Apple’s research group has a new paper making the rounds for an almost annoyingly simple idea: post-train a coding model on its own unfiltered generations and it can still get markedly better. The discussion kicked off around Bo Wang’s summary thread on X, which links to the paper and accompanying code.
The method is called Simple Self-Distillation (SSD). The recipe, as described in the thread: sample solutions from the current model, don’t filter them for correctness, and then fine-tune on those raw outputs. No external teacher model, no verifier, and no RL—just the model’s own completions fed back into training.
What SSD reported on coding benchmarks
The headline numbers in the thread focus on LiveCodeBench results for Qwen3-30B-Instruct:
- pass@1 increases from 42.4% to 55.3% on LiveCodeBench (a +30% relative improvement, per the thread).
- On harder problems, pass@5 goes from 31.1% to 54.1%.
The thread also claims the same pattern holds across Qwen and Llama families at 4B, 8B, and 30B scales, and that one sample per prompt is enough. Notably, SSD is framed as requiring no execution environment, no reward model, and no labels.
Why “train on your own outputs” might not collapse
Part of what makes SSD provocative is that it skips the usual scaffolding meant to prevent self-training from reinforcing mistakes. The thread’s explanation is more about changing the model’s output distribution than injecting new information: SSD allegedly suppresses “distractors” in some contexts while preserving diversity in others, described as keeping “diversity alive at forks.”
In that framing, the capability is treated as already present in the weights, but standard decoding (for example, greedy-style defaults) may fail to reliably surface it. SSD then acts less like a knowledge upgrade and more like a way of recovering latent capacity via post-training over the model’s own sampled traces.
Reactions: elegant baseline, plus some skepticism
The replies and quote posts orbit two competing intuitions:
- Surprise that no verification still yields strong coding gains, given how self-improvement approaches can plateau.
- Concern that unverified self-training could become an echo chamber or trade generalization for more “code-focused” behavior—an idea Bo Wang himself said seemed plausible in response to one comment.
There are also notes that self-distillation has existing precedents, with some commenters pointing to prior similar baselines and related objectives.
Original source: https://x.com/BoWang87/status/2039943931543331237