Apple’s simple self-distillation boosts coding models without verification

Apple researchers are spotlighting Simple Self-Distillation: fine-tuning a coding model on its own unfiltered outputs. In early results shared around LiveCodeBench, pass@1 and pass@5 jump sharply—without labels, RL, or an execution-based verifier.

Apple’s simple self-distillation boosts coding models without verification

TL;DR

  • Apple paper proposes Simple Self-Distillation (SSD): post-train a coding model on its own unfiltered generations
  • Training recipe: sample model solutions, no correctness filtering, then fine-tune on raw outputs
  • No external scaffolding: no teacher model, verifier, RL, execution environment, reward model, or labels
  • LiveCodeBench (Qwen3-30B-Instruct): pass@1 42.4%→55.3%; pass@5 (harder) 31.1%→54.1%
  • Reported to generalize across Qwen/Llama at 4B, 8B, 30B; one sample per prompt sufficient
  • Mechanism claim: shifts output distribution; suppresses distractors while preserving diversity at “forks”; skepticism on echo-chamber risk

Apple’s research group has a new paper making the rounds for an almost annoyingly simple idea: post-train a coding model on its own unfiltered generations and it can still get markedly better. The discussion kicked off around Bo Wang’s summary thread on X, which links to the paper and accompanying code.

The method is called Simple Self-Distillation (SSD). The recipe, as described in the thread: sample solutions from the current model, don’t filter them for correctness, and then fine-tune on those raw outputs. No external teacher model, no verifier, and no RL—just the model’s own completions fed back into training.

What SSD reported on coding benchmarks

The headline numbers in the thread focus on LiveCodeBench results for Qwen3-30B-Instruct:

  • pass@1 increases from 42.4% to 55.3% on LiveCodeBench (a +30% relative improvement, per the thread).
  • On harder problems, pass@5 goes from 31.1% to 54.1%.

The thread also claims the same pattern holds across Qwen and Llama families at 4B, 8B, and 30B scales, and that one sample per prompt is enough. Notably, SSD is framed as requiring no execution environment, no reward model, and no labels.

Why “train on your own outputs” might not collapse

Part of what makes SSD provocative is that it skips the usual scaffolding meant to prevent self-training from reinforcing mistakes. The thread’s explanation is more about changing the model’s output distribution than injecting new information: SSD allegedly suppresses “distractors” in some contexts while preserving diversity in others, described as keeping “diversity alive at forks.”

In that framing, the capability is treated as already present in the weights, but standard decoding (for example, greedy-style defaults) may fail to reliably surface it. SSD then acts less like a knowledge upgrade and more like a way of recovering latent capacity via post-training over the model’s own sampled traces.

Reactions: elegant baseline, plus some skepticism

The replies and quote posts orbit two competing intuitions:

  • Surprise that no verification still yields strong coding gains, given how self-improvement approaches can plateau.
  • Concern that unverified self-training could become an echo chamber or trade generalization for more “code-focused” behavior—an idea Bo Wang himself said seemed plausible in response to one comment.

There are also notes that self-distillation has existing precedents, with some commenters pointing to prior similar baselines and related objectives.

Original source: https://x.com/BoWang87/status/2039943931543331237

Continue the conversation on Slack

Did this article spark your interest? Join our community of experts and enthusiasts to dive deeper, ask questions, and share your ideas.

Join our community