New paper says agentic coding scaling needs smarter reuse

Joongwon Kim and coauthors argue test-time scaling for long-horizon coding agents depends less on more sampling and more on carrying forward useful rollout information. Their summary-based RTV and PDR methods boost results on SWE-Bench Verified and Terminal-Bench v2.0.

New paper says agentic coding scaling needs smarter reuse

TL;DR

  • Paper: **“Scaling Test-Time Compute for Agentic Coding”** argues test-time scaling should carry forward useful information across rollouts
  • Focus: **long-horizon coding agents**, where rollouts include actions, observations, errors, and partial progress
  • Framework: **compact rollout summaries** preserving salient hypotheses, progress, failure modes; dropping low-signal trace details
  • Parallel scaling method: **Recursive Tournament Voting (RTV)** narrows rollout summaries via recursive small-group comparisons
  • Sequential scaling method: **Parallel-Distill-Refine (PDR)** conditions new rollouts on distilled summaries from prior attempts
  • Reported gains: **SWE-Bench Verified 77.6% vs 70.9%**; **Terminal-Bench v2.0 59.1% vs 46.9%** (Claude-4.5-Opus)

In a paper posted on arXiv, Joongwon Kim and coauthors argue that applying test-time scaling to agentic coding is less about producing more attempts and more about carrying forward useful information from prior ones. Their paper, **“Scaling Test-Time Compute for Agentic Coding,”** focuses on long-horizon coding agents, where each rollout can generate a long trail of actions, observations, errors, and partial progress rather than a short answer that can be ranked at a glance.

The authors propose a framework built around compact summaries of rollout trajectories. Those summaries are designed to preserve “salient hypotheses, progress, and failure modes” while dropping low-signal trace details. The paper presents two inference-time methods on top of that representation layer.

For parallel scaling, the authors introduce **Recursive Tournament Voting (RTV)**, which recursively narrows a pool of rollout summaries through small-group comparisons. For sequential scaling, they adapt **Parallel-Distill-Refine (PDR)** so that new rollouts are conditioned on summaries distilled from previous attempts.

According to the paper, the approach improves frontier coding agents on both SWE-Bench Verified and Terminal-Bench v2.0. The reported example results put Claude-4.5-Opus at **77.6%** on SWE-Bench Verified with mini-SWE-agent, up from **70.9%**, and at **59.1%** on Terminal-Bench v2.0 with Terminus 1, up from **46.9%**.

The broader claim is fairly narrow but pointed: for long-horizon agents, test-time scaling appears to hinge on “representation, selection, and reuse,” not just more sampling. Whether that holds up beyond the benchmarks in the paper will depend on how these methods perform in other agentic coding settings.

Source: arXiv

Continue the conversation on Slack

Did this article spark your interest? Join our community of experts and enthusiasts to dive deeper, ask questions, and share your ideas.

Join our community