In a paper posted on arXiv, Joongwon Kim and coauthors argue that applying test-time scaling to agentic coding is less about producing more attempts and more about carrying forward useful information from prior ones. Their paper, **“Scaling Test-Time Compute for Agentic Coding,”** focuses on long-horizon coding agents, where each rollout can generate a long trail of actions, observations, errors, and partial progress rather than a short answer that can be ranked at a glance.
The authors propose a framework built around compact summaries of rollout trajectories. Those summaries are designed to preserve “salient hypotheses, progress, and failure modes” while dropping low-signal trace details. The paper presents two inference-time methods on top of that representation layer.
For parallel scaling, the authors introduce **Recursive Tournament Voting (RTV)**, which recursively narrows a pool of rollout summaries through small-group comparisons. For sequential scaling, they adapt **Parallel-Distill-Refine (PDR)** so that new rollouts are conditioned on summaries distilled from previous attempts.
According to the paper, the approach improves frontier coding agents on both SWE-Bench Verified and Terminal-Bench v2.0. The reported example results put Claude-4.5-Opus at **77.6%** on SWE-Bench Verified with mini-SWE-agent, up from **70.9%**, and at **59.1%** on Terminal-Bench v2.0 with Terminus 1, up from **46.9%**.
The broader claim is fairly narrow but pointed: for long-horizon agents, test-time scaling appears to hinge on “representation, selection, and reuse,” not just more sampling. Whether that holds up beyond the benchmarks in the paper will depend on how these methods perform in other agentic coding settings.
Source: arXiv