Ramp finds AI coding agents won’t self-limit token spend

Ramp Labs is out with a blunt conclusion in a new thread: coding agents don’t meaningfully regulate their own token spend, even when shown real-time budgets or asked to explicitly approve additional usage. With AI token spend among Ramp customers up 13x since January 2025, the company argues that “spend control belongs outside the process doing the spending.”

Ramp’s setup: a coding agent, a hard cap, and a lot of ignored signals

Ramp says it evaluated GPT-5.4 as a coding agent using the minimal mini-swe-agent harness on SWE-bench Verified tasks. Each task started with a 50,000-token budget; once the agent hit the cap, the run would halt unless more budget was approved.

The first set of interventions was basically informational:

a live “budget status” meter injected into the prompt
explicit grading tied to token efficiency
a request_more_budget tool exposed to the agent

Across large numbers of turns/messages, Ramp says none of it moved outcomes. The agent didn’t reference the meter, metrics didn’t change, and the budget tool was invoked zero times across more than 5,000 turns.

When forced to decide, agents mostly keep spending

Ramp then forced the agent to make an explicit decision at the token limit: extend the run or end it.

When “approval” was the default, the agent chose it 97% of the time. When “approve” and “deny” were presented neutrally, approval still came in at 79%.

Ramp attributes this to a mix of incentives and bias: the agent is effectively judging its own work, and denial amounts to stopping short of a goal it was instructed to complete—plus prior research suggesting AI can be more lenient when evaluating its own output (arXiv:2603.04582).

The proposed fix: separate the worker from the budget approver

Ramp’s main recommendation is structural: freeze the coding agent at the budget limit and hand the decision to a separate “controller” model. That controller only sees a snapshot of objective workspace signals (task info, tokens used, files modified, tests run, commands executed) and is scored on a single rule: approve more spend only if the probability of success is above 1/2.

Ramp built a 100-task evaluation set from runs that hit the token limit, balanced between cases where approving more budget would eventually succeed and cases where it would still fail.

Controllers are optimistic — and easily swayed by “colleagues”

Ramp tested six controller models: GPT-5.4 Mini, GPT-5.4, Claude Haiku 4.5, Claude Sonnet 4.6, Claude Opus 4.6, and Claude Opus 4.7.

On “workspace-only” information, Ramp says most controllers leaned heavily toward approving additional spend; only Haiku 4.5 and Opus 4.7 did meaningfully better than random chance by leaning on workspace cues like tests run and scope of changes.

Adding a vague base rate (“about half”) didn’t help much. But giving exact task-level probabilities led every controller to make the correct expected-value decision, which Ramp frames as evidence that arithmetic isn’t the bottleneck—usable numbers are.

The sharpest failure mode showed up when the controller got a short recommendation from a “colleague.” Ramp reports that many models deferred heavily to that advice, even when it contradicted the optimal choice, with accuracy dropping to worse than a coin flip under “bad recommendation” conditions. Ramp notes Opus 4.6 as an exception in these tests, appearing less influenced by recommendations.

“Governance over narration”

Ramp’s takeaway is familiar to anyone who has tried to control spend through dashboards alone: simply surfacing a budget signal doesn’t enforce anything. The company points to related research on budget-aware tool use (arXiv:2511.17006), confidence vs. decision faithfulness (arXiv:2601.07767), and LLM sycophancy (arXiv:2310.13548) to argue that agentic spend control needs external, auditable mechanisms—approved budgets and hard stops—rather than hoping the agent self-regulates.

Community replies to the thread largely echoed the theme. One commenter joked about needing “a ramp policy agent for agents,” while another argued the same pattern applies to specs, tests, and error handling: prompt-only signals don’t matter unless something outside the agent enforces them.

Source: Ramp Labs on X