Agent frameworks may be sabotaging prefix caching and inference speed

In a X thread, Chayenne Zhao argues that many agent frameworks waste tokens in ways that undercut key inference optimizations like prefix caching—hurting cost and throughput in long sessions. The takeaway: better agent–inference co-design may unlock big efficiency gains.

Agent frameworks may be sabotaging prefix caching and inference speed

TL;DR

  • Agent frameworks often waste tokens in ways that undermine inference-stack optimizations
  • “Stateless API call” patterns can defeat prefix caching, especially in long-running sessions
  • Example: Claude Code patterns showed low cache hit rates against a local serving engine
  • Token bloat drivers: retransmitted context, re-parsed tool outputs, accumulated low-density history
  • Long sessions could use fewer tokens via context compression, cache-friendly prompts, better tool scheduling
  • Efficiency gains likely require agent–inference co-design, including cache-state awareness and session semantics

Chayenne Zhao’s thread, “We’re Not Wasting Tokens — We’re Wasting the Design Margin of the Entire Inference Stack,” lands on an uncomfortable point for anyone building agentic tooling: the problem isn’t simply that models are expensive—it’s that agent frameworks often spend tokens in ways that actively undermine the inference stack’s best optimizations.

When “stateless API calls” collide with inference reality

Zhao frames the current crunch as a design mismatch across layers. Agent harnesses commonly behave as if inference is a stateless endpoint, repeatedly sending large context blocks and expecting the serving layer to clean up the mess with caching. The thread argues that this pattern can effectively defeat prefix caching and related techniques that inference engine developers rely on to keep throughput and cost sane—especially in long-running sessions.

In a concrete example, Zhao describes observing Claude Code request patterns against a local serving engine and measuring painfully low cache hit rates, suggesting that carefully engineered prefix cache mechanisms can be “almost entirely defeated” by how sessions are constructed.

Token bloat as an engineering problem, not a pricing debate

A key idea in the thread is that “token demand” from agents can be artificially inflated by crude context management: retransmitting already-processed context, re-parsing tool outputs, and accumulating low-density history. Zhao goes further with a hypothesis that very long sessions (hundreds of thousands of tokens) could potentially be restructured to use a fraction of the tokens—without sacrificing output quality—through smarter context compression, cache-friendly prompt construction, and better tool scheduling.

The bigger implication: co-design across agent, model, and inference

While the conversation started from a dispute around third-party harness access and token-plan design, Zhao’s main push is broader: meaningful efficiency gains likely require agent-inference co-design, where harnesses become aware of cache state (and inference engines understand more about agent session semantics), rather than treating each request as an isolated transaction.

Original source: https://x.com/i/status/2040976525030494211

Continue the conversation on Slack

Did this article spark your interest? Join our community of experts and enthusiasts to dive deeper, ask questions, and share your ideas.

Join our community