Agent frameworks may be sabotaging prefix caching and inference speed

Chayenne Zhao’s thread, “We’re Not Wasting Tokens — We’re Wasting the Design Margin of the Entire Inference Stack,” lands on an uncomfortable point for anyone building agentic tooling: the problem isn’t simply that models are expensive—it’s that agent frameworks often spend tokens in ways that actively undermine the inference stack’s best optimizations.

When “stateless API calls” collide with inference reality

Zhao frames the current crunch as a design mismatch across layers. Agent harnesses commonly behave as if inference is a stateless endpoint, repeatedly sending large context blocks and expecting the serving layer to clean up the mess with caching. The thread argues that this pattern can effectively defeat prefix caching and related techniques that inference engine developers rely on to keep throughput and cost sane—especially in long-running sessions.

In a concrete example, Zhao describes observing Claude Code request patterns against a local serving engine and measuring painfully low cache hit rates, suggesting that carefully engineered prefix cache mechanisms can be “almost entirely defeated” by how sessions are constructed.

Token bloat as an engineering problem, not a pricing debate

A key idea in the thread is that “token demand” from agents can be artificially inflated by crude context management: retransmitting already-processed context, re-parsing tool outputs, and accumulating low-density history. Zhao goes further with a hypothesis that very long sessions (hundreds of thousands of tokens) could potentially be restructured to use a fraction of the tokens—without sacrificing output quality—through smarter context compression, cache-friendly prompt construction, and better tool scheduling.

The bigger implication: co-design across agent, model, and inference

While the conversation started from a dispute around third-party harness access and token-plan design, Zhao’s main push is broader: meaningful efficiency gains likely require agent-inference co-design, where harnesses become aware of cache state (and inference engines understand more about agent session semantics), rather than treating each request as an isolated transaction.

Original source: https://x.com/i/status/2040976525030494211

Agent frameworks may be sabotaging prefix caching and inference speed

TL;DR

When “stateless API calls” collide with inference reality

Token bloat as an engineering problem, not a pricing debate

The bigger implication: co-design across agent, model, and inference

Continue the conversation on Slack