Prompt caching has a way of turning from an “infra detail” into the thing that quietly dictates whether an agent product feels fast, predictable, and affordable. That’s the thesis behind Thariq’s thread, “Lessons from Building Claude Code: Prompt Caching Is Everything”, which lays out how Claude Code’s agent harness was effectively designed around cache behavior—not bolted onto it later.
“Cache Rules Everything Around Me,” but for agents
The thread frames prompt caching as prefix matching up to cache_control breakpoints, which immediately makes ordering a first-order architectural decision. Claude Code’s approach is to structure requests so the most reusable parts stay at the front: static system prompt and tools, then project-level context (like Claude.MD), then session context, and only after that the conversation messages. That structure maximizes shared prefixes across sessions, but it’s also easy to break in surprisingly mundane ways—like adding a timestamp to “static” instructions or letting tool definitions shuffle.
Design patterns that avoid accidental cache busts
A few of the most practical lessons are about resisting the “obvious” implementation:
- Instead of updating the system prompt when something changes, Claude Code injects updates via messages (including a noted
pattern) to preserve the cached prefix. - Switching models mid-session can backfire because prompt caches are model-specific. The thread suggests using subagents and handoff messages rather than swapping models directly in the same flow.
- The biggest recurring footgun: adding/removing tools mid-conversation, since tool definitions live in the cached prefix. Claude Code reportedly models feature state changes (like plan mode) as tools rather than changing the tool set.
There’s also an interesting section on compaction: summarization can become dramatically more expensive if it’s done in a separate call that doesn’t share the original prefix—and Claude Code’s mitigation is “cache-safe forking” that keeps the same system prompt, context, and tools.
Thariq wraps by arguing cache hit rate deserves production-level monitoring (including treating drops as incidents), since a small miss-rate swing can meaningfully change both latency and cost.
Source: https://x.com/trq212/status/2024638793719177291?s=20

