Claude Code’s secret weapon: prompt caching shapes everything

A recent thread by Thariq explains why Claude Code is architected around prompt caching to keep agent workflows fast and affordable. He highlights a few easy ways teams accidentally tank cache hit rates—and why monitoring misses matters.

claude cover

TL;DR

  • Prompt caching as prefix matching to cache_control breakpoints; request ordering becomes an architectural constraint
  • Claude Code prompt layout: system prompt + tools → Claude.MD → session context → conversation messages
  • Common cache-busters: timestamps in “static” instructions, shuffled tool definitions, other minor prefix changes
  • Preserve cached prefix by injecting updates via messages, including a pattern
  • Model-specific caches: prefer subagents and handoff messages over mid-session model switches
  • Monitor cache hit rate in production; treat drops as incidents due to latency and cost impact

Prompt caching has a way of turning from an “infra detail” into the thing that quietly dictates whether an agent product feels fast, predictable, and affordable. That’s the thesis behind Thariq’s thread, “Lessons from Building Claude Code: Prompt Caching Is Everything”, which lays out how Claude Code’s agent harness was effectively designed around cache behavior—not bolted onto it later.

“Cache Rules Everything Around Me,” but for agents

The thread frames prompt caching as prefix matching up to cache_control breakpoints, which immediately makes ordering a first-order architectural decision. Claude Code’s approach is to structure requests so the most reusable parts stay at the front: static system prompt and tools, then project-level context (like Claude.MD), then session context, and only after that the conversation messages. That structure maximizes shared prefixes across sessions, but it’s also easy to break in surprisingly mundane ways—like adding a timestamp to “static” instructions or letting tool definitions shuffle.

Design patterns that avoid accidental cache busts

A few of the most practical lessons are about resisting the “obvious” implementation:

  • Instead of updating the system prompt when something changes, Claude Code injects updates via messages (including a noted pattern) to preserve the cached prefix.
  • Switching models mid-session can backfire because prompt caches are model-specific. The thread suggests using subagents and handoff messages rather than swapping models directly in the same flow.
  • The biggest recurring footgun: adding/removing tools mid-conversation, since tool definitions live in the cached prefix. Claude Code reportedly models feature state changes (like plan mode) as tools rather than changing the tool set.

There’s also an interesting section on compaction: summarization can become dramatically more expensive if it’s done in a separate call that doesn’t share the original prefix—and Claude Code’s mitigation is “cache-safe forking” that keeps the same system prompt, context, and tools.

Thariq wraps by arguing cache hit rate deserves production-level monitoring (including treating drops as incidents), since a small miss-rate swing can meaningfully change both latency and cost.

Source: https://x.com/trq212/status/2024638793719177291?s=20

Continue the conversation on Slack

Did this article spark your interest? Join our community of experts and enthusiasts to dive deeper, ask questions, and share your ideas.

Join our community