Headroom: Open-Source Proxy That Cuts LLM Token Costs Up To 90%

Headroom is an open-source proxy that compresses, caches, and manages LLM inputs and outputs to cut token usage and provider costs. It provides reversible compression (SmartCrusher/CCR), prefix stabilization, rolling-window context, and a drop-in proxy + SDK with no code changes.

Headroom: Open-Source Proxy That Cuts LLM Token Costs Up To 90%

TL;DR

  • Transparent proxy: routes provider-bound requests through a proxy that applies compression, caching, and context-window transforms; existing clients can point at the proxy with no code changes.
  • Core transforms: SmartCrusher (statistical JSON compression), CacheAligner (prefix normalization), RollingWindow (context management), CCR (reversible compression with automatic retrieval), and opt-in LLMLingua-2.
  • Providers supported: OpenAI (tiktoken), Anthropic, Google, Cohere, Mistral; reported reductions—search results 45k→4.5k (90%), log analysis 22k→3.3k (85%), long convo 80k→32k (60%).
  • Runtime and safety: ~1–5 ms overhead per request; user/assistant messages not compressed, tool-call ordering preserved, parse failures pass through, compression reversible via CCR.

Headroom is an open-source “context optimization layer” designed to sit as a proxy for LLM applications and handle compression, caching, and context-window management. The project packages a proxy server and SDK that aim to reduce token usage — and therefore provider costs — by applying reversible and statistical compression techniques without requiring changes to existing tool integrations.

What it does

Headroom operates as a transparent proxy: provider-bound requests are routed through the proxy which applies transforms to tool outputs and request prefixes. Key behaviors include statistical compression of JSON tool outputs, stabilization of prefixes to improve provider-side caching, and rolling-window context management that prevents token-limit failures while preserving tool semantics. Compression is reversible via the CCR architecture, so original content can be retrieved if the LLM requests it.

Core capabilities

  • SmartCrusher — statistical compression for JSON outputs, reducing large lists and search results while keeping anomalies and relevant items.
  • CacheAligner — normalizes prefixes to increase cache hit rates with providers.
  • RollingWindow — manages context windows to avoid token-limit errors without breaking tool call order.
  • CCR — reversible compression with automatic retrieval when needed.
  • LLMLingua-2 (opt-in) — an ML-based compression option advertised for higher compression ratios.

The repository emphasizes zero code changes required: existing clients can point at the proxy endpoint and benefit from token savings.

Quickstart

A minimal proxy quickstart shown in the repo:

  1. Install the proxy: pip install "headroom-ai[proxy]"
  2. Start the proxy: headroom proxy --port 8787
  3. Point clients at the proxy, for example OPENAI_BASE_URL=[http://localhost:8787/v1](http://localhost:8787/v1`) for OpenAI-compatible clients or ANTHROPIC_BASE_URL=[http://localhost:8787](http://localhost:8787`) for Anthropic.

A stats endpoint exposes token savings and cost metrics, for example: curl [http://localhost:8787/stats](http://localhost:8787/stats`) returns JSON about tokens saved and percent savings.

Providers and performance

Headroom lists support for major provider integrations and token counting strategies, including OpenAI (tiktoken), Anthropic (official API), Google, Cohere, and Mistral. The repo provides performance examples with reported token reductions:

  • Search results (1000 items): from 45,000 tokens to 4,500 tokens (90% savings)
  • Log analysis (500 entries): from 22,000 tokens to 3,300 tokens (85% savings)
  • Long conversation (50 turns): from 80,000 tokens to 32,000 tokens (60% savings)

Reported overhead is small: approximately 1–5 ms per request.

Safety and guarantees

The project documents several safety constraints:

  • User and assistant messages are not compressed.
  • Tool call ordering is preserved.
  • Parse failures are no-ops (malformed content passes through unchanged).
  • Compression is reversible and can be expanded via CCR when necessary.

Docs, examples, and contribution

Documentation covers SDK usage, proxy deployment, configuration, CCR internals, metrics, and troubleshooting. Examples in the repo include runnable code such as basic_usage.py, proxy_integration.py, and ccr_demo.py. The project is licensed under Apache-2.0 and includes contribution guidance and a test suite.

For code, documentation, and to explore the project further, see the repository: https://github.com/chopratejas/headroom.

Continue the conversation on Slack

Did this article spark your interest? Join our community of experts and enthusiasts to dive deeper, ask questions, and share your ideas.

Join our community