headroom

Name: headroom
Rating: 5 (56599 reviews)

Compress everything your AI agent reads — tool outputs, logs, RAG chunks, and files — before it reaches the LLM, achieving 60–95% fewer tokens with the same answers.

56.6Kstars

4.1Kforks

Apache License 2.0

Python

View Source Visit Website

On This Page

Headroom is a context compression layer that sits between your AI agent and the LLM provider. Every piece of content your agent ingests — tool outputs, log files, RAG chunks, codebase searches, and conversation history — is automatically compressed before being sent, slashing token costs by 60–95% without degrading the quality of the model’s responses.

The library ships three integration modes that can be mixed and matched: an inline Python or TypeScript compress() function you add to any existing LLM call; a transparent HTTP proxy that intercepts requests from any language with zero code changes; and a command-line agent wrapper (headroom wrap claude|codex|cursor|aider) that bootstraps the proxy and launches the agent in one step. An MCP server exposes compression, retrieval, and statistics as tools any MCP-native client can call.

Beyond simple compression, Headroom includes a cross-agent memory store that deduplicates context across Claude, Codex, and Gemini sessions; a headroom learn command that mines failed agent sessions and writes corrections back to CLAUDE.md or AGENTS.md; and a Reversible Compression (CCR) system that caches originals locally so the LLM can call headroom_retrieve whenever it needs the unabridged version.

Output token reduction is also supported: the proxy can inject a terseness instruction that reduces model verbosity and uses effort routing to dial down thinking depth on routine steps like file reads and passing tests — cutting what the model writes back in addition to what you send.

What You Get

Inline library API — Call compress(messages, model=...) in Python or TypeScript; drop the result into any existing Anthropic, OpenAI, or LiteLLM call with no other changes
Transparent HTTP proxy — headroom proxy --port 8787 intercepts all LLM requests, applies compression, and forwards them; any language, zero code changes
Agent wrapper CLI — headroom wrap claude|codex|cursor|aider|copilot starts the proxy and launches the target agent in one command, with optional --memory and --code-graph flags
MCP server — headroom_compress, headroom_retrieve, and headroom_stats tools available to any MCP-native client via headroom mcp install
Cross-agent memory — shared, deduplicated memory store across Claude, Codex, and Gemini sessions backed by SQLite-vec for lightweight vector search
headroom learn — scans past sessions for failure patterns and writes targeted corrections to CLAUDE.md or AGENTS.md, including verbosity calibration with --verbosity
Output token reduction — proxy-side verbosity steering and effort routing reduce model output costs on Opus-class models without touching your prompts
Reversible Compression (CCR) — originals cached locally; LLM calls headroom_retrieve when it needs the full version, making every compression safely reversible

Common Use Cases

Reducing Claude Code / Codex costs — wrap an existing coding agent with headroom wrap claude to compress every tool result and codebase search before it reaches the model, cutting context costs on long sessions
RAG pipeline optimization — pass retrieved document chunks through compress() before adding them to the LLM prompt, reducing retrieval-augmented calls by 60–92% without losing answer accuracy
Log and incident triage agents — feed multi-thousand-line SRE logs through Headroom’s proxy before analysis; benchmarks show 92% compression on incident-debugging workloads
Multi-agent memory sharing — use the cross-agent memory store so separate Claude and Codex agents working on the same codebase share context without re-sending the same files
Zero-code proxy for any LLM client — point any OpenAI-compatible HTTP client at http://localhost:8787 to get automatic compression without modifying application code
GitHub Copilot CLI subscription routing — authenticate with headroom copilot-auth login and wrap Copilot CLI subscription traffic through the compression proxy

Under The Hood

Architecture Headroom is organized as a layered pipeline with a clear separation between content detection, compression, caching, and retrieval. The ContentRouter sits at the top of the stack and dispatches each incoming payload to the appropriate specialist: SmartCrusher for structured JSON, CodeCompressor for source code using AST-aware slicing, and Kompress-v2-base for unstructured prose and logs. A CacheAligner pre-processes every input to stabilize token prefixes — stripping or normalizing timestamps, UUIDs, and version numbers — so that provider-side KV caches produce hits across turns rather than missing on superficially changed context. The CCR (Compress-Cache-Retrieve) layer stores originals locally in SQLite and injects a headroom_retrieve tool so the model can pull back the uncompressed version on demand, making every compression reversible without round-tripping data to any external service. The proxy and library modes share a single canonical pipeline, which extension authors can hook into at well-defined lifecycle stages via a plugin entry point.

Tech Stack The core compression library is written in Python 3.10+ and is distributed as headroom-ai on PyPI. The proxy layer is built on FastAPI and Uvicorn and communicates upstream over HTTP/2 via httpx. An AST-aware CodeCompressor uses tree-sitter (with regex fallback) for language-agnostic code slicing, and the Kompress-v2-base neural compression model runs through ONNX Runtime for inference without requiring a PyTorch installation. Vector-based memory search is backed by sqlite-vec, a lightweight SQLite extension. A parallel Rust port of the hot path — tokenization, cache-control logic, and CCR transforms — lives in the crates/ directory and is built with Maturin for Python binding via PyO3. A TypeScript SDK in sdk/typescript/ provides first-class Node/Bun support. Optional integrations include LangChain, LangGraph, Agno, and Strands.

Code Quality The test suite is extensive, with over 300 test files covering unit tests, acceptance tests, integration tests, adversarial grid evaluations, and end-to-end scenarios for WebSocket proxying and batch API results. Core modules use comprehensive type annotations and Pydantic v2 models for all configuration and data structures; the project ships a py.typed marker for downstream type checking. A codecov.yml and GitHub Actions CI pipeline run on every push. Error handling is explicit throughout, with custom exception types in exceptions.py and carefully guarded optional imports for heavyweight dependencies like PyTorch and tree-sitter. The codebase follows a consistent pattern of lazy-initialized singletons protected by threading locks to make the library safe for concurrent use.

What Makes It Unique Headroom’s most distinctive design choice is treating LLM context as a typed, structured resource rather than a flat string. Instead of summarizing or truncating, it routes each content type to a purpose-built compressor — AST slicing for code preserves signatures while collapsing bodies, entropy-aware masking retains UUIDs and hashes that naive truncation would destroy, and the neural Kompress model targets prose. CCR reversibility is architecturally unusual: rather than discarding compressed content, Headroom registers a retrieval tool with the LLM so the model can pull originals back mid-conversation, effectively giving the agent an infinite context window backed by local storage. The headroom learn feedback loop — which mines failed sessions and writes corrective instructions to agent memory files — closes a loop that typical compression tools leave open, allowing the system to improve agent behavior over time rather than just reduce token spend.

Self-Hosting

Headroom is released under the Apache License 2.0, a permissive open-source license that allows commercial use, modification, and redistribution with no copyleft requirements. You are free to embed it in proprietary products, run it in commercial production environments, and distribute modified versions, as long as you preserve the original license notice and the NOTICE file. There are no open-core restrictions, gating flags, or enterprise-only features in the source.

Running Headroom yourself is operationally straightforward for most self-hosters. The proxy is a single FastAPI/Uvicorn process that can be started with one command and has no required external dependencies beyond Python 3.10+; optional components like vector memory (sqlite-vec) and the neural Kompress model (ONNX Runtime) are installed as extras. All compressed originals are stored in local SQLite databases, so there is no external data store to operate. For production deployments, the project ships a Dockerfile and docker-compose configuration, and a Gunicorn production ASGI server option is available on Unix. The main operational responsibilities are keeping the proxy process running, managing local SQLite storage growth, and upgrading the package as new releases ship at a brisk pace of roughly 18 releases per month.

Headroom does not currently offer a managed cloud service, so everything a SaaS tier might provide — hosted infrastructure, uptime SLAs, managed upgrades, enterprise support contracts, and team-level dashboards — must be handled by the operator. The ENTERPRISE.md file points interested teams to hello@headroomlabs.ai for discussions about scale deployments. For teams that want hands-off operation, the absence of a hosted tier means accepting responsibility for availability, backup of the local CCR store, and keeping pace with the project’s rapid release cadence.

On This Page