How to Reduce Claude Code Token Costs
Claude Code, Cursor, and Codex all bill by the token. This guide covers every practical lever in order of implementation cost — from built-in prompt caching to semantic compression via MCP — with real numbers and no fabricated benchmarks.
Why Claude Code token bills compound¶
At Anthropic's published Sonnet 4.6 pricing, input costs $3.00 per million tokens and output costs $15.00 per million (source: anthropic.com/pricing). That gap — a 5× multiplier between input and output — means input optimization pays off before you even think about output. Four compounding sources account for most Claude Code bills:
- Tool catalog at session start. Every MCP session opens with a
tools/listresponse. A large MCP catalog — like the gotcontext gateway at 142 tools before v1.23.18 — cost ~38,000 tokens per cold session start, or roughly $0.11 before a single tool call runs. The cost-breakdown post has the full math. - Large files sent verbatim. Claude Code reads files to understand code. A 3,000-line router file or a long diff can easily consume 30,000–50,000 tokens in a single turn. Repeated reads of the same file multiply the cost.
- Verbose tool output. Test runner output, git log, and log dumps often contain an order of magnitude more text than the signal you actually need. That output lands in context as input tokens on the next turn.
- Repeated context across turns. In a long session, Claude Code re-reads portions of the conversation context on every turn. Content that is not cached gets billed again each time.
None of these is individually catastrophic. All four together, across a full coding session, produce the invoice surprise that brings AI engineers to search queries like the one that landed you here.
Measure your spend before optimising¶
Before applying any lever, identify which source is costing most. The Anthropic console usage dashboard breaks input and output tokens per model per day. Two questions narrow the diagnosis:
- Are many short sessions expensive? Tool-catalog cost dominates. Fix: Lever 2 (?profile=core).
- Are long sessions expensive? Repeated large-file reads or verbose output dominates. Fix: Lever 3 (semantic compression).
The quantitative breakdown post walks through each cost source with specific dollar figures at $3/M input and $15/M output.
Lever 1: Prompt caching (free, built-in)¶
Anthropic's prompt caching stores repeated prefixes server-side and bills cache hits at 10% of the standard input rate. For Claude Code, this applies automatically to the system prompt and any stable large files that appear early in the context. There is no API parameter to toggle; caching activates when the same prefix appears across consecutive turns of the same session.
What prompt caching does not cover:
- Cold session starts. The cache is warm only after the first request that populates it. Session one is billed at full rate.
- Context that changes between turns. Modified files, new tool output, and incremental edits all fall outside the stable prefix window.
- Tool catalog tokens. The
tools/listresponse is transmitted separately from the prompt prefix cache path.
Prompt caching is highest-impact for repetitive workloads with a stable system prompt and a fixed set of large files. Apply it first because it costs nothing extra. Then layer the remaining levers on top.
Lever 2: Trim the tool surface with ?profile=core¶
When Claude Code opens an MCP session, the server sends every tool it exposes as part of the tools/list handshake. For the gotcontext gateway, the full catalog is 142 tools at roughly 38,000 tokens. Adding ?profile=core to the MCP URL reduces that to 7 essential tools at ~2,000 tokens — a 95% manifest reduction at session start.
The 7 core tools cover the complete ingest-read-search-expand lifecycle: add a document, get its compressed skeleton, search within it, expand any region, check stats, list documents, and delete documents. That set serves the majority of agentic coding workflows without the full 142-tool catalog overhead.
To put the session-start saving in context: 38,000 tokens is 19% of a 200K context window and ~30% of a 128K window. Switching to core frees that share before your first real tool call. The full account of what that 19% figure means — and what it does not mean — is in the dedicated post.
If you need the full tool surface for a session, remove the parameter or set ?profile=full. Existing gc_ keys already in production continue working on the full profile unchanged.
Lever 3: Semantic compression via the gotcontext MCP gateway¶
Prompt caching handles repeated stable content. Tool-surface trimming handles session start. Semantic compression handles everything in between: the large files, verbose diffs, terminal output, and documentation that AI coding agents read on every turn.
The gotcontext MCP gateway compresses text before it enters the context window. The engine is not a summarizer — it produces a structured skeleton that preserves function signatures, type declarations, and key identifiers while stripping boilerplate and implementation bodies. For code review and navigation workloads, the skeleton conveys what the agent needs at a fraction of the token count.
Live rolling-average savings across all compression calls sit at up to ~60% token reduction (source: api.gotcontext.ai/v1/global-savings). That figure is a rolling average across document sizes and fidelity levels; results vary by content type and input size.
For codebases specifically, the compress_codebase tool produces an AST-aware digest: signatures and public interfaces only, bodies stripped. We ran it on our own 3,448-line MCP gateway — the result was a 50-file digest containing 1,408 ranked symbols in one call. That case study has the raw output and the cost math.
Stacked with Anthropic's native prompt caching on warm prefixes, the combined effect can reach up to ~95% input-cost reduction on large stable documents that repeat across turns.
When compression helps (and when it does not)¶
Honest disclosure
- Small inputs have fixed overhead. The compression engine adds structure headers regardless of input size. On inputs under ~200 tokens, the compressed output can be larger than the original. The live Playground demonstrates this: a 58-token paste produces 67 compressed tokens. Use compression on inputs of 500 tokens or more for a positive return.
- ~60% is a rolling average, not a floor. Results depend on content type (code compresses differently from prose), input size (larger inputs compress better), and fidelity level. The no-signup Playground shows your actual ratio before you integrate.
- Compression changes what the model sees. Lower fidelity settings strip more content. For tasks where the full text matters — verbatim code generation, exact diff review — use higher fidelity or skip compression on that specific file.
- Output tokens are not compressed. The gateway compresses what you send in. What the model writes back is unchanged. At the 5× input-to-output rate multiplier, output optimization is a separate concern (token budgets, chain-of-thought controls).
Step-by-step: add the gotcontext MCP server¶
- Get a free API key. Sign up at gotcontext.ai and mint a
gc_key from the dashboard. The Free tier includes 1,000 compressions per month with no credit card. - Add the MCP server to your config. For Claude Code, edit
~/.claude/mcp.json(orclaude_desktop_config.jsonfor Claude Desktop). The same URL works for Cursor, Codex CLI, and any other MCP-capable client.
{
"mcpServers": {
"gotcontext": {
"type": "http",
"url": "https://api.gotcontext.ai/mcp?profile=core",
"headers": {
"Authorization": "Bearer ${GOTCONTEXT_API_KEY}"
}
}
}
}Set GOTCONTEXT_API_KEY in your shell environment or paste your gc_ key directly into the header value. The ?profile=core suffix activates the 7-tool surface; remove it or replace with ?profile=full for the complete catalog.
- Restart your client. Claude Code picks up the new MCP server config on restart and runs the
tools/listhandshake automatically. - Compress large files before they enter context. The agent calls gotcontext tools directly in-session, or you can compress via the REST endpoint before pasting content:
curl -X POST https://api.gotcontext.ai/v1/compress \
-H "Authorization: Bearer gc_<your-key>" \
-H "Content-Type: application/json" \
-d '{"text": "<large file content here>", "fidelity": "balanced"}'Full setup documentation, including Cursor, Codex CLI, and Gemini CLI configs, is at /docs#mcp-server.
Before and after: what the numbers look like¶
Rather than invent a scenario, here are the numbers from our own production dogfood loop — we use gotcontext on the codebase that runs this site:
| Scenario | Without gotcontext | With gotcontext |
|---|---|---|
| MCP session start (full catalog) | ~38,000 tokens (~$0.11 at $3/M) | ~2,000 tokens (~$0.006) with ?profile=core |
| 3,448-line MCP gateway file | ~38K tokens verbatim | 50-file AST digest, 1,408 symbols (case study) |
| Rolling avg, all compression calls | Baseline | Up to ~60% reduction (live: /v1/global-savings) |
| Compression + prompt cache (warm prefix) | Baseline | Up to ~95% input-cost reduction |
The session-start and file-compression rows are from real instrumented calls. The rolling-average figure comes from the live /v1/global-savings endpoint and updates continuously. The stacked 95% figure applies specifically to large stable documents with warm cache hits; smaller or more volatile content will see a lower combined figure.
TL;DR¶
Three levers, in order of implementation effort:
- Prompt caching — zero setup, handles stable prefixes automatically. Apply first.
?profile=core— one URL change, cuts session-start context ~95% (38K → 2K tokens).- Semantic compression via gotcontext MCP — compresses files, diffs, and terminal output before they reach the context window. Up to ~60% rolling average per call; up to ~95% combined with prompt caching on warm prefixes.
Start for free — 1,000 compressions/month, no card required.