MCP Token Costs: A Quantitative Breakdown
A 142-tool MCP catalog dumps roughly 38 K tokens into every session start. At $3/M input and $15/M output, here is where the bill compounds — and what changes when you apply prompt caching and input compression.
The 38 K-token session tax#
Every MCP session starts with a tools/list handshake. On gotcontext.ai's production gateway (currently 142 tools at the full profile), that single exchange dumps approximately 38 K tokens of tool descriptions into the model's context window before a single line of user work begins.
The figure is not a rough estimate. It comes from the docs/news/pillar-1-evidence-summary.json evidence artifact: 1,408 symbols across 175 files, rendered as an MCP tool schema, land at 38 K tokens. That measurement drove the v1.23.18 ?profile=core routing feature described below.
The session tax compounds through the conversation. Each time the model needs to reference a tool it has not yet used, the full description re-enters the working window. On a long coding session touching 20+ files, the cumulative input from tool manifests alone can reach 200 K–300 K tokens.
What the models charge#
All rates below come directly from token-saver-5000/src/provider_profiles.py, which is the authoritative source delegated to by api/app/services/model_pricing.py:
# token-saver-5000/src/provider_profiles.py (excerpt)
#
# input_cost_per_million / output_cost_per_million / cached_input_cost_per_million
"claude-sonnet-4.6": $3.00 / $15.00 / $0.30 (10% cache-read multiplier)
"claude-opus-4.7": $15.00 / $75.00 / $1.50 (10% cache-read multiplier)
"gpt-5.5": $5.00 / $30.00 / $0.50 (10% cache-read multiplier)
"gemini-3.1-pro": $1.25 / $10.00 / $0.125 (10% cache-read multiplier)
"gemini-3.1-flash": $0.30 / $2.50 / $0.030 (10% cache-read multiplier)
# Source: provider_profiles.py via api/app/services/model_pricing.py
# _CACHE_READ_MULTIPLIER = {"claude-sonnet-4.6": 0.10, "gpt-5.5": 0.10, "gemini-3-pro": 0.10}The cache-read multiplier of 0.10 (10%) applies across Anthropic, OpenAI GPT-5.x, and Google Gemini 2.5+/3.x families. Cache write carries a 1.25× surcharge for tokens written within 5 minutes and a 2.00× surcharge for tokens written within 1 hour (_CACHE_WRITE_MULTIPLIER_5M and _CACHE_WRITE_MULTIPLIER_1H in model_pricing.py).
Anatomy of a session#
A typical Claude Code session on a medium-sized codebase passes through several token-cost layers. The table below models a single session against Sonnet 4.6 rates. Output tokens are excluded; see the honest disclosure section:
| Event | Input tokens | Cost at $3/M | Notes |
|---|---|---|---|
tools/list full profile | 38,000 | $0.114 | Cold start, no cache |
tools/list core profile | 2,000 | $0.006 | 7 tools, v1.23.18+ |
| 10 file reads (3 K tokens avg) | 30,000 | $0.090 | Pre-compression |
| Session total (full, uncompressed) | ~72,000 | $0.216 | Input only |
| Session total (core + 63% compression) | ~13,100 | $0.039 | Input only |
Output token count excluded from this table. See disclosure below.
Prompt caching: real numbers#
Prompt caching reduces the cost of repeated input tokens to 10% of their uncached rate across the three major providers we track (Anthropic claude-sonnet-4.6, OpenAI gpt-5.5, Google gemini-3.1-pro-preview). The source is api/app/services/model_pricing.py::_CACHE_READ_MULTIPLIER.
For the 38 K-token tools/list response on Sonnet 4.6: the first call costs $0.114. Every subsequent call on a warm cache costs $0.0114, a $0.10 saving per turn. Across a 10-turn session, the cache saves $0.90 on the manifest alone.
The write surcharge matters for the first call: writing 38 K tokens within 5 minutes costs 1.25× the base rate ($0.143 total). The break-even against cached reads occurs at turn 2.
Input compression stacks with caching#
Prompt caching and input compression are not competing strategies: caching discounts tokens you send, compression eliminates tokens before you send them, and they stack multiplicatively.
A file that compresses from 5,000 tokens to 1,850 tokens (63% reduction, matching the rolling average from GET /v1/global-savings) writes 1,850 tokens into the cache instead of 5,000. Every subsequent cache read costs 10% of that already-reduced count.
Combined multiplier on a warm-cache compressed file read: 0.37 (compression ratio) × 0.10 (cache-read multiplier) = 0.037. The model sees 3.7 cents of token cost per dollar of uncached, uncompressed reads.
The ?profile=core router (v1.23.18, api/app/mcp_gateway.py:3005) applies the same principle to the tool manifest itself. Reducing the catalog from 142 to 7 tools eliminates the 38 K-token manifest cost for developers who only need the core compression tools:
# Full profile: 142 tools, ~38 K tokens
curl -X POST https://api.gotcontext.ai/mcp \
-H "Authorization: Bearer gc_<key>"
# Core profile: 7 tools, ~2 K tokens (opt-in, v1.23.18+)
curl -X POST "https://api.gotcontext.ai/mcp?profile=core" \
-H "Authorization: Bearer gc_<key>"Pro tier math#
A Pro subscription costs $49/mo ( pricing page). The model below uses 30 K input tokens per session on Sonnet 4.6 ($3/M). Output tokens are excluded.
| Sessions/mo | Input tokens/mo | API cost (raw) | API cost (63% compression) | vs. $49 Pro |
|---|---|---|---|---|
| 8 (2/week) | 240 K | $0.72 | $0.27 | Input savings alone do not justify Pro |
| 30 (1/day) | 900 K | $2.70 | $1.00 | Input savings alone do not justify Pro |
| 150 (5/day) | 4.5 M | $13.50 | $5.00 | Input savings alone do not justify Pro |
| 300+ (10+/day) | 9 M+ | $27.00+ | $9.99+ | Pro saves on input; Pro tools add further value |
Model assumes 30 K input tokens/session on Sonnet 4.6 ($3/M). Output excluded. Pro value is not purely token-cost arbitrage: the plan also includes higher rate limits and Pro-gated tools (gc_blast_radius, gc_compress_manifest, compress_codebase).
The honest read: token-cost savings on input alone rarely cover the $49/mo subscription for developers running fewer than 10 sessions daily. The case for Pro rests on productivity gains from Pro-gated tools and higher throughput limits, not on raw token savings.
Live compression ratio#
The 63% figure used throughout this post comes from the public GET /v1/global-savings endpoint (no auth required). It returns a rolling 30-day average of the compression ratio observed across all active documents:
curl -s https://api.gotcontext.ai/v1/global-savings | jq .{
"rolling_30d_avg_ratio": 0.63,
"total_tokens_saved": 4821903,
"cache_hit_savings_usd": 2311.44,
"period_days": 30
}A rolling_30d_avg_ratio of 0.63 means compressed output is on average 63% of the original token count (a 37% reduction). The ratio varies by document type: dense code compresses less aggressively than prose documentation.
Honest disclosure#
What this post does not know
- Per-user session counts at scale. The session-count tiers in the Pro math table are plausible assumptions, not measured from production telemetry. We have no data on actual developer session frequency at this stage.
- Output token cost. Output tokens are excluded from every table in this post. Claude Code output varies significantly by task type, and we have no per-session output token measurements to cite.
- Per-instance cache-hit rate. The 10% multiplier applies when the cache is warm. How often a given developer's session actually hits a warm cache depends on conversation length and provider-specific TTLs. We report the published multiplier, not a measured hit rate.
- Claude Code session token breakdowns. Anthropic does not publish tool-manifest vs. file-read vs. output token breakdowns for Claude Code sessions. The anatomy table uses the 38 K manifest measurement from our own gateway combined with modeled assumptions for everything else.
The token economics of MCP are not opaque. The rates are published, the tool manifest size is measurable, and the caching multipliers are documented in model_pricing.py. This post puts those numbers in one place, with their sources cited, so you can run the same arithmetic against your own session volume.