Same Prompt, Different Token Count: The Hidden Cost of Switching Providers

You benchmarked GPT-4o. The cost looked right. You switched to Claude for the longer context window. Your bill went up 20%.

You didn't change your prompts. You didn't change your usage patterns. The only thing that changed was the tokenizer, and tokenizers disagree, sometimes substantially, on how many tokens the same text is worth.

Three Tokenizers, Three Different Answers ¶

The major frontier models use different tokenization schemes:

OpenAI (GPT-4o, GPT-4 Turbo) uses Byte Pair Encoding (BPE) with a 200,000-token vocabulary, implemented via the tiktoken library. BPE builds its vocabulary by iteratively merging the most frequent byte pairs in the training corpus.

Google (Gemini 1.5 Pro, Gemini 2.0) uses SentencePiece with a 262,144-token vocabulary. SentencePiece operates directly on raw text without pre-tokenization, using unigram language model or BPE internally, with vocabulary construction optimized differently from tiktoken.

Anthropic (Claude 3.5, Claude 4) uses a BPE variant with its own vocabulary, sized and trained on Anthropic's corpus mix.

These aren't minor implementation differences. The vocabulary sizes, merge rules, and training corpora produce meaningfully different token counts for the same input text.

Where the Gaps Are Largest ¶

Tokenizer disagreement is not uniform across content types. English prose from typical web text is where all three tokenizers were most heavily optimized. Gaps are smallest here, typically 2 to 5%.

The gaps grow significantly in three areas:

Code and technical content. Code uses a lot of tokens that appear rarely in natural language: indentation, brackets, underscores, camelCase identifiers, and language-specific syntax. A Python function with heavy use of list comprehensions and f-strings will tokenize very differently across providers. Gaps of 15 to 25% on code-heavy prompts are common.

Non-English text. BPE vocabularies trained primarily on English text handle other languages less efficiently. A 500-character Japanese paragraph might use 80 tokens under one tokenizer and 140 under another. For multilingual applications, tokenizer choice can double your effective cost.

Structured data formats. JSON, XML, YAML, and Markdown all contain repeated structural characters that tokenizers handle differently. A large JSON payload with many short string values will produce very different token counts across providers.

Checking Token Counts Before You Commit ¶

Each provider exposes tokenizer access:

``python # OpenAI / tiktoken import tiktoken enc = tiktoken.encoding_for_model("gpt-4o") tokens = enc.encode(your_text) print(len(tokens))`

`python # Anthropic / count_tokens API method import anthropic client = anthropic.Anthropic() response = client.messages.count_tokens( model="claude-opus-4-7-20251101", messages=[{"role": "user", "content": your_text}] ) print(response.input_tokens)`

For Google's Gemini, the countTokens` method in the Gemini API provides equivalent functionality.

The important discipline: benchmark your actual prompts through each provider's tokenizer before committing to a cost model. Generic token-per-word estimates (the common shortcut of "1 token ≈ 4 characters") obscure provider-specific variation that can swing your budget by 20% or more on technical workloads.

The Compression Lens ¶

Tokenizer differences matter most when you're managing context at scale: long system prompts, large codebase injections, multi-turn agent sessions. This is where compression earns its value.

A 15× compression of a long codebase context doesn't just reduce tokens proportionally. It also reduces the surface area where tokenizer inefficiency compounds. A 100,000-token JSON payload compressed to 6,700 tokens through semantic compression doesn't just cost 15× less; it also eliminates 93,300 tokens worth of tokenizer-specific overhead.

For teams running on multiple providers simultaneously (common for cost arbitrage or reliability), compression normalizes the token footprint before it reaches the provider's tokenizer. The compressed representation uses fewer total tokens on every provider, and the relative gaps between providers shrink because compressed text has a higher information density per character.

Practical Recommendations ¶

If you're switching providers or running a cost comparison:

Run your top 10 most common prompts through each provider's tokenizer before finalizing cost models. Don't use character counts or word counts as proxies.

Weight by content type. If 60% of your token spend is on code-related prompts, the tokenizer gap will be at the high end. If it's primarily English prose, it will be at the low end.

Re-check after prompt changes. Prompt engineering changes that reduce token count under one tokenizer may not reduce it proportionally under another.

Use compression on high-volume context injections. System prompts, codebase context, and long document injections are where tokenizer-specific overhead compounds most aggressively. Compressing these before they reach the API reduces both the base cost and the provider-specific variance.

The tokenizer is part of the pricing mechanism. It deserves the same scrutiny as the per-token rate.

Cut your token footprint across every provider with gotcontext.ai →

Try it on your own context

You just read the writeup. Now run the thing. Paste a doc or some verbose tool output and watch it shrink — free, no signup.

Your text

# Service Operations Runbook: Payments API

## Purpose and scope

This runbook covers the payments-api service: what it does, how it is deployed, what its dependencies are, and what to do when it misbehaves. It is written for the on-call engineer. Every procedure here assumes you have production read access and the ability to trigger a deploy through the standard pipeline. Nothing in this document requires direct database write access, and no procedure here should be improvised under pressure: if the situation is not covered, page the service owner rather than inventing a fix at 3am.

The payments-api accepts charge requests from the checkout frontend, validates them against the pricing catalog, forwards them to the payment processor, and records the outcome in the orders database. It is the only service permitted to talk to the processor. Average traffic is steady during business hours with a daily peak around 19:00 UTC and a weekly peak on Friday evenings.

## Architecture and dependencies

The service runs as three replicas behind the regional load balancer. Each replica is stateless; all persistent state lives in the orders database and the idempotency-key store. The service depends on four things: the orders database (primary and one read replica), the idempotency-key store, the pricing catalog service, and the external payment processor. Of these, only the processor is outside our control.

Dependency failure behavior is deliberate and asymmetric. If the pricing catalog is unreachable, the service serves prices from its local cache for up to ten minutes and emits a degraded-mode metric. If the idempotency store is unreachable, the service refuses new charges entirely, because accepting a charge without idempotency protection risks double-billing, and double-billing is strictly worse than downtime. If the processor times out, the charge is recorded as pending and a reconciliation job resolves it within the hour.

## Deployment

Deploys go through the standard pipeline: merge to main, automated tests, staging deploy, a thirty-minute soak with synthetic checkout traffic, then production rollout one replica at a time. The pipeline aborts automatically if the error rate on the new replica exceeds the old baseline. A full rollout takes about twenty minutes. Rollback is the same pipeline in reverse and takes about six minutes; the on-call engineer can trigger it without approval.

## Monitoring and alerts

Three alerts page the on-call engineer. High charge failure rate fires when more than two percent of charge attempts fail over five minutes; the usual causes are a processor incident or a bad deploy, in that order. Idempotency store unavailable fires immediately on connection failure. Reconciliation backlog fires when pending charges older than ninety minutes accumulate, which usually means the reconciliation job is stuck rather than the processor being slow.

2,912/12,000 chars

Compressed

Compressed text will appear here…

Cite this¶

Researchers, analysts, or journalists referencing this post can use either format below — both are copyable.

BibTeXbibtex

@misc{tokenizer-differences-provider-cost-reality-2026,
  title  = {Same Prompt, Different Token Count: The Hidden Cost of Switching Providers},
  author = {James Hollingsworth},
  year   = {2026},
  month  = {May},
  url    = {https://gotcontext.ai/blog/tokenizer-differences-provider-cost-reality},
  note   = {gotcontext.ai engineering blog.},
}

APAtext

James Hollingsworth. (2026, May 8). Same Prompt, Different Token Count: The Hidden Cost of Switching Providers. gotcontext.ai. Retrieved from https://gotcontext.ai/blog/tokenizer-differences-provider-cost-reality.

Contribute¶

Suggest an edit

Spotted a typo, a stale benchmark, or a missing nuance? Open a GitHub issue.

Discuss this post

Counterexamples, follow-up questions, and adjacent research welcome.

Email us

Bigger story? Hit us directly at hello@gotcontext.ai.

← Back to all posts