Speculative Decoding: The Throughput Trick That Changes What Fast Means
A small draft model proposes tokens. The large target model verifies them in parallel. Result: 2–3x throughput with provably identical outputs. Here's what that means for your API usage.
Language models generate tokens one at a time. That's not a performance choice. It's a fundamental property of autoregressive generation. Each token depends on every token before it. You can't parallelize it.
Except you can, with a workaround that feels like it shouldn't work.
Speculative decoding, introduced in a 2022 paper by Leviathan, Kalman, and Matias at Google Research (arXiv:2211.17192), demonstrated 2 to 3× throughput improvement on T5-XXL without changing the model, without changing the output distribution, and without requiring the target model to do any additional work per token.
The paper's result: the output is provably identical to standard autoregressive decoding. Not approximately the same. Identical. The trick is purely about how you use the compute.
How Speculative Decoding Works ¶
The key insight is that verifying a sequence of tokens in parallel is cheaper than generating them serially.
Here's the setup:
On average, some meaningful fraction of the draft model's tokens are accepted. Each accepted token costs only the draft model's compute. The target model pays for one forward pass that evaluates K tokens in parallel, cheaper than K sequential forward passes.
The critical property: the acceptance/rejection algorithm is designed so that the final output distribution is identical to running the target model alone. If every draft token gets rejected, you fall back to standard generation with no quality loss. If most get accepted, you get significant throughput gains for free.
Medusa: Speculative Decoding Without a Draft Model ¶
The original speculative decoding setup requires maintaining two separate models. For production systems, this means:
A 2024 paper, Medusa (arXiv:2401.10774), eliminated the separate draft model by adding multiple decoding heads directly to the target model.
Standard language models have one head: the language model head that converts the final hidden state to a probability distribution over the vocabulary. Medusa adds 5 additional heads, each predicting a token further in the future:
All 5 heads run in parallel on every forward pass. The base model's parameters are frozen. Only the new heads are trained, and training them takes a fraction of the time of training the base model.
Medusa-1 (heads trained in isolation, added to a frozen base): achieved 2.18–2.33× throughput across standard benchmarks.
Medusa-2 (heads trained jointly with fine-tuning of the base): achieved 2.83× throughput on 7B and 13B models.
On specific task types, the gains are higher. For extraction tasks (generating outputs that closely follow a fixed template) Medusa reached up to 3.62×. On Vicuna-7B specifically, throughput went from 37 tokens/second to 107 tokens/second using Medusa-2.
The acceptance behavior: on average, 3.01–3.51 tokens are accepted per decoding step across the benchmarks in the paper.
What This Means for API Consumers ¶
You don't implement speculative decoding yourself when calling Claude or GPT-4o via API. These providers run their own inference infrastructure, and whether speculative decoding is active is invisible to the API consumer.
But you experience the effects:
Throughput is not constant across output types. Speculative decoding works best when the output is predictable, when the draft model guesses right frequently. Outputs with high entropy (creative writing, code generation in a novel style, multi-step reasoning chains) accept fewer draft tokens. Outputs with low entropy (template fills, short answers to factual questions, repeated formatting patterns) accept more.
This means your time-to-first-token and tokens-per-second numbers will vary across use cases in ways that aren't obvious from the model's stated speed. A system running 50% creative generation and 50% structured extraction will see dramatically different latency per use case, even at the same input length.
Output length affects throughput more than input length. In autoregressive generation, the bottleneck is generation, not prefill. Speculative decoding specifically addresses the generation bottleneck. This means very short outputs (1–5 tokens) don't benefit much; there aren't enough tokens to amortize the draft overhead. Long outputs (100+ tokens) see the full speedup.
Context length affects draft quality. Draft models are smaller and attend to context differently than target models. Very long context windows may reduce draft acceptance rates if the draft model can't effectively attend to relevant context. Systems using large context injections (full codebase, long document) may see lower effective throughput from speculative decoding than systems with tight, compressed context.
The Compression Connection ¶
This last point is where context compression intersects with generation throughput. Long, uncompressed context doesn't just cost more at input time. It reduces the effectiveness of speculative decoding at output time, because draft model quality degrades with long, noisy context.
Compressing context before inference (reducing a 100,000-token codebase injection to 6,700 tokens) doesn't just cut input costs. It gives the draft model a cleaner signal to work with, improving acceptance rates, which improves effective throughput per dollar.
The throughput wins in Leviathan et al. and in Medusa were measured under controlled conditions. In production systems with bloated context, the actual gains are lower. Compression recovers some of that gap.
The Bottom Line ¶
Speculative decoding is the most important inference optimization that API consumers didn't build but benefit from daily. Understanding how it works clarifies why:
The 2–3× throughput improvements demonstrated in the original paper and the 2.83× demonstrated in Medusa aren't marketing numbers. They're measured under real conditions on real models. The caveat is that they're measured with clean, focused context, which is an argument for keeping your context clean.
Compress your context. Let speculative decoding work at full efficiency →
Cite this¶
Researchers, analysts, or journalists referencing this post can use either format below — both are copyable.
@misc{speculative-decoding-throughput-tradeoffs-2026,
title = {Speculative Decoding: The Throughput Trick That Changes What Fast Means},
author = {James Hollingsworth},
year = {2026},
month = {May},
url = {https://www.gotcontext.ai/blog/speculative-decoding-throughput-tradeoffs},
note = {gotcontext.ai engineering blog.},
}James Hollingsworth. (2026, May 8). Speculative Decoding: The Throughput Trick That Changes What Fast Means. gotcontext.ai. Retrieved from https://www.gotcontext.ai/blog/speculative-decoding-throughput-tradeoffs.