Speculative Decoding: The Throughput Trick That Changes What Fast Means

Language models generate tokens one at a time. That's not a performance choice. It's a fundamental property of autoregressive generation. Each token depends on every token before it. You can't parallelize it.

Except you can, with a workaround that feels like it shouldn't work.

Speculative decoding, introduced in a 2022 paper by Leviathan, Kalman, and Matias at Google Research (arXiv:2211.17192), demonstrated 2 to 3× throughput improvement on T5-XXL without changing the model, without changing the output distribution, and without requiring the target model to do any additional work per token.

The paper's result: the output is provably identical to standard autoregressive decoding. Not approximately the same. Identical. The trick is purely about how you use the compute.

How Speculative Decoding Works ¶

The key insight is that verifying a sequence of tokens in parallel is cheaper than generating them serially.

The setup:

A small, fast draft model generates K candidate tokens speculatively (say, 5 tokens at once)

The large target model runs a single forward pass over all 5 candidates simultaneously

The target model accepts candidates that match what it would have generated, rejects the first mismatch, and discards everything after the mismatch

Repeat

On average, some meaningful fraction of the draft model's tokens are accepted. Each accepted token costs only the draft model's compute. The target model pays for one forward pass that evaluates K tokens in parallel, cheaper than K sequential forward passes.

The critical property: the acceptance/rejection algorithm guarantees the final output distribution is identical to running the target model alone. If every draft token gets rejected, you fall back to standard generation with no quality loss. If most get accepted, you get significant throughput gains for free.

Medusa: Speculative Decoding Without a Draft Model ¶

The original speculative decoding setup requires maintaining two separate models. For production systems, this means:

Additional memory to hold the draft model

Coordination between two model inference processes

Risk that the draft model's distribution drifts from the target model over time

A 2024 paper, Medusa (arXiv:2401.10774), eliminated the separate draft model by adding multiple decoding heads directly to the target model.

Standard language models have one head: the language model head that converts the final hidden state to a probability distribution over the vocabulary. Medusa adds 5 additional heads, each predicting a token further in the future:

Head 1: predict token t+1 (standard LM head)

Head 2: predict token t+2

Head 3: predict token t+3

Head 4: predict token t+4

Head 5: predict token t+5

All 5 heads run in parallel on every forward pass. The base model's parameters are frozen. Only the new heads are trained, and training them takes a fraction of the time of training the base model.

Medusa-1 (heads trained in isolation, added to a frozen base): achieved 2.18 to 2.33× throughput across standard benchmarks.

Medusa-2 (heads trained jointly with fine-tuning of the base): achieved 2.83× throughput on 7B and 13B models.

On specific task types, the gains are higher. For extraction tasks (generating outputs that closely follow a fixed template) Medusa reached up to 3.62×. On Vicuna-7B specifically, throughput went from 37 tokens/second to 107 tokens/second using Medusa-2.

The acceptance behavior: on average, 3.01 to 3.51 tokens are accepted per decoding step across the benchmarks in the paper.

What This Means for API Consumers ¶

You don't implement speculative decoding yourself when calling Claude or GPT-4o via API. These providers run their own inference infrastructure, and whether speculative decoding is active is invisible to the API consumer.

But you experience the effects:

Throughput is not constant across output types. Speculative decoding works best when the output is predictable, when the draft model guesses right frequently. Outputs with high entropy (creative writing, code generation in a novel style, multi-step reasoning chains) accept fewer draft tokens. Outputs with low entropy (template fills, short answers to factual questions, repeated formatting patterns) accept more.

This means your time-to-first-token and tokens-per-second numbers will vary across use cases in ways that aren't obvious from the model's stated speed. A system running 50% creative generation and 50% structured extraction will see dramatically different latency per use case, even at the same input length.

Output length affects throughput more than input length. In autoregressive generation, the bottleneck is generation, not prefill. Speculative decoding specifically addresses the generation bottleneck. This means very short outputs (1 to 5 tokens) don't benefit much; there aren't enough tokens to amortize the draft overhead. Long outputs (100+ tokens) see the full speedup.

Context length affects draft quality. Draft models are smaller and attend to context differently than target models. Very long context windows may reduce draft acceptance rates if the draft model can't effectively attend to relevant context. Systems using large context injections (full codebase, long document) may see lower effective throughput from speculative decoding than systems with tight, compressed context.

The Compression Connection ¶

This last point is where context compression intersects with generation throughput. Long, uncompressed context doesn't just cost more at input time. It reduces the effectiveness of speculative decoding at output time, because draft model quality degrades with long, noisy context.

Compressing context before inference (reducing a 100,000-token codebase injection to 6,700 tokens) doesn't just cut input costs. It gives the draft model a cleaner signal to work with, improving acceptance rates, which improves effective throughput per dollar.

The throughput wins in Leviathan et al. and in Medusa were measured under controlled conditions. In production systems with bloated context, the actual gains are lower. Compression recovers some of that gap.

The Bottom Line ¶

Speculative decoding is the most important inference optimization that API consumers didn't build but benefit from daily. Understanding how it works clarifies why:

Structured, predictable outputs are faster than open-ended ones

Output length matters more than input length for latency

Context quality affects generation throughput and input cost

The 2 to 3× throughput improvements demonstrated in the original paper and the 2.83× demonstrated in Medusa aren't marketing numbers. They're measured under real conditions on real models. The caveat is that they're measured with clean, focused context, which is an argument for keeping your context clean.

Compress your context. Let speculative decoding work at full efficiency →

Try it on your own context

You just read the writeup. Now run the thing. Paste a doc or some verbose tool output and watch it shrink — free, no signup.

Your text

# Service Operations Runbook: Payments API

## Purpose and scope

This runbook covers the payments-api service: what it does, how it is deployed, what its dependencies are, and what to do when it misbehaves. It is written for the on-call engineer. Every procedure here assumes you have production read access and the ability to trigger a deploy through the standard pipeline. Nothing in this document requires direct database write access, and no procedure here should be improvised under pressure: if the situation is not covered, page the service owner rather than inventing a fix at 3am.

The payments-api accepts charge requests from the checkout frontend, validates them against the pricing catalog, forwards them to the payment processor, and records the outcome in the orders database. It is the only service permitted to talk to the processor. Average traffic is steady during business hours with a daily peak around 19:00 UTC and a weekly peak on Friday evenings.

## Architecture and dependencies

The service runs as three replicas behind the regional load balancer. Each replica is stateless; all persistent state lives in the orders database and the idempotency-key store. The service depends on four things: the orders database (primary and one read replica), the idempotency-key store, the pricing catalog service, and the external payment processor. Of these, only the processor is outside our control.

Dependency failure behavior is deliberate and asymmetric. If the pricing catalog is unreachable, the service serves prices from its local cache for up to ten minutes and emits a degraded-mode metric. If the idempotency store is unreachable, the service refuses new charges entirely, because accepting a charge without idempotency protection risks double-billing, and double-billing is strictly worse than downtime. If the processor times out, the charge is recorded as pending and a reconciliation job resolves it within the hour.

## Deployment

Deploys go through the standard pipeline: merge to main, automated tests, staging deploy, a thirty-minute soak with synthetic checkout traffic, then production rollout one replica at a time. The pipeline aborts automatically if the error rate on the new replica exceeds the old baseline. A full rollout takes about twenty minutes. Rollback is the same pipeline in reverse and takes about six minutes; the on-call engineer can trigger it without approval.

## Monitoring and alerts

Three alerts page the on-call engineer. High charge failure rate fires when more than two percent of charge attempts fail over five minutes; the usual causes are a processor incident or a bad deploy, in that order. Idempotency store unavailable fires immediately on connection failure. Reconciliation backlog fires when pending charges older than ninety minutes accumulate, which usually means the reconciliation job is stuck rather than the processor being slow.

2,912/12,000 chars

Compressed

Compressed text will appear here…

Cite this¶

Researchers, analysts, or journalists referencing this post can use either format below — both are copyable.

BibTeXbibtex

@misc{speculative-decoding-throughput-tradeoffs-2026,
  title  = {Speculative Decoding: The Throughput Trick That Changes What Fast Means},
  author = {James Hollingsworth},
  year   = {2026},
  month  = {May},
  url    = {https://gotcontext.ai/blog/speculative-decoding-throughput-tradeoffs},
  note   = {gotcontext.ai engineering blog.},
}

APAtext

James Hollingsworth. (2026, May 8). Speculative Decoding: The Throughput Trick That Changes What Fast Means. gotcontext.ai. Retrieved from https://gotcontext.ai/blog/speculative-decoding-throughput-tradeoffs.

Contribute¶

Suggest an edit

Spotted a typo, a stale benchmark, or a missing nuance? Open a GitHub issue.

Discuss this post

Counterexamples, follow-up questions, and adjacent research welcome.

Email us

Bigger story? Hit us directly at hello@gotcontext.ai.

← Back to all posts