Économies mesurées sur 11 LLMs — Claude Opus 4.7 à Gemini Flash.→ Voir les données par modèle
Connecter votre client
Research

Perfect retrieval isn't enough: the hidden cost of long LLM context

Recent controlled experiments show LLMs get less accurate as input grows, even when retrieval is perfect and irrelevant tokens are masked out. That makes context compression a lever that improves accuracy and cost at the same time, not a tradeoff between them.

By James Hollingsworth (Contributor)··7 min read

The EMNLP finding: accuracy drops even when retrieval is perfect#

The common explanation for why long-context LLMs struggle is distraction: irrelevant tokens dilute the signal. That explanation implies the fix is better retrieval. Get the right documents and remove the wrong ones, and the model should perform at its ceiling.

Du et al. tested this directly. “Context Length Alone Hurts LLM Performance Despite Perfect Retrieval” (EMNLP 2025 Findings) fed models exactly the passages they needed to answer questions — 100% exact-match retrieval — and varied how much surrounding context they padded those passages with. The result: accuracy still dropped 13.9%–85% as input length grew, within the model's claimed context window.

The magnitude varies by model and task, but the direction is consistent. Llama-3.1-8B (128K context window) drops 24.2 percentage points on MMLU at 30K tokens despite perfect retrieval. Bigger context windows do not fix the problem; the window's size and the model's ability to use all of it are different things.

The paper also tested a prompt engineering intervention: asking the model to “recite the evidence before answering.” This recovered up to 4% on RULER for GPT-4o. That is a real but partial improvement, and it adds tokens to the context it is trying to help with.

Why masking rules out distraction as the full cause#

The most direct part of the Du et al. study is the masking experiment. The researchers removed the distraction variable entirely: they modified the attention mask so the model could only attend to the evidence tokens, not the surrounding padding. If irrelevant context were the main driver of accuracy loss, masking should have recovered performance.

It did not. Under full masking, Llama-3 drops 50% on HumanEval at 30K tokens. The model was attending only to the correct evidence and still performed substantially worse than at shorter context lengths.

The practical implication is direct: solving for retrieval quality does not solve for context length. The two are separate problems. A system that retrieves perfectly but passes long sequences to the model still pays the length penalty.

Liu et al.'s “Lost in the Middle” (TACL 2024) established the positional-bias baseline: models use information at the start and end of long contexts better than the middle. The EMNLP masking result adds a separate layer — even when position is controlled, length itself remains a variable that degrades performance.

What compression does about it: the LLMLingua-2 numbers#

If context length is the variable that hurts accuracy independent of retrieval quality, then compressing the context before it enters the model addresses the accuracy problem directly, not just the token bill.

Pan, Wu, Jiang et al. at Microsoft Research quantified this in LLMLingua-2 (ACL 2024 Findings). On MeetingBank QA, 3.1x compression (3,003 tokens to 970 tokens) produced a QA-F1 of 86.92 versus 87.75 uncompressed — under 1 point of quality loss at 3x reduction. The compressor is a fine-tuned XLM-RoBERTa-large encoder and runs 3x–6x faster than prior compression methods.

Taken with the Du et al. finding, this means compression is not just a cost lever — it is an accuracy lever. Shorter context preserves task performance for reasons that retrieval alone cannot address, and LLMLingua-2 shows the compression itself can be done with under 1 point of QA-F1 loss on its test set.

gotcontext's /v1/compress and /v1/compress-code/structural endpoints apply semantic compression via the MCP gateway at https://api.gotcontext.ai/mcp. Compression ratios vary by document type and fidelity preset; see /benchmarks/compression for numbers on real corpora.

The cost angle: compression vs. model-switching#

The standard cost-reduction playbook is to switch to a cheaper model when task accuracy permits. This is a reasonable approach, but a recent preprint identified a structural problem with it.

“The Price Reversal Phenomenon” (arXiv:2603.23971, preprint, not yet peer-reviewed) found that in 21.8% of model-pair comparisons, the cheaper-listed model actually cost more per task. The mechanism is hidden thinking tokens: models with internal chain-of-thought generate token chains before producing their output, and those tokens are billed even though they never appear in the response. The reversal magnitude reached 28x in some pairs.

Input compression does not have this problem. The token reduction happens before the model sees the input, so there is no hidden multiplier on the downstream model's side. A 60% reduction in input tokens is a 60% reduction in input cost, regardless of which model processes it or whether that model uses internal reasoning.

The two levers also compose: compress the input, apply prompt caching on the compressed form, then route to the appropriate model. Each stage's savings multiply rather than compete. The gotcontext MCP gateway is built around this composition — see the MCP server setup guide for how to wire it into Anthropic, OpenAI, or Gemini pipelines.

What this doesn't prove#

Honest limits of the cited research

  • Academic benchmarks differ from production workloads. The Du et al. results are on MMLU, HumanEval, and RULER — structured, single-answer tasks. Real agent sessions involve multi-turn context, tool calls, and partial information across many files. Whether the accuracy drop scales proportionally to production workloads is not established by the paper.
  • LLMLingua-2 numbers are from meeting transcripts. The 3.1x / 86.92 QA-F1 figures are from MeetingBank, a corpus of meeting transcripts. Compression ratios and quality preservation vary by document type. Code compresses differently than prose; short structured documents behave differently than long narrative ones.
  • The price reversal preprint is not yet peer-reviewed. arXiv:2603.23971 has not gone through peer review as of this post's publish date. The mechanism (hidden thinking tokens) is independently verifiable from provider documentation, but the 21.8% and 28x figures should be treated as preliminary.
  • Compression is not lossless. Any compression that removes tokens removes information. The LLMLingua-2 paper shows the loss is small at 3.1x on its test set. That does not mean all compression at all ratios preserves all semantics. Higher compression ratios carry higher semantic risk, and the right fidelity level depends on the task.

The masking result from Du et al. is the most useful single finding here: it separates “long context hurts because of distraction” from “long context hurts because of length itself.” If length is the independent variable, then shorter context is the direct intervention. The LLMLingua-2 QA-F1 numbers show it is achievable at 3.1x with under 1 point of quality loss on its test set. Real-world ratios vary by document and fidelity setting — see /benchmarks/compression for measurements on real corpora, or connect the MCP gateway and measure it on your own documents.