Your LLM Gets Measurably Worse as the Conversation Grows. All of Them Do.

Context rot is not a bug. It's a property. ¶

Chroma recently published a technical report testing 18 frontier LLMs (Claude Opus 4, GPT-4.1, Gemini 2.5 Pro, and 15 others) on a consistent benchmark as input context length grew. The finding was unambiguous: every model tested showed measurable quality degradation as context length increased. No model was immune. The phenomenon has a name: context rot.

This isn't a qualitative observation. It's a consistent, reproducible pattern across model families, parameter scales, and providers. The models don't uniformly forget; they become increasingly unreliable. Performance on the same question, with the same answer present in the context, degrades as more surrounding text is added.

Three patterns that make it worse ¶

The Chroma study identified specific structural factors that accelerate degradation:

Semantic distance matters more than context length alone. When the question and the relevant passage are semantically dissimilar (a technical question whose answer is buried in adjacent narrative text), performance degrades faster than when question and answer are closely matched in vocabulary and framing. Long context windows dilute signal and actively penalize the cases where retrieval is hardest.

Distractors amplify with scale. Irrelevant but plausible content placed near the answer degrades accuracy more at 100K tokens than at 10K tokens. The model's ability to suppress misleading context weakens as the total input grows.

Coherent content hurts more than shuffled content. This is the counterintuitive result: logically structured, well-organized surrounding content degrades performance *more* than randomly shuffled noise. The hypothesis is that coherent text activates the model's tendency to read across sections, pulling attention away from the specific answer location.

A parallel study (arXiv:2601.11564) found a non-linear relationship between KV cache growth and performance on dense transformer architectures. Performance doesn't degrade linearly with context; it drops faster as context length increases, particularly when input mixes relevant and irrelevant material.

The practical implication nobody wants to say out loud ¶

LLM providers have spent two years competing on context window size. 128K, 200K, 1M tokens. The marketing framing is: bigger window = more powerful model. Feed it your entire codebase. Your whole conversation history. Everything.

The research says: longer context doesn't mean better results. It frequently means worse results. Every token of irrelevant content you add to the context window is not neutral. It actively degrades performance on what you care about.

Pattern	Effect on quality
Full conversation history in every call	Degrading; each turn adds distractor tokens
Full codebase as context	Degrading; semantically distant files suppress relevant signal
Complete tool output recirculation	Degrading; verbose outputs bury the relevant lines
Compressed, query-relevant context	Quality-preserving; model sees what matters

Context rot is a cost problem and a quality problem simultaneously ¶

Removing unnecessary tokens from your context window doesn't just save money. It makes your agent more accurate.

The two objectives (cost reduction and quality improvement) point at the same intervention: feed the model less, not more, of what it doesn't need.

gotcontext's compression pipeline is designed around this finding. Rather than truncating arbitrarily (which throws away information at random), it builds a semantic graph of the input, ranks content by structural importance, and emits a compressed form that preserves what the model needs to answer the query at hand. The result is a shorter context that gets better answers, not despite being shorter, but because of it.

``json { "mcpServers": { "gotcontext": { "url": "https://api.gotcontext.ai/mcp", "headers": { "Authorization": "Bearer gc_live_YOUR_KEY" } } } }``

Eighteen models. Same result. Longer context degrades quality. The fix is the same as the cost fix: compress before the model reads it.

Start compressing context →

Try it on your own context

You just read the writeup. Now run the thing. Paste a doc or some verbose tool output and watch it shrink — free, no signup.

Your text

# Service Operations Runbook: Payments API

## Purpose and scope

This runbook covers the payments-api service: what it does, how it is deployed, what its dependencies are, and what to do when it misbehaves. It is written for the on-call engineer. Every procedure here assumes you have production read access and the ability to trigger a deploy through the standard pipeline. Nothing in this document requires direct database write access, and no procedure here should be improvised under pressure: if the situation is not covered, page the service owner rather than inventing a fix at 3am.

The payments-api accepts charge requests from the checkout frontend, validates them against the pricing catalog, forwards them to the payment processor, and records the outcome in the orders database. It is the only service permitted to talk to the processor. Average traffic is steady during business hours with a daily peak around 19:00 UTC and a weekly peak on Friday evenings.

## Architecture and dependencies

The service runs as three replicas behind the regional load balancer. Each replica is stateless; all persistent state lives in the orders database and the idempotency-key store. The service depends on four things: the orders database (primary and one read replica), the idempotency-key store, the pricing catalog service, and the external payment processor. Of these, only the processor is outside our control.

Dependency failure behavior is deliberate and asymmetric. If the pricing catalog is unreachable, the service serves prices from its local cache for up to ten minutes and emits a degraded-mode metric. If the idempotency store is unreachable, the service refuses new charges entirely, because accepting a charge without idempotency protection risks double-billing, and double-billing is strictly worse than downtime. If the processor times out, the charge is recorded as pending and a reconciliation job resolves it within the hour.

## Deployment

Deploys go through the standard pipeline: merge to main, automated tests, staging deploy, a thirty-minute soak with synthetic checkout traffic, then production rollout one replica at a time. The pipeline aborts automatically if the error rate on the new replica exceeds the old baseline. A full rollout takes about twenty minutes. Rollback is the same pipeline in reverse and takes about six minutes; the on-call engineer can trigger it without approval.

## Monitoring and alerts

Three alerts page the on-call engineer. High charge failure rate fires when more than two percent of charge attempts fail over five minutes; the usual causes are a processor incident or a bad deploy, in that order. Idempotency store unavailable fires immediately on connection failure. Reconciliation backlog fires when pending charges older than ninety minutes accumulate, which usually means the reconciliation job is stuck rather than the processor being slow.

2,912/12,000 chars

Compressed

Compressed text will appear here…

Cite this¶

Researchers, analysts, or journalists referencing this post can use either format below — both are copyable.

BibTeXbibtex

@misc{why-your-llm-gets-dumber-as-the-conversation-grows-2026,
  title  = {Your LLM Gets Measurably Worse as the Conversation Grows. All of Them Do.},
  author = {James Hollingsworth},
  year   = {2026},
  month  = {May},
  url    = {https://gotcontext.ai/blog/why-your-llm-gets-dumber-as-the-conversation-grows},
  note   = {gotcontext.ai engineering blog.},
}

APAtext

James Hollingsworth. (2026, May 8). Your LLM Gets Measurably Worse as the Conversation Grows. All of Them Do.. gotcontext.ai. Retrieved from https://gotcontext.ai/blog/why-your-llm-gets-dumber-as-the-conversation-grows.

Contribute¶

Suggest an edit

Spotted a typo, a stale benchmark, or a missing nuance? Open a GitHub issue.

Discuss this post

Counterexamples, follow-up questions, and adjacent research welcome.

Email us

Bigger story? Hit us directly at hello@gotcontext.ai.

← Back to all posts