Measured savings across 11 LLMs — Claude Opus 4.7 to Gemini Flash.→ See per-model data
Get free API key →
Engineering

Why Long Agent Sessions Fall Apart (And the Paper That Explains It)

Chroma tested 18 LLMs and found every one degrades as context grows. A 2024 paper shows compression at 5–51x beats retrieval for preserving causal structure in agent sessions.

James Hollingsworth(Contributor)Published 6 min~915 words

Long agent sessions don't fail because the model runs out of tokens. They fail because the model starts ignoring most of what you gave it.

Chroma tested 18 large language models on a task called conversational QA. Their finding, published in their Context Rot research, was blunt: every single model degraded as context length grew. A focused ~300-token prompt consistently outperformed the full ~113,000-token conversation window. More context made answers worse.

They called the phenomenon context rot.

What Context Rot Actually Means for Agents

Context rot isn't a bug in any particular model. It's a structural property of how attention works at scale. As conversations accumulate, the signal-to-noise ratio in the context window degrades. Early exchanges get deprioritized. Repeated patterns get overweighted. The model spends attention budget on old, irrelevant turns instead of the current task.

For a single-turn chat session, this doesn't matter much. For an agent running 50+ turns across a multi-hour debugging session, it's the primary failure mode.

The practical symptoms are recognizable if you've hit them:

  • The agent "forgets" a constraint you set in turn 3 by turn 40
  • Responses grow longer and less actionable as the session continues
  • The agent starts re-asking questions you already answered
  • Tool call quality degrades: more hallucinated arguments, more retries
  • All of these are context rot. The model isn't broken. It's overwhelmed.

    The Compression Approach: Semantic Anchor Compression

    A 2024 paper from researchers at several universities proposed a different framing for the problem. Instead of asking "how do we fit more context into the window," they asked: what is the minimal representation of a conversation that preserves the information that actually matters?

    The result was Semantic Anchor Compression (SAC), published as arXiv:2510.08907.

    SAC works by identifying anchor tokens: the tokens in a conversation that carry the most semantic weight. Rather than summarizing or paraphrasing (which introduces drift), SAC aggregates KV representations around these anchors, producing a compressed version of the conversation that the model can attend to as if it were normal context.

    No autoencoder. No separate compression model. The compression happens in the KV cache layer using the same model that will consume the result.

    The compression ratios the paper demonstrates are not incremental:

  • 5× compression with quality comparable to full context
  • 15× compression with F1 score of 54.95 vs 51.52 for retrieval-augmented baselines
  • 51× compression still functional on standard QA benchmarks
  • At 15× compression, SAC outperformed RAG approaches by up to 23.5% F1 and 26.8% EM on certain tasks. The compressed representation outperformed retrieval because compression preserves conversational structure (the order and flow of the dialogue) while retrieval collapses it into a bag of relevant chunks.

    Why Compression Outperforms Retrieval for Agent Sessions

    This is the counterintuitive part. RAG is the standard answer to long-context problems: embed the conversation, retrieve the relevant bits, feed only those to the model. It works for document QA. It fails for agent sessions.

    Agent sessions have causal dependencies. The fact that you told the agent "don't touch the production database" in turn 5 is not just a relevant fact. It's a constraint that must be present for every subsequent tool call. Retrieval will surface it when you explicitly ask about databases. It won't surface it when the agent is deciding whether to run a migration script.

    Conversation compression preserves this causality. The compressed context still contains the constraint, in the right temporal position, even at 15× compression. Retrieval does not guarantee this.

    The same logic applies to:

  • File paths established early in a session
  • User preferences stated once and assumed thereafter
  • Error states the agent encountered and resolved
  • Decisions made with rationale that affects later choices
  • All of these are load-bearing facts in an agent session. All of them are at risk under context rot. None of them are reliably retrievable without the conversational structure that compression preserves.

    The Engineering Tradeoff

    Compression adds latency to context preparation. At 5×, this is usually acceptable. At 51×, the preparation step is non-trivial. The practical operating range for production agent systems is 5–15×, which brings a 100,000-token conversation down to 6,600–20,000 tokens, well within the sweet spot where attention is focused and generation is fast.

    The other cost is implementation complexity. SAC requires access to the KV cache layer, which is not exposed in standard API calls to hosted models. For teams using Claude, GPT-4o, or Gemini via API, a proxy compression step (compressing the text representation before sending) achieves similar results with less fidelity.

    This is exactly what gotcontext.ai's compress_codebase and ingest_context tools do: they compress the context representation before it reaches the model, trading some fidelity for a dramatic reduction in the tokens the model actually processes.

    What This Means for Your Agent Architecture

    If you're building agents that run long sessions, the Chroma and SAC findings together suggest a clear design principle: never let the raw conversation accumulate in the context window.

    Instead:

  • Compress conversation history before each turn, not just when you hit the limit
  • Prefer compression over retrieval for preserving causal dependencies
  • Monitor answer quality over session length as an early signal of context rot onset
  • Set a compression threshold (10× is a reasonable starting point) and apply it proactively
  • Context rot is inevitable in any system that accumulates context without managing it. The models aren't going to get better at ignoring irrelevant history. The architecture has to do that work.

    Compress your agent sessions automatically with gotcontext.ai →

    Cite this

    Researchers, analysts, or journalists referencing this post can use either format below — both are copyable.

    BibTeXbibtex
    @misc{conversation-compression-long-agent-sessions-2026,
      title  = {Why Long Agent Sessions Fall Apart (And the Paper That Explains It)},
      author = {James Hollingsworth},
      year   = {2026},
      month  = {May},
      url    = {https://www.gotcontext.ai/blog/conversation-compression-long-agent-sessions},
      note   = {gotcontext.ai engineering blog.},
    }
    APAtext
    James Hollingsworth. (2026, May 8). Why Long Agent Sessions Fall Apart (And the Paper That Explains It). gotcontext.ai. Retrieved from https://www.gotcontext.ai/blog/conversation-compression-long-agent-sessions.

    Contribute