Why Long Agent Sessions Fall Apart (And the Paper That Explains It)

Long agent sessions don't fail because the model runs out of tokens. They fail because the model starts ignoring most of what you gave it.

Chroma tested 18 large language models on a task called conversational QA. Their finding, published in their Context Rot research, was blunt: every single model degraded as context length grew. A focused ~300-token prompt consistently outperformed the full ~113,000-token conversation window. More context made answers worse.

They called the phenomenon context rot.

What Context Rot Actually Means for Agents ¶

Context rot isn't a bug in any particular model. It's a structural property of how attention works at scale. As conversations accumulate, the signal-to-noise ratio in the context window degrades. Early exchanges get deprioritized. Repeated patterns get overweighted. The model spends attention budget on old, irrelevant turns instead of the current task.

For a single-turn chat session, this doesn't matter much. For an agent running 50+ turns across a multi-hour debugging session, it's the primary failure mode.

The practical symptoms are recognizable if you've hit them:

The agent "forgets" a constraint you set in turn 3 by turn 40

Responses grow longer and less actionable as the session continues

The agent starts re-asking questions you already answered

Tool call quality degrades: more hallucinated arguments, more retries

All of these are context rot. The model isn't broken. It's overwhelmed.

The Compression Approach: Semantic Anchor Compression ¶

A 2024 paper from researchers at several universities proposed a different framing for the problem. Instead of asking "how do we fit more context into the window," they asked: what is the minimal representation of a conversation that preserves the information that actually matters?

The result was Semantic Anchor Compression (SAC), published as arXiv:2510.08907.

SAC works by identifying anchor tokens: the tokens in a conversation that carry the most semantic weight. Rather than summarizing or paraphrasing (which introduces drift), SAC aggregates KV representations around these anchors, producing a compressed version of the conversation that the model can attend to as if it were normal context.

No autoencoder. No separate compression model. The compression happens in the KV cache layer using the same model that will consume the result.

The compression ratios the paper demonstrates are not incremental:

5× compression with quality comparable to full context

15× compression with F1 score of 54.95 vs 51.52 for retrieval-augmented baselines

51× compression still functional on standard QA benchmarks

At 15× compression, SAC outperformed RAG approaches by up to 23.5% F1 and 26.8% EM on certain tasks. The compressed representation outperformed retrieval because compression preserves conversational structure (the order and flow of the dialogue) while retrieval collapses it into a bag of relevant chunks.

Why Compression Outperforms Retrieval for Agent Sessions ¶

This is the counterintuitive part. RAG is the standard answer to long-context problems: embed the conversation, retrieve the relevant bits, feed only those to the model. It works for document QA. It fails for agent sessions.

Agent sessions have causal dependencies. The fact that you told the agent "don't touch the production database" in turn 5 is a constraint on every subsequent tool call. Retrieval will surface it when you explicitly ask about databases. It won't surface it when the agent is deciding whether to run a migration script.

Conversation compression preserves this causality. The compressed context still contains the constraint, in the right temporal position, even at 15× compression. Retrieval does not guarantee this.

The same logic applies to:

File paths established early in a session

User preferences stated once and assumed thereafter

Error states the agent encountered and resolved

Decisions made with rationale that affects later choices

All of these are load-bearing facts in an agent session. All of them are at risk under context rot. None of them are reliably retrievable without the conversational structure that compression preserves.

The Engineering Tradeoff ¶

Compression adds latency to context preparation. At 5×, this is usually acceptable. At 51×, the preparation step is non-trivial. The practical operating range for production agent systems is 5 to 15×, which brings a 100,000-token conversation down to 6,600 to 20,000 tokens, well within the sweet spot where attention is focused and generation is fast.

The other cost is implementation complexity. SAC requires access to the KV cache layer, which is not exposed in standard API calls to hosted models. For teams using Claude, GPT-4o, or Gemini via API, a proxy compression step (compressing the text representation before sending) achieves similar results with less fidelity.

This is exactly what gotcontext.ai's compress_codebase and ingest_context tools do: they compress the context representation before it reaches the model, trading some fidelity for a dramatic reduction in the tokens the model actually processes.

What This Means for Your Agent Architecture ¶

If you're building agents that run long sessions, the Chroma and SAC findings together suggest a clear design principle: never let the raw conversation accumulate in the context window.

Instead:

Compress conversation history before each turn, proactively, before you hit the context limit

Prefer compression over retrieval for preserving causal dependencies

Monitor answer quality over session length as an early signal of context rot onset

Set a compression threshold (10× is a reasonable starting point) and apply it proactively

Context rot is inevitable in any system that accumulates context without managing it. The models aren't going to get better at ignoring irrelevant history. The architecture has to do that work.

Compress your agent sessions automatically with gotcontext.ai →

Try it on your own context

You just read the writeup. Now run the thing. Paste a doc or some verbose tool output and watch it shrink — free, no signup.

Your text

# Service Operations Runbook: Payments API

## Purpose and scope

This runbook covers the payments-api service: what it does, how it is deployed, what its dependencies are, and what to do when it misbehaves. It is written for the on-call engineer. Every procedure here assumes you have production read access and the ability to trigger a deploy through the standard pipeline. Nothing in this document requires direct database write access, and no procedure here should be improvised under pressure: if the situation is not covered, page the service owner rather than inventing a fix at 3am.

The payments-api accepts charge requests from the checkout frontend, validates them against the pricing catalog, forwards them to the payment processor, and records the outcome in the orders database. It is the only service permitted to talk to the processor. Average traffic is steady during business hours with a daily peak around 19:00 UTC and a weekly peak on Friday evenings.

## Architecture and dependencies

The service runs as three replicas behind the regional load balancer. Each replica is stateless; all persistent state lives in the orders database and the idempotency-key store. The service depends on four things: the orders database (primary and one read replica), the idempotency-key store, the pricing catalog service, and the external payment processor. Of these, only the processor is outside our control.

Dependency failure behavior is deliberate and asymmetric. If the pricing catalog is unreachable, the service serves prices from its local cache for up to ten minutes and emits a degraded-mode metric. If the idempotency store is unreachable, the service refuses new charges entirely, because accepting a charge without idempotency protection risks double-billing, and double-billing is strictly worse than downtime. If the processor times out, the charge is recorded as pending and a reconciliation job resolves it within the hour.

## Deployment

Deploys go through the standard pipeline: merge to main, automated tests, staging deploy, a thirty-minute soak with synthetic checkout traffic, then production rollout one replica at a time. The pipeline aborts automatically if the error rate on the new replica exceeds the old baseline. A full rollout takes about twenty minutes. Rollback is the same pipeline in reverse and takes about six minutes; the on-call engineer can trigger it without approval.

## Monitoring and alerts

Three alerts page the on-call engineer. High charge failure rate fires when more than two percent of charge attempts fail over five minutes; the usual causes are a processor incident or a bad deploy, in that order. Idempotency store unavailable fires immediately on connection failure. Reconciliation backlog fires when pending charges older than ninety minutes accumulate, which usually means the reconciliation job is stuck rather than the processor being slow.

2,912/12,000 chars

Compressed

Compressed text will appear here…

Cite this¶

Researchers, analysts, or journalists referencing this post can use either format below — both are copyable.

BibTeXbibtex

@misc{conversation-compression-long-agent-sessions-2026,
  title  = {Why Long Agent Sessions Fall Apart (And the Paper That Explains It)},
  author = {James Hollingsworth},
  year   = {2026},
  month  = {May},
  url    = {https://gotcontext.ai/blog/conversation-compression-long-agent-sessions},
  note   = {gotcontext.ai engineering blog.},
}

APAtext

James Hollingsworth. (2026, May 8). Why Long Agent Sessions Fall Apart (And the Paper That Explains It). gotcontext.ai. Retrieved from https://gotcontext.ai/blog/conversation-compression-long-agent-sessions.

Contribute¶

Suggest an edit

Spotted a typo, a stale benchmark, or a missing nuance? Open a GitHub issue.

Discuss this post

Counterexamples, follow-up questions, and adjacent research welcome.

Email us

Bigger story? Hit us directly at hello@gotcontext.ai.

← Back to all posts