A smaller context made the agent more accurate

If you run an LLM agent against real tools, you have watched its context window fill with tool output it will never read again. The reflex is to keep all of it, in case the model needs it later. A team at Microsoft tested that reflex on a 50-task expense-itemization benchmark built on Model Context Protocol tools, and the result runs the other way: keeping the full history finished 71.0% of the tasks, while pruning the context to the last five tool calls and adding a short summary finished 91.6%, on 62% fewer tokens (Lodha et al., 2026).

We build a compression API, so we have a stake in this. The numbers here are not ours, though, and they stand on their own.

The setup: an expense agent drowning in tool output ¶

The benchmark is narrow and concrete, which is what makes it useful. The agent itemizes hotel expenses inside Microsoft Dynamics 365 Finance and Operations, calling MCP tools to read and write records. Each tool response carries a lot of structured data the agent mostly does not need on the next turn. Across 50 tasks and five independent runs, the authors compared four ways of handling that growing pile of context.

The weakest setup gave the agent no running model of the user at all. It completed 8.0% of the itemizations. That is the floor: an agent with tools but no memory of what it has already done.

Full history: 1.48 million tokens to finish 71% ¶

Giving the agent its complete conversation history is the obvious fix, and it helps. Completion climbs to 71.0%. The cost is steep. That configuration burned 1,480,996 tokens and 14.56 hours per benchmark run (Lodha et al., 2026).

Two things happen at once inside that number. The agent pays for every stale tool response it re-sends on every turn, and it also has to read them. The second cost is the one people forget. A long context is not free attention. It is a pile the model searches every time, and the noise in it competes with the signal.

Pruning plus a summary: 62% fewer tokens, 91.6% done ¶

The configuration that won kept only the last five tool call and response pairs and replaced the older history with an automated summary. It finished 91.6% of the itemizations, with 99.64% of the dollar amounts correct, using 553,374 tokens and 5.79 hours (Lodha et al., 2026).

Set those side by side. Full history: 71.0% done, 1,480,996 tokens. Prune and summarize: 91.6% done, 553,374 tokens. The compressed run cost 62.6% fewer tokens and finished 20 percentage points more of the work. The authors saw the same pattern with Claude Sonnet 4.5, so it is not a single-model artifact.

This is the line for anyone who still treats compression as a cost-versus-quality trade-off: here it was a cost win and a quality win in the same run.

Why less context can help ¶

The mechanism is not mysterious. As an agent's history grows, two failure modes set in. The first is stale state: an early tool response describes a record that a later write has already changed, and the model trusts the old copy. The second is lost signal: the fact that matters is buried under thousands of tokens of routine output, and the model's attention spreads too thin to find it.

This matches a result we wrote about in Perfect retrieval isn't enough, where models lose accuracy as context grows even when the right document is present and perfectly retrievable. Pruning and summarizing attack both failure modes. Drop the stale records, and the model stops trusting outdated state. Compress the rest into a short summary, and the relevant facts move back into the part of the window the model actually reads.

This is the case for compression, measured ¶

The expense study is one workflow, but it lands in a growing pile of evidence pointing the same way. A 2025 RAG study found that compressing retrieved documents to 3% of their original length improved Exact Match by 3.3 points over feeding the model the full documents (Cui et al., 2025). The first systems-level measurement of MCP agents found that the protocol's system prompts, tool definitions, and context histories inflate token usage sharply, which turns context management into a real cost lever instead of a tuning detail (Ding et al., 2025).

The shared finding is that the model does not want all of your context. It wants the part that bears on the current step, in a form short enough to read. That is the premise of what we build at gotcontext: an MCP gateway that compresses tool output, documents, and codebases before they reach the model, so the agent reads the signal instead of the pile. The Microsoft result is a clean outside measurement of why that helps.

What this does not prove ¶

One benchmark, one domain, one primary model. The expense workflow has a property that flatters compression: most tool output is structured data the agent references once and then never again, which is exactly the case where pruning loses little. A workflow where the agent has to reason across the full history, a long legal document or a multi-file refactor, will not compress as cleanly, and aggressive summarization there can drop a fact the model needed. The honest version of the claim is narrow: for tool-heavy agent workflows where most context is reference data, a short recent window plus a summary beat keeping everything, on both cost and accuracy, in this study. That is a strong result. It is not a universal law, and we would rather you read it as the first.

Try it on your own context

You just read the writeup. Now run the thing. Paste a doc or some verbose tool output and watch it shrink — free, no signup.

Your text

# Service Operations Runbook: Payments API

## Purpose and scope

This runbook covers the payments-api service: what it does, how it is deployed, what its dependencies are, and what to do when it misbehaves. It is written for the on-call engineer. Every procedure here assumes you have production read access and the ability to trigger a deploy through the standard pipeline. Nothing in this document requires direct database write access, and no procedure here should be improvised under pressure: if the situation is not covered, page the service owner rather than inventing a fix at 3am.

The payments-api accepts charge requests from the checkout frontend, validates them against the pricing catalog, forwards them to the payment processor, and records the outcome in the orders database. It is the only service permitted to talk to the processor. Average traffic is steady during business hours with a daily peak around 19:00 UTC and a weekly peak on Friday evenings.

## Architecture and dependencies

The service runs as three replicas behind the regional load balancer. Each replica is stateless; all persistent state lives in the orders database and the idempotency-key store. The service depends on four things: the orders database (primary and one read replica), the idempotency-key store, the pricing catalog service, and the external payment processor. Of these, only the processor is outside our control.

Dependency failure behavior is deliberate and asymmetric. If the pricing catalog is unreachable, the service serves prices from its local cache for up to ten minutes and emits a degraded-mode metric. If the idempotency store is unreachable, the service refuses new charges entirely, because accepting a charge without idempotency protection risks double-billing, and double-billing is strictly worse than downtime. If the processor times out, the charge is recorded as pending and a reconciliation job resolves it within the hour.

## Deployment

Deploys go through the standard pipeline: merge to main, automated tests, staging deploy, a thirty-minute soak with synthetic checkout traffic, then production rollout one replica at a time. The pipeline aborts automatically if the error rate on the new replica exceeds the old baseline. A full rollout takes about twenty minutes. Rollback is the same pipeline in reverse and takes about six minutes; the on-call engineer can trigger it without approval.

## Monitoring and alerts

Three alerts page the on-call engineer. High charge failure rate fires when more than two percent of charge attempts fail over five minutes; the usual causes are a processor incident or a bad deploy, in that order. Idempotency store unavailable fires immediately on connection failure. Reconciliation backlog fires when pending charges older than ninety minutes accumulate, which usually means the reconciliation job is stuck rather than the processor being slow.

2,912/12,000 chars

Compressed

Compressed text will appear here…

Cite this¶

Researchers, analysts, or journalists referencing this post can use either format below — both are copyable.

BibTeXbibtex

@misc{smaller-context-made-the-agent-more-accurate-2026,
  title  = {A smaller context made the agent more accurate},
  author = {James Hollingsworth},
  year   = {2026},
  month  = {June},
  url    = {https://gotcontext.ai/blog/smaller-context-made-the-agent-more-accurate},
  note   = {gotcontext.ai engineering blog.},
}

APAtext

James Hollingsworth. (2026, June 13). A smaller context made the agent more accurate. gotcontext.ai. Retrieved from https://gotcontext.ai/blog/smaller-context-made-the-agent-more-accurate.

Contribute¶

Suggest an edit

Spotted a typo, a stale benchmark, or a missing nuance? Open a GitHub issue.

Discuss this post

Counterexamples, follow-up questions, and adjacent research welcome.

Email us

Bigger story? Hit us directly at hello@gotcontext.ai.

← Back to all posts