97% of MCP Tool Descriptions Have Quality Problems — and Your Agent Pays for It

The invisible tax on every MCP-powered agent ¶

Every time your agent runs, it carries its full tool manifest in the context window. Thirty tools, fifty tools, a hundred. Each description injected into every call whether the tool is relevant or not. Most developers think of this as a fixed overhead and move on.

They shouldn't. A February 2026 study, "MCP Tool Descriptions Are Smelly!" (arXiv:2602.14878), audited 856 tool descriptions across 103 MCP servers and found that 97.1% contain at least one quality defect, what the authors call a "smell." These aren't cosmetic issues. They directly affect agent behavior and, through the agent's execution path, your token bill.

What a "smell" actually means ¶

The paper identifies smell categories including: descriptions that fail to state what the tool does, ambiguous parameter explanations, missing error behavior, and overlapping functionality described with different vocabulary. The practical effect: agents with smelly tool descriptions misroute calls, retry unnecessarily, and select wrong tools, all of which expand the execution trace and inflate token consumption.

When the researchers augmented tool descriptions to remove smells, task success improved by a median of 5.85 percentage points. Partial goal completion improved by 15.12 percentage points. The cost? Augmented descriptions increased average execution steps by 67.46%: agents took more steps, but got further. 56% of tools in the corpus failed to clearly state their purpose in the description.

The tradeoff is real: better descriptions improve outcomes but also change execution patterns. The point isn't that more tokens are always bad. It's that *wasted* tokens from poor routing are pure cost with no quality return.

The three smell patterns that hurt most ¶

1. Purpose ambiguity. 56% of analyzed tools failed to clearly state what they do in the opening line. An agent scanning for a web search tool and encountering "Retrieves external data based on query parameters" cannot confidently distinguish it from a database lookup tool.

2. Parameter under-specification. When a tool description omits what a parameter's valid range or format is, the agent either guesses (and retries on error) or asks a clarifying question (adding a round-trip). Both paths cost tokens.

3. Functional overlap without disambiguation. Multiple tools with similar descriptions but different scopes cause the agent to try the wrong one first, get an error or partial result, and re-invoke. Each failed attempt recirculates full context.

What this means operationally ¶

If you run MCP-powered agents in production, your tool manifest is probably working against you. The ecosystem grew fast and description quality wasn't enforced.

The practical audit:

``For each tool in your manifest: 1. Can you state what it does in one sentence without jargon? 2. Does every parameter have a concrete example? 3. Is there another tool it could be confused with? If yes, does the description differentiate?`

If the answer to any of these is no, that tool is a misrouting risk.

`Context compression as a parallel fix ¶`

Auditing tool descriptions fixes the routing problem at the source. But for agents where you don't control the tool manifest (third-party MCP servers, auto-generated descriptions, or tools you inherited) context compression at the ingestion layer is the complementary fix.

gotcontext's gc_compress_manifest tool compresses the tool schema injected into your agent's context, preserving the semantically load-bearing information while cutting the token footprint of the manifest itself. When combined with description hygiene, you get both better routing (fewer wasted calls) and smaller per-call context (lower baseline cost).

`json { "mcpServers": { "gotcontext": { "url": "https://api.gotcontext.ai/mcp", "headers": { "Authorization": "Bearer gc_live_YOUR_KEY" } } } }``

The study is clear: 97.1% of MCP tools have quality problems, and fixing them demonstrably improves agent outcomes. Your context window is being taxed by descriptions that weren't written with token cost in mind. That's fixable starting this afternoon.

Fix your context overhead →

Try it on your own context

You just read the writeup. Now run the thing. Paste a doc or some verbose tool output and watch it shrink — free, no signup.

Your text

# Service Operations Runbook: Payments API

## Purpose and scope

This runbook covers the payments-api service: what it does, how it is deployed, what its dependencies are, and what to do when it misbehaves. It is written for the on-call engineer. Every procedure here assumes you have production read access and the ability to trigger a deploy through the standard pipeline. Nothing in this document requires direct database write access, and no procedure here should be improvised under pressure: if the situation is not covered, page the service owner rather than inventing a fix at 3am.

The payments-api accepts charge requests from the checkout frontend, validates them against the pricing catalog, forwards them to the payment processor, and records the outcome in the orders database. It is the only service permitted to talk to the processor. Average traffic is steady during business hours with a daily peak around 19:00 UTC and a weekly peak on Friday evenings.

## Architecture and dependencies

The service runs as three replicas behind the regional load balancer. Each replica is stateless; all persistent state lives in the orders database and the idempotency-key store. The service depends on four things: the orders database (primary and one read replica), the idempotency-key store, the pricing catalog service, and the external payment processor. Of these, only the processor is outside our control.

Dependency failure behavior is deliberate and asymmetric. If the pricing catalog is unreachable, the service serves prices from its local cache for up to ten minutes and emits a degraded-mode metric. If the idempotency store is unreachable, the service refuses new charges entirely, because accepting a charge without idempotency protection risks double-billing, and double-billing is strictly worse than downtime. If the processor times out, the charge is recorded as pending and a reconciliation job resolves it within the hour.

## Deployment

Deploys go through the standard pipeline: merge to main, automated tests, staging deploy, a thirty-minute soak with synthetic checkout traffic, then production rollout one replica at a time. The pipeline aborts automatically if the error rate on the new replica exceeds the old baseline. A full rollout takes about twenty minutes. Rollback is the same pipeline in reverse and takes about six minutes; the on-call engineer can trigger it without approval.

## Monitoring and alerts

Three alerts page the on-call engineer. High charge failure rate fires when more than two percent of charge attempts fail over five minutes; the usual causes are a processor incident or a bad deploy, in that order. Idempotency store unavailable fires immediately on connection failure. Reconciliation backlog fires when pending charges older than ninety minutes accumulate, which usually means the reconciliation job is stuck rather than the processor being slow.

2,912/12,000 chars

Compressed

Compressed text will appear here…

Cite this¶

Researchers, analysts, or journalists referencing this post can use either format below — both are copyable.

BibTeXbibtex

@misc{mcp-tool-descriptions-are-eating-your-context-2026,
  title  = {97% of MCP Tool Descriptions Have Quality Problems. And Your Agent Pays for It},
  author = {James Hollingsworth},
  year   = {2026},
  month  = {May},
  url    = {https://gotcontext.ai/blog/mcp-tool-descriptions-are-eating-your-context},
  note   = {gotcontext.ai engineering blog.},
}

APAtext

James Hollingsworth. (2026, May 8). 97% of MCP Tool Descriptions Have Quality Problems. And Your Agent Pays for It. gotcontext.ai. Retrieved from https://gotcontext.ai/blog/mcp-tool-descriptions-are-eating-your-context.

Contribute¶

Suggest an edit

Spotted a typo, a stale benchmark, or a missing nuance? Open a GitHub issue.

Discuss this post

Counterexamples, follow-up questions, and adjacent research welcome.

Email us

Bigger story? Hit us directly at hello@gotcontext.ai.

← Back to all posts