What is gotcontext.ai?

gotcontext.ai is a semantic compression MCP (Model Context Protocol) gateway that reduces LLM token usage by roughly 50% on typical inputs and up to 87.4% on large documents. It compresses every tool response before your agent reads it, so 5+ connected MCP servers do not flood the context window. It works with Claude Code, Cursor, Codex, and Gemini CLI using one bearer token at https://api.gotcontext.ai/mcp. Also available as a REST API, Python SDK, and TypeScript SDK. Free tier included, no credit card required.

How does gotcontext.ai reduce LLM token costs?

gotcontext.ai compresses the input context — documents, code, conversation history — before passing it to the LLM. The compression engine ranks content by semantic importance using graph-based PageRank, removes low-signal tokens, and returns a compressed skeleton. Because LLM pricing is per-token, sending fewer tokens directly reduces API cost, typically by 40–60% on average input and up to 85% on long redundant inputs.

How is gotcontext.ai different from LLMLingua?

LLMLingua (Microsoft Research) is an academic prompt compression method with no hosted API, no MCP server, and no SDKs. gotcontext.ai is a production-grade hosted API with REST endpoints, a Model Context Protocol (MCP) Streamable HTTP server, Python and TypeScript SDKs, a free tier, and a self-hosted Docker image. All paid plans include the same 140+ MCP tools (compression, ACE agent context engineering, knowledge management, multimodal ingestion); plans differ on monthly compression volume, embedding tier (TF-IDF / ONNX / SBERT), and enterprise wraparound (OIDC, audit-log export, SLA, support). LLMLingua requires running your own GPU inference; gotcontext.ai works with an API key.

Does gotcontext.ai work with Claude Code, Cursor, Codex, or Gemini?

Yes. Run "npx gotcontext wrap claude" to configure Claude Code, "npx gotcontext wrap codex" for OpenAI Codex CLI, or "npx gotcontext wrap gemini" for Google Gemini CLI. Each command writes the MCP server entry to the right config file automatically. Cursor and any other MCP-compatible client can also connect directly to https://api.gotcontext.ai/mcp using a Bearer API key. gotcontext.ai also ships a Claude Code plugin installable via /plugin marketplace add oimiragieo/gotcontext-plugin.

How do I install gotcontext.ai and start reducing token usage?

Get a free API key at https://gotcontext.ai/sign-up (no credit card required). Then run "npx gotcontext wrap claude" in your terminal to configure Claude Code automatically. For Codex CLI run "npx gotcontext wrap codex", for Gemini CLI run "npx gotcontext wrap gemini". After that, every tool response your agent reads is compressed by gotcontext.ai before it reaches the context window. Run "npx gotcontext doctor" to verify which CLIs are detected and configured. Full instructions at https://gotcontext.ai/docs/getting-started.

Does gotcontext.ai handle high-traffic load fairly across plans?

Yes. gotcontext.ai uses a plan-priority compression queue: each tier has its own pool of concurrent compression slots — Enterprise gets 8 slots, Team gets 4, Pro gets 2, Free gets 1. Pools are isolated, so a flood of Free traffic cannot starve a paying Enterprise customer (similar to Anthropic Priority Tier and OpenAI Priority API). When a tier reaches its slot cap, additional requests wait briefly rather than being rejected with 429, and only return HTTP 503 with a Retry-After hint after a timeout. This is reserved capacity by design, not queue jumping.

Is gotcontext.ai self-hostable for enterprise?

Yes. Enterprise customers get a Docker image that runs entirely in your VPC, with OIDC federation (Okta, Auth0, Azure AD, Keycloak), audit-log export (NDJSON or CSV) for SOC2 evidence pipelines, Ed25519-signed license JWTs, SSO/SAML, dedicated SLA with uptime credits, named CSM with 4h critical response, and a custom MSA / DPA / IP indemnity. Usage metering reports back to a control plane you choose, including air-gapped operation. Enterprise differs from Pro/Team on compliance, deployment, support, and contract terms — all three paid tiers include the same 140+ MCP tools and the same compression engine.

gotcontext.ai is a semantic compression MCP (Model Context Protocol) gateway that reduces LLM token usage by roughly 50% on typical inputs and up to 87.4% on large documents. It works with Claude Code, Cursor, Codex, and Gemini CLI using one bearer token at https://api.gotcontext.ai/mcp. Activate with npx gotcontext wrap claude. Getting started guide · Compare with alternatives.

MCP Gateway

50% smaller tool responses on average. One bearer token.

Point your MCP client at https://api.gotcontext.ai/mcp, add your key, and every tool response is compressed before your agent reads it. Works with Claude Code, Cursor, Codex, and Gemini CLI. 140+ tools.

Get your free API key

Free tier: 1,000 compressions/month, no credit card.

Read the docs →Try it live

50% typical token saving (live /v1/global-savings runs higher)87.4% benchmark peak on large docsrun the open harness yourself →

Self-hosted Docker available Exportable audit logs (NDJSON/CSV)OIDC / SSO / DPA

Try it. No signup5,300 / 12,000

# Service Operations Runbook: Payments API

## Purpose and scope

This runbook covers the payments-api service: what it does, how it is deployed, what its dependencies are, and what to do when it misbehaves. It is written for the on-call engineer. Every procedure here assumes you have production read access and the ability to trigger a deploy through the standard pipeline. Nothing in this document requires direct database write access, and no procedure here should be improvised under pressure: if the situation is not covered, page the service owner rather than inventing a fix at 3am.

The payments-api accepts charge requests from the checkout frontend, validates them against the pricing catalog, forwards them to the payment processor, and records the outcome in the orders database. It is the only service permitted to talk to the processor. Average traffic is steady during business hours with a daily peak around 19:00 UTC and a weekly peak on Friday evenings.

## Architecture and dependencies

The service runs as three replicas behind the regional load balancer. Each replica is stateless; all persistent state lives in the orders database and the idempotency-key store. The service depends on four things: the orders database (primary and one read replica), the idempotency-key store, the pricing catalog service, and the external payment processor. Of these, only the processor is outside our control.

Dependency failure behavior is deliberate and asymmetric. If the pricing catalog is unreachable, the service serves prices from its local cache for up to ten minutes and emits a degraded-mode metric. If the idempotency store is unreachable, the service refuses new charges entirely, because accepting a charge without idempotency protection risks double-billing, and double-billing is strictly worse than downtime. If the processor times out, the charge is recorded as pending and a reconciliation job resolves it within the hour.

## Deployment

Deploys go through the standard pipeline: merge to main, automated tests, staging deploy, a thirty-minute soak with synthetic checkout traffic, then production rollout one replica at a time. The pipeline aborts automatically if the error rate on the new replica exceeds the old baseline. A full rollout takes about twenty minutes. Rollback is the same pipeline in reverse and takes about six minutes; the on-call engineer can trigger it without approval. Database migrations ship separately from code, are always backwards-compatible for at least one release, and run before the code that needs them.

## Monitoring and alerts

Three alerts page the on-call engineer. High charge failure rate fires when more than two percent of charge attempts fail over five minutes; the usual causes are a processor incident or a bad deploy, in that order. Idempotency store unavailable fires immediately on connection failure, because the service is refusing charges while it is down. Reconciliation backlog fires when pending charges older than ninety minutes accumulate, which usually means the reconciliation job is stuck rather than the processor being slow.

Two further signals warn without paging: elevated latency on the catalog cache path, and a rising rate of declined cards, which is almost always upstream issuer behavior rather than anything on our side. Dashboards live in the standard observability stack; the service overview board links to per-dependency drill-downs.

## Common incidents and procedures

Processor outage: confirm against the processor status page first. Do not roll back, the deploy is not the cause. Charges queue as pending and reconcile automatically when the processor recovers. Communicate the expected resolution path to support so they can answer customer tickets accurately.

Bad deploy: the symptom is a failure-rate alert that begins within minutes of a rollout completing. Trigger rollback first and investigate second; the six-minute rollback is always cheaper than debugging in production. Capture the failing release tag in the incident channel before it scrolls away.

Idempotency store failure: this is the highest-urgency scenario because the service is intentionally refusing all new charges. Verify whether the store itself is down or the network path to it is broken. The store runs with a replica that takes over automatically; if failover has not happened within two minutes, force it using the documented store-failover procedure, then watch the charge acceptance rate recover.

Reconciliation backlog: restart the reconciliation job first; a stuck job accounts for nearly every historical occurrence. If the backlog continues to grow after a restart, check whether the processor is returning errors on the reconciliation endpoint specifically, which has happened during their maintenance windows even when charging worked normally.

## Post-incident

Every paging incident gets a written timeline within one business day, while details are fresh. The timeline records when the alert fired, what was tried, what worked, and what the customer-visible impact was. Action items get owners and dates, and the runbook section that failed to cover the incident gets updated in the same pull request as the timeline. A runbook that does not absorb its incidents is decoration, not documentation.

Compatible with

Claude Code

Cursor

Gemini CLI

Codex

Windsurf

VS Code

Compression Pipeline

How a response gets compressed.

Same input, same output, every run. Four steps, no model in the loop. The output is a re-ranking of your own sentences. Every token in the compressed response appears in the original.

Step 1: Ingest

Document Analysis

Text chunked, analyzed, and scored semantically. Compression graph assembled.

Step 2: Rank

PageRank Scoring

Graph edges weighted by semantic similarity. Importance propagated through the network.

Step 3: Extract

Ranked extract (not generated)

Top-ranked nodes form the compressed output. Every output token appears in your input. Target ratio controls fidelity.

Step 4: Deliver

Return to MCP client

Compressed output returned to your AI tool, typically ~50% smaller on production traffic (87.4% on benchmark peak). Expandable on demand.

How It Works

Why the output is auditable.

Not a summary. A re-ranking of your own sentences. Documents are chunked, embedded, and scored on a semantic graph; only the highest-ranked nodes survive into the output. Typically ~50% smaller on production traffic. Methodology & benchmark peak in the measurement section. measurement section.

Three compression modes: fast / balanced / SBERT
AST-aware code compression for 7+ languages
Per-workspace key scoping. Keys cannot read across workspaces.
Command Palette: Cmd+K navigation, G+D shortcuts, full-text search
GitHub Integration: token-savings summaries posted on your pull requests
Real-Time Queue Monitor: live SSE streaming for batch jobs
Roles: Owner, Admin, Member, Viewer. Shared projects, activity feed.
Stacks with native prompt caching (Anthropic / OpenAI / Gemini). When both apply, total input-cost reduction can reach 95%. See methodology.

COMPRESSION

Semantic Graph

PageRank-based importance scoring

50% live average

Developer First

140+ MCP tools behind one endpoint.

Works with any MCP-compatible client. Claude Code, Codex, Gemini CLI, Cursor, VS Code. One command configures the MCP server. No JSON editing required.

CWE-22 path traversal prevention on all file I/O
Async batch ingest: 4× throughput
Prometheus metrics, OpenTelemetry tracing, health checks
gc_compress_manifest shrinks MCP tool-description bloat

View API Docs

One-command setup

1. Get a free key from the dashboard
2. Run the CLI. It prompts for your key
3. Restart your CLI

$ npx gotcontext wrap claude

Also: npx gotcontext doctor — shows which CLIs are detected and configured.

Prefer manual JSON config?

.mcp.json

{
  "mcpServers": {
    "gotcontext": {
      "url": "https://api.gotcontext.ai/mcp",
      "headers": {
        "Authorization": "Bearer gc_your_key_here"
      }
    }
  }
}

terminal

live

# 1. Get a free API key at gotcontext.ai/sign-up

# 2. Point your AI tool at our MCP endpoint:

https://api.gotcontext.ai/mcp

Authorization: Bearer gc_your_key

# 3. Call tools naturally (Claude Code / Cursor / etc):

> ingest_context(file_id="api.md", content="...")

> read_skeleton(file_id="api.md", ratio=0.15)

# Result: 485 → 61 tokens (87.4% reduction)

50%

Live avg compression

<90ms

p95 pipeline latency

140+

MCP Tools

Try it now

Paste any text and see how much you can save. No signup required.

Text is processed in-memory and is not stored, logged with PII, or used for training. Do not paste secrets or production credentials. Privacy details →

Input text

# Service Operations Runbook: Payments API

## Purpose and scope

## Architecture and dependencies

## Deployment

## Monitoring and alerts

## Common incidents and procedures

## Post-incident

5,300/12,000 chars

Compressed output

Compressed text will appear here...

Code compression

Your codebase is the biggest thing your agent reads.

Text summarizers compress prose. We compress code, AST-aware and structure-preserving, at 10-11× on real source files. An agent that reads mcp_gateway.py in full spends 20,076 words of context. With read_skeleton it gets a faithful structural skeleton for 1,935 words and drills into any function on demand. Same answerable questions, one-tenth the context budget.

Live dogfood: our own gateway

10.38×

compression on api/app/mcp_gateway.py (~5.5K LoC, production)

read_file20,076 words

read_skeleton1,935 words

The agent gets a structural skeleton (every function signature, every class, every import), then calls modulate_region to expand any section it needs. gc_blast_radius gives ranked context for a specific symbol. compress_codebase produces an AST digest of a whole directory.

read_skeletonmodulate_regiongc_blast_radiuscompress_codebase

Benchmark results (CI-locked)

Input

Original

Skeleton

Ratio

Saved

Code file (real source)

3,390

299

11.34×

91.2%

api/app/mcp_gateway.pydogfood

20,076

1,935

10.38×

90.4%

Large doc (~2.3K words)

7,173

752

9.54×

89.5%

Compression is size-dependent. Small files compress little. The engine keeps them faithful. The ratios above are on large files where agents actually struggle to fit the full source into context.

Read the docs

How we measure

How the numbers are measured.

Two sources: the live API, and an open benchmark you can run. The hero number is a conservative ~50% typical saving; the live production average is at the /v1/global-savings endpoint. The benchmark peak below is from the open-source harness. Run it yourself, the numbers will be identical. Code context is powered by tensor-grep (open source, Apache-2.0).

87.4%

Benchmark peak: large-document workloads

Peak on long-form documents (API specs, codebases, research papers). The headline is a conservative ~50% typical saving; the live production average runs higher. Both the typical figure and the peak are real, the difference is workload mix.

View public benchmarks

140+

MCP tools

Claude, GPT, Gemini, Codex.

<90ms

Pipeline latency

Ingest → compress → return, p95.

CLI integrations

Claude Code, Cursor, Gemini CLI, Codex, Windsurf.

Pricing

Pay for tool calls. Compression is included.

Every MCP tool response is compressed before it returns to your agent, so each call delivers more context per token. The multiplier scales with the live compression ratio (see hero). Covers solo developers to enterprise teams.

Free

$0/month

Free tier

No credit card. Built for evaluation and side projects.

1,000 compressions/month
100KB max document
Standard compression
Command Palette & shortcuts
Activity Feed
Dark/Light theme
Community support

Start free: 1,000 compressions/mo

Pro

$49/mo

For individual developers

All 140+ MCP tools, accelerated compression, priority queue with 2 reserved compression slots.

50,000 compressions/month
All 140+ MCP tools (incl. ACE, knowledge mgmt, multimodal)
Priority queue: 2 concurrent compression slots
1MB max document
Accelerated compression (3-5x faster)
Queue Monitor (real-time SSE)
Usage analytics
Webhook Notifications
Priority support

Start Pro Plan

Business

$199/mo

Shared infra with exportable audit logging, OIDC/SSO, and DPA

Self-hosted Docker, OIDC/SSO, audit-log export for compliance reviews, SBERT embeddings, named Customer Success Manager.

500,000+ compressions / month
All 140+ MCP tools
Priority queue: 8 concurrent compression slots
Self-hosted Docker (run in your VPC)
OIDC federation (Okta, Auth0, Azure AD)
Audit-log export (NDJSON/CSV) for compliance reviews
SBERT embeddings (higher fidelity than the default MiniLM tier)
SSO / SAML
Email support · SLA on request (custom MSA)
DPA / IP indemnity / custom MSA

Contact Sales

See full plan comparison (Free · Pro · Team · Enterprise)

FROM THE BLOG

What we're measuring and writing about context

Read the blog →

Research2026-07-08

Token-level compression breaks agents at any ratio

A 2026 study ran standard token-pruning compressors on LLM agents across 17 configurations. Every one collapsed to near-zero reward, from 1.3x to 13.3x compression. The tokens a compressor drops first are the ones agents need most.

Read →

Research2026-07-08

Compressing prompts harder made them more expensive

A pre-registered randomized trial compressed production agent prompts at three retention rates. Keeping half the tokens cut total cost 27.9%. Keeping a fifth raised cost 1.8%, because the model wrote longer outputs.

Read →

Research2026-07-08

When to compact matters more than how much

JHU and Apple researchers let the model decide when to summarize its own context, guided by a short rubric. Up to 18.1 points better on math and 5 to 9 on agentic search than never summarizing, at 30 to 70% lower cost.

Read →

Start today

Start free.

1,000 compressions/month, all 140+ tools, no credit card.

Create free account Read the Docs