gotcontext + TokenSpeed: stack input compression with TRT-LLM-class inference for self-hosters

Two layers of the stack you can each squeeze independently ¶

Most LLM token-cost articles assume you're calling someone else's API: Anthropic, OpenAI, Google. If that's you, your inference engine is whatever the provider runs and you can't change it. Compression at the input layer (gotcontext) is the only knob you have.

But if you're self-hosting open models (Llama 4, Qwen3, Mixtral, Kimi K2.5, DeepSeek V4) there are two distinct knobs:

Input layer: how many tokens reach the GPU per request

Inference layer: how fast the GPU processes each token

The two compose independently. Halving input tokens × doubling throughput per token = ~75% real-cost reduction. They live at different layers of the agent stack and don't conflict.

This week the Lightseek Foundation released TokenSpeed, an open-source LLM inference engine targeting TensorRT-LLM-level performance, MIT licensed. According to LightSeek's own announcement, their Multi-head Latent Attention (MLA) kernel has been adopted by vLLM. For self-hosters, this is the first credible open-source replacement for TRT-LLM that's also hand-tuned for agentic workload patterns (long input, short output, high concurrency).

The two stack like this:

Layer	Tool	What it optimizes
Application / agent	(your code)	n/a
Input preprocessing	gotcontext	tokens reaching the GPU
Provider boundary	(your endpoint)	n/a
Inference engine	TokenSpeed	tokens/second from the GPU
GPU	(NVIDIA Hopper, Blackwell)	n/a

Where each one operates ¶

	gotcontext	TokenSpeed
Domain	Documentation, KB, conversation history, tool output	GPU-side token generation
Method	Semantic graph + PageRank importance scoring + structural chunking	Per-GPU throughput optimization, FSM-based KV cache safety
Architecture	API + MCP gateway in front of your inference endpoint	Drop-in replacement for TensorRT-LLM behind your endpoint
Where it sits	Between agent and ingest	Between endpoint and GPU
License	Proprietary (free + paid plans)	MIT
Setup	MCP config + API key	Replace TRT-LLM in your serving stack
Scope	Cuts what arrives	Speeds up what's processed

gotcontext rewrites prompt content before any inference engine sees it. TokenSpeed runs the inference engine itself. They literally cannot conflict: gotcontext doesn't run on a GPU; TokenSpeed doesn't read documents.

The math, joint impact ¶

The savings compound. Concrete example for a self-hosted agentic workload:

Workload	Input tokens / req	Inference tok/sec	Cost per request
Baseline (vLLM default + uncompressed)	12,000	45	1.0×
+ gotcontext (input compression ~3×)	4,000	45	0.33×
+ TokenSpeed (~2× tok/sec on agentic patterns)	4,000	90	0.17×

~83% real-cost reduction. The two interventions are independent: gotcontext doesn't care which inference engine consumes its compressed output; TokenSpeed doesn't care whether the input was compressed before it arrived. Stacking them is multiplicative, not additive.

Setup: roughly 1 hour for both, end to end ¶

gotcontext

Add to your Claude Code MCP config (~/.claude/claude_desktop_config.json):

``json { "mcpServers": { "gotcontext": { "url": "https://api.gotcontext.ai/mcp", "headers": { "Authorization": "Bearer gc_live_YOUR_KEY" } } } }``

Get a key at gotcontext.ai/sign-up. Free tier covers 1,000 compressions/month, no card required. ~30 seconds.

TokenSpeed

For self-hosters running NVIDIA Hopper or Blackwell GPUs, TokenSpeed is a TRT-LLM replacement. The MLA kernel has already landed in vLLM, so depending on your engine choice you may already get partial benefit. But the standalone TokenSpeed runtime is where the agentic-workload tuning lives.

Browse the project, install via the project's documented path, point your inference endpoint at TokenSpeed instead of TRT-LLM, restart your serving deployment. ~1 hour for the full migration on a typical setup. License is MIT. Vendor relationship is "open-source library" not "managed service."

Why we're recommending an inference engine ¶

gotcontext doesn't run inference. We're an API + MCP layer that sits between your agent and your model endpoint. TokenSpeed doesn't run preprocessing. It's a GPU runtime that takes whatever tokens arrive and processes them as fast as possible.

These tools cannot replace each other. A customer who runs only gotcontext gets the input-side win but is leaving inference performance on the table if they're self-hosting. A customer who runs only TokenSpeed gets faster inference on their existing token volume but is paying for tokens they didn't need to send.

The use case where this matters most is enterprise self-hosters: companies running open models on their own NVIDIA hardware, not API customers. Those teams typically have a serving stack (vLLM, TGI, TRT-LLM today) and an application stack. gotcontext slots into the application stack; TokenSpeed slots into the serving stack. No team-boundary friction, no integration work between the two products.

Operational notes ¶

gotcontext is a remote API. Content is sent to our servers for compression. If your KB is sensitive, evaluate the data-flow shape per source. Self-hosting gotcontext is on the enterprise plan.

TokenSpeed is local. Runs on your GPUs. No data leaves your infrastructure.

Free tiers exist for both. gotcontext: 1,000 compressions/month, no card. TokenSpeed: MIT, no fee.

Neither is API-customer-relevant. If you're calling Claude or GPT-4 via Anthropic/OpenAI's API, you can't choose your inference engine. That's the provider's problem. TokenSpeed doesn't help. gotcontext does.

vLLM users get a partial win. TokenSpeed's MLA kernel landed in vLLM upstream. If you're already running vLLM with MLA-capable models (Kimi, DeepSeek V4 Pro), you're picking up some of the throughput gain without migrating.

TL;DR ¶

gotcontext = input-layer compression (cuts what reaches the GPU)

TokenSpeed = inference-engine optimization (speeds up what the GPU processes)

Different layers, no conflict, multiplicative savings

Joint reduction: ~80% real-cost on self-hosted agentic workloads

gotcontext: ~30 second setup. TokenSpeed: ~1 hour migration

API customers (Claude, GPT-4): only gotcontext applies

Self-hosters (Llama, Qwen, Kimi, DeepSeek): install both

Get gotcontext free → · Read about TokenSpeed →

Try it on your own context

You just read the writeup. Now run the thing. Paste a doc or some verbose tool output and watch it shrink — free, no signup.

Your text

# Service Operations Runbook: Payments API

## Purpose and scope

This runbook covers the payments-api service: what it does, how it is deployed, what its dependencies are, and what to do when it misbehaves. It is written for the on-call engineer. Every procedure here assumes you have production read access and the ability to trigger a deploy through the standard pipeline. Nothing in this document requires direct database write access, and no procedure here should be improvised under pressure: if the situation is not covered, page the service owner rather than inventing a fix at 3am.

The payments-api accepts charge requests from the checkout frontend, validates them against the pricing catalog, forwards them to the payment processor, and records the outcome in the orders database. It is the only service permitted to talk to the processor. Average traffic is steady during business hours with a daily peak around 19:00 UTC and a weekly peak on Friday evenings.

## Architecture and dependencies

The service runs as three replicas behind the regional load balancer. Each replica is stateless; all persistent state lives in the orders database and the idempotency-key store. The service depends on four things: the orders database (primary and one read replica), the idempotency-key store, the pricing catalog service, and the external payment processor. Of these, only the processor is outside our control.

Dependency failure behavior is deliberate and asymmetric. If the pricing catalog is unreachable, the service serves prices from its local cache for up to ten minutes and emits a degraded-mode metric. If the idempotency store is unreachable, the service refuses new charges entirely, because accepting a charge without idempotency protection risks double-billing, and double-billing is strictly worse than downtime. If the processor times out, the charge is recorded as pending and a reconciliation job resolves it within the hour.

## Deployment

Deploys go through the standard pipeline: merge to main, automated tests, staging deploy, a thirty-minute soak with synthetic checkout traffic, then production rollout one replica at a time. The pipeline aborts automatically if the error rate on the new replica exceeds the old baseline. A full rollout takes about twenty minutes. Rollback is the same pipeline in reverse and takes about six minutes; the on-call engineer can trigger it without approval.

## Monitoring and alerts

Three alerts page the on-call engineer. High charge failure rate fires when more than two percent of charge attempts fail over five minutes; the usual causes are a processor incident or a bad deploy, in that order. Idempotency store unavailable fires immediately on connection failure. Reconciliation backlog fires when pending charges older than ninety minutes accumulate, which usually means the reconciliation job is stuck rather than the processor being slow.

2,912/12,000 chars

Compressed

Compressed text will appear here…

Cite this¶

Researchers, analysts, or journalists referencing this post can use either format below — both are copyable.

BibTeXbibtex

@misc{tokenspeed-companion-self-hosters-2026,
  title  = {gotcontext + TokenSpeed: stack input compression with TRT-LLM-class inference for self-hosters},
  author = {James Hollingsworth},
  year   = {2026},
  month  = {May},
  url    = {https://gotcontext.ai/blog/tokenspeed-companion-self-hosters},
  note   = {gotcontext.ai engineering blog.},
}

APAtext

James Hollingsworth. (2026, May 9). gotcontext + TokenSpeed: stack input compression with TRT-LLM-class inference for self-hosters. gotcontext.ai. Retrieved from https://gotcontext.ai/blog/tokenspeed-companion-self-hosters.

Contribute¶

Suggest an edit

Spotted a typo, a stale benchmark, or a missing nuance? Open a GitHub issue.

Discuss this post

Counterexamples, follow-up questions, and adjacent research welcome.

Email us

Bigger story? Hit us directly at hello@gotcontext.ai.

← Back to all posts