Measured savings across 11 LLMs — Claude Opus 4.7 to Gemini Flash.→ See per-model data
Get free API key →
Engineering

gotcontext + TokenSpeed: stack input compression with TRT-LLM-class inference for self-hosters

TokenSpeed (Lightseek Foundation, MIT, May 2026) is the first open-source LLM inference engine targeting TensorRT-LLM-level performance for agentic workloads. It sits at a different layer than gotcontext — gotcontext compresses what reaches the GPU, TokenSpeed speeds up what the GPU processes. For self-hosters running open models, the two stack multiplicatively for ~80% real-cost reduction.

James Hollingsworth(Contributor)Published 6 min~905 words

Two layers of the stack you can each squeeze independently

Most LLM token-cost articles assume you're calling someone else's API: Anthropic, OpenAI, Google. If that's you, your inference engine is whatever the provider runs and you can't change it. Compression at the input layer (gotcontext) is the only knob you have.

But if you're self-hosting open models (Llama 4, Qwen3, Mixtral, Kimi K2.5, DeepSeek V4) there are two distinct knobs:

  • Input layer: how many tokens reach the GPU per request
  • Inference layer: how fast the GPU processes each token
  • The two compose independently. Halving input tokens × doubling throughput per token = ~75% real-cost reduction. They live at different layers of the agent stack and don't conflict.

    This week the Lightseek Foundation released TokenSpeed, an open-source LLM inference engine targeting TensorRT-LLM-level performance, MIT licensed. According to LightSeek's own announcement, their Multi-head Latent Attention (MLA) kernel has been adopted by vLLM. For self-hosters, this is the first credible open-source replacement for TRT-LLM that's also hand-tuned for agentic workload patterns (long input, short output, high concurrency).

    The two stack like this:

    LayerToolWhat it optimizes
    Application / agent(your code)n/a
    Input preprocessinggotcontexttokens reaching the GPU
    Provider boundary(your endpoint)n/a
    Inference engineTokenSpeedtokens/second from the GPU
    GPU(NVIDIA Hopper, Blackwell)n/a

    Where each one operates

    gotcontextTokenSpeed
    DomainDocumentation, KB, conversation history, tool outputGPU-side token generation
    MethodSemantic graph + PageRank importance scoring + structural chunkingPer-GPU throughput optimization, FSM-based KV cache safety
    ArchitectureAPI + MCP gateway in front of your inference endpointDrop-in replacement for TensorRT-LLM behind your endpoint
    Where it sitsBetween agent and ingestBetween endpoint and GPU
    LicenseProprietary (free + paid plans)MIT
    SetupMCP config + API keyReplace TRT-LLM in your serving stack
    ScopeCuts what arrivesSpeeds up what's processed
    gotcontext rewrites prompt content before any inference engine sees it. TokenSpeed runs the inference engine itself. They literally cannot conflict: gotcontext doesn't run on a GPU; TokenSpeed doesn't read documents.

    The math, joint impact

    The savings compound. Concrete example for a self-hosted agentic workload:

    WorkloadInput tokens / reqInference tok/secCost per request
    Baseline (vLLM default + uncompressed)12,000451.0×
    + gotcontext (input compression ~3×)4,000450.33×
    + TokenSpeed (~2× tok/sec on agentic patterns)4,000900.17×
    ~83% real-cost reduction. The two interventions are independent: gotcontext doesn't care which inference engine consumes its compressed output; TokenSpeed doesn't care whether the input was compressed before it arrived. Stacking them is multiplicative, not additive.

    Setup: roughly 1 hour for both, end to end

    gotcontext

    Add to your Claude Code MCP config (~/.claude/claude_desktop_config.json):

    ``json { "mcpServers": { "gotcontext": { "url": "https://api.gotcontext.ai/mcp", "headers": { "Authorization": "Bearer gc_live_YOUR_KEY" } } } } ``

    Get a key at gotcontext.ai/sign-up. Free tier covers 1,000 compressions/month, no card required. ~30 seconds.

    TokenSpeed

    For self-hosters running NVIDIA Hopper or Blackwell GPUs, TokenSpeed is a TRT-LLM replacement. The MLA kernel has already landed in vLLM, so depending on your engine choice you may already get partial benefit. But the standalone TokenSpeed runtime is where the agentic-workload tuning lives.

    Browse the project, install via the project's documented path, point your inference endpoint at TokenSpeed instead of TRT-LLM, restart your serving deployment. ~1 hour for the full migration on a typical setup. License is MIT. Vendor relationship is "open-source library" not "managed service."

    Why we're recommending an inference engine

    gotcontext doesn't run inference. We're an API + MCP layer that sits between your agent and your model endpoint. TokenSpeed doesn't run preprocessing. It's a GPU runtime that takes whatever tokens arrive and processes them as fast as possible.

    These tools cannot replace each other. A customer who runs only gotcontext gets the input-side win but is leaving inference performance on the table if they're self-hosting. A customer who runs only TokenSpeed gets faster inference on their existing token volume but is paying for tokens they didn't need to send.

    The use case where this matters most is enterprise self-hosters: companies running open models on their own NVIDIA hardware, not API customers. Those teams typically have a serving stack (vLLM, TGI, TRT-LLM today) and an application stack. gotcontext slots into the application stack; TokenSpeed slots into the serving stack. No team-boundary friction, no integration work between the two products.

    Operational notes

  • gotcontext is a remote API. Content is sent to our servers for compression. If your KB is sensitive, evaluate the data-flow shape per source. Self-hosting gotcontext is on the enterprise plan.
  • TokenSpeed is local. Runs on your GPUs. No data leaves your infrastructure.
  • Free tiers exist for both. gotcontext: 1,000 compressions/month, no card. TokenSpeed: MIT, no fee.
  • Neither is API-customer-relevant. If you're calling Claude or GPT-4 via Anthropic/OpenAI's API, you can't choose your inference engine. That's the provider's problem. TokenSpeed doesn't help. gotcontext does.
  • vLLM users get a partial win. TokenSpeed's MLA kernel landed in vLLM upstream. If you're already running vLLM with MLA-capable models (Kimi, DeepSeek V4 Pro), you're picking up some of the throughput gain without migrating.
  • TL;DR

  • gotcontext = input-layer compression (cuts what reaches the GPU)
  • TokenSpeed = inference-engine optimization (speeds up what the GPU processes)
  • Different layers, no conflict, multiplicative savings
  • Joint reduction: ~80% real-cost on self-hosted agentic workloads
  • gotcontext: ~30 second setup. TokenSpeed: ~1 hour migration
  • API customers (Claude, GPT-4): only gotcontext applies
  • Self-hosters (Llama, Qwen, Kimi, DeepSeek): install both
  • Get gotcontext free → · Read about TokenSpeed →

    Cite this

    Researchers, analysts, or journalists referencing this post can use either format below — both are copyable.

    BibTeXbibtex
    @misc{tokenspeed-companion-self-hosters-2026,
      title  = {gotcontext + TokenSpeed: stack input compression with TRT-LLM-class inference for self-hosters},
      author = {James Hollingsworth},
      year   = {2026},
      month  = {May},
      url    = {https://www.gotcontext.ai/blog/tokenspeed-companion-self-hosters},
      note   = {gotcontext.ai engineering blog.},
    }
    APAtext
    James Hollingsworth. (2026, May 9). gotcontext + TokenSpeed: stack input compression with TRT-LLM-class inference for self-hosters. gotcontext.ai. Retrieved from https://www.gotcontext.ai/blog/tokenspeed-companion-self-hosters.

    Contribute