Semantic Caching Can Cut LLM API Calls by 68.8% — But the Threshold Is Everything

What semantic caching actually is ¶

Semantic caching stores LLM responses keyed by embedding vectors instead of exact query strings. When a new query arrives, the system computes its embedding, searches the cache for a similar past query, and returns the cached answer if the similarity exceeds a threshold, without calling the LLM.

The appeal is obvious. Real user queries cluster. "What is the refund policy?" and "How do I get a refund?" are semantically near-identical but string-different. A cache hit on either saves one API call.

arXiv:2411.05276 ("Semantic Caching for AI") measured this on production query logs and found 68.8% reduction in API calls with a cosine similarity threshold of 0.95. That is the headline number. It is also incomplete without the context that follows.

The threshold problem ¶

Cosine similarity at 0.95 means: two queries whose embeddings are 95% similar are assumed to have the same correct answer. This works when the assumption holds. It breaks silently when it does not.

Consider the pair:

"What are the side effects of ibuprofen at 400mg?"

"What are the side effects of ibuprofen at 800mg?"

With many embedding models, these queries score above 0.95 similarity. The correct answers are different. A cache that returns the 400mg answer for the 800mg query is wrong, and there is no error signal.

The 68.8% reduction requires that your threshold is calibrated to your query distribution. It is not a universal setting.

Category-aware thresholds ¶

arXiv:2510.26835 ("Category-Aware Semantic Caching") extended the baseline work by measuring cache hit rates across query categories at different thresholds:

Query category	Hit rate at threshold 0.95	Threshold for equivalent accuracy
Factual / definitional	40 to 60%	0.90 to 0.92
Conversational / chitchat	55 to 70%	0.88 to 0.92
Code / technical	5 to 15%	0.97 to 0.99
Math / calculation	2 to 8%	0.99+

Code queries have low cache utility because small syntactic differences produce semantically distinct queries that embed closely. "Sort a list in Python" and "Sort a list in Python in descending order" score high similarity but need different answers.

The paper found that a uniform threshold across all query types underperforms category-aware thresholds by 15 to 30% on precision. Applying a 0.95 threshold to code queries produces too many false hits. Applying the same threshold to factual queries leaves cache hits on the table.

Implementation implications ¶

The operational requirement for semantic caching that actually works:

Query classification at cache lookup time. Before checking the cache, route the query to a category. This can be a lightweight classifier (a 7B model distilled for intent classification runs in <20ms). Apply the category-specific threshold.

Cache TTL by content type. Factual answers about stable topics (company policies, product descriptions) can cache for days. Answers about current events, prices, or anything time-sensitive need short TTL or explicit invalidation.

Hit/miss logging with human review. The silent failure mode (a wrong cached answer served with full confidence) is only detectable by sampling cache hits and reviewing them. Build this into the system from the start.

Separate cache namespaces by use case. Do not share a cache between a customer support bot and a coding assistant. The query distributions are different enough that a shared cache degrades both.

What 68.8% reduction means in dollars ¶

At GPT-4o pricing ($2.50/MTok input, $10/MTok output), a customer support application processing 10,000 queries/day at ~500 tokens average:

Scenario	Daily API calls	Daily cost
No caching	10,000	~$62
68.8% cache hit rate	3,120	~$19
Monthly savings	n/a	~$1,300

The caveat: these numbers assume the 68.8% hit rate is achievable on your query distribution. For code generation or math workloads, hit rates are 5 to 15%. The savings are correspondingly smaller.

Compounding with context compression ¶

Semantic caching and context compression address different parts of the cost curve.

Caching eliminates repeat API calls entirely. Compression reduces the token cost of calls that do go through, including the cache misses. They compound multiplicatively.

If 68.8% of queries are cache hits (zero token cost), the remaining 31.2% can be compressed before the LLM call. If compression reduces token count by 40%, the effective reduction on the full query volume is: 1 - (0.312 × 0.6) = 81.3% token cost reduction.

gotcontext handles the compression side via the ingest_context MCP tool. Combined with a semantic cache layer, the two optimizations target distinct cost buckets.

The threshold is a product decision ¶

The right cosine similarity threshold for your application is not 0.95 and it is not any other fixed number. It is determined by:

Your acceptable false-hit rate (wrong cached answer served as correct)

Your query category distribution

Your embedding model and its similarity calibration

Measure these before deploying. The 68.8% reduction figure from arXiv:2411.05276 is achievable, but only after calibration on your specific workload, not out of the box.

Get gotcontext free →

Try it on your own context

You just read the writeup. Now run the thing. Paste a doc or some verbose tool output and watch it shrink — free, no signup.

Your text

# Service Operations Runbook: Payments API

## Purpose and scope

This runbook covers the payments-api service: what it does, how it is deployed, what its dependencies are, and what to do when it misbehaves. It is written for the on-call engineer. Every procedure here assumes you have production read access and the ability to trigger a deploy through the standard pipeline. Nothing in this document requires direct database write access, and no procedure here should be improvised under pressure: if the situation is not covered, page the service owner rather than inventing a fix at 3am.

The payments-api accepts charge requests from the checkout frontend, validates them against the pricing catalog, forwards them to the payment processor, and records the outcome in the orders database. It is the only service permitted to talk to the processor. Average traffic is steady during business hours with a daily peak around 19:00 UTC and a weekly peak on Friday evenings.

## Architecture and dependencies

The service runs as three replicas behind the regional load balancer. Each replica is stateless; all persistent state lives in the orders database and the idempotency-key store. The service depends on four things: the orders database (primary and one read replica), the idempotency-key store, the pricing catalog service, and the external payment processor. Of these, only the processor is outside our control.

Dependency failure behavior is deliberate and asymmetric. If the pricing catalog is unreachable, the service serves prices from its local cache for up to ten minutes and emits a degraded-mode metric. If the idempotency store is unreachable, the service refuses new charges entirely, because accepting a charge without idempotency protection risks double-billing, and double-billing is strictly worse than downtime. If the processor times out, the charge is recorded as pending and a reconciliation job resolves it within the hour.

## Deployment

Deploys go through the standard pipeline: merge to main, automated tests, staging deploy, a thirty-minute soak with synthetic checkout traffic, then production rollout one replica at a time. The pipeline aborts automatically if the error rate on the new replica exceeds the old baseline. A full rollout takes about twenty minutes. Rollback is the same pipeline in reverse and takes about six minutes; the on-call engineer can trigger it without approval.

## Monitoring and alerts

Three alerts page the on-call engineer. High charge failure rate fires when more than two percent of charge attempts fail over five minutes; the usual causes are a processor incident or a bad deploy, in that order. Idempotency store unavailable fires immediately on connection failure. Reconciliation backlog fires when pending charges older than ninety minutes accumulate, which usually means the reconciliation job is stuck rather than the processor being slow.

2,912/12,000 chars

Compressed

Compressed text will appear here…

Cite this¶

Researchers, analysts, or journalists referencing this post can use either format below — both are copyable.

BibTeXbibtex

@misc{semantic-caching-what-the-research-actually-says-about-similarity-thresholds-2026,
  title  = {Semantic Caching Can Cut LLM API Calls by 68.8%. But the Threshold Is Everything},
  author = {James Hollingsworth},
  year   = {2026},
  month  = {May},
  url    = {https://gotcontext.ai/blog/semantic-caching-what-the-research-actually-says-about-similarity-thresholds},
  note   = {gotcontext.ai engineering blog.},
}

APAtext

James Hollingsworth. (2026, May 8). Semantic Caching Can Cut LLM API Calls by 68.8%. But the Threshold Is Everything. gotcontext.ai. Retrieved from https://gotcontext.ai/blog/semantic-caching-what-the-research-actually-says-about-similarity-thresholds.

Contribute¶

Suggest an edit

Spotted a typo, a stale benchmark, or a missing nuance? Open a GitHub issue.

Discuss this post

Counterexamples, follow-up questions, and adjacent research welcome.

Email us

Bigger story? Hit us directly at hello@gotcontext.ai.

← Back to all posts