Semantic Caching Can Cut LLM API Calls by 68.8% — But the Threshold Is Everything
Research shows semantic caching with cosine similarity matching slashes API costs dramatically. Two papers explain why a single threshold value is the difference between a cache that helps and one that silently poisons your answers.
What semantic caching actually is ¶
Semantic caching stores LLM responses keyed by embedding vectors instead of exact query strings. When a new query arrives, the system computes its embedding, searches the cache for a similar past query, and returns the cached answer if the similarity exceeds a threshold, without calling the LLM.
The appeal is obvious. Real user queries cluster. "What is the refund policy?" and "How do I get a refund?" are semantically near-identical but string-different. A cache hit on either saves one API call.
arXiv:2411.05276 ("Semantic Caching for AI") measured this on production query logs and found 68.8% reduction in API calls with a cosine similarity threshold of 0.95. That is the headline number. It is also incomplete without the context that follows.
The threshold problem ¶
Cosine similarity at 0.95 means: two queries whose embeddings are 95% similar are assumed to have the same correct answer. This works when the assumption holds. It breaks silently when it does not.
Consider the pair:
With many embedding models, these queries score above 0.95 similarity. The correct answers are different. A cache that returns the 400mg answer for the 800mg query is wrong, and there is no error signal.
The 68.8% reduction requires that your threshold is calibrated to your query distribution. It is not a universal setting.
Category-aware thresholds ¶
arXiv:2510.26835 ("Category-Aware Semantic Caching") extended the baseline work by measuring cache hit rates across query categories at different thresholds:
| Query category | Hit rate at threshold 0.95 | Threshold for equivalent accuracy |
|---|---|---|
| Factual / definitional | 40–60% | 0.90–0.92 |
| Conversational / chitchat | 55–70% | 0.88–0.92 |
| Code / technical | 5–15% | 0.97–0.99 |
| Math / calculation | 2–8% | 0.99+ |
The paper found that a uniform threshold across all query types underperforms category-aware thresholds by 15–30% on precision. Applying a 0.95 threshold to code queries produces too many false hits. Applying the same threshold to factual queries leaves cache hits on the table.
Implementation implications ¶
The operational requirement for semantic caching that actually works:
Query classification at cache lookup time. Before checking the cache, route the query to a category. This can be a lightweight classifier (a 7B model distilled for intent classification runs in <20ms). Apply the category-specific threshold.
Cache TTL by content type. Factual answers about stable topics (company policies, product descriptions) can cache for days. Answers about current events, prices, or anything time-sensitive need short TTL or explicit invalidation.
Hit/miss logging with human review. The silent failure mode (a wrong cached answer served with full confidence) is only detectable by sampling cache hits and reviewing them. Build this into the system from the start.
Separate cache namespaces by use case. Do not share a cache between a customer support bot and a coding assistant. The query distributions are different enough that a shared cache degrades both.
What 68.8% reduction means in dollars ¶
At GPT-4o pricing ($2.50/MTok input, $10/MTok output), a customer support application processing 10,000 queries/day at ~500 tokens average:
| Scenario | Daily API calls | Daily cost |
|---|---|---|
| No caching | 10,000 | ~$62 |
| 68.8% cache hit rate | 3,120 | ~$19 |
| Monthly savings | n/a | ~$1,300 |
Compounding with context compression ¶
Semantic caching and context compression address different parts of the cost curve.
Caching eliminates repeat API calls entirely. Compression reduces the token cost of calls that do go through, including the cache misses. They compound multiplicatively.
If 68.8% of queries are cache hits (zero token cost), the remaining 31.2% can be compressed before the LLM call. If compression reduces token count by 40%, the effective reduction on the full query volume is: 1 - (0.312 × 0.6) = 81.3% token cost reduction.
gotcontext handles the compression side via the ingest_context MCP tool. Combined with a semantic cache layer, the two optimizations target distinct cost buckets.
The threshold is a product decision ¶
The right cosine similarity threshold for your application is not 0.95 and it is not any other fixed number. It is determined by:
Measure these before deploying. The 68.8% reduction figure from arXiv:2411.05276 is achievable, but only after calibration on your specific workload, not out of the box.
Cite this¶
Researchers, analysts, or journalists referencing this post can use either format below — both are copyable.
@misc{semantic-caching-what-the-research-actually-says-about-similarity-thresholds-2026,
title = {Semantic Caching Can Cut LLM API Calls by 68.8% — But the Threshold Is Everything},
author = {James Hollingsworth},
year = {2026},
month = {May},
url = {https://www.gotcontext.ai/blog/semantic-caching-what-the-research-actually-says-about-similarity-thresholds},
note = {gotcontext.ai engineering blog.},
}James Hollingsworth. (2026, May 8). Semantic Caching Can Cut LLM API Calls by 68.8% — But the Threshold Is Everything. gotcontext.ai. Retrieved from https://www.gotcontext.ai/blog/semantic-caching-what-the-research-actually-says-about-similarity-thresholds.