Measured savings across 11 LLMs — Claude Opus 4.7 to Gemini Flash.→ See per-model data
Get free API key →
Research

Semantic Caching Can Cut LLM API Calls by 68.8% — But the Threshold Is Everything

Research shows semantic caching with cosine similarity matching slashes API costs dramatically. Two papers explain why a single threshold value is the difference between a cache that helps and one that silently poisons your answers.

James Hollingsworth(Contributor)Published 5 min~779 words

What semantic caching actually is

Semantic caching stores LLM responses keyed by embedding vectors instead of exact query strings. When a new query arrives, the system computes its embedding, searches the cache for a similar past query, and returns the cached answer if the similarity exceeds a threshold, without calling the LLM.

The appeal is obvious. Real user queries cluster. "What is the refund policy?" and "How do I get a refund?" are semantically near-identical but string-different. A cache hit on either saves one API call.

arXiv:2411.05276 ("Semantic Caching for AI") measured this on production query logs and found 68.8% reduction in API calls with a cosine similarity threshold of 0.95. That is the headline number. It is also incomplete without the context that follows.

The threshold problem

Cosine similarity at 0.95 means: two queries whose embeddings are 95% similar are assumed to have the same correct answer. This works when the assumption holds. It breaks silently when it does not.

Consider the pair:

  • "What are the side effects of ibuprofen at 400mg?"
  • "What are the side effects of ibuprofen at 800mg?"
  • With many embedding models, these queries score above 0.95 similarity. The correct answers are different. A cache that returns the 400mg answer for the 800mg query is wrong, and there is no error signal.

    The 68.8% reduction requires that your threshold is calibrated to your query distribution. It is not a universal setting.

    Category-aware thresholds

    arXiv:2510.26835 ("Category-Aware Semantic Caching") extended the baseline work by measuring cache hit rates across query categories at different thresholds:

    Query categoryHit rate at threshold 0.95Threshold for equivalent accuracy
    Factual / definitional40–60%0.90–0.92
    Conversational / chitchat55–70%0.88–0.92
    Code / technical5–15%0.97–0.99
    Math / calculation2–8%0.99+
    Code queries have low cache utility because small syntactic differences produce semantically distinct queries that embed closely. "Sort a list in Python" and "Sort a list in Python in descending order" score high similarity but need different answers.

    The paper found that a uniform threshold across all query types underperforms category-aware thresholds by 15–30% on precision. Applying a 0.95 threshold to code queries produces too many false hits. Applying the same threshold to factual queries leaves cache hits on the table.

    Implementation implications

    The operational requirement for semantic caching that actually works:

    Query classification at cache lookup time. Before checking the cache, route the query to a category. This can be a lightweight classifier (a 7B model distilled for intent classification runs in <20ms). Apply the category-specific threshold.

    Cache TTL by content type. Factual answers about stable topics (company policies, product descriptions) can cache for days. Answers about current events, prices, or anything time-sensitive need short TTL or explicit invalidation.

    Hit/miss logging with human review. The silent failure mode (a wrong cached answer served with full confidence) is only detectable by sampling cache hits and reviewing them. Build this into the system from the start.

    Separate cache namespaces by use case. Do not share a cache between a customer support bot and a coding assistant. The query distributions are different enough that a shared cache degrades both.

    What 68.8% reduction means in dollars

    At GPT-4o pricing ($2.50/MTok input, $10/MTok output), a customer support application processing 10,000 queries/day at ~500 tokens average:

    ScenarioDaily API callsDaily cost
    No caching10,000~$62
    68.8% cache hit rate3,120~$19
    Monthly savingsn/a~$1,300
    The caveat: these numbers assume the 68.8% hit rate is achievable on your query distribution. For code generation or math workloads, hit rates are 5–15%. The savings are correspondingly smaller.

    Compounding with context compression

    Semantic caching and context compression address different parts of the cost curve.

    Caching eliminates repeat API calls entirely. Compression reduces the token cost of calls that do go through, including the cache misses. They compound multiplicatively.

    If 68.8% of queries are cache hits (zero token cost), the remaining 31.2% can be compressed before the LLM call. If compression reduces token count by 40%, the effective reduction on the full query volume is: 1 - (0.312 × 0.6) = 81.3% token cost reduction.

    gotcontext handles the compression side via the ingest_context MCP tool. Combined with a semantic cache layer, the two optimizations target distinct cost buckets.

    The threshold is a product decision

    The right cosine similarity threshold for your application is not 0.95 and it is not any other fixed number. It is determined by:

  • Your acceptable false-hit rate (wrong cached answer served as correct)
  • Your query category distribution
  • Your embedding model and its similarity calibration
  • Measure these before deploying. The 68.8% reduction figure from arXiv:2411.05276 is achievable, but only after calibration on your specific workload, not out of the box.

    Get gotcontext free →

    Cite this

    Researchers, analysts, or journalists referencing this post can use either format below — both are copyable.

    BibTeXbibtex
    @misc{semantic-caching-what-the-research-actually-says-about-similarity-thresholds-2026,
      title  = {Semantic Caching Can Cut LLM API Calls by 68.8% — But the Threshold Is Everything},
      author = {James Hollingsworth},
      year   = {2026},
      month  = {May},
      url    = {https://www.gotcontext.ai/blog/semantic-caching-what-the-research-actually-says-about-similarity-thresholds},
      note   = {gotcontext.ai engineering blog.},
    }
    APAtext
    James Hollingsworth. (2026, May 8). Semantic Caching Can Cut LLM API Calls by 68.8% — But the Threshold Is Everything. gotcontext.ai. Retrieved from https://www.gotcontext.ai/blog/semantic-caching-what-the-research-actually-says-about-similarity-thresholds.

    Contribute