Benchmark
Cache correctness, measured.
The hard problem with semantic caching is not making it hit. It's making sure the response it returns actually answers the new request. We measure that with an independent LLM judge across 9 workload verticals.
Hit-rate and savings benchmarks are tracked separately below and updated less frequently. Correctness is the launch-readiness gate.
Last savings run: 2026-05-28
Cache correctness
2026-05-29 · n=38Every cache return is scored by an independent LLM judge (gemini-2.5-pro) that decides whether the cached response actually answers the new request. We report two numbers separately:
Approximate-match correctness
100.0%
21 of 21 reuses judged correct
Cache reused a stored response for a request whose wording was different. This is the metric that matters when the cache makes a judgement call.
Exact-match return quality
100.0%
17 of 17 returned responses judged correct
Request matched a prior call verbatim, so the cache returned the original upstream answer. Failures here are model behavior, not cache logic.
Combined pass rate across all 38 sampled returns: 100.0%. Covers 8 workload verticals.
Cache content safety
2026-05-30 · n=20Every response sampled from the cache is run through an independent safety classifier (gemini-2.5-flash). A cached response that an upstream model emitted once would otherwise be replayed to every semantically-similar request for the TTL window; this measures how often that produces actively harmful content.
Safe cached responses
100.0%
20 of 20 sampled returns passed
Flagged for review
0
0.0% of sampled returns
By category: none (20)
Tenants can opt in to a real-time safety hook on the cache-store boundary that drops flagged responses before they land in the cache. See safetyClassifierEnabled in tenant settings.
Aggregate hit rate
48.9%
across the workload corpus
Aggregate savings rate
49.8%
of total LLM spend absorbed
Cache correctness
100.0%
on 21 judged paraphrase returns
How to read this: each row is a workload type with its own fixtures. Low hit-rates on RAG and adversarial are intentional; those fixtures test miss-correctness. The columns track what the cache delivers, not what it tries to.
| Workload | Hit rate | Savings rate |
|---|---|---|
Customer Support | 41.7% | 37.7% |
Content Generation | 25.0% | 25.0% |
Agent Workflows | 30.0% | 36.1% |
RAG / Document Q&A Most RAG cases ask about distinct documents; the cache correctly misses on different content. | 30.0% | 8.2% |
Developer Tools | 40.0% | 38.9% |
Cross-Provider Reuse | 75.0% | 83.2% |
Adversarial Inputs Fixtures test miss-correctness; low hit-rate is the cache correctly refusing to over-cache. | 10.0% | 27.7% |
Paraphrase Reuse | 95.8% | 100.0% |
Latency: cache hits typically return in <200ms. Misses pass through to the upstream provider with no added latency (proxy overhead is <10ms on the request path).
How it's measured
The cache-eval harness sends real HTTP requests through SemanticGuard against the same upstream LLM provider you would use yourself. Each suite is run three times and the median is reported.
Hit rate is the share of eval requests served from cache. Savings rate is the share of would-have-been LLM spend (input + output tokens at published per-million prices) that the cache absorbed. Cache correctness (top-right card) is the share of cache-served responses that an independent LLM judge graded as correctly answering the new request.
Your actual numbers will depend on workload mix, prompt diversity, and the model you run. Treat these as a directional reference, not a contract.
How correctness is measured
We sample cache returns from a fresh traffic scope, re-issue each prompt to the upstream model for a baseline answer, and ask an independent LLM judge (gemini-2.5-pro) whether the cached response correctly addresses the request. The judge is not told which response came from the cache vs. fresh from the model.
Approximate-match correctness isolates the cases where the cache made a wording-tolerant decision, the only place a cache layer can be wrong. Exact-match return quality covers cases where the request matched a prior call verbatim; failures there are upstream model behavior, not cache logic.
Sample composition for the 2026-05-29 run: 21 approximate-match returns + 17 exact-match returns, total n=38. Judge model: gemini-2.5-pro. Temperature: 0 for both fixture calls and judge calls (deterministic). Zero judge failures, zero upstream errors. We do not redact failure samples from the underlying audit JSON.
The fixture harness, runner script, and per-run JSON are maintained in our private engineering repo. To request a copy or have a third party reproduce the numbers on your own infrastructure, contact support@semanticguard.dev.