Benchmark

Cache correctness, measured.

The hard problem with semantic caching is not making it hit. It's making sure the response it returns actually answers the new request. We measure that with an independent LLM judge across 9 workload verticals.

Hit-rate and savings benchmarks are tracked separately below and updated less frequently. Correctness is the launch-readiness gate.

Last savings run: 2026-05-28

Cache correctness

2026-05-29 · n=38

Every cache return is scored by an independent LLM judge (gemini-2.5-pro) that decides whether the cached response actually answers the new request. We report two numbers separately:

Approximate-match correctness

100.0%

21 of 21 reuses judged correct

Cache reused a stored response for a request whose wording was different. This is the metric that matters when the cache makes a judgement call.

Exact-match return quality

100.0%

17 of 17 returned responses judged correct

Request matched a prior call verbatim, so the cache returned the original upstream answer. Failures here are model behavior, not cache logic.

Combined pass rate across all 38 sampled returns: 100.0%. Covers 8 workload verticals.

Cache content safety

2026-05-30 · n=20

Every response sampled from the cache is run through an independent safety classifier (gemini-2.5-flash). A cached response that an upstream model emitted once would otherwise be replayed to every semantically-similar request for the TTL window; this measures how often that produces actively harmful content.

Safe cached responses

100.0%

20 of 20 sampled returns passed

Flagged for review

0.0% of sampled returns

By category: none (20)

Tenants can opt in to a real-time safety hook on the cache-store boundary that drops flagged responses before they land in the cache. See safetyClassifierEnabled in tenant settings.

Aggregate hit rate

48.9%

across the workload corpus

Aggregate savings rate

49.8%

of total LLM spend absorbed

Cache correctness

100.0%

on 21 judged paraphrase returns

How to read this: each row is a workload type with its own fixtures. Low hit-rates on RAG and adversarial are intentional; those fixtures test miss-correctness. The columns track what the cache delivers, not what it tries to.

Workload	Hit rate	Savings rate
Customer Support	41.7%	37.7%
Content Generation	25.0%	25.0%
Agent Workflows	30.0%	36.1%
RAG / Document Q&A Most RAG cases ask about distinct documents; the cache correctly misses on different content.	30.0%	8.2%
Developer Tools	40.0%	38.9%
Cross-Provider Reuse	75.0%	83.2%
Adversarial Inputs Fixtures test miss-correctness; low hit-rate is the cache correctly refusing to over-cache.	10.0%	27.7%
Paraphrase Reuse	95.8%	100.0%

Latency: cache hits typically return in <200ms. Misses pass through to the upstream provider with no added latency (proxy overhead is <10ms on the request path).

How it's measured

The cache-eval harness sends real HTTP requests through SemanticGuard against the same upstream LLM provider you would use yourself. Each suite is run three times and the median is reported.

Hit rate is the share of eval requests served from cache. Savings rate is the share of would-have-been LLM spend (input + output tokens at published per-million prices) that the cache absorbed. Cache correctness (top-right card) is the share of cache-served responses that an independent LLM judge graded as correctly answering the new request.

Your actual numbers will depend on workload mix, prompt diversity, and the model you run. Treat these as a directional reference, not a contract.

How correctness is measured

We sample cache returns from a fresh traffic scope, re-issue each prompt to the upstream model for a baseline answer, and ask an independent LLM judge (gemini-2.5-pro) whether the cached response correctly addresses the request. The judge is not told which response came from the cache vs. fresh from the model.

Approximate-match correctness isolates the cases where the cache made a wording-tolerant decision, the only place a cache layer can be wrong. Exact-match return quality covers cases where the request matched a prior call verbatim; failures there are upstream model behavior, not cache logic.

Sample composition for the 2026-05-29 run: 21 approximate-match returns + 17 exact-match returns, total n=38. Judge model: gemini-2.5-pro. Temperature: 0 for both fixture calls and judge calls (deterministic). Zero judge failures, zero upstream errors. We do not redact failure samples from the underlying audit JSON.

The fixture harness, runner script, and per-run JSON are maintained in our private engineering repo. To request a copy or have a third party reproduce the numbers on your own infrastructure, contact support@semanticguard.dev.