Documentation
Integration guides and API reference.
Quick Start
SemanticGuard is an OpenAI-compatible proxy. Point your client at SemanticGuard instead of the provider, add your SG API key, and all requests are cached, logged, and tracked.
curl https://semanticguard.dev/api/proxy/v1/chat/completions \-H "Authorization: Bearer your-openai-api-key" \-H "x-sg-api-key: sg-your-key-here" \-H "Content-Type: application/json" \-d '{"model": "gpt-4o-mini","messages": [{"role": "user", "content": "Hello"}]}'
AI SDK Integration
Using the Vercel AI SDK? Add a fetch wrapper to any provider. Works with OpenAI, Anthropic, Vertex AI, and any provider that accepts a custom fetch function.
import { createOpenAI } from "@ai-sdk/openai";import { withSemanticGuard } from "@semanticguard/ai-sdk";const openai = createOpenAI({apiKey: "your-openai-key",fetch: withSemanticGuard({gatewayUrl: "https://semanticguard.dev",apiKey: "sg-your-key-here",}),});const result = await generateText({model: openai("gpt-4o-mini"),prompt: "Hello",});
Vercel Marketplace
Install SemanticGuard from the Vercel Marketplace for zero-config setup. The integration automatically provisions your dedicated infrastructure and injects environment variables.
What happens on install
- Click "Add Integration" in the Vercel Marketplace
- Select which projects to connect
- Dedicated infrastructure is provisioned automatically
SEMANTICGUARD_URLandSG_API_KEYare injected into your project- Add the SDK wrapper and deploy
After installation, the SDK reads from environment variables automatically:
import { withSemanticGuard } from "@semanticguard/ai-sdk";// Zero-config: reads SEMANTICGUARD_URL and SG_API_KEY from envconst openai = createOpenAI({fetch: withSemanticGuard(),});
Manage your installation from the Integrations page.
Authentication
Every request needs two keys:
- Your LLM API key (passed to the upstream provider via
Authorization: Bearerorx-api-key) - Your SemanticGuard key (via
x-sg-api-keyheader, or?sg_key=query param for clients that cannot set custom headers). Generate one from the API Keys page in the dashboard.
Supported Providers
| Provider | Auth Header | Models |
|---|---|---|
| OpenAI | Authorization: Bearer sk-... | gpt-4o, gpt-4o-mini, gpt-4.1-*, o3, o4-mini |
| Anthropic | x-api-key: sk-ant-... | claude-sonnet-4, claude-opus-4, claude-haiku-4 |
Authorization: Bearer ... | gemini-2.5-flash, gemini-2.5-pro | |
| Azure OpenAI | Authorization: Bearer <azure-key> | gpt-4o, gpt-4o-mini (via x-sg-provider: azure) |
| AWS Bedrock | x-sg-aws-access-key + x-sg-aws-secret-key | amazon.titan-*, meta.llama3-*, cohere.command-r-* |
Azure requires x-sg-provider: azure, x-sg-azure-resource, and x-sg-azure-deployment headers. Bedrock requires x-sg-aws-access-key and x-sg-aws-secret-key. Other providers (Mistral, etc.) work via the passthrough proxy.
Response Headers
| Header | Example | Description |
|---|---|---|
x-sg-cache | hit-exact, hit-semantic, miss | Cache result. Includes the layer that matched. |
x-sg-latency | 12ms | Total proxy processing time |
x-sg-provider | openai, anthropic, google, azure, bedrock | Detected upstream provider |
x-sg-score | 0.97 | Similarity score (semantic hits only) |
x-sg-confidence | 0.872 | Confidence score (0-1). Factors: similarity, age, template completeness, model recency. |
x-sg-prompt-category | factual, code, creative, extraction, instruction, general | Auto-classified prompt category. Code and creative prompts use stricter matching thresholds. |
Savings Methodology
Baseline pricing
SemanticGuard tracks the cost of every request using published per-token rates for each model. When a cache hit is served, the baseline cost is what the upstream provider would have charged for the same prompt and response. The delta between baseline and actual cost is the realized saving for that request.
What "savings" means
Savings represent avoided LLM spend: the upstream tokens that were never sent because a cached response was returned. Token counts for cached responses are derived from the original response that was stored. Savings are denominated in USD using the model's published rate at storage time.
Shadow mode and projected savings
When shadow mode is active, SemanticGuard identifies requests that would have been served from cache but forwards them to the upstream model anyway. This lets you measure hit rate and estimated savings before committing to live caching. Projected savings are the sum of shadow-mode cost deltas over the selected period.
Reconciliation and drift
The savings dashboard compares realized savings (live cache hits) against shadow-mode projections for the same time window. When the two diverge by more than 25%, a drift warning is shown. Common causes: switching from shadow mode to live mode mid-period, changes in traffic mix, cache TTL expiry clearing entries, or threshold tuning. The savings ledger (Observe section) provides a per-request audit trail for detailed investigation.
Assumptions and limitations
- Model pricing uses list rates; negotiated discounts are not factored in.
- Thinking tokens (extended reasoning) are tracked separately and included in baseline cost when present.
- Streaming responses: token counts are derived from the stored response, which may differ slightly from the upstream count due to buffering.
- Savings are computed at request time and recorded in the ledger; retroactive pricing changes are not applied to historical entries.
How caching works
SemanticGuard uses intelligent multi-layer caching that goes beyond simple key matching. It recognizes the same prompt worded differently, prompts that differ only in names or IDs, and conversations that share context with prior turns.
Every cache match is verified with advanced pattern matching before being served. The system is conservative by default: when in doubt, it forwards to the upstream provider rather than risk returning the wrong response.
Quality safeguards
- Your own AI continuously validates cached responses against the prompts they served. Failures are surfaced to admins, never silently delivered.
- The system learns which parts of your prompts vary (names, IDs, dates) so it never confuses one user's data with another's.
- Matching strictness adapts to prompt type. Code and creative prompts use stricter thresholds than factual lookups.
- Sign in to the dashboard to inspect cache decisions, verification logs, and per-request audit trails.