How to form LLM cost governance in your org

Guy Kobrinsky | Software Engineering Manager @ Meta. Building SemanticGuard

Most engineering teams discover the same problem in the same order. First, a single team ships a feature using GPT-4 or Claude. Results are great. Within a quarter, a dozen teams are doing the same thing, often with the same API key copy-pasted into environment variables. Then the consolidated bill arrives, and nobody can answer the only question that matters: who spent what, and was any of it worth it?

This is not a story about the cost of AI. It is a story about the absence of attribution. A monolithic bill from OpenAI or Anthropic tells you the total, but it cannot tell you which team, which product feature, or which customer account drove the spend. Without that, you cannot set budgets, identify outliers, or calculate ROI. You are flying blind on the fastest-growing line item in your infrastructure budget.

The core capability you need is LLM cost tracking per consumer. A "consumer" here might be an internal team, a microservice, a B2B customer account, or an individual end-user of your application. Granular attribution is the foundation everything else sits on.

Why the cloud FinOps playbook falls short

The natural instinct is to reach for the classic cloud cost management toolkit. Tag resources. Set IAM policies. Watch the usage dashboards. These are reasonable first steps, but they miss the shape of the problem.

LLM costs are not metered like EC2 instance-hours. The unit is a token, processed in milliseconds, and the billing data from your model provider often lags by hours or a full day. That is far too slow to catch a runaway loop that can burn through a month's budget in an afternoon. A single API key is usually shared across an entire application, sometimes across multiple applications, so traditional resource tagging has nothing useful to attach to. The "resource" is the key itself, not the individual calls made with it.

What you need is a layer of instrumentation that operates at the application level, understands the context of each request, and can act on that context before the call ever reaches the model provider. That is the foundation of LLM governance.

A five-step framework

Building a governance model from scratch sounds heavyweight, but it comes down to five pillars. The goal is a single point of control that gives decentralized teams the visibility they need to make safe decisions on their own.

Centralize API key management. Stop the proliferation of provider keys stored in environment variables across dozens of projects. Route all LLM traffic through a central proxy or gateway. This single change is what makes every later step possible.
Issue consumer identifiers. Instead of one OpenAI key for your whole application, the gateway should issue scoped keys or accept a consumer identifier on each request. Identifiers might look like team:marketing, feature:chatbot-widget, or customer:acme-corp. Pass them as a header.
Log and attribute every request. Capture the consumer identifier, prompt, model, token counts, and latency for every call. This is the raw material for everything downstream. Store it somewhere queryable.
Visualize the spend. Raw logs are for forensics. Governance needs dashboards. Show spend over time, broken down by consumer. Surface the top ten most expensive consumers and the fastest-growing ones. Visibility is what makes the value conversations possible.
Set budgets and rate limits. Once attribution works, the gateway becomes the place to enforce policy. Hard monthly budgets per team. Rate limits on a feature that looks abusive. These guardrails turn cost management from reactive to proactive.

What it looks like in practice

Consider a B2B SaaS company offering an AI-powered analytics feature to its enterprise customers. Their monthly bill from the model provider has crossed $20,000 and is growing unpredictably. They do not know whether one large customer is running massive reports or whether all 100 customers are using the feature moderately.

They route all traffic through an AI gateway and require every request to carry an X-Customer-ID header. Within a week, the dashboard shows three customers accounting for the majority of spend, and an internal load test that someone forgot to shut down has quietly consumed over a thousand dollars on its own.

That is what governance produces in week one. Not savings yet, but the information needed to act. The team can have a real conversation with the heavy-usage customers about consumption-based pricing. They can shut down the runaway script. They can give product managers a per-feature budget and let them make their own trade-offs between model quality and cost. And with intelligent caching enabled at the gateway layer, repeated analytics queries can be served without a fresh model call at all, which is where the recurring savings come from.

The mechanics differ by stack. SemanticGuard is one option built for this pattern, but the framework above applies whether you build it in-house, adopt an open-source proxy, or use a commercial gateway.

The payoff

Governance is not red tape. It is the set of guardrails that lets your teams keep moving fast without one runaway feature blowing up the quarter. When you can attribute spend per consumer, your AI line item stops being an opaque operational expense and starts behaving like every other unit-economics input. You can calculate cost-to-serve per customer. You can measure the ROI of a new AI feature against its actual cost. You can hand a product manager a budget and trust them to make the call between a bigger model and a cheaper one.

The durable AI strategy is not the one with the best model. It is the one with the controls around it.

Next steps

Why the cloud FinOps playbook falls short

A five-step framework

Centralize API key management. Stop the proliferation of provider keys stored in environment variables across dozens of projects. Route all LLM traffic through a central proxy or gateway. This single change is what makes every later step possible.
Issue consumer identifiers. Instead of one OpenAI key for your whole application, the gateway should issue scoped keys or accept a consumer identifier on each request. Identifiers might look like team:marketing, feature:chatbot-widget, or customer:acme-corp. Pass them as a header.
Log and attribute every request. Capture the consumer identifier, prompt, model, token counts, and latency for every call. This is the raw material for everything downstream. Store it somewhere queryable.
Visualize the spend. Raw logs are for forensics. Governance needs dashboards. Show spend over time, broken down by consumer. Surface the top ten most expensive consumers and the fastest-growing ones. Visibility is what makes the value conversations possible.
Set budgets and rate limits. Once attribution works, the gateway becomes the place to enforce policy. Hard monthly budgets per team. Rate limits on a feature that looks abusive. These guardrails turn cost management from reactive to proactive.

What it looks like in practice

The payoff

The durable AI strategy is not the one with the best model. It is the one with the controls around it.