Why I Built SemanticGuard


My career has been defined by a persistent drive to build efficient, scalable systems and to manage their operational costs. From transforming localized processes into web-scale platforms at Meta to spearheading FinOps strategies as VP Cloud Platform at Teads and Outbrain, I've spent years immersed in the practical realities of infrastructure economics. I've seen firsthand how easily technology, despite its immense power, can become a drain if not managed shrewdly.
Then came Generative AI. The promise was clear: transformative applications, incredible productivity gains. But it wasn't long before a familiar challenge emerged, one that mirrored the early days of cloud adoption but amplified: unpredictable, rapidly escalating costs. Developers, product managers, and CTOs I spoke with were all grappling with the same issue: how to reduce LLM API cost without sacrificing the very quality that made these models so compelling.
This wasn't just a theoretical problem for me; it was a daily reality for teams trying to ship AI-powered features. We were building remarkable things, but every API call felt like it had a ticking meter attached. I knew there had to be a better way to harness the power of LLMs responsibly.
Many of us started our LLM journeys by simply calling OpenAI, Anthropic and Google Gemini AI APIs directly. The initial costs might seem manageable for a proof-of-concept. But as applications scale, the token counts skyrocket. A single complex agent chain or an LLM-powered internal tool can quickly run up a bill. Consider that GPT-4, for instance, costs around $30 per 1 million input tokens and $60 per 1 million output tokens. For sophisticated applications making hundreds or thousands of calls daily, these figures quickly turn into significant operational expenses.
What often gets overlooked is the nature of these calls. How many are genuinely unique? How many are slightly rephrased versions of a previous query? Without intelligent disambiguation, each variation becomes a new, expensive API call. This isn't just about reducing redundant calls; it's about optimizing for semantic similarity. A user asking "What's the capital of France?" and then "Tell me the capital of France" should ideally hit the same answer from a cache, but most simple caching mechanisms would treat them as distinct requests. This is where traditional key-value caching falls short; it lacks the necessary understanding of meaning to truly reduce LLM API cost effectively.
My experience in distributed systems taught me that optimization needs layers. Just as we wouldn't fetch the same database query repeatedly if the data hadn't changed, we shouldn't be asking the same semantic question to an LLM over and over. The challenge was how to build that semantic layer without introducing complexity or compromising accuracy.
At companies like Outbrain and Meta, a core part of my role involved optimizing large-scale cloud infrastructure. This wasn't just about buying cheaper instances; it was about smart architecture, efficient resource utilization, and granular visibility into spending. When I looked at LLM usage, I saw the same patterns of inefficiency that I had battled with traditional cloud resources.
The idea for SemanticGuard didn't come out of thin air; it was born from these FinOps principles applied to a new domain. I recognized that to genuinely reduce LLM API cost, we needed a solution that was:
Many engineering teams initially consider building their own LLM caching solution. I understand this impulse; I've led teams that built everything from scratch. But the nuances of effective LLM caching are substantial. It's not just a dict lookup.
You need to:
import OpenAI from "openai";
import { withSemanticGuard } from "@semanticguard/ai-sdk";
const openai = new OpenAI({
apiKey: "your-openai-key",
fetch: withSemanticGuard(),
});
While the code snippet above is illustrative, it highlights the core principle: the developer experience should remain familiar, while the underlying intelligence drastically optimizes resource use. This simple integration pattern was central to how I envisioned SemanticGuard, a powerful optimization without requiring a complete rewrite of your LLM interaction logic.
One of the biggest hurdles in adopting new infrastructure is proving its value before making a full commitment. I've been in countless meetings where I had to justify significant cloud spend or infrastructure changes. That's why I insisted on a "Shadow Mode" for SemanticGuard. This feature allows teams to route their LLM traffic through our gateway, observe the potential savings, and see exactly how much they could reduce LLM API cost – all before enabling caching or making any financial commitment. It reflects my engineering ethos: measure, validate, then optimize.
This isn't just about cost; it's about confidence. Confidence that your solution will perform, that your data is secure (running in your own infrastructure), and that you retain control. It's about providing a tool that acts as a reliable partner, allowing developers to focus on building features, not battling spiraling operational expenses.
I built SemanticGuard because I believe in empowering developers to build amazing AI applications without being constrained by unpredictable costs or complex infrastructure. It's the culmination of years of experience in FinOps, cloud architecture, and distributed systems, applied to solve one of the most pressing problems in modern AI development.
Even if you're not ready to implement an intelligent caching solution, there are immediate actions you can take to gain control over your LLM API costs: