All posts
Engineering8 min read

How We Cut AI Inference Costs by 70% with Semantic Caching

Most teams pay for the same AI computation over and over. Here's the architecture we built to stop that - and the surprising edge cases we had to solve.

MO

Maya Okonkwo

Staff Engineer

·May 8, 2025

When we started tracking how our customers use the Synthr API, we noticed something striking: roughly 65% of all incoming prompts were semantically identical to a prompt we had already answered within the last 24 hours.

Not identical in the string-equality sense. But close enough that returning the previous answer would be indistinguishable in quality. This is the insight behind semantic caching - and it turns out to be a massive cost lever.

Why exact-match caching fails for AI

Traditional HTTP caching works on exact URL and header matching. That approach fails immediately for LLM prompts because users phrase the same question a thousand different ways. 'Summarize this article' and 'Give me a summary of the following text' are identical in intent but different in bytes.

typescript
// Naive approach - almost never hits
const cached = cache.get(prompt); // exact string match

// Semantic approach - hits 70% of the time
const embedding = await embed(prompt);
const nearest = await vectorSearch(embedding, threshold: 0.92);
if (nearest) return nearest.response;

The architecture

Every incoming prompt gets embedded using a small, fast embedding model running at the edge. We use cosine similarity against a vector index of recent responses. If similarity exceeds the configured threshold, we return the cached response immediately - no LLM call needed.

  • Embedding latency: ~3ms at the edge
  • Vector search latency: ~5ms against 100K entries
  • Total cache hit overhead: under 10ms
  • Average LLM call saved: 800ms and $0.004

Edge cases that will bite you

The threshold matters more than you think. Set it too low (0.85) and you return wrong answers to subtly different questions. Set it too high (0.98) and you barely cache anything. We settled on 0.92 as the default after testing on 50M real queries, but we expose it as a configurable parameter.

Time-sensitive prompts need special handling. 'What is today's weather?' should never be cached. We detect temporal signals in prompts and bypass the cache automatically.

Results

Teams using semantic caching at the default threshold see an average 71% reduction in billable LLM calls. For a team spending $5,000/month on inference, that's $3,550 back in their pocket - every month.

Ready to try Synthr?

Start with 1,000 free API calls. No credit card required.

Get started free