When we started tracking how our customers use the Synthr API, we noticed something striking: roughly 65% of all incoming prompts were semantically identical to a prompt we had already answered within the last 24 hours.
Not identical in the string-equality sense. But close enough that returning the previous answer would be indistinguishable in quality. This is the insight behind semantic caching - and it turns out to be a massive cost lever.
Why exact-match caching fails for AI
Traditional HTTP caching works on exact URL and header matching. That approach fails immediately for LLM prompts because users phrase the same question a thousand different ways. 'Summarize this article' and 'Give me a summary of the following text' are identical in intent but different in bytes.
// Naive approach - almost never hits const cached = cache.get(prompt); // exact string match // Semantic approach - hits 70% of the time const embedding = await embed(prompt); const nearest = await vectorSearch(embedding, threshold: 0.92); if (nearest) return nearest.response;
The architecture
Every incoming prompt gets embedded using a small, fast embedding model running at the edge. We use cosine similarity against a vector index of recent responses. If similarity exceeds the configured threshold, we return the cached response immediately - no LLM call needed.
- Embedding latency: ~3ms at the edge
- Vector search latency: ~5ms against 100K entries
- Total cache hit overhead: under 10ms
- Average LLM call saved: 800ms and $0.004
Edge cases that will bite you
The threshold matters more than you think. Set it too low (0.85) and you return wrong answers to subtly different questions. Set it too high (0.98) and you barely cache anything. We settled on 0.92 as the default after testing on 50M real queries, but we expose it as a configurable parameter.
Time-sensitive prompts need special handling. 'What is today's weather?' should never be cached. We detect temporal signals in prompts and bypass the cache automatically.
Results
Teams using semantic caching at the default threshold see an average 71% reduction in billable LLM calls. For a team spending $5,000/month on inference, that's $3,550 back in their pocket - every month.