Synthr – The AI API Platform for Developers

When we started tracking how our customers use the Synthr API, we noticed something striking: roughly 65% of all incoming prompts were semantically identical to a prompt we had already answered within the last 24 hours.

Not identical in the string-equality sense. But close enough that returning the previous answer would be indistinguishable in quality. This is the insight behind semantic caching - and it turns out to be a massive cost lever.

Why exact-match caching fails for AI

Traditional HTTP caching works on exact URL and header matching. That approach fails immediately for LLM prompts because users phrase the same question a thousand different ways. 'Summarize this article' and 'Give me a summary of the following text' are identical in intent but different in bytes.

typescript

// Naive approach - almost never hits
const cached = cache.get(prompt); // exact string match

// Semantic approach - hits 70% of the time
const embedding = await embed(prompt);
const nearest = await vectorSearch(embedding, threshold: 0.92);
if (nearest) return nearest.response;

The architecture

Every incoming prompt gets embedded using a small, fast embedding model running at the edge. We use cosine similarity against a vector index of recent responses. If similarity exceeds the configured threshold, we return the cached response immediately - no LLM call needed.

Embedding latency: ~3ms at the edge
Vector search latency: ~5ms against 100K entries
Total cache hit overhead: under 10ms
Average LLM call saved: 800ms and $0.004

Edge cases that will bite you

The threshold matters more than you think. Set it too low (0.85) and you return wrong answers to subtly different questions. Set it too high (0.98) and you barely cache anything. We settled on 0.92 as the default after testing on 50M real queries, but we expose it as a configurable parameter.

Time-sensitive prompts need special handling. 'What is today's weather?' should never be cached. We detect temporal signals in prompts and bypass the cache automatically.

Results

Teams using semantic caching at the default threshold see an average 71% reduction in billable LLM calls. For a team spending $5,000/month on inference, that's $3,550 back in their pocket - every month.

How We Cut AI Inference Costs by 70% with Semantic Caching

Why exact-match caching fails for AI

The architecture

Edge cases that will bite you

Results

Ready to try Synthr?