Three Kinds of Caching: Prompt, Semantic, Result
Three distinct caches you can wire up for an LLM app. Each one wins on different workloads. Here is which to reach for, in which order, and the failure modes you only find at scale.
Every "AI app optimisation" post tells you to cache. None of them tell you which cache. There are at least three distinct caches that could live in an LLM pipeline, and they win in different places, stack in different orders, and fail in different ways. If you reach for the wrong one first, you can spend a week building infrastructure that produces a 2% savings when a different cache would have given you 50%.
This is the post that disambiguates them. Three caches: prompt cache, semantic cache, result cache. I'll walk through each — what it does, when it wins, how much it saves, and how to know you need it — and finish with the order in which to reach for them in a real app.
No magic. Some real math. The single biggest cost-saving move in this course, if you apply it right.
The mental map
First, the map. Three caches, three different layers of your LLM pipeline, each catching a different kind of repetition.
The three caches sit at different points in the request. Result cache checks if the exact query has been answered before. Semantic cache checks if a similar query has been answered before. Prompt cache lives inside the model call itself and makes the model cheaper and faster when part of the prompt has been seen before. You can use all three together; they don't conflict.
Let me take each in turn.
Cache 1: prompt cache (free, turn it on)
The cheapest cache, easiest to adopt, and by far the most under-used. Every major frontier provider (Anthropic, OpenAI, Google) now exposes prompt caching at the API level. You mark a portion of your prompt — usually the system message and the first part of your context — as cacheable. On subsequent calls with that same prefix, the provider reuses its internal computation state, charging you roughly 10% of the normal input token price for that portion and returning the response measurably faster.
The relevant fact is that prompt cache doesn't require you to maintain any cache yourself. The provider handles it. You opt in by adding a cache marker to your prompt, and the provider takes care of the rest. On Anthropic's API, for instance, you add a cache_control block:
# pip install anthropic
from anthropic import Anthropic
client = Anthropic()
LONG_SYSTEM_PROMPT = "... 3000 tokens of stable system instructions ..."
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=500,
system=[
{
"type": "text",
"text": LONG_SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"},
}
],
messages=[{"role": "user", "content": "My specific question"}],
)
On the first call, you pay full price for the system prompt and get a small write-cost penalty. On every subsequent call within the cache TTL (typically 5 minutes on Anthropic, varies by provider), you pay ~10% of the system prompt's input cost. For an app that sends the same 3,000-token system prompt on every call, that's a 40% input-token bill reduction essentially for free. Latency drops by 50-200ms on cached prefixes because the model doesn't have to re-process them.
When this wins big:
- Long, stable system prompts that are reused on every request (your "you are a support bot with these 40 rules" prompt).
- Long stable context that changes rarely (a document the user is asking questions about).
- Any hot-path call with >1,024 tokens of prefix.
When this barely wins:
- Short system prompts (< 500 tokens). The cost of the cache write-overhead can match or exceed the savings.
- Calls where every request has a different system prompt.
- Low-volume features where the cache TTL expires between calls.
Rule of thumb: if your system prompt plus shared context is over 1,500 tokens and the same prefix is hit multiple times per minute, flip this on today. It is the cheapest win in this entire course. It is a one-line change to your API call.
The provider docs have the exact API shape and TTL behaviour for your stack (Anthropic cache_control, OpenAI prompt cache, Gemini cached content). Ten minutes of reading, one deploy, real savings.
Cache 2: result cache (simple, powerful for read-heavy workloads)
The classic application-level cache. You compute a hash of the full request — system prompt, user query, relevant parameters — and use it as a key in your cache store. On the next request with the same hash, you return the previously stored response without calling the model at all. No LLM call. No tokens billed. Near-zero latency (one cache read).
import hashlib
import json
def cache_key(system: str, user: str, model: str, temperature: float) -> str:
payload = json.dumps({
"system": system,
"user": user,
"model": model,
"temperature": temperature,
}, sort_keys=True)
return hashlib.sha256(payload.encode()).hexdigest()
def cached_call(system: str, user: str) -> str:
key = cache_key(system, user, "claude-sonnet-4-6", 0)
cached = cache.get(key)
if cached is not None:
return cached
result = client.messages.create(...).content[0].text
cache.set(key, result, ex=3600) # 1 hour TTL
return result
The key question is how often does the exact same request repeat. For most conversational apps: almost never. Every user's query is slightly different. For specific kinds of apps, though, repetition is the norm:
- FAQ-heavy support bots. "How do I reset my password?" gets typed by thousands of users per day. You don't need to call the model thousands of times for the same answer.
- Internal tools with repeated queries. Dashboards, report generators, status pages where users keep asking the same things.
- Evaluation runs. If you run your eval set before and after every PR, caching identical (system, user, model) triples saves you the cost of re-running unchanged cases.
- Background batch jobs with deduplication. If you're summarising 50,000 documents and 5% are exact duplicates, caching saves you 5%.
The storage is trivial — Redis, Postgres, SQLite, or even an in-memory dict for a single-process app. The TTL decision is the interesting part. Too short and you miss hits. Too long and you serve stale content when the system prompt or underlying data changes. A good default: one hour for user-facing content, one day for batch jobs, invalidate explicitly on deploy.
Rules:
- Only cache deterministic settings.
temperature=0(or close to it). Caching atemperature=0.9answer makes no sense — the next user deserves their own roll of the dice. - Always version the cache key by the system prompt version. If your prompt changes (per B2.2), old cache entries must be invalidated. A hash of the system prompt string is a natural part of the key.
- Never cache sensitive data across users. User A's query about their specific account cannot be a cache hit for user B. Scope the cache key by
user_idortenant_idfor any query that might contain PII. - Cache the response, not the whole API object. Extract just the text (and any structured fields you need). Future SDK upgrades might shuffle the object shape, but a plain string survives.
Realistic savings: on FAQ-heavy workloads with good caching, you routinely see 30-60% of queries hit the cache, which translates to ~30-60% cost and latency reduction on the cached portion. On conversational apps with mostly unique queries, savings are 1-5%.
Cache 3: semantic cache (powerful, tricky)
The fancy one. Instead of requiring the exact same query to hit, you embed the query (per B3.1), search a cache of previous queries for the nearest neighbour, and if the similarity is above a threshold, return the previously stored answer.
The idea sounds magical: "How do I reset my password?" and "I forgot my login info" both hit the same cache entry because they mean the same thing. In practice it's powerful and dangerous, and the trade-off deserves respect.
# pip install anthropic openai numpy
import numpy as np
from openai import OpenAI
client = OpenAI()
CACHE: list[tuple[np.ndarray, str]] = [] # (embedding, answer)
THRESHOLD = 0.92
def embed(text: str) -> np.ndarray:
resp = client.embeddings.create(model="text-embedding-3-small", input=text)
v = np.array(resp.data[0].embedding, dtype=np.float32)
return v / np.linalg.norm(v)
def semantic_cached_call(query: str, ask_model_fn) -> str:
q_vec = embed(query)
# Find nearest cached entry
if CACHE:
scores = np.array([float(q_vec @ v) for v, _ in CACHE])
best_idx = int(scores.argmax())
if scores[best_idx] >= THRESHOLD:
return CACHE[best_idx][1]
# Miss — call the model and cache
answer = ask_model_fn(query)
CACHE.append((q_vec, answer))
return answer
Fifteen lines, working semantic cache. In production you'd use a proper vector store (pgvector, per B3.2), scope by user or tenant, and expire entries periodically.
Here's the honest picture of when semantic cache wins and when it bites:
Wins on:
- Genuinely repeated questions with varied phrasing. FAQ bots with hundreds of distinct phrasings of the same handful of questions.
- Search-style features where slight rewordings should produce the same answer.
- Workloads where you've measured the semantic-match hit rate and it's meaningfully above zero.
Bites on:
- False positives: different questions that happen to be vector-close. "How do I cancel my subscription?" and "How do I cancel my order?" are only one word apart but the answers are very different. Semantic cache returns the wrong answer confidently.
- Threshold tuning is fragile. Set the threshold too high (0.98) and you get almost no hits. Set it too low (0.85) and you get false positives. The correct threshold depends on your query distribution, your embedding model, and your risk tolerance for wrong answers. There is no universal number.
- Staleness by context. Two users ask "what's my balance?" — same words, same embedding, completely different correct answers. Semantic cache never works unscoped for user-specific data. Scope hard.
- Regressions when the underlying data changes. The semantic cache doesn't know your product's refund policy changed last week. It will keep serving the old answer for any query that semantically matches.
My rule on semantic cache: reach for it only after you've done prompt cache and result cache and you have a specific, measured failure mode they don't address — namely, "users ask the same question in many different ways and I'm burning tokens on near-duplicates." If that's not demonstrably your workload, the complexity and the false-positive risk outweigh the savings.
When you do use it, build it in shadow mode first: on every query, compute what the semantic cache would return, but don't actually use it. Compare the cached answer to the real answer for a week and look at the disagreements. Tune the threshold until the false-positive rate is acceptably low, then flip it to serving mode. Never go straight to serving.
Stacking: which cache first
The order to adopt these, in my experience:
Week one: turn on prompt cache. One-line change. No new infrastructure. Works immediately on any call with a long shared prefix. Measure the bill delta and the latency delta.
Week two: add result cache on specific hot paths. Start with your evals (they're the most duplicate-heavy workload you have) and your FAQ bot if you have one. Use Redis or Postgres; don't over-engineer. Measure the hit rate; if it's above 10%, it's paying for itself.
Week three to N: consider semantic cache. Only if your workload specifically has high paraphrase repetition and you've confirmed it. Build in shadow mode. Measure. Decide.
At the end of this sequence, most apps will have cut 30-70% off their LLM cost bill with approximately zero quality loss and a week of engineering. That is the biggest single ROI win available to you after "pick the right model."
A realistic savings picture
A made-up-but-realistic example, the way I'd sketch it on a whiteboard:
- Baseline: 100,000 requests/day, $0.02 average cost/request, $2,000/day, 2.5s average latency.
- Prompt cache on the 3,000-token system prompt: 40% input cost reduction → $0.014/request, $1,400/day. Latency: ~2.3s.
- Result cache on 20% of requests (FAQ-heavy): 20% of requests hit cache for free → $1,120/day. Latency: cached requests drop to ~50ms; blended average ~1.9s.
- Semantic cache catching an additional 10% of similar-but-not-exact queries: $1,008/day. Latency: similar blended drop.
From $2,000/day to $1,008/day with no model change, no quality regression, and a week of engineering. That's a 50% cost reduction that pays for itself many times over before you've even looked at the more exotic optimisations.
Admit what breaks
- Prompt cache TTLs expire. If your workload is bursty (lots of calls in a minute, then quiet), the second burst starts cold. You can extend the TTL by sending a keep-alive call every few minutes if the economics justify it.
- Result cache hashes are fragile. A whitespace change in the system prompt invalidates every existing entry. Normalise before hashing.
- Cross-user cache leaks are a security incident waiting to happen. Any cache that crosses user boundaries needs a rigorous
user_id(ortenant_id) in the key. Get this wrong and you're leaking data. - Semantic cache false positives look like hallucinations. Users think the model made up the answer; it actually came from a cache entry for a different question. Your monitoring needs to distinguish "cached vs fresh" in the response so you can correlate bug reports.
- Caches can mask upstream bugs. If you fix a bug in your prompt, the cached answers from before the fix are still served. Invalidate explicitly on deploys, or keep TTLs short enough that bugs flush within an hour.
- Cache infrastructure has its own bills. Redis or pgvector at scale isn't free. Usually small compared to LLM savings, but check.
- Observability lies to you about cache hits. Your dashboard says "60% hit rate." Look at the distribution: the hits are disproportionately on your cheapest queries, not your expensive ones, so your cost savings are lower than the hit rate suggests. Log cost-weighted hit rate, not just query-count hit rate.
What just changed in your code
- Turn on prompt cache for every call with a >1,000-token stable prefix, today. It is the single cheapest optimisation in LLM engineering.
- Add a result cache for your evaluation runs at minimum, and for any FAQ-heavy user-facing feature.
- Always key result caches by (system_prompt_hash, user_message, model, temperature) and, for anything user-specific, scope by
user_idortenant_id. - Do not reach for semantic cache until you've measured specific paraphrase repetition and run it in shadow mode for a week.
- Invalidate caches explicitly on deploys that change prompts or underlying data.
- Log cost-weighted cache hit rate, not just query count, to know the real savings.
Next post, B5.3, we turn the lights on. You can't optimise what you can't see, and most LLM apps are shipped without any meaningful observability. We'll walk through the minimum viable tracing, logging, and metrics for an LLM feature — enough to debug, enough to alert, enough to improve.
Course navigation
| ⬅️ Previous | 📍 You are here | Next ➡️ |
| ⬅️ Previous B5.1 · Cost and Latency, the Two Dials Users Feel | B5.2 of B6.4 | Next ➡️ B5.3 · Observability for LLM Apps |
📚 AI for Builders · Course Home — 28 posts, six modules.
Cover photo via Unsplash. This post is part of the AI for Builders series.