Skip to main content

Command Palette

Search for a command to run...

Cost and Latency: the Two Dials Users Feel

Your users will never read your model card. They will feel two things: how fast your app is and how much it costs you per answer. Here is how to reason about both without a spreadsheet.

Updated
10 min read
Cost and Latency: the Two Dials Users Feel

Welcome to Module B5 — Shipping. The module where we stop talking about what to build and start talking about what makes the difference between a demo that works on your laptop and a product that survives ten thousand users on a bad day. Cost. Latency. Caching. Observability. Guardrails. Rollouts. Five posts. Every one of them is unglamorous. Every one of them is what separates a team that ships from a team that demos.

We start with the two dials users actually feel. They will never read your model card. They will never know which embedding model you picked. They will feel: how fast the app responds, and — eventually, through pricing or product decisions — how much it costs you per answer. Both of these are load-bearing product choices, and both of them are easy to ignore until they start hurting.

This post is the practical version. Concrete numbers, concrete moves, concrete trade-offs. No spreadsheets — just the handful of levers you actually pull.


What users actually feel

Latency and cost are not user-visible the way "this button is blue" is user-visible. They're user-visible in indirect ways that matter more:

  • Time to first token (TTFT) is what users feel as "the app is alive." A 200ms TTFT feels instant. A 2-second TTFT feels slow. A 5-second TTFT feels broken, even if the final response is fast.
  • Total time to final answer matters less than TTFT for streaming products, more than TTFT for non-streaming ones (status reports, batch jobs, background agents).
  • Cost shows up in two ways: directly to users through pricing tiers or usage limits, and indirectly to you through whether the product is economically viable at scale. Either way, the user eventually notices if your margin is too thin.
  • Variance matters as much as the mean. A product with a 500ms median latency but a 15-second P99 latency will feel broken to 1% of users, every day. Users remember the bad experience, not the median.

If there is one thing to remember from this post: optimise TTFT and P99, not just the average total time. Users feel the first two. They do not feel the third.


The four levers

There are four main dials you can turn on LLM cost and latency. In rough order of impact, and cheapest to most expensive operationally:

Lever 1: pick the right model for the call

The single biggest lever, and the one most teams underuse. Picking a smaller or faster model where it's capable enough is the cheapest way to cut cost and latency simultaneously. Two dimensions:

  • Capability tier. Frontier models (Claude Sonnet 4.6, GPT-5, Gemini 2.5 Pro) are the most capable and usually the most expensive per token. Mid-tier models (Claude Haiku, GPT-5-mini, Gemini Flash) are ~5-10x cheaper and faster, at meaningfully-but-not-dramatically lower quality on most tasks. On simple tasks — classification, extraction, routing — mid-tier is often indistinguishable from frontier.
  • Reasoning mode. Modern models have "reasoning" or "extended thinking" modes that burn extra tokens on internal thought before answering. These help on hard tasks and hurt on everything else. Default off. Turn on only when your eval shows a lift.

The rule I use: try the cheapest model first, measure, and only upgrade to frontier when you have evidence. The opposite direction — start on frontier and try to downshift later — wastes time because you anchor on frontier-quality results and downgrades feel like regressions.

A realistic cost comparison, in rough 2026 numbers per million tokens for the default "mid-tier" model from each lab vs the frontier tier. These move quarterly; treat them as order-of-magnitude:

TierInput ($/M)Output ($/M)Use for
Haiku / mini / Flash$0.30–$1$1–$4Classification, extraction, routing, simple generation
Sonnet / GPT-5 / Gemini Pro$3–$15$15–$75Complex reasoning, long-form generation, agents
Opus / o-series / Ultra$15–$75$75–$150+Hardest reasoning tasks where quality is everything

Most apps that mix tiers end up spending 70% of their request volume on cheap models and 30% on frontier models, with cost dominated by the 30% and latency dominated by the 70%. Routing the right query to the right tier is half the game.

Lever 2: cut the prompt

Your system prompt is probably too long. Your RAG context is probably too long. Your few-shot examples are probably doing nothing. Your tool descriptions are probably repeating themselves.

Input tokens are the quiet cost driver. A 3,000-token system prompt on a call with a 200-token user message is spending 94% of its input budget on developer-written text. If you can cut that to 1,500 tokens without quality loss — and you usually can — you just halved your input cost and chopped 100–300ms off the latency.

Practical moves:

  • Delete dead few-shot examples. If structured output handles the format (B1.3), the examples teaching "respond as JSON" are pure cost.
  • Compress system prompts. Read yours out loud. Every sentence that doesn't change the model's behaviour when removed is dead weight. Measure against your eval set and delete aggressively.
  • Cap retrieval top-k. Four well-chosen chunks beat ten mediocre ones on most queries, and four is much cheaper. See B3.4 on hybrid search + rerankers.
  • Summarise long conversations. After 10 turns, the first 5 are often out of context. Replace them with a compact summary when the history gets long.
  • Drop tool definitions you're not going to call. If your agent only needs 3 of 20 tools for this specific request, use a classifier up front to pick the relevant subset.

Every one of these is a "free" improvement — quality stays flat, cost and latency drop. Do them before you touch the model choice.

Lever 3: stream

Streaming (B1.2) is the single best thing you can do for perceived latency. It doesn't reduce the total time to final answer at all — the model still takes as long to produce the full response — but it moves the "user sees something" moment from "end of response" to "200ms after the call started." That's the dial users feel.

A 4-second generation that starts rendering at 300ms feels like 300ms. A 2-second generation that stays blank and dumps all at once feels like 2 seconds. The user's internal stopwatch starts on "I hit enter" and stops on "I see something." Streaming is the move that makes the stopwatch stop early.

Three things to know:

  1. Streaming costs the same tokens as non-streaming. Zero cost penalty.
  2. Streaming catches errors faster. If the call fails, you find out in ~100ms instead of after full generation.
  3. Streaming breaks on buffered middleware. See B1.2 for the proxy/CDN landmines. Test end-to-end in staging.

If your user-facing LLM calls aren't streaming, fix that today. It's the cheapest single lever in this post.

Lever 4: cache (next post)

Caching is the fourth lever and I'm saving the full treatment for B5.2. Here's the preview: there are three distinct caches you can wire up — provider-level prompt cache (free, flip a flag), semantic cache (one embedding call per query), and result cache (plain database). At scale, a good caching strategy cuts cost by 30–70% on read-heavy workloads and shaves 200–500ms off cached responses. But it has its own failure modes and trade-offs, which is why it gets its own post.

For now: know that it exists, plan for it, and don't optimise anything else in this post assuming cache savings. Cache savings come last.


The two numbers to track

You don't need elaborate observability to start reasoning about cost and latency. Two numbers per user-visible feature:

  1. P50 and P99 TTFT (time to first token). Log both. Alert when P99 crosses 3 seconds. This is the user-feel number.
  2. Average cost per request, broken down by model. Log the model, input tokens, output tokens, and dollar cost per call. Aggregate per feature. You want to be able to answer "how much does feature X cost me per active user per day?" without doing math in your head.

Everything else — P95, P999, total monthly cost by tier, cost per seat — comes later. Those two numbers are the minimum dashboard.


A real tuning story

Let me walk through a realistic cost/latency tuning session, the way I'd actually do it. Suppose you have a support bot:

  • Feature: classify incoming tickets, retrieve relevant docs, answer.
  • Volume: 10,000 requests/day.
  • Current latency P50: 4.2s, P99: 12s.
  • Current cost: $180/day, ~$5,400/month.
  • Current setup: one big call to a frontier model, with 4,000-token system prompt, full top-10 RAG context, no streaming.

First-pass diagnosis:

  • 4,000-token system prompt → 40% of it is dead few-shot and redundant instructions. Cut to 1,500 tokens. Estimated save: 60% of input cost → $30/day.
  • Top-10 → Top-4 with hybrid search + reranker (B3.4). Estimated save: 60% of context cost → another $40/day.
  • Classification step is on the same frontier model → route to mid-tier (Haiku) for classification. Save: 80% of classifier cost → $15/day.
  • Add streaming → Perceived latency drops dramatically without changing total time.

Before/after expectation:

  • Daily cost: $180 → ~$95 (47% reduction).
  • TTFT P50: 4.2s → ~400ms (streaming).
  • TTFT P99: 12s → ~2s (streaming + shorter prompt).
  • Total time to full answer: mostly unchanged, but users don't feel that number.

That's a realistic first-pass tuning. Not exotic. No new infrastructure. All four levers turned one click. Always start with these before reaching for anything fancier.


Admit what breaks

  • Model downshifting can degrade quality invisibly. The cheaper model produces confident-sounding but subtly worse answers, and your eval set (if you have one) catches 90% of the regression. The 10% lands in production. Have good evals (B2.5) before you downshift.
  • Prompt cutting can break specific edge cases. The one paragraph you removed was the one that handled the ambiguous cases you never wrote an eval for. Fix: backfill the eval from production bug reports after the cut.
  • Streaming amplifies partial-failure bugs. A stream that dies mid-response leaves the user looking at half an answer. Handle stream errors explicitly (see B1.2).
  • TTFT optimisations don't help batch workloads. If your LLM work is background processing, nobody is watching the output stream. Optimise total throughput instead — batching, async parallelism, smaller models.
  • Cost and latency are often in tension. The cheapest model is not always the fastest. The fastest path is not always the cheapest. You'll sometimes have to pick: do I optimise for my user or for my margin? The answer depends on your product and your pricing, and it's a real decision, not a technical one.
  • "Cost per request" hides variance. Your average is fine. Your P99 cost per request is 50x higher, and that 1% of queries is half your total spend. Log cost per request as a distribution, not just a mean.

What just changed in your code

  • Log P50 and P99 TTFT per feature and alert on regressions. This is the minimum.
  • Log cost per request (input tokens, output tokens, model name, dollars) and aggregate per feature per day. If you can't answer "how much does this feature cost me," you can't optimise it.
  • Route classification and extraction to mid-tier models before doing anything else. It's the biggest free win for multi-call pipelines.
  • Cut every system prompt by 30% this week. Measure against your eval set. Most prompts can lose 30% without quality loss.
  • Enable streaming on every user-visible call. If it isn't streaming, you're leaving perceived-latency on the table.
  • Don't pre-optimise around imaginary cache savings. Caching is real, but it's the last lever, not the first.

Next post, B5.2, is the dedicated caching post: three kinds of caching — prompt cache, semantic cache, result cache — and when each one earns its complexity. One of the biggest cost wins in this whole course, but only if you know which cache to reach for.


Course navigation

⬅️ Previous📍 You are hereNext ➡️
⬅️ Previous
B4.5 · Long-Running Agents
B5.1 of B6.4Next ➡️
B5.2 · Three Kinds of Caching

📚 AI for Builders · Course Home — 28 posts, six modules.


Cover photo via Unsplash. This post is part of the AI for Builders series.

More from this blog

Learn AI - Zero to Hero

111 posts