The Demo-to-Prod Gap | AI for Product

You ran a demo on a Tuesday. The model nailed it. Leadership was excited. Sales started pitching it. The engineering team said "we can ship this in six weeks." On week eight, the ship date moved. On week twelve, it moved again. On week sixteen, you found yourself explaining to leadership why "the thing that worked in the demo" was not, in fact, close to ready. You felt like you had misled them. You had not; the demo really did work. The product that came out of it was a different thing, and the difference is the subject of this post.

The space between the two is the demo-to-prod gap. It is the single most expensive surprise in AI product work, and it is not a mystery — it follows a predictable shape. Every team that ships AI products crosses this gap. The ones that budget for it ship on time. The ones that treat the demo as a prototype ("we just need to productionise this") lose a quarter to discovering what "productionise" actually means.

This post is the map of the gap. Five specific, measurable differences between a demo that wows and a product that survives GA, each one quantified against April 2026 realities, each with a concrete move to de-risk it before engineering has written production code. Read this before your next demo turns into a commitment.

Why the gap exists at all

Start with the mental model. A demo is an instance: one interaction, curated inputs, one attentive reviewer, no adversaries, no edge cases, no scale. A product is a distribution: millions of interactions, adversarial inputs, inattentive users, edge cases at every seam, scale, cost, latency, compliance, and a team of engineers supporting it at 3am.

A frontier model handles "instance well" in a way that feels magical. It handles "distribution well" in a way that is extremely specific work. The gap exists because the magic transfers poorly from the first mode to the second, and the transfer is the product engineering. No amount of model quality makes the transfer free.

Five costs. Every one of them is real. I'll take each in turn, with 2026-specific numbers where they help.

Gap 1: tail quality (the 80/20 on a fresh tail)

The demo uses three to five inputs. The model handles them well. The product will see thousands to millions of inputs. The distribution has a tail — a long list of weird, ambiguous, or adversarial cases the demo never touched. Tail behaviour is where quality dies.

A typical pattern: your demo achieves ~95% subjective "looks great" quality on curated inputs. On a production-like eval set of 100-200 real user inputs, the same model plus prompt achieves 78%. The 17-point gap is the tail, and it is large. The specific failures in the tail are usually:

Inputs in a language or dialect you didn't test on. A demo in English gets English right. Your user base speaks fifteen languages, seven of which have quirks your demo never exercised.
Inputs with formatting the model hasn't seen before. A PDF with tables arranged oddly. A transcript with multiple overlapping speakers. A document with embedded images your pipeline strips silently.
Inputs that are legitimately ambiguous. The user asks a question that has two valid interpretations. The demo always picked the right one because you tested it on clean cases.
Inputs that are short. A one-word query gives the model almost no signal. Demos almost never use short inputs.
Inputs with adversarial intent, which we'll cover in Gap 4.

Defence (do this before committing a ship date):

Before the demo becomes a commitment, build an eval set from real user data, not your own intuition. 100 cases minimum. If you don't have real user data yet, source from customer interviews, past tickets, or similar proxy data.
Run your demo's exact prompt and model against the eval set. Record the number. That number is your honest starting quality, not the demo number.
The gap between the eval score and your shipping threshold is the tail work. Budget for it as a first-class chunk of the roadmap, not as "polish."

What "budget for it" means in weeks: in my experience, closing the tail from eval-set score to shipping threshold takes 2-8 weeks depending on how bad the gap is. If the eval-set score is within 5 points of the threshold, 2 weeks. If it's 20+ points off, 8 weeks. If it's 30+ points off, the model or approach is probably wrong — don't try to close a huge gap; rethink the prompt, the retrieval, or the shape.

Gap 2: unit economics (the single most skipped calculation)

A demo costs whatever a few API calls cost — effectively zero. A product at scale costs whatever millions of API calls cost. The demo never forces you to do the math; the product forces you to do it on the worst possible day, which is when leadership reviews the P&L.

As of April 2026, here are rough per-call costs on frontier models for a realistic RAG feature (retrieval + generation, ~4,000 input tokens, ~400 output tokens), to put concrete numbers on this:

Claude Sonnet 4.6: roughly $0.018-$0.025 per call
GPT-5: roughly $0.020-$0.030 per call
Gemini 2.5 Pro: roughly $0.015-$0.023 per call
Mid-tier (Claude Haiku, GPT-5-mini, Gemini Flash): roughly $0.002-$0.005 per call

Now do the product math. Suppose the feature is used 10 times per active user per day, you have 30,000 active users, and you're on Claude Sonnet. That's 300,000 calls/day × $0.02 = $6,000/day = ~$180,000/month. For one feature.

A demo never has this conversation. A product has it at the worst possible moment. The question becomes: does the feature justify $180K/month? If the feature generates $200K/month of pricing lift or cost savings, easy yes. If it doesn't, you have a unit-economics problem and "we'll optimise later" is not a plan.

Defence:

Before the demo becomes a commitment, calculate the worst-case unit economics at 2-3 volume scenarios: pilot (thousands of calls/day), launch (tens of thousands), scale (hundreds of thousands).
Multiply by the cost of one frontier call with your full prompt. Don't estimate; run a real call and measure the actual tokens in/out.
Compare the number to the revenue or cost savings the feature produces. If the ratio is worse than 3:1 (value to cost), you need either a smaller model, aggressive caching (Course 2's B5.2), or a different feature.
Have the cost conversation before the ship commitment, not after.

The single most common surprise here is teams that demo on Claude Sonnet or GPT-5 because "we wanted the best model" and then discover in month three that they need to downshift to the cheaper tier, which requires re-tuning the prompts and re-running the evals. Two weeks of rework avoided by thinking about costs on day one.

Gap 3: latency at p99 (the tail users feel)

A demo runs one call at a time, in a quiet environment, with a clean network. Latency feels great. A product runs many calls concurrently, sometimes queued, sometimes throttled by the provider, sometimes competing with other workloads. The p50 latency stays roughly similar to the demo. The p99 latency — the 1% of calls users wait the longest on — can be 10-20x the p50 in production.

Concrete 2026 numbers for a typical "RAG + generation" call with streaming enabled on a frontier API:

Demo p50 time-to-first-token: 300-600ms.
Demo p50 total: 2-4 seconds.
Production p50: similar, maybe slightly worse under load.
Production p99 time-to-first-token: 1.5-3 seconds (5x the demo).
Production p99 total: 8-25 seconds, occasionally more (5-10x the demo).

Users feel p99 as much as p50. A feature where 1% of calls take 20 seconds is a feature 1% of users bounce on every day, and that 1% compounds into real churn over weeks. The demo never surfaces this.

Defence:

Load-test the pipeline before committing a ship date. 100 concurrent users for 30 minutes, on the real provider, with your real prompt. Measure p50, p95, p99. Compare to your UX requirements.
Enable streaming everywhere a user is watching. This is the cheapest win: streaming moves the user-feel latency from "total time" to "time to first token," which cuts the p50 user experience by ~60% and makes p99 feel less broken.
Plan for slow-call handling. What happens when a call takes 25 seconds? Do you show a progress indicator? Cancel and retry? Fail gracefully? Your UX has to answer these questions before launch.

The $180K/month feature from Gap 2 is also an 8-second-p99 feature in Gap 3, and both problems exist at the same time. The demo showed neither.

Gap 4: adversarial robustness

The demo used inputs you wrote. Production will use inputs written by millions of users you have never met, a small fraction of whom are actively trying to break your feature. Some of them are testing for fun; some of them are probing for exploits; some are legitimate users whose phrasings happen to hit edge cases the demo never considered.

Adversarial concerns in 2026 cluster into three areas:

Prompt injection. Users sending inputs that try to manipulate the model into ignoring its system prompt. Course 2's B2.4 covers this in depth. In production, 1-5% of inputs from public-facing features contain injection-shaped text, mostly benign experimentation, occasionally malicious.
Off-topic abuse. Users trying to get the model to do something unrelated to the product's purpose. "Write me a poem about my ex" in a customer support bot. You will see this, it will be tweeted, and your response has to be both polite and clearly off-topic.
Data extraction attempts. Users trying to get the system to reveal internal prompts, training data, or information about other users. Rare but serious.

A demo never tests these. A product has to survive them from Day 1 of launch because the first adversarial user is already typing.

Defence:

Run a prompt-injection red-team session against your pipeline before ship. Two hours, one person, realistic attack patterns. You will find at least one real issue.
Implement the guardrail stack from Course 2's B5.4 — input validation, output filtering, abuse detection — as first-class spec items, not as "we'll add that later."
Plan the response copy for off-topic and refusal cases as user-facing text. "I can't help with that but here's what I can do" beats a bare "refused" every time.

Gap 5: user trust and UX

The last gap is the most qualitative, and the one teams most often don't recognise as a demo-to-prod gap at all. The demo is reviewed by a sympathetic audience (your team, leadership, a friendly pilot customer). They want it to work. They forgive minor issues. They interpret outputs charitably. They're primed to be impressed.

A cold user — someone who's never seen the product before, has no relationship with your team, and is trying to accomplish a task — is not primed. They will:

Not understand what the AI can and can't do unless you tell them.
Not know how to phrase inputs for good results unless you guide them.
Interpret outputs literally and without charity.
Stop trusting the feature after a single bad output.
Not verify outputs unless the UX makes verification easy.

The demo feels great because the reviewers knew what they were doing. The product has to teach users what to do, and that teaching is UX work that doesn't exist yet when the demo ships.

Specific UX patterns that rebuild trust in production:

Set expectations up front. A first-run tooltip or onboarding moment explaining what the feature does and what it can't. Costs 30 minutes of design work, saves two months of churn.
Show your work. Display citations, sources, or the reasoning steps the model took. Cite-or-refuse (Course 2 B3.5) is the default for grounded features; always available, always visible.
Make editing trivial. If the output is a draft, make editing the primary action. If the output is a fact, make "show source" one click away.
Design the failure state. What does the UX look like when the model returns a bad answer, a low-confidence answer, or an "I don't know"? These states need dedicated design, not a default error message.
Instrument the trust signal. Track how often users accept, edit, or reject AI outputs. If the edit rate is 90% or the rejection rate is high, users don't trust the feature and you have a problem to fix before scaling.

The cold-user experience is the biggest single predictor of whether an AI feature succeeds or quietly gets removed. A demo can't tell you if cold users will trust it. Only cold-user testing, before launch, can.

Defence:

Run cold-user testing with five to ten participants who have never seen the product before. Observe them. Watch where they hesitate, where they re-read, where they undo their action. Each hesitation is a trust gap.
Design for the failure state explicitly. Failure states get as much design attention as success states.
Track the trust signals on Day 1 of launch. If users aren't interacting with the feature in healthy ways, you have a pre-churn signal you can act on.

A worked example: demo vs shipped cost

Let me put all five gaps in one honest table for a realistic feature. Suppose you demoed a "summarise any document in our knowledge base" feature for a B2B SaaS.

Variable	Demo value	Realistic production value	Gap
Quality on eval set	"Looked great" on 3 hand-picked docs	76% pass rate on 150 real docs	~20 points
Cost per call	Effectively $0	$0.022 per call × 8,000 calls/day = $176/day = $5,280/month	100% of spend is new
P50 total latency	3.2s	3.5s	small
P99 total latency	"Felt fast"	11.8s	~8s tail
Adversarial robustness	Untested	2 real injection issues found in a 2-hour red-team	2 bugs
UX for failure state	No design	Default "something went wrong"	entire state unbuilt
Cold-user trust signal	Unknown	Edit rate of 87%, rejection rate of 11% on pilot	user doesn't trust output

Seven rows, seven gaps. Each one is a real chunk of work, not a vibe.

Timeline impact: closing these seven gaps realistically takes ~10-14 weeks of engineering and design, on top of whatever you thought "productionise the demo" meant. Teams that budget this upfront ship on schedule. Teams that don't ship 3-4 months late and hit the bad conversation with leadership.

Mitigations that help: caching the summarisation for repeat documents (saves ~40% of cost), downshifting to mid-tier for short documents (saves another ~20%), adding cite-or-refuse (fixes the trust issue on low-confidence cases), and running the eval-set loop before shipping. Each is a specific move; each is measurable.

None of them would have been necessary if the demo had told you they would be. The demo couldn't tell you; only the gap could.

The failure mode: "productionise it in 6 weeks"

The demo-to-prod gap has one signature failure mode, and it always sounds the same: "we can productionise it in 6 weeks." A confident engineer or lead says it, leadership nods, and the team commits. Six weeks later, the team is 30% through Gap 1. Twelve weeks later, they're in Gap 4. Eighteen weeks later, the feature ships with two of the five gaps still unaddressed and quietly underperforms.

The failure mode is not the engineer being wrong; engineers generally know the work is more than 6 weeks. The failure is the word "productionise" — a small, tidy-sounding verb that hides five large, untidy chunks of work. Nobody inside the team says "we're lying"; everyone says "productionise" and everyone means something different by it, and the ambiguity eats the schedule.

The defence: never accept "productionise" as a scoping term. Break every "productionise this demo" commitment into the five gaps explicitly. For each gap, answer three questions: (1) what specifically has to be true for this gap to be closed, (2) how do we measure that it's closed, and (3) how long will it realistically take. If a team can't answer these, they can't commit to a schedule, and saying so out loud is the single highest-leverage thing a PM can do in the demo-to-commitment conversation.

Six weeks to productionise turns into "two weeks to close Gap 2, four weeks to close Gap 1 to threshold, three weeks for Gap 4 and Gap 5, and two weeks of buffer" — 11 weeks, honestly budgeted. Leadership would rather hear "11 weeks, here's why" than "6 weeks, oh no it slipped, it slipped again, it slipped again."

What just changed in your roadmap

Before a demo becomes a commitment, run the five-gap audit. Quality eval set, unit economics math, load test for p99, red-team for adversarial, cold-user test for trust. Five quick tests before the ship date is written.
Never write a ship date based on a demo. Write it based on the gap measurements.
Kill the word "productionise." Replace it with the specific gap-closing work. Each gap gets its own line on the project plan, with its own estimate and owner.
Assume eval-set quality is ~15-20 points below demo quality as a planning default. Adjust based on measurement.
Run unit-economics math on Day 1. The first call you make should be "how much does this cost per user per month at target scale?"
Always load-test before commit. p99 latency is the number leadership will see in the product review, and "we didn't test for concurrency" is a career-limiting answer.
Design the failure state before you design the happy path. It's where the trust breaks.
If leadership is pushing for a demo-based commitment, show them this framework. "Here's what we still need to measure. We can commit after we have the numbers." Five days of measurement saves ten weeks of schedule slip.

Next post, P2.3, covers a less glamorous but equally load-bearing topic: prototyping with AI as a non-coder. How a PM or designer can generate throwaway artefacts that earn real scoping decisions, without having to write a line of production code or wait for an engineering sprint to learn something testable.

⬅️ Previous	📍 You are here	Next ➡️
⬅️ Previous P2.1 · Finding AI-Shaped Problems	P2.2 of P5.4	Next ➡️ P2.3 · Prototyping With AI as a Non-Coder

📚 AI for Product · Course Home — 20 posts, five modules.

Cover photo via Unsplash. This post is part of the AI for Product series.

The Demo-to-Prod Gap: Why Your Magic Demo Will Die

Why the gap exists at all

Gap 1: tail quality (the 80/20 on a fresh tail)

Gap 2: unit economics (the single most skipped calculation)

Gap 3: latency at p99 (the tail users feel)

Gap 4: adversarial robustness

Gap 5: user trust and UX

A worked example: demo vs shipped cost

The failure mode: "productionise it in 6 weeks"

What just changed in your roadmap

Course navigation

Comments

AI for Product

Prototyping With AI as a Non-Coder: Earning Decisions Without Sprints

More from this blog

A Reading List and Two Habits: Staying Current in Ten Minutes a Week

What to Decide Now, What to Defer, What to Ignore: The AI Action Matrix

The Next 18 Months of AI: A Calibrated Leader's Forecast

Calibrating Your AI Exposure: Upside and Downside in One Matrix

Five AI Capabilities That Matter for Your Business, and Five That Do Not

Command Palette

Why the gap exists at all

Gap 1: tail quality (the 80/20 on a fresh tail)

Gap 2: unit economics (the single most skipped calculation)

Gap 3: latency at p99 (the tail users feel)

Gap 4: adversarial robustness

Gap 5: user trust and UX

A worked example: demo vs shipped cost

The failure mode: "productionise it in 6 weeks"

What just changed in your roadmap

Course navigation

Comments

AI for Product

Prototyping With AI as a Non-Coder: Earning Decisions Without Sprints

More from this blog