The Non-Determinism Tax: Why Your Definition of Done Needs a Rewrite
Every AI product pays a hidden tax that no traditional product spec accounts for. The tax is called non-determinism, and it rewrites your definition of done, your testing story, and your roadmap.
Here is the moment every PM on their first AI product has, usually around week six.
You shipped a spec. The engineer built to it. The feature works in staging. You review it. It gives a good answer to your test question. Your boss reviews it, asks the same question, and gets a slightly different good answer. QA reviews it, asks a slightly different question, and gets a bad answer. You dig in. "Why is it different every time?" you ask. "That's how LLMs work," the engineer says. "You asked for an AI product. AI products are non-deterministic."
You nod. You don't have a framework for what to do next. The rest of your week is conversations about "quality" without a clear shared definition of what "quality" is. Your definition of done is broken and you don't know how to fix it.
This post is the fix. The idea is called the non-determinism tax, and it's the hidden cost every AI product pays that no traditional product spec accounts for. Once you know it's there, your definition of done, your testing story, and your shipping confidence all shift into a shape that actually works for probabilistic products.
What non-determinism actually means here
In the traditional software world, determinism is a gift you don't know you have. Given the same inputs, the same function returns the same output, every time, forever. Your tests check for exact matches. Your bugs are reproducible. Your definition of done is "the function returns what we said it should."
AI products don't have this gift. An LLM asked the same question twice will often give two slightly different answers. Same prompt, same settings, different output, even at the lowest temperature. The same bug might happen 30% of the time and "work fine" 70% of the time. Your QA engineer tests ten variations and finds two failures — you can't tell if that's a real rate or a fluke without running more tests. The baseline behaviour is probabilistic, not deterministic, and your entire quality machinery was designed for the deterministic case.
That machinery is: unit tests that check exact output; bug reports with "repro steps" that have to reliably reproduce; definition of done that reads "passes tests"; release criteria that are binary. None of these are wrong, but none of them map cleanly to probabilistic systems. You need a different set.
The "non-determinism tax" is the name for all the extra work an AI product has to do that a deterministic product doesn't. It's real, it's budgetable, and it shows up whether you plan for it or not.
The tax exists whether you plan for it or not. Teams that plan for it pay it in an organised way; teams that don't pay it in surprise delays, half-finished launches, and "why is this still broken?" meetings with leadership.
The five costs you will pay
The tax has five line items, each real, each one teams forget to budget for:
Cost 1: evaluation sets, as part of the spec
In a traditional product, you write a spec and the team implements it. In an AI product, you also write a quality bar — a set of example inputs with expected-ish outputs — and the team has to hit it. The eval set is a first-class artifact of the spec, not an afterthought.
Who writes the eval set? Usually the PM, with help from a domain expert. It's 20-100 concrete examples of the kinds of inputs the feature will see and what "good" looks like for each. It is not a test suite; it's a vocabulary for arguing about quality. Without it, every "is this good enough?" meeting is a debate about vibes.
Budget: 1-2 days to write, ongoing maintenance. I'll cover eval-set design properly in P3.2 but for now know that it goes on your roadmap alongside the feature itself.
Cost 2: statistical testing instead of binary testing
A traditional bug report is "this happened, reproduce it." An AI bug report is "this happened 3 times out of 10, here are the 3 bad outputs and the 7 good ones." The bug is a frequency, not a single event. Fixing it means lowering the frequency, not eliminating the specific instance.
Your QA process has to change. Instead of "try it once, write the bug," QA runs each test case multiple times and looks at the distribution. Instead of "the bug is fixed," the team says "the rate went from 30% to 4%, which is under our threshold." This takes longer than traditional QA. Add 30-50% to QA time for AI features, at minimum.
Cost 3: shipping criteria that are thresholds, not pass/fail
A traditional release says "all tests pass." An AI release says "the eval set pass rate is above 87%, the regression rate from the previous version is below 2%, and no critical-category failure appears in the failure log." These are threshold criteria, not binary gates. Someone has to agree on the thresholds, and the agreement is political more than technical.
Who sets the thresholds? The PM, collaborating with engineering and any affected business owner. The thresholds should be written down, public, and revisited quarterly. Without them, every release is a debate about whether the feature is ready, and nobody wins a vibes argument.
Cost 4: ongoing quality measurement in production
A traditional feature, once shipped, stays behaviourally the same until you change it. An AI feature drifts. Models update under you. User inputs shift over time. The same prompt that was fine six months ago may be subtly worse today, and you won't know unless you measure. Every AI product needs ongoing quality measurement in production — something watching a sample of real traffic and flagging when quality dips.
This is new operational work. Budget for it. It's not optional.
Cost 5: communication overhead when something goes weird
Users will report bugs that aren't bugs. "The AI told me X yesterday and Y today for the same question — is this broken?" A deterministic product has a clear answer ("no, the output is always the same, you must be misremembering"). A non-deterministic product has to acknowledge that yes, both answers came out, and explain why that's normal without sounding defensive.
Your support team needs training. Your docs need a "how the AI works" section. Your in-product copy needs to set expectations. Every one of these is a small cost that non-AI products don't pay. Budget for it or your support queue becomes a running "is this broken?" thread.
A specific example: how the tax re-shapes a spec
Let me walk through the same feature, specced two ways — traditional and with the tax — to make the difference concrete.
The feature: auto-summarise a customer support call transcript into a one-sentence note for the CRM.
Traditional-style spec:
- Input: call transcript (plain text).
- Output: one-sentence summary in the CRM's "notes" field.
- Acceptance: "When I provide a transcript, the feature generates a summary in under 3 seconds. The summary is written in English. It fits in the CRM's 255-character field."
- Done: when the three bullets above are checked.
Tax-aware spec:
- Input: call transcript (plain text, 50-5,000 words typically).
- Output: one-sentence summary, under 255 characters, focused on the customer's problem not the agent's tone.
- Quality bar (eval set): 40 representative transcripts with expected-summary descriptions. Quality measured by: (a) length under 255 chars, (b) contains the customer's problem in plain language, (c) does not include pleasantries, (d) does not hallucinate facts not in the transcript.
- Shipping threshold: on the eval set, 90%+ pass on (a), 85%+ pass on (b), 95%+ pass on (c), and zero failures on (d).
- QA protocol: each eval case runs 3 times to measure variance; reported as distribution.
- Production monitoring: 1% of real calls are flagged for quality review weekly; track the pass rate as a trend line.
- Failure mode handling: "What do we do when the summary is wrong?" The CRM shows an edit button; users who edit are measured; edit rate above 20% triggers prompt review.
- Support copy: a "how this works" note in the CRM tooltip explaining that summaries are generated and may need light editing.
Same feature. The second spec is three times longer and accurately describes the real work. The first spec looks tight and produces a month of "this isn't quite working, is it?" meetings.
The difference is not smart-PM vs dumb-PM. The difference is whether the PM knows the non-determinism tax exists and budgets for it in the spec. Once you know, you write tax-aware specs by default, and your team stops running into surprises six weeks in.
What the tax is NOT
A quick list of things that are sometimes blamed on "AI non-determinism" but aren't actually the tax — they're bugs in something else:
- Model quality being bad. If the model is simply wrong about the task, that's a capability problem, not a non-determinism problem. Fix the prompt, the retrieval, or the model.
- Inconsistent formatting. If the output format drifts, use structured output (engineering concern) rather than complaining about non-determinism.
- Hallucination. Different from non-determinism. The model can be 100% consistent and 100% wrong. This is a grounding problem; fix retrieval or the system prompt.
- Prompt drift over time. That's a deployment issue — the prompt was quietly changed without an eval — not a non-determinism issue.
Knowing these aren't the tax matters because the fix is different. Non-determinism is a budget issue — you pay it by doing more work. These other issues are design issues — you fix them by changing something specific.
The failure mode: "we'll handle it in QA"
The single most common way teams screw up the non-determinism tax is pushing it entirely to QA at the end. The PM writes a traditional spec, engineering builds to it, and then at the end of the cycle QA runs the feature a few times, sees inconsistent behaviour, raises a stack of bugs, and the team scrambles to fix them one by one without a shared framework. The feature ships late, the team feels like it was "underspecced," and the retro blames vague things like "AI is hard."
The fix is not to QA harder at the end. The fix is to write the eval set at the beginning, set thresholds up front, and make the tax visible as part of the spec. You can't cash the tax at the end; you pay it along the way.
Teams that do this early feel slower in the first two weeks of a feature and faster from week three onward. Teams that skip it feel fast early and slow forever.
What just changed in your roadmap
- Budget the non-determinism tax explicitly for every AI feature. Add 20-40% to the timeline over a comparable traditional feature. This is real work, not overhead.
- Write an eval set as part of the spec, not after. 20-100 cases, owned by the PM, reviewed with engineering and domain experts. It's the vocabulary you'll use to argue about quality.
- Define shipping thresholds up front, in numbers, not vibes. "Eval pass rate above X, regression below Y, zero critical-category failures."
- Change your QA protocol. Each test case runs multiple times. Report distributions, not single results. This is a cultural shift more than a tooling one.
- Plan for in-production quality monitoring before launch. A sample of real traffic reviewed weekly. Treat it as a ritual, not an incident response.
- Brief your support team. Give them copy for "why does the AI give different answers" questions. Put a "how this works" note somewhere users can find.
- Stop blaming "non-determinism" for things that are actually grounding, formatting, or capability problems. The tax is a budget line; the other things are design fixes.
Next post, P1.3, is the budget question that eats two weeks of every AI project if you don't have a framework: build vs buy vs wrap. When to train your own model, when to pay a vendor, and when to ship an off-the-shelf wrapper with your branding on it.
Course navigation
| ⬅️ Previous | 📍 You are here | Next ➡️ |
| ⬅️ Previous P1.1 · AI Feature vs AI Product | P1.2 of P5.4 | Next ➡️ P1.3 · Build vs Buy vs Wrap |
📚 AI for Product · Course Home — 20 posts, five modules.
Cover photo via Unsplash. This post is part of the AI for Product series.