Skip to main content

Command Palette

Search for a command to run...

Defining Good Enough for a Probabilistic Product

Your team cannot agree on whether the AI is ready to ship because nobody wrote down what good enough means. Here is the four-part quality bar that replaces the vibes meeting.

Updated
17 min read
Defining Good Enough for a Probabilistic Product

Fridays on AI teams have a distinct shape. Someone runs the demo. Someone says "that looks pretty good." Someone else says "I saw it fail on a case yesterday that worried me." A third person says "is it ready to ship?" The PM says "what do we mean by ready?" Nobody has an answer. The meeting ends in twenty minutes of "let's look at a few more examples" and the ship date slides by a week.

I have sat in this meeting a dozen times. The root cause is always the same: the team has never written down what "good enough" means for this specific feature, and in the absence of a shared definition, every person in the room is silently using their own. The skeptical engineer is benchmarking against the last outage. The enthusiastic designer is benchmarking against the Tuesday demo. The CEO is benchmarking against competitor marketing copy. None of them is wrong; all of them are incompatible.

This post is the fix. Module P3 opens with the single most important quality question for AI products — how do you know when it's good enough to ship? — and answers it with a four-part bar you can write on one page and defend in a review. No statistics background required. No data science sign-off. Just the specific decisions that turn a vibes meeting into a shipping decision.

If you only take four things from Module P3, take these four. Every subsequent post builds on them.


Why "good enough" is hard on AI features

On a traditional feature, "good enough" is binary. Tests pass or they fail. The checkout button either submits an order or it throws an error. Engineering fixes failures; product writes the acceptance criteria; QA verifies; everyone goes home. "Good enough" lives inside "passing."

On an AI feature, "good enough" is a distribution. The checkout button becomes "the AI summariser produces a one-sentence summary of a support ticket." Run it on one ticket: maybe the summary is perfect. Run it on a hundred tickets: some are perfect, some are fine, some are slightly off, a few are genuinely wrong. The team stares at this and argues.

The argument is unresolvable at the level of examples. You can always find an example that looks bad, and you can always find one that looks great. The only way to get out of the loop is to stop arguing about examples and start measuring the distribution — the rate at which outputs fall into each quality category, the rate at which specific failure types appear, and the threshold above which the feature is allowed to ship. That's "good enough" for AI products, and it requires four specific decisions instead of one.

Four parts. Let me take each in detail with concrete guidance and 2026-realistic numbers.


Part 1: quality categories

Before you pick a number, you have to decide what "good" means for this specific feature. "Good" on a customer-support summary is not the same as "good" on a code suggestion or a recommendation. Every feature has its own rubric, and the rubric has to be explicit before you measure anything.

The rubric I use has four categories, designed to be coarse enough to score quickly and fine enough to be actionable:

  • Pass. The output is correct and usable as-is. A user would not need to edit, correct, or verify it.
  • Borderline. The output has a minor issue — tone is slightly off, a minor fact is missing, formatting is imperfect — but it's still useful. A user would accept it with a small edit.
  • Bad. The output is wrong, misleading, unhelpful, or off-topic. A user would reject it and either retry or abandon the feature.
  • Critical. The output is actively harmful — wrong in a way that could create legal, compliance, safety, or trust damage. A single occurrence is a shipping block.

Four categories, four clear definitions. Now the hard part: write the specific rubric for your feature. What specifically counts as "pass" vs "borderline" vs "bad" for this output?

Example for a customer-support summary feature:

Pass: Summary is 1-2 sentences, captures the customer's core issue in plain language, names specific entities (product, amount, date) if they're in the source, no hallucinated facts, no apologies or pleasantries.

Borderline: Summary has 3+ sentences, or omits one minor entity, or includes a polite phrase, or has awkward phrasing — but a reader would still understand the core issue.

Bad: Summary misstates the core issue, misses the customer's request entirely, or includes pleasantries that pad length.

Critical: Summary fabricates a fact (wrong amount, wrong product, wrong date), misattributes the complaint to the wrong customer, or contains content that violates a privacy policy.

Four categories, one paragraph each, specific to the feature. This rubric is the first thing to write when defining quality, and it's the thing most teams skip. Without the rubric, every scorer is measuring something slightly different and your "pass rate" is noise.

Rules for writing a rubric that works:

  1. Be specific enough that two scorers agree 90%+ of the time. If two people score the same output and disagree, the rubric is too vague. Tighten the definitions.
  2. Tie critical categories to real downstream consequences. "Critical" should map to things your legal, compliance, or safety teams would actually care about. If nothing would happen on a critical failure, it isn't critical.
  3. Keep the rubric to one page. A 20-page rubric nobody reads is equivalent to no rubric.
  4. Validate the rubric with a pilot scoring session. Have 2-3 people score the same 20 outputs independently. Compare. Update the rubric until agreement is high. This costs two hours and prevents weeks of downstream confusion.

Part 2: target pass rate

Once you have the rubric, the second decision is a number: what pass rate on a representative eval set is good enough to ship?

The instinct is to reach for 95% or 99%. Resist. On most real AI features, 95% is genuinely out of reach without heavy engineering investment, and the feature would be valuable at 80% if the failure modes are benign. The right target depends on three variables:

  • The cost of a failure. How bad is it when the output is wrong? Wrong support summary: user manually re-reads the ticket (cheap). Wrong legal clause extraction: lawyer misses a clause and signs a bad contract (very expensive). The higher the cost of a failure, the higher your target pass rate has to be.
  • The user's ability to catch failures. Can the user see when the model is wrong? For a drafted email the user will read, yes — they'll catch and edit. For an automated classification that routes tickets without review, no — the wrong classification happens silently. The less visible the failure, the higher the target.
  • The alternative. What does the user do today? If the baseline is "nothing" or "slow manual work," an 80% pass rate is transformative. If the baseline is "an existing automated system at 92%," an 85% AI version is a regression.

Concrete bands I've used successfully in 2026, calibrated against real shipped features:

Feature typeTarget pass rateNotes
Draft something the user will edit and approve78-88%User catches failures; edits are cheap
Extract structured fields the user will verify88-94%User catches failures; but verification load matters
Classify tickets or content, user-visible85-92%User sees class, can correct
Classify tickets or content, silent routing93-97%User doesn't see failures; higher bar
RAG answer with cite-or-refuse85-92% on answered queries, >98% on refusalsRefusing correctly matters more than answering perfectly
Grounded factual output in regulated domain95%+Legal / medical / financial — high cost of failure
Creative generation where user picks one of several70-85% per candidateUser picks best; lower individual bar OK

These are starting targets, not commandments. Adjust up or down based on your cost/visibility/alternative calculation. What matters is that the target is written down in numbers, not "we want it to be high quality."

The specific trap: copying a target from a benchmark without understanding what it measured. "GPT-5 scores 92% on MMLU" does not mean your summariser will pass at 92% on your eval set. Public benchmarks measure curated tasks on curated inputs; your feature runs on real user inputs. The gap between benchmark and eval-set performance is routinely 10-20 percentage points, always in the direction you don't want.


Part 3: failure mode ceilings

Pass rate alone is not enough. A feature can pass at 88% overall and still be broken if 100% of the failures are in one specific category that users hit every day. You need ceilings on each failure type separately.

The move: after you score an eval set, group the bad and critical failures by category, and set a maximum rate for each. If any category exceeds its ceiling, the feature blocks on that category even if the overall pass rate is fine.

Example for the support summariser:

Overall pass rate: ≥ 85%.

Failure mode ceilings (percentage of total eval cases):

  • Wrong core issue: ≤ 3%
  • Missing key entity: ≤ 5%
  • Hallucinated fact: ≤ 1%
  • Pleasantry padding: ≤ 8%
  • Off-topic response: ≤ 2%

Note that these ceilings do not have to sum to 15% (the "non-pass" budget). Some outputs have multiple failure modes; some cases get counted in two categories. The ceilings are per category, not a division of the pass rate.

Why ceilings matter more than they sound like:

  • They prevent the "we'll fix it later" trap where one nasty failure mode persists unaddressed because the overall number looks acceptable.
  • They make the quality conversation specific. "We're failing the hallucination ceiling" is a clear problem with a clear solution; "quality feels off" is a vibe.
  • They catch regressions that the pass rate would hide. A new prompt might improve overall pass rate from 85% to 87% while hallucination rate goes from 1% to 4%. The pass rate celebrates; the ceilings block.

The specific trap: setting ceilings that are too loose because nobody wants to block ship. Ceilings should be uncomfortable. If your current feature is at 4% hallucination and you set the ceiling at 5%, you've learned nothing; a regression is silently allowed. Set ceilings tight enough that they catch real degradations, and use them as targets to push against, not walls to avoid.


Part 4: zero-tolerance critical cases

The fourth and most load-bearing piece. Critical failures — the category from Part 1 that maps to legal, compliance, safety, or trust damage — get their own, stricter, treatment: a single occurrence in the eval set blocks ship, period.

Zero tolerance means zero. Not "below 1%." Not "minimum threshold." Zero. If the eval set has one case where the feature fabricates a dollar amount, misattributes content to the wrong customer, reveals internal information, or produces content that violates a policy, the feature is not allowed to ship with that prompt and model combination. Engineering has to fix or re-scope; the PM cannot wave it through.

Why so strict? Because critical failures are exactly the ones where a single production occurrence creates disproportionate damage. A 0.3% rate sounds small until you translate it: at 100,000 production calls a week, that's 300 compliance incidents a week. The team that ships a 0.3% critical-failure feature is shipping 300 lawsuits a week.

How to set zero-tolerance categories:

  1. Map critical failures to real consequences. "Fabricates a fact that a customer could reference in a dispute" → legal risk. "Reveals data from another tenant" → privacy incident. "Produces content that violates GDPR" → regulatory fine. Tie each critical category to a specific downstream cost.
  2. Get sign-off from the owner of that cost. Your legal team should agree with your definition of "legal risk." Your compliance team should agree with "compliance risk." If the owners don't sign off, you're making policy on their behalf without their input — a career-limiting move on the first incident.
  3. Keep the list small. Three to five zero-tolerance categories maximum. More than that and the list gets ignored. Prioritise the categories where a single incident would be career-limiting.
  4. Review quarterly. Critical categories can shift as the product evolves. Re-validate the list every quarter and after any regulatory change.

The defensive move: explicitly write "if any critical case occurs in the eval set, ship is blocked" into the feature's one-pager (Section 6, the quality bar, from P2.4). When the first critical failure shows up in eval, the conversation is already had — the rule was set in advance. Without the pre-commitment, the rule gets debated in the moment, and political pressure usually wins over safety pressure.


Putting the four parts together

Here is the full quality-bar block for the support summariser, in the form it would appear in a PRD:

## Quality bar

Rubric: see rubric v2 in /quality/support_summariser_rubric.md

Eval set: 80 cases. Sourced 70% real anonymised tickets, 30% synthetic
edge cases. Includes 12 critical-priority cases covering fabrication,
misattribution, and PII leaks.

Target pass rate: ≥ 85% on full eval set.

Failure mode ceilings:
  - Wrong core issue: ≤ 3%
  - Missing key entity: ≤ 5%
  - Hallucinated fact: ≤ 1%
  - Pleasantry padding: ≤ 8%
  - Off-topic: ≤ 2%

Zero-tolerance (one occurrence blocks ship):
  - Fabricated dollar amount, product name, or date
  - Misattribution of complaint to wrong customer or account
  - PII leak from another ticket
  - Content violating GDPR right-to-be-forgotten policy

Ship criteria: all four satisfied on the most recent eval run before the
ship candidate is approved.

Sixteen lines. Every one of them is specific. Every one of them makes a quality argument explicit instead of a vibe. A team looking at this PRD can disagree with the numbers (and should) — but they can't disagree about what they're measuring. The disagreement becomes productive: "I think 85% is too low for this user segment" is a real conversation that ends with a decision.

Compare this to the traditional version: "the feature should produce high-quality summaries." That sentence is a vibe, not a criterion, and every ship review about it is a waste of time.


A 2026 worked example: raising quality from 76% to 87%

Let me walk through what it looks like to apply the framework to a feature that isn't ready yet, and how the four parts guide the work to close the gap.

Starting state (week 1):

  • Feature: summarise support tickets into one-sentence CRM notes.
  • Model: Claude Sonnet 4.6, temperature 0, custom prompt.
  • Eval set: 80 real tickets, scored against rubric v1.
  • Result: 76% pass, 18% borderline, 4% bad, 2% critical (2 cases: one fabrication, one misattribution).
  • Failure modes (of the 4% bad): 65% wrong core issue, 25% hallucinated fact, 10% off-topic.

Ship criteria (fixed at the start):

  • Pass rate ≥ 85%
  • Wrong core issue ≤ 3%, hallucinated fact ≤ 1%, off-topic ≤ 2%
  • Zero critical failures

Verdict on week 1: not ready. Missing pass rate target by 9 points. Hallucination rate (25% of 4% = 1% of total) is at the ceiling. Two critical cases — automatic block.

Week 2: fix critical. Engineering adds a guardrail that strips fabricated-looking numeric strings from outputs unless they appear verbatim in the input. Retry: 0 critical failures. Critical ceiling met.

Week 3: fix hallucination. Engineer adds a "ground in source text" instruction and few-shot examples of grounded summarisation. Pass rate 82%, hallucination rate 0.5% (hallucination ceiling now met), wrong-core-issue rate 4% (still above ceiling).

Week 4: fix core-issue extraction. Engineer experiments with a two-step approach (extract key entities first, then summarise using only those entities). Pass rate 87%, wrong-core-issue rate 2.5% (now below ceiling). All ceilings met. Zero critical. Ship.

Four weeks, measured progress, specific interventions, a clear "done" point. Compare this to the alternate universe where "quality feels off, let's iterate more" is the only thing on the meeting agenda. That team is still iterating in week six and leadership is losing patience.

The four-part bar isn't just a shipping criterion — it's a work planner. Each week's engineering time goes to the specific gap the bar identifies, and the team has a visible number to defend each week. This is the single biggest operational win of writing a real quality bar.


The specific trap: inflation of "pass"

One subtle and common failure mode deserves naming. As a feature matures, there's a quiet pressure to soften the rubric. A case that was "borderline" in week 1 gets re-categorised as "pass" by week 4 because the team has seen so many that the bar has shifted. The numerical pass rate goes up. The actual feature quality hasn't changed. The team ships on a phantom improvement.

Rubric drift is the name, and it's insidious because no single decision is wrong — each individual re-categorisation feels defensible. The effect is cumulative.

Two defences:

  1. Lock the rubric version for the duration of a ship cycle. If the rubric needs to change, bump the version and re-score the whole eval set. Don't quietly modify the rubric and call the new number comparable.
  2. Use multiple scorers on contested cases. When a score changes from "borderline" to "pass," require a second scorer to agree. This slows drift.

A less subtle version of the same trap: reducing the eval set size when the pass rate is below target. "Let's remove the hardest 10 cases because they're unfair." No. Those are the cases the feature needs to handle, and removing them is lying to yourself.


The failure mode: "quality is a vibe"

The single most common way this whole framework fails is the team that reads it and says "this is too rigid for our creative product." Or "our feature is too novel to measure this way." Or "our stakeholders won't accept a pass rate below 95%." Each excuse ends with the team not writing the bar and shipping on vibes instead.

The diagnosis is easy: if your AI feature is in its third month of "still not quite ready" without a written bar, the problem is not that the framework is too rigid — the problem is that nobody knows when to ship, and "not rigid enough" is a symptom of not deciding. The framework is an alignment tool more than a measurement tool. Its primary value is making the argument happen once, up front, with numbers, rather than every week with vibes.

The response when you hear "this is too rigid": "I hear you. Let's write a draft bar this afternoon, pilot-score 20 cases to see if the rubric is usable, and iterate the numbers on Friday. If the framework genuinely doesn't fit the feature, we'll know in a day instead of a quarter." Most of the time, the team that was skeptical ends up with a working bar they defend the following week.


What just changed in your roadmap

  • Write a four-part quality bar for every AI feature before you start measuring anything. Rubric, pass rate, ceilings, zero-tolerance. One page, 15-20 lines.
  • Pilot-score the rubric with two scorers on 20 cases before committing. 90%+ agreement means the rubric is usable.
  • Set the target pass rate from the cost/visibility/alternative triangle, not from benchmark envy. 85% is ambitious for most real features; 95% is usually out of reach without major investment.
  • Always set ceilings on failure modes, not just a total pass rate. Ceilings catch the "overall number looks fine but one bug is killing us" case.
  • Get sign-off on zero-tolerance categories from the owner of the downstream cost. Legal owns legal risk; compliance owns compliance risk. Don't make policy on their behalf.
  • Lock the rubric version per ship cycle. Drift is silent and cumulative.
  • Use the quality bar as a work planner, not just a shipping gate. Each unmet criterion is a specific chunk of engineering work with a measurable end state.
  • When stakeholders say "this is too rigid," treat it as "nobody knows when to ship." The framework is the cheapest alignment tool in the room.

Next post, P3.2, dives into the practical side of the rubric and the eval set: how to build an eval set as a PM. What cases to include, where to source them, how big to start, how to keep it fresh, and how to keep ownership from silently moving to engineering and then to nobody at all.


Course navigation

⬅️ Previous📍 You are hereNext ➡️
⬅️ Previous
P2.4 · AI Feature One-Pager
P3.1 of P5.4Next ➡️
P3.2 · Eval Sets for PMs

📚 AI for Product · Course Home — 20 posts, five modules.


Cover photo via Unsplash. This post is part of the AI for Product series.

More from this blog

Learn AI - Zero to Hero

111 posts