Skip to main content

Command Palette

Search for a command to run...

Reading Eval Numbers: the PM's Skeptic Kit for Quality Reports

Your engineering team brings you a number. 87 percent. Is that good? Is it real? Is it better than last week? Is it the number you should care about? Here is the skeptic kit for reading any AI quality report without getting fooled.

Updated
โ€ข16 min read
Reading Eval Numbers: the PM's Skeptic Kit for Quality Reports

Your engineer walks into the ship review with a slide. The slide has one big number on it: "87%." Below the number, in smaller text: "eval set pass rate, up from 82% last week." The room nods. Someone says "great work." The PM is about to say "OK ship it." The meeting is moving fast. And now you have about nine seconds to decide whether that 87% is real.

This is the moment Module P3 has been building toward. P3.1 taught you what "good enough" means. P3.2 taught you how to build the eval set. P3.3 taught you how to grade it. This post teaches you how to read the number a team brings you โ€” without getting fooled by the specific, predictable ways quality reports dress up progress that isn't quite there. It is the single most leverage-dense skill a PM of an AI product can develop, because the quality report is where every shipping decision eventually lands, and a skeptic kit lets you catch the problems before the ship, not after.

Ten questions, three fool-yourself patterns, and a one-page format you can paste into every ship review from Monday.


Why the single number is never enough

Start with the mental model: an 87% pass rate is not a fact; it is a claim that has many ways of being wrong. The number is the output of a pipeline โ€” the eval set, the rubric, the grader, the run configuration, the date. Each piece of the pipeline can be wrong in a way that makes the number look better (or worse) than the truth. A PM who accepts the number at face value is accepting the whole pipeline at face value, sight unseen.

The analogous skill in traditional software is reading a code review or a benchmark result: you don't accept "I ran the tests, they passed" โ€” you ask which tests, what config, when. The same interrogation applies here. It's not about distrusting the team; it's about giving the number the scrutiny it deserves because the pipeline that produced it has more hidden assumptions than a test suite does.

Five obvious questions. Most teams don't ask them. Let me walk through the full ten I actually ask, and the three specific fool-yourself patterns they catch.


The ten questions

Each question takes about 10 seconds to ask and the answer takes about 30 seconds to give. Ten questions, 400 seconds, one ship-review conversation. The time is cheap and the return is catching things that would otherwise ship broken.

Question 1: "Which eval set version ran?"

Eval sets evolve. Version 3 has 60 cases; version 4 has 80; version 5 dropped 10 cases because they were "too ambiguous." The pass rate on v5 is not comparable to the pass rate on v4, and comparing them gives you a false improvement. Always ask for the version. If the answer is "the current set," you do not know what you're comparing. Make the team version the eval set (spreadsheet filename with a date, git tag, platform version) and reference it by version in every report.

Question 2: "Which grader version?"

Same problem. A rule-based grader that had 8 rules last week and 12 this week is a different measuring stick. An LLM-as-judge with an updated prompt is a different measuring stick. Ask "have we changed the grader since last run?" If yes, the comparison is not apples-to-apples. The team should re-run the previous model/prompt against the current grader to get a real before/after. If they haven't, the improvement claim is invalid.

Question 3: "What's the breakdown by stratum?"

Always break the number down by the strata from P3.2: common happy path, edge cases, ambiguous, adversarial, critical. An 87% overall pass rate can mean:

  • 93% common, 85% edge, 72% adversarial, 100% critical โ€” healthy
  • 97% common, 70% edge, 40% adversarial, 100% critical โ€” the overall looks fine but edge and adversarial are broken
  • 80% common, 100% edge, 100% adversarial, 100% critical โ€” common is the problem

These three are indistinguishable in the overall number and very different in practice. The right question: "what does the table look like?" Not "what's the number?"

Question 4: "What are the top three failure modes?"

Rank the failures by frequency. The top three usually account for 60-80% of total failures on mature features. If the top three are concentrated (e.g., "65% of failures are wrong-core-issue, 20% are hallucinations, 10% are format drift"), you have a specific engineering target for the next iteration. If the failures are scattered (no mode over 15%), the problem is subtler and probably requires rethinking the prompt or retrieval, not fixing specific bugs.

The team should bring this breakdown to every review unprompted. If they don't, they probably haven't looked at the failures โ€” they're only looking at the aggregate. Push back.

Question 5: "How many eval cases?"

The size of the set determines the variance of the number. On 20 cases, the pass rate can swing 5-10 points between runs just from sampling noise. On 100 cases, it's more like 2-4 points. On 300, under 2. If the team proudly reports "we went from 82% to 87%" on a 30-case set, the "improvement" might be noise. Ask for the absolute count of cases that changed category, not just the percentage. "4 more cases pass than last week, out of 30" is noise. "20 more cases pass than last week, out of 300" is signal.

Question 6: "How many runs did you average over?"

Even on a fixed eval set, the model's output varies run to run (P1.2 โ€” the non-determinism tax). A single run of a single eval can differ 1-3 points from another run on the same prompt and model. Responsible teams run the eval 3-5 times on each prompt candidate and report the average and the spread. If the team ran it once and reports a point estimate, the number has invisible noise and the comparison to last week may be swamped by it. Ask for the variance; if they can't give it, that's a flag.

Question 7: "What's the confidence interval?"

You don't need stats training for this. Rough rule of thumb: for a binary pass/fail on N cases, the 95% confidence interval is roughly ยฑโˆš(p(1-p)/N). For 85% pass on 100 cases, that's ยฑ3.6%. So an improvement from 82% to 87% (a 5-point gap on a 100-case set) is barely statistically distinguishable. On 50 cases, 82% to 87% is absolutely noise. On 300 cases, it's a real improvement. The right question: "at this eval set size, is this delta larger than the confidence interval?"

The team doesn't have to quote stats jargon โ€” they just have to understand that small improvements on small sets are unreliable. Push back on any claim that doesn't account for this.

Question 8: "Did the eval set change between runs?"

Subtle version of question 1. Even if the eval set version is the same, sometimes individual cases get edited ("updated this case's expected output to be more lenient"). Ask specifically: "did we change any case between runs, including labels and expected outputs?" If yes, the comparison is invalid and the team has to re-score the old model/prompt against the updated cases.

Question 9: "What does the production pass rate look like?"

The eval set is a proxy. Production is the reality. If the team has observability (Course 2 B5.3), there should be a sampled production pass rate running alongside the eval pass rate. If they align within 3-5 points, your eval is calibrated. If they diverge by 10+ points, your eval set is not representative of production and the eval pass rate is a comforting lie. Ask for both.

This is the question that catches "eval looks great, users complain." If your team can't produce a production pass rate, they're flying blind and you should put "set up production sampling" on the roadmap immediately.

Question 10: "Did the model, the prompt, the retrieval, or the data change?"

A single pass-rate delta attributed to "prompt improvements" may actually be the combined effect of a prompt change, a retrieval tweak, a model version update, and a silent data refresh. If three things changed between runs, you cannot know which one moved the number. Insist on one change per eval run โ€” that is the single highest-leverage engineering discipline for AI features, and it's surprisingly rare.

"We made several improvements this week and the number went up" is not a useful claim. "We changed X; nothing else moved; the number went from Y to Z; here's the before/after on the relevant failure stratum" is a useful claim. Push for the second shape.


Three fool-yourself patterns

The ten questions catch almost all the accidental and intentional ways quality reports mislead. Three specific patterns deserve their own names because I have seen them repeatedly and they are seductive enough to slip past even skeptical PMs.

Pattern 1: the eval-set inflation trick

The team adds easy cases to the eval set because "we found more common scenarios." The pass rate goes up because the new cases are things the model already handles well. The team reports the improvement as if it were prompt progress. It isn't; it's pass-rate inflation from easier cases being added to the set.

How to catch it: always ask "how many cases did you add since last run, and what was the pass rate on just the NEW cases versus just the OLD cases?" If the new cases are pulling the average up, the "old-cases pass rate" will be flat or worse, and the improvement is illusory.

Who does this: usually not maliciously. Teams under pressure to show progress add cases they know will pass, often without realising they're biasing the set.

Pattern 2: the quiet rubric softening

The rubric changed subtly between runs. "Borderline" now counts as "pass" for cases where "the summary is slightly wordy but captures the point." This is a judgment call someone made, probably informally, and it shifts the line between pass and not-pass. Two percentage points of "improvement" are now free โ€” no actual work on the model or prompt, just a softer rubric.

How to catch it: ask "did we change the rubric or the pass/fail definition between runs?" If the answer is "no" but the team has been discussing "what counts as pass" recently, go look at the scoring. Run a few old cases through the new rubric and see if the categorisation changed. If it did, the pass-rate comparison is invalid.

Who does this: often engineering, often informally, without bad intent. The team is trying to move forward and the rubric feels like a blocker.

Pattern 3: the single-number dashboard

The team built a dashboard that shows one number: the eval pass rate. Over time, the number goes up and to the right. Leadership is pleased. Nobody is looking at the failure breakdown, the production sample, the stratum table, or the variance. The dashboard is hiding quality debt in the aggregated number.

How to catch it: demand breakdowns. Every dashboard should show: overall pass rate, pass rate by stratum, top failure modes, production sample pass rate, variance across runs, and the delta from the previous run with its confidence interval. If the dashboard shows only the aggregate, it's marketing, not measurement.

Who does this: the team building the dashboard, who wants to keep it simple. The PM's job is to insist on the breakdowns.


The one-page quality report format

Instead of accepting whatever format the team brings, use this template for every ship-review quality report. Paste it into your project template, require it for every AI feature, and never accept a report missing a section.

# [Feature] Quality Report โ€” [date]

## Headline
- Pass rate: [X]% on [N] cases (eval set v[V]).
- Delta from last run: [+/-]Y points (previous: [date]).
- Confidence interval: ยฑ[Z] points (at N=[N]).
- [Signal / noise / unclear] โ€” is this delta larger than CI?

## What changed since last run
- Model: [same / new version]
- Prompt: [same / v4 โ†’ v5, diff linked]
- Retrieval: [same / changed]
- Eval set: [same / N cases added]
- Rubric: [same / changed]

## Breakdown by stratum
| Stratum | Pass rate | Delta | Notes |
|---|---|---|---|
| Common happy path | [%] | [ยฑ] | |
| Edge cases | [%] | [ยฑ] | |
| Ambiguous | [%] | [ยฑ] | |
| Adversarial | [%] | [ยฑ] | |
| Critical | [%] | [ยฑ] | zero-tolerance: [clean / BLOCK] |

## Top 3 failure modes
1. [name] โ€” [% of failures] โ€” [example input+output]
2. [name] โ€” [% of failures] โ€” [example]
3. [name] โ€” [% of failures] โ€” [example]

## Production sample
- Production pass rate (last 7 days, [M]-case sample): [%]
- Eval-to-production gap: [X] points โ€” [within range / flag]

## Ship status
- [ ] Overall pass rate โ‰ฅ target
- [ ] All failure-mode ceilings met
- [ ] Zero critical failures
- [ ] Production sample aligned with eval set
- [ ] Top failure modes have specific next-step work

One page, nine sections, every one forcing a specific answer. A team that can't fill in every section doesn't have enough measurement to ship. A team that fills in every section is already catching the fool-yourself patterns above, because the template doesn't give the patterns any place to hide.

Implementing this format takes one hour of tooling: a template in your ship-review doc, a one-line explanation on how to fill it, and a standing ask for it on every AI-feature review. After two weeks it becomes the default and nobody questions the extra sections โ€” they're just "how we ship AI features here." The overhead pays for itself the first time it catches a ship-blocking issue that would have slipped past a one-number report.


A worked example: catching a false improvement

Let me walk through a realistic scenario where the skeptic kit catches a problem that would have shipped otherwise. Names are fictional; patterns are real.

The setup: Priya is the PM for a contract review feature. Rafi, the eng lead, brings a quality report to Friday's ship review. Headline: "Pass rate went from 84% to 89% this week." The room is ready to ship.

Priya asks the 10 questions:

  1. Which eval set version? v6. Last week's run was on v6 too. โœ“
  2. Which grader version? Rule-based v2 + LLM-as-judge v3. Same as last week. โœ“
  3. Breakdown by stratum? Common: 91% (was 88%). Edge: 86% (was 85%). Ambiguous: 72% (was 78%). Adversarial: 100% (was 100%). Critical: 100% (was 100%). Ambiguous went DOWN 6 points.
  4. Top three failure modes? "Wrong clause type" 45%, "missing effective date" 25%, "misinterprets exclusion" 20%. Wait โ€” "misinterprets exclusion" is a new failure mode that wasn't in last week's top three.
  5. How many cases? 120. โœ“
  6. How many runs averaged? One. ๐Ÿšฉ "Rafi, can we do 3 runs and get the variance?"
  7. Confidence interval? At 120 cases, 85% pass rate, roughly ยฑ3.2 points. The 5-point overall delta is ~1.5 CI widths โ€” marginal, not strong.
  8. Did any cases change? Rafi: "We updated the expected output for 4 ambiguous cases because the last run showed them as unfair." ๐Ÿšฉ๐Ÿšฉ Four cases in the ambiguous stratum were relabelled to make the old model fail less. That's why the ambiguous pass rate went down โ€” the relabelled cases are stricter now.
  9. Production pass rate? "Last 7 days, 200-case sample: 81%." 8 points below eval. ๐Ÿšฉ "That's a meaningful gap. Is the eval representative?"
  10. What changed since last run? "New prompt plus re-labelled eval cases." Two things changed. Cannot attribute the improvement.

Verdict: the 5-point "improvement" is mostly noise at this eval-set size, made worse by the fact that two things changed at once (prompt + rubric), compounded by a 8-point gap to production that isn't closing. The failure mode breakdown reveals a new issue ("misinterprets exclusion") that the team hadn't been tracking. The ambiguous stratum went down.

Priya's call: don't ship. Specific next steps: run the new prompt 3 times to get variance; run the old prompt against the re-labelled cases to get a clean comparison; add "misinterprets exclusion" to the tracked failure modes; investigate why the eval-production gap is 8 points. Two weeks of work before a ship decision, but now it's real work with a specific target instead of "iterate on quality."

Without the skeptic kit, the ship review would have seen 84% โ†’ 89%, declared victory, and launched a feature that was actually worse than the previous version on the stratum that matters most and doesn't match production anyway. This is the exact pattern that produces "we shipped and users complained" incidents.

Total time cost of the skeptic kit in this meeting: ~8 minutes. Total value: ~3 weeks of incident response avoided and the team caught a real regression before it shipped. This is why the ten questions exist.


The failure mode: "we trust our team"

One specific pattern deserves its own naming, because it's the one that prevents PMs from asking these questions in the first place: the "we trust our team" mindset. The PM thinks that demanding the ten questions is a sign of distrust in engineering, so they accept the number without asking, and feel good about the relationship.

The ten questions are not about distrust. They are about process rigor on a class of reporting that is unusually easy to get wrong. The team is doing their best; the numbers are still vulnerable to noise, inflation, softening, and single-number dashboards. Even a perfectly honest team produces misleading numbers if they don't have the process in place. Asking the questions isn't a signal that you doubt the team's integrity โ€” it's a signal that you know how eval reports work and you want the team to give you a report that survives scrutiny.

The right framing to use out loud, if it helps: "I want to make sure the numbers we bring to leadership survive the questions leadership would ask. Let's get ahead of that together." This reframes the ten questions as preparing for review rather than reviewing the team, and teams usually engage with it enthusiastically once they realise the goal is shared.

The teams that adopt the skeptic kit once it's explained this way stop needing the PM to ask โ€” they start bringing the answers unprompted, because they want to ship clean reports. That cultural shift is the real deliverable of this post, and it is worth the ~1 awkward meeting where you first ask a team "can you break this down by stratum?" and they don't have the answer.


What just changed in your roadmap

  • Ask the ten questions in every AI quality report review. All ten. Takes eight minutes. Catches a shipping incident at least once a quarter.
  • Adopt the one-page quality report template. Put it in your ship-review doc template. Require every AI feature to fill it in. Two weeks and it becomes the default.
  • Never accept a single-number dashboard. Always demand breakdowns: by stratum, by failure mode, by production sample, by variance.
  • Insist on one change per eval run. Single-variable attribution is the most important engineering discipline on AI features.
  • Watch for the three fool-yourself patterns: eval-set inflation, quiet rubric softening, single-number dashboard. Name them when you see them.
  • Compute the confidence interval in your head. For binary pass/fail, ยฑโˆš(p(1-p)/N) is enough math; you can do it in a meeting.
  • Always ask for the production sample. If the team can't produce one, they're flying blind and production monitoring moves to the top of the roadmap.
  • Frame the questions as preparation for leadership review, not distrust of the team. The cultural shift is the real win.

And that closes Module P3 โ€” Evaluation and Quality. You now have the four-part quality bar, the PM-owned eval set, the matched grading stack, and the skeptic kit for reading the numbers your team brings you. You can walk into a ship review with an AI feature's quality story and either green-light it with confidence or block it with a specific list of missing measurements. No more vibes-based ship decisions.

Next up is Module P4 โ€” Trust, Risk, and Rollout. User trust as a UX problem, the compliance checklist every AI product now needs, rolling out without losing the room, and handling your first public failure. Four posts on the things that happen between "the eval passes" and "the user trusts the feature enough to use it in production." See you there.


Course navigation

โฌ…๏ธ Previous๐Ÿ“ You are hereNext โžก๏ธ
โฌ…๏ธ Previous
P3.3 ยท Golden Sets, LLM-as-Judge, Human Review
P3.4 of P5.4Next โžก๏ธ
P4.1 ยท User Trust Is a UX Problem

๐Ÿ“š AI for Product ยท Course Home โ€” 20 posts, five modules.


Cover photo via Unsplash. This post is part of the AI for Product series.

More from this blog

Learn AI - Zero to Hero

111 posts