Golden Sets, LLM-as-Judge, Human Review: Which Grading Style to Reach For
Four ways to score an AI output. Each wins on a different problem. Teams that pick the wrong one waste weeks on infrastructure that does not match the quality signal they actually need.
You have an eval set. You know your rubric. You want a pass rate. Now a new question appears: who decides whether each individual output is a pass? You, reading them one by one? A rule engine that checks for specific strings? Another model that grades the output? A human-reviewer panel? The answer affects speed, cost, accuracy, and how often you can actually run the eval. Pick the wrong one and you spend a month building scoring infrastructure that produces numbers nobody trusts.
This post is the map. There are four common grading styles in 2026 โ exact match, rule-based, LLM-as-judge, and human review โ and each one wins on a different shape of problem. Most production features use a stack of two or three of them, not just one. The skill is knowing which to reach for when, where each one lies to you, and how to stack them without redundancy. I'll go through each style with its cost, its accuracy, its failure modes, and the specific features where it's the right primary choice.
This is a PM-focused version of the same topic Course 2 B2.5 covers for engineers. If you read that post, a lot here will feel familiar; the difference is what you do with it as a PM who is not writing the grading code. You will still be the person deciding which style to trust, and the decision is harder than it looks.
The four styles, in one sentence each
- Exact match. The output's text has to equal an expected string, or fall into a literal enum of expected strings. Fastest, cheapest, least flexible.
- Rule-based. The output is graded by a set of rules (contains-phrase, length-limit, schema-valid, regex-match, forbidden-word). Still fast and cheap, more flexible than exact match, still limited to what you can express as a rule.
- LLM-as-judge. Another model is given the input, the expected behaviour, and the actual output, and asked to score whether the output is acceptable. Slower, more expensive, much more flexible โ and itself a probabilistic system that has to be calibrated.
- Human review. A person reads the input and the output and scores it. Slowest, most expensive, most accurate. The gold standard that everything else is calibrated against.
Every grading system is some stack of these. A sensible stack is usually: exact match + rule-based for everything you can measure cheaply, LLM-as-judge for the things rules can't capture, and human review on a rolling sample to keep the cheaper graders honest. What that stack looks like depends on the feature.
Four paths, one decision per eval case. Most features end up using the "stack" pattern for the cases where multiple signals matter and the simpler styles for the cheap wins. Let me take each in detail.
Exact match: the free and boring win
Exact match wins on features where the correct output is a finite set of strings or a single canonical form. Classification into one of N categories. Boolean yes/no. Enum values. Structured fields like urgency = "high". If the model should return one of a fixed list and you can enumerate the list, exact match is the right grader and it is free.
When it wins cleanly:
- Routing / classification / labelling into a closed taxonomy.
- Boolean outputs.
- Structured-output schemas where a field has a finite set of legal values.
- Any case where "the output is right" has a unique answer you can type out in advance.
The specific 2026 tip: pair exact match with schema-enforced structured output (Course 2 B1.3). The model's output is constrained to a valid enum by the schema; the grader checks whether it matches the expected enum. Together these produce a grading pipeline that is cheap, fast, and 100% reliable about "did the model pick the right class."
When exact match silently lies to you:
- Case sensitivity mismatches. "Billing" vs "billing" โ your grader says fail, users don't care.
- Whitespace and punctuation drift. "Yes." vs "Yes" vs "yes " โ same answer, three fails.
- Minor legitimate variation. "2024-01-15" vs "Jan 15, 2024" โ same date, two fails.
The defences are obvious โ normalise casing, strip whitespace, compare parsed dates โ but every normalisation is a piece of logic that could be wrong. Keep it simple; the moment your "exact match" has 20 lines of pre-processing, you've invented a rule-based grader and should commit to that.
Rule-based: the workhorse
Most of your scoring will be rule-based. A rule-based grader runs a set of checks against the output: contains specific phrases, absent specific phrases, length within bounds, JSON schema valid, regex match, numeric value within a range. Each rule is a function from output to pass/fail. The case passes if all its rules pass.
When it wins cleanly:
- Structured extractions with multiple fields, each checkable independently.
- Outputs with format requirements (character limits, required sections, forbidden content).
- Any case where "correct" is a conjunction of measurable properties.
- Most "the output must contain / must not contain X" cases.
Concrete 2026 examples where rule-based grading is the right primary:
- Support ticket summariser. Rules: length < 200 chars, contains at least one noun from a list of product terms, absent "I'm so sorry to hear", absent any dollar amount not in the input.
- Code suggestion feature. Rules: syntactically valid code, no placeholder variables (
foo,bar,TODO), function signature matches the request. - Email drafter. Rules: starts with a greeting, ends with a closing phrase, length 30-200 words, no prohibited words from the house-style forbidden list.
- Classification with confidence. Rules: class in expected enum (exact match), confidence > 0.5, reason field populated.
The specific 2026 tip: most of what you think you need LLM-as-judge for can be done with rules if you think carefully about the rubric from P3.1. A lot of "this needs a sophisticated grader" is actually "I haven't broken the rubric down into measurable properties yet." Spend an hour trying to express each rubric criterion as 1-3 rules before reaching for LLM-as-judge; you'll usually find 60-80% of them can become rules. That's 60-80% of your grading running at rule-based cost instead of LLM-as-judge cost.
When rule-based silently lies:
- Rules that are too loose. "Contains the word invoice" passes a summary that is about invoices in general rather than the specific invoice in question.
- Rules that are too tight. "Matches exactly this regex" fails on legitimate paraphrases.
- Missing rules. The rubric says "be polite"; you have no rule for politeness; the grader silently ignores it.
- Rules that capture format but not meaning. The output has all the right structural pieces but says something wrong; rule-based grading says pass.
The sign that rule-based is no longer enough: you keep adding special-case rules to catch specific failures, and the rule list grows past 10-12 per case. At that point the complexity of maintaining rules has exceeded the complexity of calling a judge model, and you should graduate to the stacked approach.
LLM-as-judge: powerful and dangerous
LLM-as-judge gives you a grader that can handle subjective criteria: "is this summary faithful to the source?", "does this response match the house tone?", "does this code solve the problem even if the approach is different from the reference answer?" These are things no rule engine can reliably measure, and they're also where most of the hardest AI-feature quality questions live.
The shape: you write a prompt for a judge model that includes the input, the reference (or expected properties), the actual output, and instructions for how to score it. The judge returns a verdict โ pass / fail, or a numeric score, or a rubric-aligned category. Structured output (Course 2 B1.3) makes the judge's output reliable to parse.
A minimum-viable judge prompt, for a summariser feature:
You are a strict grader for a customer support summary feature.
Grade the summary as PASS, BORDERLINE, or BAD based on these rules:
- PASS: 1-2 sentences, captures the customer's specific problem,
no added facts, no apologies.
- BORDERLINE: slight tone or formatting issue but meaning is correct.
- BAD: misses the problem, adds facts not in the source, or
includes pleasantries.
Source ticket:
---
{input}
---
Candidate summary:
---
{output}
---
Reply with exactly one word: PASS, BORDERLINE, or BAD.
The judge runs over the whole eval set in minutes, produces a distribution, and gives you a number. It costs roughly one LLM call per eval case โ at mid-tier model prices (Claude Haiku, GPT-5-mini, Gemini Flash in 2026), that's about $0.001-$0.005 per graded case. A 100-case eval set costs roughly $0.10-$0.50 to grade. Cheap enough to run on every commit.
When LLM-as-judge wins cleanly:
- Subjective criteria that rules can't capture (tone, faithfulness, coherence).
- Features where you have many cases and the judge can scale where a human couldn't.
- Cases where you want a per-case justification alongside the score (the judge can explain why it picked BAD, which is valuable for the team).
- Initial quality signals in the first weeks of a project when you're iterating fast and don't want to spend humans on every run.
The specific 2026 tip: use a cheaper model for the judge than you use for the feature itself. Your feature runs on Claude Sonnet 4.6 or GPT-5; your judge runs on Haiku or GPT-5-mini. This saves cost without losing much accuracy, because grading is usually easier than generation and the cheap models do it well.
Why LLM-as-judge lies to you, and how badly:
This is the failure mode section that deserves the most attention in this post, because LLM-as-judge is the grading style where teams think they're measuring quality and are actually measuring something subtly different.
- The judge has its own preferences. Judge models tend to prefer longer outputs over shorter ones, more formal tone over casual, hedged language over confident. If your feature is supposed to be confident and direct, the judge may downgrade it for being "too blunt" even though the rubric says direct is good. The judge's priors leak into the grade.
- The judge can't verify facts. "Is this summary faithful to the source?" sounds like something a model can check, but in practice the judge model may itself misread the source and declare the summary wrong when it's right, or right when it's wrong. On factual-precision tasks, judge accuracy often lags human accuracy by 10-15 points.
- The judge misses subtle errors. An output that sounds coherent and relevant gets a pass even when it contradicts the source. The judge is as vulnerable to "plausible but wrong" output as humans are, sometimes more so because it reads quickly.
- The judge drifts with model updates. You calibrate the judge against human labels in week 1. In week 8, the judge model silently updates under you, and now it scores 4 points higher on the same cases. The feature looks better without changing. You ship a false improvement.
- The judge is yours to prompt, which means you can bias it. You write the judge prompt; you emphasise the criteria you care about; the judge returns what you asked for, not what's true. This is subtle and common.
The single most important rule for LLM-as-judge: calibrate against human labels before you trust it. Label 30-50 cases by hand first. Run the judge on the same cases. Compare. If the judge agrees with your human labels 85%+ of the time, it's usable. If it's below 80%, tune the judge prompt or fall back to rule-based. If the distribution of agreement is lopsided (judge agrees on passes but disagrees on fails, or vice versa), the judge is biased in a specific direction and the pass rate it produces is wrong.
Calibration is not a one-time thing. Re-validate monthly, or whenever the judge model updates. If you skip this step, the judge is producing numbers that feel real and aren't. This is how the "production is 78% but the dashboard says 91%" situation happens.
Human review: the ground truth
Human review is the slowest, most expensive, most accurate grading style, and the one everything else is calibrated against. You cannot eliminate human review โ it has to stay in the loop somewhere, even if only on a rolling sample, or the automated graders drift unchecked. But human review also cannot carry the primary load for most features, because it doesn't scale.
When it wins cleanly:
- The rubric has subjective criteria the judge can't be trusted on.
- The stakes per case are high enough to warrant the time.
- You're calibrating another grading system (this is the most important use).
- You're diagnosing a new failure mode before you know how to automate it.
- The feature is early and the eval set is small (under 30 cases โ run humans on the whole set).
Concrete human-review setups that work in 2026:
- Full manual scoring, small set. 20-40 cases, one scorer, 60-90 minutes per run. The PM does it. Works for early-stage projects, rubric development, rapid iteration.
- Two-scorer blind grading. Two people score independently, compare, resolve disagreements with a third. Use when you need defensible scores for a ship decision or to calibrate LLM-as-judge.
- Rolling sample audit. 10-20 cases per week from recent production traffic, manually scored, compared against the dashboard's automated score. Use to catch drift in LLM-as-judge.
- Stakeholder review. A domain expert (lawyer, doctor, support lead) grades a small set. Use when you need sign-off for a shipping criterion that only they can validate.
The specific 2026 tip: build human review into the workflow before it's urgent. If the first time you run human review is during an incident, it's going to be slow, inconsistent, and politically fraught. Set up a weekly 30-minute review ritual from week 1, with a spreadsheet and a clear rubric, even when the automated grading is working. The ritual keeps the human judgment in the loop and makes it cheap to invoke when you need it.
Why human review also lies to you:
- Single-scorer bias. One person's interpretation of the rubric is not the same as another's. Use two scorers on anything that matters.
- Fatigue drift. Hour 1 of a scoring session: careful, consistent, 90% agreement with a second scorer. Hour 3: sloppy, inconsistent, 70% agreement. Cap sessions at 90 minutes, take breaks.
- Rubric drift. Same as P3.1 โ the scorer's mental rubric evolves as they see more examples, even when the written rubric hasn't changed.
- Politeness bias. Internal scorers are nicer to outputs than external users would be, because they want the feature to succeed.
None of these make human review useless โ they make it require a process. Two scorers, short sessions, fixed rubric per cycle, occasional external validation. The process is the difference between human review that holds the other graders accountable and human review that's just more noise.
Stacking: how a real production feature grades itself
Here's what a mature grading stack looks like for a realistic 2026 feature โ the customer-support summariser I've been using as the running example.
Eval run, per case:
- Exact match checks the structured metadata fields (category in enum, priority in enum, customer_id matches). Grading time: nanoseconds. Cost: zero. These fields are either right or wrong.
- Rule-based checks the summary text: length < 200 chars, no dollar amounts not in the input, absent forbidden phrases (
I'm so sorry,I completely understand), contains at least one product term. Grading time: milliseconds. Cost: near zero. - LLM-as-judge checks subjective criteria the rules can't capture: is the summary faithful to the source, is the tone matter-of-fact, does the summary capture the customer's specific issue. Grading time: ~1-2 seconds per case. Cost: ~$0.003 per case. Calibrated against human labels monthly.
- Human review runs on a rolling 10-case weekly sample: the PM reads 10 recent production outputs and scores them against the rubric. Compares to the automated scores. Flags any case where automated and human disagree. Time: 30 minutes/week. Cost: PM time.
The overall grade for a case is: pass exact-match AND pass rule-based AND pass LLM-judge. Any step failing blocks that case from passing. The human review doesn't block individual cases but acts as an ongoing audit of the whole system.
Cost per run:
- 100-case eval set, running the full stack: ~$0.30-$0.50 of LLM calls for the judge, 30 minutes of PM time for the weekly sample. Sustainable indefinitely.
What this stack catches that any single style would miss:
- Exact match alone would pass a case where the structured fields are right and the summary is garbage.
- Rule-based alone would pass a case where the format is right and the meaning is wrong.
- LLM-as-judge alone would pass a case where the judge misreads the source.
- Human review alone doesn't scale to a daily-running eval set.
The stack is the answer. Not because "more is better" but because each layer catches a different class of failure the others miss, and the total cost is still low enough to run on every PR.
A concrete decision: which style for which feature
Let me run the decision for three realistic 2026 features to show the flow.
Feature 1: ticket router. Classifies incoming tickets into one of 12 categories.
- Primary grader: exact match, because the output is a closed enum.
- Supplement: LLM-as-judge on "confidence", because "the model is confident when it shouldn't be" is a real failure mode that rules don't catch.
- Audit: human review on 20 cases/week, to catch drift.
- No need for rule-based or extensive human review.
Feature 2: code review assistant. Reads a PR diff and produces suggested comments.
- Primary grader: rule-based, because you can check "output has valid markdown format, references specific line numbers, contains at least one suggestion, absent off-topic content."
- Supplement: LLM-as-judge on "is the suggestion useful", because "useful" is a subjective criterion rules can't capture.
- Audit: human review on 15 cases/week, specifically by a senior engineer to ground what "useful" means.
- Exact match doesn't apply here โ there's no single canonical correct review comment.
Feature 3: legal clause extractor. Extracts specific clauses from contracts into structured data.
- Primary grader: exact match on the structured fields and rule-based on the extracted text. The fields have expected values for the eval cases.
- Supplement: human review, not LLM-as-judge, because legal accuracy is high-stakes and LLM-as-judge on legal content is notoriously unreliable.
- Two-scorer human review required on any shipping criterion.
- No LLM-as-judge โ the risk of a subtly wrong grade is higher than the cost of doing human review manually.
Three features, three different stacks, each matched to the shape of the problem. There is no universal answer. The skill is in matching.
The failure mode: "let's just use an eval platform"
One specific failure mode deserves its own call-out. A team, feeling overwhelmed by the grading-style decision, adopts an eval platform โ LangSmith, Braintrust, Helicone evals, a homegrown tool โ and delegates the decision to the platform's defaults. The platform is usually built around LLM-as-judge as the default grader because that's the most impressive thing to demo. The team accepts LLM-as-judge for their whole eval set without calibrating it, and the platform produces a pass rate that looks authoritative.
Two months later, an incident shows the production pass rate is meaningfully different from the platform's number, and the team has spent two months shipping on a grader they never validated.
The defence: adopt platforms for tooling, not for judgment. Use them to run evals, store history, visualise runs, and version rubrics. Do not use them to decide which grading style to use for which case. That decision is still yours, and the platform's default is usually wrong for your specific feature mix. Calibrate everything the platform's grader produces against human labels before trusting it. The platform is a UI, not an authority.
And the simpler version: do not adopt a platform at all until your spreadsheet-based grading has proven the shape of the problem. Most teams spend a week setting up a platform, building integrations, learning the UI, and then realise their spreadsheet approach was working fine and the platform is just adding complexity. Pay the platform cost when you have evidence you need it, not before.
What just changed in your roadmap
- Pick the right grading style per feature, not per team. A team that uses "LLM-as-judge for everything" or "human review for everything" is leaving quality signal on the table.
- Use exact match and rule-based for everything you can cheaply measure. Most of your rubric items can become rules if you think carefully.
- Use LLM-as-judge for subjective criteria rules can't capture, but calibrate against human labels before trusting it. Recalibrate monthly.
- Always keep human review in the loop, even if only on a rolling sample. It's the only way to catch drift in the automated graders.
- Cap human review sessions at 90 minutes and use two scorers on anything that matters.
- Adopt eval platforms for tooling and versioning, not for judgment. The decision about which grading style to use stays with the PM.
- Log every grader's verdict per case, not just the final pass/fail. If the LLM-as-judge says pass but the rule-based grader says fail, you want both recorded so you can audit the stack later.
- Start with a spreadsheet, not a platform. Graduate to a platform only when the spreadsheet can't keep up โ usually past 200 eval cases and multiple concurrent prompt versions.
Next post, P3.4, closes Module P3 with the skill that matters most in a ship-review meeting: how to read the eval numbers your team brings you without getting fooled. Overall pass rate vs category breakdowns, significance vs noise, the three questions to ask about any quality report, and the specific ways teams dress up numbers to make progress look better than it is.
Course navigation
| โฌ ๏ธ Previous | ๐ You are here | Next โก๏ธ |
| โฌ
๏ธ Previous P3.2 ยท Eval Sets for PMs | P3.3 of P5.4 | Next โก๏ธ P3.4 ยท Reading Eval Numbers |
๐ AI for Product ยท Course Home โ 20 posts, five modules.
Cover photo via Unsplash. This post is part of the AI for Product series.