AI Product Metrics That Actually Tell You If It Is Working
Your engineering dashboard says 88 percent eval pass rate and everyone is celebrating. Your product is quietly dying. Here is the three tier metrics stack that catches the problem six weeks before the bottom line does.
An engineering team I worked with shipped an AI feature that looked great on every dashboard they cared about. Eval pass rate: 88%. Latency P99: 2.3 seconds. Cost per call: well under budget. Deployment was clean. The team felt good and moved on. Twelve weeks later, the PM ran an audit because the product's gross retention had quietly dropped 3 points over the same window. The feature was correlated with the drop. Users who tried it once were 30% less likely to renew than users who didn't. The team had been shipping a retention-negative feature for three months while congratulating itself on its eval numbers.
This is the specific metrics gap I want to close in this post. Engineering measures what's cheap to measure and technically interesting: eval pass rate, latency, cost, uptime. Product measures what's supposed to matter: adoption, engagement, satisfaction. Business measures what actually matters: retention, expansion, revenue per user. The three layers almost never move together, and a team that only watches one layer is one of the three teams that silently ships a bad feature and doesn't find out for a quarter.
This post is the three-tier metrics stack for AI products in 2026. Layer 1: quality metrics (the engineering dashboard, necessary but insufficient). Layer 2: adoption and engagement metrics (what PMs should watch and usually don't watch closely enough). Layer 3: outcome metrics (what the business actually cares about โ and how to connect the AI feature to them without lying with statistics). For each layer I'll name the specific metrics, the ones teams measure by default, the ones they should, and the leading indicators that warn you six weeks in advance when a feature is about to quietly die.
This is the most load-bearing measurement post in this course. If you get this right, you catch problems before they hit your P&L. If you get it wrong, you ship the feature that kills your retention and blame the next quarter's dashboard.
Why three tiers, not one
Start with why the layers exist as separate things. It is tempting to just track "does the feature work," pick one number, and report it. Doesn't work. Each layer answers a different question with a different latency:
- Quality answers: does the feature technically perform well? Latency: real-time.
- Adoption and engagement answers: do users try it and come back? Latency: days to weeks.
- Outcome answers: does the feature change what the user ultimately does โ do they complete more tasks, renew at a higher rate, expand their usage? Latency: weeks to months.
A single metric at any one layer can move in the wrong direction while the others look fine. The engineering team sees quality green. The PM sees adoption OK. The business sees retention slipping. The three signals together tell the truth; any single one lies.
Three tiers, three latencies, one set of decisions. The PM's job is to watch all three and synthesise them into "what to do next" โ a job nobody else in the org is structured to do. Engineering watches Tier 1. Growth and analytics teams watch Tier 3. The PM is the only person whose dashboard has all three on it.
Let me take each tier in detail โ the metrics that belong there, the specific leading indicators to watch, and the traps in each layer.
Tier 1: quality metrics (necessary, insufficient)
Tier 1 is the engineering dashboard from Course 2's B5.3 and Module P3 of this course. If you built the eval set and the observability right, you already have these. The important thing for a PM is to know which ones must be there and which ones must not be the only thing you look at.
The tier-1 metrics that must exist:
- Eval pass rate over time. The number from Module P3, run against a locked eval set version. Trend line over weeks, with clear version markers when the set changes.
- Failure mode breakdown. Pass rate by category from the rubric. Each failure mode has its own trend line.
- Latency P50 and P99 TTFT + total. Separate for each. User-visible is what matters; total time is for backend budgets.
- Cost per call and cost per active user per day/month. The COGS line from P5.1, tracked weekly.
- Error rate and stop-reason distribution. Non-2xx from the provider, timeouts,
max_tokenstruncations, schema-validation failures. Alerts on spikes (the two alerts from B5.3: error rate and cost spike). - Production sample pass rate vs eval set pass rate. The gap from P3.4 โ if the two diverge by >5 points, your eval set is drifting from reality.
Tier-1 metrics that look important but usually aren't:
- Specific-model attribution ("our Sonnet 4.6 pass rate went from 85% to 87%"). Useful to engineering, not useful to product decisions unless the model change is itself a product event.
- Token consumption stats in isolation. Cost matters; tokens-as-tokens are a means to it. Report cost.
- Cache hit rate (from Course 2 B5.2) unless you're optimising cost specifically. High hit rates are good but they don't tell you about feature quality directly.
The specific tier-1 trap: eval pass rate up and to the right. A rising eval pass rate can mean: (a) the feature is getting genuinely better, (b) the eval set is drifting easier (the rubric-drift problem from P3.1), (c) the grader is being softened (the fool-yourself patterns from P3.4), or (d) eval-set inflation (adding easy cases). From the number alone you cannot tell which. Every Tier 1 report has to show the rising number plus the breakdowns that would catch each of the fool-yourself patterns.
The other specific trap: declaring victory at a pass-rate milestone without looking at Tier 2 or Tier 3. "We hit 90% eval" is not a business result. It is a precondition for a business result. Celebrate quietly in the engineering channel; defer judgement until the other tiers come in.
Tier 2: adoption and engagement metrics (the layer PMs neglect)
Tier 2 is where most AI feature decisions should be made, and where most PMs under-invest. Engineering can't build this dashboard for you โ the metrics are product-shaped, not model-shaped, and they require the PM to define what "good" looks like for this specific feature's user journey.
The tier-2 metrics that must exist:
- Try rate. Of users who were exposed to the feature (saw the button, were in the Stage 4 opt-in, were in the default-on population), what fraction tried it at least once? Segment by user cohort.
- Return rate after first try. Of users who tried it once, what fraction came back within 7 days? This is the single most predictive Tier 2 metric for AI features. A return rate below 40% is a warning; below 30% is a crisis.
- Daily active user rate of the feature. Of active product users who could use the feature, what fraction do on a given day?
- Sessions per user per week. Intensity metric. Users who use the feature once a month are not the same as users who use it once a day, and the distinction matters for pricing and positioning.
- Edit rate / rejection rate / retry rate. The trust UX signals from P4.1. How often do users edit the model's output? How often do they reject it? How often do they retry with modifications? These tell you whether users trust the feature or are fighting it.
- Time to first value. From first exposure to first successful interaction. Often measured in seconds for simple features, minutes for complex ones. If it's more than 3x what you'd expect, your onboarding is broken.
- Feature-specific success rate. Task-level: did the user complete what they started? Different from eval pass rate โ eval is about output quality; success rate is about user journey completion.
- Deactivation / turn-off rate. For opt-in features, how often do users enable and then disable? A high number means the feature failed on first use.
The leading indicators that predict trouble 4-6 weeks early:
- Try rate flat or falling on a newly launched feature means your discovery or onboarding is broken. Even if everything else is green, users aren't finding the feature.
- Return rate below 40% means users are trying once and bouncing. You have a trust UX problem (P4.1) or a quality problem Tier 1 didn't catch. Investigate immediately.
- Edit rate climbing over time on a feature that should be getting better is a regression signal โ the model is producing worse output, users are compensating, and the eval set isn't catching it. Double-check the grader.
- Rejection rate spiking on a specific input type means a new failure mode has appeared in production that your eval set doesn't cover. Add the case immediately.
- Deactivation rate above 15% means users are enabling and finding the feature actively worse than the baseline. Drop what you're doing and diagnose.
- Time to first value climbing over days means your onboarding or the feature's affordance has drifted. Check recent UX changes.
The tier-2 trap: reporting adoption as "total users of the feature" without cohorting by exposure. Your feature is used by 3,000 users a month and growing! Great โ but 3,000 out of what population? If you have 100,000 active product users, 3,000 is 3% adoption and probably a problem. Always report as a percentage of the addressable user population, not as an absolute count. Absolute counts hide the denominator, which is usually where the bad news lives.
The other specific trap: treating "feature enabled" as adoption. A user who turned on the feature in settings and never used it is not adopting. You want to count actual usage, not permission state. This distinction has killed more than one "great adoption numbers" retrospective that turned out to be almost entirely toggles with no invocations.
Tier 3: outcome metrics (the one that actually matters)
Tier 3 is where the feature connects to the business result. This is also the hardest layer to measure because outcomes lag by weeks or months, multiple other things affect them, and attribution is messy. A PM who skips Tier 3 ships features that look great at Tiers 1 and 2 and don't change what the business cares about; a PM who only looks at Tier 3 is making decisions with 8-week data latency and can't course-correct in time.
The tier-3 metrics that must exist for any AI feature that's supposed to drive business value:
- Retention of users who adopted the feature vs users who didn't. Same plan tier, same vintage, compared over 30/60/90 days. Users who use your AI feature should retain at a meaningfully higher rate than those who don't; if they don't, the feature isn't worth its cost.
- Plan upgrade / expansion rate. If you priced the feature behind a higher plan (P5.1), are users upgrading? What's the conversion rate from "hits the feature cap" to "upgrades"?
- Revenue per user, segmented by AI-feature usage. Are AI-feature users worth more on average? By how much? Is the gap growing or shrinking?
- Gross churn attributable to the feature. Did any customers churn and mention the AI feature in their exit interview? Did any customers churn specifically because the feature produced something bad? This number is usually small in absolute terms and disproportionately important.
- Task-level business outcomes. If the feature is a support-ticket assistant, is average handle time dropping? If it's a code assistant, is PR cycle time dropping? If it's a meeting summariser, is meeting follow-up completion rate rising? The specific outcome from the P1.1 "AI feature vs AI product" framing should be measurable here.
The leading indicators in Tier 3 that predict trouble:
- AI-feature users retain at the same rate as non-users, or worse. You thought the feature added value; the data says it doesn't. Critical signal.
- Conversion from feature-cap-hit to upgrade stays below 8%. Users hit your limit and leave rather than paying more. Either the price is wrong or the value is wrong.
- Task-level metric (handle time, cycle time, etc.) is not moving even though adoption is up. The feature is getting used but it isn't changing the job. Either users are using it wrong or it isn't actually helpful.
- Customer exit interviews mention AI features negatively, even at low frequency. One mention is random; three is a pattern; five is a signal to fix or remove something.
The tier-3 trap: attribution theatre. A growth team declares that "the AI feature drove a 12% retention lift." Sounds impressive. Look at the methodology: users who chose to try the feature are systematically different from users who didn't โ they might be more engaged, on larger accounts, in specific verticals. The 12% lift is confounded. The real answer is probably 2-4% after controlling for obvious selection effects, and 0-1% after controlling for hidden ones.
The defence: wherever possible, compare randomly assigned populations (A/B rollout from P4.3) rather than self-selected ones. If you can't randomise (Stage 5 is default-on for everyone), use quasi-experimental methods โ difference-in-differences across user segments, synthetic-control comparisons with similar users from earlier vintages, etc. These are harder; they're also more honest. A PM reporting Tier 3 metrics must know which comparisons they're making and what the confounds are. "AI-feature users retain better" without context is Tier 3 theatre.
The other specific trap: ignoring Tier 3 because it lags. A team decides to "wait until the data is in" and then skips running the analysis when the data arrives, because Tiers 1 and 2 look fine and there's no pressure. Six months later, someone else runs the analysis and finds the feature doesn't actually move retention. The team had the data all along. Tier 3 metrics must be on a calendar, not triggered by crisis.
The three-tier dashboard: what to actually build
If you're a PM on an AI feature and you're trying to build a working measurement practice, here is the specific dashboard to produce. One page, three sections.
# [Feature] Metrics Dashboard โ Week of [date]
## Tier 1: Quality
- Eval pass rate (locked set v[N]): [X]% โ delta vs last week [ยฑ]
- Failure mode breakdown: top 3 with percentages
- Production sample pass rate: [Y]% โ gap to eval: [ยฑ]
- P50 TTFT: [ms] โ P99 TTFT: [ms]
- Cost per active user per month: $[X]
- Error rate: [%] โ alerts: [count]
## Tier 2: Adoption and engagement
- Feature try rate (% of addressable users): [X]% โ cohort: [desc]
- Return rate after first try (7d): [X]%
- Daily active users of feature / addressable: [X]%
- Edit rate: [X]% โ rejection rate: [X]%
- Deactivation rate: [X]%
- Time to first value (median): [seconds or minutes]
## Tier 3: Outcome
- Retention delta (adopters vs non-adopters, controlled): [ยฑX points]
- Plan upgrade rate from feature caps: [X]%
- Revenue per user delta (AI users vs non-AI): $[X]
- Task-level business outcome: [metric and delta]
- Exit interview mentions: [count, sentiment]
## Leading indicators to watch
- [Specific thing that's moving in a concerning direction]
- [Another]
- [Another]
## Decisions this week
- [What changes based on these numbers]
The dashboard is one page. Every section has a specific claim with a number. The "Leading indicators" and "Decisions" sections are the most important โ they're where the PM does the synthesis work that no other role on the team is positioned to do. Without these, the dashboard is a status report; with them, it's a decision-making tool.
The cadence: weekly for Tier 1 and Tier 2, monthly for Tier 3 (because outcome data lags). Review the dashboard with engineering, design, and at least one business stakeholder (growth, finance, or CSM) once a week. The cross-functional review is where the synthesis happens and where the "Decisions this week" section gets filled in with specific actions.
A worked 2026 example: the feature that nearly died
Let me ground all this in a specific scenario I've watched play out, anonymised and simplified.
The setup: a B2B SaaS product shipped an AI meeting-summariser feature. It's running on Claude Sonnet 4.6. Eval pass rate is 91%. Latency is fine. Cost is within budget. Product is celebrating. Adoption in the first week after launch is 34% of addressable users โ respectable.
Week 2: try rate growing slowly (+3% week-over-week). Return rate after first try: 52%. Deactivation rate: 7%. All within expected ranges. Tier 1 still green.
Week 4: try rate stalled at 41% โ not growing anymore. Return rate dropped to 46%. Edit rate climbing to 58%. Deactivation rate at 11%. Leading indicators firing. The PM digs in.
Investigation finding: the feature is producing summaries that look fine to the eval set (which is based on curated meeting transcripts) but are being heavily edited by real users because they don't capture the specific decisions made in the meeting โ the eval set was measuring quality of summary-as-literature, not summary-as-decision-trace. The rubric from P3.1 was missing a category.
Week 5 intervention: PM adds 30 new eval cases focused on decision capture. Runs them against current prompt: 68% pass. Updates the system prompt to explicitly require a "decisions made" section. Re-evals: 89%. Ships the update.
Week 8: try rate back to growing (+5% week-over-week). Return rate at 63%. Edit rate dropping to 34%. Deactivation at 4%. Tier 2 recovering.
Week 12: Tier 3 data comes in. Adopters are now retaining at a 6-point lift vs non-adopters (controlled for plan and vintage). Plan upgrade rate from feature caps: 14%. The feature is working.
Counterfactual: without the Tier 2 leading indicators at week 4, the team would have waited for Tier 3 data at week 12 to notice the problem, and by then the engagement damage would have been set. Real users had had 8 weeks to form a "this feature is annoying" habit. Tier 2 caught it at week 4, and the fix was in production by week 5. Eight weeks of damage averted by watching the right tier at the right cadence.
This is the specific ROI of a proper metrics stack. Tier 1 alone would have missed the problem (it was invisible in the eval pass rate). Tier 3 alone would have caught it too late. The three tiers together, with explicit leading indicators, caught it with time to act.
The failure modes across all three tiers
Three specific patterns kill AI feature measurement. Each deserves naming because they recur in real teams.
Failure 1: "green dashboard syndrome"
The engineering team builds a beautiful dashboard showing all the Tier 1 metrics in green. Weekly reviews celebrate. The team believes the feature is succeeding. Meanwhile, Tier 2 and Tier 3 go unmeasured because "the dashboard already says we're doing well." Six months later, Tier 3 data surfaces a retention problem. The team is shocked. They shouldn't be โ the dashboard was lying by omission.
The defence: require the three-tier dashboard as the canonical review artifact, not the engineering dashboard. Tier 1 is an input, not the answer. If the review meeting agenda says "show me the quality metrics," the meeting is incomplete.
Failure 2: "vanity adoption"
The PM reports absolute numbers without denominators. "15,000 users tried the feature this month!" Sounds great. Out of what? 250,000 active product users means a 6% try rate, which is low. The absolute number was framed to impress; the percentage would have flagged a problem.
The defence: never report adoption numbers without the denominator and a comparison to your target. "15,000 of 250,000 (6%), target was 20%, so we're undershooting โ here's why." Reporting the ratio and the target makes the number accountable.
Failure 3: "correlation is causation theatre"
Growth or marketing reports that "AI-feature users have 12% higher retention" without controlling for anything. The claim is used to justify more investment. Six months later, someone runs the controlled analysis and finds the actual lift is 2 points. The original claim was confounded by self-selection โ users who try the AI feature are more engaged generally, and the engagement is doing most of the lift.
The defence: every tier-3 claim comes with its methodology in one line. "AI-feature users retain 6 points higher than matched non-users (same plan, same vintage, same industry segment)." If the claim can't say "matched" or "randomised," it's a correlation, not an attribution, and you should describe it as such.
What just changed in your roadmap
- Build the three-tier dashboard for every AI feature. One page, three sections, weekly for Tiers 1 and 2, monthly for Tier 3. Cross-functional review weekly.
- Watch the leading indicators, not just the aggregates. Return rate after first try, edit rate, deactivation rate. These predict trouble 4-6 weeks before outcomes move.
- Always report adoption as a percentage with a denominator, not as an absolute count.
- Distinguish correlation from causation in Tier 3. Every retention claim names its methodology. If you can't run a randomised experiment, name the matching strategy.
- Run the analyses on a calendar, not on crisis. Tier 3 metrics should be reviewed on a monthly schedule whether or not you suspect a problem. The team that waits for a problem to trigger the analysis is the team that learns too late.
- Own Tier 2 as a PM. Engineering won't build it for you. It is the single highest-leverage layer in your measurement practice.
- Share the dashboard with cross-functional stakeholders. Growth, finance, CSM. Their perspectives catch things you won't see alone.
- When adoption stalls at Tier 2, check your eval set coverage before you check anything else. Missing eval categories are the most common cause of Tier 2 drops.
Next post, P5.4, is the final post of Course 3. A look at what the next year of AI product work probably looks like โ the specific capability shifts that are on the near horizon for 2026-2027, the positioning and pricing questions that will reopen as capabilities stabilise, and a short, honest list of the habits and rhythms that keep a PM ahead of the field without burning out.
Course navigation
| โฌ ๏ธ Previous | ๐ You are here | Next โก๏ธ |
| โฌ
๏ธ Previous P5.2 ยท Positioning Against AI Noise | P5.3 of P5.4 | Next โก๏ธ P5.4 ยท What the Next Year Looks Like |
๐ AI for Product ยท Course Home โ 20 posts, five modules.
Cover photo via Unsplash. This post is part of the AI for Product series.