The One-Pager for an AI Feature

The first AI feature you spec with a traditional PRD will fail in a specific way. You'll write a tidy document: problem, solution, user story, acceptance criteria. Engineering will read it, nod, start building, and within two weeks send back a list of questions the PRD didn't answer. What does "good enough" mean for a probabilistic output? Which model? What's the fallback when retrieval misses? What do we log? What's the eval set? What happens at p99? Who owns quality after launch? Your PRD had no place for any of these. You end up having the conversation in Slack, piecemeal, over three weeks, and the spec turns into tribal knowledge distributed across a dozen messages. The launch slips. The retro blames "missing requirements."

The fix is not more process. It's a different PRD shape — one that carries the things AI features actually need into a single reviewable page. This post is that shape. Twelve sections, one page, designed to force the specific decisions AI products require and to leave no room for the "we didn't think about that" gap that kills timelines.

This is the closing post of Module P2 (Discovery and Scoping). P2.1 taught you to find shape-matched problems. P2.2 mapped the demo-to-prod gap. P2.3 showed how to prototype before engineering touches it. This post is where all that discovery condenses into something you hand to a team. If you have been climbing the ladder, you will walk into your next spec review with three pages of ladder artifacts and this one page of PRD, and the conversation will be dramatically different from the one you had on your first AI feature.

Why traditional PRDs fail AI features

A traditional PRD works for deterministic products. Its implicit assumptions are:

The feature either works or it doesn't.
"Done" is a yes/no.
Bugs are reproducible.
Quality is implicit in "passing acceptance criteria."
The model is a given; you describe the feature around it.

Every one of these breaks on AI features. As established in Module P1, AI features have a non-determinism tax — quality is a distribution, not a binary; bugs are frequencies; done is a threshold; the model is the most important product choice and the PRD has to make it explicitly. The traditional shape has no place for any of this, which is why the same gap reopens on every AI project with the same symptoms.

The SOTA one-pager below is designed around the specific questions AI features force you to answer. It is not longer than a traditional PRD — it's the same length, redistributed. Every section earns its line.

Twelve sections, four clusters: what and why, model and economics, quality and ops, risk and trust. Each cluster answers a different stakeholder's real questions — user and design for the first, engineering and finance for the second, QA and SRE for the third, legal and support for the fourth. One page, four audiences, no surprises.

Here is the full shape. Read it, then we'll walk through each section in detail with concrete guidance.

# [Feature name]
Owner: [PM name]  ·  Eng lead: [name]  ·  Status: [discovery / committed / building / shipped]  ·  Target launch: [quarter]

## 1. Problem (2-3 sentences)
Who feels this problem, how often, and what do they do instead today.

## 2. Shape
Which of the three shapes this fits: messy-in/structured-out, blank-page, or search-corpus. If none, reject.

## 3. Capability band
Band 1 / 2 / 3. Cite the specific capability this depends on and the evidence for the band placement.

## 4. Build vs buy vs wrap
The posture, in one word, plus one sentence explaining why the decision matrix landed there.

## 5. User story
One concrete story of a named persona hitting the feature, using it, and getting value. Not a template.

## 6. Quality bar
Eval set size, target pass rate, regression ceiling, explicit "critical failures" that block ship.

## 7. Model and prompt plan
Which model tier (frontier / mid / small), temperature, max_tokens, and a link to the prompt version in code.

## 8. Unit economics
Estimated cost per call and cost per active user per month at launch volume. Cite the math.

## 9. Latency budget
Target p50 and p99 time-to-first-token and total. Hard ceilings, not wishes.

## 10. Trust and fallback
What the UX does when the model is wrong, unsure, or unavailable. "I don't know" states count.

## 11. Observability and guardrails
What we log per call. What blocks a call. What alerts us. Link to the guardrail spec.

## 12. Gap budget
For each of the 5 demo-to-prod gaps, one line: status, owner, estimated work.

Twelve sections, one page. Every section forces a decision that would otherwise live as tribal knowledge. Each takes a paragraph to explain well. Let's walk through them.

Section 1: Problem (2-3 sentences)

Standard PM section. The only AI-specific note: name the real alternative. What does the user do today when this problem arises? They write the email themselves. They copy-paste between tools. They email a colleague. They skip the task entirely. The feature has to be better than the current alternative, and the alternative is often "the user is already coping adequately," which is a higher bar than "this problem exists."

A specific trap: "users don't have an AI assistant for this" is not a problem. The absence of an AI feature is never a problem statement. Describe the human-level pain, in human terms, before naming the AI as the solution.

Section 2: Shape

Pick one of the three shapes from P2.1: messy-input-to-structured-output, blank-page-first-draft, or search-across-corpus. If it doesn't fit any of the three, the one-pager ends here — reject and send the idea back to discovery.

This section is short — one word and one sentence — but it is load-bearing, because it tells engineering which tool category they should reach for. A Shape 1 feature needs schema enforcement. A Shape 2 feature needs edit-friendly UX. A Shape 3 feature needs RAG infrastructure. Saying the shape up front saves two weeks of "wait, is this a RAG thing or a classification thing?" back-and-forth.

Section 3: Capability band

From P1.4. Is this feature depending on Band 1 (reliably shipped), Band 2 (shipped but uneven), or Band 3 (on the horizon)? For Band 2 dependencies, which specific capability, and what is the evidence — benchmark data, public reliability history, your own eval results — that it's real?

Example of a good entry:

Band 2. Depends on citation grounding at production quality. Evidence: 89% pass rate on our Rung 3 eval set using Claude Sonnet 4.6 with reasoning mode, confirmed against Anthropic's cookbook example from March 2026. Six months of public demos working. Fallback if regression: paragraph-level citations via our retrieval layer.

Example of a bad entry:

We're using AI.

One sentence is not enough. Four to six sentences is right. This is where vapour commitments get caught.

Section 4: Build vs buy vs wrap

From P1.3. One word: wrap, buy, or build. One sentence explaining which columns in the five-row decision matrix drove the choice. If you don't know the matrix, re-read P1.3 before filling this section.

For 80-90% of AI features in 2026, the answer is wrap. That's not a failure of ambition — it's the matrix working as intended. Don't let politics override the matrix on this field.

Section 5: User story

One specific story. A named persona (even if anonymised), a real scenario, a sequence of actions, and an outcome. Not "a user opens the app and gets help with their task." Something like:

Aisha is a customer support lead at a mid-size fintech. On Wednesday morning, a ticket lands in her queue from a customer who has been charged twice. Aisha opens the ticket; the AI has pre-classified it as "billing dispute, high urgency," extracted the duplicate charge details into a structured panel, and drafted a one-paragraph response quoting the specific amounts. Aisha reads the classification, confirms it's correct, edits two words in the draft, and sends. The whole interaction takes 40 seconds instead of the usual 6 minutes. By the end of her shift, she has processed 30% more tickets than her previous average, with the same edit-rate as her usual replies.

That story does more work than a generic "users process tickets faster with AI" sentence. It names the before time (6 minutes), the after time (40 seconds), the action (edit two words), and the measurable outcome (30% more tickets). It gives engineering a target to build against and QA a scenario to test against. If you cannot write a story this specific, you do not know the feature well enough to spec it.

Section 6: Quality bar

This is the most important single section, and it is the one traditional PRDs don't have at all. It names the specific numeric criteria that tell engineering and leadership when the feature is ready to ship.

A good quality-bar section has four parts:

Eval set size and composition. "40 cases, sourced 50% from real anonymised tickets, 50% from customer interviews. 8 critical-priority cases where a wrong answer would create a compliance issue."
Target pass rate. "85% overall, no single failure mode above 10% of total failures."
Regression ceiling. "New prompt or model cannot drop below 83% on any individual category compared to the current version."
Critical failure definitions. "Any response that fabricates a charge amount, cites a non-existent account, or reveals another customer's data is a zero-tolerance critical failure. One occurrence in eval blocks ship."

Four bullets, one paragraph each. The team now has a concrete answer to "are we ready to ship?" that doesn't require a vibes meeting. We'll cover the specifics of building an eval set in Module P3; for now, the one-pager just has to name that these criteria will exist and give the rough outline.

The specific trap to avoid: writing "high quality" or "production-ready" in this section. Those are not criteria; they are vibes. If the entry doesn't have numbers, it isn't done.

Section 7: Model and prompt plan

Three to four lines. Which model tier will the feature use? Frontier (Claude Sonnet 4.6, GPT-5, Gemini 2.5 Pro) or mid (Haiku, GPT-5-mini, Flash) or a hybrid routed by a classifier? What temperature? What max_tokens budget? And a link to the prompt, versioned in code (per Course 2's B2.2).

Example:

Model: Claude Sonnet 4.6 for generation; Haiku for the classifier step. Temperature: 0 for classification, 0.3 for generation. max_tokens: 800 for generation, 50 for classifier. Prompt: prompts/support_ticket_drafter/v4.py (see PR #2341).

This section makes the cost model knowable (see Section 8), the latency predictable (Section 9), and the quality reproducible. It also prevents the common anti-pattern where engineering and the PM argue about "which model are we actually using" three weeks into the build.

Section 8: Unit economics

Two numbers:

Cost per call at the average expected usage, using the model plan in Section 7, calculated from real token counts. Not an estimate — a measured cost from a Rung 3 prototype run.
Cost per active user per month at the expected launch volume.

Plus one sentence showing the math: "Avg 3,000 input + 400 output tokens on Claude Sonnet 4.6 = $0.021 per call. 12 calls/user/day, 25 days/month, 8,000 MAU = $50,400/month."

The goal of this section is to make sure the feature's economics are understood before commit, not after launch. The PM should be able to defend the number in a pricing review. If the number is surprising — too high or too low — the feature needs re-scoping (smaller model, caching, restricted usage) before engineering commits time.

This is the section most PMs skip. Don't. It is the most expensive skip in the entire document.

Section 9: Latency budget

Three numbers, all with "ceilings," not "goals":

P50 time-to-first-token (TTFT): target and ceiling.
P99 TTFT: target and ceiling.
P99 total time: target and ceiling.

Example:

P50 TTFT: 400ms target, 800ms ceiling. P99 TTFT: 1.5s target, 3.0s ceiling. P99 total: 6s target, 12s ceiling. Measured from the user's click, not the server's first byte.

Ceilings are what matter. If the feature regresses past a ceiling, it's a ship-blocking issue, not a nice-to-have. Without explicit ceilings, "it feels slow sometimes" becomes an argument instead of a measurement.

Streaming is assumed. If the feature is not streaming, explain why. Almost always a mistake.

Section 10: Trust and fallback

What does the UX show when:

The model is wrong (user catches it; how do they recover?)
The model is uncertain (confidence below threshold; show what?)
The model is unavailable (provider outage; do what?)
The user rejects the model's output (retry? edit? abandon?)

Four scenarios, one line each. Example:

Wrong: user can edit the draft inline; edit rate tracked. Uncertain: draft shows a yellow "review carefully" banner if confidence score is below 0.7. Unavailable: drafter button disabled with tooltip "Temporarily unavailable, sending manually"; no error page. Rejected: one-click "try again" and "write from scratch" buttons; rejection rate tracked.

This section is the first place trust lives. If you leave it blank, the feature will ship with default error messages and no failure-state design, and cold users will not trust it. We cover this in depth in P4.1; for now, the one-pager just has to have the four answers.

Section 11: Observability and guardrails

From Course 2's B5.3 (observability) and B5.4 (guardrails). The one-pager doesn't have to reproduce the whole guardrail spec, but it has to reference it and call out the feature-specific additions.

A good entry is three lines:

Log per call: trace_id, prompt version, full prompt, full output, model, cost, latency, user_id, confidence score. PII redacted at ingest. Input guardrails: size cap 10KB, moderation classification, abuse detection per user. Output guardrails: schema validation on structured extraction, off-topic classifier, cite-or-refuse when retrieval confidence < 0.5. Link: guardrails/support_drafter.md.

This is enough for engineering to build the observability layer and the guardrail checks. The detail lives in the linked doc. The one-pager makes sure those docs exist and are referenced.

Section 12: Gap budget

The five gaps from P2.2 — tail quality, unit economics, p99 latency, adversarial robustness, user trust — each get one line on the PRD.

Format:

Tail quality: status (measured / estimated / unknown), number, owner, estimated weeks to close.
Unit economics: see Section 8, confirmed as within budget? yes/no.
P99 latency: see Section 9, measured in load test? yes/no.
Adversarial robustness: red-team run? yes/no. Issues found and defence shipped?
User trust: cold-user test run? yes/no. Edit/rejection rates measured?

Example:

Tail: 76% on eval, target 85%, 3 weeks of prompt + retrieval work, owned by @engsi. Unit econ: confirmed within $0.03/call, within budget. Latency: not yet load-tested, blocker. Adversarial: 2h red-team planned next week, owned by @security. Trust: 5-user cold test planned in dev week, owned by @pm.

This section is the single strongest early-warning system in the document. If three of the five gaps say "not measured," you're not ready to commit — the one-pager is still in discovery, not in build phase.

Here's what the section-by-section treatment looks like when compressed into an actual one-page PRD for a real-feeling feature. (Shortened for the post; in real life each section gets more detail.)

# Ticket Auto-Drafter v1
Owner: Maya (PM)  ·  Eng lead: Rafi  ·  Status: committed  ·  Target: Q3 2026

## 1. Problem
Support agents spend 4-7 minutes per ticket on triage + first draft. Avg queue latency is 18 min; CSAT drops sharply above 20. Agents currently cope by copying templates and editing, which misses ticket-specific context.

## 2. Shape
Hybrid: Shape 1 for extraction (ticket → structured fields), Shape 2 for draft generation (fields → reply draft).

## 3. Capability band
Band 1 for extraction (structured output is rock-solid). Band 2 for faithful draft generation without hallucinating details — mitigated by grounding the draft in the extracted fields only.

## 4. Build vs buy vs wrap
Wrap. All 5 rows of the matrix favour wrap: no proprietary data advantage, no need for custom moat, wrap-compatible at our volume.

## 5. User story
Aisha, shift lead, opens a ticket about a duplicate charge. AI has pre-extracted fields and drafted a reply. Aisha reads, corrects 2 words, sends. 40s vs previous 6min. (Anonymised from pilot user interviews, March 2026.)

## 6. Quality bar
Eval set: 60 real anonymised tickets, 12 critical-priority (compliance-sensitive). Target: 85% overall pass, 10% regression ceiling on critical. Zero-tolerance: fabricated amounts, wrong customer data.

## 7. Model and prompt plan
Claude Haiku 4.5 for classifier step; Sonnet 4.6 for draft generation. Temp 0 for both. Max_tokens 800 for draft. Prompt: `prompts/ticket_drafter/v3.py` (PR #2341).

## 8. Unit economics
Classifier: $0.0009/call (Haiku). Drafter: $0.022/call (Sonnet). Total ~$0.023/call. 12 drafts/agent/day × 80 agents × 22 days = $486/month. Within budget.

## 9. Latency budget
P50 TTFT: 400ms target, 700ms ceiling. P99 TTFT: 1.2s target, 2.5s ceiling. P99 total: 5s target, 10s ceiling. Load test before GA.

## 10. Trust and fallback
Wrong: inline edit, edit-rate tracked. Uncertain: yellow banner at confidence < 0.7. Unavailable: button disabled with tooltip. Rejected: one-click retry or write-from-scratch.

## 11. Observability and guardrails
Log: trace_id, prompt version, full I/O, cost, latency, user_id, confidence. PII redacted. Input: size cap, moderation, abuse detection. Output: schema valid, off-topic check, cite-or-refuse. Spec: `guardrails/ticket_drafter.md`.

## 12. Gap budget
Tail: 78% eval today, 3 weeks to 85% (Rafi). Unit econ: confirmed, within budget. Latency: load test scheduled week of May 5 (Rafi). Adversarial: 2h red-team May 8 (security). Trust: cold test with 6 users starting May 3 (Maya).

One page. Every section has a specific answer. Engineering knows what to build. QA knows what to test. Leadership can see when it ships and what it costs. Legal can review the guardrails. Finance can defend the economics. This is what a PRD for a 2026 AI feature looks like.

One more practical use: diagnosing a stuck project by filling in the one-pager retroactively. If you inherit an AI project that's been in "we're figuring it out" for two months, have the team fill in this template. The sections where the team can't answer are the specific blockers.

Example diagnoses I've seen:

Team can't fill in Section 6 (Quality bar) → there's no eval set, which means nobody knows what "done" means, which is why they can't agree on whether to ship.
Team can't fill in Section 8 (Unit economics) → nobody has costed the feature at launch volume, which means leadership is about to get a surprise.
Team can't fill in Section 12 (Gap budget) → they haven't measured the tail, which means the ship date is fiction.

Once you see which sections are blank, the unblock is direct. You don't need a retro or a process change; you need to go measure the specific thing.

The one specific failure mode: a team that reads this template and says "that's too much for a one-pager, let me just do the first three sections." They skip sections 6-12 because they're "engineering concerns." The PRD ships looking lean. The team celebrates. Six weeks later, they are in the same "missing requirements" conversation the old PRD shape produced, only now they're also three weeks further along and more committed to their assumptions.

The sections are not optional. They are the specific things AI features need that traditional PRDs don't carry. If the sections feel too heavy, the feature isn't ready to be a PRD yet — it's still in discovery, and should be climbing the Rung ladder from P2.3. The one-pager is the output of discovery, not a substitute for it.

The test: if a team can't fill in the one-pager in an afternoon, the feature is not ready to commit. Send it back to discovery and have them climb another rung.

What just changed in your roadmap

Adopt the 12-section one-pager shape for every AI feature PRD. The template above is free to copy.
Never write an AI PRD without Section 6 (quality bar). Vibes are not a shipping criterion.
Never skip Section 8 (unit economics). The cost conversation happens before commit, not after launch.
Never skip Section 12 (gap budget). The five gaps from P2.2 are where projects slip; making them visible is half the defence.
Use the template to diagnose stuck projects. The blank sections are the specific blockers.
When a team says "that's too much for a one-pager," hear that as "this feature isn't ready to commit." Send it back to the Rung ladder.
Treat the one-pager as a living doc. It updates during the build as measurements come in. "Gap budget" in particular should reflect current state, not the original estimate.

That closes Module P2 — Discovery and Scoping. You now have the three shapes, the demo-to-prod gap, the five-rung prototyping ladder, and the one-page PRD that carries all of it into engineering. You can take an AI idea from "interesting hunch" to "committed roadmap item" with evidence at every stage and a spec that doesn't silently lose half the requirements.

Next up: Module P3 — Evaluation and Quality. The part the traditional PM playbook hates the most, because it forces probability into a world that wanted binary. Four posts on how to define "good enough," how to build eval sets that work for PMs (not engineers), how to read the quality numbers your team brings you, and how to set shipping criteria that don't require a rewrite every six weeks. See you there.

⬅️ Previous	📍 You are here	Next ➡️
⬅️ Previous P2.3 · Prototyping With AI as a Non-Coder	P2.4 of P5.4	Next ➡️ P3.1 · Defining Good Enough

📚 AI for Product · Course Home — 20 posts, five modules.

Cover photo via Unsplash. This post is part of the AI for Product series.

The One-Pager for an AI Feature: the PRD Shape That Survives Engineering

Why traditional PRDs fail AI features

Section 1: Problem (2-3 sentences)

Section 2: Shape

Section 3: Capability band

Section 4: Build vs buy vs wrap

Section 5: User story

Section 6: Quality bar

Section 7: Model and prompt plan

Section 8: Unit economics

Section 9: Latency budget

Section 10: Trust and fallback

Section 11: Observability and guardrails

Section 12: Gap budget

What just changed in your roadmap

Course navigation

Comments

AI for Product

Defining Good Enough for a Probabilistic Product

More from this blog

A Reading List and Two Habits: Staying Current in Ten Minutes a Week

What to Decide Now, What to Defer, What to Ignore: The AI Action Matrix

The Next 18 Months of AI: A Calibrated Leader's Forecast

Calibrating Your AI Exposure: Upside and Downside in One Matrix

Five AI Capabilities That Matter for Your Business, and Five That Do Not

Command Palette

Why traditional PRDs fail AI features

The one-pager template

Section 1: Problem (2-3 sentences)

Section 2: Shape

Section 3: Capability band

Section 4: Build vs buy vs wrap

Section 5: User story

Section 6: Quality bar

Section 7: Model and prompt plan

Section 8: Unit economics

Section 9: Latency budget

Section 10: Trust and fallback

Section 11: Observability and guardrails

Section 12: Gap budget

The one-pager in action

Using the one-pager to unblock a stuck project

The failure mode: "that's too much for a one-pager"

What just changed in your roadmap

Course navigation

Comments

AI for Product

Defining Good Enough for a Probabilistic Product

More from this blog