Finding AI-Shaped Problems in Your Product
Most 'AI features' do not earn their keep because they were bolted onto problems that were not AI-shaped to begin with. Here is how to find the problems where AI actually wins, and how to reject the ones where it does not.
Open your product roadmap. Look at the column marked "AI." Count how many of those bullets are there because someone on the team identified a specific user pain that AI is uniquely good at relieving, and how many are there because leadership asked for "more AI." If your honest answer is less than half, you're in the room with every AI product team in 2026.
The root cause isn't a shortage of AI capability. Frontier models in April 2026 can reason over 200,000 tokens, call tools reliably, produce structured output on demand, and classify, extract, and summarise text with a precision that used to take a team of domain experts a week. Capability is not the bottleneck. Problem selection is. Teams are inventing AI features for problems that never needed them, while real AI-shaped problems — the ones where a model earns its keep — sit unnoticed a few rows down in the same roadmap.
This post is the filter. A framework for identifying which problems in your product are genuinely AI-shaped, and how to reject the ones that aren't — even when leadership really wants you to ship something with "AI" in the launch post. It's the second post of Module P2 (Discovery and Scoping) and the one I most wish every PM read before their first AI feature.
No code, no model details. Just the question you ask about each problem on your backlog.
The three shapes of AI-winning problem
From watching a lot of AI features ship and fail, I've landed on this: almost every problem where AI genuinely wins fits one of three shapes. If your candidate problem doesn't fit any of them, you're probably about to build a feature that will underperform its capex and quietly be removed in a future cleanup.
Shape 1: the "messy input, structured output" problem. The user has something unstructured and wants something structured. Messy PDF → cleaned fields. Long email thread → action items. Support transcript → CRM note. Photo of a receipt → expense entry. These problems used to require a human or a brittle rule engine. Modern frontier models handle them in a single API call with a typed schema.
Shape 2: the "write the first draft" problem. The user has a blank page and needs to stop staring at it. Draft this email. Draft this policy document. Draft this commit message. Draft this job posting. The user will edit whatever the model produces, but the first 70% is what was slow. The AI's job is to produce something non-embarrassing as a starting point, not to produce the final output.
Shape 3: the "search across everything" problem. The user has a question and the answer is buried somewhere in a corpus they cannot efficiently search. "Which of our 300 past tickets touched this issue?" "What does our 80-page employee handbook say about bereavement leave?" "What was the decision in that product meeting three months ago?" RAG (covered in Course 2's Module B3) was built for exactly this shape.
Three shapes. Most AI features that succeed are one of them. A clean test for any candidate problem is: "does this reduce to messy-in-structured-out, blank-page, or search-a-corpus?" If yes, you probably have a real problem. If no, you probably don't.
Let me walk through each shape with specifics about when they win, what makes them easy, and the specific failure modes inside each shape — because even within a winning shape, teams get it wrong.
Shape 1, in depth: messy input to structured output
The reason Shape 1 is the single most common successful AI product category in 2026 is that it solves a class of problems that previously required either humans or brittle software, with a tool that's dramatically cheaper than both.
When it wins cleanly:
- The input has meaningful variation that would need a human to parse.
- The output shape is well-defined and small (a dozen fields, not a hundred).
- The downstream consumer of the output is software, not a human who will re-read the input anyway.
- The cost of a wrong extraction is recoverable — the user can spot it and fix it.
Concrete 2026 examples that shipped well and are making money:
- Expense management tools that turn receipt photos into categorised line items. The receipt-extraction snippet from Course 2's B6.3 ships this in ~40 lines of code.
- Legal tools that turn contract PDFs into structured clause databases.
- Recruiting tools that turn freeform job postings into standardised category/skill/location fields.
- Customer support tools that extract "intent, sentiment, urgency" from incoming tickets.
These are not flashy features. None of them would headline a ProductHunt launch. All of them move real product metrics — time to complete a task, accuracy of routing, cost of a manual back-office operation — by large, measurable margins. The "time to categorise a receipt" number goes from 45 seconds to 3 seconds. The "cost per support ticket triage" goes from $0.80 to $0.04. These are the economics that make AI features pay.
The failure mode inside Shape 1: schema creep. The team starts with "extract these 8 fields from a receipt" and, over the next two months, stakeholders add another 12 fields, some of which aren't in every receipt, and some of which the model invents rather than leaves blank. The original 8-field extractor worked at 94% precision on your eval set. The 20-field version works at 71%. Users stop trusting it. The feature fails not because Shape 1 was wrong for this problem, but because the team let the schema bloat beyond what the model (and the input) could reliably support.
The defence: treat the schema as a first-class spec artifact and require a reason to add fields. Each new field lowers the reliability of every field, not just itself. Keep it small.
Shape 2, in depth: blank-page first-draft
Shape 2 is the one most teams build when they don't know better, and it's often the right choice, but for a specific reason: the user values speed-to-70% more than perfection. If a user would rather edit a mediocre draft than write one from scratch, Shape 2 wins. If they'd rather write from scratch because they're going to have to rewrite everything anyway, it doesn't.
When it wins cleanly:
- The user is currently blocked by a blank page more than by "I don't know what to write."
- The user has enough context and domain expertise to catch model errors.
- The cost of a bad draft is low — they can discard and try again or edit freely.
- The user's output is not going to another human without a review step.
Concrete 2026 examples:
- Email drafters in CRMs (Gmail's smart compose is the canonical example; every CRM is cloning the shape).
- Commit message generators in IDEs.
- Draft responses in customer support tools where the agent still sends.
- Resume and cover letter tools where the candidate edits heavily before sending.
- "Write the PRD first draft from this notes doc" features in product tools.
The pattern: the model produces a first pass, the user reviews and edits, the user's brain gets unstuck. The win is measured in time saved, not in what the model wrote. Good Shape-2 products make it trivial to accept, edit, or reject the draft; bad ones force the user to commit.
The failure mode inside Shape 2: the "sounds plausible but subtly wrong" draft. The model produces a draft that looks competent at a glance but contains a specific error the user doesn't catch because the draft reads smoothly. Support responses that reference features the product doesn't have. PR descriptions that claim a change the PR didn't make. Emails that misrepresent a meeting summary. Users trust the draft because it's well-written, they skim the edit, they send it, and now a wrong message is in production.
The defence: design the UX so editing feels effortless and not editing feels slightly wrong. Highlight the model-generated parts. Require at least one user interaction before sending. Instrument the edit rate as a signal — if users are sending drafts unedited, either your model is great or your users are trusting it too much; either way, measure and investigate.
Shape 3, in depth: search across a corpus
Shape 3 is the RAG use case: the user has a question, the answer exists somewhere in a body of text, and the product's job is to find and present the answer. Under the hood, this is retrieval plus grounded generation, which Course 2 Module B3 covers in detail.
When it wins cleanly:
- The corpus is large enough that humans can't search it efficiently.
- The answers are in the corpus, not synthesised from world knowledge.
- The user will accept "I don't know" as a valid answer when the corpus is silent.
- Citations back to source documents are available and valued by the user.
Concrete 2026 examples:
- Internal "ask your company's docs" tools (Glean, Notion AI's search).
- Legal research tools that ground answers in case law.
- Customer-facing product Q&A bots backed by official help documentation.
- Clinical decision support tools that cite medical guidelines.
- Compliance tools that reference regulatory text directly.
All of these are Shape 3. None of them is a chatbot. None of them is "generic AI assistant." Each of them does one specific thing well because the problem fit the shape.
The failure mode inside Shape 3: the "answers not in the corpus" trap. Users ask questions that genuinely have no answer in the corpus — because the document hasn't been written yet, or the corpus is out of date, or the question is outside the product's scope. A naive RAG pipeline generates a plausible-sounding answer anyway, synthesising from the model's general knowledge, and the user trusts it because they see the "search your docs" framing. Now the tool is confidently wrong on exactly the questions where it should have said "I don't know."
The defence: implement a confidence threshold and a "cite-or-refuse" mode. If retrieval returns no chunks above a similarity floor, respond "I don't have information about that in your docs" rather than synthesising. This is cheap to implement and disproportionately valuable for user trust. We covered it in Course 2's B3.5 (RAG failure modes).
A worked example: auditing a real AI roadmap
Let me apply the framework to a realistic B2B product's "AI roadmap" to show the flow.
The product: a project management SaaS for small to mid-size companies. The "AI column" on the roadmap has eight items. Let's walk them through.
| # | Roadmap item | Shape? | Decision |
| 1 | Auto-summarise completed tasks into weekly reports | Shape 1 (messy notes → structured report sections) | Accept. Core AI win. |
| 2 | Suggest next tasks based on project history | Not a clean shape — it's a recommender, not a Shape-1/2/3 problem | Reject as "AI feature"; may be a good ML feature, but use classical recsys, not LLMs. |
| 3 | Draft status-update emails from the week's task list | Shape 2 (blank-page drafts the user edits and sends) | Accept. Classic Shape 2. |
| 4 | AI project name generator | Not a product problem. It's a toy. | Reject. Users don't need AI help naming a project. |
| 5 | Search across all past project discussions | Shape 3 (Q&A over corpus of old messages) | Accept with care — confidence threshold mandatory. |
| 6 | AI chatbot on the marketing site | Shape 3 if grounded in help docs, but it's not in the product's main value loop | Defer. Real Shape 3, but low-priority compared to 1, 3, 5. |
| 7 | Automatic meeting transcription | Shape 1 (audio → structured notes) but requires a separate audio-first model | Accept with care — owner needs to evaluate transcription vendors; likely Course 3 P1.3 "buy" decision. |
| 8 | AI-generated project timelines from a free-form description | Shape 1 (text → structured timeline) | Accept. Good Shape 1. |
Before the audit: 8 items, all labelled "AI," no prioritisation framework, a year of engineering planned.
After the audit: 5 solid candidates (#1, 3, 5, 7, 8), 2 rejections (#2, 4), 1 deferred (#6). The team now has a focused roadmap of Shape-1 and Shape-2 wins with one Shape-3 that has clear design requirements, and the two rejections save six weeks of engineering that would have gone to features that underperform.
This is a realistic outcome. I've done this exercise with five different teams and the "audit rejected two to three items" result has been consistent. AI roadmaps almost always contain items that aren't shape-matched, and removing them is the cheapest quality improvement available to a PM in 2026.
The specific shapes where AI doesn't win (and what to use instead)
A symmetric list. If your problem is one of these, don't reach for an LLM — reach for the tool that was designed for the shape.
- High-precision classification over well-labelled training data. Use classical ML (gradient-boosted trees, small fine-tuned transformers). LLMs are overkill and less accurate on this specific shape.
- Real-time systems with sub-100ms latency budgets. LLM inference adds at least 200-500ms even with streaming. Find a non-LLM approach or pre-compute.
- Exact deterministic output required. LLMs are probabilistic. If the downstream consumer can't tolerate any variance, you need a rules engine or a template system, not a model.
- Tasks where users don't trust AI and want to do it themselves. Shape 2 fails if the user rejects every draft. Measure before committing.
- Problems where the correct output is a number or a calculation. LLMs do arithmetic poorly compared to, you know, a calculator. Use tool-calling to delegate math to actual software.
- Search over very structured data. If the answer is in a well-designed relational database, you want SQL, not RAG. "Write SQL from natural language" is itself a valid AI feature, but the underlying search stays SQL.
- Recommendations based on user behaviour. A classical collaborative filter or matrix factorisation beats an LLM on this shape almost always.
Rejecting a problem from your AI roadmap doesn't kill it — it moves it to a more appropriate tool. "Not AI-shaped" is not the same as "not worth doing." It's a scoping decision, and saying no to bad AI fits makes room for saying yes to good ones.
The failure mode: "AI-washing the roadmap"
One specific failure mode deserves naming, because it's the most common way product teams end up with a bloated AI roadmap that underperforms.
AI-washing the roadmap happens when leadership demands "more AI" and the PM translates that demand by relabelling existing features. A "smart" feature that already worked is renamed to "AI-powered smart." A recommender that was using classical ML is re-specced to use an LLM. A rule-based workflow is replaced with a model-driven one because it sounds better. None of these changes improve the product; they all add cost, latency, and non-determinism tax, and they do not move any user metric.
The product ends up with a roadmap full of items that are labelled "AI" but aren't shape-matched, every one of which is slightly worse than the feature it replaced. Six months in, engineering quietly rolls back several of them, and the team spends the next quarter explaining to leadership why the "AI push" didn't deliver.
The defence: apply the three-shape test to every item before it gets an "AI" label. If it doesn't fit a shape, do not add "AI" to the name, and do not commit engineering time to making it use a model. Ship the feature with the tool it actually needs (classical ML, rules, a better UI), and save the AI budget for the shape-matched problems where it'll earn real wins.
A PM who can push back on "add more AI" with a framework saves their team a whole quarter of misdirected work.
What just changed in your roadmap
- Audit your current AI roadmap against the three shapes this week. Item by item. Most roadmaps have at least one item that should be rejected outright and at least one that should be deferred.
- Require a "which shape" label on every new AI feature proposal. If the shape isn't clear, the problem isn't AI-shaped and the proposal needs reshaping.
- Inside Shape 1, limit schemas aggressively. Every field is a reliability cost. 8-12 fields is a good range; 20+ is a warning sign.
- Inside Shape 2, instrument the edit rate. Users editing drafts 30-60% of the time is healthy; 0% is a trust problem, 90%+ is a quality problem.
- Inside Shape 3, always implement cite-or-refuse. No RAG product should synthesise answers to questions the corpus doesn't cover.
- When leadership asks for "more AI," respond with the audit. You'll find two features they didn't know they already had, and you'll reject two features that wouldn't have shipped well. That's a better outcome than a longer roadmap.
- Keep a "rejected from AI roadmap" list. Revisit quarterly. Capabilities change; a problem that wasn't AI-shaped in Q1 might be in Q3 when a new capability stabilises.
Next post, P2.2, we take on the second most-ignored variable in AI product planning: the demo-to-prod gap. Why your magic demo is going to die on the road to general availability, and the specific moves that de-risk it early before engineering has written a line of production code.
Course navigation
| ⬅️ Previous | 📍 You are here | Next ➡️ |
| ⬅️ Previous P1.4 · The Capability Frontier | P2.1 of P5.4 | Next ➡️ P2.2 · The Demo-to-Prod Gap |
📚 AI for Product · Course Home — 20 posts, five modules.
Cover photo via Unsplash. This post is part of the AI for Product series.