Prototyping With AI as a Non-Coder

You have a hunch about an AI feature. You can describe it in a sentence. Leadership is interested but noncommittal. Engineering is busy and won't build a prototype unless the team decides to prioritise it. You cannot write code. You have one week before the next roadmap review.

In 2023, this would have been a frustrating blocker. In 2026, it is an afternoon of work. A PM, designer, or founder with a $20-a-month Claude or GPT account and a rough instinct for prompting can produce an artifact that earns a real scoping decision — without waiting for engineering, without committing sprint capacity, and without a single line of production code touched. The prototype is not the product; it is the evidence that decides whether to build the product.

This post is the ladder. Five progressively richer ways to prototype an AI feature as a non-coder, each useful at a different stage of discovery, each producing a different kind of artifact. You pick the rung you need for the decision you're trying to make, do the work, and walk into the next review with something specific instead of a pitch. If Module P1 was about mental models and P2.1-P2.2 were about problem selection, this post is about the hands-on work that happens between "interesting idea" and "commit to engineering."

No code. Real tools. Concrete time estimates. The ladder I have actually used.

The five rungs

The ladder goes from cheapest/roughest to most expensive/most realistic. Each rung answers a specific question and rules out or validates a specific risk. You should climb only as high as you need to make the next decision — going higher just wastes time, and going lower leaves decisions unsupported.

Rung 1: the chat-window prototype. You open Claude or ChatGPT. You paste your best attempt at a system prompt. You manually feed in 5-10 realistic inputs. You look at the outputs. Time: 30 minutes. Answers: is this problem even shape-matched to a model?

Rung 2: the "projects" prototype. You use Claude Projects, ChatGPT Custom GPTs, or Gemini Gems — the provider's built-in "give me a persistent system prompt and a small knowledge base" feature. You give it a handful of real documents and share the project with a pilot user. Time: 2-3 hours. Answers: is the experience as a user actually useful, and what do they do in the first five minutes?

Rung 3: the no-code workflow. You use a no-code tool like n8n, Zapier, or Make with a built-in AI node, wire a simple "input → model call → output" flow, and test it against a batch of real inputs from a spreadsheet. Time: 4-6 hours. Answers: what is the shape of the output across a realistic distribution, not just cherry-picked examples?

Rung 4: the vibe-coded prototype. You use a code-generation tool like v0, Lovable, Bolt, or Cursor's agent mode to generate a throwaway web app around the AI call. You describe the feature in plain English and iterate until it's usable. Time: 1-3 days. Answers: does this make sense as a user-facing product, with a real UI and a real flow?

Rung 5: the pilot with real users. You take the Rung 4 prototype, give it to 5-10 real users in a structured setting, and watch them use it. Time: 1-2 weeks. Answers: do cold users actually get value, and what do they trust or not trust?

Five rungs. Most ideas should die at rung 1, 2, or 3 — cheap, fast, before the team commits real engineering. The few that survive all the way to rung 5 are the ones worth building. Climbing this ladder is the single highest-leverage thing a non-coding PM can do in 2026, and almost nobody does it because they've been trained to wait for engineering to "build a prototype."

Let me take each rung in detail with concrete 2026 tools and honest limitations.

Rung 1: the chat-window prototype

The cheapest possible prototype. You open Claude.ai (or ChatGPT, or Gemini — pick whichever you personally use most) and paste your first attempt at a system prompt into a fresh conversation. Then you paste a realistic input and look at the output. Then you paste a harder input. Then an adversarial one. Then a weird one. Then a short one. Thirty minutes, eight to ten inputs.

What you're testing: whether the problem is shape-matched to a frontier model at all. If the model gives broadly reasonable outputs on most of your inputs, there's a signal worth pursuing. If it fails on the basics — hallucinates, ignores the instruction, produces the wrong format — the problem either needs a different shape, more context (retrieval), or a different approach altogether. You find that out in thirty minutes instead of six weeks.

What it does NOT tell you: how the feature would feel as a product, what the distribution of outputs looks like at scale, whether real users would trust it, what the unit economics are, or how it'll behave on the tail. Those are later rungs.

Concrete 2026 setup (Claude example):

Open claude.ai, start a new chat.
Paste your best attempt at a system prompt. It should be one paragraph at most for a Rung-1 test.
In a notebook, write down 8-10 realistic inputs covering: 2-3 canonical happy-path cases, 2-3 edge cases you think might be hard, 2 adversarial cases, 1-2 clearly-out-of-scope cases.
Paste each input. Record the response.
Score each response: good, borderline, bad. Write one line about why.
At the end, look at the pattern. Is the good rate above 70%? Are the failures clustering? Does anything surprise you?

Thirty minutes. You now have a one-paragraph answer to "should we keep exploring this?" which is often enough to kill or greenlight a direction.

Specific 2026 tip: use the reasoning mode (Claude extended thinking, GPT-5 reasoning, Gemini deep thinking) for the hardest 2-3 inputs only, and compare its answers to the default mode. If reasoning mode clearly beats default, you have a signal that the feature needs a reasoning-model tier, which has real cost and latency implications downstream.

The failure mode for Rung 1: mistaking "I found a prompt that works on my cherry-picked inputs" for "this feature is ready." Rung 1 is the start of the ladder, not the end. Too many PMs demo a rung-1 chat-window session to leadership, get a commit-to-build response, and then discover at rung 3 or 4 that the real distribution is much harder. Always climb at least one more rung before committing.

Rung 2: the Projects / Custom GPT / Gems prototype

Rung 2 moves from a bare chat window to a reusable shared artifact. The specific features — Claude Projects, ChatGPT Custom GPTs, Gemini Gems — each let you package a system prompt and a small knowledge base into something with a link, so a pilot user can access it without pasting your prompt into their own chat.

Why this rung matters: it's the first point where someone other than you uses the prototype. The gap between "I know what this is supposed to do" and "a user who doesn't know tries to use it" is enormous, and Rung 2 is the cheapest way to measure it. You'll learn in the first 10 minutes of watching a pilot user that their first instinct was different from yours, they phrased their query in a way you didn't predict, and they gave up after two turns because they weren't sure what the assistant could do.

Concrete setup (Claude Projects example):

Create a new project in Claude.ai.
Write a slightly more polished system prompt — 2-3 paragraphs, still no more.
Upload 5-20 real documents (help articles, past tickets, product docs, whatever constitutes the "knowledge base" for the feature).
Share the project link with 3-5 colleagues who aren't on your team. Give them a one-sentence task description and nothing else.
Wait an hour. Review the conversations. (Projects preserve the full history.)
Ask each user 3 questions: what did you try first, what surprised you, would you use this if it worked well.

Three hours of your time. Five real external perspectives. Specific insights you couldn't have gotten from the chat-window alone.

What Rung 2 is good at: onboarding gaps (users don't know what to ask), trust signals (users stop using it after one bad answer), latent expectations (users try to use it for things you didn't scope), and voice/tone fit (users find the responses too formal or too casual).

What Rung 2 is still bad at: measuring output distribution across many inputs (that's Rung 3), production-grade UX decisions (Rung 4), and cold-user trust at scale (Rung 5).

The failure mode for Rung 2: sharing the project only with teammates who are primed to be impressed. You'll get a positive vibe, no useful signal, and a false greenlight. Share with people outside your team — ideally outside your department — who have no stake in the feature's success.

Rung 3: the no-code workflow prototype

Rung 3 is where you start testing across a distribution of inputs instead of one at a time. You use a no-code automation tool to wire up "spreadsheet of inputs → model call → spreadsheet of outputs," run it, and look at the full grid.

Concrete 2026 tools:

n8n — self-hostable automation platform with AI nodes, good for batching and iteration.
Zapier / Make — similar, slightly easier to start, less flexible at the edges.
Airtable + AI extensions — turns a spreadsheet into a mini-pipeline, very PM-friendly.
Google Sheets with the AI functions — even simpler, surprisingly competent for quick batches.
Anthropic's own workbench — direct for testing Claude prompts at small batch scale with versioning.

Why Rung 3 matters: this is the first rung where you're measuring output quality as a distribution instead of anecdotally. You assemble 40-100 real inputs (or proxy data if you don't have real yet), run them all through your prompt, and look at the full output table. Roughly 20% of the outputs will reveal something you didn't know. Those 20% are the signal Rung 1 and Rung 2 can't give you.

Concrete flow (Airtable example):

Create an Airtable base. One row per input.
Add an "AI output" column using an AI extension wired to your chosen model with your Rung-2 prompt.
Run the column against all rows. Wait. Come back.
Add a "quality" column. Manually score each row: pass, borderline, fail.
Add a "failure mode" column. For each fail, one-word category: wrong-answer, wrong-format, hallucination, refusal, off-topic.
Sort by failure mode. Look at the clusters.

Four to six hours. You now have an honest eval-shaped artifact, which is not the same as a real eval set (that's the PM-written version from Module P3) but is dramatically better than anything lower on the ladder.

What Rung 3 is good at: finding failure mode clusters, estimating shipping quality realistically, spotting tail behaviour the chat window would never surface, and producing a quantitative claim ("this approach is 76% passing on a realistic set") that earns real engineering discussion.

What it's still bad at: the user experience. An 82% pass rate might feel great in a spreadsheet and terrible in a product where 18% of users hit the failure case on their first try.

The failure mode for Rung 3: using unrealistic inputs because real ones are hard to get. A PM makes up 40 "representative" inputs from memory. The spreadsheet looks impressive. Production inputs turn out to be very different — messier, shorter, more adversarial — and the 82% number doesn't hold. The fix: every rung-3 run should use at least 50% real or anonymised-real inputs, not invented ones.

Rung 4: the vibe-coded prototype

Rung 4 is where you build a real, clickable, user-facing UI around the model call — without writing code yourself. This is the most important shift in 2026 for non-coding PMs, and it deserves more space than it usually gets.

The tools that work in 2026:

v0 (Vercel) — turn English into a React UI, connects to any API, deploy with one click.
Lovable — similar, more opinionated, includes database wiring.
Bolt.new — full-stack prototype in one page, extremely fast initial generation.
Cursor's agent mode or composer — if you have a little more technical inclination, this is faster once you learn it.
Claude Artifacts — in Claude.ai, you can ask the model to "build me a simple web app that does X" and iterate on the artifact live in the chat.

Pick one you like. They're all different in the details and converge on the same use case: describe a UI in plain English, iterate until it works, end up with something you can put in front of a user without waiting for engineering.

Concrete flow (v0 example):

Start a new v0 project. Describe the prototype: "A web app where a user pastes in a customer support ticket. When they click analyse, it calls the Claude API with my system prompt and displays the structured output as a table with copy-to-clipboard buttons."
v0 generates the initial version. Iterate: "Make the output table editable. Add a confidence score next to each field. If confidence is below 0.7, highlight the field in yellow."
Wire the API call: the tool walks you through adding an API key and hitting the model.
Deploy. You get a URL.
Use the URL yourself, then share with 3-5 pilot users.

One to three days. You now have a working product you can demo and let real people use. This is the first time in the ladder you get real UX decisions tested — do users understand the output, do they trust it, do they edit or accept, where do they click first.

What Rung 4 is good at: validating the UX, measuring cold-user behaviour, producing something you can share with leadership that looks and feels like a product. The specific thing Rung 4 gives you that Rung 3 doesn't is the answer to the question "what does the interface look like when the model is wrong?" — you can only design for that when you have a UI to put it in.

What it's still not: production-ready, scalable, secure, or reliable. Vibe-coded prototypes are throwaway. Do not ship one to real users on the public internet. Do not rely on them for important decisions. Use them to test decisions and then throw them away when engineering builds the real thing.

The failure mode for Rung 4: falling in love with the vibe-coded prototype and trying to productionise it. This is a surprisingly common way AI products ship, and they always ship broken. The prototype's charm is in its throwaway nature; the moment you try to harden it, you're better off starting from scratch with real engineering. Rung 4 outputs inform Rung 5 and then get deleted.

Rung 5: the pilot with real users

The top of the ladder. You take the Rung 4 prototype, recruit 5-10 real users or pilot customers, give them access for 1-2 weeks, and watch. Not "demo it to them." Watch them use it for a real task in their own environment.

This is the single most expensive rung in time — one to two weeks, recruiting effort, user research time — and also the single most valuable, because it's the first rung where you observe actual behaviour rather than reactions to a pitch. Users who loved the idea in the abstract often never open the link. Users who were skeptical sometimes become your most engaged testers. You cannot predict this; you can only observe it.

What you learn in Rung 5 that nothing earlier tells you:

How often users actually try the feature when it's available to them.
What they try to do with it first (rarely what you expected).
How long it takes before they give up, and what specifically makes them give up.
What they tell a colleague about it after a week.
Whether they integrate it into their workflow or treat it as a curiosity.
What they'd pay for it, and under what conditions.

Concrete setup:

Recruit 5-10 real users (or pilot customers). Not people on your team. Not friends who'll be polite.
Give them a one-paragraph description of the feature and access to the Rung 4 prototype.
Don't onboard them in person. Let them try it cold. That's the whole point.
Set a weekly 20-minute check-in with each. Ask open questions: "What have you tried? What worked? What didn't? Did you come back after the first session, and why?"
At the end of the pilot, write a one-page summary of what you learned, grouped by "strong signal" and "weak signal."

Two weeks. You walk into the roadmap review with a specific set of learnings, real user quotes, and a confidence level that no lower rung can give you. Leadership will ask hard questions. You will have real answers.

What Rung 5 is good at: the things the product will actually live or die on — engagement, retention after first use, word-of-mouth signal, the specific UX decisions that separate a feature users love from one they ignore.

The failure mode for Rung 5: recruiting friendly users and getting a friendly false positive. Pilots with polite users are worse than useless — they give you confidence you shouldn't have. The defence: explicitly recruit people who have no relationship with you or your team, and bias your selection toward users who've complained about the problem you're solving rather than users who've enthused about AI in general.

A worked 2026 example

Let me ground this in a specific scenario. You're the PM for a small analytics SaaS and you have an idea: an AI feature that converts natural-language questions into SQL queries against the user's data warehouse. You have one week before the next roadmap review.

Day 1 (Rung 1): Open Claude.ai. Paste a system prompt: "You are a SQL expert. Given a natural-language question and a database schema, produce a PostgreSQL query that answers it." Paste 10 test questions against your own product's database schema. Score the results. 7/10 are good. 3/10 are wrong in interesting ways — one confuses column names, one produces a query that runs but returns the wrong semantics, one refuses because the question was ambiguous. Verdict: shape-matched, worth pursuing. Thirty minutes.

Day 2 (Rung 2): Create a Claude Project. Upload a dummy schema document, the 10 questions from Day 1, and 3 example question-SQL pairs. Share with 3 colleagues outside your team, ask them to try 5 questions each. Review the conversations. One colleague writes questions that are much more casual than yours ("show me my top customers" instead of "select customers by revenue"). Another asks a question the model refuses because it involves a table that wasn't in the schema document. Insight: onboarding matters more than the prompt quality. Three hours.

Day 3 (Rung 3): Build a 60-row sheet in Airtable with real anonymised questions from your analytics product's search logs. Wire an AI extension to Claude Sonnet. Run the column. Score each row. You get 41 correct, 9 borderline, 10 wrong. Failure modes cluster: 6 of the 10 wrong ones are ambiguous questions where the model picked the wrong interpretation confidently. Insight: "show me X" queries need a disambiguation step, not a direct SQL call. Four hours.

Day 4 (Rung 4): Use v0 to build a prototype — a text input, a "generate query" button, and a display that shows the generated SQL, a "run" button, and the query results in a table. Adds a "clarify" button that asks a follow-up question when confidence is low. Iterate for a few hours. Ship to yourself and 3 colleagues. Insight: the clarify flow feels natural; users click through it without friction; the confidence-triggered UI makes the ambiguity problem tractable. One day.

Days 5-14 (Rung 5): Recruit 6 real analytics users from your customer base who've complained about the time it takes to write queries by hand. Give them the Rung 4 prototype with a short email describing the idea. Watch their usage. 4 of 6 come back after Day 1. 2 integrate it into their daily workflow within a week. 1 says "I would pay $50/month for this if it were reliable." Insight: the feature has pull; the disambiguation pattern is what makes it work; the unreliability concern is the thing to solve next. Two weeks.

Roadmap review: you walk in with specific findings — feature is shape-matched, disambiguation is the load-bearing UX, 2 of 6 pilot users already want to pay, main concern is reliability. You're asking engineering to build the feature with specific requirements, not "explore AI for SQL." The conversation is 5x more productive than it would have been without the ladder.

That whole sequence cost one week of your own time and zero engineering sprints. It produced more evidence than a traditional "spec first, engineering prototype second, pilot third" flow would have produced in two months. This is what the ladder is for.

When NOT to use the ladder

A few honest caveats. The ladder is powerful but wrong for a few situations:

Highly regulated or sensitive data. Don't test with real production data in consumer chat interfaces. Use anonymised proxies.
Features that fundamentally require backend integration. Some features only make sense inside the real product's data and permissions. For these, rung 1-3 still works; rungs 4-5 need engineering help.
Features where the shape of the interaction is novel. If the whole feature is about a new UI paradigm nobody has seen, vibe-coded tools may not be able to generate it, and you'll need a designer to mock it up in Figma first.
Cases where leadership won't accept "PM-generated evidence." Politically, some organisations demand engineering-built prototypes. Fight that, but pick your battles.

The ladder is a tool, not a mandate. Climb as high as you need and no higher.

The failure mode: "I'll just wait for engineering"

The specific failure mode that makes this whole post necessary: PMs waiting for an engineering sprint to produce a prototype, and in the meantime losing three weeks to back-and-forth on a scope that could have been settled in a day. I've watched this play out dozens of times. The PM has a vague feature idea. Engineering says "we need more spec." The PM writes more spec based on intuition instead of evidence. Engineering says "we need a prototype to validate." The PM says "can you build one?" Engineering says "we have other work first." Three weeks disappear.

Every week the PM spent waiting was a week they could have been climbing the ladder on their own. They didn't, because the culture said "PMs don't prototype" — a norm from an earlier era when prototyping required code. That norm is obsolete. A PM who climbs the ladder every quarter ships faster, earns more credibility with engineering, and makes better scoping decisions. A PM who doesn't, waits.

The defence is cultural more than technical. You need to give yourself permission to prototype. And you need to show your team the artifact afterwards, with a short writeup, to normalise the pattern. Two or three ladder-climbs later, your engineering team will start asking you to climb the ladder before you write the spec. That's when the pattern has set in.

What just changed in your roadmap

Pick one AI feature idea on your backlog this week. Climb the ladder. Rung 1 in 30 minutes, Rung 2 in an afternoon if Rung 1 survives, Rung 3 next day if Rung 2 survives. Stop at the rung that answers your decision question.
Do not wait for engineering to prototype. The delta in speed between "you do it" and "engineering does it" is enormous, and engineering rarely has the sprint capacity anyway.
Build a Rung 3 eval-style sheet on every serious AI idea. It takes half a day and it produces a quantitative claim you can defend in reviews.
Use vibe-coded prototypes (v0, Lovable, Bolt) for Rung 4. Describe the feature in English; iterate to a clickable demo; throw it away after.
Never productionise a Rung 4 prototype. Use it to inform real engineering; delete it when the real thing ships.
Recruit pilot users outside your team. Friendly users are worse than useless. Bias toward people who've complained about the specific problem you're solving.
Walk into your next roadmap review with an artifact, not a pitch. The conversation shifts dramatically in your favour.

Next post, P2.4, closes Module P2 with the thing every PM eventually has to hand to engineering: the one-pager for an AI feature. The specific PRD shape that survives first contact with engineering, with concrete sections that account for the non-determinism tax, the capability band, the shape of the problem, and the gaps you need to close before shipping. Four modules of framework plus five rungs of prototyping all boil down into one page.

⬅️ Previous	📍 You are here	Next ➡️
⬅️ Previous P2.2 · The Demo-to-Prod Gap	P2.3 of P5.4	Next ➡️ P2.4 · One-Pager for an AI Feature

📚 AI for Product · Course Home — 20 posts, five modules.

Cover photo via Unsplash. This post is part of the AI for Product series.

Prototyping With AI as a Non-Coder: Earning Decisions Without Sprints

The five rungs

Rung 1: the chat-window prototype

Rung 2: the Projects / Custom GPT / Gems prototype

Rung 3: the no-code workflow prototype

Rung 4: the vibe-coded prototype

Rung 5: the pilot with real users

A worked 2026 example

When NOT to use the ladder

The failure mode: "I'll just wait for engineering"

What just changed in your roadmap

Course navigation

Comments

AI for Product

The One-Pager for an AI Feature: the PRD Shape That Survives Engineering

More from this blog

A Reading List and Two Habits: Staying Current in Ten Minutes a Week

What to Decide Now, What to Defer, What to Ignore: The AI Action Matrix

The Next 18 Months of AI: A Calibrated Leader's Forecast

Calibrating Your AI Exposure: Upside and Downside in One Matrix

Five AI Capabilities That Matter for Your Business, and Five That Do Not

Command Palette

The five rungs

Rung 1: the chat-window prototype

Rung 2: the Projects / Custom GPT / Gems prototype

Rung 3: the no-code workflow prototype

Rung 4: the vibe-coded prototype

Rung 5: the pilot with real users

A worked 2026 example

When NOT to use the ladder

The failure mode: "I'll just wait for engineering"

What just changed in your roadmap

Course navigation

Comments

AI for Product

The One-Pager for an AI Feature: the PRD Shape That Survives Engineering

More from this blog