Skip to main content

Command Palette

Search for a command to run...

Reading AI Vendor Claims Without Getting Fooled

Every AI vendor pitch deck in 2026 looks like the last one. Here is the seven question skeptic kit for senior buyers who have 20 minutes to tell a real claim from a rehearsed demo.

Updated
13 min read
Reading AI Vendor Claims Without Getting Fooled

Bottom line. Every AI vendor pitch deck you see in 2026 contains the same four-five impressive claims, the same benchmark chart, the same customer logo strip, and the same "it just works" demo. The substance varies enormously. A senior buyer's job in the 20-minute vendor meeting is not to evaluate the claims — it's to run seven specific questions against them, and to know the three patterns vendors use to make weak claims sound strong. This briefing is that kit, calibrated for the 2026 vendor landscape. Print it. Bring it to your next vendor call. Use it.

In Course 4 so far you have the state-of-the-field map and the capability allocation filter. L1.3 is the third tool in the leadership kit: how to interrogate a claim before you write a check. Not a deep procurement framework — a 20-minute skeptic kit for the meetings where a senior leader's attention decides whether a vendor moves into pilot or stays out. The cost of getting this wrong is a quarter of wasted engineering time and a vendor bill with no return. The cost of getting it right is one afternoon of practice.


The three patterns every AI vendor uses

Start by naming the patterns. Almost every weak vendor claim in 2026 fits one of these three shapes, and spotting the shape is half the work.

Pattern 1: benchmark inflation. The vendor shows a chart where their product outperforms a competitor or a baseline on some benchmark. The number is real; the benchmark is not representative of your workload. "We beat GPT-5 on our internal eval" is technically true and strategically meaningless because the eval was designed to highlight what they do well and not what matters to you.

Pattern 2: demo abstraction. The vendor shows a live demo with a carefully chosen input, a clean environment, and a pre-tested prompt. The demo works flawlessly. Your workload involves a thousand distinct inputs, adversarial users, integration with legacy data, and the specific edge cases the demo carefully avoided. The gap between demo and your production is exactly what Course 3's P2.2 covered, and it applies here.

Pattern 3: customer logo theatre. The vendor's slide has 12-20 customer logos. Some are paying customers, some are pilot users, some are companies that ran a 30-day evaluation and moved on, some are companies whose "usage" is an internal champion experimenting with a free tier. The logo strip is designed to imply momentum; the underlying usage is often much shallower.

Each pattern is defensible in a sales context — the vendor is doing their job — and each is an information gap for the buyer. The seven questions below exist to close the gaps.

Seven questions, one informed decision. Each question is a specific ask the vendor should be able to answer concretely; evasion on any of them is signal.


The seven questions

Question 1: "Can we run the evaluation on our own data, not yours?"

Why it matters: this single question breaks pattern 1 (benchmark inflation) cleanly. If the vendor's performance claims only hold on their curated evaluation, the claims are not transferable to your workload. Every credible vendor in 2026 has a "bring your own data" evaluation path; if they resist, that tells you something.

What a good answer looks like: "Yes, we can run a 50-200 case evaluation on your data within 1-2 weeks under NDA, and we'll share the results including failure cases." What a bad answer looks like: "Our eval set is representative of most customers" or "we can't run on customer data for privacy reasons" (they can — this is solved).

The cost to the buyer: 1-2 hours assembling the evaluation data, NDA signature, 1-2 weeks of elapsed time. Cheap in absolute terms; dispositive of most vendor decisions.

Question 2: "What does a failure look like, and how often?"

Why it matters: vendors who understand their own product can describe its failure modes specifically. Vendors who can't are either new to production or hiding something. In 2026, any AI product in production has hit specific failure modes repeatedly — hallucinations in particular corners, integration edge cases, specific input patterns that break — and a senior engineer at the vendor should be able to name them.

What a good answer: "Our main failure modes are X, Y, and Z. X happens at about 2% of cases on our benchmark; we handle it by Z-mitigation. Y is harder — it shows up on Z-shaped inputs." What a bad answer: "We haven't seen failures at our customers" or "the model is very reliable" — these are flags, not reassurance.

The calibration: any product that has been in production for 6+ months has seen at least 3-5 distinct failure modes. A vendor who names zero is not honest; a vendor who names them specifically is sophisticated.

Question 3: "Who owns compliance for the data we send you, and what does your DPA say about training?"

Why it matters: the compliance questions from Course 3's P4.2 apply to your vendor evaluations, too. In 2026, a credible B2B AI vendor has an AI-specific DPA, explicit "no training on your data" language, regional processing options, and a path to answer security questionnaire items within days. A vendor without these is not enterprise-ready regardless of their product quality.

What a good answer: "Our DPA is here — it specifies no training on customer data, SOC 2 Type II with AI features in scope, optional EU-only processing, and a 24-hour incident notification clause." What a bad answer: "We're working on our compliance story" or "we can provide a DPA in enterprise tier."

The specific trap: vendors who gate basic compliance behind an enterprise tier. The compliance language is the same cost whether the customer is $20K or $200K/year; gating it means the vendor is using compliance as a pricing wedge, which is a red flag about how they treat customers overall.

Question 4: "What's the real cost at my expected volume, including all inference, tooling, and support?"

Why it matters: many 2026 AI vendor pricing models are designed to look cheap at low volume and expensive at high volume. "$0.10 per call" sounds fine until you do the math at your expected usage and realise you're looking at $80K/month. The vendor's sales motion rarely volunteers the total; the buyer has to do the math.

What to ask specifically: "Given that we expect 50,000 calls per day at an average prompt size of 3,000 tokens in and 400 out, what's our monthly bill in the first year, including overage charges, support fees, and any infrastructure costs?" The vendor should produce a number within 24 hours. If they can't, their pricing is opaque.

The specific red flag: pricing that shifts based on your organisation's size rather than your usage. "We charge based on your revenue" or "enterprise pricing is custom" with no underlying unit-economics model usually means the vendor is pricing on your ability to pay, not on their cost to serve. This becomes expensive fast.

Why it matters: this is the pattern 3 (logo theatre) breaker. A logo on a slide tells you nothing; a 30-minute conversation with the named technical lead at that customer tells you everything. A credible vendor in 2026 has 3-5 reference customers willing to talk to prospective buyers under NDA, and the conversations are not scripted.

What a good answer: "We have 4 reference customers at similar scale to you. Here are their names, the use cases, and the introductions. Our ops lead will join the call to answer technical questions you don't want the customer to have to answer." What a bad answer: "Our customers don't typically take reference calls" or a single reference who turns out to be an investor, an early pilot, or someone the vendor's executives are friends with.

The cost-free test: ask for the reference call before the commercial negotiation. Vendors who only offer references after contract signing are giving you no information, because by then you're too committed to act on it.

Question 6: "What changed in your product in the last 90 days, and what's changing in the next 90?"

Why it matters: AI vendor products move fast. A product that was great 6 months ago may have shifted in ways that matter — the underlying model changed, the prompt was re-tuned, a feature was removed, a capability was added. Asking about the recent trajectory tells you whether the product is still under active development, whether the team is shipping in sensible directions, and whether any of the specific capabilities you're buying today are at risk of being changed or removed.

What a good answer: "Last quarter we swapped our primary model from Claude Sonnet 3.5 to 4.6, re-ran our eval set, and gained 6 points of pass rate. Next quarter we're adding structured reasoning for long documents, which will affect Feature X. Nothing we're planning removes functionality you'd depend on." What a bad answer: "Our product is stable" (suspicious — AI products aren't stable in 2026) or "we're pivoting to agents" (warning — pivots create gaps).

The subtext: you're asking about the vendor's engineering discipline. Teams that can describe their recent changes in specific terms are teams that ship well. Teams that can't are teams that drift.

Question 7: "What would make you recommend we not buy?"

Why it matters: this is the honesty test. Any competent salesperson should be able to answer this. The answer separates vendors who understand their product's limits from vendors who pretend the product fits every use case. "We're not a fit for real-time sub-100ms latency" or "we're not great at extremely small context windows" are the shapes of real answers. "We're a fit for everyone" is a shape of no answer.

What a good answer: a specific 2-3 item list of where the product genuinely doesn't fit, with one honest caveat about a current limitation the vendor is working on. What a bad answer: some form of "we're great for everything" or deflection to "we'd love to understand your needs better."

The vendor reaction to this question tells you a lot. A vendor who pauses, thinks, and produces a real list respects you as a buyer. A vendor who laughs and pivots is selling, not qualifying. The former is likely to be honest about product limits during deployment; the latter is likely to produce the "capability we promised isn't actually working" conversation in month 3.


A worked example: running the seven questions on a realistic pitch

Let me walk through a realistic 2026 vendor evaluation to show how the questions play out.

The setup: you're the COO of a 400-person B2B SaaS. A vendor — call them Helios AI — is pitching their "customer success AI" that auto-drafts QBR emails, call summaries, and expansion recommendations from your CRM and meeting data. Price: $80K/year for 80 seats. The sales team is excited. Your CS leader is leaning yes. You have 30 minutes with Helios tomorrow.

In the meeting, you run the seven:

  • Q1 (evaluation on our data): Helios agrees to a 100-case eval on your anonymised historical data within 2 weeks under NDA. Good.
  • Q2 (failure modes): Helios names 3 specific failure modes from their production usage, including "we underperform on CS workflows for companies with <50 customers per rep because we need more historical data per customer." Specific, testable, honest. Good.
  • Q3 (compliance and training): they have a standard DPA with explicit "no training" language, SOC 2 Type II with AI in scope, EU data residency optional at higher tier. Good for US, check for EU.
  • Q4 (real cost): at 80 seats × 15 draft calls/day × 22 working days × $0.018/call (they quote this transparently) = ~$4,750/month in variable usage fees on top of the $80K/year seat cost. Total first-year cost: ~$140K, not $80K. They answered directly, which is good; the answer is higher than the slide implied, which is useful to know.
  • Q5 (customer references): they offer two reference calls at customers of similar size. One customer is available this week; the other within two weeks. Good — set up the calls.
  • Q6 (recent and planned changes): Helios swapped from GPT-5-mini to Claude Sonnet 4.6 last quarter (quality improvement). Next quarter they're adding "meeting intent detection" which affects the call summary feature you're buying. Specific and verifiable — ask for the release notes.
  • Q7 (reasons not to buy): "If your CSMs spend more than 40% of their time on outbound growth instead of inbound retention, our automation fits less well — we're stronger on retention workflows. Also, if your CRM is Salesforce, we're native; if you use HubSpot, integration is more manual right now." Honest and specific. Good answer.

Verdict from the seven: Helios passes all seven questions with specific, falsifiable answers. The total cost is ~75% higher than the headline but still within what you had budgeted. The failure modes are known; the reference calls are lined up; the roadmap is specific. You move to pilot with high confidence.

Counterfactual — running the same seven on a vendor who fails: Q1 they resist the customer-data eval. Q2 they claim "very reliable, we don't see many failures." Q4 their real cost at your volume turns out to be 3x the quoted number. Q5 the only reference is an investor. Q6 they can't describe any specific changes in the last 90 days. Q7 they say they're a fit for everyone. At least four of the seven fail. You walk away after 30 minutes instead of committing $80K and a quarter of CS team time.

The seven questions are not complicated. Their power is that they're asked in order, with specific expectations for each answer. A vendor who fails two or more is almost never worth a pilot regardless of how polished the rest of the pitch was.


The failure mode: "the demo convinced me"

The single specific failure mode that sinks senior buyers: letting a polished demo substitute for the seven questions. The vendor shows a live demo. The demo is impressive. The room feels the product is real. The buyer skips the hard questions because the demo already "proved" it works.

This happens because demos are a high-bandwidth sensory experience and questions are a low-bandwidth cognitive exercise. The demo wins the feeling; the questions win the decision. A buyer who lets the demo decide is a buyer who ships a failed pilot and blames procurement for not catching it.

The defence is specific: always ask Question 1 (evaluation on our data) before you watch the demo. A vendor who agrees to a real eval is a vendor whose demo you can trust, because the eval will confirm or deny the demo. A vendor who declines the eval is a vendor whose demo was rehearsed, and you should ignore the demo. Inverting the order — eval commitment first, demo second — breaks the emotional-then-rational trap that demos are designed to exploit.


What to decide on Monday

  • Print the seven questions and put them on your wall before your next vendor meeting. Literally print them.
  • Add Question 1 to your standard vendor intake. Every AI vendor you talk to should agree to a bring-your-own-data evaluation before you move to a pilot.
  • Require reference calls before contract signing, not after. Any vendor who delays this is not giving you information.
  • Do the full math on Question 4 before the commercial conversation. Surprise costs at scale are the most common post-pilot regret.
  • Score vendor answers on each question. Five or more weak answers = no pilot. Three or four weak = negotiate harder. One or two weak = acceptable.
  • Do not let a demo skip the questions. The demo is the vendor's best day; the questions are your every-day.
  • Train your procurement team on the seven, so vendor evaluations scale beyond your own attention.

Next briefing, L1.4, closes Module L1 with the third leadership tool: calibrating your own organisation's AI exposure. The specific upside-downside matrix for measuring how much AI strategy actually matters to your business, and why the answer is often "less than you think on the upside, more than you think on the downside."


Course navigation

⬅️ Previous📍 You are hereNext ➡️
⬅️ Previous
L1.2 · Five AI Capabilities That Matter
L1.3 of L4.3Next ➡️
L1.4 · Calibrating Your AI Exposure

📚 AI for Leaders · Course Home — 15 briefings, four modules.


Cover photo via Unsplash. This post is part of the AI for Leaders series.

More from this blog

Learn AI - Zero to Hero

111 posts