Skip to main content

Command Palette

Search for a command to run...

RAG Failure Modes: Seven Ways Your Retrieval Will Lie

Seven specific ways RAG systems fail in production, how to spot each one from logs, and how to fix them without blaming the model. The post I wish I had read before I shipped my first RAG feature.

Updated
11 min read
RAG Failure Modes: Seven Ways Your Retrieval Will Lie

You built the RAG system. You chunked well, embedded with a frontier model, added hybrid search, wired pgvector, and stood up evals. Your numbers look good. You ship.

A week later, a user reports a wrong answer. You dig in and find the system retrieved three chunks — all relevant, all factual — and then the model produced an answer that contradicted them. A month later, another user reports a wrong answer from the same system; this time the retrieved chunks were wrong, two of them from a document that was correct six months ago and is now stale. A month after that, you get a bug where the retrieved chunks are perfect, the model is perfect, but the chunks were stored without metadata, so the model's answer doesn't cite any sources and your compliance team is unhappy.

These are not bugs in one specific place. They are the characteristic failure modes of any RAG pipeline. If you ship a RAG system, you will hit at least five of the seven in this post. The difference between a team that fixes them and a team that drowns is whether they can see them in logs and know which layer to fix.

This is the last post of Module B3. I'm going to walk you through all seven, with how to spot each one and what to change. You'll notice a pattern: almost none of them are "the model hallucinated." Almost all of them are in the retrieval or pipeline layer, which is exactly where a skilled RAG team should live.


The map

Four yes/no questions, five places to land. We'll walk seven specific failure modes through this decision tree.


Failure 1 · retrieval missed the right document

The classic. The user's question had a perfectly correct answer in your corpus, but your retrieval returned five unrelated chunks. The model's answer is either a confabulation ("I don't know but here's a guess") or a technically-correct-but-useless "I couldn't find information about that."

How to spot: log, for every user query, the top-k retrieved chunks and their similarity scores. If the top score is unusually low (well below the average for your corpus), you have a coverage problem. Even more telling: spot-check a sample of user queries against the corpus by hand — can you find the correct answer in the source documents when retrieval couldn't?

How to fix:

  • Chunking is the first suspect (see B3.3). Long chunks might blur the topic. Short chunks might fragment the answer across two chunks.
  • Query expansion. Add synonyms, rephrasings, or a brief re-query using the LLM ("rephrase this question three ways") and retrieve for each, fuse with RRF.
  • Hybrid search (see B3.4) if the query has specific tokens the embedding model misses.
  • Increase k. Retrieve top-20 instead of top-5 and let the model see more. Only do this if your context budget allows.

Failure 2 · retrieval returned the right docs but drowned them in distractors

Variant of the same bug but sneakier. You retrieved the correct chunk along with four others that are thematically related but irrelevant to the specific question. The model reads all five, the distractors are convincing enough that the model weighs them, and the answer is a compromise that's subtly wrong.

How to spot: the retrieved chunks look good in logs — your team nods along while reviewing them — but the answer is still wrong. This is the failure mode that hides best from automated eval, because rule-based checks often measure "the right chunk was retrieved" not "the wrong chunks were not retrieved."

How to fix:

  • Rerankers (B3.4) earn their cost here. A cross-encoder reranker is much better at distinguishing "related but not the answer" from "actually the answer."
  • Lower top-k. Use top-3 instead of top-5 — fewer distractors, even if you occasionally miss the right doc.
  • Stricter score thresholds. Drop retrieved chunks whose similarity is below a floor, even if they're in the top-k.
  • Better metadata filtering. If you can scope the query to a specific document set ("within the user's tenant's files"), you cut the distractor pool dramatically.

Failure 3 · lost in the middle

A known phenomenon on every long-context model: when you stuff 20 chunks into the prompt, the model pays more attention to chunks at the start and end than to chunks in the middle. A correct answer that lands at chunk #10 of 20 can be silently ignored.

How to spot: if you increase top-k and quality worsens or stays flat, you're probably losing the middle. Running the same eval with different top-k values is the quickest diagnostic.

How to fix:

  • Keep top-k small. 3 to 5 chunks usually beats 20, unless you're on a reasoning model designed for long context.
  • Rank by a reranker and pass the top 3 only.
  • Split the prompt. Instead of 20 chunks in one call, do 4 calls of 5 chunks each and fuse the answers.
  • Recent reasoning models are somewhat better at long-context attention, but not immune. Test on your own data.

Failure 4 · stale documents

The retrieved chunk is correct — as of six months ago. The product has since changed, the pricing has updated, the policy has been superseded. The model reads the stale chunk, answers confidently, and quotes a number that no longer applies.

How to spot: user reports where the answer is "factually wrong" but the retrieved chunks are "factually right if you assume the world stopped six months ago." Check the updated_at timestamps on your retrieved chunks. If they're all old, you have a freshness problem.

How to fix:

  • Track update timestamps as first-class metadata on every chunk.
  • Filter by recency where it makes sense: for pricing/policy questions, exclude chunks older than N months.
  • Set up a re-ingestion pipeline that re-pulls source documents on a schedule (weekly for fast-moving content, monthly for stable).
  • Have an explicit deprecation signal. When a source document is superseded, mark its chunks as deprecated and filter them out of retrieval. Never rely on "the new version will just rank higher than the old" — it often won't.
  • Surface the document timestamp in the final answer: "Based on the Billing Policy last updated 2026-02-14." Users catch staleness faster than you will.

Failure 5 · contradictory retrieved chunks

You retrieved two chunks. Chunk A says "our refund window is 14 days." Chunk B says "our refund window is 30 days." Both are in your corpus. Neither is marked stale. The model either picks one (essentially at random) or averages them ("between 14 and 30 days"). Either way, the user gets a wrong-for-them answer.

How to spot: the retrieved chunks disagree with each other in obvious ways. A quick sanity check: log the retrieved chunks alongside the answer, and spot-check for internal contradictions.

How to fix:

  • This is a data problem, not a retrieval problem. Your source of truth has drifted. Fix the sources, not the model. Merge or deprecate conflicting documents.
  • For genuinely context-dependent answers (refund policy differs by customer tier), carry the disambiguation into the query: "customer X, tier Y, what's the refund policy?" Filter retrieval by the relevant tier.
  • Have the model flag conflicts. In your prompt: "If the retrieved context contains contradictory information, say so and list both positions." Better to surface the uncertainty than confidently pick wrong.

Failure 6 · ungrounded drift (the rare real hallucination)

This is what everyone fears but is less common than people think: the retrieved chunks were correct and sufficient, and the model still produced an answer that contradicts them. This happens when the model's pre-trained prior is stronger than the retrieved evidence, or when the retrieval context is short enough to be out-weighted by the model's general world knowledge.

How to spot: compare the retrieved chunks to the model's answer, word by word. If the answer asserts a fact that's neither in the chunks nor a reasonable inference from them, you have ungrounded drift.

How to fix:

  • Stronger grounding instructions in the system prompt. "Answer ONLY using information from the provided context. If the answer isn't in the context, say 'I don't have information about that.' Do not use outside knowledge."
  • Cite-or-refuse. Ask the model to cite the specific chunk ID for every claim, and refuse to answer if it can't cite.
  • Lower temperature. Near-zero for RAG answers (see B1.4).
  • Don't retrieve if the score is low. If your top chunk's similarity is below your threshold, skip the RAG path and return "I don't have enough information to answer that." Better than a confident hallucination.

Failure 7 · the pipeline lost the source

The retrieved chunks are perfect. The model's answer is perfect. But the system is supposed to cite sources to the user, and no citations are rendered — the chunk metadata was lost between retrieval and the final answer. From the user's perspective, the app generated an answer from nowhere, and they can't verify it.

How to spot: the answer has no citations when it should. Check the pipeline from retrieval → prompt assembly → answer → response. Find where the metadata fell off.

How to fix:

  • Pass chunk IDs into the prompt, not just the text. "Context chunk [1]: ..., Context chunk [2]: ..." and instruct the model to cite "[1]" or "[2]" in the answer. Parse those citations back out and attach the real source URLs to the response.
  • Store every chunk with source metadata (see B3.3) and carry it through the full pipeline as a structured object, never as a flat string.
  • Render citations as links in the final UI so users can click through. Verification is trust.

How to debug any RAG failure in five minutes

A short checklist. Print these four things for any user query that got a wrong answer:

  1. The query as received by the pipeline, and the query after any expansion/rewriting.
  2. The top-k retrieved chunks — not just IDs, the actual text, with similarity scores and document timestamps.
  3. The prompt as sent to the model — the full assembled system + context + user.
  4. The model's raw output — full response, not the post-processed version.

Read them in order. The failure mode will usually jump out. If the retrieval is bad, it's in step 2. If retrieval is good but the answer ignored it, it's in step 3 or 4. If retrieval is good and the answer respected it but the user is still unhappy, the bug is upstream — maybe the source docs are stale, or the user's question was ambiguous, or the answer simply wasn't what they needed.

Teams that cannot produce these four things on demand are teams that cannot debug RAG systematically. The first thing I do on any new RAG codebase is wire up this four-line log, because every subsequent debugging session starts here.


Admit what breaks

  • These seven failure modes overlap. A single wrong answer can involve stale data and lost in the middle and a missing citation. Triage means identifying the primary cause, not the only cause.
  • Evals lag production. Your eval set catches what you thought to test. Production users ask questions you didn't anticipate. Expect to backfill the eval set from production bug reports — that's the healthy cycle.
  • Grounding instructions don't completely prevent drift. They reduce it, they don't eliminate it. For high-stakes answers, add an output-side check that confirms every claim maps back to the retrieved chunks. Expensive but honest.
  • Fixes have costs. Rerankers add latency. Filtering by recency can miss useful older documents. Top-k reduction can miss the correct chunk. Every fix has a trade-off, and you should measure before and after.
  • Some failures are upstream from your team. If your source documents contradict each other, retrieval can't save you. If users ask ambiguous questions, no pipeline can read minds. Know where the boundary is.

What just changed in your code

  • Log the four debug items on every user query: the query, the retrieved chunks, the assembled prompt, the raw output. Without these, you cannot debug RAG.
  • Add timestamps and deprecation flags to every chunk. Freshness is a first-class field.
  • Carry source metadata end-to-end and render citations as clickable links in the final UI.
  • Use grounding instructions in the system prompt: answer only from the provided context.
  • Build a "cite or refuse" path — when the top retrieved chunk is below a score threshold, respond "I don't have information on that" instead of guessing.
  • Backfill your eval set from production bug reports. Every wrong-answer incident should add one case to the eval.

And that closes Module B3 — Retrieval, Really. You now have embeddings, the right place to store them, chunking that actually works, hybrid search, and a debugging map for every kind of failure you'll see. You can ship a RAG feature and diagnose it when it breaks.

Next up is Module B4 — Tools and Agents. We leave the retrieval world and get into the part everyone wants to build: LLMs that don't just answer, they act. Tool use, the agent loop written by hand, planning-vs-reacting, and why "more agents" is usually a bug, not a feature.


Course navigation

⬅️ Previous📍 You are hereNext ➡️
⬅️ Previous
B3.4 · Hybrid Search Beats Either Alone
B3.5 of B6.4Next ➡️
B4.1 · Tool Use Is Structured Output in Disguise

📚 AI for Builders · Course Home — 28 posts, six modules.


Cover photo via Unsplash. This post is part of the AI for Builders series.

More from this blog

Learn AI - Zero to Hero

111 posts