RAG systems fail quietly. Evaluate them like they will.

A retrieval-augmented system rarely crashes. It just starts answering confidently from the wrong page. Honest evaluation is the only defence.

By Imran Haiqal · 6 min read

Retrieval-augmented generation, which lets a language model answer from your documents instead of its memory, is the most-requested AI capability in business right now, and for good reason. But RAG has a dangerous property: it fails silently. A broken dashboard shows an error. A broken RAG system shows a fluent, confident, wrong answer.

The two failure points, and why they compound

Every RAG answer has two stages: retrieval finds passages, generation writes from them. If retrieval brings back the wrong passage, the model summarizes the wrong truth beautifully. If retrieval is right but generation drifts, the model decorates a good source with invented details. Users can't tell which failure they're seeing, because both arrive in the same polished prose. That's why 'it looked right in the demo' means very little.

Evaluate the stages separately

The single most useful practice: score retrieval on its own before judging the end-to-end answer. Build a test set of real questions with known source passages (even fifty is transformative) and measure whether the right passage appears in what's retrieved. Retrieval problems are cheap to fix (chunking, indexing, query rewriting); generation problems are expensive. Knowing which one you have saves weeks.

Make groundedness measurable

For the generation side, the question is: does every claim in the answer trace to a retrieved passage? This can be checked: by a second model acting as a judge, by spot-checking samples weekly, or at minimum by always showing sources in the interface so users can verify. If your system can't cite where an answer came from, it isn't an honest RAG system yet; it's an oracle you're choosing to trust.

Decide what 'good enough' means before launch

The uncomfortable conversation worth having early: what's the cost of a wrong answer here? A RAG assistant for internal documentation can tolerate the occasional miss. One that answers compliance or contract questions cannot. The threshold decides the architecture: how much human review, how conservative the refusals, whether the system should say 'I don't know' far more often than feels impressive. A system that declines gracefully is doing its job; one that never declines is hiding its failures.

Facing this problem yourself?

A free 30-minute call. Talk it through, no pitch.

Get in touch

Keep reading

Architecture

Why most AI pilots die before production

Data Engineering