CRAG: Grade Your Retrieval Before You Trust It
Retrieval can fail silently. CRAG adds a confidence-scoring evaluator between the retriever and the generator — discarding bad chunks before the LLM ever sees them.
HYPOTHESIS
H₀THE PROBLEM
RAG systems assume that if retrieval returns something, it returned something relevant. That assumption is wrong often enough to matter. Dense retrieval produces results for every query — it has no way to say “nothing useful was found.” The cosine similarity score tells you the best match found, not whether that match is actually good.
When irrelevant context reaches the LLM, the LLM doesn’t say “I don’t know”. It reasons from the bad context and produces a confident, hallucinated answer. Prompt guardrails (“only answer from the provided context”) reduce this — but they don’t eliminate it, and they don’t fix the root cause.
CRAG (Corrective Retrieval Augmented Generation) adds an evaluator between the retriever and the generator. Each retrieved chunk gets a score: CORRECT (clearly relevant), AMBIGUOUS (partially relevant), or INCORRECT (not relevant). The system routes based on those verdicts — using only CORRECT chunks for generation, refining ambiguous results, or triggering a fallback entirely.
LAYMAN EXPLANATION
Think of a researcher preparing a briefing. A naive approach: hand them the first five search results and tell them to write the report. They’ll use whatever they get, even if two of the five results are irrelevant — the brief will sound confident even where it’s wrong.
CRAG is the editorial review step. Before the researcher writes, an editor reads the five sources and marks each one: solid reference, questionable, unrelated. The researcher only uses the solid references. If none pass, they go find better sources rather than writing from bad ones.
The evaluator in CRAG is itself an LLM, acting as a lightweight judge. It receives the query and each chunk and outputs a relevance verdict. The cost is one additional pass per retrieval call — but it’s a much cheaper model than the generator, and it prevents the expensive failure mode of confident hallucination.
LIVE DEMO
interactiveType any query below. CRAG evaluates a set of hypothetical retrieved chunks and shows the verdict for each — CORRECT, AMBIGUOUS, or INCORRECT — with the routing decision at the end.
↑ Notice that the evaluator doesn’t just score — it routes. No CORRECT chunks means no generation from bad context.
THE MATH
interactiveStandard RAG passes all retrieved chunks directly to the generator:
CRAG inserts an evaluation step. For each chunk , the evaluator assigns a score and verdict:
The routing decision depends on the evaluation outcome. The confidence threshold is the key engineering parameter — it controls the precision-recall tradeoff:
↑ Low threshold: more chunks pass, fewer fallbacks triggered, lower precision. High threshold: fewer chunks pass, more fallbacks, higher precision. Set based on your knowledge base quality and how expensive fallbacks are.
| Evaluation outcome | Action | Why |
|---|---|---|
| ≥1 CORRECT chunk | Generate using only CORRECT chunks | Irrelevant chunks discarded before generation |
| Only AMBIGUOUS chunks | Rewrite query, re-retrieve, or use web fallback | Partial signal — refining may surface better chunks |
| ★ All INCORRECT | Fallback (web search / 4-route system) | No internal knowledge — retrieve externally or decline gracefully |
REFERENCE PAPERS
| Paper | Year | Key contribution |
|---|---|---|
| Corrective Retrieval Augmented Generation (Yan et al.) | 2024 | Original CRAG paper — introduces the retrieval evaluator with CORRECT/AMBIGUOUS/INCORRECT routing and web search fallback |
| Self-RAG: Learning to Retrieve, Generate, and Critique Through Self-Reflection (Asai et al.) | 2023 | Related: LLM trained to generate reflection tokens inline rather than using a separate evaluator model |
| Is Your LLM Secretly a World Model and a Reasoner? Analysing LLM Capabilities Through Retrieval Failure (Leng et al.) | 2023 | Analysis of how RAG fails when retrieval quality is low — motivates the need for evaluation-before-generation |
WHAT NEXT
CRAG evaluates retrieval at query time and routes to a fallback when needed. The next evolution: FLARE (Forward-Looking Active Retrieval) makes retrieval itself dynamic — triggering additional retrieval passes mid-generation whenever the model detects its own uncertainty, rather than evaluating upfront.
CONCLUSION
Retrieval failure is silent. Without CRAG, the LLM receives bad context and produces a confident, plausible-sounding hallucination — there is no error, no warning, and no way to detect the failure downstream. The evaluator breaks that silence: incorrect chunks never reach the generator.
The confidence threshold is a real engineering decision, not a default. Setting it too low defeats the purpose (bad chunks still pass). Setting it too high triggers expensive fallbacks on every slightly ambiguous query. Calibrate it against your knowledge base’s typical retrieval precision, not on intuition.