EXP #004 Rg Rg Q-01 · Building RAG from Scratch · step 4 ✓ Achievement

CRAG: Grade Your Retrieval Before You Trust It

Retrieval can fail silently. CRAG adds a confidence-scoring evaluator between the retriever and the generator — discarding bad chunks before the LLM ever sees them.

2026-05-27 7 MIN READ COMPLETE

HYPOTHESIS

H₀

H₀ Scoring retrieved chunks as CORRECT / AMBIGUOUS / INCORRECT before generation, and routing to a fallback when quality is low, will reduce hallucination caused by bad context more reliably than prompt-only guardrails.

THE PROBLEM

RAG systems assume that if retrieval returns something, it returned something relevant. That assumption is wrong often enough to matter. Dense retrieval produces results for every query — it has no way to say “nothing useful was found.” The cosine similarity score tells you the best match found, not whether that match is actually good.

When irrelevant context reaches the LLM, the LLM doesn’t say “I don’t know”. It reasons from the bad context and produces a confident, hallucinated answer. Prompt guardrails (“only answer from the provided context”) reduce this — but they don’t eliminate it, and they don’t fix the root cause.

CRAG (Corrective Retrieval Augmented Generation) adds an evaluator between the retriever and the generator. Each retrieved chunk gets a score: CORRECT (clearly relevant), AMBIGUOUS (partially relevant), or INCORRECT (not relevant). The system routes based on those verdicts — using only CORRECT chunks for generation, refining ambiguous results, or triggering a fallback entirely.

LAYMAN EXPLANATION

Think of a researcher preparing a briefing. A naive approach: hand them the first five search results and tell them to write the report. They’ll use whatever they get, even if two of the five results are irrelevant — the brief will sound confident even where it’s wrong.

CRAG is the editorial review step. Before the researcher writes, an editor reads the five sources and marks each one: solid reference, questionable, unrelated. The researcher only uses the solid references. If none pass, they go find better sources rather than writing from bad ones.

The evaluator in CRAG is itself an LLM, acting as a lightweight judge. It receives the query and each chunk and outputs a relevance verdict. The cost is one additional pass per retrieval call — but it’s a much cheaper model than the generator, and it prevents the expensive failure mode of confident hallucination.

LIVE DEMO

interactive

Type any query below. CRAG evaluates a set of hypothetical retrieved chunks and shows the verdict for each — CORRECT, AMBIGUOUS, or INCORRECT — with the routing decision at the end.

01YOUR QUERY
ENTER ↵

↑ Notice that the evaluator doesn’t just score — it routes. No CORRECT chunks means no generation from bad context.

THE MATH

interactive

Standard RAG passes all retrieved chunks directly to the generator:

$\text{Answer} = \text{LLM}_\text{gen}(q, C) \quad \text{where } C = \text{Retrieve}(q, k)$

CRAG inserts an evaluation step. For each chunk $c_i$ , the evaluator assigns a score and verdict:

$\text{score}(c_i) = \text{LLM}_\text{eval}(q, c_i) \in [0, 1]$

$\text{verdict}(c_i) = \begin{cases} \text{CORRECT} & \text{if } \text{score}(c_i) \geq \theta \\ \text{AMBIGUOUS} & \text{if } \theta - 0.3 \leq \text{score}(c_i) < \theta \\ \text{INCORRECT} & \text{otherwise} \end{cases}$

The routing decision depends on the evaluation outcome. The confidence threshold $\theta$ is the key engineering parameter — it controls the precision-recall tradeoff:

PARAMETER SIMULATOR · CONFIDENCE THRESHOLD
CONFIDENCE THRESHOLD0.6
lenient (0.3)strict (0.9)
0.88
CORRECT
Chunk A: Directly explains cosine similarity with formula and examples.
0.72
CORRECT
Chunk B: Covers dense retrieval broadly — touches on cosine but not the focus.
0.41
AMBIGUOUS
Chunk C: Related to BM25 scoring — different retrieval paradigm.
0.25
INCORRECT
Chunk D: About gradient descent optimization — unrelated topic.
0.55
AMBIGUOUS
Chunk E: Embedding model comparisons — indirect relevance.
2 CORRECT
2 AMBIGUOUS
1 INCORRECT
→ ROUTING: 2 chunks passed → generate answer directly
Drag the threshold to see how CRAG's evaluation changes. Lower threshold = more chunks pass but less precision. Higher threshold = higher precision but more fallbacks triggered.

↑ Low threshold: more chunks pass, fewer fallbacks triggered, lower precision. High threshold: fewer chunks pass, more fallbacks, higher precision. Set $\theta$ based on your knowledge base quality and how expensive fallbacks are.

DATA TABLE n=3

Evaluation outcome	Action	Why
≥1 CORRECT chunk	Generate using only CORRECT chunks	Irrelevant chunks discarded before generation
Only AMBIGUOUS chunks	Rewrite query, re-retrieve, or use web fallback	Partial signal — refining may surface better chunks
★ All INCORRECT	Fallback (web search / 4-route system)	No internal knowledge — retrieve externally or decline gracefully

REFERENCE PAPERS

DATA TABLE n=3

Paper	Year	Key contribution
Corrective Retrieval Augmented Generation (Yan et al.)	2024	Original CRAG paper — introduces the retrieval evaluator with CORRECT/AMBIGUOUS/INCORRECT routing and web search fallback
Self-RAG: Learning to Retrieve, Generate, and Critique Through Self-Reflection (Asai et al.)	2023	Related: LLM trained to generate reflection tokens inline rather than using a separate evaluator model
Is Your LLM Secretly a World Model and a Reasoner? Analysing LLM Capabilities Through Retrieval Failure (Leng et al.)	2023	Analysis of how RAG fails when retrieval quality is low — motivates the need for evaluation-before-generation

WHAT NEXT

CRAG evaluates retrieval at query time and routes to a fallback when needed. The next evolution: FLARE (Forward-Looking Active Retrieval) makes retrieval itself dynamic — triggering additional retrieval passes mid-generation whenever the model detects its own uncertainty, rather than evaluating upfront.

CONCLUSION

✓ ACHIEVEMENT

Hypothesis confirmed.

Retrieval failure is silent. Without CRAG, the LLM receives bad context and produces a confident, plausible-sounding hallucination — there is no error, no warning, and no way to detect the failure downstream. The evaluator breaks that silence: incorrect chunks never reach the generator.

The confidence threshold is a real engineering decision, not a default. Setting it too low defeats the purpose (bad chunks still pass). Setting it too high triggers expensive fallbacks on every slightly ambiguous query. Calibrate it against your knowledge base’s typical retrieval precision, not on intuition.

★ RELATED EXPERIMENTS