EXP #012 Ev Ev Q-01 · Building RAG from Scratch · step 5 ⚠ Learning

How Do You Grade a RAG System?

Retrieval found the right passage — at rank 80, where the model never reads it. Classification metrics can't grade a ranked list. Building MRR and NDCG from scratch, and the one idea that makes NDCG click: rank is the guess, grade is the truth.

2026-06-29 10 MIN READ COMPLETE

HYPOTHESIS

H₀

H₀ A relevant chunk showing up in the retrieved set means retrieval did its job — presence is what matters.

METHOD

The previous experiment graded one prediction (is this spam?). But a RAG retriever hands back a ranked list of passages — and the tempting hypothesis is “if the right chunk is in there, retrieval worked.” FAR FROM IT. A gold chunk at rank 80 is a chunk the model never reads. I built the retrieval-evaluation toolkit from first principles — and the two metrics that matter, MRR and NDCG, fall out of asking “what should a good ranking score reward?“

TWO WAYS RETRIEVAL FAILS

RAG eval has two layers — retrieval (did we fetch the right context) and generation (is the answer good given it). This post is retrieval. And retrieval fails in exactly two ways, both names you already know:

①Missed a needed chunk → context recall — plain: of all the info the answer needed, how much did we fetch? (Comparing HyDE vs Step-Back needs both chunks; miss one and the answer can’t be written.)
②Grabbed junk → context precision — plain: of what we fetched, how much was actually relevant? Distractor chunks don’t just waste space — they pull the model’s attention and degrade the answer.

And the lever between them is familiar: the similarity-score threshold is the same dial as a classifier’s — slide it and you trade context precision against context recall, exactly like Bouncer Babu’s strictness knob.

WHY POSITION MATTERS — LOST IN THE MIDDLE

Here’s why “is it in the set?” isn’t enough. LLMs suffer from “lost in the middle”: they attend most to the start and end of a long prompt, so a relevant chunk buried deep gets under-read even though it’s technically in the context. So a set-based metric like recall@k (is a good chunk in the top k) gives full marks whether the gold chunk sits at rank 1 or rank 19 — and those are wildly different outcomes for the model.

The fix in practice is a reranker (plain: a cheap first pass casts a wide net, then a heavier scorer — a cross-encoder or hybrid BM25+dense — reorders the catch so the best chunks sit on top). But to even measure this, we need a metric that rewards relevant chunks for being high, not just present.

MRR — REWARD THE FIRST HIT

interactive

Simplest case: one query, one right answer. Score it by where the gold chunk landed — and you want rank 1 to be great, rank 20 nearly worthless, with a big penalty at the top and a flat tail. Put the rank in the denominator:

$\text{MRR} = \frac{1}{Q}\sum_{q=1}^{Q} \frac{1}{\text{rank of the first relevant chunk}}$

gold chunk at rank 1score 1/1 = 1.000

rank 1rank 20

slipping rank 1→2 halves the score (huge penalty up top); 19→20 barely moves (flat at the bottom) · average 1/rank over every query = MRR

↑ Slide where the gold chunk lands. 1/rank halves from rank 1→2 (the penalty you want for falling off the top) and barely moves 19→20 (already buried). The dashed line is a linear score — notice it can’t tell rank 1 from rank 3.

Average that over all your queries and you’ve built MRR — mean reciprocal rank. It’s perfect for one-right-answer retrieval, but it’s blind to everything after the first hit: if a query needs the HyDE chunk and the Step-Back chunk, MRR scores a smug 1.0 the moment it finds HyDE — even with Step-Back rotting at rank 80.

NDCG — GRADE × POSITION

interactive

To fix MRR’s blind spot, count every relevant chunk, and let chunks be graded (perfect=3, decent=2, weak=1, useless=0), not just yes/no. Each chunk’s contribution is two factors multiplied — how relevant it is times how high it sits:

$\text{DCG} = \sum_i \frac{rel_i}{\log_2(1 + \text{rank}_i)} \qquad \text{NDCG} = \frac{\text{DCG}}{\text{IDCG}} \in [0,1]$

IDCG is the score of the ideal ordering (highest grades on top); dividing by it normalizes to [0,1] so queries are comparable. (Neat: the log base cancels in the ratio.)

#1🥇rel=3
3.00

#2⬜rel=0
0.00

#3🥈rel=2
1.00

#4🥉rel=1
0.43

#5⬜rel=0
0.00

#6🥉rel=1
0.36

DCG 4.79IDCG 5.19 (fixed by truth)NDCG 0.92

reorder the chunks · IDCG (the truth) never moves · "bury the gold" drops the 🥇 to the bottom → NDCG craters · grade = truth, rank = the retriever's guess, NDCG = the gap

↑ Reorder the chunks, or hit “bury the gold.” Watch IDCG never move while DCG — and so NDCG — collapses.

SPECIMEN · rank is the guess, grade is the truth 100×

And here’s the idea that makes NDCG finally click: rank is where the retriever put the chunk (its imperfect guess); grade is what the chunk is actually worth (the ground truth, from a human or LLM-judge). They can disagree — a weak chunk at rank 1, the gold one at rank 80 — and that disagreement is exactly what NDCG measures. IDCG is fixed by the truth; DCG follows the retriever’s order; NDCG is the gap between them.

🔬 focus the lens

THE RETRIEVAL METRIC RACK

DATA TABLE n=4

Metric	Answers	Best for	Ignores
Recall@k / Precision@k	is a good chunk in the top k	a quick set check	order within the top k
Context recall	did we fetch all needed info	coverage gaps	where it ranked
Context precision	of fetched, how much was relevant	junk / distractors	—
MRR	how high was the first hit	one right answer	everything after hit #1
★ NDCG	best stuff highest, graded	many graded results	needs graded labels

CONCLUSION

⚠ FAR FROM IT

Presence isn’t success — position is.

A gold chunk at rank 80 is one the LLM never reads, so where it lands is what matters, not whether it’s technically in the set. Classification metrics grade one prediction; retrieval hands back a ranked list, so you need ranking metrics: MRR when there’s one right answer, NDCG when several chunks matter to different degrees.

The anchor for NDCG: rank is the retriever’s guess, grade is the ground truth, and NDCG is the distance between them.

WHAT NEXT

We’ve graded the fetch. But perfect retrieval doesn’t guarantee a good answer — the model can have every right chunk in hand and still hallucinate or wander off the question. Next: faithfulness and answer relevancy — when the retrieval is perfect and the answer still lies (EXP-013).

★ RELATED EXPERIMENTS