LIVE · BENGALURU EST. 2024
EXP #012 Ev Ev Q-01 · Building RAG from Scratch · step 5 ⚠ Learning

How Do You Grade a RAG System?

Retrieval found the right passage — at rank 80, where the model never reads it. Classification metrics can't grade a ranked list. Building MRR and NDCG from scratch, and the one idea that makes NDCG click: rank is the guess, grade is the truth.

2026-06-29 10 MIN READ COMPLETE
01

HYPOTHESIS

H₀
H₀ A relevant chunk showing up in the retrieved set means retrieval did its job — presence is what matters.
02

METHOD

The previous experiment graded one prediction (is this spam?). But a RAG retriever hands back a ranked list of passages — and the tempting hypothesis is “if the right chunk is in there, retrieval worked.” FAR FROM IT. A gold chunk at rank 80 is a chunk the model never reads. I built the retrieval-evaluation toolkit from first principles — and the two metrics that matter, MRR and NDCG, fall out of asking “what should a good ranking score reward?“

03

TWO WAYS RETRIEVAL FAILS

RAG eval has two layers — retrieval (did we fetch the right context) and generation (is the answer good given it). This post is retrieval. And retrieval fails in exactly two ways, both names you already know:

  • Missed a needed chunkcontext recall — plain: of all the info the answer needed, how much did we fetch? (Comparing HyDE vs Step-Back needs both chunks; miss one and the answer can’t be written.)
  • Grabbed junkcontext precision — plain: of what we fetched, how much was actually relevant? Distractor chunks don’t just waste space — they pull the model’s attention and degrade the answer.

And the lever between them is familiar: the similarity-score threshold is the same dial as a classifier’s — slide it and you trade context precision against context recall, exactly like Bouncer Babu’s strictness knob.

04

WHY POSITION MATTERS — LOST IN THE MIDDLE

Here’s why “is it in the set?” isn’t enough. LLMs suffer from “lost in the middle”: they attend most to the start and end of a long prompt, so a relevant chunk buried deep gets under-read even though it’s technically in the context. So a set-based metric like recall@k (is a good chunk in the top k) gives full marks whether the gold chunk sits at rank 1 or rank 19 — and those are wildly different outcomes for the model.

The fix in practice is a reranker (plain: a cheap first pass casts a wide net, then a heavier scorer — a cross-encoder or hybrid BM25+dense — reorders the catch so the best chunks sit on top). But to even measure this, we need a metric that rewards relevant chunks for being high, not just present.

05

MRR — REWARD THE FIRST HIT

interactive

Simplest case: one query, one right answer. Score it by where the gold chunk landed — and you want rank 1 to be great, rank 20 nearly worthless, with a big penalty at the top and a flat tail. Put the rank in the denominator:

MRR=1Qq=1Q1rank of the first relevant chunk\text{MRR} = \frac{1}{Q}\sum_{q=1}^{Q} \frac{1}{\text{rank of the first relevant chunk}}

00.5115101520rank of the gold chunklinear (bad)1/rank
gold chunk at rank 1score 1/1 = 1.000
rank 1rank 20
slipping rank 1→2 halves the score (huge penalty up top); 19→20 barely moves (flat at the bottom) · average 1/rank over every query = MRR

↑ Slide where the gold chunk lands. 1/rank halves from rank 1→2 (the penalty you want for falling off the top) and barely moves 19→20 (already buried). The dashed line is a linear score — notice it can’t tell rank 1 from rank 3.

Average that over all your queries and you’ve built MRR — mean reciprocal rank. It’s perfect for one-right-answer retrieval, but it’s blind to everything after the first hit: if a query needs the HyDE chunk and the Step-Back chunk, MRR scores a smug 1.0 the moment it finds HyDE — even with Step-Back rotting at rank 80.

06

NDCG — GRADE × POSITION

interactive

To fix MRR’s blind spot, count every relevant chunk, and let chunks be graded (perfect=3, decent=2, weak=1, useless=0), not just yes/no. Each chunk’s contribution is two factors multiplied — how relevant it is times how high it sits:

DCG=irelilog2(1+ranki)NDCG=DCGIDCG[0,1]\text{DCG} = \sum_i \frac{rel_i}{\log_2(1 + \text{rank}_i)} \qquad \text{NDCG} = \frac{\text{DCG}}{\text{IDCG}} \in [0,1]

IDCG is the score of the ideal ordering (highest grades on top); dividing by it normalizes to [0,1] so queries are comparable. (Neat: the log base cancels in the ratio.)

#1🥇rel=3
3.00
#2rel=0
0.00
#3🥈rel=2
1.00
#4🥉rel=1
0.43
#5rel=0
0.00
#6🥉rel=1
0.36
DCG 4.79IDCG 5.19 (fixed by truth)NDCG 0.92
reorder the chunks · IDCG (the truth) never moves · "bury the gold" drops the 🥇 to the bottom → NDCG craters · grade = truth, rank = the retriever's guess, NDCG = the gap

↑ Reorder the chunks, or hit “bury the gold.” Watch IDCG never move while DCG — and so NDCG — collapses.

SPECIMEN · rank is the guess, grade is the truth 100×

And here’s the idea that makes NDCG finally click: rank is where the retriever put the chunk (its imperfect guess); grade is what the chunk is actually worth (the ground truth, from a human or LLM-judge). They can disagree — a weak chunk at rank 1, the gold one at rank 80 — and that disagreement is exactly what NDCG measures. IDCG is fixed by the truth; DCG follows the retriever’s order; NDCG is the gap between them.

07

THE RETRIEVAL METRIC RACK

DATA TABLE n=4
MetricAnswersBest forIgnores
Recall@k / Precision@kis a good chunk in the top ka quick set checkorder within the top k
Context recalldid we fetch all needed infocoverage gapswhere it ranked
Context precisionof fetched, how much was relevantjunk / distractors
MRRhow high was the first hitone right answereverything after hit #1
★ NDCGbest stuff highest, gradedmany graded resultsneeds graded labels
08

CONCLUSION

⚠ FAR FROM IT
Presence isn’t success — position is.

A gold chunk at rank 80 is one the LLM never reads, so where it lands is what matters, not whether it’s technically in the set. Classification metrics grade one prediction; retrieval hands back a ranked list, so you need ranking metrics: MRR when there’s one right answer, NDCG when several chunks matter to different degrees.

The anchor for NDCG: rank is the retriever’s guess, grade is the ground truth, and NDCG is the distance between them.

09

WHAT NEXT

We’ve graded the fetch. But perfect retrieval doesn’t guarantee a good answer — the model can have every right chunk in hand and still hallucinate or wander off the question. Next: faithfulness and answer relevancy — when the retrieval is perfect and the answer still lies (EXP-013).