How Do You Grade a RAG System?
Retrieval found the right passage — at rank 80, where the model never reads it. Classification metrics can't grade a ranked list. Building MRR and NDCG from scratch, and the one idea that makes NDCG click: rank is the guess, grade is the truth.
HYPOTHESIS
H₀METHOD
The previous experiment graded one prediction (is this spam?). But a RAG retriever hands back a ranked list of passages — and the tempting hypothesis is “if the right chunk is in there, retrieval worked.” FAR FROM IT. A gold chunk at rank 80 is a chunk the model never reads. I built the retrieval-evaluation toolkit from first principles — and the two metrics that matter, MRR and NDCG, fall out of asking “what should a good ranking score reward?“
TWO WAYS RETRIEVAL FAILS
RAG eval has two layers — retrieval (did we fetch the right context) and generation (is the answer good given it). This post is retrieval. And retrieval fails in exactly two ways, both names you already know:
- ①Missed a needed chunk → context recall — plain: of all the info the answer needed, how much did we fetch? (Comparing HyDE vs Step-Back needs both chunks; miss one and the answer can’t be written.)
- ②Grabbed junk → context precision — plain: of what we fetched, how much was actually relevant? Distractor chunks don’t just waste space — they pull the model’s attention and degrade the answer.
And the lever between them is familiar: the similarity-score threshold is the same dial as a classifier’s — slide it and you trade context precision against context recall, exactly like Bouncer Babu’s strictness knob.
WHY POSITION MATTERS — LOST IN THE MIDDLE
Here’s why “is it in the set?” isn’t enough. LLMs suffer from “lost in the middle”: they attend most to the start and end of a long prompt, so a relevant chunk buried deep gets under-read even though it’s technically in the context. So a set-based metric like recall@k (is a good chunk in the top k) gives full marks whether the gold chunk sits at rank 1 or rank 19 — and those are wildly different outcomes for the model.
The fix in practice is a reranker (plain: a cheap first pass casts a wide net, then a heavier scorer — a cross-encoder or hybrid BM25+dense — reorders the catch so the best chunks sit on top). But to even measure this, we need a metric that rewards relevant chunks for being high, not just present.
MRR — REWARD THE FIRST HIT
interactiveSimplest case: one query, one right answer. Score it by where the gold chunk landed — and you want rank 1 to be great, rank 20 nearly worthless, with a big penalty at the top and a flat tail. Put the rank in the denominator:
↑ Slide where the gold chunk lands. 1/rank halves from rank 1→2 (the penalty you want for falling off the top) and barely moves 19→20 (already buried). The dashed line is a linear score — notice it can’t tell rank 1 from rank 3.
Average that over all your queries and you’ve built MRR — mean reciprocal rank. It’s perfect for one-right-answer retrieval, but it’s blind to everything after the first hit: if a query needs the HyDE chunk and the Step-Back chunk, MRR scores a smug 1.0 the moment it finds HyDE — even with Step-Back rotting at rank 80.
NDCG — GRADE × POSITION
interactiveTo fix MRR’s blind spot, count every relevant chunk, and let chunks be graded (perfect=3, decent=2, weak=1, useless=0), not just yes/no. Each chunk’s contribution is two factors multiplied — how relevant it is times how high it sits:
IDCG is the score of the ideal ordering (highest grades on top); dividing by it normalizes to [0,1] so queries are comparable. (Neat: the log base cancels in the ratio.)
↑ Reorder the chunks, or hit “bury the gold.” Watch IDCG never move while DCG — and so NDCG — collapses.
THE RETRIEVAL METRIC RACK
| Metric | Answers | Best for | Ignores |
|---|---|---|---|
| Recall@k / Precision@k | is a good chunk in the top k | a quick set check | order within the top k |
| Context recall | did we fetch all needed info | coverage gaps | where it ranked |
| Context precision | of fetched, how much was relevant | junk / distractors | — |
| MRR | how high was the first hit | one right answer | everything after hit #1 |
| ★ NDCG | best stuff highest, graded | many graded results | needs graded labels |
CONCLUSION
A gold chunk at rank 80 is one the LLM never reads, so where it lands is what matters, not whether it’s technically in the set. Classification metrics grade one prediction; retrieval hands back a ranked list, so you need ranking metrics: MRR when there’s one right answer, NDCG when several chunks matter to different degrees.
The anchor for NDCG: rank is the retriever’s guess, grade is the ground truth, and NDCG is the distance between them.
WHAT NEXT
We’ve graded the fetch. But perfect retrieval doesn’t guarantee a good answer — the model can have every right chunk in hand and still hallucinate or wander off the question. Next: faithfulness and answer relevancy — when the retrieval is perfect and the answer still lies (EXP-013).