Pick the Wrong Metric and You Go Blind
A model has no single "performance" — it has a confusion matrix, a score, and a dial. Every metric is a slice of those four cells, and the wrong slice hides the exact failure you shipped. A bouncer, a club, and the 0.99 that lies.
HYPOTHESIS
H₀METHOD
Fifth entry in ML Foundations from First Principles. The hypothesis is the one everyone secretly holds: “accuracy — or AUC — tells you how good the model is.” I re-sat the whole metrics topic as a senior-style mock and rebuilt each piece from its mechanism, and the verdict is FAR FROM IT: a classifier has no single number. It has a confusion matrix, a score, and a dial — and a completely different story under every metric. Pick the wrong one and you don’t just under-measure, you go blind to the failure you shipped.
THE TRUTH TABLE — EVERY METRIC IS A SLICE
interactiveEverything starts with the confusion matrix (the truth table): the four ways a prediction can land. Using Bouncer Babu — let in (predicted positive) or turn away (predicted negative), crossed with who’s truly a VIP or a gatecrasher:
- ✓TP (true positive) — VIP let in · TN (true negative) — gatecrasher kept out. The two you got right.
- ✗FP (false positive) — gatecrasher sneaked in · FN (false negative) — VIP turned away. The two ways you were wrong.
Here’s the unlock: every metric is just a different slice of these four cells. Click through them:
↑ Click a metric chip and watch the cells it uses light up. Precision is a column, recall is a row, specificity the other row, accuracy the diagonal — and F1 never touches TN.
Two cousins fall straight out of the matrix:
Accuracy — plain: overall fraction correct — is the famous trap: on a 99%-legit stream, “predict legit for everyone” scores 99% accuracy with zero recall. The metric applauds a useless model, because the giant TN cell drowns everything. That’s why accuracy is the first number to distrust on imbalance. Specificity — plain: of the real negatives, how many you correctly cleared — is recall’s mirror, and FPR = 1 − specificity (remember that; it’s ROC’s x-axis).
THE DIAL — AND HOW TO SET IT
interactiveA classifier hands you a score, not a verdict; a threshold (the dial) turns it into a decision. Two cells flip as you move it:
Lower the dial → flag more → recall climbs, precision slips. Which way you lean is set by cost asymmetry: fraud → a miss (FN) is costly → recall-first; spam → a real email lost (FP) is costly → precision-first. But “lean toward recall” is hand-wavy — the dial is actually a cost calculation:
↑ Set the cost of a false alarm vs a miss; the green line is the cost-minimizing dial. Crank up the cost of a miss (FN) — fraud — and the optimal threshold slides left, toward recall. That’s how you pick an operating point, not by vibes.
One guardrail: optimise one metric subject to a floor on the other, or it goes degenerate — “100% recall” is free if you flag everyone.
BOUNCER BABU — WHERE THE CURVE COMES FROM
interactiveInstead of one dial setting, sweep all of them. Plot catch rate (TPR) against sneak-in rate (FPR = FP/(FP+TN)) as the dial sweeps — that’s the ROC curve, from (0,0) “let nobody in” to (1,1) “let everybody in.” AUC (the area) = the chance Babu scores a random VIP above a random gatecrasher — his “eye,” independent of the dial.
Read the ROC curve like a story. Its two ends: (0,0) is the door locked — nobody in, zero caught, zero sneaked (that’s the origin); (1,1) is the door wide open — everyone in, every VIP and every gatecrasher. Between them the dot walks as the dial loosens.
The more the arc bulges toward the top-left, the better Babu’s eye — the corner (1,0) is the dream: all VIPs in, zero gatecrashers. The diagonal is Babu guessing at random — a coin flip, AUC 0.5.
And the lovely bit: AUC = the chance Babu gives a random real VIP a higher score than a random gatecrasher. It’s threshold-free — it judges whether his scores are in the right order before you argue about the dial, which is exactly what lets the owner compare two bouncers fairly.
↑ Slide the dial — people cross IN/OUT, the dot slides on both curves. Now hit MOBBED: gatecrashers swamp the VIPs.
Watch MOBBED. The gatecrashers (TN) become an enormous pile, so even hundreds of sneak-ins barely move FPR = FP/(FP+TN) — ROC stays near-perfect — while inside the club those sneak-ins swamp the few VIPs, so precision craters. ROC hides the false positives in a giant TN pile; precision has no TN, so it tells the truth. On rare-positive problems report PR-AUC (baseline = the prevalence, not 0.5), not ROC-AUC.
The PR curve asks Babu’s other question — not “did I catch the VIPs?” but “is the room actually exclusive?” (precision = of everyone inside, what fraction are real VIPs). On a mobbed night the gatecrashers who slip in barely move the sneak-in rate — FPR drowns them in a giant pile of correctly-blocked randos, so ROC shrugs — yet inside the club they swamp the handful of VIPs, so precision collapses.
The keeper: balanced crowd, you just want a good eye → trust ROC-AUC (baseline 0.5); VIPs are rare and a room packed with randos is the failure → trust PR-AUC (baseline = how rare VIPs are).
F1 — THE WEAKEST LINK
To collapse precision and recall into one number, use the harmonic mean:
The harmonic mean is the weakest-link average — dragged toward whichever of P/R is worse. Precision 1.0, recall 0.0 → F1 = 0 (correctly damning), where the plain average would say a comfy 0.5. Fβ tilts it: F2 for recall (fraud), F0.5 for precision (spam). And note from §03: F1 never uses TN, which is exactly why — like PR-AUC — it survives imbalance where accuracy dies.
MACRO VS MICRO — HOW A METRIC HIDES A BROKEN CLASS
interactiveGo multi-class — a moderation model across 14 languages. Compute a per-class F1, then squash them into one number, and how you squash decides what you can see. Macro averages per-class equally (a tiny language counts as much as a giant); micro pools every prediction into one score — which for single-label is just accuracy — and is dominated by the big classes.
↑ Drag tiny Assamese’s F1 to the floor. MACRO drops and rings the alarm; WEIGHTED / micro barely move. Same model, same failure, opposite verdicts.
So when minority classes matter, report macro-F1 plus per-class precision/recall. A shiny micro-F1 on imbalanced classes is accuracy wearing a lab coat.
THE TRUTH TABLE, AS A LENS RACK
| Metric | From the cells | Plain question | Watch out |
|---|---|---|---|
| Accuracy | (TP+TN)/all | overall fraction correct | useless on imbalance |
| Precision | TP/(TP+FP) | can I trust a positive | degenerate if you flag ~nothing |
| Recall (TPR) | TP/(TP+FN) | did I catch the positives | degenerate if you flag everything |
| Specificity (TNR) | TN/(TN+FP) | did I clear the negatives | FPR = 1 − this |
| F1 / Fβ | harmonic(P, R) | balance P and R | ignores TN; equal weight may mismatch cost |
| ★ MCC | all four cells | one honest number on imbalance | less intuitive to explain |
| ROC-AUC | TPR vs FPR, swept | ranking quality | lies on rare positives |
| PR-AUC | precision vs recall, swept | quality of your positives | baseline = prevalence |
MCC (Matthews correlation) is the one that uses all four cells at once — the most honest single number on imbalanced data (+1 perfect, 0 coin-flip, −1 inverted). When someone demands “just give me one metric,” MCC is the least misleading choice.
⚠ Caveat — they all grade decisions, not trust. Every metric here scores a thresholded decision or a ranking. None checks whether the probability itself is believable. If you act on the score — expected-value thresholds, cost math, downstream models — you also need calibration: a predicted 0.9 should be right about 90% of the time (reliability curve / Brier score). A model can have a perfect AUC and badly miscalibrated probabilities. That’s its own experiment — coming later.
CONCLUSION
It all reduces to four cells. Accuracy reads the diagonal and dies on imbalance; precision and recall read a column and a row and trade off on the dial; F1 and MCC compress them; ROC and PR sweep them. A 0.99 ROC-AUC can sit on single-digit precision; micro-F1 hides the minority class. Picking the wrong slice doesn’t just under-measure — it blinds you to the exact failure you shipped.
Match the metric to the cost asymmetry and the class balance, set the dial by cost, and report the operating point — not a headline number.
WHAT NEXT
These metrics grade one prediction. But a retriever hands back a ranked list — and “is the right chunk in there?” isn’t enough when the LLM only reads the top. Next: MRR and NDCG, and why a chunk being retrieved doesn’t mean it helped (EXP-012). Calibration — whether the probabilities are trustworthy — gets its own experiment after that.