EXP #011 Ev Ev Q-02 · ML Foundations from First Principles · step 5 ⚠ Learning

Pick the Wrong Metric and You Go Blind

A model has no single "performance" — it has a confusion matrix, a score, and a dial. Every metric is a slice of those four cells, and the wrong slice hides the exact failure you shipped. A bouncer, a club, and the 0.99 that lies.

2026-06-29 12 MIN READ COMPLETE

HYPOTHESIS

H₀

H₀ Accuracy (or AUC) tells you how good a classifier is — read the number and you know where you stand.

METHOD

Fifth entry in ML Foundations from First Principles. The hypothesis is the one everyone secretly holds: “accuracy — or AUC — tells you how good the model is.” I re-sat the whole metrics topic as a senior-style mock and rebuilt each piece from its mechanism, and the verdict is FAR FROM IT: a classifier has no single number. It has a confusion matrix, a score, and a dial — and a completely different story under every metric. Pick the wrong one and you don’t just under-measure, you go blind to the failure you shipped.

THE TRUTH TABLE — EVERY METRIC IS A SLICE

interactive

Everything starts with the confusion matrix (the truth table): the four ways a prediction can land. Using Bouncer Babu — let in (predicted positive) or turn away (predicted negative), crossed with who’s truly a VIP or a gatecrasher:

✓TP (true positive) — VIP let in · TN (true negative) — gatecrasher kept out. The two you got right.
✗FP (false positive) — gatecrasher sneaked in · FN (false negative) — VIP turned away. The two ways you were wrong.

Here’s the unlock: every metric is just a different slice of these four cells. Click through them:

predicted IN
predicted OUT
actual 🤩 VIP
TP
11
FN
2
actual 🕶️ crasher
FP
3
TN
12

Precision = TP/(TP+FP) = 11/(11+3) = 0.79

lenientstrict

the dial sets the four counts · every metric is just a different slice of these cells (note F1 never touches TN — that's why it survives imbalance)

↑ Click a metric chip and watch the cells it uses light up. Precision is a column, recall is a row, specificity the other row, accuracy the diagonal — and F1 never touches TN.

Two cousins fall straight out of the matrix:

$\text{accuracy} = \frac{TP+TN}{TP+FP+FN+TN} \qquad \text{specificity (TNR)} = \frac{TN}{TN+FP}$

Accuracy — plain: overall fraction correct — is the famous trap: on a 99%-legit stream, “predict legit for everyone” scores 99% accuracy with zero recall. The metric applauds a useless model, because the giant TN cell drowns everything. That’s why accuracy is the first number to distrust on imbalance. Specificity — plain: of the real negatives, how many you correctly cleared — is recall’s mirror, and FPR = 1 − specificity (remember that; it’s ROC’s x-axis).

SPECIMEN · why it's one truth table 100×

Don’t memorise five formulas — memorise the 2×2. Precision reads a column (of who I let in, how many were really VIPs); recall reads a row (of the real VIPs, how many I let in); specificity the other row; accuracy the diagonal. Every metric is just one question you can ask of these four cells.

🔬 focus the lens

THE DIAL — AND HOW TO SET IT

interactive

A classifier hands you a score, not a verdict; a threshold (the dial) turns it into a decision. Two cells flip as you move it:

$\text{precision} = \frac{TP}{TP+FP} \qquad \text{recall} = \frac{TP}{TP+FN}$

Lower the dial → flag more → recall climbs, precision slips. Which way you lean is set by cost asymmetry: fraud → a miss (FN) is costly → recall-first; spam → a real email lost (FP) is costly → precision-first. But “lean toward recall” is hand-wavy — the dial is actually a cost calculation:

cost of false alarm (FP)1

cost of a miss (FN)1

the dial isn't a vibe — it minimizes total cost · raise the cost of a miss (fraud) and the optimal threshold slides left, toward recall

↑ Set the cost of a false alarm vs a miss; the green line is the cost-minimizing dial. Crank up the cost of a miss (FN) — fraud — and the optimal threshold slides left, toward recall. That’s how you pick an operating point, not by vibes.

One guardrail: optimise one metric subject to a floor on the other, or it goes degenerate — “100% recall” is free if you flag everyone.

BOUNCER BABU — WHERE THE CURVE COMES FROM

interactive

Instead of one dial setting, sweep all of them. Plot catch rate (TPR) against sneak-in rate (FPR = FP/(FP+TN)) as the dial sweeps — that’s the ROC curve, from (0,0) “let nobody in” to (1,1) “let everybody in.” AUC (the area) = the chance Babu scores a random VIP above a random gatecrasher — his “eye,” independent of the dial.

✦ Bouncer Babu — reading the curve tap to unroll ▾

Read the ROC curve like a story. Its two ends: (0,0) is the door locked — nobody in, zero caught, zero sneaked (that’s the origin); (1,1) is the door wide open — everyone in, every VIP and every gatecrasher. Between them the dot walks as the dial loosens.

The more the arc bulges toward the top-left, the better Babu’s eye — the corner (1,0) is the dream: all VIPs in, zero gatecrashers. The diagonal is Babu guessing at random — a coin flip, AUC 0.5.

And the lovely bit: AUC = the chance Babu gives a random real VIP a higher score than a random gatecrasher. It’s threshold-free — it judges whether his scores are in the right order before you argue about the dial, which is exactly what lets the owner compare two bouncers fairly.

~ from Babu's notebook ~ ✦

IN (let through) · 14

🤩🤩🤩🤩🤩🤩🤩🤩🤩🤩🤩🕶️🕶️🕶️

OUT (turned away) · 14

🤩🤩🕶️🕶️🕶️🕶️🕶️🕶️🕶️🕶️🕶️🕶️🕶️🕶️

TP 11FP 3FN 2TN 12precision 0.79 · recall 0.85

lenientstrict

slide the dial: 🤩 VIPs vs 🕶️ gatecrashers cross IN/OUT · the dot is your operating point on each curve · switch to MOBBED and watch ROC stay high while PR caves

↑ Slide the dial — people cross IN/OUT, the dot slides on both curves. Now hit MOBBED: gatecrashers swamp the VIPs.

Watch MOBBED. The gatecrashers (TN) become an enormous pile, so even hundreds of sneak-ins barely move FPR = FP/(FP+TN) — ROC stays near-perfect — while inside the club those sneak-ins swamp the few VIPs, so precision craters. ROC hides the false positives in a giant TN pile; precision has no TN, so it tells the truth. On rare-positive problems report PR-AUC (baseline = the prevalence, not 0.5), not ROC-AUC.

The PR curve asks Babu’s other question — not “did I catch the VIPs?” but “is the room actually exclusive?” (precision = of everyone inside, what fraction are real VIPs). On a mobbed night the gatecrashers who slip in barely move the sneak-in rate — FPR drowns them in a giant pile of correctly-blocked randos, so ROC shrugs — yet inside the club they swamp the handful of VIPs, so precision collapses.

The keeper: balanced crowd, you just want a good eye → trust ROC-AUC (baseline 0.5); VIPs are rare and a room packed with randos is the failure → trust PR-AUC (baseline = how rare VIPs are).

📷 tap & shake to develop

PR — is the room exclusive?

F1 — THE WEAKEST LINK

To collapse precision and recall into one number, use the harmonic mean:

$F_1 = \frac{2PR}{P+R} \qquad\qquad F_\beta = \frac{(1+\beta^2)\,PR}{\beta^2 P + R}$

The harmonic mean is the weakest-link average — dragged toward whichever of P/R is worse. Precision 1.0, recall 0.0 → F1 = 0 (correctly damning), where the plain average would say a comfy 0.5. Fβ tilts it: F2 for recall (fraud), F0.5 for precision (spam). And note from §03: F1 never uses TN, which is exactly why — like PR-AUC — it survives imbalance where accuracy dies.

MACRO VS MICRO — HOW A METRIC HIDES A BROKEN CLASS

interactive

Go multi-class — a moderation model across 14 languages. Compute a per-class F1, then squash them into one number, and how you squash decides what you can see. Macro averages per-class equally (a tiny language counts as much as a giant); micro pools every prediction into one score — which for single-label is just accuracy — and is dominated by the big classes.

MACRO (equal vote) = 0.62small class counts fully → exposes it

WEIGHTED / micro≈acc = 0.95size vote → hides it

Assamese F10.10

break the tiny Assamese class (drag its F1 down): MACRO drops and rings the alarm · WEIGHTED/micro barely move and stay silent

↑ Drag tiny Assamese’s F1 to the floor. MACRO drops and rings the alarm; WEIGHTED / micro barely move. Same model, same failure, opposite verdicts.

So when minority classes matter, report macro-F1 plus per-class precision/recall. A shiny micro-F1 on imbalanced classes is accuracy wearing a lab coat.

THE TRUTH TABLE, AS A LENS RACK

DATA TABLE n=4

Metric	From the cells	Plain question	Watch out
Accuracy	(TP+TN)/all	overall fraction correct	useless on imbalance
Precision	TP/(TP+FP)	can I trust a positive	degenerate if you flag ~nothing
Recall (TPR)	TP/(TP+FN)	did I catch the positives	degenerate if you flag everything
Specificity (TNR)	TN/(TN+FP)	did I clear the negatives	FPR = 1 − this
F1 / Fβ	harmonic(P, R)	balance P and R	ignores TN; equal weight may mismatch cost
★ MCC	all four cells	one honest number on imbalance	less intuitive to explain
ROC-AUC	TPR vs FPR, swept	ranking quality	lies on rare positives
PR-AUC	precision vs recall, swept	quality of your positives	baseline = prevalence

MCC (Matthews correlation) is the one that uses all four cells at once — the most honest single number on imbalanced data (+1 perfect, 0 coin-flip, −1 inverted). When someone demands “just give me one metric,” MCC is the least misleading choice.

⚠ Caveat — they all grade decisions, not trust. Every metric here scores a thresholded decision or a ranking. None checks whether the probability itself is believable. If you act on the score — expected-value thresholds, cost math, downstream models — you also need calibration: a predicted 0.9 should be right about 90% of the time (reliability curve / Brier score). A model can have a perfect AUC and badly miscalibrated probabilities. That’s its own experiment — coming later.

CONCLUSION

⚠ FAR FROM IT

There is no single number — the metric is a lens.

It all reduces to four cells. Accuracy reads the diagonal and dies on imbalance; precision and recall read a column and a row and trade off on the dial; F1 and MCC compress them; ROC and PR sweep them. A 0.99 ROC-AUC can sit on single-digit precision; micro-F1 hides the minority class. Picking the wrong slice doesn’t just under-measure — it blinds you to the exact failure you shipped.

Match the metric to the cost asymmetry and the class balance, set the dial by cost, and report the operating point — not a headline number.

WHAT NEXT

These metrics grade one prediction. But a retriever hands back a ranked list — and “is the right chunk in there?” isn’t enough when the LLM only reads the top. Next: MRR and NDCG, and why a chunk being retrieved doesn’t mean it helped (EXP-012). Calibration — whether the probabilities are trustworthy — gets its own experiment after that.

★ RELATED EXPERIMENTS