EXP #010 Ev Ev Q-02 · ML Foundations from First Principles · step 4 ⚠ Learning

Cross-Validation: Why a Great CV Score Lies

A 0.92 in cross-validation, a 0.71 in production, and no code changed. Leakage and the winner's curse inflate CV without touching the real world — and the fix is treating CV as a strict simulation of deployment.

2026-06-14 10 MIN READ COMPLETE

HYPOTHESIS

H₀

H₀ A high cross-validation score means the model will generalise — trust the number and ship.

METHOD

Fourth entry in ML Foundations from First Principles. The hypothesis I walked in with is the one most people quietly hold: a high cross-validation score means the model will generalise. Then comes the scenario that breaks it — 0.92 in CV, 0.71 in production, and nothing about the model changed.

I sat the four-question mock — diagnose that cliff, name the leakage types, design CV for delayed labels, explain why best-of-N tuning is biased — and the lesson was a priority correction: before blaming the world for drifting, ask whether your 0.92 was ever honest. Verdict: FAR FROM IT.

K-FOLD — EVERY SLICE TAKES A TURN

interactive

Start with the honest version. You have a question bank and want to estimate your exam score. Test on one fixed slice and the estimate is luck — maybe that slice was easy — and you wasted those questions for studying. So rotate: split into k folds, and let each fold take a turn as the test while the rest train.

$\text{estimate} = \frac{1}{k}\sum_{i=1}^{k} \text{score}_i$

estimate = mean of 1 fold = 0.810

k = 5

each fold takes a turn as the test set · averaging the k scores cancels the luck of any single split

↑ Step through the rounds — the coral block is the held-out test each time. Averaging the k scores cancels the luck of any single split. This is the tool; the rest of the post is the ways it quietly stops being honest.

The one rule that makes it honest: every validation fold must be independent of everything used to build the model — the preprocessing, the features, and any related rows. Break that and the score inflates.

LEAKAGE — THE OFFLINE→ONLINE CLIFF

interactive

Leakage is information from the validation fold sneaking into training. It inflates CV but never production — so the gap between your CV score and your live score is the signature. The nastiest example: target encoding, where you replace a category with its average label. For a category that appears once — a user_id — that average is that row’s own label. You’ve pasted the answer into a feature: CV reads 0.99, prod collapses.

The fraud version is the same shape: a feature like lifetime_chargeback_count is only filled in because fraud happened, so at scoring time in prod it’s empty. The model leaned on a column that won’t exist when it matters.

✗ CV is 37 pts above prod — the leak signature

category encoded by its own mean label — catastrophic

↑ Pick a leak, then flip the fit from full data to inside each fold. Under the leak the CV bar towers over production; fit honestly and the two snap together. The fix is structural: every preprocessing step fit inside the fold (a Pipeline), and the splitter chosen to match the data’s structure.

DELAYED LABELS — MATURITY & THE EMBARGO GAP

interactive

Now the trap I couldn’t name in the mock. Imagine running Rapido in Tirupati and predicting which ride draws a complaint — except the complaint can land up to 30 days later. Two things break a naive split:

①Immature labels. A ride from last week marked “clean” might just not have been complained about yet. Train on it as a confirmed negative and you teach the model that bad rides are fine. Only train on rides whose 30-day window has fully closed — like a shopkeeper who won’t call a customer a “good payer” before the due date passes.
②The seam. A training ride near the train/validation boundary has a complaint window that reaches into the validation period — so its label is decided by something that happens in the future you’re testing on.

The fix for the seam is an embargo gap: push the validation start back by the maturation window, so no training label can reach across.

✗ 4 train rides leak — label window reaches into validation

maturity M = 30d

rides within M days of the seam leak — their complaint could land in the validation window · the embargo drops that gap so it can't

↑ Each line is a ride’s label window. Rides within M days of the seam glow red — their complaint could land in validation. Turn on the embargo and the gap pushes validation clear, so no label leaks across. Drag M to feel how the danger zone grows with the delay.

THE WINNER’S CURSE — AND NESTED CV

interactive

Last trap. You try 50 hyperparameter configs, pick the one with the best CV score, and report that as your expected performance. The winner won partly by genuine merit and partly by luck — it happened to suit those particular folds. The maximum over many noisy estimates drifts above the truth:

$\mathbb{E}\big[\max_{1..N}\text{CV}\big] > \text{true performance}, \qquad \text{growing with } N$

SPECIMEN · the sealed final exam 100×

The fix is nested CV: an inner loop picks the hyperparameters, an outer loop scores the winner on a fold the picking never touched. Think of a sealed final exam locked in the principal’s office — you choose your study strategy from weekly mock tests, then sit the sealed paper exactly once. Because the final was never used to choose, its score is honest.

🔬 focus the lens

● reported = 0.888● honest = 0.79inflation = +10 pts

1 config200 configs

the more configs you try, the higher the best-of-N CV drifts above the truth · nested CV scores on untouched data, so it doesn't

↑ Slide the number of configs you try. The best-of-N CV climbs as you search more — that’s the curse. Nested CV stays flat at the honest number, because the data it scores on never took part in the picking.

FIVE WAYS THE SCORE LIES

DATA TABLE n=3

Leak / trap	How it sneaks in	Fix
Preprocessing	scaler/encoder fit on full data	fit inside each fold (Pipeline)
★ Target encoding	category ← its own mean label	out-of-fold encoding
Temporal	random fold on time-ordered data	TimeSeriesSplit + embargo gap
Group / entity	same user in train and val	GroupKFold
Winner's curse	report best-of-N tuned score	nested CV / one held-out test

One mental model collapses the whole list: cross-validation is a simulation of deployment. Anything that wouldn’t be true in production — future data, the label itself, related rows of the same entity, parameters fit on the validation data — must not exist when the model trains.

CONCLUSION

⚠ FAR FROM IT

Hypothesis refuted — a great CV score isn’t a promise.

A high cross-validation score doesn’t mean the model generalises — it means it did well on data that may not have been independent of its own training. Two ways the number lies: leakage (fit preprocessing inside the fold; split by the structure — Group, Time) and selection bias (nested CV, not best-of-N). Delayed labels add maturity + embargo on top.

The priority correction that mattered most: when CV beats production, suspect your own estimate before you blame the world. Treat CV as a strict simulation of deployment and the score stops lying.

WHAT NEXT

Evaluation metrics — precision/recall, F1, the PR-AUC-vs-ROC trap on imbalanced data, and MRR/NDCG for ranking. Once the estimate is honest, the next question is whether you’re even measuring the right thing.

★ RELATED EXPERIMENTS