Regularisation: Four Knobs That Aren't the Same Knob
I assumed L1, L2, dropout and BatchNorm were one family of regularisation knobs — pick whichever. Far from it: BatchNorm is an optimiser, L1 and L2 shrink by different machines, and "L2 inside Adam" isn't even weight decay.
HYPOTHESIS
H₀METHOD
Third entry in ML Foundations from First Principles. I went in with a lazy hypothesis: L1, L2, dropout and BatchNorm are all “regularisation” — one bucket of interchangeable knobs. Then I sat a four-question mock — AdamW vs L2, what BatchNorm is actually for, the inverted-dropout rescale, and why L1 zeroes weights when L2 doesn’t — and the bucket fell apart.
The verdict this time is FAR FROM IT, and that’s the useful part: the shared label hides four different machines. Here’s each one, with a knob you can turn.
L1 vs L2 — A BUDGET, AND THE SHAPE OF IT
interactivePicture cutting your monthly spending. L2 is “trim everything proportionally” — shave a slice off groceries, fuel, Netflix alike; but as a bill gets small you trim less of it, so nothing ever hits zero. L1 is “a flat pressure on every line item” — the big essentials absorb it, but your rarely-used small subscription gets cancelled outright. Same goal, opposite endgame — and it’s all in the gradient.
L2 penalises w², so its gradient is λw — a pull proportional to the weight. As w shrinks the pull fades, so weights glide toward 0 but never arrive. That’s the proportional trim: the smaller the bill, the gentler the cut.
L1 penalises |w|, so its gradient is λ·sign(w) — a constant-magnitude push that doesn’t ease off near zero. It keeps shoving until the small weights are pinned at exactly 0 (the cancelled subscription), while the big ones survive. That’s why L1 does feature selection and L2 doesn’t. Geometrically it’s the diamond’s corners sitting on the axes vs the circle that has none — which is what you can drag below.
↑ Drag the optimum around. The solution is where the loss first touches the budget. The L1 diamond keeps catching it on a corner → one weight snaps to exactly 0 (sparse). The L2 circle has no corners, so it only ever shrinks — both weights stay alive. That’s the whole reason L1 does feature selection and L2 doesn’t.
THE TWIST — “L2 IS WEIGHT DECAY” IS A LIE UNDER ADAM
interactiveThe shrink factor (1 − αλ/m) above is weight decay — but only for plain SGD. The moment you switch to Adam, the story breaks, and this is the one I got wrong in the mock.
Coupled L2 folds the penalty into the gradient, so it rides through the adaptive denominator:
That √v̂ divides everything, including the λw term. So a weight with a big recent gradient (large √v̂) gets its decay diluted — the loud weights, the ones you most want to rein in, barely shrink. AdamW fixes it by decoupling: it pulls λw out of the gradient and applies it straight to the weight, untouched by √v̂.
Think of λ as a fine everyone in class should pay equally. Adam+L2 divides the fine by how loud you are — so the loudest kid walks free. AdamW charges everyone the same. That’s why AdamW, not vanilla Adam, trains the transformers.
↑ Same λ for all four weights. Under Adam+L2 the decay collapses as √s grows — the LOUD weight is decayed an order of magnitude less. AdamW is flat: every weight, same decay.
DROPOUT — AND THE RESCALE EVERYONE FORGETS
interactiveDropout zeroes each unit with probability 1−p during training, so no neuron can lean on any single input — reliance spreads out, like a team forced to practise with random players benched. That part everyone knows. The part that trips people up is the rescale.
Carry the team analogy through the rescale. With only a fraction p of players on the pitch, total output falls short — so each one who is playing covers 1/p as much ground (inverted dropout divides survivors by p) and the team’s expected output is whole again. On match day everyone plays at normal intensity — at test you do nothing, the books are already balanced. Forget to dial it back and a full team each working 1/p harder overshoots wildly: test activations come in 1/p times too large, the network sees magnitudes it never trained on, and quietly falls apart.
↑ Drag keep_prob. With scaling ON the train and test signal bars line up (the rescale compensates for the dropped units). Toggle it OFF and the test bar overshoots by 1/p — the failure mode the interviewer is fishing for.
BATCHNORM — THE ONE THAT ISN’T REGULARISATION
interactiveThis is the one I had filed in the wrong drawer. BatchNorm normalises each pre-activation using the mini-batch’s own mean and variance, then lets the network scale and shift it back with learnable γ, β:
Its job is optimisation: it smooths the loss landscape so you can use a higher learning rate and converge faster, with less sensitivity to initialisation. The regularising effect people credit it with is an accident — because μ_B, σ_B are estimated from whichever examples happened to share the batch, each example gets a slightly noisy normalisation, and that noise behaves a little like dropout.
The give-away: that noise scales as σ/√(batch). Judge Tirupati’s weather from 2 days and your average is jumpy; from 200 days it’s rock-steady. Grow the batch and the “regularisation” evaporates — proof it was never the point.
↑ Each dot is one mini-batch’s estimate of the mean. Small batch → the estimates scatter (noise injected = incidental regularisation). Slide the batch up and they collapse onto the true value — the reg effect vanishes. (Transformers use LayerNorm instead: normalise per example, not per batch.)
FOUR MACHINES, ONE LABEL
| Technique | Mechanism | Weights → exactly 0? | Its real job |
|---|---|---|---|
| L2 / weight decay | grad adds λw → ×(1−αλ) shrink | no — asymptotes | shrink all weights (variance↓) |
| L1 / lasso | grad adds λ·sign(w) → soft-threshold | yes — exactly 0 | sparsity / feature selection |
| Dropout | drop p, ÷keep_prob at train | no | implicit ensemble; spread reliance |
| ★ BatchNorm | normalise pre-acts per batch (γ, β) | no | optimisation — reg is a side-effect |
One orthogonality note worth keeping (Ng’s): early stopping also reduces variance, but it tangles “optimise the loss” and “control variance” into a single knob — so when you can, prefer L2 + train longer, which keeps the two concerns separable.
CONCLUSION
L1 and L2 shrink by different machines (a constant soft-threshold to exact zeros vs a proportional decay that only asymptotes). “L2 inside Adam” isn’t even clean weight decay — the adaptive denominator dilutes it, which is the entire reason AdamW exists. And BatchNorm is an optimiser whose regularising effect is an accident of mini-batch noise that fades as the batch grows.
The quest motto earned its keep: name the precise mechanism, don’t describe around it. “Regularisation” is a label on a drawer with four different tools inside.
WHAT NEXT
Deep Learning from first principles — attention (Q/K/V), multi-head, positional encodings — and a dedicated note on LayerNorm vs BatchNorm: why per-example normalisation is the one that survives variable-length sequences and tiny batches, and BatchNorm doesn’t.