LIVE · BENGALURU EST. 2024
EXP #009 Op Op Q-02 · ML Foundations from First Principles · step 3 ⚠ Learning

Regularisation: Four Knobs That Aren't the Same Knob

I assumed L1, L2, dropout and BatchNorm were one family of regularisation knobs — pick whichever. Far from it: BatchNorm is an optimiser, L1 and L2 shrink by different machines, and "L2 inside Adam" isn't even weight decay.

2026-06-14 9 MIN READ COMPLETE
01

HYPOTHESIS

H₀
H₀ L1, L2, dropout and BatchNorm are interchangeable regularisation knobs — same goal (shrink the train/val gap), so reach for whichever is handy.
02

METHOD

Third entry in ML Foundations from First Principles. I went in with a lazy hypothesis: L1, L2, dropout and BatchNorm are all “regularisation” — one bucket of interchangeable knobs. Then I sat a four-question mock — AdamW vs L2, what BatchNorm is actually for, the inverted-dropout rescale, and why L1 zeroes weights when L2 doesn’t — and the bucket fell apart.

The verdict this time is FAR FROM IT, and that’s the useful part: the shared label hides four different machines. Here’s each one, with a knob you can turn.

03

L1 vs L2 — A BUDGET, AND THE SHAPE OF IT

interactive

Picture cutting your monthly spending. L2 is “trim everything proportionally” — shave a slice off groceries, fuel, Netflix alike; but as a bill gets small you trim less of it, so nothing ever hits zero. L1 is “a flat pressure on every line item” — the big essentials absorb it, but your rarely-used small subscription gets cancelled outright. Same goal, opposite endgame — and it’s all in the gradient.

J=1mL  +  λ2mlW[l]F2W:=(1αλm)WαLJ = \frac{1}{m}\sum L \;+\; \frac{\lambda}{2m}\sum_l \lVert W^{[l]}\rVert_F^2 \quad\Longrightarrow\quad W := \Big(1 - \frac{\alpha\lambda}{m}\Big)W - \alpha\,\partial L

L2 penalises , so its gradient is λw — a pull proportional to the weight. As w shrinks the pull fades, so weights glide toward 0 but never arrive. That’s the proportional trim: the smaller the bill, the gentler the cut.

L1: λmwgradient adds λmsign(w)\text{L1: } \frac{\lambda}{m}\sum |w| \quad\Longrightarrow\quad \text{gradient adds } \frac{\lambda}{m}\,\operatorname{sign}(w)

L1 penalises |w|, so its gradient is λ·sign(w) — a constant-magnitude push that doesn’t ease off near zero. It keeps shoving until the small weights are pinned at exactly 0 (the cancelled subscription), while the big ones survive. That’s why L1 does feature selection and L2 doesn’t. Geometrically it’s the diamond’s corners sitting on the axes vs the circle that has none — which is what you can drag below.

w₁w₂drag me
L1 → (0.85, 0.25) L2 → (0.94, 0.57)L1 sparse? no
STRONG λ (t=0.4)WEAK λ (t=2.0)
drag the optimum · diamond corners sit on the axes → L1 zeroes a weight · the circle has no corners → L2 only shrinks

↑ Drag the optimum around. The solution is where the loss first touches the budget. The L1 diamond keeps catching it on a corner → one weight snaps to exactly 0 (sparse). The L2 circle has no corners, so it only ever shrinks — both weights stay alive. That’s the whole reason L1 does feature selection and L2 doesn’t.

04

THE TWIST — “L2 IS WEIGHT DECAY” IS A LIE UNDER ADAM

interactive

The shrink factor (1 − αλ/m) above is weight decay — but only for plain SGD. The moment you switch to Adam, the story breaks, and this is the one I got wrong in the mock.

Coupled L2 folds the penalty into the gradient, so it rides through the adaptive denominator:

g=L+λw,W:=Wαm^v^+εg = \partial L + \lambda w, \qquad W := W - \alpha\,\frac{\hat m}{\sqrt{\hat v} + \varepsilon}

That √v̂ divides everything, including the λw term. So a weight with a big recent gradient (large √v̂) gets its decay diluted — the loud weights, the ones you most want to rein in, barely shrink. AdamW fixes it by decoupling: it pulls λw out of the gradient and applies it straight to the weight, untouched by √v̂.

AdamW:W:=Wαm^v^+ε    αλW\textbf{AdamW:}\quad W := W - \alpha\,\frac{\hat m}{\sqrt{\hat v} + \varepsilon} \;-\; \alpha\lambda W

Think of λ as a fine everyone in class should pay equally. Adam+L2 divides the fine by how loud you are — so the loudest kid walks free. AdamW charges everyone the same. That’s why AdamW, not vanilla Adam, trains the transformers.

effective decayquiet√s=0.15typical√s=0.5active√s=1.4LOUD√s=4Adam + L2 (λ/√s)AdamW (λ, uniform)
λ = 0.040
under Adam+L2 the LOUD weight is decayed 26× less than the quiet one · AdamW treats them equally

↑ Same λ for all four weights. Under Adam+L2 the decay collapses as √s grows — the LOUD weight is decayed an order of magnitude less. AdamW is flat: every weight, same decay.

05

DROPOUT — AND THE RESCALE EVERYONE FORGETS

interactive

Dropout zeroes each unit with probability 1−p during training, so no neuron can lean on any single input — reliance spreads out, like a team forced to practise with random players benched. That part everyone knows. The part that trips people up is the rescale.

train:a=amaskptest:a=a\textbf{train:}\quad a = \frac{a \odot \text{mask}}{p} \qquad\qquad \textbf{test:}\quad a = a

Carry the team analogy through the rescale. With only a fraction p of players on the pitch, total output falls short — so each one who is playing covers 1/p as much ground (inverted dropout divides survivors by p) and the team’s expected output is whole again. On match day everyone plays at normal intensity — at test you do nothing, the books are already balanced. Forget to dial it back and a full team each working 1/p harder overshoots wildly: test activations come in 1/p times too large, the network sees magnitudes it never trained on, and quietly falls apart.

TRAIN — keep_prob = 0.70 (8/12 kept, ×1.43)
TEST — all 12 units on, no dropout
train E[signal] = 12.0test signal = 12.0
✓ train and test magnitudes match
keep 0.5keep 0.95
scaling ON divides survivors by keep_prob so E[signal] = test signal · turn it OFF and test activations blow up by 1/keep_prob

↑ Drag keep_prob. With scaling ON the train and test signal bars line up (the rescale compensates for the dropped units). Toggle it OFF and the test bar overshoots by 1/p — the failure mode the interviewer is fishing for.

06

BATCHNORM — THE ONE THAT ISN’T REGULARISATION

interactive

This is the one I had filed in the wrong drawer. BatchNorm normalises each pre-activation using the mini-batch’s own mean and variance, then lets the network scale and shift it back with learnable γ, β:

z^=zμBσB2+ε,z~=γz^+β\hat z = \frac{z - \mu_B}{\sqrt{\sigma_B^2 + \varepsilon}}, \qquad \tilde z = \gamma\,\hat z + \beta

Its job is optimisation: it smooths the loss landscape so you can use a higher learning rate and converge faster, with less sensitivity to initialisation. The regularising effect people credit it with is an accident — because μ_B, σ_B are estimated from whichever examples happened to share the batch, each example gets a slightly noisy normalisation, and that noise behaves a little like dropout.

The give-away: that noise scales as σ/√(batch). Judge Tirupati’s weather from 2 days and your average is jumpy; from 200 days it’s rock-steady. Grow the batch and the “regularisation” evaporates — proof it was never the point.

true μeach • = one mini-batch's μ̂
batch = 8noise σ/√n = 0.354strong incidental reg
small batchlarge batch
the μ̂ jitter (the noise BatchNorm injects) shrinks as σ/√batch · big batch → no noise → no regularisation. it was never the point.

↑ Each dot is one mini-batch’s estimate of the mean. Small batch → the estimates scatter (noise injected = incidental regularisation). Slide the batch up and they collapse onto the true value — the reg effect vanishes. (Transformers use LayerNorm instead: normalise per example, not per batch.)

07

FOUR MACHINES, ONE LABEL

DATA TABLE n=4
TechniqueMechanismWeights → exactly 0?Its real job
L2 / weight decaygrad adds λw → ×(1−αλ) shrinkno — asymptotesshrink all weights (variance↓)
L1 / lassograd adds λ·sign(w) → soft-thresholdyes — exactly 0sparsity / feature selection
Dropoutdrop p, ÷keep_prob at trainnoimplicit ensemble; spread reliance
★ BatchNormnormalise pre-acts per batch (γ, β)nooptimisation — reg is a side-effect

One orthogonality note worth keeping (Ng’s): early stopping also reduces variance, but it tangles “optimise the loss” and “control variance” into a single knob — so when you can, prefer L2 + train longer, which keeps the two concerns separable.

08

CONCLUSION

⚠ FAR FROM IT
Hypothesis refuted — they’re not one knob.

L1 and L2 shrink by different machines (a constant soft-threshold to exact zeros vs a proportional decay that only asymptotes). “L2 inside Adam” isn’t even clean weight decay — the adaptive denominator dilutes it, which is the entire reason AdamW exists. And BatchNorm is an optimiser whose regularising effect is an accident of mini-batch noise that fades as the batch grows.

The quest motto earned its keep: name the precise mechanism, don’t describe around it. “Regularisation” is a label on a drawer with four different tools inside.

09

WHAT NEXT

Deep Learning from first principles — attention (Q/K/V), multi-head, positional encodings — and a dedicated note on LayerNorm vs BatchNorm: why per-example normalisation is the one that survives variable-length sequences and tiny batches, and BatchNorm doesn’t.