L1 and L2 Regularization

A neural network is trained by minimizing a loss function. For a supervised learning problem, this loss measures how far the model predictions are from the target values. If the model has parameters $\theta$, and the training data are denoted by $\mathcal{D}$, the usual training objective has the form

$$ \min_\theta ; \mathcal{L}_{\text{data}}(\theta). $$

The term $\mathcal{L}_{\text{data}}$ is the data-fitting loss. It may be mean squared error for regression, cross-entropy for classification, or another task-specific loss.

Regularization modifies this objective by adding a penalty on the parameters:

$$ \min_\theta ; \mathcal{L}_{\text{data}}(\theta) + \lambda \Omega(\theta). $$

Here $\Omega(\theta)$ is a regularization penalty, and $\lambda \geq 0$ controls its strength. A larger value of $\lambda$ places more pressure on the model parameters. A smaller value places more emphasis on fitting the training data.

L1 and L2 regularization are two of the simplest and most widely used parameter penalties.

The Purpose of Parameter Regularization

Deep neural networks often contain many more parameters than the number of training examples. Such models can fit complex functions, but they can also fit noise, accidental correlations, and artifacts of the training set. This leads to overfitting: the model performs well on training data but poorly on new data.

Regularization constrains the learning problem. It discourages solutions that fit the training data only by using unnecessarily large or unstable parameter values.

Parameter regularization encodes a simple preference: among models that fit the data similarly well, prefer the one with smaller or simpler parameters.

This principle appears in many forms. In linear regression, small coefficients lead to smoother functions. In neural networks, small weights often reduce sensitivity to input perturbations and help avoid extreme activations. The precise effect depends on architecture, normalization, optimizer, and data, but the basic pressure remains the same.

L2 Regularization

L2 regularization penalizes the squared Euclidean norm of the parameter vector:

$$ \Omega_2(\theta) = |\theta|2^2 = \sum{j=1}^{p} \theta_j^2. $$

The regularized objective is

$$ \min_\theta ; \mathcal{L}_{\text{data}}(\theta) + \lambda |\theta|_2^2. $$

Some texts include a factor of $1/2$ in the penalty:

$$ \min_\theta ; \mathcal{L}_{\text{data}}(\theta) + \frac{\lambda}{2}|\theta|_2^2. $$

This convention makes the derivative cleaner, because

$$ \nabla_\theta \left(\frac{\lambda}{2}|\theta|_2^2\right) = \lambda \theta. $$

L2 regularization penalizes large parameter values more strongly than small ones because the penalty grows quadratically. If a weight doubles, its L2 contribution becomes four times larger.

For a single parameter $\theta_j$, the L2 penalty contributes

$$ \lambda \theta_j^2. $$

Thus large weights become expensive.

Gradient Effect of L2 Regularization

Let the regularized objective be

$$ J(\theta) = \mathcal{L}_{\text{data}}(\theta) + \frac{\lambda}{2}|\theta|_2^2. $$

Taking the gradient gives

$$ \nabla_\theta J(\theta) = \nabla_\theta \mathcal{L}_{\text{data}}(\theta) + \lambda \theta. $$

A gradient descent update with learning rate $\eta$ becomes

$$ \theta_{t+1} = \theta_t - \eta\left( \nabla_\theta \mathcal{L}_{\text{data}}(\theta_t) + \lambda \theta_t \right). $$

Rearranging,

$$ \theta_{t+1} = (1-\eta\lambda)\theta_t - \eta \nabla_\theta \mathcal{L}_{\text{data}}(\theta_t). $$

This expression shows why L2 regularization is often connected to weight decay. Each update shrinks the current parameter vector by the factor $1-\eta\lambda$, then applies the usual gradient step.

For plain stochastic gradient descent, L2 regularization and weight decay are closely related. For adaptive optimizers such as Adam, the distinction matters. AdamW uses decoupled weight decay, where the shrinkage is applied separately from the adaptive gradient update.

L2 Regularization in PyTorch

In PyTorch, L2 regularization is usually implemented through the weight_decay argument of an optimizer.

import torch
from torch import nn

model = nn.Sequential(
    nn.Linear(100, 256),
    nn.ReLU(),
    nn.Linear(256, 10),
)

optimizer = torch.optim.SGD(
    model.parameters(),
    lr=1e-2,
    weight_decay=1e-4,
)

Here weight_decay=1e-4 applies an L2-like penalty to the model parameters.

For Adam-style training, AdamW is usually preferred over Adam with weight decay:

optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=3e-4,
    weight_decay=1e-2,
)

AdamW decouples weight decay from the adaptive gradient calculation. This tends to behave more predictably in modern deep learning practice.

A common detail is to avoid applying weight decay to bias parameters and normalization parameters. Biases, LayerNorm weights, and BatchNorm weights often contribute little to overfitting in the same way as large weight matrices, and decaying them can degrade performance.

A typical parameter grouping pattern is:

decay = []
no_decay = []

for name, param in model.named_parameters():
    if not param.requires_grad:
        continue

    if name.endswith("bias") or "norm" in name.lower():
        no_decay.append(param)
    else:
        decay.append(param)

optimizer = torch.optim.AdamW(
    [
        {"params": decay, "weight_decay": 1e-2},
        {"params": no_decay, "weight_decay": 0.0},
    ],
    lr=3e-4,
)

This style is common in transformer training.

L1 Regularization

L1 regularization penalizes the sum of absolute parameter values:

$$ \Omega_1(\theta) = |\theta|1 = \sum{j=1}^{p} |\theta_j|. $$

The regularized objective is

$$ \min_\theta ; \mathcal{L}_{\text{data}}(\theta) + \lambda |\theta|_1. $$

Unlike L2 regularization, L1 regularization grows linearly with the magnitude of each parameter. It applies a constant pressure toward zero. This often encourages sparsity: many parameters may become exactly zero or very close to zero.

For linear models, L1 regularization is closely associated with feature selection. If a coefficient becomes zero, the corresponding feature is removed from the model. In neural networks, the effect is less direct, but L1 penalties can still encourage sparse weights, sparse activations, or sparse feature usage depending on where the penalty is applied.

Gradient Effect of L1 Regularization

The absolute value function has derivative

$$ \frac{d}{d\theta_j}|\theta_j| = \begin{cases} 1, & \theta_j > 0, \ -1, & \theta_j < 0. \end{cases} $$

At $\theta_j = 0$, the absolute value function is not differentiable. In optimization, we use a subgradient. Any value in the interval $[-1,1]$ is a valid subgradient at zero.

Thus the L1 contribution to the gradient is

$$ \lambda \operatorname{sign}(\theta_j), $$

away from zero.

A gradient step with L1 regularization applies a nearly constant pull toward zero. Positive weights are pushed downward. Negative weights are pushed upward. This differs from L2 regularization, where the pull is proportional to the current value of the parameter.

This difference explains the sparsity effect. L2 makes weights smaller. L1 can drive weights to zero.

L1 Regularization in PyTorch

PyTorch optimizers do not usually expose L1 regularization as a built-in weight_decay argument. The standard implementation adds an L1 penalty explicitly to the loss.

import torch
from torch import nn

model = nn.Sequential(
    nn.Linear(100, 256),
    nn.ReLU(),
    nn.Linear(256, 10),
)

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)

lambda_l1 = 1e-5

x = torch.randn(32, 100)
target = torch.randint(0, 10, (32,))

logits = model(x)
data_loss = criterion(logits, target)

l1_penalty = 0.0
for param in model.parameters():
    l1_penalty = l1_penalty + param.abs().sum()

loss = data_loss + lambda_l1 * l1_penalty

optimizer.zero_grad()
loss.backward()
optimizer.step()

This code adds the absolute value of every parameter to the objective. In practice, one may apply L1 regularization only to selected layers.

For example:

l1_penalty = model[0].weight.abs().sum()
loss = data_loss + lambda_l1 * l1_penalty

This penalizes only the first linear layer’s weight matrix.

Comparing L1 and L2 Regularization

Property	L1 regularization	L2 regularization
Penalty	$\sum_j	\theta_j
Effect on weights	Encourages sparsity	Encourages small weights
Gradient pressure	Roughly constant toward zero	Proportional to weight size
Differentiability	Not differentiable at zero	Smooth everywhere
Common use	Sparse models, feature selection	General-purpose regularization
PyTorch implementation	Add penalty to loss manually	Use optimizer `weight_decay`

L2 regularization is more common in deep learning. It is smooth, easy to optimize, and works well with standard neural network training. L1 regularization is useful when sparsity is desired, but it can make optimization less smooth.

Regularizing Weights, Biases, and Activations

The symbol $\theta$ may refer to all model parameters, but regularization choices are often more selective.

A model contains different kinds of parameters:

Parameter type	Typical regularization choice
Linear and convolution weights	Often regularized
Bias terms	Often not regularized
BatchNorm scale and shift	Usually not regularized
LayerNorm scale and shift	Usually not regularized
Embedding matrices	Sometimes regularized
Output heads	Depends on task

One can also regularize activations instead of parameters. Suppose $h(x)$ is a hidden representation. An activation penalty may take the form

$$ \lambda |h(x)|_1 $$

$$ \lambda |h(x)|_2^2. $$

Activation regularization encourages hidden units to remain small or sparse. It is common in sparse autoencoders, representation learning, and some interpretability methods.

Weight Decay Versus L2 Penalty

The terms weight decay and L2 regularization are often used interchangeably, but they are not always identical.

With plain SGD, adding an L2 penalty to the loss produces the same update form as multiplicative weight decay. With adaptive optimizers, the optimizer rescales gradients coordinate by coordinate. If the L2 penalty is included inside the gradient, the regularization term is also affected by this rescaling. Decoupled weight decay avoids this by applying parameter shrinkage separately.

This is the main reason AdamW is preferred for many modern models.

A simplified AdamW-style update can be written as

$$ \theta_{t+1} = (1-\eta\lambda)\theta_t - \eta \cdot \text{AdamUpdate}_t. $$

The shrinkage term and the adaptive gradient term are separate. This gives the weight decay coefficient a cleaner interpretation.

Choosing the Regularization Strength

The regularization coefficient $\lambda$ is a hyperparameter. It must be chosen using validation performance.

If $\lambda$ is too small, the penalty has little effect and the model may overfit. If $\lambda$ is too large, the model may underfit because its parameters are forced too close to zero.

Typical ranges depend on optimizer, model size, data size, and task. For AdamW, values such as

$$ 10^{-4},\quad 10^{-3},\quad 10^{-2},\quad 10^{-1} $$

are common candidates. For SGD, smaller values such as

$$ 10^{-5},\quad 10^{-4},\quad 10^{-3} $$

are often tested.

A simple search may use:

weight_decays = [0.0, 1e-5, 1e-4, 1e-3, 1e-2]

for wd in weight_decays:
    optimizer = torch.optim.AdamW(
        model.parameters(),
        lr=3e-4,
        weight_decay=wd,
    )
    # train model and evaluate validation metric

The correct value is empirical. It should be selected by validation loss, validation accuracy, or another task-relevant metric.

Practical Guidance

L2 regularization through weight decay is a strong default for most neural networks. For modern transformer models, AdamW with decoupled weight decay is the usual starting point.

L1 regularization should be used when sparsity is an explicit goal. It can be applied to weights, activations, embeddings, or specific layers. Applying L1 to every parameter in a large deep model may slow optimization and may not produce useful sparsity without additional pruning or thresholding.

Biases and normalization parameters should usually be excluded from weight decay. This is especially important in architectures with LayerNorm, BatchNorm, or many small affine parameters.

Regularization should be tuned together with learning rate, batch size, model size, and data augmentation. These choices interact. A model trained with strong augmentation may need less weight decay. A very large model trained on limited data may need more regularization.

Summary

L1 and L2 regularization add parameter penalties to the training objective.

L2 regularization penalizes squared parameter values. It encourages weights to remain small and is commonly implemented as weight decay. In modern PyTorch training, AdamW is often the preferred optimizer for decoupled weight decay.

L1 regularization penalizes absolute parameter values. It encourages sparsity and is usually added manually to the loss.

Both methods control model complexity by modifying the optimization problem. They do not guarantee good generalization, but they often improve stability and reduce overfitting when chosen carefully.