Mean Squared Error

Mean squared error is one of the simplest and most widely used loss functions in supervised learning. It measures the average squared difference between a model’s prediction and the target value.

It is mainly used for regression problems, where the target is a real-valued quantity rather than a class label. Examples include predicting house prices, temperatures, distances, demand, ratings, physical measurements, or future numerical values.

Suppose a model receives an input $x$ and produces a prediction

\hat{y} = f_\theta(x),

where $\theta$ denotes the model parameters. If the true target is $y$ , the squared error for one example is

(\hat{y} - y)^2.

For a dataset with $n$ examples, the mean squared error is

\mathrm{MSE} = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)^2.

The loss is small when predictions are close to targets. The loss grows quickly when predictions are far away from targets because the error is squared.

Regression as Function Approximation

In regression, the model learns a function from inputs to continuous outputs:

f_\theta : \mathcal{X} \to \mathbb{R}.

For example, a model may map a feature vector describing a house to its price:

x = \begin{bmatrix} \text{area} \\ \text{number of rooms} \\ \text{location score} \\ \text{age} \end{bmatrix}, \qquad y = \text{price}.

The model prediction is

\hat{y} = f_\theta(x).

Training adjusts $\theta$ so that $\hat{y}$ becomes close to $y$ on the training data and, more importantly, on future unseen data.

Mean squared error gives a precise objective for this problem. It asks the model to minimize the average squared prediction error.

MSE for Batches

Deep learning models usually train on mini-batches. Suppose a batch contains $B$ examples. The predictions and targets may be written as vectors:

\hat{y} = \begin{bmatrix} \hat{y}_1 \\ \hat{y}_2 \\ \vdots \\ \hat{y}_B \end{bmatrix}, \qquad y = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_B \end{bmatrix}.

The batch mean squared error is

L = \frac{1}{B} \sum_{i=1}^{B} (\hat{y}_i - y_i)^2.

In PyTorch:

import torch

y_pred = torch.tensor([2.5, 0.0, 2.0, 8.0])
y_true = torch.tensor([3.0, -0.5, 2.0, 7.0])

loss = torch.mean((y_pred - y_true) ** 2)
print(loss)

PyTorch also provides the built-in module:

import torch
import torch.nn as nn

loss_fn = nn.MSELoss()

loss = loss_fn(y_pred, y_true)
print(loss)

Both versions compute the same loss when the default reduction is used.

MSE for Multi-Output Regression

Many models predict more than one value per example. For instance, a model may predict both temperature and humidity, or it may predict the $x$ - and $y$ -coordinates of a point.

Suppose

\hat{Y}, Y \in \mathbb{R}^{B \times d},

where $B$ is the batch size and $d$ is the number of output dimensions. The mean squared error is usually computed over all entries:

L = \frac{1}{Bd} \sum_{i=1}^{B} \sum_{j=1}^{d} (\hat{Y}_{ij} - Y_{ij})^2.

In PyTorch:

y_pred = torch.randn(32, 5)  # 32 examples, 5 outputs each
y_true = torch.randn(32, 5)

loss_fn = nn.MSELoss()
loss = loss_fn(y_pred, y_true)

print(loss.shape)  # torch.Size([])

The result is a scalar tensor. This scalar is then used for backpropagation.

Gradient of Mean Squared Error

Mean squared error is popular partly because its derivative is simple.

For one prediction $\hat{y}$ and target $y$ , define

L = (\hat{y} - y)^2.

Then

\frac{\partial L}{\partial \hat{y}} = 2(\hat{y} - y).

For the averaged batch loss

L = \frac{1}{B} \sum_{i=1}^{B} (\hat{y}_i - y_i)^2,

the derivative with respect to $\hat{y}_i$ is

\frac{\partial L}{\partial \hat{y}_i} = \frac{2}{B}(\hat{y}_i - y_i).

The sign of the gradient tells the model how to change the prediction. If $\hat{y}_i > y_i$ , the gradient is positive, so gradient descent pushes the prediction downward. If $\hat{y}_i < y_i$ , the gradient is negative, so gradient descent pushes the prediction upward.

This behavior is exactly what we want for regression.

Linear Regression with MSE

Consider a linear model

\hat{y} = wx + b.

For one example, the squared error is

L = (wx + b - y)^2.

The gradients are

\frac{\partial L}{\partial w} = 2(wx+b-y)x,

and

\frac{\partial L}{\partial b} = 2(wx+b-y).

Thus the weight gradient depends on both the prediction error and the input value. Large input values amplify the gradient. This is one reason feature scaling matters in regression.

For a batch of examples, let

X \in \mathbb{R}^{B \times d}, \qquad w \in \mathbb{R}^{d}, \qquad y \in \mathbb{R}^{B}.

The prediction is

\hat{y} = Xw + b.

The mean squared error is

L = \frac{1}{B} \|Xw + b - y\|_2^2.

This compact vector form is the standard formulation of least-squares regression.

PyTorch Example: Linear Regression

The following example trains a one-layer model using mean squared error.

import torch
import torch.nn as nn
import torch.optim as optim

# Synthetic data:
# y = 3x + 2 + noise
torch.manual_seed(0)

x = torch.randn(1000, 1)
noise = 0.2 * torch.randn(1000, 1)
y = 3.0 * x + 2.0 + noise

model = nn.Linear(1, 1)

loss_fn = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

for epoch in range(100):
    y_pred = model(x)
    loss = loss_fn(y_pred, y)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

print("weight:", model.weight.item())
print("bias:", model.bias.item())
print("loss:", loss.item())

The learned weight should be close to $3$ , and the learned bias should be close to $2$ . The exact values differ because the training data includes noise and the parameters are initialized randomly.

The training loop follows the standard pattern:

y_pred = model(x)
loss = loss_fn(y_pred, y)

optimizer.zero_grad()
loss.backward()
optimizer.step()

The loss measures the prediction error. backward() computes gradients. The optimizer updates the model parameters.

Reduction Modes in PyTorch

nn.MSELoss supports different reduction modes.

nn.MSELoss(reduction="mean")
nn.MSELoss(reduction="sum")
nn.MSELoss(reduction="none")

The default is "mean".

With "mean", PyTorch averages over all elements:

loss_fn = nn.MSELoss(reduction="mean")
loss = loss_fn(y_pred, y_true)

With "sum", PyTorch sums all squared errors:

loss_fn = nn.MSELoss(reduction="sum")
loss = loss_fn(y_pred, y_true)

With "none", PyTorch returns the elementwise squared errors:

loss_fn = nn.MSELoss(reduction="none")
loss = loss_fn(y_pred, y_true)

print(loss.shape)

The "none" option is useful when examples need different weights. For example:

loss_fn = nn.MSELoss(reduction="none")

errors = loss_fn(y_pred, y_true)
weights = torch.tensor([1.0, 0.5, 2.0, 1.0])

weighted_loss = (errors * weights).mean()

For multi-dimensional outputs, care is needed because the weight tensor must have a shape compatible with the error tensor.

MSE and Gaussian Noise

Mean squared error has a probabilistic interpretation. Suppose the target is generated by

y = f_\theta(x) + \epsilon,

where the noise term $\epsilon$ follows a Gaussian distribution:

\epsilon \sim \mathcal{N}(0, \sigma^2).

Then

y \mid x \sim \mathcal{N}(f_\theta(x), \sigma^2).

The likelihood of observing $y$ given $x$ and $\theta$ is

p(y \mid x, \theta) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp \left( -\frac{(y - f_\theta(x))^2}{2\sigma^2} \right).

Maximizing this likelihood is equivalent to minimizing squared error. More precisely, minimizing the negative log-likelihood gives

-\log p(y \mid x, \theta) = \frac{1}{2\sigma^2} (y - f_\theta(x))^2 + \text{constant}.

The constant does not depend on $\theta$ , so it has no effect on training. Thus, under the assumption of independent Gaussian noise with constant variance, mean squared error is the natural loss function.

This interpretation matters. MSE assumes that large positive and negative errors are penalized symmetrically, and that deviations are well modeled by Gaussian noise. When the noise has heavy tails or many outliers, MSE may perform poorly.

Sensitivity to Outliers

Because MSE squares the error, large errors receive very large penalties.

If the error is $1$ , the squared error is $1$ . If the error is $10$ , the squared error is $100$ . If the error is $100$ , the squared error is $10000$ .

This makes MSE sensitive to outliers. A small number of unusual examples can dominate the loss and strongly affect the learned model.

For example, suppose a dataset contains house prices, but a few prices are incorrectly recorded with extra zeros. MSE may push the model toward these corrupted targets because their squared errors are very large.

When outliers are expected, other losses may be preferable, such as mean absolute error or Huber loss.

MSE Versus Mean Absolute Error

Mean absolute error uses absolute differences:

\mathrm{MAE} = \frac{1}{n} \sum_{i=1}^{n} |\hat{y}_i - y_i|.

MSE penalizes large errors more strongly than MAE. This can be useful when large errors are especially undesirable. But it can also make MSE less robust.

Loss	Formula	Behavior
MSE	$\frac{1}{n}\sum_i(\hat{y}_i-y_i)^2$	Strongly penalizes large errors
MAE	(\frac{1}{n}\sum_i	\hat{y}_i-y_i
Huber	Quadratic near zero, linear for large errors	Compromise between MSE and MAE

MSE has smooth gradients everywhere. MAE has a nondifferentiable point at zero, although this rarely prevents practical optimization. Huber loss combines the smoothness of MSE near zero with the robustness of MAE for large errors.

Scale Dependence

MSE depends on the scale of the target variable. If the target is measured in meters, the squared error is measured in square meters. If the target is measured in kilometers, the numerical loss changes.

For example, an error of $10$ meters gives squared error $100$ . The same error expressed as $0.01$ kilometers gives squared error $0.0001$ .

This means raw MSE values are hard to compare across datasets with different units.

In practice, regression targets are often standardized:

y' = \frac{y - \mu}{\sigma},

where $\mu$ is the mean target value and $\sigma$ is the target standard deviation.

The model is trained to predict $y'$ . Predictions can later be transformed back to the original scale:

\hat{y} = \sigma \hat{y}' + \mu.

Target normalization often improves numerical stability and optimization.

Root Mean Squared Error

A related metric is root mean squared error:

\mathrm{RMSE} = \sqrt{ \frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)^2 }.

RMSE has the same unit as the target variable. If the target is measured in dollars, RMSE is also measured in dollars. This makes RMSE easier to interpret than MSE.

However, MSE is usually used as the training loss because it has a simpler derivative. RMSE can be used as an evaluation metric.

In PyTorch:

mse = nn.MSELoss()(y_pred, y_true)
rmse = torch.sqrt(mse)

print(rmse)

When using RMSE during training, one often adds a small epsilon for numerical stability:

eps = 1e-8
rmse = torch.sqrt(mse + eps)

Shape Requirements in PyTorch

nn.MSELoss expects predictions and targets to have the same shape.

Correct:

y_pred = torch.randn(32, 1)
y_true = torch.randn(32, 1)

loss = nn.MSELoss()(y_pred, y_true)

Potentially incorrect:

y_pred = torch.randn(32, 1)
y_true = torch.randn(32)

loss = nn.MSELoss()(y_pred, y_true)

PyTorch may broadcast the tensors instead of raising an immediate error. Broadcasting can produce an unintended loss shape. For regression, it is usually better to make the shapes explicitly match:

y_true = y_true.view(-1, 1)

y_pred = y_pred.squeeze(-1)

depending on the intended shape.

Shape mismatches are a common source of silent training errors.

MSE in Neural Network Training

In a neural network, MSE is applied to the final output. The model may contain many layers, but the loss only compares the final prediction with the target.

For example:

class RegressionMLP(nn.Module):
    def __init__(self, in_features, hidden_features):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(in_features, hidden_features),
            nn.ReLU(),
            nn.Linear(hidden_features, hidden_features),
            nn.ReLU(),
            nn.Linear(hidden_features, 1),
        )

    def forward(self, x):
        return self.net(x)

Training:

model = RegressionMLP(in_features=10, hidden_features=64)

x = torch.randn(32, 10)
y = torch.randn(32, 1)

loss_fn = nn.MSELoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

y_pred = model(x)
loss = loss_fn(y_pred, y)

optimizer.zero_grad()
loss.backward()
optimizer.step()

The MSE loss creates gradients at the output. Backpropagation then propagates those gradients through every layer of the network.

When to Use MSE

Mean squared error is appropriate when the task is regression, the target is continuous, errors are roughly symmetric, large errors should be penalized strongly, and the noise is reasonably close to Gaussian.

It is commonly used for:

Task	Example target
Price prediction	House price, stock return, demand
Physical prediction	Temperature, force, velocity
Coordinate prediction	Object center, landmark location
Signal reconstruction	Audio waveform, image pixels
Autoencoder training	Reconstructed input values
Value prediction	Reinforcement learning value estimates

For classification, MSE is usually the wrong default. Classification normally uses cross-entropy because the model predicts class probabilities, not continuous targets.

Limitations

MSE has several limitations.

It is sensitive to outliers because large errors are squared. It depends strongly on the scale of the target variable. It assumes symmetric costs for overprediction and underprediction. It corresponds to Gaussian noise with constant variance, which may be inappropriate for many real datasets.

MSE also treats each output dimension equally unless weights are added. In multi-output regression, this can cause problems when different target dimensions have different units or scales.

For example, suppose a model predicts both age and income. Income may numerically dominate the loss because its values are much larger. Standardization or weighted losses are usually needed.

Practical Guidelines

Use MSE as the first baseline for regression. Normalize input features and often normalize regression targets. Check prediction and target shapes carefully. Inspect residuals to see whether errors are symmetric and whether outliers dominate. Report RMSE or MAE alongside MSE when the original units matter.

For PyTorch, the standard pattern is:

loss_fn = nn.MSELoss()
loss = loss_fn(predictions, targets)

For many regression models, this simple loss is sufficient. For noisy, heavy-tailed, or outlier-heavy data, compare it against MAE and Huber loss.