Mean squared error is one of the simplest and most widely used loss functions in supervised learning. It measures the average squared difference between a model’s prediction and the target value.
Mean squared error is one of the simplest and most widely used loss functions in supervised learning. It measures the average squared difference between a model’s prediction and the target value.
It is mainly used for regression problems, where the target is a real-valued quantity rather than a class label. Examples include predicting house prices, temperatures, distances, demand, ratings, physical measurements, or future numerical values.
Suppose a model receives an input and produces a prediction
where denotes the model parameters. If the true target is , the squared error for one example is
For a dataset with examples, the mean squared error is
The loss is small when predictions are close to targets. The loss grows quickly when predictions are far away from targets because the error is squared.
Regression as Function Approximation
In regression, the model learns a function from inputs to continuous outputs:
For example, a model may map a feature vector describing a house to its price:
The model prediction is
Training adjusts so that becomes close to on the training data and, more importantly, on future unseen data.
Mean squared error gives a precise objective for this problem. It asks the model to minimize the average squared prediction error.
MSE for Batches
Deep learning models usually train on mini-batches. Suppose a batch contains examples. The predictions and targets may be written as vectors:
The batch mean squared error is
In PyTorch:
import torch
y_pred = torch.tensor([2.5, 0.0, 2.0, 8.0])
y_true = torch.tensor([3.0, -0.5, 2.0, 7.0])
loss = torch.mean((y_pred - y_true) ** 2)
print(loss)PyTorch also provides the built-in module:
import torch
import torch.nn as nn
loss_fn = nn.MSELoss()
loss = loss_fn(y_pred, y_true)
print(loss)Both versions compute the same loss when the default reduction is used.
MSE for Multi-Output Regression
Many models predict more than one value per example. For instance, a model may predict both temperature and humidity, or it may predict the - and -coordinates of a point.
Suppose
where is the batch size and is the number of output dimensions. The mean squared error is usually computed over all entries:
In PyTorch:
y_pred = torch.randn(32, 5) # 32 examples, 5 outputs each
y_true = torch.randn(32, 5)
loss_fn = nn.MSELoss()
loss = loss_fn(y_pred, y_true)
print(loss.shape) # torch.Size([])The result is a scalar tensor. This scalar is then used for backpropagation.
Gradient of Mean Squared Error
Mean squared error is popular partly because its derivative is simple.
For one prediction and target , define
Then
For the averaged batch loss
the derivative with respect to is
The sign of the gradient tells the model how to change the prediction. If , the gradient is positive, so gradient descent pushes the prediction downward. If , the gradient is negative, so gradient descent pushes the prediction upward.
This behavior is exactly what we want for regression.
Linear Regression with MSE
Consider a linear model
For one example, the squared error is
The gradients are
and
Thus the weight gradient depends on both the prediction error and the input value. Large input values amplify the gradient. This is one reason feature scaling matters in regression.
For a batch of examples, let
The prediction is
The mean squared error is
This compact vector form is the standard formulation of least-squares regression.
PyTorch Example: Linear Regression
The following example trains a one-layer model using mean squared error.
import torch
import torch.nn as nn
import torch.optim as optim
# Synthetic data:
# y = 3x + 2 + noise
torch.manual_seed(0)
x = torch.randn(1000, 1)
noise = 0.2 * torch.randn(1000, 1)
y = 3.0 * x + 2.0 + noise
model = nn.Linear(1, 1)
loss_fn = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)
for epoch in range(100):
y_pred = model(x)
loss = loss_fn(y_pred, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
print("weight:", model.weight.item())
print("bias:", model.bias.item())
print("loss:", loss.item())The learned weight should be close to , and the learned bias should be close to . The exact values differ because the training data includes noise and the parameters are initialized randomly.
The training loop follows the standard pattern:
y_pred = model(x)
loss = loss_fn(y_pred, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()The loss measures the prediction error. backward() computes gradients. The optimizer updates the model parameters.
Reduction Modes in PyTorch
nn.MSELoss supports different reduction modes.
nn.MSELoss(reduction="mean")
nn.MSELoss(reduction="sum")
nn.MSELoss(reduction="none")The default is "mean".
With "mean", PyTorch averages over all elements:
loss_fn = nn.MSELoss(reduction="mean")
loss = loss_fn(y_pred, y_true)With "sum", PyTorch sums all squared errors:
loss_fn = nn.MSELoss(reduction="sum")
loss = loss_fn(y_pred, y_true)With "none", PyTorch returns the elementwise squared errors:
loss_fn = nn.MSELoss(reduction="none")
loss = loss_fn(y_pred, y_true)
print(loss.shape)The "none" option is useful when examples need different weights. For example:
loss_fn = nn.MSELoss(reduction="none")
errors = loss_fn(y_pred, y_true)
weights = torch.tensor([1.0, 0.5, 2.0, 1.0])
weighted_loss = (errors * weights).mean()For multi-dimensional outputs, care is needed because the weight tensor must have a shape compatible with the error tensor.
MSE and Gaussian Noise
Mean squared error has a probabilistic interpretation. Suppose the target is generated by
where the noise term follows a Gaussian distribution:
Then
The likelihood of observing given and is
Maximizing this likelihood is equivalent to minimizing squared error. More precisely, minimizing the negative log-likelihood gives
The constant does not depend on , so it has no effect on training. Thus, under the assumption of independent Gaussian noise with constant variance, mean squared error is the natural loss function.
This interpretation matters. MSE assumes that large positive and negative errors are penalized symmetrically, and that deviations are well modeled by Gaussian noise. When the noise has heavy tails or many outliers, MSE may perform poorly.
Sensitivity to Outliers
Because MSE squares the error, large errors receive very large penalties.
If the error is , the squared error is . If the error is , the squared error is . If the error is , the squared error is .
This makes MSE sensitive to outliers. A small number of unusual examples can dominate the loss and strongly affect the learned model.
For example, suppose a dataset contains house prices, but a few prices are incorrectly recorded with extra zeros. MSE may push the model toward these corrupted targets because their squared errors are very large.
When outliers are expected, other losses may be preferable, such as mean absolute error or Huber loss.
MSE Versus Mean Absolute Error
Mean absolute error uses absolute differences:
MSE penalizes large errors more strongly than MAE. This can be useful when large errors are especially undesirable. But it can also make MSE less robust.
| Loss | Formula | Behavior |
|---|---|---|
| MSE | Strongly penalizes large errors | |
| MAE | (\frac{1}{n}\sum_i | \hat{y}_i-y_i |
| Huber | Quadratic near zero, linear for large errors | Compromise between MSE and MAE |
MSE has smooth gradients everywhere. MAE has a nondifferentiable point at zero, although this rarely prevents practical optimization. Huber loss combines the smoothness of MSE near zero with the robustness of MAE for large errors.
Scale Dependence
MSE depends on the scale of the target variable. If the target is measured in meters, the squared error is measured in square meters. If the target is measured in kilometers, the numerical loss changes.
For example, an error of meters gives squared error . The same error expressed as kilometers gives squared error .
This means raw MSE values are hard to compare across datasets with different units.
In practice, regression targets are often standardized:
where is the mean target value and is the target standard deviation.
The model is trained to predict . Predictions can later be transformed back to the original scale:
Target normalization often improves numerical stability and optimization.
Root Mean Squared Error
A related metric is root mean squared error:
RMSE has the same unit as the target variable. If the target is measured in dollars, RMSE is also measured in dollars. This makes RMSE easier to interpret than MSE.
However, MSE is usually used as the training loss because it has a simpler derivative. RMSE can be used as an evaluation metric.
In PyTorch:
mse = nn.MSELoss()(y_pred, y_true)
rmse = torch.sqrt(mse)
print(rmse)When using RMSE during training, one often adds a small epsilon for numerical stability:
eps = 1e-8
rmse = torch.sqrt(mse + eps)Shape Requirements in PyTorch
nn.MSELoss expects predictions and targets to have the same shape.
Correct:
y_pred = torch.randn(32, 1)
y_true = torch.randn(32, 1)
loss = nn.MSELoss()(y_pred, y_true)Potentially incorrect:
y_pred = torch.randn(32, 1)
y_true = torch.randn(32)
loss = nn.MSELoss()(y_pred, y_true)PyTorch may broadcast the tensors instead of raising an immediate error. Broadcasting can produce an unintended loss shape. For regression, it is usually better to make the shapes explicitly match:
y_true = y_true.view(-1, 1)or
y_pred = y_pred.squeeze(-1)depending on the intended shape.
Shape mismatches are a common source of silent training errors.
MSE in Neural Network Training
In a neural network, MSE is applied to the final output. The model may contain many layers, but the loss only compares the final prediction with the target.
For example:
class RegressionMLP(nn.Module):
def __init__(self, in_features, hidden_features):
super().__init__()
self.net = nn.Sequential(
nn.Linear(in_features, hidden_features),
nn.ReLU(),
nn.Linear(hidden_features, hidden_features),
nn.ReLU(),
nn.Linear(hidden_features, 1),
)
def forward(self, x):
return self.net(x)Training:
model = RegressionMLP(in_features=10, hidden_features=64)
x = torch.randn(32, 10)
y = torch.randn(32, 1)
loss_fn = nn.MSELoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
y_pred = model(x)
loss = loss_fn(y_pred, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()The MSE loss creates gradients at the output. Backpropagation then propagates those gradients through every layer of the network.
When to Use MSE
Mean squared error is appropriate when the task is regression, the target is continuous, errors are roughly symmetric, large errors should be penalized strongly, and the noise is reasonably close to Gaussian.
It is commonly used for:
| Task | Example target |
|---|---|
| Price prediction | House price, stock return, demand |
| Physical prediction | Temperature, force, velocity |
| Coordinate prediction | Object center, landmark location |
| Signal reconstruction | Audio waveform, image pixels |
| Autoencoder training | Reconstructed input values |
| Value prediction | Reinforcement learning value estimates |
For classification, MSE is usually the wrong default. Classification normally uses cross-entropy because the model predicts class probabilities, not continuous targets.
Limitations
MSE has several limitations.
It is sensitive to outliers because large errors are squared. It depends strongly on the scale of the target variable. It assumes symmetric costs for overprediction and underprediction. It corresponds to Gaussian noise with constant variance, which may be inappropriate for many real datasets.
MSE also treats each output dimension equally unless weights are added. In multi-output regression, this can cause problems when different target dimensions have different units or scales.
For example, suppose a model predicts both age and income. Income may numerically dominate the loss because its values are much larger. Standardization or weighted losses are usually needed.
Practical Guidelines
Use MSE as the first baseline for regression. Normalize input features and often normalize regression targets. Check prediction and target shapes carefully. Inspect residuals to see whether errors are symmetric and whether outliers dominate. Report RMSE or MAE alongside MSE when the original units matter.
For PyTorch, the standard pattern is:
loss_fn = nn.MSELoss()
loss = loss_fn(predictions, targets)For many regression models, this simple loss is sufficient. For noisy, heavy-tailed, or outlier-heavy data, compare it against MAE and Huber loss.