Linear regression is the simplest supervised learning model used in deep learning.
Linear regression is the simplest supervised learning model used in deep learning. It maps an input vector to a numerical output by applying a linear transformation. Although the model is simple, it introduces the main structure of neural network training: parameters, predictions, loss functions, gradients, and optimization.
A linear regression model assumes that the target value can be approximated by a weighted sum of the input features. If the input is
then the model predicts
Here is the weight vector, is the bias, and is the predicted output.
The weight controls how much feature contributes to the prediction. The bias shifts the prediction independently of the input.
The Linear Model
For one input vector , the model is
This can be written compactly as
The notation means the dot product between and :
Linear regression is called linear because the prediction is linear in the input features and in the parameters.
For a batch of inputs, we write
The batch prediction is
where .
In PyTorch:
import torch
B = 32
d = 5
X = torch.randn(B, d)
w = torch.randn(d)
b = torch.randn(())
y_hat = X @ w + b
print(y_hat.shape) # torch.Size([32])The expression X @ w performs matrix-vector multiplication. The scalar bias b is broadcast across the batch.
Data and Targets
In supervised learning, we are given training examples
Each is an input vector, and each is a real-valued target.
For example, a model may predict house price from features such as area, number of rooms, distance to city center, and age of the building. The input vector contains the measured features. The target is the observed price.
The training data can be stored as
The model produces predictions
The goal of training is to choose and so that is close to .
Mean Squared Error
The standard loss for linear regression is mean squared error.
Since
we can write
The quantity is the prediction error for example . Squaring the error penalizes large mistakes more strongly than small mistakes and makes positive and negative errors contribute equally.
In PyTorch:
y = torch.randn(B)
loss = ((y_hat - y) ** 2).mean()
print(loss.shape) # torch.Size([])The loss is a scalar tensor. Training reduces this scalar by changing the parameters.
Parameters and Gradients
The parameters of the model are and . These are the quantities learned from data.
To train the model, we compute the gradient of the loss with respect to each parameter:
For a single example, the squared error loss is
Let
Then
Using the chain rule,
Therefore,
The derivative with respect to the bias is
For a batch of examples, the mean squared error gradients are
and
These formulas show the role of error. If predictions match targets, then , and the gradient is zero. If predictions differ from targets, the gradient points in a direction that changes the parameters.
Gradient Descent
Gradient descent updates parameters by moving in the negative gradient direction. For learning rate ,
The learning rate controls the size of each update. A small learning rate may train slowly. A large learning rate may make training unstable.
A simple manual implementation is:
import torch
N = 100
d = 3
X = torch.randn(N, d)
true_w = torch.tensor([2.0, -3.0, 0.5])
true_b = torch.tensor(1.0)
noise = 0.1 * torch.randn(N)
y = X @ true_w + true_b + noise
w = torch.randn(d, requires_grad=True)
b = torch.zeros((), requires_grad=True)
lr = 0.05
for step in range(200):
y_hat = X @ w + b
loss = ((y_hat - y) ** 2).mean()
loss.backward()
with torch.no_grad():
w -= lr * w.grad
b -= lr * b.grad
w.grad.zero_()
b.grad.zero_()
print(w)
print(b)The block with torch.no_grad() prevents PyTorch from tracking the parameter update itself as part of the computation graph. The calls to zero_() clear accumulated gradients before the next step.
Linear Regression with nn.Linear
PyTorch provides linear layers through torch.nn.Linear.
A layer
torch.nn.Linear(in_features, out_features)computes
If in_features = d and out_features = 1, then the layer implements scalar-output linear regression.
import torch
from torch import nn
model = nn.Linear(in_features=3, out_features=1)
X = torch.randn(100, 3)
y_hat = model(X)
print(y_hat.shape) # torch.Size([100, 1])The output shape is [100, 1], while a target vector may have shape [100]. It is often convenient to make them match:
y_hat = model(X).squeeze(-1)
print(y_hat.shape) # torch.Size([100])The layer stores its parameters as:
print(model.weight.shape) # torch.Size([1, 3])
print(model.bias.shape) # torch.Size([1])PyTorch uses the shape [out_features, in_features] for the weight matrix.
Training with an Optimizer
Instead of manually updating parameters, we usually use an optimizer.
import torch
from torch import nn
N = 100
d = 3
X = torch.randn(N, d)
true_w = torch.tensor([2.0, -3.0, 0.5])
true_b = 1.0
y = X @ true_w + true_b + 0.1 * torch.randn(N)
model = nn.Linear(d, 1)
loss_fn = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.05)
for step in range(200):
y_hat = model(X).squeeze(-1)
loss = loss_fn(y_hat, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(model.weight)
print(model.bias)The training step has four parts:
| Step | PyTorch code | Meaning |
|---|---|---|
| Forward pass | y_hat = model(X) | Compute predictions |
| Loss computation | loss = loss_fn(y_hat, y) | Measure prediction error |
| Backward pass | loss.backward() | Compute gradients |
| Parameter update | optimizer.step() | Change parameters |
This pattern appears throughout deep learning. More complex networks use the same structure.
Multiple Outputs
Linear regression can predict more than one target. Suppose each input has features and each output has values. Then
The model computes
where
In PyTorch:
model = nn.Linear(in_features=5, out_features=3)
X = torch.randn(32, 5)
Y_hat = model(X)
print(Y_hat.shape) # torch.Size([32, 3])This model predicts a 3-dimensional output for each input example. Multiple-output regression is useful in forecasting, control, computer vision, and scientific modeling.
Closed-Form Solution and Learning-Based Solution
For ordinary least squares, linear regression has a closed-form solution under suitable conditions. If the bias is included as an additional constant feature, the solution is
This solution minimizes mean squared error when is invertible.
Deep learning usually uses gradient-based optimization instead of closed-form matrix inversion. There are several reasons.
First, the same training loop extends to nonlinear networks where no closed-form solution exists. Second, stochastic gradient methods scale better to very large datasets. Third, minibatch training allows models to learn from streaming data and large files that may not fit in memory.
Linear regression therefore serves as a useful bridge. It is simple enough to understand exactly, but it has the same training structure used by deep neural networks.
Feature Scaling
Linear regression is sensitive to feature scale. If one feature is measured in thousands and another in fractions, their gradients may have very different magnitudes. This can slow optimization.
A common preprocessing step is standardization:
where is the mean of feature , and is its standard deviation.
In PyTorch:
mean = X.mean(dim=0)
std = X.std(dim=0)
X_scaled = (X - mean) / (std + 1e-8)The small constant avoids division by zero.
Feature scaling usually improves gradient-based training, especially when using plain SGD.
Linear Regression as a Neural Network
Linear regression can be viewed as a neural network with one layer and no nonlinear activation:
A multilayer neural network extends this idea by composing many linear transformations with nonlinear activation functions. For example,
The first transformation produces hidden features. The activation function introduces nonlinearity. The second transformation maps the hidden features to an output.
Without nonlinear activations, stacking linear layers gives another linear layer. This is why nonlinear activation functions are essential in deep networks.
Summary
Linear regression predicts a real-valued output from a weighted sum of input features. Its parameters are a weight vector and a bias. The usual loss is mean squared error. Training adjusts the parameters by following gradients of the loss.
In PyTorch, linear regression can be implemented manually with tensors and automatic differentiation, or directly with nn.Linear, nn.MSELoss, and an optimizer such as torch.optim.SGD.
The model is simple, but it contains the core training pattern used throughout deep learning: compute predictions, compute loss, backpropagate gradients, and update parameters.