Skip to content

Linear Regression

Linear regression is the simplest supervised learning model used in deep learning.

Linear regression is the simplest supervised learning model used in deep learning. It maps an input vector to a numerical output by applying a linear transformation. Although the model is simple, it introduces the main structure of neural network training: parameters, predictions, loss functions, gradients, and optimization.

A linear regression model assumes that the target value can be approximated by a weighted sum of the input features. If the input is

x=[x1x2xd]Rd, x = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_d \end{bmatrix} \in \mathbb{R}^d,

then the model predicts

y^=wx+b. \hat{y} = w^\top x + b.

Here wRdw\in\mathbb{R}^d is the weight vector, bRb\in\mathbb{R} is the bias, and y^R\hat{y}\in\mathbb{R} is the predicted output.

The weight wjw_j controls how much feature xjx_j contributes to the prediction. The bias bb shifts the prediction independently of the input.

The Linear Model

For one input vector xx, the model is

y^=w1x1+w2x2++wdxd+b. \hat{y} = w_1x_1 + w_2x_2 + \cdots + w_dx_d + b.

This can be written compactly as

y^=wx+b. \hat{y} = w^\top x + b.

The notation wxw^\top x means the dot product between ww and xx:

wx=j=1dwjxj. w^\top x = \sum_{j=1}^{d} w_jx_j.

Linear regression is called linear because the prediction is linear in the input features and in the parameters.

For a batch of BB inputs, we write

XRB×d,wRd,bR. X\in\mathbb{R}^{B\times d}, \quad w\in\mathbb{R}^{d}, \quad b\in\mathbb{R}.

The batch prediction is

y^=Xw+b, \hat{y} = Xw + b,

where y^RB\hat{y}\in\mathbb{R}^{B}.

In PyTorch:

import torch

B = 32
d = 5

X = torch.randn(B, d)
w = torch.randn(d)
b = torch.randn(())

y_hat = X @ w + b
print(y_hat.shape)  # torch.Size([32])

The expression X @ w performs matrix-vector multiplication. The scalar bias b is broadcast across the batch.

Data and Targets

In supervised learning, we are given training examples

(x1,y1),(x2,y2),,(xN,yN). (x_1,y_1), (x_2,y_2), \ldots, (x_N,y_N).

Each xiRdx_i\in\mathbb{R}^d is an input vector, and each yiRy_i\in\mathbb{R} is a real-valued target.

For example, a model may predict house price from features such as area, number of rooms, distance to city center, and age of the building. The input vector contains the measured features. The target is the observed price.

The training data can be stored as

XRN×d,yRN. X\in\mathbb{R}^{N\times d}, \quad y\in\mathbb{R}^{N}.

The model produces predictions

y^RN. \hat{y}\in\mathbb{R}^{N}.

The goal of training is to choose ww and bb so that y^\hat{y} is close to yy.

Mean Squared Error

The standard loss for linear regression is mean squared error.

L(w,b)=1Ni=1N(y^iyi)2. L(w,b) = \frac{1}{N} \sum_{i=1}^{N} (\hat{y}_i - y_i)^2.

Since

y^i=wxi+b, \hat{y}_i = w^\top x_i + b,

we can write

L(w,b)=1Ni=1N(wxi+byi)2. L(w,b) = \frac{1}{N} \sum_{i=1}^{N} (w^\top x_i + b - y_i)^2.

The quantity y^iyi\hat{y}_i - y_i is the prediction error for example ii. Squaring the error penalizes large mistakes more strongly than small mistakes and makes positive and negative errors contribute equally.

In PyTorch:

y = torch.randn(B)

loss = ((y_hat - y) ** 2).mean()
print(loss.shape)  # torch.Size([])

The loss is a scalar tensor. Training reduces this scalar by changing the parameters.

Parameters and Gradients

The parameters of the model are ww and bb. These are the quantities learned from data.

To train the model, we compute the gradient of the loss with respect to each parameter:

wLandLb. \nabla_w L \quad \text{and} \quad \frac{\partial L}{\partial b}.

For a single example, the squared error loss is

(w,b)=(y^y)2. \ell(w,b) = (\hat{y} - y)^2.

Let

e=y^y=wx+by. e = \hat{y} - y = w^\top x + b - y.

Then

(w,b)=e2. \ell(w,b) = e^2.

Using the chain rule,

wj=2exj. \frac{\partial \ell}{\partial w_j} = 2e x_j.

Therefore,

w=2ex. \nabla_w \ell = 2e x.

The derivative with respect to the bias is

b=2e. \frac{\partial \ell}{\partial b} = 2e.

For a batch of NN examples, the mean squared error gradients are

wL=2NX(y^y), \nabla_w L = \frac{2}{N}X^\top(\hat{y}-y),

and

Lb=2Ni=1N(y^iyi). \frac{\partial L}{\partial b} = \frac{2}{N}\sum_{i=1}^{N}(\hat{y}_i-y_i).

These formulas show the role of error. If predictions match targets, then y^y=0\hat{y}-y=0, and the gradient is zero. If predictions differ from targets, the gradient points in a direction that changes the parameters.

Gradient Descent

Gradient descent updates parameters by moving in the negative gradient direction. For learning rate η>0\eta>0,

wwηwL, w \leftarrow w - \eta \nabla_w L, bbηLb. b \leftarrow b - \eta \frac{\partial L}{\partial b}.

The learning rate controls the size of each update. A small learning rate may train slowly. A large learning rate may make training unstable.

A simple manual implementation is:

import torch

N = 100
d = 3

X = torch.randn(N, d)
true_w = torch.tensor([2.0, -3.0, 0.5])
true_b = torch.tensor(1.0)

noise = 0.1 * torch.randn(N)
y = X @ true_w + true_b + noise

w = torch.randn(d, requires_grad=True)
b = torch.zeros((), requires_grad=True)

lr = 0.05

for step in range(200):
    y_hat = X @ w + b
    loss = ((y_hat - y) ** 2).mean()

    loss.backward()

    with torch.no_grad():
        w -= lr * w.grad
        b -= lr * b.grad

        w.grad.zero_()
        b.grad.zero_()

print(w)
print(b)

The block with torch.no_grad() prevents PyTorch from tracking the parameter update itself as part of the computation graph. The calls to zero_() clear accumulated gradients before the next step.

Linear Regression with nn.Linear

PyTorch provides linear layers through torch.nn.Linear.

A layer

torch.nn.Linear(in_features, out_features)

computes

Y=XA+b. Y = XA^\top + b.

If in_features = d and out_features = 1, then the layer implements scalar-output linear regression.

import torch
from torch import nn

model = nn.Linear(in_features=3, out_features=1)

X = torch.randn(100, 3)
y_hat = model(X)

print(y_hat.shape)  # torch.Size([100, 1])

The output shape is [100, 1], while a target vector may have shape [100]. It is often convenient to make them match:

y_hat = model(X).squeeze(-1)
print(y_hat.shape)  # torch.Size([100])

The layer stores its parameters as:

print(model.weight.shape)  # torch.Size([1, 3])
print(model.bias.shape)    # torch.Size([1])

PyTorch uses the shape [out_features, in_features] for the weight matrix.

Training with an Optimizer

Instead of manually updating parameters, we usually use an optimizer.

import torch
from torch import nn

N = 100
d = 3

X = torch.randn(N, d)
true_w = torch.tensor([2.0, -3.0, 0.5])
true_b = 1.0
y = X @ true_w + true_b + 0.1 * torch.randn(N)

model = nn.Linear(d, 1)
loss_fn = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.05)

for step in range(200):
    y_hat = model(X).squeeze(-1)
    loss = loss_fn(y_hat, y)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

print(model.weight)
print(model.bias)

The training step has four parts:

StepPyTorch codeMeaning
Forward passy_hat = model(X)Compute predictions
Loss computationloss = loss_fn(y_hat, y)Measure prediction error
Backward passloss.backward()Compute gradients
Parameter updateoptimizer.step()Change parameters

This pattern appears throughout deep learning. More complex networks use the same structure.

Multiple Outputs

Linear regression can predict more than one target. Suppose each input has dd features and each output has kk values. Then

XRB×d,WRd×k,bRk. X\in\mathbb{R}^{B\times d}, \quad W\in\mathbb{R}^{d\times k}, \quad b\in\mathbb{R}^{k}.

The model computes

Y^=XW+b, \hat{Y}=XW+b,

where

Y^RB×k. \hat{Y}\in\mathbb{R}^{B\times k}.

In PyTorch:

model = nn.Linear(in_features=5, out_features=3)

X = torch.randn(32, 5)
Y_hat = model(X)

print(Y_hat.shape)  # torch.Size([32, 3])

This model predicts a 3-dimensional output for each input example. Multiple-output regression is useful in forecasting, control, computer vision, and scientific modeling.

Closed-Form Solution and Learning-Based Solution

For ordinary least squares, linear regression has a closed-form solution under suitable conditions. If the bias is included as an additional constant feature, the solution is

w=(XX)1Xy. w^\star = (X^\top X)^{-1}X^\top y.

This solution minimizes mean squared error when XXX^\top X is invertible.

Deep learning usually uses gradient-based optimization instead of closed-form matrix inversion. There are several reasons.

First, the same training loop extends to nonlinear networks where no closed-form solution exists. Second, stochastic gradient methods scale better to very large datasets. Third, minibatch training allows models to learn from streaming data and large files that may not fit in memory.

Linear regression therefore serves as a useful bridge. It is simple enough to understand exactly, but it has the same training structure used by deep neural networks.

Feature Scaling

Linear regression is sensitive to feature scale. If one feature is measured in thousands and another in fractions, their gradients may have very different magnitudes. This can slow optimization.

A common preprocessing step is standardization:

xj=xjμjσj, x'_j = \frac{x_j-\mu_j}{\sigma_j},

where μj\mu_j is the mean of feature jj, and σj\sigma_j is its standard deviation.

In PyTorch:

mean = X.mean(dim=0)
std = X.std(dim=0)

X_scaled = (X - mean) / (std + 1e-8)

The small constant avoids division by zero.

Feature scaling usually improves gradient-based training, especially when using plain SGD.

Linear Regression as a Neural Network

Linear regression can be viewed as a neural network with one layer and no nonlinear activation:

xwx+b. x \longmapsto w^\top x + b.

A multilayer neural network extends this idea by composing many linear transformations with nonlinear activation functions. For example,

xW2σ(W1x+b1)+b2. x \longmapsto W_2 \sigma(W_1x+b_1)+b_2.

The first transformation produces hidden features. The activation function introduces nonlinearity. The second transformation maps the hidden features to an output.

Without nonlinear activations, stacking linear layers gives another linear layer. This is why nonlinear activation functions are essential in deep networks.

Summary

Linear regression predicts a real-valued output from a weighted sum of input features. Its parameters are a weight vector and a bias. The usual loss is mean squared error. Training adjusts the parameters by following gradients of the loss.

In PyTorch, linear regression can be implemented manually with tensors and automatic differentiation, or directly with nn.Linear, nn.MSELoss, and an optimizer such as torch.optim.SGD.

The model is simple, but it contains the core training pattern used throughout deep learning: compute predictions, compute loss, backpropagate gradients, and update parameters.