Linear Regression

Linear regression is the simplest supervised learning model used in deep learning. It maps an input vector to a numerical output by applying a linear transformation. Although the model is simple, it introduces the main structure of neural network training: parameters, predictions, loss functions, gradients, and optimization.

A linear regression model assumes that the target value can be approximated by a weighted sum of the input features. If the input is

x = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_d \end{bmatrix} \in \mathbb{R}^d,

then the model predicts

\hat{y} = w^\top x + b.

Here $w\in\mathbb{R}^d$ is the weight vector, $b\in\mathbb{R}$ is the bias, and $\hat{y}\in\mathbb{R}$ is the predicted output.

The weight $w_j$ controls how much feature $x_j$ contributes to the prediction. The bias $b$ shifts the prediction independently of the input.

The Linear Model

For one input vector $x$ , the model is

\hat{y} = w_1x_1 + w_2x_2 + \cdots + w_dx_d + b.

This can be written compactly as

\hat{y} = w^\top x + b.

The notation $w^\top x$ means the dot product between $w$ and $x$ :

w^\top x = \sum_{j=1}^{d} w_jx_j.

Linear regression is called linear because the prediction is linear in the input features and in the parameters.

For a batch of $B$ inputs, we write

X\in\mathbb{R}^{B\times d}, \quad w\in\mathbb{R}^{d}, \quad b\in\mathbb{R}.

The batch prediction is

\hat{y} = Xw + b,

where $\hat{y}\in\mathbb{R}^{B}$ .

In PyTorch:

import torch

B = 32
d = 5

X = torch.randn(B, d)
w = torch.randn(d)
b = torch.randn(())

y_hat = X @ w + b
print(y_hat.shape)  # torch.Size([32])

The expression X @ w performs matrix-vector multiplication. The scalar bias b is broadcast across the batch.

Data and Targets

In supervised learning, we are given training examples

(x_1,y_1), (x_2,y_2), \ldots, (x_N,y_N).

Each $x_i\in\mathbb{R}^d$ is an input vector, and each $y_i\in\mathbb{R}$ is a real-valued target.

For example, a model may predict house price from features such as area, number of rooms, distance to city center, and age of the building. The input vector contains the measured features. The target is the observed price.

The training data can be stored as

X\in\mathbb{R}^{N\times d}, \quad y\in\mathbb{R}^{N}.

The model produces predictions

\hat{y}\in\mathbb{R}^{N}.

The goal of training is to choose $w$ and $b$ so that $\hat{y}$ is close to $y$ .

Mean Squared Error

The standard loss for linear regression is mean squared error.

L(w,b) = \frac{1}{N} \sum_{i=1}^{N} (\hat{y}_i - y_i)^2.

Since

\hat{y}_i = w^\top x_i + b,

we can write

L(w,b) = \frac{1}{N} \sum_{i=1}^{N} (w^\top x_i + b - y_i)^2.

The quantity $\hat{y}_i - y_i$ is the prediction error for example $i$ . Squaring the error penalizes large mistakes more strongly than small mistakes and makes positive and negative errors contribute equally.

In PyTorch:

y = torch.randn(B)

loss = ((y_hat - y) ** 2).mean()
print(loss.shape)  # torch.Size([])

The loss is a scalar tensor. Training reduces this scalar by changing the parameters.

Parameters and Gradients

The parameters of the model are $w$ and $b$ . These are the quantities learned from data.

To train the model, we compute the gradient of the loss with respect to each parameter:

\nabla_w L \quad \text{and} \quad \frac{\partial L}{\partial b}.

For a single example, the squared error loss is

\ell(w,b) = (\hat{y} - y)^2.

Let

e = \hat{y} - y = w^\top x + b - y.

Then

\ell(w,b) = e^2.

Using the chain rule,

\frac{\partial \ell}{\partial w_j} = 2e x_j.

Therefore,

\nabla_w \ell = 2e x.

The derivative with respect to the bias is

\frac{\partial \ell}{\partial b} = 2e.

For a batch of $N$ examples, the mean squared error gradients are

\nabla_w L = \frac{2}{N}X^\top(\hat{y}-y),

and

\frac{\partial L}{\partial b} = \frac{2}{N}\sum_{i=1}^{N}(\hat{y}_i-y_i).

These formulas show the role of error. If predictions match targets, then $\hat{y}-y=0$ , and the gradient is zero. If predictions differ from targets, the gradient points in a direction that changes the parameters.

Gradient Descent

Gradient descent updates parameters by moving in the negative gradient direction. For learning rate $\eta>0$ ,

w \leftarrow w - \eta \nabla_w L,

b \leftarrow b - \eta \frac{\partial L}{\partial b}.

The learning rate controls the size of each update. A small learning rate may train slowly. A large learning rate may make training unstable.

A simple manual implementation is:

import torch

N = 100
d = 3

X = torch.randn(N, d)
true_w = torch.tensor([2.0, -3.0, 0.5])
true_b = torch.tensor(1.0)

noise = 0.1 * torch.randn(N)
y = X @ true_w + true_b + noise

w = torch.randn(d, requires_grad=True)
b = torch.zeros((), requires_grad=True)

lr = 0.05

for step in range(200):
    y_hat = X @ w + b
    loss = ((y_hat - y) ** 2).mean()

    loss.backward()

    with torch.no_grad():
        w -= lr * w.grad
        b -= lr * b.grad

        w.grad.zero_()
        b.grad.zero_()

print(w)
print(b)

The block with torch.no_grad() prevents PyTorch from tracking the parameter update itself as part of the computation graph. The calls to zero_() clear accumulated gradients before the next step.

Linear Regression with `nn.Linear`

PyTorch provides linear layers through torch.nn.Linear.

A layer

torch.nn.Linear(in_features, out_features)

computes

Y = XA^\top + b.

If in_features = d and out_features = 1, then the layer implements scalar-output linear regression.

import torch
from torch import nn

model = nn.Linear(in_features=3, out_features=1)

X = torch.randn(100, 3)
y_hat = model(X)

print(y_hat.shape)  # torch.Size([100, 1])

The output shape is [100, 1], while a target vector may have shape [100]. It is often convenient to make them match:

y_hat = model(X).squeeze(-1)
print(y_hat.shape)  # torch.Size([100])

The layer stores its parameters as:

print(model.weight.shape)  # torch.Size([1, 3])
print(model.bias.shape)    # torch.Size([1])

PyTorch uses the shape [out_features, in_features] for the weight matrix.

Training with an Optimizer

Instead of manually updating parameters, we usually use an optimizer.

import torch
from torch import nn

N = 100
d = 3

X = torch.randn(N, d)
true_w = torch.tensor([2.0, -3.0, 0.5])
true_b = 1.0
y = X @ true_w + true_b + 0.1 * torch.randn(N)

model = nn.Linear(d, 1)
loss_fn = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.05)

for step in range(200):
    y_hat = model(X).squeeze(-1)
    loss = loss_fn(y_hat, y)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

print(model.weight)
print(model.bias)

The training step has four parts:

Step	PyTorch code	Meaning
Forward pass	`y_hat = model(X)`	Compute predictions
Loss computation	`loss = loss_fn(y_hat, y)`	Measure prediction error
Backward pass	`loss.backward()`	Compute gradients
Parameter update	`optimizer.step()`	Change parameters

This pattern appears throughout deep learning. More complex networks use the same structure.

Multiple Outputs

Linear regression can predict more than one target. Suppose each input has $d$ features and each output has $k$ values. Then

X\in\mathbb{R}^{B\times d}, \quad W\in\mathbb{R}^{d\times k}, \quad b\in\mathbb{R}^{k}.

The model computes

\hat{Y}=XW+b,

where

\hat{Y}\in\mathbb{R}^{B\times k}.

In PyTorch:

model = nn.Linear(in_features=5, out_features=3)

X = torch.randn(32, 5)
Y_hat = model(X)

print(Y_hat.shape)  # torch.Size([32, 3])

This model predicts a 3-dimensional output for each input example. Multiple-output regression is useful in forecasting, control, computer vision, and scientific modeling.

Closed-Form Solution and Learning-Based Solution

For ordinary least squares, linear regression has a closed-form solution under suitable conditions. If the bias is included as an additional constant feature, the solution is

w^\star = (X^\top X)^{-1}X^\top y.

This solution minimizes mean squared error when $X^\top X$ is invertible.

Deep learning usually uses gradient-based optimization instead of closed-form matrix inversion. There are several reasons.

First, the same training loop extends to nonlinear networks where no closed-form solution exists. Second, stochastic gradient methods scale better to very large datasets. Third, minibatch training allows models to learn from streaming data and large files that may not fit in memory.

Linear regression therefore serves as a useful bridge. It is simple enough to understand exactly, but it has the same training structure used by deep neural networks.

Feature Scaling

Linear regression is sensitive to feature scale. If one feature is measured in thousands and another in fractions, their gradients may have very different magnitudes. This can slow optimization.

A common preprocessing step is standardization:

x'_j = \frac{x_j-\mu_j}{\sigma_j},

where $\mu_j$ is the mean of feature $j$ , and $\sigma_j$ is its standard deviation.

In PyTorch:

mean = X.mean(dim=0)
std = X.std(dim=0)

X_scaled = (X - mean) / (std + 1e-8)

The small constant avoids division by zero.

Feature scaling usually improves gradient-based training, especially when using plain SGD.

Linear Regression as a Neural Network

Linear regression can be viewed as a neural network with one layer and no nonlinear activation:

x \longmapsto w^\top x + b.

A multilayer neural network extends this idea by composing many linear transformations with nonlinear activation functions. For example,

x \longmapsto W_2 \sigma(W_1x+b_1)+b_2.

The first transformation produces hidden features. The activation function introduces nonlinearity. The second transformation maps the hidden features to an output.

Without nonlinear activations, stacking linear layers gives another linear layer. This is why nonlinear activation functions are essential in deep networks.

Summary

Linear regression predicts a real-valued output from a weighted sum of input features. Its parameters are a weight vector and a bias. The usual loss is mean squared error. Training adjusts the parameters by following gradients of the loss.

In PyTorch, linear regression can be implemented manually with tensors and automatic differentiation, or directly with nn.Linear, nn.MSELoss, and an optimizer such as torch.optim.SGD.

The model is simple, but it contains the core training pattern used throughout deep learning: compute predictions, compute loss, backpropagate gradients, and update parameters.

Linear Regression

The Linear Model

Data and Targets

Mean Squared Error

Parameters and Gradients

Gradient Descent

Linear Regression with nn.Linear

Training with an Optimizer

Multiple Outputs

Closed-Form Solution and Learning-Based Solution

Feature Scaling

Linear Regression as a Neural Network

Summary

Linear Regression with `nn.Linear`