Chapter 109. Linear Regression

Linear regression studies how one set of variables depends approximately on another through a linear relationship.

Given observations of inputs and outputs, the goal is to construct a linear model that predicts the outputs from the inputs as accurately as possible.

Linear regression is one of the most important applications of linear algebra. It connects vectors, matrices, projections, least squares problems, optimization, statistics, and numerical computation into a single framework.

The central problem is simple:

Given measured data that may contain noise or inconsistency, find the linear relationship that best explains the observations.

109.1 Data and Models

Suppose we observe pairs of values

(x_1, y_1), (x_2, y_2), \ldots, (x_m, y_m).

The variable $x_i$ is called the input, predictor, or feature. The variable $y_i$ is called the output, response, or target.

A linear regression model assumes that the output is approximately linear in the input:

y \approx \beta_0 + \beta_1 x.

Here:

Symbol	Meaning
$\beta_0$	Intercept
$\beta_1$	Slope

The unknown coefficients $\beta_0$ and $\beta_1$ must be estimated from the data.

For example, suppose the data are:

$x$	$y$
1	2
2	3
3	5
4	4

The points do not lie exactly on a single line. Linear regression finds the line that best approximates them.

109.2 The Geometric Problem

Each observed point contributes an equation:

\beta_0 + \beta_1 x_i \approx y_i.

Collecting all equations gives

\begin{aligned} \beta_0 + \beta_1 x_1 &\approx y_1, \\ \beta_0 + \beta_1 x_2 &\approx y_2, \\ &\vdots \\ \beta_0 + \beta_1 x_m &\approx y_m. \end{aligned}

In matrix form:

A\beta \approx y,

where

A = \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_m \end{bmatrix}, \qquad \beta = \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix}, \qquad y = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_m \end{bmatrix}.

The matrix $A$ is called the design matrix.

The vector $A\beta$ contains the predicted values.

The problem usually has more equations than unknowns. Therefore an exact solution often does not exist.

Instead, we search for the vector $\beta$ that makes the error as small as possible.

109.3 Residuals

The residual vector is

r = y - A\beta.

Each component measures prediction error:

r_i = y_i - (\beta_0 + \beta_1 x_i).

Large residuals correspond to poor predictions.

The goal is to minimize the total error.

A natural measure is the Euclidean norm:

\|r\|^2 = r^Tr.

Expanding gives

\|y - A\beta\|^2.

This quantity is called the residual sum of squares.

Linear regression minimizes this expression.

109.4 Least Squares

The least squares problem is

\min_{\beta} \|y - A\beta\|^2.

This is the central optimization problem of linear regression.

\min_{\beta}|y-A\beta|^2

The solution is the point in the column space of $A$ closest to $y$ .

Geometrically, linear regression is an orthogonal projection problem.

The vector $A\hat{\beta}$ is the orthogonal projection of $y$ onto the column space of $A$ .

The residual vector is orthogonal to every column of $A$ :

A^T(y - A\hat{\beta}) = 0.

This condition produces the normal equations.

109.5 Normal Equations

Starting from

A^T(y - A\hat{\beta}) = 0,

we obtain

A^TA\hat{\beta} = A^Ty.

A^TA\hat{\beta}=A^Ty

These are called the normal equations.

If $A^TA$ is invertible, then

\hat{\beta} = (A^TA)^{-1}A^Ty.

This formula gives the least squares solution.

The matrix

(A^TA)^{-1}A^T

is called the Moore-Penrose pseudoinverse when $A$ has full column rank.

109.6 Example of Simple Linear Regression

Consider the data:

$x$	$y$
1	1
2	2
3	2
4	4

The design matrix is

A = \begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \\ 1 & 4 \end{bmatrix}.

The target vector is

y = \begin{bmatrix} 1 \\ 2 \\ 2 \\ 4 \end{bmatrix}.

Compute

A^TA = \begin{bmatrix} 4 & 10 \\ 10 & 30 \end{bmatrix},

and

A^Ty = \begin{bmatrix} 9 \\ 23 \end{bmatrix}.

The normal equations become

\begin{bmatrix} 4 & 10 \\ 10 & 30 \end{bmatrix} \hat{\beta} = \begin{bmatrix} 9 \\ 23 \end{bmatrix}.

Solving gives

\hat{\beta} = \begin{bmatrix} 0 \\ 0.9 \end{bmatrix}.

Thus the regression line is

y = 0.9x.

109.7 Multiple Linear Regression

Linear regression extends naturally to multiple variables.

Suppose each observation has $n$ features:

x_i = \begin{bmatrix} x_{i1} \\ x_{i2} \\ \vdots \\ x_{in} \end{bmatrix}.

The model becomes

y \approx \beta_0 + \beta_1x_1 + \cdots + \beta_nx_n.

In matrix form:

A\beta \approx y.

Now

A = \begin{bmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1n} \\ 1 & x_{21} & x_{22} & \cdots & x_{2n} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{m1} & x_{m2} & \cdots & x_{mn} \end{bmatrix}.

Each row corresponds to one observation. Each column corresponds to one feature.

The least squares problem remains

\min_{\beta}\|y - A\beta\|^2.

The solution again satisfies

A^TA\hat{\beta} = A^Ty.

109.8 Projection Interpretation

Linear regression is fundamentally a projection problem.

The column space of $A$ contains all vectors representable by the model.

The observed vector $y$ may not lie in this subspace.

The least squares solution finds the closest vector inside the subspace.

\hat{y} = A\hat{\beta},

then

y = \hat{y} + r,

where:

Vector	Meaning
$\hat{y}$	Projection onto column space
$r$	Orthogonal residual

The orthogonality condition is

A^Tr = 0.

This geometric interpretation explains why least squares works.

109.9 Orthogonal Projection Matrix

The projection matrix onto the column space of $A$ is

P = A(A^TA)^{-1}A^T.

P=A(A^TA)^{-1}A^T

The predicted vector satisfies

\hat{y} = Py.

The matrix $P$ has several important properties:

Property	Meaning
$P^2 = P$	Idempotent
$P^T = P$	Symmetric
$Py \in \operatorname{Col}(A)$	Projection property

The residual operator is

I - P.

Thus

r = (I - P)y.

109.10 Statistical Interpretation

In statistics, linear regression models the response as

y = A\beta + \varepsilon,

where $\varepsilon$ is a random error vector.

Common assumptions are:

Assumption	Meaning
Mean zero	$E[\varepsilon] = 0$
Constant variance	Equal noise variance
Independence	Errors independent
Normality	Errors Gaussian

Under these assumptions, least squares estimators have strong statistical properties.

The estimator

\hat{\beta}

is unbiased:

E[\hat{\beta}] = \beta.

It is also the maximum likelihood estimator under Gaussian noise.

109.11 Rank and Identifiability

The regression problem depends critically on the rank of $A$ .

If the columns of $A$ are linearly independent, then

\operatorname{rank}(A) = n,

and $A^TA$ is invertible.

If the columns are dependent, then multiple parameter vectors produce the same predictions.

This phenomenon is called multicollinearity.

For example, if one feature is an exact multiple of another, then the regression coefficients are not uniquely determined.

Rank deficiency causes instability and ill-conditioning.

109.12 Numerical Computation

The normal equations are conceptually simple but numerically unstable in some problems.

Modern numerical linear algebra usually solves regression problems using QR decomposition or singular value decomposition.

QR Method

A = QR,

where:

Matrix	Property
$Q$	Orthogonal
$R$	Upper triangular

then

R\hat{\beta} = Q^Ty.

This avoids forming $A^TA$ , which can amplify numerical error.

SVD Method

A = U\Sigma V^T,

then least squares solutions can be computed robustly even when $A$ is nearly singular.

The SVD reveals rank, conditioning, and geometric structure simultaneously.

109.13 Regularization

Large regression problems may overfit the data.

Regularization introduces additional constraints.

Ridge Regression

Ridge regression minimizes

\|y - A\beta\|^2 + \lambda \|\beta\|^2.

|y-A\beta|^2+\lambda|\beta|^2

The parameter $\lambda > 0$ penalizes large coefficients.

The solution becomes

(A^TA + \lambda I)\hat{\beta} = A^Ty.

Lasso Regression

Lasso regression minimizes

\|y - A\beta\|^2 + \lambda \|\beta\|_1.

The $\ell_1$ -penalty encourages sparse coefficients.

Regularization connects linear algebra with optimization and machine learning.

109.14 Polynomial Regression

Regression can model nonlinear relationships while remaining linear algebraically.

Suppose we fit

y \approx \beta_0 + \beta_1x + \beta_2x^2.

The model is nonlinear in $x$ but linear in the coefficients.

The design matrix becomes

A = \begin{bmatrix} 1 & x_1 & x_1^2 \\ 1 & x_2 & x_2^2 \\ \vdots & \vdots & \vdots \\ 1 & x_m & x_m^2 \end{bmatrix}.

The least squares framework remains unchanged.

This principle extends to arbitrary basis expansions.

109.15 Applications

Linear regression appears throughout science and engineering.

Field	Example
Statistics	Trend estimation
Economics	Forecasting
Physics	Experimental fitting
Biology	Growth models
Machine learning	Predictive models
Signal processing	Parameter estimation
Computer vision	Camera calibration
Finance	Risk modeling
Engineering	System identification

Many advanced machine learning methods are extensions of linear regression.

109.16 Geometric Summary

Linear regression unifies several major ideas in linear algebra.

Concept	Role
Vectors	Observations and predictions
Matrices	Design operators
Column space	Model subspace
Orthogonality	Residual condition
Projection	Best approximation
Least squares	Optimization principle
Rank	Identifiability
Decompositions	Numerical algorithms

The subject demonstrates how abstract linear algebra directly solves practical approximation problems.

109.17 Summary

Linear regression seeks the linear model that best approximates observed data.

Given a system

A\beta \approx y,

the least squares solution minimizes

\|y - A\beta\|^2.

The solution satisfies the normal equations

A^TA\hat{\beta} = A^Ty.

Geometrically, regression projects the observed vector onto the column space of the design matrix.

Computationally, regression relies on matrix factorizations such as QR decomposition and singular value decomposition.

Statistically, regression estimates relationships between variables under uncertainty.

Linear regression is therefore both a practical computational method and a central application of the geometry and algebra of vector spaces.