Skip to content

Chapter 109. Linear Regression

Linear regression studies how one set of variables depends approximately on another through a linear relationship.

Given observations of inputs and outputs, the goal is to construct a linear model that predicts the outputs from the inputs as accurately as possible.

Linear regression is one of the most important applications of linear algebra. It connects vectors, matrices, projections, least squares problems, optimization, statistics, and numerical computation into a single framework.

The central problem is simple:

Given measured data that may contain noise or inconsistency, find the linear relationship that best explains the observations.

109.1 Data and Models

Suppose we observe pairs of values

(x1,y1),(x2,y2),,(xm,ym). (x_1, y_1), (x_2, y_2), \ldots, (x_m, y_m).

The variable xix_i is called the input, predictor, or feature. The variable yiy_i is called the output, response, or target.

A linear regression model assumes that the output is approximately linear in the input:

yβ0+β1x. y \approx \beta_0 + \beta_1 x.

Here:

SymbolMeaning
β0\beta_0Intercept
β1\beta_1Slope

The unknown coefficients β0\beta_0 and β1\beta_1 must be estimated from the data.

For example, suppose the data are:

xxyy
12
23
35
44

The points do not lie exactly on a single line. Linear regression finds the line that best approximates them.

109.2 The Geometric Problem

Each observed point contributes an equation:

β0+β1xiyi. \beta_0 + \beta_1 x_i \approx y_i.

Collecting all equations gives

β0+β1x1y1,β0+β1x2y2,β0+β1xmym. \begin{aligned} \beta_0 + \beta_1 x_1 &\approx y_1, \\ \beta_0 + \beta_1 x_2 &\approx y_2, \\ &\vdots \\ \beta_0 + \beta_1 x_m &\approx y_m. \end{aligned}

In matrix form:

Aβy, A\beta \approx y,

where

A=[1x11x21xm],β=[β0β1],y=[y1y2ym]. A = \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_m \end{bmatrix}, \qquad \beta = \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix}, \qquad y = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_m \end{bmatrix}.

The matrix AA is called the design matrix.

The vector AβA\beta contains the predicted values.

The problem usually has more equations than unknowns. Therefore an exact solution often does not exist.

Instead, we search for the vector β\beta that makes the error as small as possible.

109.3 Residuals

The residual vector is

r=yAβ. r = y - A\beta.

Each component measures prediction error:

ri=yi(β0+β1xi). r_i = y_i - (\beta_0 + \beta_1 x_i).

Large residuals correspond to poor predictions.

The goal is to minimize the total error.

A natural measure is the Euclidean norm:

r2=rTr. \|r\|^2 = r^Tr.

Expanding gives

yAβ2. \|y - A\beta\|^2.

This quantity is called the residual sum of squares.

Linear regression minimizes this expression.

109.4 Least Squares

The least squares problem is

minβyAβ2. \min_{\beta} \|y - A\beta\|^2.

This is the central optimization problem of linear regression.

\min_{\beta}|y-A\beta|^2

The solution is the point in the column space of AA closest to yy.

Geometrically, linear regression is an orthogonal projection problem.

The vector Aβ^A\hat{\beta} is the orthogonal projection of yy onto the column space of AA.

The residual vector is orthogonal to every column of AA:

AT(yAβ^)=0. A^T(y - A\hat{\beta}) = 0.

This condition produces the normal equations.

109.5 Normal Equations

Starting from

AT(yAβ^)=0, A^T(y - A\hat{\beta}) = 0,

we obtain

ATAβ^=ATy. A^TA\hat{\beta} = A^Ty.

A^TA\hat{\beta}=A^Ty

These are called the normal equations.

If ATAA^TA is invertible, then

β^=(ATA)1ATy. \hat{\beta} = (A^TA)^{-1}A^Ty.

This formula gives the least squares solution.

The matrix

(ATA)1AT (A^TA)^{-1}A^T

is called the Moore-Penrose pseudoinverse when AA has full column rank.

109.6 Example of Simple Linear Regression

Consider the data:

xxyy
11
22
32
44

The design matrix is

A=[11121314]. A = \begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \\ 1 & 4 \end{bmatrix}.

The target vector is

y=[1224]. y = \begin{bmatrix} 1 \\ 2 \\ 2 \\ 4 \end{bmatrix}.

Compute

ATA=[4101030], A^TA = \begin{bmatrix} 4 & 10 \\ 10 & 30 \end{bmatrix},

and

ATy=[923]. A^Ty = \begin{bmatrix} 9 \\ 23 \end{bmatrix}.

The normal equations become

[4101030]β^=[923]. \begin{bmatrix} 4 & 10 \\ 10 & 30 \end{bmatrix} \hat{\beta} = \begin{bmatrix} 9 \\ 23 \end{bmatrix}.

Solving gives

β^=[00.9]. \hat{\beta} = \begin{bmatrix} 0 \\ 0.9 \end{bmatrix}.

Thus the regression line is

y=0.9x. y = 0.9x.

109.7 Multiple Linear Regression

Linear regression extends naturally to multiple variables.

Suppose each observation has nn features:

xi=[xi1xi2xin]. x_i = \begin{bmatrix} x_{i1} \\ x_{i2} \\ \vdots \\ x_{in} \end{bmatrix}.

The model becomes

yβ0+β1x1++βnxn. y \approx \beta_0 + \beta_1x_1 + \cdots + \beta_nx_n.

In matrix form:

Aβy. A\beta \approx y.

Now

A=[1x11x12x1n1x21x22x2n1xm1xm2xmn]. A = \begin{bmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1n} \\ 1 & x_{21} & x_{22} & \cdots & x_{2n} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{m1} & x_{m2} & \cdots & x_{mn} \end{bmatrix}.

Each row corresponds to one observation. Each column corresponds to one feature.

The least squares problem remains

minβyAβ2. \min_{\beta}\|y - A\beta\|^2.

The solution again satisfies

ATAβ^=ATy. A^TA\hat{\beta} = A^Ty.

109.8 Projection Interpretation

Linear regression is fundamentally a projection problem.

The column space of AA contains all vectors representable by the model.

The observed vector yy may not lie in this subspace.

The least squares solution finds the closest vector inside the subspace.

If

y^=Aβ^, \hat{y} = A\hat{\beta},

then

y=y^+r, y = \hat{y} + r,

where:

VectorMeaning
y^\hat{y}Projection onto column space
rrOrthogonal residual

The orthogonality condition is

ATr=0. A^Tr = 0.

This geometric interpretation explains why least squares works.

109.9 Orthogonal Projection Matrix

The projection matrix onto the column space of AA is

P=A(ATA)1AT. P = A(A^TA)^{-1}A^T.

P=A(A^TA)^{-1}A^T

The predicted vector satisfies

y^=Py. \hat{y} = Py.

The matrix PP has several important properties:

PropertyMeaning
P2=PP^2 = PIdempotent
PT=PP^T = PSymmetric
PyCol(A)Py \in \operatorname{Col}(A)Projection property

The residual operator is

IP. I - P.

Thus

r=(IP)y. r = (I - P)y.

109.10 Statistical Interpretation

In statistics, linear regression models the response as

y=Aβ+ε, y = A\beta + \varepsilon,

where ε\varepsilon is a random error vector.

Common assumptions are:

AssumptionMeaning
Mean zeroE[ε]=0E[\varepsilon] = 0
Constant varianceEqual noise variance
IndependenceErrors independent
NormalityErrors Gaussian

Under these assumptions, least squares estimators have strong statistical properties.

The estimator

β^ \hat{\beta}

is unbiased:

E[β^]=β. E[\hat{\beta}] = \beta.

It is also the maximum likelihood estimator under Gaussian noise.

109.11 Rank and Identifiability

The regression problem depends critically on the rank of AA.

If the columns of AA are linearly independent, then

rank(A)=n, \operatorname{rank}(A) = n,

and ATAA^TA is invertible.

If the columns are dependent, then multiple parameter vectors produce the same predictions.

This phenomenon is called multicollinearity.

For example, if one feature is an exact multiple of another, then the regression coefficients are not uniquely determined.

Rank deficiency causes instability and ill-conditioning.

109.12 Numerical Computation

The normal equations are conceptually simple but numerically unstable in some problems.

Modern numerical linear algebra usually solves regression problems using QR decomposition or singular value decomposition.

QR Method

If

A=QR, A = QR,

where:

MatrixProperty
QQOrthogonal
RRUpper triangular

then

Rβ^=QTy. R\hat{\beta} = Q^Ty.

This avoids forming ATAA^TA, which can amplify numerical error.

SVD Method

If

A=UΣVT, A = U\Sigma V^T,

then least squares solutions can be computed robustly even when AA is nearly singular.

The SVD reveals rank, conditioning, and geometric structure simultaneously.

109.13 Regularization

Large regression problems may overfit the data.

Regularization introduces additional constraints.

Ridge Regression

Ridge regression minimizes

yAβ2+λβ2. \|y - A\beta\|^2 + \lambda \|\beta\|^2.

|y-A\beta|^2+\lambda|\beta|^2

The parameter λ>0\lambda > 0 penalizes large coefficients.

The solution becomes

(ATA+λI)β^=ATy. (A^TA + \lambda I)\hat{\beta} = A^Ty.

Lasso Regression

Lasso regression minimizes

yAβ2+λβ1. \|y - A\beta\|^2 + \lambda \|\beta\|_1.

The 1\ell_1-penalty encourages sparse coefficients.

Regularization connects linear algebra with optimization and machine learning.

109.14 Polynomial Regression

Regression can model nonlinear relationships while remaining linear algebraically.

Suppose we fit

yβ0+β1x+β2x2. y \approx \beta_0 + \beta_1x + \beta_2x^2.

The model is nonlinear in xx but linear in the coefficients.

The design matrix becomes

A=[1x1x121x2x221xmxm2]. A = \begin{bmatrix} 1 & x_1 & x_1^2 \\ 1 & x_2 & x_2^2 \\ \vdots & \vdots & \vdots \\ 1 & x_m & x_m^2 \end{bmatrix}.

The least squares framework remains unchanged.

This principle extends to arbitrary basis expansions.

109.15 Applications

Linear regression appears throughout science and engineering.

FieldExample
StatisticsTrend estimation
EconomicsForecasting
PhysicsExperimental fitting
BiologyGrowth models
Machine learningPredictive models
Signal processingParameter estimation
Computer visionCamera calibration
FinanceRisk modeling
EngineeringSystem identification

Many advanced machine learning methods are extensions of linear regression.

109.16 Geometric Summary

Linear regression unifies several major ideas in linear algebra.

ConceptRole
VectorsObservations and predictions
MatricesDesign operators
Column spaceModel subspace
OrthogonalityResidual condition
ProjectionBest approximation
Least squaresOptimization principle
RankIdentifiability
DecompositionsNumerical algorithms

The subject demonstrates how abstract linear algebra directly solves practical approximation problems.

109.17 Summary

Linear regression seeks the linear model that best approximates observed data.

Given a system

Aβy, A\beta \approx y,

the least squares solution minimizes

yAβ2. \|y - A\beta\|^2.

The solution satisfies the normal equations

ATAβ^=ATy. A^TA\hat{\beta} = A^Ty.

Geometrically, regression projects the observed vector onto the column space of the design matrix.

Computationally, regression relies on matrix factorizations such as QR decomposition and singular value decomposition.

Statistically, regression estimates relationships between variables under uncertainty.

Linear regression is therefore both a practical computational method and a central application of the geometry and algebra of vector spaces.