Skip to content

Chapter 52. Least Squares Problems

A least squares problem asks for the best approximate solution to a linear system that may have no exact solution. Instead of requiring

Ax=b, Ax=b,

we choose xx so that AxAx is as close as possible to bb. The usual measure of closeness is the Euclidean norm of the residual:

Axb2. \|Ax-b\|_2.

Equivalently, we minimize the squared residual norm:

Axb22. \|Ax-b\|_2^2.

This problem appears whenever there are more equations than unknowns, noisy measurements, inconsistent constraints, or data that should be fitted by a simple model. Standard linear algebra texts describe least squares as solving Ax=bAx=b as closely as possible by minimizing the squared error Axb2\|Ax-b\|^2.

52.1 The Basic Problem

Let

ARm×n,bRm. A\in\mathbb{R}^{m\times n}, \qquad b\in\mathbb{R}^m.

The least squares problem is

minxRnAxb2. \min_{x\in\mathbb{R}^n}\|Ax-b\|_2.

Since the square root is increasing, this is equivalent to

minxRnAxb22. \min_{x\in\mathbb{R}^n}\|Ax-b\|_2^2.

The vector

r=bAx r=b-Ax

is called the residual. The least squares solution is a vector x^\hat{x} for which

Ax^b2Axb2 \|A\hat{x}-b\|_2 \le \|Ax-b\|_2

for every xRnx\in\mathbb{R}^n.

The vector Ax^A\hat{x} is the closest vector to bb among all vectors in the column space of AA.

52.2 Why Exact Solutions May Fail

The equation

Ax=b Ax=b

has a solution exactly when

bCol(A). b\in \operatorname{Col}(A).

If bb does not lie in the column space, no vector xx satisfies the system exactly.

This is common when m>nm>n. There may be more equations than unknowns. Such systems are called overdetermined. An overdetermined system may be consistent, but in applications with measurement error it usually is not.

Least squares replaces the exact equation by the approximation problem

Axb. Ax \approx b.

The goal is to choose the vector in Col(A)\operatorname{Col}(A) nearest to bb.

52.3 Geometric Interpretation

The set of all possible vectors AxAx is the column space of AA:

Col(A)={Ax:xRn}. \operatorname{Col}(A)=\{Ax:x\in\mathbb{R}^n\}.

Thus the least squares problem is the problem of projecting bb onto Col(A)\operatorname{Col}(A).

Let

p=projCol(A)(b). p=\operatorname{proj}_{\operatorname{Col}(A)}(b).

Then

p=Ax^ p=A\hat{x}

for at least one vector x^\hat{x}. The residual is

r=bp=bAx^. r=b-p=b-A\hat{x}.

The defining condition for orthogonal projection is

rCol(A). r\perp \operatorname{Col}(A).

Thus least squares is an orthogonal projection problem. The method of least squares can be understood as finding the projection of a vector onto a column space.

52.4 Orthogonality Condition

Since the residual

r=bAx^ r=b-A\hat{x}

is orthogonal to the column space of AA, it is orthogonal to every column of AA. This condition can be written as

ATr=0. A^T r=0.

Substituting r=bAx^r=b-A\hat{x}, we obtain

AT(bAx^)=0. A^T(b-A\hat{x})=0.

Expanding gives

ATbATAx^=0. A^Tb-A^TA\hat{x}=0.

Therefore

ATAx^=ATb. A^TA\hat{x}=A^Tb.

These equations are called the normal equations. They characterize least squares solutions in the Euclidean norm.

52.5 The Normal Equations

The normal equations are

ATAx^=ATb. A^TA\hat{x}=A^Tb.

They form an n×nn\times n linear system.

If AA has linearly independent columns, then ATAA^TA is invertible, and the least squares solution is unique:

x^=(ATA)1ATb. \hat{x}=(A^TA)^{-1}A^Tb.

If AA does not have linearly independent columns, the normal equations may have infinitely many solutions. In that case, every solution of the normal equations gives the same projected vector Ax^A\hat{x}, but the coefficient vector x^\hat{x} may not be unique.

The normal equations are central because they replace an inconsistent system by a consistent square system.

52.6 Why ATAA^TA Is Invertible Under Full Column Rank

Assume the columns of AA are linearly independent. Then

Null(A)={0}. \operatorname{Null}(A)=\{0\}.

Suppose

ATAx=0. A^TAx=0.

Multiply on the left by xTx^T:

xTATAx=0. x^TA^TAx=0.

But

xTATAx=(Ax)T(Ax)=Ax22. x^TA^TAx=(Ax)^T(Ax)=\|Ax\|_2^2.

Thus

Ax22=0. \|Ax\|_2^2=0.

So

Ax=0. Ax=0.

Since AA has linearly independent columns,

x=0. x=0.

Therefore ATAA^TA has trivial null space and is invertible.

52.7 Projection Matrix

If AA has full column rank, the projection of bb onto Col(A)\operatorname{Col}(A) is

p=Ax^. p=A\hat{x}.

Using

x^=(ATA)1ATb, \hat{x}=(A^TA)^{-1}A^Tb,

we get

p=A(ATA)1ATb. p=A(A^TA)^{-1}A^Tb.

Thus the projection matrix onto the column space of AA is

P=A(ATA)1AT. P=A(A^TA)^{-1}A^T.

This matrix satisfies

P2=P P^2=P

and

PT=P. P^T=P.

Hence it is an orthogonal projection matrix.

The residual is

r=bp=(IP)b. r=b-p=(I-P)b.

Since pCol(A)p\in\operatorname{Col}(A) and rCol(A)r\in\operatorname{Col}(A)^\perp, we have

b=p+r. b=p+r.

This is the orthogonal decomposition behind least squares.

52.8 The Objective Function

Define

f(x)=Axb22. f(x)=\|Ax-b\|_2^2.

Then

f(x)=(Axb)T(Axb). f(x)=(Ax-b)^T(Ax-b).

Expanding gives

f(x)=xTATAx2xTATb+bTb. f(x)=x^TA^TAx-2x^TA^Tb+b^Tb.

This is a quadratic function of xx. Its gradient is

f(x)=2ATAx2ATb. \nabla f(x)=2A^TAx-2A^Tb.

At a minimizer x^\hat{x}, the gradient must vanish:

2ATAx^2ATb=0. 2A^TA\hat{x}-2A^Tb=0.

Hence

ATAx^=ATb. A^TA\hat{x}=A^Tb.

Thus the normal equations can also be derived from calculus.

52.9 Example: An Inconsistent System

Consider

A=[11],b=[24]. A= \begin{bmatrix} 1\\ 1 \end{bmatrix}, \qquad b= \begin{bmatrix} 2\\ 4 \end{bmatrix}.

The system

Ax=b Ax=b

means

x=2,x=4. x=2, \qquad x=4.

There is no exact solution.

The least squares problem is

minx[xx][24]22. \min_x \left\| \begin{bmatrix} x\\ x \end{bmatrix} - \begin{bmatrix} 2\\ 4 \end{bmatrix} \right\|_2^2.

That is,

minx((x2)2+(x4)2). \min_x \left((x-2)^2+(x-4)^2\right).

The normal equations are

ATAx^=ATb. A^TA\hat{x}=A^Tb.

Compute

ATA=[11][11]=2, A^TA= \begin{bmatrix} 1&1 \end{bmatrix} \begin{bmatrix} 1\\ 1 \end{bmatrix} =2,

and

ATb=[11][24]=6. A^Tb= \begin{bmatrix} 1&1 \end{bmatrix} \begin{bmatrix} 2\\ 4 \end{bmatrix} =6.

Thus

2x^=6, 2\hat{x}=6,

so

x^=3. \hat{x}=3.

The best approximate equation is

Ax^=[33]. A\hat{x} = \begin{bmatrix} 3\\ 3 \end{bmatrix}.

The residual is

r=bAx^=[24][33]=[11]. r=b-A\hat{x} = \begin{bmatrix} 2\\ 4 \end{bmatrix} - \begin{bmatrix} 3\\ 3 \end{bmatrix} = \begin{bmatrix} -1\\ 1 \end{bmatrix}.

Check orthogonality:

ATr=[11][11]=0. A^Tr= \begin{bmatrix} 1&1 \end{bmatrix} \begin{bmatrix} -1\\ 1 \end{bmatrix} =0.

Thus the residual is orthogonal to the column space of AA.

52.10 Least Squares Line Fitting

Suppose data points

(t1,y1),,(tm,ym) (t_1,y_1),\ldots,(t_m,y_m)

are to be fitted by a line

y=β0+β1t. y=\beta_0+\beta_1t.

For each data point, the model gives

β0+β1tiyi. \beta_0+\beta_1t_i\approx y_i.

Collecting all equations gives

[1t11t21tm][β0β1][y1y2ym]. \begin{bmatrix} 1 & t_1\\ 1 & t_2\\ \vdots & \vdots\\ 1 & t_m \end{bmatrix} \begin{bmatrix} \beta_0\\ \beta_1 \end{bmatrix} \approx \begin{bmatrix} y_1\\ y_2\\ \vdots\\ y_m \end{bmatrix}.

This has the form

Xβy. X\beta \approx y.

The least squares estimate is

β^=(XTX)1XTy, \hat{\beta} = (X^TX)^{-1}X^Ty,

provided XX has full column rank.

The first column represents the intercept. The second column represents the slope. More complicated models add more columns.

52.11 Example: Fitting a Line

Fit a line

y=β0+β1t y=\beta_0+\beta_1t

to the three data points

(0,1),(1,2),(2,2). (0,1),\quad (1,2),\quad (2,2).

The design matrix and data vector are

X=[101112],y=[122]. X= \begin{bmatrix} 1&0\\ 1&1\\ 1&2 \end{bmatrix}, \qquad y= \begin{bmatrix} 1\\ 2\\ 2 \end{bmatrix}.

Compute

XTX=[111012][101112]=[3335]. X^TX= \begin{bmatrix} 1&1&1\\ 0&1&2 \end{bmatrix} \begin{bmatrix} 1&0\\ 1&1\\ 1&2 \end{bmatrix} = \begin{bmatrix} 3&3\\ 3&5 \end{bmatrix}.

Also,

XTy=[111012][122]=[56]. X^Ty= \begin{bmatrix} 1&1&1\\ 0&1&2 \end{bmatrix} \begin{bmatrix} 1\\ 2\\ 2 \end{bmatrix} = \begin{bmatrix} 5\\ 6 \end{bmatrix}.

The normal equations are

[3335][β^0β^1]=[56]. \begin{bmatrix} 3&3\\ 3&5 \end{bmatrix} \begin{bmatrix} \hat{\beta}_0\\ \hat{\beta}_1 \end{bmatrix} = \begin{bmatrix} 5\\ 6 \end{bmatrix}.

Solving,

3β^0+3β^1=5, 3\hat{\beta}_0+3\hat{\beta}_1=5, 3β^0+5β^1=6. 3\hat{\beta}_0+5\hat{\beta}_1=6.

Subtracting gives

2β^1=1, 2\hat{\beta}_1=1,

so

β^1=12. \hat{\beta}_1=\frac12.

Then

3β^0+32=5, 3\hat{\beta}_0+\frac32=5,

so

β^0=76. \hat{\beta}_0=\frac76.

The least squares line is

y=76+12t. y=\frac76+\frac12 t.

52.12 Residuals in Data Fitting

For the fitted line above, the predicted values are

y^i=β^0+β^1ti. \hat{y}_i=\hat{\beta}_0+\hat{\beta}_1t_i.

Thus

y^=[7/65/313/6]. \hat{y}= \begin{bmatrix} 7/6\\ 5/3\\ 13/6 \end{bmatrix}.

The residual vector is

r=yy^=[122][7/65/313/6]=[1/61/31/6]. r=y-\hat{y} = \begin{bmatrix} 1\\ 2\\ 2 \end{bmatrix} - \begin{bmatrix} 7/6\\ 5/3\\ 13/6 \end{bmatrix} = \begin{bmatrix} -1/6\\ 1/3\\ -1/6 \end{bmatrix}.

The residual is orthogonal to both columns of XX:

XTr=0. X^Tr=0.

This gives two equations:

i=1mri=0, \sum_{i=1}^m r_i=0,

and

i=1mtiri=0. \sum_{i=1}^m t_i r_i=0.

For line fitting, these equations say that the residuals balance against the intercept column and the slope column.

52.13 Least Squares with Orthonormal Columns

If QQ has orthonormal columns, then

QTQ=I. Q^TQ=I.

The least squares problem

minxQxb2 \min_x \|Qx-b\|_2

has normal equations

QTQx^=QTb. Q^TQ\hat{x}=Q^Tb.

Since QTQ=IQ^TQ=I, this reduces to

x^=QTb. \hat{x}=Q^Tb.

The projected vector is

p=Qx^=QQTb. p=Q\hat{x}=QQ^Tb.

This is why orthonormal bases are useful. They make least squares coefficients easy to compute and avoid forming ATAA^TA.

52.14 QR Solution of Least Squares

Let

A=QR, A=QR,

where QQ has orthonormal columns and RR is upper triangular and invertible. Then

Axb2=QRxb2. \|Ax-b\|_2 = \|QRx-b\|_2.

Since QQ has orthonormal columns, the least squares condition becomes

Rx^=QTb. R\hat{x}=Q^Tb.

Thus

x^=R1QTb. \hat{x}=R^{-1}Q^Tb.

In practice, one solves the triangular system

Rx^=QTb R\hat{x}=Q^Tb

by back substitution rather than explicitly computing R1R^{-1}.

QR methods are usually preferred over the normal equations for numerical stability. Forming ATAA^TA can worsen conditioning, while orthogonal transformations preserve Euclidean norms.

52.15 Rank-Deficient Least Squares

If AA does not have full column rank, the least squares problem still has at least one solution, but the solution vector may not be unique.

The normal equations

ATAx=ATb A^TAx=A^Tb

remain valid. However, ATAA^TA is singular.

If x^\hat{x} is one least squares solution and zNull(A)z\in\operatorname{Null}(A), then

A(x^+z)=Ax^. A(\hat{x}+z)=A\hat{x}.

Thus x^+z\hat{x}+z gives the same fitted vector and the same residual.

Among all least squares solutions, one often chooses the solution with smallest Euclidean norm. This solution can be computed using the singular value decomposition.

52.16 Weighted Least Squares

In some problems, not all residuals should be treated equally. If measurements have different reliability, more reliable measurements should receive greater weight.

Weighted least squares minimizes

W(Axb)22, \|W(Ax-b)\|_2^2,

where WW is a weighting matrix.

If WW is diagonal,

W=[w1w2wm], W= \begin{bmatrix} w_1\\ & w_2\\ && \ddots\\ &&& w_m \end{bmatrix},

then the objective is

i=1mwi2(aiTxbi)2. \sum_{i=1}^m w_i^2(a_i^Tx-b_i)^2.

The normal equations become

ATWTWAx^=ATWTWb. A^TW^TWA\hat{x}=A^TW^TWb.

Weighted least squares appears in statistics, inverse problems, and numerical modeling when observations have unequal variance or importance.

52.17 Least Squares and the Pseudoinverse

The Moore-Penrose pseudoinverse A+A^+ gives a compact expression for least squares solutions.

If AA has full column rank, then

A+=(ATA)1AT, A^+=(A^TA)^{-1}A^T,

and

x^=A+b. \hat{x}=A^+b.

For general AA, the vector

A+b A^+b

is the minimum-norm least squares solution.

The pseudoinverse is especially useful when AA is rectangular or rank deficient.

52.18 Error Decomposition

Let

p=Ax^ p=A\hat{x}

be the projection of bb onto Col(A)\operatorname{Col}(A), and let

r=bp. r=b-p.

Since

pr p\perp r

only when the column space contains the origin and pCol(A)p\in\operatorname{Col}(A), rCol(A)r\in\operatorname{Col}(A)^\perp, the Pythagorean theorem gives

b22=p22+r22. \|b\|_2^2=\|p\|_2^2+\|r\|_2^2.

The total squared size of bb splits into a fitted part and an unfitted part.

In regression language, the projection explains part of the variation, and the residual contains the part left unexplained by the chosen model.

52.19 Conditioning

The normal equations can be numerically sensitive.

If AA is ill-conditioned, then ATAA^TA is often much more ill-conditioned. In the Euclidean norm,

κ2(ATA)=κ2(A)2. \kappa_2(A^TA)=\kappa_2(A)^2.

Thus solving least squares by normal equations may lose accuracy.

More stable methods use orthogonal transformations:

MethodMain idea
QR factorizationReduce least squares to triangular solve
SVDHandles rank deficiency and computes minimum-norm solutions
Iterative methodsUseful for large sparse problems
RegularizationStabilizes ill-conditioned problems

The normal equations are important theoretically, but QR and SVD are often better computational tools.

52.20 Summary

A least squares problem seeks

x^=argminxAxb2. \hat{x} = \arg\min_x \|Ax-b\|_2.

Geometrically, Ax^A\hat{x} is the orthogonal projection of bb onto Col(A)\operatorname{Col}(A). The residual

r=bAx^ r=b-A\hat{x}

satisfies

rCol(A). r\perp \operatorname{Col}(A).

This orthogonality gives the normal equations:

ATAx^=ATb. A^TA\hat{x}=A^Tb.

If AA has full column rank, the solution is unique:

x^=(ATA)1ATb. \hat{x}=(A^TA)^{-1}A^Tb.

Least squares connects linear systems, projection, approximation, data fitting, and numerical computation. It is one of the main uses of inner product geometry in applied linear algebra.