Chapter 52. Least Squares Problems

A least squares problem asks for the best approximate solution to a linear system that may have no exact solution. Instead of requiring

Ax=b,

we choose $x$ so that $Ax$ is as close as possible to $b$ . The usual measure of closeness is the Euclidean norm of the residual:

\|Ax-b\|_2.

Equivalently, we minimize the squared residual norm:

\|Ax-b\|_2^2.

This problem appears whenever there are more equations than unknowns, noisy measurements, inconsistent constraints, or data that should be fitted by a simple model. Standard linear algebra texts describe least squares as solving $Ax=b$ as closely as possible by minimizing the squared error $\|Ax-b\|^2$ .

52.1 The Basic Problem

Let

A\in\mathbb{R}^{m\times n}, \qquad b\in\mathbb{R}^m.

The least squares problem is

\min_{x\in\mathbb{R}^n}\|Ax-b\|_2.

Since the square root is increasing, this is equivalent to

\min_{x\in\mathbb{R}^n}\|Ax-b\|_2^2.

The vector

r=b-Ax

is called the residual. The least squares solution is a vector $\hat{x}$ for which

\|A\hat{x}-b\|_2 \le \|Ax-b\|_2

for every $x\in\mathbb{R}^n$ .

The vector $A\hat{x}$ is the closest vector to $b$ among all vectors in the column space of $A$ .

52.2 Why Exact Solutions May Fail

The equation

Ax=b

has a solution exactly when

b\in \operatorname{Col}(A).

If $b$ does not lie in the column space, no vector $x$ satisfies the system exactly.

This is common when $m>n$ . There may be more equations than unknowns. Such systems are called overdetermined. An overdetermined system may be consistent, but in applications with measurement error it usually is not.

Least squares replaces the exact equation by the approximation problem

Ax \approx b.

The goal is to choose the vector in $\operatorname{Col}(A)$ nearest to $b$ .

52.3 Geometric Interpretation

The set of all possible vectors $Ax$ is the column space of $A$ :

\operatorname{Col}(A)=\{Ax:x\in\mathbb{R}^n\}.

Thus the least squares problem is the problem of projecting $b$ onto $\operatorname{Col}(A)$ .

Let

p=\operatorname{proj}_{\operatorname{Col}(A)}(b).

Then

p=A\hat{x}

for at least one vector $\hat{x}$ . The residual is

r=b-p=b-A\hat{x}.

The defining condition for orthogonal projection is

r\perp \operatorname{Col}(A).

Thus least squares is an orthogonal projection problem. The method of least squares can be understood as finding the projection of a vector onto a column space.

52.4 Orthogonality Condition

Since the residual

r=b-A\hat{x}

is orthogonal to the column space of $A$ , it is orthogonal to every column of $A$ . This condition can be written as

A^T r=0.

Substituting $r=b-A\hat{x}$ , we obtain

A^T(b-A\hat{x})=0.

Expanding gives

A^Tb-A^TA\hat{x}=0.

Therefore

A^TA\hat{x}=A^Tb.

These equations are called the normal equations. They characterize least squares solutions in the Euclidean norm.

52.5 The Normal Equations

The normal equations are

A^TA\hat{x}=A^Tb.

They form an $n\times n$ linear system.

If $A$ has linearly independent columns, then $A^TA$ is invertible, and the least squares solution is unique:

\hat{x}=(A^TA)^{-1}A^Tb.

If $A$ does not have linearly independent columns, the normal equations may have infinitely many solutions. In that case, every solution of the normal equations gives the same projected vector $A\hat{x}$ , but the coefficient vector $\hat{x}$ may not be unique.

The normal equations are central because they replace an inconsistent system by a consistent square system.

52.6 Why $A^TA$ Is Invertible Under Full Column Rank

Assume the columns of $A$ are linearly independent. Then

\operatorname{Null}(A)=\{0\}.

Suppose

A^TAx=0.

Multiply on the left by $x^T$ :

x^TA^TAx=0.

But

x^TA^TAx=(Ax)^T(Ax)=\|Ax\|_2^2.

Thus

\|Ax\|_2^2=0.

Ax=0.

Since $A$ has linearly independent columns,

x=0.

Therefore $A^TA$ has trivial null space and is invertible.

52.7 Projection Matrix

If $A$ has full column rank, the projection of $b$ onto $\operatorname{Col}(A)$ is

p=A\hat{x}.

Using

\hat{x}=(A^TA)^{-1}A^Tb,

we get

p=A(A^TA)^{-1}A^Tb.

Thus the projection matrix onto the column space of $A$ is

P=A(A^TA)^{-1}A^T.

This matrix satisfies

P^2=P

and

P^T=P.

Hence it is an orthogonal projection matrix.

The residual is

r=b-p=(I-P)b.

Since $p\in\operatorname{Col}(A)$ and $r\in\operatorname{Col}(A)^\perp$ , we have

b=p+r.

This is the orthogonal decomposition behind least squares.

52.8 The Objective Function

Define

f(x)=\|Ax-b\|_2^2.

Then

f(x)=(Ax-b)^T(Ax-b).

Expanding gives

f(x)=x^TA^TAx-2x^TA^Tb+b^Tb.

This is a quadratic function of $x$ . Its gradient is

\nabla f(x)=2A^TAx-2A^Tb.

At a minimizer $\hat{x}$ , the gradient must vanish:

2A^TA\hat{x}-2A^Tb=0.

Hence

A^TA\hat{x}=A^Tb.

Thus the normal equations can also be derived from calculus.

52.9 Example: An Inconsistent System

Consider

A= \begin{bmatrix} 1\\ 1 \end{bmatrix}, \qquad b= \begin{bmatrix} 2\\ 4 \end{bmatrix}.

The system

Ax=b

means

x=2, \qquad x=4.

There is no exact solution.

The least squares problem is

\min_x \left\| \begin{bmatrix} x\\ x \end{bmatrix} - \begin{bmatrix} 2\\ 4 \end{bmatrix} \right\|_2^2.

That is,

\min_x \left((x-2)^2+(x-4)^2\right).

The normal equations are

A^TA\hat{x}=A^Tb.

Compute

A^TA= \begin{bmatrix} 1&1 \end{bmatrix} \begin{bmatrix} 1\\ 1 \end{bmatrix} =2,

and

A^Tb= \begin{bmatrix} 1&1 \end{bmatrix} \begin{bmatrix} 2\\ 4 \end{bmatrix} =6.

Thus

2\hat{x}=6,

\hat{x}=3.

The best approximate equation is

A\hat{x} = \begin{bmatrix} 3\\ 3 \end{bmatrix}.

The residual is

r=b-A\hat{x} = \begin{bmatrix} 2\\ 4 \end{bmatrix} - \begin{bmatrix} 3\\ 3 \end{bmatrix} = \begin{bmatrix} -1\\ 1 \end{bmatrix}.

Check orthogonality:

A^Tr= \begin{bmatrix} 1&1 \end{bmatrix} \begin{bmatrix} -1\\ 1 \end{bmatrix} =0.

Thus the residual is orthogonal to the column space of $A$ .

52.10 Least Squares Line Fitting

Suppose data points

(t_1,y_1),\ldots,(t_m,y_m)

are to be fitted by a line

y=\beta_0+\beta_1t.

For each data point, the model gives

\beta_0+\beta_1t_i\approx y_i.

Collecting all equations gives

\begin{bmatrix} 1 & t_1\\ 1 & t_2\\ \vdots & \vdots\\ 1 & t_m \end{bmatrix} \begin{bmatrix} \beta_0\\ \beta_1 \end{bmatrix} \approx \begin{bmatrix} y_1\\ y_2\\ \vdots\\ y_m \end{bmatrix}.

This has the form

X\beta \approx y.

The least squares estimate is

\hat{\beta} = (X^TX)^{-1}X^Ty,

provided $X$ has full column rank.

The first column represents the intercept. The second column represents the slope. More complicated models add more columns.

52.11 Example: Fitting a Line

Fit a line

y=\beta_0+\beta_1t

to the three data points

(0,1),\quad (1,2),\quad (2,2).

The design matrix and data vector are

X= \begin{bmatrix} 1&0\\ 1&1\\ 1&2 \end{bmatrix}, \qquad y= \begin{bmatrix} 1\\ 2\\ 2 \end{bmatrix}.

Compute

X^TX= \begin{bmatrix} 1&1&1\\ 0&1&2 \end{bmatrix} \begin{bmatrix} 1&0\\ 1&1\\ 1&2 \end{bmatrix} = \begin{bmatrix} 3&3\\ 3&5 \end{bmatrix}.

Also,

X^Ty= \begin{bmatrix} 1&1&1\\ 0&1&2 \end{bmatrix} \begin{bmatrix} 1\\ 2\\ 2 \end{bmatrix} = \begin{bmatrix} 5\\ 6 \end{bmatrix}.

The normal equations are

\begin{bmatrix} 3&3\\ 3&5 \end{bmatrix} \begin{bmatrix} \hat{\beta}_0\\ \hat{\beta}_1 \end{bmatrix} = \begin{bmatrix} 5\\ 6 \end{bmatrix}.

Solving,

3\hat{\beta}_0+3\hat{\beta}_1=5,

3\hat{\beta}_0+5\hat{\beta}_1=6.

Subtracting gives

2\hat{\beta}_1=1,

\hat{\beta}_1=\frac12.

Then

3\hat{\beta}_0+\frac32=5,

\hat{\beta}_0=\frac76.

The least squares line is

y=\frac76+\frac12 t.

52.12 Residuals in Data Fitting

For the fitted line above, the predicted values are

\hat{y}_i=\hat{\beta}_0+\hat{\beta}_1t_i.

Thus

\hat{y}= \begin{bmatrix} 7/6\\ 5/3\\ 13/6 \end{bmatrix}.

The residual vector is

r=y-\hat{y} = \begin{bmatrix} 1\\ 2\\ 2 \end{bmatrix} - \begin{bmatrix} 7/6\\ 5/3\\ 13/6 \end{bmatrix} = \begin{bmatrix} -1/6\\ 1/3\\ -1/6 \end{bmatrix}.

The residual is orthogonal to both columns of $X$ :

X^Tr=0.

This gives two equations:

\sum_{i=1}^m r_i=0,

and

\sum_{i=1}^m t_i r_i=0.

For line fitting, these equations say that the residuals balance against the intercept column and the slope column.

52.13 Least Squares with Orthonormal Columns

If $Q$ has orthonormal columns, then

Q^TQ=I.

The least squares problem

\min_x \|Qx-b\|_2

has normal equations

Q^TQ\hat{x}=Q^Tb.

Since $Q^TQ=I$ , this reduces to

\hat{x}=Q^Tb.

The projected vector is

p=Q\hat{x}=QQ^Tb.

This is why orthonormal bases are useful. They make least squares coefficients easy to compute and avoid forming $A^TA$ .

52.14 QR Solution of Least Squares

Let

A=QR,

where $Q$ has orthonormal columns and $R$ is upper triangular and invertible. Then

\|Ax-b\|_2 = \|QRx-b\|_2.

Since $Q$ has orthonormal columns, the least squares condition becomes

R\hat{x}=Q^Tb.

Thus

\hat{x}=R^{-1}Q^Tb.

In practice, one solves the triangular system

R\hat{x}=Q^Tb

by back substitution rather than explicitly computing $R^{-1}$ .

QR methods are usually preferred over the normal equations for numerical stability. Forming $A^TA$ can worsen conditioning, while orthogonal transformations preserve Euclidean norms.

52.15 Rank-Deficient Least Squares

If $A$ does not have full column rank, the least squares problem still has at least one solution, but the solution vector may not be unique.

The normal equations

A^TAx=A^Tb

remain valid. However, $A^TA$ is singular.

If $\hat{x}$ is one least squares solution and $z\in\operatorname{Null}(A)$ , then

A(\hat{x}+z)=A\hat{x}.

Thus $\hat{x}+z$ gives the same fitted vector and the same residual.

Among all least squares solutions, one often chooses the solution with smallest Euclidean norm. This solution can be computed using the singular value decomposition.

52.16 Weighted Least Squares

In some problems, not all residuals should be treated equally. If measurements have different reliability, more reliable measurements should receive greater weight.

Weighted least squares minimizes

\|W(Ax-b)\|_2^2,

where $W$ is a weighting matrix.

If $W$ is diagonal,

W= \begin{bmatrix} w_1\\ & w_2\\ && \ddots\\ &&& w_m \end{bmatrix},

then the objective is

\sum_{i=1}^m w_i^2(a_i^Tx-b_i)^2.

The normal equations become

A^TW^TWA\hat{x}=A^TW^TWb.

Weighted least squares appears in statistics, inverse problems, and numerical modeling when observations have unequal variance or importance.

52.17 Least Squares and the Pseudoinverse

The Moore-Penrose pseudoinverse $A^+$ gives a compact expression for least squares solutions.

If $A$ has full column rank, then

A^+=(A^TA)^{-1}A^T,

and

\hat{x}=A^+b.

For general $A$ , the vector

A^+b

is the minimum-norm least squares solution.

The pseudoinverse is especially useful when $A$ is rectangular or rank deficient.

52.18 Error Decomposition

Let

p=A\hat{x}

be the projection of $b$ onto $\operatorname{Col}(A)$ , and let

r=b-p.

Since

p\perp r

only when the column space contains the origin and $p\in\operatorname{Col}(A)$ , $r\in\operatorname{Col}(A)^\perp$ , the Pythagorean theorem gives

\|b\|_2^2=\|p\|_2^2+\|r\|_2^2.

The total squared size of $b$ splits into a fitted part and an unfitted part.

In regression language, the projection explains part of the variation, and the residual contains the part left unexplained by the chosen model.

52.19 Conditioning

The normal equations can be numerically sensitive.

If $A$ is ill-conditioned, then $A^TA$ is often much more ill-conditioned. In the Euclidean norm,

\kappa_2(A^TA)=\kappa_2(A)^2.

Thus solving least squares by normal equations may lose accuracy.

More stable methods use orthogonal transformations:

Method	Main idea
QR factorization	Reduce least squares to triangular solve
SVD	Handles rank deficiency and computes minimum-norm solutions
Iterative methods	Useful for large sparse problems
Regularization	Stabilizes ill-conditioned problems

The normal equations are important theoretically, but QR and SVD are often better computational tools.

52.20 Summary

A least squares problem seeks

\hat{x} = \arg\min_x \|Ax-b\|_2.

Geometrically, $A\hat{x}$ is the orthogonal projection of $b$ onto $\operatorname{Col}(A)$ . The residual

r=b-A\hat{x}

satisfies

r\perp \operatorname{Col}(A).

This orthogonality gives the normal equations:

A^TA\hat{x}=A^Tb.

If $A$ has full column rank, the solution is unique:

\hat{x}=(A^TA)^{-1}A^Tb.

Least squares connects linear systems, projection, approximation, data fitting, and numerical computation. It is one of the main uses of inner product geometry in applied linear algebra.

Chapter 52. Least Squares Problems

52.1 The Basic Problem

52.2 Why Exact Solutions May Fail

52.3 Geometric Interpretation

52.4 Orthogonality Condition

52.5 The Normal Equations

52.6 Why ATAA^TAATA Is Invertible Under Full Column Rank

52.7 Projection Matrix

52.8 The Objective Function

52.9 Example: An Inconsistent System

52.10 Least Squares Line Fitting

52.11 Example: Fitting a Line

52.12 Residuals in Data Fitting

52.13 Least Squares with Orthonormal Columns

52.14 QR Solution of Least Squares

52.15 Rank-Deficient Least Squares

52.16 Weighted Least Squares

52.17 Least Squares and the Pseudoinverse

52.18 Error Decomposition

52.19 Conditioning

52.20 Summary

52.6 Why $A^TA$ Is Invertible Under Full Column Rank