Skip to content

Chapter 88. Iterative Methods for Linear Systems

Iterative methods solve a linear system by producing a sequence of approximate solutions. Instead of factoring the matrix once and solving the system directly, an iterative method starts from an initial guess and repeatedly improves it.

The target problem is

Ax=b. Ax=b.

A direct method, such as Gaussian elimination, tries to compute the solution in a finite sequence of algebraic steps. An iterative method constructs vectors

x(0),x(1),x(2), x^{(0)}, x^{(1)}, x^{(2)}, \ldots

with the goal that

x(k)x x^{(k)} \to x

as kk increases.

Iterative methods are especially important for large sparse systems, where direct factorization may require too much memory or computation. Large sparse systems arise in discretized differential equations, graph problems, optimization, inverse problems, simulation, and machine learning. Standard classifications distinguish stationary methods such as Jacobi and Gauss-Seidel from nonstationary Krylov methods such as conjugate gradient and GMRES.

88.1 Direct Methods and Iterative Methods

A direct method attempts to solve

Ax=b Ax=b

by transforming the system into an equivalent system that can be solved exactly in exact arithmetic.

Examples include:

MethodMain idea
Gaussian eliminationeliminate variables
LU factorizationfactor AA into triangular matrices
Cholesky factorizationfactor symmetric positive definite matrices
QR factorizationuse orthogonal transformations

An iterative method instead computes a sequence of approximations. It may stop before the exact solution is reached, once the residual or error is small enough.

Method typeStrengthWeakness
Directpredictable, robust for moderate dense systemsexpensive for large sparse systems
Iterativememory efficient, often fast for sparse systemsconvergence depends on matrix structure

For large sparse matrices, iterative methods are often the only practical choice.

88.2 The Residual

Given an approximate solution x(k)x^{(k)}, the residual is

r(k)=bAx(k). r^{(k)} = b - Ax^{(k)}.

The residual measures how much the current approximation fails to satisfy the equation.

If

r(k)=0, r^{(k)} = 0,

then

Ax(k)=b, Ax^{(k)} = b,

so x(k)x^{(k)} is an exact solution.

Most iterative methods use the residual to decide how to correct the current approximation.

88.3 The Error

Let xx be the exact solution, and let

e(k)=xx(k) e^{(k)} = x - x^{(k)}

be the error at iteration kk.

Because

Ax=b, Ax=b,

we have

Ae(k)=A(xx(k))=bAx(k)=r(k). A e^{(k)} = A(x-x^{(k)}) = b-Ax^{(k)} = r^{(k)}.

Thus

Ae(k)=r(k). A e^{(k)} = r^{(k)}.

The residual is the image of the error under AA. A small residual implies a small error only when the matrix is well-conditioned.

88.4 General Fixed-Point Iteration

Many iterative methods can be written as fixed-point iterations.

Starting from

Ax=b, Ax=b,

we rewrite the equation in the form

x=Gx+c. x = Gx + c.

Then define

x(k+1)=Gx(k)+c. x^{(k+1)} = Gx^{(k)} + c.

The matrix GG is called the iteration matrix.

If this iteration converges, its limit xx satisfies

x=Gx+c, x = Gx+c,

which is equivalent to the original system.

88.5 Convergence of Fixed-Point Iteration

Let xx be the exact solution. Subtracting

x=Gx+c x = Gx+c

from

x(k+1)=Gx(k)+c x^{(k+1)} = Gx^{(k)}+c

gives

e(k+1)=Ge(k). e^{(k+1)} = G e^{(k)}.

Therefore,

e(k)=Gke(0). e^{(k)} = G^k e^{(0)}.

The iteration converges for every starting vector if and only if

ρ(G)<1, \rho(G)<1,

where ρ(G)\rho(G) is the spectral radius of GG, the largest absolute value of its eigenvalues.

This criterion is central for stationary iterative methods.

88.6 Matrix Splitting

A common construction uses a splitting of the matrix

A=MN, A = M - N,

where MM is easy to invert.

Then

Ax=b Ax=b

becomes

(MN)x=b, (M-N)x=b,

so

Mx=Nx+b. Mx = Nx+b.

This gives the iteration

Mx(k+1)=Nx(k)+b. Mx^{(k+1)} = Nx^{(k)} + b.

Equivalently,

x(k+1)=M1Nx(k)+M1b. x^{(k+1)} = M^{-1}N x^{(k)} + M^{-1}b.

Thus the iteration matrix is

G=M1N. G = M^{-1}N.

The quality of the method depends on choosing MM so that solving systems with MM is cheap and M1NM^{-1}N has favorable convergence behavior.

88.7 Diagonal, Lower, and Upper Parts

Write

A=D+L+U, A = D + L + U,

where:

SymbolMeaning
DDdiagonal part of AA
LLstrictly lower triangular part
UUstrictly upper triangular part

Different choices of MM produce different classical methods.

MethodChoice of MM
JacobiDD
Gauss-SeidelD+LD+L
SOR1ωD+L\frac{1}{\omega}D+L

These methods are called stationary because the same iteration matrix is used at every step.

88.8 Jacobi Method

The Jacobi method solves each equation for its diagonal variable using values from the previous iteration.

For the system

Ax=b, Ax=b,

with entries aija_{ij}, the update is

xi(k+1)=1aii(bijiaijxj(k)). x_i^{(k+1)} = \frac{1}{a_{ii}} \left( b_i-\sum_{j\ne i}a_{ij}x_j^{(k)} \right).

The Jacobi method corresponds to the splitting

A=D((L+U)). A = D - (-(L+U)).

Its iteration matrix is

GJ=D1(L+U). G_J = -D^{-1}(L+U).

Jacobi is simple and parallelizable because each component of x(k+1)x^{(k+1)} uses only values from the previous iteration.

Its convergence may be slow.

88.9 Gauss-Seidel Method

The Gauss-Seidel method updates components sequentially and immediately uses the newest available values.

The update is

xi(k+1)=1aii(bij<iaijxj(k+1)j>iaijxj(k)). x_i^{(k+1)} = \frac{1}{a_{ii}} \left( b_i - \sum_{j<i}a_{ij}x_j^{(k+1)} - \sum_{j>i}a_{ij}x_j^{(k)} \right).

The method corresponds to the splitting

M=D+L. M=D+L.

Its iteration matrix is

GGS=(D+L)1U. G_{GS}=-(D+L)^{-1}U.

Gauss-Seidel often converges faster than Jacobi because it uses updated values as soon as they are available, though it remains relatively slow compared with modern Krylov methods.

88.10 Successive Over-Relaxation

Successive over-relaxation, or SOR, modifies Gauss-Seidel by introducing a relaxation parameter ω\omega.

The update has the form

xi(k+1)=(1ω)xi(k)+ωaii(bij<iaijxj(k+1)j>iaijxj(k)). x_i^{(k+1)} = (1-\omega)x_i^{(k)} + \frac{\omega}{a_{ii}} \left( b_i - \sum_{j<i}a_{ij}x_j^{(k+1)} - \sum_{j>i}a_{ij}x_j^{(k)} \right).

The parameter ω\omega controls the step size.

RangeInterpretation
0<ω<10<\omega<1under-relaxation
ω=1\omega=1Gauss-Seidel
1<ω<21<\omega<2over-relaxation

When ω\omega is chosen well, SOR may converge much faster than Gauss-Seidel. A poor choice can slow convergence or cause divergence.

88.11 Convergence Conditions

Convergence depends on the matrix and the method.

Some useful sufficient conditions are:

Matrix propertyConsequence
Strict diagonal dominanceJacobi and Gauss-Seidel often converge
Symmetric positive definitenessGauss-Seidel converges
Spectral radius less than onefixed-point iteration converges
Poor conditioningconvergence may be slow

Strict diagonal dominance means

$$ |a_{ii}|

\sum_{j\ne i}|a_{ij}| $$

for each row.

This condition says that each diagonal entry dominates the off-diagonal entries in its row.

88.12 Stopping Criteria

An iterative method must decide when to stop.

Common stopping criteria include:

CriterionForm
Absolute residualr(k)τ\|r^{(k)}\| \le \tau
Relative residualr(k)bτ\frac{\|r^{(k)}\|}{\|b\|} \le \tau
Step sizex(k+1)x(k)τ\|x^{(k+1)}-x^{(k)}\| \le \tau
Maximum iterationskkmaxk \le k_{\max}

The relative residual is commonly used because it accounts for the scale of bb.

However, residual-based stopping must be interpreted carefully. If AA is ill-conditioned, a small residual may still correspond to a large solution error.

88.13 Stationary and Nonstationary Methods

Stationary methods use the same update rule at every iteration.

Examples:

Stationary methodFeature
Jacobiparallel, simple
Gauss-Seidelsequential, often faster
SORrelaxation parameter
SSORsymmetric form useful as preconditioner

Nonstationary methods adapt the iteration using information from previous steps.

Examples:

Nonstationary methodTypical use
Conjugate gradientsymmetric positive definite systems
MINRESsymmetric indefinite systems
GMRESnonsymmetric systems
BiCGSTABnonsymmetric systems
Chebyshev iterationspectrum approximately known

Modern large-scale solvers usually rely on nonstationary methods, especially Krylov subspace methods.

88.14 Krylov Subspaces

Krylov methods search for approximations in spaces generated by repeated multiplication by AA.

Given an initial residual

r(0)=bAx(0), r^{(0)} = b-Ax^{(0)},

the kk-th Krylov subspace is

Kk(A,r(0))=span{r(0),Ar(0),A2r(0),,Ak1r(0)}. \mathcal{K}_k(A,r^{(0)}) = \operatorname{span} \{r^{(0)}, Ar^{(0)}, A^2r^{(0)}, \ldots, A^{k-1}r^{(0)}\}.

A Krylov method chooses x(k)x^{(k)} from

x(0)+Kk(A,r(0)). x^{(0)}+\mathcal{K}_k(A,r^{(0)}).

This approach is powerful because it uses only matrix-vector products.

For sparse matrices, matrix-vector products are cheap compared with factorization.

88.15 Conjugate Gradient Method

The conjugate gradient method solves

Ax=b Ax=b

when AA is symmetric positive definite.

It constructs approximations in Krylov subspaces and minimizes the error in the energy norm.

The method uses three main sequences:

SequenceMeaning
x(k)x^{(k)}approximate solution
r(k)r^{(k)}residual
p(k)p^{(k)}search direction

The search directions are AA-conjugate:

(p(i))TAp(j)=0 (p^{(i)})^T A p^{(j)} = 0

for iji\ne j.

In exact arithmetic, conjugate gradient reaches the exact solution in at most nn iterations for an n×nn \times n system. In floating point arithmetic, it is used as an iterative method and often stops much earlier when the residual is sufficiently small.

88.16 Basic Conjugate Gradient Algorithm

Given x(0)x^{(0)}, define

r(0)=bAx(0),p(0)=r(0). r^{(0)} = b-Ax^{(0)}, \qquad p^{(0)} = r^{(0)}.

For k=0,1,2,k=0,1,2,\ldots:

αk=(r(k))Tr(k)(p(k))TAp(k). \alpha_k = \frac{(r^{(k)})^Tr^{(k)}}{(p^{(k)})^TAp^{(k)}}. x(k+1)=x(k)+αkp(k). x^{(k+1)} = x^{(k)}+\alpha_k p^{(k)}. r(k+1)=r(k)αkAp(k). r^{(k+1)} = r^{(k)}-\alpha_k A p^{(k)}. βk=(r(k+1))Tr(k+1)(r(k))Tr(k). \beta_k = \frac{(r^{(k+1)})^Tr^{(k+1)}}{(r^{(k)})^Tr^{(k)}}. p(k+1)=r(k+1)+βkp(k). p^{(k+1)} = r^{(k+1)}+\beta_k p^{(k)}.

The method requires one matrix-vector product per iteration.

This makes it attractive for large sparse symmetric positive definite systems.

88.17 Convergence of Conjugate Gradient

The convergence of conjugate gradient depends on the eigenvalues of AA.

For symmetric positive definite AA, a standard estimate is

xx(k)Axx(0)A2(κ2(A)1κ2(A)+1)k. \frac{\|x-x^{(k)}\|_A}{\|x-x^{(0)}\|_A} \le 2 \left( \frac{\sqrt{\kappa_2(A)}-1} {\sqrt{\kappa_2(A)}+1} \right)^k.

Here

vA=vTAv. \|v\|_A = \sqrt{v^TAv}.

The estimate shows that convergence slows as the condition number grows.

Eigenvalue clustering can make convergence much faster than the worst-case bound suggests.

88.18 GMRES

GMRES stands for generalized minimal residual method.

It is designed for nonsymmetric systems.

At step kk, GMRES chooses an approximation in the Krylov subspace that minimizes the residual norm:

bAx(k)2. \|b-Ax^{(k)}\|_2.

GMRES is robust, but it stores an increasing number of basis vectors as kk grows. To control memory, practical implementations often use restarted GMRES.

Restarting reduces memory cost, but it may slow convergence.

88.19 Preconditioning

Preconditioning transforms the system into an equivalent system that is easier for an iterative method to solve.

Instead of solving

Ax=b, Ax=b,

we choose a matrix PP that approximates AA but is easier to invert.

A left-preconditioned system is

P1Ax=P1b. P^{-1}Ax=P^{-1}b.

A good preconditioner makes

P1A P^{-1}A

better conditioned or more favorable spectrally.

Preconditioning is often the difference between a method that converges quickly and one that does not converge in useful time.

88.20 Examples of Preconditioners

Common preconditioners include:

PreconditionerMain idea
Jacobiuse diagonal of AA
SSORuse symmetric relaxation
Incomplete LUapproximate LU factorization
Incomplete Choleskyapproximate Cholesky factorization
Multigridreduce errors across scales
Domain decompositionsplit problem into subdomains

The best preconditioner depends on matrix structure.

There is no universally optimal choice.

88.21 Sparse Matrices

Iterative methods are closely tied to sparse matrices.

A sparse matrix has mostly zero entries. It is stored by recording only the nonzero entries and their locations.

The cost of one matrix-vector product is proportional to the number of nonzero entries:

O(nnz(A)). O(\operatorname{nnz}(A)).

This is much cheaper than dense multiplication when

nnz(A)n2. \operatorname{nnz}(A) \ll n^2.

For this reason, iterative methods can solve systems far larger than direct dense methods.

88.22 Matrix-Free Methods

Some applications do not form AA explicitly.

Instead, they provide a function that computes

vAv. v \mapsto Av.

Such methods are called matrix-free.

Krylov methods are well suited to matrix-free computation because they mainly require matrix-vector products.

This is important in:

ApplicationReason
PDE solversmatrix may be too large to assemble
optimizationHessian-vector products are available
inverse problemsoperator defined by simulation
machine learningJacobian or Hessian products computed implicitly

Matrix-free methods reduce storage cost and allow very large-scale computation.

88.23 Scaling and Normalization

Scaling can improve iterative convergence.

Poorly scaled systems may have entries with vastly different magnitudes. This can slow convergence and worsen numerical behavior.

Common scaling techniques include:

TechniqueGoal
Row scalingbalance equations
Column scalingbalance variables
Diagonal equilibrationreduce magnitude disparities
Symmetric scalingpreserve symmetry

Scaling is often applied before preconditioning.

88.24 Parallelism

Iterative methods are often suitable for parallel computation.

Matrix-vector products can be parallelized across rows.

Vector updates can be parallelized componentwise.

Dot products require global reductions, which may become bottlenecks in distributed systems.

OperationParallel behavior
Sparse matrix-vector productgood locality if partitioned well
Vector additionhighly parallel
Dot productrequires reduction
Preconditioner applicationdepends on preconditioner

Designing scalable iterative solvers requires attention to communication cost, not only arithmetic cost.

88.25 Choosing an Iterative Method

A practical choice depends on matrix properties.

Matrix typeCommon method
Symmetric positive definiteConjugate gradient
Symmetric indefiniteMINRES
NonsymmetricGMRES, BiCGSTAB
Very structured PDE matrixmultigrid
Ill-conditioned sparse matrixKrylov method with preconditioning
Small dense matrixdirect factorization

Before choosing a method, one should ask:

QuestionWhy it matters
Is AA symmetric?determines eligible methods
Is AA positive definite?enables conjugate gradient
Is AA sparse?favors iterative methods
Is a good preconditioner available?controls convergence
How accurate must the result be?determines stopping tolerance

88.26 Failure Modes

Iterative methods may fail or perform poorly for several reasons.

Failure modeCause
Divergenceiteration matrix spectral radius too large
Stagnationresidual stops decreasing
Slow convergencepoor conditioning or unfavorable spectrum
Loss of orthogonalityfloating point error in Krylov basis
Bad preconditionertransformed system remains difficult
Poor stopping ruleiteration stops too early or too late

Failure should be diagnosed through residual history, conditioning estimates, and knowledge of matrix structure.

88.27 Residual History

A residual history records values such as

r(k) \|r^{(k)}\|

over iterations.

It helps identify the behavior of the method.

PatternInterpretation
Steady decreasenormal convergence
Rapid early decrease, then stagnationlimited precision or poor preconditioner
Oscillationnonsymmetric or indefinite behavior
Growthdivergence or instability
Flat residualno useful progress

Residual history is one of the main diagnostic tools for iterative solvers.

88.28 Summary

Iterative methods solve linear systems by improving an approximate solution step by step.

They are essential for large sparse systems because they avoid dense factorization and often require only matrix-vector products.

The central objects are:

ConceptMeaning
Residualr(k)=bAx(k)r^{(k)}=b-Ax^{(k)}
Errore(k)=xx(k)e^{(k)}=x-x^{(k)}
Iteration matrixcontrols stationary convergence
Spectral radiusdetermines fixed-point convergence
Krylov subspacesearch space for modern methods
Preconditionertransforms system for faster convergence

Classical stationary methods include Jacobi, Gauss-Seidel, and SOR. Modern large-scale methods include conjugate gradient, GMRES, MINRES, and BiCGSTAB. In practice, preconditioning is often the decisive ingredient.