Skip to content

Chapter 91. Conjugate Gradient Method

The conjugate gradient method is an iterative method for solving linear systems

Ax=b Ax=b

where AA is symmetric positive definite. It is one of the most important Krylov subspace methods in numerical linear algebra.

The method combines three ideas:

IdeaRole
Quadratic minimizationinterprets Ax=bAx=b as an optimization problem
Conjugate directionsavoids undoing previous progress
Krylov subspacesbuilds approximations using matrix-vector products

For large sparse symmetric positive definite systems, conjugate gradient is often the standard first method to try. It avoids matrix factorization and requires mainly matrix-vector products, vector updates, and inner products. In exact arithmetic, it reaches the exact solution in at most nn steps for an n×nn \times n system, but in floating point arithmetic it is used as an iterative method and is usually stopped when the residual is small enough.

91.1 Symmetric Positive Definite Systems

The conjugate gradient method applies to systems

Ax=b Ax=b

where AA is symmetric positive definite.

Symmetric means

AT=A. A^T=A.

Positive definite means

xTAx>0 x^TAx>0

for every nonzero vector xx.

These assumptions are essential. They imply that AA has positive eigenvalues, defines an inner product, and determines a strictly convex quadratic function.

Many applied problems produce symmetric positive definite matrices, including finite difference methods, finite element methods, least squares normal equations, graph Laplacian reductions, and quadratic optimization problems.

91.2 Linear Systems as Quadratic Minimization

If AA is symmetric positive definite, solving

Ax=b Ax=b

is equivalent to minimizing the quadratic function

ϕ(x)=12xTAxbTx. \phi(x) = \frac{1}{2}x^TAx-b^Tx.

The gradient is

ϕ(x)=Axb. \nabla \phi(x)=Ax-b.

Thus the stationary condition

ϕ(x)=0 \nabla \phi(x)=0

is exactly

Ax=b. Ax=b.

Since AA is positive definite, the quadratic function is strictly convex. Therefore it has a unique minimizer, and that minimizer is the unique solution of the linear system.

91.3 Residual and Gradient

At an approximate solution x(k)x^{(k)}, define the residual

r(k)=bAx(k). r^{(k)}=b-Ax^{(k)}.

The gradient of ϕ\phi is

ϕ(x(k))=Ax(k)b. \nabla\phi(x^{(k)})=Ax^{(k)}-b.

Therefore,

r(k)=ϕ(x(k)). r^{(k)}=-\nabla\phi(x^{(k)}).

The residual is the negative gradient of the quadratic objective. It points in the steepest descent direction.

Steepest descent uses the residual direction at each step. Conjugate gradient improves on steepest descent by choosing directions that are mutually conjugate with respect to AA.

91.4 The Energy Inner Product

For symmetric positive definite AA, define the AA-inner product by

u,vA=uTAv. \langle u,v\rangle_A=u^TAv.

The associated norm is

uA=uTAu. \|u\|_A=\sqrt{u^TAu}.

This norm is called the energy norm.

Two vectors pp and qq are AA-conjugate if

pTAq=0. p^TAq=0.

This is orthogonality measured in the geometry induced by AA, rather than ordinary Euclidean geometry.

91.5 Conjugate Directions

Suppose we have nonzero directions

p(0),p(1),,p(m1) p^{(0)},p^{(1)},\ldots,p^{(m-1)}

that are mutually AA-conjugate:

(p(i))TAp(j)=0ij. (p^{(i)})^TAp^{(j)}=0 \qquad i\ne j.

If we minimize ϕ\phi successively along these directions, each step does not interfere with the progress made in previous directions.

This is the key advantage over steepest descent.

Steepest descent may zigzag across elongated level sets. Conjugate directions account for the geometry of the quadratic surface and avoid repeated correction in the same effective direction.

91.6 One-Dimensional Minimization

Given a current approximation x(k)x^{(k)} and a search direction p(k)p^{(k)}, choose

x(k+1)=x(k)+αkp(k). x^{(k+1)}=x^{(k)}+\alpha_k p^{(k)}.

We choose αk\alpha_k to minimize

ϕ(x(k)+αp(k)). \phi(x^{(k)}+\alpha p^{(k)}).

Differentiate with respect to α\alpha:

ddαϕ(x(k)+αp(k))=(p(k))T(Ax(k)b)+α(p(k))TAp(k). \frac{d}{d\alpha} \phi(x^{(k)}+\alpha p^{(k)}) = (p^{(k)})^T(Ax^{(k)}-b) + \alpha (p^{(k)})^TAp^{(k)}.

Since

r(k)=bAx(k), r^{(k)}=b-Ax^{(k)},

setting the derivative to zero gives

αk=(p(k))Tr(k)(p(k))TAp(k). \alpha_k = \frac{(p^{(k)})^Tr^{(k)}}{(p^{(k)})^TAp^{(k)}}.

In the standard conjugate gradient recurrence, this becomes

αk=(r(k))Tr(k)(p(k))TAp(k). \alpha_k = \frac{(r^{(k)})^Tr^{(k)}}{(p^{(k)})^TAp^{(k)}}.

91.7 Residual Update

After choosing αk\alpha_k, the new approximation is

x(k+1)=x(k)+αkp(k). x^{(k+1)} = x^{(k)}+\alpha_kp^{(k)}.

The residual updates as

r(k+1)=bAx(k+1). r^{(k+1)} = b-Ax^{(k+1)}.

Substitute the update for x(k+1)x^{(k+1)}:

r(k+1)=bA(x(k)+αkp(k))=bAx(k)αkAp(k). r^{(k+1)} = b-A(x^{(k)}+\alpha_kp^{(k)}) = b-Ax^{(k)}-\alpha_kAp^{(k)}.

Therefore,

r(k+1)=r(k)αkAp(k). r^{(k+1)} = r^{(k)}-\alpha_kAp^{(k)}.

This update avoids recomputing Ax(k+1)Ax^{(k+1)} from scratch.

91.8 Constructing the Next Direction

The next search direction is formed from the new residual and the previous direction:

p(k+1)=r(k+1)+βkp(k). p^{(k+1)} = r^{(k+1)}+\beta_kp^{(k)}.

The coefficient βk\beta_k is chosen so that

(p(k+1))TAp(k)=0. (p^{(k+1)})^TAp^{(k)}=0.

For the standard conjugate gradient method,

βk=(r(k+1))Tr(k+1)(r(k))Tr(k). \beta_k = \frac{(r^{(k+1)})^Tr^{(k+1)}}{(r^{(k)})^Tr^{(k)}}.

Thus the method stores only the current approximation, residual, search direction, and one matrix-vector product.

91.9 The Basic Algorithm

Choose an initial vector x(0)x^{(0)}.

Set

r(0)=bAx(0),p(0)=r(0). r^{(0)}=b-Ax^{(0)}, \qquad p^{(0)}=r^{(0)}.

For

k=0,1,2, k=0,1,2,\ldots

compute

αk=(r(k))Tr(k)(p(k))TAp(k). \alpha_k = \frac{(r^{(k)})^Tr^{(k)}}{(p^{(k)})^TAp^{(k)}}.

Then update

x(k+1)=x(k)+αkp(k). x^{(k+1)} = x^{(k)}+\alpha_kp^{(k)}.

Update the residual:

r(k+1)=r(k)αkAp(k). r^{(k+1)} = r^{(k)}-\alpha_kAp^{(k)}.

If the residual is small enough, stop.

Otherwise compute

βk=(r(k+1))Tr(k+1)(r(k))Tr(k). \beta_k = \frac{(r^{(k+1)})^Tr^{(k+1)}}{(r^{(k)})^Tr^{(k)}}.

Then update

p(k+1)=r(k+1)+βkp(k). p^{(k+1)} = r^{(k+1)}+\beta_kp^{(k)}.

This is the standard three-term recurrence form of conjugate gradient.

91.10 Pseudocode

conjugate_gradient(A, b, x, tol, max_iter):
    r = b - A * x
    p = r
    rsold = dot(r, r)

    for k = 0 to max_iter - 1:
        Ap = A * p
        alpha = rsold / dot(p, Ap)

        x = x + alpha * p
        r = r - alpha * Ap

        rsnew = dot(r, r)

        if sqrt(rsnew) / norm(b) <= tol:
            return x

        beta = rsnew / rsold
        p = r + beta * p
        rsold = rsnew

    return x

Each iteration requires:

OperationCount
Matrix-vector product1
Dot products2
Vector updatesseveral
Stored vectorsfew

The dominant cost for sparse matrices is usually the matrix-vector product.

91.11 Orthogonality of Residuals

In exact arithmetic, the residuals produced by conjugate gradient are mutually orthogonal:

(r(i))Tr(j)=0ij. (r^{(i)})^Tr^{(j)}=0 \qquad i\ne j.

This means each iteration removes an error component in a new independent direction.

Since there can be at most nn mutually orthogonal nonzero vectors in Rn\mathbb{R}^n, the method reaches zero residual in at most nn steps in exact arithmetic. This is the finite termination property.

91.12 Conjugacy of Search Directions

The search directions are mutually AA-conjugate:

(p(i))TAp(j)=0ij. (p^{(i)})^TAp^{(j)}=0 \qquad i\ne j.

This is stronger than ordinary independence.

It means the directions are orthogonal under the energy inner product.

Because the objective function has Hessian AA, AA-conjugacy is exactly the orthogonality needed to minimize the quadratic efficiently.

91.13 Krylov Subspace Interpretation

The kk-th Krylov subspace is

Kk(A,r(0))=span{r(0),Ar(0),A2r(0),,Ak1r(0)}. \mathcal{K}_k(A,r^{(0)}) = \operatorname{span}\{r^{(0)},Ar^{(0)},A^2r^{(0)},\ldots,A^{k-1}r^{(0)}\}.

The conjugate gradient iterate satisfies

x(k)x(0)+Kk(A,r(0)). x^{(k)} \in x^{(0)}+\mathcal{K}_k(A,r^{(0)}).

Among all vectors in this affine Krylov space, x(k)x^{(k)} minimizes the error in the energy norm:

xx(k)A. \|x-x^{(k)}\|_A.

This Krylov interpretation explains why the method only needs matrix-vector products.

91.14 Polynomial Approximation View

Since

e(k)=xx(k), e^{(k)}=x-x^{(k)},

the error after kk steps can be written as

e(k)=qk(A)e(0) e^{(k)}=q_k(A)e^{(0)}

for a polynomial qkq_k of degree kk satisfying

qk(0)=1. q_k(0)=1.

The method chooses a polynomial that makes the error small over the eigenvalues of AA.

Thus convergence depends not only on the largest and smallest eigenvalues, but also on the distribution of all eigenvalues.

Clustered eigenvalues often lead to rapid convergence.

91.15 Convergence Bound

For symmetric positive definite AA, a standard bound is

xx(k)Axx(0)A2(κ2(A)1κ2(A)+1)k. \frac{\|x-x^{(k)}\|_A}{\|x-x^{(0)}\|_A} \le 2 \left( \frac{\sqrt{\kappa_2(A)}-1} {\sqrt{\kappa_2(A)}+1} \right)^k.

Here

κ2(A)=λmax(A)λmin(A) \kappa_2(A) = \frac{\lambda_{\max}(A)}{\lambda_{\min}(A)}

is the spectral condition number.

This bound shows that convergence becomes slower as the condition number grows. It also explains why conjugate gradient is often much faster than Jacobi or Gauss-Seidel for well-preconditioned symmetric positive definite systems.

91.16 Effect of Eigenvalue Distribution

The condition number gives a useful worst-case estimate, but it does not tell the full story.

Conjugate gradient may converge rapidly when eigenvalues are clustered, even if the condition number is large.

Intuitively, each iteration builds a polynomial that is small at the eigenvalues of AA. If many eigenvalues lie in clusters, a low-degree polynomial can be small on most of them.

Thus two matrices with the same condition number may produce different convergence histories.

91.17 Preconditioned Conjugate Gradient

Preconditioning replaces the original system by an equivalent one that has better spectral properties.

Choose a symmetric positive definite matrix MM that approximates AA and is easy to solve with.

The preconditioned method uses

Mz(k)=r(k). Mz^{(k)}=r^{(k)}.

The vector z(k)z^{(k)} is the preconditioned residual.

The goal is to make the effective system better conditioned. In practice, preconditioning is often necessary for fast convergence of conjugate gradient.

91.18 Preconditioned Algorithm

Choose x(0)x^{(0)}.

Set

r(0)=bAx(0). r^{(0)}=b-Ax^{(0)}.

Solve

Mz(0)=r(0). Mz^{(0)}=r^{(0)}.

Set

p(0)=z(0). p^{(0)}=z^{(0)}.

For k=0,1,2,k=0,1,2,\ldots:

αk=(r(k))Tz(k)(p(k))TAp(k). \alpha_k = \frac{(r^{(k)})^Tz^{(k)}}{(p^{(k)})^TAp^{(k)}}. x(k+1)=x(k)+αkp(k). x^{(k+1)} = x^{(k)}+\alpha_kp^{(k)}. r(k+1)=r(k)αkAp(k). r^{(k+1)} = r^{(k)}-\alpha_kAp^{(k)}.

Solve

Mz(k+1)=r(k+1). Mz^{(k+1)}=r^{(k+1)}. βk=(r(k+1))Tz(k+1)(r(k))Tz(k). \beta_k = \frac{(r^{(k+1)})^Tz^{(k+1)}}{(r^{(k)})^Tz^{(k)}}. p(k+1)=z(k+1)+βkp(k). p^{(k+1)} = z^{(k+1)}+\beta_kp^{(k)}.

91.19 Common Preconditioners

Common preconditioners include:

PreconditionerMain idea
Jacobiuse the diagonal of AA
Incomplete Choleskyapproximate Cholesky factorization
SSORsymmetric relaxation-based preconditioner
Multigridsolve error components across scales
Domain decompositionsplit problem into subdomains
Algebraic multigridbuild hierarchy from matrix structure

A good preconditioner must balance two costs:

RequirementMeaning
Strong approximationreduces iteration count
Cheap applicationkeeps each iteration inexpensive

A preconditioner that is too expensive may erase the benefit of faster convergence.

91.20 Floating Point Behavior

In exact arithmetic, residuals remain mutually orthogonal and directions remain AA-conjugate.

In floating point arithmetic, these properties gradually degrade.

Consequences include:

Exact arithmetic propertyFloating point behavior
finite termination in at most nn stepsrarely exact termination
mutually orthogonal residualsorthogonality is gradually lost
mutually AA-conjugate directionsconjugacy is gradually lost
exact residual recurrencerecursive residual may drift

For this reason, implementations sometimes recompute the true residual

bAx(k) b-Ax^{(k)}

occasionally, rather than relying only on the recurrence.

91.21 Stopping Criteria

Common stopping rules use the relative residual:

r(k)2b2τ. \frac{\|r^{(k)}\|_2}{\|b\|_2}\le \tau.

Other choices include:

CriterionForm
relative residualr(k)/b\|r^{(k)}\|/\|b\|
preconditioned residual(r(k))Tz(k)(r^{(k)})^Tz^{(k)}
estimated errorbound using condition estimate
maximum iterationsfixed safety limit
stagnationresidual no longer decreases

A small residual must still be interpreted with conditioning. For ill-conditioned systems, a small residual may correspond to a larger solution error.

91.22 Computational Cost

For a sparse matrix with nnz(A)\operatorname{nnz}(A) nonzero entries, one conjugate gradient iteration costs approximately

O(nnz(A)). O(\operatorname{nnz}(A)).

The method stores only a small number of vectors:

VectorMeaning
xxapproximate solution
rrresidual
ppsearch direction
ApApmatrix-vector product
zzpreconditioned residual, if used

This low storage cost makes the method suitable for very large sparse systems.

91.23 Comparison with Steepest Descent

Steepest descent uses

p(k)=r(k) p^{(k)}=r^{(k)}

at every step.

Conjugate gradient uses

p(k+1)=r(k+1)+βkp(k). p^{(k+1)}=r^{(k+1)}+\beta_kp^{(k)}.
FeatureSteepest descentConjugate gradient
Directionresidual onlyresidual plus previous direction
GeometryEuclidean steepest directionAA-conjugate directions
Typical convergenceslow on elongated quadraticsmuch faster
Memoryvery lowvery low
Matrix requirementSPD for standard formSPD

Conjugate gradient can be viewed as steepest descent with memory adjusted to the quadratic geometry.

91.24 Comparison with Direct Methods

Direct methods factor the matrix.

Conjugate gradient does not.

FeatureDirect factorizationConjugate gradient
Matrix typemany typesSPD only
Workpredictabledepends on convergence
Memorymay be large due to fill-inlow
Sparse systemsfill-in may dominateoften efficient
Accuracyhigh if stabledepends on tolerance and conditioning
Reuse for many right-hand sidesstrongweaker unless preconditioner reused

For a single large sparse SPD system, conjugate gradient is often preferable. For many right-hand sides with the same matrix, a direct factorization may become competitive.

91.25 Failure Modes

Conjugate gradient may fail or perform poorly when its assumptions are violated or when convergence is too slow.

Failure modeCause
denominator nonpositiveAA not positive definite
slow convergencelarge condition number
stagnationpoor preconditioner or roundoff
loss of conjugacyfloating point error
residual driftrecursive residual differs from true residual
breakdown in PCGpreconditioner not SPD

If AA is symmetric indefinite, MINRES is often more appropriate. If AA is nonsymmetric, GMRES or BiCGSTAB is usually used.

91.26 Practical Checklist

Before using conjugate gradient, check:

QuestionReason
Is AA symmetric?required by standard CG
Is AA positive definite?required for positive denominators and convexity
Is AA sparse or matrix-free?favors CG
Is a good preconditioner available?controls iteration count
Is the stopping tolerance meaningful?residual does not equal error
Is the matrix scaled reasonably?improves numerical behavior

For many applications, the most important implementation decision is the preconditioner.

91.27 Summary

The conjugate gradient method solves symmetric positive definite systems

Ax=b Ax=b

by minimizing the quadratic

ϕ(x)=12xTAxbTx \phi(x)=\frac{1}{2}x^TAx-b^Tx

over expanding Krylov subspaces.

Its basic recurrence is:

r(0)=bAx(0),p(0)=r(0). r^{(0)}=b-Ax^{(0)}, \qquad p^{(0)}=r^{(0)}. αk=(r(k))Tr(k)(p(k))TAp(k). \alpha_k = \frac{(r^{(k)})^Tr^{(k)}}{(p^{(k)})^TAp^{(k)}}. x(k+1)=x(k)+αkp(k). x^{(k+1)}=x^{(k)}+\alpha_kp^{(k)}. r(k+1)=r(k)αkAp(k). r^{(k+1)}=r^{(k)}-\alpha_kAp^{(k)}. βk=(r(k+1))Tr(k+1)(r(k))Tr(k). \beta_k = \frac{(r^{(k+1)})^Tr^{(k+1)}}{(r^{(k)})^Tr^{(k)}}. p(k+1)=r(k+1)+βkp(k). p^{(k+1)}=r^{(k+1)}+\beta_kp^{(k)}.

In exact arithmetic, it terminates in at most nn steps. In practice, it is used as a fast iterative method. Its effectiveness depends strongly on conditioning, eigenvalue distribution, and preconditioning.