The conjugate gradient method is an iterative method for solving linear systems
where is symmetric positive definite. It is one of the most important Krylov subspace methods in numerical linear algebra.
The method combines three ideas:
| Idea | Role |
|---|---|
| Quadratic minimization | interprets as an optimization problem |
| Conjugate directions | avoids undoing previous progress |
| Krylov subspaces | builds approximations using matrix-vector products |
For large sparse symmetric positive definite systems, conjugate gradient is often the standard first method to try. It avoids matrix factorization and requires mainly matrix-vector products, vector updates, and inner products. In exact arithmetic, it reaches the exact solution in at most steps for an system, but in floating point arithmetic it is used as an iterative method and is usually stopped when the residual is small enough.
91.1 Symmetric Positive Definite Systems
The conjugate gradient method applies to systems
where is symmetric positive definite.
Symmetric means
Positive definite means
for every nonzero vector .
These assumptions are essential. They imply that has positive eigenvalues, defines an inner product, and determines a strictly convex quadratic function.
Many applied problems produce symmetric positive definite matrices, including finite difference methods, finite element methods, least squares normal equations, graph Laplacian reductions, and quadratic optimization problems.
91.2 Linear Systems as Quadratic Minimization
If is symmetric positive definite, solving
is equivalent to minimizing the quadratic function
The gradient is
Thus the stationary condition
is exactly
Since is positive definite, the quadratic function is strictly convex. Therefore it has a unique minimizer, and that minimizer is the unique solution of the linear system.
91.3 Residual and Gradient
At an approximate solution , define the residual
The gradient of is
Therefore,
The residual is the negative gradient of the quadratic objective. It points in the steepest descent direction.
Steepest descent uses the residual direction at each step. Conjugate gradient improves on steepest descent by choosing directions that are mutually conjugate with respect to .
91.4 The Energy Inner Product
For symmetric positive definite , define the -inner product by
The associated norm is
This norm is called the energy norm.
Two vectors and are -conjugate if
This is orthogonality measured in the geometry induced by , rather than ordinary Euclidean geometry.
91.5 Conjugate Directions
Suppose we have nonzero directions
that are mutually -conjugate:
If we minimize successively along these directions, each step does not interfere with the progress made in previous directions.
This is the key advantage over steepest descent.
Steepest descent may zigzag across elongated level sets. Conjugate directions account for the geometry of the quadratic surface and avoid repeated correction in the same effective direction.
91.6 One-Dimensional Minimization
Given a current approximation and a search direction , choose
We choose to minimize
Differentiate with respect to :
Since
setting the derivative to zero gives
In the standard conjugate gradient recurrence, this becomes
91.7 Residual Update
After choosing , the new approximation is
The residual updates as
Substitute the update for :
Therefore,
This update avoids recomputing from scratch.
91.8 Constructing the Next Direction
The next search direction is formed from the new residual and the previous direction:
The coefficient is chosen so that
For the standard conjugate gradient method,
Thus the method stores only the current approximation, residual, search direction, and one matrix-vector product.
91.9 The Basic Algorithm
Choose an initial vector .
Set
For
compute
Then update
Update the residual:
If the residual is small enough, stop.
Otherwise compute
Then update
This is the standard three-term recurrence form of conjugate gradient.
91.10 Pseudocode
conjugate_gradient(A, b, x, tol, max_iter):
r = b - A * x
p = r
rsold = dot(r, r)
for k = 0 to max_iter - 1:
Ap = A * p
alpha = rsold / dot(p, Ap)
x = x + alpha * p
r = r - alpha * Ap
rsnew = dot(r, r)
if sqrt(rsnew) / norm(b) <= tol:
return x
beta = rsnew / rsold
p = r + beta * p
rsold = rsnew
return xEach iteration requires:
| Operation | Count |
|---|---|
| Matrix-vector product | 1 |
| Dot products | 2 |
| Vector updates | several |
| Stored vectors | few |
The dominant cost for sparse matrices is usually the matrix-vector product.
91.11 Orthogonality of Residuals
In exact arithmetic, the residuals produced by conjugate gradient are mutually orthogonal:
This means each iteration removes an error component in a new independent direction.
Since there can be at most mutually orthogonal nonzero vectors in , the method reaches zero residual in at most steps in exact arithmetic. This is the finite termination property.
91.12 Conjugacy of Search Directions
The search directions are mutually -conjugate:
This is stronger than ordinary independence.
It means the directions are orthogonal under the energy inner product.
Because the objective function has Hessian , -conjugacy is exactly the orthogonality needed to minimize the quadratic efficiently.
91.13 Krylov Subspace Interpretation
The -th Krylov subspace is
The conjugate gradient iterate satisfies
Among all vectors in this affine Krylov space, minimizes the error in the energy norm:
This Krylov interpretation explains why the method only needs matrix-vector products.
91.14 Polynomial Approximation View
Since
the error after steps can be written as
for a polynomial of degree satisfying
The method chooses a polynomial that makes the error small over the eigenvalues of .
Thus convergence depends not only on the largest and smallest eigenvalues, but also on the distribution of all eigenvalues.
Clustered eigenvalues often lead to rapid convergence.
91.15 Convergence Bound
For symmetric positive definite , a standard bound is
Here
is the spectral condition number.
This bound shows that convergence becomes slower as the condition number grows. It also explains why conjugate gradient is often much faster than Jacobi or Gauss-Seidel for well-preconditioned symmetric positive definite systems.
91.16 Effect of Eigenvalue Distribution
The condition number gives a useful worst-case estimate, but it does not tell the full story.
Conjugate gradient may converge rapidly when eigenvalues are clustered, even if the condition number is large.
Intuitively, each iteration builds a polynomial that is small at the eigenvalues of . If many eigenvalues lie in clusters, a low-degree polynomial can be small on most of them.
Thus two matrices with the same condition number may produce different convergence histories.
91.17 Preconditioned Conjugate Gradient
Preconditioning replaces the original system by an equivalent one that has better spectral properties.
Choose a symmetric positive definite matrix that approximates and is easy to solve with.
The preconditioned method uses
The vector is the preconditioned residual.
The goal is to make the effective system better conditioned. In practice, preconditioning is often necessary for fast convergence of conjugate gradient.
91.18 Preconditioned Algorithm
Choose .
Set
Solve
Set
For :
Solve
91.19 Common Preconditioners
Common preconditioners include:
| Preconditioner | Main idea |
|---|---|
| Jacobi | use the diagonal of |
| Incomplete Cholesky | approximate Cholesky factorization |
| SSOR | symmetric relaxation-based preconditioner |
| Multigrid | solve error components across scales |
| Domain decomposition | split problem into subdomains |
| Algebraic multigrid | build hierarchy from matrix structure |
A good preconditioner must balance two costs:
| Requirement | Meaning |
|---|---|
| Strong approximation | reduces iteration count |
| Cheap application | keeps each iteration inexpensive |
A preconditioner that is too expensive may erase the benefit of faster convergence.
91.20 Floating Point Behavior
In exact arithmetic, residuals remain mutually orthogonal and directions remain -conjugate.
In floating point arithmetic, these properties gradually degrade.
Consequences include:
| Exact arithmetic property | Floating point behavior |
|---|---|
| finite termination in at most steps | rarely exact termination |
| mutually orthogonal residuals | orthogonality is gradually lost |
| mutually -conjugate directions | conjugacy is gradually lost |
| exact residual recurrence | recursive residual may drift |
For this reason, implementations sometimes recompute the true residual
occasionally, rather than relying only on the recurrence.
91.21 Stopping Criteria
Common stopping rules use the relative residual:
Other choices include:
| Criterion | Form |
|---|---|
| relative residual | |
| preconditioned residual | |
| estimated error | bound using condition estimate |
| maximum iterations | fixed safety limit |
| stagnation | residual no longer decreases |
A small residual must still be interpreted with conditioning. For ill-conditioned systems, a small residual may correspond to a larger solution error.
91.22 Computational Cost
For a sparse matrix with nonzero entries, one conjugate gradient iteration costs approximately
The method stores only a small number of vectors:
| Vector | Meaning |
|---|---|
| approximate solution | |
| residual | |
| search direction | |
| matrix-vector product | |
| preconditioned residual, if used |
This low storage cost makes the method suitable for very large sparse systems.
91.23 Comparison with Steepest Descent
Steepest descent uses
at every step.
Conjugate gradient uses
| Feature | Steepest descent | Conjugate gradient |
|---|---|---|
| Direction | residual only | residual plus previous direction |
| Geometry | Euclidean steepest direction | -conjugate directions |
| Typical convergence | slow on elongated quadratics | much faster |
| Memory | very low | very low |
| Matrix requirement | SPD for standard form | SPD |
Conjugate gradient can be viewed as steepest descent with memory adjusted to the quadratic geometry.
91.24 Comparison with Direct Methods
Direct methods factor the matrix.
Conjugate gradient does not.
| Feature | Direct factorization | Conjugate gradient |
|---|---|---|
| Matrix type | many types | SPD only |
| Work | predictable | depends on convergence |
| Memory | may be large due to fill-in | low |
| Sparse systems | fill-in may dominate | often efficient |
| Accuracy | high if stable | depends on tolerance and conditioning |
| Reuse for many right-hand sides | strong | weaker unless preconditioner reused |
For a single large sparse SPD system, conjugate gradient is often preferable. For many right-hand sides with the same matrix, a direct factorization may become competitive.
91.25 Failure Modes
Conjugate gradient may fail or perform poorly when its assumptions are violated or when convergence is too slow.
| Failure mode | Cause |
|---|---|
| denominator nonpositive | not positive definite |
| slow convergence | large condition number |
| stagnation | poor preconditioner or roundoff |
| loss of conjugacy | floating point error |
| residual drift | recursive residual differs from true residual |
| breakdown in PCG | preconditioner not SPD |
If is symmetric indefinite, MINRES is often more appropriate. If is nonsymmetric, GMRES or BiCGSTAB is usually used.
91.26 Practical Checklist
Before using conjugate gradient, check:
| Question | Reason |
|---|---|
| Is symmetric? | required by standard CG |
| Is positive definite? | required for positive denominators and convexity |
| Is sparse or matrix-free? | favors CG |
| Is a good preconditioner available? | controls iteration count |
| Is the stopping tolerance meaningful? | residual does not equal error |
| Is the matrix scaled reasonably? | improves numerical behavior |
For many applications, the most important implementation decision is the preconditioner.
91.27 Summary
The conjugate gradient method solves symmetric positive definite systems
by minimizing the quadratic
over expanding Krylov subspaces.
Its basic recurrence is:
In exact arithmetic, it terminates in at most steps. In practice, it is used as a fast iterative method. Its effectiveness depends strongly on conditioning, eigenvalue distribution, and preconditioning.