Skip to content

Appendix F. Numerical Computation

Linear algebra has an exact theory and a numerical practice. The exact theory treats numbers as exact objects. The numerical practice works with finite representations on a computer.

This distinction matters. A matrix identity may be true in exact arithmetic, while its computed version may contain rounding error. Numerical computation studies how to obtain useful answers despite finite precision, limited memory, and finite time.

Most modern numerical computing uses floating-point arithmetic. Floating-point systems represent real numbers with a finite number of digits and an exponent, similar to scientific notation. The IEEE 754 standard defines common formats, rounding behavior, infinities, and special values such as NaN.

F.1 Exact Arithmetic and Floating-Point Arithmetic

In exact arithmetic, the expression

13+13+13 \frac{1}{3}+\frac{1}{3}+\frac{1}{3}

equals

1. 1.

On a computer using finite decimal or binary representation, the number 1/31/3 usually cannot be represented exactly. The stored value is rounded. After arithmetic operations, additional rounding may occur.

Floating-point arithmetic represents numbers in the form

±m×βe, \pm m \times \beta^e,

where mm is the significand, β\beta is the base, and ee is the exponent. In most hardware arithmetic, the base is 22. Floating-point numbers can represent very small and very large quantities, but only finitely many values are representable.

Thus numerical computation replaces exact real arithmetic by approximate arithmetic.

F.2 Rounding

When an exact real number cannot be represented in a floating-point system, it is rounded to a nearby representable number.

If xx is a real number, let

fl(x) \operatorname{fl}(x)

denote the floating-point number obtained by rounding xx.

For a basic operation \circ, such as addition or multiplication, one often models floating-point arithmetic as

fl(ab)=(ab)(1+δ), \operatorname{fl}(a \circ b) = (a \circ b)(1+\delta),

where

δu. |\delta| \leq u.

The number uu is called the unit roundoff. It measures the relative size of one rounding error.

This model is not a complete description of every floating-point situation, but it is useful for analyzing many algorithms.

F.3 Absolute Error and Relative Error

Let xx be the exact value and let x^\widehat{x} be a computed approximation.

The absolute error is

xx^. |x-\widehat{x}|.

The relative error is

xx^x \frac{|x-\widehat{x}|}{|x|}

when x0x\neq 0.

Absolute error measures error in the original units. Relative error measures error compared with the size of the true value.

For example, an error of 10310^{-3} is small when approximating 10001000, but large when approximating 10510^{-5}. Relative error captures this distinction.

F.4 Significant Digits

A computed value has many correct significant digits when its relative error is small.

Roughly, if

xx^x10k, \frac{|x-\widehat{x}|}{|x|}\approx 10^{-k},

then x^\widehat{x} has about kk correct decimal digits.

For example, if

x=3.14159265 x=3.14159265

and

x^=3.1416, \widehat{x}=3.1416,

then the approximation has several correct leading digits.

Significant digits are a practical way to describe numerical quality, but error bounds are more precise.

F.5 Cancellation

Cancellation occurs when two nearly equal numbers are subtracted.

For example,

1.0000011.000000=0.000001. 1.000001 - 1.000000 = 0.000001.

If the inputs have already been rounded, their leading digits cancel, and the remaining result may contain few accurate digits.

Cancellation is dangerous when the result is small relative to the operands. It often appears in formulas that subtract nearly equal quantities.

Example

The expression

x+1x \sqrt{x+1}-\sqrt{x}

suffers cancellation when xx is large. A more stable form is obtained by multiplying by the conjugate:

x+1x=1x+1+x. \sqrt{x+1}-\sqrt{x} = \frac{1}{\sqrt{x+1}+\sqrt{x}}.

Both expressions are algebraically equal in exact arithmetic, but the second is usually better numerically for large xx.

F.6 Conditioning

Conditioning is a property of a problem, not of an algorithm.

A problem is well-conditioned if small changes in the input cause small changes in the output. It is ill-conditioned if small input changes can cause large output changes.

For a scalar function ff, the relative condition number at xx is often measured by

κ(x)=xf(x)f(x), \kappa(x) = \left| \frac{x f'(x)}{f(x)} \right|,

when the expression is defined.

A large condition number means the problem is sensitive. Even a perfect algorithm cannot recover information that has been lost through input uncertainty.

In linear systems, conditioning is measured by the condition number of the matrix.

F.7 Matrix Condition Number

For an invertible matrix AA, the condition number with respect to a matrix norm is

κ(A)=AA1. \kappa(A)=\|A\|\,\|A^{-1}\|.

If κ(A)\kappa(A) is close to 11, the system

Ax=b Ax=b

is well-conditioned. If κ(A)\kappa(A) is large, small perturbations in AA or bb may cause large changes in the solution xx.

The condition number explains why some linear systems are difficult to solve accurately. The difficulty comes from the geometry of the problem itself.

F.8 Stability

Stability is a property of an algorithm.

An algorithm is stable if it does not greatly amplify rounding errors. An unstable algorithm may give poor answers even for well-conditioned problems.

It is useful to separate two questions.

QuestionConcept
Is the problem sensitive to input changes?Conditioning
Does the algorithm control rounding error?Stability

A well-conditioned problem solved by a stable algorithm usually gives an accurate answer. An ill-conditioned problem may give inaccurate answers even when the algorithm is stable.

F.9 Forward Error and Backward Error

Forward error compares the computed answer with the exact answer.

If xx is the exact solution and x^\widehat{x} is the computed solution, the forward error is

xx^. \|x-\widehat{x}\|.

Backward error asks a different question: for what nearby problem is the computed answer exact?

For a linear system, if

Ax^b, A\widehat{x}\neq b,

then backward error measures how much AA or bb must be changed so that x^\widehat{x} becomes an exact solution.

A backward stable algorithm produces the exact solution to a nearby problem. This is a strong and useful guarantee.

F.10 Residuals

For a computed solution x^\widehat{x} to

Ax=b, Ax=b,

the residual is

r=bAx^. r=b-A\widehat{x}.

If r=0r=0, then x^\widehat{x} solves the system exactly.

A small residual means

Ax^ A\widehat{x}

is close to bb. However, a small residual does not always imply a small error in x^\widehat{x}. If AA is ill-conditioned, a small residual may still correspond to a large solution error.

Residuals are easy to compute and widely used as stopping criteria in iterative methods.

F.11 Norms in Numerical Computation

Norms measure the size of vectors and matrices.

Common vector norms include:

NormFormula
11-norm(|x|_1=\sum_i
22-norm(|x|_2=\sqrt{\sum_i
\infty-norm(|x|_\infty=\max_i

Common matrix norms include:

NormMeaning
11-normMaximum absolute column sum
\infty-normMaximum absolute row sum
22-normLargest singular value
Frobenius normSquare root of sum of squared entries

Norms are used to measure error, residuals, perturbations, and convergence.

F.12 Machine Epsilon

Machine epsilon is commonly used to describe floating-point precision. Informally, it is the distance from 11 to the next larger representable floating-point number in a given format.

For IEEE double precision, machine epsilon is approximately

2.22×1016. 2.22\times 10^{-16}.

For IEEE single precision, it is approximately

1.19×107. 1.19\times 10^{-7}.

Unit roundoff is often half of machine epsilon when rounding to nearest is used. Authors sometimes use the terms differently, so one should check the convention in a given text.

F.13 Overflow, Underflow, and NaN

Floating-point systems have finite range.

Overflow occurs when a result is too large to represent. Underflow occurs when a result is too small in magnitude to represent normally.

IEEE-style floating-point arithmetic also includes special values:

ValueMeaning
++\inftyPositive overflow or exact infinite result
-\inftyNegative overflow or exact infinite result
NaNNot a Number
Signed zeroDistinguishes +0+0 and 0-0
Subnormal numbersVery small numbers with reduced precision

These values are part of practical numerical computation. They allow many operations to continue after exceptional events, but they also require care.

F.14 Scaling

Scaling changes the magnitude of a problem without changing its essential mathematical content.

For example, a linear system may be difficult to solve numerically if one row contains entries near 101210^{12} and another row contains entries near 101210^{-12}. Row or column scaling can reduce the range of magnitudes.

Scaling is used to reduce overflow, underflow, and loss of precision. It is also used before computing norms, factorizations, and eigenvalues.

Good numerical software often applies scaling internally.

F.15 Pivoting

Gaussian elimination may fail or become inaccurate if it divides by a small pivot.

Partial pivoting swaps rows so that a larger pivot is used. In each column, the algorithm selects an entry of large magnitude below or at the current pivot position and swaps that row into place.

Pivoting improves numerical stability. It also avoids division by zero when a nonzero pivot can be found by row exchange.

In exact arithmetic, row exchanges may seem like bookkeeping. In floating-point arithmetic, they are a stability mechanism.

F.16 Dense and Sparse Computation

A dense matrix stores most of its entries explicitly. A sparse matrix has many zero entries and stores only the nonzero structure.

Dense algorithms are appropriate when most entries are nonzero. Sparse algorithms are appropriate when the matrix has few nonzero entries relative to its size.

Matrix typeStorageTypical algorithms
DenseAll entriesLU, QR, SVD
SparseNonzero entries and indicesSparse LU, iterative methods, Krylov methods

Sparse computation is essential for large graphs, finite element methods, optimization, and scientific simulations.

F.17 Direct and Iterative Methods

A direct method solves a problem by a finite sequence of algebraic operations. Gaussian elimination, LU decomposition, QR decomposition, and Cholesky decomposition are direct methods.

An iterative method starts from an initial approximation and improves it repeatedly. Jacobi, Gauss-Seidel, conjugate gradient, GMRES, and power iteration are iterative methods.

Method classStrengthLimitation
DirectPredictable, accurate for moderate sizeCan require large memory
IterativeScales to large sparse problemsNeeds convergence control

The choice depends on matrix size, sparsity, conditioning, and required accuracy.

F.18 Complexity

Computational complexity estimates the cost of an algorithm.

For dense n×nn\times n matrices, classical matrix multiplication costs

O(n3) O(n^3)

arithmetic operations.

Gaussian elimination also costs

O(n3) O(n^3)

operations for dense square systems.

Matrix-vector multiplication with a dense n×nn\times n matrix costs

O(n2). O(n^2).

If the matrix is sparse with mm nonzero entries, matrix-vector multiplication costs approximately

O(m). O(m).

These estimates describe scaling behavior. Actual runtime also depends on memory access, cache behavior, parallelism, and hardware.

F.19 Memory and Data Movement

Modern numerical computation is often limited by memory movement rather than arithmetic.

A processor can perform many arithmetic operations per second, but moving data between memory hierarchy levels can be expensive. Efficient algorithms use blocking, cache-aware layouts, and batched operations to reuse data.

For example, matrix multiplication can be organized so that blocks of matrices remain in fast cache while many arithmetic operations are performed on them. This is one reason high-quality numerical libraries can be much faster than straightforward code.

F.20 Reproducibility

Floating-point arithmetic is not always reproducible across platforms or execution orders.

Addition is associative in exact arithmetic:

(a+b)+c=a+(b+c). (a+b)+c=a+(b+c).

In floating-point arithmetic, this identity may fail because rounding occurs after each operation.

Parallel computations often sum terms in different orders. This can produce slightly different results.

Reproducible numerical computation requires controlled rounding, deterministic reduction order, fixed libraries, or compensated algorithms.

F.21 Compensated Summation

Ordinary summation can accumulate rounding error. Compensated summation keeps an additional correction term.

The Kahan summation algorithm is a common example. It improves the accuracy of summing many floating-point numbers by tracking small errors lost during addition.

Compensated techniques are useful when many numbers of different magnitudes are added. They do not make arithmetic exact, but they often reduce error substantially.

F.22 Numerical Software

Numerical linear algebra is usually implemented through specialized libraries.

Common examples include BLAS, LAPACK, SuiteSparse, ARPACK, and vendor-optimized libraries. These libraries encode decades of work on algorithms, memory layout, stability, and hardware performance.

For most serious numerical work, one should use tested numerical libraries rather than writing basic matrix algorithms from scratch.

The reason is not only speed. Correct numerical behavior depends on pivoting, scaling, blocking, tolerance selection, and error handling.

F.23 Tolerances

Because computed values are approximate, numerical algorithms often compare quantities to a tolerance.

Instead of testing

x=0, x=0,

one may test

xτ, |x| \leq \tau,

where τ\tau is a chosen tolerance.

A good tolerance depends on the scale of the problem. Absolute tolerances alone may fail when values are very large or very small. Relative tolerances account for magnitude.

A common pattern is

xx^τabs+τrelx. |x-\widehat{x}| \leq \tau_{\mathrm{abs}} + \tau_{\mathrm{rel}} |x|.

This combines absolute and relative error control.

F.24 Summary

Numerical computation studies how mathematical procedures behave when implemented with finite precision and finite resources.

The main ideas are:

ConceptMeaning
Floating-point arithmeticFinite approximation to real arithmetic
RoundingReplacement by nearby representable values
Absolute errorError measured in original units
Relative errorError measured relative to the exact value
CancellationLoss of significant digits in subtraction
ConditioningSensitivity of the mathematical problem
StabilityError behavior of the algorithm
ResidualFailure of a computed solution to satisfy the equation
PivotingRow exchange for stable elimination
ScalingMagnitude adjustment for numerical reliability
Sparse computationExploiting zero structure
TolerancePractical replacement for exact comparison

Exact linear algebra explains what should be true. Numerical linear algebra explains what can be computed reliably. Both viewpoints are necessary for using linear algebra in scientific computing, data analysis, engineering, and large-scale simulation.