Numerical linear algebra is built on finite computation. Real numbers form a continuous mathematical system, but computers store numbers using a finite number of bits. Floating point arithmetic is the standard method used to approximate real numbers in scientific computing.
Every numerical algorithm in linear algebra depends on floating point representation. Matrix multiplication, Gaussian elimination, eigenvalue computation, singular value decomposition, and iterative methods all operate within this finite arithmetic system. Understanding floating point arithmetic is therefore necessary for understanding numerical error, instability, conditioning, and algorithm design.
85.1 Exact Arithmetic and Machine Arithmetic
Pure mathematics assumes exact arithmetic.
For example,
exactly.
Computers cannot generally represent irrational numbers or infinitely repeating decimals exactly. Instead, they approximate numbers using finite binary representations.
For example, the decimal number
does not have a finite binary expansion. A computer therefore stores an approximation.
This distinction between exact mathematics and machine arithmetic is fundamental.
In numerical linear algebra, every computed quantity contains some level of approximation error.
85.2 Fixed Point and Floating Point
There are two common approaches to representing real numbers on computers.
| Method | Description |
|---|---|
| Fixed point | Decimal or binary point stays in a fixed location |
| Floating point | Decimal or binary point moves according to an exponent |
Fixed point arithmetic is simple but has limited range.
Floating point arithmetic allows very small and very large numbers to be represented using scientific notation.
A floating point number has the form
where:
| Symbol | Meaning |
|---|---|
| significand or mantissa | |
| base | |
| exponent |
Modern systems almost always use base
Thus numbers are stored in binary scientific notation.
For example,
85.3 Normalized Floating Point Representation
Most floating point systems use normalized representation.
A normalized binary floating point number has the form
where:
- the leading digit is always ,
- is the number of stored fraction bits,
- is the exponent.
Normalization ensures uniqueness and maximizes precision.
For example,
The stored value contains:
| Field | Value |
|---|---|
| Sign | |
| Significand | |
| Exponent |
85.4 IEEE 754 Standard
Most modern computers use the IEEE 754 floating point standard.
The two most common formats are:
| Format | Total bits | Exponent bits | Fraction bits |
|---|---|---|---|
| Single precision | 32 | 8 | 23 |
| Double precision | 64 | 11 | 52 |
Double precision is standard in numerical linear algebra because it provides much higher accuracy.
A double precision number has approximate relative precision
This quantity is called machine epsilon.
85.5 Machine Epsilon
Machine epsilon measures the spacing between and the next representable floating point number.
It is commonly denoted by
For IEEE double precision,
\varepsilon_{\mathrm{mach}} = 2^{-53} \approx 1.11 \times 10^{-16}
Machine epsilon provides a scale for floating point error.
If a computation produces relative error on the order of
then the computation is usually considered near machine precision.
85.6 Rounding
Most real numbers cannot be represented exactly in floating point form.
The computer therefore rounds numbers to the nearest representable value.
If is a real number and is its floating point representation, then
where
This model is one of the foundations of numerical analysis.
Every arithmetic operation introduces a small rounding error.
For floating point addition,
For multiplication,
Although each error is small, large computations may accumulate substantial total error.
85.7 Floating Point Addition
Floating point addition is more complicated than integer addition.
To add two numbers:
- Align exponents.
- Shift significands.
- Add significands.
- Normalize result.
- Round result.
Suppose we compute
The smaller number may disappear after alignment because the significand has finite precision.
This phenomenon is called loss of significance.
Large differences in scale are therefore numerically dangerous.
85.8 Cancellation
Subtraction of nearly equal numbers may destroy significant digits.
Suppose
Then
Most leading digits cancel.
This is called catastrophic cancellation.
Cancellation amplifies relative error because the result is much smaller than the original operands.
Many unstable algorithms fail because they repeatedly subtract nearly equal numbers.
85.9 Underflow and Overflow
Floating point systems have finite exponent range.
Very large numbers exceed the maximum representable magnitude.
This is overflow.
Very small numbers fall below the minimum representable magnitude.
This is underflow.
For IEEE double precision:
| Quantity | Approximate value |
|---|---|
| Largest number | |
| Smallest normalized positive number |
Overflow usually produces infinity.
Underflow may produce zero or subnormal numbers.
85.10 Subnormal Numbers
Near zero, IEEE arithmetic allows subnormal numbers.
These numbers do not use normalized form.
Instead of
they use
Subnormal numbers provide gradual underflow.
They improve continuity near zero but reduce precision.
85.11 Floating Point Number Line
Floating point numbers are not uniformly spaced.
Near zero, numbers are very dense.
Far from zero, spacing becomes larger.
If is a floating point number, adjacent representable numbers differ approximately by
Thus relative precision is approximately constant, while absolute precision varies with scale.
This property explains why floating point arithmetic behaves differently for very large and very small numbers.
85.12 Relative and Absolute Error
Suppose is the exact value and is the computed approximation.
Absolute error is
Relative error is
Relative error is usually more important in numerical linear algebra because it measures error relative to scale.
For example:
| Exact value | Approximation | Relative error |
|---|---|---|
| very small | ||
| large |
85.13 Backward Error Interpretation
Modern numerical analysis often interprets algorithms using backward error.
Instead of asking:
How wrong is the computed answer?
we ask:
For what nearby problem is the computed answer exact?
Suppose we compute solution for system
If
then is the exact solution of a nearby problem.
Backward stability means the perturbations satisfy
small relative to machine precision.
This viewpoint is central in numerical linear algebra.
85.14 Floating Point Arithmetic Laws
Floating point arithmetic approximately obeys algebraic laws, but not exactly.
Associativity may fail.
For example,
Consider:
Then:
while
in floating point arithmetic.
This failure occurs because rounding changes intermediate results.
Algorithms must therefore avoid rearrangements that increase error.
85.15 Floating Point Matrix Operations
Matrix computations magnify floating point effects because they involve many arithmetic operations.
For matrix multiplication,
each entry requires multiple multiplications and additions:
c_{ij}=\sum_{k=1}^{n} a_{ik} b_{kj}
Each operation introduces rounding error.
The total error depends on:
| Factor | Effect |
|---|---|
| Matrix size | More operations |
| Matrix conditioning | Error amplification |
| Algorithm structure | Stability properties |
| Floating point precision | Baseline rounding scale |
Numerical linear algebra studies algorithms that minimize this accumulated error.
85.16 Floating Point and Gaussian Elimination
Gaussian elimination is sensitive to rounding error.
Small pivot elements can produce large multipliers:
Large multipliers amplify rounding effects.
Partial pivoting improves stability by choosing larger pivots.
This leads to the practical algorithm:
- Gaussian elimination with partial pivoting,
- usually written as LU factorization with row permutations.
Most practical dense linear solvers use this method.
85.17 Floating Point and Orthogonality
Orthogonal matrices are numerically important because they preserve vector norms.
If is orthogonal,
Q^{T}Q = I
Orthogonal transformations do not significantly amplify rounding error.
For this reason:
| Method | Stability |
|---|---|
| QR factorization | highly stable |
| Orthogonal iteration | stable |
| Householder reflections | stable |
| Gram-Schmidt (classical) | less stable |
Orthogonality is therefore one of the central tools of stable numerical computation.
85.18 Floating Point and Eigenvalue Problems
Eigenvalue algorithms are especially sensitive to numerical issues.
Repeated matrix powers may overflow or underflow.
Close eigenvalues may be difficult to separate numerically.
Non-normal matrices may exhibit strong sensitivity.
Modern algorithms therefore use:
- orthogonal similarity transformations,
- Hessenberg reduction,
- QR iteration,
- balancing and scaling techniques.
These methods are carefully designed to control floating point error.
85.19 Conditioning Versus Stability
Two different concepts are fundamental:
| Concept | Meaning |
|---|---|
| Conditioning | Sensitivity of the mathematical problem |
| Stability | Sensitivity introduced by the algorithm |
A well-conditioned problem may still produce poor results if the algorithm is unstable.
A stable algorithm may still produce poor results if the problem itself is ill-conditioned.
These ideas are developed in later chapters.
85.20 Floating Point as a Mathematical Model
Floating point arithmetic is often modeled abstractly by:
Here represents an arithmetic operation.
\mathrm{fl}(x \circ y) = (x \circ y)(1+\delta), \qquad |\delta| \le \varepsilon_{\mathrm{mach}}
This simple model supports much of modern error analysis.
It allows theoretical estimates of accumulated error in matrix algorithms.
85.21 Numerical Reproducibility
Floating point computations may vary across systems because:
- processors differ,
- instruction order differs,
- parallel execution changes summation order,
- compiler optimizations rearrange arithmetic.
Thus two mathematically identical programs may produce slightly different results.
Reproducibility is an important issue in scientific computing and high-performance linear algebra.
85.22 Floating Point in Practice
Practical numerical software must account for floating point limitations.
Good numerical software:
- avoids cancellation,
- avoids overflow and underflow,
- uses stable decompositions,
- scales matrices appropriately,
- estimates conditioning,
- monitors residual error,
- minimizes unnecessary operations.
Libraries such as LAPACK, BLAS, Eigen, and Intel MKL are designed around these principles.
85.23 Summary
Floating point arithmetic approximates real arithmetic using finite binary representations.
Key concepts include:
| Concept | Meaning |
|---|---|
| Machine epsilon | Baseline rounding scale |
| Rounding | Approximation to nearest representable value |
| Cancellation | Loss of significant digits |
| Overflow | Number too large to represent |
| Underflow | Number too small to represent |
| Relative error | Scale-sensitive error measure |
| Stability | Algorithmic resistance to rounding error |
Floating point arithmetic is the computational foundation of numerical linear algebra. Every numerical algorithm operates within its limitations. Understanding these limitations is necessary for designing stable algorithms and interpreting computed results correctly.