Skip to content

Chapter 85. Floating Point Arithmetic

Numerical linear algebra is built on finite computation. Real numbers form a continuous mathematical system, but computers store numbers using a finite number of bits. Floating point arithmetic is the standard method used to approximate real numbers in scientific computing.

Every numerical algorithm in linear algebra depends on floating point representation. Matrix multiplication, Gaussian elimination, eigenvalue computation, singular value decomposition, and iterative methods all operate within this finite arithmetic system. Understanding floating point arithmetic is therefore necessary for understanding numerical error, instability, conditioning, and algorithm design.

85.1 Exact Arithmetic and Machine Arithmetic

Pure mathematics assumes exact arithmetic.

For example,

13+13+13=1 \frac{1}{3} + \frac{1}{3} + \frac{1}{3} = 1

exactly.

Computers cannot generally represent irrational numbers or infinitely repeating decimals exactly. Instead, they approximate numbers using finite binary representations.

For example, the decimal number

0.1 0.1

does not have a finite binary expansion. A computer therefore stores an approximation.

This distinction between exact mathematics and machine arithmetic is fundamental.

In numerical linear algebra, every computed quantity contains some level of approximation error.

85.2 Fixed Point and Floating Point

There are two common approaches to representing real numbers on computers.

MethodDescription
Fixed pointDecimal or binary point stays in a fixed location
Floating pointDecimal or binary point moves according to an exponent

Fixed point arithmetic is simple but has limited range.

Floating point arithmetic allows very small and very large numbers to be represented using scientific notation.

A floating point number has the form

±m×βe \pm m \times \beta^e

where:

SymbolMeaning
mmsignificand or mantissa
β\betabase
eeexponent

Modern systems almost always use base

β=2. \beta = 2.

Thus numbers are stored in binary scientific notation.

For example,

101.1012=1.011012×22. 101.101_2 = 1.01101_2 \times 2^2.

85.3 Normalized Floating Point Representation

Most floating point systems use normalized representation.

A normalized binary floating point number has the form

1.b1b2b3bt×2e 1.b_1b_2b_3\cdots b_t \times 2^e

where:

  • the leading digit is always 11,
  • tt is the number of stored fraction bits,
  • ee is the exponent.

Normalization ensures uniqueness and maximizes precision.

For example,

13.25=1101.012=1.101012×23. 13.25 = 1101.01_2 = 1.10101_2 \times 2^3.

The stored value contains:

FieldValue
Sign++
Significand1.1010121.10101_2
Exponent33

85.4 IEEE 754 Standard

Most modern computers use the IEEE 754 floating point standard.

The two most common formats are:

FormatTotal bitsExponent bitsFraction bits
Single precision32823
Double precision641152

Double precision is standard in numerical linear algebra because it provides much higher accuracy.

A double precision number has approximate relative precision

2531.11×1016. 2^{-53} \approx 1.11 \times 10^{-16}.

This quantity is called machine epsilon.

85.5 Machine Epsilon

Machine epsilon measures the spacing between 11 and the next representable floating point number.

It is commonly denoted by

εmach. \varepsilon_{\text{mach}}.

For IEEE double precision,

εmach=253. \varepsilon_{\text{mach}} = 2^{-53}.

\varepsilon_{\mathrm{mach}} = 2^{-53} \approx 1.11 \times 10^{-16}

Machine epsilon provides a scale for floating point error.

If a computation produces relative error on the order of

1016, 10^{-16},

then the computation is usually considered near machine precision.

85.6 Rounding

Most real numbers cannot be represented exactly in floating point form.

The computer therefore rounds numbers to the nearest representable value.

If xx is a real number and fl(x)\mathrm{fl}(x) is its floating point representation, then

fl(x)=x(1+δ) \mathrm{fl}(x) = x(1 + \delta)

where

δεmach. |\delta| \le \varepsilon_{\text{mach}}.

This model is one of the foundations of numerical analysis.

Every arithmetic operation introduces a small rounding error.

For floating point addition,

fl(a+b)=(a+b)(1+δ). \mathrm{fl}(a+b) = (a+b)(1+\delta).

For multiplication,

fl(ab)=ab(1+δ). \mathrm{fl}(ab) = ab(1+\delta).

Although each error is small, large computations may accumulate substantial total error.

85.7 Floating Point Addition

Floating point addition is more complicated than integer addition.

To add two numbers:

  1. Align exponents.
  2. Shift significands.
  3. Add significands.
  4. Normalize result.
  5. Round result.

Suppose we compute

1.000000×106+1. 1.000000 \times 10^6 + 1.

The smaller number may disappear after alignment because the significand has finite precision.

This phenomenon is called loss of significance.

Large differences in scale are therefore numerically dangerous.

85.8 Cancellation

Subtraction of nearly equal numbers may destroy significant digits.

Suppose

x=1.23456789,y=1.23456780. x = 1.23456789, \qquad y = 1.23456780.

Then

xy=0.00000009. x-y = 0.00000009.

Most leading digits cancel.

This is called catastrophic cancellation.

Cancellation amplifies relative error because the result is much smaller than the original operands.

Many unstable algorithms fail because they repeatedly subtract nearly equal numbers.

85.9 Underflow and Overflow

Floating point systems have finite exponent range.

Very large numbers exceed the maximum representable magnitude.

This is overflow.

Very small numbers fall below the minimum representable magnitude.

This is underflow.

For IEEE double precision:

QuantityApproximate value
Largest number1.8×103081.8 \times 10^{308}
Smallest normalized positive number2.2×103082.2 \times 10^{-308}

Overflow usually produces infinity.

Underflow may produce zero or subnormal numbers.

85.10 Subnormal Numbers

Near zero, IEEE arithmetic allows subnormal numbers.

These numbers do not use normalized form.

Instead of

1.b1b2bt×2e, 1.b_1b_2\cdots b_t \times 2^e,

they use

0.b1b2bt×2e. 0.b_1b_2\cdots b_t \times 2^e.

Subnormal numbers provide gradual underflow.

They improve continuity near zero but reduce precision.

85.11 Floating Point Number Line

Floating point numbers are not uniformly spaced.

Near zero, numbers are very dense.

Far from zero, spacing becomes larger.

If xx is a floating point number, adjacent representable numbers differ approximately by

xεmach. |x| \varepsilon_{\text{mach}}.

Thus relative precision is approximately constant, while absolute precision varies with scale.

This property explains why floating point arithmetic behaves differently for very large and very small numbers.

85.12 Relative and Absolute Error

Suppose xx is the exact value and x^\hat{x} is the computed approximation.

Absolute error is

xx^. |x - \hat{x}|.

Relative error is

xx^x. \frac{|x-\hat{x}|}{|x|}.

Relative error is usually more important in numerical linear algebra because it measures error relative to scale.

For example:

Exact valueApproximationRelative error
10610^6106+110^6 + 1very small
10810^{-8}2×1082 \times 10^{-8}large

85.13 Backward Error Interpretation

Modern numerical analysis often interprets algorithms using backward error.

Instead of asking:

How wrong is the computed answer?

we ask:

For what nearby problem is the computed answer exact?

Suppose we compute solution x^\hat{x} for system

Ax=b. Ax=b.

If

(A+ΔA)x^=b+Δb, (A+\Delta A)\hat{x}=b+\Delta b,

then x^\hat{x} is the exact solution of a nearby problem.

Backward stability means the perturbations satisfy

ΔA,Δb \|\Delta A\|, \|\Delta b\|

small relative to machine precision.

This viewpoint is central in numerical linear algebra.

85.14 Floating Point Arithmetic Laws

Floating point arithmetic approximately obeys algebraic laws, but not exactly.

Associativity may fail.

For example,

(a+b)+ca+(b+c). (a+b)+c \neq a+(b+c).

Consider:

a=1016,b=1016,c=1. a = 10^{16}, \qquad b = -10^{16}, \qquad c = 1.

Then:

(a+b)+c=1, (a+b)+c = 1,

while

a+(b+c)=0 a+(b+c)=0

in floating point arithmetic.

This failure occurs because rounding changes intermediate results.

Algorithms must therefore avoid rearrangements that increase error.

85.15 Floating Point Matrix Operations

Matrix computations magnify floating point effects because they involve many arithmetic operations.

For matrix multiplication,

C=AB, C = AB,

each entry requires multiple multiplications and additions:

cij=k=1naikbkj. c_{ij} = \sum_{k=1}^n a_{ik}b_{kj}.

c_{ij}=\sum_{k=1}^{n} a_{ik} b_{kj}

Each operation introduces rounding error.

The total error depends on:

FactorEffect
Matrix sizeMore operations
Matrix conditioningError amplification
Algorithm structureStability properties
Floating point precisionBaseline rounding scale

Numerical linear algebra studies algorithms that minimize this accumulated error.

85.16 Floating Point and Gaussian Elimination

Gaussian elimination is sensitive to rounding error.

Small pivot elements can produce large multipliers:

mik=aikakk. m_{ik} = \frac{a_{ik}}{a_{kk}}.

Large multipliers amplify rounding effects.

Partial pivoting improves stability by choosing larger pivots.

This leads to the practical algorithm:

  • Gaussian elimination with partial pivoting,
  • usually written as LU factorization with row permutations.

Most practical dense linear solvers use this method.

85.17 Floating Point and Orthogonality

Orthogonal matrices are numerically important because they preserve vector norms.

If QQ is orthogonal,

QTQ=I. Q^TQ = I.

Q^{T}Q = I

Orthogonal transformations do not significantly amplify rounding error.

For this reason:

MethodStability
QR factorizationhighly stable
Orthogonal iterationstable
Householder reflectionsstable
Gram-Schmidt (classical)less stable

Orthogonality is therefore one of the central tools of stable numerical computation.

85.18 Floating Point and Eigenvalue Problems

Eigenvalue algorithms are especially sensitive to numerical issues.

Repeated matrix powers may overflow or underflow.

Close eigenvalues may be difficult to separate numerically.

Non-normal matrices may exhibit strong sensitivity.

Modern algorithms therefore use:

  • orthogonal similarity transformations,
  • Hessenberg reduction,
  • QR iteration,
  • balancing and scaling techniques.

These methods are carefully designed to control floating point error.

85.19 Conditioning Versus Stability

Two different concepts are fundamental:

ConceptMeaning
ConditioningSensitivity of the mathematical problem
StabilitySensitivity introduced by the algorithm

A well-conditioned problem may still produce poor results if the algorithm is unstable.

A stable algorithm may still produce poor results if the problem itself is ill-conditioned.

These ideas are developed in later chapters.

85.20 Floating Point as a Mathematical Model

Floating point arithmetic is often modeled abstractly by:

fl(xy)=(xy)(1+δ),δεmach. \mathrm{fl}(x \circ y) = (x \circ y)(1+\delta), \qquad |\delta| \le \varepsilon_{\text{mach}}.

Here \circ represents an arithmetic operation.

\mathrm{fl}(x \circ y) = (x \circ y)(1+\delta), \qquad |\delta| \le \varepsilon_{\mathrm{mach}}

This simple model supports much of modern error analysis.

It allows theoretical estimates of accumulated error in matrix algorithms.

85.21 Numerical Reproducibility

Floating point computations may vary across systems because:

  • processors differ,
  • instruction order differs,
  • parallel execution changes summation order,
  • compiler optimizations rearrange arithmetic.

Thus two mathematically identical programs may produce slightly different results.

Reproducibility is an important issue in scientific computing and high-performance linear algebra.

85.22 Floating Point in Practice

Practical numerical software must account for floating point limitations.

Good numerical software:

  • avoids cancellation,
  • avoids overflow and underflow,
  • uses stable decompositions,
  • scales matrices appropriately,
  • estimates conditioning,
  • monitors residual error,
  • minimizes unnecessary operations.

Libraries such as LAPACK, BLAS, Eigen, and Intel MKL are designed around these principles.

85.23 Summary

Floating point arithmetic approximates real arithmetic using finite binary representations.

Key concepts include:

ConceptMeaning
Machine epsilonBaseline rounding scale
RoundingApproximation to nearest representable value
CancellationLoss of significant digits
OverflowNumber too large to represent
UnderflowNumber too small to represent
Relative errorScale-sensitive error measure
StabilityAlgorithmic resistance to rounding error

Floating point arithmetic is the computational foundation of numerical linear algebra. Every numerical algorithm operates within its limitations. Understanding these limitations is necessary for designing stable algorithms and interpreting computed results correctly.