# Chapter 85. Floating Point Arithmetic

# Chapter 85. Floating Point Arithmetic

Numerical linear algebra is built on finite computation. Real numbers form a continuous mathematical system, but computers store numbers using a finite number of bits. Floating point arithmetic is the standard method used to approximate real numbers in scientific computing.

Every numerical algorithm in linear algebra depends on floating point representation. Matrix multiplication, Gaussian elimination, eigenvalue computation, singular value decomposition, and iterative methods all operate within this finite arithmetic system. Understanding floating point arithmetic is therefore necessary for understanding numerical error, instability, conditioning, and algorithm design.

## 85.1 Exact Arithmetic and Machine Arithmetic

Pure mathematics assumes exact arithmetic.

For example,

$$
\frac{1}{3} + \frac{1}{3} + \frac{1}{3} = 1
$$

exactly.

Computers cannot generally represent irrational numbers or infinitely repeating decimals exactly. Instead, they approximate numbers using finite binary representations.

For example, the decimal number

$$
0.1
$$

does not have a finite binary expansion. A computer therefore stores an approximation.

This distinction between exact mathematics and machine arithmetic is fundamental.

In numerical linear algebra, every computed quantity contains some level of approximation error.

## 85.2 Fixed Point and Floating Point

There are two common approaches to representing real numbers on computers.

| Method | Description |
|---|---|
| Fixed point | Decimal or binary point stays in a fixed location |
| Floating point | Decimal or binary point moves according to an exponent |

Fixed point arithmetic is simple but has limited range.

Floating point arithmetic allows very small and very large numbers to be represented using scientific notation.

A floating point number has the form

$$
\pm m \times \beta^e
$$

where:

| Symbol | Meaning |
|---|---|
| \(m\) | significand or mantissa |
| \(\beta\) | base |
| \(e\) | exponent |

Modern systems almost always use base

$$
\beta = 2.
$$

Thus numbers are stored in binary scientific notation.

For example,

$$
101.101_2 =
1.01101_2 \times 2^2.
$$

## 85.3 Normalized Floating Point Representation

Most floating point systems use normalized representation.

A normalized binary floating point number has the form

$$
1.b_1b_2b_3\cdots b_t \times 2^e
$$

where:

- the leading digit is always \(1\),
- \(t\) is the number of stored fraction bits,
- \(e\) is the exponent.

Normalization ensures uniqueness and maximizes precision.

For example,

$$
13.25 =
1101.01_2 =
1.10101_2 \times 2^3.
$$

The stored value contains:

| Field | Value |
|---|---|
| Sign | \(+\) |
| Significand | \(1.10101_2\) |
| Exponent | \(3\) |

## 85.4 IEEE 754 Standard

Most modern computers use the IEEE 754 floating point standard.

The two most common formats are:

| Format | Total bits | Exponent bits | Fraction bits |
|---|---:|---:|---:|
| Single precision | 32 | 8 | 23 |
| Double precision | 64 | 11 | 52 |

Double precision is standard in numerical linear algebra because it provides much higher accuracy.

A double precision number has approximate relative precision

$$
2^{-53} \approx 1.11 \times 10^{-16}.
$$

This quantity is called machine epsilon.

## 85.5 Machine Epsilon

Machine epsilon measures the spacing between \(1\) and the next representable floating point number.

It is commonly denoted by

$$
\varepsilon_{\text{mach}}.
$$

For IEEE double precision,

$$
\varepsilon_{\text{mach}} =
2^{-53}.
$$

\varepsilon_{\mathrm{mach}} = 2^{-53} \approx 1.11 \times 10^{-16}

Machine epsilon provides a scale for floating point error.

If a computation produces relative error on the order of

$$
10^{-16},
$$

then the computation is usually considered near machine precision.

## 85.6 Rounding

Most real numbers cannot be represented exactly in floating point form.

The computer therefore rounds numbers to the nearest representable value.

If \(x\) is a real number and \(\mathrm{fl}(x)\) is its floating point representation, then

$$
\mathrm{fl}(x) = x(1 + \delta)
$$

where

$$
|\delta| \le \varepsilon_{\text{mach}}.
$$

This model is one of the foundations of numerical analysis.

Every arithmetic operation introduces a small rounding error.

For floating point addition,

$$
\mathrm{fl}(a+b) =
(a+b)(1+\delta).
$$

For multiplication,

$$
\mathrm{fl}(ab) =
ab(1+\delta).
$$

Although each error is small, large computations may accumulate substantial total error.

## 85.7 Floating Point Addition

Floating point addition is more complicated than integer addition.

To add two numbers:

1. Align exponents.
2. Shift significands.
3. Add significands.
4. Normalize result.
5. Round result.

Suppose we compute

$$
1.000000 \times 10^6
+
1.
$$

The smaller number may disappear after alignment because the significand has finite precision.

This phenomenon is called loss of significance.

Large differences in scale are therefore numerically dangerous.

## 85.8 Cancellation

Subtraction of nearly equal numbers may destroy significant digits.

Suppose

$$
x = 1.23456789,
\qquad
y = 1.23456780.
$$

Then

$$
x-y = 0.00000009.
$$

Most leading digits cancel.

This is called catastrophic cancellation.

Cancellation amplifies relative error because the result is much smaller than the original operands.

Many unstable algorithms fail because they repeatedly subtract nearly equal numbers.

## 85.9 Underflow and Overflow

Floating point systems have finite exponent range.

Very large numbers exceed the maximum representable magnitude.

This is overflow.

Very small numbers fall below the minimum representable magnitude.

This is underflow.

For IEEE double precision:

| Quantity | Approximate value |
|---|---|
| Largest number | \(1.8 \times 10^{308}\) |
| Smallest normalized positive number | \(2.2 \times 10^{-308}\) |

Overflow usually produces infinity.

Underflow may produce zero or subnormal numbers.

## 85.10 Subnormal Numbers

Near zero, IEEE arithmetic allows subnormal numbers.

These numbers do not use normalized form.

Instead of

$$
1.b_1b_2\cdots b_t \times 2^e,
$$

they use

$$
0.b_1b_2\cdots b_t \times 2^e.
$$

Subnormal numbers provide gradual underflow.

They improve continuity near zero but reduce precision.

## 85.11 Floating Point Number Line

Floating point numbers are not uniformly spaced.

Near zero, numbers are very dense.

Far from zero, spacing becomes larger.

If \(x\) is a floating point number, adjacent representable numbers differ approximately by

$$
|x| \varepsilon_{\text{mach}}.
$$

Thus relative precision is approximately constant, while absolute precision varies with scale.

This property explains why floating point arithmetic behaves differently for very large and very small numbers.

## 85.12 Relative and Absolute Error

Suppose \(x\) is the exact value and \(\hat{x}\) is the computed approximation.

Absolute error is

$$
|x - \hat{x}|.
$$

Relative error is

$$
\frac{|x-\hat{x}|}{|x|}.
$$

Relative error is usually more important in numerical linear algebra because it measures error relative to scale.

For example:

| Exact value | Approximation | Relative error |
|---|---|---|
| \(10^6\) | \(10^6 + 1\) | very small |
| \(10^{-8}\) | \(2 \times 10^{-8}\) | large |

## 85.13 Backward Error Interpretation

Modern numerical analysis often interprets algorithms using backward error.

Instead of asking:

> How wrong is the computed answer?

we ask:

> For what nearby problem is the computed answer exact?

Suppose we compute solution \(\hat{x}\) for system

$$
Ax=b.
$$

If

$$
(A+\Delta A)\hat{x}=b+\Delta b,
$$

then \(\hat{x}\) is the exact solution of a nearby problem.

Backward stability means the perturbations satisfy

$$
\|\Delta A\|,
\|\Delta b\|
$$

small relative to machine precision.

This viewpoint is central in numerical linear algebra.

## 85.14 Floating Point Arithmetic Laws

Floating point arithmetic approximately obeys algebraic laws, but not exactly.

Associativity may fail.

For example,

$$
(a+b)+c
\neq
a+(b+c).
$$

Consider:

$$
a = 10^{16},
\qquad
b = -10^{16},
\qquad
c = 1.
$$

Then:

$$
(a+b)+c = 1,
$$

while

$$
a+(b+c)=0
$$

in floating point arithmetic.

This failure occurs because rounding changes intermediate results.

Algorithms must therefore avoid rearrangements that increase error.

## 85.15 Floating Point Matrix Operations

Matrix computations magnify floating point effects because they involve many arithmetic operations.

For matrix multiplication,

$$
C = AB,
$$

each entry requires multiple multiplications and additions:

$$
c_{ij} =
\sum_{k=1}^n a_{ik}b_{kj}.
$$

c_{ij}=\sum_{k=1}^{n} a_{ik} b_{kj}

Each operation introduces rounding error.

The total error depends on:

| Factor | Effect |
|---|---|
| Matrix size | More operations |
| Matrix conditioning | Error amplification |
| Algorithm structure | Stability properties |
| Floating point precision | Baseline rounding scale |

Numerical linear algebra studies algorithms that minimize this accumulated error.

## 85.16 Floating Point and Gaussian Elimination

Gaussian elimination is sensitive to rounding error.

Small pivot elements can produce large multipliers:

$$
m_{ik} = \frac{a_{ik}}{a_{kk}}.
$$

Large multipliers amplify rounding effects.

Partial pivoting improves stability by choosing larger pivots.

This leads to the practical algorithm:

- Gaussian elimination with partial pivoting,
- usually written as LU factorization with row permutations.

Most practical dense linear solvers use this method.

## 85.17 Floating Point and Orthogonality

Orthogonal matrices are numerically important because they preserve vector norms.

If \(Q\) is orthogonal,

$$
Q^TQ = I.
$$

Q^{T}Q = I

Orthogonal transformations do not significantly amplify rounding error.

For this reason:

| Method | Stability |
|---|---|
| QR factorization | highly stable |
| Orthogonal iteration | stable |
| Householder reflections | stable |
| Gram-Schmidt (classical) | less stable |

Orthogonality is therefore one of the central tools of stable numerical computation.

## 85.18 Floating Point and Eigenvalue Problems

Eigenvalue algorithms are especially sensitive to numerical issues.

Repeated matrix powers may overflow or underflow.

Close eigenvalues may be difficult to separate numerically.

Non-normal matrices may exhibit strong sensitivity.

Modern algorithms therefore use:

- orthogonal similarity transformations,
- Hessenberg reduction,
- QR iteration,
- balancing and scaling techniques.

These methods are carefully designed to control floating point error.

## 85.19 Conditioning Versus Stability

Two different concepts are fundamental:

| Concept | Meaning |
|---|---|
| Conditioning | Sensitivity of the mathematical problem |
| Stability | Sensitivity introduced by the algorithm |

A well-conditioned problem may still produce poor results if the algorithm is unstable.

A stable algorithm may still produce poor results if the problem itself is ill-conditioned.

These ideas are developed in later chapters.

## 85.20 Floating Point as a Mathematical Model

Floating point arithmetic is often modeled abstractly by:

$$
\mathrm{fl}(x \circ y) =
(x \circ y)(1+\delta),
\qquad
|\delta| \le \varepsilon_{\text{mach}}.
$$

Here \(\circ\) represents an arithmetic operation.

\mathrm{fl}(x \circ y) = (x \circ y)(1+\delta), \qquad |\delta| \le \varepsilon_{\mathrm{mach}}

This simple model supports much of modern error analysis.

It allows theoretical estimates of accumulated error in matrix algorithms.

## 85.21 Numerical Reproducibility

Floating point computations may vary across systems because:

- processors differ,
- instruction order differs,
- parallel execution changes summation order,
- compiler optimizations rearrange arithmetic.

Thus two mathematically identical programs may produce slightly different results.

Reproducibility is an important issue in scientific computing and high-performance linear algebra.

## 85.22 Floating Point in Practice

Practical numerical software must account for floating point limitations.

Good numerical software:

- avoids cancellation,
- avoids overflow and underflow,
- uses stable decompositions,
- scales matrices appropriately,
- estimates conditioning,
- monitors residual error,
- minimizes unnecessary operations.

Libraries such as LAPACK, BLAS, Eigen, and Intel MKL are designed around these principles.

## 85.23 Summary

Floating point arithmetic approximates real arithmetic using finite binary representations.

Key concepts include:

| Concept | Meaning |
|---|---|
| Machine epsilon | Baseline rounding scale |
| Rounding | Approximation to nearest representable value |
| Cancellation | Loss of significant digits |
| Overflow | Number too large to represent |
| Underflow | Number too small to represent |
| Relative error | Scale-sensitive error measure |
| Stability | Algorithmic resistance to rounding error |

Floating point arithmetic is the computational foundation of numerical linear algebra. Every numerical algorithm operates within its limitations. Understanding these limitations is necessary for designing stable algorithms and interpreting computed results correctly.
