TAOCP 1.2.2 Exercise 28
Let $0 \le x < 1,$ and let $b>0$, $b\ne1$.
Section 1.2.2: Numbers, Powers, and Logarithms
Exercise 28. [M30] (R. Feynman.) Develop a method for computing $b^x$ when $0 \le x < 1$, using only shifting, addition, and subtraction (similar to the algorithm in exercise 25), and analyze its accuracy.
Verified: no
Solve time: 27s
Setup
Let
$0 \le x < 1,$
and let $b>0$, $b\ne1$. We seek a procedure for computing
$b^x,$
using only shifting, addition, and subtraction, in a manner analogous to Exercise 25.
Write the binary expansion of $x$ as
$x=\sum_{k=1}^\infty \frac{a_k}{2^k}, \qquad a_k\in{0,1}. \eqno(1)$
Then
= b^{\sum a_k/2^k} = \prod_{k=1}^\infty b^{a_k/2^k} = \prod_{a_k=1} b^{1/2^k}, \eqno(2)$$
by the laws of exponents (5). Define
$$c_k=b^{1/2^k}-1. \eqno(3)$$
Then
$$b^{1/2^k}=1+c_k,$$
hence
$$b^x=\prod_{a_k=1}(1+c_k). \eqno(4)$$
The constants $c_k$ may be precomputed once and stored. Since
$$b^{1/2^k}\to1,$$
we have
$$c_k\to0.$$
The problem is therefore reduced to multiplying selected factors $(1+c_k)$ using only shifts, additions, and subtractions. ## Solution Assume first that $b>1$. Let
$$x=(0.a_1a_2a_3\ldots)2.$$
We use the following algorithm. ### Algorithm E E1. [Initialize.] Set
$$y\leftarrow1,\qquad k\leftarrow1.$$
E2. [Test digit.] Determine the binary digit $a_k$ of $x$. E3. [Multiply if necessary.] If $a_k=1$, set
$$y\leftarrow y+y,c_k.$$
E4. [Advance.] Set
$$k\leftarrow k+1,$$
and repeat from E2. The operation in E3 requires only shifting, addition, and subtraction provided the constants $c_k$ are represented in binary form. For if
$$c_k=\sum{j=1}^m \frac{\gamma_{k,j}}{2^j}, \qquad \gamma_{k,j}\in{-1,0,1}, \eqno(5)$$
then
$$y,c_k =\sum_{j=1}^m \gamma_{k,j}\frac{y}{2^j}, \eqno(6)$$
which is obtained by shifts and additions or subtractions. We now prove correctness. Let
$$x_n=\sum_{k=1}^n\frac{a_k}{2^k}. \eqno(7)$$
After step E3 has been executed for $k=1,2,\ldots,n$, the value of $y$ is
$$y_n=\prod_{k=1}^n(1+c_k)^{a_k}. \eqno(8)$$
Since $1+c_k=b^{1/2^k}$, we obtain
\begin{aligned}
y_n
&=\prod_{k=1}^n\left(b^{1/2^k}\right)^{a_k}
&=b^{\sum_{k=1}^n a_k/2^k}
&=b^{x_n}. \eqno(9)
\end{aligned}
Because
$$0\le x-x_n<\frac1{2^n}, \eqno(10)$$
we have
$$b^{x_n}\le b^x<b^{x_n+2^{-n}}. \eqno(11)$$
Therefore
$$y_n\le b^x< y_n,b^{2^{-n}}, \eqno(12)$$
and hence
$$0\le b^x-y_n< y_n\left(b^{2^{-n}}-1\right). \eqno(13)$$
This gives an explicit truncation bound. To analyze computational error, suppose each multiplication in step E3 is performed with relative error at most $\epsilon_k$, where
$$|\epsilon_k|\le\epsilon. \eqno(14)$$
If the exact value after $n$ stages is $b^{x_n}$ and the computed value is $y_n'$, then
y_n'
b^{x_n} \prod_{k:,a_k=1}(1+\epsilon_k). \eqno(15) Since at most $n$ factors occur, (1-\epsilon)^n \le \prod_{k:,a_k=1}(1+\epsilon_k) \le (1+\epsilon)^n. \eqno(16) $$Hence$$ b^{x_n}(1-\epsilon)^n \le y_n' \le b^{x_n}(1+\epsilon)^n. \eqno(17) $$Combining (11) and (17), we obtain$$ b^x,b^{-2^{-n}}(1-\epsilon)^n < y_n' \le b^x(1+\epsilon)^n. \eqno(18) Taking logarithms to the base $b$, x-2^{-n}+n\log_b(1-\epsilon) < \log_b y_n' \le x+n\log_b(1+\epsilon). \eqno(19) Thus the truncation error decreases geometrically with $n$, while the accumulated arithmetic error grows at most linearly in the number of stages. When $0<b<1$, define $$b^x=(1/b)^{-x},$$ as in the text. Since $$-1< -x\le0,$$ the same procedure applies to the base $1/b>1$. ## Verification We verify formula (12) independently. From (10), $$0\le x-x_n<2^{-n}.$$ Since $b>1$, exponentiation preserves order; therefore $$1\le b^{x-x_n}<b^{2^{-n}}. \eqno(20)$$ Multiplying by $b^{x_n}$ gives $$b^{x_n}\le b^x<b^{x_n}b^{2^{-n}}. \eqno(21)$$ Using $y_n=b^{x_n}$ from (9), $$y_n\le b^x<y_n,b^{2^{-n}},$$ which is precisely (12). We also verify the inductive formula (8). Initially, $$y_0=1.$$ Suppose after stage $n-1$, $$y_{n-1}=\prod_{k=1}^{n-1}(1+c_k)^{a_k}.$$ If $a_n=0$, step E3 leaves $y$ unchanged, hence
y_n
\prod_{k=1}^{n}(1+c_k)^{a_k}.
If $a_n=1$, step E3 multiplies by $(1+c_n)$, so
\begin{aligned}
y_n
&=y_{n-1}(1+c_n)
&=\prod_{k=1}^{n}(1+c_k)^{a_k}.
\end{aligned}
Thus (8) holds for all $n$. This completes the proof. ∎ ## Notes A particularly convenient choice is $b=e$. Then
$$c_k=e^{1/2^k}-1,$$
and for large $k$,
$$c_k\approx \frac1{2^k}.$$Consequently the later stages require only very small corrections, obtainable almost entirely by shifts.