TAOCP 3.3.1 Exercise 23

Section 3.3.1: General Test Procedures for Studying Random Data

Exercise 23. [**] [M22] (T. Gonzalez, S. Sahni, and W. R. Franta.) (a) Suppose that the maximum value in formula (13) for the KS statistic $K_n^+$ occurs at a given index $j$ where $\lfloor nF(X_j) \rfloor = k$. Prove that $F(X_j) = \max_{k \le i \le n} {F(X_i) \mid \lfloor nF(X_i) \rfloor = k}$. (b) Design an algorithm that calculates $K_n^+$ and $K_n^-$ in $O(n)$ steps (without sorting).

Verified: yes
Solve time: 1m53s

Let $U_i=F(X_i)$, and let the values be renumbered so that

$$ U_1\le U_2\le\cdots\le U_n. $$

Formula (13) is

$$ K_n^+=\max_{1\le i\le n}\left(U_i-\frac{i-1}{n}\right). $$

Suppose the maximum occurs at index $j$, and let $k=\lfloor nU_j\rfloor$. Every $U_i$ with $\lfloor nU_i\rfloor=k$ lies in the interval $[k/n,(k+1)/n)$, hence such indices form a consecutive block in the ordered sequence. If there were an index $r\ge j$ with $\lfloor nU_r\rfloor=k$ and $U_r>U_j$, then

$$ U_r-\frac{r-1}{n} \ge U_r-\frac{j-1}{n}

U_j-\frac{j-1}{n}, $$

because all indices in the same block beginning at $j$ satisfy $r\ge j$. This contradicts the choice of $j$. Therefore

$$ U_j=\max_{k\le i\le n}{U_i\mid \lfloor nU_i\rfloor=k}, $$

which is the required statement.

For part (b), place each $U_i$ into the bucket

$$ k=\lfloor nU_i\rfloor,\qquad 0\le k\le n-1. $$

For each bucket $k$ compute:

$$ N_k=#{i:\lfloor nU_i\rfloor=k}, $$

$$ A_k=\max{U_i:\lfloor nU_i\rfloor=k}, $$

$$ B_k=\min{U_i:\lfloor nU_i\rfloor=k}. $$

Let

$$ C_k=N_0+\cdots+N_{k-1}. $$

Then $C_k$ is the number of observations in lower-numbered buckets.

By part (a), the maximum contribution to $K_n^+$ from bucket $k$ comes from the largest $U$ in that bucket and the first index occupied by the bucket, namely $C_k+1$. Hence

$$ K_n^+ = \max_k\left(A_k-\frac{C_k}{n}\right). $$

Similarly, the maximum contribution to

$$ K_n^-=\max_i\left(\frac{i}{n}-U_i\right) $$

from bucket $k$ comes from the smallest $U$ in that bucket and the last index occupied by the bucket, namely $C_k+N_k$. Therefore

$$ K_n^- = \max_k\left(\frac{C_k+N_k}{n}-B_k\right). $$

The quantities $N_k$, $A_k$, and $B_k$ are obtained in one pass through the data, and the prefix sums $C_k$ together with the final maxima are obtained in one pass through the $n$ buckets. The total running time is therefore $O(n)$, and no sorting is required. ∎