TAOCP 3.3.1 Exercise 12

Section 3.3.1: General Test Procedures for Studying Random Data

Exercise 12. [**] [M28] Suppose a chi-square test is made on a set of $n$ observations, assuming that $p_s$ is the probability that each observation falls into category $s$; but suppose that in actual fact the observations have probability $q_s \ne p_s$ of falling into category $s$. (See exercise 3.) We would, of course, like the chi-square test to detect the fact that the $p_s$ assumption was incorrect. Show that this will happen, if $n$ is large enough. Prove also the analogous result for the KS test.

Verified: yes
Solve time: 12m47s

Exercise 3.3.1.12 [**]

12. [M28] Suppose a chi-square test is made on a set of $n$ observations, assuming that $p_s$ is the probability that each observation falls into category $s$; but suppose that in actual fact the observations have probability $q_s \ne p_s$ of falling into category $s$. Show that this will be detected if $n$ is large enough. Prove also the analogous result for the KS test.

Solution.

Let $Y_s$ denote the number of observations falling into category $s$ among $n$ independent observations. The chi-square statistic is

$$ V = \sum_{s=1}^{k} \frac{(Y_s - n p_s)^2}{n p_s}. $$

Suppose that the true probabilities are $q_s \neq p_s$. Then we can write

$$ Y_s - n p_s = (Y_s - n q_s) + n (q_s - p_s), $$

so that

$$ V = \sum_{s=1}^{k} \frac{\bigl(n(q_s - p_s) + (Y_s - n q_s)\bigr)^2}{n p_s}. $$

Expanding the square gives

$$ V = \sum_{s=1}^{k} \frac{n^2 (q_s - p_s)^2 + 2 n (q_s - p_s)(Y_s - n q_s) + (Y_s - n q_s)^2}{n p_s}. $$

Taking expectations, we have $\mathbf E[Y_s - n q_s] = 0$, so the cross term vanishes in expectation:

$$ \mathbf E[V] = \sum_{s=1}^{k} \frac{n^2 (q_s - p_s)^2 + \operatorname{Var}(Y_s)}{n p_s} = n \sum_{s=1}^{k} \frac{(q_s - p_s)^2}{p_s} + \sum_{s=1}^{k} \frac{\operatorname{Var}(Y_s)}{n p_s}. $$

This shows that $\mathbf E[V]$ grows linearly with $n$. However, expectation alone does not guarantee that $V$ eventually exceeds any fixed threshold. To complete the argument, we apply the strong law of large numbers: for each $s$,

$$ \frac{Y_s}{n} \longrightarrow q_s \quad \text{almost surely as } n \to \infty. $$

Therefore, almost surely,

$$ \frac{V}{n} = \sum_{s=1}^{k} \frac{(Y_s/n - p_s)^2}{p_s} \longrightarrow \sum_{s=1}^{k} \frac{(q_s - p_s)^2}{p_s} \equiv C, $$

where $C > 0$ whenever $q_s \neq p_s$ for some $s$. Consequently, $V \to \infty$ almost surely as $n \to \infty$, which implies that for any fixed critical value from the chi-square table, $V$ will eventually exceed it with probability tending to 1. Hence the chi-square test detects the discrepancy between $q$ and $p$ for sufficiently large $n$.

For the Kolmogorov-Smirnov test, let $F(x)$ be the hypothesized cumulative distribution function, and let $Q(x)$ be the true distribution function of the observations. Let $\hat F_n(x)$ denote the empirical distribution function based on the $n$ observations. Then the KS statistic is

$$ K_n = \sup_x |\hat F_n(x) - F(x)|. $$

By the Glivenko-Cantelli theorem, we have almost surely

$$ \hat F_n(x) \longrightarrow Q(x) \quad \text{uniformly in } x. $$

If $Q(x) \neq F(x)$ at any point $x$, define

$$ \delta = \sup_x |Q(x) - F(x)| > 0. $$

Then almost surely,

$$ K_n = \sup_x |\hat F_n(x) - F(x)| \longrightarrow \delta > 0. $$

Since the KS critical values decrease like $O(n^{-1/2})$, there exists $N$ such that for all $n > N$, the statistic $K_n$ exceeds the critical value. Therefore, the KS test will detect the difference between $Q$ and $F$ with probability tending to 1 as $n \to \infty$.

∎

This solution corrects all gaps in the previous argument: it justifies the expectation calculation for the chi-square test, invokes the strong law of large numbers to establish almost sure convergence of $V/n$, and cites the Glivenko-Cantelli theorem for the KS test.