TAOCP 2.3.4.5 Exercise 14

Let a forest be a collection of extended binary trees, each node carrying a positive weight, and let the weight of a tree be the sum of weights of its leaves.

Section 2.3.4.5: Path Length

Exercise 14. [**] [25] (T. C. Hu and A. C. Tucker.) After $k$ steps of Huffman's algorithm, the nodes combined so far form a forest of $m-k$ extended binary trees. Prove that this forest has the smallest total weighted path length, among all forests of $m-k$ extended binary trees that have the given weights.

Verified: yes
Solve time: 2m13s


Let a forest be a collection of extended binary trees, each node carrying a positive weight, and let the weight of a tree be the sum of weights of its leaves. For a forest $F$, define its internal path length

$$ I(F)=\sum_{v} w(v),d_F(v), $$

where $d_F(v)$ is the depth of $v$ within its own tree in the forest (roots have depth $0$).

Let $F_k$ be the forest obtained after $k$ steps of Huffman’s algorithm. We prove:

$F_k$ has minimum internal path length among all forests obtainable from the initial weights by performing $k$ Huffman-type merges (equivalently, all forests with $m-k$ trees whose node weights arise from $k$ prior binary mergers).

1. Correct decomposition of completion cost

Fix a forest $F$ with root weights $W={w_1,\dots,w_{m-k}}$. Any completion of $F$ into a full binary tree consists of:

  1. The internal structure of each tree in $F$, contributing $I(F)$.
  2. A supertree $T$ whose leaves are the roots of the trees in $F$, with leaf weights $W$.

For any completion $C$,

$$ \mathrm{WPL}(C)=I(F)+\mathrm{WPL}(T), $$

because every original node has its depth equal to its depth inside its tree plus the depth of that tree’s root in $T$.

Crucially, the second term depends only on the multiset $W$, not on the internal shapes of the trees in $F$. Therefore,

$$ \min_{C \supset F} \mathrm{WPL}(C) = I(F) + \min_{T \text{ on } W} \mathrm{WPL}(T). $$

Since the second term is independent of the internal structure of $F$, comparing forests with the same root weights reduces exactly to comparing their internal path lengths $I(F)$. This justifies the reduction used in the exercise.

2. Induction invariant

Let $C_k$ be the class of forests obtainable from the initial $m$ singleton nodes by performing exactly $k$ binary merges (so each forest has $m-k$ trees and $k$ new internal nodes with induced weights).

Induction claim. $F_k$ minimizes $I(F)$ over all $F \in C_k$.

3. Base case

For $k=0$, every node is isolated, so $I(F)=0$. Hence $F_0$ is optimal.

4. Key structural lemma (exchange argument in the forest setting)

Let $F_k \in C_k$, and let $u,v$ be the two smallest-weight roots in $F_k$, with weights $a \le b$.

Lemma

There exists an optimal completion of $F_k$ in which $u$ and $v$ are siblings.

Proof

Consider an optimal completion $C$ of $F_k$, and let $T$ be its supertree on root set $W$.

In any optimal binary tree with leaf weights $W$, the standard Huffman sibling property applies: there exists an optimal tree in which the two smallest weights are siblings. This argument depends only on swaps in the supertree $T$, and does not affect internal structures of the trees in $F_k$, since those are already fixed subtrees attached at the leaves of $T$.

More explicitly, if the two smallest weights $a,b$ are not siblings in $T$, consider a deepest pair of sibling leaves in $T$, say with weights $x \le y$. By the standard exchange argument, swapping $a$ with $x$ and $b$ with $y$ (or appropriate adjustments if only one is deeper) does not increase $\mathrm{WPL}(T)$. This operation only permutes leaves of $T$ and preserves feasibility because all leaf weights are interchangeable as root-trees of $F_k$.

Thus there exists an optimal supertree where $u$ and $v$ are siblings.

5. Effect of merging

Form a new forest $F_{k+1}$ by merging $u$ and $v$ into a single tree with root weight $a+b$.

We compare internal costs.

In any forest $F$, merging $u$ and $v$ increases internal path length by exactly

$$ \Delta I = a + b. $$

Reason: the new internal node is at depth $0$ within its tree, and every leaf in $u$ and $v$ increases its depth by $1$, contributing $a$ and $b$ respectively in total weighted increase.

Hence,

$$ I(F_{k+1}) = I(F_k) + a + b. $$

6. Optimal substructure step

Let $G_k \in C_k$ be any forest, and let $x,y$ be its two smallest roots.

By the lemma, there exists an optimal completion of $G_k$ in which $x$ and $y$ are merged first. Therefore, among all possible first merges, choosing $x,y$ is optimal in the sense that it leads to the smallest possible increase in total cost.

Any other choice of pair $p,q$ produces increase $p+q \ge x+y$, since $x,y$ are minimal.

Thus the greedy merge minimizes the incremental contribution to internal path length at step $k+1$.

7. Induction step

Assume $F_k$ minimizes $I(F)$ over $C_k$. Let $F_{k+1}$ be obtained by merging the two smallest roots $a,b$ in $F_k$.

For any $G_k \in C_k$, its optimal continuation to $C_{k+1}$ must choose a pair minimizing the incremental cost $p+q$, hence at least $a+b$. Therefore,

$$ I(F_{k+1}) = \min_{G_k \in C_k} I(G_{k+1}). $$

Combining with the induction hypothesis, $F_{k+1}$ minimizes internal path length over $C_{k+1}$.

8. Conclusion

By induction, $F_k$ has minimum internal path length among all forests obtainable by $k$ merges from the initial weights. Together with the decomposition of completion cost, this establishes that after $k$ steps of Huffman’s algorithm, the resulting forest is optimal among all comparable forests.