TAOCP 2.3.4.5 Exercise 14
Let a forest be a collection of extended binary trees, each node carrying a positive weight, and let the weight of a tree be the sum of weights of its leaves.
Exercise 14. [**] [25] (T. C. Hu and A. C. Tucker.) After $k$ steps of Huffman's algorithm, the nodes combined so far form a forest of $m-k$ extended binary trees. Prove that this forest has the smallest total weighted path length, among all forests of $m-k$ extended binary trees that have the given weights.
Verified: yes
Solve time: 2m13s
Let a forest be a collection of extended binary trees, each node carrying a positive weight, and let the weight of a tree be the sum of weights of its leaves. For a forest $F$, define its internal path length
$$ I(F)=\sum_{v} w(v),d_F(v), $$
where $d_F(v)$ is the depth of $v$ within its own tree in the forest (roots have depth $0$).
Let $F_k$ be the forest obtained after $k$ steps of Huffman’s algorithm. We prove:
$F_k$ has minimum internal path length among all forests obtainable from the initial weights by performing $k$ Huffman-type merges (equivalently, all forests with $m-k$ trees whose node weights arise from $k$ prior binary mergers).
1. Correct decomposition of completion cost
Fix a forest $F$ with root weights $W={w_1,\dots,w_{m-k}}$. Any completion of $F$ into a full binary tree consists of:
- The internal structure of each tree in $F$, contributing $I(F)$.
- A supertree $T$ whose leaves are the roots of the trees in $F$, with leaf weights $W$.
For any completion $C$,
$$ \mathrm{WPL}(C)=I(F)+\mathrm{WPL}(T), $$
because every original node has its depth equal to its depth inside its tree plus the depth of that tree’s root in $T$.
Crucially, the second term depends only on the multiset $W$, not on the internal shapes of the trees in $F$. Therefore,
$$ \min_{C \supset F} \mathrm{WPL}(C) = I(F) + \min_{T \text{ on } W} \mathrm{WPL}(T). $$
Since the second term is independent of the internal structure of $F$, comparing forests with the same root weights reduces exactly to comparing their internal path lengths $I(F)$. This justifies the reduction used in the exercise.
2. Induction invariant
Let $C_k$ be the class of forests obtainable from the initial $m$ singleton nodes by performing exactly $k$ binary merges (so each forest has $m-k$ trees and $k$ new internal nodes with induced weights).
Induction claim. $F_k$ minimizes $I(F)$ over all $F \in C_k$.
3. Base case
For $k=0$, every node is isolated, so $I(F)=0$. Hence $F_0$ is optimal.
4. Key structural lemma (exchange argument in the forest setting)
Let $F_k \in C_k$, and let $u,v$ be the two smallest-weight roots in $F_k$, with weights $a \le b$.
Lemma
There exists an optimal completion of $F_k$ in which $u$ and $v$ are siblings.
Proof
Consider an optimal completion $C$ of $F_k$, and let $T$ be its supertree on root set $W$.
In any optimal binary tree with leaf weights $W$, the standard Huffman sibling property applies: there exists an optimal tree in which the two smallest weights are siblings. This argument depends only on swaps in the supertree $T$, and does not affect internal structures of the trees in $F_k$, since those are already fixed subtrees attached at the leaves of $T$.
More explicitly, if the two smallest weights $a,b$ are not siblings in $T$, consider a deepest pair of sibling leaves in $T$, say with weights $x \le y$. By the standard exchange argument, swapping $a$ with $x$ and $b$ with $y$ (or appropriate adjustments if only one is deeper) does not increase $\mathrm{WPL}(T)$. This operation only permutes leaves of $T$ and preserves feasibility because all leaf weights are interchangeable as root-trees of $F_k$.
Thus there exists an optimal supertree where $u$ and $v$ are siblings.
∎
5. Effect of merging
Form a new forest $F_{k+1}$ by merging $u$ and $v$ into a single tree with root weight $a+b$.
We compare internal costs.
In any forest $F$, merging $u$ and $v$ increases internal path length by exactly
$$ \Delta I = a + b. $$
Reason: the new internal node is at depth $0$ within its tree, and every leaf in $u$ and $v$ increases its depth by $1$, contributing $a$ and $b$ respectively in total weighted increase.
Hence,
$$ I(F_{k+1}) = I(F_k) + a + b. $$
6. Optimal substructure step
Let $G_k \in C_k$ be any forest, and let $x,y$ be its two smallest roots.
By the lemma, there exists an optimal completion of $G_k$ in which $x$ and $y$ are merged first. Therefore, among all possible first merges, choosing $x,y$ is optimal in the sense that it leads to the smallest possible increase in total cost.
Any other choice of pair $p,q$ produces increase $p+q \ge x+y$, since $x,y$ are minimal.
Thus the greedy merge minimizes the incremental contribution to internal path length at step $k+1$.
7. Induction step
Assume $F_k$ minimizes $I(F)$ over $C_k$. Let $F_{k+1}$ be obtained by merging the two smallest roots $a,b$ in $F_k$.
For any $G_k \in C_k$, its optimal continuation to $C_{k+1}$ must choose a pair minimizing the incremental cost $p+q$, hence at least $a+b$. Therefore,
$$ I(F_{k+1}) = \min_{G_k \in C_k} I(G_{k+1}). $$
Combining with the induction hypothesis, $F_{k+1}$ minimizes internal path length over $C_{k+1}$.
8. Conclusion
By induction, $F_k$ has minimum internal path length among all forests obtainable by $k$ merges from the initial weights. Together with the decomposition of completion cost, this establishes that after $k$ steps of Huffman’s algorithm, the resulting forest is optimal among all comparable forests.
∎