14.6 Huffman Coding
Huffman coding is one of the most successful greedy algorithms ever developed.
14.6 Huffman Coding
Huffman coding is one of the most successful greedy algorithms ever developed. It is used in file compression, image formats, audio codecs, archive utilities, network protocols, and countless data storage systems.
Unlike activity scheduling, which optimizes the number of selected intervals, Huffman coding optimizes the representation of information itself. Given a set of symbols and their frequencies, the goal is to construct a binary encoding that minimizes the total number of bits required to store or transmit the data.
The remarkable aspect of Huffman coding is that a sequence of locally optimal merges produces a globally optimal code.
Problem
Given symbols and frequencies:
| Symbol | Frequency |
|---|---|
| A | 45 |
| B | 13 |
| C | 12 |
| D | 16 |
| E | 9 |
| F | 5 |
Construct a binary code such that:
- No code is a prefix of another code.
- The total encoded size is minimized.
The second condition is the optimization objective.
The first condition guarantees unique decoding.
Why Fixed-Length Codes Waste Space
Suppose six symbols exist.
A fixed-length encoding requires:
$$ \lceil \log_2 6 \rceil = 3 $$
bits per symbol.
Example:
| Symbol | Code |
|---|---|
| A | 000 |
| B | 001 |
| C | 010 |
| D | 011 |
| E | 100 |
| F | 101 |
Total bits:
$$ 3 \times \text{number of symbols} $$
This ignores frequency information.
Highly frequent symbols consume exactly the same space as rare symbols.
Basic Idea
Frequently occurring symbols should receive shorter codes.
Rare symbols can tolerate longer codes.
For example:
| Symbol | Frequency | Code |
|---|---|---|
| A | 45 | 0 |
| B | 13 | 101 |
| C | 12 | 100 |
| D | 16 | 111 |
| E | 9 | 1101 |
| F | 5 | 1100 |
The common symbol A now uses only one bit.
Rare symbols receive longer encodings.
Average storage decreases dramatically.
Prefix Codes
A valid Huffman code must be a prefix code.
Definition:
No codeword may be the prefix of another codeword.
Valid:
| Symbol | Code |
|---|---|
| A | 0 |
| B | 10 |
| C | 110 |
| D | 111 |
Invalid:
| Symbol | Code |
|---|---|
| A | 0 |
| B | 01 |
Because:
0
is a prefix of:
01
Decoding becomes ambiguous.
Binary Tree Representation
Prefix codes naturally correspond to binary trees.
Example:
root
/ \\n A *
/ \\n B *
/ \\n C D
Assign:
- Left edge = 0
- Right edge = 1
Codes become:
| Symbol | Code |
|---|---|
| A | 0 |
| B | 10 |
| C | 110 |
| D | 111 |
Every leaf represents a symbol.
The path from root to leaf determines the code.
Cost Function
Let:
- (f_i) = frequency
- (d_i) = depth of leaf
Total encoded size:
$$ \sum_i f_i d_i $$
This is the quantity we wish to minimize.
The optimization problem becomes:
Construct a binary tree with minimum weighted path length.
Key Greedy Observation
Consider the two least frequent symbols.
Example:
| Symbol | Frequency |
|---|---|
| E | 9 |
| F | 5 |
Since they occur rarely, placing them deeper in the tree causes little penalty.
An optimal tree always places the least frequent symbols among the deepest leaves.
This observation leads directly to the greedy strategy.
Greedy Rule
Repeatedly:
Merge the two lowest-frequency nodes.
Create a new parent whose frequency equals the sum.
Continue until only one node remains.
Example Construction
Initial frequencies:
| Symbol | Frequency |
|---|---|
| A | 45 |
| B | 13 |
| C | 12 |
| D | 16 |
| E | 9 |
| F | 5 |
Step 1
Merge smallest frequencies:
F(5) + E(9)
New node:
14
Current set:
12 13 14 16 45
Step 2
Merge:
12 + 13
New node:
25
Current set:
14 16 25 45
Step 3
Merge:
14 + 16
New node:
30
Current set:
25 30 45
Step 4
Merge:
25 + 30
New node:
55
Current set:
45 55
Step 5
Merge:
45 + 55
New node:
100
Finished.
The final root frequency equals the total frequency count.
Resulting Tree
100
/ \\n 45 55
/ \\n 25 30
/ \ / \\n 12 13 14 16
/ \\n 5 9
Assign:
- Left = 0
- Right = 1
Codes emerge automatically.
Why the Greedy Choice Works
The key theorem is:
The two least frequent symbols can always be placed as sibling leaves in some optimal tree.
Suppose the two smallest frequencies are:
$$ x $$
and
$$ y $$
Any optimal tree can be transformed so that:
- (x) and (y) become siblings.
- Total cost remains unchanged or decreases.
This is an exchange argument.
Once combined, the pair behaves like a single symbol with frequency:
$$ x+y $$
The problem becomes smaller.
The same argument applies recursively.
Recursive Structure
After merging:
$$ x+y $$
the remaining task is:
Construct an optimal Huffman tree for the reduced symbol set.
This self-similarity allows the greedy algorithm to solve the entire problem.
Priority Queue Implementation
At each step we need the two smallest frequencies.
A min-heap is ideal.
Algorithm:
Insert all frequencies into min-heap
While heap size > 1:
x = extract minimum
y = extract minimum
create node x+y
insert x+y
Return remaining node
Go Implementation
type Node struct {
Freq int
Left *Node
Right *Node
}
Core algorithm:
for pq.Len() > 1 {
x := heap.Pop(pq).(*Node)
y := heap.Pop(pq).(*Node)
parent := &Node{
Freq: x.Freq + y.Freq,
Left: x,
Right: y,
}
heap.Push(pq, parent)
}
The remaining node is the Huffman tree root.
Complexity Analysis
Let:
$$ n $$
be the number of symbols.
Heap construction:
$$ O(n) $$
Each merge:
$$ O(\log n) $$
Number of merges:
$$ n-1 $$
Total:
$$ O(n \log n) $$
Space:
$$ O(n) $$
Real-World Usage
Huffman coding appears in:
ZIP Archives
Many ZIP implementations use Huffman encoding internally.
PNG Images
PNG compression relies heavily on Huffman codes.
JPEG Images
Entropy coding phase uses Huffman trees.
DEFLATE
The compression algorithm behind ZIP and gzip uses Huffman coding.
Network Protocols
Efficient symbol encoding reduces transmission costs.
Why Huffman Is Important
Huffman coding demonstrates a greedy algorithm operating on a tree structure rather than a sequence of intervals or graph edges.
The algorithm repeatedly performs a local optimization:
Merge the two least important nodes.
Despite this local viewpoint, the final tree is globally optimal.
This makes Huffman coding one of the most elegant examples of greedy design.
Relationship to Previous Sections
Interval scheduling:
Choose earliest finish time
Activity selection:
Choose earliest finish time
Huffman coding:
Merge smallest frequencies
All three algorithms rely on the same pattern:
- Prove a greedy choice is safe.
- Reduce the problem size.
- Solve the smaller instance recursively.
The specific structures differ, but the proof strategy remains remarkably similar.
Takeaway
Huffman coding constructs an optimal prefix code by repeatedly merging the two least frequent symbols. The resulting binary tree minimizes the weighted path length and therefore minimizes the total encoded size. A priority queue enables an (O(n \log n)) implementation, while an exchange argument proves that the greedy merge operation is always safe. Huffman coding remains one of the most important and widely deployed greedy algorithms in computer science, demonstrating how a sequence of simple local decisions can produce a globally optimal solution.