20.6 Building a Compression Tool
Design a compression tool that reduces the size of textual data while preserving the original information.
20.6 Building a Compression Tool
Problem
Design a compression tool that reduces the size of textual data while preserving the original information. The tool should accept an input stream, generate a compact representation, and later reconstruct the original content exactly.
Compression appears throughout computing. Source code repositories compress objects before storage. Databases compress pages to improve cache efficiency. Network protocols reduce bandwidth consumption. Log systems compress archives to reduce storage costs. Backup systems compress files before transmission. The underlying objective remains the same: represent information using fewer bits.
Consider the following text:
ERROR ERROR ERROR ERROR ERROR
A naive storage system records every character:
ERROR ERROR ERROR ERROR ERROR
A compression system recognizes repetition and stores a more compact representation:
5 × ERROR
The challenge is determining how to discover patterns automatically and encode them efficiently.
Understanding Compression
Compression algorithms exploit redundancy.
If every symbol appears with equal probability and no patterns exist, compression opportunities are limited. If symbols, words, or structures repeat frequently, the data contains redundancy that can be removed.
For example:
AAAAAAABBBBBBBCCCCCCC
contains obvious repetition.
By contrast:
Q7xJ4mT2ZpL8vN5
appears much more random and offers fewer compression opportunities.
Compression algorithms generally fall into two categories:
| Category | Goal |
|---|---|
| Lossless | Recover the exact original data |
| Lossy | Recover an approximation |
A source code file, executable, database page, or financial record requires lossless compression. Image, audio, and video systems often use lossy compression.
This case study focuses entirely on lossless techniques.
Run-Length Encoding
The simplest compression algorithm is run-length encoding.
Instead of storing repeated symbols individually:
AAAAAA
store:
6A
Implementation:
def rle_encode(text):
if not text:
return []
result = []
current = text[0]
count = 1
for ch in text[1:]:
if ch == current:
count += 1
else:
result.append((count, current))
current = ch
count = 1
result.append((count, current))
return result
Example:
print(rle_encode("AAABBBCC"))
Output:
[(3, 'A'), (3, 'B'), (2, 'C')]
Decoding:
def rle_decode(encoded):
parts = []
for count, symbol in encoded:
parts.append(symbol * count)
return "".join(parts)
This algorithm is extremely simple and extremely fast.
Unfortunately, it performs poorly on many real datasets.
Input:
ABCDEFG
Output:
1A1B1C1D1E1F1G
The compressed representation is larger than the original.
Compression systems therefore need more sophisticated approaches.
Frequency-Based Compression
Many datasets contain symbols that occur far more frequently than others.
Consider:
AAAAAABBBBCCD
Frequency table:
| Symbol | Frequency |
|---|---|
| A | 6 |
| B | 4 |
| C | 2 |
| D | 1 |
A fixed-length encoding might use:
A = 00
B = 01
C = 10
D = 11
Every symbol requires two bits.
A better strategy assigns shorter codes to common symbols and longer codes to rare symbols.
This observation leads to Huffman coding.
Huffman Coding
Huffman coding constructs an optimal prefix code from symbol frequencies.
Given frequencies:
| Symbol | Frequency |
|---|---|
| A | 6 |
| B | 4 |
| C | 2 |
| D | 1 |
Build a tree by repeatedly merging the least frequent nodes.
Initial nodes:
A:6
B:4
C:2
D:1
Merge:
(D,C) -> 3
Now:
A:6
B:4
(DC):3
Merge:
(DC,B) -> 7
Now:
A:6
(DCB):7
Merge:
(A,DCB) -> 13
The final tree assigns codes:
A = 0
B = 11
C = 101
D = 100
The most common symbol receives the shortest code.
Building a Huffman Tree
Represent nodes:
import heapq
class Node:
def __init__(self, frequency, symbol=None):
self.frequency = frequency
self.symbol = symbol
self.left = None
self.right = None
def __lt__(self, other):
return self.frequency < other.frequency
Construct the tree:
def build_huffman_tree(frequencies):
heap = []
for symbol, frequency in frequencies.items():
heapq.heappush(
heap,
Node(frequency, symbol)
)
while len(heap) > 1:
left = heapq.heappop(heap)
right = heapq.heappop(heap)
parent = Node(
left.frequency + right.frequency
)
parent.left = left
parent.right = right
heapq.heappush(heap, parent)
return heap[0]
Notice the combination of two algorithmic ideas:
- Frequency counting using hash tables.
- Priority queues using heaps.
This pattern appears repeatedly in real systems. Complex solutions often emerge from combining simple data structures.
Generating Codes
Traverse the tree.
def build_codes(node, prefix="", codes=None):
if codes is None:
codes = {}
if node.symbol is not None:
codes[node.symbol] = prefix
return codes
build_codes(node.left, prefix + "0", codes)
build_codes(node.right, prefix + "1", codes)
return codes
Example:
freq = {
"A": 6,
"B": 4,
"C": 2,
"D": 1,
}
tree = build_huffman_tree(freq)
codes = build_codes(tree)
print(codes)
Possible output:
{
'A': '0',
'B': '11',
'C': '101',
'D': '100'
}
Different valid trees may produce different codes while preserving the same compression ratio.
Encoding Data
Encoding becomes a table lookup.
def encode(text, codes):
bits = []
for ch in text:
bits.append(codes[ch])
return "".join(bits)
Example:
encoded = encode(
"ABCD",
codes
)
print(encoded)
Output:
011101100
The encoded stream contains fewer bits than a fixed-width representation.
Decoding Data
Decoding walks the Huffman tree.
def decode(bits, root):
result = []
node = root
for bit in bits:
if bit == "0":
node = node.left
else:
node = node.right
if node.symbol is not None:
result.append(node.symbol)
node = root
return "".join(result)
Example:
original = decode(encoded, tree)
print(original)
Output:
ABCD
The original data is recovered exactly.
Dictionary Compression
Many datasets contain repeated strings rather than repeated characters.
Example:
ERROR user login failed
ERROR database unavailable
ERROR timeout reached
ERROR invalid password
The word:
ERROR
appears repeatedly.
Dictionary-based compression replaces recurring substrings with references.
Dictionary:
1 -> ERROR
Compressed:
1 user login failed
1 database unavailable
1 timeout reached
1 invalid password
This idea forms the foundation of the LZ family of algorithms.
Sliding Window Compression
LZ77 maintains a sliding window over previously seen data.
Example:
ABABABABAB
After reading:
ABAB
the next occurrence:
ABAB
already exists in the window.
Instead of storing it again, store:
(offset, length)
Such as:
(4, 4)
meaning:
copy four characters from four positions back
Modern compression formats frequently build on variants of this idea.
Examples include:
- DEFLATE
- gzip
- ZIP
- PNG
These systems often combine dictionary matching with Huffman coding.
Compression Pipeline Design
A practical compression tool consists of several stages.
Input
|
Tokenizer
|
Pattern Detection
|
Dictionary Builder
|
Entropy Encoder
|
Compressed Output
Decompression reverses the pipeline.
Compressed Input
|
Entropy Decoder
|
Dictionary Expansion
|
Original Output
Each stage has a clear responsibility.
Streaming Compression
Large datasets may not fit into memory.
Examples:
- Multi-gigabyte log files.
- Database backups.
- Network traffic.
- Video archives.
A streaming compressor processes chunks incrementally.
def compress_stream(reader, compressor):
while True:
chunk = reader.read(8192)
if not chunk:
break
compressor.write(chunk)
compressor.finish()
Streaming avoids loading the entire dataset at once.
Compression and CPU Tradeoffs
Higher compression ratios usually require more computation.
| Strategy | Speed | Compression |
|---|---|---|
| RLE | Very High | Poor |
| Huffman | High | Moderate |
| LZ77 | Moderate | Good |
| DEFLATE | Moderate | Very Good |
| Zstandard | High | Excellent |
| Brotli | Lower | Excellent |
There is no universally optimal choice.
Log ingestion systems may prioritize throughput.
Archival storage systems may prioritize compression ratio.
Interactive web applications often require a balance.
Measuring Compression Effectiveness
Compression ratio:
compressed_size / original_size
Example:
Original: 100 MB
Compressed: 25 MB
Ratio:
0.25
Or:
75% reduction
Additional metrics include:
- Compression speed.
- Decompression speed.
- Memory usage.
- CPU consumption.
- Random access capability.
A practical evaluation should measure all of them.
Complexity Analysis
For Huffman coding:
| Operation | Complexity |
|---|---|
| Frequency counting | O(n) |
| Heap construction | O(k) |
| Tree construction | O(k log k) |
| Encoding | O(n) |
| Decoding | O(n) |
Where:
- n = input size.
- k = number of unique symbols.
For typical text, k is much smaller than n, making Huffman coding effectively linear.
Testing
A compression system should satisfy one critical property:
decode(encode(x)) == x
For every valid input.
Test cases should include:
| Case | Purpose |
|---|---|
| Empty input | Boundary condition |
| Single symbol | Minimal alphabet |
| Repeated symbol | Compression effectiveness |
| Random data | Worst-case behavior |
| Large files | Scalability |
| Unicode text | Character handling |
| Binary data | Generality |
Property-based testing is especially effective because it automatically generates large numbers of inputs.
Common Bugs
A frequent mistake is assuming encoded bits can be stored as ordinary strings. Real compressors operate on packed bit streams.
Another common error is forgetting to store metadata such as the Huffman tree or frequency table. The decoder cannot reconstruct the original text without enough information to rebuild the coding scheme.
Streaming implementations often fail at chunk boundaries. A pattern may begin in one chunk and continue into the next.
Dictionary compressors frequently introduce off-by-one errors when calculating offsets and lengths.
Performance bugs often arise from excessive string concatenation. Use buffers, byte arrays, or streaming interfaces instead.
Recipe
When building a compression tool:
- Start with a lossless design.
- Measure symbol frequencies.
- Build a Huffman encoder.
- Add dictionary compression for repeated substrings.
- Support streaming input.
- Store metadata required for decompression.
- Verify round-trip correctness.
- Measure compression ratio and throughput.
- Benchmark against representative datasets.
- Optimize only after identifying bottlenecks.
Compression illustrates a recurring theme in algorithm engineering. Significant performance gains rarely come from a single algorithm. Instead, multiple techniques cooperate: hash tables collect frequencies, heaps build optimal trees, dictionaries capture repeated structures, and bit-level encoders transform patterns into compact representations. The resulting system achieves far more than any individual component could provide alone.