20.6 Building a Compression Tool

Design a compression tool that reduces the size of textual data while preserving the original information.

20.6 Building a Compression Tool

Problem

Design a compression tool that reduces the size of textual data while preserving the original information. The tool should accept an input stream, generate a compact representation, and later reconstruct the original content exactly.

Compression appears throughout computing. Source code repositories compress objects before storage. Databases compress pages to improve cache efficiency. Network protocols reduce bandwidth consumption. Log systems compress archives to reduce storage costs. Backup systems compress files before transmission. The underlying objective remains the same: represent information using fewer bits.

Consider the following text:

ERROR ERROR ERROR ERROR ERROR

A naive storage system records every character:

ERROR ERROR ERROR ERROR ERROR

A compression system recognizes repetition and stores a more compact representation:

5 × ERROR

The challenge is determining how to discover patterns automatically and encode them efficiently.

Understanding Compression

Compression algorithms exploit redundancy.

If every symbol appears with equal probability and no patterns exist, compression opportunities are limited. If symbols, words, or structures repeat frequently, the data contains redundancy that can be removed.

For example:

AAAAAAABBBBBBBCCCCCCC

contains obvious repetition.

By contrast:

Q7xJ4mT2ZpL8vN5

appears much more random and offers fewer compression opportunities.

Compression algorithms generally fall into two categories:

Category Goal
Lossless Recover the exact original data
Lossy Recover an approximation

A source code file, executable, database page, or financial record requires lossless compression. Image, audio, and video systems often use lossy compression.

This case study focuses entirely on lossless techniques.

Run-Length Encoding

The simplest compression algorithm is run-length encoding.

Instead of storing repeated symbols individually:

AAAAAA

store:

6A

Implementation:

def rle_encode(text):
    if not text:
        return []

    result = []

    current = text[0]
    count = 1

    for ch in text[1:]:
        if ch == current:
            count += 1
        else:
            result.append((count, current))
            current = ch
            count = 1

    result.append((count, current))

    return result

Example:

print(rle_encode("AAABBBCC"))

Output:

[(3, 'A'), (3, 'B'), (2, 'C')]

Decoding:

def rle_decode(encoded):
    parts = []

    for count, symbol in encoded:
        parts.append(symbol * count)

    return "".join(parts)

This algorithm is extremely simple and extremely fast.

Unfortunately, it performs poorly on many real datasets.

Input:

ABCDEFG

Output:

1A1B1C1D1E1F1G

The compressed representation is larger than the original.

Compression systems therefore need more sophisticated approaches.

Frequency-Based Compression

Many datasets contain symbols that occur far more frequently than others.

Consider:

AAAAAABBBBCCD

Frequency table:

Symbol Frequency
A 6
B 4
C 2
D 1

A fixed-length encoding might use:

A = 00
B = 01
C = 10
D = 11

Every symbol requires two bits.

A better strategy assigns shorter codes to common symbols and longer codes to rare symbols.

This observation leads to Huffman coding.

Huffman Coding

Huffman coding constructs an optimal prefix code from symbol frequencies.

Given frequencies:

Symbol Frequency
A 6
B 4
C 2
D 1

Build a tree by repeatedly merging the least frequent nodes.

Initial nodes:

A:6
B:4
C:2
D:1

Merge:

(D,C) -> 3

Now:

A:6
B:4
(DC):3

Merge:

(DC,B) -> 7

Now:

A:6
(DCB):7

Merge:

(A,DCB) -> 13

The final tree assigns codes:

A = 0
B = 11
C = 101
D = 100

The most common symbol receives the shortest code.

Building a Huffman Tree

Represent nodes:

import heapq

class Node:
    def __init__(self, frequency, symbol=None):
        self.frequency = frequency
        self.symbol = symbol
        self.left = None
        self.right = None

    def __lt__(self, other):
        return self.frequency < other.frequency

Construct the tree:

def build_huffman_tree(frequencies):
    heap = []

    for symbol, frequency in frequencies.items():
        heapq.heappush(
            heap,
            Node(frequency, symbol)
        )

    while len(heap) > 1:
        left = heapq.heappop(heap)
        right = heapq.heappop(heap)

        parent = Node(
            left.frequency + right.frequency
        )

        parent.left = left
        parent.right = right

        heapq.heappush(heap, parent)

    return heap[0]

Notice the combination of two algorithmic ideas:

  • Frequency counting using hash tables.
  • Priority queues using heaps.

This pattern appears repeatedly in real systems. Complex solutions often emerge from combining simple data structures.

Generating Codes

Traverse the tree.

def build_codes(node, prefix="", codes=None):
    if codes is None:
        codes = {}

    if node.symbol is not None:
        codes[node.symbol] = prefix
        return codes

    build_codes(node.left, prefix + "0", codes)
    build_codes(node.right, prefix + "1", codes)

    return codes

Example:

freq = {
    "A": 6,
    "B": 4,
    "C": 2,
    "D": 1,
}

tree = build_huffman_tree(freq)
codes = build_codes(tree)

print(codes)

Possible output:

{
  'A': '0',
  'B': '11',
  'C': '101',
  'D': '100'
}

Different valid trees may produce different codes while preserving the same compression ratio.

Encoding Data

Encoding becomes a table lookup.

def encode(text, codes):
    bits = []

    for ch in text:
        bits.append(codes[ch])

    return "".join(bits)

Example:

encoded = encode(
    "ABCD",
    codes
)

print(encoded)

Output:

011101100

The encoded stream contains fewer bits than a fixed-width representation.

Decoding Data

Decoding walks the Huffman tree.

def decode(bits, root):
    result = []

    node = root

    for bit in bits:
        if bit == "0":
            node = node.left
        else:
            node = node.right

        if node.symbol is not None:
            result.append(node.symbol)
            node = root

    return "".join(result)

Example:

original = decode(encoded, tree)

print(original)

Output:

ABCD

The original data is recovered exactly.

Dictionary Compression

Many datasets contain repeated strings rather than repeated characters.

Example:

ERROR user login failed
ERROR database unavailable
ERROR timeout reached
ERROR invalid password

The word:

ERROR

appears repeatedly.

Dictionary-based compression replaces recurring substrings with references.

Dictionary:
1 -> ERROR

Compressed:
1 user login failed
1 database unavailable
1 timeout reached
1 invalid password

This idea forms the foundation of the LZ family of algorithms.

Sliding Window Compression

LZ77 maintains a sliding window over previously seen data.

Example:

ABABABABAB

After reading:

ABAB

the next occurrence:

ABAB

already exists in the window.

Instead of storing it again, store:

(offset, length)

Such as:

(4, 4)

meaning:

copy four characters from four positions back

Modern compression formats frequently build on variants of this idea.

Examples include:

  • DEFLATE
  • gzip
  • ZIP
  • PNG

These systems often combine dictionary matching with Huffman coding.

Compression Pipeline Design

A practical compression tool consists of several stages.

Input
  |
Tokenizer
  |
Pattern Detection
  |
Dictionary Builder
  |
Entropy Encoder
  |
Compressed Output

Decompression reverses the pipeline.

Compressed Input
  |
Entropy Decoder
  |
Dictionary Expansion
  |
Original Output

Each stage has a clear responsibility.

Streaming Compression

Large datasets may not fit into memory.

Examples:

  • Multi-gigabyte log files.
  • Database backups.
  • Network traffic.
  • Video archives.

A streaming compressor processes chunks incrementally.

def compress_stream(reader, compressor):
    while True:
        chunk = reader.read(8192)

        if not chunk:
            break

        compressor.write(chunk)

    compressor.finish()

Streaming avoids loading the entire dataset at once.

Compression and CPU Tradeoffs

Higher compression ratios usually require more computation.

Strategy Speed Compression
RLE Very High Poor
Huffman High Moderate
LZ77 Moderate Good
DEFLATE Moderate Very Good
Zstandard High Excellent
Brotli Lower Excellent

There is no universally optimal choice.

Log ingestion systems may prioritize throughput.

Archival storage systems may prioritize compression ratio.

Interactive web applications often require a balance.

Measuring Compression Effectiveness

Compression ratio:

compressed_size / original_size

Example:

Original: 100 MB
Compressed: 25 MB

Ratio:

0.25

Or:

75% reduction

Additional metrics include:

  • Compression speed.
  • Decompression speed.
  • Memory usage.
  • CPU consumption.
  • Random access capability.

A practical evaluation should measure all of them.

Complexity Analysis

For Huffman coding:

Operation Complexity
Frequency counting O(n)
Heap construction O(k)
Tree construction O(k log k)
Encoding O(n)
Decoding O(n)

Where:

  • n = input size.
  • k = number of unique symbols.

For typical text, k is much smaller than n, making Huffman coding effectively linear.

Testing

A compression system should satisfy one critical property:

decode(encode(x)) == x

For every valid input.

Test cases should include:

Case Purpose
Empty input Boundary condition
Single symbol Minimal alphabet
Repeated symbol Compression effectiveness
Random data Worst-case behavior
Large files Scalability
Unicode text Character handling
Binary data Generality

Property-based testing is especially effective because it automatically generates large numbers of inputs.

Common Bugs

A frequent mistake is assuming encoded bits can be stored as ordinary strings. Real compressors operate on packed bit streams.

Another common error is forgetting to store metadata such as the Huffman tree or frequency table. The decoder cannot reconstruct the original text without enough information to rebuild the coding scheme.

Streaming implementations often fail at chunk boundaries. A pattern may begin in one chunk and continue into the next.

Dictionary compressors frequently introduce off-by-one errors when calculating offsets and lengths.

Performance bugs often arise from excessive string concatenation. Use buffers, byte arrays, or streaming interfaces instead.

Recipe

When building a compression tool:

  1. Start with a lossless design.
  2. Measure symbol frequencies.
  3. Build a Huffman encoder.
  4. Add dictionary compression for repeated substrings.
  5. Support streaming input.
  6. Store metadata required for decompression.
  7. Verify round-trip correctness.
  8. Measure compression ratio and throughput.
  9. Benchmark against representative datasets.
  10. Optimize only after identifying bottlenecks.

Compression illustrates a recurring theme in algorithm engineering. Significant performance gains rarely come from a single algorithm. Instead, multiple techniques cooperate: hash tables collect frequencies, heaps build optimal trees, dictionaries capture repeated structures, and bit-level encoders transform patterns into compact representations. The resulting system achieves far more than any individual component could provide alone.