16.17 Compressed Strings
Many strings contain repeated structure.
16.17 Compressed Strings
Problem
Many strings contain repeated structure.
Examples:
aaaaaa
abababab
the the the the
A direct representation stores every character explicitly. That is simple, but it can waste space and make repeated operations expensive.
Suppose a string is represented as:
a repeated 1,000,000 times
It is inefficient to store and scan one million characters when the structure can be described compactly.
The challenge is to represent repetitive strings in compressed form while still supporting useful operations such as indexing, comparison, search, and expansion when needed.
Solution
Use a compressed string representation.
A compressed string stores a description of the text rather than the text itself. Common representations include:
| Representation | Best For |
|---|---|
| Run-length encoding | Long repeated runs |
| Grammar compression | Repeated phrases and nested patterns |
| LZ-style parsing | Repeated substrings |
| Rope trees | Large editable strings |
| Prefix-sharing tries | Large dictionaries |
| Suffix-based indexes | Large searchable texts |
This recipe focuses on the basic algorithmic idea: exploit repetition explicitly instead of treating every string as a flat array of characters.
Run-Length Encoding
The simplest compressed representation is run-length encoding, or RLE.
Instead of storing:
aaaaabbbbcc
store:
a5 b4 c2
Each run records:
- The character.
- The number of repetitions.
A Lean-like structure:
structure Run where
ch : Char
count : Nat
abbrev RLEString := List Run
This representation is compact when the string contains long runs.
Encoding
To encode a string, scan from left to right and group consecutive equal characters.
aaabbcaaaa
becomes:
a3 b2 c1 a4
Pseudocode:
runs = []
for character c in string:
if runs is empty or last run character differs:
append new run (c, 1)
else:
increment last run count
The scan is linear in the input length.
Decoding
Decoding expands the compressed representation.
a3 b2 c1 a4
becomes:
aaabbcaaaa
This operation may be expensive because the expanded output can be much larger than the compressed input.
That distinction matters.
If a compressed string represents one billion characters, decoding is necessarily expensive. Compression does not eliminate output size.
Indexing
RLE can support indexing without full decompression.
Given:
a3 b2 c1 a4
the expanded string is:
aaabbcaaaa
To find character at index 5:
| Run | Range |
|---|---|
| a3 | 0..3 |
| b2 | 3..5 |
| c1 | 5..6 |
| a4 | 6..10 |
Index 5 falls in:
c1
so the answer is:
c
A naive index operation scans runs from the beginning. With prefix sums over run lengths, indexing can be answered by binary search.
Prefix Sums for Runs
Store cumulative lengths:
Runs: a3 b2 c1 a4
Prefix: 3 5 6 10
To locate index i, find the first prefix value greater than i.
For index:
5
the first prefix greater than 5 is:
6
which corresponds to the run:
c1
This gives:
O(log r)
indexing, where r is the number of runs.
Comparing Compressed Strings
Compressed strings can often be compared without full expansion.
Example:
A = a5 b2
B = a3 a2 b2
Both expand to:
aaaaabb
A comparison algorithm walks through runs from both sides and consumes the smaller remaining count.
a5 vs a3 consume 3
a2 vs a2 consume 2
b2 vs b2 consume 2
The expanded text is never materialized.
This technique is useful in testing, data pipelines, compression tools, and parsers over repeated input.
Searching in Compressed Strings
Pattern search over compressed strings is more subtle.
For simple RLE text, searching for a single repeated character is easy.
Example:
pattern = aaa
text = a10 b2 a2
The first run contains the pattern.
But a general pattern may cross run boundaries:
pattern = aaabb
text = a5 b3
The match begins inside the a run and continues into the b run.
A practical strategy is to operate at the run level when the pattern also compresses well, and fall back to local expansion near run boundaries when necessary.
Grammar Compression
Run-length encoding handles repeated characters, but not repeated phrases.
For:
abcabcabcabc
RLE gives little benefit.
Grammar compression represents repeated structure using rules:
X = abc
Y = XXXX
The full string becomes:
Y
This can represent long repeated patterns compactly.
Grammar-compressed strings are common in theoretical string algorithms and compressed indexing.
LZ-Style Compression
Lempel-Ziv methods represent a string as references to earlier substrings.
Example:
abcabcabc
can be described as:
abc + copy previous 6 characters
LZ-style compression is the foundation of many practical compression formats.
Algorithmically, it is important because it converts repeated substring discovery into a parsing problem.
Suffix arrays, suffix automata, and rolling hashes often appear in LZ parsers.
Ropes
A rope stores a large string as a balanced tree of chunks.
Instead of one contiguous array:
very large string
a rope stores:
root
/ \\n chunk node
/ \\n chunk chunk
Ropes are useful when strings are edited frequently.
Insertion and deletion in the middle of a large flat string can require shifting many characters. In a rope, these operations mostly rearrange tree nodes.
Algorithmic Operations
Different compressed representations support different operations.
| Operation | RLE | Grammar | LZ | Rope |
|---|---|---|---|---|
| Sequential scan | Easy | Moderate | Moderate | Easy |
| Random access | Easy with prefix sums | Harder | Harder | Easy |
| Pattern search | Moderate | Hard | Hard | Moderate |
| Editing | Poor | Poor | Poor | Good |
| Compression ratio | Narrow | Good | Good | Not primary goal |
Choose the representation according to the workload, not only the compression ratio.
Complexity Analysis
For RLE, let:
| Symbol | Meaning |
|---|---|
| n | Expanded length |
| r | Number of runs |
| Operation | Complexity |
|---|---|
| Encode | O(n) |
| Decode | O(n) |
| Index without prefix sums | O(r) |
| Index with prefix sums | O(log r) |
| Sequential scan | O(r) compressed steps |
| Storage | O(r) |
If r << n, compressed processing can be dramatically faster and smaller.
If r ≈ n, RLE gives little benefit and may increase overhead.
Correctness
An RLE representation is correct if each run records a positive count and adjacent runs never have the same character.
Encoding preserves the original order by grouping maximal consecutive equal characters.
Decoding preserves each run by emitting its character exactly count times.
Because every input character belongs to exactly one run, decoding an encoded string reconstructs the original string exactly.
Common Pitfalls
Do not assume compressed length predicts expanded cost. A tiny compressed representation can describe a huge output.
Do not decode automatically unless the operation truly requires it.
Keep runs canonical. Adjacent runs with the same character should be merged.
Be careful with integer overflow in run lengths and expanded lengths.
Choose indexing semantics carefully for Unicode text.
Takeaway
Compressed strings replace flat character arrays with structured descriptions of text. The main algorithmic question is not only how small the representation is, but which operations remain efficient without decompression. Run-length encoding is the simplest example: it can turn repeated-character strings into compact run lists while still supporting indexing, comparison, and some searches directly. More advanced representations generalize the same principle to repeated phrases, large editable documents, and compressed full-text indexes.