16.17 Compressed Strings

Many strings contain repeated structure.

16.17 Compressed Strings

Problem

Many strings contain repeated structure.

Examples:

aaaaaa
abababab
the the the the

A direct representation stores every character explicitly. That is simple, but it can waste space and make repeated operations expensive.

Suppose a string is represented as:

a repeated 1,000,000 times

It is inefficient to store and scan one million characters when the structure can be described compactly.

The challenge is to represent repetitive strings in compressed form while still supporting useful operations such as indexing, comparison, search, and expansion when needed.

Solution

Use a compressed string representation.

A compressed string stores a description of the text rather than the text itself. Common representations include:

Representation Best For
Run-length encoding Long repeated runs
Grammar compression Repeated phrases and nested patterns
LZ-style parsing Repeated substrings
Rope trees Large editable strings
Prefix-sharing tries Large dictionaries
Suffix-based indexes Large searchable texts

This recipe focuses on the basic algorithmic idea: exploit repetition explicitly instead of treating every string as a flat array of characters.

Run-Length Encoding

The simplest compressed representation is run-length encoding, or RLE.

Instead of storing:

aaaaabbbbcc

store:

a5 b4 c2

Each run records:

  1. The character.
  2. The number of repetitions.

A Lean-like structure:

structure Run where
  ch : Char
  count : Nat

abbrev RLEString := List Run

This representation is compact when the string contains long runs.

Encoding

To encode a string, scan from left to right and group consecutive equal characters.

aaabbcaaaa

becomes:

a3 b2 c1 a4

Pseudocode:

runs = []

for character c in string:
    if runs is empty or last run character differs:
        append new run (c, 1)
    else:
        increment last run count

The scan is linear in the input length.

Decoding

Decoding expands the compressed representation.

a3 b2 c1 a4

becomes:

aaabbcaaaa

This operation may be expensive because the expanded output can be much larger than the compressed input.

That distinction matters.

If a compressed string represents one billion characters, decoding is necessarily expensive. Compression does not eliminate output size.

Indexing

RLE can support indexing without full decompression.

Given:

a3 b2 c1 a4

the expanded string is:

aaabbcaaaa

To find character at index 5:

Run Range
a3 0..3
b2 3..5
c1 5..6
a4 6..10

Index 5 falls in:

c1

so the answer is:

c

A naive index operation scans runs from the beginning. With prefix sums over run lengths, indexing can be answered by binary search.

Prefix Sums for Runs

Store cumulative lengths:

Runs:    a3 b2 c1 a4
Prefix:  3  5  6 10

To locate index i, find the first prefix value greater than i.

For index:

5

the first prefix greater than 5 is:

6

which corresponds to the run:

c1

This gives:

O(log r)

indexing, where r is the number of runs.

Comparing Compressed Strings

Compressed strings can often be compared without full expansion.

Example:

A = a5 b2
B = a3 a2 b2

Both expand to:

aaaaabb

A comparison algorithm walks through runs from both sides and consumes the smaller remaining count.

a5  vs a3  consume 3
a2  vs a2  consume 2
b2  vs b2  consume 2

The expanded text is never materialized.

This technique is useful in testing, data pipelines, compression tools, and parsers over repeated input.

Searching in Compressed Strings

Pattern search over compressed strings is more subtle.

For simple RLE text, searching for a single repeated character is easy.

Example:

pattern = aaa
text    = a10 b2 a2

The first run contains the pattern.

But a general pattern may cross run boundaries:

pattern = aaabb
text    = a5 b3

The match begins inside the a run and continues into the b run.

A practical strategy is to operate at the run level when the pattern also compresses well, and fall back to local expansion near run boundaries when necessary.

Grammar Compression

Run-length encoding handles repeated characters, but not repeated phrases.

For:

abcabcabcabc

RLE gives little benefit.

Grammar compression represents repeated structure using rules:

X = abc
Y = XXXX

The full string becomes:

Y

This can represent long repeated patterns compactly.

Grammar-compressed strings are common in theoretical string algorithms and compressed indexing.

LZ-Style Compression

Lempel-Ziv methods represent a string as references to earlier substrings.

Example:

abcabcabc

can be described as:

abc + copy previous 6 characters

LZ-style compression is the foundation of many practical compression formats.

Algorithmically, it is important because it converts repeated substring discovery into a parsing problem.

Suffix arrays, suffix automata, and rolling hashes often appear in LZ parsers.

Ropes

A rope stores a large string as a balanced tree of chunks.

Instead of one contiguous array:

very large string

a rope stores:

        root
       /    \\n   chunk    node
           /    \\n       chunk    chunk

Ropes are useful when strings are edited frequently.

Insertion and deletion in the middle of a large flat string can require shifting many characters. In a rope, these operations mostly rearrange tree nodes.

Algorithmic Operations

Different compressed representations support different operations.

Operation RLE Grammar LZ Rope
Sequential scan Easy Moderate Moderate Easy
Random access Easy with prefix sums Harder Harder Easy
Pattern search Moderate Hard Hard Moderate
Editing Poor Poor Poor Good
Compression ratio Narrow Good Good Not primary goal

Choose the representation according to the workload, not only the compression ratio.

Complexity Analysis

For RLE, let:

Symbol Meaning
n Expanded length
r Number of runs
Operation Complexity
Encode O(n)
Decode O(n)
Index without prefix sums O(r)
Index with prefix sums O(log r)
Sequential scan O(r) compressed steps
Storage O(r)

If r << n, compressed processing can be dramatically faster and smaller.

If r ≈ n, RLE gives little benefit and may increase overhead.

Correctness

An RLE representation is correct if each run records a positive count and adjacent runs never have the same character.

Encoding preserves the original order by grouping maximal consecutive equal characters.

Decoding preserves each run by emitting its character exactly count times.

Because every input character belongs to exactly one run, decoding an encoded string reconstructs the original string exactly.

Common Pitfalls

Do not assume compressed length predicts expanded cost. A tiny compressed representation can describe a huge output.

Do not decode automatically unless the operation truly requires it.

Keep runs canonical. Adjacent runs with the same character should be merged.

Be careful with integer overflow in run lengths and expanded lengths.

Choose indexing semantics carefully for Unicode text.

Takeaway

Compressed strings replace flat character arrays with structured descriptions of text. The main algorithmic question is not only how small the representation is, but which operations remain efficient without decompression. Run-length encoding is the simplest example: it can turn repeated-character strings into compact run lists while still supporting indexing, comparison, and some searches directly. More advanced representations generalize the same principle to repeated phrases, large editable documents, and compressed full-text indexes.