16.23 Complexity Analysis

String algorithms often look deceptively simple.

16.23 Complexity Analysis

Problem

String algorithms often look deceptively simple.

A few nested loops may be harmless for short text but catastrophic for large input. A suffix-array comparison may appear to take O(log n) binary-search steps, while each comparison secretly scans many characters. A hash-based algorithm may appear constant time, but only after preprocessing and with collision assumptions.

The challenge is to analyze string algorithms honestly, including hidden costs such as substring allocation, character comparison, Unicode decoding, output size, and preprocessing.

Solution

Analyze string algorithms in layers.

A useful complexity analysis should account for:

  1. Input size
  2. Pattern size
  3. Alphabet size
  4. Number of matches
  5. Preprocessing cost
  6. Query cost
  7. Memory cost
  8. Representation cost

String algorithms are rarely just about n.

Define the Variables

Before analyzing, define the symbols.

Common variables:

Symbol Meaning
n Text length
m Pattern length
k Number of patterns
M Total length of all patterns
z Number of matches reported
σ Alphabet size
q Number of queries
d Number of documents

Without clear variables, complexity claims become ambiguous.

For example:

Aho-Corasick is O(n)

is incomplete. More precisely:

Build:  O(M)
Search: O(n + z)

where M is the total pattern length and z is the number of matches.

Single-Pattern Matching

For one pattern and one text:

Algorithm Preprocessing Search Space
Naive O(1) O(nm) worst case O(1)
KMP O(m) O(n) O(m)
Z Algorithm O(n + m) O(n + m) O(n + m)
Rabin-Karp O(m) O(n) expected, O(nm) worst O(1)
Boyer-Moore O(m + σ) typical Sublinear average, O(nm) worst O(m + σ)

The table shows why no algorithm dominates all others.

KMP provides strong worst-case guarantees.

Boyer-Moore often wins in practice.

Rabin-Karp is attractive when rolling hashes are useful elsewhere.

Multiple-Pattern Matching

If there are k patterns, repeated single-pattern search costs:

O(k × n)

or worse, depending on the algorithm.

Aho-Corasick changes the shape of the problem:

Build:  O(M)
Search: O(n + z)

where:

M = total length of all patterns
z = number of reported matches

This is a major improvement when the pattern set is large and reused.

Output-Sensitive Complexity

Some algorithms must report matches.

If the text is:

aaaaaaaaaa

and the patterns are:

a
aa
aaa
aaaa

the number of matches can be large.

Any algorithm that prints or stores every match must spend at least:

O(z)

time.

This is not an implementation weakness. It is a lower bound.

When analyzing search algorithms, include the output term:

O(n + z)

not merely:

O(n)

Substring Indexes

Suffix structures move work from query time to preprocessing time.

Structure Build Query Space
Suffix Array O(n log n) typical O(m log n + z) O(n)
Suffix Array + LCP/RMQ O(n log n) typical O(m + log n + z) or better O(n)
Suffix Automaton O(n) O(m) membership O(n)
Suffix Tree O(n) theoretical O(m + z) O(n), large constants

A suffix array is often preferred because it is compact and cache-friendly.

A suffix automaton is often better for substring analytics.

A suffix tree is powerful but pointer-heavy.

Hidden Cost: String Comparison

Sorting suffixes is not just sorting integers.

If each suffix comparison scans up to O(n) characters, then a comparison sort may cost:

O(n log n × n)
=
O(n² log n)

This is why naive suffix-array construction is slow.

Prefix-doubling, induced sorting, and rank-based methods avoid repeated long comparisons.

When you see:

sort suffixes

ask:

What does one comparison cost?

That question often determines the real complexity.

Hidden Cost: Substring Allocation

Many languages make substring creation cheap-looking.

Example:

s[i..j]

But this may allocate a new string of length:

j - i

If done inside a loop, complexity can silently increase.

Bad pattern:

for i in range(n):
    suffixes.append(s[i:])
sort(suffixes)

This materializes O(n²) total characters.

Better pattern:

store suffix start positions

and compare by index or rank.

Hidden Cost: Unicode

For ASCII-like text, indexing appears constant time.

For UTF-8 text, indexing by character position may require scanning from the beginning or maintaining an auxiliary table.

Possible units:

Unit Notes
Byte Fast, storage-oriented
Code point Better for many algorithms
Grapheme cluster Better for user-visible text

If an algorithm says:

access s[i]

make sure the representation supports that operation at the claimed cost.

Hashing Complexity

String hashing gives:

O(1)

substring comparison only after preprocessing.

The full cost is:

Operation Complexity
Build powers O(n)
Build prefix hashes O(n)
Substring hash O(1)
Equality check O(1) probabilistic

The correctness guarantee is probabilistic unless matching hashes are verified by character comparison.

If every candidate is verified, worst-case cost may include extra character comparisons.

Dynamic Programming Costs

Edit distance and longest common substring both use grids.

For strings of lengths n and m:

O(nm)

time is common.

Space may be:

O(nm)

with a full table, or:

O(min(n,m))

with rolling rows.

If reconstruction is required, full-table storage or extra backtracking information may be necessary.

Optimization depends on what you need:

Need Space
Distance only O(min(n,m))
Full edit script O(nm) typical
Longest substring only O(min(n,m))
All matches Output-sensitive

Preprocessing vs Query Tradeoff

Indexing is worthwhile only when queries justify it.

Suppose:

n = 1,000,000
q = 1

A suffix array may be overkill.

But if:

q = 1,000,000

preprocessing pays for itself.

A useful decision rule:

total cost =
build cost + q × query cost

Compare alternatives using the actual workload.

Example:

Strategy Total Cost
Scan per query O(qn)
Build suffix array O(n log n + q m log n)
Build inverted index O(n + query postings)

The best choice changes with q, m, and query type.

Memory Complexity

String algorithms often trade memory for speed.

Examples:

Algorithm Extra Memory
Naive matching O(1)
KMP O(m)
Z Algorithm O(n + m)
Aho-Corasick O(M)
Suffix Array O(n)
LCP Array O(n)
Suffix Automaton O(n)
Eertree O(n)
Edit Distance full DP O(nm)

Memory also has constant factors.

A suffix automaton with hash-map transitions may use far more memory than a suffix array, even though both are O(n).

Big O hides this engineering detail.

Alphabet Size

The alphabet size σ affects transition storage.

A trie node can store children as:

Representation Lookup Memory
Fixed array of size σ O(1) High
Hash map O(1) average Moderate
Sorted vector O(log degree) Low
Linked list O(degree) Low

For DNA:

σ = 4

fixed arrays are attractive.

For Unicode:

σ = very large

sparse maps are usually better.

Amortized Analysis

Some string structures rely on amortized bounds.

Examples:

  • Suffix automaton construction
  • Manacher's algorithm
  • Z Algorithm
  • Kasai's algorithm

These algorithms contain loops that appear nested, but the total work remains linear because a pointer or boundary moves monotonically.

Typical argument:

The right boundary only increases.
It increases at most n times.
Therefore total expansion work is O(n).

This pattern appears repeatedly in advanced string algorithms.

Benchmarking

Complexity analysis predicts scaling behavior. Benchmarks reveal constants.

Benchmark with:

  • Random text
  • Repetitive text
  • Real documents
  • Worst-case patterns
  • Unicode-heavy input
  • Many short queries
  • Few long queries

Different algorithms win under different distributions.

For example, Boyer-Moore may outperform KMP on natural language text but behave less predictably on adversarial repetitive input.

Correctness of Complexity Claims

A complexity claim is correct only when its assumptions are stated.

For example:

Substring comparison is O(1)

is true only if:

  • Prefix hashes have been built
  • Collisions are ignored or accepted
  • Substring boundaries are valid
  • Hash extraction is constant time

Without those assumptions, the statement is misleading.

Common Pitfalls

Do not omit output size.

Do not ignore preprocessing.

Do not allocate substrings in inner loops.

Do not assume character indexing is constant time for every string representation.

Do not compare suffixes naively and claim suffix-array construction is O(n log n).

Do not ignore memory constants.

Do not use average-case claims where worst-case guarantees are required.

Takeaway

Complexity analysis for string algorithms must account for more than the length of the text. Pattern size, number of patterns, output size, indexing representation, preprocessing, memory layout, and character encoding all affect the real cost. Good analysis separates build time from query time, identifies hidden comparison and allocation costs, and states the assumptions behind every bound.