16.23 Complexity Analysis
String algorithms often look deceptively simple.
16.23 Complexity Analysis
Problem
String algorithms often look deceptively simple.
A few nested loops may be harmless for short text but catastrophic for large input. A suffix-array comparison may appear to take O(log n) binary-search steps, while each comparison secretly scans many characters. A hash-based algorithm may appear constant time, but only after preprocessing and with collision assumptions.
The challenge is to analyze string algorithms honestly, including hidden costs such as substring allocation, character comparison, Unicode decoding, output size, and preprocessing.
Solution
Analyze string algorithms in layers.
A useful complexity analysis should account for:
- Input size
- Pattern size
- Alphabet size
- Number of matches
- Preprocessing cost
- Query cost
- Memory cost
- Representation cost
String algorithms are rarely just about n.
Define the Variables
Before analyzing, define the symbols.
Common variables:
| Symbol | Meaning |
|---|---|
n |
Text length |
m |
Pattern length |
k |
Number of patterns |
M |
Total length of all patterns |
z |
Number of matches reported |
σ |
Alphabet size |
q |
Number of queries |
d |
Number of documents |
Without clear variables, complexity claims become ambiguous.
For example:
Aho-Corasick is O(n)
is incomplete. More precisely:
Build: O(M)
Search: O(n + z)
where M is the total pattern length and z is the number of matches.
Single-Pattern Matching
For one pattern and one text:
| Algorithm | Preprocessing | Search | Space |
|---|---|---|---|
| Naive | O(1) | O(nm) worst case | O(1) |
| KMP | O(m) | O(n) | O(m) |
| Z Algorithm | O(n + m) | O(n + m) | O(n + m) |
| Rabin-Karp | O(m) | O(n) expected, O(nm) worst | O(1) |
| Boyer-Moore | O(m + σ) typical | Sublinear average, O(nm) worst | O(m + σ) |
The table shows why no algorithm dominates all others.
KMP provides strong worst-case guarantees.
Boyer-Moore often wins in practice.
Rabin-Karp is attractive when rolling hashes are useful elsewhere.
Multiple-Pattern Matching
If there are k patterns, repeated single-pattern search costs:
O(k × n)
or worse, depending on the algorithm.
Aho-Corasick changes the shape of the problem:
Build: O(M)
Search: O(n + z)
where:
M = total length of all patterns
z = number of reported matches
This is a major improvement when the pattern set is large and reused.
Output-Sensitive Complexity
Some algorithms must report matches.
If the text is:
aaaaaaaaaa
and the patterns are:
a
aa
aaa
aaaa
the number of matches can be large.
Any algorithm that prints or stores every match must spend at least:
O(z)
time.
This is not an implementation weakness. It is a lower bound.
When analyzing search algorithms, include the output term:
O(n + z)
not merely:
O(n)
Substring Indexes
Suffix structures move work from query time to preprocessing time.
| Structure | Build | Query | Space |
|---|---|---|---|
| Suffix Array | O(n log n) typical | O(m log n + z) | O(n) |
| Suffix Array + LCP/RMQ | O(n log n) typical | O(m + log n + z) or better | O(n) |
| Suffix Automaton | O(n) | O(m) membership | O(n) |
| Suffix Tree | O(n) theoretical | O(m + z) | O(n), large constants |
A suffix array is often preferred because it is compact and cache-friendly.
A suffix automaton is often better for substring analytics.
A suffix tree is powerful but pointer-heavy.
Hidden Cost: String Comparison
Sorting suffixes is not just sorting integers.
If each suffix comparison scans up to O(n) characters, then a comparison sort may cost:
O(n log n × n)
=
O(n² log n)
This is why naive suffix-array construction is slow.
Prefix-doubling, induced sorting, and rank-based methods avoid repeated long comparisons.
When you see:
sort suffixes
ask:
What does one comparison cost?
That question often determines the real complexity.
Hidden Cost: Substring Allocation
Many languages make substring creation cheap-looking.
Example:
s[i..j]
But this may allocate a new string of length:
j - i
If done inside a loop, complexity can silently increase.
Bad pattern:
for i in range(n):
suffixes.append(s[i:])
sort(suffixes)
This materializes O(n²) total characters.
Better pattern:
store suffix start positions
and compare by index or rank.
Hidden Cost: Unicode
For ASCII-like text, indexing appears constant time.
For UTF-8 text, indexing by character position may require scanning from the beginning or maintaining an auxiliary table.
Possible units:
| Unit | Notes |
|---|---|
| Byte | Fast, storage-oriented |
| Code point | Better for many algorithms |
| Grapheme cluster | Better for user-visible text |
If an algorithm says:
access s[i]
make sure the representation supports that operation at the claimed cost.
Hashing Complexity
String hashing gives:
O(1)
substring comparison only after preprocessing.
The full cost is:
| Operation | Complexity |
|---|---|
| Build powers | O(n) |
| Build prefix hashes | O(n) |
| Substring hash | O(1) |
| Equality check | O(1) probabilistic |
The correctness guarantee is probabilistic unless matching hashes are verified by character comparison.
If every candidate is verified, worst-case cost may include extra character comparisons.
Dynamic Programming Costs
Edit distance and longest common substring both use grids.
For strings of lengths n and m:
O(nm)
time is common.
Space may be:
O(nm)
with a full table, or:
O(min(n,m))
with rolling rows.
If reconstruction is required, full-table storage or extra backtracking information may be necessary.
Optimization depends on what you need:
| Need | Space |
|---|---|
| Distance only | O(min(n,m)) |
| Full edit script | O(nm) typical |
| Longest substring only | O(min(n,m)) |
| All matches | Output-sensitive |
Preprocessing vs Query Tradeoff
Indexing is worthwhile only when queries justify it.
Suppose:
n = 1,000,000
q = 1
A suffix array may be overkill.
But if:
q = 1,000,000
preprocessing pays for itself.
A useful decision rule:
total cost =
build cost + q × query cost
Compare alternatives using the actual workload.
Example:
| Strategy | Total Cost |
|---|---|
| Scan per query | O(qn) |
| Build suffix array | O(n log n + q m log n) |
| Build inverted index | O(n + query postings) |
The best choice changes with q, m, and query type.
Memory Complexity
String algorithms often trade memory for speed.
Examples:
| Algorithm | Extra Memory |
|---|---|
| Naive matching | O(1) |
| KMP | O(m) |
| Z Algorithm | O(n + m) |
| Aho-Corasick | O(M) |
| Suffix Array | O(n) |
| LCP Array | O(n) |
| Suffix Automaton | O(n) |
| Eertree | O(n) |
| Edit Distance full DP | O(nm) |
Memory also has constant factors.
A suffix automaton with hash-map transitions may use far more memory than a suffix array, even though both are O(n).
Big O hides this engineering detail.
Alphabet Size
The alphabet size σ affects transition storage.
A trie node can store children as:
| Representation | Lookup | Memory |
|---|---|---|
Fixed array of size σ |
O(1) | High |
| Hash map | O(1) average | Moderate |
| Sorted vector | O(log degree) | Low |
| Linked list | O(degree) | Low |
For DNA:
σ = 4
fixed arrays are attractive.
For Unicode:
σ = very large
sparse maps are usually better.
Amortized Analysis
Some string structures rely on amortized bounds.
Examples:
- Suffix automaton construction
- Manacher's algorithm
- Z Algorithm
- Kasai's algorithm
These algorithms contain loops that appear nested, but the total work remains linear because a pointer or boundary moves monotonically.
Typical argument:
The right boundary only increases.
It increases at most n times.
Therefore total expansion work is O(n).
This pattern appears repeatedly in advanced string algorithms.
Benchmarking
Complexity analysis predicts scaling behavior. Benchmarks reveal constants.
Benchmark with:
- Random text
- Repetitive text
- Real documents
- Worst-case patterns
- Unicode-heavy input
- Many short queries
- Few long queries
Different algorithms win under different distributions.
For example, Boyer-Moore may outperform KMP on natural language text but behave less predictably on adversarial repetitive input.
Correctness of Complexity Claims
A complexity claim is correct only when its assumptions are stated.
For example:
Substring comparison is O(1)
is true only if:
- Prefix hashes have been built
- Collisions are ignored or accepted
- Substring boundaries are valid
- Hash extraction is constant time
Without those assumptions, the statement is misleading.
Common Pitfalls
Do not omit output size.
Do not ignore preprocessing.
Do not allocate substrings in inner loops.
Do not assume character indexing is constant time for every string representation.
Do not compare suffixes naively and claim suffix-array construction is O(n log n).
Do not ignore memory constants.
Do not use average-case claims where worst-case guarantees are required.
Takeaway
Complexity analysis for string algorithms must account for more than the length of the text. Pattern size, number of patterns, output size, indexing representation, preprocessing, memory layout, and character encoding all affect the real cost. Good analysis separates build time from query time, identifies hidden comparison and allocation costs, and states the assumptions behind every bound.