16.3 Z Algorithm
Many string algorithms need to answer the question: > How many characters match between a string prefix and a substring starting at a particular position?
16.3 Z Algorithm
Problem
Many string algorithms need to answer the question:
How many characters match between a string prefix and a substring starting at a particular position?
Consider:
ABABABCA
At position 2:
ABABABCA
ABABCA
The substring beginning at position 2 shares four characters with the prefix:
ABAB
A naive solution compares characters one by one for every position.
For a string of length n, this may require:
O(n²)
comparisons.
The challenge is to compute all prefix matches in linear time.
Solution
The Z Algorithm computes an array:
Z[i]
where:
Z[i] = length of the longest substring
starting at i that matches
the prefix of the string
The resulting array captures significant structural information about the string and can be used for:
- Exact pattern matching
- Repeated substring detection
- String compression
- Prefix analysis
- Periodicity detection
- Border computation
The algorithm runs in:
O(n)
time.
Understanding the Z Array
Consider:
S = AABCAABXAAY
Indexing:
| Index | Character |
|---|---|
| 0 | A |
| 1 | A |
| 2 | B |
| 3 | C |
| 4 | A |
| 5 | A |
| 6 | B |
| 7 | X |
| 8 | A |
| 9 | A |
| 10 | Y |
The Z values are:
| Index | Z |
|---|---|
| 0 | 0 |
| 1 | 1 |
| 2 | 0 |
| 3 | 0 |
| 4 | 3 |
| 5 | 1 |
| 6 | 0 |
| 7 | 0 |
| 8 | 2 |
| 9 | 1 |
| 10 | 0 |
At position 4:
AABCAABXAAY
AAB
Three characters match the prefix.
Therefore:
Z[4] = 3
The Z Box
The key optimization is maintaining a region:
[L, R]
called the Z-box.
Inside this interval:
S[L..R]
matches:
S[0..R-L]
This means previous work can be reused.
Example:
A B A B A B C A
|-----|
Z-box
Any position inside the box inherits information already computed.
Instead of starting comparisons from scratch, the algorithm copies information from a corresponding position in the prefix.
This avoids redundant character comparisons.
Naive Construction
A direct implementation:
def zValue
(s : Array Char)
(start : Nat)
: Nat :=
let rec loop (k : Nat) :=
if start + k < s.size
&& s[k]! = s[start + k]! then
loop (k + 1)
else
k
loop 0
Applying this at every position yields:
O(n²)
time.
The real Z Algorithm eliminates this inefficiency.
Linear-Time Construction
The algorithm keeps track of:
L = left boundary
R = right boundary
of the current Z-box.
For each position:
Case 1: Outside the Z-box
i > R
No previous information exists.
Perform direct comparison.
AABCAABXAAY
^
Expand character by character.
Case 2: Inside the Z-box
i ≤ R
Reuse previously computed information.
Suppose:
k = i - L
Then:
Z[k]
already contains information about matching prefixes.
Two subcases arise:
Fully Contained
Z[k] < R - i + 1
Then:
Z[i] = Z[k]
No further work required.
Touching Boundary
Z[k] ≥ R - i + 1
The match may continue beyond R.
Start expansion from the boundary.
Current box
[L----------R]
^
i
Only new characters are compared.
Lean Implementation
def buildZ
(s : Array Char)
: Array Nat :=
Id.run do
let n := s.size
let mut z := Array.mkArray n 0
let mut l := 0
let mut r := 0
for i in [1:n] do
if i <= r then
let k := i - l
let value :=
Nat.min z[k]! (r - i + 1)
z := z.set! i value
let mut current := z[i]!
while i + current < n
&& s[current]! =
s[i + current]! do
current := current + 1
z := z.set! i current
if i + current - 1 > r then
l := i
r := i + current - 1
return z
Every character contributes to at most a constant amount of expansion work.
Therefore total runtime remains linear.
Pattern Matching with Z
The Z Algorithm becomes a pattern matcher through a simple transformation.
Construct:
Pattern + "$" + Text
Example:
Pattern = ABA
Text = ABABABA
Combined string:
ABA$ABABABA
Compute the Z array.
Whenever:
Z[i] = Pattern.length
a complete match exists.
Example
Pattern:
ABA
Text:
ABABABA
Combined:
ABA$ABABABA
Z values:
| Position | Z |
|---|---|
| 0 | 0 |
| 1 | 0 |
| 2 | 1 |
| 3 | 0 |
| 4 | 3 |
| 5 | 0 |
| 6 | 3 |
| 7 | 0 |
| 8 | 3 |
| 9 | 0 |
| 10 | 1 |
Matches occur where:
Z[i] = 3
Positions:
0
2
4
inside the original text.
Detecting Repeated Patterns
The Z array naturally reveals repetition.
Example:
ABABABAB
Z values:
| Position | Z |
|---|---|
| 0 | 0 |
| 1 | 0 |
| 2 | 6 |
| 3 | 0 |
| 4 | 4 |
| 5 | 0 |
| 6 | 2 |
| 7 | 0 |
Large values indicate repeated prefix structures.
This property is useful for:
- Compression
- DNA analysis
- Pattern mining
- Period detection
Correctness
The algorithm maintains the invariant:
S[L..R]
matches:
S[0..R-L]
at every step.
For positions inside the current Z-box, previously computed information remains valid because the matched regions are identical.
Whenever the copied information reaches the boundary, direct expansion verifies whether the match extends further.
Therefore every Z value equals the exact length of the longest prefix match beginning at that position.
Complexity Analysis
| Operation | Complexity |
|---|---|
| Z construction | O(n) |
| Pattern matching | O(n + m) |
| Extra memory | O(n) |
Where:
| Symbol | Meaning |
|---|---|
| n | Text length |
| m | Pattern length |
KMP vs Z Algorithm
| Feature | KMP | Z Algorithm |
|---|---|---|
| Preprocessing target | Pattern | Entire string |
| Matching complexity | O(n + m) | O(n + m) |
| Failure table | LPS | Z-array |
| Pattern matching | Excellent | Excellent |
| Prefix analysis | Moderate | Excellent |
| Repetition detection | Moderate | Excellent |
| Conceptual simplicity | Moderate | High |
Both algorithms achieve linear performance.
KMP views matching from the perspective of the pattern and its prefixes.
The Z Algorithm views matching from the perspective of every position in the string and how strongly it resembles the prefix.
This prefix-centric perspective makes the Z Algorithm one of the most elegant and versatile tools in string processing.