2.12 Substring Search

Substring search locates occurrences of a pattern p inside a text s. The objective is to find all indices i such that:

s[i : i + m] == p

where m = len(p).

The choice of algorithm depends on input size, number of queries, and tolerance for probabilistic error.

Problem

Given:

s = "abracadabra"
p = "abra"

Return all starting indices:

[0, 7]

Baseline: Naive Search

Check every possible starting position.

func NaiveSearch(s, p string) []int {
	n := len(s)
	m := len(p)

	var out []int

	for i := 0; i+m <= n; i++ {
		match := true
		for j := 0; j < m; j++ {
			if s[i+j] != p[j] {
				match = false
				break
			}
		}
		if match {
			out = append(out, i)
		}
	}

	return out
}

Complexity

Worst case: O(n * m)

Works well for small inputs or random data. Performs poorly on repetitive inputs like "aaaaaa...".

Knuth-Morris-Pratt (KMP)

KMP avoids rechecking characters by preprocessing the pattern.

Idea

Build an array lps (longest prefix which is also suffix). When a mismatch occurs, use lps to skip ahead.

LPS Construction

func BuildLPS(p string) []int {
	m := len(p)
	lps := make([]int, m)

	length := 0
	i := 1

	for i < m {
		if p[i] == p[length] {
			length++
			lps[i] = length
			i++
		} else {
			if length != 0 {
				length = lps[length-1]
			} else {
				lps[i] = 0
				i++
			}
		}
	}

	return lps
}

Search

func KMPSearch(s, p string) []int {
	n := len(s)
	m := len(p)

	lps := BuildLPS(p)

	var out []int

	i, j := 0, 0

	for i < n {
		if s[i] == p[j] {
			i++
			j++
		}

		if j == m {
			out = append(out, i-m)
			j = lps[j-1]
		} else if i < n && s[i] != p[j] {
			if j != 0 {
				j = lps[j-1]
			} else {
				i++
			}
		}
	}

	return out
}

Invariant

At each step, p[0:j] matches s[i-j : i]

When mismatch occurs, lps[j-1] gives the next valid prefix length.

Complexity

Time:  O(n + m)
Space: O(m)

Rabin-Karp (Rolling Hash)

Use hashing to compare substrings in constant time.

Idea

Compute hash of pattern and rolling hash of substrings of s.

If hashes match, verify to avoid collisions.

Implementation

func RabinKarp(s, p string) []int {
	n := len(s)
	m := len(p)

	if m == 0 || m > n {
		return nil
	}

	const base = 256
	const mod = 1_000_000_007

	var h, ph, power int

	power = 1
	for i := 0; i < m-1; i++ {
		power = (power * base) % mod
	}

	for i := 0; i < m; i++ {
		h = (h*base + int(s[i])) % mod
		ph = (ph*base + int(p[i])) % mod
	}

	var out []int

	for i := 0; i <= n-m; i++ {
		if h == ph {
			if s[i:i+m] == p {
				out = append(out, i)
			}
		}

		if i < n-m {
			h = (h - int(s[i])*power%mod + mod) % mod
			h = (h*base + int(s[i+m])) % mod
		}
	}

	return out
}

Complexity

Average: O(n + m)
Worst:   O(n * m) (due to collisions)

Z Algorithm

Build an array Z where Z[i] is the length of the longest substring starting at i that matches the prefix.

Apply it to:

p + "#" + s

Matches occur where Z[i] == len(p).

Key Property

Z[i] = length of match between s[i:] and s[0:]

Complexity

Time: O(n + m)
Space: O(n + m)

When to Use What

Method	Time	Space	Notes
Naive	O(n*m)	O(1)	Simple, small inputs
KMP	O(n+m)	O(m)	Deterministic, no backtracking
Rabin-Karp	O(n+m) avg	O(1)	Good for multiple patterns
Z algorithm	O(n+m)	O(n)	Clean prefix-based approach

Multiple Pattern Search

For many patterns, use a trie-based automaton such as Aho-Corasick.

This builds a finite automaton that processes the text in one pass and reports all matches.

Common Pitfalls

Comparing substrings directly in loops leads to quadratic behavior.

Using Rabin-Karp without verifying matches risks false positives.

Incorrect LPS construction breaks KMP silently.

Mixing byte indexing and Unicode characters produces incorrect matches in non-ASCII text.

For large alphabets, hash base selection affects collision rate.

Design Pattern

Substring search reduces to:

A representation of partial matches
A rule for advancing on mismatch
A way to avoid rechecking previous work

KMP uses prefix structure. Rabin-Karp uses hashing. Z uses prefix comparisons.

Takeaway

Efficient substring search avoids redundant comparisons. KMP provides deterministic linear time. Rolling hash offers flexibility for multiple queries. The correct choice depends on constraints: input size, number of patterns, and tolerance for probabilistic behavior.