5.11 Rolling Hashes

Problem

You need to compare many substrings quickly.

Direct substring comparison can be expensive because comparing two substrings of length m may take O(m) time. If an algorithm compares many substrings, this cost can dominate the whole computation.

A rolling hash assigns each prefix of a string a numeric summary so that the hash of any substring can be computed in constant time after preprocessing.

Solution

Choose a base B and modulus M. For a string s, define a prefix hash array.

prefix[0] = 0

for i from 0 to n - 1:
    prefix[i + 1] = (prefix[i] * B + code(s[i])) mod M

Also precompute powers of B.

power[0] = 1

for i from 0 to n - 1:
    power[i + 1] = (power[i] * B) mod M

The hash of substring s[l..r) is:

hash(l, r) = (prefix[r] - prefix[l] * power[r - l]) mod M

If the result is negative, add M.

Discussion

A rolling hash treats a string like a number written in base B. Each character contributes a digit-like value. The prefix array stores hashes of prefixes, and subtraction removes the prefix before the substring.

The main benefit is that substring hashes become cheap. After O(n) preprocessing, each substring hash takes O(1) time. This is useful in substring search, duplicate substring detection, palindrome checks, and string algorithms that compare many intervals.

Rolling hashes are probabilistic when used for equality. Equal substrings always have equal hashes, but unequal substrings may collide. In most algorithmic settings, a large modulus or two independent moduli make collisions unlikely enough. In correctness-critical systems, confirm equality with direct comparison after a hash match.

The choice of base and modulus matters. The base should usually be larger than the alphabet size and should avoid obvious structure. The modulus should be large. Common choices include large primes or machine-word overflow arithmetic, depending on the language and performance goals.

Correctness

For any fixed string and fixed parameters, the prefix recurrence gives each prefix a deterministic hash. The substring formula subtracts the contribution of the first l characters after shifting it to the same scale as prefix[r].

Suppose:

prefix[r] = hash(s[0..r))
prefix[l] = hash(s[0..l))

The prefix s[0..l) contributes to prefix[r] multiplied by B^(r-l). Therefore:

prefix[r] - prefix[l] * B^(r-l)

leaves exactly the contribution from s[l..r). Reducing modulo M gives the stored substring hash.

This proves that the formula consistently computes the same value for the same substring. It does not prove that different substrings have different hashes, because modular arithmetic can collapse distinct values.

Complexity

Preprocessing takes O(n) time and O(n) space for a string of length n.

Each substring hash query takes O(1) time.

Comparing two substrings by hash takes O(1) expected or probabilistic time, depending on the collision model. If a direct comparison is used after equal hashes, the worst-case comparison cost remains O(m) for substring length m.

Example

Find whether a pattern occurs in a text.

contains(text, pattern):
    n = length(text)
    m = length(pattern)

    build prefix hashes for text
    pattern_hash = hash of pattern

    for i from 0 to n - m:
        if hash_text(i, i + m) == pattern_hash:
            if text[i..i+m) == pattern:
                return true

    return false

The hash check filters most positions quickly. The direct comparison protects against collisions.

Implementation Notes

Normalize negative modular results after subtraction.

Use integer types large enough to avoid unintended overflow unless overflow arithmetic is intentional.

Use double hashing when collision probability matters but direct comparison is expensive.

Map characters to positive codes. Using zero for common characters can weaken some formulations.

For persistent or adversarial inputs, avoid relying on a single predictable hash.

Common Mistakes

Treating equal hashes as a proof of equal substrings in correctness-critical code.

Forgetting to multiply the left prefix by power[r - l].

Using inconsistent indexing conventions.

Failing to normalize negative results after modular subtraction.

Choosing a weak base or modulus and then testing only on random strings.