2.17 Rolling Hashes

Rolling hashes assign numeric fingerprints to substrings so that many substring comparisons can be done quickly. After preprocessing, the hash of any substring can be computed in constant time. This is useful in substring search, duplicate detection, palindrome checks, longest common substring, and approximate text indexing.

The method is usually probabilistic. Equal hashes suggest equal strings, but collisions can occur unless the algorithm uses a collision-free representation or verifies matches directly.

Problem

Given a string s, answer many queries of the form:

Are s[l1:r1] and s[l2:r2] equal?

A direct comparison costs O(length) per query. With rolling hashes, each comparison can usually be reduced to O(1) after O(n) preprocessing.

Polynomial Hash

A common hash for a byte string is:

H(s[0:n]) = s[0] * B^(n-1) + s[1] * B^(n-2) + ... + s[n-1]

computed modulo a large prime M.

Here:

B = base
M = modulus

The base should be larger than the alphabet size or chosen randomly from a safe range. The modulus should be large enough to reduce collision probability.

Prefix Hashes

Build prefix hashes so that each prefix has a stored hash value.

type RollingHash struct {
	base  int64
	mod   int64
	pow   []int64
	pref  []int64
}

func NewRollingHash(s string, base, mod int64) RollingHash {
	n := len(s)

	pow := make([]int64, n+1)
	pref := make([]int64, n+1)

	pow[0] = 1

	for i := 0; i < n; i++ {
		pow[i+1] = (pow[i] * base) % mod
		pref[i+1] = (pref[i]*base + int64(s[i])) % mod
	}

	return RollingHash{
		base: base,
		mod:  mod,
		pow:  pow,
		pref: pref,
	}
}

The prefix invariant is:

pref[i] is the hash of s[0:i]
pow[i] is base^i modulo mod

Substring Hash

For a half-open range [l, r), compute:

hash(l, r) = pref[r] - pref[l] * base^(r-l)

modulo M.

func (h RollingHash) Hash(l, r int) int64 {
	x := h.pref[r] - (h.pref[l]*h.pow[r-l])%h.mod
	if x < 0 {
		x += h.mod
	}
	return x
}

This gives a normalized hash for s[l:r], independent of its original position.

Comparing Substrings

func SameSubstring(s string, h RollingHash, l1, r1, l2, r2 int) bool {
	if r1-l1 != r2-l2 {
		return false
	}

	if h.Hash(l1, r1) != h.Hash(l2, r2) {
		return false
	}

	return s[l1:r1] == s[l2:r2] // exact verification
}

The final string comparison removes collision risk. If the application tolerates probabilistic answers, the final comparison can be omitted.

Example

s := "abracadabra"
h := NewRollingHash(s, 911382323, 1_000_000_007)

fmt.Println(h.Hash(0, 4) == h.Hash(7, 11)) // "abra" == "abra"
fmt.Println(h.Hash(0, 3) == h.Hash(3, 6))  // "abr" != "aca"

Rolling Window Hash

For fixed-size windows, update the hash incrementally.

func FixedWindowHashes(s string, k int, base, mod int64) []int64 {
	if k <= 0 || k > len(s) {
		return nil
	}

	power := int64(1)
	for i := 0; i < k-1; i++ {
		power = (power * base) % mod
	}

	hash := int64(0)
	for i := 0; i < k; i++ {
		hash = (hash*base + int64(s[i])) % mod
	}

	out := []int64{hash}

	for i := k; i < len(s); i++ {
		left := int64(s[i-k])
		right := int64(s[i])

		hash = (hash - left*power%mod + mod) % mod
		hash = (hash*base + right) % mod

		out = append(out, hash)
	}

	return out
}

This is the usual Rabin-Karp update rule.

Double Hashing

To reduce collision probability, compute two independent hashes.

type DoubleHash struct {
	a RollingHash
	b RollingHash
}

type HashPair struct {
	X int64
	Y int64
}

func NewDoubleHash(s string) DoubleHash {
	return DoubleHash{
		a: NewRollingHash(s, 911382323, 1_000_000_007),
		b: NewRollingHash(s, 972663749, 1_000_000_009),
	}
}

func (h DoubleHash) Hash(l, r int) HashPair {
	return HashPair{
		X: h.a.Hash(l, r),
		Y: h.b.Hash(l, r),
	}
}

Two hashes do not eliminate collisions, but they make accidental collisions much less likely.

Longest Repeated Substring

Rolling hashes can be combined with binary search on the answer length.

For a fixed length k, compute all substring hashes of length k. If a hash repeats, there may be a repeated substring.

func HasRepeatedOfLength(s string, k int) bool {
	if k == 0 {
		return true
	}

	h := NewDoubleHash(s)
	seen := make(map[HashPair]struct{})

	for i := 0; i+k <= len(s); i++ {
		x := h.Hash(i, i+k)
		if _, ok := seen[x]; ok {
			return true
		}
		seen[x] = struct{}{}
	}

	return false
}

For exact correctness, store positions per hash and verify equal substrings when hashes match.

Palindrome Queries

Build one rolling hash for the string and one for the reversed string.

A substring s[l:r] is a palindrome if its forward hash equals the corresponding reverse hash.

func ReverseStringBytes(s string) string {
	b := []byte(s)

	for l, r := 0, len(b)-1; l < r; l, r = l+1, r-1 {
		b[l], b[r] = b[r], b[l]
	}

	return string(b)
}

For substring [l, r) in s, the corresponding range in rev is:

[n-r, n-l)

where n = len(s).

Complexity

Preprocessing:

Time:  O(n)
Space: O(n)

Substring hash query:

Time:  O(1)
Space: O(1)

Exact substring verification, when used, costs:

O(length)

Common Pitfalls

Hash equality does not guarantee string equality. Verify matches when exact correctness is required.

Using a small modulus creates frequent collisions.

Forgetting to normalize negative modulo results produces incorrect hashes.

Using byte hashes for Unicode text compares UTF-8 byte sequences, not characters.

Changing the base or modulus between compared strings makes hashes incompatible.

Integer multiplication may overflow before modulo if values are too large for the chosen type. Use int64 carefully, or use unsigned overflow hashing deliberately and consistently.

Design Pattern

Rolling hash consists of:

A prefix hash table
A power table
A constant time substring formula
An explicit collision policy

The collision policy is part of the algorithm, not an implementation detail.

Takeaway

Rolling hashes make substring comparison fast by replacing text segments with numeric fingerprints. They are powerful for repeated queries and pattern search, but they must be used with a clear collision strategy. Use direct verification for correctness-critical code and double hashing when probabilistic comparison is acceptable.