19.8 Hashing with Randomness

Hash tables are usually presented as deterministic data structures: compute a hash, reduce it to a table index, and store or retrieve the key.

19.8 Hashing with Randomness

Hash tables are usually presented as deterministic data structures: compute a hash, reduce it to a table index, and store or retrieve the key. In practice, randomness is often part of the design. A randomized hash function can reduce collision patterns, defend against adversarial inputs, and provide strong expected-time guarantees without relying on accidental properties of the input.

The main issue is collisions. A hash table is fast when keys spread evenly across buckets. It is slow when many keys land in the same bucket. Randomized hashing makes the bucket assignment independent of the input order and, in many designs, independent of attacker control.

Problem

You need a dictionary supporting:

insert(key, value)
lookup(key)
delete(key)

with expected constant-time performance.

A hash table gives:

Expected lookup: O(1)
Expected insert: O(1)
Expected delete: O(1)

but only if keys distribute well across buckets.

A bad hash function can turn operations into:

O(n)

because many keys collide into the same bucket.

Solution

Use a hash function chosen randomly from a family of hash functions.

Instead of fixing one function forever:

h(key)

choose:

hᵣ(key)

where r is random seed material selected when the table is created.

The seed remains fixed during table operation. The randomness is used to choose the mapping from keys to buckets.

A simple model looks like this:

import random

class RandomHash:
    def __init__(self, table_size: int, prime: int) -> None:
        self.table_size = table_size
        self.prime = prime
        self.a = random.randrange(1, prime)
        self.b = random.randrange(0, prime)

    def hash(self, key: int) -> int:
        return ((self.a * key + self.b) % self.prime) % self.table_size

For integer keys, this is a common universal-hashing pattern. The parameters a and b are randomly selected. The prime should be larger than the largest key value in the simplified textbook model.

Universal Hashing

A family of hash functions is universal if, for any two distinct keys x and y, the probability that they collide is small when a function is chosen randomly from the family.

The desired property is:

P(h(x) = h(y)) ≤ 1/m

where m is the number of buckets.

This means no pair of distinct keys has an unusually high chance of collision.

The guarantee is over the random choice of the hash function, not over the input distribution.

That detail matters. Universal hashing works even when the input keys are not random.

Why Randomized Hashing Helps

Suppose a hash table has:

m = 1024

buckets.

A weak deterministic hash function might map many related keys to the same bucket.

Example:

0, 1024, 2048, 3072, 4096

If the table index is computed by:

key % 1024

all of these keys land in bucket 0.

A randomized affine hash changes the mapping:

((a * key + b) mod p) mod m

Now the attacker or input pattern cannot easily predict bucket placement without knowing a and b.

The table still has collisions, but systematic collision patterns are much harder to trigger.

Separate Chaining

The easiest collision strategy is separate chaining. Each bucket stores a list of key-value pairs.

from dataclasses import dataclass
from typing import Generic, TypeVar

K = TypeVar("K")
V = TypeVar("V")

@dataclass
class Entry(Generic[K, V]):
    key: K
    value: V

A basic chained hash table for integer keys:

import random
from typing import Optional

class ChainedHashTable:
    def __init__(self, bucket_count: int = 8) -> None:
        self.bucket_count = bucket_count
        self.prime = 2_147_483_647
        self.hash_function = RandomHash(bucket_count, self.prime)
        self.buckets: list[list[Entry[int, str]]] = [
            [] for _ in range(bucket_count)
        ]
        self.size = 0

    def _index(self, key: int) -> int:
        return self.hash_function.hash(key)

    def put(self, key: int, value: str) -> None:
        index = self._index(key)
        bucket = self.buckets[index]

        for entry in bucket:
            if entry.key == key:
                entry.value = value
                return

        bucket.append(Entry(key, value))
        self.size += 1

    def get(self, key: int) -> Optional[str]:
        index = self._index(key)
        bucket = self.buckets[index]

        for entry in bucket:
            if entry.key == key:
                return entry.value

        return None

    def delete(self, key: int) -> bool:
        index = self._index(key)
        bucket = self.buckets[index]

        for position, entry in enumerate(bucket):
            if entry.key == key:
                bucket.pop(position)
                self.size -= 1
                return True

        return False

Example:

table = ChainedHashTable()

table.put(10, "ten")
table.put(20, "twenty")
table.put(30, "thirty")

print(table.get(20))
print(table.delete(10))
print(table.get(10))

Output:

twenty
True
None

This implementation omits resizing to keep the randomized hashing idea visible.

Load Factor

The load factor is:

α = n / m

where:

n = number of stored keys
m = number of buckets

When α is small, buckets are short. When α grows, buckets become longer and operations slow down.

A typical chained hash table resizes when:

α > 0.75

or:

α > 1.0

depending on implementation goals.

Randomized hashing helps spread keys. Resizing keeps the average bucket size small.

Expected Chain Length

With universal hashing, the expected number of keys colliding with a fixed key is bounded by the load factor.

For a table with n keys and m buckets:

Expected collisions for a key ≤ n / m

So if:

n / m = 0.75

the expected number of other keys in the same bucket is constant.

This gives expected constant-time lookup.

Recipe: Rehash on Resize

When the table grows, allocate a larger bucket array and choose a new random hash function.

class ResizingHashTable:
    def __init__(self, bucket_count: int = 8) -> None:
        self.bucket_count = bucket_count
        self.prime = 2_147_483_647
        self.hash_function = RandomHash(bucket_count, self.prime)
        self.buckets: list[list[Entry[int, str]]] = [
            [] for _ in range(bucket_count)
        ]
        self.size = 0

    def _index(self, key: int) -> int:
        return self.hash_function.hash(key)

    def _load_factor(self) -> float:
        return self.size / self.bucket_count

    def _resize(self) -> None:
        old_entries: list[Entry[int, str]] = []

        for bucket in self.buckets:
            old_entries.extend(bucket)

        self.bucket_count *= 2
        self.hash_function = RandomHash(self.bucket_count, self.prime)
        self.buckets = [[] for _ in range(self.bucket_count)]
        self.size = 0

        for entry in old_entries:
            self.put(entry.key, entry.value)

    def put(self, key: int, value: str) -> None:
        if self._load_factor() > 0.75:
            self._resize()

        index = self._index(key)
        bucket = self.buckets[index]

        for entry in bucket:
            if entry.key == key:
                entry.value = value
                return

        bucket.append(Entry(key, value))
        self.size += 1

    def get(self, key: int) -> Optional[str]:
        index = self._index(key)

        for entry in self.buckets[index]:
            if entry.key == key:
                return entry.value

        return None

Choosing a fresh hash function during resize helps avoid preserving an unlucky or adversarial collision pattern.

Random Seeds in Real Hash Tables

Production hash tables often randomize hash seeds for strings and other attacker-controlled inputs.

The goal is not merely performance. It is also denial-of-service resistance.

Without randomization, an attacker who knows the hash function may construct many strings with the same hash bucket. A web server that parses many such keys into a hash map can degrade from expected linear request processing to quadratic behavior.

Randomized seeding makes such attacks harder because the attacker does not know the exact hash mapping used by the process.

Open Addressing

Randomness also appears in open-addressed hash tables.

In open addressing, all entries live directly inside the table array. When a collision occurs, the algorithm probes alternative locations.

Common probing strategies:

Strategy Description
Linear probing Try next slot
Quadratic probing Try positions with quadratic offsets
Double hashing Use second hash to compute step size

Double hashing uses a second hash function:

indexᵢ = (h₁(key) + i * h₂(key)) mod m

The second function randomizes the probe sequence and reduces clustering compared with simple linear probing.

Collision Attacks

A collision attack deliberately feeds a hash table many keys that collide.

Example:

keys = [k₁, k₂, k₃, ..., kₙ]

where:

h(k₁) = h(k₂) = ... = h(kₙ)

A chained table places all keys in one bucket.

Operations become:

O(n)

If the table performs many insertions, total processing can degrade toward:

O(n²)

Randomized hashing reduces this risk by making the hash function unpredictable.

Determinism and Reproducibility

Randomized hashing has one trade-off: iteration order may vary between runs.

Example:

values = {"a": 1, "b": 2, "c": 3}

In some languages or runtimes, internal hash seeding can affect the order in which items are visited.

For tests, logs, serialized output, and builds, nondeterminism can be inconvenient.

The usual answer is to separate concerns:

  • Use randomized hashing for table internals.
  • Sort keys when producing stable output.
  • Use fixed seeds only in controlled test scenarios.
  • Avoid depending on hash table iteration order.

Complexity

With a good randomized hash family and bounded load factor:

Operation Expected Cost
Lookup O(1)
Insert O(1) amortized
Delete O(1)
Resize O(n), occasional

The O(1) bounds are expected, not worst-case. Randomized hashing makes bad cases rare, but it does not make them impossible.

Correctness Argument

Randomized hashing does not change dictionary semantics.

For each key:

  1. The hash function computes a bucket index.
  2. The bucket is searched for an equal key.
  3. Insert updates or appends.
  4. Lookup returns the value associated with the matching key.
  5. Delete removes the matching entry if present.

Collisions affect performance, not correctness. The equality check inside the bucket is essential. A hash match alone never proves key equality.

Common Mistakes

Do not treat hash equality as key equality. Hashes collide.

Do not allow the load factor to grow without resizing.

Do not use modulo reduction with a poor upstream hash and expect good distribution.

Do not expose deterministic hash behavior to adversarial users unless collision resistance is handled elsewhere.

Do not use a non-cryptographic randomized hash as a password hash, MAC, signature, or security boundary. Hash-table randomization and cryptographic hashing solve different problems.

Takeaway

Randomized hashing makes hash tables robust against structured and adversarial inputs. The core technique is to choose a hash function or hash seed randomly, then keep the load factor bounded through resizing. The resulting table preserves exact dictionary semantics while giving expected constant-time operations. Randomness protects performance; equality checks preserve correctness.