# 5.24 Real-World Hash Table Design

# 5.24 Real-World Hash Table Design

## Problem

You need to choose and implement a hash table that performs reliably under real workloads: mixed key types, uneven access patterns, deletions, resizing, and possible adversarial input.

Pure theoretical designs often fail in production due to constant factors, memory layout, and edge cases.

## Solution

Design the hash table as a layered system with explicit responsibilities:

```text id="h2v8qk"
- hashing layer
- indexing layer
- collision handling layer
- resizing layer
- memory layout layer
```

A robust baseline structure:

```text id="k9m1zr"
HashTable:
    entries: contiguous array
    capacity: integer
    size: integer
    load_factor_threshold: float
```

Core operations:

```text id="p4n7xd"
index = hash(key)
probe sequence in entries
match via equality check
```

Resizing policy:

```text id="v7c3pl"
if size / capacity > threshold:
    rehash to larger capacity
```

## Discussion

A production-grade hash table is not a single algorithm but a combination of design decisions that interact.

### 1. Hashing layer

The hash function must be:

* fast for common keys
* well-distributed for structured input
* consistent with equality
* optionally keyed for untrusted input

A weak hash function can destroy performance even if the rest of the table is well-designed.

### 2. Indexing layer

Index computation maps hash values into table positions.

```text id="m8q2dn"
index = hash(key) mod capacity
```

Capacity choice affects distribution. Power-of-two capacities are fast but require good bit mixing. Prime capacities reduce some structured collisions but may be slower.

### 3. Collision handling layer

Choices:

* chaining for simplicity and flexibility
* open addressing for cache efficiency
* hybrid structures for worst-case control

The correct choice depends on workload:

* read-heavy: open addressing
* insert-heavy: chaining or hybrid
* adversarial input: hybrid or treeified buckets

### 4. Resizing layer

Resizing is essential for maintaining performance.

Without resizing:

```text id="q1z8mv"
load factor grows → collisions grow → performance degrades
```

With resizing:

```text id="t6k3np"
load factor bounded → expected constant time preserved
```

Geometric growth ensures amortized O(1) insertion cost.

### 5. Memory layout layer

Memory layout often determines real performance more than algorithm choice.

Key options:

* contiguous arrays (best cache locality)
* pointer-based nodes (flexible but slower)
* segmented arrays (balance between both)

Modern high-performance implementations favor contiguous storage with careful probing.

## Correctness

Correctness is defined by invariants:

1. Every key is stored exactly once (no duplicates).
2. Lookup always searches the correct candidate positions.
3. Equality is always checked explicitly.
4. Resizing preserves all entries.

Formally:

```text id="x3v9jq"
lookup(key) returns value iff key was inserted and not deleted
```

All optimizations must preserve these invariants regardless of layout or caching behavior.

## Complexity

Expected performance:

```text id="z7m4pd"
insert → O(1)
lookup → O(1)
delete → O(1)
```

Worst case:

```text id="c8q1xn"
O(n)
```

With hybrid bucket structures:

```text id="r2v9mk"
worst-case lookup → O(log n)
```

Amortized resizing cost:

```text id="f6n2ld"
O(1) per operation
```

## Example

A real system design for a key-value cache:

```text id="k5m8qz"
key: request_id
value: response
```

Design choices:

* keyed hash for security
* open addressing for cache efficiency
* load factor threshold at 0.7
* resizing at 2x growth
* fallback treeified buckets for collision clusters

Behavior under load:

* normal traffic: constant-time access
* burst traffic: stable latency
* adversarial keys: bounded degradation

## Implementation Notes

Treat the hash table as a system, not a data structure.

Tune parameters based on workload:

```text id="p9c3nx"
- load factor threshold
- bucket representation
- hash function strength
- resizing strategy
```

Benchmark with realistic workloads, not synthetic random inputs.

Monitor:

* probe length distribution
* bucket size distribution
* cache miss rate
* resize frequency

Avoid premature micro-optimizations. Focus first on invariants and distribution quality.

## Common Mistakes

Choosing a hash function without considering input structure.

Ignoring memory layout and cache behavior.

Using a single design for all workloads.

Allowing uncontrolled growth of load factor.

Optimizing lookup but ignoring insertion and deletion costs.

Treating resizing as an implementation detail rather than a core mechanism.

Relying on average-case guarantees without measuring tail latency.