# 5.22 Hybrid Hash Structures

# 5.22 Hybrid Hash Structures

## Problem

Pure hash tables do not handle all workloads well. They are fast on average, but can degrade under collisions, skewed key distributions, or heavy deletion patterns.

You need a design that preserves hash-table speed while adapting to pathological cases.

## Solution

Combine multiple internal representations depending on bucket state or load conditions.

A common hybrid design:

```text id="h3q9vn"
bucket:
    if small:
        use array or list
    if medium:
        use hash chaining list
    if large:
        use balanced tree
```

At the table level:

```text id="v8m2lx"
if load_factor high:
    resize

if bucket_size > threshold:
    upgrade bucket structure
```

## Discussion

A hybrid hash structure acknowledges that no single collision strategy is optimal for all regimes.

Hash tables are typically optimized for expected-case performance. However, real workloads can be skewed:

* a small number of keys dominate access
* certain prefixes cluster due to input structure
* adversarial inputs target collisions
* repeated insert-delete cycles create fragmentation

A hybrid approach adapts locally.

### Small buckets

For small bucket sizes, arrays or vectors are optimal. They have minimal overhead and benefit from cache locality. Linear scan is faster than more complex structures at this scale.

### Medium buckets

When buckets grow, linear scan becomes expensive. At this point, structured storage like sorted arrays or linked lists is used depending on insertion frequency.

### Large buckets

For worst-case collision clusters, performance must be bounded. Converting the bucket into a balanced tree ensures:

```text id="k4x7bd"
lookup = O(log b)
```

instead of linear time.

This prevents catastrophic degradation.

## Correctness

The hybrid structure preserves correctness as long as the following invariants hold:

* all entries for a bucket are stored in exactly one internal structure
* equality is always checked explicitly
* transitions between representations preserve all elements

When a bucket is upgraded (for example, list → tree), all entries must be migrated without loss:

```text id="n2q8vd"
for each entry in old_structure:
    insert into new_structure
```

No key may be duplicated or dropped during conversion.

From the perspective of correctness, the internal representation is irrelevant. Only membership and key-value association matter.

## Complexity

Hybrid structures aim to guarantee bounded worst-case lookup.

Typical guarantees:

* small bucket: O(1)
* medium bucket: O(b)
* large bucket: O(log b)

Since large buckets are rare under good hashing, expected performance remains O(1).

Even under adversarial clustering, worst-case lookup is bounded.

Resizing still costs O(n), but occurs infrequently under load factor control.

## Example

A bucket evolves under insertions:

### Stage 1: small

```text id="q9k2lm"
["a", "b"]
```

Linear scan is efficient.

### Stage 2: medium

```text id="z7v3nc"
["a", "b", "c", "d", "e"]
```

Still manageable as array or list.

### Stage 3: large

```text id="x1p8fd"
tree:
    balanced structure over keys
```

Lookup becomes logarithmic rather than linear.

## Implementation Notes

Define clear thresholds for structure transitions.

Avoid frequent oscillation between representations by using hysteresis:

```text id="b6m2kp"
upgrade at size > T1
downgrade at size < T2
where T2 < T1
```

Keep conversion cost amortized by triggering upgrades only when necessary.

Ensure memory layout changes do not break iterator behavior unless explicitly allowed.

In performance-critical systems, benchmark both representation and transition cost.

## Common Mistakes

Upgrading too early, adding unnecessary overhead.

Upgrading too late, allowing performance collapse.

Failing to migrate all elements during conversion.

Using trees for all buckets, removing cache efficiency benefits.

Ignoring downgrade logic, leading to permanently overbuilt structures.

Assuming hybridization removes the need for good hashing. It only mitigates worst-case behavior, not poor distribution.

