# 5.2 Hash Functions

# 5.2 Hash Functions

## Problem

You need a function that turns a key into a hash code so that a hash table can place the key into a bucket.

A hash table depends on two operations. First, it must decide whether two keys are equal. Second, it must compute a hash code for each key. These two operations must agree. Whenever two keys are equal, they must produce the same hash code.

## Solution

Define the hash function over exactly the same data used by equality.

```text
hash(key) -> integer
```

For an integer key, the hash function may begin with the integer itself and then mix its bits.

```text
hash_int(x):
    h = x
    h = h xor (h >> 16)
    h = h * constant1
    h = h xor (h >> 13)
    h = h * constant2
    h = h xor (h >> 16)
    return h
```

For a string key, the hash function processes each character or byte.

```text
hash_string(s):
    h = seed

    for each byte b in s:
        h = mix(h, b)

    return finalize(h)
```

For a composite key, combine the hashes of all fields that participate in equality.

```text
hash_pair(a, b):
    h1 = hash(a)
    h2 = hash(b)
    return combine(h1, h2)
```

The essential contract is simple.

```text
if key1 == key2:
    hash(key1) == hash(key2)
```

The reverse does not need to hold. Two different keys may have the same hash code.

## Discussion

A hash function is not required to be unique. It maps a large input space into a smaller output space, so collisions are unavoidable. The purpose of the hash function is to distribute common inputs evenly enough that each bucket remains small.

A poor hash function can make a good hash table behave badly. For example, suppose a table has capacity `1024`, and the bucket index is computed from the low bits of the hash code. If the hash function maps many keys to values whose low bits are similar, most entries will land in the same small group of buckets. The table will still be correct, but lookup will become slow.

Hash quality has several practical dimensions. Uniformity means keys are spread across the output space. Speed means the hash can be computed without dominating the cost of lookup. Avalanche behavior means a small change in the key tends to change many bits of the hash. Stability means the same key produces the same hash within the intended scope of use. Security means adversarial users cannot easily create many colliding keys.

Not every application needs the same hash function. A compiler symbol table, an in-memory cache, and a public HTTP parameter parser have different risk profiles. A compiler may prefer speed and deterministic output. A public service may prefer randomized or keyed hashing to reduce collision attacks. A persistent file format may require stable hashes across program versions.

## Correctness

The correctness requirement is compatibility with equality.

Assume a map uses equality relation `==` on keys. If two keys are equal, they represent the same logical key and must be stored and found as the same entry. Therefore, they must follow the same hash-table search path. Since the table chooses the first bucket from the hash code, equal keys must have equal hash codes.

If equal keys could produce different hash codes, insertion and lookup could inspect different buckets. The table might insert one key into bucket `i`, then later search for an equal key in bucket `j`. If `i` and `j` differ, lookup could fail even though the logical key is present.

The opposite direction is not required. If two unequal keys produce the same hash code, they land in the same bucket or probing sequence. The table then compares the stored keys directly. The equality check separates the unequal keys and preserves correctness.

## Example

Consider a key representing a point in a grid.

```text
Point:
    x: integer
    y: integer
```

If equality compares both coordinates, then hashing must also use both coordinates.

```text
point_equal(p, q):
    return p.x == q.x and p.y == q.y

hash_point(p):
    return combine(hash_int(p.x), hash_int(p.y))
```

This is correct because equal points have equal `x` values and equal `y` values, so their combined hashes are equal.

This version is unsafe for performance because it ignores `y`.

```text
hash_point_bad(p):
    return hash_int(p.x)
```

It may still be correct, because equal points have equal `x` values. But many different points with the same `x` coordinate collide. A data set containing points on a vertical line would place all entries into the same bucket group.

This version is incorrect.

```text
point_equal_by_x(p, q):
    return p.x == q.x

hash_point_wrong(p):
    return combine(hash_int(p.x), hash_int(p.y))
```

Here equality ignores `y`, but hashing includes `y`. The points `(3, 4)` and `(3, 9)` are equal according to `point_equal_by_x`, but they may produce different hash codes. A hash table using this pair of operations can fail to find entries.

## Choosing a Hash Function

For small integer keys, use the language or library default unless you have measured a problem. Mature libraries usually provide well-tested integer hashing.

For strings, use a standard non-cryptographic hash such as the one provided by the runtime or container library. Avoid writing a simple sum of character codes, because anagrams and similar strings collide easily.

For composite keys, use the standard tuple, record, or struct hashing mechanism if the language provides one. If you combine fields manually, use an order-sensitive combination function. The pair `(a, b)` should generally hash differently from `(b, a)`.

For untrusted inputs, use a keyed hash when available. A keyed hash includes a secret seed chosen by the process. This makes it difficult for an attacker to construct many keys with the same hash behavior.

For persistent identifiers, avoid process-randomized hashes. Many runtime hash functions are intentionally unstable across process runs. That is good for attack resistance but bad for stored indexes, cache keys, or file formats.

## Complexity

A hash function adds cost to every lookup, insertion, and deletion. For fixed-size keys such as integers, this cost is usually constant. For variable-size keys such as strings, the cost is proportional to the length of the key unless the hash is cached.

If a string of length `L` is hashed from scratch, lookup has at least `O(L)` cost before bucket access begins. This matters when keys are long, such as URLs, file paths, source-code identifiers, or serialized records.

Caching hash codes can help when keys are immutable and frequently reused. The cache must remain consistent with the key. Mutable keys are dangerous because changing a key after insertion can invalidate its bucket position.

## Common Mistakes

Do not require different keys to have different hashes. That is usually impossible and unnecessary.

Do not use hashing as equality. A matching hash code only says that two keys may be equal.

Do not hash fields that equality ignores. This can break correctness.

Do not ignore fields that equality uses unless you accept the collision cost.

Do not use process-randomized hashes for persistent data.

Do not use a cryptographic hash by default inside a normal in-memory table. It may be much slower than necessary.

Do not use a weak hash on adversarial inputs exposed to the network. Collision attacks can turn expected constant-time operations into linear-time operations.

