5.19 Deterministic Hashing

Problem

You need hash values that remain stable across program runs, machines, builds, or language runtimes.

Many hash tables deliberately use randomized hashing for attack resistance. This is good for public-facing in-memory maps, but it is wrong when the hash value becomes part of a persistent file, cache key, distributed partition rule, or reproducible build artifact.

Solution

Use a fixed, documented hash function with fixed parameters.

deterministic_hash(key) -> unsigned integer

The function must define:

- input encoding
- byte order
- seed or absence of seed
- output width
- version

For a string key, first define the byte representation, then hash the bytes.

hash_string(s):
    bytes = encode_utf8(s)
    return deterministic_hash_bytes(bytes)

For a composite key, encode fields unambiguously.

hash_pair(a, b):
    bytes = encode(a) || separator || encode(b)
    return deterministic_hash_bytes(bytes)

When possible, prefer structured field hashing rather than ad hoc string concatenation.

Discussion

Deterministic hashing is a reproducibility tool. It gives the same answer for the same logical key under the same specification. This matters when a hash is stored, shared, compared, or used to route work across systems.

Examples include content-addressed storage, database shard assignment, build cache keys, deduplication fingerprints, distributed task partitioning, and stable document identifiers. In these settings, changing the hash function changes the identity system.

Deterministic hashing must be specified more carefully than ordinary in-memory hashing. A runtime hash function may depend on a random seed. A language may change its hash implementation between versions. A string may have multiple Unicode representations that look identical to a human but encode differently. A composite key may collide accidentally if fields are concatenated without boundaries.

For persistent use, the hash function becomes part of the data format. Treat it like a schema field. Give it a name and version.

hash_algorithm = "xxh3-64-v1"

or:

hash_algorithm = "sha256-v1"

The exact choice depends on the use case. Fast non-cryptographic hashes are appropriate for partitioning, checksums, and internal identifiers where adversarial collision resistance is not required. Cryptographic hashes are appropriate for content addressing, integrity checks, and untrusted inputs where collision resistance matters.

Correctness

The equality-hash contract still applies.

if key1 == key2:
    deterministic_hash(key1) == deterministic_hash(key2)

For this to hold across systems, equality must also be deterministic. If two systems normalize strings differently, serialize records differently, or compare case differently, then their hashes may diverge.

For composite keys, the encoding must be injective with respect to the logical fields. The pair ("ab", "c") must not encode the same way as ("a", "bc").

A safe encoding includes lengths or typed boundaries.

encode_pair(a, b):
    return len(a) || a || len(b) || b

This prevents accidental ambiguity.

Complexity

Hashing cost depends on key size.

For fixed-size keys such as integers, hashing is O(1).

For strings, byte arrays, and serialized records, hashing is O(L), where L is the encoded length.

For large content-addressed objects, hashing may dominate runtime. Streaming hash APIs avoid loading the whole object into memory.

Example

Suppose a distributed system assigns files to shards by path.

shard = deterministic_hash(path) mod shard_count

If one service uses randomized runtime hashing and another service uses a different seed, they disagree on shard placement. Requests may be routed to the wrong node.

A stable rule fixes the function and encoding:

bytes = utf8_normalized_path(path)
h = xxh3_64_seed0(bytes)
shard = h mod shard_count

Every service that implements this rule assigns the same path to the same shard.

Implementation Notes

Do not use default language hash functions for persistent identifiers unless the language explicitly guarantees stability.

Record the hash algorithm and version near the stored hash.

Define Unicode normalization rules for string keys.

Define byte order for integers.

Use length-prefix encoding for composite keys.

For distributed systems, deploy hash changes as migrations, not silent replacements.

For security-sensitive identifiers, use cryptographic hashing.

Common Mistakes

Using process-randomized hashes for persisted data.

Changing the hash algorithm without migrating stored values.

Concatenating fields without separators or lengths.

Ignoring Unicode normalization.

Assuming different languages implement the same named hash identically.

Using deterministic non-cryptographic hashes where adversarial collision resistance is required.