5.6 Sets

Problem

You need to store a collection of distinct keys and answer membership questions quickly.

A set differs from a map because it stores only keys. There is no associated value. The main operations are insert, contains, and delete.

Solution

Implement the set as a hash table that stores keys directly.

HashSet:
    buckets: array of bucket
    size: number of stored keys
    capacity: number of buckets

Membership uses the same bucket selection rule as a hash table.

contains(set, key):
    i = hash(key) mod set.capacity

    for each stored_key in set.buckets[i]:
        if stored_key == key:
            return true

    return false

Insertion adds the key only if it is not already present.

insert(set, key):
    i = hash(key) mod set.capacity

    for each stored_key in set.buckets[i]:
        if stored_key == key:
            return false

    append key to set.buckets[i]
    set.size = set.size + 1
    return true

Deletion removes the key if it exists.

delete(set, key):
    i = hash(key) mod set.capacity

    for each stored_key in set.buckets[i]:
        if stored_key == key:
            remove stored_key
            set.size = set.size - 1
            return true

    return false

Discussion

A set is the simplest useful abstraction built on top of hashing. It answers the question “have I seen this key before?” without storing extra data. This makes it a natural tool for deduplication, graph traversal, cache membership, duplicate detection, and constraint checking.

The central invariant is uniqueness. At most one copy of a key may exist in the set. Insertion must therefore check for an existing equal key before appending a new one. If insertion skips this check, membership may still appear to work, but size becomes wrong, deletion becomes ambiguous, and algorithms that rely on distinctness become incorrect.

Sets use the same equality and hashing contract as maps. Equal keys must have equal hashes. Unequal keys may collide, so contains must compare actual keys after selecting the bucket.

The choice between hash set and ordered set depends on required operations. A hash set is usually better for membership and insertion when no ordering is needed. An ordered set is better when the algorithm needs sorted iteration, range queries, predecessor, or successor operations.

Correctness

The correctness of contains follows from the placement invariant. When a key is inserted, it is placed in the bucket selected by hash(key) mod capacity. A later membership test computes the same index and searches the same bucket. If the key is present, direct equality comparison finds it.

The correctness of insert depends on the uniqueness check. If an equal key already exists in the selected bucket, insertion returns without modifying the set. If no equal key exists in that bucket, the placement invariant guarantees that no equal key exists anywhere else in the set, so appending the key preserves uniqueness.

Deletion is correct because it examines the only bucket where the key can appear. If the key is found, removing it makes future membership tests return false. If the key is not found in that bucket, the key is absent from the set.

Complexity

With a good hash function and bounded load factor, contains, insert, and delete take expected O(1) time.

The worst case is O(n) when many keys collide into the same bucket or probe sequence.

Resizing costs O(n) when it happens, but with geometric growth the amortized insertion cost remains expected O(1).

Space usage is O(n + m), where n is the number of stored keys and m is the number of buckets. In many implementations, m is kept proportional to n.

Example

Suppose you want to remove duplicates from a stream while preserving first occurrence order.

deduplicate(items):
    seen = empty set
    output = empty list

    for item in items:
        if not contains(seen, item):
            insert(seen, item)
            append item to output

    return output

For input:

["go", "lean", "go", "rust", "lean", "zig"]

the output is:

["go", "lean", "rust", "zig"]

The set stores the keys already emitted. Each duplicate is detected by membership before it reaches the output list.

Implementation Notes

Use a hash set when the only question is membership. Do not store dummy values in a map unless the language lacks a native set type.

Keep the equality rule explicit for composite keys. If two records are considered the same by id, then the set should hash and compare by id.

Be careful with mutable keys. If a key is modified after insertion, its hash bucket may no longer match its current value. This can make the key impossible to find or delete.

When iteration order matters, remember that most hash sets do not guarantee stable order. Use an ordered set, a linked hash set, or a separate output list if order is part of the result.

Common Mistakes

Allowing duplicate keys inside the set.

Using a set when the algorithm requires counts. Use a frequency map instead.

Relying on hash set iteration order.

Mutating keys after insertion.

Using identity equality when value equality is required.

Using value equality when identity equality is required.