Skip to content

LeetCode 705: Design HashSet

A clear explanation of designing a hash set without using built-in hash table libraries.

Problem Restatement

We need to design a class called MyHashSet.

The class should support three operations:

OperationMeaning
add(key)Insert key into the hash set
remove(key)Remove key from the hash set if it exists
contains(key)Return whether key exists in the hash set

We cannot use built-in hash table libraries.

The value range is bounded:

0 <= key <= 1_000_000

At most 10,000 calls will be made to add, remove, and contains.

Input and Output

ItemMeaning
ConstructorCreates an empty hash set
add(key)Returns nothing
remove(key)Returns nothing
contains(key)Returns True or False
Duplicate addDoes not create multiple copies
Missing removeDoes nothing

The class shape is:

class MyHashSet:

    def __init__(self):
        ...

    def add(self, key: int) -> None:
        ...

    def remove(self, key: int) -> None:
        ...

    def contains(self, key: int) -> bool:
        ...

Examples

Example:

myHashSet = MyHashSet()
myHashSet.add(1)
myHashSet.add(2)
myHashSet.contains(1)
myHashSet.contains(3)
myHashSet.add(2)
myHashSet.contains(2)
myHashSet.remove(2)
myHashSet.contains(2)

The returned values are:

[None, None, True, False, None, True, None, False]

Step by step:

OperationSet After OperationReturn
add(1){1}None
add(2){1, 2}None
contains(1){1, 2}True
contains(3){1, 2}False
add(2){1, 2}None
contains(2){1, 2}True
remove(2){1}None
contains(2){1}False

First Thought: Direct Addressing

Because every key is between 0 and 1,000,000, we can create a boolean array of size 1,000,001.

Each index represents one possible key.

If key x exists in the set, store:

data[x] = True

If key x does not exist, store:

data[x] = False

Then all three operations are simple.

add(key):      data[key] = True
remove(key):   data[key] = False
contains(key): return data[key]

This is the simplest valid solution for the given constraints.

Problem With Direct Addressing

Direct addressing uses memory for every possible key, even if we only store a few keys.

The key range has 1,000,001 possible values.

So the space cost is:

O(1_000_001)

For this LeetCode problem, that is acceptable.

For a real hash set, this is usually wasteful. A real hash set should use a smaller array of buckets and a hash function.

We will implement the bucket-based version because it better matches the idea of a hash set.

Key Insight

A hash set maps a key to a bucket.

For an integer key, a simple hash function is:

bucket_index = key % bucket_count

Each bucket stores all keys that map to that index.

For example, if:

bucket_count = 1009

then these keys go to the same bucket:

1
1010
2019

because they all have the same remainder modulo 1009.

This is called a collision.

To handle collisions, each bucket can be a list.

When we add, remove, or check a key, we only scan the bucket where that key belongs.

Algorithm

Use an array of buckets.

Each bucket is a list of keys.

During initialization:

  1. Choose a bucket count.
  2. Create that many empty lists.

For add(key):

  1. Compute the bucket index.
  2. Check whether the key is already in that bucket.
  3. If not, append it.

For remove(key):

  1. Compute the bucket index.
  2. Search the bucket.
  3. If the key exists, remove it.

For contains(key):

  1. Compute the bucket index.
  2. Return whether the key exists in that bucket.

Correctness

Each key is always stored in exactly one possible bucket: the bucket given by key % bucket_count.

The add method computes this bucket and only inserts the key if it is not already present. Therefore the structure stores unique keys.

The remove method computes the same bucket for the key. If the key exists, it must be in that bucket, so removing it from that bucket removes it from the whole set. If it is not in that bucket, then it is not in the set.

The contains method also computes the same bucket. Since every inserted key is placed in its hash bucket, checking that bucket is sufficient. If the key is found there, it is in the set. If it is not found there, it has never been inserted or has already been removed.

Therefore all three operations behave exactly like a hash set.

Complexity

Let b be the number of buckets and m be the number of keys stored in one bucket.

OperationAverage TimeWorst-case TimeWhy
addO(1)O(m)Scan one bucket before inserting
removeO(1)O(m)Scan one bucket before removing
containsO(1)O(m)Scan one bucket
SpaceO(b + n)O(b + n)Buckets plus stored keys

With a reasonable bucket count and at most 10,000 operations, the buckets stay small in normal cases.

Implementation

class MyHashSet:

    def __init__(self):
        self.size = 1009
        self.buckets = [[] for _ in range(self.size)]

    def _hash(self, key: int) -> int:
        return key % self.size

    def add(self, key: int) -> None:
        index = self._hash(key)
        bucket = self.buckets[index]

        if key not in bucket:
            bucket.append(key)

    def remove(self, key: int) -> None:
        index = self._hash(key)
        bucket = self.buckets[index]

        if key in bucket:
            bucket.remove(key)

    def contains(self, key: int) -> bool:
        index = self._hash(key)
        bucket = self.buckets[index]

        return key in bucket

Code Explanation

The constructor creates the bucket array:

self.size = 1009
self.buckets = [[] for _ in range(self.size)]

1009 is a prime number. A prime bucket count is often used with modulo hashing because it can reduce some regular collision patterns.

The helper method maps a key to a bucket index:

def _hash(self, key: int) -> int:
    return key % self.size

The add method gets the correct bucket:

index = self._hash(key)
bucket = self.buckets[index]

Then it inserts only if the key is not already present:

if key not in bucket:
    bucket.append(key)

The remove method also finds the correct bucket:

index = self._hash(key)
bucket = self.buckets[index]

Then it removes the key only if it exists:

if key in bucket:
    bucket.remove(key)

The contains method checks membership in the correct bucket:

return key in bucket

Alternative: Direct Addressing

Because the key range is small enough, we can also solve the problem with a boolean array.

class MyHashSet:

    def __init__(self):
        self.data = [False] * 1_000_001

    def add(self, key: int) -> None:
        self.data[key] = True

    def remove(self, key: int) -> None:
        self.data[key] = False

    def contains(self, key: int) -> bool:
        return self.data[key]

This version is simpler and gives guaranteed O(1) time for every operation.

Its tradeoff is memory. It allocates space for all possible keys, even if only a few keys are used.

Direct Addressing Complexity

OperationTimeSpace
addO(1)O(1_000_001)
removeO(1)O(1_000_001)
containsO(1)O(1_000_001)

Testing

def test_my_hash_set():
    hs = MyHashSet()

    hs.add(1)
    hs.add(2)

    assert hs.contains(1) is True
    assert hs.contains(3) is False

    hs.add(2)
    assert hs.contains(2) is True

    hs.remove(2)
    assert hs.contains(2) is False

    hs.remove(42)
    assert hs.contains(42) is False

    hs.add(1009)
    hs.add(2018)

    assert hs.contains(1009) is True
    assert hs.contains(2018) is True

    hs.remove(1009)

    assert hs.contains(1009) is False
    assert hs.contains(2018) is True

    print("all tests passed")

test_my_hash_set()

Test coverage:

TestWhy
Add and containsConfirms insertion
Missing keyConfirms false lookup
Duplicate addConfirms set behavior
Remove existing keyConfirms deletion
Remove missing keyConfirms no error
Colliding keysConfirms bucket collision handling