A clear explanation of designing a hash set without using built-in hash table libraries.
Problem Restatement
We need to design a class called MyHashSet.
The class should support three operations:
| Operation | Meaning |
|---|---|
add(key) | Insert key into the hash set |
remove(key) | Remove key from the hash set if it exists |
contains(key) | Return whether key exists in the hash set |
We cannot use built-in hash table libraries.
The value range is bounded:
0 <= key <= 1_000_000At most 10,000 calls will be made to add, remove, and contains.
Input and Output
| Item | Meaning |
|---|---|
| Constructor | Creates an empty hash set |
add(key) | Returns nothing |
remove(key) | Returns nothing |
contains(key) | Returns True or False |
| Duplicate add | Does not create multiple copies |
| Missing remove | Does nothing |
The class shape is:
class MyHashSet:
def __init__(self):
...
def add(self, key: int) -> None:
...
def remove(self, key: int) -> None:
...
def contains(self, key: int) -> bool:
...Examples
Example:
myHashSet = MyHashSet()
myHashSet.add(1)
myHashSet.add(2)
myHashSet.contains(1)
myHashSet.contains(3)
myHashSet.add(2)
myHashSet.contains(2)
myHashSet.remove(2)
myHashSet.contains(2)The returned values are:
[None, None, True, False, None, True, None, False]Step by step:
| Operation | Set After Operation | Return |
|---|---|---|
add(1) | {1} | None |
add(2) | {1, 2} | None |
contains(1) | {1, 2} | True |
contains(3) | {1, 2} | False |
add(2) | {1, 2} | None |
contains(2) | {1, 2} | True |
remove(2) | {1} | None |
contains(2) | {1} | False |
First Thought: Direct Addressing
Because every key is between 0 and 1,000,000, we can create a boolean array of size 1,000,001.
Each index represents one possible key.
If key x exists in the set, store:
data[x] = TrueIf key x does not exist, store:
data[x] = FalseThen all three operations are simple.
add(key): data[key] = True
remove(key): data[key] = False
contains(key): return data[key]This is the simplest valid solution for the given constraints.
Problem With Direct Addressing
Direct addressing uses memory for every possible key, even if we only store a few keys.
The key range has 1,000,001 possible values.
So the space cost is:
O(1_000_001)For this LeetCode problem, that is acceptable.
For a real hash set, this is usually wasteful. A real hash set should use a smaller array of buckets and a hash function.
We will implement the bucket-based version because it better matches the idea of a hash set.
Key Insight
A hash set maps a key to a bucket.
For an integer key, a simple hash function is:
bucket_index = key % bucket_countEach bucket stores all keys that map to that index.
For example, if:
bucket_count = 1009then these keys go to the same bucket:
1
1010
2019because they all have the same remainder modulo 1009.
This is called a collision.
To handle collisions, each bucket can be a list.
When we add, remove, or check a key, we only scan the bucket where that key belongs.
Algorithm
Use an array of buckets.
Each bucket is a list of keys.
During initialization:
- Choose a bucket count.
- Create that many empty lists.
For add(key):
- Compute the bucket index.
- Check whether the key is already in that bucket.
- If not, append it.
For remove(key):
- Compute the bucket index.
- Search the bucket.
- If the key exists, remove it.
For contains(key):
- Compute the bucket index.
- Return whether the key exists in that bucket.
Correctness
Each key is always stored in exactly one possible bucket: the bucket given by key % bucket_count.
The add method computes this bucket and only inserts the key if it is not already present. Therefore the structure stores unique keys.
The remove method computes the same bucket for the key. If the key exists, it must be in that bucket, so removing it from that bucket removes it from the whole set. If it is not in that bucket, then it is not in the set.
The contains method also computes the same bucket. Since every inserted key is placed in its hash bucket, checking that bucket is sufficient. If the key is found there, it is in the set. If it is not found there, it has never been inserted or has already been removed.
Therefore all three operations behave exactly like a hash set.
Complexity
Let b be the number of buckets and m be the number of keys stored in one bucket.
| Operation | Average Time | Worst-case Time | Why |
|---|---|---|---|
add | O(1) | O(m) | Scan one bucket before inserting |
remove | O(1) | O(m) | Scan one bucket before removing |
contains | O(1) | O(m) | Scan one bucket |
| Space | O(b + n) | O(b + n) | Buckets plus stored keys |
With a reasonable bucket count and at most 10,000 operations, the buckets stay small in normal cases.
Implementation
class MyHashSet:
def __init__(self):
self.size = 1009
self.buckets = [[] for _ in range(self.size)]
def _hash(self, key: int) -> int:
return key % self.size
def add(self, key: int) -> None:
index = self._hash(key)
bucket = self.buckets[index]
if key not in bucket:
bucket.append(key)
def remove(self, key: int) -> None:
index = self._hash(key)
bucket = self.buckets[index]
if key in bucket:
bucket.remove(key)
def contains(self, key: int) -> bool:
index = self._hash(key)
bucket = self.buckets[index]
return key in bucketCode Explanation
The constructor creates the bucket array:
self.size = 1009
self.buckets = [[] for _ in range(self.size)]1009 is a prime number. A prime bucket count is often used with modulo hashing because it can reduce some regular collision patterns.
The helper method maps a key to a bucket index:
def _hash(self, key: int) -> int:
return key % self.sizeThe add method gets the correct bucket:
index = self._hash(key)
bucket = self.buckets[index]Then it inserts only if the key is not already present:
if key not in bucket:
bucket.append(key)The remove method also finds the correct bucket:
index = self._hash(key)
bucket = self.buckets[index]Then it removes the key only if it exists:
if key in bucket:
bucket.remove(key)The contains method checks membership in the correct bucket:
return key in bucketAlternative: Direct Addressing
Because the key range is small enough, we can also solve the problem with a boolean array.
class MyHashSet:
def __init__(self):
self.data = [False] * 1_000_001
def add(self, key: int) -> None:
self.data[key] = True
def remove(self, key: int) -> None:
self.data[key] = False
def contains(self, key: int) -> bool:
return self.data[key]This version is simpler and gives guaranteed O(1) time for every operation.
Its tradeoff is memory. It allocates space for all possible keys, even if only a few keys are used.
Direct Addressing Complexity
| Operation | Time | Space |
|---|---|---|
add | O(1) | O(1_000_001) |
remove | O(1) | O(1_000_001) |
contains | O(1) | O(1_000_001) |
Testing
def test_my_hash_set():
hs = MyHashSet()
hs.add(1)
hs.add(2)
assert hs.contains(1) is True
assert hs.contains(3) is False
hs.add(2)
assert hs.contains(2) is True
hs.remove(2)
assert hs.contains(2) is False
hs.remove(42)
assert hs.contains(42) is False
hs.add(1009)
hs.add(2018)
assert hs.contains(1009) is True
assert hs.contains(2018) is True
hs.remove(1009)
assert hs.contains(1009) is False
assert hs.contains(2018) is True
print("all tests passed")
test_my_hash_set()Test coverage:
| Test | Why |
|---|---|
| Add and contains | Confirms insertion |
| Missing key | Confirms false lookup |
| Duplicate add | Confirms set behavior |
| Remove existing key | Confirms deletion |
| Remove missing key | Confirms no error |
| Colliding keys | Confirms bucket collision handling |