Remove duplicate entries from a dataset or stream efficiently using hash sets for membership tracking.
Problem
You need to remove duplicates from a dataset or stream efficiently.
A naive approach compares each element with all previously seen elements, which leads to quadratic time. For large inputs, this is impractical.
Solution
Use a hash set to track seen elements.
deduplicate(items):
seen = empty set
output = empty list
for item in items:
if not contains(seen, item):
insert(seen, item)
append item to output
return outputFor streaming or memory-constrained environments, emit items immediately and maintain only the set.
Discussion
Hash-based deduplication relies on fast membership checks. Each item is processed once. If the item has not been seen, it is emitted and recorded. If it has been seen, it is skipped.
The set encodes the invariant that all emitted items are unique. The order of the output matches the first occurrence of each item in the input.
This pattern appears in log processing, database ingestion, event streams, and data cleaning pipelines. It is also used as a preprocessing step before grouping or counting.
When memory is limited, exact deduplication may not be feasible for very large key spaces. In such cases, approximate structures such as Bloom filters can be used, accepting false positives that drop some unique items.
Correctness
Let output be the list produced by the algorithm and seen be the set.
Invariant after processing the first i items:
seencontains exactly the items already appended tooutput.outputcontains no duplicates.outputcontains each distinct item from the prefix of lengthiexactly once, in order of first occurrence.
When processing item x:
- If
xis inseen, it has already been appended, so skipping it preserves uniqueness. - If
xis not inseen, appending it and inserting intoseenmaintains the invariant.
Thus, after processing all items, output contains exactly one copy of each distinct item.
Complexity
Let n be the number of items and k be the number of distinct items.
Each iteration performs a membership test and possibly an insertion. With a good hash function and bounded load factor, both operations take expected O(1) time.
Total expected time is O(n).
Space usage is O(k) for the set plus O(k) for the output.
Worst case time is O(n^2) if all keys collide, but this is avoided in practice with proper hashing.
Example
Remove duplicates from a list:
items = ["a", "b", "a", "c", "b", "d"]Processing:
"a" -> output ["a"], seen {"a"}
"b" -> output ["a", "b"], seen {"a", "b"}
"a" -> skip
"c" -> output ["a", "b", "c"], seen {"a", "b", "c"}
"b" -> skip
"d" -> output ["a", "b", "c", "d"], seen {"a", "b", "c", "d"}Final result:
["a", "b", "c", "d"]Variations
Deduplicate without preserving order by using only a set and returning its elements.
Deduplicate based on a derived key:
if not contains(seen, key_of(item)):
insert(seen, key_of(item))
append itemDeduplicate within a sliding window by removing expired items from the set.
Approximate deduplication with Bloom filters for large-scale streaming.
Implementation Notes
Use a hash set when order of first occurrence matters. Use a sort-and-unique approach when order does not matter and memory is constrained.
Ensure the equality and hash function reflect the notion of duplicate. For example, normalize strings if case-insensitive deduplication is required.
Avoid mutating items after insertion into the set.
For large objects, store identifiers or hashes instead of full objects when possible.
Common Mistakes
Using a list to track seen items, leading to quadratic time.
Using inconsistent equality, such as comparing normalized values but hashing raw values.
Forgetting that approximate methods may drop unique items.
Allowing the set to grow without bound in streaming systems.
Assuming deduplication preserves order when using unordered structures.