5.16 Hash-Based Deduplication

Problem

You need to remove duplicates from a dataset or stream efficiently.

A naive approach compares each element with all previously seen elements, which leads to quadratic time. For large inputs, this is impractical.

Solution

Use a hash set to track seen elements.

deduplicate(items):
    seen = empty set
    output = empty list

    for item in items:
        if not contains(seen, item):
            insert(seen, item)
            append item to output

    return output

For streaming or memory-constrained environments, emit items immediately and maintain only the set.

Discussion

Hash-based deduplication relies on fast membership checks. Each item is processed once. If the item has not been seen, it is emitted and recorded. If it has been seen, it is skipped.

The set encodes the invariant that all emitted items are unique. The order of the output matches the first occurrence of each item in the input.

This pattern appears in log processing, database ingestion, event streams, and data cleaning pipelines. It is also used as a preprocessing step before grouping or counting.

When memory is limited, exact deduplication may not be feasible for very large key spaces. In such cases, approximate structures such as Bloom filters can be used, accepting false positives that drop some unique items.

Correctness

Let output be the list produced by the algorithm and seen be the set.

Invariant after processing the first i items:

seen contains exactly the items already appended to output.
output contains no duplicates.
output contains each distinct item from the prefix of length i exactly once, in order of first occurrence.

When processing item x:

If x is in seen, it has already been appended, so skipping it preserves uniqueness.
If x is not in seen, appending it and inserting into seen maintains the invariant.

Thus, after processing all items, output contains exactly one copy of each distinct item.

Complexity

Let n be the number of items and k be the number of distinct items.

Each iteration performs a membership test and possibly an insertion. With a good hash function and bounded load factor, both operations take expected O(1) time.

Total expected time is O(n).

Space usage is O(k) for the set plus O(k) for the output.

Worst case time is O(n^2) if all keys collide, but this is avoided in practice with proper hashing.

Example

Remove duplicates from a list:

items = ["a", "b", "a", "c", "b", "d"]

Processing:

"a" -> output ["a"], seen {"a"}
"b" -> output ["a", "b"], seen {"a", "b"}
"a" -> skip
"c" -> output ["a", "b", "c"], seen {"a", "b", "c"}
"b" -> skip
"d" -> output ["a", "b", "c", "d"], seen {"a", "b", "c", "d"}

Final result:

["a", "b", "c", "d"]

Variations

Deduplicate without preserving order by using only a set and returning its elements.

Deduplicate based on a derived key:

if not contains(seen, key_of(item)):
    insert(seen, key_of(item))
    append item

Deduplicate within a sliding window by removing expired items from the set.

Approximate deduplication with Bloom filters for large-scale streaming.

Implementation Notes

Use a hash set when order of first occurrence matters. Use a sort-and-unique approach when order does not matter and memory is constrained.

Ensure the equality and hash function reflect the notion of duplicate. For example, normalize strings if case-insensitive deduplication is required.

Avoid mutating items after insertion into the set.

For large objects, store identifiers or hashes instead of full objects when possible.

Common Mistakes

Using a list to track seen items, leading to quadratic time.

Using inconsistent equality, such as comparing normalized values but hashing raw values.

Forgetting that approximate methods may drop unique items.

Allowing the set to grow without bound in streaming systems.

Assuming deduplication preserves order when using unordered structures.