Skip to content

2.9 Deduplication

Deduplication removes repeated values while preserving a chosen notion of identity and, optionally, order.

Deduplication removes repeated values while preserving a chosen notion of identity and, optionally, order. The design depends on two decisions:

  • whether the input is sorted
  • whether the output must preserve original order

These decisions determine both the algorithm and its complexity.

Problem

Given an array a, produce a sequence that contains each value at most once.

Example:

[3, 1, 3, 2, 1, 4]

Possible outputs:

  • preserve first occurrence order: [3, 1, 2, 4]
  • preserve sorted order: [1, 2, 3, 4]
  • in-place on sorted input: modify the array prefix

Case 1: Sorted Input, In-Place

When equal values are adjacent, deduplication reduces to keeping the first value in each run.

func UniqueSorted(a []int) []int {
	if len(a) == 0 {
		return a
	}

	write := 1

	for read := 1; read < len(a); read++ {
		if a[read] != a[write-1] {
			a[write] = a[read]
			write++
		}
	}

	return a[:write]
}

Invariant

Before each iteration:

a[0:write] contains the unique values from a[0:read], in sorted order

The comparison a[read] != a[write-1] works because duplicates are adjacent.

Complexity

Time:  O(n)
Space: O(1)

Case 2: Unsorted Input, Preserve Order

Use a set to track which values have been seen.

func UniqueStable(a []int) []int {
	seen := make(map[int]struct{})
	write := 0

	for _, x := range a {
		if _, ok := seen[x]; !ok {
			seen[x] = struct{}{}
			a[write] = x
			write++
		}
	}

	return a[:write]
}

Invariant

a[0:write] contains the first occurrence of each distinct value in a, in original order
seen contains exactly those values

Each value is written once, at its first occurrence.

Complexity

Time:  O(n)
Space: O(n)

The extra space is required to remember which values have already appeared.

Case 3: Unsorted Input, Order Not Required

Sort first, then deduplicate.

func UniqueBySorting(a []int) []int {
	sort.Ints(a)
	return UniqueSorted(a)
}

Complexity

Time:  O(n log n)
Space: O(1) or O(log n) depending on sort

This approach is useful when sorting is already needed or when memory is constrained and stable order is not required.

Case 4: Deduplicate with Counts

Sometimes the goal is to keep counts rather than discard duplicates.

func Frequency(a []int) map[int]int {
	count := make(map[int]int)

	for _, x := range a {
		count[x]++
	}

	return count
}

This produces a histogram:

value -> number of occurrences

The counts can be used for further processing, such as filtering values that appear exactly once.

Case 5: Allow Limited Duplicates

Some problems allow at most k duplicates per value.

Example: keep at most two occurrences in a sorted array.

func KeepAtMostTwo(a []int) []int {
	if len(a) <= 2 {
		return a
	}

	write := 2

	for read := 2; read < len(a); read++ {
		if a[read] != a[write-2] {
			a[write] = a[read]
			write++
		}
	}

	return a[:write]
}

Invariant

a[0:write] contains each value at most twice, in sorted order

The comparison uses write-2 to ensure no more than two equal values are kept.

Case 6: Deduplicate by Key

For structured data, duplicates are defined by a key function.

type Item struct {
	ID   int
	Name string
}

func UniqueByID(a []Item) []Item {
	seen := make(map[int]struct{})
	write := 0

	for _, x := range a {
		if _, ok := seen[x.ID]; !ok {
			seen[x.ID] = struct{}{}
			a[write] = x
			write++
		}
	}

	return a[:write]
}

The identity of an element is determined by ID, not by the entire struct.

Streaming Deduplication

When data arrives incrementally, maintain a set of seen elements.

type Deduper struct {
	seen map[int]struct{}
}

func NewDeduper() *Deduper {
	return &Deduper{seen: make(map[int]struct{})}
}

func (d *Deduper) Add(x int) bool {
	if _, ok := d.seen[x]; ok {
		return false
	}
	d.seen[x] = struct{}{}
	return true
}

The method Add returns true only for the first occurrence.

Approximate Deduplication

When memory is limited and exact results are not required, use probabilistic structures such as Bloom filters.

These structures trade exactness for space. They can report false positives (claiming an element has been seen when it has not), but they do not report false negatives.

This is useful in large-scale systems where storing all seen elements is impractical.

Common Pitfalls

Using the sorted-array method on unsorted input fails because duplicates are not adjacent.

Using a map for deduplication without considering order loses ordering information if the result is built from the map keys.

Forgetting to slice the result after in-place deduplication exposes stale values in the tail.

Comparing entire structures when only a key matters can lead to incorrect deduplication.

Using floating-point values as keys without normalization may treat nearly equal values as distinct.

Design Pattern

Deduplication consists of three components:

  1. A definition of equality or identity
  2. A policy for ordering
  3. A storage mechanism to detect repeats

The algorithm follows directly from these choices.

Takeaway

Use in-place scanning for sorted data. Use a set when order must be preserved. Use sorting when order does not matter or when memory is constrained. Always make the notion of identity explicit before choosing the method.