# Array Deduplication

# Array Deduplication

Array deduplication removes repeated values from an array. The usual method keeps a set of values already seen and writes each new value once.

You use it when the array may contain repeated elements and downstream work needs only distinct values.

## Problem

Given an array $A$ of length $n$, produce an array containing each distinct value once.

For stable deduplication, preserve the first occurrence order.

## Algorithm

Use a set and a write pointer.

```id="array-deduplication"
deduplicate(A):
    seen = empty set
    k = 0

    for i from 0 to length(A) - 1:
        if A[i] not in seen:
            add A[i] to seen
            A[k] = A[i]
            k += 1

    return k
```

## Example

Let

$$
A = [8, 3, 5, 3, 8, 1]
$$

| step | value | seen after step | array prefix |
| ---- | ----: | --------------- | ------------ |
| 1    |     8 | {8}             | [8]          |
| 2    |     3 | {8, 3}          | [8, 3]       |
| 3    |     5 | {8, 3, 5}       | [8, 3, 5]    |
| 4    |     3 | {8, 3, 5}       | [8, 3, 5]    |
| 5    |     8 | {8, 3, 5}       | [8, 3, 5]    |
| 6    |     1 | {8, 3, 5, 1}    | [8, 3, 5, 1] |

Return new length $4$.

The deduplicated prefix is:

$$
[8, 3, 5, 1]
$$

## Correctness

The set `seen` contains exactly the values already written into the output prefix. When the scan reaches a new value, the algorithm writes it once and records it. When the scan reaches a value already in `seen`, the value has already been written, so skipping it preserves uniqueness.

Since the scan proceeds left to right, the first occurrence order is preserved.

## Complexity

| operation   | expected time |
| ----------- | ------------: |
| deduplicate |        $O(n)$ |

Space usage:

$$
O(m)
$$

where $m$ is the number of distinct values.

## When to Use

Array deduplication is appropriate when:

* repeated values should be removed
* first occurrence order matters
* extra memory for a set is acceptable
* expected linear time is preferred

It is less suitable when:

* values cannot be hashed
* strict constant extra space is required

For sorted arrays, deduplication can be done in-place with constant extra space.

## Implementation

```python id="array-deduplication-python"
def deduplicate(a):
    seen = set()
    k = 0

    for x in a:
        if x not in seen:
            seen.add(x)
            a[k] = x
            k += 1

    return k
```

```go id="array-deduplication-go"
func Deduplicate[T comparable](a []T) int {
    seen := make(map[T]struct{})
    k := 0

    for _, x := range a {
        if _, ok := seen[x]; ok {
            continue
        }

        seen[x] = struct{}{}
        a[k] = x
        k++
    }

    return k
}
```

