# 6.12 Top k Selection

# 6.12 Top k Selection

Top k selection finds the largest or smallest `k` elements without necessarily sorting the entire input. It is closely related to partial sorting, but the emphasis is selection rather than presentation. The caller may only need the set of best candidates, not their complete internal order.

## Problem

You have an array `A[0..n-1]` and an integer `k`. You want the largest `k` elements, or the smallest `k` elements, using less work than a full sort when `k` is small compared with `n`.

## Solution

For the largest `k` elements, keep a min-heap of size `k`.

```text id="b9q3ld"
top_k_largest(A, k):
  heap = empty min-heap

  for each x in A:
    if size(heap) < k:
      push x into heap
    else if x > min(heap):
      pop min(heap)
      push x into heap

  return elements of heap
```

The heap contains the current best `k` elements. The smallest element in the heap is the cutoff. If a new element is not larger than the cutoff, it cannot enter the top `k`.

For the smallest `k` elements, use the symmetric method with a max-heap.

```text id="o5n2wp"
top_k_smallest(A, k):
  heap = empty max-heap

  for each x in A:
    if size(heap) < k:
      push x into heap
    else if x < max(heap):
      pop max(heap)
      push x into heap

  return elements of heap
```

## Example

Let:

```text id="kgpvh3"
A = [7, 1, 5, 9, 2, 8, 3]
k = 3
```

For the largest `3` elements, scan left to right.

After reading `7, 1, 5`, the heap contains:

```text id="yhzh4q"
[1, 7, 5]
```

The cutoff is `1`. Read `9`. Since `9 > 1`, remove `1` and insert `9`:

```text id="ukxkrx"
[5, 7, 9]
```

Read `2`. Since `2 <= 5`, ignore it.

Read `8`. Since `8 > 5`, remove `5` and insert `8`:

```text id="sl7vms"
[7, 9, 8]
```

Read `3`. Since `3 <= 7`, ignore it.

The heap contains the largest three elements:

```text id="c4ce8d"
[7, 8, 9]
```

The heap order itself is not sorted output. It is only the internal representation of the selected set.

## Invariant

After processing the first `i` elements, the heap contains the largest `min(i, k)` elements among those `i` elements.

When the heap has fewer than `k` elements, every seen element belongs in the candidate set.

When the heap already has `k` elements, the root is the weakest selected candidate. For largest `k`, this root is the minimum selected element. A new element larger than that root must replace it. A new element less than or equal to that root cannot belong to the largest `k`, because there are already `k` selected elements at least as large as it.

This invariant gives the proof of correctness directly. At the end of the scan, the first `i` elements are the whole array, so the heap contains the largest `k` elements overall.

## Complexity

The heap size never exceeds `k`.

Each push or replacement costs:

```text id="vgycqg"
O(log k)
```

The scan processes `n` elements, so the running time is:

```text id="dnt4vi"
O(n log k)
```

The extra memory is:

```text id="5myvyc"
O(k)
```

If the selected elements must be returned in sorted order, add:

```text id="71w92z"
O(k log k)
```

to sort the final heap contents.

## Quickselect Alternative

When the input is an array and mutation is allowed, quickselect can find the boundary element more quickly in expectation.

For largest `k`, find the element that would appear at index `n - k` in sorted order. After partitioning around that element, the last `k` positions contain the largest `k` elements.

```text id="rqa195"
top_k_largest_partition(A, k):
  target = length(A) - k
  quickselect(A, 0, length(A), target)
  return A[target..length(A)-1]
```

Expected time:

```text id="lkl1an"
O(n)
```

Worst-case time without safeguards:

```text id="3ex5pr"
O(n^2)
```

If sorted output is required, sort the returned suffix:

```text id="jqvk0x"
O(n + k log k)
```

expected time.

## Heap vs Quickselect

| Method           | Mutates input | Streaming | Expected time |             Worst-case time |                Extra space |
| ---------------- | ------------: | --------: | ------------: | --------------------------: | -------------------------: |
| Heap of size `k` |            No |       Yes |  `O(n log k)` |                `O(n log k)` |                     `O(k)` |
| Quickselect      |           Yes |        No |        `O(n)` |                    `O(n^2)` | `O(1)` average stack aside |
| Full sort        |     Often yes |        No |  `O(n log n)` | `O(n log n)` for safe sorts |                     varies |

The heap method is usually preferred for streams, immutable inputs, and small `k`. Quickselect is usually preferred for large in-memory arrays when average speed matters and mutation is acceptable.

## Handling Ties

Top k selection needs a tie policy. If several elements compare equal at the boundary, there may be more than `k` equally valid answers.

For example:

```text id="v19p1o"
A = [10, 9, 9, 9, 8]
k = 2
```

The largest two elements could be:

```text id="fdu6bl"
[10, 9]
```

but there are three possible `9` elements. If records have identities, the selection rule should specify which equal-key records are retained.

Common policies include:

```text id="0c6hbn"
arbitrary ties
stable ties by input order
secondary-key ties
return all elements tied with the kth value
```

The last policy can return more than `k` elements.

## Implementation Notes

For largest `k`, use a min-heap. For smallest `k`, use a max-heap. This is the most common source of confusion.

Handle boundary cases before running the main algorithm:

```text id="1i8hbp"
if k <= 0:
  return empty result

if k >= length(A):
  return all elements, optionally sorted
```

When `k` is very small, the heap method is simple and robust. When `k` is close to `n`, selecting the excluded side may be cheaper. For example, finding the largest `n - 3` elements is equivalent to excluding the smallest `3`.

## Common Bugs

A common bug is returning the heap array as though it were sorted. A heap only guarantees that the root is the minimum or maximum. The remaining elements are partially ordered.

Another bug is using the wrong comparison at the cutoff. For largest `k`, replace only when `x > min(heap)`, unless the tie policy says otherwise.

A third bug is ignoring duplicates. Top k by value and top k by record identity can produce different expectations.

A fourth bug is using quickselect when the caller expects the input order to remain unchanged. Quickselect partitions in place.

## Takeaway

Top k selection is a selection problem before it is a sorting problem. Use a fixed-size heap for streaming or non-mutating code. Use quickselect for in-place array selection when expected linear time is more important than stable ordering.

