Skip to content

6.12 Top k Selection

Find the largest or smallest k elements without sorting the full input, using a heap or partition-based approach.

Top k selection finds the largest or smallest k elements without necessarily sorting the entire input. It is closely related to partial sorting, but the emphasis is selection rather than presentation. The caller may only need the set of best candidates, not their complete internal order.

Problem

You have an array A[0..n-1] and an integer k. You want the largest k elements, or the smallest k elements, using less work than a full sort when k is small compared with n.

Solution

For the largest k elements, keep a min-heap of size k.

top_k_largest(A, k):
  heap = empty min-heap

  for each x in A:
    if size(heap) < k:
      push x into heap
    else if x > min(heap):
      pop min(heap)
      push x into heap

  return elements of heap

The heap contains the current best k elements. The smallest element in the heap is the cutoff. If a new element is not larger than the cutoff, it cannot enter the top k.

For the smallest k elements, use the symmetric method with a max-heap.

top_k_smallest(A, k):
  heap = empty max-heap

  for each x in A:
    if size(heap) < k:
      push x into heap
    else if x < max(heap):
      pop max(heap)
      push x into heap

  return elements of heap

Example

Let:

A = [7, 1, 5, 9, 2, 8, 3]
k = 3

For the largest 3 elements, scan left to right.

After reading 7, 1, 5, the heap contains:

[1, 7, 5]

The cutoff is 1. Read 9. Since 9 > 1, remove 1 and insert 9:

[5, 7, 9]

Read 2. Since 2 <= 5, ignore it.

Read 8. Since 8 > 5, remove 5 and insert 8:

[7, 9, 8]

Read 3. Since 3 <= 7, ignore it.

The heap contains the largest three elements:

[7, 8, 9]

The heap order itself is not sorted output. It is only the internal representation of the selected set.

Invariant

After processing the first i elements, the heap contains the largest min(i, k) elements among those i elements.

When the heap has fewer than k elements, every seen element belongs in the candidate set.

When the heap already has k elements, the root is the weakest selected candidate. For largest k, this root is the minimum selected element. A new element larger than that root must replace it. A new element less than or equal to that root cannot belong to the largest k, because there are already k selected elements at least as large as it.

This invariant gives the proof of correctness directly. At the end of the scan, the first i elements are the whole array, so the heap contains the largest k elements overall.

Complexity

The heap size never exceeds k.

Each push or replacement costs:

O(log k)

The scan processes n elements, so the running time is:

O(n log k)

The extra memory is:

O(k)

If the selected elements must be returned in sorted order, add:

O(k log k)

to sort the final heap contents.

Quickselect Alternative

When the input is an array and mutation is allowed, quickselect can find the boundary element more quickly in expectation.

For largest k, find the element that would appear at index n - k in sorted order. After partitioning around that element, the last k positions contain the largest k elements.

top_k_largest_partition(A, k):
  target = length(A) - k
  quickselect(A, 0, length(A), target)
  return A[target..length(A)-1]

Expected time:

O(n)

Worst-case time without safeguards:

O(n^2)

If sorted output is required, sort the returned suffix:

O(n + k log k)

expected time.

Heap vs Quickselect

MethodMutates inputStreamingExpected timeWorst-case timeExtra space
Heap of size kNoYesO(n log k)O(n log k)O(k)
QuickselectYesNoO(n)O(n^2)O(1) average stack aside
Full sortOften yesNoO(n log n)O(n log n) for safe sortsvaries

The heap method is usually preferred for streams, immutable inputs, and small k. Quickselect is usually preferred for large in-memory arrays when average speed matters and mutation is acceptable.

Handling Ties

Top k selection needs a tie policy. If several elements compare equal at the boundary, there may be more than k equally valid answers.

For example:

A = [10, 9, 9, 9, 8]
k = 2

The largest two elements could be:

[10, 9]

but there are three possible 9 elements. If records have identities, the selection rule should specify which equal-key records are retained.

Common policies include:

arbitrary ties
stable ties by input order
secondary-key ties
return all elements tied with the kth value

The last policy can return more than k elements.

Implementation Notes

For largest k, use a min-heap. For smallest k, use a max-heap. This is the most common source of confusion.

Handle boundary cases before running the main algorithm:

if k <= 0:
  return empty result

if k >= length(A):
  return all elements, optionally sorted

When k is very small, the heap method is simple and robust. When k is close to n, selecting the excluded side may be cheaper. For example, finding the largest n - 3 elements is equivalent to excluding the smallest 3.

Common Bugs

A common bug is returning the heap array as though it were sorted. A heap only guarantees that the root is the minimum or maximum. The remaining elements are partially ordered.

Another bug is using the wrong comparison at the cutoff. For largest k, replace only when x > min(heap), unless the tie policy says otherwise.

A third bug is ignoring duplicates. Top k by value and top k by record identity can produce different expectations.

A fourth bug is using quickselect when the caller expects the input order to remain unchanged. Quickselect partitions in place.

Takeaway

Top k selection is a selection problem before it is a sorting problem. Use a fixed-size heap for streaming or non-mutating code. Use quickselect for in-place array selection when expected linear time is more important than stable ordering.