Sample Sort Selection

Sample Sort Selection uses random sampling to approximate the position of the k-th element, then narrows the search to a smaller subarray.

The idea is to avoid full partitioning of the entire array by first estimating where the target lies.

Problem

Given an array $A$ of length $n$ and an index $k$ , find the k-th smallest element.

Algorithm

Take a random sample, sort it, and use it to estimate bounds around the desired rank.

sample_sort_selection(A, k):
    s = choose sample size
    S = random sample of size s from A
    sort S

    estimate position of k in S:
        i = floor(k * s / n)

    choose bounds:
        low = S[i - delta]
        high = S[i + delta]

    B = [x in A where low <= x <= high]

    return quickselect(B, k - count(x < low))

The parameter $\delta$ controls how wide the interval is.

Key Idea

The sample approximates the distribution of the full array. The rank of the target in the sample gives an estimate of its value in the full array.

By selecting a range around this estimate, the algorithm reduces the problem size from $n$ to a much smaller subset $B$ .

Example

Let:

A = [7, 2, 9, 4, 3]

Suppose the sample is:

[2, 4, 9]

and we want $k = 2$ . The estimate suggests the target lies near value $4$ , so the algorithm narrows the search around that region and applies Quickselect only on that subset.

Correctness

If the sample is representative, the estimated rank position is close to the true rank. With high probability, the true k-th element lies within the selected bounds.

Even if the bounds are slightly off, the refinement step still includes the correct element with high probability when the sampling parameters are chosen properly.

Complexity

phase	cost
sampling	$O(s)$
sorting sample	$O(s \log s)$
filtering	$O(n)$
final selection	$O(	B	)$

With appropriate sample size, expected complexity is:

O(n)

but with smaller constant factors compared to full Quickselect in some cases.

When to Use

Use Sample Sort Selection when:

input is very large
approximate narrowing is effective
you want to reduce partition cost
randomized algorithms are acceptable

It is often used in parallel and external-memory settings.

Implementation

import random

def sample_sort_selection(a, k, sample_size=50):
    n = len(a)
    s = min(sample_size, n)

    sample = random.sample(a, s)
    sample.sort()

    i = int(k * s / n)

    delta = max(1, s // 10)

    low = sample[max(0, i - delta)]
    high = sample[min(s - 1, i + delta)]

    below = [x for x in a if x < low]
    middle = [x for x in a if low <= x <= high]

    if k < len(below):
        return sample_sort_selection(below, k, sample_size)

    if k < len(below) + len(middle):
        return quickselect(middle, k - len(below))

    above = [x for x in a if x > high]
    return sample_sort_selection(above, k - len(below) - len(middle), sample_size)

func SampleSortSelection(a []int, k int) int {
	// simplified version using random pivot fallback
	return Quickselect(a, k)
}