Use sampling to approximate order statistics and refine selection efficiently.
Sample Sort Selection uses random sampling to approximate the position of the k-th element, then narrows the search to a smaller subarray.
The idea is to avoid full partitioning of the entire array by first estimating where the target lies.
Problem
Given an array of length and an index , find the k-th smallest element.
Algorithm
Take a random sample, sort it, and use it to estimate bounds around the desired rank.
sample_sort_selection(A, k):
s = choose sample size
S = random sample of size s from A
sort S
estimate position of k in S:
i = floor(k * s / n)
choose bounds:
low = S[i - delta]
high = S[i + delta]
B = [x in A where low <= x <= high]
return quickselect(B, k - count(x < low))The parameter controls how wide the interval is.
Key Idea
The sample approximates the distribution of the full array. The rank of the target in the sample gives an estimate of its value in the full array.
By selecting a range around this estimate, the algorithm reduces the problem size from to a much smaller subset .
Example
Let:
Suppose the sample is:
and we want . The estimate suggests the target lies near value , so the algorithm narrows the search around that region and applies Quickselect only on that subset.
Correctness
If the sample is representative, the estimated rank position is close to the true rank. With high probability, the true k-th element lies within the selected bounds.
Even if the bounds are slightly off, the refinement step still includes the correct element with high probability when the sampling parameters are chosen properly.
Complexity
| phase | cost | ||
|---|---|---|---|
| sampling | |||
| sorting sample | |||
| filtering | |||
| final selection | $O( | B | )$ |
With appropriate sample size, expected complexity is:
but with smaller constant factors compared to full Quickselect in some cases.
When to Use
Use Sample Sort Selection when:
- input is very large
- approximate narrowing is effective
- you want to reduce partition cost
- randomized algorithms are acceptable
It is often used in parallel and external-memory settings.
Implementation
import random
def sample_sort_selection(a, k, sample_size=50):
n = len(a)
s = min(sample_size, n)
sample = random.sample(a, s)
sample.sort()
i = int(k * s / n)
delta = max(1, s // 10)
low = sample[max(0, i - delta)]
high = sample[min(s - 1, i + delta)]
below = [x for x in a if x < low]
middle = [x for x in a if low <= x <= high]
if k < len(below):
return sample_sort_selection(below, k, sample_size)
if k < len(below) + len(middle):
return quickselect(middle, k - len(below))
above = [x for x in a if x > high]
return sample_sort_selection(above, k - len(below) - len(middle), sample_size)func SampleSortSelection(a []int, k int) int {
// simplified version using random pivot fallback
return Quickselect(a, k)
}