In-place parallel super scalar samplesort for high performance comparison sorting on modern multicore machines.
IPS4o stands for in-place parallel super scalar samplesort. It is a high performance comparison sorting algorithm designed for modern multicore CPUs. It combines samplesort partitioning, cache efficient block movement, and parallel execution while keeping extra memory usage small.
The algorithm is useful when comparison sorting is required but ordinary quicksort or mergesort does not make full use of hardware parallelism.
Problem
Given an array of comparable elements, sort the array in increasing order using parallel partitioning and recursive sorting.
Idea
IPS4o follows the samplesort pattern:
- choose a sample from the input
- sort the sample
- select splitters from the sample
- partition the input into buckets using the splitters
- recursively sort each bucket in parallel
Unlike simple samplesort, IPS4o performs partitioning in place using small buffers and block permutation.
Splitters
If the algorithm uses buckets, it selects splitters:
Each element is assigned to a bucket by finding the interval containing it:
with boundary cases for the first and last bucket.
Algorithm
ips4o(A):
if len(A) <= base_case:
comparison_sort(A)
return
sample = choose_sample(A)
sort(sample)
splitters = choose_splitters(sample)
classify elements into buckets using splitters
compute bucket sizes
permute blocks in place so each bucket is contiguous
parallel for each bucket:
ips4o(bucket)Example
Sort:
Suppose splitters are:
Then the buckets are:
| bucket | range | values |
|---|---|---|
| 0 | 12, 27, 8 | |
| 1 | 43, 35, 50 | |
| 2 | 91, 64 |
After placing buckets contiguously and recursively sorting:
Concatenating buckets gives:
Correctness
The splitters divide the value order into disjoint intervals. Every element is placed into exactly one interval bucket. All elements in an earlier bucket are less than or equal to all elements in a later bucket.
Recursive sorting orders elements inside each bucket. Since bucket order is already correct globally, concatenating the sorted buckets yields a sorted array.
Complexity
Let:
- be the number of elements
- be the number of worker threads
Expected work:
Expected parallel time, ignoring synchronization overhead:
Extra memory:
where is the number of buckets and is the block buffer size.
Properties
| property | value |
|---|---|
| stable | no |
| in place | mostly |
| comparison based | yes |
| parallel | yes |
| cache efficient | yes |
When to Use
Use IPS4o when sorting large arrays on multicore machines and comparison order is required. It is especially relevant when the key type does not admit cheap radix extraction or when the comparator defines a custom order.
Avoid it for small arrays, single-threaded teaching code, or cases where a simpler standard library sort is already fast enough.
Implementation Sketch
from concurrent.futures import ThreadPoolExecutor
def sample_sort(a, workers=4, base_case=32):
if len(a) <= base_case:
return sorted(a)
sample = sorted(a[::max(1, len(a) // (workers * 4))])
splitters = []
for i in range(1, workers):
pos = i * len(sample) // workers
splitters.append(sample[pos])
buckets = [[] for _ in range(workers)]
for x in a:
b = 0
while b < len(splitters) and x >= splitters[b]:
b += 1
buckets[b].append(x)
with ThreadPoolExecutor(max_workers=workers) as pool:
parts = list(pool.map(lambda b: sample_sort(b, workers, base_case), buckets))
out = []
for part in parts:
out.extend(part)
return out