IPS4o

IPS4o stands for in-place parallel super scalar samplesort. It is a high performance comparison sorting algorithm designed for modern multicore CPUs. It combines samplesort partitioning, cache efficient block movement, and parallel execution while keeping extra memory usage small.

The algorithm is useful when comparison sorting is required but ordinary quicksort or mergesort does not make full use of hardware parallelism.

Problem

Given an array $A$ of $n$ comparable elements, sort the array in increasing order using parallel partitioning and recursive sorting.

Idea

IPS4o follows the samplesort pattern:

choose a sample from the input
sort the sample
select splitters from the sample
partition the input into buckets using the splitters
recursively sort each bucket in parallel

Unlike simple samplesort, IPS4o performs partitioning in place using small buffers and block permutation.

Splitters

If the algorithm uses $k$ buckets, it selects $k - 1$ splitters:

s_1 < s_2 < \cdots < s_{k-1}

Each element $x$ is assigned to a bucket by finding the interval containing it:

B_i = {x : s_i \le x < s_{i+1}}

with boundary cases for the first and last bucket.

Algorithm

ips4o(A):
    if len(A) <= base_case:
        comparison_sort(A)
        return

    sample = choose_sample(A)
    sort(sample)
    splitters = choose_splitters(sample)

    classify elements into buckets using splitters
    compute bucket sizes
    permute blocks in place so each bucket is contiguous

    parallel for each bucket:
        ips4o(bucket)

Example

Sort:

A = [43, 12, 91, 27, 35, 8, 64, 50]

Suppose splitters are:

[30, 60]

Then the buckets are:

bucket	range	values
0	$x < 30$	12, 27, 8
1	$30 \le x < 60$	43, 35, 50
2	$x \ge 60$	91, 64

After placing buckets contiguously and recursively sorting:

[8, 12, 27],\ [35, 43, 50],\ [64, 91]

Concatenating buckets gives:

[8, 12, 27, 35, 43, 50, 64, 91]

Correctness

The splitters divide the value order into disjoint intervals. Every element is placed into exactly one interval bucket. All elements in an earlier bucket are less than or equal to all elements in a later bucket.

Recursive sorting orders elements inside each bucket. Since bucket order is already correct globally, concatenating the sorted buckets yields a sorted array.

Complexity

Let:

$n$ be the number of elements
$p$ be the number of worker threads

Expected work:

O(n \log n)

Expected parallel time, ignoring synchronization overhead:

O\left(\frac{n \log n}{p}\right)

Extra memory:

O(kp + b)

where $k$ is the number of buckets and $b$ is the block buffer size.

Properties

property	value
stable	no
in place	mostly
comparison based	yes
parallel	yes
cache efficient	yes

When to Use

Use IPS4o when sorting large arrays on multicore machines and comparison order is required. It is especially relevant when the key type does not admit cheap radix extraction or when the comparator defines a custom order.

Avoid it for small arrays, single-threaded teaching code, or cases where a simpler standard library sort is already fast enough.

Implementation Sketch

from concurrent.futures import ThreadPoolExecutor

def sample_sort(a, workers=4, base_case=32):
    if len(a) <= base_case:
        return sorted(a)

    sample = sorted(a[::max(1, len(a) // (workers * 4))])
    splitters = []

    for i in range(1, workers):
        pos = i * len(sample) // workers
        splitters.append(sample[pos])

    buckets = [[] for _ in range(workers)]

    for x in a:
        b = 0
        while b < len(splitters) and x >= splitters[b]:
            b += 1
        buckets[b].append(x)

    with ThreadPoolExecutor(max_workers=workers) as pool:
        parts = list(pool.map(lambda b: sample_sort(b, workers, base_case), buckets))

    out = []
    for part in parts:
        out.extend(part)

    return out