Skip to content

IPS4o

In-place parallel super scalar samplesort for high performance comparison sorting on modern multicore machines.

IPS4o stands for in-place parallel super scalar samplesort. It is a high performance comparison sorting algorithm designed for modern multicore CPUs. It combines samplesort partitioning, cache efficient block movement, and parallel execution while keeping extra memory usage small.

The algorithm is useful when comparison sorting is required but ordinary quicksort or mergesort does not make full use of hardware parallelism.

Problem

Given an array AA of nn comparable elements, sort the array in increasing order using parallel partitioning and recursive sorting.

Idea

IPS4o follows the samplesort pattern:

  1. choose a sample from the input
  2. sort the sample
  3. select splitters from the sample
  4. partition the input into buckets using the splitters
  5. recursively sort each bucket in parallel

Unlike simple samplesort, IPS4o performs partitioning in place using small buffers and block permutation.

Splitters

If the algorithm uses kk buckets, it selects k1k - 1 splitters:

s1<s2<<sk1 s_1 < s_2 < \cdots < s_{k-1}

Each element xx is assigned to a bucket by finding the interval containing it:

Bi=x:six<si+1 B_i = {x : s_i \le x < s_{i+1}}

with boundary cases for the first and last bucket.

Algorithm

ips4o(A):
    if len(A) <= base_case:
        comparison_sort(A)
        return

    sample = choose_sample(A)
    sort(sample)
    splitters = choose_splitters(sample)

    classify elements into buckets using splitters
    compute bucket sizes
    permute blocks in place so each bucket is contiguous

    parallel for each bucket:
        ips4o(bucket)

Example

Sort:

A=[43,12,91,27,35,8,64,50] A = [43, 12, 91, 27, 35, 8, 64, 50]

Suppose splitters are:

[30,60] [30, 60]

Then the buckets are:

bucketrangevalues
0x<30x < 3012, 27, 8
130x<6030 \le x < 6043, 35, 50
2x60x \ge 6091, 64

After placing buckets contiguously and recursively sorting:

[8,12,27], [35,43,50], [64,91] [8, 12, 27],\ [35, 43, 50],\ [64, 91]

Concatenating buckets gives:

[8,12,27,35,43,50,64,91] [8, 12, 27, 35, 43, 50, 64, 91]

Correctness

The splitters divide the value order into disjoint intervals. Every element is placed into exactly one interval bucket. All elements in an earlier bucket are less than or equal to all elements in a later bucket.

Recursive sorting orders elements inside each bucket. Since bucket order is already correct globally, concatenating the sorted buckets yields a sorted array.

Complexity

Let:

  • nn be the number of elements
  • pp be the number of worker threads

Expected work:

O(nlogn) O(n \log n)

Expected parallel time, ignoring synchronization overhead:

O(nlognp) O\left(\frac{n \log n}{p}\right)

Extra memory:

O(kp+b) O(kp + b)

where kk is the number of buckets and bb is the block buffer size.

Properties

propertyvalue
stableno
in placemostly
comparison basedyes
parallelyes
cache efficientyes

When to Use

Use IPS4o when sorting large arrays on multicore machines and comparison order is required. It is especially relevant when the key type does not admit cheap radix extraction or when the comparator defines a custom order.

Avoid it for small arrays, single-threaded teaching code, or cases where a simpler standard library sort is already fast enough.

Implementation Sketch

from concurrent.futures import ThreadPoolExecutor

def sample_sort(a, workers=4, base_case=32):
    if len(a) <= base_case:
        return sorted(a)

    sample = sorted(a[::max(1, len(a) // (workers * 4))])
    splitters = []

    for i in range(1, workers):
        pos = i * len(sample) // workers
        splitters.append(sample[pos])

    buckets = [[] for _ in range(workers)]

    for x in a:
        b = 0
        while b < len(splitters) and x >= splitters[b]:
            b += 1
        buckets[b].append(x)

    with ThreadPoolExecutor(max_workers=workers) as pool:
        parts = list(pool.map(lambda b: sample_sort(b, workers, base_case), buckets))

    out = []
    for part in parts:
        out.extend(part)

    return out