6.22 Parallel Sorting

Parallel sorting distributes work across multiple processing units. The goal is to reduce wall-clock time while preserving correctness and scalability. The cost model includes total work, parallel time (span), communication, and synchronization.

Problem

You have an array A[0..n-1] and p processors. You want to sort faster than a single-threaded algorithm by exploiting parallelism.

Models

Two common models guide design:

Shared-memory model: threads access a common address space
Distributed-memory model: data is partitioned across nodes; communication uses messages

Performance is described by:

work T1   = total operations
span T∞   = critical path length
time Tp   = parallel time on p processors

A scalable algorithm keeps T1 = O(n log n) and reduces T∞ while limiting communication.

Parallel Merge Sort

Split the array, sort halves in parallel, then merge in parallel.

parallel_merge_sort(A):
  if small:
    sort sequentially
    return

  split A into L and R
  spawn parallel_merge_sort(L)
  spawn parallel_merge_sort(R)
  sync
  parallel_merge(L, R, A)

Parallel merge divides one input by selecting a pivot and locating its rank in the other input via binary search, then recurses on subproblems.

Properties:

work  T1 = O(n log n)
span  T∞ = O(log^2 n) with parallel merge
space = O(n)

Parallel Quick Sort

Partition in parallel, then recurse on both sides.

parallel_quick_sort(A, lo, hi):
  if hi - lo <= threshold:
    sort sequentially
    return

  p = parallel_partition(A, lo, hi)

  spawn parallel_quick_sort(A, lo, p)
  spawn parallel_quick_sort(A, p+1, hi)
  sync

Parallel partition computes a boolean flag per element (≤ pivot), performs a parallel prefix sum to compute target indices, then scatters elements.

Properties:

work  T1 = O(n log n) expected
span  T∞ = O(log n) for partition + recursion depth
space = O(n) for temporary buffers

Performance depends on pivot quality. Randomized or median-of-three pivots reduce imbalance.

Sample Sort (Distributed)

Sample sort scales well across nodes.

Each node sorts its local chunk
Nodes select samples and compute global splitters
Partition data by splitters and exchange buckets
Each node sorts its received bucket

sample_sort(distributed A):
  local_sort(A_i)
  samples = pick from each A_i
  splitters = global_select(samples)
  buckets = partition(A_i, splitters)
  all_to_all_exchange(buckets)
  local_sort(received bucket)

Properties:

balanced buckets → near O((n/p) log (n/p)) local work
communication dominates if buckets are skewed

Bitonic Sort (Network-Based)

Bitonic sort uses a fixed sequence of compare-exchange operations. It is suitable for SIMD and GPU execution.

for size = 2, 4, 8, ...:
  build bitonic sequences
  perform bitonic merges with compare-exchange

Properties:

work  T1 = O(n log^2 n)
span  T∞ = O(log^2 n)

It performs more work than comparison-optimal sorts but has regular structure and no branching, which suits hardware.

Radix Sort on GPUs

Radix sort parallelizes well by processing digits with counting and prefix sums.

Per pass:

count digits in parallel
compute prefix sums
scatter elements to output positions

Properties:

work  T1 = O(d(n + b))
span  T∞ = O(d log n) for prefix sums

High throughput comes from coalesced memory access and uniform control flow.

Correctness

Parallel algorithms preserve the same invariants as sequential versions:

Partition ensures elements are on correct sides of the pivot
Merge ensures the output is globally ordered
Stable variants maintain tie-breaking rules

Parallel execution must avoid races:

Writes to shared arrays must be disjoint or synchronized
Prefix sums provide deterministic placement indices
Barriers (sync) ensure dependencies are respected

Load Balancing

Imbalance reduces speedup.

Strategies:

randomized pivots
oversampling for splitter selection
dynamic task scheduling (work stealing)
chunking thresholds for small tasks

Work-stealing schedulers keep processors busy by redistributing tasks at runtime.

Memory and Communication

Shared-memory:

avoid false sharing
align buffers
prefer contiguous access

Distributed-memory:

minimize all-to-all volume
batch messages
compress keys when possible

Communication often dominates at scale.

Complexity and Speedup

Ideal speedup:

speedup = T1 / Tp ≈ p

In practice:

Tp ≈ T1 / p + overhead

Overheads include synchronization, communication, and cache effects.

When to Use

Large datasets where single-thread time is too high
Systems with multiple cores or distributed nodes
Workloads tolerant of additional memory and coordination costs

Avoid parallelization when:

Input size is small
Memory bandwidth is the bottleneck
Synchronization cost dominates

Common Bugs

Data races during partition or scatter
Incorrect prefix sums leading to overwritten elements
Imbalanced partitions causing idle processors
Excessive synchronization reducing parallel gains

Takeaway

Parallel sorting reduces elapsed time by distributing work, but it introduces coordination costs. Effective designs keep work optimal, reduce span, balance load, and control communication.