Distribute sorting work across multiple processors to reduce wall-clock time, with analysis of total work, span, communication, and synchronization.
Parallel sorting distributes work across multiple processing units. The goal is to reduce wall-clock time while preserving correctness and scalability. The cost model includes total work, parallel time (span), communication, and synchronization.
Problem
You have an array A[0..n-1] and p processors. You want to sort faster than a single-threaded algorithm by exploiting parallelism.
Models
Two common models guide design:
- Shared-memory model: threads access a common address space
- Distributed-memory model: data is partitioned across nodes; communication uses messages
Performance is described by:
work T1 = total operations
span T∞ = critical path length
time Tp = parallel time on p processorsA scalable algorithm keeps T1 = O(n log n) and reduces T∞ while limiting communication.
Parallel Merge Sort
Split the array, sort halves in parallel, then merge in parallel.
parallel_merge_sort(A):
if small:
sort sequentially
return
split A into L and R
spawn parallel_merge_sort(L)
spawn parallel_merge_sort(R)
sync
parallel_merge(L, R, A)Parallel merge divides one input by selecting a pivot and locating its rank in the other input via binary search, then recurses on subproblems.
Properties:
work T1 = O(n log n)
span T∞ = O(log^2 n) with parallel merge
space = O(n)Parallel Quick Sort
Partition in parallel, then recurse on both sides.
parallel_quick_sort(A, lo, hi):
if hi - lo <= threshold:
sort sequentially
return
p = parallel_partition(A, lo, hi)
spawn parallel_quick_sort(A, lo, p)
spawn parallel_quick_sort(A, p+1, hi)
syncParallel partition computes a boolean flag per element (≤ pivot), performs a parallel prefix sum to compute target indices, then scatters elements.
Properties:
work T1 = O(n log n) expected
span T∞ = O(log n) for partition + recursion depth
space = O(n) for temporary buffersPerformance depends on pivot quality. Randomized or median-of-three pivots reduce imbalance.
Sample Sort (Distributed)
Sample sort scales well across nodes.
- Each node sorts its local chunk
- Nodes select samples and compute global splitters
- Partition data by splitters and exchange buckets
- Each node sorts its received bucket
sample_sort(distributed A):
local_sort(A_i)
samples = pick from each A_i
splitters = global_select(samples)
buckets = partition(A_i, splitters)
all_to_all_exchange(buckets)
local_sort(received bucket)Properties:
balanced buckets → near O((n/p) log (n/p)) local work
communication dominates if buckets are skewedBitonic Sort (Network-Based)
Bitonic sort uses a fixed sequence of compare-exchange operations. It is suitable for SIMD and GPU execution.
for size = 2, 4, 8, ...:
build bitonic sequences
perform bitonic merges with compare-exchangeProperties:
work T1 = O(n log^2 n)
span T∞ = O(log^2 n)It performs more work than comparison-optimal sorts but has regular structure and no branching, which suits hardware.
Radix Sort on GPUs
Radix sort parallelizes well by processing digits with counting and prefix sums.
Per pass:
count digits in parallel
compute prefix sums
scatter elements to output positionsProperties:
work T1 = O(d(n + b))
span T∞ = O(d log n) for prefix sumsHigh throughput comes from coalesced memory access and uniform control flow.
Correctness
Parallel algorithms preserve the same invariants as sequential versions:
- Partition ensures elements are on correct sides of the pivot
- Merge ensures the output is globally ordered
- Stable variants maintain tie-breaking rules
Parallel execution must avoid races:
- Writes to shared arrays must be disjoint or synchronized
- Prefix sums provide deterministic placement indices
- Barriers (
sync) ensure dependencies are respected
Load Balancing
Imbalance reduces speedup.
Strategies:
randomized pivots
oversampling for splitter selection
dynamic task scheduling (work stealing)
chunking thresholds for small tasksWork-stealing schedulers keep processors busy by redistributing tasks at runtime.
Memory and Communication
Shared-memory:
avoid false sharing
align buffers
prefer contiguous accessDistributed-memory:
minimize all-to-all volume
batch messages
compress keys when possibleCommunication often dominates at scale.
Complexity and Speedup
Ideal speedup:
speedup = T1 / Tp ≈ pIn practice:
Tp ≈ T1 / p + overheadOverheads include synchronization, communication, and cache effects.
When to Use
- Large datasets where single-thread time is too high
- Systems with multiple cores or distributed nodes
- Workloads tolerant of additional memory and coordination costs
Avoid parallelization when:
- Input size is small
- Memory bandwidth is the bottleneck
- Synchronization cost dominates
Common Bugs
- Data races during partition or scatter
- Incorrect prefix sums leading to overwritten elements
- Imbalanced partitions causing idle processors
- Excessive synchronization reducing parallel gains
Takeaway
Parallel sorting reduces elapsed time by distributing work, but it introduces coordination costs. Effective designs keep work optimal, reduce span, balance load, and control communication.