# 6.22 Parallel Sorting

# 6.22 Parallel Sorting

Parallel sorting distributes work across multiple processing units. The goal is to reduce wall-clock time while preserving correctness and scalability. The cost model includes total work, parallel time (span), communication, and synchronization.

## Problem

You have an array `A[0..n-1]` and `p` processors. You want to sort faster than a single-threaded algorithm by exploiting parallelism.

## Models

Two common models guide design:

* Shared-memory model: threads access a common address space
* Distributed-memory model: data is partitioned across nodes; communication uses messages

Performance is described by:

```text id="m1x3y9"
work T1   = total operations
span T∞   = critical path length
time Tp   = parallel time on p processors
```

A scalable algorithm keeps `T1 = O(n log n)` and reduces `T∞` while limiting communication.

## Parallel Merge Sort

Split the array, sort halves in parallel, then merge in parallel.

```text id="b4v0h2"
parallel_merge_sort(A):
  if small:
    sort sequentially
    return

  split A into L and R
  spawn parallel_merge_sort(L)
  spawn parallel_merge_sort(R)
  sync
  parallel_merge(L, R, A)
```

Parallel merge divides one input by selecting a pivot and locating its rank in the other input via binary search, then recurses on subproblems.

Properties:

```text id="8g6s3n"
work  T1 = O(n log n)
span  T∞ = O(log^2 n) with parallel merge
space = O(n)
```

## Parallel Quick Sort

Partition in parallel, then recurse on both sides.

```text id="j7y2qk"
parallel_quick_sort(A, lo, hi):
  if hi - lo <= threshold:
    sort sequentially
    return

  p = parallel_partition(A, lo, hi)

  spawn parallel_quick_sort(A, lo, p)
  spawn parallel_quick_sort(A, p+1, hi)
  sync
```

Parallel partition computes a boolean flag per element (`≤ pivot`), performs a parallel prefix sum to compute target indices, then scatters elements.

Properties:

```text id="6t9q1p"
work  T1 = O(n log n) expected
span  T∞ = O(log n) for partition + recursion depth
space = O(n) for temporary buffers
```

Performance depends on pivot quality. Randomized or median-of-three pivots reduce imbalance.

## Sample Sort (Distributed)

Sample sort scales well across nodes.

1. Each node sorts its local chunk
2. Nodes select samples and compute global splitters
3. Partition data by splitters and exchange buckets
4. Each node sorts its received bucket

```text id="p2d5wv"
sample_sort(distributed A):
  local_sort(A_i)
  samples = pick from each A_i
  splitters = global_select(samples)
  buckets = partition(A_i, splitters)
  all_to_all_exchange(buckets)
  local_sort(received bucket)
```

Properties:

```text id="3h0b4c"
balanced buckets → near O((n/p) log (n/p)) local work
communication dominates if buckets are skewed
```

## Bitonic Sort (Network-Based)

Bitonic sort uses a fixed sequence of compare-exchange operations. It is suitable for SIMD and GPU execution.

```text id="r8k1e6"
for size = 2, 4, 8, ...:
  build bitonic sequences
  perform bitonic merges with compare-exchange
```

Properties:

```text id="z9c2q7"
work  T1 = O(n log^2 n)
span  T∞ = O(log^2 n)
```

It performs more work than comparison-optimal sorts but has regular structure and no branching, which suits hardware.

## Radix Sort on GPUs

Radix sort parallelizes well by processing digits with counting and prefix sums.

Per pass:

```text id="h5n7r2"
count digits in parallel
compute prefix sums
scatter elements to output positions
```

Properties:

```text id="y4d8k1"
work  T1 = O(d(n + b))
span  T∞ = O(d log n) for prefix sums
```

High throughput comes from coalesced memory access and uniform control flow.

## Correctness

Parallel algorithms preserve the same invariants as sequential versions:

* Partition ensures elements are on correct sides of the pivot
* Merge ensures the output is globally ordered
* Stable variants maintain tie-breaking rules

Parallel execution must avoid races:

* Writes to shared arrays must be disjoint or synchronized
* Prefix sums provide deterministic placement indices
* Barriers (`sync`) ensure dependencies are respected

## Load Balancing

Imbalance reduces speedup.

Strategies:

```text id="n6f3k9"
randomized pivots
oversampling for splitter selection
dynamic task scheduling (work stealing)
chunking thresholds for small tasks
```

Work-stealing schedulers keep processors busy by redistributing tasks at runtime.

## Memory and Communication

Shared-memory:

```text id="u2w8x3"
avoid false sharing
align buffers
prefer contiguous access
```

Distributed-memory:

```text id="c7p1e5"
minimize all-to-all volume
batch messages
compress keys when possible
```

Communication often dominates at scale.

## Complexity and Speedup

Ideal speedup:

```text id="d3m0s8"
speedup = T1 / Tp ≈ p
```

In practice:

```text id="v9l2t6"
Tp ≈ T1 / p + overhead
```

Overheads include synchronization, communication, and cache effects.

## When to Use

* Large datasets where single-thread time is too high
* Systems with multiple cores or distributed nodes
* Workloads tolerant of additional memory and coordination costs

Avoid parallelization when:

* Input size is small
* Memory bandwidth is the bottleneck
* Synchronization cost dominates

## Common Bugs

* Data races during partition or scatter
* Incorrect prefix sums leading to overwritten elements
* Imbalanced partitions causing idle processors
* Excessive synchronization reducing parallel gains

## Takeaway

Parallel sorting reduces elapsed time by distributing work, but it introduces coordination costs. Effective designs keep work optimal, reduce span, balance load, and control communication.