15.10 Parallel Divide and Conquer
Divide-and-conquer algorithms naturally create independent subproblems.
15.10 Parallel Divide and Conquer
Problem
Divide-and-conquer algorithms naturally create independent subproblems. Merge sort sorts the left half and right half independently. Quick sort partitions the array and then recursively sorts both sides. Matrix multiplication splits work into independent block products.
On a single core, these recursive calls run one after another. On modern hardware, this leaves available CPU cores idle.
How can we turn recursive decomposition into parallel execution without creating excessive overhead?
Solution
Parallelize independent recursive calls, but only while the subproblems are large enough to justify the cost.
A divide-and-conquer algorithm usually has this structure:
solve(problem)
if problem is small
solve directly
split problem into subproblems
solve subproblem A
solve subproblem B
combine results
The parallel version changes the recursive calls:
solve(problem)
if problem is small
solve directly
split problem into subproblems
run subproblem A concurrently
run subproblem B concurrently
wait for both results
combine results
The algorithmic idea is simple. The engineering problem is controlling granularity.
Why Divide and Conquer Parallelizes Well
A recursive split often produces independent work.
In merge sort:
left half -> can be sorted independently
right half -> can be sorted independently
In quick sort:
values less than pivot -> independent
values greater than pivot -> independent
In tree algorithms:
left subtree -> independent
right subtree -> independent
The combine step may still be sequential, but the recursive calls can often run in parallel.
This gives divide-and-conquer algorithms a natural task graph.
Work and Span
Parallel algorithms are commonly analyzed using two quantities.
The work is the total amount of computation performed if executed on one processor.
The span is the length of the longest dependency chain.
For merge sort, the total work remains:
$$ O(n\log n) $$
The span depends on how the merge step is implemented.
If merging remains sequential, the span includes a large combine step at each level. If merging is also parallelized, the span can be reduced further.
The theoretical maximum speedup is bounded by:
$$ \frac{\text{work}}{\text{span}} $$
This tells us how much parallelism the algorithm exposes.
A Sequential Merge Sort
func MergeSort(nums []int) []int {
if len(nums) <= 1 {
return nums
}
mid := len(nums) / 2
left := MergeSort(nums[:mid])
right := MergeSort(nums[mid:])
return Merge(left, right)
}
The recursive calls are independent.
This is the exact point where parallelism can be introduced.
A Parallel Merge Sort Sketch
func ParallelMergeSort(nums []int, depth int) []int {
if len(nums) <= 1 {
return nums
}
if depth <= 0 || len(nums) < 4096 {
return MergeSort(nums)
}
mid := len(nums) / 2
var left []int
var right []int
done := make(chan struct{})
go func() {
left = ParallelMergeSort(nums[:mid], depth-1)
close(done)
}()
right = ParallelMergeSort(nums[mid:], depth-1)
<-done
return Merge(left, right)
}
This version spawns one goroutine for the left half and computes the right half in the current goroutine.
The depth parameter limits how many recursive levels create parallel tasks.
Without such a limit, the program may create too many goroutines.
Choosing a Cutoff
Parallelism has overhead.
Creating tasks, scheduling work, synchronizing results, and moving memory all cost time. For small subproblems, this overhead may exceed the benefit.
A practical implementation uses two cutoffs:
minimum input size
maximum parallel depth
Example:
if depth <= 0 || len(nums) < 4096 {
return MergeSort(nums)
}
The exact threshold depends on:
- CPU core count
- scheduler overhead
- cache behavior
- merge implementation
- allocation cost
- input type
Do not parallelize every recursive call.
Parallelize only enough calls to keep cores busy.
Depth-Based Control
A common rule is:
parallel depth ≈ log2(number of cores)
For 8 cores:
log2(8) = 3
This means at most three recursive levels spawn parallel tasks.
That produces up to:
2^3 = 8
large tasks.
Each task can then run sequentially.
This avoids creating thousands of tiny jobs.
Work Stealing
Many parallel runtimes use work stealing.
The idea is:
- Each worker owns a deque of tasks.
- A worker executes local tasks first.
- Idle workers steal tasks from others.
- Large recursive tasks naturally split into smaller tasks.
Divide-and-conquer algorithms work well with this model because tasks form a recursive tree. Large tasks split near the top, and smaller tasks appear near the leaves.
Languages and runtimes with work-stealing schedulers can often parallelize recursive algorithms cleanly.
Parallel Quick Sort
Quick sort is also naturally parallel after partitioning.
func ParallelQuickSort(nums []int, low, high int, depth int) {
if low >= high {
return
}
if depth <= 0 || high-low < 4096 {
QuickSort(nums, low, high)
return
}
pivot := partition(nums, low, high)
done := make(chan struct{})
go func() {
ParallelQuickSort(nums, low, pivot-1, depth-1)
close(done)
}()
ParallelQuickSort(nums, pivot+1, high, depth-1)
<-done
}
Partitioning itself is usually sequential in simple implementations.
After the pivot is placed, both sides are independent.
Load Balance
Quick sort has a problem that merge sort does not.
Merge sort splits evenly by construction:
n/2 and n/2
Quick sort may split unevenly:
1 and n-2
or:
0 and n-1
Poor pivot choices cause bad load balance.
One core may receive most of the work while other cores sit idle.
Random pivots, median-of-three pivots, and three-way partitioning improve both sequential performance and parallel load balance.
Parallel Recursion as a Task Tree
A parallel recursive algorithm creates a tree of tasks.
Example with depth 3:
n
/ \\n n/2 n/2
/ \ / \\n n/4 n/4 n/4 n/4
/ \ / \ / \ / \\n...
Tasks near the top are large and worth parallelizing.
Tasks near the bottom are small and should usually run sequentially.
A good implementation exploits top-level parallelism and avoids leaf-level overhead.
Avoiding Shared Mutable State
Parallel divide-and-conquer code should minimize shared writes.
Good pattern:
subproblem A writes only to region A
subproblem B writes only to region B
Bad pattern:
all tasks append to one shared list
Shared mutable structures require locks or atomic operations. These often erase the benefit of parallelism.
Partition arrays, use local buffers, and combine results after the parallel phase.
False Sharing
Even when two tasks write to different variables, they may still contend on the same cache line.
This is called false sharing.
Example:
worker 1 writes counter[0]
worker 2 writes counter[1]
If both counters live in the same cache line, CPU cores repeatedly invalidate each other's cache entries.
In high-performance parallel divide-and-conquer code, memory layout matters.
Use per-worker buffers, padding, or chunked output when necessary.
Parallel Combine Steps
Some algorithms parallelize only the recursive calls.
Others also parallelize the combine step.
For merge sort, the merge operation can itself be parallelized:
- Split the larger sorted array in half.
- Binary search that midpoint into the other array.
- Recursively merge the two independent output ranges.
This creates parallelism inside the merge step, reducing span.
The implementation is more complex, but it matters for very large inputs.
Example: Parallel Sum
Parallel divide and conquer can be useful even for simple reductions.
func ParallelSum(nums []int, depth int) int {
if len(nums) == 0 {
return 0
}
if depth <= 0 || len(nums) < 4096 {
sum := 0
for _, x := range nums {
sum += x
}
return sum
}
mid := len(nums) / 2
done := make(chan int)
go func() {
done <- ParallelSum(nums[:mid], depth-1)
}()
right := ParallelSum(nums[mid:], depth-1)
left := <-done
return left + right
}
Sequential sum is already simple and cache-friendly. Parallel sum only helps when the array is large enough.
This makes it a good example of the cutoff principle.
Complexity Analysis
Parallelization does not change total work.
For merge sort:
$$ O(n\log n) $$
work remains.
Parallelization reduces wall-clock time by distributing independent work across processors.
With (p) processors, ideal time is roughly:
$$ O\left(\frac{\text{work}}{p} + \text{span}\right) $$
In practice, performance also depends on:
- scheduling overhead
- memory bandwidth
- cache locality
- synchronization
- allocation
- load balance
Asymptotic complexity is only part of the story.
Benchmarking Parallel Algorithms
Parallel code must be benchmarked carefully.
Measure:
- single-thread baseline
- parallel version with 2, 4, 8, and more workers
- different input sizes
- different cutoff thresholds
- memory allocations
- CPU utilization
A parallel algorithm that is faster for 100 million elements may be slower for 100 thousand elements.
Always compare against a strong sequential implementation.
Common Mistakes
Creating Too Many Tasks
Spawning a task at every recursive call creates enormous overhead.
Use depth and size cutoffs.
Ignoring Load Balance
Uneven partitioning can leave cores idle.
This is especially common in quick sort.
Sharing Output Buffers Incorrectly
Concurrent writes to overlapping memory ranges cause data races.
Assign disjoint regions to each task.
Parallelizing Memory-Bound Work
Some algorithms are limited by memory bandwidth, not CPU time.
Adding threads may produce little or no speedup.
Measuring Only One Input Size
Parallel speedups are input-size dependent.
Small benchmarks often produce misleading results.
Discussion
Parallel divide and conquer works because recursive decomposition exposes independent work. The algorithm already has the shape of a task graph; parallel execution simply schedules parts of that graph on different processors.
The important engineering skill is restraint. More parallelism is not automatically better. Good implementations create enough tasks to saturate available cores, then switch to sequential code before overhead dominates. They avoid shared mutable state, control memory allocation, and benchmark cutoffs rather than guessing.
This recipe also shows a recurring theme in high-performance algorithms: the asymptotic work may stay the same while real execution time improves dramatically. Divide and conquer gives us a clean way to express that parallel structure, but the final performance depends on hardware-aware choices about granularity, locality, and synchronization.