15.10 Parallel Divide and Conquer

Problem

Divide-and-conquer algorithms naturally create independent subproblems. Merge sort sorts the left half and right half independently. Quick sort partitions the array and then recursively sorts both sides. Matrix multiplication splits work into independent block products.

On a single core, these recursive calls run one after another. On modern hardware, this leaves available CPU cores idle.

How can we turn recursive decomposition into parallel execution without creating excessive overhead?

Solution

Parallelize independent recursive calls, but only while the subproblems are large enough to justify the cost.

A divide-and-conquer algorithm usually has this structure:

solve(problem)
    if problem is small
        solve directly

    split problem into subproblems

    solve subproblem A
    solve subproblem B

    combine results

The parallel version changes the recursive calls:

solve(problem)
    if problem is small
        solve directly

    split problem into subproblems

    run subproblem A concurrently
    run subproblem B concurrently

    wait for both results

    combine results

The algorithmic idea is simple. The engineering problem is controlling granularity.

Why Divide and Conquer Parallelizes Well

A recursive split often produces independent work.

In merge sort:

left half  -> can be sorted independently
right half -> can be sorted independently

In quick sort:

values less than pivot    -> independent
values greater than pivot -> independent

In tree algorithms:

left subtree  -> independent
right subtree -> independent

The combine step may still be sequential, but the recursive calls can often run in parallel.

This gives divide-and-conquer algorithms a natural task graph.

Work and Span

Parallel algorithms are commonly analyzed using two quantities.

The work is the total amount of computation performed if executed on one processor.

The span is the length of the longest dependency chain.

For merge sort, the total work remains:

$$ O(n\log n) $$

The span depends on how the merge step is implemented.

If merging remains sequential, the span includes a large combine step at each level. If merging is also parallelized, the span can be reduced further.

The theoretical maximum speedup is bounded by:

$$ \frac{\text{work}}{\text{span}} $$

This tells us how much parallelism the algorithm exposes.

A Sequential Merge Sort

func MergeSort(nums []int) []int {
    if len(nums) <= 1 {
        return nums
    }

    mid := len(nums) / 2

    left := MergeSort(nums[:mid])
    right := MergeSort(nums[mid:])

    return Merge(left, right)
}

The recursive calls are independent.

This is the exact point where parallelism can be introduced.

A Parallel Merge Sort Sketch

func ParallelMergeSort(nums []int, depth int) []int {
    if len(nums) <= 1 {
        return nums
    }

    if depth <= 0 || len(nums) < 4096 {
        return MergeSort(nums)
    }

    mid := len(nums) / 2

    var left []int
    var right []int

    done := make(chan struct{})

    go func() {
        left = ParallelMergeSort(nums[:mid], depth-1)
        close(done)
    }()

    right = ParallelMergeSort(nums[mid:], depth-1)

    <-done

    return Merge(left, right)
}

This version spawns one goroutine for the left half and computes the right half in the current goroutine.

The depth parameter limits how many recursive levels create parallel tasks.

Without such a limit, the program may create too many goroutines.

Choosing a Cutoff

Parallelism has overhead.

Creating tasks, scheduling work, synchronizing results, and moving memory all cost time. For small subproblems, this overhead may exceed the benefit.

A practical implementation uses two cutoffs:

minimum input size
maximum parallel depth

Example:

if depth <= 0 || len(nums) < 4096 {
    return MergeSort(nums)
}

The exact threshold depends on:

CPU core count
scheduler overhead
cache behavior
merge implementation
allocation cost
input type

Do not parallelize every recursive call.

Parallelize only enough calls to keep cores busy.

Depth-Based Control

A common rule is:

parallel depth ≈ log2(number of cores)

For 8 cores:

log2(8) = 3

This means at most three recursive levels spawn parallel tasks.

That produces up to:

2^3 = 8

large tasks.

Each task can then run sequentially.

This avoids creating thousands of tiny jobs.

Work Stealing

Many parallel runtimes use work stealing.

The idea is:

Each worker owns a deque of tasks.
A worker executes local tasks first.
Idle workers steal tasks from others.
Large recursive tasks naturally split into smaller tasks.

Divide-and-conquer algorithms work well with this model because tasks form a recursive tree. Large tasks split near the top, and smaller tasks appear near the leaves.

Languages and runtimes with work-stealing schedulers can often parallelize recursive algorithms cleanly.

Parallel Quick Sort

Quick sort is also naturally parallel after partitioning.

func ParallelQuickSort(nums []int, low, high int, depth int) {
    if low >= high {
        return
    }

    if depth <= 0 || high-low < 4096 {
        QuickSort(nums, low, high)
        return
    }

    pivot := partition(nums, low, high)

    done := make(chan struct{})

    go func() {
        ParallelQuickSort(nums, low, pivot-1, depth-1)
        close(done)
    }()

    ParallelQuickSort(nums, pivot+1, high, depth-1)

    <-done
}

Partitioning itself is usually sequential in simple implementations.

After the pivot is placed, both sides are independent.

Load Balance

Quick sort has a problem that merge sort does not.

Merge sort splits evenly by construction:

n/2 and n/2

Quick sort may split unevenly:

1 and n-2

or:

0 and n-1

Poor pivot choices cause bad load balance.

One core may receive most of the work while other cores sit idle.

Random pivots, median-of-three pivots, and three-way partitioning improve both sequential performance and parallel load balance.

Parallel Recursion as a Task Tree

A parallel recursive algorithm creates a tree of tasks.

Example with depth 3:

                 n
          /             \\n       n/2               n/2
     /     \           /     \\n   n/4     n/4       n/4     n/4
  /  \     /  \      /  \    /  \\n...

Tasks near the top are large and worth parallelizing.

Tasks near the bottom are small and should usually run sequentially.

A good implementation exploits top-level parallelism and avoids leaf-level overhead.

Avoiding Shared Mutable State

Parallel divide-and-conquer code should minimize shared writes.

Good pattern:

subproblem A writes only to region A
subproblem B writes only to region B

Bad pattern:

all tasks append to one shared list

Shared mutable structures require locks or atomic operations. These often erase the benefit of parallelism.

Partition arrays, use local buffers, and combine results after the parallel phase.

Even when two tasks write to different variables, they may still contend on the same cache line.

This is called false sharing.

Example:

worker 1 writes counter[0]
worker 2 writes counter[1]

If both counters live in the same cache line, CPU cores repeatedly invalidate each other's cache entries.

In high-performance parallel divide-and-conquer code, memory layout matters.

Use per-worker buffers, padding, or chunked output when necessary.

Parallel Combine Steps

Some algorithms parallelize only the recursive calls.

Others also parallelize the combine step.

For merge sort, the merge operation can itself be parallelized:

Split the larger sorted array in half.
Binary search that midpoint into the other array.
Recursively merge the two independent output ranges.

This creates parallelism inside the merge step, reducing span.

The implementation is more complex, but it matters for very large inputs.

Example: Parallel Sum

Parallel divide and conquer can be useful even for simple reductions.

func ParallelSum(nums []int, depth int) int {
    if len(nums) == 0 {
        return 0
    }

    if depth <= 0 || len(nums) < 4096 {
        sum := 0
        for _, x := range nums {
            sum += x
        }
        return sum
    }

    mid := len(nums) / 2

    done := make(chan int)

    go func() {
        done <- ParallelSum(nums[:mid], depth-1)
    }()

    right := ParallelSum(nums[mid:], depth-1)
    left := <-done

    return left + right
}

Sequential sum is already simple and cache-friendly. Parallel sum only helps when the array is large enough.

This makes it a good example of the cutoff principle.

Complexity Analysis

Parallelization does not change total work.

For merge sort:

$$ O(n\log n) $$

work remains.

Parallelization reduces wall-clock time by distributing independent work across processors.

With (p) processors, ideal time is roughly:

$$ O\left(\frac{\text{work}}{p} + \text{span}\right) $$

In practice, performance also depends on:

scheduling overhead
memory bandwidth
cache locality
synchronization
allocation
load balance

Asymptotic complexity is only part of the story.

Benchmarking Parallel Algorithms

Parallel code must be benchmarked carefully.

Measure:

single-thread baseline
parallel version with 2, 4, 8, and more workers
different input sizes
different cutoff thresholds
memory allocations
CPU utilization

A parallel algorithm that is faster for 100 million elements may be slower for 100 thousand elements.

Always compare against a strong sequential implementation.

Common Mistakes

Creating Too Many Tasks

Spawning a task at every recursive call creates enormous overhead.

Use depth and size cutoffs.

Ignoring Load Balance

Uneven partitioning can leave cores idle.

This is especially common in quick sort.

Concurrent writes to overlapping memory ranges cause data races.

Assign disjoint regions to each task.

Parallelizing Memory-Bound Work

Some algorithms are limited by memory bandwidth, not CPU time.

Adding threads may produce little or no speedup.

Measuring Only One Input Size

Parallel speedups are input-size dependent.

Small benchmarks often produce misleading results.

Discussion

Parallel divide and conquer works because recursive decomposition exposes independent work. The algorithm already has the shape of a task graph; parallel execution simply schedules parts of that graph on different processors.

The important engineering skill is restraint. More parallelism is not automatically better. Good implementations create enough tasks to saturate available cores, then switch to sequential code before overhead dominates. They avoid shared mutable state, control memory allocation, and benchmark cutoffs rather than guessing.

This recipe also shows a recurring theme in high-performance algorithms: the asymptotic work may stay the same while real execution time improves dramatically. Divide and conquer gives us a clean way to express that parallel structure, but the final performance depends on hardware-aware choices about granularity, locality, and synchronization.

15.10 Parallel Divide and Conquer

15.10 Parallel Divide and Conquer

Problem

Solution

Why Divide and Conquer Parallelizes Well

Work and Span

A Sequential Merge Sort

A Parallel Merge Sort Sketch

Choosing a Cutoff

Depth-Based Control

Work Stealing

Parallel Quick Sort

Load Balance

Parallel Recursion as a Task Tree

Avoiding Shared Mutable State

False Sharing

Parallel Combine Steps

Example: Parallel Sum

Complexity Analysis

Benchmarking Parallel Algorithms

Common Mistakes

Creating Too Many Tasks

Ignoring Load Balance

Sharing Output Buffers Incorrectly

Parallelizing Memory-Bound Work

Measuring Only One Input Size

Discussion