Count key frequencies in parallel, compute prefix sums, then scatter elements into sorted positions.
Parallel counting sort extends counting sort by distributing counting and placement across multiple workers. It assumes keys are integers within a bounded range. Instead of comparing elements, it counts occurrences and reconstructs the sorted output.
The algorithm consists of three main stages: local counting, prefix sum computation, and parallel scatter.
Problem
Given an array of size with integer keys in range , produce a sorted array.
Algorithm
parallel_counting_sort(A, k, workers):
create local_count[workers][k]
parallel for each worker w:
for each element in its chunk:
local_count[w][key] += 1
global_count = sum over workers of local_count
compute prefix sums over global_count
compute per-worker offsets from local_count and prefix sums
parallel for each worker w:
for each element in its chunk:
place element in output using offset and increment offsetPrefix Sum
The prefix sum defines starting positions:
Each key occupies a contiguous block in the output.
Example
Let:
Counts:
| key | count |
|---|---|
| 0 | 1 |
| 1 | 2 |
| 2 | 2 |
Prefix sums:
| key | offset |
|---|---|
| 0 | 0 |
| 1 | 1 |
| 2 | 3 |
Final sorted array:
Complexity
| measure | value |
|---|---|
| counting | per worker |
| prefix sum | |
| scatter | per worker |
| total work | |
| parallel time | |
| space |
Correctness
Counting ensures that the number of occurrences of each key is exact. Prefix sums assign non overlapping intervals for each key. During scatter, each element is placed into its assigned interval. Since intervals are ordered by key, the final array is sorted.
Practical Considerations
- Local counts avoid contention on shared counters.
- Memory overhead grows with .
- Prefix sum can be parallelized for large .
- Stable ordering requires processing elements in original order within each worker.
- Efficient when is small relative to .
When to Use
Use parallel counting sort when:
- keys are integers with small bounded range
- input size is large
- comparison cost should be avoided
- stable ordering is required
Avoid it when the key range is large, sparse, or unknown.
Implementation (Go, simplified)
func ParallelCountingSort(a []int, k int, workers int) []int {
n := len(a)
if n <= 1 {
return append([]int(nil), a...)
}
local := make([][]int, workers)
for i := range local {
local[i] = make([]int, k)
}
done := make(chan struct{}, workers)
for w := 0; w < workers; w++ {
lo := w * n / workers
hi := (w + 1) * n / workers
go func(w, lo, hi int) {
for i := lo; i < hi; i++ {
local[w][a[i]]++
}
done <- struct{}{}
}(w, lo, hi)
}
for w := 0; w < workers; w++ {
<-done
}
global := make([]int, k)
for d := 0; d < k; d++ {
for w := 0; w < workers; w++ {
global[d] += local[w][d]
}
}
offset := make([]int, k)
sum := 0
for d := 0; d < k; d++ {
offset[d] = sum
sum += global[d]
}
workerOffset := make([][]int, workers)
for w := 0; w < workers; w++ {
workerOffset[w] = make([]int, k)
}
for d := 0; d < k; d++ {
pos := offset[d]
for w := 0; w < workers; w++ {
workerOffset[w][d] = pos
pos += local[w][d]
}
}
out := make([]int, n)
for w := 0; w < workers; w++ {
lo := w * n / workers
hi := (w + 1) * n / workers
go func(w, lo, hi int) {
pos := workerOffset[w]
for i := lo; i < hi; i++ {
v := a[i]
out[pos[v]] = v
pos[v]++
}
done <- struct{}{}
}(w, lo, hi)
}
for w := 0; w < workers; w++ {
<-done
}
return out
}