Parallel Counting Sort

Parallel counting sort extends counting sort by distributing counting and placement across multiple workers. It assumes keys are integers within a bounded range. Instead of comparing elements, it counts occurrences and reconstructs the sorted output.

The algorithm consists of three main stages: local counting, prefix sum computation, and parallel scatter.

Problem

Given an array $A$ of size $n$ with integer keys in range $[0, k)$ , produce a sorted array.

Algorithm

parallel_counting_sort(A, k, workers):
    create local_count[workers][k]

    parallel for each worker w:
        for each element in its chunk:
            local_count[w][key] += 1

    global_count = sum over workers of local_count

    compute prefix sums over global_count

    compute per-worker offsets from local_count and prefix sums

    parallel for each worker w:
        for each element in its chunk:
            place element in output using offset and increment offset

Prefix Sum

The prefix sum defines starting positions:

offset[i] = \sum_{j < i} count[j]

Each key $i$ occupies a contiguous block in the output.

Example

Let:

A = [2, 1, 0, 2, 1]

Counts:

key	count
0	1
1	2
2	2

Prefix sums:

key	offset
0	0
1	1
2	3

Final sorted array:

[0, 1, 1, 2, 2]

Complexity

measure	value
counting	$O(n/p)$ per worker
prefix sum	$O(k)$
scatter	$O(n/p)$ per worker
total work	$O(n + k)$
parallel time	$O(n/p + k)$
space	$O(n + pk)$

Correctness

Counting ensures that the number of occurrences of each key is exact. Prefix sums assign non overlapping intervals for each key. During scatter, each element is placed into its assigned interval. Since intervals are ordered by key, the final array is sorted.

Practical Considerations

Local counts avoid contention on shared counters.
Memory overhead grows with $p \cdot k$ .
Prefix sum can be parallelized for large $k$ .
Stable ordering requires processing elements in original order within each worker.
Efficient when $k$ is small relative to $n$ .

When to Use

Use parallel counting sort when:

keys are integers with small bounded range
input size is large
comparison cost should be avoided
stable ordering is required

Avoid it when the key range is large, sparse, or unknown.

Implementation (Go, simplified)

func ParallelCountingSort(a []int, k int, workers int) []int {
	n := len(a)
	if n <= 1 {
		return append([]int(nil), a...)
	}

	local := make([][]int, workers)
	for i := range local {
		local[i] = make([]int, k)
	}

	done := make(chan struct{}, workers)

	for w := 0; w < workers; w++ {
		lo := w * n / workers
		hi := (w + 1) * n / workers

		go func(w, lo, hi int) {
			for i := lo; i < hi; i++ {
				local[w][a[i]]++
			}
			done <- struct{}{}
		}(w, lo, hi)
	}

	for w := 0; w < workers; w++ {
		<-done
	}

	global := make([]int, k)
	for d := 0; d < k; d++ {
		for w := 0; w < workers; w++ {
			global[d] += local[w][d]
		}
	}

	offset := make([]int, k)
	sum := 0
	for d := 0; d < k; d++ {
		offset[d] = sum
		sum += global[d]
	}

	workerOffset := make([][]int, workers)
	for w := 0; w < workers; w++ {
		workerOffset[w] = make([]int, k)
	}

	for d := 0; d < k; d++ {
		pos := offset[d]
		for w := 0; w < workers; w++ {
			workerOffset[w][d] = pos
			pos += local[w][d]
		}
	}

	out := make([]int, n)

	for w := 0; w < workers; w++ {
		lo := w * n / workers
		hi := (w + 1) * n / workers

		go func(w, lo, hi int) {
			pos := workerOffset[w]
			for i := lo; i < hi; i++ {
				v := a[i]
				out[pos[v]] = v
				pos[v]++
			}
			done <- struct{}{}
		}(w, lo, hi)
	}

	for w := 0; w < workers; w++ {
		<-done
	}

	return out
}