Skip to content

Parallel Counting Sort

Count key frequencies in parallel, compute prefix sums, then scatter elements into sorted positions.

Parallel counting sort extends counting sort by distributing counting and placement across multiple workers. It assumes keys are integers within a bounded range. Instead of comparing elements, it counts occurrences and reconstructs the sorted output.

The algorithm consists of three main stages: local counting, prefix sum computation, and parallel scatter.

Problem

Given an array AA of size nn with integer keys in range [0,k)[0, k), produce a sorted array.

Algorithm

parallel_counting_sort(A, k, workers):
    create local_count[workers][k]

    parallel for each worker w:
        for each element in its chunk:
            local_count[w][key] += 1

    global_count = sum over workers of local_count

    compute prefix sums over global_count

    compute per-worker offsets from local_count and prefix sums

    parallel for each worker w:
        for each element in its chunk:
            place element in output using offset and increment offset

Prefix Sum

The prefix sum defines starting positions:

offset[i]=j<icount[j] offset[i] = \sum_{j < i} count[j]

Each key ii occupies a contiguous block in the output.

Example

Let:

A=[2,1,0,2,1] A = [2, 1, 0, 2, 1]

Counts:

keycount
01
12
22

Prefix sums:

keyoffset
00
11
23

Final sorted array:

[0,1,1,2,2] [0, 1, 1, 2, 2]

Complexity

measurevalue
countingO(n/p)O(n/p) per worker
prefix sumO(k)O(k)
scatterO(n/p)O(n/p) per worker
total workO(n+k)O(n + k)
parallel timeO(n/p+k)O(n/p + k)
spaceO(n+pk)O(n + pk)

Correctness

Counting ensures that the number of occurrences of each key is exact. Prefix sums assign non overlapping intervals for each key. During scatter, each element is placed into its assigned interval. Since intervals are ordered by key, the final array is sorted.

Practical Considerations

  • Local counts avoid contention on shared counters.
  • Memory overhead grows with pkp \cdot k.
  • Prefix sum can be parallelized for large kk.
  • Stable ordering requires processing elements in original order within each worker.
  • Efficient when kk is small relative to nn.

When to Use

Use parallel counting sort when:

  • keys are integers with small bounded range
  • input size is large
  • comparison cost should be avoided
  • stable ordering is required

Avoid it when the key range is large, sparse, or unknown.

Implementation (Go, simplified)

func ParallelCountingSort(a []int, k int, workers int) []int {
	n := len(a)
	if n <= 1 {
		return append([]int(nil), a...)
	}

	local := make([][]int, workers)
	for i := range local {
		local[i] = make([]int, k)
	}

	done := make(chan struct{}, workers)

	for w := 0; w < workers; w++ {
		lo := w * n / workers
		hi := (w + 1) * n / workers

		go func(w, lo, hi int) {
			for i := lo; i < hi; i++ {
				local[w][a[i]]++
			}
			done <- struct{}{}
		}(w, lo, hi)
	}

	for w := 0; w < workers; w++ {
		<-done
	}

	global := make([]int, k)
	for d := 0; d < k; d++ {
		for w := 0; w < workers; w++ {
			global[d] += local[w][d]
		}
	}

	offset := make([]int, k)
	sum := 0
	for d := 0; d < k; d++ {
		offset[d] = sum
		sum += global[d]
	}

	workerOffset := make([][]int, workers)
	for w := 0; w < workers; w++ {
		workerOffset[w] = make([]int, k)
	}

	for d := 0; d < k; d++ {
		pos := offset[d]
		for w := 0; w < workers; w++ {
			workerOffset[w][d] = pos
			pos += local[w][d]
		}
	}

	out := make([]int, n)

	for w := 0; w < workers; w++ {
		lo := w * n / workers
		hi := (w + 1) * n / workers

		go func(w, lo, hi int) {
			pos := workerOffset[w]
			for i := lo; i < hi; i++ {
				v := a[i]
				out[pos[v]] = v
				pos[v]++
			}
			done <- struct{}{}
		}(w, lo, hi)
	}

	for w := 0; w < workers; w++ {
		<-done
	}

	return out
}