# Parallel Counting Sort

# Parallel Counting Sort

Parallel counting sort extends counting sort by distributing counting and placement across multiple workers. It assumes keys are integers within a bounded range. Instead of comparing elements, it counts occurrences and reconstructs the sorted output.

The algorithm consists of three main stages: local counting, prefix sum computation, and parallel scatter.

## Problem

Given an array $A$ of size $n$ with integer keys in range $[0, k)$, produce a sorted array.

## Algorithm

```text id="r8p1zx"
parallel_counting_sort(A, k, workers):
    create local_count[workers][k]

    parallel for each worker w:
        for each element in its chunk:
            local_count[w][key] += 1

    global_count = sum over workers of local_count

    compute prefix sums over global_count

    compute per-worker offsets from local_count and prefix sums

    parallel for each worker w:
        for each element in its chunk:
            place element in output using offset and increment offset
```

## Prefix Sum

The prefix sum defines starting positions:

$$
offset[i] = \sum_{j < i} count[j]
$$

Each key $i$ occupies a contiguous block in the output.

## Example

Let:

$$
A = [2, 1, 0, 2, 1]
$$

Counts:

| key | count |
| --- | ----- |
| 0   | 1     |
| 1   | 2     |
| 2   | 2     |

Prefix sums:

| key | offset |
| --- | ------ |
| 0   | 0      |
| 1   | 1      |
| 2   | 3      |

Final sorted array:

$$
[0, 1, 1, 2, 2]
$$

## Complexity

| measure       | value               |
| ------------- | ------------------- |
| counting      | $O(n/p)$ per worker |
| prefix sum    | $O(k)$              |
| scatter       | $O(n/p)$ per worker |
| total work    | $O(n + k)$          |
| parallel time | $O(n/p + k)$        |
| space         | $O(n + pk)$         |

## Correctness

Counting ensures that the number of occurrences of each key is exact. Prefix sums assign non overlapping intervals for each key. During scatter, each element is placed into its assigned interval. Since intervals are ordered by key, the final array is sorted.

## Practical Considerations

* Local counts avoid contention on shared counters.
* Memory overhead grows with $p \cdot k$.
* Prefix sum can be parallelized for large $k$.
* Stable ordering requires processing elements in original order within each worker.
* Efficient when $k$ is small relative to $n$.

## When to Use

Use parallel counting sort when:

* keys are integers with small bounded range
* input size is large
* comparison cost should be avoided
* stable ordering is required

Avoid it when the key range is large, sparse, or unknown.

## Implementation (Go, simplified)

```go id="y8m4hp"
func ParallelCountingSort(a []int, k int, workers int) []int {
	n := len(a)
	if n <= 1 {
		return append([]int(nil), a...)
	}

	local := make([][]int, workers)
	for i := range local {
		local[i] = make([]int, k)
	}

	done := make(chan struct{}, workers)

	for w := 0; w < workers; w++ {
		lo := w * n / workers
		hi := (w + 1) * n / workers

		go func(w, lo, hi int) {
			for i := lo; i < hi; i++ {
				local[w][a[i]]++
			}
			done <- struct{}{}
		}(w, lo, hi)
	}

	for w := 0; w < workers; w++ {
		<-done
	}

	global := make([]int, k)
	for d := 0; d < k; d++ {
		for w := 0; w < workers; w++ {
			global[d] += local[w][d]
		}
	}

	offset := make([]int, k)
	sum := 0
	for d := 0; d < k; d++ {
		offset[d] = sum
		sum += global[d]
	}

	workerOffset := make([][]int, workers)
	for w := 0; w < workers; w++ {
		workerOffset[w] = make([]int, k)
	}

	for d := 0; d < k; d++ {
		pos := offset[d]
		for w := 0; w < workers; w++ {
			workerOffset[w][d] = pos
			pos += local[w][d]
		}
	}

	out := make([]int, n)

	for w := 0; w < workers; w++ {
		lo := w * n / workers
		hi := (w + 1) * n / workers

		go func(w, lo, hi int) {
			pos := workerOffset[w]
			for i := lo; i < hi; i++ {
				v := a[i]
				out[pos[v]] = v
				pos[v]++
			}
			done <- struct{}{}
		}(w, lo, hi)
	}

	for w := 0; w < workers; w++ {
		<-done
	}

	return out
}
```

