Distributed Range Partition Sort

Distributed range partition sort divides the key space into fixed ranges and assigns each range to a worker. Each worker sends records to the correct range owner, then sorts its received records locally.

This method is simpler than sample sort because it uses predefined ranges. It works well when the key distribution is known or bounded.

Problem

Given $n$ records distributed across $p$ workers, sort all records by key in nondecreasing order.

Each worker produces one sorted partition corresponding to a key interval.

Algorithm

distributed_range_sort(workers, ranges):
    broadcast range boundaries to all workers

    each worker partitions local records by range

    all_to_all exchange records to target workers

    each worker sorts received records

    return workers in range order

The ranges define key intervals:

$$ (-\infty, r_1], (r_1, r_2], \dots, (r_{p-1}, \infty) $$

Bucket Assignment

Each key is mapped to a range using binary search.

range_id(key, ranges):
    return upper_bound(ranges, key)

This determines the destination worker.

Communication

Each worker sends data to all others based on range.

for each worker u:
    for each record:
        v = range_id(key)
        send record to worker v

Each worker collects all records belonging to its assigned range.

Complexity

measure	value
partitioning	$O((n/p)\log p)$
communication	$O(n)$
local sorting	$O((n/p)\log(n/p))$
total work	$O(n\log(n/p))$

The performance depends strongly on how balanced the ranges are.

Correctness

Each range defines a disjoint interval of keys. All records assigned to worker $i$ fall within its interval. For any workers $i < j$, all keys in worker $i$ are less than or equal to all keys in worker $j$. Each worker sorts its own records locally, so all partitions are internally sorted and globally ordered.

Practical Considerations

Requires prior knowledge of key distribution.
Poor range selection causes load imbalance.
Heavy skew leads to one overloaded worker.
Easier to implement than sample based methods.
Often used when keys are uniformly distributed or bounded.
Communication cost can dominate runtime.

When to Use

Use distributed range partition sort when:

key distribution is known or predictable
ranges can be chosen accurately
simple implementation is preferred
distributed sorted output is acceptable

Avoid it when the key distribution is unknown or highly skewed. In such cases, sampling based methods give better balance.

Implementation Sketch

assign_ranges(samples, p):
    sort samples
    choose evenly spaced boundaries
    return ranges

worker_process(records, ranges):
    buckets = array of p empty lists

    for record in records:
        b = upper_bound(ranges, key(record))
        buckets[b].append(record)

    send buckets to corresponding workers

    received = receive from all workers
    sort received

    return received

Simplified Python Model

from bisect import bisect_right

def distributed_range_sort(records_by_worker, ranges):
    p = len(records_by_worker)
    inbox = [[] for _ in range(p)]

    for records in records_by_worker:
        for r in records:
            key = r[0]
            idx = bisect_right(ranges, key)
            inbox[idx].append(r)

    for i in range(p):
        inbox[i].sort(key=lambda x: x[0])

    return inbox