Sort distributed data by assigning fixed key ranges to workers, routing records by range, then sorting locally.
Distributed range partition sort divides the key space into fixed ranges and assigns each range to a worker. Each worker sends records to the correct range owner, then sorts its received records locally.
This method is simpler than sample sort because it uses predefined ranges. It works well when the key distribution is known or bounded.
Problem
Given records distributed across workers, sort all records by key in nondecreasing order.
Each worker produces one sorted partition corresponding to a key interval.
Algorithm
distributed_range_sort(workers, ranges):
broadcast range boundaries to all workers
each worker partitions local records by range
all_to_all exchange records to target workers
each worker sorts received records
return workers in range orderThe ranges define key intervals:
Bucket Assignment
Each key is mapped to a range using binary search.
range_id(key, ranges):
return upper_bound(ranges, key)This determines the destination worker.
Communication
Each worker sends data to all others based on range.
for each worker u:
for each record:
v = range_id(key)
send record to worker vEach worker collects all records belonging to its assigned range.
Complexity
| measure | value |
|---|---|
| partitioning | |
| communication | |
| local sorting | |
| total work |
The performance depends strongly on how balanced the ranges are.
Correctness
Each range defines a disjoint interval of keys. All records assigned to worker fall within its interval. For any workers , all keys in worker are less than or equal to all keys in worker . Each worker sorts its own records locally, so all partitions are internally sorted and globally ordered.
Practical Considerations
- Requires prior knowledge of key distribution.
- Poor range selection causes load imbalance.
- Heavy skew leads to one overloaded worker.
- Easier to implement than sample based methods.
- Often used when keys are uniformly distributed or bounded.
- Communication cost can dominate runtime.
When to Use
Use distributed range partition sort when:
- key distribution is known or predictable
- ranges can be chosen accurately
- simple implementation is preferred
- distributed sorted output is acceptable
Avoid it when the key distribution is unknown or highly skewed. In such cases, sampling based methods give better balance.
Implementation Sketch
assign_ranges(samples, p):
sort samples
choose evenly spaced boundaries
return rangesworker_process(records, ranges):
buckets = array of p empty lists
for record in records:
b = upper_bound(ranges, key(record))
buckets[b].append(record)
send buckets to corresponding workers
received = receive from all workers
sort received
return receivedSimplified Python Model
from bisect import bisect_right
def distributed_range_sort(records_by_worker, ranges):
p = len(records_by_worker)
inbox = [[] for _ in range(p)]
for records in records_by_worker:
for r in records:
key = r[0]
idx = bisect_right(ranges, key)
inbox[idx].append(r)
for i in range(p):
inbox[i].sort(key=lambda x: x[0])
return inbox