Sorting method that divides input into bounded-size chunks, sorts each chunk independently, and then combines the chunks.
Chunked sort divides a large input into smaller chunks, sorts each chunk, and then combines the sorted chunks. It is a practical pattern for sorting data that is too large or too inconvenient to process as one array.
The method appears in external sorting, streaming systems, parallel sorting, and batch data pipelines.
Problem
Given a sequence of records and a memory limit that can hold only records, sort the full sequence.
If , a direct in-memory sort is not possible. The input must be split into manageable parts.
Core Idea
The algorithm has two stages:
Chunk sorting Split the input into chunks of size at most and sort each chunk independently.
Chunk combining Combine the sorted chunks, usually by k-way merge.
chunked_sort(input, chunk_size):
chunks = []
while input has records:
chunk = read at most chunk_size records
sort chunk
chunks.append(chunk)
return merge_sorted_chunks(chunks)Example
Input:
Chunk size:
Sorted chunks:
| chunk | sorted |
|---|---|
| [8, 3, 5] | [3, 5, 8] |
| [1, 9, 2] | [1, 2, 9] |
| [7, 4] | [4, 7] |
Merge:
Final output:
Correctness
Each chunk is sorted independently. The merge phase repeatedly selects the smallest current element among all sorted chunks.
Since no later element in a sorted chunk can be smaller than its current head, each selected element is the smallest remaining element overall. Therefore the merged output is sorted.
Every input element belongs to exactly one chunk and is emitted exactly once during merging, so the result is a sorted permutation of the input.
Complexity
Let:
- be the number of records
- be the chunk size
- be the number of chunks
- be the merge fan-in
Chunk sorting cost:
Merge cost with a heap:
If all chunks are merged at once, then :
Total CPU cost is commonly:
External I/O follows the same pattern as external merge sort, with sequential reads and writes for chunks and merge passes.
Design Choices
| choice | effect |
|---|---|
| larger chunks | fewer chunks and fewer merge inputs |
| smaller chunks | lower memory pressure |
| merge fan-in | controls number of merge passes |
| stable chunk sort | helps preserve equal-key order locally |
| stable merge | preserves global stable ordering |
When to Use
Use chunked sort when:
- data exceeds memory
- input arrives in batches
- parallel chunk sorting is possible
- implementation simplicity matters
- temporary sorted runs are acceptable
It is also useful when sorting files, logs, large arrays, and database operator output.
Implementation Sketch
import heapq
def merge_sorted_chunks(chunks):
heap = []
iters = [iter(chunk) for chunk in chunks]
for chunk_id, it in enumerate(iters):
try:
value = next(it)
heapq.heappush(heap, (value, chunk_id))
except StopIteration:
pass
out = []
while heap:
value, chunk_id = heapq.heappop(heap)
out.append(value)
try:
nxt = next(iters[chunk_id])
heapq.heappush(heap, (nxt, chunk_id))
except StopIteration:
pass
return outdef chunked_sort(data, chunk_size):
chunks = []
for i in range(0, len(data), chunk_size):
chunk = sorted(data[i:i + chunk_size])
chunks.append(chunk)
return merge_sorted_chunks(chunks)Notes
Chunked sort is the simplest external sorting pattern. It becomes external merge sort when chunks are written as runs to disk and later merged with bounded memory.