External sorting method that uses memory-mapped files so large data can be accessed through virtual memory while still sorting in chunks and merging runs.
Memory mapped external sort uses memory-mapped files to sort data larger than ordinary working memory. Instead of manually reading and writing file blocks with explicit I/O calls, the program maps file regions into virtual memory and lets the operating system page data in and out.
This can simplify implementation, but it does not remove the external-memory problem. The algorithm still needs chunking, sorted runs, and merging to avoid random page faults and excessive memory pressure.
Problem
Given a large file of records, sort the records without loading the entire file into normal memory.
The file may be larger than available RAM, but the operating system can map file pages into the process address space.
Core Idea
Use memory mapping for file access, but keep the sorting pattern external-memory friendly:
- map a region of the input file
- sort a chunk that fits comfortably in memory
- write or map the sorted chunk as a run
- merge the sorted runs sequentially
The memory map is an access mechanism. The sorting strategy is still external merge sort.
Algorithm
memory_mapped_external_sort(file, chunk_size):
runs = []
for each chunk region in file:
A = map chunk region
sort A in memory
run = write sorted chunk
runs.append(run)
unmap A
return merge_runs(runs)The merge phase should read runs sequentially.
merge_mapped_runs(runs):
map small windows from each run
keep current value from each run in a heap
while heap is not empty:
output minimum value
advance inside that run window
if window is exhausted:
map next window from that runExample
Suppose the file contains:
and memory allows chunks of 3 records.
Run generation:
| chunk | sorted run |
|---|---|
| [8, 3, 5] | [3, 5, 8] |
| [1, 9, 2] | [1, 2, 9] |
| [7, 4] | [4, 7] |
Merge the runs:
Final output:
Correctness
Each mapped chunk is sorted before being written as a run. Therefore every generated run is sorted.
The merge phase repeatedly emits the smallest current element among all run heads. Since each run is sorted, no later element in a run can be smaller than its current head. Thus every emitted element is the smallest remaining element overall.
All records belong to exactly one chunk and then exactly one run. The final merge emits every record once, so the output is a sorted permutation of the input.
Complexity
Let:
- be the number of records
- be the chosen chunk size
- be the number of runs
- be the page or block size
- be the merge fan-in
CPU cost for run generation:
Merge CPU cost:
per merge level.
I/O cost is similar to external merge sort:
The actual cost depends heavily on page faults, readahead, dirty-page writeback, and mapping granularity.
Design Choices
| choice | effect |
|---|---|
| chunk size | controls memory pressure and run count |
| mapped window size | controls page-fault behavior during merge |
| sequential access | improves OS readahead |
| random access | may cause many page faults |
| explicit unmapping | releases address space and page pressure |
| temp run layout | affects merge locality |
When to Use
Use memory mapped external sort when:
- the data is file-backed
- record layout is fixed-width or easy to address
- the platform has good memory-mapping support
- implementation simplicity matters
- sequential access can be preserved
Avoid relying on memory mapping alone for huge random-access sorts. The operating system may thrash if the working set is much larger than RAM.
Implementation Sketch
This simplified example uses arrays, but the structure matches a memory-mapped external sort.
def make_sorted_runs(data, chunk_size):
runs = []
for i in range(0, len(data), chunk_size):
chunk = data[i:i + chunk_size]
runs.append(sorted(chunk))
return runsimport heapq
def merge_runs(runs):
heap = []
iters = [iter(run) for run in runs]
for run_id, it in enumerate(iters):
try:
heapq.heappush(heap, (next(it), run_id))
except StopIteration:
pass
out = []
while heap:
value, run_id = heapq.heappop(heap)
out.append(value)
try:
nxt = next(iters[run_id])
heapq.heappush(heap, (nxt, run_id))
except StopIteration:
pass
return outdef memory_mapped_external_sort_model(data, chunk_size):
runs = make_sorted_runs(data, chunk_size)
return merge_runs(runs)Notes
Memory mapped external sort is mostly an engineering variant of external merge sort. The main decision is not whether mapping is possible, but whether the algorithm still gives the operating system predictable sequential access.