Cache Efficient Merge Sort

Cache efficient merge sort improves merge sort by reducing cache misses. It reorganizes recursion and merging so that working data fits into cache levels.

The goal is to minimize memory transfers rather than just comparisons.

Problem

Given an array $A$ of length $n$ , reorder it such that:

A[0] \le A[1] \le \dots \le A[n-1]

while improving cache locality.

Idea

Standard merge sort performs well in theory but may suffer from poor cache usage due to repeated passes over large memory regions.

Cache efficient variants aim to:

divide the problem until subarrays fit into cache
perform merges on contiguous memory blocks
reduce repeated scanning of large arrays

A common approach uses blocked recursion:

recursively sort subarrays until they fit into cache
merge subarrays in a way that maximizes sequential access

Algorithm

cache_merge_sort(A, lo, hi):
    if hi - lo <= threshold:
        insertion_sort(A, lo, hi)
        return

    mid = (lo + hi) / 2

    cache_merge_sort(A, lo, mid)
    cache_merge_sort(A, mid + 1, hi)

    cache_merge(A, lo, mid, hi)

The merge step is optimized for locality.

cache_merge(A, lo, mid, hi):
    copy A[lo..mid] into buffer

    i = 0
    j = mid + 1
    k = lo

    while i < len(buffer) and j <= hi:
        if buffer[i] <= A[j]:
            A[k] = buffer[i]
            i += 1
        else:
            A[k] = A[j]
            j += 1
        k += 1

    copy remaining buffer elements back

Example

Let:

A = [8, 3, 7, 1, 6, 2, 5, 4]

Divide recursively until small subarrays fit in cache:

sort small blocks locally
merge them sequentially

Final result:

[1,2,3,4,5,6,7,8]

Correctness

Each recursive call sorts a subarray. The merge step combines two sorted subarrays into one sorted array. The cache optimizations do not change ordering logic, so correctness follows directly from merge sort.

Complexity

metric	value
time	$O(n \log n)$
cache complexity	improved
extra space	$O(n)$

The number of cache misses is reduced compared to naive merge sort.

Properties

property	value
stable	yes
in-place	no
cache locality	improved
sequential access	high

Notes

Cache efficient merge sort focuses on memory hierarchy rather than asymptotic complexity. It is important for large datasets where memory bandwidth dominates performance.

This idea leads to more advanced designs such as cache oblivious algorithms and external memory sorting.

Implementation

def cache_merge_sort(a):
    def sort(lo, hi):
        if hi - lo <= 16:
            insertion_sort(lo, hi)
            return

        mid = (lo + hi) // 2
        sort(lo, mid)
        sort(mid + 1, hi)
        merge(lo, mid, hi)

    def insertion_sort(lo, hi):
        for i in range(lo + 1, hi + 1):
            key = a[i]
            j = i - 1
            while j >= lo and a[j] > key:
                a[j + 1] = a[j]
                j -= 1
            a[j + 1] = key

    def merge(lo, mid, hi):
        left = a[lo:mid + 1]
        i = 0
        j = mid + 1
        k = lo

        while i < len(left) and j <= hi:
            if left[i] <= a[j]:
                a[k] = left[i]
                i += 1
            else:
                a[k] = a[j]
                j += 1
            k += 1

        while i < len(left):
            a[k] = left[i]
            i += 1
            k += 1

    sort(0, len(a) - 1)
    return a