Skip to content

6.16 Nearly Sorted Data

Exploit near-sorted structure in inputs like append-only logs or incremental updates to sort in linear or near-linear time.

Nearly sorted inputs appear frequently in practice. Logs appended over time, incremental updates, and cached results often differ only slightly from sorted order. Algorithms that exploit this structure can run in linear or near-linear time.

Problem

You have an array A[0..n-1] that is close to sorted. You want to sort it faster than general-purpose O(n log n) algorithms.

Measuring “Nearly Sorted”

Several measures capture how far an array is from sorted.

Number of inversions

An inversion is a pair (i, j) with i < j and A[i] > A[j]. A sorted array has zero inversions.

Maximum displacement

Each element moves at most d positions from its sorted location.

Runs

The array can be decomposed into a small number of already sorted segments.

Different algorithms exploit different measures.

Insertion Sort on Nearly Sorted Data

Insertion sort adapts to low inversion count.

For each element, the cost is proportional to how far it must move left. If most elements are already near their correct positions, the total work is small.

Time complexity becomes:

O(n + number_of_inversions)

If inversions are few, this is close to linear.

Example

[1, 2, 3, 5, 4, 6, 7]

Only one inversion exists: (5, 4). Insertion sort shifts 5 once and finishes quickly.

Heap-Based Method for Bounded Displacement

If each element is at most d positions away from its correct position, use a min-heap of size d + 1.

sort_k_sorted(A, d):
  heap = first d + 1 elements

  for i = d + 1 to n - 1:
    output extract_min(heap)
    insert A[i] into heap

  while heap not empty:
    output extract_min(heap)

Invariant

At each step, the smallest remaining element must lie within the next d + 1 elements of the input. The heap stores exactly those candidates.

Complexity

Each heap operation costs:

O(log d)

Total time:

O(n log d)

When d is small, this is close to linear.

Run-Based Sorting

If the array consists of sorted runs, detect them and merge.

identify runs
merge runs using k-way merge

If there are r runs, merging costs:

O(n log r)

This is efficient when r is small.

This idea is used in practical algorithms such as Timsort, which detects natural runs and merges them.

Adaptive Merge Strategy

Instead of fixed-size splits, scan the array to find increasing or decreasing segments:

[1, 2, 3] [9, 8, 7] [10, 11]

Reverse decreasing runs to make them increasing:

[1, 2, 3] [7, 8, 9] [10, 11]

Then merge these runs efficiently.

Correctness

Each method relies on a structural invariant:

  • Insertion sort maintains a sorted prefix and shifts only necessary elements
  • Heap method maintains a candidate window containing the next correct element
  • Run-based methods maintain sorted subarrays and merge them correctly

All methods preserve permutation because they only move existing elements.

Choosing a Strategy

PropertyRecommended method
Few inversionsInsertion sort
Bounded displacement dHeap-based sort
Few sorted runsRun detection + merge
Unknown structureHybrid adaptive sort

Practical Algorithms

Modern standard libraries use adaptive algorithms.

Timsort, used in Python and Java, combines:

run detection
insertion sort for small runs
merge sort for combining runs

It guarantees:

O(n log n) worst case
O(n) on nearly sorted input

Common Bugs

A common mistake is assuming “nearly sorted” without measuring it. Algorithms like insertion sort degrade to quadratic time if the assumption is false.

Another bug is choosing d incorrectly in the heap method. If d is underestimated, correctness fails.

A third bug is failing to reverse descending runs before merging, which breaks the assumption that runs are sorted.

Takeaway

Nearly sorted data allows algorithms to skip unnecessary work. By exploiting structure such as small inversions, bounded displacement, or existing runs, sorting can approach linear time in practice.