# 6.16 Nearly Sorted Data

# 6.16 Nearly Sorted Data

Nearly sorted inputs appear frequently in practice. Logs appended over time, incremental updates, and cached results often differ only slightly from sorted order. Algorithms that exploit this structure can run in linear or near-linear time.

## Problem

You have an array `A[0..n-1]` that is close to sorted. You want to sort it faster than general-purpose `O(n log n)` algorithms.

## Measuring “Nearly Sorted”

Several measures capture how far an array is from sorted.

**Number of inversions**

An inversion is a pair `(i, j)` with `i < j` and `A[i] > A[j]`. A sorted array has zero inversions.

**Maximum displacement**

Each element moves at most `d` positions from its sorted location.

**Runs**

The array can be decomposed into a small number of already sorted segments.

Different algorithms exploit different measures.

## Insertion Sort on Nearly Sorted Data

Insertion sort adapts to low inversion count.

For each element, the cost is proportional to how far it must move left. If most elements are already near their correct positions, the total work is small.

Time complexity becomes:

```text id="8o8ztt"
O(n + number_of_inversions)
```

If inversions are few, this is close to linear.

## Example

```text id="pf7p2r"
[1, 2, 3, 5, 4, 6, 7]
```

Only one inversion exists: `(5, 4)`. Insertion sort shifts `5` once and finishes quickly.

## Heap-Based Method for Bounded Displacement

If each element is at most `d` positions away from its correct position, use a min-heap of size `d + 1`.

```text id="9u7xw0"
sort_k_sorted(A, d):
  heap = first d + 1 elements

  for i = d + 1 to n - 1:
    output extract_min(heap)
    insert A[i] into heap

  while heap not empty:
    output extract_min(heap)
```

## Invariant

At each step, the smallest remaining element must lie within the next `d + 1` elements of the input. The heap stores exactly those candidates.

## Complexity

Each heap operation costs:

```text id="h2pbsi"
O(log d)
```

Total time:

```text id="7h9g4q"
O(n log d)
```

When `d` is small, this is close to linear.

## Run-Based Sorting

If the array consists of sorted runs, detect them and merge.

```text id="v2q0yo"
identify runs
merge runs using k-way merge
```

If there are `r` runs, merging costs:

```text id="kp6tq8"
O(n log r)
```

This is efficient when `r` is small.

This idea is used in practical algorithms such as Timsort, which detects natural runs and merges them.

## Adaptive Merge Strategy

Instead of fixed-size splits, scan the array to find increasing or decreasing segments:

```text id="7a7o8t"
[1, 2, 3] [9, 8, 7] [10, 11]
```

Reverse decreasing runs to make them increasing:

```text id="v6r1zt"
[1, 2, 3] [7, 8, 9] [10, 11]
```

Then merge these runs efficiently.

## Correctness

Each method relies on a structural invariant:

* Insertion sort maintains a sorted prefix and shifts only necessary elements
* Heap method maintains a candidate window containing the next correct element
* Run-based methods maintain sorted subarrays and merge them correctly

All methods preserve permutation because they only move existing elements.

## Choosing a Strategy

| Property                 | Recommended method    |
| ------------------------ | --------------------- |
| Few inversions           | Insertion sort        |
| Bounded displacement `d` | Heap-based sort       |
| Few sorted runs          | Run detection + merge |
| Unknown structure        | Hybrid adaptive sort  |

## Practical Algorithms

Modern standard libraries use adaptive algorithms.

Timsort, used in Python and Java, combines:

```text id="p3c0s3"
run detection
insertion sort for small runs
merge sort for combining runs
```

It guarantees:

```text id="s1skgj"
O(n log n) worst case
O(n) on nearly sorted input
```

## Common Bugs

A common mistake is assuming “nearly sorted” without measuring it. Algorithms like insertion sort degrade to quadratic time if the assumption is false.

Another bug is choosing `d` incorrectly in the heap method. If `d` is underestimated, correctness fails.

A third bug is failing to reverse descending runs before merging, which breaks the assumption that runs are sorted.

## Takeaway

Nearly sorted data allows algorithms to skip unnecessary work. By exploiting structure such as small inversions, bounded displacement, or existing runs, sorting can approach linear time in practice.

