# Tape Merge Sort

# Tape Merge Sort

Tape merge sort is an external sorting algorithm designed for sequential storage. It was historically used with magnetic tapes, where random access was expensive or unavailable. The algorithm writes sorted runs to multiple tapes, then repeatedly merges runs from input tapes onto output tapes.

The same idea still matters for sequential external storage: make long sequential reads and writes, avoid random seeks, and reduce the number of full passes over the data.

## Problem

Given a dataset too large to fit in memory, sort it using only sequential reads and writes.

The input is divided into sorted runs. These runs are then merged until one sorted run remains.

## Core Idea

Tape merge sort uses two phases:

1. **Run generation**
   Read chunks into memory, sort them, and write sorted runs to output tapes.

2. **Tape merging**
   Merge runs from input tapes to output tapes. After each pass, swap input and output tape roles.

## Algorithm

```text id="oxp4s9"
tape_merge_sort(input, memory_size, tapes):
    runs = generate_sorted_runs(input, memory_size)
    distribute runs across input tapes

    while more than one run remains:
        merge runs from input tapes onto output tapes
        rewind tapes
        swap input tapes and output tapes

    return final run
```

A two-way tape merge uses two input tapes and one output tape.

```text id="ti6s2a"
merge_pass(input_tape_1, input_tape_2, output_tape):
    while both input tapes have runs:
        merge one run from input_tape_1
        with one run from input_tape_2
        write merged run to output_tape
```

## Example

Suppose initial sorted runs are:

$$
R_1, R_2, R_3, R_4
$$

Distribute them across two input tapes:

| tape | runs       |
| ---- | ---------- |
| T1   | $R_1, R_3$ |
| T2   | $R_2, R_4$ |

First merge pass:

| output tape | merged runs                          |
| ----------- | ------------------------------------ |
| T3          | merge($R_1, R_2$), merge($R_3, R_4$) |

Now there are two larger runs. A second pass merges those into one final run.

## Correctness

Each initial run is sorted. A merge of two sorted runs produces a larger sorted run by repeatedly selecting the smaller current head element.

Each merge pass preserves all records and reduces the number of runs. Since every pass produces sorted runs and the process stops only when one run remains, the final run contains all input records in sorted order.

## Complexity

Let:

* $N$ be the number of records
* $R$ be the number of initial runs
* $B$ be the block size
* $k$ be the merge fan-in

Number of merge passes:

$$
O(\log_k R)
$$

I/O cost:

$$
O\left(\frac{N}{B} \log_k R\right)
$$

CPU cost for k-way merging:

$$
O(N \log k)
$$

per merge level when using a heap.

## Tape Scheduling

Tape merge sort differs from ordinary external merge sort mainly in scheduling.

| schedule        | idea                                          |
| --------------- | --------------------------------------------- |
| balanced merge  | distribute runs evenly across tapes           |
| polyphase merge | distribute runs unevenly to reduce idle tapes |
| cascade merge   | pass merged runs through staged levels        |

For real tape devices, minimizing rewinds and redistribution can dominate performance.

## When to Use

Tape merge sort is useful when:

* storage supports fast sequential access but poor random access
* data is much larger than memory
* runs must be processed in passes
* input and output streams are naturally sequential

It is mostly historical as a named method, but its I/O discipline remains relevant for external sorting and streaming pipelines.

## Implementation Sketch

```python id="e2d8vf"
def merge_two_runs(a, b):
    i = 0
    j = 0
    out = []

    while i < len(a) and j < len(b):
        if a[i] <= b[j]:
            out.append(a[i])
            i += 1
        else:
            out.append(b[j])
            j += 1

    out.extend(a[i:])
    out.extend(b[j:])
    return out
```

```python id="hx47sm"
def tape_merge_sort_runs(runs):
    while len(runs) > 1:
        next_runs = []

        for i in range(0, len(runs), 2):
            if i + 1 < len(runs):
                next_runs.append(merge_two_runs(runs[i], runs[i + 1]))
            else:
                next_runs.append(runs[i])

        runs = next_runs

    return runs[0] if runs else []
```

## Notes

Tape merge sort is best understood as merge sort under sequential-storage constraints. The sorting rule is ordinary merging. The important part is how runs are placed, read, written, rewound, and redistributed across sequential devices.

