Learned Index Search

Learned index search treats an ordered index as a prediction problem. Given a key, a model estimates the position where that key should appear in sorted data. The algorithm then corrects the estimate by searching a small range around the predicted position.

The main idea is that a sorted array defines a cumulative distribution function from key to position. If this function is predictable, a model can replace part of the traditional index.

Problem

Given a sorted array $A$ of length $n$ and a target key $x$ , find an index $i$ such that

A[i] = x

or report failure.

Idea

Train or define a model $M$ that predicts the approximate rank of a key:

M(x) \approx i

where $i$ is the position where $x$ should occur.

Because the prediction may have error, search in a bounded window:

[M(x) - \epsilon, M(x) + \epsilon]

where $\epsilon$ is the maximum known prediction error.

Algorithm

learned_index_search(A, x, M, epsilon):
    p = M(x)

    left = max(0, floor(p) - epsilon)
    right = min(length(A) - 1, floor(p) + epsilon)

    return binary_search(A, x, left, right)

Example

Let

A = [10, 20, 30, 40, 50, 60, 70, 80]

Suppose the model predicts:

M(60) = 5.2

and the error bound is:

\epsilon = 2

Then search only the range:

[3, 7]

index	value
3	40
4	50
5	60
6	70
7	80

Binary search in this range returns index $5$ .

Correctness

The model does not need to be exact. Correctness depends on the error bound. If every true position lies within $\epsilon$ of the predicted position, then the target must be inside the bounded search window whenever it exists.

The final binary search is exact on that window. Therefore, if the key exists and the error bound is valid, the algorithm returns a correct index. If the key is absent, binary search correctly reports failure within the only possible window.

Complexity

Let $\epsilon$ be the maximum prediction error.

operation	time
model evaluation	$O(1)$ or model dependent
correction search	$O(\log \epsilon)$
total	$O(1 + \log \epsilon)$

Space depends on the model. A simple linear model uses $O(1)$ space. A recursive or piecewise model may use more.

When to Use

Use learned index search when:

keys are sorted
query volume is high
the key distribution is predictable
memory access dominates comparison cost
an approximate rank model can be maintained

It is most useful for static or mostly static indexes. If inserts and deletes are frequent, the model must be retrained or updated.

Implementation

def learned_index_search(a, x, model, epsilon):
    if not a:
        return -1

    p = int(model(x))
    left = max(0, p - epsilon)
    right = min(len(a) - 1, p + epsilon)

    while left <= right:
        mid = (left + right) // 2
        if a[mid] == x:
            return mid
        elif a[mid] < x:
            left = mid + 1
        else:
            right = mid - 1

    return -1

func LearnedIndexSearch(a []int, x int, model func(int) int, epsilon int) int {
    if len(a) == 0 {
        return -1
    }

    p := model(x)
    left := p - epsilon
    if left < 0 {
        left = 0
    }

    right := p + epsilon
    if right >= len(a) {
        right = len(a) - 1
    }

    for left <= right {
        mid := (left + right) / 2
        if a[mid] == x {
            return mid
        } else if a[mid] < x {
            left = mid + 1
        } else {
            right = mid - 1
        }
    }

    return -1
}