Recursive Model Index Search

Recursive model index search uses several learned models arranged as a hierarchy. A top model selects a lower model, and the lower model predicts the likely position of a key in sorted data. The final prediction is then corrected by searching inside a bounded local range.

This is the main search procedure behind a recursive model index, often shortened to RMI. It replaces part of a traditional index with prediction.

Problem

Given a sorted array $A$ of length $n$ , a target key $x$ , and a hierarchy of models, find an index $i$ such that

A[i] = x

or report failure.

Idea

A single model may be too inaccurate for a large or irregular key distribution. An RMI splits the prediction task into stages.

The first model predicts which submodel should handle the key. The selected submodel predicts the final position.

For a two stage RMI:

m = M_0(x)

p = M_1[m](x)

where $m$ is the selected model and $p$ is the predicted position.

Because the prediction may be wrong by some bounded error $\epsilon_m$ , the algorithm searches:

[p - \epsilon_m, p + \epsilon_m]

Algorithm

recursive_model_index_search(A, x, root_model, leaf_models, errors):
    m = root_model(x)
    m = clamp(floor(m), 0, length(leaf_models) - 1)

    p = leaf_models[m](x)
    p = floor(p)

    epsilon = errors[m]

    left = max(0, p - epsilon)
    right = min(length(A) - 1, p + epsilon)

    return binary_search(A, x, left, right)

Example

Let

A = [10, 20, 30, 40, 50, 60, 70, 80]

Suppose the root model sends $x = 70$ to leaf model $1$ :

M_0(70) = 1

The leaf model predicts:

M_1 = 6.3

If the known error for leaf model $1$ is:

\epsilon_1 = 1

then search:

[5, 7]

index	value
5	60
6	70
7	80

Binary search returns index $6$ .

Correctness

The model hierarchy is only a predictor. Correctness comes from the verified error bound attached to the selected leaf model.

If the true position of $x$ is guaranteed to lie within $\epsilon_m$ positions of the leaf prediction, then the bounded search interval contains the target whenever the target exists. Binary search over that interval is exact, so the algorithm returns the correct index or correctly reports failure.

Complexity

Let $k$ be the number of model stages and let $\epsilon$ be the correction window size for the selected leaf.

part	time
model traversal	$O(k)$
correction search	$O(\log \epsilon)$
total	$O(k + \log \epsilon)$

Space depends on the number and size of models.

For a two stage RMI with $b$ leaf models, space is:

O(b)

when each model has constant size.

When to Use

Use recursive model index search when:

keys are sorted
the dataset is large
the key distribution has predictable regions
query volume is high
the index is static or mostly static
memory access cost matters

It is less suitable when data changes frequently, because insertions and deletions may invalidate the learned distribution and its error bounds.

Implementation

def binary_search_range(a, x, left, right):
    while left <= right:
        mid = (left + right) // 2
        if a[mid] == x:
            return mid
        if a[mid] < x:
            left = mid + 1
        else:
            right = mid - 1
    return -1

def recursive_model_index_search(a, x, root_model, leaf_models, errors):
    if not a:
        return -1

    m = int(root_model(x))
    m = max(0, min(m, len(leaf_models) - 1))

    p = int(leaf_models[m](x))
    epsilon = errors[m]

    left = max(0, p - epsilon)
    right = min(len(a) - 1, p + epsilon)

    return binary_search_range(a, x, left, right)

func binarySearchRange(a []int, x int, left int, right int) int {
    for left <= right {
        mid := (left + right) / 2
        if a[mid] == x {
            return mid
        }
        if a[mid] < x {
            left = mid + 1
        } else {
            right = mid - 1
        }
    }
    return -1
}

func RecursiveModelIndexSearch(
    a []int,
    x int,
    rootModel func(int) int,
    leafModels []func(int) int,
    errors []int,
) int {
    if len(a) == 0 {
        return -1
    }

    m := rootModel(x)
    if m < 0 {
        m = 0
    }
    if m >= len(leafModels) {
        m = len(leafModels) - 1
    }

    p := leafModels[m](x)
    epsilon := errors[m]

    left := p - epsilon
    if left < 0 {
        left = 0
    }

    right := p + epsilon
    if right >= len(a) {
        right = len(a) - 1
    }

    return binarySearchRange(a, x, left, right)
}