Use a hierarchy of learned models to predict a key position, then search inside a bounded error range.
Recursive model index search uses several learned models arranged as a hierarchy. A top model selects a lower model, and the lower model predicts the likely position of a key in sorted data. The final prediction is then corrected by searching inside a bounded local range.
This is the main search procedure behind a recursive model index, often shortened to RMI. It replaces part of a traditional index with prediction.
Problem
Given a sorted array of length , a target key , and a hierarchy of models, find an index such that
or report failure.
Idea
A single model may be too inaccurate for a large or irregular key distribution. An RMI splits the prediction task into stages.
The first model predicts which submodel should handle the key. The selected submodel predicts the final position.
For a two stage RMI:
where is the selected model and is the predicted position.
Because the prediction may be wrong by some bounded error , the algorithm searches:
Algorithm
recursive_model_index_search(A, x, root_model, leaf_models, errors):
m = root_model(x)
m = clamp(floor(m), 0, length(leaf_models) - 1)
p = leaf_models[m](x)
p = floor(p)
epsilon = errors[m]
left = max(0, p - epsilon)
right = min(length(A) - 1, p + epsilon)
return binary_search(A, x, left, right)Example
Let
Suppose the root model sends to leaf model :
The leaf model predicts:
If the known error for leaf model is:
then search:
| index | value |
|---|---|
| 5 | 60 |
| 6 | 70 |
| 7 | 80 |
Binary search returns index .
Correctness
The model hierarchy is only a predictor. Correctness comes from the verified error bound attached to the selected leaf model.
If the true position of is guaranteed to lie within positions of the leaf prediction, then the bounded search interval contains the target whenever the target exists. Binary search over that interval is exact, so the algorithm returns the correct index or correctly reports failure.
Complexity
Let be the number of model stages and let be the correction window size for the selected leaf.
| part | time |
|---|---|
| model traversal | |
| correction search | |
| total |
Space depends on the number and size of models.
For a two stage RMI with leaf models, space is:
when each model has constant size.
When to Use
Use recursive model index search when:
- keys are sorted
- the dataset is large
- the key distribution has predictable regions
- query volume is high
- the index is static or mostly static
- memory access cost matters
It is less suitable when data changes frequently, because insertions and deletions may invalidate the learned distribution and its error bounds.
Implementation
def binary_search_range(a, x, left, right):
while left <= right:
mid = (left + right) // 2
if a[mid] == x:
return mid
if a[mid] < x:
left = mid + 1
else:
right = mid - 1
return -1
def recursive_model_index_search(a, x, root_model, leaf_models, errors):
if not a:
return -1
m = int(root_model(x))
m = max(0, min(m, len(leaf_models) - 1))
p = int(leaf_models[m](x))
epsilon = errors[m]
left = max(0, p - epsilon)
right = min(len(a) - 1, p + epsilon)
return binary_search_range(a, x, left, right)func binarySearchRange(a []int, x int, left int, right int) int {
for left <= right {
mid := (left + right) / 2
if a[mid] == x {
return mid
}
if a[mid] < x {
left = mid + 1
} else {
right = mid - 1
}
}
return -1
}
func RecursiveModelIndexSearch(
a []int,
x int,
rootModel func(int) int,
leafModels []func(int) int,
errors []int,
) int {
if len(a) == 0 {
return -1
}
m := rootModel(x)
if m < 0 {
m = 0
}
if m >= len(leafModels) {
m = len(leafModels) - 1
}
p := leafModels[m](x)
epsilon := errors[m]
left := p - epsilon
if left < 0 {
left = 0
}
right := p + epsilon
if right >= len(a) {
right = len(a) - 1
}
return binarySearchRange(a, x, left, right)
}