Use a predictive model to estimate where a key should appear, then search within a bounded local range.
Learned index search treats an ordered index as a prediction problem. Given a key, a model estimates the position where that key should appear in sorted data. The algorithm then corrects the estimate by searching a small range around the predicted position.
The main idea is that a sorted array defines a cumulative distribution function from key to position. If this function is predictable, a model can replace part of the traditional index.
Problem
Given a sorted array of length and a target key , find an index such that
or report failure.
Idea
Train or define a model that predicts the approximate rank of a key:
where is the position where should occur.
Because the prediction may have error, search in a bounded window:
where is the maximum known prediction error.
Algorithm
learned_index_search(A, x, M, epsilon):
p = M(x)
left = max(0, floor(p) - epsilon)
right = min(length(A) - 1, floor(p) + epsilon)
return binary_search(A, x, left, right)Example
Let
Suppose the model predicts:
and the error bound is:
Then search only the range:
| index | value |
|---|---|
| 3 | 40 |
| 4 | 50 |
| 5 | 60 |
| 6 | 70 |
| 7 | 80 |
Binary search in this range returns index .
Correctness
The model does not need to be exact. Correctness depends on the error bound. If every true position lies within of the predicted position, then the target must be inside the bounded search window whenever it exists.
The final binary search is exact on that window. Therefore, if the key exists and the error bound is valid, the algorithm returns a correct index. If the key is absent, binary search correctly reports failure within the only possible window.
Complexity
Let be the maximum prediction error.
| operation | time |
|---|---|
| model evaluation | or model dependent |
| correction search | |
| total |
Space depends on the model. A simple linear model uses space. A recursive or piecewise model may use more.
When to Use
Use learned index search when:
- keys are sorted
- query volume is high
- the key distribution is predictable
- memory access dominates comparison cost
- an approximate rank model can be maintained
It is most useful for static or mostly static indexes. If inserts and deletes are frequent, the model must be retrained or updated.
Implementation
def learned_index_search(a, x, model, epsilon):
if not a:
return -1
p = int(model(x))
left = max(0, p - epsilon)
right = min(len(a) - 1, p + epsilon)
while left <= right:
mid = (left + right) // 2
if a[mid] == x:
return mid
elif a[mid] < x:
left = mid + 1
else:
right = mid - 1
return -1func LearnedIndexSearch(a []int, x int, model func(int) int, epsilon int) int {
if len(a) == 0 {
return -1
}
p := model(x)
left := p - epsilon
if left < 0 {
left = 0
}
right := p + epsilon
if right >= len(a) {
right = len(a) - 1
}
for left <= right {
mid := (left + right) / 2
if a[mid] == x {
return mid
} else if a[mid] < x {
left = mid + 1
} else {
right = mid - 1
}
}
return -1
}