20.15 Building a Mini Database Index

Problem

Design a small database index that supports fast lookup by key and efficient range scans.

A database table may contain millions of rows. Without an index, a query such as:

SELECT * FROM users WHERE id = 42;

requires scanning rows until the target appears. A range query such as:

SELECT * FROM users WHERE id BETWEEN 1000 AND 2000;

requires scanning the whole table unless the storage layout happens to match the query.

An index stores a searchable structure beside the data. It trades write cost and storage space for faster reads.

The core algorithmic problem is ordered lookup. The engineering problem is maintaining that order while supporting inserts, range scans, and disk-friendly access.

Solution

Use a sorted key-value structure first. Then evolve it toward a B-tree-style page index.

For an in-memory prototype, store sorted (key, row_id) pairs and use binary search.

import bisect

class SortedIndex:
    def __init__(self):
        self.keys = []
        self.row_ids = []

    def insert(self, key, row_id):
        position = bisect.bisect_left(self.keys, key)

        if position < len(self.keys) and self.keys[position] == key:
            self.row_ids[position] = row_id
            return

        self.keys.insert(position, key)
        self.row_ids.insert(position, row_id)

    def get(self, key):
        position = bisect.bisect_left(self.keys, key)

        if position == len(self.keys):
            return None

        if self.keys[position] != key:
            return None

        return self.row_ids[position]

Use it:

index = SortedIndex()

index.insert(10, "row-a")
index.insert(20, "row-b")
index.insert(15, "row-c")

print(index.get(15))

Output:

row-c

This gives fast lookup, but insertion into the middle of a Python list costs O(n). That is acceptable for a teaching prototype and poor for a write-heavy database.

Range Scans

The main advantage of ordered indexes is range scanning.

class SortedIndex:
    def __init__(self):
        self.keys = []
        self.row_ids = []

    def insert(self, key, row_id):
        position = bisect.bisect_left(self.keys, key)

        if position < len(self.keys) and self.keys[position] == key:
            self.row_ids[position] = row_id
            return

        self.keys.insert(position, key)
        self.row_ids.insert(position, row_id)

    def get(self, key):
        position = bisect.bisect_left(self.keys, key)

        if position == len(self.keys):
            return None

        if self.keys[position] != key:
            return None

        return self.row_ids[position]

    def range_scan(self, start, end):
        position = bisect.bisect_left(self.keys, start)

        while position < len(self.keys) and self.keys[position] <= end:
            yield self.keys[position], self.row_ids[position]
            position += 1

Example:

index = SortedIndex()

for key in [10, 20, 15, 30, 25]:
    index.insert(key, f"row-{key}")

print(list(index.range_scan(15, 25)))

Output:

[(15, 'row-15'), (20, 'row-20'), (25, 'row-25')]

The range query cost is:

O(log n + k)

where k is the number of returned rows.

This is the defining value of an ordered index.

Why Hash Indexes Are Not Enough

A hash table supports fast equality lookup.

id = 42

But it does not support ordered range queries efficiently.

id BETWEEN 1000 AND 2000

A hash index can answer exact matches but cannot scan keys in sorted order without extra work.

Index type	Equality lookup	Range scan
Hash index	Fast	Poor
Ordered index	Fast	Fast
Full table scan	Poor	Poor

Database systems use different index structures depending on query shape. This recipe focuses on ordered indexes because they support both equality and range access.

Duplicate Keys

Real indexes often need duplicate keys.

Example:

SELECT * FROM orders WHERE customer_id = 42;

Many orders may share the same customer_id.

Store a list of row IDs per key.

class NonUniqueSortedIndex:
    def __init__(self):
        self.keys = []
        self.row_id_lists = []

    def insert(self, key, row_id):
        position = bisect.bisect_left(self.keys, key)

        if position < len(self.keys) and self.keys[position] == key:
            self.row_id_lists[position].append(row_id)
            return

        self.keys.insert(position, key)
        self.row_id_lists.insert(position, [row_id])

    def get_all(self, key):
        position = bisect.bisect_left(self.keys, key)

        if position == len(self.keys):
            return []

        if self.keys[position] != key:
            return []

        return list(self.row_id_lists[position])

Range scans now return each matching row ID.

def range_scan(self, start, end):
    position = bisect.bisect_left(self.keys, start)

    while position < len(self.keys) and self.keys[position] <= end:
        for row_id in self.row_id_lists[position]:
            yield self.keys[position], row_id

        position += 1

This models a secondary index.

Separating Index and Table

The index should not store full rows. It should store row identifiers that point to a table or storage layer.

class Table:
    def __init__(self):
        self.rows = {}
        self.next_row_id = 1

    def insert(self, row):
        row_id = self.next_row_id
        self.next_row_id += 1

        self.rows[row_id] = row

        return row_id

    def get(self, row_id):
        return self.rows.get(row_id)

Use the table and index together:

table = Table()
index = NonUniqueSortedIndex()

row_id = table.insert({
    "id": 1,
    "customer_id": 42,
    "amount": 99,
})

index.insert(42, row_id)

for row_id in index.get_all(42):
    print(table.get(row_id))

This mirrors a database design: table storage owns records, and indexes point into it.

Composite Keys

Many queries use more than one column.

SELECT * FROM orders
WHERE customer_id = 42
  AND created_at >= '2026-01-01'
  AND created_at < '2026-02-01';

A composite index on (customer_id, created_at) supports this efficiently.

Represent composite keys as tuples:

index.insert((42, "2026-01-12"), row_id)
index.insert((42, "2026-01-20"), row_id)
index.insert((43, "2026-01-05"), row_id)

Range scan for one customer:

start = (42, "2026-01-01")
end = (42, "2026-02-01")

matches = list(index.range_scan(start, end))

Tuple ordering gives lexicographic behavior:

(42, 2026-01-01)
(42, 2026-01-12)
(42, 2026-01-20)
(43, 2026-01-05)

Composite indexes are powerful, but column order matters.

An index on:

(customer_id, created_at)

helps queries that constrain customer_id. It may not help much for queries that constrain only created_at.

Moving Toward Pages

A sorted array is easy to understand, but database indexes are stored in pages.

A page is a fixed-size block, often 4 KB, 8 KB, or 16 KB.

A page contains sorted keys and pointers.

Page
+-----------------------+
| keys: 10 20 30 40     |
| ptrs: r1 r2 r3 r4     |
+-----------------------+

When a page fills up, it splits.

This idea leads to B-trees and B+ trees.

B+ Tree Shape

A B+ tree stores keys in sorted order and keeps all row pointers in leaf pages.

Internal pages route searches.

             [30]
            /    \\n [10 20] -> [30 40 50]

A real B+ tree links leaf pages together:

[10 20] -> [30 40 50] -> [60 70]

This makes range scans efficient. The index finds the first matching leaf, then walks leaf links until the range ends.

Minimal Leaf Page

Define a small page structure.

class LeafPage:
    def __init__(self, capacity):
        self.capacity = capacity
        self.keys = []
        self.values = []
        self.next = None

    def insert(self, key, value):
        position = bisect.bisect_left(self.keys, key)

        self.keys.insert(position, key)
        self.values.insert(position, value)

        if len(self.keys) > self.capacity:
            return self.split()

        return None

    def split(self):
        middle = len(self.keys) // 2

        right = LeafPage(self.capacity)
        right.keys = self.keys[middle:]
        right.values = self.values[middle:]
        right.next = self.next

        self.keys = self.keys[:middle]
        self.values = self.values[:middle]
        self.next = right

        separator = right.keys[0]

        return separator, right

When a leaf splits, the separator key is promoted to the parent.

Minimal Internal Page

An internal page stores separator keys and child pointers.

class InternalPage:
    def __init__(self, capacity):
        self.capacity = capacity
        self.keys = []
        self.children = []

    def child_index(self, key):
        return bisect.bisect_right(self.keys, key)

For a complete B+ tree, insertion descends to a leaf, inserts the key, splits if needed, and propagates separators upward. If the root splits, the tree grows by one level.

The complete implementation is longer than this recipe needs. The important point is the invariant:

Each internal key separates ranges stored in child pages.
All leaves are linked in sorted order.

Lookup Path

A lookup starts at the root.

search key 37

root keys: [30, 60]
37 > 30 and 37 < 60
go to child 1

leaf keys: [30, 35, 37, 40]
find 37

Height is small because pages hold many keys. If a page holds 200 keys, a tree with height 3 can address millions of records.

This is why B-trees are disk-friendly. They reduce random I/O by using high fanout.

Range Scan in a B+ Tree

A range scan has two phases:

Find the first leaf containing start.
Walk leaf pages until keys exceed end.

find start -> leaf page
scan keys in leaf
move to next leaf
repeat

This is the same logical pattern as the sorted-array range scan:

O(log n + k)

The difference is storage layout. B+ trees are designed around pages, while arrays are contiguous memory structures.

Write Costs

Indexes speed reads but slow writes.

On insert:

Insert row into table.
Insert key into each affected index.
Split index pages if needed.
Write updated pages.
Log changes for recovery.

A table with many indexes has more expensive writes.

This is why database schema design is a tradeoff. An index should exist because a query needs it, not because a column looks important.

Covering Indexes

Sometimes the index contains all columns needed by a query.

Query:

SELECT customer_id, created_at
FROM orders
WHERE customer_id = 42;

If the index stores:

(customer_id, created_at)

then the database may answer the query from the index alone without reading table rows.

This is called a covering index.

In prototype terms, store payloads in the index:

index.insert((customer_id, created_at), {
    "customer_id": customer_id,
    "created_at": created_at,
})

Covering indexes improve reads but increase index size.

Selectivity

Index usefulness depends on selectivity.

A highly selective column filters many rows.

email = '[email protected]'

A low-selectivity column filters few rows.

status = 'active'

If almost every row has status = active, an index on status may not help much. The database still has to fetch many rows.

A simple selectivity estimate:

def estimate_selectivity(total_rows, distinct_values):
    if total_rows == 0:
        return 0.0

    return distinct_values / total_rows

Higher selectivity usually makes an index more useful for equality filters.

Testing

Test exact lookup.

def test_sorted_index_lookup():
    index = SortedIndex()

    index.insert(10, "row-10")
    index.insert(20, "row-20")

    assert index.get(10) == "row-10"
    assert index.get(30) is None

Test range scan.

def test_sorted_index_range_scan():
    index = SortedIndex()

    for key in [30, 10, 20, 40]:
        index.insert(key, f"row-{key}")

    rows = list(index.range_scan(15, 35))

    assert rows == [
        (20, "row-20"),
        (30, "row-30"),
    ]

Test duplicate keys.

def test_non_unique_index():
    index = NonUniqueSortedIndex()

    index.insert("active", "row-1")
    index.insert("active", "row-2")

    assert index.get_all("active") == ["row-1", "row-2"]

Test composite keys.

def test_composite_key_range():
    index = SortedIndex()

    index.insert((42, "2026-01-10"), "row-1")
    index.insert((42, "2026-01-20"), "row-2")
    index.insert((43, "2026-01-10"), "row-3")

    rows = list(index.range_scan(
        (42, "2026-01-01"),
        (42, "2026-02-01"),
    ))

    assert rows == [
        ((42, "2026-01-10"), "row-1"),
        ((42, "2026-01-20"), "row-2"),
    ]

These tests define the basic contract: ordered lookup, range scan, duplicate handling, and composite-key behavior.

Common Bugs

The most common index bug is updating the table but not updating the index. This creates stale query results.

Another common bug is using a hash index when the workload needs range scans.

Duplicate keys are often mishandled. A secondary index usually needs multiple row IDs per key.

Composite index order is frequently misunderstood. (a, b) and (b, a) serve different query patterns.

Range boundaries produce off-by-one errors. Decide whether bounds are inclusive or exclusive and test them.

B+ tree implementations often break during split propagation. Keep page invariants explicit and test small page capacities to force splits.

Covering indexes can silently become stale if included payload columns are not updated.

Recipe

Build the index in layers.

Start with a sorted array and binary search. Add range scans. Add duplicate-key support for secondary indexes. Separate table storage from index entries. Use tuple keys for composite indexes. Move to pages when storage layout matters. Use a B+ tree when inserts, range scans, and disk access must all be efficient. Measure selectivity before adding indexes. Test boundaries, duplicates, and update paths carefully.

The main lesson is that an index is a maintained search structure. It makes reads faster because the database pays extra work on writes and storage. A good index matches the query pattern: equality, range, ordering, uniqueness, or coverage.