Skip to content

Reservoir Sampling

Select a uniform random sample from a stream of unknown length using fixed memory.

Reservoir Sampling selects a uniform random sample from a stream without knowing the stream length in advance. It keeps only a fixed-size reservoir and updates it as new items arrive.

The simplest version keeps one item. The general version keeps kk items.

Problem

Given a stream:

x1,x2,,xn x_1, x_2, \dots, x_n

choose kk items uniformly at random from the stream, using only:

O(k) O(k)

memory.

Algorithm

Fill the reservoir with the first kk items. For the ii-th item after that, choose a random integer jj from 00 to ii. If j<kj < k, replace reservoir position jj.

Here ii is zero-based.

reservoir_sampling(stream, k):
    R = empty array

    for i, x in enumerate(stream):
        if i < k:
            append x to R
        else:
            j = random integer from 0 to i
            if j < k:
                R[j] = x

    return R

Example

Let:

stream=[a,b,c,d,e] stream = [a, b, c, d, e]

and:

k=2 k = 2

The reservoir first becomes:

[a,b] [a, b]

When c,d,ec, d, e arrive, each new item has a chance to replace one existing reservoir item. After the stream ends, every 2-item subset has equal probability.

Correctness

For any item xix_i, the probability that it enters the reservoir when processed is:

ki \frac{k}{i}

using one-based position ii.

After entering, it must avoid replacement by every later item xtx_t for t=i+1,,nt = i + 1, \dots, n. At step tt, the probability that a specific reservoir item is replaced is:

1t \frac{1}{t}

so the probability it survives that step is:

11t 1 - \frac{1}{t}

Therefore its final probability is:

$$ \frac{k}{i} \prod_{t=i+1}^{n}\left(1 - \frac{1}{t}\right)

\frac{k}{i} \prod_{t=i+1}^{n}\frac{t-1}{t}

\frac{k}{n} $$

Thus every item is included with equal probability k/nk/n.

Complexity

operationcost
memoryO(k)O(k)
update per itemO(1)O(1)
total timeO(n)O(n)

The algorithm is online and does not need to know nn in advance.

When to Use

Use Reservoir Sampling when:

  • the input is a stream
  • the stream length is unknown
  • storing all items is impossible or unnecessary
  • a uniform sample is required

It is common in logging, telemetry, randomized testing, large file sampling, and database query processing.

Implementation

import random

def reservoir_sampling(stream, k):
    reservoir = []

    for i, x in enumerate(stream):
        if i < k:
            reservoir.append(x)
        else:
            j = random.randint(0, i)
            if j < k:
                reservoir[j] = x

    return reservoir
import "math/rand"

func ReservoirSampling[T any](stream []T, k int) []T {
	reservoir := make([]T, 0, k)

	for i, x := range stream {
		if i < k {
			reservoir = append(reservoir, x)
			continue
		}

		j := rand.Intn(i + 1)
		if j < k {
			reservoir[j] = x
		}
	}

	return reservoir
}