# Reservoir Sampling

# Reservoir Sampling

Reservoir Sampling selects a uniform random sample from a stream without knowing the stream length in advance. It keeps only a fixed-size reservoir and updates it as new items arrive.

The simplest version keeps one item. The general version keeps $k$ items.

## Problem

Given a stream:

$$
x_1, x_2, \dots, x_n
$$

choose $k$ items uniformly at random from the stream, using only:

$$
O(k)
$$

memory.

## Algorithm

Fill the reservoir with the first $k$ items. For the $i$-th item after that, choose a random integer $j$ from $0$ to $i$. If $j < k$, replace reservoir position $j$.

Here $i$ is zero-based.

```id="rs1"
reservoir_sampling(stream, k):
    R = empty array

    for i, x in enumerate(stream):
        if i < k:
            append x to R
        else:
            j = random integer from 0 to i
            if j < k:
                R[j] = x

    return R
```

## Example

Let:

$$
stream = [a, b, c, d, e]
$$

and:

$$
k = 2
$$

The reservoir first becomes:

$$
[a, b]
$$

When $c, d, e$ arrive, each new item has a chance to replace one existing reservoir item. After the stream ends, every 2-item subset has equal probability.

## Correctness

For any item $x_i$, the probability that it enters the reservoir when processed is:

$$
\frac{k}{i}
$$

using one-based position $i$.

After entering, it must avoid replacement by every later item $x_t$ for $t = i + 1, \dots, n$. At step $t$, the probability that a specific reservoir item is replaced is:

$$
\frac{1}{t}
$$

so the probability it survives that step is:

$$
1 - \frac{1}{t}
$$

Therefore its final probability is:

$$
\frac{k}{i}
\prod_{t=i+1}^{n}\left(1 - \frac{1}{t}\right)
=============================================

\frac{k}{i}
\prod_{t=i+1}^{n}\frac{t-1}{t}
==============================

\frac{k}{n}
$$

Thus every item is included with equal probability $k/n$.

## Complexity

| operation       |   cost |
| --------------- | -----: |
| memory          | $O(k)$ |
| update per item | $O(1)$ |
| total time      | $O(n)$ |

The algorithm is online and does not need to know $n$ in advance.

## When to Use

Use Reservoir Sampling when:

* the input is a stream
* the stream length is unknown
* storing all items is impossible or unnecessary
* a uniform sample is required

It is common in logging, telemetry, randomized testing, large file sampling, and database query processing.

## Implementation

```id="rs2"
import random

def reservoir_sampling(stream, k):
    reservoir = []

    for i, x in enumerate(stream):
        if i < k:
            reservoir.append(x)
        else:
            j = random.randint(0, i)
            if j < k:
                reservoir[j] = x

    return reservoir
```

```id="rs3"
import "math/rand"

func ReservoirSampling[T any](stream []T, k int) []T {
	reservoir := make([]T, 0, k)

	for i, x := range stream {
		if i < k {
			reservoir = append(reservoir, x)
			continue
		}

		j := rand.Intn(i + 1)
		if j < k {
			reservoir[j] = x
		}
	}

	return reservoir
}
```

