19.5 Random Sampling
Random sampling is one of the most powerful techniques in algorithm design.
19.5 Random Sampling
Random sampling is one of the most powerful techniques in algorithm design. Instead of processing an entire dataset, we examine a carefully selected subset and use it to estimate properties of the whole. When designed correctly, a sample can provide surprisingly accurate answers while requiring only a fraction of the computation, memory, and storage needed by exact algorithms.
The idea appears throughout computer science. Database systems estimate query costs from sampled rows. Search engines analyze representative subsets of documents. Machine learning algorithms train on randomly selected batches. Streaming systems maintain compact summaries of massive event streams. Randomized algorithms often use sampling to avoid expensive exhaustive computation.
This section introduces the fundamental sampling techniques used in algorithm design and analysis.
Problem
Suppose you have a dataset containing:
10,000,000,000 records
You want to estimate:
- Average value
- Maximum frequency
- Distinct counts
- Distribution properties
- Popular items
Processing every record may be too expensive.
Can a small subset provide useful answers?
Random sampling answers this question.
Recipe: Uniform Random Sampling
The simplest sampling strategy selects each element with equal probability.
Given:
A = [a₁, a₂, a₃, ..., aₙ]
choose a random element:
P(select aᵢ) = 1/n
Every element has an equal chance of being selected.
This property is called uniform sampling.
Uniformity is the foundation of most randomized analysis.
Example: Estimating an Average
Suppose a company stores the purchase amounts of:
100,000,000 customers
Computing the exact average requires scanning the entire dataset.
Instead:
- Select 10,000 customers uniformly.
- Compute the average of the sample.
- Use that value as an estimate.
Example:
| Sample Values |
|---|
| 120 |
| 95 |
| 140 |
| 110 |
| 105 |
Sample average:
114
With a sufficiently large sample, the estimate often becomes very close to the true average.
Why Sampling Works
Random samples tend to preserve important characteristics of the underlying population.
Suppose:
60% of users prefer feature A
40% prefer feature B
A sufficiently large random sample will likely exhibit nearly the same proportions.
Perhaps:
59.8% / 40.2%
or:
60.3% / 39.7%
The sample becomes a compact representation of the larger population.
This principle powers many statistical and algorithmic techniques.
Sampling Without Replacement
Suppose we select:
k
elements from:
n
elements.
Once an element is chosen, it cannot be selected again.
Example:
Population:
[1,2,3,4,5]
Sample size:
3
Possible sample:
[2,5,1]
Element:
2
cannot appear twice.
This is called sampling without replacement.
Most survey methods follow this model.
Sampling With Replacement
In some applications, selected elements may be chosen again.
Example:
Draw a card
Record it
Return it
Shuffle
Draw again
Possible sample:
[4,4,1,3]
Repeated selections are allowed.
This simplifies some analyses because selections become independent.
Many probabilistic proofs assume sampling with replacement.
Estimating Frequencies
Suppose we want to estimate:
Frequency("error")
inside a massive log stream.
Instead of examining billions of entries:
- Sample 100,000 records.
- Count occurrences.
- Scale appropriately.
Example:
| Sample Size | Error Count |
|---|---|
| 100,000 | 2,000 |
Estimated frequency:
2%
The estimate becomes increasingly accurate as sample size grows.
Random Sampling in Algorithms
Sampling often reduces computational complexity.
Consider finding a good pivot for quicksort.
Instead of examining:
n
elements:
- Select a small random sample.
- Compute its median.
- Use that value as the pivot.
Work decreases dramatically.
Partition quality remains high.
Many practical sorting implementations use this idea.
Reservoir Sampling
One of the most important sampling techniques is reservoir sampling.
Problem:
A stream contains:
Unknown number of elements
You want a uniform random sample.
You cannot store the entire stream.
Basic Algorithm
Maintain a reservoir of size:
k
For the first:
k
items:
Store them directly.
For each later item:
i
Generate a random integer:
r ∈ [1, i]
If:
r ≤ k
replace a random reservoir element.
Otherwise discard the new item.
Example
Stream:
10, 20, 30, 40, 50, 60, 70
Reservoir size:
k = 3
After processing the entire stream:
[20, 60, 70]
might remain.
Every stream element has equal probability of appearing.
This remarkable property holds regardless of stream length.
Why Reservoir Sampling Matters
Modern systems often process:
- Log streams
- Sensor feeds
- Click streams
- Financial transactions
- Network traffic
The total number of records may be unknown.
Reservoir sampling provides:
O(k)
memory usage.
Memory remains constant regardless of stream size.
This makes the algorithm invaluable in large-scale systems.
Stratified Sampling
Pure random sampling may miss important minority groups.
Example population:
| Group | Percentage |
|---|---|
| A | 95% |
| B | 5% |
A small sample may contain few members of group B.
Stratified sampling addresses this issue.
Process:
- Divide data into groups.
- Sample each group independently.
- Combine results.
This often reduces estimation error.
Many analytical systems use stratified sampling to ensure representation.
Importance Sampling
Not all records are equally valuable.
Suppose rare events matter more than common events.
Instead of uniform selection:
Rare events → higher probability
Common events → lower probability
This approach is called importance sampling.
Applications include:
- Monte Carlo simulation
- Bayesian inference
- Reinforcement learning
- Rare-event analysis
The resulting estimates require weighting to remove bias.
Sample Size Selection
Larger samples generally improve accuracy.
However:
Cost ↑
Memory ↑
Computation ↑
The goal is finding the smallest sample that achieves acceptable error.
Many practical systems discover that surprisingly small samples provide useful estimates.
For example:
1,000 observations
often reveal major trends within datasets containing millions of records.
Bias
The greatest danger in sampling is bias.
Consider:
Survey only active users
Results may not represent the entire population.
Similarly:
Sample only recent events
may distort long-term behavior.
A biased sample can produce highly inaccurate conclusions regardless of size.
Randomness helps eliminate systematic bias.
Variance
Even unbiased sampling introduces variability.
Two different samples may produce slightly different estimates.
Example:
| Sample | Average |
|---|---|
| A | 101 |
| B | 105 |
| C | 98 |
This variation is expected.
Variance measures how much estimates fluctuate.
Reducing variance is a major goal of advanced sampling methods.
Applications
Random sampling appears throughout algorithm design.
Databases
- Query optimization
- Approximate aggregation
- Cardinality estimation
Machine Learning
- Mini-batch training
- Validation sets
- Data augmentation
Streaming Systems
- Reservoir sampling
- Traffic monitoring
- Event analysis
Distributed Systems
- Health checks
- Node selection
- Load estimation
Search Engines
- Index quality estimation
- Ranking experiments
- A/B testing
Sampling often transforms impossible computations into practical ones.
Complexity Analysis
For a sample of size:
k
drawn from:
n
elements:
| Operation | Complexity |
|---|---|
| Uniform sampling | O(k) |
| Reservoir sampling | O(n) stream processing |
| Reservoir memory | O(k) |
| Average estimation | O(k) |
| Frequency estimation | O(k) |
The key advantage is that storage depends on:
k
rather than:
n
where:
k << n
in most applications.
Common Mistakes
Confusing Random and Arbitrary
Selecting:
First 1000 records
is not random sampling.
It often introduces bias.
Ignoring Sample Bias
Large biased samples can be less useful than small unbiased samples.
Using Too Small a Sample
Very small samples may exhibit large variance.
Forgetting Stream Constraints
Algorithms designed for stored datasets often fail in streaming environments.
Reservoir sampling exists specifically to address this issue.
Discussion
Random sampling is one of the fundamental tools of scalable algorithm design. It allows massive datasets to be summarized, analyzed, and estimated using only a tiny fraction of the available data. The power of sampling comes from the fact that carefully chosen random subsets often preserve the essential statistical properties of the full dataset.
Many of the probabilistic data structures and streaming algorithms that follow rely on this principle. Whether estimating cardinalities, detecting trends, identifying heavy hitters, or approximating query results, sampling provides a bridge between exact computation and practical scalability. The next section explores randomized shuffling, another foundational technique that uses randomness to eliminate bias and generate uniformly random permutations.