19.5 Random Sampling

Random sampling is one of the most powerful techniques in algorithm design.

19.5 Random Sampling

Random sampling is one of the most powerful techniques in algorithm design. Instead of processing an entire dataset, we examine a carefully selected subset and use it to estimate properties of the whole. When designed correctly, a sample can provide surprisingly accurate answers while requiring only a fraction of the computation, memory, and storage needed by exact algorithms.

The idea appears throughout computer science. Database systems estimate query costs from sampled rows. Search engines analyze representative subsets of documents. Machine learning algorithms train on randomly selected batches. Streaming systems maintain compact summaries of massive event streams. Randomized algorithms often use sampling to avoid expensive exhaustive computation.

This section introduces the fundamental sampling techniques used in algorithm design and analysis.

Problem

Suppose you have a dataset containing:

10,000,000,000 records

You want to estimate:

  • Average value
  • Maximum frequency
  • Distinct counts
  • Distribution properties
  • Popular items

Processing every record may be too expensive.

Can a small subset provide useful answers?

Random sampling answers this question.

Recipe: Uniform Random Sampling

The simplest sampling strategy selects each element with equal probability.

Given:

A = [a₁, a₂, a₃, ..., aₙ]

choose a random element:

P(select aᵢ) = 1/n

Every element has an equal chance of being selected.

This property is called uniform sampling.

Uniformity is the foundation of most randomized analysis.

Example: Estimating an Average

Suppose a company stores the purchase amounts of:

100,000,000 customers

Computing the exact average requires scanning the entire dataset.

Instead:

  1. Select 10,000 customers uniformly.
  2. Compute the average of the sample.
  3. Use that value as an estimate.

Example:

Sample Values
120
95
140
110
105

Sample average:

114

With a sufficiently large sample, the estimate often becomes very close to the true average.

Why Sampling Works

Random samples tend to preserve important characteristics of the underlying population.

Suppose:

60% of users prefer feature A
40% prefer feature B

A sufficiently large random sample will likely exhibit nearly the same proportions.

Perhaps:

59.8% / 40.2%

or:

60.3% / 39.7%

The sample becomes a compact representation of the larger population.

This principle powers many statistical and algorithmic techniques.

Sampling Without Replacement

Suppose we select:

k

elements from:

n

elements.

Once an element is chosen, it cannot be selected again.

Example:

Population:
[1,2,3,4,5]

Sample size:

3

Possible sample:

[2,5,1]

Element:

2

cannot appear twice.

This is called sampling without replacement.

Most survey methods follow this model.

Sampling With Replacement

In some applications, selected elements may be chosen again.

Example:

Draw a card
Record it
Return it
Shuffle
Draw again

Possible sample:

[4,4,1,3]

Repeated selections are allowed.

This simplifies some analyses because selections become independent.

Many probabilistic proofs assume sampling with replacement.

Estimating Frequencies

Suppose we want to estimate:

Frequency("error")

inside a massive log stream.

Instead of examining billions of entries:

  1. Sample 100,000 records.
  2. Count occurrences.
  3. Scale appropriately.

Example:

Sample Size Error Count
100,000 2,000

Estimated frequency:

2%

The estimate becomes increasingly accurate as sample size grows.

Random Sampling in Algorithms

Sampling often reduces computational complexity.

Consider finding a good pivot for quicksort.

Instead of examining:

n

elements:

  1. Select a small random sample.
  2. Compute its median.
  3. Use that value as the pivot.

Work decreases dramatically.

Partition quality remains high.

Many practical sorting implementations use this idea.

Reservoir Sampling

One of the most important sampling techniques is reservoir sampling.

Problem:

A stream contains:

Unknown number of elements

You want a uniform random sample.

You cannot store the entire stream.

Basic Algorithm

Maintain a reservoir of size:

k

For the first:

k

items:

Store them directly.

For each later item:

i

Generate a random integer:

r ∈ [1, i]

If:

r ≤ k

replace a random reservoir element.

Otherwise discard the new item.

Example

Stream:

10, 20, 30, 40, 50, 60, 70

Reservoir size:

k = 3

After processing the entire stream:

[20, 60, 70]

might remain.

Every stream element has equal probability of appearing.

This remarkable property holds regardless of stream length.

Why Reservoir Sampling Matters

Modern systems often process:

  • Log streams
  • Sensor feeds
  • Click streams
  • Financial transactions
  • Network traffic

The total number of records may be unknown.

Reservoir sampling provides:

O(k)

memory usage.

Memory remains constant regardless of stream size.

This makes the algorithm invaluable in large-scale systems.

Stratified Sampling

Pure random sampling may miss important minority groups.

Example population:

Group Percentage
A 95%
B 5%

A small sample may contain few members of group B.

Stratified sampling addresses this issue.

Process:

  1. Divide data into groups.
  2. Sample each group independently.
  3. Combine results.

This often reduces estimation error.

Many analytical systems use stratified sampling to ensure representation.

Importance Sampling

Not all records are equally valuable.

Suppose rare events matter more than common events.

Instead of uniform selection:

Rare events → higher probability
Common events → lower probability

This approach is called importance sampling.

Applications include:

  • Monte Carlo simulation
  • Bayesian inference
  • Reinforcement learning
  • Rare-event analysis

The resulting estimates require weighting to remove bias.

Sample Size Selection

Larger samples generally improve accuracy.

However:

Cost ↑
Memory ↑
Computation ↑

The goal is finding the smallest sample that achieves acceptable error.

Many practical systems discover that surprisingly small samples provide useful estimates.

For example:

1,000 observations

often reveal major trends within datasets containing millions of records.

Bias

The greatest danger in sampling is bias.

Consider:

Survey only active users

Results may not represent the entire population.

Similarly:

Sample only recent events

may distort long-term behavior.

A biased sample can produce highly inaccurate conclusions regardless of size.

Randomness helps eliminate systematic bias.

Variance

Even unbiased sampling introduces variability.

Two different samples may produce slightly different estimates.

Example:

Sample Average
A 101
B 105
C 98

This variation is expected.

Variance measures how much estimates fluctuate.

Reducing variance is a major goal of advanced sampling methods.

Applications

Random sampling appears throughout algorithm design.

Databases

  • Query optimization
  • Approximate aggregation
  • Cardinality estimation

Machine Learning

  • Mini-batch training
  • Validation sets
  • Data augmentation

Streaming Systems

  • Reservoir sampling
  • Traffic monitoring
  • Event analysis

Distributed Systems

  • Health checks
  • Node selection
  • Load estimation

Search Engines

  • Index quality estimation
  • Ranking experiments
  • A/B testing

Sampling often transforms impossible computations into practical ones.

Complexity Analysis

For a sample of size:

k

drawn from:

n

elements:

Operation Complexity
Uniform sampling O(k)
Reservoir sampling O(n) stream processing
Reservoir memory O(k)
Average estimation O(k)
Frequency estimation O(k)

The key advantage is that storage depends on:

k

rather than:

n

where:

k << n

in most applications.

Common Mistakes

Confusing Random and Arbitrary

Selecting:

First 1000 records

is not random sampling.

It often introduces bias.

Ignoring Sample Bias

Large biased samples can be less useful than small unbiased samples.

Using Too Small a Sample

Very small samples may exhibit large variance.

Forgetting Stream Constraints

Algorithms designed for stored datasets often fail in streaming environments.

Reservoir sampling exists specifically to address this issue.

Discussion

Random sampling is one of the fundamental tools of scalable algorithm design. It allows massive datasets to be summarized, analyzed, and estimated using only a tiny fraction of the available data. The power of sampling comes from the fact that carefully chosen random subsets often preserve the essential statistical properties of the full dataset.

Many of the probabilistic data structures and streaming algorithms that follow rely on this principle. Whether estimating cardinalities, detecting trends, identifying heavy hitters, or approximating query results, sampling provides a bridge between exact computation and practical scalability. The next section explores randomized shuffling, another foundational technique that uses randomness to eliminate bias and generate uniformly random permutations.