19.20 Streaming Algorithms
Streaming algorithms process data one item at a time while using much less memory than the input size.
19.20 Streaming Algorithms
Streaming algorithms process data one item at a time while using much less memory than the input size. They are designed for systems where the stream is too large to store, too fast to revisit, or too distributed to collect in one place.
A streaming algorithm usually sees each element once. It updates a compact state and discards the raw item. Later, queries are answered from that state. The answer may be exact for simple problems, but many useful streaming algorithms are approximate and probabilistic.
This section introduces the streaming model and the basic patterns used to design algorithms for massive data streams.
Problem
Suppose events arrive continuously:
click
view
view
purchase
view
click
...
You want to answer questions such as:
How many events have arrived?
How many distinct users appeared?
Which items are frequent?
What is the approximate average?
Has this key appeared before?
The stream may contain:
10 billion events per day
Storing every event is impractical.
You need compact summaries.
Streaming Model
A streaming algorithm receives items sequentially:
x₁, x₂, x₃, ..., xₙ
For each item, it performs a small update:
state = update(state, xᵢ)
After processing the stream, it answers queries from state.
The design goals are:
| Goal | Meaning |
|---|---|
| Small memory | Use far less than O(n) space |
| Fast updates | Process each item quickly |
| One pass | Avoid rereading the stream |
| Approximate answers | Allow controlled error when needed |
| Mergeability | Combine summaries from multiple workers |
The strongest streaming algorithms use memory that depends on the desired error, not on the number of events.
Exact Counting
Some streaming tasks are exact and simple.
Counting total events needs one counter:
class EventCounter:
def __init__(self) -> None:
self.count = 0
def add(self, item: object) -> None:
self.count += 1
def estimate(self) -> int:
return self.count
Memory:
O(1)
This is a streaming algorithm, but it is not very interesting. The challenge begins when the answer depends on many distinct items.
Running Average
An average can also be maintained exactly for numeric streams.
class RunningAverage:
def __init__(self) -> None:
self.count = 0
self.total = 0.0
def add(self, value: float) -> None:
self.count += 1
self.total += value
def average(self) -> float:
if self.count == 0:
raise ValueError("average is undefined for an empty stream")
return self.total / self.count
This uses constant memory.
For variance, use a numerically stable online algorithm rather than storing all values.
Online Variance
Welford's method maintains mean and variance in one pass.
class RunningVariance:
def __init__(self) -> None:
self.count = 0
self.mean = 0.0
self.m2 = 0.0
def add(self, value: float) -> None:
self.count += 1
delta = value - self.mean
self.mean += delta / self.count
delta2 = value - self.mean
self.m2 += delta * delta2
def variance(self) -> float:
if self.count < 2:
raise ValueError("variance requires at least two values")
return self.m2 / (self.count - 1)
This is exact up to floating-point error and avoids the numerical instability of subtracting two large sums.
Distinct Counting
Distinct counting is harder.
Exact solution:
seen = set()
for user_id in stream:
seen.add(user_id)
answer = len(seen)
Memory grows with the number of distinct users:
O(distinct_count)
This may be unacceptable.
Approximate solution:
HyperLogLog
uses fixed memory and estimates distinct count with known statistical error.
This is a typical streaming trade-off:
Exact answer
requires growing memory.
Approximate answer
uses bounded memory.
Membership Testing
Exact membership also requires storing keys:
seen = set()
def add(key: str) -> None:
seen.add(key)
def contains(key: str) -> bool:
return key in seen
A Bloom filter provides a compact approximate alternative.
Properties:
| Query Result | Meaning |
|---|---|
| Definitely absent | Exact |
| Possibly present | May be false positive |
This is useful when a negative answer can avoid expensive work.
Frequency Estimation
Exact per-key frequency counting uses a dictionary:
from collections import Counter
counts = Counter()
for key in stream:
counts[key] += 1
Memory grows with the number of distinct keys.
A Count-Min Sketch estimates frequencies using fixed memory:
estimate(key) ≥ true_count(key)
with bounded overestimation.
This is useful when approximate frequency is sufficient.
Heavy Hitters
A heavy hitter is an item whose frequency exceeds a threshold.
Example:
Items appearing in at least 1% of the stream
A simple exact solution stores all counts.
A streaming solution can use algorithms such as Misra-Gries.
Recipe: Misra-Gries Heavy Hitters
Maintain at most k - 1 counters.
When an item arrives:
- If it already has a counter, increment it.
- Else if there is free space, add it with count 1.
- Else decrement all counters by 1.
- Remove counters that reach zero.
class MisraGries:
def __init__(self, capacity: int) -> None:
if capacity < 2:
raise ValueError("capacity must be at least 2")
self.capacity = capacity
self.counters: dict[str, int] = {}
def add(self, item: str) -> None:
if item in self.counters:
self.counters[item] += 1
return
if len(self.counters) < self.capacity - 1:
self.counters[item] = 1
return
to_delete = []
for key in list(self.counters):
self.counters[key] -= 1
if self.counters[key] == 0:
to_delete.append(key)
for key in to_delete:
del self.counters[key]
def candidates(self) -> dict[str, int]:
return dict(self.counters)
This does not return exact counts. It returns candidate heavy hitters. A second pass, or an auxiliary sketch, can estimate their true frequencies.
Why Misra-Gries Works
Each decrement step effectively removes a group of k distinct items from consideration.
An item that appears more than:
n / k
times cannot be completely eliminated by these cancellations.
Therefore every true heavy hitter remains among the candidates.
The algorithm uses:
O(k)
memory.
Sliding Windows
Many streams need recent statistics rather than lifetime statistics.
Examples:
Requests in the last 5 minutes
Errors in the last hour
Active users in the last day
This is the sliding-window model.
The simplest exact implementation uses timestamped buckets.
from collections import deque
from dataclasses import dataclass
@dataclass
class Event:
timestamp: float
value: int
class SlidingWindowSum:
def __init__(self, window_seconds: float) -> None:
self.window_seconds = window_seconds
self.events: deque[Event] = deque()
self.total = 0
def add(self, timestamp: float, value: int) -> None:
self.events.append(Event(timestamp, value))
self.total += value
self._expire(timestamp)
def query(self, now: float) -> int:
self._expire(now)
return self.total
def _expire(self, now: float) -> None:
cutoff = now - self.window_seconds
while self.events and self.events[0].timestamp < cutoff:
expired = self.events.popleft()
self.total -= expired.value
This is exact, but memory depends on the number of events inside the window.
Approximate window algorithms use buckets or sketches to reduce memory.
Reservoir Sampling in Streams
Reservoir sampling keeps a uniform random sample from a stream of unknown length.
For sample size k, memory remains:
O(k)
regardless of stream length.
This is useful for debugging, audits, dashboards, and sampling representative events from continuous systems.
Distributed Streaming
Large streams are usually partitioned.
Example:
Worker 1 processes shard A
Worker 2 processes shard B
Worker 3 processes shard C
Each worker maintains a local summary.
A coordinator merges summaries.
Mergeable sketches are especially valuable:
| Sketch | Merge Operation |
|---|---|
| HyperLogLog | Register-wise max |
| Count-Min Sketch | Counter-wise addition |
| Bloom Filter | Bit-wise OR |
| Reservoir sample | Weighted merge |
| Running count | Addition |
Mergeability often determines whether an algorithm is practical in distributed systems.
One Pass vs Multiple Passes
Some streaming algorithms require one pass.
Others require two passes.
Example:
Pass 1:
Find heavy-hitter candidates.
Pass 2:
Compute exact counts for candidates.
One-pass algorithms are necessary when historical data cannot be replayed.
Two-pass algorithms are often acceptable in batch systems, log archives, and data lakes.
Error Budgets
Approximate streaming algorithms should be designed with explicit error budgets.
Questions to answer:
How much relative error is acceptable?
How often may the estimate fail?
How much memory is available?
How fast must updates be?
Can sketches be merged?
Without these constraints, choosing a sketch size is guesswork.
Complexity Patterns
Streaming algorithms are usually evaluated by:
| Metric | Question |
|---|---|
| Update time | Cost per item |
| Query time | Cost to answer |
| Memory | Size of maintained state |
| Error | Accuracy guarantee |
| Failure probability | Probability bound |
| Merge cost | Distributed aggregation cost |
For high-volume streams, update time is often the most important engineering constraint.
Common Mistakes
Do not store the whole stream and call it streaming. The defining constraint is sublinear memory.
Do not use approximate sketches without stating their error model.
Do not merge sketches with different parameters unless the algorithm supports it.
Do not assume event-time and processing-time windows are equivalent. Late data changes window semantics.
Do not use lifetime counters when the product question asks about recent behavior.
Takeaway
Streaming algorithms process data incrementally while maintaining compact state. Exact streaming is possible for simple aggregates such as counts, sums, averages, and variances. For distinct counts, membership, frequencies, and heavy hitters, approximate sketches become essential. The recurring design pattern is to replace raw data with a summary whose memory footprint depends on accuracy requirements rather than stream size.