19.20 Streaming Algorithms

Streaming algorithms process data one item at a time while using much less memory than the input size. They are designed for systems where the stream is too large to store, too fast to revisit, or too distributed to collect in one place.

A streaming algorithm usually sees each element once. It updates a compact state and discards the raw item. Later, queries are answered from that state. The answer may be exact for simple problems, but many useful streaming algorithms are approximate and probabilistic.

This section introduces the streaming model and the basic patterns used to design algorithms for massive data streams.

Problem

Suppose events arrive continuously:

click
view
view
purchase
view
click
...

You want to answer questions such as:

How many events have arrived?
How many distinct users appeared?
Which items are frequent?
What is the approximate average?
Has this key appeared before?

The stream may contain:

10 billion events per day

Storing every event is impractical.

You need compact summaries.

Streaming Model

A streaming algorithm receives items sequentially:

x₁, x₂, x₃, ..., xₙ

For each item, it performs a small update:

state = update(state, xᵢ)

After processing the stream, it answers queries from state.

The design goals are:

Goal	Meaning
Small memory	Use far less than O(n) space
Fast updates	Process each item quickly
One pass	Avoid rereading the stream
Approximate answers	Allow controlled error when needed
Mergeability	Combine summaries from multiple workers

The strongest streaming algorithms use memory that depends on the desired error, not on the number of events.

Exact Counting

Some streaming tasks are exact and simple.

Counting total events needs one counter:

class EventCounter:
    def __init__(self) -> None:
        self.count = 0

    def add(self, item: object) -> None:
        self.count += 1

    def estimate(self) -> int:
        return self.count

Memory:

O(1)

This is a streaming algorithm, but it is not very interesting. The challenge begins when the answer depends on many distinct items.

Running Average

An average can also be maintained exactly for numeric streams.

class RunningAverage:
    def __init__(self) -> None:
        self.count = 0
        self.total = 0.0

    def add(self, value: float) -> None:
        self.count += 1
        self.total += value

    def average(self) -> float:
        if self.count == 0:
            raise ValueError("average is undefined for an empty stream")

        return self.total / self.count

This uses constant memory.

For variance, use a numerically stable online algorithm rather than storing all values.

Online Variance

Welford's method maintains mean and variance in one pass.

class RunningVariance:
    def __init__(self) -> None:
        self.count = 0
        self.mean = 0.0
        self.m2 = 0.0

    def add(self, value: float) -> None:
        self.count += 1

        delta = value - self.mean
        self.mean += delta / self.count
        delta2 = value - self.mean
        self.m2 += delta * delta2

    def variance(self) -> float:
        if self.count < 2:
            raise ValueError("variance requires at least two values")

        return self.m2 / (self.count - 1)

This is exact up to floating-point error and avoids the numerical instability of subtracting two large sums.

Distinct Counting

Distinct counting is harder.

Exact solution:

seen = set()

for user_id in stream:
    seen.add(user_id)

answer = len(seen)

Memory grows with the number of distinct users:

O(distinct_count)

This may be unacceptable.

Approximate solution:

HyperLogLog

uses fixed memory and estimates distinct count with known statistical error.

This is a typical streaming trade-off:

Exact answer
requires growing memory.

Approximate answer
uses bounded memory.

Membership Testing

Exact membership also requires storing keys:

seen = set()

def add(key: str) -> None:
    seen.add(key)

def contains(key: str) -> bool:
    return key in seen

A Bloom filter provides a compact approximate alternative.

Properties:

Query Result	Meaning
Definitely absent	Exact
Possibly present	May be false positive

This is useful when a negative answer can avoid expensive work.

Frequency Estimation

Exact per-key frequency counting uses a dictionary:

from collections import Counter

counts = Counter()

for key in stream:
    counts[key] += 1

Memory grows with the number of distinct keys.

A Count-Min Sketch estimates frequencies using fixed memory:

estimate(key) ≥ true_count(key)

with bounded overestimation.

This is useful when approximate frequency is sufficient.

Heavy Hitters

A heavy hitter is an item whose frequency exceeds a threshold.

Example:

Items appearing in at least 1% of the stream

A simple exact solution stores all counts.

A streaming solution can use algorithms such as Misra-Gries.

Recipe: Misra-Gries Heavy Hitters

Maintain at most k - 1 counters.

When an item arrives:

If it already has a counter, increment it.
Else if there is free space, add it with count 1.
Else decrement all counters by 1.
Remove counters that reach zero.

class MisraGries:
    def __init__(self, capacity: int) -> None:
        if capacity < 2:
            raise ValueError("capacity must be at least 2")

        self.capacity = capacity
        self.counters: dict[str, int] = {}

    def add(self, item: str) -> None:
        if item in self.counters:
            self.counters[item] += 1
            return

        if len(self.counters) < self.capacity - 1:
            self.counters[item] = 1
            return

        to_delete = []

        for key in list(self.counters):
            self.counters[key] -= 1

            if self.counters[key] == 0:
                to_delete.append(key)

        for key in to_delete:
            del self.counters[key]

    def candidates(self) -> dict[str, int]:
        return dict(self.counters)

This does not return exact counts. It returns candidate heavy hitters. A second pass, or an auxiliary sketch, can estimate their true frequencies.

Why Misra-Gries Works

Each decrement step effectively removes a group of k distinct items from consideration.

An item that appears more than:

n / k

times cannot be completely eliminated by these cancellations.

Therefore every true heavy hitter remains among the candidates.

The algorithm uses:

O(k)

memory.

Sliding Windows

Many streams need recent statistics rather than lifetime statistics.

Examples:

Requests in the last 5 minutes
Errors in the last hour
Active users in the last day

This is the sliding-window model.

The simplest exact implementation uses timestamped buckets.

from collections import deque
from dataclasses import dataclass

@dataclass
class Event:
    timestamp: float
    value: int

class SlidingWindowSum:
    def __init__(self, window_seconds: float) -> None:
        self.window_seconds = window_seconds
        self.events: deque[Event] = deque()
        self.total = 0

    def add(self, timestamp: float, value: int) -> None:
        self.events.append(Event(timestamp, value))
        self.total += value
        self._expire(timestamp)

    def query(self, now: float) -> int:
        self._expire(now)
        return self.total

    def _expire(self, now: float) -> None:
        cutoff = now - self.window_seconds

        while self.events and self.events[0].timestamp < cutoff:
            expired = self.events.popleft()
            self.total -= expired.value

This is exact, but memory depends on the number of events inside the window.

Approximate window algorithms use buckets or sketches to reduce memory.

Reservoir Sampling in Streams

Reservoir sampling keeps a uniform random sample from a stream of unknown length.

For sample size k, memory remains:

O(k)

regardless of stream length.

This is useful for debugging, audits, dashboards, and sampling representative events from continuous systems.

Distributed Streaming

Large streams are usually partitioned.

Example:

Worker 1 processes shard A
Worker 2 processes shard B
Worker 3 processes shard C

Each worker maintains a local summary.

A coordinator merges summaries.

Mergeable sketches are especially valuable:

Sketch	Merge Operation
HyperLogLog	Register-wise max
Count-Min Sketch	Counter-wise addition
Bloom Filter	Bit-wise OR
Reservoir sample	Weighted merge
Running count	Addition

Mergeability often determines whether an algorithm is practical in distributed systems.

One Pass vs Multiple Passes

Some streaming algorithms require one pass.

Others require two passes.

Example:

Pass 1:
Find heavy-hitter candidates.

Pass 2:
Compute exact counts for candidates.

One-pass algorithms are necessary when historical data cannot be replayed.

Two-pass algorithms are often acceptable in batch systems, log archives, and data lakes.

Error Budgets

Approximate streaming algorithms should be designed with explicit error budgets.

Questions to answer:

How much relative error is acceptable?
How often may the estimate fail?
How much memory is available?
How fast must updates be?
Can sketches be merged?

Without these constraints, choosing a sketch size is guesswork.

Complexity Patterns

Streaming algorithms are usually evaluated by:

Metric	Question
Update time	Cost per item
Query time	Cost to answer
Memory	Size of maintained state
Error	Accuracy guarantee
Failure probability	Probability bound
Merge cost	Distributed aggregation cost

For high-volume streams, update time is often the most important engineering constraint.

Common Mistakes

Do not store the whole stream and call it streaming. The defining constraint is sublinear memory.

Do not use approximate sketches without stating their error model.

Do not merge sketches with different parameters unless the algorithm supports it.

Do not assume event-time and processing-time windows are equivalent. Late data changes window semantics.

Do not use lifetime counters when the product question asks about recent behavior.

Takeaway

Streaming algorithms process data incrementally while maintaining compact state. Exact streaming is possible for simple aggregates such as counts, sums, averages, and variances. For distinct counts, membership, frequencies, and heavy hitters, approximate sketches become essential. The recurring design pattern is to replace raw data with a summary whose memory footprint depends on accuracy requirements rather than stream size.