Skip to content

Streaming Heavy Hitters

Find frequent items in a stream using bounded memory.

Streaming Heavy Hitters finds items that occur frequently in a stream. The stream may be too large to store, so the algorithm keeps only a compact summary.

The usual goal is to report all values whose frequency is above a threshold, or to estimate the most frequent values.

Problem

Given a stream

x1,x2,,xn x_1, x_2, \dots, x_n

find items whose frequency is large compared with nn.

For a threshold ϕ\phi, a heavy hitter is an item xx such that

count(x)>ϕn \text{count}(x) > \phi n

Algorithm

A standard deterministic method is Misra Gries. It keeps at most k1k - 1 counters.

streaming_heavy_hitters(stream, k):
    counters = empty map

    for x in stream:
        if x in counters:
            counters[x] = counters[x] + 1
        else if size(counters) < k - 1:
            counters[x] = 1
        else:
            for each key y in counters:
                counters[y] = counters[y] - 1
                if counters[y] == 0:
                    remove y from counters

    return keys of counters

The returned keys are candidates. A second pass can compute exact counts.

Key Idea

When the counter table is full and a new unseen item arrives, the algorithm removes one count from every stored item. This operation cancels a group of kk distinct items.

An item occurring more than n/kn/k times cannot be completely canceled, so it remains as a candidate.

Example

Let the stream be:

a,b,a,c,a,d,b,a a, b, a, c, a, d, b, a

With k=3k = 3, the algorithm keeps at most two counters. The item aa appears four times out of eight, so it is a heavy hitter for threshold 1/31/3.

Correctness

Each decrement step removes one occurrence from at most k1k - 1 tracked items and also accounts for one untracked item. This can be viewed as deleting a group of up to kk distinct values.

An item with frequency greater than n/kn/k cannot be deleted in all such groups, because there are too many copies of it. Therefore every item with frequency greater than n/kn/k must appear among the final candidates.

The candidate set may contain false positives, so exact reporting requires a verification pass.

Complexity

measurecost
memoryO(k)O(k)
update, simple map versionO(k)O(k)
total timeO(nk)O(nk)
candidatesat most k1k - 1

With more careful data structures, decrement handling can be optimized, but the simple version is usually enough for small kk.

When to Use

Use Streaming Heavy Hitters when:

  • the input is a stream
  • full frequency counting is too expensive
  • approximate candidate detection is acceptable
  • a second pass is possible for exact counts

Use Count Min Sketch when you need frequency estimates for arbitrary queried keys.

Implementation

def streaming_heavy_hitters(stream, k):
    counters = {}

    for x in stream:
        if x in counters:
            counters[x] += 1
        elif len(counters) < k - 1:
            counters[x] = 1
        else:
            remove = []
            for y in counters:
                counters[y] -= 1
                if counters[y] == 0:
                    remove.append(y)

            for y in remove:
                del counters[y]

    return list(counters.keys())
func StreamingHeavyHitters(stream []string, k int) []string {
	counters := map[string]int{}

	for _, x := range stream {
		if _, ok := counters[x]; ok {
			counters[x]++
			continue
		}

		if len(counters) < k-1 {
			counters[x] = 1
			continue
		}

		var remove []string
		for y := range counters {
			counters[y]--
			if counters[y] == 0 {
				remove = append(remove, y)
			}
		}

		for _, y := range remove {
			delete(counters, y)
		}
	}

	out := make([]string, 0, len(counters))
	for x := range counters {
		out = append(out, x)
	}

	return out
}