19.16 Set Cover Approximation

Set Cover is the canonical greedy approximation problem. It is easy to describe, appears in many practical systems, and has a clean logarithmic approximation guarantee. If you understand Set Cover, you understand the basic shape of many approximation algorithms: choose the best local move, compare that move to what an optimal solution must accomplish, and accumulate a global bound.

The problem is computationally hard in general, so exact algorithms do not scale well. The greedy algorithm is simple, fast, and provably close to optimal.

Problem

You are given a universe of elements:

U = {1, 2, 3, 4, 5, 6, 7, 8}

and a collection of subsets:

S1 = {1, 2, 3}
S2 = {2, 4, 5}
S3 = {3, 6}
S4 = {4, 5, 6, 7}
S5 = {7, 8}
S6 = {1, 8}

Choose as few subsets as possible so that every element in U is covered.

One valid cover is:

S1, S4, S5

because:

S1 ∪ S4 ∪ S5 = {1,2,3,4,5,6,7,8}

The optimization goal is to minimize the number of chosen sets.

Exact Search

A brute-force solver tries every combination of sets.

from itertools import combinations

def exact_set_cover(
    universe: set[int],
    sets: list[set[int]],
) -> list[set[int]] | None:
    for size in range(1, len(sets) + 1):
        for candidate in combinations(sets, size):
            covered = set().union(*candidate)

            if covered >= universe:
                return list(candidate)

    return None

This is useful for testing small cases. It is not a scalable algorithm.

The number of possible subsets of the collection is:

2^m

where m is the number of available sets.

Greedy Solution

The greedy algorithm repeatedly chooses the set that covers the largest number of still-uncovered elements.

def greedy_set_cover(
    universe: set[int],
    sets: list[set[int]],
) -> list[set[int]]:
    uncovered = set(universe)
    chosen: list[set[int]] = []

    while uncovered:
        best = max(
            sets,
            key=lambda s: len(s & uncovered),
        )

        newly_covered = best & uncovered

        if not newly_covered:
            raise ValueError("universe cannot be covered by the provided sets")

        chosen.append(best)
        uncovered -= best

    return chosen

Example:

universe = {1, 2, 3, 4, 5, 6, 7, 8}

sets = [
    {1, 2, 3},
    {2, 4, 5},
    {3, 6},
    {4, 5, 6, 7},
    {7, 8},
    {1, 8},
]

cover = greedy_set_cover(universe, sets)

print(cover)

Possible output:

[{4, 5, 6, 7}, {1, 2, 3}, {7, 8}]

The exact selected sets may depend on tie-breaking.

Step-by-Step Execution

Start:

Uncovered = {1,2,3,4,5,6,7,8}

Set S4 covers four uncovered elements:

{4,5,6,7}

Choose S4.

Remaining:

{1,2,3,8}

Now S1 covers:

{1,2,3}

Choose S1.

Remaining:

{8}

Choose either S5 or S6.

Final cover size:

Weighted Set Cover

In weighted Set Cover, each set has a cost.

Example:

Set	Elements	Cost
S1	{1,2,3}	5
S2	{2,4,5}	3
S3	{3,6}	2
S4	{4,5,6,7}	8
S5	{7,8}	2
S6	{1,8}	4

The goal is to cover the universe with minimum total cost.

The greedy rule changes. Instead of choosing the set with the most uncovered elements, choose the set with the lowest cost per newly covered element.

cost(S)
------------------------
new elements covered by S

Weighted Implementation

from dataclasses import dataclass

@dataclass(frozen=True)
class WeightedSet:
    name: str
    elements: frozenset[int]
    cost: float

def greedy_weighted_set_cover(
    universe: set[int],
    sets: list[WeightedSet],
) -> list[WeightedSet]:
    uncovered = set(universe)
    chosen: list[WeightedSet] = []

    while uncovered:
        best = min(
            sets,
            key=lambda s: (
                s.cost / len(s.elements & uncovered)
                if s.elements & uncovered
                else float("inf")
            ),
        )

        newly_covered = set(best.elements) & uncovered

        if not newly_covered:
            raise ValueError("universe cannot be covered by the provided sets")

        chosen.append(best)
        uncovered -= newly_covered

    return chosen

Example:

weighted_sets = [
    WeightedSet("S1", frozenset({1, 2, 3}), 5),
    WeightedSet("S2", frozenset({2, 4, 5}), 3),
    WeightedSet("S3", frozenset({3, 6}), 2),
    WeightedSet("S4", frozenset({4, 5, 6, 7}), 8),
    WeightedSet("S5", frozenset({7, 8}), 2),
    WeightedSet("S6", frozenset({1, 8}), 4),
]

cover = greedy_weighted_set_cover(universe, weighted_sets)

print([(s.name, s.cost) for s in cover])

Possible output:

[('S3', 2), ('S5', 2), ('S2', 3), ('S1', 5)]

This may not be optimal, but it has a logarithmic approximation guarantee.

Approximation Guarantee

For unweighted Set Cover, the greedy algorithm is an H(n)-approximation, where:

n = |U|

and:

H(n) = 1 + 1/2 + 1/3 + ... + 1/n

The harmonic number satisfies:

H(n) ≤ 1 + ln n

So the approximation ratio is:

O(log n)

For weighted Set Cover, the same harmonic-style guarantee holds with respect to the size of the universe.

Proof Sketch

Let:

OPT

be the number of sets in an optimal cover.

Suppose there are:

uncovered elements remaining.

Since OPT sets can cover all elements, those same OPT sets can cover the remaining r elements.

Therefore at least one set in the optimal solution covers at least:

r / OPT

of the remaining elements.

The greedy algorithm chooses a set covering at least that many uncovered elements.

So each greedy step reduces the number of uncovered elements by a factor of at least:

1 - 1 / OPT

After enough steps, the number of uncovered elements falls below 1.

This leads to a logarithmic bound.

Charging Argument

Another common proof uses charging.

When the greedy algorithm selects a set, distribute the cost of that set equally among the newly covered elements.

In the unweighted case, if a chosen set covers k new elements, each newly covered element receives charge:

1 / k

The total charge over all elements equals the number of sets chosen by the greedy algorithm.

The proof then shows each element's charge can be bounded by a harmonic sequence relative to the optimal cover.

This yields:

Greedy cost ≤ H(n) · OPT

The charging view is useful because it generalizes cleanly to the weighted case.

Why the Bound Is Logarithmic

The greedy algorithm removes a constant fraction of what remains relative to OPT.

That creates a decay pattern:

n
n(1 - 1/OPT)
n(1 - 1/OPT)^2
...

The number of rounds needed is proportional to:

OPT · ln n

Hence the logarithmic factor.

The algorithm can be much better in practice, but the worst-case guarantee is logarithmic.

Tie-Breaking

Ties occur frequently.

Example:

S1 covers 3 uncovered elements
S2 covers 3 uncovered elements

Both are equally good according to the greedy rule.

Tie-breaking can affect the final solution size, but not the worst-case approximation guarantee.

Practical tie-breakers include:

Lower cost
Larger total set size
Stable input order
Random choice
Domain-specific priority

Tie-breaking is a tuning decision, not part of the core proof.

Improving the Implementation

The simple implementation scans all sets in every iteration.

If there are many sets, this can be expensive.

A faster version can maintain a priority queue keyed by current uncovered contribution. Since contributions shrink as elements are covered, entries become stale and must be recomputed when popped.

Sketch:

import heapq

def greedy_set_cover_heap(
    universe: set[int],
    sets: list[set[int]],
) -> list[set[int]]:
    uncovered = set(universe)
    chosen: list[set[int]] = []

    heap = [
        (-len(s & uncovered), index)
        for index, s in enumerate(sets)
    ]

    heapq.heapify(heap)

    while uncovered:
        _, index = heapq.heappop(heap)
        current_gain = len(sets[index] & uncovered)

        if current_gain == 0:
            continue

        heapq.heappush(heap, (-current_gain, index))

        best_gain, best_index = heapq.heappop(heap)
        best_gain = -best_gain

        recomputed_gain = len(sets[best_index] & uncovered)

        if recomputed_gain < best_gain:
            heapq.heappush(heap, (-recomputed_gain, best_index))
            continue

        chosen.append(sets[best_index])
        uncovered -= sets[best_index]

    return chosen

In production, this pattern needs careful testing. Lazy priority queues are easy to get subtly wrong. The simple scan version is often preferable until profiling proves it is a bottleneck.

Applications

Set Cover models many practical problems.

Domain	Universe	Sets
Testing	Code paths	Test cases
Monitoring	Failure modes	Sensors
Security	Vulnerabilities	Patches
Logistics	Delivery regions	Facilities
Advertising	Users	Campaign segments
Databases	Queries	Materialized views
Cloud systems	Requirements	Server configurations

Whenever the task is “cover all requirements with minimum choices,” Set Cover is a candidate model.

Common Mistakes

Do not use greedy Set Cover without checking that every universe element is coverable.

Do not assume the greedy solution is optimal.

Do not forget to use cost per newly covered element in the weighted version.

Do not treat ties as irrelevant in production. They can materially affect empirical quality.

Do not optimize the implementation before validating the model. Many failures come from modeling the wrong universe or wrong sets.

Takeaway

Greedy Set Cover is the standard example of a logarithmic approximation algorithm. The algorithm is simple: repeatedly choose the set that covers the most uncovered elements, or the lowest cost per newly covered element in the weighted case. The analysis works because the optimal solution must cover the remaining elements somehow, so at least one available set must make measurable progress. This yields an H(n) approximation guarantee and a practical algorithm for large covering problems.