19.16 Set Cover Approximation
Set Cover is the canonical greedy approximation problem.
19.16 Set Cover Approximation
Set Cover is the canonical greedy approximation problem. It is easy to describe, appears in many practical systems, and has a clean logarithmic approximation guarantee. If you understand Set Cover, you understand the basic shape of many approximation algorithms: choose the best local move, compare that move to what an optimal solution must accomplish, and accumulate a global bound.
The problem is computationally hard in general, so exact algorithms do not scale well. The greedy algorithm is simple, fast, and provably close to optimal.
Problem
You are given a universe of elements:
U = {1, 2, 3, 4, 5, 6, 7, 8}
and a collection of subsets:
S1 = {1, 2, 3}
S2 = {2, 4, 5}
S3 = {3, 6}
S4 = {4, 5, 6, 7}
S5 = {7, 8}
S6 = {1, 8}
Choose as few subsets as possible so that every element in U is covered.
One valid cover is:
S1, S4, S5
because:
S1 ∪ S4 ∪ S5 = {1,2,3,4,5,6,7,8}
The optimization goal is to minimize the number of chosen sets.
Exact Search
A brute-force solver tries every combination of sets.
from itertools import combinations
def exact_set_cover(
universe: set[int],
sets: list[set[int]],
) -> list[set[int]] | None:
for size in range(1, len(sets) + 1):
for candidate in combinations(sets, size):
covered = set().union(*candidate)
if covered >= universe:
return list(candidate)
return None
This is useful for testing small cases. It is not a scalable algorithm.
The number of possible subsets of the collection is:
2^m
where m is the number of available sets.
Greedy Solution
The greedy algorithm repeatedly chooses the set that covers the largest number of still-uncovered elements.
def greedy_set_cover(
universe: set[int],
sets: list[set[int]],
) -> list[set[int]]:
uncovered = set(universe)
chosen: list[set[int]] = []
while uncovered:
best = max(
sets,
key=lambda s: len(s & uncovered),
)
newly_covered = best & uncovered
if not newly_covered:
raise ValueError("universe cannot be covered by the provided sets")
chosen.append(best)
uncovered -= best
return chosen
Example:
universe = {1, 2, 3, 4, 5, 6, 7, 8}
sets = [
{1, 2, 3},
{2, 4, 5},
{3, 6},
{4, 5, 6, 7},
{7, 8},
{1, 8},
]
cover = greedy_set_cover(universe, sets)
print(cover)
Possible output:
[{4, 5, 6, 7}, {1, 2, 3}, {7, 8}]
The exact selected sets may depend on tie-breaking.
Step-by-Step Execution
Start:
Uncovered = {1,2,3,4,5,6,7,8}
Set S4 covers four uncovered elements:
{4,5,6,7}
Choose S4.
Remaining:
{1,2,3,8}
Now S1 covers:
{1,2,3}
Choose S1.
Remaining:
{8}
Choose either S5 or S6.
Final cover size:
3
Weighted Set Cover
In weighted Set Cover, each set has a cost.
Example:
| Set | Elements | Cost |
|---|---|---|
| S1 | {1,2,3} | 5 |
| S2 | {2,4,5} | 3 |
| S3 | {3,6} | 2 |
| S4 | {4,5,6,7} | 8 |
| S5 | {7,8} | 2 |
| S6 | {1,8} | 4 |
The goal is to cover the universe with minimum total cost.
The greedy rule changes. Instead of choosing the set with the most uncovered elements, choose the set with the lowest cost per newly covered element.
cost(S)
------------------------
new elements covered by S
Weighted Implementation
from dataclasses import dataclass
@dataclass(frozen=True)
class WeightedSet:
name: str
elements: frozenset[int]
cost: float
def greedy_weighted_set_cover(
universe: set[int],
sets: list[WeightedSet],
) -> list[WeightedSet]:
uncovered = set(universe)
chosen: list[WeightedSet] = []
while uncovered:
best = min(
sets,
key=lambda s: (
s.cost / len(s.elements & uncovered)
if s.elements & uncovered
else float("inf")
),
)
newly_covered = set(best.elements) & uncovered
if not newly_covered:
raise ValueError("universe cannot be covered by the provided sets")
chosen.append(best)
uncovered -= newly_covered
return chosen
Example:
weighted_sets = [
WeightedSet("S1", frozenset({1, 2, 3}), 5),
WeightedSet("S2", frozenset({2, 4, 5}), 3),
WeightedSet("S3", frozenset({3, 6}), 2),
WeightedSet("S4", frozenset({4, 5, 6, 7}), 8),
WeightedSet("S5", frozenset({7, 8}), 2),
WeightedSet("S6", frozenset({1, 8}), 4),
]
cover = greedy_weighted_set_cover(universe, weighted_sets)
print([(s.name, s.cost) for s in cover])
Possible output:
[('S3', 2), ('S5', 2), ('S2', 3), ('S1', 5)]
This may not be optimal, but it has a logarithmic approximation guarantee.
Approximation Guarantee
For unweighted Set Cover, the greedy algorithm is an H(n)-approximation, where:
n = |U|
and:
H(n) = 1 + 1/2 + 1/3 + ... + 1/n
The harmonic number satisfies:
H(n) ≤ 1 + ln n
So the approximation ratio is:
O(log n)
For weighted Set Cover, the same harmonic-style guarantee holds with respect to the size of the universe.
Proof Sketch
Let:
OPT
be the number of sets in an optimal cover.
Suppose there are:
r
uncovered elements remaining.
Since OPT sets can cover all elements, those same OPT sets can cover the remaining r elements.
Therefore at least one set in the optimal solution covers at least:
r / OPT
of the remaining elements.
The greedy algorithm chooses a set covering at least that many uncovered elements.
So each greedy step reduces the number of uncovered elements by a factor of at least:
1 - 1 / OPT
After enough steps, the number of uncovered elements falls below 1.
This leads to a logarithmic bound.
Charging Argument
Another common proof uses charging.
When the greedy algorithm selects a set, distribute the cost of that set equally among the newly covered elements.
In the unweighted case, if a chosen set covers k new elements, each newly covered element receives charge:
1 / k
The total charge over all elements equals the number of sets chosen by the greedy algorithm.
The proof then shows each element's charge can be bounded by a harmonic sequence relative to the optimal cover.
This yields:
Greedy cost ≤ H(n) · OPT
The charging view is useful because it generalizes cleanly to the weighted case.
Why the Bound Is Logarithmic
The greedy algorithm removes a constant fraction of what remains relative to OPT.
That creates a decay pattern:
n
n(1 - 1/OPT)
n(1 - 1/OPT)^2
...
The number of rounds needed is proportional to:
OPT · ln n
Hence the logarithmic factor.
The algorithm can be much better in practice, but the worst-case guarantee is logarithmic.
Tie-Breaking
Ties occur frequently.
Example:
S1 covers 3 uncovered elements
S2 covers 3 uncovered elements
Both are equally good according to the greedy rule.
Tie-breaking can affect the final solution size, but not the worst-case approximation guarantee.
Practical tie-breakers include:
- Lower cost
- Larger total set size
- Stable input order
- Random choice
- Domain-specific priority
Tie-breaking is a tuning decision, not part of the core proof.
Improving the Implementation
The simple implementation scans all sets in every iteration.
If there are many sets, this can be expensive.
A faster version can maintain a priority queue keyed by current uncovered contribution. Since contributions shrink as elements are covered, entries become stale and must be recomputed when popped.
Sketch:
import heapq
def greedy_set_cover_heap(
universe: set[int],
sets: list[set[int]],
) -> list[set[int]]:
uncovered = set(universe)
chosen: list[set[int]] = []
heap = [
(-len(s & uncovered), index)
for index, s in enumerate(sets)
]
heapq.heapify(heap)
while uncovered:
_, index = heapq.heappop(heap)
current_gain = len(sets[index] & uncovered)
if current_gain == 0:
continue
heapq.heappush(heap, (-current_gain, index))
best_gain, best_index = heapq.heappop(heap)
best_gain = -best_gain
recomputed_gain = len(sets[best_index] & uncovered)
if recomputed_gain < best_gain:
heapq.heappush(heap, (-recomputed_gain, best_index))
continue
chosen.append(sets[best_index])
uncovered -= sets[best_index]
return chosen
In production, this pattern needs careful testing. Lazy priority queues are easy to get subtly wrong. The simple scan version is often preferable until profiling proves it is a bottleneck.
Applications
Set Cover models many practical problems.
| Domain | Universe | Sets |
|---|---|---|
| Testing | Code paths | Test cases |
| Monitoring | Failure modes | Sensors |
| Security | Vulnerabilities | Patches |
| Logistics | Delivery regions | Facilities |
| Advertising | Users | Campaign segments |
| Databases | Queries | Materialized views |
| Cloud systems | Requirements | Server configurations |
Whenever the task is “cover all requirements with minimum choices,” Set Cover is a candidate model.
Common Mistakes
Do not use greedy Set Cover without checking that every universe element is coverable.
Do not assume the greedy solution is optimal.
Do not forget to use cost per newly covered element in the weighted version.
Do not treat ties as irrelevant in production. They can materially affect empirical quality.
Do not optimize the implementation before validating the model. Many failures come from modeling the wrong universe or wrong sets.
Takeaway
Greedy Set Cover is the standard example of a logarithmic approximation algorithm. The algorithm is simple: repeatedly choose the set that covers the most uncovered elements, or the lowest cost per newly covered element in the weighted case. The analysis works because the optimal solution must cover the remaining elements somehow, so at least one available set must make measurable progress. This yields an H(n) approximation guarantee and a practical algorithm for large covering problems.