11.14 Clustering
You have a set of objects and pairwise distances between them.
11.14 Clustering
Problem
You have a set of objects and pairwise distances between them. You want to divide the objects into groups so that nearby objects stay together and distant objects fall into different groups.
This is a clustering problem.
Minimum spanning trees provide a simple and effective clustering method when the data can be represented as a weighted undirected graph:
vertex = object
edge weight = distance or dissimilarity
The idea is to build an MST, then cut its largest edges.
Solution
To create k clusters:
- Build a minimum spanning tree.
- Remove the
k - 1largest edges from the MST. - The remaining connected components are the clusters.
This is often called single-linkage clustering.
The MST captures the cheapest way to keep all objects connected. Large edges in the MST are natural separation points because they connect regions that are far apart.
Example
Suppose the MST contains these weighted edges:
| Edge | Weight |
|---|---|
| A-B | 1 |
| B-C | 2 |
| D-E | 1 |
| E-F | 2 |
| C-D | 10 |
The MST looks like this:
A --1-- B --2-- C --10-- D --1-- E --2-- F
If we want two clusters, remove the largest edge:
C-D = 10
Remaining components:
A -- B -- C
D -- E -- F
Clusters:
{A, B, C}
{D, E, F}
The large edge was acting as a bridge between two dense groups.
Why This Works
The MST keeps the lightest possible connections needed to make the graph connected.
When an MST edge has a large weight, it suggests that two groups had to be connected through a relatively expensive link.
Removing that edge separates the graph at its weakest connection.
For k clusters, removing the k - 1 largest MST edges maximizes the spacing between clusters under the single-linkage criterion.
Spacing
The spacing of a clustering is the minimum distance between any two points in different clusters.
For two clusters:
Cluster 1: A, B, C
Cluster 2: D, E, F
the spacing is the smallest edge that crosses between the two clusters.
In the MST example, after removing:
C-D = 10
the spacing is:
10
because that was the cheapest edge connecting the two groups.
Kruskal View
The same clustering can be produced directly by Kruskal's algorithm.
Instead of running Kruskal until one component remains, stop when exactly k components remain.
KruskalClustering(G, k):
sort edges by weight
make each vertex its own component
for each edge(u, v, w):
if find(u) != find(v):
if component_count == k:
stop
union(u, v)
add edge to forest
component_count--
The result is a minimum spanning forest with k components.
This is often easier than building the full MST and then deleting edges.
Example with Kruskal
Edges:
| Edge | Weight |
|---|---|
| A-B | 1 |
| D-E | 1 |
| B-C | 2 |
| E-F | 2 |
| C-D | 10 |
Goal:
k = 2
Start:
{A} {B} {C} {D} {E} {F}
Add:
A-B
D-E
B-C
E-F
Now we have:
{A,B,C}
{D,E,F}
Since there are exactly two components, stop.
The next edge would have been:
C-D = 10
That edge defines the spacing.
Maximum Spacing k-Clustering
Kruskal clustering solves the maximum spacing k-clustering problem.
The objective is:
create k clusters
maximize the minimum distance between clusters
Kruskal's algorithm naturally does this because it merges the closest components first.
When it stops with k components, the next edge connecting two different components is the smallest inter-cluster edge. Its weight is the spacing.
Implementation Pattern
You need:
- Edge list sorted by weight.
- Union-find structure.
- Component count.
- Optional list of chosen edges.
- Optional spacing value.
The spacing is found by continuing to scan after reaching k components until you find the next edge whose endpoints belong to different components.
spacing = none
for edge(u, v, w) in sorted_edges:
if find(u) == find(v):
continue
if component_count == k:
spacing = w
break
union(u, v)
component_count--
Choosing k
The algorithm requires a target number of clusters.
In practical work, choosing k is often harder than running the algorithm.
Common approaches include:
| Method | Idea |
|---|---|
| Domain choice | Use business or scientific constraints |
| Gap heuristic | Look for large jumps in MST edge weights |
| Validation metric | Compare clustering quality across several k values |
| Visualization | Plot MST edge weights or dendrogram-like structure |
For a cookbook implementation, the gap heuristic is a useful starting point.
Sort the MST edge weights:
1, 1, 2, 2, 10
The largest jump is:
2 -> 10
This suggests cutting before weight 10, producing two clusters.
When It Works Well
MST clustering works well when clusters are separated by clear gaps.
Example:
dense group
large distance
dense group
The MST will contain small internal edges and large bridge edges.
Removing the bridge edges gives intuitive clusters.
When It Works Poorly
Single-linkage clustering has a known weakness: chaining.
If points form a thin bridge between two dense groups, single-linkage may connect them into one cluster.
Example:
Group A -- p1 -- p2 -- p3 -- Group B
Each local edge is short, so the MST does not contain one obviously large bridge. The algorithm may fail to separate the groups.
This is not an implementation bug. It is a property of the single-linkage criterion.
Distance Graphs
For n objects, a complete distance graph contains:
n(n - 1) / 2
edges.
This can be expensive.
For large datasets, avoid materializing all pairwise edges when possible. Common alternatives include:
| Approach | Use when |
|---|---|
| k-nearest-neighbor graph | Local similarity is enough |
| Spatial index | Points live in geometric space |
| Approximate nearest neighbors | Large high-dimensional datasets |
| Threshold graph | Only short distances matter |
The MST method depends heavily on how the graph is constructed.
Complexity
If the graph has V vertices and E edges:
| Step | Complexity |
|---|---|
| Sort edges | O(E log E) |
| Union-find processing | O(E α(V)) |
| Total | O(E log E) |
| Space | O(V + E) |
For a complete graph:
E = O(V^2)
so sorting dominates.
Common Mistakes
Building a Complete Graph Unnecessarily
If the dataset is large, complete pairwise distances may be too expensive.
Confusing MST Clustering with k-Means
MST clustering uses pairwise distances and graph connectivity. It does not compute centroids.
Forgetting to Stop at k Components
For direct Kruskal clustering, stop when the component count reaches k.
Not Computing Spacing Correctly
The spacing is the next edge that connects two different components after clustering stops.
Assuming It Handles All Cluster Shapes
Single-linkage can suffer from chaining.
Recipe
Use MST-based clustering when you have pairwise distances and want a simple maximum-spacing clustering:
- Represent objects as vertices.
- Represent distances as weighted undirected edges.
- Sort edges by weight.
- Run Kruskal with union-find.
- Stop when
kcomponents remain. - The components are the clusters.
- The next inter-component edge gives the spacing.
MST clustering is attractive because it uses the same machinery as minimum spanning trees: sorted edges, union-find, and component merging. It is easy to implement, easy to explain, and effective when cluster boundaries appear as large gaps in the graph.