11.14 Clustering

You have a set of objects and pairwise distances between them.

11.14 Clustering

Problem

You have a set of objects and pairwise distances between them. You want to divide the objects into groups so that nearby objects stay together and distant objects fall into different groups.

This is a clustering problem.

Minimum spanning trees provide a simple and effective clustering method when the data can be represented as a weighted undirected graph:

vertex = object
edge weight = distance or dissimilarity

The idea is to build an MST, then cut its largest edges.

Solution

To create k clusters:

  1. Build a minimum spanning tree.
  2. Remove the k - 1 largest edges from the MST.
  3. The remaining connected components are the clusters.

This is often called single-linkage clustering.

The MST captures the cheapest way to keep all objects connected. Large edges in the MST are natural separation points because they connect regions that are far apart.

Example

Suppose the MST contains these weighted edges:

Edge Weight
A-B 1
B-C 2
D-E 1
E-F 2
C-D 10

The MST looks like this:

A --1-- B --2-- C --10-- D --1-- E --2-- F

If we want two clusters, remove the largest edge:

C-D = 10

Remaining components:

A -- B -- C

D -- E -- F

Clusters:

{A, B, C}
{D, E, F}

The large edge was acting as a bridge between two dense groups.

Why This Works

The MST keeps the lightest possible connections needed to make the graph connected.

When an MST edge has a large weight, it suggests that two groups had to be connected through a relatively expensive link.

Removing that edge separates the graph at its weakest connection.

For k clusters, removing the k - 1 largest MST edges maximizes the spacing between clusters under the single-linkage criterion.

Spacing

The spacing of a clustering is the minimum distance between any two points in different clusters.

For two clusters:

Cluster 1: A, B, C
Cluster 2: D, E, F

the spacing is the smallest edge that crosses between the two clusters.

In the MST example, after removing:

C-D = 10

the spacing is:

10

because that was the cheapest edge connecting the two groups.

Kruskal View

The same clustering can be produced directly by Kruskal's algorithm.

Instead of running Kruskal until one component remains, stop when exactly k components remain.

KruskalClustering(G, k):

    sort edges by weight

    make each vertex its own component

    for each edge(u, v, w):

        if find(u) != find(v):

            if component_count == k:
                stop

            union(u, v)
            add edge to forest
            component_count--

The result is a minimum spanning forest with k components.

This is often easier than building the full MST and then deleting edges.

Example with Kruskal

Edges:

Edge Weight
A-B 1
D-E 1
B-C 2
E-F 2
C-D 10

Goal:

k = 2

Start:

{A} {B} {C} {D} {E} {F}

Add:

A-B
D-E
B-C
E-F

Now we have:

{A,B,C}
{D,E,F}

Since there are exactly two components, stop.

The next edge would have been:

C-D = 10

That edge defines the spacing.

Maximum Spacing k-Clustering

Kruskal clustering solves the maximum spacing k-clustering problem.

The objective is:

create k clusters
maximize the minimum distance between clusters

Kruskal's algorithm naturally does this because it merges the closest components first.

When it stops with k components, the next edge connecting two different components is the smallest inter-cluster edge. Its weight is the spacing.

Implementation Pattern

You need:

  1. Edge list sorted by weight.
  2. Union-find structure.
  3. Component count.
  4. Optional list of chosen edges.
  5. Optional spacing value.

The spacing is found by continuing to scan after reaching k components until you find the next edge whose endpoints belong to different components.

spacing = none

for edge(u, v, w) in sorted_edges:

    if find(u) == find(v):
        continue

    if component_count == k:
        spacing = w
        break

    union(u, v)
    component_count--

Choosing k

The algorithm requires a target number of clusters.

In practical work, choosing k is often harder than running the algorithm.

Common approaches include:

Method Idea
Domain choice Use business or scientific constraints
Gap heuristic Look for large jumps in MST edge weights
Validation metric Compare clustering quality across several k values
Visualization Plot MST edge weights or dendrogram-like structure

For a cookbook implementation, the gap heuristic is a useful starting point.

Sort the MST edge weights:

1, 1, 2, 2, 10

The largest jump is:

2 -> 10

This suggests cutting before weight 10, producing two clusters.

When It Works Well

MST clustering works well when clusters are separated by clear gaps.

Example:

dense group
large distance
dense group

The MST will contain small internal edges and large bridge edges.

Removing the bridge edges gives intuitive clusters.

When It Works Poorly

Single-linkage clustering has a known weakness: chaining.

If points form a thin bridge between two dense groups, single-linkage may connect them into one cluster.

Example:

Group A -- p1 -- p2 -- p3 -- Group B

Each local edge is short, so the MST does not contain one obviously large bridge. The algorithm may fail to separate the groups.

This is not an implementation bug. It is a property of the single-linkage criterion.

Distance Graphs

For n objects, a complete distance graph contains:

n(n - 1) / 2

edges.

This can be expensive.

For large datasets, avoid materializing all pairwise edges when possible. Common alternatives include:

Approach Use when
k-nearest-neighbor graph Local similarity is enough
Spatial index Points live in geometric space
Approximate nearest neighbors Large high-dimensional datasets
Threshold graph Only short distances matter

The MST method depends heavily on how the graph is constructed.

Complexity

If the graph has V vertices and E edges:

Step Complexity
Sort edges O(E log E)
Union-find processing O(E α(V))
Total O(E log E)
Space O(V + E)

For a complete graph:

E = O(V^2)

so sorting dominates.

Common Mistakes

Building a Complete Graph Unnecessarily

If the dataset is large, complete pairwise distances may be too expensive.

Confusing MST Clustering with k-Means

MST clustering uses pairwise distances and graph connectivity. It does not compute centroids.

Forgetting to Stop at k Components

For direct Kruskal clustering, stop when the component count reaches k.

Not Computing Spacing Correctly

The spacing is the next edge that connects two different components after clustering stops.

Assuming It Handles All Cluster Shapes

Single-linkage can suffer from chaining.

Recipe

Use MST-based clustering when you have pairwise distances and want a simple maximum-spacing clustering:

  1. Represent objects as vertices.
  2. Represent distances as weighted undirected edges.
  3. Sort edges by weight.
  4. Run Kruskal with union-find.
  5. Stop when k components remain.
  6. The components are the clusters.
  7. The next inter-component edge gives the spacing.

MST clustering is attractive because it uses the same machinery as minimum spanning trees: sorted edges, union-find, and component merging. It is easy to implement, easy to explain, and effective when cluster boundaries appear as large gaps in the graph.