LeetCode 433: Minimum Genetic Mutation

Problem Restatement

We are given two gene strings:

Name	Meaning
`startGene`	The starting gene
`endGene`	The target gene

Each gene string has length 8 and contains only these characters:

"A", "C", "G", "T"

One mutation means changing exactly one character.

For example:

AACCGGTT -> AACCGGTA

This is one mutation because only the last character changed.

We are also given a gene bank. A gene is valid only if it appears in bank, except the starting gene, which is assumed to be valid even if it does not appear in the bank.

We need to return the minimum number of mutations needed to change startGene into endGene.

If no valid mutation sequence exists, return -1.

The official statement defines each gene as an 8-character string over A, C, G, and T; each mutation changes one character; every valid mutation must be in bank; and the start gene may be absent from the bank.

Input and Output

Item	Meaning
Input	`startGene`, `endGene`, and `bank`
Output	Minimum number of mutations
Impossible case	Return `-1`
Gene length	Always `8`
Valid characters	`A`, `C`, `G`, `T`

Example function shape:

def minMutation(startGene: str, endGene: str, bank: list[str]) -> int:
    ...

Examples

Example 1:

startGene = "AACCGGTT"
endGene = "AACCGGTA"
bank = ["AACCGGTA"]

We can mutate directly:

AACCGGTT -> AACCGGTA

The answer is:

Example 2:

startGene = "AACCGGTT"
endGene = "AAACGGTA"
bank = ["AACCGGTA", "AACCGCTA", "AAACGGTA"]

One shortest valid path is:

AACCGGTT -> AACCGGTA -> AAACGGTA

The answer is:

Example 3:

startGene = "AAAAACCC"
endGene = "AACCCCCC"
bank = ["AAAACCCC", "AAACCCCC", "AACCCCCC"]

A valid path is:

AAAAACCC -> AAAACCCC -> AAACCCCC -> AACCCCCC

The answer is:

First Thought: Try All Mutation Paths

We could try every possible mutation sequence.

From each gene, there are 8 positions, and each position can change to one of 3 other characters.

So each gene can have up to:

8 * 3 = 24

one-step mutations.

Trying all paths without structure can revisit the same genes and loop forever.

We need the shortest valid path, so this is a graph problem.

Key Insight

Treat each valid gene string as a graph node.

There is an edge between two genes if they differ by exactly one character.

Each edge has cost 1, because one mutation costs one step.

So the problem becomes:

Find the shortest path from startGene to endGene in an unweighted graph.

The correct tool for shortest path in an unweighted graph is BFS.

BFS explores all genes reachable in:

0 mutations
1 mutation
2 mutations
3 mutations
...

Therefore, the first time BFS reaches endGene, the number of steps is minimal.

Algorithm

Create a set from the bank:

bank_set = set(bank)

If endGene is not in the bank, then no valid path can end there:

if endGene not in bank_set:
    return -1

Start BFS from:

(startGene, 0)

where 0 is the number of mutations used so far.

For each current gene:

If it equals endGene, return the step count.
For every position from 0 to 7:
- Try replacing it with A, C, G, or T.
- Skip the same character.
- If the new gene is in bank_set, it is valid.
- Add it to the queue with step count + 1.
- Remove it from bank_set so we do not visit it again.

If BFS ends without reaching endGene, return -1.

Correctness

Each queue entry stores a gene and the number of mutations needed to reach it from startGene.

BFS begins with startGene at distance 0.

When processing a gene at distance d, the algorithm generates every valid gene that differs by exactly one character. Each such neighbor is reachable in exactly d + 1 mutations.

Because BFS uses a queue, it processes all genes at distance d before any gene at distance d + 1.

Therefore, when endGene is first removed from the queue, no shorter valid mutation sequence can exist. If a shorter sequence existed, BFS would have discovered and processed it earlier.

The algorithm only adds genes that appear in bank_set, so every intermediate mutation is valid. It removes visited genes from bank_set, so each gene is processed at most once and cycles cannot cause repeated work.

If BFS finishes without finding endGene, then every reachable valid mutation has been explored. Therefore, no valid sequence exists, and returning -1 is correct.

Complexity

Metric	Value	Why
Time	`O(B * L * C)`	Each visited bank gene generates mutations across `L` positions and `C` letters
Space	`O(B)`	The bank set and queue store valid genes

Here:

Symbol	Meaning
`B`	Number of genes in `bank`
`L`	Gene length, which is `8`
`C`	Alphabet size, which is `4`

Since L = 8 and C = 4, this is effectively linear in the bank size.

Implementation

from collections import deque

class Solution:
    def minMutation(
        self,
        startGene: str,
        endGene: str,
        bank: list[str],
    ) -> int:
        bank_set = set(bank)

        if endGene not in bank_set:
            return -1

        letters = ["A", "C", "G", "T"]
        queue = deque([(startGene, 0)])

        while queue:
            gene, steps = queue.popleft()

            if gene == endGene:
                return steps

            for i in range(len(gene)):
                for ch in letters:
                    if ch == gene[i]:
                        continue

                    mutated = gene[:i] + ch + gene[i + 1:]

                    if mutated in bank_set:
                        bank_set.remove(mutated)
                        queue.append((mutated, steps + 1))

        return -1

Code Explanation

We convert the bank into a set:

bank_set = set(bank)

This gives average O(1) lookup for valid mutations.

If the target gene is not in the bank, we can stop immediately:

if endGene not in bank_set:
    return -1

Every valid final mutation must be in the bank.

The BFS queue starts with the starting gene:

queue = deque([(startGene, 0)])

The second value is the number of mutations used so far.

For each gene, we try every one-character mutation:

for i in range(len(gene)):
    for ch in letters:

We skip replacing a character with itself:

if ch == gene[i]:
    continue

Then we form the mutated string:

mutated = gene[:i] + ch + gene[i + 1:]

If it is in the bank, it is a valid next state:

if mutated in bank_set:

We remove it immediately:

bank_set.remove(mutated)

This marks it as visited.

Then we push it into the queue with one more mutation:

queue.append((mutated, steps + 1))

If the BFS cannot reach the target, we return:

return -1

Testing

def run_tests():
    s = Solution()

    assert s.minMutation(
        "AACCGGTT",
        "AACCGGTA",
        ["AACCGGTA"],
    ) == 1

    assert s.minMutation(
        "AACCGGTT",
        "AAACGGTA",
        ["AACCGGTA", "AACCGCTA", "AAACGGTA"],
    ) == 2

    assert s.minMutation(
        "AAAAACCC",
        "AACCCCCC",
        ["AAAACCCC", "AAACCCCC", "AACCCCCC"],
    ) == 3

    assert s.minMutation(
        "AACCGGTT",
        "AACCGGTA",
        [],
    ) == -1

    assert s.minMutation(
        "AACCGGTT",
        "CCCCCCCC",
        ["AACCGGTA", "AACCGCTA", "AAACGGTA"],
    ) == -1

    print("all tests passed")

run_tests()

Test meaning:

Test	Why
One mutation	Checks direct target
Two mutations	Checks shortest path through an intermediate gene
Three mutations	Checks longer valid chain
Empty bank	Target cannot be reached
Missing target	Must return `-1`