Find the minimum number of valid one-character gene mutations using breadth-first search.
Problem Restatement
We are given two gene strings:
| Name | Meaning |
|---|---|
startGene | The starting gene |
endGene | The target gene |
Each gene string has length 8 and contains only these characters:
"A", "C", "G", "T"One mutation means changing exactly one character.
For example:
AACCGGTT -> AACCGGTAThis is one mutation because only the last character changed.
We are also given a gene bank. A gene is valid only if it appears in bank, except the starting gene, which is assumed to be valid even if it does not appear in the bank.
We need to return the minimum number of mutations needed to change startGene into endGene.
If no valid mutation sequence exists, return -1.
The official statement defines each gene as an 8-character string over A, C, G, and T; each mutation changes one character; every valid mutation must be in bank; and the start gene may be absent from the bank.
Input and Output
| Item | Meaning |
|---|---|
| Input | startGene, endGene, and bank |
| Output | Minimum number of mutations |
| Impossible case | Return -1 |
| Gene length | Always 8 |
| Valid characters | A, C, G, T |
Example function shape:
def minMutation(startGene: str, endGene: str, bank: list[str]) -> int:
...Examples
Example 1:
startGene = "AACCGGTT"
endGene = "AACCGGTA"
bank = ["AACCGGTA"]We can mutate directly:
AACCGGTT -> AACCGGTAThe answer is:
1Example 2:
startGene = "AACCGGTT"
endGene = "AAACGGTA"
bank = ["AACCGGTA", "AACCGCTA", "AAACGGTA"]One shortest valid path is:
AACCGGTT -> AACCGGTA -> AAACGGTAThe answer is:
2Example 3:
startGene = "AAAAACCC"
endGene = "AACCCCCC"
bank = ["AAAACCCC", "AAACCCCC", "AACCCCCC"]A valid path is:
AAAAACCC -> AAAACCCC -> AAACCCCC -> AACCCCCCThe answer is:
3First Thought: Try All Mutation Paths
We could try every possible mutation sequence.
From each gene, there are 8 positions, and each position can change to one of 3 other characters.
So each gene can have up to:
8 * 3 = 24one-step mutations.
Trying all paths without structure can revisit the same genes and loop forever.
We need the shortest valid path, so this is a graph problem.
Key Insight
Treat each valid gene string as a graph node.
There is an edge between two genes if they differ by exactly one character.
Each edge has cost 1, because one mutation costs one step.
So the problem becomes:
Find the shortest path from startGene to endGene in an unweighted graph.
The correct tool for shortest path in an unweighted graph is BFS.
BFS explores all genes reachable in:
0 mutations
1 mutation
2 mutations
3 mutations
...Therefore, the first time BFS reaches endGene, the number of steps is minimal.
Algorithm
Create a set from the bank:
bank_set = set(bank)If endGene is not in the bank, then no valid path can end there:
if endGene not in bank_set:
return -1Start BFS from:
(startGene, 0)where 0 is the number of mutations used so far.
For each current gene:
- If it equals
endGene, return the step count. - For every position from
0to7:- Try replacing it with
A,C,G, orT. - Skip the same character.
- If the new gene is in
bank_set, it is valid. - Add it to the queue with step count
+ 1. - Remove it from
bank_setso we do not visit it again.
- Try replacing it with
If BFS ends without reaching endGene, return -1.
Correctness
Each queue entry stores a gene and the number of mutations needed to reach it from startGene.
BFS begins with startGene at distance 0.
When processing a gene at distance d, the algorithm generates every valid gene that differs by exactly one character. Each such neighbor is reachable in exactly d + 1 mutations.
Because BFS uses a queue, it processes all genes at distance d before any gene at distance d + 1.
Therefore, when endGene is first removed from the queue, no shorter valid mutation sequence can exist. If a shorter sequence existed, BFS would have discovered and processed it earlier.
The algorithm only adds genes that appear in bank_set, so every intermediate mutation is valid. It removes visited genes from bank_set, so each gene is processed at most once and cycles cannot cause repeated work.
If BFS finishes without finding endGene, then every reachable valid mutation has been explored. Therefore, no valid sequence exists, and returning -1 is correct.
Complexity
| Metric | Value | Why |
|---|---|---|
| Time | O(B * L * C) | Each visited bank gene generates mutations across L positions and C letters |
| Space | O(B) | The bank set and queue store valid genes |
Here:
| Symbol | Meaning |
|---|---|
B | Number of genes in bank |
L | Gene length, which is 8 |
C | Alphabet size, which is 4 |
Since L = 8 and C = 4, this is effectively linear in the bank size.
Implementation
from collections import deque
class Solution:
def minMutation(
self,
startGene: str,
endGene: str,
bank: list[str],
) -> int:
bank_set = set(bank)
if endGene not in bank_set:
return -1
letters = ["A", "C", "G", "T"]
queue = deque([(startGene, 0)])
while queue:
gene, steps = queue.popleft()
if gene == endGene:
return steps
for i in range(len(gene)):
for ch in letters:
if ch == gene[i]:
continue
mutated = gene[:i] + ch + gene[i + 1:]
if mutated in bank_set:
bank_set.remove(mutated)
queue.append((mutated, steps + 1))
return -1Code Explanation
We convert the bank into a set:
bank_set = set(bank)This gives average O(1) lookup for valid mutations.
If the target gene is not in the bank, we can stop immediately:
if endGene not in bank_set:
return -1Every valid final mutation must be in the bank.
The BFS queue starts with the starting gene:
queue = deque([(startGene, 0)])The second value is the number of mutations used so far.
For each gene, we try every one-character mutation:
for i in range(len(gene)):
for ch in letters:We skip replacing a character with itself:
if ch == gene[i]:
continueThen we form the mutated string:
mutated = gene[:i] + ch + gene[i + 1:]If it is in the bank, it is a valid next state:
if mutated in bank_set:We remove it immediately:
bank_set.remove(mutated)This marks it as visited.
Then we push it into the queue with one more mutation:
queue.append((mutated, steps + 1))If the BFS cannot reach the target, we return:
return -1Testing
def run_tests():
s = Solution()
assert s.minMutation(
"AACCGGTT",
"AACCGGTA",
["AACCGGTA"],
) == 1
assert s.minMutation(
"AACCGGTT",
"AAACGGTA",
["AACCGGTA", "AACCGCTA", "AAACGGTA"],
) == 2
assert s.minMutation(
"AAAAACCC",
"AACCCCCC",
["AAAACCCC", "AAACCCCC", "AACCCCCC"],
) == 3
assert s.minMutation(
"AACCGGTT",
"AACCGGTA",
[],
) == -1
assert s.minMutation(
"AACCGGTT",
"CCCCCCCC",
["AACCGGTA", "AACCGCTA", "AAACGGTA"],
) == -1
print("all tests passed")
run_tests()Test meaning:
| Test | Why |
|---|---|
| One mutation | Checks direct target |
| Two mutations | Checks shortest path through an intermediate gene |
| Three mutations | Checks longer valid chain |
| Empty bank | Target cannot be reached |
| Missing target | Must return -1 |