# LeetCode 433: Minimum Genetic Mutation

## Problem Restatement

We are given two gene strings:

| Name | Meaning |
|---|---|
| `startGene` | The starting gene |
| `endGene` | The target gene |

Each gene string has length `8` and contains only these characters:

```python
"A", "C", "G", "T"
```

One mutation means changing exactly one character.

For example:

```text
AACCGGTT -> AACCGGTA
```

This is one mutation because only the last character changed.

We are also given a gene bank. A gene is valid only if it appears in `bank`, except the starting gene, which is assumed to be valid even if it does not appear in the bank.

We need to return the minimum number of mutations needed to change `startGene` into `endGene`.

If no valid mutation sequence exists, return `-1`.

The official statement defines each gene as an 8-character string over `A`, `C`, `G`, and `T`; each mutation changes one character; every valid mutation must be in `bank`; and the start gene may be absent from the bank.

## Input and Output

| Item | Meaning |
|---|---|
| Input | `startGene`, `endGene`, and `bank` |
| Output | Minimum number of mutations |
| Impossible case | Return `-1` |
| Gene length | Always `8` |
| Valid characters | `A`, `C`, `G`, `T` |

Example function shape:

```python
def minMutation(startGene: str, endGene: str, bank: list[str]) -> int:
    ...
```

## Examples

Example 1:

```python
startGene = "AACCGGTT"
endGene = "AACCGGTA"
bank = ["AACCGGTA"]
```

We can mutate directly:

```text
AACCGGTT -> AACCGGTA
```

The answer is:

```python
1
```

Example 2:

```python
startGene = "AACCGGTT"
endGene = "AAACGGTA"
bank = ["AACCGGTA", "AACCGCTA", "AAACGGTA"]
```

One shortest valid path is:

```text
AACCGGTT -> AACCGGTA -> AAACGGTA
```

The answer is:

```python
2
```

Example 3:

```python
startGene = "AAAAACCC"
endGene = "AACCCCCC"
bank = ["AAAACCCC", "AAACCCCC", "AACCCCCC"]
```

A valid path is:

```text
AAAAACCC -> AAAACCCC -> AAACCCCC -> AACCCCCC
```

The answer is:

```python
3
```

## First Thought: Try All Mutation Paths

We could try every possible mutation sequence.

From each gene, there are 8 positions, and each position can change to one of 3 other characters.

So each gene can have up to:

```python
8 * 3 = 24
```

one-step mutations.

Trying all paths without structure can revisit the same genes and loop forever.

We need the shortest valid path, so this is a graph problem.

## Key Insight

Treat each valid gene string as a graph node.

There is an edge between two genes if they differ by exactly one character.

Each edge has cost `1`, because one mutation costs one step.

So the problem becomes:

Find the shortest path from `startGene` to `endGene` in an unweighted graph.

The correct tool for shortest path in an unweighted graph is BFS.

BFS explores all genes reachable in:

```text
0 mutations
1 mutation
2 mutations
3 mutations
...
```

Therefore, the first time BFS reaches `endGene`, the number of steps is minimal.

## Algorithm

Create a set from the bank:

```python
bank_set = set(bank)
```

If `endGene` is not in the bank, then no valid path can end there:

```python
if endGene not in bank_set:
    return -1
```

Start BFS from:

```python
(startGene, 0)
```

where `0` is the number of mutations used so far.

For each current gene:

1. If it equals `endGene`, return the step count.
2. For every position from `0` to `7`:
   - Try replacing it with `A`, `C`, `G`, or `T`.
   - Skip the same character.
   - If the new gene is in `bank_set`, it is valid.
   - Add it to the queue with step count `+ 1`.
   - Remove it from `bank_set` so we do not visit it again.

If BFS ends without reaching `endGene`, return `-1`.

## Correctness

Each queue entry stores a gene and the number of mutations needed to reach it from `startGene`.

BFS begins with `startGene` at distance `0`.

When processing a gene at distance `d`, the algorithm generates every valid gene that differs by exactly one character. Each such neighbor is reachable in exactly `d + 1` mutations.

Because BFS uses a queue, it processes all genes at distance `d` before any gene at distance `d + 1`.

Therefore, when `endGene` is first removed from the queue, no shorter valid mutation sequence can exist. If a shorter sequence existed, BFS would have discovered and processed it earlier.

The algorithm only adds genes that appear in `bank_set`, so every intermediate mutation is valid. It removes visited genes from `bank_set`, so each gene is processed at most once and cycles cannot cause repeated work.

If BFS finishes without finding `endGene`, then every reachable valid mutation has been explored. Therefore, no valid sequence exists, and returning `-1` is correct.

## Complexity

| Metric | Value | Why |
|---|---:|---|
| Time | `O(B * L * C)` | Each visited bank gene generates mutations across `L` positions and `C` letters |
| Space | `O(B)` | The bank set and queue store valid genes |

Here:

| Symbol | Meaning |
|---|---|
| `B` | Number of genes in `bank` |
| `L` | Gene length, which is `8` |
| `C` | Alphabet size, which is `4` |

Since `L = 8` and `C = 4`, this is effectively linear in the bank size.

## Implementation

```python
from collections import deque

class Solution:
    def minMutation(
        self,
        startGene: str,
        endGene: str,
        bank: list[str],
    ) -> int:
        bank_set = set(bank)

        if endGene not in bank_set:
            return -1

        letters = ["A", "C", "G", "T"]
        queue = deque([(startGene, 0)])

        while queue:
            gene, steps = queue.popleft()

            if gene == endGene:
                return steps

            for i in range(len(gene)):
                for ch in letters:
                    if ch == gene[i]:
                        continue

                    mutated = gene[:i] + ch + gene[i + 1:]

                    if mutated in bank_set:
                        bank_set.remove(mutated)
                        queue.append((mutated, steps + 1))

        return -1
```

## Code Explanation

We convert the bank into a set:

```python
bank_set = set(bank)
```

This gives average `O(1)` lookup for valid mutations.

If the target gene is not in the bank, we can stop immediately:

```python
if endGene not in bank_set:
    return -1
```

Every valid final mutation must be in the bank.

The BFS queue starts with the starting gene:

```python
queue = deque([(startGene, 0)])
```

The second value is the number of mutations used so far.

For each gene, we try every one-character mutation:

```python
for i in range(len(gene)):
    for ch in letters:
```

We skip replacing a character with itself:

```python
if ch == gene[i]:
    continue
```

Then we form the mutated string:

```python
mutated = gene[:i] + ch + gene[i + 1:]
```

If it is in the bank, it is a valid next state:

```python
if mutated in bank_set:
```

We remove it immediately:

```python
bank_set.remove(mutated)
```

This marks it as visited.

Then we push it into the queue with one more mutation:

```python
queue.append((mutated, steps + 1))
```

If the BFS cannot reach the target, we return:

```python
return -1
```

## Testing

```python
def run_tests():
    s = Solution()

    assert s.minMutation(
        "AACCGGTT",
        "AACCGGTA",
        ["AACCGGTA"],
    ) == 1

    assert s.minMutation(
        "AACCGGTT",
        "AAACGGTA",
        ["AACCGGTA", "AACCGCTA", "AAACGGTA"],
    ) == 2

    assert s.minMutation(
        "AAAAACCC",
        "AACCCCCC",
        ["AAAACCCC", "AAACCCCC", "AACCCCCC"],
    ) == 3

    assert s.minMutation(
        "AACCGGTT",
        "AACCGGTA",
        [],
    ) == -1

    assert s.minMutation(
        "AACCGGTT",
        "CCCCCCCC",
        ["AACCGGTA", "AACCGCTA", "AAACGGTA"],
    ) == -1

    print("all tests passed")

run_tests()
```

Test meaning:

| Test | Why |
|---|---|
| One mutation | Checks direct target |
| Two mutations | Checks shortest path through an intermediate gene |
| Three mutations | Checks longer valid chain |
| Empty bank | Target cannot be reached |
| Missing target | Must return `-1` |