# LeetCode 187: Repeated DNA Sequences

## Problem Restatement

We are given a string `s` representing a DNA sequence.

A DNA sequence contains only these four characters:

| Character | Meaning |
|---|---|
| `A` | Adenine |
| `C` | Cytosine |
| `G` | Guanine |
| `T` | Thymine |

We need to find all 10-letter-long substrings that appear more than once in `s`.

The answer can be returned in any order. The official statement asks for all 10-letter-long sequences that occur more than once in a DNA molecule.

## Input and Output

| Item | Meaning |
|---|---|
| Input | A DNA string `s` |
| Output | A list of repeated 10-letter substrings |
| Substring length | Exactly `10` |
| Order | Any order is accepted |
| Characters | Only `A`, `C`, `G`, and `T` |

Example function shape:

```python
def findRepeatedDnaSequences(s: str) -> list[str]:
    ...
```

## Examples

Example 1:

```python
s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"
```

The 10-letter sequence:

```text
"AAAAACCCCC"
```

appears more than once.

The 10-letter sequence:

```text
"CCCCCAAAAA"
```

also appears more than once.

So one valid answer is:

```python
["AAAAACCCCC", "CCCCCAAAAA"]
```

Example 2:

```python
s = "AAAAAAAAAAAAA"
```

Every 10-letter window is:

```text
"AAAAAAAAAA"
```

It appears multiple times, but the answer should include it only once.

Output:

```python
["AAAAAAAAAA"]
```

## First Thought: Count Every Length-10 Substring

The problem asks about repeated substrings of fixed length `10`.

So the direct idea is to slide a window of length `10` across the string.

For a string of length `n`, the first window starts at index `0`:

```python
s[0:10]
```

The next starts at index `1`:

```python
s[1:11]
```

The last valid window starts at index:

```python
n - 10
```

So the loop should run over:

```python
range(n - 9)
```

For each window, count how many times it appears.

Then return the windows whose count is greater than `1`.

## Key Insight

We do not need to store full counts for every sequence.

We only need to know two things:

| Set | Meaning |
|---|---|
| `seen` | Sequences that appeared at least once |
| `repeated` | Sequences that appeared at least twice |

When we see a 10-letter sequence:

1. If it is already in `seen`, then it is repeated.
2. Add it to `repeated`.
3. Add it to `seen`.

Using a set for `repeated` prevents duplicate answers when a sequence appears three or more times.

## Algorithm

1. Create an empty set `seen`.
2. Create an empty set `repeated`.
3. For every starting index `i` from `0` to `len(s) - 10`:
   - Extract `seq = s[i:i + 10]`.
   - If `seq` is already in `seen`, add it to `repeated`.
   - Otherwise, add it to `seen`.
4. Return `list(repeated)`.

If `len(s) < 10`, the loop naturally runs zero times and the answer is empty.

## Correctness

Every 10-letter substring of `s` starts at some index `i` where:

```python
0 <= i <= len(s) - 10
```

The loop visits exactly these starting positions, so every relevant substring is checked.

When a sequence appears for the first time, it is added to `seen`.

When the same sequence appears again later, it is already in `seen`, so it is added to `repeated`.

Therefore, every sequence that appears more than once is included in `repeated`.

A sequence is added to `repeated` only after it has already appeared before, so every sequence in `repeated` appears at least twice.

Since `repeated` is a set, each repeated sequence appears once in the final answer.

## Complexity

| Metric | Value | Why |
|---|---|---|
| Time | `O(n)` | There are `n - 9` windows, and each substring has fixed length `10` |
| Space | `O(n)` | The sets may store many distinct 10-letter substrings |

Since the substring length is fixed at `10`, extracting and hashing each substring is treated as constant work.

## Implementation

```python
class Solution:
    def findRepeatedDnaSequences(self, s: str) -> list[str]:
        seen = set()
        repeated = set()

        for i in range(len(s) - 9):
            seq = s[i:i + 10]

            if seq in seen:
                repeated.add(seq)
            else:
                seen.add(seq)

        return list(repeated)
```

## Code Explanation

We store sequences seen once or more:

```python
seen = set()
```

We store sequences confirmed to be repeated:

```python
repeated = set()
```

This loop visits every 10-letter substring:

```python
for i in range(len(s) - 9):
```

At each index, we extract the current window:

```python
seq = s[i:i + 10]
```

If the sequence was seen before, it is repeated:

```python
if seq in seen:
    repeated.add(seq)
```

Otherwise, this is the first time we have seen it:

```python
else:
    seen.add(seq)
```

Finally, LeetCode expects a list:

```python
return list(repeated)
```

## Alternative Implementation With Counts

A hash map count is also clear.

```python
from collections import defaultdict

class Solution:
    def findRepeatedDnaSequences(self, s: str) -> list[str]:
        count = defaultdict(int)
        ans = []

        for i in range(len(s) - 9):
            seq = s[i:i + 10]
            count[seq] += 1

            if count[seq] == 2:
                ans.append(seq)

        return ans
```

The condition:

```python
if count[seq] == 2:
```

adds a sequence exactly when it becomes repeated.

This avoids adding the same sequence multiple times.

## Testing

```python
def normalize(x):
    return sorted(x)

def run_tests():
    s = Solution()

    assert normalize(
        s.findRepeatedDnaSequences("AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT")
    ) == normalize(["AAAAACCCCC", "CCCCCAAAAA"])

    assert normalize(
        s.findRepeatedDnaSequences("AAAAAAAAAAAAA")
    ) == normalize(["AAAAAAAAAA"])

    assert normalize(
        s.findRepeatedDnaSequences("ACGTACGT")
    ) == []

    assert normalize(
        s.findRepeatedDnaSequences("ACGTACGTAC")
    ) == []

    assert normalize(
        s.findRepeatedDnaSequences("ACGTACGTACACGTACGTAC")
    ) == normalize(["ACGTACGTAC"])

    print("all tests passed")

run_tests()
```

Test meaning:

| Test | Why |
|---|---|
| Long mixed DNA string | Main example |
| Repeated `A` string | Checks overlapping repeated windows |
| Length less than `10` | No valid 10-letter sequence |
| Length exactly `10` | One sequence cannot repeat |
| Same 10-letter block twice | Basic duplicate case |

