LeetCode 187: Repeated DNA Sequences

Problem Restatement

We are given a string s representing a DNA sequence.

A DNA sequence contains only these four characters:

Character	Meaning
`A`	Adenine
`C`	Cytosine
`G`	Guanine
`T`	Thymine

We need to find all 10-letter-long substrings that appear more than once in s.

The answer can be returned in any order. The official statement asks for all 10-letter-long sequences that occur more than once in a DNA molecule.

Input and Output

Item	Meaning
Input	A DNA string `s`
Output	A list of repeated 10-letter substrings
Substring length	Exactly `10`
Order	Any order is accepted
Characters	Only `A`, `C`, `G`, and `T`

Example function shape:

def findRepeatedDnaSequences(s: str) -> list[str]:
    ...

Examples

Example 1:

s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"

The 10-letter sequence:

"AAAAACCCCC"

appears more than once.

The 10-letter sequence:

"CCCCCAAAAA"

also appears more than once.

So one valid answer is:

["AAAAACCCCC", "CCCCCAAAAA"]

Example 2:

s = "AAAAAAAAAAAAA"

Every 10-letter window is:

"AAAAAAAAAA"

It appears multiple times, but the answer should include it only once.

Output:

["AAAAAAAAAA"]

First Thought: Count Every Length-10 Substring

The problem asks about repeated substrings of fixed length 10.

So the direct idea is to slide a window of length 10 across the string.

For a string of length n, the first window starts at index 0:

s[0:10]

The next starts at index 1:

s[1:11]

The last valid window starts at index:

n - 10

So the loop should run over:

range(n - 9)

For each window, count how many times it appears.

Then return the windows whose count is greater than 1.

Key Insight

We do not need to store full counts for every sequence.

We only need to know two things:

Set	Meaning
`seen`	Sequences that appeared at least once
`repeated`	Sequences that appeared at least twice

When we see a 10-letter sequence:

If it is already in seen, then it is repeated.
Add it to repeated.
Add it to seen.

Using a set for repeated prevents duplicate answers when a sequence appears three or more times.

Algorithm

Create an empty set seen.
Create an empty set repeated.
For every starting index i from 0 to len(s) - 10:
- Extract seq = s[i:i + 10].
- If seq is already in seen, add it to repeated.
- Otherwise, add it to seen.
Return list(repeated).

If len(s) < 10, the loop naturally runs zero times and the answer is empty.

Correctness

Every 10-letter substring of s starts at some index i where:

0 <= i <= len(s) - 10

The loop visits exactly these starting positions, so every relevant substring is checked.

When a sequence appears for the first time, it is added to seen.

When the same sequence appears again later, it is already in seen, so it is added to repeated.

Therefore, every sequence that appears more than once is included in repeated.

A sequence is added to repeated only after it has already appeared before, so every sequence in repeated appears at least twice.

Since repeated is a set, each repeated sequence appears once in the final answer.

Complexity

Metric	Value	Why
Time	`O(n)`	There are `n - 9` windows, and each substring has fixed length `10`
Space	`O(n)`	The sets may store many distinct 10-letter substrings

Since the substring length is fixed at 10, extracting and hashing each substring is treated as constant work.

Implementation

class Solution:
    def findRepeatedDnaSequences(self, s: str) -> list[str]:
        seen = set()
        repeated = set()

        for i in range(len(s) - 9):
            seq = s[i:i + 10]

            if seq in seen:
                repeated.add(seq)
            else:
                seen.add(seq)

        return list(repeated)

Code Explanation

We store sequences seen once or more:

seen = set()

We store sequences confirmed to be repeated:

repeated = set()

This loop visits every 10-letter substring:

for i in range(len(s) - 9):

At each index, we extract the current window:

seq = s[i:i + 10]

If the sequence was seen before, it is repeated:

if seq in seen:
    repeated.add(seq)

Otherwise, this is the first time we have seen it:

else:
    seen.add(seq)

Finally, LeetCode expects a list:

return list(repeated)

Alternative Implementation With Counts

A hash map count is also clear.

from collections import defaultdict

class Solution:
    def findRepeatedDnaSequences(self, s: str) -> list[str]:
        count = defaultdict(int)
        ans = []

        for i in range(len(s) - 9):
            seq = s[i:i + 10]
            count[seq] += 1

            if count[seq] == 2:
                ans.append(seq)

        return ans

The condition:

if count[seq] == 2:

adds a sequence exactly when it becomes repeated.

This avoids adding the same sequence multiple times.

Testing

def normalize(x):
    return sorted(x)

def run_tests():
    s = Solution()

    assert normalize(
        s.findRepeatedDnaSequences("AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT")
    ) == normalize(["AAAAACCCCC", "CCCCCAAAAA"])

    assert normalize(
        s.findRepeatedDnaSequences("AAAAAAAAAAAAA")
    ) == normalize(["AAAAAAAAAA"])

    assert normalize(
        s.findRepeatedDnaSequences("ACGTACGT")
    ) == []

    assert normalize(
        s.findRepeatedDnaSequences("ACGTACGTAC")
    ) == []

    assert normalize(
        s.findRepeatedDnaSequences("ACGTACGTACACGTACGTAC")
    ) == normalize(["ACGTACGTAC"])

    print("all tests passed")

run_tests()

Test meaning:

Test	Why
Long mixed DNA string	Main example
Repeated `A` string	Checks overlapping repeated windows
Length less than `10`	No valid 10-letter sequence
Length exactly `10`	One sequence cannot repeat
Same 10-letter block twice	Basic duplicate case