A clear explanation of finding repeated 10-letter DNA substrings using a fixed-size sliding window and hash sets.
Problem Restatement
We are given a string s representing a DNA sequence.
A DNA sequence contains only these four characters:
| Character | Meaning |
|---|---|
A | Adenine |
C | Cytosine |
G | Guanine |
T | Thymine |
We need to find all 10-letter-long substrings that appear more than once in s.
The answer can be returned in any order. The official statement asks for all 10-letter-long sequences that occur more than once in a DNA molecule.
Input and Output
| Item | Meaning |
|---|---|
| Input | A DNA string s |
| Output | A list of repeated 10-letter substrings |
| Substring length | Exactly 10 |
| Order | Any order is accepted |
| Characters | Only A, C, G, and T |
Example function shape:
def findRepeatedDnaSequences(s: str) -> list[str]:
...Examples
Example 1:
s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"The 10-letter sequence:
"AAAAACCCCC"appears more than once.
The 10-letter sequence:
"CCCCCAAAAA"also appears more than once.
So one valid answer is:
["AAAAACCCCC", "CCCCCAAAAA"]Example 2:
s = "AAAAAAAAAAAAA"Every 10-letter window is:
"AAAAAAAAAA"It appears multiple times, but the answer should include it only once.
Output:
["AAAAAAAAAA"]First Thought: Count Every Length-10 Substring
The problem asks about repeated substrings of fixed length 10.
So the direct idea is to slide a window of length 10 across the string.
For a string of length n, the first window starts at index 0:
s[0:10]The next starts at index 1:
s[1:11]The last valid window starts at index:
n - 10So the loop should run over:
range(n - 9)For each window, count how many times it appears.
Then return the windows whose count is greater than 1.
Key Insight
We do not need to store full counts for every sequence.
We only need to know two things:
| Set | Meaning |
|---|---|
seen | Sequences that appeared at least once |
repeated | Sequences that appeared at least twice |
When we see a 10-letter sequence:
- If it is already in
seen, then it is repeated. - Add it to
repeated. - Add it to
seen.
Using a set for repeated prevents duplicate answers when a sequence appears three or more times.
Algorithm
- Create an empty set
seen. - Create an empty set
repeated. - For every starting index
ifrom0tolen(s) - 10:- Extract
seq = s[i:i + 10]. - If
seqis already inseen, add it torepeated. - Otherwise, add it to
seen.
- Extract
- Return
list(repeated).
If len(s) < 10, the loop naturally runs zero times and the answer is empty.
Correctness
Every 10-letter substring of s starts at some index i where:
0 <= i <= len(s) - 10The loop visits exactly these starting positions, so every relevant substring is checked.
When a sequence appears for the first time, it is added to seen.
When the same sequence appears again later, it is already in seen, so it is added to repeated.
Therefore, every sequence that appears more than once is included in repeated.
A sequence is added to repeated only after it has already appeared before, so every sequence in repeated appears at least twice.
Since repeated is a set, each repeated sequence appears once in the final answer.
Complexity
| Metric | Value | Why |
|---|---|---|
| Time | O(n) | There are n - 9 windows, and each substring has fixed length 10 |
| Space | O(n) | The sets may store many distinct 10-letter substrings |
Since the substring length is fixed at 10, extracting and hashing each substring is treated as constant work.
Implementation
class Solution:
def findRepeatedDnaSequences(self, s: str) -> list[str]:
seen = set()
repeated = set()
for i in range(len(s) - 9):
seq = s[i:i + 10]
if seq in seen:
repeated.add(seq)
else:
seen.add(seq)
return list(repeated)Code Explanation
We store sequences seen once or more:
seen = set()We store sequences confirmed to be repeated:
repeated = set()This loop visits every 10-letter substring:
for i in range(len(s) - 9):At each index, we extract the current window:
seq = s[i:i + 10]If the sequence was seen before, it is repeated:
if seq in seen:
repeated.add(seq)Otherwise, this is the first time we have seen it:
else:
seen.add(seq)Finally, LeetCode expects a list:
return list(repeated)Alternative Implementation With Counts
A hash map count is also clear.
from collections import defaultdict
class Solution:
def findRepeatedDnaSequences(self, s: str) -> list[str]:
count = defaultdict(int)
ans = []
for i in range(len(s) - 9):
seq = s[i:i + 10]
count[seq] += 1
if count[seq] == 2:
ans.append(seq)
return ansThe condition:
if count[seq] == 2:adds a sequence exactly when it becomes repeated.
This avoids adding the same sequence multiple times.
Testing
def normalize(x):
return sorted(x)
def run_tests():
s = Solution()
assert normalize(
s.findRepeatedDnaSequences("AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT")
) == normalize(["AAAAACCCCC", "CCCCCAAAAA"])
assert normalize(
s.findRepeatedDnaSequences("AAAAAAAAAAAAA")
) == normalize(["AAAAAAAAAA"])
assert normalize(
s.findRepeatedDnaSequences("ACGTACGT")
) == []
assert normalize(
s.findRepeatedDnaSequences("ACGTACGTAC")
) == []
assert normalize(
s.findRepeatedDnaSequences("ACGTACGTACACGTACGTAC")
) == normalize(["ACGTACGTAC"])
print("all tests passed")
run_tests()Test meaning:
| Test | Why |
|---|---|
| Long mixed DNA string | Main example |
Repeated A string | Checks overlapping repeated windows |
Length less than 10 | No valid 10-letter sequence |
Length exactly 10 | One sequence cannot repeat |
| Same 10-letter block twice | Basic duplicate case |