Skip to content

LeetCode 187: Repeated DNA Sequences

A clear explanation of finding repeated 10-letter DNA substrings using a fixed-size sliding window and hash sets.

Problem Restatement

We are given a string s representing a DNA sequence.

A DNA sequence contains only these four characters:

CharacterMeaning
AAdenine
CCytosine
GGuanine
TThymine

We need to find all 10-letter-long substrings that appear more than once in s.

The answer can be returned in any order. The official statement asks for all 10-letter-long sequences that occur more than once in a DNA molecule.

Input and Output

ItemMeaning
InputA DNA string s
OutputA list of repeated 10-letter substrings
Substring lengthExactly 10
OrderAny order is accepted
CharactersOnly A, C, G, and T

Example function shape:

def findRepeatedDnaSequences(s: str) -> list[str]:
    ...

Examples

Example 1:

s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"

The 10-letter sequence:

"AAAAACCCCC"

appears more than once.

The 10-letter sequence:

"CCCCCAAAAA"

also appears more than once.

So one valid answer is:

["AAAAACCCCC", "CCCCCAAAAA"]

Example 2:

s = "AAAAAAAAAAAAA"

Every 10-letter window is:

"AAAAAAAAAA"

It appears multiple times, but the answer should include it only once.

Output:

["AAAAAAAAAA"]

First Thought: Count Every Length-10 Substring

The problem asks about repeated substrings of fixed length 10.

So the direct idea is to slide a window of length 10 across the string.

For a string of length n, the first window starts at index 0:

s[0:10]

The next starts at index 1:

s[1:11]

The last valid window starts at index:

n - 10

So the loop should run over:

range(n - 9)

For each window, count how many times it appears.

Then return the windows whose count is greater than 1.

Key Insight

We do not need to store full counts for every sequence.

We only need to know two things:

SetMeaning
seenSequences that appeared at least once
repeatedSequences that appeared at least twice

When we see a 10-letter sequence:

  1. If it is already in seen, then it is repeated.
  2. Add it to repeated.
  3. Add it to seen.

Using a set for repeated prevents duplicate answers when a sequence appears three or more times.

Algorithm

  1. Create an empty set seen.
  2. Create an empty set repeated.
  3. For every starting index i from 0 to len(s) - 10:
    • Extract seq = s[i:i + 10].
    • If seq is already in seen, add it to repeated.
    • Otherwise, add it to seen.
  4. Return list(repeated).

If len(s) < 10, the loop naturally runs zero times and the answer is empty.

Correctness

Every 10-letter substring of s starts at some index i where:

0 <= i <= len(s) - 10

The loop visits exactly these starting positions, so every relevant substring is checked.

When a sequence appears for the first time, it is added to seen.

When the same sequence appears again later, it is already in seen, so it is added to repeated.

Therefore, every sequence that appears more than once is included in repeated.

A sequence is added to repeated only after it has already appeared before, so every sequence in repeated appears at least twice.

Since repeated is a set, each repeated sequence appears once in the final answer.

Complexity

MetricValueWhy
TimeO(n)There are n - 9 windows, and each substring has fixed length 10
SpaceO(n)The sets may store many distinct 10-letter substrings

Since the substring length is fixed at 10, extracting and hashing each substring is treated as constant work.

Implementation

class Solution:
    def findRepeatedDnaSequences(self, s: str) -> list[str]:
        seen = set()
        repeated = set()

        for i in range(len(s) - 9):
            seq = s[i:i + 10]

            if seq in seen:
                repeated.add(seq)
            else:
                seen.add(seq)

        return list(repeated)

Code Explanation

We store sequences seen once or more:

seen = set()

We store sequences confirmed to be repeated:

repeated = set()

This loop visits every 10-letter substring:

for i in range(len(s) - 9):

At each index, we extract the current window:

seq = s[i:i + 10]

If the sequence was seen before, it is repeated:

if seq in seen:
    repeated.add(seq)

Otherwise, this is the first time we have seen it:

else:
    seen.add(seq)

Finally, LeetCode expects a list:

return list(repeated)

Alternative Implementation With Counts

A hash map count is also clear.

from collections import defaultdict

class Solution:
    def findRepeatedDnaSequences(self, s: str) -> list[str]:
        count = defaultdict(int)
        ans = []

        for i in range(len(s) - 9):
            seq = s[i:i + 10]
            count[seq] += 1

            if count[seq] == 2:
                ans.append(seq)

        return ans

The condition:

if count[seq] == 2:

adds a sequence exactly when it becomes repeated.

This avoids adding the same sequence multiple times.

Testing

def normalize(x):
    return sorted(x)

def run_tests():
    s = Solution()

    assert normalize(
        s.findRepeatedDnaSequences("AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT")
    ) == normalize(["AAAAACCCCC", "CCCCCAAAAA"])

    assert normalize(
        s.findRepeatedDnaSequences("AAAAAAAAAAAAA")
    ) == normalize(["AAAAAAAAAA"])

    assert normalize(
        s.findRepeatedDnaSequences("ACGTACGT")
    ) == []

    assert normalize(
        s.findRepeatedDnaSequences("ACGTACGTAC")
    ) == []

    assert normalize(
        s.findRepeatedDnaSequences("ACGTACGTACACGTACGTAC")
    ) == normalize(["ACGTACGTAC"])

    print("all tests passed")

run_tests()

Test meaning:

TestWhy
Long mixed DNA stringMain example
Repeated A stringChecks overlapping repeated windows
Length less than 10No valid 10-letter sequence
Length exactly 10One sequence cannot repeat
Same 10-letter block twiceBasic duplicate case