Episode 12

Interview Question Series | Season 3

Repeated DNA Sequences – Reading Version

Problem in short:

You’re given a DNA string made up of 'A', 'C', 'G', and 'T'.
Your task is to find all 10-letter-long substrings that appear more than once in the string.

Example:
If the DNA is "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",
then "AAAAACCCCC" and "CCCCCAAAAA" both appear twice.

Idea of the solution:

We need to check every 10-letter substring in the DNA and count how many times it appears.
If any substring appears more than once, we include it in our result.

Step-by-step solution (two main approaches):

1. Naive Approach – O(N²)

Extract every 10-letter substring.
For each substring, compare it with all other substrings to check if it repeats.
This is very slow, because each substring comparison costs time, leading to roughly N² operations.

✅ Works for small strings.
❌ Too slow for large DNA sequences (like length 100,000).

2. Optimized Approach using Hash Map – O(N)

We use a hash map (or map<string, int>) to count how many times each 10-letter substring appears.

How it works:

Loop through the string, taking every substring of length 10.
Store each substring in a hash map, increasing its count.
After finishing, collect all substrings with a count > 1.

This reduces the time complexity to O(N) because:

Each substring operation is constant time (since it’s always length 10).
Insertion and lookup in a hash map are O(1) on average.
So overall → O(N).

Quick Example:

Input:
s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"

Process:

Take every 10-letter substring
Count occurrences
Substrings seen more than once → "AAAAACCCCC", "CCCCCAAAAA"

Output:
["AAAAACCCCC", "CCCCCAAAAA"]

C++ Code (Optimized Way – Using Hash Map)

class Solution {
public:
    vector<string> findRepeatedDnaSequences(string s) {
        vector<string> result;
        map<string, int> count;
        int n = s.size();
        if (n < 10) return result;

        for (int i = 0; i <= n - 10; i++) {
            string part = s.substr(i, 10);
            count[part]++;
        }

        for (auto it : count) {
            if (it.second > 1)
                result.push_back(it.first);
        }

        return result;
    }
};

Notes:

Time Complexity: O(N)
Space Complexity: O(N) (for storing substrings)
Very popular interview problem (Google, Amazon, and Meta).
Tests your ability to use hashing to detect duplicates efficiently.