Repeated DNA Sequences – Reading Version
Problem in short:
You’re given a DNA string made up of 'A', 'C', 'G', and 'T'.
Your task is to find all 10-letter-long substrings that appear more than once in the string.
Example:
If the DNA is "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",
then "AAAAACCCCC" and "CCCCCAAAAA" both appear twice.
Idea of the solution:
We need to check every 10-letter substring in the DNA and count how many times it appears.
If any substring appears more than once, we include it in our result.
Step-by-step solution (two main approaches):
1. Naive Approach – O(N²)
-
Extract every 10-letter substring.
-
For each substring, compare it with all other substrings to check if it repeats.
-
This is very slow, because each substring comparison costs time, leading to roughly N² operations.
✅ Works for small strings.
❌ Too slow for large DNA sequences (like length 100,000).
2. Optimized Approach using Hash Map – O(N)
We use a hash map (or map<string, int>) to count how many times each 10-letter substring appears.
How it works:
-
Loop through the string, taking every substring of length 10.
-
Store each substring in a hash map, increasing its count.
-
After finishing, collect all substrings with a count > 1.
This reduces the time complexity to O(N) because:
-
Each substring operation is constant time (since it’s always length 10).
-
Insertion and lookup in a hash map are O(1) on average.
So overall → O(N).
Quick Example:
Input:s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"
Process:
-
Take every 10-letter substring
-
Count occurrences
-
Substrings seen more than once →
"AAAAACCCCC","CCCCCAAAAA"
Output:["AAAAACCCCC", "CCCCCAAAAA"]
C++ Code (Optimized Way – Using Hash Map)
class Solution {
public:
vector<string> findRepeatedDnaSequences(string s) {
vector<string> result;
map<string, int> count;
int n = s.size();
if (n < 10) return result;
for (int i = 0; i <= n - 10; i++) {
string part = s.substr(i, 10);
count[part]++;
}
for (auto it : count) {
if (it.second > 1)
result.push_back(it.first);
}
return result;
}
};
Notes:
-
Time Complexity: O(N)
-
Space Complexity: O(N) (for storing substrings)
-
Very popular interview problem (Google, Amazon, and Meta).
-
Tests your ability to use hashing to detect duplicates efficiently.
