DNA Sequence Analyzer - Problem

You are given a DNA sequence represented as a string containing only the characters 'A', 'T', 'G', and 'C'. Find all repeated DNA subsequences of length K that appear more than once in the sequence.

For each repeated subsequence, return:

The subsequence string
Its frequency (number of occurrences)
All starting positions where it appears

Return format: An array of objects, where each object has three properties: sequence (string), frequency (integer), and positions (array of integers).

Note: Return results sorted by frequency in descending order, then by sequence lexicographically if frequencies are equal.

Input & Output

Example 1 — Basic Repeated Patterns

$ Input: dna = "AGATCGATCGA", k = 3

› Output: [{"sequence":"ATC","frequency":2,"positions":[2,5]},{"sequence":"CGA","frequency":2,"positions":[4,8]}]

💡 Note: Pattern 'ATC' appears at positions 2 and 5. Pattern 'CGA' appears at positions 4 and 8. Both have frequency 2, sorted lexicographically: ATC comes before CGA.

Example 2 — Single Character Repeats

$ Input: dna = "AAAAAAAAAA", k = 2

› Output: [{"sequence":"AA","frequency":9,"positions":[0,1,2,3,4,5,6,7,8]}]

💡 Note: In a string of 10 A's, the pattern 'AA' appears at every position from 0 to 8, giving it a frequency of 9.

Example 3 — No Repeated Patterns

$ Input: dna = "ATCG", k = 2

› Output: []

💡 Note: Patterns are 'AT', 'TC', 'CG' - each appears only once. No pattern has frequency > 1, so return empty array.

Constraints

1 ≤ dna.length ≤ 10⁴
1 ≤ k ≤ dna.length
dna contains only characters 'A', 'T', 'G', 'C'

Visualization

Tap to expand

Asked in

G Google 35 a Amazon 28 M Microsoft 22 f Facebook 18

The key insight is using a sliding window with hash map to track patterns in a single pass. The optimal approach slides a K-length window through the DNA sequence, maintaining counts and positions for each pattern. Time: O(n × k), Space: O(n × k).

Common Approaches

✓ Brute Force - Check All Patterns

⏱️ Time: O(n² × k) Space: O(n × k)

Generate every possible K-length substring from the DNA sequence. For each unique substring, scan through the entire sequence again to count occurrences and record positions. Finally, filter out patterns that appear only once.

Sliding Window with Hash Map

⏱️ Time: O(n × k) Space: O(n × k)

Use a sliding window approach to traverse the DNA sequence once. As we slide the window, maintain a hash map to count occurrences and track positions of each K-length pattern. This eliminates the need for multiple passes through the data.

Brute Force - Check All Patterns — Algorithm Steps

Step 1: Extract all possible K-length substrings from the DNA sequence
Step 2: For each unique substring, scan the entire DNA sequence to count occurrences
Step 3: Record all positions where each substring appears
Step 4: Filter results to keep only patterns with frequency > 1
Step 5: Sort results by frequency (descending) then lexicographically

Visualization

Tap to expand

Step-by-Step Walkthrough

Extract All Substrings

Generate all K-length substrings from the DNA sequence

Find Unique Patterns

Identify distinct patterns from extracted substrings

Rescan for Each Pattern

For each unique pattern, scan entire sequence to count occurrences

Filter and Sort Results

Keep patterns with frequency > 1, sort by frequency then lexicographically

Code -

solution.c — C

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define MAX_LEN 10000
#define MAX_PATTERNS 5000

typedef struct {
    char sequence[21];
    int frequency;
    int positions[MAX_LEN];
    int pos_count;
} Result;

int compareResults(const void* a, const void* b) {
    Result* ra = (Result*)a;
    Result* rb = (Result*)b;
    if (ra->frequency != rb->frequency) {
        return rb->frequency - ra->frequency;
    }
    return strcmp(ra->sequence, rb->sequence);
}

int solution(char* dna, int k, Result* results) {
    int len = strlen(dna);
    if (len < k) {
        return 0;
    }
    
    char patterns[MAX_PATTERNS][21];
    int pattern_count = 0;
    int result_count = 0;
    
    // Extract all k-length substrings and find unique patterns
    for (int i = 0; i <= len - k; i++) {
        char substring[21];
        strncpy(substring, dna + i, k);
        substring[k] = '\0';
        
        // Check if pattern already exists
        int found = -1;
        for (int j = 0; j < pattern_count; j++) {
            if (strcmp(patterns[j], substring) == 0) {
                found = j;
                break;
            }
        }
        
        if (found == -1 && pattern_count < MAX_PATTERNS) {
            strcpy(patterns[pattern_count], substring);
            pattern_count++;
        }
    }
    
    // For each unique pattern, count occurrences and find positions
    for (int p = 0; p < pattern_count; p++) {
        Result curr;
        strcpy(curr.sequence, patterns[p]);
        curr.frequency = 0;
        curr.pos_count = 0;
        
        for (int i = 0; i <= len - k; i++) {
            if (strncmp(dna + i, patterns[p], k) == 0) {
                curr.positions[curr.pos_count++] = i;
                curr.frequency++;
            }
        }
        
        if (curr.frequency > 1) {
            results[result_count++] = curr;
        }
    }
    
    qsort(results, result_count, sizeof(Result), compareResults);
    return result_count;
}

int main() {
    char dna[MAX_LEN];
    int k;
    
    fgets(dna, sizeof(dna), stdin);
    dna[strcspn(dna, "\n")] = 0;
    
    scanf("%d", &k);
    
    Result results[MAX_PATTERNS];
    int count = solution(dna, k, results);
    
    printf("[");
    for (int i = 0; i < count; i++) {
        if (i > 0) printf(",");
        printf("{\"sequence\":\"%s\",\"frequency\":%d,\"positions\":[", 
               results[i].sequence, results[i].frequency);
        for (int j = 0; j < results[i].pos_count; j++) {
            if (j > 0) printf(",");
            printf("%d", results[i].positions[j]);
        }
        printf("]}");
    }
    printf("]\n");
    
    return 0;
}

Time & Space Complexity

Time Complexity

⏱️

O(n² × k)

For each of n-k+1 substrings, we scan the entire string of length n, and each comparison takes O(k) time

⚠ Quadratic Growth

Space Complexity

O(n × k)

Store all unique substrings and their positions, worst case all substrings are unique

⚡ Linearithmic Space

23.5K Views

Medium Frequency

~25 min Avg. Time

892 Likes

Ln 1, Col 1

Smart Actions

💡 Explanation

AI Ready

💡 Suggestion Tab to accept Esc to dismiss

// Output will appear here after running code

Code Editor Closed

Click the red button to reopen

DNA Sequence Analyzer - Problem

Input & Output

Constraints

Visualization

Related Problems

Common Approaches

Brute Force - Check All Patterns — Algorithm Steps

Visualization

Code -

Time & Space Complexity

Select Compiler