GenomicRangeQuery

Task

The task is to find the minimum values in slices of an array.

A DNA sequence can be represented as a string consisting of the letters A, C, G and T, which correspond to the types of successive nucleotides in the sequence. Each nucleotide has an impact factor, which is an integer. Nucleotides of types A, C, G and T have impact factors of 1, 2, 3 and 4, respectively. You are going to answer several queries of the form: What is the minimal impact factor of nucleotides contained in a particular part of the given DNA sequence?

The DNA sequence is given as a non-empty string S = S[0]S[1]…S[N-1] consisting of N characters. There are M queries, which are given in non-empty arrays P and Q, each consisting of M integers. The K-th query (0 ≤ K < M) requires you to find the minimal impact factor of nucleotides contained in the DNA sequence between positions P[K] and Q[K] (inclusive).

For example, consider string S = CAGCCTA and arrays P, Q such that:
P[0] = 2 Q[0] = 4 P[1] = 5 Q[1] = 5 P[2] = 0 Q[2] = 6

The answers to these M = 3 queries are as follows:

The part of the DNA between positions 2 and 4 contains nucleotides G and C (twice), whose impact factors are 3 and 2 respectively, so the answer is 2.

The part between positions 5 and 5 contains a single nucleotide T, whose impact factor is 4, so the answer is 4.

The part between positions 0 and 6 (the whole string) contains all nucleotides, in particular nucleotide A whose impact factor is 1, so the answer is 1.

Write a function: vector<int> solution(string &S, vector<int> &P, vector<int> &Q);

that, given a non-empty string S consisting of N characters and two non-empty arrays P and Q consisting of M integers, returns an array consisting of M integers specifying the consecutive answers to all queries.

Result array should be returned as a vector of integers.

For example, given the string S = CAGCCTA and arrays P, Q such that:
P[0] = 2 Q[0] = 4 P[1] = 5 Q[1] = 5 P[2] = 0 Q[2] = 6

the function should return the values [2, 4, 1], as explained above.

Write an efficient algorithm for the following assumptions:

N is an integer within the range [1..100,000];

M is an integer within the range [1..50,000];

each element of arrays P, Q is an integer within the range [0..N − 1];

P[K] ≤ Q[K], where 0 ≤ K < M;

string S consists only of upper-case English letters A, C, G, T.

Solution

The trick is to use the ‘prefix sums’ hack described in the documentation. For each letter, store the count of the frequency of that letter before each index.

These counts can then be used to find if the letter occurs in any slice.


#include <cassert>
#include <math.h>
#include <vector>
vector<int> solution(string &S, vector<int> &P, vector<int> &Q) {
  // write your code in C++14 (g++ 6.2.0)
  size_t N = S.size();
  size_t M = P.size();
  assert(Q.size() == M);
  //arrays are count of all previous to. so add extra zero items which is zero
  vector<int> As_prev(N+1);
  vector<int> Cs_prev(N+1);
  vector<int> Gs_prev(N+1);
  As_prev[0] = 0;
  Cs_prev[0] = 0;
  Gs_prev[0] = 0;

  // build index of count of previous nucleotides of diff types
  int A_count = 0, C_count = 0, G_count = 0;
  for (std::string::size_type i = 0; i < S.size(); i++) {
    char nucleo = S[i];
    switch (nucleo) {
    case ‘A’:
      A_count++;
      break;
    case ‘C’:
      C_count++;
      break;
    case ‘G’:
      G_count++;
      break;
    case ‘T’:
      break;
    default:
      assert(false);
    }

    As_prev[i+1] = A_count;
    Cs_prev[i+1] = C_count;
    Gs_prev[i+1] = G_count;
  }
  // now calc minimums
  vector<int> res(M);
  for (size_t i = 0; i < M; i++) {
    int mini = 5;
    size_t from = P[i];
    size_t to = Q[i];

    // includes last 
    int numAs = As_prev[to+1] – As_prev[from];
    assert(numAs >= 0);
    if (numAs > 0) {
      mini = 1;
    } else {
      int numCs = Cs_prev[to+1] – Cs_prev[from];
      assert(numCs >= 0);
      if (numCs > 0) {
        mini = 2;
      } else {
        int numGs = Gs_prev[to+1] – Gs_prev[from];
        assert(numGs >= 0);
        if (numGs > 0) {
          mini = 3;
        } else {
          // only Ts left
          mini = 4;
        }
      }
    }// numAs else
    // mini calc’d
    assert(mini<=4);
    res[i] = mini;
  }// for
  return res;
}

This could be simplified by using an array of arrays, indexed by the letter

Results

Correctness	100%
Performance	100%
Time Complexity	O(N+M)

GenomicRangeQuery

Task

Solution

Results

Index of solutions to Codility Lessons

Leave a Reply Cancel reply