Approximate pattern matching with gap constraints

(This research is supported by the National Natural Foundation of China (NSFC) under grant No. 61229301 and 61370144, the Natural Science Foundation of Hebei Province of China under grant No. F2013202138, and the Key Project of the Educational Commission of Hebei Province under grant No. ZH2012038.)

 

Abstract. Pattern matching is a key issue in sequential pattern mining, with many researchers now focussing on pattern matching with gap constraints. However, most of these studies involve exact pattern matching problems, a special case of approximate pattern matching and a more challenging task. In this study, we introduce an approximate pattern matching problem with Hamming distance. Its objective is to compute the number of approximate occurrences of pattern P with gap constraints in sequence S under similarity constraint d. We propose an efficient algorithm named SONG (Single-rOot Nettree for approximate pattern matchinG with gap constraints) and based on the new nonlinear data structure Single-root Nettree. SONG uses some new concepts of Single-root Nettrees to solve the problem effectively. Theoretical analysis and experiments demonstrate an interesting law that the ratio M(P,S,d)/N(P,S,m) approximately follows a binomial distribution, where M(P,S,d) and N(P,S,m) are the numbers of the approximate occurrences whose distances to pattern P are d (0≤dm) and no more than m (the length of pattern P), respectively. Experimental results for real biological data validate the efficiency and correctness of SONG.

 

 

Problem:

Definition 1. A pattern can be denoted as P= p0[a0, b0]p1… [aj-1, bj-1]pj… [am-2, bm-2]pm-1, where m denotes the length of P, aj-1 and bj-1 are given integers in a group of gap constraints, representing the minimal and maximal wildcards between pj and pj+1, where 0≤ajbj.

Definition 2. A sequence can be denoted as S=s0s1sisn-1, where n is the length of S. ∑ represents a set of characters and λ is the length of ∑.

Definition 3. Given two sequences P=p0p1pm-1 and Q=q0q1qm-1, if there are k positions at which the corresponding characters are different, i.e. pi qi (0≤i<m), then the Hamming distance between the two strings is k (0≤km). D(P,Q) is used to denote the Hamming distance between P and Q.

Definition 4. Given a threshold d, if a position indices I= <i0,… ,ij ,… ,im-1 > satisfies the following equations:

 

                                    (1)

0≤aj-1ij-ij-1-1≤bj-1                                        (2)

0< MinLenim-1- i0+1≤MaxLen                                         (3)

 

where 0≤jm-1 and 0≤ijn-1, then I is an approximate occurrence of P in S.

    Apparently, all gap constraints can determine the length constraints, with the lower bound of MinLen and the upper bound of MaxLen equal to  and , respectively.

Definition 5. The goal of the problem is to compute the total number of approximate occurrences, as denoted by N(P,S,d).

 

 

Example:

    Given pattern P=p0[a0,b0]p1[a1,b1]p2=a[0,2]g[1,3]a, sequence S=s0s1s2s3s4s5s6=atggaga, length constraints MinLen=4 and MaxLen=8, and similarity constraint d=1.

Firstly, if we do not consider the similarity constraint or the similarity constraint d=0, the instance is an exact pattern matching problem which was solved in [11] and [12]. According to pattern P and sequence S, we know that pattern P has two gaps. Each gap has two gap constraints, one being the lower bound of the number of wildcards and the other the upper bound of the number of wildcards. For example, for the first group of gap constraints, [0,2] in 'a[0,2]g' means that the number of wildcards between 'a' and 'g' can be 0, 1 or 2. We use a group of position indices in sequence S to denote an occurrence. The span (the distance between the beginning position and ending position) of an occurrence should be subject to the length constraints. For instance, <0, 2, 4> is an exact occurrence of P in S, because p0=s0=a, p1=s2=g, p2=s4=a, a0≤2-0-1≤b0 and a14-2-1b1. <0,1,4> is not an exact occurrence, although it does satisfy the gap constraints because p1='g' is not equal to s1='t'. However, we can say that <0,1,4> is an approximate occurrence with a Hamming distance of 1. Similarly, <0,4,6> is not an occurrence, because 4-0-1=3 is not subject to gap constraints [a0, b0]. We can thus easily enumerate all 3 exact occurrences: <0,2,4>, <0,2,6> and <0,3,6>, as well as all 7 approximate occurrences with a Hamming distance of 1: <0,1,4>, <0,2,5>, <0,3,5>, <1,2,4>, <1,2,6>, <1,3,6> and <2,3,6>. So there are 10 occurrences under the similarity constraint d=1. In the loose pattern matching problem, all occurrences are 4, 5 and 6. The span of occurrence <0,2,4> is 4-0+1=5, which is subject to MinLen=4 and MaxLen=8, since MinLen ≤ 5 ≤ MaxLen. Suppose the length constraints are MinLen=4 and MaxLen=6; occurrences <0,2,6> and <0,3,6> are not subject to these constraints since the spans of occurrences <0,2,6> and <0,3,6> are both 7, which is greater than MaxLen=6. Therefore, there are 8 occurrences under MinLen=4 and MaxLen=6.

 

 

 

Data:

The H1N1 virus was isolated during human swine flu outbreak in 2009. The biological sequences of the H1N1 2009 virus can be downloaded from the National Center for Biotechnology Information website (http://www.ncbi.nlm. nih.gov/). There are many candidates, and we select the first four segments of a result (A/Managua/2093.01/2009(H1N1)), published on March 30, 2010 (http://www.ncbi.nlm.nih.gov/genomes/FLU/ SwineFlu.html). Table 1 show all the sequences.

Table 1: Segments of the H1N1 virus sequence

Number

Segment NO.

Locus

length

S1

Segment1

CY058563

2286

S2

Segment2

CY058562

2299

S3

Segment3

CY058561

2169

S4

Segment4

CY058556

1720

 

Table 2 shows the 4 patterns.

Table 2: Patterns

Number

Pattern

P1

a[0,3]t[0,3]a[0,3]t[0,3]a[0,3]t[0,3]a[0,3]t[0,3]a[0,3]t[0,3]a

P2

g[1,5]t[0,6]a[2,7]g[3,9]t[2,5]a[4,9]g[1,8]t[2,9]a

P3

g[1,9]t[1,9]a[1,9]g[1,9]t[1,9]a[1,9]g[1,9]t[1,9]a[1,9]g[1,9]t

P4

g[1,5]t[0,6]a[2,7]g[3,9]t[2,5]a[4,9]g[1,8]t[2,9]a[1,9]g[1,9]t

 

Solving Algorithms:

 

PAIG-APPRO          base on PAIG

Song-nonp                without using an effective pruning strategy

 

Proposed algorithm: SONG