Approximate
pattern matching with gap constraints
(This research is supported by the National
Natural Foundation of China (NSFC) under grant No. 61229301 and 61370144, the Natural Science Foundation of Hebei
Province of China under grant No. F2013202138, and the Key Project of the
Educational Commission of Hebei Province under grant No. ZH2012038.)
Abstract. Pattern matching is a key issue in sequential pattern mining, with many researchers
now focussing on pattern matching with gap constraints. However, most of these
studies involve exact pattern matching problems, a special case of approximate
pattern matching and a more challenging task. In this study, we introduce an
approximate pattern matching problem with Hamming distance. Its objective is to
compute the number of approximate occurrences of pattern P with gap constraints in sequence S under similarity constraint d.
We propose an efficient algorithm named SONG (SinglerOot Nettree for
approximate pattern matchinG with gap constraints) and based on the new
nonlinear data structure Singleroot Nettree. SONG uses some new concepts of
Singleroot Nettrees to solve the problem effectively. Theoretical analysis and
experiments demonstrate an interesting law that the ratio M(P,S,d)/N(P,S,m) approximately
follows a binomial distribution, where M(P,S,d) and N(P,S,m) are the numbers of
the approximate occurrences whose distances to pattern P are d (0≤d≤m)
and no more than m (the length of
pattern P), respectively.
Experimental results for real biological data validate the efficiency and
correctness of SONG.
Problem:
Definition 1. A pattern can be denoted as P= p_{0}[a_{0}, b_{0}]p_{1}…
[a_{j}_{1}, b_{j}_{1}]p_{j}… [a_{m}_{2}, b_{m}_{2}]p_{m}_{1}, where m denotes the length of P, a_{j}_{1}
and b_{j}_{1} are
given integers in a group of gap constraints,
representing the minimal and maximal wildcards between p_{j }and p_{j}_{+1},
where 0≤a_{j}≤b_{j}.
Definition 2. A sequence can be denoted as S=s_{0}s_{1}…s_{i}… s_{n}_{1},
where n is the length of S. ∑ represents a set of characters and λ is the length of ∑.
Definition 3. Given two
sequences P=p_{0}p_{1}…p_{m}_{1 }and Q=q_{0}q_{1}…q_{m}_{1}, if there are k positions at
which the corresponding characters are different, i.e. p_{i}
≠ q_{i }(0≤i<m), then the Hamming distance between the two strings is k (0≤k≤m).
D(P,Q) is used to denote the Hamming distance
between P and Q.
Definition 4. Given a threshold d, if a position indices I=
<i_{0},… ,i_{j} ,… ,i_{m}_{1} > satisfies the following equations:
(1)
0≤a_{j}_{1}≤i_{j}i_{j}_{1}1≤b_{j}_{1} (2)
0< MinLen ≤i_{m}_{1}
i_{0}+1≤MaxLen (3)
where 0≤j≤m1 and 0≤i_{j}≤n1, then I is an approximate occurrence of P in S.
Apparently, all gap constraints can determine the length constraints,
with the lower bound of MinLen and
the upper bound of MaxLen equal to and ,
respectively.
Definition 5. The goal of the
problem is to compute the total number of approximate occurrences, as denoted
by N(P,S,d).
Example:
Given pattern P=p_{0}[a_{0},b_{0}]p_{1}[a_{1},b_{1}]p_{2}=a[0,2]g[1,3]a,
sequence S=s_{0}s_{1}s_{2}s_{3}s_{4}s_{5}s_{6}=atggaga, length constraints MinLen=4 and MaxLen=8,
and similarity constraint d=1.
Firstly, if we do not consider
the similarity constraint or the similarity constraint d=0, the instance is an exact pattern matching problem which was
solved in [11] and [12]. According to pattern P and sequence S, we know
that pattern P has two gaps. Each gap
has two gap constraints, one being the lower bound of the number of wildcards
and the other the upper bound of the number of wildcards. For example, for the
first group of gap constraints, [0,2] in 'a[0,2]g' means that the number of
wildcards between 'a' and 'g' can be 0, 1 or 2. We use a group of position
indices in sequence S to denote an
occurrence. The span (the distance between the beginning position and ending
position) of an occurrence should be subject to the length constraints. For
instance, <0, 2, 4> is an exact occurrence of P in S, because p_{0}=s_{0}=a, p_{1}=s_{2}=g, p_{2}=s_{4}=a,
a_{0}≤201≤b_{0} and a_{1}≤
Data:
The
H1N1 virus was isolated during human swine flu outbreak in 2009. The biological
sequences of the H1N1 2009 virus can be downloaded from the National Center for
Biotechnology Information website (http://www.ncbi.nlm. nih.gov/). There are
many candidates, and we select the first four segments of a result (A/Managua/2093.01/2009(H1N1)),
published on March 30, 2010 (http://www.ncbi.nlm.nih.gov/genomes/FLU/
SwineFlu.html). Table 1 show all the sequences.
Table 1: Segments of the H1N1 virus sequence
Number 
Segment NO. 
Locus 
length 
S1 
Segment1 
CY058563 

S2 
Segment2 
CY058562 

S3 
Segment3 
CY058561 

S4 
Segment4 
CY058556 
Table 2 shows the 4 patterns.
Table 2: Patterns
Number 
Pattern 
P1 
a[0,3]t[0,3]a[0,3]t[0,3]a[0,3]t[0,3]a[0,3]t[0,3]a[0,3]t[0,3]a 
P2 
g[1,5]t[0,6]a[2,7]g[3,9]t[2,5]a[4,9]g[1,8]t[2,9]a 
P3 
g[1,9]t[1,9]a[1,9]g[1,9]t[1,9]a[1,9]g[1,9]t[1,9]a[1,9]g[1,9]t 
P4 
g[1,5]t[0,6]a[2,7]g[3,9]t[2,5]a[4,9]g[1,8]t[2,9]a[1,9]g[1,9]t 
Solving Algorithms:
PAIGAPPRO base
on PAIG
Songnonp without
using an effective pruning strategy
Proposed algorithm: SONG