Mining
Sequential Patterns with Periodic Wildcard Gaps(MSPPWG)
(This
research is supported by the National Natural Foundation of China under grants No.
61229301, 61170190, and 61370144, the Program for Changjiang Scholars and
Innovative Research Team in University (PCSIRT) of the Ministry of Education,
China, under grant IRT13059, the National 863 Program of China under grant
2012AA011005, the National 973 Program of China under grant 2013CB329604, the
Natural Science Foundation of Hebei Province of China under grant No.
F2013202138, the Key Project of the Educational Commission of Hebei Province
under grant No. ZH2012038, and the Industrial Science and Technology Pillar
Program of Changzhou, Jiangsu, China, under grant CE20120026.)
MSPPWG
Definition 1. S=s_{0} ... s_{i} ...
s_{n1} is called as a subject sequence, where s_{i} is some
symbol and n is the length of S. ∑ can be a different symbol set indifferent
applications and the size of ∑ is denoted by ∑. In a DNA sequence, ∑={A,T,C,G
} and ∑=4.
Definition 2. P=p_{0}[min_{0},max_{0}]p_{1}
... [min_{j1},max_{j1}] p_{j} ... [min_{m2},max_{m2}]
p_{m1} is a pattern, where p_{j}_{}, m is
the length of P, min_{j1} and max_{j1} are integer numbers
and 0≤min_{j1} ≤max_{j1}. min_{j1} and max_{j1}
are gap constraints, presenting the minimum and maximum number of wildcards
between two consecutive letters p_{j1} and p_{j},
respectively. If min_{0}= min_{1} =… = min_{m2}=M and
max_{0}= max_{1} =… = max_{m2} =N, pattern P is called
a pattern with periodic wildcard gaps. For example, A[1,3]C[1,3]C is a pattern
with periodic wildcard gaps [1,3] .
Definition 3.If a sequence of indices D=<d_{0}, d_{1},
…, d_{m1} > is subject to M≤d_{j}d_{j1}1≤N
(0<j<m1), D is an offset sequence of P with periodic wildcard gaps [M,
N] in S. The number of offset sequences of P in S is denoted by ofs(P, S).
Definition 4. If an offset sequence I=<i_{0},… ,i_{j}
,… ,i_{m1} > is subject to _{}(0≤j≤m1 and 0≤i_{j}≤n1),
I is a support (or occurrence) of P in S. The number of supports of P in S is
denoted by sup(P, S).
Definition 5. r(P, S)= sup(P, S)/ofs(P,S) is called support ratio.
If r(P, S) is greater than a prespecified threshold _{}, pattern P is a
frequent pattern, otherwise P is an infrequent pattern.
Instance 1:Given
patterns P1= A[1,3]C and P2=G[1, 3]C, a sequence S=s_{0}s_{1}s_{2}s_{3}s_{4}s_{5}=AGCCCT
and _{}=0.25.
We can easily enumerate all 9 offset sequences of P1 and P
we give an illustrative example to show the principle of GCS as follows:
Instance 2:Given a pattern P=T[0,3]C[0,3]G and a sequence S=s_{0}s_{1}s_{2}s_{3}s_{4}s_{5}s_{6}s_{7}s_{8}s_{9} = TTCCTCCGCG.
In the GCS algorithm a two dimensional array A is created (shown in Figure 1). If p_{i}<>s_{j} (0≤i≤m1), A(i, j)=0. If i=0 and p_{i}=s_{j}, A(i, j)=1 otherwise A(i, j)=_{}(i1, jk1) (0<i≤m1). The sum of the elements in the last row is the number of supports of the pattern in the sequence. Hence we can know that the number of supports of P in S is 0+0+0+0+0+0+0+5+0+4=9.
Figure 1.Gap constrained pattern search.
Definition 6. A Nettree data structure is a collection of nodes. The collection can be empty, otherwise a Nettree consists of some distinguished nodes r_{1},…,r_{m} (root), and 0 or more nonempty subNettrees T_{1},T_{2}, …, T_{n}, each of whose roots can be connected by at least an edge from root r_{i}, where 1≤m, 1≤n, and 1≤i≤ n. A generic Nettree is shown in Figure 2.
Figure 2. A generic Nettree.
To show the
differences between the tree data structure and theNettree data structure,
Properties of Nettree are given as follows.
(1) A Nettree is an extension of a tree because it has
similar concepts of a tree, such as the root, leaf,level,parent,child,and so
on.
(2) A Nettree may have more than one root.
(3) Some nodes except roots in a Nettree may have more
than one parent.
(4) There may be more than one path from a node to a
root.
(5) Some nodes may occur more than once.
Definition 7. The same node can occur once or more in a Nettree. We can use the node name to denote each of them if each node occurs only once. Otherwise n^{i}_{j} denotes node i on the j^{th} level if a node occurs more than once in different levels in a Nettree.
Definition 8. A path from a first level node to node n^{i}_{j} is called as a root path. R(n^{i}_{j}) denotes the Number of Root Paths (NRP) of node n^{i}_{j} . The NRPs of a first level node and a j^{th} level node (j>1) are 1 and the sum of the NRPs of its parents, respectively.
Instance 3: Figure 3 shows a Nettree.
Node 3 occurs 3 times in the Nettree.n^{3}_{1} ,n^{3}_{2} and n^{3}_{3} denote node
Figure 3. A Nettree.
Incomplete Nettree means the last layer of Nettree. In order to reduce space consumption ,we use the incomplete Nettree to express a pattern. Here we give an example to account for its usage in computing the number of supports of a pattern'superpatterns.
Instance 4: Given a pattern P = T[0,3]C and a sequence S=s_{0}s_{1}s_{2}s_{3}s_{4}s_{5}s_{6}s_{7}s_{8}s_{9}=TTCCTCCGCG.
Figure 4 shows a Nettree to express pattern P. In fact ,we can only use the second layer of this Nettree in the black frame to express pattern P which is just the incomplete Nettree of P. The sum of values in grey circles is the number of supports of P, namely sup(P, S)=2+2+2+1+1=8.
Figure 4. A Nettree for P in S.
The most important thing is that we can employ pattern P to construct incomplete Nettrees of its superpatterns simultaneously and get their numbers of supports simply.For example, we can construct incomplete Nettrees of patterns P_{0}= T[0,3]C[0,3]A, P_{1}= T[0,3]C[0,3]T, P_{2}=T[0,3]C[0,3]C,P_{3}=T[0, 3]C[0,3]G in S at the same time. The process and result can be shown as follows :
R(node)
is used to express NRP of a node according to Definition 8. We create child
nodes of n^{2}_{2} from s_{3} to s_{6} since
gap constraints are [0, 3]. Because s_{3}=C, P_{2}=T[0,3]C[0,3]C
and node n^{3}_{3} is not in the incomplete Nettree of P_{2},
we create node n
Figure 5. Incomplete Nettrees of P, P_{0}, P_{1}, P_{2}, and P_{3}.
Solving algorithm:
(1)Representative algorithms:
MPP
Algorithm designed by Minghua Zhang ^{[1]}
Reprogramed by Lingling Wang using java
AMIN
^{[3]}
(2)The algorithms we design：
MAPB based on Incomplete Nettree
MAPD based on Incomplete Nettree
MAPBOK based on MAPB
(3)The input parameters of all algorithms:
The input parameters of all other algorithms except for MAPBOK include "a file path of biosequence(generally a *.txt file which contains the subject sequence, such as ' e:\ax1000.txt')", "a minimum gap", "a maximum gap", "a threshold" and "the possible maximum length of frequent patterns". For MAPBOK, the input parameters include "a file path of biosequence(similar to the other algorithms)", "a minimum gap","a maximum gap", "a threshold", "the value of K to ensure the number of frequent patterns printed" and "the extension factor e (a smaller number, such as 1, 1.5, 2 or 3 and so on)".
Experimental data
Table1. Biodata sequences
sequence 
the name of segment 
length 
data source 
experimental data 
S1 
AX1000 
1000 

S2 
AX2000 
2000 

S3 
AX4000 
4000 

S4 
AX8000 
8000 

S5 
AX10011 
10011 

S6 
AL20000 
20000 

S7 
AL40000 
40000 

S8 
AL80000 
80000 

S9 
AL167005 
167005 

S10 
AB15000 
15000 

S11 
AB30000 
30000 

S12 
AB60000 
60000 

S13 
AB131892 
131892 
Experimental environment and results
All experiments run on a laptop with Pentium(R) DualCore T4500@
2.30GHz CPU and 2.0 GB of RAM, Windows 7. Java Development Kit (JDK)
Table2. Mining results
sequence 
the number of frequent
patterns for various lengths 
the greatest
length of frequent patterns 
S1

4,16,64,256,1024,4096,13374,5678,1514,623,
242, 55, 12 
13 
S2

4,16,64,256,1024,4096,15205,3436,350,85,8,3

12 
S3

4,16,64,256,1024,4096,15965,1937,59,3

10 
S4

4,16,64,256,1024,4096,14970,4283,241,1

10 
S5

4,16,64,256,1024,4096,14422,4811,299,1

10 
S6

4,16,64,256,1024,4096,12619,7068,614,8

10 
S7

4,16,64,256,1024,4096,12388,6960,749,11

10 
S8

4,16,64,256,1024,4096,12947,6303,666,11

10 
S9

4,16,64,256,1024,4096,13438,5767,604,1

10 
S10

4,16,64,256,1024,4096,12507,7197,563,2

10 
S11

4,16,64,256,1024,4096,11126,7634,1164,54,1

11 
S12

4,16,64,256,1024,4096,12799,6404,699,11

10 
S13

4,16,64,256,1024,4096,12913,6558,672,11

10 
[1] Zhang M, Kao B,
Cheung DW,
[2] Zhu X, X. Wu
(2007) Mining complex patterns across sequences with gap requirements. In:
Proceedings of the 20th International Joint Conference on Artificial Intelligence,
[3] Min F, Wu Y, Wu X
(2012) The Apriori property of sequence pattern mining with wildcard gaps.
International Journal Functional Informatics and Personalised Medicine. 4(1):
15–31