Mining Sequential Patterns with Periodic Wildcard Gaps(MSPPWG)

(This research is supported by the National Natural Foundation of China under grants No. 61229301, 61170190, and 61370144, the Program for Changjiang Scholars and Innovative Research Team in University (PCSIRT) of the Ministry of Education, China, under grant IRT13059, the National 863 Program of China under grant 2012AA011005, the National 973 Program of China under grant 2013CB329604, the Natural Science Foundation of Hebei Province of China under grant No. F2013202138, the Key Project of the Educational Commission of Hebei Province under grant No. ZH2012038, and the Industrial Science and Technology Pillar Program of Changzhou, Jiangsu, China, under grant CE20120026.)

MSPPWG

Definition 1. S=s0 ... si ... sn-1 is called as a subject sequence, where si is some symbol and n is the length of S. ∑ can be a different symbol set indifferent applications and the size of ∑ is denoted by |∑|. In a DNA sequence, ∑={A,T,C,G } and |∑|=4.

Definition 2. P=p0[min0,max0]p1 ... [minj-1,maxj-1] pj ... [minm-2,maxm-2] pm-1 is a pattern, where pj , m is the length of P, minj-1 and maxj-1 are integer numbers and 0≤minj-1 ≤maxj-1. minj-1 and maxj-1 are gap constraints, presenting the minimum and maximum number of wildcards between two consecutive letters pj-1 and pj, respectively. If min0= min1 =… = minm-2=M and max0= max1 =… = maxm-2 =N, pattern P is called a pattern with periodic wildcard gaps. For example, A[1,3]C[1,3]C is a pattern with periodic wildcard gaps [1,3] .

Definition 3.If a sequence of indices D=<d0, d1, …, dm-1 > is subject to M≤dj-dj-1-1≤N (0<j<m-1), D is an offset sequence of P with periodic wildcard gaps [M, N] in S. The number of offset sequences of P in S is denoted by ofs(P, S).

Definition 4. If an offset sequence I=<i0,… ,ij ,… ,im-1 > is subject to (0≤j≤m-1 and 0≤ij≤n-1), I is a support (or occurrence) of P in S. The number of supports of P in S is denoted by sup(P, S).

Definition 5. r(P, S)= sup(P, S)/ofs(P,S) is called support ratio. If r(P, S) is greater than a pre-specified threshold , pattern P is a frequent pattern, otherwise P is an infrequent pattern.

Instance 1:Given patterns P1= A[1,3]C and P2=G[1, 3]C, a sequence S=s0s1s2s3s4s5=AGCCCT and =0.25.

We can easily enumerate all 9 offset sequences of P1 and P2 in S which are <0,2>,<0,3>,<0,4>, <1,3>, <1,4>, <1,5>, <2,4>, <2,5>, and <3,5>. Hence ofs(P1, S)=ofs(P2, S)=9. The supports of P1 in S include <0,2>,<0,3>, and <0,4> and sup(P1, S)=3. P1 is a frequent pattern since the support ratio r(P1,S)=3/9=0.333. We can know that sup(P2, S)=2 and the supports are <1, 3> and <1, 4>. So r(P2, S)=2/9=0.222 < and P2 is an infrequent pattern.

GCS

we give an illustrative example to show the principle of GCS as follows:

Instance 2:Given a pattern P=T[0,3]C[0,3]G and a sequence S=s0s1s2s3s4s5s6s7s8s9 = TTCCTCCGCG.

In the GCS algorithm a two dimensional array A is created (shown in Figure 1). If pi<>sj (0≤i≤m-1), A(i, j)=0. If i=0 and pi=sj, A(i, j)=1 otherwise A(i, j)= (i-1, j-k-1) (0<i≤m-1). The sum of the elements in the last row is the number of supports of the pattern in the sequence. Hence we can know that the number of supports of P in S is 0+0+0+0+0+0+0+5+0+4=9. Figure 1.Gap constrained pattern search.

Nettree

Definition 6. A Nettree data structure is a collection of nodes. The collection can be empty, otherwise a Nettree consists of some distinguished nodes r1,…,rm (root), and 0 or more non-empty subNettrees T1,T2, …, Tn, each of whose roots can be connected by at least an edge from root ri, where 1≤m, 1≤n, and 1≤i≤ n. A generic Nettree is shown in Figure 2. Figure 2. A generic Nettree.

To show the differences between the tree data structure and theNettree data structure, Properties of Nettree are given as follows.

(1) A Nettree is an extension of a tree because it has similar concepts of a tree, such as the root, leaf,level,parent,child,and so on.
(2) A Nettree may have more than one root.
(3) Some nodes except roots in a Nettree may have more than one parent.
(4) There may be more than one path from a node to a root.
(5) Some nodes may occur more than once.

Definition 7. The same node can occur once or more in a Nettree. We can use the node name to denote each of them if each node occurs only once. Otherwise nij denotes node i on the jth level if a node occurs more than once in different levels in a Nettree.

Definition 8. A path from a first level node to node nij is called as a root path. R(nij) denotes the Number of Root Paths (NRP) of node nij . The NRPs of a first level node and a jth level node (j>1) are 1 and the sum of the NRPs of its parents, respectively.

Instance 3: Figure 3 shows a Nettree. Node 3 occurs 3 times in the Nettree.n31 ,n32 and n33 denote node 3 in 1st, 2nd and 3rd level, respectively.Node n42 has two parents,n11 and n31 . Figure 3. A Nettree.

Incomplete Nettree

Incomplete Nettree means the last layer of Nettree. In order to reduce space consumption ,we use the incomplete Nettree to express a pattern. Here we give an example to account for its usage in computing the number of supports of a pattern'super-patterns.

Instance 4: Given a pattern P = T[0,3]C and a sequence S=s0s1s2s3s4s5s6s7s8s9=TTCCTCCGCG.

Figure 4 shows a Nettree to express pattern P. In fact ,we can only use the second layer of this Nettree in the black frame to express pattern P which is just the incomplete Nettree of P. The sum of values in grey circles is the number of supports of P, namely sup(P, S)=2+2+2+1+1=8. Figure 4. A Nettree for P in S.

The most important thing is that we can employ pattern P to construct incomplete Nettrees of its super-patterns simultaneously and get their numbers of supports simply.For example, we can construct incomplete Nettrees of patterns P0= T[0,3]C[0,3]A, P1= T[0,3]C[0,3]T, P2=T[0,3]C[0,3]C,P3=T[0, 3]C[0,3]G in S at the same time.  The process and result can be shown as follows :

R(node) is used to express NRP of a node according to Definition 8. We create child nodes of n22 from s3 to s6 since gap constraints are [0, 3]. Because s3=C, P2=T[0,3]C[0,3]C and node n33 is not in the incomplete Nettree of P2, we create node n33 in the incomplete Nettree of P2 and R(n33)=R(n22)=2. We create node n43 in the incomplete Nettree of P1 and R(n43)=R(n22)=2. Similarly, we create nodes n53 and n63 in the incomplete Nettree of P2, R(n53)= R(n63)= R(n22)=2. Then we create child nodes of n32 from s4 to s7. Because s4=T and node n43 is in the incomplete Nettree of P1, we update the value of R(n43)= R(n43)+ R(n32)=4. Similarly, we can know that R(n53)= R(n53)+ R(n32)=4, R(n63)= R(n63)+ R(n32)=4, and R(n73) = R(n32)=2. At last, we create all child nodes of node n52, n62, and n82. Figure 5 shows incomplete Nettrees of P, P0, P1, P2, and P3. It is straightforward to know that the numbers of supports of P0, P1, P2, and P3 in S are 0, 4, 2+4+6+3=15, and 5+4=9, respectively. Figure 5. Incomplete Nettrees of P, P0, P1, P2, and P3.

Solving algorithm:

(1)Representative algorithms:

MPP           Algorithm designed by Minghua Zhang 

Reprogramed by Lingling Wang using java

MGCS          based on GCS

(2)The algorithms we design

MAPB        based on Incomplete Nettree

MAPD          based on Incomplete Nettree

MAPBOK          based on MAPB

(3)The input parameters of all algorithms:

The input parameters of all other algorithms except for MAPBOK include "a file path of bio-sequence(generally a *.txt file which contains the subject sequence, such as ' e:\ax1000.txt')", "a minimum gap", "a maximum gap", "a threshold" and "the possible maximum length of frequent patterns". For MAPBOK, the input parameters include "a file path of bio-sequence(similar to the other algorithms)", "a minimum gap","a maximum gap", "a threshold", "the value of K to ensure the number of frequent patterns printed" and "the extension factor e (a smaller number, such as 1, 1.5, 2 or 3 and so on)".

Experimental data

Table1. Bio-data sequences

 sequence the name of segment length data source experimental data S1 AX1000 1000 S2 AX2000 2000 S3 AX4000 4000 Homo Sapiens AX829174 S4 AX8000 8000 S5 AX10011 10011 S6 AL20000 20000 S7 AL40000 40000 S8 AL80000 80000 S9 AL167005 167005 S10 AB15000 15000 S11 AB30000 30000 S12 AB60000 60000 S13 AB131892 131892

Experimental environment and results

All experiments run on a laptop with Pentium(R) Dual-Core T4500@ 2.30GHz CPU and 2.0 GB of RAM, Windows 7. Java Development Kit (JDK) 1.6.0 is used to develop all algorithms. In these experiments,the greatest length of frequent patterns is considered to be 13, the minimum and maximum gap constraints are 9 and 12, respectively and the threshold is 3*10^-5. What is more, considering that the default stack memory of JVM is too small, we assign 1GB memory space for every algorithm.An console running instance of MAPD is" javac MAPD.java, java -Xms1024M -Xmx1024M MAPD". Table 2 shows the mining result of every sequence.

Table2. Mining results

 sequence the number of frequent patterns for various lengths the greatest length of frequent patterns S1 4,16,64,256,1024,4096,13374,5678,1514,623, 242, 55, 12 13 S2 4,16,64,256,1024,4096,15205,3436,350,85,8,3 12 S3 4,16,64,256,1024,4096,15965,1937,59,3 10 S4 4,16,64,256,1024,4096,14970,4283,241,1 10 S5 4,16,64,256,1024,4096,14422,4811,299,1 10 S6 4,16,64,256,1024,4096,12619,7068,614,8 10 S7 4,16,64,256,1024,4096,12388,6960,749,11 10 S8 4,16,64,256,1024,4096,12947,6303,666,11 10 S9 4,16,64,256,1024,4096,13438,5767,604,1 10 S10 4,16,64,256,1024,4096,12507,7197,563,2 10 S11 4,16,64,256,1024,4096,11126,7634,1164,54,1 11 S12 4,16,64,256,1024,4096,12799,6404,699,11 10 S13 4,16,64,256,1024,4096,12913,6558,672,11 10

 Zhang M, Kao B, Cheung DW, Yip KY (2007) Mining periodic patterns with gap requirement from sequences. ACM Transactions on Knowledge Discovery from Data, 1(2): Article No. 7

 Zhu X, X. Wu (2007) Mining complex patterns across sequences with gap requirements. In: Proceedings of the 20th International Joint Conference on Artificial Intelligence, Hyderabad, India, pp 2934-2940

 Min F, Wu Y, Wu X (2012) The Apriori property of sequence pattern mining with wildcard gaps. International Journal Functional Informatics and Personalised Medicine. 4(1): 15–31