Mining Sequential Patterns with Periodic Wildcard Gaps(MSPPWG)

(This research is supported by the National Natural Foundation of China under grants No. 61229301, 61170190, and 61370144, the Program for Changjiang Scholars and Innovative Research Team in University (PCSIRT) of the Ministry of Education, China, under grant IRT13059, the National 863 Program of China under grant 2012AA011005, the National 973 Program of China under grant 2013CB329604, the Natural Science Foundation of Hebei Province of China under grant No. F2013202138, the Key Project of the Educational Commission of Hebei Province under grant No. ZH2012038, and the Industrial Science and Technology Pillar Program of Changzhou, Jiangsu, China, under grant CE20120026.)

 

 

MSPPWG

 

 

Definition 1. S=s0 ... si ... sn-1 is called as a subject sequence, where si is some symbol and n is the length of S. ∑ can be a different symbol set indifferent applications and the size of ∑ is denoted by |∑|. In a DNA sequence, ∑={A,T,C,G } and |∑|=4.

 

Definition 2. P=p0[min0,max0]p1 ... [minj-1,maxj-1] pj ... [minm-2,maxm-2] pm-1 is a pattern, where pj, m is the length of P, minj-1 and maxj-1 are integer numbers and 0≤minj-1 ≤maxj-1. minj-1 and maxj-1 are gap constraints, presenting the minimum and maximum number of wildcards between two consecutive letters pj-1 and pj, respectively. If min0= min1 =… = minm-2=M and max0= max1 =… = maxm-2 =N, pattern P is called a pattern with periodic wildcard gaps. For example, A[1,3]C[1,3]C is a pattern with periodic wildcard gaps [1,3] .

 

Definition 3.If a sequence of indices D=<d0, d1, …, dm-1 > is subject to M≤dj-dj-1-1≤N (0<j<m-1), D is an offset sequence of P with periodic wildcard gaps [M, N] in S. The number of offset sequences of P in S is denoted by ofs(P, S).

 

Definition 4. If an offset sequence I=<i0,… ,ij ,… ,im-1 > is subject to (0≤j≤m-1 and 0≤ij≤n-1), I is a support (or occurrence) of P in S. The number of supports of P in S is denoted by sup(P, S).

 

Definition 5. r(P, S)= sup(P, S)/ofs(P,S) is called support ratio. If r(P, S) is greater than a pre-specified threshold , pattern P is a frequent pattern, otherwise P is an infrequent pattern.

 

Instance 1:Given patterns P1= A[1,3]C and P2=G[1, 3]C, a sequence S=s0s1s2s3s4s5=AGCCCT and =0.25.

 

  We can easily enumerate all 9 offset sequences of P1 and P2 in S which are <0,2>,<0,3>,<0,4>, <1,3>, <1,4>, <1,5>, <2,4>, <2,5>, and <3,5>. Hence ofs(P1, S)=ofs(P2, S)=9. The supports of P1 in S include <0,2>,<0,3>, and <0,4> and sup(P1, S)=3. P1 is a frequent pattern since the support ratio r(P1,S)=3/9=0.333. We can know that sup(P2, S)=2 and the supports are <1, 3> and <1, 4>. So r(P2, S)=2/9=0.222 < and P2 is an infrequent pattern.

 

GCS

we give an illustrative example to show the principle of GCS as follows:

Instance 2:Given a pattern P=T[0,3]C[0,3]G and a sequence S=s0s1s2s3s4s5s6s7s8s9 = TTCCTCCGCG.

 In the GCS algorithm a two dimensional array A is created (shown in Figure 1). If pi<>sj (0≤i≤m-1), A(i, j)=0. If i=0 and pi=sj, A(i, j)=1 otherwise A(i, j)=(i-1, j-k-1) (0<i≤m-1). The sum of the elements in the last row is the number of supports of the pattern in the sequence. Hence we can know that the number of supports of P in S is 0+0+0+0+0+0+0+5+0+4=9.

 

Figure 1.Gap constrained pattern search.

 

 

Nettree

Definition 6. A Nettree data structure is a collection of nodes. The collection can be empty, otherwise a Nettree consists of some distinguished nodes r1,…,rm (root), and 0 or more non-empty subNettrees T1,T2, …, Tn, each of whose roots can be connected by at least an edge from root ri, where 1≤m, 1≤n, and 1≤i≤ n. A generic Nettree is shown in Figure 2.

 

 

Figure 2. A generic Nettree.

 

 To show the differences between the tree data structure and theNettree data structure, Properties of Nettree are given as follows.
 
  (1) A Nettree is an extension of a tree because it has similar concepts of a tree, such as the root, leaf,level,parent,child,and so on.
  (2) A Nettree may have more than one root.
  (3) Some nodes except roots in a Nettree may have more than one parent.
  (4) There may be more than one path from a node to a root.
    (5) Some nodes may occur more than once.

Definition 7. The same node can occur once or more in a Nettree. We can use the node name to denote each of them if each node occurs only once. Otherwise nij denotes node i on the jth level if a node occurs more than once in different levels in a Nettree.

 

Definition 8. A path from a first level node to node nij is called as a root path. R(nij) denotes the Number of Root Paths (NRP) of node nij . The NRPs of a first level node and a jth level node (j>1) are 1 and the sum of the NRPs of its parents, respectively.

Instance 3: Figure 3 shows a Nettree. Node 3 occurs 3 times in the Nettree.n31 ,n32 and n33 denote node 3 in 1st, 2nd and 3rd level, respectively.Node n42 has two parents,n11 and n31 .

 

Figure 3. A Nettree.

 

Incomplete Nettree

Incomplete Nettree means the last layer of Nettree. In order to reduce space consumption ,we use the incomplete Nettree to express a pattern. Here we give an example to account for its usage in computing the number of supports of a pattern'super-patterns.

 

Instance 4: Given a pattern P = T[0,3]C and a sequence S=s0s1s2s3s4s5s6s7s8s9=TTCCTCCGCG.

 

Figure 4 shows a Nettree to express pattern P. In fact ,we can only use the second layer of this Nettree in the black frame to express pattern P which is just the incomplete Nettree of P. The sum of values in grey circles is the number of supports of P, namely sup(P, S)=2+2+2+1+1=8.

 

 

Figure 4. A Nettree for P in S.

The most important thing is that we can employ pattern P to construct incomplete Nettrees of its super-patterns simultaneously and get their numbers of supports simply.For example, we can construct incomplete Nettrees of patterns P0= T[0,3]C[0,3]A, P1= T[0,3]C[0,3]T, P2=T[0,3]C[0,3]C,P3=T[0, 3]C[0,3]G in S at the same time.  The process and result can be shown as follows :

R(node) is used to express NRP of a node according to Definition 8. We create child nodes of n22 from s3 to s6 since gap constraints are [0, 3]. Because s3=C, P2=T[0,3]C[0,3]C and node n33 is not in the incomplete Nettree of P2, we create node n33 in the incomplete Nettree of P2 and R(n33)=R(n22)=2. We create node n43 in the incomplete Nettree of P1 and R(n43)=R(n22)=2. Similarly, we create nodes n53 and n63 in the incomplete Nettree of P2, R(n53)= R(n63)= R(n22)=2. Then we create child nodes of n32 from s4 to s7. Because s4=T and node n43 is in the incomplete Nettree of P1, we update the value of R(n43)= R(n43)+ R(n32)=4. Similarly, we can know that R(n53)= R(n53)+ R(n32)=4, R(n63)= R(n63)+ R(n32)=4, and R(n73) = R(n32)=2. At last, we create all child nodes of node n52, n62, and n82. Figure 5 shows incomplete Nettrees of P, P0, P1, P2, and P3. It is straightforward to know that the numbers of supports of P0, P1, P2, and P3 in S are 0, 4, 2+4+6+3=15, and 5+4=9, respectively. 

 

Figure 5. Incomplete Nettrees of P, P0, P1, P2, and P3.

 

 

Solving algorithm:

 

(1)Representative algorithms:

                            MPP           Algorithm designed by Minghua Zhang [1]

 

                                                  Reprogramed by Lingling Wang using java

                          MGCS          based on GCS  [2]

 

                          AMIN           [3]

 (2)The algorithms we design

                          MAPB        based on Incomplete Nettree

 

                          MAPD          based on Incomplete Nettree

 

                          MAPBOK          based on MAPB

(3)The input parameters of all algorithms:

  The input parameters of all other algorithms except for MAPBOK include "a file path of bio-sequence(generally a *.txt file which contains the subject sequence, such as ' e:\ax1000.txt')", "a minimum gap", "a maximum gap", "a threshold" and "the possible maximum length of frequent patterns". For MAPBOK, the input parameters include "a file path of bio-sequence(similar to the other algorithms)", "a minimum gap","a maximum gap", "a threshold", "the value of K to ensure the number of frequent patterns printed" and "the extension factor e (a smaller number, such as 1, 1.5, 2 or 3 and so on)".

 

Experimental data

 

Table1. Bio-data sequences

sequence

the name of segment

length

data source

experimental data

S1

AX1000

1000

Homo Sapiens AX829174

top 1000

S2

AX2000

2000

Homo Sapiens AX829174

top 2000

S3

AX4000

4000

Homo Sapiens AX829174

top 4000

S4

AX8000

8000

Homo Sapiens AX829174

top 8000

S5

AX10011

10011

Homo Sapiens AX829174

the whole segment

S6

AL20000

20000

Homo Sapiens AL158070

top 20000

S7

AL40000

40000

Homo Sapiens AL158070

top 40000

S8

AL80000

80000

Homo Sapiens AL158070

top 80000

S9

AL167005

167005

Homo Sapiens AL158070

the whole segment

S10

AB15000

15000

Homo Sapiens AB038490

top 15000

S11

AB30000

30000

Homo Sapiens AB038490

top 30000

S12

AB60000

60000

Homo Sapiens AB038490

top 60000

S13

AB131892

131892

Homo Sapiens AB038490

the whole segment

 

 

 

Experimental environment and results

  All experiments run on a laptop with Pentium(R) Dual-Core T4500@ 2.30GHz CPU and 2.0 GB of RAM, Windows 7. Java Development Kit (JDK) 1.6.0 is used to develop all algorithms. In these experiments,the greatest length of frequent patterns is considered to be 13, the minimum and maximum gap constraints are 9 and 12, respectively and the threshold is 3*10^-5. What is more, considering that the default stack memory of JVM is too small, we assign 1GB memory space for every algorithm.An console running instance of MAPD is" javac MAPD.java, java -Xms1024M -Xmx1024M MAPD". Table 2 shows the mining result of every sequence.

Table2. Mining results

sequence

the number of frequent patterns for various lengths

the greatest length of frequent patterns

S1

4,16,64,256,1024,4096,13374,5678,1514,623, 242, 55, 12

13

S2

4,16,64,256,1024,4096,15205,3436,350,85,8,3

12

S3

4,16,64,256,1024,4096,15965,1937,59,3

10

S4

4,16,64,256,1024,4096,14970,4283,241,1

10

S5

4,16,64,256,1024,4096,14422,4811,299,1

10

S6

4,16,64,256,1024,4096,12619,7068,614,8

10

S7

4,16,64,256,1024,4096,12388,6960,749,11

10

S8

4,16,64,256,1024,4096,12947,6303,666,11

10

S9

4,16,64,256,1024,4096,13438,5767,604,1

10

S10

4,16,64,256,1024,4096,12507,7197,563,2

10

S11

4,16,64,256,1024,4096,11126,7634,1164,54,1

11

S12

4,16,64,256,1024,4096,12799,6404,699,11

10

S13

4,16,64,256,1024,4096,12913,6558,672,11

10

 

 

[1] Zhang M, Kao B, Cheung DW, Yip KY (2007) Mining periodic patterns with gap requirement from sequences. ACM Transactions on Knowledge Discovery from Data, 1(2): Article No. 7

[2] Zhu X, X. Wu (2007) Mining complex patterns across sequences with gap requirements. In: Proceedings of the 20th International Joint Conference on Artificial Intelligence, Hyderabad, India, pp 2934-2940

[3] Min F, Wu Y, Wu X (2012) The Apriori property of sequence pattern mining with wildcard gaps. International Journal Functional Informatics and Personalised Medicine. 4(1): 15–31