Maximum Pattern Matching with Gaps and One-off ConstraintMPMGOOC

(本研究受国家自然科学基金60828005支持)

(This research is supported by the National Natural Foundation of China (NSFC) under grant No. 60828005.)

MPMGOOC

Definition 1. Given a pattern P= p0[min0, max0]p1 [minj-1, maxj-1]pj [minm-2, maxm-2]pm-1, a sequence S=s0s1 si sn-1, and two integer values, MinLen and MaxLen , where m is the length of pattern P, pj (0< j<= m-1), n is the length of sequence S, and si (0<= i<= n-1). is a wildcard which can match any letter in a given alphabet table. minj-1 and maxj-1 are given integer values indicating the minimal and maximum length of wildcards between two given letters pj-1 and pj respectively, where 0<= minj<= maxj. MinLen and MaxLen are called global length constraints. minj and maxj are called local length constraints. ,          (1)

Definition 2. An occurrence of pattern P in sequence S is a sequence of position indices {a0, ,aj , ,am-1 } subject to the following equation. ,          (1)

where 0<= j <= m-1 and 0<= aj<= n-1.

Definition 3. Let set T(S,P) be the set of all occurrences of P in S. The length of T(S,P)  is denoted by |T(S,P)|.

Definition 4. Given two occurrences B=<b0, , bj, , bm-1> and  C=<c0, ,  ck, , cm-1>. Occurrence C is related with occurrence B if there exists  ck=bj, otherwise occurrence C is unrelated with occurrence B. The number of occurrences Related with Occurrence B is denoted by RO(B).  Both occurrences B and C are containing position bj, where 0 <= k< m and 0<= j <m. The number of occurrences Containing Position i (0<= i <= |S|-1) of P in S is denoted by CP(i).

Definition 5. Given a pattern P and a sequence S, a Pattern Matching with Gaps and One-Off Condition (PMGOOC) is a subset T'(S,P) T(S,P) such that every two occurrences in T'(S,P) are unrelated with each other. The Maximum Pattern Matching with Gaps and One-Off Condition (MPMGOOC) problem refers to the problem of searching a MPMGOOC with the largest number of occurrences in T(S,P).

Instance: Given P=a[0,1]b[0,1]c, S=aabbcc, MinLen=3, and MaxLen=5.

We can know that the number of all occurrences is 4 and all occurrences are {0, 2, 4}, {1, 2, 4}, {1, 3, 4}, and {1, 3, 5}. In these occurrences, the unique solution of this instance is two optimal occurrences, {0, 2, 4} and {1, 3, 5}. If {1, 2, 4} or {1, 3, 4} is the solution of this instance, the number of occurrence of the solution is 1. Hence {1, 2, 4} or {1, 3, 4} can not be a solution of this instance.

(1)网树是树结构的拓展，它具备很多与树相似的概念，如根结点，叶子结点，层，双亲，孩子等等；

(2)一棵网树可以有多于一个根结点；

(3)除了根结点之外，网树的其它结点可以有多于一个双亲结点；

(4)从一个结点到达网树的一个根结点的路径数不唯一；

(5)相同结点名称的结点可以在网树的不同层上多次出现。

Definition 6. A Nettree is a kind of DAG with two kinds of edge labels, "parent-child" and "child-parent", where each node has zero or more children nodes and zero or more parents nodes. Furthermore, the children of each node have a specific order. A Nettree has the following three properties.

1. A Nettree is an extension of tree because it has all concepts of tree, such as root, leaf, level, parent, child and so on.

2. A Nettree may have r roots, where r 1.

3. Some nodes except roots in a Nettree may have q parents, where q 1.

A Nettree is shown in Fig. 1. Nodes A and B are two roots of the Nettree. Nodes C, F and G are three leaves. Node D has two parents (nodes A and B). In Fig.1, each edge has a label either "parent-child" or "child-parent". So it is a kind of DAG with edge labels. Otherwise there will be a cycle {B, D, G, E, B}. 1  一颗网树

Figure 1. A Nettree.

(1)一种在线算法An online algorithm:

SAIL       Algorithm designed by Gong Chen

Reprogramed by Youxi Wu using VC

(2)我们设计的算法：

AGSP          based on Nettree

ARMP        based on Nettree

SBO           based on Nettree

（源自 美国国家生物计算信息中心的H1N1病毒候选序列 1 真实生物数据片段

 序号 片段名称 位点 片段长度 S1 CY058563 2286 S2 CY058562 2299 S3 CY058561 2169 S4 CY058556 1720 S5 CY058559 1516 S6 CY058558 1418 S7 CY058557 982 S8 CY058560 844

2 模式串

 序号 模式串 最小长度 最大长度 P1 a[0,3]t[0,3]a[0,3]t[0,3]a[0,3]t[0,3]a[0,3]t[0,3]a[0,3]t[0,3]a 11 41 P2 g[1,5]t[0,6]a[2,7]g[3,9]t[2,5]a[4,9]g[1,8]t[2,9]a 24 57 P3 g[1,9]t[1,9]a[1,9]g[1,9]t[1,9]a[1,9]g[1,9]t[1,9]a[1,9]g[1,9]t 21 101 P4 g[1,5]t[0,6]a[2,7]g[3,9]t[2,5]a[4,9]g[1,8]t[2,9]a[1,9]g[1,9]t 27 73 P5 a[0,10]a[0,10]t[0,10]c[0,10]g[0,10]g 6 56 P6 a[0,5]t[0,7]c[0,9]g[0,11]g 5 37 P7 a[0,5]t[0,7]c[0,6]g[0,8]t[0,7]c[0,9]g 7 49 P8 a[5,6]c[4,7]g[3,8]t[2,8]a[1,7]c[0,9]g 22 52 P9 c[0,5]t[0,5]g[0,5]a[0,5]a 5 25

3 全部72个运行实例