具有间隙约束和一次性条件的最大模式匹配

Maximum Pattern Matching with Gaps and One-off ConstraintMPMGOOC

(本研究受国家自然科学基金60828005支持)

(This research is supported by the National Natural Foundation of China (NSFC) under grant No. 60828005.)

 

 

 

MPMGOOC

定义1给定一个模式串P= p0[min0, max0]p1 [minj-1, maxj-1]pj [minm-2, maxm-2]pm-1和一个序列串S=s0s1si sn-1 以及两个整数值最小长度MinLen和最大长度MaxLen,这里mn分别是模式串P和序列串S的长度,pj (0<jm-1) si (0≤ in-1). 是通配符可以匹配任何给定的字符。minj-1maxj-1是给定整数值,代表模式字符pj-1 pj 之间通配符可以匹配的最小和最大长度,这里0≤ minjmaxjW=max(maxj -minj+1)称为模式串P的最大间距。MinLenMaxLen称为全局约束, minjmaxj称为局部约束。

Definition 1. Given a pattern P= p0[min0, max0]p1[minj-1, maxj-1]pj[minm-2, maxm-2]pm-1, a sequence S=s0s1 si sn-1, and two integer values, MinLen and MaxLen , where m is the length of pattern P, pj (0< j<= m-1), n is the length of sequence S, and si (0<= i<= n-1).  is a wildcard which can match any letter in a given alphabet table. minj-1 and maxj-1 are given integer values indicating the minimal and maximum length of wildcards between two given letters pj-1 and pj respectively, where 0<= minj<= maxj. MinLen and MaxLen are called global length constraints. minj and maxj are called local length constraints.

定义2如果一个位置索引序列A=<a0,… ,aj ,… ,am-1 >服从如下约束条件

,          (1)

这里0≤ jm-1 0≤ ajn-1,则称APS中的一个出现。

Definition 2. An occurrence of pattern P in sequence S is a sequence of position indices {a0,,aj ,  ,am-1 } subject to the following equation.

 

,          (1)

 

where 0<= j <= m-1 and 0<= aj<= n-1.

定义3令集合T(S,P)代表模式串P在序列串S中的所有出现,集合T(S,P)的长度用|T(S,P)|来表示。

Definition 3. Let set T(S,P) be the set of all occurrences of P in S. The length of T(S,P)  is denoted by |T(S,P)|.

定义4给定两个出现B=<b0,…, bj,…, bm-1>C=<c0,…, ck, …, cm-1>,如果存在bj = ck,则称出现B与出现C相关并称出现BC都包含位置bj,这里0≤k<m并且0≤j<m;否则称出现BC互不相关。

Definition 4. Given two occurrences B=<b0,  , bj,  , bm-1> and  C=<c0,  ,  ck, , cm-1>. Occurrence C is related with occurrence B if there exists  ck=bj, otherwise occurrence C is unrelated with occurrence B. The number of occurrences Related with Occurrence B is denoted by RO(B).  Both occurrences B and C are containing position bj, where 0 <= k< m and 0<= j <m. The number of occurrences Containing Position i (0<= i <= |S|-1) of P in S is denoted by CP(i).

定义5T(S,P)中子集T1(S,P)满足的任何两个出现都是彼此不相关的,则称子集T1(S,P)是具有间隙约束和一次性条件的模式匹配PMGOOCPattern Matching with Gaps and One-Off Condition)。满足PMGOOC的最大子集T1(S,P)称为具有间隙约束和一次性条件的最大模式匹配MPMGOOC(Maximum Pattern Matching with Gaps and One-Off Condition)

Definition 5. Given a pattern P and a sequence S, a Pattern Matching with Gaps and One-Off Condition (PMGOOC) is a subset T'(S,P) T(S,P) such that every two occurrences in T'(S,P) are unrelated with each other. The Maximum Pattern Matching with Gaps and One-Off Condition (MPMGOOC) problem refers to the problem of searching a MPMGOOC with the largest number of occurrences in T(S,P).

 

实例:给定模式串P=a[0,1]b[0,1]c,序列串S=aabbcc以及最小和最大长度分别为:MinLen=3MaxLen=5

依据给定条件可知,模式串P在序列串S中所有出现T(S,P){<0, 2, 4>, <1, 2, 4>, <1, 3, 4>, <1, 3, 5> },故|T(S,P)|4。依据定义5可知,MPMGOOC问题的解是{<0, 2, 4>, <1, 3, 5>},因为这两个出现是T(S,P)中不相关最大子集。

Instance: Given P=a[0,1]b[0,1]c, S=aabbcc, MinLen=3, and MaxLen=5.

 

We can know that the number of all occurrences is 4 and all occurrences are {0, 2, 4}, {1, 2, 4}, {1, 3, 4}, and {1, 3, 5}. In these occurrences, the unique solution of this instance is two optimal occurrences, {0, 2, 4} and {1, 3, 5}. If {1, 2, 4} or {1, 3, 4} is the solution of this instance, the number of occurrence of the solution is 1. Hence {1, 2, 4} or {1, 3, 4} can not be a solution of this instance.

 

网树(Nettree

定义6网树是一种每条边具有“双亲-孩子关系”或“孩子-双亲关系”这样边标签的有向无环图DAGDirected Acyclic Graph),并且网树中每个结点具有0个或多个孩子结点以及0个或多个双亲结点。网树还具有如下5个性质:

(1)网树是树结构的拓展,它具备很多与树相似的概念,如根结点,叶子结点,层,双亲,孩子等等;

(2)一棵网树可以有多于一个根结点;

(3)除了根结点之外,网树的其它结点可以有多于一个双亲结点;

(4)从一个结点到达网树的一个根结点的路径数不唯一;

(5)相同结点名称的结点可以在网树的不同层上多次出现。

Definition 6. A Nettree is a kind of DAG with two kinds of edge labels, "parent-child" and "child-parent", where each node has zero or more children nodes and zero or more parents nodes. Furthermore, the children of each node have a specific order. A Nettree has the following three properties.

1. A Nettree is an extension of tree because it has all concepts of tree, such as root, leaf, level, parent, child and so on.

2. A Nettree may have r roots, where r1.

3. Some nodes except roots in a Nettree may have q parents, where q1.

A Nettree is shown in Fig. 1. Nodes A and B are two roots of the Nettree. Nodes C, F and G are three leaves. Node D has two parents (nodes A and B). In Fig.1, each edge has a label either "parent-child" or "child-parent". So it is a kind of DAG with edge labels. Otherwise there will be a cycle {B, D, G, E, B}.

1  一颗网树

Figure 1. A Nettree.

 

 

求解算法Solving algorithm:

 

(1)一种在线算法An online algorithm:

                            SAIL       Algorithm designed by Gong Chen

                                                Reprogramed by Youxi Wu using VC

(2)我们设计的算法:

                           AGSP          based on Nettree

 

 

 

                          ARMP        based on Nettree

 

 

 

                           SBO           based on Nettree

 

 

 

 

实验测试数据

(源自 美国国家生物计算信息中心的H1N1病毒候选序列

 

 

 

1 真实生物数据片段

序号

片段名称

位点

片段长度

S1

Segment 1

CY058563

2286

S2

Segment 2

CY058562

2299

S3

Segment 3

CY058561

2169

S4

Segment 4

CY058556

1720

S5

Segment 5

CY058559

1516

S6

Segment 6

CY058558

1418

S7

Segment 7

CY058557

982

S8

Segment 8

CY058560

844

 

 

2 模式串

序号

模式串

最小长度

最大长度

P1

a[0,3]t[0,3]a[0,3]t[0,3]a[0,3]t[0,3]a[0,3]t[0,3]a[0,3]t[0,3]a

11

41

P2

g[1,5]t[0,6]a[2,7]g[3,9]t[2,5]a[4,9]g[1,8]t[2,9]a

24

57

P3

g[1,9]t[1,9]a[1,9]g[1,9]t[1,9]a[1,9]g[1,9]t[1,9]a[1,9]g[1,9]t

21

101

P4

g[1,5]t[0,6]a[2,7]g[3,9]t[2,5]a[4,9]g[1,8]t[2,9]a[1,9]g[1,9]t

27

73

P5

a[0,10]a[0,10]t[0,10]c[0,10]g[0,10]g

6

56

P6

a[0,5]t[0,7]c[0,9]g[0,11]g

5

37

P7

a[0,5]t[0,7]c[0,6]g[0,8]t[0,7]c[0,9]g

7

49

P8

a[5,6]c[4,7]g[3,8]t[2,8]a[1,7]c[0,9]g

22

52

P9

c[0,5]t[0,5]g[0,5]a[0,5]a

5

25

 

 

3 全部72个运行实例

P1-S1

P1-S2

P1-S3

P1-S4

P1-S5

P1-S6

P1-S7

P1-S8

P2-S1

P2-S2

P2-S3

P2-S4

P2-S5

P2-S6

P2-S7

P2-S8

P3-S1

P3-S2

P3-S3

P3-S4

P3-S5

P3-S6

P3-S7

P3-S8

P4-S1

P4-S2

P4-S3

P4-S4

P4-S5

P4-S6

P4-S7

P4-S8

P5-S1

P5-S2

P5-S3

P5-S4

P5-S5

P5-S6

P5-S7

P5-S8

P6-S1

P6-S2

P6-S3

P6-S4

P6-S5

P6-S6

P6-S7

P6-S8

P7-S1

P7-S2

P7-S3

P7-S4

P7-S5

P7-S6

P7-S7

P7-S8

P8-S1

P8-S2

P8-S3

P8-S4

P8-S5

P8-S6

P8-S7

P8-S8

P9-S1

P9-S2

P9-S3

P9-S4

P9-S5

P9-S6

P9-S7

P9-S8