Subnettrees for Strict Pattern Matching with General Gaps and Length Constraints

Subnettrees for Strict Pattern Matching with General Gaps and Length Constraints

武优西,刘亚伟,郭磊,吴信东. 子网树求解一般间隙和长度约束严格模式匹配. 软件学报. 2013, 24(5): 915-932  (PDF) (Source codes)

Abstract: Pattern matching with gaps constraints has very important application in many fields such as information retrieval, computational biology, sequential pattern mining etc. This paper proposes a new pattern matching problem named Strict Pattern mAtching with geNeral Gaps and Length cOnstraints (SPANGLO) which has four aspects: it is a kind of strict exact pattern matching, any position in sequence can be used more than once, the pattern can have more than one general gap constraint and the occurrence has the length constraints. An instance of SPANGLO can be transformed into exponential solved instances with non-negative gaps in the worst case. In order to solve the problem effectively, an algorithm named SubnEtTreeSpanglo (SETS) is proposed based on Subnettree and its special concepts and properties. The correctness and completeness of the algorithm are given and the space and time complexities of the algorithm are O(m×MaxLen×W) and O(MaxLen×W×m2×n) respectively, where n, m, MaxLen and W are the sequence length, the pattern length, the maximum length of the occurrence and the maximum gap of the pattern respectively. Experimental results validate the correctness of the transforming method of SPANGLO and the efficientiveness and correctness of SETS.


摘  要: 具有通配符间隙约束的模式匹配问题在信息检索、计算生物学和序列模式挖掘等研究领域具有重要的应用。本文提出了更一般性的模式匹配问题,即一般间隙和长度约束的严格模式匹配(Strict Pattern mAtching with geNeral Gaps and Length cOnstraints,SPANGLO),其具有如下4个特点:它是一种严格的精确模式匹配;允许序列中任意位置的字符被多次使用;模式串中可以包含多个一般间隙;对出现的总体长度进行了约束。最坏情况下,一个SPANGLO实例将转换出指数个非负间隙的严格模式匹配实例。为了有效地解决该问题,本文提出了子网树及其相关概念和性质。在此基础上提出了求解算法,并给出算法的正确性和完备性证明,同时指出该算法的空间复杂度与时间复杂度分别为O(m×MaxLen×W)和O(MaxLen×W×m2×n),这里m,n, MaxLen和W分别是模式和序列的长度,出现的最大长度约束和模式的最大间距。实验结果既验证了SPANGLO问题转换方法的正确性又验证了本文算法的正确性和有效性。

Problem: Given P=a[0,1]b[-1,1]c, S=s0s1s2s3s4s5=acbacbc and length constraints MinLen=3, MaxLen=4.
Without length constraints, T(S,P)= {<0,2,1>,<0,2,4>,<3,5,4>, <3,5,6>} and |T(S,P)|=4.
With length constraints, T(S,P,LEN)={<0,2,1>,<3,5,4>, <3,5,6>} and |T(S,P,LEN)|=3.

Algorithm :sets

2 评论