一般间隙及一次性条件的严格模式匹配

柴欣,贾晓菲,武优西,江贺,吴信东. 一般间隙及一次性条件的严格模式匹配. 软件学报,  2015, 26(5):1096−1112.   (PDF) (Source codes)

摘  要:     具有间隙约束的模式匹配是序列模式挖掘的关键问题之一.一次性条件约束是要求序列中每个位置的字符最多只能使用一次,在序列模式挖掘中采用一次性条件约束更加合理.但目前间隙约束多为非负间隙,非负间隙对字符串中每个字符的出现顺序具有严格的约束,一定程度上限定了匹配的灵活性.为此本文提出了一般间隙及一次性条件的严格模式匹配问题,之后理论证明了该问题的计算复杂性为NP-Hard问题,为了对该问题进行有效求解,本文在网树结构上构建了动态更新结点信息的启发式求解算法(Dynamically Changing Node Property, DCNP),该算法动态地更新各个结点的树根路径数、叶子路径数和树根-叶子路径数等,进而每次可以获得一个较优的出现.之后迭代这一过程.为了有效地提高DCNP算法速度,避免动态更新大量的结点信息,我们提出了Checking机制,使得DCNP算法仅在可能产生内部重复出现的时候才进行动态更新.理论分析了DCNP算法的时间复杂度和空间复杂度.大量实验结果验证了DCNP算法的具有良好的求解性能.

 

Strict Pattern Matching with General Gaps and One-off Condition

Abstract:     Pattern matching with gap constraints is one of the key issues of sequential pattern mining. One-off condition which is always used in sequential pattern mining tasks means that the character of each position in the sequence can be used at most once for pattern matching. Recently, most researches focus on pattern matching with non-negative gaps, but the rule of non-negative gaps implies the character’s order in the sequence may limit the flexibility of matching. For these reasons, we propose the strict pattern matching with general gaps and one-off condition and show that this problem is NP-Hard. To handle this issue, we propose a heuristic algorithm, named Dynamically Changing Node Property (DCNP), based on Nettree which dynamically updates the properties of each node such as the numbers of root paths, leaf paths and root-leaf paths, thus we can get a better occurrence. Then we iterate the above process. To improve the speed of DCNP effectively and avoid dynamically updating information of nodes on a large scale, a checking mechanism is proposed which means that DCNP will update information of nodes only when the occurrence may have repetition. We also theoretically analyze the space and time complexities of DCNP. Experimental results show that DCNP has better performance than other competitive algorithms.

 

 

源程序

序列串与模式串

感谢刘同学的提示,做出如下勘误:

1、原文1108页表3中P8和P9应该为:

P8=g[-1,9]t[-1,9]a[-1,9]g[-1,9]t[-1,9]a[-1,9]g[-1,9]t[-1,9]a;
P9=g[-1,7]t[-1,7]a[-1,7]g[-1,7]t[-1,7]a[-1,7]g

2、原文1109页表5中DCNP算法在P9-S1实例上结果为138.

Mining Sequential Patterns with Periodic Wildcard Gaps

Youxi Wu, Lingling Wang, Jiadong Ren, Wei Ding, Xindong Wu. Mining Sequential Patterns with Periodic Wildcard Gaps. Applied Intelligence. Volume 41, Issue 1 (2014), Page 99-116.  (PDF) (Source codes)

 

Abstract  Mining frequent patterns with periodic wildcard gaps is a critical data mining problem to deal with complex real-world problems. This problem can be described as follows: given a subject sequence, a pre-specified threshold, and a variable gap-length with wildcards between each two consecutive letters. The task is to gain all frequent patterns with periodic wildcard gaps. State-of-the-art mining algorithms which use matrix or other linear data structures to solve the problem not only consume a large amount of memory but also run slowly. In this study, we use an Incomplete Nettree structure (the last layer of a Nettree which is an extension of a tree) of a sub-pattern P to efficiently create Incomplete Nettrees of all its super-patterns with prefix pattern P and compute the numbers of their supports in a one-way scan. We propose two new algorithms, MAPB (Mining sequentiAl Pattern using incomplete Nettree with Breadth first search) and MAPD (Mining sequentiAl Pattern using incomplete Nettree with Depth first search), to solve the problem effectively with low memory requirements. Furthermore, we design a heuristic algorithm MAPBOK (MAPB for tOp-K) based on MAPB to deal with the Top-K frequent patterns for each length. Experimental results on real-world biological data demonstrate the superiority of the proposed algorithms in running time and space consumption and also show that the pattern matching approach can be employed to mine special frequent patterns effectively.
Keywords Sequential pattern mining, Periodic wildcard gap, Pattern matching, Heuristic algorithm, Nettree

Definitions & Detail & Source codes

Subnettrees for Strict Pattern Matching with General Gaps and Length Constraints

武优西,刘亚伟,郭磊,吴信东. 子网树求解一般间隙和长度约束严格模式匹配. 软件学报. 2013, 24(5): 915-932  (PDF) (Source codes)

Abstract: Pattern matching with gaps constraints has very important application in many fields such as information retrieval, computational biology, sequential pattern mining etc. This paper proposes a new pattern matching problem named Strict Pattern mAtching with geNeral Gaps and Length cOnstraints (SPANGLO) which has four aspects: it is a kind of strict exact pattern matching, any position in sequence can be used more than once, the pattern can have more than one general gap constraint and the occurrence has the length constraints. An instance of SPANGLO can be transformed into exponential solved instances with non-negative gaps in the worst case. In order to solve the problem effectively, an algorithm named SubnEtTreeSpanglo (SETS) is proposed based on Subnettree and its special concepts and properties. The correctness and completeness of the algorithm are given and the space and time complexities of the algorithm are O(m×MaxLen×W) and O(MaxLen×W×m2×n) respectively, where n, m, MaxLen and W are the sequence length, the pattern length, the maximum length of the occurrence and the maximum gap of the pattern respectively. Experimental results validate the correctness of the transforming method of SPANGLO and the efficientiveness and correctness of SETS.

子网树求解一般间隙和长度约束严格模式匹配

摘  要: 具有通配符间隙约束的模式匹配问题在信息检索、计算生物学和序列模式挖掘等研究领域具有重要的应用。本文提出了更一般性的模式匹配问题,即一般间隙和长度约束的严格模式匹配(Strict Pattern mAtching with geNeral Gaps and Length cOnstraints,SPANGLO),其具有如下4个特点:它是一种严格的精确模式匹配;允许序列中任意位置的字符被多次使用;模式串中可以包含多个一般间隙;对出现的总体长度进行了约束。最坏情况下,一个SPANGLO实例将转换出指数个非负间隙的严格模式匹配实例。为了有效地解决该问题,本文提出了子网树及其相关概念和性质。在此基础上提出了求解算法,并给出算法的正确性和完备性证明,同时指出该算法的空间复杂度与时间复杂度分别为O(m×MaxLen×W)和O(MaxLen×W×m2×n),这里m,n, MaxLen和W分别是模式和序列的长度,出现的最大长度约束和模式的最大间距。实验结果既验证了SPANGLO问题转换方法的正确性又验证了本文算法的正确性和有效性。

Problem: Given P=a[0,1]b[-1,1]c, S=s0s1s2s3s4s5=acbacbc and length constraints MinLen=3, MaxLen=4.
Without length constraints, T(S,P)= {<0,2,1>,<0,2,4>,<3,5,4>, <3,5,6>} and |T(S,P)|=4.
With length constraints, T(S,P,LEN)={<0,2,1>,<3,5,4>, <3,5,6>} and |T(S,P,LEN)|=3.

Algorithm :sets

具有间隙约束和一次性条件的最大模式匹配

武优西,吴信东,江贺,闵帆. 一种求解MPMGOOC问题的启发式算法. 计算机学报. 2011, 34(8): 1452- 1462 .  (PDF) (Source codes)

 

具有间隙约束和一次性条件的最大模式匹配(Maximum Pattern Matching with Gaps and One-Off Condition,MPMGOOC)是一种具有通配符长度约束的模式匹配问题,其任务是寻找彼此互不相关的最多出现。文中基于一种新的非线性数据结构——网树,提出了一种解决MPMGOOC 问题的启发式算法。与树结构不同之处在于,除根结点外,网树中任何结点可以多于1 个双亲结点。文中给出了网树的定义及其相关的概念和性质。基于这些概念和性质,提出了一种选择较优出现 (Selecting Better Occurrence,SBO)的启发式算法。该算法在搜索一个出现的循环中,采用了贪婪搜索双亲策略 (Strategy of Greedy-Search Parent,SGSP)和最右双亲策略 (Strategy of RightMost Parent,SRMP)寻找相同叶子的两个出现并选择其中较好的出现作为SBO 算法的结果。SGSP 策略的核心思想是每一步都寻找当前结点的一个近似最优双亲(Approximately Optimimal Parent, AOP);SRMP 策略的核心思想是每一步都寻找当前结点的最右双亲结点。实验结果表明,在多数情况下SBO 算法可以获得更好的解且解的质量较其它算法有显著的提高。本文不但提供了一个解决MPMGOOC 问题的启发式算法,更重要的是对于求解其它复杂问题具有一定参考价值。

点击这里阅读本文