Abstract: In pattern matching, a gap constraint is a more flexible wildcard than traditional wildcards “?“ and “*“. Pattern matching with gap constraints is more difficult to handle and fulfills user’s enquiries more easily. Pattern matching with gap constraints has therefore been carried out in numerous research works such as music information retrieval, finding protein sites, and sequence pattern mining. Strict pattern matching under a nonoverlapping condition, as a type of pattern matching with gap constraints, is a key issue of sequence pattern mining with gap constraints since it can be used to compute the frequency of a pattern. Exact matching limits the flexibility of the match to some extent since it requires that each character match. Therefore, we address Approximate Strict Pattern matching under the NonOverlapping constraints (ASPNO). We propose an effective algorithm, named NETtree for ASPNO (NETASPNO), which first transforms the problem into a Nettree data structure. To find the nonoverlapping occurrences effectively, NETASPNO is iterated to find the rightmost root-leaf-path from the rightmost root and prunes the path. Extensive experimental results demonstrate that NETASPNO has better performance than other competitive algorithms.

Codes:

NETROL

NETLAP-Appro

NETASPNO

NOSEP: Non-Overlapping Sequence Pattern Mining with Gap Constraints

Youxi Wu, Yao Tong, Xingquan Zhu, Xindong Wu. NOSEP: Nonoverlapping Sequence Pattern Mining with Gap Constraints. IEEE Transactions on Cybernetics.  DOI (identifier) 10.1109/TCYB.2017.2750691

 

Abstract:

Sequence pattern mining aims to discover frequent subsequences as patterns in a single sequence or a sequence database. By combining gap constraints (or flexible wildcards), users can specify special characteristics of the patterns and discover meaningful subsequences suitable for their own application domains, such as finding gene transcription sites from DNA sequences or discovering patterns for time series data classification.  However, due to the inherent complexity of sequence patterns, including the exponential candidate space with respect to pattern letters and gap constraints, to date, existing sequence pattern mining methods are either incomplete or do not support the Apriori property since the support or support ratio of a pattern may be greater than that of its sub-patterns. Most importantly, patterns discovered by these methods are either too restrictive or too general and cannot represent underlying meaningful knowledge in the sequences. In this paper, we focus on a non-overlapping sequence pattern mining task with gap constraints, where a non-overlapping sequence pattern allows sequence letters to be flexible, yet maximally, utilized for pattern discovery. A new non-overlapping sequence pattern mining algorithm (NOSEP), which is an Apriori-based and complete mining algorithm, is proposed by using Nettree, a specially designed data structure, to calculate the exact occurrence of a pattern in the sequence. Experimental results and comparisons with biology DNA sequences, time series data,  and  Gazelle Datasets demonstrate the efficiency of the proposed algorithm and the uniqueness of non-overlapping sequence patterns compared to other methods.

 

Datasets:Dataset1  (DNA and Protein sequences)

 

Gazelle Datasets (From SPMF):

BMSWebView1 (Gazelle) ( KDD CUP 2000)

BMSWebView2 (Gazelle) ( KDD CUP 2000)

 

 

Algorithms:

NOSEP   for character version(small alphabet size)

NOSEPi   for integer version(large alphabet size)

 

NOSEP-B   (backtracking strategy) for character version(small alphabet size)

NOSEP-B-i  for integer version(large alphabet size)

 

NetM-B   (BFS strategy) for character version(small alphabet size)

NetM-B-i   for integer version(large alphabet size)

 

NetM-D  (DFS strategy) for character version(small alphabet size)

NetM-D-i   for integer version(large alphabet size)

 

 

GSgrow  for character version(small alphabet size)

GSgrow-i  for integer version(large alphabet size)

is proposed by Ding et al[1] .

 

[1]Ding B, Lo D, Han J, et al. Efficient mining of closed repetitive gapped subsequences from a sequence database. In: Proceedings of IEEE International Conference on Data Engineering, Shanghai, 2009. 1024-1035

 

 

中文摘要:序列模式挖掘是在单序列或序列数据库中发现频繁子序列。在结合间隙约束(或可变长度通配符)后形成具有间隙约束的序列模式挖掘,这种间隙约束可以用来指定模式的特点,以用于发现有意义的子序列,其已在诸多领域得到了广泛的应用,如DNA序列寻找基因转录位点或时间序列数据分类的模式发现。由于在序列字符和间隙的作用下,候选空间呈现指数形式,因此该挖掘具有一定的复杂性。迄今为止现有的挖掘方法要么是挖掘结果不具有完备性的算法,要么是不支持的Apriori性质的算法,因为模式的支持度或支持率可能大于其子模式的。更为重要的是这些方法发现的模式要么限制性太强,要么过于笼统,难以表示序列中隐含的有意义的知识。在本文中,我们研究了一种无重叠的序列模式挖掘,其是一种允许序列字符更加灵活并最大限度地用于模式发现。在此基础上,提出了一种新的基于Apriori性质的无重叠序列模式挖掘算法(NOSEP),该算法是一个具有完备性的模式挖掘算法,其采用模式匹配策略并运用一个称为网树的特殊数据结构来计算一个模式的支持度,并采用模式增长策略实现候选模式有效剪枝。在生物DNA序列、时间序列数据和一个公开的电子商务点击流数据集上与同类算法进行对比,实验结果不但验证了NOSEP算法的效率,并且表明在相同条件下其可以发现更多且更加有意义的频繁模式。

Strict pattern matching under non-overlapping condition

Youxi Wu, Cong Shen, He Jiang, Xindong Wu. Strict pattern matching under non-overlapping condition. Science China Information Sciences. 2017, 60 (1):012101:1-16.   Read Online (PDF)

 

ABSTRACT

Pattern matching (or string matching) is an essential task in computer science, especially in sequential pattern mining, since pattern matching methods can be used to calculate the support (or the number of occurrences) of a pattern and then to determine whether the pattern is frequent or not. A state-of-the-art sequential pattern mining with gap constraints (or flexible wildcards) uses the number of non-overlapping occurrences to denote the frequency of a pattern. Non-overlapping means that any two occurrences cannot use the same character of the sequence at the same position of the pattern. In this paper, we investigate strict pattern matching under the non-overlapping condition. We show that the problem is in P at first. Then we propose an algorithm, called NETLAP-Best, which uses Nettree structure. NETLAP-Best transforms the pattern matching problem into a Nettree and iterates to find the rightmost root-leaf path, to prune the useless nodes in the Nettree after removing the rightmost root-leaf path. We show that NETLAP-Best is a complete algorithm and analyse the time and space complexities of the algorithm. Extensive experimental results demonstrate the correctness and efficiency of NETLAP-Best.

摘要:

模式匹配(串匹配)是计算机科学中至关重要的一个任务,特别是在序列模式挖掘中,因为模式匹配方法可以用来计算一个模式在序列中的支持度(出现数),进而判断这个模式是否频繁。一种具有间隙约束(可变长度通配符)的序列模式挖掘算法采用模式的无重叠出现数目来表示这个模式的频度,这里无重叠是指任何两个出现不能共用序列的相同位置的字符。首先理论证明了无重叠条件的严格模式匹配的计算复杂度是P,然后提出了一个基于网树结构的NETLAP-Best算法,该算法将模式匹配问题转换为一颗网树,并在网树上迭代地寻找最右树根-叶子路径,之后剪去这条路径和无用的网树结点。之后理论证明了NETLAP-Best算法的完备性并分析了该算法的时间和空间复杂度。大量实验结果验证了NETLAP-Best算法的正确性和有效性。

 

Algorithms:

INSgrow

netlap_nopruning

netlap_slow

netlap_best

 

Data

dataset

 

Manuscript

NETLAP_2015

Approximate pattern matching with gap constraints

Youxi Wu, Zhiqiang Tang, He Jiang, Xindong Wu. Approximate Pattern Matching with Gap Constraints. Journal of Information Science.2016, 42(5): 639–658  (PDF)  (Source codes)

 

Abstract. Pattern matching is a key issue in sequential pattern mining, with many researchers now focussing on pattern matching with gap constraints. However, most of these studies involve exact pattern matching problems, a special case of approximate pattern matching and a more challenging task. In this study, we introduce an approximate pattern matching problem with Hamming distance. Its objective is to compute the number of approximate occurrences of pattern P with gap constraints in sequence S under similarity constraint d. We propose an efficient algorithm named SONG (Single-rOot Nettree for approximate pattern matchinG with gap constraints) and based on the new nonlinear data structure Single-root Nettree. SONG uses some new concepts of Single-root Nettrees to solve the problem effectively. Theoretical analysis and experiments demonstrate an interesting law that the ratio M(P,S,d)/N(P,S,m) approximately follows a binomial distribution, where M(P,S,d) and N(P,S,m) are the numbers of the approximate occurrences whose distances to pattern P are d (0≤dm) and no more than m (the length of pattern P), respectively. Experimental results for real biological data validate the efficiency and correctness of SONG.

在《网树求解有向无环图中具有长度约束的最大不相交路径》一文中提出的有向无环图构造生成算法。该论文信息如下:

李艳,武优西,黄春萍,张志颖,曾珍香. 网树求解有向无环图中具有长度约束的最大不相交路径. 通信学报. 2015, 36(8): 38-49

 

该算法可以依据用户输入的n, kd来随机生成从顶点1到顶点n路径长度为k情况下,最大不交路径数为w的有向无环图,这里d表示图中边密度,w为一个大于0.7倍理论最大不相交路径数(该数为(n-2)/(k-1))的随机数(注:因为如果最大不相交路径太少且在边密度较高情况下,任何一种算法都相对较容易获得较好质量的解,因此难于用来评价算法的好坏,这一点在后面的实验中也有所体现). 该算法采用的原理如下:

第一步:随机生成w*(k-1)个不同数字,其范围为2到n-1,每个数字表示一个顶点名;

第二步:建立一条长度为k的路径. 每k-1个数字与顶点1和n一起构成一条长度为k的路径,具体做法是从前一个数字顶点向后一个数字顶点建立一条有向边,然后顶点1向这k-1个数字中第一个数字顶点建立一条有向边,最后一个顶点向顶点n建立一条有向边;

第三步:迭代第二步,直到建立w条长度为k的不相交路径;

第四步:随机加边. 随机生成两个介于1到n-1之间的数字ab,判断从ab的有向边是否存在,如果存在重新生成;如果ab相同,也重新生成;如果会产生回路也重新生成(从顶点b出发,依据当前图G的状态,采用广度优先搜索策略,如果能够回到顶点a,说明存在回路),直到可以生成一组满足要求的数字,然后在顶点ab之间建立一条有向边;

第五步:迭代第四步,直到图中边的密度为用户指定的d值后停止.

由于第二和第三步保证了顶点1到顶点n之间有w条路径长度为k的不相交路径,并且第四步在生成边的过程中b最大为n-1,不能取到n,这样顶点n的入度就仅仅为w,这样就确保随机生成的有向无环图从顶点1到顶点n之间最多仅有w条路径长度为k的不相交路径.

 

算法的源代码:dpc

李艳,武优西,黄春萍,张志颖,曾珍香. 网树求解有向无环图中具有长度约束的最大不相交路径. 通信学报. 2015, 36(8): 38-49

 

A Nettree for Maximum Disjoint Paths with Length Constraint in DAGs

ABSTRACT  In this paper, we focus on the problem of the maximum disjoint paths in directed acyclic graphs (DAGs) which is to find the maximum disjoint paths with length k between two given vertices. We propose a greedy algorithm named Greedy Path (GP) to solve the problem. GP transforms a DAG into a Nettree with depth k+1 at first. Then we calculate the number of root-leaf paths for each node of the Nettree to achieve the number of total paths for each vertex of the DAG. In order to obtain an optimized disjoint path, GP selects the node in the k+1th level of the Nettree as the current node, and searches for the optimized parent in the usable parents whose number of total paths is minimal. We iterate this process, until there is no disjoint path. The space and time complexities of GP are O(wkn(p+q)) and O(kn(p+q)+n2), respectively, where k, n, p, q, and w are length constraint, number of vertices, maximum indegree, maximum outdegree, and the solution, respectively. To evaluate the performance of GP we also propose an algorithm which can create artificial DAGs with known maximum disjoint paths. Experimental results show that GP can get better performance than other competitive algorithms.

 

网树求解有向无环图中具有长度约束的最大不相交路径

摘要  本文对有向无环图中具有长度约束的最大不相交路径问题进行研究,该问题是求解图中两点间路径长度为k的最大不相交路径. 为了对该问题进行求解,本文提出了贪婪搜索算法(Greedy Path,GP),该算法先将一个有向无环图转化为一棵深度为k+1的网树,然后计算每个网树结点的树根叶子路径数,并以此计算图中每个顶点的总路径数,之后从网树的第k+1层结点出发,在当前结点的双亲结点中选择未被使用且总路径数最小的双亲,以此形成一条优化的不相交路径,最后迭代这一过程,直到不再有新的不相交路径为止. GP算法的时间和空间复杂度分别为O(wkn(p+q))和O(kn(p+q)+n2),这里k, n, p, qw为分别为长度约束、顶点数、最大顶点入度、最大顶点出度以及问题的解. 为了测试GP算法的近似性,本文又建立了一种能够生成人工数据的算法,该算法能够准确地控制有向无环图中最大不相交路径的数量. 通过该算法生成了大量测试用数据,实验结果表明GP算法较其他对比性算法具有良好的近似性且其实际求解时间较短,验证了该方法的有效性和可行性.

 

求解算法

 

 

求解算法1:LP

 

求解算法2:RP

 

求解算法3:SP

 

求解算法4:GP

 

 

 

测试数据:测试用48组数据

 

 

Strict approximate pattern matching with general gaps

Youxi Wu, Shuai Fu, He Jiang, Xindong Wu. Strict approximate pattern matching with general gaps. Applied Intelligence. 2015, Volume 42, Issue 3, pp 566-580.  (PDF) (Source codes)

 

Abstract  Pattern matching with gap constraints is one of the essential problems in computer science such as music information retrieval and sequential pattern mining. One of the cases is called loose matching, which means only considering the matching position of the last substring of pattern in the sequence. One more challenging problem is considering the matching positions of each character in the sequence, called strict pattern matching which is one of the essential tasks of sequential patterns mining with gap constraints. Some strict pattern matching algorithms were used to handle pattern mining tasks, since strict pattern matching can be used to compute the occurrence frequency of some patterns in the given sequence and then the frequent patternscan be derived. In this article, we address a more general strict approximate pattern matching with Hamming distance, named SAP (Strict Approximate Pattern matching with general gaps and length constraints), which means that the gap constraints can be negative. We show that a SAP instance can be transformed into exponential amount of the exact pattern matching with general gaps instances. Hence, we propose an effective online algorithm, named SETA (SubnETtree for sAp), based on the subnettree structure (a Nettree is an extension of a tree with multi-parents and multi-roots) and show the completeness of the algorithm. The space and time complexities of the algorithm are O(m×Maxlen×W×d) and O(Maxlen×W×m2×n×d), respectively, where m, Maxlen, W, and d are the length of pattern P, the maximal length constraint, the maximal gap length of pattern P and the approximate threshold. Extensive experimental results validate the correctness and effectiveness of SETA.

 

 

Algorithms:

 

SETA1

 

SETA2

 

SETA3

 

SETA

 

 

 

Data:

 

The data used in this paper

一般间隙及一次性条件的严格模式匹配

柴欣,贾晓菲,武优西,江贺,吴信东. 一般间隙及一次性条件的严格模式匹配. 软件学报,  2015, 26(5):1096−1112.   (PDF) (Source codes)

摘  要:     具有间隙约束的模式匹配是序列模式挖掘的关键问题之一.一次性条件约束是要求序列中每个位置的字符最多只能使用一次,在序列模式挖掘中采用一次性条件约束更加合理.但目前间隙约束多为非负间隙,非负间隙对字符串中每个字符的出现顺序具有严格的约束,一定程度上限定了匹配的灵活性.为此本文提出了一般间隙及一次性条件的严格模式匹配问题,之后理论证明了该问题的计算复杂性为NP-Hard问题,为了对该问题进行有效求解,本文在网树结构上构建了动态更新结点信息的启发式求解算法(Dynamically Changing Node Property, DCNP),该算法动态地更新各个结点的树根路径数、叶子路径数和树根-叶子路径数等,进而每次可以获得一个较优的出现.之后迭代这一过程.为了有效地提高DCNP算法速度,避免动态更新大量的结点信息,我们提出了Checking机制,使得DCNP算法仅在可能产生内部重复出现的时候才进行动态更新.理论分析了DCNP算法的时间复杂度和空间复杂度.大量实验结果验证了DCNP算法的具有良好的求解性能.

 

Strict Pattern Matching with General Gaps and One-off Condition

Abstract:     Pattern matching with gap constraints is one of the key issues of sequential pattern mining. One-off condition which is always used in sequential pattern mining tasks means that the character of each position in the sequence can be used at most once for pattern matching. Recently, most researches focus on pattern matching with non-negative gaps, but the rule of non-negative gaps implies the character’s order in the sequence may limit the flexibility of matching. For these reasons, we propose the strict pattern matching with general gaps and one-off condition and show that this problem is NP-Hard. To handle this issue, we propose a heuristic algorithm, named Dynamically Changing Node Property (DCNP), based on Nettree which dynamically updates the properties of each node such as the numbers of root paths, leaf paths and root-leaf paths, thus we can get a better occurrence. Then we iterate the above process. To improve the speed of DCNP effectively and avoid dynamically updating information of nodes on a large scale, a checking mechanism is proposed which means that DCNP will update information of nodes only when the occurrence may have repetition. We also theoretically analyze the space and time complexities of DCNP. Experimental results show that DCNP has better performance than other competitive algorithms.

 

 

源程序

序列串与模式串

感谢刘同学的提示,做出如下勘误:

1、原文1108页表3中P8和P9应该为:

P8=g[-1,9]t[-1,9]a[-1,9]g[-1,9]t[-1,9]a[-1,9]g[-1,9]t[-1,9]a;
P9=g[-1,7]t[-1,7]a[-1,7]g[-1,7]t[-1,7]a[-1,7]g

2、原文1109页表5中DCNP算法在P9-S1实例上结果为138.