Abstract: In pattern matching, a gap constraint is a more flexible wildcard than traditional wildcards “?“ and “*“. Pattern matching with gap constraints is more difficult to handle and fulfills user’s enquiries more easily. Pattern matching with gap constraints has therefore been carried out in numerous research works such as music information retrieval, finding protein sites, and sequence pattern mining. Strict pattern matching under a nonoverlapping condition, as a type of pattern matching with gap constraints, is a key issue of sequence pattern mining with gap constraints since it can be used to compute the frequency of a pattern. Exact matching limits the flexibility of the match to some extent since it requires that each character match. Therefore, we address Approximate Strict Pattern matching under the NonOverlapping constraints (ASPNO). We propose an effective algorithm, named NETtree for ASPNO (NETASPNO), which first transforms the problem into a Nettree data structure. To find the nonoverlapping occurrences effectively, NETASPNO is iterated to find the rightmost root-leaf-path from the rightmost root and prunes the path. Extensive experimental results demonstrate that NETASPNO has better performance than other competitive algorithms.
Youxi Wu, Yao Tong, Xingquan Zhu, Xindong Wu. NOSEP: Nonoverlapping Sequence Pattern Mining with Gap Constraints. IEEE Transactions on Cybernetics. DOI (identifier) 10.1109/TCYB.2017.2750691
Sequence pattern mining aims to discover frequent subsequences as patterns in a single sequence or a sequence database. By combining gap constraints (or flexible wildcards), users can specify special characteristics of the patterns and discover meaningful subsequences suitable for their own application domains, such as finding gene transcription sites from DNA sequences or discovering patterns for time series data classification. However, due to the inherent complexity of sequence patterns, including the exponential candidate space with respect to pattern letters and gap constraints, to date, existing sequence pattern mining methods are either incomplete or do not support the Apriori property since the support or support ratio of a pattern may be greater than that of its sub-patterns. Most importantly, patterns discovered by these methods are either too restrictive or too general and cannot represent underlying meaningful knowledge in the sequences. In this paper, we focus on a non-overlapping sequence pattern mining task with gap constraints, where a non-overlapping sequence pattern allows sequence letters to be flexible, yet maximally, utilized for pattern discovery. A new non-overlapping sequence pattern mining algorithm (NOSEP), which is an Apriori-based and complete mining algorithm, is proposed by using Nettree, a specially designed data structure, to calculate the exact occurrence of a pattern in the sequence. Experimental results and comparisons with biology DNA sequences, time series data, and Gazelle Datasets demonstrate the efficiency of the proposed algorithm and the uniqueness of non-overlapping sequence patterns compared to other methods.
Datasets:Dataset1 (DNA and Protein sequences)
Gazelle Datasets (From SPMF):
NOSEP for character version(small alphabet size)
NOSEPi for integer version(large alphabet size)
NOSEP-B (backtracking strategy) for character version(small alphabet size)
NOSEP-B-i for integer version(large alphabet size)
NetM-B (BFS strategy) for character version(small alphabet size)
NetM-B-i for integer version(large alphabet size)
NetM-D (DFS strategy) for character version(small alphabet size)
NetM-D-i for integer version(large alphabet size)
GSgrow for character version(small alphabet size)
GSgrow-i for integer version(large alphabet size)
is proposed by Ding et al .
Ding B, Lo D, Han J, et al. Efficient mining of closed repetitive gapped subsequences from a sequence database. In: Proceedings of IEEE International Conference on Data Engineering, Shanghai, 2009. 1024-1035
Youxi Wu, Cong Shen, He Jiang, Xindong Wu. Strict pattern matching under non-overlapping condition. Science China Information Sciences. 2017, 60 (1):012101:1-16. Read Online (PDF)
Pattern matching (or string matching) is an essential task in computer science, especially in sequential pattern mining, since pattern matching methods can be used to calculate the support (or the number of occurrences) of a pattern and then to determine whether the pattern is frequent or not. A state-of-the-art sequential pattern mining with gap constraints (or flexible wildcards) uses the number of non-overlapping occurrences to denote the frequency of a pattern. Non-overlapping means that any two occurrences cannot use the same character of the sequence at the same position of the pattern. In this paper, we investigate strict pattern matching under the non-overlapping condition. We show that the problem is in P at first. Then we propose an algorithm, called NETLAP-Best, which uses Nettree structure. NETLAP-Best transforms the pattern matching problem into a Nettree and iterates to find the rightmost root-leaf path, to prune the useless nodes in the Nettree after removing the rightmost root-leaf path. We show that NETLAP-Best is a complete algorithm and analyse the time and space complexities of the algorithm. Extensive experimental results demonstrate the correctness and efficiency of NETLAP-Best.
Abstract. Pattern matching is a key issue in sequential pattern mining, with many researchers now focussing on pattern matching with gap constraints. However, most of these studies involve exact pattern matching problems, a special case of approximate pattern matching and a more challenging task. In this study, we introduce an approximate pattern matching problem with Hamming distance. Its objective is to compute the number of approximate occurrences of pattern P with gap constraints in sequence S under similarity constraint d. We propose an efficient algorithm named SONG (Single-rOot Nettree for approximate pattern matchinG with gap constraints) and based on the new nonlinear data structure Single-root Nettree. SONG uses some new concepts of Single-root Nettrees to solve the problem effectively. Theoretical analysis and experiments demonstrate an interesting law that the ratio M(P,S,d)/N(P,S,m) approximately follows a binomial distribution, where M(P,S,d) and N(P,S,m) are the numbers of the approximate occurrences whose distances to pattern P are d (0≤d≤m) and no more than m (the length of pattern P), respectively. Experimental results for real biological data validate the efficiency and correctness of SONG.
该算法可以依据用户输入的n, k和d来随机生成从顶点1到顶点n路径长度为k情况下，最大不交路径数为w的有向无环图，这里d表示图中边密度，w为一个大于0.7倍理论最大不相交路径数（该数为(n-2)/(k-1)）的随机数（注：因为如果最大不相交路径太少且在边密度较高情况下，任何一种算法都相对较容易获得较好质量的解，因此难于用来评价算法的好坏，这一点在后面的实验中也有所体现). 该算法采用的原理如下：
ABSTRACT In this paper, we focus on the problem of the maximum disjoint paths in directed acyclic graphs (DAGs) which is to find the maximum disjoint paths with length k between two given vertices. We propose a greedy algorithm named Greedy Path (GP) to solve the problem. GP transforms a DAG into a Nettree with depth k+1 at first. Then we calculate the number of root-leaf paths for each node of the Nettree to achieve the number of total paths for each vertex of the DAG. In order to obtain an optimized disjoint path, GP selects the node in the k+1th level of the Nettree as the current node, and searches for the optimized parent in the usable parents whose number of total paths is minimal. We iterate this process, until there is no disjoint path. The space and time complexities of GP are O(wkn(p+q)) and O(kn(p+q)+n2), respectively, where k, n, p, q, and w are length constraint, number of vertices, maximum indegree, maximum outdegree, and the solution, respectively. To evaluate the performance of GP we also propose an algorithm which can create artificial DAGs with known maximum disjoint paths. Experimental results show that GP can get better performance than other competitive algorithms.
摘要 本文对有向无环图中具有长度约束的最大不相交路径问题进行研究，该问题是求解图中两点间路径长度为k的最大不相交路径. 为了对该问题进行求解，本文提出了贪婪搜索算法(Greedy Path，GP)，该算法先将一个有向无环图转化为一棵深度为k+1的网树，然后计算每个网树结点的树根叶子路径数，并以此计算图中每个顶点的总路径数，之后从网树的第k+1层结点出发，在当前结点的双亲结点中选择未被使用且总路径数最小的双亲，以此形成一条优化的不相交路径，最后迭代这一过程，直到不再有新的不相交路径为止. GP算法的时间和空间复杂度分别为O(wkn(p+q))和O(kn(p+q)+n2)，这里k, n, p, q和w为分别为长度约束、顶点数、最大顶点入度、最大顶点出度以及问题的解. 为了测试GP算法的近似性，本文又建立了一种能够生成人工数据的算法，该算法能够准确地控制有向无环图中最大不相交路径的数量. 通过该算法生成了大量测试用数据，实验结果表明GP算法较其他对比性算法具有良好的近似性且其实际求解时间较短，验证了该方法的有效性和可行性.
Youxi Wu, Shuai Fu, He Jiang, Xindong Wu. Strict approximate pattern matching with general gaps. Applied Intelligence. 2015, Volume 42, Issue 3, pp 566-580. (PDF) (Source codes)
Abstract Pattern matching with gap constraints is one of the essential problems in computer science such as music information retrieval and sequential pattern mining. One of the cases is called loose matching, which means only considering the matching position of the last substring of pattern in the sequence. One more challenging problem is considering the matching positions of each character in the sequence, called strict pattern matching which is one of the essential tasks of sequential patterns mining with gap constraints. Some strict pattern matching algorithms were used to handle pattern mining tasks, since strict pattern matching can be used to compute the occurrence frequency of some patterns in the given sequence and then the frequent patternscan be derived. In this article, we address a more general strict approximate pattern matching with Hamming distance, named SAP (Strict Approximate Pattern matching with general gaps and length constraints), which means that the gap constraints can be negative. We show that a SAP instance can be transformed into exponential amount of the exact pattern matching with general gaps instances. Hence, we propose an effective online algorithm, named SETA (SubnETtree for sAp), based on the subnettree structure (a Nettree is an extension of a tree with multi-parents and multi-roots) and show the completeness of the algorithm. The space and time complexities of the algorithm are O(m×Maxlen×W×d) and O(Maxlen×W×m2×n×d), respectively, where m, Maxlen, W, and d are the length of pattern P, the maximal length constraint, the maximal gap length of pattern P and the approximate threshold. Extensive experimental results validate the correctness and effectiveness of SETA.
摘 要: 具有间隙约束的模式匹配是序列模式挖掘的关键问题之一.一次性条件约束是要求序列中每个位置的字符最多只能使用一次,在序列模式挖掘中采用一次性条件约束更加合理.但目前间隙约束多为非负间隙,非负间隙对字符串中每个字符的出现顺序具有严格的约束,一定程度上限定了匹配的灵活性.为此本文提出了一般间隙及一次性条件的严格模式匹配问题,之后理论证明了该问题的计算复杂性为NP-Hard问题,为了对该问题进行有效求解,本文在网树结构上构建了动态更新结点信息的启发式求解算法(Dynamically Changing Node Property, DCNP),该算法动态地更新各个结点的树根路径数、叶子路径数和树根-叶子路径数等,进而每次可以获得一个较优的出现.之后迭代这一过程.为了有效地提高DCNP算法速度,避免动态更新大量的结点信息,我们提出了Checking机制,使得DCNP算法仅在可能产生内部重复出现的时候才进行动态更新.理论分析了DCNP算法的时间复杂度和空间复杂度.大量实验结果验证了DCNP算法的具有良好的求解性能.
Abstract: Pattern matching with gap constraints is one of the key issues of sequential pattern mining. One-off condition which is always used in sequential pattern mining tasks means that the character of each position in the sequence can be used at most once for pattern matching. Recently, most researches focus on pattern matching with non-negative gaps, but the rule of non-negative gaps implies the character’s order in the sequence may limit the flexibility of matching. For these reasons, we propose the strict pattern matching with general gaps and one-off condition and show that this problem is NP-Hard. To handle this issue, we propose a heuristic algorithm, named Dynamically Changing Node Property (DCNP), based on Nettree which dynamically updates the properties of each node such as the numbers of root paths, leaf paths and root-leaf paths, thus we can get a better occurrence. Then we iterate the above process. To improve the speed of DCNP effectively and avoid dynamically updating information of nodes on a large scale, a checking mechanism is proposed which means that DCNP will update information of nodes only when the occurrence may have repetition. We also theoretically analyze the space and time complexities of DCNP. Experimental results show that DCNP has better performance than other competitive algorithms.