NetNDP: Nonoverlapping (δ, γ)-approximate Pattern matching

Astract:
Pattern matching can calculate the support of patterns, which is a key issue of sequential pattern mining (or sequence pattern mining). Nonoverlapping pattern matching means that any two occurrences cannot use the same character of the sequence at the same position. Approximate pattern matching allows for some data noise, which is more general than exact pattern matching. At present, nonoverlapping approximate pattern matching employs Hamming distance, which cannot measure the local approximation between subsequence and pattern, resulting in large deviations in matching results. To tackle the aforementioned issue, this paper addresses Nonoverlapping Delta and gamma approximate Pattern matching (NDP), i.e. employ (δ, γ)-distance to study an approximate pattern matching, where the local distance and the global distance do not exceed δ and γ, respectively. We first transform the NDP problem into a local approximate Nettree. Then we construct an efficient algorithm, named local approximate Nettree for NDP (NetNDP). This paper proposes Minimal Root Distance (MRD). According to MRD, we can know whether a node has root paths satisfy the global constraint or not and prune invalid nodes and parent-child relationships. NetNDP finds the rightmost absolute leaf of the max root, searches the rightmost occurrence from the rightmost absolute leaf , and deletes the occurrence. We iterate the above steps until there is no new occurrence. Numerous experiments verify the correctness and effectiveness of the proposed algorithm.

Code
NetNDP
INSGrow-appro
NETLAP-(δ,γ)
NETASPNO-(δ,γ)
NetNDP-nonp

dataset
dataset

Abstract: In pattern matching, a gap constraint is a more flexible wildcard than traditional wildcards “?“ and “*“. Pattern matching with gap constraints is more difficult to handle and fulfills user’s enquiries more easily. Pattern matching with gap constraints has therefore been carried out in numerous research works such as music information retrieval, finding protein sites, and sequence pattern mining. Strict pattern matching under a nonoverlapping condition, as a type of pattern matching with gap constraints, is a key issue of sequence pattern mining with gap constraints since it can be used to compute the frequency of a pattern. Exact matching limits the flexibility of the match to some extent since it requires that each character match. Therefore, we address Approximate Strict Pattern matching under the NonOverlapping constraints (ASPNO). We propose an effective algorithm, named NETtree for ASPNO (NETASPNO), which first transforms the problem into a Nettree data structure. To find the nonoverlapping occurrences effectively, NETASPNO is iterated to find the rightmost root-leaf-path from the rightmost root and prunes the path. Extensive experimental results demonstrate that NETASPNO has better performance than other competitive algorithms.

Codes:

NETROL

NETLAP-Appro

NETASPNO

NOSEP: Non-Overlapping Sequence Pattern Mining with Gap Constraints

Youxi Wu, Yao Tong, Xingquan Zhu, Xindong Wu. NOSEP: Nonoverlapping Sequence Pattern Mining with Gap Constraints. IEEE Transactions on Cybernetics.  DOI (identifier) 10.1109/TCYB.2017.2750691

 

Abstract:

Sequence pattern mining aims to discover frequent subsequences as patterns in a single sequence or a sequence database. By combining gap constraints (or flexible wildcards), users can specify special characteristics of the patterns and discover meaningful subsequences suitable for their own application domains, such as finding gene transcription sites from DNA sequences or discovering patterns for time series data classification.  However, due to the inherent complexity of sequence patterns, including the exponential candidate space with respect to pattern letters and gap constraints, to date, existing sequence pattern mining methods are either incomplete or do not support the Apriori property since the support or support ratio of a pattern may be greater than that of its sub-patterns. Most importantly, patterns discovered by these methods are either too restrictive or too general and cannot represent underlying meaningful knowledge in the sequences. In this paper, we focus on a non-overlapping sequence pattern mining task with gap constraints, where a non-overlapping sequence pattern allows sequence letters to be flexible, yet maximally, utilized for pattern discovery. A new non-overlapping sequence pattern mining algorithm (NOSEP), which is an Apriori-based and complete mining algorithm, is proposed by using Nettree, a specially designed data structure, to calculate the exact occurrence of a pattern in the sequence. Experimental results and comparisons with biology DNA sequences, time series data,  and  Gazelle Datasets demonstrate the efficiency of the proposed algorithm and the uniqueness of non-overlapping sequence patterns compared to other methods.

 

Datasets:Dataset1  (DNA and Protein sequences)

 

Gazelle Datasets (From SPMF):

BMSWebView1 (Gazelle) ( KDD CUP 2000)

BMSWebView2 (Gazelle) ( KDD CUP 2000)

 

 

Algorithms:

NOSEP   for character version(small alphabet size)

NOSEPi   for integer version(large alphabet size)

 

NOSEP-B   (backtracking strategy) for character version(small alphabet size)

NOSEP-B-i  for integer version(large alphabet size)

 

NetM-B   (BFS strategy) for character version(small alphabet size)

NetM-B-i   for integer version(large alphabet size)

 

NetM-D  (DFS strategy) for character version(small alphabet size)

NetM-D-i   for integer version(large alphabet size)

 

 

GSgrow  for character version(small alphabet size)

GSgrow-i  for integer version(large alphabet size)

is proposed by Ding et al[1] .

 

[1]Ding B, Lo D, Han J, et al. Efficient mining of closed repetitive gapped subsequences from a sequence database. In: Proceedings of IEEE International Conference on Data Engineering, Shanghai, 2009. 1024-1035

 

 

中文摘要:序列模式挖掘是在单序列或序列数据库中发现频繁子序列。在结合间隙约束(或可变长度通配符)后形成具有间隙约束的序列模式挖掘,这种间隙约束可以用来指定模式的特点,以用于发现有意义的子序列,其已在诸多领域得到了广泛的应用,如DNA序列寻找基因转录位点或时间序列数据分类的模式发现。由于在序列字符和间隙的作用下,候选空间呈现指数形式,因此该挖掘具有一定的复杂性。迄今为止现有的挖掘方法要么是挖掘结果不具有完备性的算法,要么是不支持的Apriori性质的算法,因为模式的支持度或支持率可能大于其子模式的。更为重要的是这些方法发现的模式要么限制性太强,要么过于笼统,难以表示序列中隐含的有意义的知识。在本文中,我们研究了一种无重叠的序列模式挖掘,其是一种允许序列字符更加灵活并最大限度地用于模式发现。在此基础上,提出了一种新的基于Apriori性质的无重叠序列模式挖掘算法(NOSEP),该算法是一个具有完备性的模式挖掘算法,其采用模式匹配策略并运用一个称为网树的特殊数据结构来计算一个模式的支持度,并采用模式增长策略实现候选模式有效剪枝。在生物DNA序列、时间序列数据和一个公开的电子商务点击流数据集上与同类算法进行对比,实验结果不但验证了NOSEP算法的效率,并且表明在相同条件下其可以发现更多且更加有意义的频繁模式。

Strict pattern matching under non-overlapping condition

Youxi Wu, Cong Shen, He Jiang, Xindong Wu. Strict pattern matching under non-overlapping condition. Science China Information Sciences. 2017, 60 (1):012101:1-16.   Read Online (PDF)

 

ABSTRACT

Pattern matching (or string matching) is an essential task in computer science, especially in sequential pattern mining, since pattern matching methods can be used to calculate the support (or the number of occurrences) of a pattern and then to determine whether the pattern is frequent or not. A state-of-the-art sequential pattern mining with gap constraints (or flexible wildcards) uses the number of non-overlapping occurrences to denote the frequency of a pattern. Non-overlapping means that any two occurrences cannot use the same character of the sequence at the same position of the pattern. In this paper, we investigate strict pattern matching under the non-overlapping condition. We show that the problem is in P at first. Then we propose an algorithm, called NETLAP-Best, which uses Nettree structure. NETLAP-Best transforms the pattern matching problem into a Nettree and iterates to find the rightmost root-leaf path, to prune the useless nodes in the Nettree after removing the rightmost root-leaf path. We show that NETLAP-Best is a complete algorithm and analyse the time and space complexities of the algorithm. Extensive experimental results demonstrate the correctness and efficiency of NETLAP-Best.

摘要:

模式匹配(串匹配)是计算机科学中至关重要的一个任务,特别是在序列模式挖掘中,因为模式匹配方法可以用来计算一个模式在序列中的支持度(出现数),进而判断这个模式是否频繁。一种具有间隙约束(可变长度通配符)的序列模式挖掘算法采用模式的无重叠出现数目来表示这个模式的频度,这里无重叠是指任何两个出现不能共用序列的相同位置的字符。首先理论证明了无重叠条件的严格模式匹配的计算复杂度是P,然后提出了一个基于网树结构的NETLAP-Best算法,该算法将模式匹配问题转换为一颗网树,并在网树上迭代地寻找最右树根-叶子路径,之后剪去这条路径和无用的网树结点。之后理论证明了NETLAP-Best算法的完备性并分析了该算法的时间和空间复杂度。大量实验结果验证了NETLAP-Best算法的正确性和有效性。

 

Algorithms:

INSgrow

netlap_nopruning

netlap_slow

netlap_best

 

Data

dataset

 

Manuscript

NETLAP_2015

LCI-ELM: Length-Changeable Incremental Extreme Learning Machine

 

Abstract: Extreme Learning Machine (ELM) is a learning algorithm for generalized Single-hidden-Layer Feed-forward Networks (SLFNs). In order to obtain a suitable network architecture, Incremental Extreme Learning Machine (I-ELM) is a sort of ELM constructing SLFNs by adding hidden nodes one by one. Although kinds of I-ELM-class algorithms were proposed to improve the convergence rate or to obtain minimal training error, they do not change the construction way of I-ELM or face the over-fitting risk. Making the testing error converge quickly and stably therefore becomes an important issue. In this paper, we proposed a new incremental ELM which is referred to as Length-Changeable Incremental Extreme Learning Machine (LCI-ELM). It allows more than one hidden node to be added to the network and the existing network will be regarded as a whole in output weights tuning. The output weights of newly added hidden nodes are determined using a partial error-minimizing method. We prove that an SLFN constructed using LCI-ELM has approximation capability on a universal compact input set as well as on a finite training set. Experimental results demonstrate that LCI-ELM achieves higher convergence rate as well as lower over-fitting risk than some competitive I-ELM-class algorithms.

 

Some ELM-class Source Codes

Approximate pattern matching with gap constraints

Youxi Wu, Zhiqiang Tang, He Jiang, Xindong Wu. Approximate Pattern Matching with Gap Constraints. Journal of Information Science.2016, 42(5): 639–658  (PDF)  (Source codes)

 

Abstract. Pattern matching is a key issue in sequential pattern mining, with many researchers now focussing on pattern matching with gap constraints. However, most of these studies involve exact pattern matching problems, a special case of approximate pattern matching and a more challenging task. In this study, we introduce an approximate pattern matching problem with Hamming distance. Its objective is to compute the number of approximate occurrences of pattern P with gap constraints in sequence S under similarity constraint d. We propose an efficient algorithm named SONG (Single-rOot Nettree for approximate pattern matchinG with gap constraints) and based on the new nonlinear data structure Single-root Nettree. SONG uses some new concepts of Single-root Nettrees to solve the problem effectively. Theoretical analysis and experiments demonstrate an interesting law that the ratio M(P,S,d)/N(P,S,m) approximately follows a binomial distribution, where M(P,S,d) and N(P,S,m) are the numbers of the approximate occurrences whose distances to pattern P are d (0≤dm) and no more than m (the length of pattern P), respectively. Experimental results for real biological data validate the efficiency and correctness of SONG.

课题组现有博士生导师1名,硕士生导师3名,每年计划共招生4-8名研究生,欢迎有志向的本科生或研究生进行报考!

课题组先后培养多名硕士,其中刘亚伟和孙乐等2名硕士获评河北省优秀硕士学位论文;
王玲玲、沈丛、周坤、付帅、贾晓菲和刘东等6名硕士获评我校优秀硕士学位论文;
仝瑶、王月华和张红岩等3名硕士主持河北省省级科研项目共3项。
许多研究生的研究成果录用或发表在IEEE Transactions on Cybernetics、Science China Information Sciences、Journal of Computer Science and Technology、Neurocomputing、IEEE Access、Applied Intelligence、Journal of Information Science、Cluster Computing、计算机学报、软件学报、通信学报等国内外知名学术刊物。

以下是课题组自2012年以来部分同学取得的部分成绩(有些同学已经做出了很好的工作,但是其工作正在审稿中,因此暂时未能列出)(尽管目前有些同学已经攻读博士,但所列的工作为其硕士阶段或本科阶段所做的工作,因此仅以硕士或本科表示):

  1. 仝瑶硕士生 在《IEEE Transactions on Cybernetics》(IF:7.384)录用论文1篇;获得河北省在读研究生创新资助项目(220056)项目1项;软件著作权登记1项;
  2. 宋亚青硕士生 软件著作权登记1项;
  3. 沈丛硕士 在《Science China Information Sciences》(中国科学信息科学英文版)(CCF推荐B类期刊&SCI期刊)发表论文1篇
  4. 周坤硕士 在《计算机学报》发表论文1篇;软件著作权登记1项;
  5. 唐志强本科 在《Journal of Information Science》(SCI期刊)发表论文1篇
  6. 刘东硕士 在《Neurocomputing》(CCF推荐期刊&SCI期刊)及《Journal of Computer Science and Technology》(CCF推荐B类期刊&SCI期刊)共发表论文2篇
  7. 杨雷硕士 在国际会议录用/发表论文1篇;
  8. 闫洁硕士 在国际期刊录用/发表论文1篇;
  9. 付帅硕士 在《Applied Intelligence》(CCF推荐期刊&SCI期刊)发表论文1篇获得国家奖学金
  10. 贾晓菲硕士 在《软件学报》发表论文1篇获得国家奖学金
  11. 王玲玲硕士 在《Applied Intelligence》(CCF推荐期刊&SCI期刊)发表论文1篇;获得软件著作权登记1项;获得校优秀硕士学位论文
  12. 刘亚伟硕士 在《软件学报》发表论文1篇获得国家奖学金获得2015年河北省优秀硕士学位论文(全省共118篇)
  13. 孙秀芳硕士 在《小型微型计算机系统》发表论文1篇
  14. 孙劲耀硕士 在《计算机科学》发表论文1篇
  15. 侯丹丹硕士 在《小型微型计算机系统》发表论文
  16. 李建满硕士 在《计算机工程与应用》发表论文
  17. 孙乐硕士 在《计算机学报》发表论文1篇;获得2014年河北省优秀硕士学位论文(全省共106篇)

在《网树求解有向无环图中具有长度约束的最大不相交路径》一文中提出的有向无环图构造生成算法。该论文信息如下:

李艳,武优西,黄春萍,张志颖,曾珍香. 网树求解有向无环图中具有长度约束的最大不相交路径. 通信学报. 2015, 36(8): 38-49

 

该算法可以依据用户输入的n, kd来随机生成从顶点1到顶点n路径长度为k情况下,最大不交路径数为w的有向无环图,这里d表示图中边密度,w为一个大于0.7倍理论最大不相交路径数(该数为(n-2)/(k-1))的随机数(注:因为如果最大不相交路径太少且在边密度较高情况下,任何一种算法都相对较容易获得较好质量的解,因此难于用来评价算法的好坏,这一点在后面的实验中也有所体现). 该算法采用的原理如下:

第一步:随机生成w*(k-1)个不同数字,其范围为2到n-1,每个数字表示一个顶点名;

第二步:建立一条长度为k的路径. 每k-1个数字与顶点1和n一起构成一条长度为k的路径,具体做法是从前一个数字顶点向后一个数字顶点建立一条有向边,然后顶点1向这k-1个数字中第一个数字顶点建立一条有向边,最后一个顶点向顶点n建立一条有向边;

第三步:迭代第二步,直到建立w条长度为k的不相交路径;

第四步:随机加边. 随机生成两个介于1到n-1之间的数字ab,判断从ab的有向边是否存在,如果存在重新生成;如果ab相同,也重新生成;如果会产生回路也重新生成(从顶点b出发,依据当前图G的状态,采用广度优先搜索策略,如果能够回到顶点a,说明存在回路),直到可以生成一组满足要求的数字,然后在顶点ab之间建立一条有向边;

第五步:迭代第四步,直到图中边的密度为用户指定的d值后停止.

由于第二和第三步保证了顶点1到顶点n之间有w条路径长度为k的不相交路径,并且第四步在生成边的过程中b最大为n-1,不能取到n,这样顶点n的入度就仅仅为w,这样就确保随机生成的有向无环图从顶点1到顶点n之间最多仅有w条路径长度为k的不相交路径.

 

算法的源代码:dpc