NetNDP: Nonoverlapping (δ, γ)-approximate Pattern matching

Pattern matching can calculate the support of patterns, which is a key issue of sequential pattern mining (or sequence pattern mining). Nonoverlapping pattern matching means that any two occurrences cannot use the same character of the sequence at the same position. Approximate pattern matching allows for some data noise, which is more general than exact pattern matching. At present, nonoverlapping approximate pattern matching employs Hamming distance, which cannot measure the local approximation between subsequence and pattern, resulting in large deviations in matching results. To tackle the aforementioned issue, this paper addresses Nonoverlapping Delta and gamma approximate Pattern matching (NDP), i.e. employ (δ, γ)-distance to study an approximate pattern matching, where the local distance and the global distance do not exceed δ and γ, respectively. We first transform the NDP problem into a local approximate Nettree. Then we construct an efficient algorithm, named local approximate Nettree for NDP (NetNDP). This paper proposes Minimal Root Distance (MRD). According to MRD, we can know whether a node has root paths satisfy the global constraint or not and prune invalid nodes and parent-child relationships. NetNDP finds the rightmost absolute leaf of the max root, searches the rightmost occurrence from the rightmost absolute leaf , and deletes the occurrence. We iterate the above steps until there is no new occurrence. Numerous experiments verify the correctness and effectiveness of the proposed algorithm.



Abstract: In pattern matching, a gap constraint is a more flexible wildcard than traditional wildcards “?“ and “*“. Pattern matching with gap constraints is more difficult to handle and fulfills user’s enquiries more easily. Pattern matching with gap constraints has therefore been carried out in numerous research works such as music information retrieval, finding protein sites, and sequence pattern mining. Strict pattern matching under a nonoverlapping condition, as a type of pattern matching with gap constraints, is a key issue of sequence pattern mining with gap constraints since it can be used to compute the frequency of a pattern. Exact matching limits the flexibility of the match to some extent since it requires that each character match. Therefore, we address Approximate Strict Pattern matching under the NonOverlapping constraints (ASPNO). We propose an effective algorithm, named NETtree for ASPNO (NETASPNO), which first transforms the problem into a Nettree data structure. To find the nonoverlapping occurrences effectively, NETASPNO is iterated to find the rightmost root-leaf-path from the rightmost root and prunes the path. Extensive experimental results demonstrate that NETASPNO has better performance than other competitive algorithms.





NOSEP: Non-Overlapping Sequence Pattern Mining with Gap Constraints

Youxi Wu, Yao Tong, Xingquan Zhu, Xindong Wu. NOSEP: Nonoverlapping Sequence Pattern Mining with Gap Constraints. IEEE Transactions on Cybernetics.  DOI (identifier) 10.1109/TCYB.2017.2750691



Sequence pattern mining aims to discover frequent subsequences as patterns in a single sequence or a sequence database. By combining gap constraints (or flexible wildcards), users can specify special characteristics of the patterns and discover meaningful subsequences suitable for their own application domains, such as finding gene transcription sites from DNA sequences or discovering patterns for time series data classification.  However, due to the inherent complexity of sequence patterns, including the exponential candidate space with respect to pattern letters and gap constraints, to date, existing sequence pattern mining methods are either incomplete or do not support the Apriori property since the support or support ratio of a pattern may be greater than that of its sub-patterns. Most importantly, patterns discovered by these methods are either too restrictive or too general and cannot represent underlying meaningful knowledge in the sequences. In this paper, we focus on a non-overlapping sequence pattern mining task with gap constraints, where a non-overlapping sequence pattern allows sequence letters to be flexible, yet maximally, utilized for pattern discovery. A new non-overlapping sequence pattern mining algorithm (NOSEP), which is an Apriori-based and complete mining algorithm, is proposed by using Nettree, a specially designed data structure, to calculate the exact occurrence of a pattern in the sequence. Experimental results and comparisons with biology DNA sequences, time series data,  and  Gazelle Datasets demonstrate the efficiency of the proposed algorithm and the uniqueness of non-overlapping sequence patterns compared to other methods.


Datasets:Dataset1  (DNA and Protein sequences)


Gazelle Datasets (From SPMF):

BMSWebView1 (Gazelle) ( KDD CUP 2000)

BMSWebView2 (Gazelle) ( KDD CUP 2000)




NOSEP   for character version(small alphabet size)

NOSEPi   for integer version(large alphabet size)


NOSEP-B   (backtracking strategy) for character version(small alphabet size)

NOSEP-B-i  for integer version(large alphabet size)


NetM-B   (BFS strategy) for character version(small alphabet size)

NetM-B-i   for integer version(large alphabet size)


NetM-D  (DFS strategy) for character version(small alphabet size)

NetM-D-i   for integer version(large alphabet size)



GSgrow  for character version(small alphabet size)

GSgrow-i  for integer version(large alphabet size)

is proposed by Ding et al[1] .


[1]Ding B, Lo D, Han J, et al. Efficient mining of closed repetitive gapped subsequences from a sequence database. In: Proceedings of IEEE International Conference on Data Engineering, Shanghai, 2009. 1024-1035




Strict pattern matching under non-overlapping condition

Youxi Wu, Cong Shen, He Jiang, Xindong Wu. Strict pattern matching under non-overlapping condition. Science China Information Sciences. 2017, 60 (1):012101:1-16.   Read Online (PDF)



Pattern matching (or string matching) is an essential task in computer science, especially in sequential pattern mining, since pattern matching methods can be used to calculate the support (or the number of occurrences) of a pattern and then to determine whether the pattern is frequent or not. A state-of-the-art sequential pattern mining with gap constraints (or flexible wildcards) uses the number of non-overlapping occurrences to denote the frequency of a pattern. Non-overlapping means that any two occurrences cannot use the same character of the sequence at the same position of the pattern. In this paper, we investigate strict pattern matching under the non-overlapping condition. We show that the problem is in P at first. Then we propose an algorithm, called NETLAP-Best, which uses Nettree structure. NETLAP-Best transforms the pattern matching problem into a Nettree and iterates to find the rightmost root-leaf path, to prune the useless nodes in the Nettree after removing the rightmost root-leaf path. We show that NETLAP-Best is a complete algorithm and analyse the time and space complexities of the algorithm. Extensive experimental results demonstrate the correctness and efficiency of NETLAP-Best.















LCI-ELM: Length-Changeable Incremental Extreme Learning Machine


Abstract: Extreme Learning Machine (ELM) is a learning algorithm for generalized Single-hidden-Layer Feed-forward Networks (SLFNs). In order to obtain a suitable network architecture, Incremental Extreme Learning Machine (I-ELM) is a sort of ELM constructing SLFNs by adding hidden nodes one by one. Although kinds of I-ELM-class algorithms were proposed to improve the convergence rate or to obtain minimal training error, they do not change the construction way of I-ELM or face the over-fitting risk. Making the testing error converge quickly and stably therefore becomes an important issue. In this paper, we proposed a new incremental ELM which is referred to as Length-Changeable Incremental Extreme Learning Machine (LCI-ELM). It allows more than one hidden node to be added to the network and the existing network will be regarded as a whole in output weights tuning. The output weights of newly added hidden nodes are determined using a partial error-minimizing method. We prove that an SLFN constructed using LCI-ELM has approximation capability on a universal compact input set as well as on a finite training set. Experimental results demonstrate that LCI-ELM achieves higher convergence rate as well as lower over-fitting risk than some competitive I-ELM-class algorithms.


Some ELM-class Source Codes

Approximate pattern matching with gap constraints

Youxi Wu, Zhiqiang Tang, He Jiang, Xindong Wu. Approximate Pattern Matching with Gap Constraints. Journal of Information Science.2016, 42(5): 639–658  (PDF)  (Source codes)


Abstract. Pattern matching is a key issue in sequential pattern mining, with many researchers now focussing on pattern matching with gap constraints. However, most of these studies involve exact pattern matching problems, a special case of approximate pattern matching and a more challenging task. In this study, we introduce an approximate pattern matching problem with Hamming distance. Its objective is to compute the number of approximate occurrences of pattern P with gap constraints in sequence S under similarity constraint d. We propose an efficient algorithm named SONG (Single-rOot Nettree for approximate pattern matchinG with gap constraints) and based on the new nonlinear data structure Single-root Nettree. SONG uses some new concepts of Single-root Nettrees to solve the problem effectively. Theoretical analysis and experiments demonstrate an interesting law that the ratio M(P,S,d)/N(P,S,m) approximately follows a binomial distribution, where M(P,S,d) and N(P,S,m) are the numbers of the approximate occurrences whose distances to pattern P are d (0≤dm) and no more than m (the length of pattern P), respectively. Experimental results for real biological data validate the efficiency and correctness of SONG.


许多研究生的研究成果录用或发表在IEEE Transactions on Cybernetics、Science China Information Sciences、Journal of Computer Science and Technology、Neurocomputing、IEEE Access、Applied Intelligence、Journal of Information Science、Cluster Computing、计算机学报、软件学报、通信学报等国内外知名学术刊物。


  1. 仝瑶硕士生 在《IEEE Transactions on Cybernetics》(IF:7.384)录用论文1篇;获得河北省在读研究生创新资助项目(220056)项目1项;软件著作权登记1项;
  2. 宋亚青硕士生 软件著作权登记1项;
  3. 沈丛硕士 在《Science China Information Sciences》(中国科学信息科学英文版)(CCF推荐B类期刊&SCI期刊)发表论文1篇
  4. 周坤硕士 在《计算机学报》发表论文1篇;软件著作权登记1项;
  5. 唐志强本科 在《Journal of Information Science》(SCI期刊)发表论文1篇
  6. 刘东硕士 在《Neurocomputing》(CCF推荐期刊&SCI期刊)及《Journal of Computer Science and Technology》(CCF推荐B类期刊&SCI期刊)共发表论文2篇
  7. 杨雷硕士 在国际会议录用/发表论文1篇;
  8. 闫洁硕士 在国际期刊录用/发表论文1篇;
  9. 付帅硕士 在《Applied Intelligence》(CCF推荐期刊&SCI期刊)发表论文1篇获得国家奖学金
  10. 贾晓菲硕士 在《软件学报》发表论文1篇获得国家奖学金
  11. 王玲玲硕士 在《Applied Intelligence》(CCF推荐期刊&SCI期刊)发表论文1篇;获得软件著作权登记1项;获得校优秀硕士学位论文
  12. 刘亚伟硕士 在《软件学报》发表论文1篇获得国家奖学金获得2015年河北省优秀硕士学位论文(全省共118篇)
  13. 孙秀芳硕士 在《小型微型计算机系统》发表论文1篇
  14. 孙劲耀硕士 在《计算机科学》发表论文1篇
  15. 侯丹丹硕士 在《小型微型计算机系统》发表论文
  16. 李建满硕士 在《计算机工程与应用》发表论文
  17. 孙乐硕士 在《计算机学报》发表论文1篇;获得2014年河北省优秀硕士学位论文(全省共106篇)


李艳,武优西,黄春萍,张志颖,曾珍香. 网树求解有向无环图中具有长度约束的最大不相交路径. 通信学报. 2015, 36(8): 38-49


该算法可以依据用户输入的n, kd来随机生成从顶点1到顶点n路径长度为k情况下,最大不交路径数为w的有向无环图,这里d表示图中边密度,w为一个大于0.7倍理论最大不相交路径数(该数为(n-2)/(k-1))的随机数(注:因为如果最大不相交路径太少且在边密度较高情况下,任何一种算法都相对较容易获得较好质量的解,因此难于用来评价算法的好坏,这一点在后面的实验中也有所体现). 该算法采用的原理如下:


第二步:建立一条长度为k的路径. 每k-1个数字与顶点1和n一起构成一条长度为k的路径,具体做法是从前一个数字顶点向后一个数字顶点建立一条有向边,然后顶点1向这k-1个数字中第一个数字顶点建立一条有向边,最后一个顶点向顶点n建立一条有向边;


第四步:随机加边. 随机生成两个介于1到n-1之间的数字ab,判断从ab的有向边是否存在,如果存在重新生成;如果ab相同,也重新生成;如果会产生回路也重新生成(从顶点b出发,依据当前图G的状态,采用广度优先搜索策略,如果能够回到顶点a,说明存在回路),直到可以生成一组满足要求的数字,然后在顶点ab之间建立一条有向边;