Strict approximate pattern matching with general gaps

Youxi Wu, Shuai Fu, He Jiang, Xindong Wu. Strict approximate pattern matching with general gaps. Applied Intelligence. 2015, Volume 42, Issue 3, pp 566-580.  (PDF) (Source codes)

 

Abstract  Pattern matching with gap constraints is one of the essential problems in computer science such as music information retrieval and sequential pattern mining. One of the cases is called loose matching, which means only considering the matching position of the last substring of pattern in the sequence. One more challenging problem is considering the matching positions of each character in the sequence, called strict pattern matching which is one of the essential tasks of sequential patterns mining with gap constraints. Some strict pattern matching algorithms were used to handle pattern mining tasks, since strict pattern matching can be used to compute the occurrence frequency of some patterns in the given sequence and then the frequent patternscan be derived. In this article, we address a more general strict approximate pattern matching with Hamming distance, named SAP (Strict Approximate Pattern matching with general gaps and length constraints), which means that the gap constraints can be negative. We show that a SAP instance can be transformed into exponential amount of the exact pattern matching with general gaps instances. Hence, we propose an effective online algorithm, named SETA (SubnETtree for sAp), based on the subnettree structure (a Nettree is an extension of a tree with multi-parents and multi-roots) and show the completeness of the algorithm. The space and time complexities of the algorithm are O(m×Maxlen×W×d) and O(Maxlen×W×m2×n×d), respectively, where m, Maxlen, W, and d are the length of pattern P, the maximal length constraint, the maximal gap length of pattern P and the approximate threshold. Extensive experimental results validate the correctness and effectiveness of SETA.

 

 

Algorithms:

 

SETA1

 

SETA2

 

SETA3

 

SETA

 

 

 

Data:

 

The data used in this paper