Approximate pattern matching with gap constraints
Abstract. Pattern matching is a key issue in sequential pattern mining, with many researchers now focussing on pattern matching with gap constraints. However, most of these studies involve exact pattern matching problems, a special case of approximate pattern matching and a more challenging task. In this study, we introduce an approximate pattern matching problem with Hamming distance. Its objective is to compute the number of approximate occurrences of pattern P with gap constraints in sequence S under similarity constraint d. We propose an efficient algorithm named SONG (Single-rOot Nettree for approximate pattern matchinG with gap constraints) and based on the new nonlinear data structure Single-root Nettree. SONG uses some new concepts of Single-root Nettrees to solve the problem effectively. Theoretical analysis and experiments demonstrate an interesting law that the ratio M(P,S,d)/N(P,S,m) approximately follows a binomial distribution, where M(P,S,d) and N(P,S,m) are the numbers of the approximate occurrences whose distances to pattern P are d (0≤d≤m) and no more than m (the length of pattern P), respectively. Experimental results for real biological data validate the efficiency and correctness of SONG.