NOSEP: Non-Overlapping Sequence Pattern Mining with Gap Constraints


Abstract: Sequence pattern mining aims to discover frequent subsequences as patterns in a single sequence or a sequence database. By combining gap constraints (or flexible wildcards), users can specify special characteristics of the patterns and discover meaningful subsequences suitable for their own application domains, such as finding gene transcription sites from DNA sequences or discovering patterns for time series data classification. Due to the inherent complexity of sequence patterns, including the exponential candidate space with respect to pattern letters and gap constraints, to date, existing sequence pattern mining methods are either incomplete or do not support the Apriori property because the support ratio of a pattern may be greater than that of its sub-patterns. Most importantly, patterns discovered by these methods are either too restrictive or too general and cannot represent underlying meaningful knowledge in the sequences. In this paper, we focus on a non-overlapping sequence pattern mining task with gap constraints, where a non-overlapping sequence pattern allows sequence letters to be flexibly and maximally utilized for pattern discovery. A new Apriori-based non-overlapping sequence pattern mining algorithm (NOSEP) is proposed. NOSEP is a complete pattern mining algorithm, which uses a specially designed data structure, Nettree, to calculate the exact occurrence of a pattern in the sequence. Experimental results and comparisons on biology DNA sequences, time series data, and Gazelle Datasets demonstrate the efficiency of the proposed algorithm and the uniqueness of non-overlapping sequence patterns compared to other methods.




Gazelle BMS1: Clickstream data from an e-commerce

Gazelle BMS2: Clickstream data from an e-commerce

Others: DNA & protein sequences




NOSEP: for small character version

NOSEP-i: for large character version



NOSEP-b: for small character version

NOSEP-b-i: for large character version



NetM-B: for small character version

NetM-B-i: for large character version



NetM-D: for small character version

NetM-D-i: for large character version



GSgrow: for small character version

GSgrow-i: for large character version