NOSEP: Non-Overlapping Sequence Pattern Mining with Gap Constraints

 

Abstract: Sequence pattern mining aims to discover frequent subsequences as patterns in a single sequence or a sequence database. By combining gap constraints (or flexible wildcards), users can specify special characteristics of the patterns and discover meaningful subsequences suitable for their own application domains, such as finding gene transcription sites from DNA sequences or discovering patterns for time series data classification. Due to the inherent complexity of sequence patterns, including the exponential candidate space with respect to pattern letters and gap constraints, to date, existing sequence pattern mining methods are either incomplete or do not support the Apriori property because the support ratio of a pattern may be greater than that of its sub-patterns. Most importantly, patterns discovered by these methods are either too restrictive or too general and cannot represent underlying meaningful knowledge in the sequences. In this paper, we focus on a non-overlapping sequence pattern mining task with gap constraints, where a non-overlapping sequence pattern allows sequence letters to be flexibly and maximally utilized for pattern discovery. A new Apriori-based non-overlapping sequence pattern mining algorithm (NOSEP) is proposed. NOSEP is a complete pattern mining algorithm, which uses a specially designed data structure, Nettree, to calculate the exact occurrence of a pattern in the sequence. Experimental results and comparisons on biology DNA sequences, time series data, and Gazelle Datasets demonstrate the efficiency of the proposed algorithm and the uniqueness of non-overlapping sequence patterns compared to other methods.

 

 

Datasets:

Gazelle BMS1: Clickstream data from an e-commerce

Gazelle BMS2: Clickstream data from an e-commerce

Others: DNA & protein sequences

 

Algorithms:

 

NOSEP: for small character version

NOSEP-i: for large character version

 

 

NOSEP-b: for small character version

NOSEP-b-i: for large character version

 

 

NetM-B: for small character version

NetM-B-i: for large character version

 

 

NetM-D: for small character version

NetM-D-i: for large character version

 

 

GSgrow: for small character version

GSgrow-i: for large character version