周期性一般间隙约束的序列模式挖掘

 

 

       序列模式挖掘能够有效地挖掘到序列中的出现频率高的模式,从而在诸多领域中具有广泛的应用。假定子模式pipj可以分别匹配事件A和事件B,如果i<j,传统的序列模式挖掘方法能够对事件A在事件B之后的序列进行检测,而不能对事件B发生在事件A之前的序列进行识别。这样由于序列的有序性,因而挖掘的模式必然是有序的频繁模式,然而这种情况在实际中有时不是很合理。有鉴于此,本文提出了周期性一般间隙约束的序列模式挖掘问题,该问题具有如下5个特点:间隙值最小值可以为负值,因此称为一般间隙约束;每个模式的间隙约束都相同,这样的模式称为周期性模式;在支持数统计方面无特殊约束,因此允许序列中事件多次使用;本挖掘问题满足Apriori性质;挖掘支持率大于给定的频繁度阈值的频繁模式。为了进行有效地挖掘,采用深度优先的方式建立频繁模式树。由于一棵不完整网树(即网树的最后一层结点,可以存储在一个数组中)可以有效地表示一个模式在一个序列中的支持数,因此本文采用模式匹配技术,在一遍扫描序列数据库的情况下,建立其所有超模式的不完整网树森林,并对这些超模式的支持率进行有效地计算,进而挖掘出所有频繁模式,有效地提高了序列模式挖掘速度。实验结果验证了本文算法的可行性和有效性。

关键词     序列模式挖掘;一般间隙;频繁模式树;模式匹配;Apriori性质

 

 

Mining Sequential Patterns with Periodic General Gap Constraints

 

Abstract   Sequential pattern mining method can effectively discover the frequent patterns in the sequences and plays an essential role in many critical data mining tasks. Given sub-patterns pi and pj (i < j) can match events A and B respectively, traditional pattern mining methods can detect the sequence in which event B is after event A, but fail to find the sequence with event B occurring before event A. Therefore, the frequent patterns are always ordered since the sequences are ordered. To tackle this challenge, in this paper, we propose sequential pattern mining with periodic general gap constraints with five characteristics as follows. The minimal gap constraint, namely general gap constraint, can be a negative value. All gap constraints of the pattern are the same. Any event in the sequence can be used more than once in different supports. The problem satisfies the Apriori property under the new definition of offset sequences. If the support ratio of a pattern is greater than the given threshold, the pattern is a frequent pattern. To solve the problem effectively, depth-first search is used to create the frequent pattern tree. An Incomplete Nettree structure, the last layer of a Nettree which can be stored in an array, can represent the number of supports of the pattern in the sequence. Pattern matching method is used to create the Incomplete Nettrees for all super patterns of a frequent pattern by one way scan over the sequence database. Therefore, we can calculate the support ratios of these super patterns and find the frequent patterns. Experimental results validate the feasibility and effectiveness of the proposed algorithms.

 

 

求解算法:

 

对比算法1MCPaS-PRO

 

对比算法2AMIN-PRO

 

提出算法:MAPD-PRO