Data Mining Concepts and Techniques second edition 数据挖掘概念与技术 第二版 韩家炜 第八章03.PPT
合集下载
相关主题
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
November 28, 2010
Data Mining: Concepts and Techniques
GSP—Generalized Sequential Pattern Mining
GSP (Generalized Sequential Pattern) mining algorithm proposed by Agrawal and Srikant, EDBT’96 Outline of the method Initially, every item in DB is a candidate of length-1 for each level (i.e., sequences of length-k) do scan database to collect support count for each candidate sequence generate candidate length-(k+1) sequences from length-k frequent sequences using Apriori repeat until no frequent sequence or no candidate can be found Major strength: Candidate pruning by Apriori
<a> <a> <b> <c> <d> <e> <f>
<b> <(ab)>
<c> <(ac)> <(bc)>
<d> <(ad)> <(bd)> <(cd)>
<e> <(ae)> <(be)> <(ce)> <(de)>
<f> <(af)> <(bf)> <(cf)> <(df)> <(ef)>
Without Apriori property, 8*8+8*7/2=92 candidates
Data Mining:
Concepts and Techniques
— Chapter 8 —
8.3 Mining sequence patterns in transactional databases Jiawei Han and Micheline Kamber Department of Computer Science University of Illinois at Urbana-Champaign www.cs.uiuc.edu/~hanj
November 28, 2010 Data Mining: Concepts and Techniques
7
The Apriori Property of Sequential Patterns
A basic property: Apriori (Agrawal & Sirkant’94) If a sequence S is not frequent Then none of the super-sequences of S is frequent E.g, <hb> is infrequent
Seq. ID 10 20 30 40 50
November 28, 2010
Sequence <(bd)cb(ac)> <(bf)(ce)b(fg)> <(ah)(bf)abf> <(be)(ce)d> <a(bd)bcb(ade)>
Cand <a> <b> <c> <d> <e> <f> <g> <h>
<b> <ab> <bb> <cb> <db> <eb> <fb>
<c> <ac> <bc> <cc> <dc> <ec> <fc>
<d> <ad> <bd> <cd> <dd> <ed> <fd>
<e> <ae> <be> <ce> <de> <ee> <fe>
<f> <af> <bf> <cf> <df> <ef> <ff>
5
Challenges on Sequential Pattern Mining
A huge number of possible sequential patterns are hidden in databases A mining algorithm should find the complete set of patterns, when possible, satisfying the minimum support (frequency) threshold be highly efficient, scalable, involving only a small number of database scans be able to incorporate various kinds of user-specific constraints
Seq. ID 10 20 30 40 50 Sequence <(bd)cb(ac)> <(bf)(ce)b(fg)> <(ah)(bf)abf> <(be)(ce)d> <a(bd)bcb(ade)>
8
so do <hab> and <(ah)b>
Given support threshold min_sup =2
November 28, 2010 Data Mining: Concepts and Techniques
4
What Is Sequential PatterBiblioteka Baidu Mining?
Given a set of sequences, find the complete set of frequent subsequences
November 28, 2010
Apriori prunes 44.57% candidates Data Mining: Concepts and Techniques
11
The GSP Mining Process
Cand. cannot pass sup. threshold
5th scan: 1 cand. 1 length-5 seq. pat.
November 28, 2010 Data Mining: Concepts and Techniques
9
Finding Length-1 Sequential Patterns
Examine GSP using an example Initial candidates: all singleton sequences <a>, <b>, <c>, <d>, <e>, <f>, <g>, <h> Scan database once, count support for candidates min_sup =2
Sup 3 5 4 3 3 2 1 1
Data Mining: Concepts and Techniques
10
GSP: Generating Length-2 Candidates
51 length-2 Candidates
<a> <b> <c> <d> <e> <f>
<a> <aa> <ba> <ca> <da> <ea> <fa>
A
sequence : < (ef) (ab)
(df) c b >
A sequence database
SID 10 20 30 40 sequence <a(abc)(ac)d(cf)> <(ad)c(bc)(ae)> <(ef)(ab)(df)cb> <eg(af)cbc>
An element may contain a set of items. Items within an element are unordered and we list them alphabetically.
2
Chapter 8. Mining Stream, TimeSeries, and Sequence Data
Mining data streams Mining time-series data
Mining sequence patterns in transactional databases
©2006 Jiawei Han and Micheline Kamber. All rights reserved.
November 28, 2010 Data Mining: Concepts and Techniques
1
November 28, 2010
Data Mining: Concepts and Techniques
November 28, 2010
Data Mining: Concepts and Techniques
6
Sequential Pattern Mining Algorithms
Concept introduction and an initial Apriori-like algorithm Agrawal & Srikant. Mining sequential patterns, ICDE’95 Apriori-based method: GSP (Generalized Sequential Patterns: Srikant & Agrawal @ EDBT’96) Pattern-growth methods: FreeSpan & PrefixSpan (Han et al.@KDD’00; Pei, et al.@ICDE’01) Vertical format-based mining: SPADE (Zaki@Machine Leanining’00) Constraint-based sequential pattern mining (SPIRIT: Garofalakis, Rastogi, Shim@VLDB’99; Pei, Han, Wang @ CIKM’02) Mining closed sequential patterns: CloSpan (Yan, Han & Afshar @SDM’03)
Mining sequence patterns in biological data
November 28, 2010 Data Mining: Concepts and Techniques
3
Sequence Databases & Sequential Patterns
Transaction databases, time-series databases vs. sequence databases Frequent patterns vs. (frequent) sequential patterns Applications of sequential pattern mining Customer shopping sequences: First buy computer, then CD-ROM, and then digital camera, within 3 months. Medical treatments, natural disasters (e.g., earthquakes), science & eng. processes, stocks and markets, etc. Telephone calling patterns, Weblog click streams DNA sequences and gene structures
<a(bc)dc> is a subsequence of <a(abc)(ac)d(cf)>
Given support threshold min_sup =2, <(ab)c> is a
sequential pattern
November 28, 2010 Data Mining: Concepts and Techniques
<(bd)cba>
Cand. not in DB at all 4th scan: 8 cand. 6 length-4 seq. <abba> <(bd)bc> … pat. 3rd scan: 47 cand. 19 length-3 seq. <abb> <aab> <aba> <baa> <bab> … pat. 20 cand. not in DB at all 2nd scan: 51 cand. 19 length-2 seq. <aa> <ab> … <af> <ba> <bb> … <ff> <(ab)> … <(ef)> pat. 10 cand. not in DB at all 1st scan: 8 cand. 6 length-1 seq. <a> <b> <c> <d> <e> <f> <g> <h> pat.