数据挖掘讲座9 Association Analysis：FP-growth and Others

合集下载

相关主题

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

10 Number of frequent itemsets = 3 × ∑ k
10 k =1

Need a compact representation
20
Maximal Frequent Itemset
An itemset is maximal frequent if none of its immediate supersets is frequent null
(b) Specific-to-general
(c) Bidirectional
13
Alternative Methods for Frequent Itemset Generation

Traversal of Itemset Lattice - Equivalent Classes
null null
4
FP-growth Algorithm

Use a compressed representation of the database using an FP-tree Once an FP-tree has been constructed, it uses a recursive divide-and-conquer approach to mine the frequent itemsets
FP-growth: Stopping Condition
null
a:3
b:3
single-path conditional FP-tree
9
FP-growth Versus Apriori

Typically, FP-growth is faster than Apriori - no candidate generation, no candidate test - compressed database: FP-tree structure - no repeated scan of entire database However, Apriori does have its own merits - a relatively large support threshold - when the data set is too large to be loaded into the
TID 1 2 3 4 5 6 7 8 9 10
Items A,B,E BCD B,C,D C,E A,C,D A,B,C,D A,E AB A,B A,B,C A,C,D , , B
A 1 4 5 6 7 8 9
B 1 2 5 7 8 10
C 2 3 4 8 9
D 2 4 5 9
E 1 3 6
TID-list
A
B
C
D
A
B
C
D
AB
AC
AD
BC
BD
CD
AB
AC
BC
AD
BD
CD
ABC
ABD
ACD
BCD
ABC
ABD
ACD
BCD
ABCD
ABCD
(a) Prefix tree
(b) Suffix tree
14
Alternative Methods for Frequent Itemset Generation

Traversal of Itemset Lattice - Breadth-first vs Depth-first
Maximal Itemsets
A B C D E
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
ABCD
main memory

- when the data set is too dense
10
FP-growth Versus Apriori, cont.
Data set T25I20D10K
100 90 80 70
D1 FP-grow th runtime D1 Apriori runtime
Run time(sec.)

Other tools - WEKA, SAS EM, SPSS CLEMENTINE, …
18
Compact Representation of Frequent Itemsets
19
Compact Representation of Frequent Itemsets

Some itemsets are redundant because they have identical support as their supersets
16
ECLAT

Determine support of any k-itemset by intersecting tid-lists of two of its (k-1) subsets.
A 1 4 5 6 7 8 9
B 1 2 5 7 8 10
∧
→
AB 1 5 7 8

3 traversal approaches: - top-down, bottom-up and hybrid Advantage: very fast support counting Disadvantage: intermediate tid-lists may become too large for memory
17
Important Ref’s

Data Sets and Algorithm Implementations - You can find public available resources on FIMI, http://fimi.cs.helsinki.fi Also, http://www.borgelt.net/apriori.html
60 50 40 30 20 10 0 0 0.5 1 1.5 2 Support threshold(%) 2.5 3
11
Beyond Apriori and FP-growth
12
Alternative Methods for Frequent Itemset Generation

Traversal of Itemset Lattice - General-to-specific vs Specific-to-general
TID A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 7 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

3
The Grid View of Itemsets
null
itemsets ended by E
E
A
B
C
D
AB
AC
AD
AE
BC
BD
BE
Baidu NhomakorabeaCD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
ABCD
ABCE
ABDE
ACDE
BCDE
ABCDE
How to traverse it? A suffix tree strategy!
Frequent itemset border null null Frequent itemset border null
.. ..
{a1,a2,...,an}
.. ..
{a1,a2,...,an} Frequent itemset border
.. ..
{a1,a2,...,an}
(a) General-to-specific
Transaction Database
null A:7 B:3
B:5
Header table
Item A B C D E Pointer
C:1
D:1
C:3 D:1
C:3 D:1 D:1
D:1 E:1
E:1
E:1
Pointers are used to assist frequent itemset generation
7
FP-growth: Recursive Divide-and-Conquer
null A:7 B:5 C:3 D:1 D:1 A:4 B:2 C:1
8
B:1 C:1 D:1 null B:1 C:1 C:1 D:1 C:1 D:1
Conditional Pattern base for D: P = {(A:1,B:1,C:1), (A:1,B:1), (A:1,C:1), (A:1), (B:1,C:1)} Conditional FP-tree for D Recursively apply FPgrowth on P Frequent Itemsets found (with sup > 1): {AD, BD, CD, ACD, BCD, ABD}
Intelligent Data Engineering, 2010
Lecture 9:
Association Analysis: FP-growth and Others
The FP-Growth Algorithm
2
Bottleneck of Apriori

Mining long patterns needs many passes of scanning and generates lots of candidates. Bottleneck: candidate-generation-and-test. Can we avoid candidate generation? May some new data structure help?

5
FP-tree Construction SORTED transactions: Which order is the best?
After reading TID=1: null A:1 B:1 After reading TID=2: A:1 B:1 null B:1
C:1 D:1
6
FP-tree Construction
(a) Breadth first
(b) Depth first
15
Alternative Methods for Frequent Itemset Generation

Representation of Database - horizontal vs vertical data layout