中国科大 数据挖掘 slide7
合集下载
相关主题
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Introduction to Data Mining
02/06/2011
‹#›
Computational Complexity
Given d unique items:
– Total number of itemsets = 2d – Total number of possible association rules:
N
M
w
– Match each transaction against every candidate – Complexity ~ O(NMw) => Expensive since M = 2d !!!
Introduction to Data Mining 02/06/2011 ‹#›
Frequent Itemset Generation Strategies
Example: {Milk, Bread, Diaper}
TID Items
– k-itemset
An itemset that contains k items
1 2 3 4 5
Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke
Market-Basket transactions
TID Items
Example of Association Rules
{Diaper} {Beer}, {Milk, Bread} {Eggs,Coke}, {Beer, Bread} {Milk},
1 2 3 4 5
Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke
1 2 3 4 5
Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke
Observations:
• All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer} • Rules originating from the same itemset have identical support but can have different confidence • Thus, we may decouple the support and confidence requirements
Basic Concepts
Introduction to Data Mining
08/06/2006
2
Association Rule Mining
Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction
Given d items, there are 2d possible candidate itemsets
02/06/2011 ‹#›
Introduction to Data Mining
Frequent Itemset Generation
Brute-force approach:
– Each itemset in the lattice is a candidate frequent itemset – Count the support of each candidate by scanning the database
2. Rule Generation
–
Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset
Frequent itemset generation is still computationally expensive
Support count ()
– Frequency of occurrence of an itemset – E.g. ({Milk, Bread,Diaper}) = 2
Support
– Fraction of transactions that contain an itemset – E.g. s({Milk, Bread, Diaper}) = 2/5
Introduction to Data Mining
02/06/2011
‹#›
Frequent Itemset Generation
null
A
B
C
Dຫໍສະໝຸດ Baidu
E
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
ABCD
ABCE
ABDE
ACDE
BCDE
ABCDE
02/06/2011
Association Rule Mining Task
Given a set of transactions T, the goal of association rule mining is to find all rules having
– support ≥ minsup threshold – confidence ≥ minconf threshold
d d k R k j 3 2 1
d 1 k 1 d k j 1 d d 1
If d=6, R = 602 rules
Introduction to Data Mining
02/06/2011
‹#›
Mining Association Rules
Introduction to Data Mining
02/06/2011
‹#›
Reducing Number of Candidates
Apriori principle:
– If an itemset is frequent, then all of its subsets must also be frequent
Implication means co-occurrence, not causality!
Introduction to Data Mining
02/06/2011
‹#›
Definition: Frequent Itemset
Itemset
– A collection of one or more items
Introduction to Data Mining 02/06/2011 ‹#›
Mining Association Rules
Two-step approach:
1. Frequent Itemset Generation
–
Generate all itemsets whose support minsup
Association Rule
– An implication expression of the form X Y, where X and Y are itemsets – Example: {Milk, Diaper} {Beer}
TID
Items
1 2 3 4 5
Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke
Brute-force approach:
– List all possible association rules – Compute the support and confidence for each rule – Prune rules that fail the minsup and minconf thresholds Computationally prohibitive!
Reduce the number of candidates (M)
– Complete search: M=2d – Use pruning techniques to reduce M
Reduce the number of transactions (N)
– Reduce size of N as the size of itemset increases – Used by DHP and vertical-based mining algorithms
Reduce the number of comparisons (NM)
– Use efficient data structures to store the candidates or transactions – No need to match every candidate against every transaction
Example:
– Confidence (c)
{Milk, Diaper} {Beer}
s
c
(Milk , Diaper, Beer)
|T|
2 0.4 5
(Milk, Diaper, Beer) 2 0.67 (Milk , Diaper ) 3
‹#›
Introduction to Data Mining
Rule Evaluation Metrics
– Support (s)
Fraction of transactions that contain both X and Y Measures how often items in Y appear in transactions that contain X
TID Items
Example of Rules:
{Milk,Diaper} {Beer} (s=0.4, c=0.67) {Milk,Beer} {Diaper} (s=0.4, c=1.0) {Diaper,Beer} {Milk} (s=0.4, c=0.67) {Beer} {Milk,Diaper} (s=0.4, c=0.67) {Diaper} {Milk,Beer} (s=0.4, c=0.5) {Milk} {Diaper,Beer} (s=0.4, c=0.5)
Transactions
TID 1 2 3 4 5 Items Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke
List of Candidates
Association Analysis: Basic Concepts and Algorithms
Dr. Hui Xiong Rutgers University
Introduction to Data Mining
08/06/2006
1
Association Analysis: Basic Concepts and Algorithms
Frequent Itemset
– An itemset whose support is greater than or equal to a minsup threshold
Introduction to Data Mining
02/06/2011
‹#›
Definition: Association Rule