数据挖掘关联规则
合集下载
相关主题
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
2011年2月28日星期一 Data Mining: Concepts and Techniques 6
关联规则
所有形如X ⇒Y 蕴涵式的称为关联规则,这里X X Y X ⊂I, Y ⊂I,并且X∩Y= Y=Φ。 I, I X Y= 关联规则是有趣的,如果它满足最小支持度阈值与 最小置信度阈值,并称之为强规则
Nuts, Coffee, Diaper, Eggs, Milk Customer buys both
10 20 30 40 50
Customer buys diaper
Find all the rules X Y with minimum support and confidence support, s, probability that a transaction contains X ∪ Y confidence, c, conditional probability that a transaction having X also contains Y
3
关联规则挖掘
简单的说,关联规则挖掘就是发现大量数据中项集之间有趣 的关联 在交易数据、关系数据或其他信息载体中,查找存在于项目 集合或对象集合之间的频繁模式、关联、相关性、或因果 结构。 应用 购物篮分析、交叉销售、产品目录设计、聚集、分类等 两种策略: 1。 商品放近, 增加销量 2。商品放远, 增加其他商品的销量
2011年2月28日星期一 Data Mining: Concepts and Techniques
9
Biblioteka Baidu
关联规则的基本形式
关联规则的基本形式: 前提条件⇒结论[支持度, 置信度] [ , ]
diapers”) ⇒buys(x, beers beers”) buys(x, “diapers ) ⇒buys(x, “beers ) [0.5%, diapers 60%] major(x,“CS CS”) DB”) ⇒grade(x, A ) major(x, CS ) takes(x, “DB ) ⇒grade(x, “A”) DB [1%, 75%]
Chapter 4: Mining Frequent Patterns, Association and Correlations
Basic concepts and a road map Scalable frequent itemset mining methods Mining various kinds of association rules Constraint-based association mining From association to correlation analysis Mining colossal patterns Summary
Items bought A, B, C A, C A, D B, E, F
Min. support 50% Min. confidence 50%
Frequent pattern {A} {B} {C} {A, C} Support 75% 50% 50% 50%
For rule A ⇒ C: support = support({A}∪{C}) = 50% confidence = support({A}∪{C})/support({A}) = 66.6%
哪些物品经常被顾客购买? 同一次购买中,哪些商品经常会被一起购买? 一般用户的购买过程中是否存在一定的购买时间序列?
具体应用:利润最大化
商品货架设计:更加适合客户的购物路径 货存安排:实现超市的零库存管理 用户分类:提供个性化的服务
2011年2月28日星期一
Data Mining: Concepts and Techniques
2011年2月28日星期一 Data Mining: Concepts and Techniques 1
What Is Frequent Pattern Analysis?
Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent itemsets and association rule mining Motivation: Finding inherent regularities in data What products were often purchased together?— Beer and diapers?! What are the subsequent purchases after buying a PC? What kinds of DNA are sensitive to this new drug? Can we automatically classify web documents? Applications Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis.
2011年2月28日星期一 Data Mining: Concepts and Techniques 5
关联规则挖掘形式化定义
给定: 设I ={i1 , i2 ,…, im}是项(item)的集 (item)的 I , (item) 合。若干项的集合,称为项集(Item Sets Item Sets) 记D为交易(transaction) T (或事务)的集 (或 D (transaction) 合,这里交易T 是项的集合,并且T ⊆I 。对 T T I 应每一个交易有唯一的标识,如交易号,记 作TID 。设X是一个I中项的集合,如果X ⊆T, TID X I X ⊆T 那么称交易T包含X。 T X 寻找:有趣的关联规则(强规则).
11
Data Mining: Concepts and Techniques
Basic Concepts: Association Rules
Tid
Items bought Beer, Nuts, Diaper Beer, Coffee, Diaper Beer, Diaper, Eggs Nuts, Eggs, Milk
2011年2月28日星期一 Data Mining: Concepts and Techniques 2
关联规则挖掘
关联规则挖掘的典型案例:购物篮问题 在商场中拥有大量的商品(项目),如:牛奶、面包等,客户将所购买的 商品放入到自己的购物篮中。 通过发现顾客放入购物篮中的不同商品之间的联系,分析顾客的购买习惯
2011年2月28日星期一
Data Mining: Concepts and Techniques
7
confidence and support
Itemset X={i1, …, ik} Find all the rules X ⇒Y with min confidence and support
Let minsup = 50%, minconf = 50% Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3,
Customer buys beer
Customer buys beer
2011年2月28日星期一
itemset: A set of one or more items k-itemset X = {x1, …, xk} (absolute) support, or, support count of X: Frequency or occurrence of an itemset X (relative) support, s, is the fraction of transactions that contains X (i.e., the probability that a transaction contains X) An itemset X is frequent if X’s support is no less than a minsup threshold
confidence, confidence, c, conditional probability that a transaction having X also contains Y.
confidence(X⇒Y)=同时购买商品 confidence(X⇒Y)=同时购买商品X和Y的交 同时购买商品X 易数/购买商品X 易数/购买商品X的交易数
Customer buys both Customer buys diaper
support, support, s, probability that a transaction contains X∪Y X∪
support(X⇒Y)=同时包含项目集 support(X⇒Y)=同时包含项目集X和Y的交 同时包含项目集X 易数/ 易数/总交易数 用于描述有用性. 用于描述有用性.
包含k个项目的集合,称为k-项集 k k ∧ 项集的出现频率是包含项集的事务个数,称为项 集的频率、支持计数或者计数
2011年2月28日星期一 Data Mining: Concepts and Techniques 10
Basic Concepts: Frequent Patterns
Tid 10 20 30 40 50 Items bought Beer, Nuts, Diaper Beer, Coffee, Diaper Beer, Diaper, Eggs Nuts, Eggs, Milk Nuts, Coffee, Diaper, Eggs, Milk Customer buys both Customer buys diaper
用于描述确定性, 用于描述确定性,即”值得信赖的程 度””可靠性” ””可靠性”
Customer buys beer
2011年2月28日星期一 Data Mining: Concepts and Techniques 8
Mining Association Rules—an Example
Transaction-id 10 20 30 40
2011年2月28日星期一
Data Mining: Concepts and Techniques
4
Why Is Freq. Pattern Mining Important?
Freq. pattern: An intrinsic and important property of datasets Foundation for many essential data mining tasks Association, correlation, and causality analysis Sequential, structural (e.g., sub-graph) patterns Pattern analysis in spatiotemporal, multimedia, timeseries, and stream data Classification: discriminative, frequent pattern analysis Cluster analysis: frequent pattern-based clustering Data warehousing: iceberg cube and cube-gradient Semantic data compression: fascicles Broad applications
关联规则
所有形如X ⇒Y 蕴涵式的称为关联规则,这里X X Y X ⊂I, Y ⊂I,并且X∩Y= Y=Φ。 I, I X Y= 关联规则是有趣的,如果它满足最小支持度阈值与 最小置信度阈值,并称之为强规则
Nuts, Coffee, Diaper, Eggs, Milk Customer buys both
10 20 30 40 50
Customer buys diaper
Find all the rules X Y with minimum support and confidence support, s, probability that a transaction contains X ∪ Y confidence, c, conditional probability that a transaction having X also contains Y
3
关联规则挖掘
简单的说,关联规则挖掘就是发现大量数据中项集之间有趣 的关联 在交易数据、关系数据或其他信息载体中,查找存在于项目 集合或对象集合之间的频繁模式、关联、相关性、或因果 结构。 应用 购物篮分析、交叉销售、产品目录设计、聚集、分类等 两种策略: 1。 商品放近, 增加销量 2。商品放远, 增加其他商品的销量
2011年2月28日星期一 Data Mining: Concepts and Techniques
9
Biblioteka Baidu
关联规则的基本形式
关联规则的基本形式: 前提条件⇒结论[支持度, 置信度] [ , ]
diapers”) ⇒buys(x, beers beers”) buys(x, “diapers ) ⇒buys(x, “beers ) [0.5%, diapers 60%] major(x,“CS CS”) DB”) ⇒grade(x, A ) major(x, CS ) takes(x, “DB ) ⇒grade(x, “A”) DB [1%, 75%]
Chapter 4: Mining Frequent Patterns, Association and Correlations
Basic concepts and a road map Scalable frequent itemset mining methods Mining various kinds of association rules Constraint-based association mining From association to correlation analysis Mining colossal patterns Summary
Items bought A, B, C A, C A, D B, E, F
Min. support 50% Min. confidence 50%
Frequent pattern {A} {B} {C} {A, C} Support 75% 50% 50% 50%
For rule A ⇒ C: support = support({A}∪{C}) = 50% confidence = support({A}∪{C})/support({A}) = 66.6%
哪些物品经常被顾客购买? 同一次购买中,哪些商品经常会被一起购买? 一般用户的购买过程中是否存在一定的购买时间序列?
具体应用:利润最大化
商品货架设计:更加适合客户的购物路径 货存安排:实现超市的零库存管理 用户分类:提供个性化的服务
2011年2月28日星期一
Data Mining: Concepts and Techniques
2011年2月28日星期一 Data Mining: Concepts and Techniques 1
What Is Frequent Pattern Analysis?
Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent itemsets and association rule mining Motivation: Finding inherent regularities in data What products were often purchased together?— Beer and diapers?! What are the subsequent purchases after buying a PC? What kinds of DNA are sensitive to this new drug? Can we automatically classify web documents? Applications Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis.
2011年2月28日星期一 Data Mining: Concepts and Techniques 5
关联规则挖掘形式化定义
给定: 设I ={i1 , i2 ,…, im}是项(item)的集 (item)的 I , (item) 合。若干项的集合,称为项集(Item Sets Item Sets) 记D为交易(transaction) T (或事务)的集 (或 D (transaction) 合,这里交易T 是项的集合,并且T ⊆I 。对 T T I 应每一个交易有唯一的标识,如交易号,记 作TID 。设X是一个I中项的集合,如果X ⊆T, TID X I X ⊆T 那么称交易T包含X。 T X 寻找:有趣的关联规则(强规则).
11
Data Mining: Concepts and Techniques
Basic Concepts: Association Rules
Tid
Items bought Beer, Nuts, Diaper Beer, Coffee, Diaper Beer, Diaper, Eggs Nuts, Eggs, Milk
2011年2月28日星期一 Data Mining: Concepts and Techniques 2
关联规则挖掘
关联规则挖掘的典型案例:购物篮问题 在商场中拥有大量的商品(项目),如:牛奶、面包等,客户将所购买的 商品放入到自己的购物篮中。 通过发现顾客放入购物篮中的不同商品之间的联系,分析顾客的购买习惯
2011年2月28日星期一
Data Mining: Concepts and Techniques
7
confidence and support
Itemset X={i1, …, ik} Find all the rules X ⇒Y with min confidence and support
Let minsup = 50%, minconf = 50% Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3,
Customer buys beer
Customer buys beer
2011年2月28日星期一
itemset: A set of one or more items k-itemset X = {x1, …, xk} (absolute) support, or, support count of X: Frequency or occurrence of an itemset X (relative) support, s, is the fraction of transactions that contains X (i.e., the probability that a transaction contains X) An itemset X is frequent if X’s support is no less than a minsup threshold
confidence, confidence, c, conditional probability that a transaction having X also contains Y.
confidence(X⇒Y)=同时购买商品 confidence(X⇒Y)=同时购买商品X和Y的交 同时购买商品X 易数/购买商品X 易数/购买商品X的交易数
Customer buys both Customer buys diaper
support, support, s, probability that a transaction contains X∪Y X∪
support(X⇒Y)=同时包含项目集 support(X⇒Y)=同时包含项目集X和Y的交 同时包含项目集X 易数/ 易数/总交易数 用于描述有用性. 用于描述有用性.
包含k个项目的集合,称为k-项集 k k ∧ 项集的出现频率是包含项集的事务个数,称为项 集的频率、支持计数或者计数
2011年2月28日星期一 Data Mining: Concepts and Techniques 10
Basic Concepts: Frequent Patterns
Tid 10 20 30 40 50 Items bought Beer, Nuts, Diaper Beer, Coffee, Diaper Beer, Diaper, Eggs Nuts, Eggs, Milk Nuts, Coffee, Diaper, Eggs, Milk Customer buys both Customer buys diaper
用于描述确定性, 用于描述确定性,即”值得信赖的程 度””可靠性” ””可靠性”
Customer buys beer
2011年2月28日星期一 Data Mining: Concepts and Techniques 8
Mining Association Rules—an Example
Transaction-id 10 20 30 40
2011年2月28日星期一
Data Mining: Concepts and Techniques
4
Why Is Freq. Pattern Mining Important?
Freq. pattern: An intrinsic and important property of datasets Foundation for many essential data mining tasks Association, correlation, and causality analysis Sequential, structural (e.g., sub-graph) patterns Pattern analysis in spatiotemporal, multimedia, timeseries, and stream data Classification: discriminative, frequent pattern analysis Cluster analysis: frequent pattern-based clustering Data warehousing: iceberg cube and cube-gradient Semantic data compression: fascicles Broad applications