关联规则分析

合集下载
  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
关联规则分析 (association analysis)
超市例子
例3.1 (Groceries.txt) 这是一个超市购物例子(Hahsler et al., 2006),数据中有9835笔交易,涉及169种商品。每个交易 为一个顾客的购买记录,而每种商品是一个二分变量,比 如,购买用1代表,未购买用0代表。通过对数据的初步计 算,我们发现在单项计数中,全牛奶(whole milk)的频数最 高,为2513(频率接近26%),而其次为:其它蔬菜(other vegetables)为1903,面包(rolls/buns)为1809,苏打(soda)为 1715,酸奶(yogurt)为1372等等。超过5%的顾客购买的商 品频率显示在图3.1中。此外,还可以知道分别买不同数 量商品的顾客人数,购买1至9种商品的人数展示在下表中:
a=as.matrix(a); trans2 <- as(a, "transactions"); summary(trans2)#数据概况
item frequency (relative) 0.0 0.1 0.2 0.3 0.4
Re ad y. m ad e
Fr oz en .fo od s
Al co ho l
fra nk fu sa rter us ag e po rk b cit e e f ru tro s fr u pi c a it lf ru it ro p ip ot ve fru ot he get it r v ab eg le s et a wh b le s ol e m ilk bu tte r wh c ur ip d pe yo d/ gu so do ur c r t r m e s eam t ic eg g ro lls s /b br ow un s n br ea d pa s m ar try ga rin e bo co ff ttl ed ee wa fru te it / r ve ge so ta b l da e ju bo ic ttl ed e ca b nn ee ed r be na er ne pk in w sh sp s a op pe pi ng rs ba gs
AdultUCI[["capital-gain"]] <- ordered(cut(AdultUCI[["capital-gain"]], c(-Inf, 0, median(AdultUCI[["capitalgain"]][AdultUCI[["capital-gain"]] > 0]), Inf)), labels = c("None", "Low", "High")) AdultUCI[["capital-loss"]] <- ordered(cut(AdultUCI[["capital-loss"]], c(-Inf, 0, median(AdultUCI[["capitalloss"]][AdultUCI[["capital-loss"]] > 0]), Inf)), labels = c("none", "low", "high")) Adult <- as(AdultUCI, "transactions"); Adult summary(Adult) itemFrequencyPlot(Adult, support = 0.5, cex.names = 0.8) rules = apriori(Adult, parameter = list(support = 0.01,confidence = 0.6)) x=subset(rules, subset = rhs %in% "income=large" &lift > 1.2) inspect(sort(x, by = "confidence")[1:5]) inspect(sort(x, by = "lift")[1:5])
术语
• • • • 每一个观测称为一个事务或交易(transaction) 每一个二分变量称为一个项目或项(item) 事务数据集、项目集或项集(itemset) 用X表示一个项目或者项目集,用Y表示与X没有交的另 一个项目或项目集,那么记号“X=>Y”表示X和Y同时出 现的一个规则(rule) • 在X=>Y中,称X为前项(也称为条件项或左项, antecedent, left-hand-side or LHS of the rule),而称Y为后 项(也称为结果项或右项,consequent,right-hand-side or RHS of the rule)。
library(arules); w=read.table("f:/xzwu/adbook/shopping.txt",header=TRUE,sep="\t");a=w[1:10]; dim(a) [1] 786 10
> names(a) [1] "Ready.made" [6] "Bakery.goods" "Frozen.foods" "Alcohol" "Fresh.meat" "Toiletries" "Fresh.Vegetables" "Milk" "Snacks" "Tinned.goods"
x=subset(rules, subset = lhs %pin% "whole milk" &lift > 1.2) inspect(sort(x, by = "support")[1:5]) inspect(sort(x, by = "confidence")[1:5]) #inspect(sort(x, by = "lift")[1:5]) x=subset(rules, subset = rhs %pin% "whole milk" &lift > 1.2) inspect(sort(x, by = "support")[1:5]) inspect(sort(x, by = "confidence")[1:5]) #inspect(sort(x, by = "lift")[1:5])
• 求得规则: – rules = apriori(trans2, parameter = list(support = 0.01,confidence = 0.6)) • 查看规则: – inspect(rules[1:3]) • 筛选规则: – x=subset(rules, subset = rhs %in% "Milk" &lift > 1.2) • 规则排序: – inspect(sort(x, by = "confidence")[1:3])
M iwenku.baidu.comk
#图示数据 itemFrequencyPlot(trans2, support = 0.1, cex.names = 0.8)
Ba ke ry .g oo ds
Sn ac ks
Ti nn ed .g oo ds
fsets <- eclat(trans2, parameter = list(support = 0.05,maxlen=10))#求频繁项集 rules = apriori(trans2, parameter = list(support = 0.01,confidence = 0.6))#求规则
– 分级
data("AdultUCI")#library(arules) attributes(AdultUCI)$class;attributes(AdultUCI)$names;dim(AdultUCI);AdultUCI[1:2, ]
#连续变量处理: #删除 AdultUCI[["fnlwgt"]] <- NULL AdultUCI[["education-num"]] <- NULL #分级 AdultUCI[["age"]] <- ordered(cut(AdultUCI[["age"]], c(15,25, 45, 65, 100)), labels = c("Young", "Middleaged","Senior", "Old"))
信息 • X=>Y的支持度(support)
记s(Z)表示事务Z在包含N个事务的整个事务数据集 中的频数,用A表示事务包含X的事件,而B表示事 务包含Y的事件(X和Y没有交) ,则:
• X=>Y的置信度(confidence) • X=>Y的提升(lift)
library(arules) data(Groceries) summary(Groceries) itemFrequencyPlot(Groceries, support = 0.05, cex.names = 0.8) #图3.1 fsets <- eclat(Groceries, parameter = list(support = 0.05,maxlen=10))#求频繁项集 inspect(fsets[1:10]) inspect(sort(fsets, by = "support")[1:10]) rules = apriori(Groceries, parameter = list(support = 0.01,confidence = 0.01))#求规 则 x=subset(rules, subset = rhs %in% "whole milk" &lift > 1.2) inspect(sort(x, by = "support")[1:5]) #第三章表 inspect(sort(x, by = "confidence")[1:5])#第三章表 #inspect(sort(x, by = "lift")[1:5])
x=subset(rules, subset = lhs %in% "whole milk" &lift > 1.2) inspect(sort(x, by = "support")[1:5]) inspect(sort(x, by = "confidence")[1:5]) #inspect(sort(x, by = "lift")[1:5]) x=subset(rules, subset = lhs %ain% "whole milk" &lift > 1.2) inspect(sort(x, by = "support")[1:5]) inspect(sort(x, by = "confidence")[1:5]) #inspect(sort(x, by = "lift")[1:5]) x=subset(rules, subset = rhs %ain% "whole milk" &lift > 1.2) inspect(sort(x, by = "support")[1:5]) inspect(sort(x, by = "confidence")[1:5]) #inspect(sort(x, by = "lift")[1:5])
连续变量
AdultUCI[["hours-per-week"]] <- ordered(cut(AdultUCI[["hours-per-week"]], c(0, 25, 40, 60, 168)), labels = c("Part-time", "Full-time", "Over-time", "Workaholic"))
library(arules) data(Groceries) summary(Groceries) itemFrequencyPlot(Groceries, support = 0.05, cex.names = 0.8) #图3.1
0.00
0.05
0.10
0.15
0.20
0.25
超过5%的顾客购买的商品名字和频率
连续变量(先变成分类变量)
• data("AdultUCI")#library(arules) • attributes(AdultUCI)$class;attributes(AdultUCI)$na mes;dim(AdultUCI);AdultUCI[1:2, ] • 连续变量处理:
– 删除
• AdultUCI[["fnlwgt"]] <- NULL • AdultUCI[["education-num"]] <- NULL
相关文档
最新文档