6.决策树
合集下载
相关主题
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Attrib2 Small Medium Large Small Large
Attrib3 55K 80K 110K 95K 67K
Class ? ? ? ? ?
Decision Tree
Deduction
Test Set
用模型来分类(1)
Test Data Start from the root of tree.
10
Attrib1 No Yes Yes No No
Attrib2 Small Medium Large Small Large
Attrib3 55K 80K 110K 95K 67K
Class ? ? ? ? ?
Apply Model
Deduction
Test Set
分类任务举例
预测肿瘤细胞是良性的还是恶性的。 对信用卡交易进行分类,判断是合 法交易、还是欺诈交易。 将蛋白质二级结构分为-螺旋、-折叠、或无规则 卷曲。 将新闻分为财经、天气、娱乐、体育等。
Divorced 95K Married 60K
Divorced 220K Single Married Single 85K 75K 90K
Training Data
Model: Decision Tree
另一棵决策树
MarSt
Tid Refund Marital Status 1 2 3 4 5 6 7 8 9 10
数据仓库与 数据挖掘技术
第六章 决策树方法
计算机帮助警察维持治安
Communications of the ACM, 2012, 55(3)
主要内容
分类
决策树
分类(Classification)
给定记录的集合(训练集)
每条记录包含若干个属性,其中一个(几个)是类 别属性,其它的称为条件属性。
(i) Binary split
(ii) Multi-way split
“最佳”分割(1)
分割之前: 10条记录的类别为0, 10条记录的类别为1
Own Computer ? Yes No Family Sports
C0: 6 C1: 4 C0: 4 C1: 6 C0: 1 C1: 3 C0: 8 C1: 0 C0: 1 C1: 7 C0: 1 C1: 0
Refund Marital Status No Married Taxable Income Cheat 80K ?
Refund
Yes NO No MarSt
10
Single, Divorced
TaxInc < 80K NO > 80K YES
Married
NO
用模型来分类(3)
Test Data
• 二值分割: (A < v) or (A v)
• 考虑所有的分割,从中选取最好的 • 计算代价往往很高。
基于连续属性的分割(2)
Taxable Income > 80K?
< 10K Yes No [10K,25K) [25K,50K) [50K,80K)
Taxable Income?
> 80K
决策树
简介 决策树的构造 分类结果的评价
决策树分类
Tid 1 2 3 4 5 6 7 8 9 10
10
Attrib1 Yes No No Yes No No Yes No No No
Attrib2 Large Medium Small Medium Large Medium Large Small Medium Small
Refund Marital Status No Married Taxable Income Cheat 80K ?
Refund
Yes NO No MarSt
10
Single, Divorced
TaxInc < 80K NO > 80K YES
Married
NO
用模型来分类(2)
Test Data
2路分割: 需要考虑分割的优化
CarType
{Family}
{Sports, Luxury}
OR
{Family, Luxury}
CarType
{Sports}
基于顺序属性的分割
多路分割
Size
Small Medium Large
2路分割
{Small, Medium}
Size
{Large}
OR
给定结点t的熵: Entropy(t ) p( j | t ) log p( j | t )
j
• p( j | t)是类别j在结点t的相对频率。
• 用于度量结点所对应记录的纯度
• 当所有记录均匀分布于各类时,信息量最小,取最大值 log nc • 当所有记录属于同一类时,信息量最大,取最小值0。
Car Type? Luxury c1
Student ID? c10
C0: 1 C1: 0
c11
C0: 0 C1: 1
c20
...
...
C0: 0 C1: 1
哪种分割更好?
“最佳”分割(2)
贪心方法: • 选择分割后具有同类属性的结点 需要度量结点不纯程度的指标
C0: 5 C1: 5
Non-homogeneous, High degree of impurity
M0
B?
A?
Yes No Yes Node N3
C0 C1 N30 N31
No Node N4
C0 C1 N40 N41
Node N1
C0 C1 N10 N11
Node N2
C0 C1 N20 N21
M1
M2
M3 M34
M4
M12
Gain = M0 – M12 vs M0 – M34
基于信息论的分割
Refund Marital Status No Married Taxable Income Cheat 80K ?
Refund Yes NO No MarSt Single, Divorced TaxInc < 80K NO > 80K YES
10
Married
NO
用模型来分类(6)
Test Data
Class No No No No Yes No No Yes No Yes
Tree Induction algorithm Induction
Learn Model
Model
Training Set
Apply Model
Tid 11 12 13 14 15
10
Attrib1 No Yes Yes No No
Apply Model
Tid 11 12 13 14 15
10
Attrib1 No Yes Yes No No
Attrib2 Small Medium Large Small Large
Attrib3 55K 80K 110K 95K 67K
Class ? ? ? ? ?
Decision Tree
Deduction
Refund Marital Status No Married Taxable Income Cheat 80K ?
Refund Yes NO No MarSt Single, Divorced TaxInc < 80K NO > 80K YES
10
Married
NO
用模型来分类(4)
Test Data
计算熵示例
Refund Marital Status No Married Taxable Income Cheat 80K ?
Refund Yes NO No MarSt Single, Divorced TaxInc < 80K NO > 80K YES
10
Married
NO
Assign Cheat to “No”
C0: 9 C1: 1
Homogeneous, Low degree of impurity
结点不纯程度的度量
熵(Entropy)
Gini指数(Gini Index) 错分率(Misclassification error)
如何发现最佳分割
Before Splitting:
C0 C1 N00 N01
找出一个以类别属性为结论,以其它属性 的值为条件的分类模型 目标:尽量精确地给出事先未知记录的类 别属性的值
用测试集来检验模型的精度。
分类的展示
Tid 1 2 3 4 5 6 7 8 9 10
10
Attrib1 Yes No No Yes No No Yes No No No
Attrib2 Large Medium Small Medium Large Medium Large Small Medium Small
分类方法
决策树分类 基于规则的分类 神经网络
Bayes与信度网络
支持向量机
……
主要内容
分类
决策树
决策树
简介 决策树的构造 分类结果的评价
决策树示例
Tid Refund Marital Status Single Married Single Married Taxable Income 125K 100K 70K 120K
10
Taxable Income Cheat 125K 100K 70K 120K No No No No Yes No No Yes No Yes
Married NO Yes NO
Single, Divorced Refund
Yes No No Yes No No Yes No No No
Single Married Single Married
Test Set
测试属性的选择
取决于属性的类型
• 名词性的(Nominal) • 顺序性的(Ordinal) • 连续的(Continuous)
取决于分割的类型
• 2路分割 • 多路分割
基于名词属性的分割
多路分割: 使用与属性的值一样多的分割
CarType
Family Sports Luxury
决策树分类
Tid 1 2 3 4 5 6 7 8 9 10
10
Attrib1 Yes No No Yes No No Yes No No No
Attrib2 Large Medium Small Medium Large Medium Large Small Medium Small
Attrib3 125K 100K 70K 120K 95K 60K 220K 85K 75K 90K
Refund Marital Status No Married Taxable Income Cheat 80K ?
Refund Yes NO No MarSt Single, Divorced TaxInc < 80K NO > 80K YES
10
Married
NO
用模型来分类(5)
Test Data
Attrib3 125K 100K 70K 120K 95K 60K 220K 85K 75K 90K
Class No No No No Yes No No Yes No Yes
Tree Induction algorithm Induction
Learn Model
Model
Training Set
{Medium, Large}
Size
{Small}
{Small, Large}
Size
{Medium}
基于连续属性的分割(1)
有不同的处理方法
• 离散化为顺序的类别属性
• 静态的 – 开始的时候一次性离散化 • 动态的 – 等区间分桶(equal interval bucketing), 等 频率分桶(equal frequency bucketing), 或聚类
No
TaxInc
< 80K NO
> 80K
YES
Divorced 95K Married 60K
Divorced 220K Single Married Single 85K 75K 90K
There could be more than one tree that fits the same data!
Attrib3 125K 100K 70K 120K 95K 60K 220K 85K 75K 90K
Class No No No No Yes No No Yes No Yes
Learning algorithm Induction
Learn Model
Model
Training Set
Tid 11 12 13 14 15
ຫໍສະໝຸດ Baidu
Splitting Attributes
Cheat No No No No Yes No No Yes No Yes
1 2 3 4 5 6 7 8 9 10
10
Yes No No Yes No No Yes No No No
Refund Yes NO No MarSt Single, Divorced TaxInc < 80K NO > 80K YES Married NO