数据挖掘导论第四章
合集下载
相关主题
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
2018年9月4日星期二
数据挖掘导论
21
Hunt算法: 例
Don’t Cheat
Tid Refund Marital Status 1 Yes No No Yes No No Yes No No Single Married Single Married
Taxable Income Cheat 125K 100K 70K 120K No No No No Yes No No Yes No Yes
Attrib3 125K 100K 70K 120K 95K 60K 220K 85K 75K 90K
Class No No No No Yes No No Yes No Yes
Tree Induction algorithm Induction
Learn Model
Model
Training Set
Test Set
2018年9月4日星期二 数据挖掘导论 12
决策树: 使用模型
Test Data
Start from the root of tree.
Refund Marital Status No
Refund
10
Taxable Income Cheat 80K ?
Married
Yes NO
No
Attrib1 Yes No No Yes No No Yes No No No
Attrib2 Large Medium Small Medium Large Medium Large Small Medium Small
Attrib3 125K 100K 70K 120K 95K 60K 220K 85K 75K 90K
2018年9月4日星期二
数据挖掘导论
7
4.3 决策树归纳
2018年9月4日星期二
数据挖掘导论
9
决策树: 例子
Tid Refund Marital Status 1 2 3 4 5 6 7 8 9 10
10
Taxable Income Cheat 125K 100K 70K 120K No No No No Yes No No Yes No Yes
2018年9月4日星期二
数据挖掘导论Baidu Nhomakorabea
20
Hunt算法的一般结构
Let Dt be the set of training records that reach a node t General Procedure: If Dt contains records that belong the same class yt, then t is a leaf node labeled as yt
数据挖掘导论
Pang-ning Tan, Michael Stieinbach, and Vipin Kumar著 Pearson Education LTD. 范明 等译 人民邮电出版社
第4章分类:基本概念、决策树 与模型评估
引言: 预备知识, 解决分类问题的一般方法 决策树归纳 模型的过分拟合 评估分类器的性能
2018年9月4日星期二
数据挖掘导论
6
分类:技术
Decision Tree based Methods Rule-based Methods Memory based reasoning Neural Networks Naï ve Bayes and Bayesian Belief Networks Support Vector Machines
No Refund
10
Yes NO
No
MarSt
Single, Divorced
TaxInc
Married NO
< 80K NO
> 80K
YES
2018年9月4日星期二
数据挖掘导论
17
决策树: 使用模型
Test Data
Refund Marital Status Married Taxable Income 80K Cheat ?
Apply Model
Tid 11 12 13 14 15
10
Attrib1 No Yes Yes No No
Attrib2 Small Medium Large Small Large
Attrib3 55K 80K 110K 95K 67K
Class ? ? ? ? ?
Decision Tree
Deduction
Married
Yes NO
No
MarSt
Single, Divorced
TaxInc
Married NO
< 80K NO
> 80K
YES
2018年9月4日星期二
数据挖掘导论
14
决策树: 使用模型
Test Data
Refund Marital Status Married Taxable Income 80K Cheat ?
Attrib2 Small Medium Large Small Large
Attrib3 55K 80K 110K 95K 67K
Class ? ? ? ? ?
Decision Tree
Deduction
Test Set
2018年9月4日星期二 数据挖掘导论 19
决策树归纳
Many Algorithms: Hunt’s Algorithm (one of the earliest) CART ID3, C4.5 SLIQ, SPRINT
Attrib3 125K 100K 70K 120K 95K 60K 220K 85K 75K 90K
Class No No No No Yes No No Yes No Yes
Learning algorithm Induction
Learn Model
Model
Training Set
Tid 11 12 13 14 15
Divorced 220K Single Married Single 85K 75K 90K
< 80K NO
> 80K YES
Training Data
2018年9月4日星期二 数据挖掘导论
Model: Decision Tree
10
决策树
树中包含三种结点 根结点(root node): 没有入边, 有零条或多条出边 内部结点(internal node): 恰有一条入边和两条或多条出边 叶结点(leaf node)或终端结点(terminal node): 恰有一条入边,但没 有出边
No Refund
10
Yes NO
No
MarSt
Single, Divorced
TaxInc
Married NO
< 80K NO
> 80K
YES
2018年9月4日星期二
数据挖掘导论
16
决策树: 使用模型
Test Data
Refund Marital Status Married Taxable Income 80K Cheat ?
No Refund
10
Yes NO
No
MarSt
Single, Divorced
TaxInc
Married NO
< 80K NO
> 80K
YES
2018年9月4日星期二
数据挖掘导论
15
决策树: 使用模型
Test Data
Refund Marital Status Married Taxable Income 80K Cheat ?
If Dt is an empty set, then t is a leaf node labeled by the default class, yd If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset.
No Refund
10
Yes NO
No
MarSt
Single, Divorced
TaxInc
Married NO
Assign Cheat to “No”
< 80K NO
> 80K
YES
2018年9月4日星期二
数据挖掘导论
18
决策树分类任务:学习模型
Tid 1 2 3 4 5 6 7 8 9 10
10
分类任务的例子
肿瘤:Predicting tumor cells as benign or malignant 信用卡交易:Classifying credit card transactions as legitimate or fraudulent 蛋白质结构:Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil 新闻:Categorizing news stories as finance, weather, entertainment, sports, etc
2018年9月4日星期二
数据挖掘导论
4
分类:解释
Tid 1 2 3 4 5 6 7 8 9 10
10
Attrib1 Yes No No Yes No No Yes No No No
Attrib2 Large Medium Small Medium Large Medium Large Small Medium Small
Class No No No No Yes No No Yes No Yes
Tree Induction algorithm Induction
Learn Model
Model
Training Set
Apply Model
Tid 11 12 13 14 15
10
Attrib1 No Yes Yes No No
10
Attrib1 No Yes Yes No No
Attrib2 Small Medium Large Small Large
Attrib3 55K 80K 110K 95K 67K
Class ? ? ? ? ?
Apply Model
Deduction
Test Set
2018年9月4日星期二 数据挖掘导论 5
Splitting Attributes
Yes No No Yes No No Yes No No No
Single Married Single Married
Refund
Yes NO
No
MarSt
Divorced 95K Married 60K
Single, Divorced
TaxInc
Married NO
引言
分类:定义
Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible. A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it
2018年9月4日星期二
数据挖掘导论
11
决策树分类任务: 应用模型
Tid 1 2 3 4 5 6 7 8 9 10
10
Attrib1 Yes No No Yes No No Yes No No No
Attrib2 Large Medium Small Medium Large Medium Large Small Medium Small
MarSt
Single, Divorced
TaxInc
Married NO
< 80K NO
> 80K
YES
2018年9月4日星期二
数据挖掘导论
13
决策树: 使用模型
Test Data
Refund Marital Status No
Refund
10
Taxable Income Cheat 80K ?