决策树数据挖掘

合集下载

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

financial condition=bad learning attitude=negative 20 ==> achievement=bad 19 conf:(0.95) 7. financial condition=bad coach=no 20 ==> achievement=bad 19 conf:(0.95) 8. IQ=inferior coach=no 20 ==> achievement=bad 19 conf:(0.95) 9. IQ=inferior 39 ==> achievement=bad 37 conf:(0.95) 10. IQ=inferior coach=yes 19 ==> achievement=bad 18 conf:(0.95)
superior
common common
yes
yes yes
bad
good good
good
good
loose
strict
positive
negative
common
common
yes
yes
good
good
共有120笔,不一一列出
开启 Weka
选择 Explorer
选择 Open file
相关属性Visualsize
自下一页载入Testing Data,进行测试.
存成Arff的格式方便日后读取
Testing Output
测试结果:J48 -C 0.25 -M 2
Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances 23 17 57.5 % 42.5 % 0.1099 0.431 0.5589 94.1176 % 104.3913 %
Best rules found
1.
2.
3.
4. 5. 6.
discipline way=loose learning attitude=negative 18 ==> achievement=bad 18 conf:(1) learning attitude=negative IQ=inferior 18 ==> achievement=bad 18 conf:(1) financial condition=good IQ=inferior 12 ==> achievement=bad 12 conf:(1) financial condition=bad IQ=inferior 12 ==> achievement=bad 12 conf:(1) discipline way=loose IQ=inferior 12 ==> achievement=bad 12 conf:(1)
选择 Classifier 并点选Trees资料夹选用J48的演算模式
经不断调整测试以便发现出现最大正确率的是当值
Training Tree
修剪前
Training Tree
经过修剪后
Output
原始 0:J48 -C 0.25 -M 2 微调 1:J48 -C 0.2 -M 2 微调 2:J48 -C 0.1 -M 2 微调 3:J48 -C 0.52 -M 5 (*得到最佳模型)
独立变数
相依变数
演
算
法源自文库
常用的决策树演算法 :
ID3, C4, C4.5, C5, CART(二分), CHAID(只有类别), QUEST
决策树训练
用以建立决策树模型(例子)的样本称为训练资料(Training data) ;为了测试模型正确分类的资料称为测试集(Test set) – 都已分类过了. 训练资料(Training data)中,再划分出 1/3作为inside-testing用.最主要的用意是为了避免Overftinig的状况发生.
分类属性,归纳如下表:
表1-台中市文山国小五年级学习成就属性分类家庭环境 financial condition 管教方式 discipline way 学习态度 learning attitude Type:类别智商高低 IQ Type:类别课后补习 coach Type:类别学习成就 achievement Type:类别
good
good good
loose
strict democracy
positive
negative negative
superior
superior superior
yes
yes yes
good
good good
good
good good
loose
strict democracy
negative
positive positive
Type:类别 Type:类别
Distinct :3 Distinct :3
good general bad strict demoracy loose
Distinct :2
positive negative
Distinct :3
superior common inferior
Distinct :2
结
论
外部资料测试的结果,准确度不如预期,推估原因,可能由於资料数目不够大,每笔资料的变异性,对整体结果都有显著性的影响,因此在切割训练资料与测试资料的乱数取样,可能有不均匀的现象产生. 就属性表的图表输出,可约略看出,学业成就被归类为「Bad」的状况明显偏高,有可能问卷设计上有些问题. 整体而言,分类的结果可以看出,影响学生的学习成就,最关键的因数在於「IQ」的好坏,其次有「家庭经济状况」与「课后补习」,「学习态度」.至於「管教方式」影响似乎并不显著.
yes no
Distinct :2
good bad
资料说明
1. 文山国小五年级样本数为312人,运用问卷方式取的学生的背景属性,有项回收问卷数目 120份,根据学生实际的学习成效制表,资料建置使用Microsoft Excel,并将结果转存为 CSV档(附表一).
资料说明
2. 原计画资料切割如下:
学生学习成效决策树分析
以台中市文山国小学童为例
报告:赖庆霖
报告架构
1. 基础概念简介
资料探勘技术决策树演算法
资料来源属性资料分组与切割
2. 操作实例介绍
3. 结果分析 4. 关联法则推估
资料探勘技术
资料探勘
监督式学习模型
非监督式学习模型
属性选择与资料转化
树型结构
关联模式
规则诱发
分群
决策树 (类别型属性)
回归树 (连续型属性)
CN2
ITRULE CART C4.5/C5 ID3 CART
M5
Cubist
决
决策树:
策
树
1. 决策树是一种简单监督式学习程式,可以将输入范例资料经学习建立成决策树. 2. 是将一个新的范例,归类至一个明确的类别. 3. 将目前的行为分类,而非预测未来的类型.
�
资料来源说明
动机:希望对校园学童学习成就,做进一步资料的探勘与分析,透过若干属性的比重权宜记分,企图建立一个模型,以作为决策参考之用. 资料取得:
– 就过去所学以及目前职场的背景. – 在资料取得的便利考量. – 以台中市文山国民小学五年级的学习成就为分类资料.
分类属性总表
– 1/3 为 Test Set – 2/3 * 2/3 为 Training Set – 2/3 * 1/3 为 Validation Set
3. WEKA内部并无以上机制,因此修正如下
– 2/3 为 Training Set 预计有80笔资料 (档案training.csv) – 1/3 为 Test Set 预计有40笔资料 (档案testing.csv)
40
另一种测试方法
使用Cross Validation 或是 Percentage Split 的方式进行资料训练. 将训练模型储存. 开启外部测试资料,并将原先储存的模型给load进来. 执行出新的结果. (操作如下面解说)
Associator model
方法:Apriori -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S -1.0 -c -1 Minimum support: 0.1 (12 instances) (我把所有资料丢进去做关联,共120笔) Minimum metric <confidence>: 0.9 Number of cycles performed: 18 Generated sets of large itemsets: Size of set of large itemsets L(1): 15 Size of set of large itemsets L(2): 88 Size of set of large itemsets L(3): 43
训练资料DATA
Financial condition good good Discipline way strict democracy Learning attitude positive positive IQ superior superior coach achievement yes yes good good

决策树 数据挖掘

决策树数据挖掘