决策树论文：基于敏感度的可抗噪的模糊SLIQ决策树

合集下载

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

决策树论文：基于敏感度的可抗噪的模糊SLIQ决策树
【中文摘要】决策树作为数据挖掘领域最为广泛使用的技术之,由于其在知识获取以及知识表达方面的突出优势而备受青睐。

伴随着海量数据的产生,数据中蕴含的不确定知识同时日益增长,因此人们对这些不确定知识给予了越来越多的关注。

上世纪60年代中期,扎德建设性地提出了模糊集合理论,从此人们对模糊知识有了更为精确的表达。

同时,众多的学者把模糊集合理论引入了决策树领域,以克服传统决策树存在的尖锐边界问题。

ID3算法较早地被模糊化,而最
近,SLIQ算法也被引入了模糊环境。

本文针对Chandra等人提出的模糊SLIQ算法—G-FDT算法进行讨论,对由该算法归纳出的模糊决策树退化为传统的清晰决策树这一现象,剖析出其中的原因。

并根据传统的分裂测试评估函数在模糊环境下表现出的凸性弊端,本文提出了一种新的模糊SLIQ算法——可抗噪的基于属性敏感度的模糊决策树归纳算法,相比于G-FDT,该算法的主要改进有：(1)针对G-FDT算法构造出的候选属性对应得区分函数的形状过于狭窄的问题,本文提出的确定区分函数形状的方法从根本上避免了区分函数近似于清晰区分函数。

(2)提出了候选属性敏感度的概念。

根据传统节点分裂测试的启发式评估函数在模糊环境下存在的凸性弊端,本文提出了显示候选属性的分类能力的分类敏感度的概念,对于分类敏感度高的候选属性使其对应的区分函数形状性对狭窄,从而使得这种属性倾向于被选择。

(3)提出了对训练数据进行异常探测机制。

由于G-FDT或早期提
出的SG-FDT算法抗干扰能力极差,由其构造出的决策树结构对具体的训练样本较为敏感,削弱了决策树对知识的良好的表达能力。

为此,改进后的算法中,当节点进行分裂测试时,对当前的数据进行排噪处理。

从而获得决策树相对更加稳定、健壮。

(4)提出了使得计算效率提高的优化措施。

为了使改进后的归纳算法更加具有实用性,本文提出了多种优化措施来减少由较为复杂的操作而带来的巨大开销,这些措施包括增加节点分裂终止标准、对候选属性进行测试前进行检测以便确定该属性是否被当前节点的祖先节点使用过。

本文对可抗噪的基于敏感度的模糊决策树归纳算法进行了相应的实验模拟和结果分析,实验结果表明,该算法真正实现了模糊SLIQ算法,表现出了良好的健壮性,并且有该算法构造出的模糊决策树的分类能力、计算效率方面有了较大的提高。

【英文摘要】Decision tree is one of most widely used technology in data mining domain, and very popular with its prominent ability in knowledge acquisition and knowledge representation. With the production of huge amounts of data, uncertainty of knowledge imbedded in mass data is increasing, so people give to these uncertain knowledge more and more attention. In the mid-1960s, Zade constructively proposed the fuzzy set theory; henceforth people had a kind of more precise expression to the fuzzy knowledge. At the same time, the numerous scholars have introduced the fuzzy set theory into the
decision tree domain, in order to overcome the incisive boundary problem that traditional decision trees have. ID3 algorithm is fuzzifyed earlier, and recently, SLIQ algorithm has been introduced into the fuzzy environment.The paper focuses on the fuzzy SLIQ algorithm, G-FDT proposed by Chandra, et al. For the phenomenon that the fuzzy decision tree which is induced by this algorithm degenerates into traditional crisp decision tree, the paper gives concrete analysis about the reason. And according to displayed under the fuzzy environment convexity malpractice of traditional test appraisal function of node split, this article proposes an innovative fuzzy SLIQ algorithm, anti-noise induction algorithm of fuzzy decision trees based on classification sensitivity of candidate attribute. Compared to G-FDT, this algorithm has following improvements:(1) For the drawback that the discrimination functions of candidate attributes which are contructed by G-FDT algorithm are too narrow, the propsed method of determining discrimination function fundamentally avoids this phenomenon mentioned above.(2) The concept of candidate attribute sensitivity is put forward. According to the convexity malpractice the traditional heuristic test function of node split under the fuzzy environment, this article proposes the
concept of classification sensitivity that verifies classification ability of candidate attribute, one candidate attribute corresponds a relatively steep discrimination function, if the attribute has high classification sensitivity; thus, this makes the attribute to be inclining to be selected.(3) The mechanism of outlier detection is put forward. Because
G-FDT or the early proposed SG-FDT algorithm has extremely low antijamming ability, the decision trees induced by them are weakened in term of the ability of knowlwdge representation. Therefore, the improved algorithm will delete outliers in current example set when it tests probable node split. Thus the decision tree relatively becomes stabler and robuster.(4) Optimization measures are proposed to makes the calculation more efficient. In order to improve the practicability of the induction algorithm, the paper proposes several optimizations to reduce the enormous cost of the complex operation, these measures include increasing the termination criterion of node split, testing candidate attribute in order to determine whether the attribute is used by the current node’s ancestors.For the anti-noise fuzzy SLIQ decision trees algorithm based on sensitivity, the paper carries on corresponding simulation and the analysis of result. The
experimental result indicated that this algorithm has implemented the fuzzy SLIQ algorithm truly, displayed the good toughness. And had the classification ability of fuzzy decision tree constructed with the algorithm has obtained large scale enhancement.
【关键词】决策树模糊集合理论敏感度 SLIQ 箱线图
【英文关键词】decision tree fuzzy set theory sensitivity SLIQ boxplot
【目录】基于敏感度的可抗噪的模糊SLIQ决策树摘要
8-10ABSTRACT10-11第1章绪论12-16 1.1
研究背景及意义12-13 1.2 国内外研究现状13-14 1.3 本文主要工作14-15 1.4 本文组织结构15-16第2章
模糊决策树16-33 2.1 经典决策树16-25 2.1.1 分类
问题16-17 2.1.2 经典决策树定义17-18 2.1.3 决策
树归纳算法概述18-21 2.1.4 决策树修剪与评估
21-22 2.1.5 可伸缩的决策树技术22-25 2.2 模糊集
合理论25-30 2.2.1 经典集合理论26 2.2.2 隶属函数与模糊集合26-28 2.2.3 模糊集上的一般运算
28-30 2.3 模糊决策树30-33第3章模糊SLIQ决策树归纳算法33-48 3.1 模糊SLIQ算法概述33 3.2 G-FDT 算法概述33-41 3.2.1 候选属性模糊化34-36 3.2.2
节点分裂36-38 3.2.3 G-FDT节点分裂终止准则
38 3.2.4 G-FDT算法描述38-41 3.3 G-FDT算法的缺陷41-46 3.3.1 缺陷分析41-44 3.3.2 G-FDT缺陷诱因分析44-46 3.4 修正G-FDT算法的原则46-47 3.5 本章小结47-48第4章基于分类敏感度的抗噪模糊SLIQ决策树算法48-61 4.1 基于敏感度的模糊SLIQ决策树算法
SG-FDT48-53 4.1.1 候选属性的分类敏感度
49-52 4.1.2 SG-FDT模糊决策树归纳算法52-53 4.2 可抗噪的SG-FDT算法53-60 4.2.1 箱线图异常点探测机制
53-57 4.2.2 SG-FDT计算效率改进57-58 4.2.3 可抗噪SG-FDT算法概述58-60 4.3 本章小结60-61第5章实验分析61-66 5.1 训练数据及决策树验证61-62 5.2 分类精度对比分析62-63 5.3 决策树规模与构造开销的对比分析63-64 5.4 决策树结构分析64-65 5.5 本章小结65-66第6章总结和展望66-68 6.1 总结
66-67 6.2 展望67-68参考文献68-72致谢
72-73攻读硕士期间发表的学术论文目录73-74学位论文评阅及答辩情况表74。