基于倾斜分布的变流速数据流聚类算法

合集下载

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

基于倾斜分布的变流速数据流聚类算法
邢长征;胡权波
【期刊名称】《计算机工程》
【年(卷),期】2013(000)012
【摘要】The skew distribution characteristics of data stream clustering algorithm TDCA lack of clustering speed and memory utilization. Variable flow rate data stream environment has a serious impact on the quality of the clustering results. In order to deal with the above problems, a data stream clustering algorithm named GR-Stream is presented. It uses grid cells as the aggregation of data points, Based on an extension of the R-tree structure as the organization of grid cell index structure, it introduces pruning strategy on the basis of this structure, and adjusts the way of data points into the tree. It adopts the real dataset the KDD-CUP99 on algorithm test. Experimental results show that, compared with the TDCA algorithm data structure organizing data, this index structure can improve the clustering speed by 40%, and the application of pruning strategy to save at least half memory usage, at the same time maintaining more than 90%of the average purity of the clustering results in the variable flow rate of the data stream environment.%处理倾斜分布特征的数据流聚类算法TDCA 存在聚类速度与内存利用率上的不足，且变流速的数据流环境对聚类结果的质量有严重影响。

针对上述问题，提出一种数据流聚类算法GR-Stream。

采用网格单元作为数据点的聚集形式，以基于R-tree的扩展数据结构作为组织网格单元的索引
结构，在此基础上引入剪枝策略，并调整数据点进入树的方式。

在真实数据集KDD-CUP99上进行测试，结果表明，与TDCA算法相比，该算法在聚类过程中可以提高40%的访问速度，应用剪枝策略节省至少一半的内存使用量，同时在变流速的数据流环境下将聚类结果的平均纯度保持在90%以上。

【总页数】5页(P247-250,259)
【作者】邢长征;胡权波
【作者单位】辽宁工程技术大学电子与信息工程学院，辽宁葫芦岛 125105;辽宁工程技术大学电子与信息工程学院，辽宁葫芦岛 125105
【正文语种】中文
【中图分类】TP18
【相关文献】
1.基于Hadoop MapReduce的分布式数据流聚类算法研究 [J], 蔡斌雷;任家东;朱世伟;郭芹
2.分布式实时日志密度数据流聚类算法及其基于Storm的实现 [J], 张辉;王成龙;王伟
3.一种基于时态密度的倾斜分布数据流聚类算法 [J], 杨宁;唐常杰;王悦;陈瑜;郑皎凌
4.分布式数据流聚类算法及其基于Storm的实现 [J], 万新贵;李玲娟;马可
5.基于Storm的分布式实时数据流密度聚类算法 [J], 牛丽媛;张桂芸
因版权原因，仅展示原文概要，查看原文内容请购买。