A new-style clustering algorithm based on swarm intelligent theory

合集下载

基于改进聚类和矩阵分解的协同过滤推荐算法

基于改进聚类和矩阵分解的协同过滤推荐算法

基于改进聚类和矩阵分解的协同过滤推荐算法王永贵;宋真真;肖成龙【摘要】Concerning data sparseness,low accuracy and poor real-time performance of traditional collaborative filtering recommendation algorithm in e-commerce system under the background of big data,a new collaborative filtering recommendation algorithm based on improved clustering and matrix decomposition was proposed.Firstly,the dimensionality reduction and data filling of the original data were reliazed by matrix decomposition.Then the time decay function was introduced to deal with user score.The attribute vector of a project was used to characterize the project and the interest vector of user was used to characterize the user,then the projects and users were clustered by k-means clustering algorithm.By using the improved similaritymeasure method,the nearest neighbors and the project recommendation candidate set in the cluster were searched,thus the recommendation wasmade.Experimental results show that the proposed algorithm can not only solve the problem of sparse data and cold start caused by new projects,but also can reflect the change of user's interest in multidimension,and the accuracy of recommendation algorithm is obviously improved.%大数据背景下,对于传统的协同过滤推荐算法在电子商务系统中的数据稀疏性、准确性不高、实时性不足等问题,提出一种改进的协同过滤推荐算法.该算法首先通过矩阵分解实现对原始数据的降维及其数据填充,并引入了时间衰减函数预处理用户评分,用项目的属性向量来表征项目,用用户的兴趣向量来表征用户,通过k-means聚类算法对用户和项目分别进行聚类;然后使用改进相似性度量方法在簇中查找用户的最近邻和项目推荐候选集,产生推荐.实验结果表明,该算法不仅可以有效解决数据稀疏和新项目带来的冷启动问题,而且还可以在多维度下反映用户的兴趣变化,推荐算法的准确度明显提升.【期刊名称】《计算机应用》【年(卷),期】2018(038)004【总页数】6页(P1001-1006)【关键词】协同过滤;聚类;时间衰变;兴趣向量;矩阵分解【作者】王永贵;宋真真;肖成龙【作者单位】辽宁工程技术大学软件学院,辽宁葫芦岛125105;辽宁工程技术大学软件学院,辽宁葫芦岛125105;辽宁工程技术大学软件学院,辽宁葫芦岛125105【正文语种】中文【中图分类】TP1810 引言在互联网时代,用户对信息的需求数量得到了满足,但是用户在搜索信息时无法直接有效地获取到他们真正想要的信息,其实质反而是降低了用户对信息的使用效率,即人们由信息匮乏的时代进入到了信息过载的时代。

一种CCA-层次聚类的基因聚类算法

一种CCA-层次聚类的基因聚类算法

第28卷㊀第5期2023年10月㊀哈尔滨理工大学学报JOURNAL OF HARBIN UNIVERSITY OF SCIENCE AND TECHNOLOGY㊀Vol.28No.5Oct.2023㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀一种CCA -层次聚类的基因聚类算法林倩闽(厦门理工学院电气工程与自动化学院,福建厦门361024)摘㊀要:针对基因芯片技术带来的海量基因表达数据,为了充分挖掘其蕴含的生物信息和潜在的生物机制,提出一种基于CCA -层次聚类的基因聚类算法(CCA-Hc )㊂该算法在层次聚类的基础上引入典型相关分析,优化相似性矩阵计算方法㊂首先,利用典型相关分析方法结合基因的多个特征信息进行基因相关性度量,得到基因相似性矩阵㊂然后将该相似性矩阵作为层次聚类的邻近矩阵进行凝聚层次聚类㊂在Oryza sativa L.(水稻)的基因表达数据集上进行CCA-Hc 聚类效果测试实验,结果表明,与采用欧式距离的传统层次聚类算法(EUC-Hc )相比,CCA-Hc 的内部稳定性指标和生物功能性指标均优于EUC-Hc ,具有更佳的鲁棒性和聚类准确性,更有利于去发现基因间的共表达关系㊂关键词:基因表达数据;聚类算法;典型相关分析;层次聚类DOI :10.15938/j.jhust.2023.05.011中图分类号:TP391文献标志码:A文章编号:1007-2683(2023)05-0085-06A Gene Clustering Algorithm Based on the CCA-Hierarchical ClusteringLIN Qianmin(School of Electrical Engineering and Automation,Xiamen University of Technology,Xiamen 361024,China)Abstract :Aiming at the massive gene expression data brought by gene chip technology,in order to fully mine the biological information and potential biological mechanisms contained in it,this paper proposes a gene clustering algorithm based on CCA-hierarchical clustering (CCA-Hc).The algorithm introduces canonical correlation analysis on the basis of hierarchical clustering,and optimizes the calculation method of similarity matrix.First,the canonical correlation analysis method is used to measure the gene correlation by combining the multiple feature information of the gene,and the gene similarity matrix is obtained.Then the similarity matrix is used as the neighbor matrix of hierarchical clustering for agglomerative hierarchical clustering.The CCA-Hc clustering effect test experiment was performed on the gene expression dataset of Oryza sativa L.(rice).The results show that,compared with the traditional hierarchical clustering algorithm using Euclidean distance (EUC-Hc),CCA-Hc is superior to EUC-Hc in both internal stability index and biological functional index,and has better robustness and clustering accuracy.It is more conducive to discoveringthe co-expression relationship between genes.Keywords :gene expression data;clustering algorithm;canonical correlation analysis;hierarchical clustering㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀收稿日期:2022-06-08基金项目:福建省科技厅引导性项目(2019H0039);福建省中青年教师教育科研项目(JAT210341).通信作者:林倩闽(1992 ),女,硕士,助理实验师,E-mail:1023447133@.0㊀引㊀言随着高通量测序技术的不断快速发展,出现越来越多复杂度高㊁数据量大的生物数据㊂不同测序技术可以得到不同水平的生物数据,如通过基因组测序得到DNA 水平的生物数据,转录组测序得到RNA 水平的生物数据㊂基因表达数据是通过DNA微阵列技术(又称为基因芯片技术)检测得到,是不同细胞在不同条件下的基因动态表达水平[1]㊂基因是携带遗传物质的DNA片段,在不同细胞中会有不同的表达方向[2],从而可以控制不同的性状㊂为此基因表达数据蕴含着丰富且重要的生物机制,具有很大的研究价值㊂在基因表达数据分析中,聚类分析方法被广大研究者选用,用以发现具有相似表达行为的基因集,基因间的共表达㊁共调控关系等,对于推断未知的基因功能及在疾病诊断方面具有重要意义[2]㊂目前基因聚类算法根据聚类对象可以分为基于基因㊁基于样本聚类以及基于基因样本的双聚类[3-4]㊂根据聚类方式的不同,又可以分为以K-means算法[5]㊁K-MEDOIDS[6]为代表的基于分区的聚类算法,以BIRCH算法[7]㊁CURE算法[8]为代表的基于层次的聚类算法,以DBSCAN算法[9]㊁OPTICS算法[10]为代表的基于密度的聚类算法和以CLIQUE算法[11]为代表的基于网格的聚类算法㊂在对基因表达数据进行聚类分析时,主要是度量基因之间的相关性,把相关性程度高的基因聚在一起㊂很多基因聚类研究中把皮尔森相关系数㊁欧式距离㊁曼哈顿距离等作为相关性程度的度量方式[12]㊂这些度量方式是基于基因的整体表达水平进行的,即一个基因只由一个一维的数据矩阵表示㊂而在实际的的测序过程中,往往会在不同的细胞周期进行实验测量基因的表达水平,使得一个基因会有多组数据,每组数据代表该基因的一个特征㊂大部分的研究中采用求和的方式把基因多个特征的数据进行累加,进而分析基因之间的相关性㊂这种方法存在的问题是忽略了基因各个特征对表达水平的影响,从而对聚类结果造成影响㊂为了解决上述问题,本文把典型相关分析(Ca-nonical Correlation Analysis,CCA)引入到层次聚类中来,搭建出基于CCA-层次聚类的基因聚类算法(CCA-Hc)㊂典型相关分析是一种计算变量之间相关性的统计学分析方法,能结合变量的多个特征,得到变量的整体相关性[13]㊂利用典型相关分析度量基因之间的相关性,能充分考虑基因的多个特征信息,使得聚类结果中的基因集相似性程度更高㊂同时采用凝聚层次聚类,可以从聚类树状图中直观地分析聚类结果,从而整体上提高聚类效果㊂最后用GEO数据库上的基因数据集来验证CCA-Hc算法的有效性㊂1㊀CCA-Hc算法设计1.1㊀典型相关分析给定基因微阵列数据矩阵A nˑm=(G,T),n表示基因个数,m表示条件的种类数㊂每个基因可以看成是一个变量,使用典型相关分析方法分析变量相关性时,假设变量X有p个特征,变量Y有q个特征,pɤq,每个特征均对应m个不同条件的数据,则X=[x1, ,x p]T(1) Y=[y1, ,y q]T(2)变量X的数据矩阵为x11x12x13 x1mx21x22x23 x2mx31x32x33 x3m︙︙︙︙x p1x p2x p3 x pméëêêêêêêêùûúúúúúúú变量Y的数据矩阵为y11y12y13 y1my21y22y23 y2my31y32y33 y3m︙︙︙︙y q1y q2y q3 y qméëêêêêêêêùûúúúúúúú变量X和变量Y的协方差矩阵为ð=Cov(X,Y)=Var(X)Cov(X,Y)Cov(Y,X)Var(Y)()=ð11ð12ð21ð22()(3)变量X和变量Y的线性表达式记为U㊁V,表示为:U=a1x1+a2x2+ +a p x p=a T X(4) V=b1y1+b2y2+ +b q y q=b T Y(5)变量X和变量Y进行典型相关性分析时,可用这两个变量的线性表达式U㊁V之间相关系数的最大值来度量变量之间的相关性程度,即max a,b corr(U,V)=a Tð12b(a Tð11aˑb Tð22b)1/2(6)在求解上述最值表达式时,运用拉格朗日数乘法求解瑞利熵矩阵(ð-111ð12ð-122ð21)得到p个特征值,68哈㊀尔㊀滨㊀理㊀工㊀大㊀学㊀学㊀报㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀第28卷㊀记为λ1,λ2 λp ㊂这p 个特征值即变量X 和变量Y之间的典型相关系数㊂每一个相关系数再应用卡方检验进行显著性检验,得到p 个卡方检验p-value 值,记为p 1,p 2 p p ㊂为了更好地表示变量之间的典型相关程度,引入一个关于典型相关系数和p-value 值的权重函数W 来表示,定义为:W =ðp i =1λi I (log P i )ðp i =1I (log P i )(7)其中I (log P i )=0P >0.05-log PP ɤ0.05{这样每两个变量之间就能得到一个w 值来度量它们的相关性程度㊂对基因表达数据的n 个基因进行如上方法的典型相关分析后,最终得到一个n ˑn 的相似性矩阵㊂1.2㊀层次聚类目前常用的聚类算法有基于分区㊁基于层次㊁基于密度和基于网络4种类型[2],其中基于层次聚类的算法因原理通俗易懂㊁结果直观且精度高等优点而被广泛使用[14]㊂层次聚类分为自下而上的凝聚聚类和自上而下的分裂聚类两种[15],其中凝聚层次聚类运用最为广泛,同时凝聚层次聚类在无预先定义类别数的分类中具有明显优势[16]㊂故本文采用的是凝聚层次聚类,可以用树状图和嵌套簇图来表示,例如图1所示㊂图1㊀凝聚层次聚类的树状图和嵌套簇图Fig.1㊀Dendrogram and Nested Cluster Diagramfor Agglomerative Hierarchical Clustering下面介绍凝聚层次聚类的聚类过程:步骤1:视每一个数据点(如基因变量)为一个集群;步骤2:计算邻近矩阵,把类间距离最接近的两个集群进行合并;步骤3:重复步骤2,直到所有数据点合并完成㊂步骤2中的类间距离即两个集群之间的距离,传统的层次聚类类间距离计算方法有如下几种[17]:1)两个集群中距离最近的两个样本距离;2)两个集群中距离最远的两个样本距离;3)两个集群中所有样本之间的距离再求平均值;完成所有聚类步骤后会生产一个树状图(又叫聚类树)㊂采用不同的变量相关性程度度量方式和不同的类间距离计算方法都将对聚类结果造成影响㊂1.3㊀CCA-HC 算法传统的层次聚类算法其计算复杂度为O (n 3),由于在聚类过程中需要不断地重复计算类间距离㊁不断地更新邻近矩阵,从而消耗大量的时间与资源[18]㊂对于数据量庞大的基因微阵列数据,迫切需要对算法进行优化,降低复杂度㊂本文提出了一种基于CCA 和层次聚类的基因聚类算法(CCA-HC),优化相似性矩阵计算方法,把典型相关分析的输出作为层次聚类的输入,即把典型相关分析得到的相似性矩阵作为层次聚类的邻近矩阵㊂CCA-HC 在度量基因相关性程度时采用典型相关分析的方法,在层次聚类方式上选择自下而上的凝聚层次聚类㊂CCA-HC 充分利用了典型相关分析和层次聚类的优点,能够结合基因的多个特征来量化基因之间的相关性,使得聚类结果中的基因集相似性程度更高,也能自主选择集群数目以得到更佳的聚类效果[18]㊂2㊀实验与结果分析2.1㊀实验数据为了评价章节一中提出算法的聚类效果,在GEO 数据库上下载Oryza sativa L.(水稻)的基因表达数据集,得到的原始数据集共有45063个基因,样本数为41㊂由于原始数据集基因数庞大,对其计算分析时不论在存储空间还是计算程序上都提出了较高的要求,为此进行适当的数据预处理显得尤为重要㊂本文在数据预处理方面开展的主要工作有:把基因名未知的数据剔除;过滤掉样本表达量过低的基因;采用log2的对数函数对原始数据进行标准化处理等㊂经过如上处理后得到4564ˑ41的数据矩阵,用于后续的实验分析㊂预处理后的实验数据集78第5期林倩闽:一种CCA -层次聚类的基因聚类算法统计情况如表1所示㊂表1㊀预处理后的实验数据集统计情况表Tab.1㊀Statistical table of experimental dataset after preprocessing数据集基因数样本数基因功能类别Oryza sativa L.456441881.5㊀评价标准基因表达数据的聚类效果可以从聚类结果中同一集群的相关性程度以及聚类算法的稳定性等方面进行评价,用生物功能性指标和内部稳定性指标来描述㊂1.生物功能性指标生物同源性指标(biological homogeneity index, BHI)是用来评估聚类集群在生物功能意义上的同源性程度[19]㊂在基因本体(gene ontology,GO)数据库上下载水稻的基因功能类数据,可以得知每个水稻基因所对应的生物组织功能,用来分析同一聚类集群中的基因在功能上的相关性㊂BHI公式计算如下:BHI(K,B)=1KðK k=11nk(n k-1)ðiʂjɪC k I(B(i)=B(j))(8)式中:C为聚类结果中的任一集群;B为基因功能类集合,当基因i和基因j所对应的功能类存在交集,则I(B(i)=B(j))=1,否则为0㊂最终得到的BHI 是介于0~1的值,BHI值越大,表示基因聚类集群的生物功能相关性越大,聚类效果更佳[19]㊂2.内部稳定性指标内部稳定性指标在于评价聚类算法的鲁棒性,通过改变基因微阵列数据的某几列进行聚类,进而比较基于不同数据的聚类结果㊂优值系数(figure of merit,FOM)是内部稳定性指标中的一种,表示数据列改变后基因之间的平均群内方差[20]㊂FOM公式计算如下:FOM(l,K)=1NðK k=1ðiɪC k(l)dist(x i,l, x C k(l))(9)式中:FOM的取值范围是0到无穷大,FOM值越小表示该聚类算法的稳定性越好[20]㊂2.3㊀结果与分析为验证CCA-Hc的聚类效果,对比采用欧式距离的传统层次聚类算法(EUC-Hc),运用相同数据集进行实验㊂为了获得更加准确的聚类效果,本实验设置不同的聚类集群参数,确定聚类集群数目K 分别为2㊁4㊁6㊁7㊁9㊁11㊁12这7组实验,并通过BHI 和FOM指标对这7组实验的聚类结果进行评估, BHI和FOM指标值分别见表2和表3㊂表2㊀不同聚类集群数目下的BHI指标值Tab.2㊀BHI index values under different number of clusters 算法类型\集群数目CCA-Hc EUC-Hc差异率K=20.4660.233100.05%K=40.4630.34633.77%K=60.4670.37723.90%K=70.4670.41213.34%K=90.4650.4357.12%K=110.4640.4512.72%K=120.4630.456 1.48%表3㊀不同聚类集群数目下的FOM指标值Tab.3㊀FOM index values under different number of clusters算法类型\集群数目CCA-Hc EUC-Hc差异率K=22.6974.633-41.78%K=42.6974.298-37.26%K=62.6964.047-33.37%K=72.6963.995-32.52%K=92.6963.816-29.35%K=112.6953.693-27.03%K=12 2.695 3.636-25.89%㊀㊀表2中的差异率指的是CCA-Hc的BHI指标比EUC-Hc的BHI指标相差的百分比,同理可以计算表3中的差异率㊂根据表2和表3的实验指标数据发现,对于7组不同的聚类集群数目实验,本文提出的CCA-Hc 的BHI指标均高于EUC-Hc,FOM指标均低于EUC-Hc,这表明CCA-Hc的鲁棒性更好,聚类结果中同一集群的基因相关性更大,聚类效果更加显著㊂同时还发现,集群数目对CCA-Hc的影响较小,K选不同的值,BHI指标值稳定在0.463~0.467之间,FOM88哈㊀尔㊀滨㊀理㊀工㊀大㊀学㊀学㊀报㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀第28卷㊀指标值稳定在2.695~2.697之间,而集群数目对EUC-Hc 算法的影响相对比较明显㊂图2为CCA-Hc 在Oryza sativa L.数据集的聚类树状图,可以自行在所需的层级对树状图进行 剪枝 操作以获得合适的聚类效果[21]㊂图2㊀CCA-Hc 在Oryza sativa L.数据集的聚类树状图Fjg.2㊀Clustering dendrogram of CCA-Hc in Oryzasativa L.dataset3㊀结㊀论本文为了充分有效地挖掘基因表达数据所蕴含的生物机制,提出一种基于CCA -层次聚类的基因聚类算法(CCA-Hc)㊂把典型相关分析方法引入到凝聚层次聚类中来进行多特征基因的聚类分析,成为本文的创新之处㊂该算法利用典型相关分析方法度量基因之间的相关性程度,能够充分考虑基因的多个特征信息㊂同时采用凝聚层次聚类可自主选择聚类集群数目,直观显示聚类结果㊂基于Oryza sativa L.(水稻)的基因表达数据集,本文对比了CCA-Hc 和EUC-Hc 的聚类效果,使用BHI 和FOM 两个评价指标进行衡量,结果表明CCA-Hc 的鲁棒性和聚类准确性均更好,更有利于去探索基因表达数据潜在的生物机制㊂参考文献:[1]㊀欧阳玉梅.基因表达数据聚类分析技术及其软件工具[J].生物信息学,2010,8(2):104.OUYANG Yumei.Gene Expression Data Cluster Analysis Technology and Software Tools [J ].Bioinformatics,2010,8(2):104.[2]㊀高华成.基于数据降维框架的基因聚类算法[D].南京:南京邮电大学,2021.[3]㊀姚登举,詹晓娟,张晓晶.一种加权K -均值基因聚类算法[J ].哈尔滨理工大学学报,2017,22(2):112.YAO Dengju,ZHAN Xiaojuan,ZHANG Xiaojing.A Weighted K-Means Gene Clustering Algorithm[J].Jour-nal of Harbin University of Science and Technology,2017,22(2):112.[4]㊀方匡南,陈远星,张庆昭,等.双向聚类方法综述[J].数理统计与管理,2020,39(1):22.FANG Kuangnan,CHEN Yuanxing,ZHANG Qingzhao,et al.Review of Bidirectional Clustering Methods [J].Journal of Applied Statistics and Management,2020,39(1):22.[5]㊀吴明阳,张芮,岳彩旭,等.应用K-means 聚类算法划分曲面及实验验证[J].哈尔滨理工大学学报,2017(1):54.WU Mingyang,ZHANG Rui,YUE Caixu,et al.Appli-cation of K-means Clustering Algorithm for Surface Divi-sion and Experimental Verification[J].Journal of HarbinUniversity of Science and Technology,2017(1):54.[6]㊀LACKO D,HUYSMANS T,VLEUGELS J,et al.ProductSizing with 3D Anthropometry and K-medoids Clustering[J].Computer-Aided Design,2017:S0010448517301173.[7]㊀ZHANG T,RAMAKRISHNAN R,LIVNY M.BIRCH:ANew Data Clustering Algorithm and Its Applications[J].Data Mining and Knowledge Discovery,1997,1(2):141.[8]㊀FUSHIMI T,MORI R.High-Speed Clustering of Region-al Photos Using Representative Photos of Different Re-gions[C].2018IEEE /WIC /ACM International Confer-ence on Web Intelligence (WI),IEEE,2018:520.[9]㊀Al-MAMORY S O,KAMIL I S.A New Density BasedSampling to Enhance DBSCAN Clustering Algorithm[J].Journal of Computer Science,2019,32(4):315.[10]ANKERST M,BREUNIG M M,KRIEGEL H P,et al.OPTICS:Ordering Points to Identify the Clustering Struc-ture[C]//SIGMOD 1999,Proceedings ACM SIGMOD International Conference on Management of Data,June 1-3,1999,Philadelphia,Pennsylvania,USA.ACM,1999:2008,99.[11]王飞,王国胤,李智星,等.一种基于网格的密度峰值聚类算法[J ].小型微型计算机系统,2017(5):1034.WANG Fei,WANG Guoyin,LI Zhixing,et al.A Grid-based Density Peak Clustering Algorithm[J].Journal of98第5期林倩闽:一种CCA -层次聚类的基因聚类算法Chinese Computer Systems,2017(5):1034. [12]YAO J,CHANG C,SALMI M L,et al.Genome-scaleClusteranalysis of Replicated Microarrays Using ShrinkageCorrelation Coefficient[J].BMC Bioinformatics,2008,9:288.[13]HONG S,CHEN X,JIN L,et al.Canonical CorrelationAnalysis for RNA-seq Co-expression Networks[J].Nu-cleic Acids Res,2013,41(8):e95.[14]万静,郑龙君,何云斌,等.高维数据的高密度子空间聚类算法[J].哈尔滨理工大学学报,2020,25(4):84.WAN Jing,ZHENG Longjun,HE Yunbin,et al.High-Density Subspace Clustering Algorithm for High-Dimen-sional Data[J].Journal of Harbin University of Scienceand Technology,2020,25(4):84.[15]刘昊.基于聚类算法的生物分析软件的设计与实现[D].上海:复旦大学,2013.[16]乔锦荣,原新鹏,梁旭东,等.凝聚层次聚类方法在降水预报评估中的应用[J].干旱气象,2022,40(4):690.QIAO Jinrong,YUAN Xinpeng,LIANG Xudong,et al.Application of Agglomerative Hierarchical ClusteringMethod in Precipitation Forecast Evaluation[J].AridMeteorology,2022,40(4):690.[17]JASKOWIAK P A,CAMPELLO R J,COSTA I G.Onthe Selection of Appropriate Distances for Gene Expres-sion Data Clustering[J].BMC Bioinformatics,2014,15(2):1.[18]季姜帅,裴颂文.面向异质基因数据的智能层次聚类算法研究[J].小型微型计算机系统,2021,43(9):1808.JI Jiangshuai,PEI Songwen.Research on Intelligent Hi-erarchical Clustering Algorithm for Heterogeneous GeneticData[J].Journal of Chinese Computer Systems,2021,43(9):1808.[19]DATTA S,DATTA S.Methods for Evaluating ClusteringAlgorithms for Gene Expression Data Using a ReferenceSet of Functional Classes[J].BMC Bioinformatics,2006,7(1):1.[20]DATTA parisons and Validation of Statistical Clus-tering Techniques for Microarray Gene Expression Data[J].Bioinformatics,2003,19(4):459. [21]HULOT A,CHIQUET J,JAFFRÉZIC F,et al.Fast TreeAggregation for Consensus Hierarchical Clustering[J].BMC Bioinformatics,2020,21(1):12.(编辑:温泽宇)09哈㊀尔㊀滨㊀理㊀工㊀大㊀学㊀学㊀报㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀第28卷㊀。

机器学习与人工智能领域中常用的英语词汇

机器学习与人工智能领域中常用的英语词汇

机器学习与人工智能领域中常用的英语词汇1.General Concepts (基础概念)•Artificial Intelligence (AI) - 人工智能1)Artificial Intelligence (AI) - 人工智能2)Machine Learning (ML) - 机器学习3)Deep Learning (DL) - 深度学习4)Neural Network - 神经网络5)Natural Language Processing (NLP) - 自然语言处理6)Computer Vision - 计算机视觉7)Robotics - 机器人技术8)Speech Recognition - 语音识别9)Expert Systems - 专家系统10)Knowledge Representation - 知识表示11)Pattern Recognition - 模式识别12)Cognitive Computing - 认知计算13)Autonomous Systems - 自主系统14)Human-Machine Interaction - 人机交互15)Intelligent Agents - 智能代理16)Machine Translation - 机器翻译17)Swarm Intelligence - 群体智能18)Genetic Algorithms - 遗传算法19)Fuzzy Logic - 模糊逻辑20)Reinforcement Learning - 强化学习•Machine Learning (ML) - 机器学习1)Machine Learning (ML) - 机器学习2)Artificial Neural Network - 人工神经网络3)Deep Learning - 深度学习4)Supervised Learning - 有监督学习5)Unsupervised Learning - 无监督学习6)Reinforcement Learning - 强化学习7)Semi-Supervised Learning - 半监督学习8)Training Data - 训练数据9)Test Data - 测试数据10)Validation Data - 验证数据11)Feature - 特征12)Label - 标签13)Model - 模型14)Algorithm - 算法15)Regression - 回归16)Classification - 分类17)Clustering - 聚类18)Dimensionality Reduction - 降维19)Overfitting - 过拟合20)Underfitting - 欠拟合•Deep Learning (DL) - 深度学习1)Deep Learning - 深度学习2)Neural Network - 神经网络3)Artificial Neural Network (ANN) - 人工神经网络4)Convolutional Neural Network (CNN) - 卷积神经网络5)Recurrent Neural Network (RNN) - 循环神经网络6)Long Short-Term Memory (LSTM) - 长短期记忆网络7)Gated Recurrent Unit (GRU) - 门控循环单元8)Autoencoder - 自编码器9)Generative Adversarial Network (GAN) - 生成对抗网络10)Transfer Learning - 迁移学习11)Pre-trained Model - 预训练模型12)Fine-tuning - 微调13)Feature Extraction - 特征提取14)Activation Function - 激活函数15)Loss Function - 损失函数16)Gradient Descent - 梯度下降17)Backpropagation - 反向传播18)Epoch - 训练周期19)Batch Size - 批量大小20)Dropout - 丢弃法•Neural Network - 神经网络1)Neural Network - 神经网络2)Artificial Neural Network (ANN) - 人工神经网络3)Deep Neural Network (DNN) - 深度神经网络4)Convolutional Neural Network (CNN) - 卷积神经网络5)Recurrent Neural Network (RNN) - 循环神经网络6)Long Short-Term Memory (LSTM) - 长短期记忆网络7)Gated Recurrent Unit (GRU) - 门控循环单元8)Feedforward Neural Network - 前馈神经网络9)Multi-layer Perceptron (MLP) - 多层感知器10)Radial Basis Function Network (RBFN) - 径向基函数网络11)Hopfield Network - 霍普菲尔德网络12)Boltzmann Machine - 玻尔兹曼机13)Autoencoder - 自编码器14)Spiking Neural Network (SNN) - 脉冲神经网络15)Self-organizing Map (SOM) - 自组织映射16)Restricted Boltzmann Machine (RBM) - 受限玻尔兹曼机17)Hebbian Learning - 海比安学习18)Competitive Learning - 竞争学习19)Neuroevolutionary - 神经进化20)Neuron - 神经元•Algorithm - 算法1)Algorithm - 算法2)Supervised Learning Algorithm - 有监督学习算法3)Unsupervised Learning Algorithm - 无监督学习算法4)Reinforcement Learning Algorithm - 强化学习算法5)Classification Algorithm - 分类算法6)Regression Algorithm - 回归算法7)Clustering Algorithm - 聚类算法8)Dimensionality Reduction Algorithm - 降维算法9)Decision Tree Algorithm - 决策树算法10)Random Forest Algorithm - 随机森林算法11)Support Vector Machine (SVM) Algorithm - 支持向量机算法12)K-Nearest Neighbors (KNN) Algorithm - K近邻算法13)Naive Bayes Algorithm - 朴素贝叶斯算法14)Gradient Descent Algorithm - 梯度下降算法15)Genetic Algorithm - 遗传算法16)Neural Network Algorithm - 神经网络算法17)Deep Learning Algorithm - 深度学习算法18)Ensemble Learning Algorithm - 集成学习算法19)Reinforcement Learning Algorithm - 强化学习算法20)Metaheuristic Algorithm - 元启发式算法•Model - 模型1)Model - 模型2)Machine Learning Model - 机器学习模型3)Artificial Intelligence Model - 人工智能模型4)Predictive Model - 预测模型5)Classification Model - 分类模型6)Regression Model - 回归模型7)Generative Model - 生成模型8)Discriminative Model - 判别模型9)Probabilistic Model - 概率模型10)Statistical Model - 统计模型11)Neural Network Model - 神经网络模型12)Deep Learning Model - 深度学习模型13)Ensemble Model - 集成模型14)Reinforcement Learning Model - 强化学习模型15)Support Vector Machine (SVM) Model - 支持向量机模型16)Decision Tree Model - 决策树模型17)Random Forest Model - 随机森林模型18)Naive Bayes Model - 朴素贝叶斯模型19)Autoencoder Model - 自编码器模型20)Convolutional Neural Network (CNN) Model - 卷积神经网络模型•Dataset - 数据集1)Dataset - 数据集2)Training Dataset - 训练数据集3)Test Dataset - 测试数据集4)Validation Dataset - 验证数据集5)Balanced Dataset - 平衡数据集6)Imbalanced Dataset - 不平衡数据集7)Synthetic Dataset - 合成数据集8)Benchmark Dataset - 基准数据集9)Open Dataset - 开放数据集10)Labeled Dataset - 标记数据集11)Unlabeled Dataset - 未标记数据集12)Semi-Supervised Dataset - 半监督数据集13)Multiclass Dataset - 多分类数据集14)Feature Set - 特征集15)Data Augmentation - 数据增强16)Data Preprocessing - 数据预处理17)Missing Data - 缺失数据18)Outlier Detection - 异常值检测19)Data Imputation - 数据插补20)Metadata - 元数据•Training - 训练1)Training - 训练2)Training Data - 训练数据3)Training Phase - 训练阶段4)Training Set - 训练集5)Training Examples - 训练样本6)Training Instance - 训练实例7)Training Algorithm - 训练算法8)Training Model - 训练模型9)Training Process - 训练过程10)Training Loss - 训练损失11)Training Epoch - 训练周期12)Training Batch - 训练批次13)Online Training - 在线训练14)Offline Training - 离线训练15)Continuous Training - 连续训练16)Transfer Learning - 迁移学习17)Fine-Tuning - 微调18)Curriculum Learning - 课程学习19)Self-Supervised Learning - 自监督学习20)Active Learning - 主动学习•Testing - 测试1)Testing - 测试2)Test Data - 测试数据3)Test Set - 测试集4)Test Examples - 测试样本5)Test Instance - 测试实例6)Test Phase - 测试阶段7)Test Accuracy - 测试准确率8)Test Loss - 测试损失9)Test Error - 测试错误10)Test Metrics - 测试指标11)Test Suite - 测试套件12)Test Case - 测试用例13)Test Coverage - 测试覆盖率14)Cross-Validation - 交叉验证15)Holdout Validation - 留出验证16)K-Fold Cross-Validation - K折交叉验证17)Stratified Cross-Validation - 分层交叉验证18)Test Driven Development (TDD) - 测试驱动开发19)A/B Testing - A/B 测试20)Model Evaluation - 模型评估•Validation - 验证1)Validation - 验证2)Validation Data - 验证数据3)Validation Set - 验证集4)Validation Examples - 验证样本5)Validation Instance - 验证实例6)Validation Phase - 验证阶段7)Validation Accuracy - 验证准确率8)Validation Loss - 验证损失9)Validation Error - 验证错误10)Validation Metrics - 验证指标11)Cross-Validation - 交叉验证12)Holdout Validation - 留出验证13)K-Fold Cross-Validation - K折交叉验证14)Stratified Cross-Validation - 分层交叉验证15)Leave-One-Out Cross-Validation - 留一法交叉验证16)Validation Curve - 验证曲线17)Hyperparameter Validation - 超参数验证18)Model Validation - 模型验证19)Early Stopping - 提前停止20)Validation Strategy - 验证策略•Supervised Learning - 有监督学习1)Supervised Learning - 有监督学习2)Label - 标签3)Feature - 特征4)Target - 目标5)Training Labels - 训练标签6)Training Features - 训练特征7)Training Targets - 训练目标8)Training Examples - 训练样本9)Training Instance - 训练实例10)Regression - 回归11)Classification - 分类12)Predictor - 预测器13)Regression Model - 回归模型14)Classifier - 分类器15)Decision Tree - 决策树16)Support Vector Machine (SVM) - 支持向量机17)Neural Network - 神经网络18)Feature Engineering - 特征工程19)Model Evaluation - 模型评估20)Overfitting - 过拟合21)Underfitting - 欠拟合22)Bias-Variance Tradeoff - 偏差-方差权衡•Unsupervised Learning - 无监督学习1)Unsupervised Learning - 无监督学习2)Clustering - 聚类3)Dimensionality Reduction - 降维4)Anomaly Detection - 异常检测5)Association Rule Learning - 关联规则学习6)Feature Extraction - 特征提取7)Feature Selection - 特征选择8)K-Means - K均值9)Hierarchical Clustering - 层次聚类10)Density-Based Clustering - 基于密度的聚类11)Principal Component Analysis (PCA) - 主成分分析12)Independent Component Analysis (ICA) - 独立成分分析13)T-distributed Stochastic Neighbor Embedding (t-SNE) - t分布随机邻居嵌入14)Gaussian Mixture Model (GMM) - 高斯混合模型15)Self-Organizing Maps (SOM) - 自组织映射16)Autoencoder - 自动编码器17)Latent Variable - 潜变量18)Data Preprocessing - 数据预处理19)Outlier Detection - 异常值检测20)Clustering Algorithm - 聚类算法•Reinforcement Learning - 强化学习1)Reinforcement Learning - 强化学习2)Agent - 代理3)Environment - 环境4)State - 状态5)Action - 动作6)Reward - 奖励7)Policy - 策略8)Value Function - 值函数9)Q-Learning - Q学习10)Deep Q-Network (DQN) - 深度Q网络11)Policy Gradient - 策略梯度12)Actor-Critic - 演员-评论家13)Exploration - 探索14)Exploitation - 开发15)Temporal Difference (TD) - 时间差分16)Markov Decision Process (MDP) - 马尔可夫决策过程17)State-Action-Reward-State-Action (SARSA) - 状态-动作-奖励-状态-动作18)Policy Iteration - 策略迭代19)Value Iteration - 值迭代20)Monte Carlo Methods - 蒙特卡洛方法•Semi-Supervised Learning - 半监督学习1)Semi-Supervised Learning - 半监督学习2)Labeled Data - 有标签数据3)Unlabeled Data - 无标签数据4)Label Propagation - 标签传播5)Self-Training - 自训练6)Co-Training - 协同训练7)Transudative Learning - 传导学习8)Inductive Learning - 归纳学习9)Manifold Regularization - 流形正则化10)Graph-based Methods - 基于图的方法11)Cluster Assumption - 聚类假设12)Low-Density Separation - 低密度分离13)Semi-Supervised Support Vector Machines (S3VM) - 半监督支持向量机14)Expectation-Maximization (EM) - 期望最大化15)Co-EM - 协同期望最大化16)Entropy-Regularized EM - 熵正则化EM17)Mean Teacher - 平均教师18)Virtual Adversarial Training - 虚拟对抗训练19)Tri-training - 三重训练20)Mix Match - 混合匹配•Feature - 特征1)Feature - 特征2)Feature Engineering - 特征工程3)Feature Extraction - 特征提取4)Feature Selection - 特征选择5)Input Features - 输入特征6)Output Features - 输出特征7)Feature Vector - 特征向量8)Feature Space - 特征空间9)Feature Representation - 特征表示10)Feature Transformation - 特征转换11)Feature Importance - 特征重要性12)Feature Scaling - 特征缩放13)Feature Normalization - 特征归一化14)Feature Encoding - 特征编码15)Feature Fusion - 特征融合16)Feature Dimensionality Reduction - 特征维度减少17)Continuous Feature - 连续特征18)Categorical Feature - 分类特征19)Nominal Feature - 名义特征20)Ordinal Feature - 有序特征•Label - 标签1)Label - 标签2)Labeling - 标注3)Ground Truth - 地面真值4)Class Label - 类别标签5)Target Variable - 目标变量6)Labeling Scheme - 标注方案7)Multi-class Labeling - 多类别标注8)Binary Labeling - 二分类标注9)Label Noise - 标签噪声10)Labeling Error - 标注错误11)Label Propagation - 标签传播12)Unlabeled Data - 无标签数据13)Labeled Data - 有标签数据14)Semi-supervised Learning - 半监督学习15)Active Learning - 主动学习16)Weakly Supervised Learning - 弱监督学习17)Noisy Label Learning - 噪声标签学习18)Self-training - 自训练19)Crowdsourcing Labeling - 众包标注20)Label Smoothing - 标签平滑化•Prediction - 预测1)Prediction - 预测2)Forecasting - 预测3)Regression - 回归4)Classification - 分类5)Time Series Prediction - 时间序列预测6)Forecast Accuracy - 预测准确性7)Predictive Modeling - 预测建模8)Predictive Analytics - 预测分析9)Forecasting Method - 预测方法10)Predictive Performance - 预测性能11)Predictive Power - 预测能力12)Prediction Error - 预测误差13)Prediction Interval - 预测区间14)Prediction Model - 预测模型15)Predictive Uncertainty - 预测不确定性16)Forecast Horizon - 预测时间跨度17)Predictive Maintenance - 预测性维护18)Predictive Policing - 预测式警务19)Predictive Healthcare - 预测性医疗20)Predictive Maintenance - 预测性维护•Classification - 分类1)Classification - 分类2)Classifier - 分类器3)Class - 类别4)Classify - 对数据进行分类5)Class Label - 类别标签6)Binary Classification - 二元分类7)Multiclass Classification - 多类分类8)Class Probability - 类别概率9)Decision Boundary - 决策边界10)Decision Tree - 决策树11)Support Vector Machine (SVM) - 支持向量机12)K-Nearest Neighbors (KNN) - K最近邻算法13)Naive Bayes - 朴素贝叶斯14)Logistic Regression - 逻辑回归15)Random Forest - 随机森林16)Neural Network - 神经网络17)SoftMax Function - SoftMax函数18)One-vs-All (One-vs-Rest) - 一对多(一对剩余)19)Ensemble Learning - 集成学习20)Confusion Matrix - 混淆矩阵•Regression - 回归1)Regression Analysis - 回归分析2)Linear Regression - 线性回归3)Multiple Regression - 多元回归4)Polynomial Regression - 多项式回归5)Logistic Regression - 逻辑回归6)Ridge Regression - 岭回归7)Lasso Regression - Lasso回归8)Elastic Net Regression - 弹性网络回归9)Regression Coefficients - 回归系数10)Residuals - 残差11)Ordinary Least Squares (OLS) - 普通最小二乘法12)Ridge Regression Coefficient - 岭回归系数13)Lasso Regression Coefficient - Lasso回归系数14)Elastic Net Regression Coefficient - 弹性网络回归系数15)Regression Line - 回归线16)Prediction Error - 预测误差17)Regression Model - 回归模型18)Nonlinear Regression - 非线性回归19)Generalized Linear Models (GLM) - 广义线性模型20)Coefficient of Determination (R-squared) - 决定系数21)F-test - F检验22)Homoscedasticity - 同方差性23)Heteroscedasticity - 异方差性24)Autocorrelation - 自相关25)Multicollinearity - 多重共线性26)Outliers - 异常值27)Cross-validation - 交叉验证28)Feature Selection - 特征选择29)Feature Engineering - 特征工程30)Regularization - 正则化2.Neural Networks and Deep Learning (神经网络与深度学习)•Convolutional Neural Network (CNN) - 卷积神经网络1)Convolutional Neural Network (CNN) - 卷积神经网络2)Convolution Layer - 卷积层3)Feature Map - 特征图4)Convolution Operation - 卷积操作5)Stride - 步幅6)Padding - 填充7)Pooling Layer - 池化层8)Max Pooling - 最大池化9)Average Pooling - 平均池化10)Fully Connected Layer - 全连接层11)Activation Function - 激活函数12)Rectified Linear Unit (ReLU) - 线性修正单元13)Dropout - 随机失活14)Batch Normalization - 批量归一化15)Transfer Learning - 迁移学习16)Fine-Tuning - 微调17)Image Classification - 图像分类18)Object Detection - 物体检测19)Semantic Segmentation - 语义分割20)Instance Segmentation - 实例分割21)Generative Adversarial Network (GAN) - 生成对抗网络22)Image Generation - 图像生成23)Style Transfer - 风格迁移24)Convolutional Autoencoder - 卷积自编码器25)Recurrent Neural Network (RNN) - 循环神经网络•Recurrent Neural Network (RNN) - 循环神经网络1)Recurrent Neural Network (RNN) - 循环神经网络2)Long Short-Term Memory (LSTM) - 长短期记忆网络3)Gated Recurrent Unit (GRU) - 门控循环单元4)Sequence Modeling - 序列建模5)Time Series Prediction - 时间序列预测6)Natural Language Processing (NLP) - 自然语言处理7)Text Generation - 文本生成8)Sentiment Analysis - 情感分析9)Named Entity Recognition (NER) - 命名实体识别10)Part-of-Speech Tagging (POS Tagging) - 词性标注11)Sequence-to-Sequence (Seq2Seq) - 序列到序列12)Attention Mechanism - 注意力机制13)Encoder-Decoder Architecture - 编码器-解码器架构14)Bidirectional RNN - 双向循环神经网络15)Teacher Forcing - 强制教师法16)Backpropagation Through Time (BPTT) - 通过时间的反向传播17)Vanishing Gradient Problem - 梯度消失问题18)Exploding Gradient Problem - 梯度爆炸问题19)Language Modeling - 语言建模20)Speech Recognition - 语音识别•Long Short-Term Memory (LSTM) - 长短期记忆网络1)Long Short-Term Memory (LSTM) - 长短期记忆网络2)Cell State - 细胞状态3)Hidden State - 隐藏状态4)Forget Gate - 遗忘门5)Input Gate - 输入门6)Output Gate - 输出门7)Peephole Connections - 窥视孔连接8)Gated Recurrent Unit (GRU) - 门控循环单元9)Vanishing Gradient Problem - 梯度消失问题10)Exploding Gradient Problem - 梯度爆炸问题11)Sequence Modeling - 序列建模12)Time Series Prediction - 时间序列预测13)Natural Language Processing (NLP) - 自然语言处理14)Text Generation - 文本生成15)Sentiment Analysis - 情感分析16)Named Entity Recognition (NER) - 命名实体识别17)Part-of-Speech Tagging (POS Tagging) - 词性标注18)Attention Mechanism - 注意力机制19)Encoder-Decoder Architecture - 编码器-解码器架构20)Bidirectional LSTM - 双向长短期记忆网络•Attention Mechanism - 注意力机制1)Attention Mechanism - 注意力机制2)Self-Attention - 自注意力3)Multi-Head Attention - 多头注意力4)Transformer - 变换器5)Query - 查询6)Key - 键7)Value - 值8)Query-Value Attention - 查询-值注意力9)Dot-Product Attention - 点积注意力10)Scaled Dot-Product Attention - 缩放点积注意力11)Additive Attention - 加性注意力12)Context Vector - 上下文向量13)Attention Score - 注意力分数14)SoftMax Function - SoftMax函数15)Attention Weight - 注意力权重16)Global Attention - 全局注意力17)Local Attention - 局部注意力18)Positional Encoding - 位置编码19)Encoder-Decoder Attention - 编码器-解码器注意力20)Cross-Modal Attention - 跨模态注意力•Generative Adversarial Network (GAN) - 生成对抗网络1)Generative Adversarial Network (GAN) - 生成对抗网络2)Generator - 生成器3)Discriminator - 判别器4)Adversarial Training - 对抗训练5)Minimax Game - 极小极大博弈6)Nash Equilibrium - 纳什均衡7)Mode Collapse - 模式崩溃8)Training Stability - 训练稳定性9)Loss Function - 损失函数10)Discriminative Loss - 判别损失11)Generative Loss - 生成损失12)Wasserstein GAN (WGAN) - Wasserstein GAN(WGAN)13)Deep Convolutional GAN (DCGAN) - 深度卷积生成对抗网络(DCGAN)14)Conditional GAN (c GAN) - 条件生成对抗网络(c GAN)15)Style GAN - 风格生成对抗网络16)Cycle GAN - 循环生成对抗网络17)Progressive Growing GAN (PGGAN) - 渐进式增长生成对抗网络(PGGAN)18)Self-Attention GAN (SAGAN) - 自注意力生成对抗网络(SAGAN)19)Big GAN - 大规模生成对抗网络20)Adversarial Examples - 对抗样本•Encoder-Decoder - 编码器-解码器1)Encoder-Decoder Architecture - 编码器-解码器架构2)Encoder - 编码器3)Decoder - 解码器4)Sequence-to-Sequence Model (Seq2Seq) - 序列到序列模型5)State Vector - 状态向量6)Context Vector - 上下文向量7)Hidden State - 隐藏状态8)Attention Mechanism - 注意力机制9)Teacher Forcing - 强制教师法10)Beam Search - 束搜索11)Recurrent Neural Network (RNN) - 循环神经网络12)Long Short-Term Memory (LSTM) - 长短期记忆网络13)Gated Recurrent Unit (GRU) - 门控循环单元14)Bidirectional Encoder - 双向编码器15)Greedy Decoding - 贪婪解码16)Masking - 遮盖17)Dropout - 随机失活18)Embedding Layer - 嵌入层19)Cross-Entropy Loss - 交叉熵损失20)Tokenization - 令牌化•Transfer Learning - 迁移学习1)Transfer Learning - 迁移学习2)Source Domain - 源领域3)Target Domain - 目标领域4)Fine-Tuning - 微调5)Domain Adaptation - 领域自适应6)Pre-Trained Model - 预训练模型7)Feature Extraction - 特征提取8)Knowledge Transfer - 知识迁移9)Unsupervised Domain Adaptation - 无监督领域自适应10)Semi-Supervised Domain Adaptation - 半监督领域自适应11)Multi-Task Learning - 多任务学习12)Data Augmentation - 数据增强13)Task Transfer - 任务迁移14)Model Agnostic Meta-Learning (MAML) - 与模型无关的元学习(MAML)15)One-Shot Learning - 单样本学习16)Zero-Shot Learning - 零样本学习17)Few-Shot Learning - 少样本学习18)Knowledge Distillation - 知识蒸馏19)Representation Learning - 表征学习20)Adversarial Transfer Learning - 对抗迁移学习•Pre-trained Models - 预训练模型1)Pre-trained Model - 预训练模型2)Transfer Learning - 迁移学习3)Fine-Tuning - 微调4)Knowledge Transfer - 知识迁移5)Domain Adaptation - 领域自适应6)Feature Extraction - 特征提取7)Representation Learning - 表征学习8)Language Model - 语言模型9)Bidirectional Encoder Representations from Transformers (BERT) - 双向编码器结构转换器10)Generative Pre-trained Transformer (GPT) - 生成式预训练转换器11)Transformer-based Models - 基于转换器的模型12)Masked Language Model (MLM) - 掩蔽语言模型13)Cloze Task - 填空任务14)Tokenization - 令牌化15)Word Embeddings - 词嵌入16)Sentence Embeddings - 句子嵌入17)Contextual Embeddings - 上下文嵌入18)Self-Supervised Learning - 自监督学习19)Large-Scale Pre-trained Models - 大规模预训练模型•Loss Function - 损失函数1)Loss Function - 损失函数2)Mean Squared Error (MSE) - 均方误差3)Mean Absolute Error (MAE) - 平均绝对误差4)Cross-Entropy Loss - 交叉熵损失5)Binary Cross-Entropy Loss - 二元交叉熵损失6)Categorical Cross-Entropy Loss - 分类交叉熵损失7)Hinge Loss - 合页损失8)Huber Loss - Huber损失9)Wasserstein Distance - Wasserstein距离10)Triplet Loss - 三元组损失11)Contrastive Loss - 对比损失12)Dice Loss - Dice损失13)Focal Loss - 焦点损失14)GAN Loss - GAN损失15)Adversarial Loss - 对抗损失16)L1 Loss - L1损失17)L2 Loss - L2损失18)Huber Loss - Huber损失19)Quantile Loss - 分位数损失•Activation Function - 激活函数1)Activation Function - 激活函数2)Sigmoid Function - Sigmoid函数3)Hyperbolic Tangent Function (Tanh) - 双曲正切函数4)Rectified Linear Unit (Re LU) - 矩形线性单元5)Parametric Re LU (P Re LU) - 参数化Re LU6)Exponential Linear Unit (ELU) - 指数线性单元7)Swish Function - Swish函数8)Softplus Function - Soft plus函数9)Softmax Function - SoftMax函数10)Hard Tanh Function - 硬双曲正切函数11)Softsign Function - Softsign函数12)GELU (Gaussian Error Linear Unit) - GELU(高斯误差线性单元)13)Mish Function - Mish函数14)CELU (Continuous Exponential Linear Unit) - CELU(连续指数线性单元)15)Bent Identity Function - 弯曲恒等函数16)Gaussian Error Linear Units (GELUs) - 高斯误差线性单元17)Adaptive Piecewise Linear (APL) - 自适应分段线性函数18)Radial Basis Function (RBF) - 径向基函数•Backpropagation - 反向传播1)Backpropagation - 反向传播2)Gradient Descent - 梯度下降3)Partial Derivative - 偏导数4)Chain Rule - 链式法则5)Forward Pass - 前向传播6)Backward Pass - 反向传播7)Computational Graph - 计算图8)Neural Network - 神经网络9)Loss Function - 损失函数10)Gradient Calculation - 梯度计算11)Weight Update - 权重更新12)Activation Function - 激活函数13)Optimizer - 优化器14)Learning Rate - 学习率15)Mini-Batch Gradient Descent - 小批量梯度下降16)Stochastic Gradient Descent (SGD) - 随机梯度下降17)Batch Gradient Descent - 批量梯度下降18)Momentum - 动量19)Adam Optimizer - Adam优化器20)Learning Rate Decay - 学习率衰减•Gradient Descent - 梯度下降1)Gradient Descent - 梯度下降2)Stochastic Gradient Descent (SGD) - 随机梯度下降3)Mini-Batch Gradient Descent - 小批量梯度下降4)Batch Gradient Descent - 批量梯度下降5)Learning Rate - 学习率6)Momentum - 动量7)Adaptive Moment Estimation (Adam) - 自适应矩估计8)RMSprop - 均方根传播9)Learning Rate Schedule - 学习率调度10)Convergence - 收敛11)Divergence - 发散12)Adagrad - 自适应学习速率方法13)Adadelta - 自适应增量学习率方法14)Adamax - 自适应矩估计的扩展版本15)Nadam - Nesterov Accelerated Adaptive Moment Estimation16)Learning Rate Decay - 学习率衰减17)Step Size - 步长18)Conjugate Gradient Descent - 共轭梯度下降19)Line Search - 线搜索20)Newton's Method - 牛顿法•Learning Rate - 学习率1)Learning Rate - 学习率2)Adaptive Learning Rate - 自适应学习率3)Learning Rate Decay - 学习率衰减4)Initial Learning Rate - 初始学习率5)Step Size - 步长6)Momentum - 动量7)Exponential Decay - 指数衰减8)Annealing - 退火9)Cyclical Learning Rate - 循环学习率10)Learning Rate Schedule - 学习率调度11)Warm-up - 预热12)Learning Rate Policy - 学习率策略13)Learning Rate Annealing - 学习率退火14)Cosine Annealing - 余弦退火15)Gradient Clipping - 梯度裁剪16)Adapting Learning Rate - 适应学习率17)Learning Rate Multiplier - 学习率倍增器18)Learning Rate Reduction - 学习率降低19)Learning Rate Update - 学习率更新20)Scheduled Learning Rate - 定期学习率•Batch Size - 批量大小1)Batch Size - 批量大小2)Mini-Batch - 小批量3)Batch Gradient Descent - 批量梯度下降4)Stochastic Gradient Descent (SGD) - 随机梯度下降5)Mini-Batch Gradient Descent - 小批量梯度下降6)Online Learning - 在线学习7)Full-Batch - 全批量8)Data Batch - 数据批次9)Training Batch - 训练批次10)Batch Normalization - 批量归一化11)Batch-wise Optimization - 批量优化12)Batch Processing - 批量处理13)Batch Sampling - 批量采样14)Adaptive Batch Size - 自适应批量大小15)Batch Splitting - 批量分割16)Dynamic Batch Size - 动态批量大小17)Fixed Batch Size - 固定批量大小18)Batch-wise Inference - 批量推理19)Batch-wise Training - 批量训练20)Batch Shuffling - 批量洗牌•Epoch - 训练周期1)Training Epoch - 训练周期2)Epoch Size - 周期大小3)Early Stopping - 提前停止4)Validation Set - 验证集5)Training Set - 训练集6)Test Set - 测试集7)Overfitting - 过拟合8)Underfitting - 欠拟合9)Model Evaluation - 模型评估10)Model Selection - 模型选择11)Hyperparameter Tuning - 超参数调优12)Cross-Validation - 交叉验证13)K-fold Cross-Validation - K折交叉验证14)Stratified Cross-Validation - 分层交叉验证15)Leave-One-Out Cross-Validation (LOOCV) - 留一法交叉验证16)Grid Search - 网格搜索17)Random Search - 随机搜索18)Model Complexity - 模型复杂度19)Learning Curve - 学习曲线20)Convergence - 收敛3.Machine Learning Techniques and Algorithms (机器学习技术与算法)•Decision Tree - 决策树1)Decision Tree - 决策树2)Node - 节点3)Root Node - 根节点4)Leaf Node - 叶节点5)Internal Node - 内部节点6)Splitting Criterion - 分裂准则7)Gini Impurity - 基尼不纯度8)Entropy - 熵9)Information Gain - 信息增益10)Gain Ratio - 增益率11)Pruning - 剪枝12)Recursive Partitioning - 递归分割13)CART (Classification and Regression Trees) - 分类回归树14)ID3 (Iterative Dichotomiser 3) - 迭代二叉树315)C4.5 (successor of ID3) - C4.5(ID3的后继者)16)C5.0 (successor of C4.5) - C5.0(C4.5的后继者)17)Split Point - 分裂点18)Decision Boundary - 决策边界19)Pruned Tree - 剪枝后的树20)Decision Tree Ensemble - 决策树集成•Random Forest - 随机森林1)Random Forest - 随机森林2)Ensemble Learning - 集成学习3)Bootstrap Sampling - 自助采样4)Bagging (Bootstrap Aggregating) - 装袋法5)Out-of-Bag (OOB) Error - 袋外误差6)Feature Subset - 特征子集7)Decision Tree - 决策树8)Base Estimator - 基础估计器9)Tree Depth - 树深度10)Randomization - 随机化11)Majority Voting - 多数投票12)Feature Importance - 特征重要性13)OOB Score - 袋外得分14)Forest Size - 森林大小15)Max Features - 最大特征数16)Min Samples Split - 最小分裂样本数17)Min Samples Leaf - 最小叶节点样本数18)Gini Impurity - 基尼不纯度19)Entropy - 熵20)Variable Importance - 变量重要性•Support Vector Machine (SVM) - 支持向量机1)Support Vector Machine (SVM) - 支持向量机2)Hyperplane - 超平面3)Kernel Trick - 核技巧4)Kernel Function - 核函数5)Margin - 间隔6)Support Vectors - 支持向量7)Decision Boundary - 决策边界8)Maximum Margin Classifier - 最大间隔分类器9)Soft Margin Classifier - 软间隔分类器10) C Parameter - C参数11)Radial Basis Function (RBF) Kernel - 径向基函数核12)Polynomial Kernel - 多项式核13)Linear Kernel - 线性核14)Quadratic Kernel - 二次核15)Gaussian Kernel - 高斯核16)Regularization - 正则化17)Dual Problem - 对偶问题18)Primal Problem - 原始问题19)Kernelized SVM - 核化支持向量机20)Multiclass SVM - 多类支持向量机•K-Nearest Neighbors (KNN) - K-最近邻1)K-Nearest Neighbors (KNN) - K-最近邻2)Nearest Neighbor - 最近邻3)Distance Metric - 距离度量4)Euclidean Distance - 欧氏距离5)Manhattan Distance - 曼哈顿距离6)Minkowski Distance - 闵可夫斯基距离7)Cosine Similarity - 余弦相似度8)K Value - K值9)Majority Voting - 多数投票10)Weighted KNN - 加权KNN11)Radius Neighbors - 半径邻居12)Ball Tree - 球树13)KD Tree - KD树14)Locality-Sensitive Hashing (LSH) - 局部敏感哈希15)Curse of Dimensionality - 维度灾难16)Class Label - 类标签17)Training Set - 训练集18)Test Set - 测试集19)Validation Set - 验证集20)Cross-Validation - 交叉验证•Naive Bayes - 朴素贝叶斯1)Naive Bayes - 朴素贝叶斯2)Bayes' Theorem - 贝叶斯定理3)Prior Probability - 先验概率4)Posterior Probability - 后验概率5)Likelihood - 似然6)Class Conditional Probability - 类条件概率7)Feature Independence Assumption - 特征独立假设8)Multinomial Naive Bayes - 多项式朴素贝叶斯9)Gaussian Naive Bayes - 高斯朴素贝叶斯10)Bernoulli Naive Bayes - 伯努利朴素贝叶斯11)Laplace Smoothing - 拉普拉斯平滑12)Add-One Smoothing - 加一平滑13)Maximum A Posteriori (MAP) - 最大后验概率14)Maximum Likelihood Estimation (MLE) - 最大似然估计15)Classification - 分类16)Feature Vectors - 特征向量17)Training Set - 训练集18)Test Set - 测试集19)Class Label - 类标签20)Confusion Matrix - 混淆矩阵•Clustering - 聚类1)Clustering - 聚类2)Centroid - 质心3)Cluster Analysis - 聚类分析4)Partitioning Clustering - 划分式聚类5)Hierarchical Clustering - 层次聚类6)Density-Based Clustering - 基于密度的聚类7)K-Means Clustering - K均值聚类8)K-Medoids Clustering - K中心点聚类9)DBSCAN (Density-Based Spatial Clustering of Applications with Noise) - 基于密度的空间聚类算法10)Agglomerative Clustering - 聚合式聚类11)Dendrogram - 系统树图12)Silhouette Score - 轮廓系数13)Elbow Method - 肘部法则14)Clustering Validation - 聚类验证15)Intra-cluster Distance - 类内距离16)Inter-cluster Distance - 类间距离17)Cluster Cohesion - 类内连贯性18)Cluster Separation - 类间分离度19)Cluster Assignment - 聚类分配20)Cluster Label - 聚类标签•K-Means - K-均值1)K-Means - K-均值2)Centroid - 质心3)Cluster - 聚类4)Cluster Center - 聚类中心5)Cluster Assignment - 聚类分配6)Cluster Analysis - 聚类分析7)K Value - K值8)Elbow Method - 肘部法则9)Inertia - 惯性10)Silhouette Score - 轮廓系数11)Convergence - 收敛12)Initialization - 初始化13)Euclidean Distance - 欧氏距离14)Manhattan Distance - 曼哈顿距离15)Distance Metric - 距离度量16)Cluster Radius - 聚类半径17)Within-Cluster Variation - 类内变异18)Cluster Quality - 聚类质量19)Clustering Algorithm - 聚类算法20)Clustering Validation - 聚类验证•Dimensionality Reduction - 降维1)Dimensionality Reduction - 降维2)Feature Extraction - 特征提取3)Feature Selection - 特征选择4)Principal Component Analysis (PCA) - 主成分分析5)Singular Value Decomposition (SVD) - 奇异值分解6)Linear Discriminant Analysis (LDA) - 线性判别分析7)t-Distributed Stochastic Neighbor Embedding (t-SNE) - t-分布随机邻域嵌入8)Autoencoder - 自编码器9)Manifold Learning - 流形学习10)Locally Linear Embedding (LLE) - 局部线性嵌入11)Isomap - 等度量映射12)Uniform Manifold Approximation and Projection (UMAP) - 均匀流形逼近与投影13)Kernel PCA - 核主成分分析14)Non-negative Matrix Factorization (NMF) - 非负矩阵分解15)Independent Component Analysis (ICA) - 独立成分分析16)Variational Autoencoder (VAE) - 变分自编码器17)Sparse Coding - 稀疏编码18)Random Projection - 随机投影19)Neighborhood Preserving Embedding (NPE) - 保持邻域结构的嵌入20)Curvilinear Component Analysis (CCA) - 曲线成分分析•Principal Component Analysis (PCA) - 主成分分析1)Principal Component Analysis (PCA) - 主成分分析2)Eigenvector - 特征向量3)Eigenvalue - 特征值4)Covariance Matrix - 协方差矩阵。

人工智能基础(习题卷9)

人工智能基础(习题卷9)

人工智能基础(习题卷9)第1部分:单项选择题,共53题,每题只有一个正确答案,多选或少选均不得分。

1.[单选题]由心理学途径产生,认为人工智能起源于数理逻辑的研究学派是( )A)连接主义学派B)行为主义学派C)符号主义学派答案:C解析:2.[单选题]一条规则形如:,其中“←"右边的部分称为(___)A)规则长度B)规则头C)布尔表达式D)规则体答案:D解析:3.[单选题]下列对人工智能芯片的表述,不正确的是()。

A)一种专门用于处理人工智能应用中大量计算任务的芯片B)能够更好地适应人工智能中大量矩阵运算C)目前处于成熟高速发展阶段D)相对于传统的CPU处理器,智能芯片具有很好的并行计算性能答案:C解析:4.[单选题]以下图像分割方法中,不属于基于图像灰度分布的阈值方法的是( )。

A)类间最大距离法B)最大类间、内方差比法C)p-参数法D)区域生长法答案:B解析:5.[单选题]下列关于不精确推理过程的叙述错误的是( )。

A)不精确推理过程是从不确定的事实出发B)不精确推理过程最终能够推出确定的结论C)不精确推理过程是运用不确定的知识D)不精确推理过程最终推出不确定性的结论答案:B解析:6.[单选题]假定你现在训练了一个线性SVM并推断出这个模型出现了欠拟合现象,在下一次训练时,应该采取的措施是()0A)增加数据点D)减少特征答案:C解析:欠拟合是指模型拟合程度不高,数据距离拟合曲线较远,或指模型没有很好地捕 捉到数据特征,不能够很好地拟合数据。

可通过增加特征解决。

7.[单选题]以下哪一个概念是用来计算复合函数的导数?A)微积分中的链式结构B)硬双曲正切函数C)softplus函数D)劲向基函数答案:A解析:8.[单选题]相互关联的数据资产标准,应确保()。

数据资产标准存在冲突或衔接中断时,后序环节应遵循和适应前序环节的要求,变更相应数据资产标准。

A)连接B)配合C)衔接和匹配D)连接和配合答案:C解析:9.[单选题]固体半导体摄像机所使用的固体摄像元件为( )。

复杂网络全局拓扑相似度计算方法实证研究

复杂网络全局拓扑相似度计算方法实证研究

复杂网络全局拓扑相似度计算方法实证研究胡燕祝;权桁;艾新波【摘要】The research on similarity of complex network has important effect on many hot fields, such as chain forecasting, evolution mechanism and community detection. This article defines a new method to compute the similar-ity of different complex network based on the global topological properties. The simulation results show that the method can distinguish different complex network according to their similarity. Thistext also shows that there are three stages in the development process of technology trading by using the method which this text defined in the empirical research, and the similarity of the complex networks in the same stage is obviously higher than the ones those in differ-ent stages which proves that the method this text defined is feasible and effective.%相似度研究对于复杂网络的链路预测、演化机制以及社团检测等相关热门研究领域都具有重要的作用,本文从网络相似度及演化的角度出发,基于提取复杂网络全局拓扑特性,定义了一种新的复杂网络相似度计算方法,仿真结果表明,该相似度计算方法可以准确表征不同复杂网络的相似程度,通过将该方法应用于技术交易中进行实证分析,发现可以将技术交易分为三个不同的阶段,每一阶段内的复杂网络之间相似度明显高于该阶段外的复杂网络,证实了本文提出的相似度计算方法具有可行性与有效性。

INSA de Lyon

INSA de Lyon

Keywords: Text Extraction, image enhancement, binarization, OCR, video Indexing 1
1
Introduage Retrieval and its extension to videos is a research area which gained a lot of attention in the recent years. Various methods and techniques have been presented, which allow to query big databases with multimedia contents (images, videos etc.) using features extracted by low level image processing methods and distance functions which have been designed to resemble human visual perception as closely as possible. Nevertheless, query results returned by these systems do not always match the results desired by a human user. This is largely due to the lack of semantic information in these systems. Systems trying to extract semantic information from low level features have already been presented [10], but they are error prone and very much depend on large databases of pre-defined semantic concepts and their low level representation. Another method to add more semantics to the query process is relevance feedback, which uses interactive user feedback to steer the query process. See [20] and [8] for surveying papers on this subject. Systems mixing features from different domains (image and text) are an interesting alternative to mono-domain based features [4]. However, the keywords are not available for all images and are very dependent on the indexer’s point of view on a given image (the so-called polysemy of images), even if they are closely related to the semantic information of certain video sequences (see figure 1). In this paper we focus on text extraction and recognition in videos. The text is automatically extracted from the videos in the database and stored together with a link to the video sequence and the frame number. This is a complementary approach to basic keywords. The user submits a request by providing a keyword, which is robustly matched against the previously extracted text in the database. Videos containing the keyword are presented to the user. This can also be merged with image features like color or texture. <Figure 1 about here> Extraction of text from images and videos is a very young research subject, which nevertheless attracts a large number of researchers. The first algorithms, introduced by the document processing community for the extraction of text from 2

recognition system

recognition system

A new clustering algorithm based on the chemicalrecognition system of antsNicolas Labroche and Nicolas Monmarch´e and Gilles VenturiniAbstract.In this paper,we introduce a new method to solve theunsupervised clustering problem,based on a modelling of the chem-ical recognition system of ants.This system allow ants to discrim-inate between nestmates and intruders,and thus to create homo-geneous groups of individuals sharing a similar odor by continu-ously exchanging chemical cues.This phenomenon,known as”colo-nial closure”,inspired us into developing a new clustering algorithmand then comparing it to a well-known method such as K-M EANSmethod.Our results show that our algorithm performs better thanK-M EANS over artificial and real data sets,and furthermore our ap-proach requires less initial information(such as number of classes,shape of classes,limitation in the types of attributes handled).1INTRODUCTIONThe efficiency of real ants collective behaviors has led number ofcomputer scientists to create and propose novel and successful ap-proaches to problem solving.For instance,modelling collective be-haviors has been used in the well known algorithmic approach AntColony Optimization(ACO)([4])in which pheromone trails are used.In the same way,other ants-based clustering algorithms have beenproposed([12],[8],[13]).In these studies,researchers have mod-elled real ants abilities to sort objects.Artificial ants may carry oneor more objects and may drop them according to given probabilities.These agents do not communicate directly with each others,but theymay influence each others through the configuration of objects onthefloor.Thus,after a while,these artificials ants are able to con-struct groups of similar objects,a problem which is known as dataclustering.In this paper,we focus on another important real ants collectivebehavior,namely the construction of a colonial odor and its use fordetermining the ants nest membership.As far as we know,this modelhas not yet been applied to any task in problem solving,and we showhere how it can be used in data clustering.The remaining of this article is organized as follows:section de-scribes the main principles of the real ants recognition system.Sec-tion presents thefirst data clustering algorithm that uses this newmodel:A NT C LUST.Section details experimental tests on bench-marks and their comparisons with standard approaches.Finally,sec-tion concludes on future extensions of this new model.2.2Odor ontogenesis and evolutionAt the early stage of their life,young ants,when fed by other colony members,physically impregnate nestmates’odors and learn them as a first Template.Their Label are then only defined by their own ge-netic information.After a short time,ants are able to synthetise their own hydrocarbons and thus can reinforce their Label by spreading their PPG’s content to their cuticle (”individual licking”).The ho-mogeneous sharing of all nestmates odors in a colony is achieved by trophallaxies (an ant decants its PPG contents in an other’s PPG),by ”social licking”(each ant spread a portion of its PPG over the other’s cuticle)or by simple contacts (only cuticular substances are exchanged).3Clustering algorithm A NT C LUST 3.1The clustering problemIn this paper,we focus on the unsupervised clustering problem in which the goal is to find groups of similar objects as close as possible to the natural partition of the given data set.No assumptions are made about the representation of the objects.They may be described with numerical or symbolic values or with first order logic.All we need here is the definition of a similarity measure which takes a couple of objects and as input and outputs a value between and .Value means that the two objects are totally different,means that they are identical.3.2Main principles of A NT C LUST algorithmThe main idea in this new model is as follows:one object is assigned to each artificial ant and represents the genetic part of the ant’s odor.We detail hereafter how its Label and its Template are represented (see 1).etic r,A cce pt h re sh oldsF inal cla ss i ftiona b c Figure 1.Principles of A NT C LUST .Labels and Templates are represented in a 2D-space for a better understanding.In a.ants have no Label and are just described by their Genetic odor.In b.the first labels have been computed by the algorithm.In c.the final classification groups in the same nest,theants that share a similar Label.For one ant ,we define the following parameters:The Label is determined by the belonging nest of the ant and is simply coded by a number,representative of the nest.At the beginning,ants are not under the influence of any nest,soequals .This Label will evolve over time until each ant has found its best nest.The Template is defined half by the genetic odor of the ant and half by an acceptance threshold .The first one corresponds to an object of the data set and can not evolve during the algorithm.The latter is learned during an initialization phase,similar to the real ants ontogenesis period,in which each artifi-cial ant will meet others and each time will evaluate the genetic odors similarities.The resulting Template thresholdis a function of all the similarities observed during this period.This acceptance threshold is dynamic and is updated after every encounters realised by ant .An estimator that reflects if the ant is successful during its meetings with all encountered ants or not.Since a young ant hasnot realised any meeting,at time .estimates the size of the nest to which belongs to (i.e.ants with the same).is simply increased when ant meets an other ant with the same Label and decreased when Labels are different (see section 3.4.1).An estimator which measures how well accepted is ant in its nest.It is increased when ant meets another ant with the same Label and when both ants accept each other and decreased when there is no acceptance between ants (see section 3.4.1).An age which,at the beginning,equals and is used when updating acceptance threshold.Estimates of the maximal similarity and mean similarity(1)Initializeiterations during which two ants,that are randomly chosen,meet (7)Deleteeach ant having no more nest to the nest ofthe most similar ant found that have a nest.3.3Initialization of young antsWe have copied the creation phase of young artificial ants from the ontogenesis period of biological ants,during which they learn a tem-plate of their colony odor,that will allow them to accept or reject encountered ants.Thus,we consider that the template can be defined as an accep-tance threshold,for each ant,that will be learned during a given number of random meetings.At the end of this period,the ant possesses values of mean and maximal similarities,which are used to define the Template at the beginning as shown in the following equation(1).(1) During its meetings,the ant will progressively make its mean and maximal similarities values evolve,in order to continuously update its Template threshold according to equation(1).3.4Ants meetings resolutionThe crucial point of our method concerns the resolution of meetings. It allows ants to share a common Label with individuals with com-patible templates.We consider thereafter two ants and.We define that there is acceptance(or recognition)between and(seefigure 2):(2)etic i etic jT e mpla iT e mpla jA cce p t aetic i etic jT e mpla iT e mpla jR e j o nFigure2.Principle of acceptance and rejection between two ants and3.4.1Behavioral rules associated with meetingsNew nest creation:If and Acceptance(,)Then Create a new Label and.If is false then rule is applied.Adding an ant with no Label to an existing nest:If and Acceptance(,)Then.The case()is handled in a similar way.”Positive”meeting between two nestmates:If and Acceptance(,)Then Increase and.By increasing(3)or decreasing(4)a variable we respectively mean:(3)(4) (Here we choose,because this is useful to”track”those evolving quantities)”Negative”meeting between two nestmates:If and Acceptance(,)=Then Increase and Decrease and.The ant(=,=)which possesses the worst integration in the nest()loses its Label and thus has no more nest(,and).Meeting between two ants of different nests:If and Acceptance(,)Then Decreaseand.The ant with the lowest(i.e.ant belonging to the smallest nest)changes its nest and belongs now to the nest of the encountered ant.Default rule:If no other rule applies then nothing happens.3.4.2Analysis of behavioral rulesThis section aims at briefly describing the rules mentioned herebe-fore.The rule has a fundamental role because it is the only creative rule in the method.No other rule can generate a new Label for a new nest.It causes the gathering of similar ants in the veryfirst clusters. The latter will be used as”seeds”to generate thefinal clusters.Ac-cording to this rule,a cluster contains at least two objects.The rule enlarges the cluster by adding an ant with no nest to a nest in which there exists a similar ant.The rule simply increments the estimators and in case of acceptance between the two ants.The rule permits to remove ants that were accepted when the nest profile was not clearly defined because there were not enough ants in it.By means of this rule,the worst integrated ants in a nest can be rejected and then reset.This allows bad or not optimally clustered objects to change their belonging cluster,which improves the results of the algorithm.The rule is also very important because it allows the gather-ing of similar clusters with one bigger than the other,the small one being absorbed by the other one.In fact,at the beginning,there are lots of clusers and this rule significantly decreases their number by gathering small sub-clusters into one bigger one.The rule happens when no other rule applies.4Experiments and resultsIn this section,we compare our method A NT C LUST to the K-M EANS algorithm[7].The latter is initialized with clusters ran-domly generated,so we will refer to it as10-M EANS hereafter.Be-fore detailling the settings of experiments,the benchmarks used for evaluation must be introduced.4.1Benchmarks and Experimental settingsIn order to test and compare the clustering abilities of the two meth-ods,we use randomly generated and real data sets attributes-basedrepresentations.For more detail on the data sets,see[13].Namely, there are:A RT as artificial data sets and for real ones:I RIS, G LASS,P IMA,S OYBEAN and T HYROID.Concerning artificial data sets,A RT1,A RT2,A RT3,A RT5and A RT6are generated by gaus-sian laws with different difficulties(irrelevant attributes,clusters overlap),A RT4data set is generated by a uniform law andfinally A RT7and A RT8correspond to white noise.The main characteristics of data are summarized in table1.All evaluations have been conducted over runs for each data set and each method.Concerning A NT C LUST,each test corre-sponds to iterations.During each of these iterations,two randomly chosen ants meet.Results are shown in table2.The fol-lowingfields are introduced in the table1for each datafile:the number of objects(”#Objects”)and their associated number of at-tributes(”#Attributes”),the number of clusters expected to be found in the data(”#Clusters”).Datas#Attributes400410002110042002900940041001100011503214779824742153Table1.Main characteristics of the data sets.4.2Similarity between objects and clustering errorevaluationThis section introduces two major definitions to help understand the results that are developped hereafter.The following equations present on the one hand,how we compute similarity and on the other hand, the mathematical expression used to compute the clustering error during our evaluations.Considering the similarity definition,each object is represented by a set of attributes,each of them having a data type among the existing data types(i.e.numeric,symbolic,...).Global similarity between two objects and can then be defined:(6)where is the similarity computed between all the attributes of type for both objects and,the number of times that data type is used to describe an object andfinally a function that computes the dissimilarity between two attributes of the compared objects and having data type.Description of thefunctions won’t be too detailled,but we formalize hereafter the(7)and(8)respectively,for a pair of numeric or symbolic values:if(9)where:0if1else(10)with the expected cluster identifiant for object in the original data and the cluster found by the evaluated algorithm.4.3ResultsThe table2shows the number of clusters effectively found by both methods(”#Clusters Found”)with the standard deviation(””) andfinally the error generated by both algorithms(”%Clustering Er-ror”)associated with its standard deviation too(””).Datas#Clusters Found%Clustering Error[][] Art1 4.00[0.00]0.18[0.02]8.52[0.96]0.38[0.01]Art3 2.00[0.00]0.15[0.02]6.38[0.75]0.32[0.02]Art5 3.28[0.45]0.28[0.03]8.46[1.08]0.10[0.02]Art7 3.28[0.45]0.66[0.02]8.78[0.83]0.88[0.01]Iris 2.16[0.37]0.22[0.01]9.44[0.70]0.29[0.02]Pima 2.66[0.56]0.45[0.01]8.82[0.97]0.13[0.02]Thyroid 2.88[0.33]0.18[0.06]Table2.Results obtained after50runs of each method applied over eachdata.Our algorithm A NT C LUST performs better than the10-M EANS method.It seems to be mainly because A NT C LUST manages to have, in general,a better appreciation of the number of clusters in the data. 10-M EANSfinds too many clusters because it starts from10and it does not manage to reduce this number because of the too little dif-ference existing between the objects.In fact,10-M EANS performs better than A NT C LUST only twice:for A RT5and G LASS,because the number of clusters expected is quite near.These results show that A NT C LUST can treat small to big sets of data with a great suc-cess(see S OYBEAN,A RT1,A RT2and A RT6)but also demonstrate, that A NT C LUST does not manage tofind a good partition when animportant number of clusters is expected (see A RT 5for instance).This may be due to the fact that there is only one rule that can create a new nest.This rule mainly applies at the beginning of the algorithm because even when an ant is ejected from its nest,it often remains alone and thus can not create a new nest,as two ants are needed.We study the influence of the number of iteration to see if A NT C LUST could be enhanced.The number of iteration defines the number of possible meetings between the ants so it is a very restric-tive parameter.Some tests allow us to verify that the number of iter-ations,that was inititially set to ,could be set to with similar results as shown in figure 3.With iterations or less,our algorithm does not manage to maintain the quality of the results because there is a large variability in clustering error and in the num-ber of clusters found,especially for A RT 1,A RT 2and A RT 5.In fact,when the number of iterations is no sufficient A NT C LUST generates too many clusters:this may be because the ants do not have enough time to gathercorrectly.Figure 3.Success of A NT C LUST for several number of iterations.Successis equal to clustering error.5ConclusionIn this paper we describe a new model of the ant recognition systemand its first application to the unsupervised clustering problem.Re-sults are good when compared to those of the 10-M EANS algorithm.Our approach does not make any assumption about the nature of the data to be clustered and does not require an initial partition or an ini-tial number of classes.This allows us to test our method in numerous application fields.The first one will be the web mining problem and more precisely the study of the behaviour of Internet users,because of the growing necessity for such tools for webmasters and because it provides a huge source of data.We are currently working on a new version of A NT C LUST that allows the user to see the generation of the nests in a 2D-space in real time.This version will rely on an other modelling of the Label and its evolution which will be more accurate.In fact,there are numerous ways left to adapt the mathematical model of the ants recognition system to the unsupervised clustering problem.REFERENCES[1]R Boulay and A.Lenoir,‘Social isolation of mature workers affectsnestmate recognition in the ant camponotus fellah ’,Behavioral Pro-cesses ,55,67–73,(2001).[2]N.F.Carlin and B.H¨o lldobler,‘The kin recognition system of carpenterants(camponotus spp.).i.hierarchical cues in small colonies’,Behav Ecol Sociobiol ,19,123–134,(1986).[3] D.Cliff,P.Husbands,J.A.Meyer,and Stewart W.,eds.Third Interna-tional Conference on Simulation of Adaptive Behavior:From Animals to Animats 3.MIT Press,Cambridge,Massachusetts,1994.[4] A.Colorni,M.Dorigo,and V .Maniezzo,‘Distributed optimization byant colonies’,in Proceedings of the First European Conference on Ar-tificial Life ,eds.,F.Varela and P.Bourgine,pp.134–142.MIT Press,Cambridge,Massachusetts,(1991).[5] D.Fresneau and C.Errard,‘L’identit´e coloniale et sa reprsentation chezles fourmis’,Intellectica ,2,91–115,(1994).[6] B.H¨o lldobler and E.O.Wilson,The Ants ,chapter Colony odor and kinrecognition,197–208,Springer Verlag,Berlin,Germany,1990.[7] A.K.Jain and Dubes R.C.,Algorithms for clustering Data ,chapterSquare-Error Clustering Method,96–101,Prentice Hall Advanced Ref-erence series,1988.[8]P.Kuntz and D.Snyers,‘Emergent colonization and graph partitioning’,In Cliff et al.[3],pp.494–500.[9]broche,N.Monmarch´e , A.Lenoir,and G.Venturini,‘Mod´e lisation du syst`e me de reconnaissance chimique des fourmis’,Rapport interne,Laboratoire d’Informatique de l’Universit´e de Tours,E3i Tours,(january 2002).45pages.[10]hav,V .Soroker,R.K.Vander Meer,and A.Hefetz,‘Nestmaterecognition in the ant cataglyphis niger :Do queens matter?’,Behav Ecol Sociobiol ,43,203–212,(1998).[11] A.Lenoir,D.Cuisset,and A.Hefetz,‘Effects of social isolation onhydrocarbon pattern and nestmate recognition in the ant aphaenogaster senilis ’,Insectes soc.,48,101–109,(2001).[12] E.D.Lumer and B.Faieta,‘Diversity and adaptation in populations ofclustering ants’,In Cliff et al.[3],pp.501–508.[13]N.Monmarch´e ,M.Slimane,and G.Venturini,‘On improving clus-tering in numerical databases with artificial ants’,in Lecture Notes inArtificial Intelligence ,eds.,D.Floreano,J.D.Nicoud,and F.Mondala,pp.626–635,Swiss Federal Institute of Technology,Lausanne,Switzer-land,(13-17September 1999).Springer-Verlag.。

纹理物体缺陷的视觉检测算法研究--优秀毕业论文

纹理物体缺陷的视觉检测算法研究--优秀毕业论文

摘 要
在竞争激烈的工业自动化生产过程中,机器视觉对产品质量的把关起着举足 轻重的作用,机器视觉在缺陷检测技术方面的应用也逐渐普遍起来。与常规的检 测技术相比,自动化的视觉检测系统更加经济、快捷、高效与 安全。纹理物体在 工业生产中广泛存在,像用于半导体装配和封装底板和发光二极管,现代 化电子 系统中的印制电路板,以及纺织行业中的布匹和织物等都可认为是含有纹理特征 的物体。本论文主要致力于纹理物体的缺陷检测技术研究,为纹理物体的自动化 检测提供高效而可靠的检测算法。 纹理是描述图像内容的重要特征,纹理分析也已经被成功的应用与纹理分割 和纹理分类当中。本研究提出了一种基于纹理分析技术和参考比较方式的缺陷检 测算法。这种算法能容忍物体变形引起的图像配准误差,对纹理的影响也具有鲁 棒性。本算法旨在为检测出的缺陷区域提供丰富而重要的物理意义,如缺陷区域 的大小、形状、亮度对比度及空间分布等。同时,在参考图像可行的情况下,本 算法可用于同质纹理物体和非同质纹理物体的检测,对非纹理物体 的检测也可取 得不错的效果。 在整个检测过程中,我们采用了可调控金字塔的纹理分析和重构技术。与传 统的小波纹理分析技术不同,我们在小波域中加入处理物体变形和纹理影响的容 忍度控制算法,来实现容忍物体变形和对纹理影响鲁棒的目的。最后可调控金字 塔的重构保证了缺陷区域物理意义恢复的准确性。实验阶段,我们检测了一系列 具有实际应用价值的图像。实验结果表明 本文提出的纹理物体缺陷检测算法具有 高效性和易于实现性。 关键字: 缺陷检测;纹理;物体变形;可调控金字塔;重构
Keywords: defect detection, texture, object distortion, steerable pyramid, reconstruction
II

一种新的基于聚类决策的码本更新算法

一种新的基于聚类决策的码本更新算法

一种新的基于聚类决策的码本更新算法谢蒙;易法令;杨松润;吴裕伟【期刊名称】《计算机技术与发展》【年(卷),期】2013(000)003【摘要】In the field of computer vision,the codebook algorithm is to extract the moving targets in the video through the background model,which is constituted by the background color-value ranges. Since the codebook background model's learning takes some time,and the unstable background requires any updates in real time,the analysis of the moving targets has the poor results. It presents a new code-book-updated algorithm based on the clustering decision,which through building the video frame queue clusters the frames of the queue in sequence according to the corresponding rules,can update the codebook step by step in the process of object extraction,effectively sol-ving the problem of the codebook real time. The results show that the method has good results for analysis of the moving target in the un-stable background environment.% 在计算机视觉领域中,码本算法是一种通过记录背景颜色值范围构成背景模型,以此提取出视频中运动目标的算法,由于码本背景模型的学习需要花费一段时间,并且对于背景不稳定的场景,需要实时更新,因而造成了运动目标分析效果的下降。

基于群组与密度的轨迹聚类算法

基于群组与密度的轨迹聚类算法

第47卷第4期Vol.47No.4计算机工程Computer Engineering2021年4月April2021基于群组与密度的轨迹聚类算法俞庆英1,2,赵亚军1,2,叶梓彤1,2,胡凡1,2,夏芸1,2(1.安徽师范大学计算机与信息学院,安徽芜湖241002;2.安徽师范大学网络与信息安全安徽省重点实验室,安徽芜湖241002)摘要:现有基于密度的聚类方法主要用于点数据的聚类,不适用于大规模轨迹数据。

针对该问题,提出一种利用群组和密度的轨迹聚类算法。

根据最小描述长度原则对轨迹进行分段预处理找出具有相似特征的子轨迹段,通过两次遍历轨迹数据集获取基于子轨迹段的群组集合,并采用群组搜索代替距离计算减少聚类过程中邻域对象集合搜索的计算量,最终结合群组和密度完成对轨迹数据集的聚类。

在大西洋飓风轨迹数据集上的实验结果表明,与基于密度的TRACLUS轨迹聚类算法相比,该算法运行时间更短,聚类结果更准确,在小数据集和大数据集上的运行时间分别减少73.79%和84.19%,且运行时间的减幅随轨迹数据集规模的扩大而增加。

关键词:群组;密度;群组可达;邻域搜索;轨迹聚类开放科学(资源服务)标志码(OSID):中文引用格式:俞庆英,赵亚军,叶梓彤,等.基于群组与密度的轨迹聚类算法[J].计算机工程,2021,47(4):100-107.英文引用格式:YU Qingying,ZHAO Yajun,YE Zitong,et al.Trajectory clustering algorithm based on group and density[J]. Computer Engineering,2021,47(4):100-107.Trajectory Clustering Algorithm Based on Group and DensityYU Qingying1,2,ZHAO Yajun1,2,YE Zitong1,2,HU Fan1,2,XIA Yun1,2(1.School of Computer and Information,Anhui Normal University,Wuhu,Anhui241002,China;2.Anhui Provincial Key Laboratory of Network and Information Security,Anhui Normal University,Wuhu,Anhui241002,China)【Abstract】The existing density-based clustering methods are mainly used for point data clustering,and not suitable for large-scale trajectory data.To address the problem,this paper proposes a trajectory clustering algorithm based on group and density. According to the principle of Minimum Description Length(MDL),the trajectories are preprocessed by segments to find out the sub trajectories with similar characteristics.The group set based on the sub trajectories is obtained by traversing the trajectories dataset twice,and the group search is used to replace the distance calculation to reduce the calculation amount required for the neighborhood object set search in the clustering process.Finally,the trajectory data set is clustered by combining the group and density.Experimental results on Atlantic hurricane track dataset show that,compared with the density-based TRACLUS track clustering algorithm,the running time of the proposed algorithm is less and the clustering results are more accurate.The running time on the small dataset and large dataset is reduced by73.79%and84.19%respectively,and the reduction of running time increases with the expansion of track dataset.【Key words】group;density;group reachability;neighborhood search;trajectory clusteringDOI:10.19678/j.issn.1000-3428.00574250概述随着定位、通信和存储技术的快速发展,车辆行驶轨迹数据、用户活动轨迹数据以及飓风轨迹数据等大量移动对象的轨迹数据可被搜集和存储。

基于遗传改进蚁群聚类算法的电力客户价值评价

基于遗传改进蚁群聚类算法的电力客户价值评价

1 电力客户价值评价指标体系
客户终生价值 (customer lifetime value , CLV) 包含企业与客户在整个生命周期内交易所能获得 的全部收益。目前对 CLV 的研究主要从企业、客 户和企业–客户 3 个视角出发; 而从企业视角对 CLV 进行研究是企业进行客户细分的重要标准 [13] 。因 此本文从企业视角对电力客户价值进行评价。 从企业视角来看,CLV 包括当前价值和潜在 价值 2 个方面:当前价值指客户当前消费模式不变 的情况下,在未来会给企业带来的价值;潜在价值 指通过调动客户购买积极性或向别人推荐产品和 服务等会给企业带来的价值。根据已有的参考资 本文构建 料[2-5,13],并考虑电力客户实际运营情况, 了电力客户价值评价指标体系,如图 1 所示。
ABSTRACT: It is an important procedure for service resource allocation of power supply enterprises to evaluate power consumer value. Based on the analysis on ant colony clustering algorithm (ACCA) and in allusion to the blindness in the setup of parametric combination of ACCA, its low convergence speed and easily falling into local convergence, a new method to evaluate power customer value, in which ACCA is improved by genetic algorithm (GA), is proposed. In the proposed method, the parameters of ACCA are optimized by GA, and then the clustering evaluation of power customer value is performed. Results of case study show that the clustering performance of the proposed method is evidently enhanced and the convergence is speeded up and local convergence can be avoided, in addition, the subjective factor during the evaluation is decreased. The proposed method is applied to evaluate ten industrial customers of a certain urban power supply company, and the evaluation results show that the proposed method is accurate, efficient and practicable. The features of various types of power customers are summarized and some suggestions on optimal allocation of service resources of power supply enterprises are put forward. KEY WORDS: power customer value; evaluation index system; ant colony clustering algorithm (ACCA); ACCA optimized by genetic algorithm; service resources optimization 摘要: 对电力客户价值进行评价是供电企业优化服务资源配 置的重要步骤。 分析了蚁群聚类算法, 并针对蚁群聚类算法 进行评价时参数组合设置盲目性、 收敛速度慢、 容易陷入局 部收敛的缺点, 提出了运用遗传算法改进蚁群聚类算法评价 电力客户价值的新方法。 该新方法利用遗传算法对蚁群聚类 算法的参数进行优化,进而再对电力客户价值进行聚类评 价。通过实例验证表明,该新方法聚类性能有较大的提升, 能够提升收敛速度和避免陷入局部收敛, 并且减少了聚类评

在线社交网络中影响节点的可视化(IJCNIS-V6-N5-2)

在线社交网络中影响节点的可视化(IJCNIS-V6-N5-2)

I.J. Computer Network and Information Security, 2014, 5, 9-20Published Online April 2014 in MECS (/)DOI: 10.5815/ijcnis.2014.05.02Visualization of Influencing Nodes in OnlineSocial NetworksPrajit Limsaiprom, Prasong Praneetpolgrang, Pilastpongs SubsermsriSchool of Information Technology, Sripatum University, Bangkok 10900, Thailandcrossprajit@, prasong.pr@spu.ac.th, pilastpongs@Abstract—The rise of the Internet accelerates the creation of various large-scale online social networks. The online social networks have brought considerable attention as an important medium for the information diffusion model, which can be described the relationships and activities among human beings. The online social networks’ relationships in the real world are too big to present with useful information to identify the criminal or cyber attacks. The methodology for information security analysis was proposed with the complementary of Cluster Algorithm and Social Network Analysis, which presented anomaly and cyber attack patterns in online social networks and visualized the influencing nodes of such anomaly and cyber attacks. The closet vertices of influencing nodes could not avoid from the harmfulness in social networking. The new proposed information security analysis methodology and results were significance analysis and could be applied as a guide for further investigate of social network behavior to improve the security model and notify the risk, computer viruses or cyber attacks for online social networks in advance.Index Terms—visualization, influencing nodes, anomaly and cyber attacks, online social networks, clustering, social network analysis.I.I NTRODUCTIONNew services will widen the Internet into an interactive medium for online social networks (Web 2.0) such as Facebook, MySpace, Twitter, Hi5, Google, etc. The online social network services increase the benefits of users, however these services may cause additional risks for users due to their trusted these online social networks. The attackers, spammers and scammers may have the opportunities to exploit their information easily or construct convincing social engineering attacks with all their data.Many organizations have become vulnerable from the intrusion of cyber attackers that compromise the security of their networks. The security incidents comprise the violation of the given security policies. To be able to detect violations of the security, it must be able to observe all activities that could potentially be part of such violations [1]. Many computer security techniques have been intensively studied in the last decade to defend against various cyber attacks and computer viruses, namely cryptography, firewalls, anomaly and intrusion detection [2], [3], [4], [5], [6].Online social networks can also play an important role as a medium for the spread of information. For example, innovation, hot topics and even malicious rumors can propagate through online social networks among individuals, and computer viruses can diffuse through email networks. Social Network Analysis (SNA) has become a powerful methodological tool alongside with statistics. The social network analysis is an approach and a set of techniques, which can use for studying the exchange of resources among actors (i.e., individuals, groups, or organizations). Regular patterns of information exchange reveal themselves as online social networks. The actors are nodes in the networks and information exchange relationships are connectors between nodes [7].The previous research paper was studied about computer virus distribution in online social networks and found that the users who visited MySpace was the first order40.58%, next was Hi5 35.12% and Facebook was the last one 24.30%. The virus behavior analysis was JS/PackRedir. A! tr. dldr90.36% from private IP address 172.16.10.96. The result of correlation analysis between the usage of online social networks and virus distribution was significant at the 0.01 level (2-tailed) [8].In this paper, we have carried out the anomaly and cyber attack patterns in online social networks, identified and visualized the influencing nodes of such anomaly and cyber attack patterns in online social networks.The organization of this paper as follows: Section I demonstrates about the background. Section II reviews the related works. Section III introduces the proposed approach. Section IV explains about the researching methodologies. Section V presents the data source that used in this research. Section VI illustrates the result and analysis. Section VII presents the discussion. Finally, section VIII shows the conclusion and references.II.R ELATED W ORKSData Mining has been interested in various research applications such as improved the performance of intrusion detection system, detected the anomaly and attack patterns, classified network behaviors, etc. Recently, many researchers have interested to propose the frameworks or the methods in data mining to improve the performance of intrusion detection system:C. Azad et al., studied many research papers to classify the area of interesting in Data Mining. Anomaly detection was interested in first priority that presented 67% [9]. R. Smith et al., introduced a new clustering algorithm based on the neural network architecture called an autoassociator (AA) [10].D. Fu et al., presented the improved association analysis algorithm based on FP-Growth and FCM network intrusion detection technologies based on statistical binning [11]. The result of this research found the intrusion activity in time and solved the problem of data mining speed effectively; enhanced the detective ability of intrusion detection. J. Song et al., presented an approach that was different from the traditional detection models based on raw traffic data. The proposed method could extract unknown activities from IDS [12].Data Mining was used to detect the anomaly and attack patterns: M. Bordie et al., developed a data-mining system with an integrated set of analysis methods that used to analyze a large amount of IDS log data to discover interesting, previous unknown and actionable information [13]. They presented an architecture and data mining process that used a set of integrated tools including visualization and data analysis and utilized several data-mining algorithms, including temporal association, event burst, and clustering, to discover valuable patterns. H. Yang et al., presented an anomaly detection approach based on clustering and classification for intrusion detection [4]. They used connections obtained from raw packet data of the audit trail and then performed clustering to group training data points into cluster, from which selected some clusters as normal and non-attack profile.The influence-based classification algorithm was used to classify network behaviors. Z. Qi et al., proposed k-walks in (1+A) k matrix for studying the connective relationship of a network. They normalized the walker vector of each node into unit n-dimensional unit spherical surface, and then they computed geodesic lines between two nodes on the spherical surface, and got the best part of the community by k-means clustering method [14]. Z. Jiliu et al., used the criminal data mining by applying data mining technology and graph theory to Social Network Analysis (SNA) in crime groups or terrorist organizations [15]. They proposed a Crime Group Identification Model (CGIM) based on time step by calculating the group similarity and the group members aggregation under different time steps. Besides, Social network is a phenomenon of the interaction among the people in a group, community or Internet world. It can visualize as a graph, where a vertex corresponds to a person in that group and an edge represents some kind of connections between the corresponding persons. Online social networks are interested in various research areas as well such as social network analysis, searching and discovering mechanisms in Social Networks, information diffusion, the problem of information dissemination, etc.Some researchers have interested in the applications of social network analysis: A. Apolloni et al., used a probabilistic model to determine whether two people would converse about a particular topic based on their similarity and familiarity [16]. H. Xia et al., analyzed the key factors influencing knowledge diffusion and innovation through quantitative analysis about network density, centrality and the cohesive subgroup [17].In addition, S.H. Sharif et al., reviewed on search and discovery mechanisms in Social Networks. They classified the existing methods into four categories: people search, job search, keyword search and web service discovery [18]. S. Sharma et al., presented a centrality measurement and analysis of the social networks for tracking online community by using betweenness, closeness and degree centrality measures [19].The approaches and techniques were proposed in order to study of information diffusion: C. Haythornthwaite introduced social network analysis as an approach and a set of techniques for the study of information diffusion [7]. T. Fushimi et al., attempted to answer a question “What does an information diffusion model tell about social network structure?”. They proposed a new scheme for an empirical study to explore the behavioral characteristics of information diffusion model [20]. C.T. Butts applied the notion of algorithmic complexity to the analysis of social network, based on theoretical motivation regarding constraints on the graph structure [21]. C. Lo Storto presented an approach useful to analyze the performance of the product development process [22].Some researchers have concentrated on the problem of information diffusion in social networks: M. Kimura et al., addressed the problem of efficiently estimating the influence function of initially activated nodes in a social network under the susceptible/infected/susceptible (SIS) model [23]. A. Pablo et al., solved the influence maximization problem in social networks with greedy algorithm [24]. They considered two information diffusion models, Independent Cascade Model and Linear Threshold Model. Their proposed algorithm compared with traditional maximization algorithms such as simple greedy and degree centrality using three data sets.III.P ROPOSED A PPROACHBased on the objectives of this research, a new approach was proposed with the complementary of cluster algorithm and social network analysis.The clustering technique was proposed to detect informative patterns of cyber attacks in online social networks. Consequently, the centroid of the cluster could use to represent the member within each cluster and discover attacks in the audit trail.Visualization the influencing nodes of anomaly and cyber attacks in online social networks, active nodes would be discovered by social network analysis (SNA) into two classes-Influencing nodes (Influencing Nodes:IN) and other nodes (Other Nodes: ON), IN was a class which indicated the influencing nodes and ON was a class which indicated the other nodes. The proposed architecture of this research presented as shown in Fig.1. The software, namely WEKA was used to analyze the IDS data streams of online social networks and present anomaly and cyber attack patterns. The software, namely Pajek was used to analyze and visualize the influencing nodes with high degrees’ nodes in onlin e social networks’ relationships with 7,035 clients and 25,778 edges of data in April 2012.Fig 1: Architecture of the proposed systemIV.M ETHODOLOGIESThe new approach was proposed in this research, which developed with the complementary of cluster algorithm and Social Network Analysis. The cluster algorithm was used to analyze IDS logs data and discover anomaly and cyber attack patterns in online social networks. Social network analysis (SNA) was used to identify and visualize the influencing nodes of anomaly and cyber attacks in online social networks, which could be applied in many applications such as identifying the structure of a criminal network or monitoring and controlling information diffusion, or cyber attacks for secure computer systems in network applications.A. Clustering AlgorithmClustering is a useful technique for the intrusion detection of online social networks as malicious activities will cluster together, separate itself from non-malicious activities.The clustering algorithm will group the similar objects from different objects by considering between similarity measure and distance measure. There are two most clustering analysis models, which are hierarchical cluster analysis and non-hierarchical cluster analysis. The hierarchical cluster analysis is included Agglomerative method and Division method while the method of non-hierarchical cluster analysis is K-Means clustering. The cluster analysis with distance measure and K-Means clustering method was used in this research.Formally, a cluster analysis can be described as the partitioning of a number N of classifying objects-a number of patterns with an endless dimension P in K groups or clusters {Ck; k = 1,…, K}. Given N objects X = {xi, i = 1,…, N}, where xi, j denotes the jth element of xi, the grouping of all objects with index i = 1,…, N in clusters k = 1,…, K can be defined as follows:(1) The association of each object to a cluster is unique by applying two conditions for the matrix.(2)Furthermore, let the following definition denominate the number of objects belonging to a cluster Ck:(3) The Euclidean distance was used as a similarity measure function to identify the similarity of two feature vectors. Whether a set of points is close enough to be considered a cluster. The distance measure D (x, y) was used to tell how far points x and y are. Often, the points may think to live in k-dimensional Euclidean space, and the distance between any two points, say asx= [x1, x2,…, x k] and y = [y1, y2,…, yk] (4)(5)The correlation coefficient provides another possibility to measure similarities between two classification objects:(6)B. Social Network Analysis (SNA)Social network analysis is based on the principles of graph theory, which consists of a set of mathematical formulae and concepts in the study of patterns of lines. Actors are the points in the graph, and relationships are the lines between actors, Sociograms are graphs of socialnetworks. Centrality is the extent to which a person is in the center of a network. Central people have more influence in their network than people who are less central. Measures of centrality include degree, betweenness and closeness centrality.1. A network is represented by a matrix that calledthe adjacency matrix A, which in the simplestcase is a (n x n) symmetric matrix, where n is thenumber of nodes in the network. The adjacencymatrix has elements.(7) 2.Let a graph G= (V, E) represents the graph ofonline social networks where vertices Vcorresponds to contact nodes (users) in thenetwork, and edges E corresponds to theinformation sending events among users.3.The degree of a vertex in a network is the numberof edges attached to it. In mathematical terms, thedegree Di of a vertex i is:(8) 4.The betweenness Ba of a node a is defined as thenumber of geodesics (shortest paths between twonodes) passing through it:(9)Where indicates whether the shortest pathbetween two other nodes i and j passes throughnode a.5.The Closeness Ca is the sum of the length ofgeodesics between a particular node a and all theother nodes in a network. It actually measureshow far away one node is from other nodes andsometimes called farness. Where l(i, a) is thelength of the shortest path connecting nodes i anda:(10) 6.For node i, we definitely InDg (I) as the uniquenumber of edges are sent to node i.(11)7.For node i, we definitely OutDg (I) as the uniquenumber of edges are received from node i.(12) 8.The total unique number of edges is sent to thenode i and the number of edges are received fromnode I definitely in TotalDg(i)(13)V.D ATA S OURCESThe organization in this research is the public health sector, which concerns about high privacy and security of information system and needs to avoid from the harmful of network applications. The event log data from IDS (Intruder Detection System) of Health Care Organization with Head Office in Bangkok and 12 regional centers (RC1, RC2… RC12) had located in the different region cover 76 provinces of Thailand in April 2012 with high severity were used in this research. Sample record of raw IDS logs file presented as shown in Table I.Table 1.N ETWORK IDS L OG R ECORD D ATA D EFENITIONIn this example, besides date and time stamp, the attributes of IDS log file were: attacked ID (attack_id) for representing the type of intruder signature; severity presented the level of risk; source IP address, destination IP address, source port, destination port, and a detailed message for additional explanation of this signature. The meaning of the example alert above, the attack id was 10873; the source IP was 172.16.10.96 with port number 2156; the destination IP was 172.16.10.137 with port number 80; and activity wa s “The remote attackers can gain control of vulnerable systems”.As mention above, this research focused on log data of intrusion detection system of online social networks toidentify and discover unknown patterns, the data preprocessing as follows:1.The attack logs data of online social networkswere filtered from Intrusion Detection System logto analyze and discover anomaly and cyber attackpatterns.2.The raw events were enhanced by augmenting theattack log data with other relevant information notfound in the raw event logs, such as the type ofweb (web 1.0, web 2.0), group number of onlinesocial networks in each real IP destinationaddress, so on.3.The attack logs data in April 2012 were filtered toIDS log data of online social networks (web 2.0)to analyze and discover unknown patterns withClustering Technique and Social NetworkAnalysis (SNA).4.This research merged online social networks’relationships with members, who were attackedfrom the attackers of such online social networks.The online social networks’relationships in thereal world are too big to present usefulinformation as shown in Fig.2, which was theexample of online social networks’relationshipswith 7,035 nodes and 25,778 edges in April 2012.They became very difficult to read as the memberof actors’ increases.Fig 2: The relationships in online social networksVI.R ESULTS AND A NALYSIS R ESULTSThe approach was proposed with the complementary of cluster algorithm and Social Network Analysis. The cluster algorithm was used to analyze IDS log data to discover anomaly and cyber attack patterns in online social networks. Social network analysis (SNA) was used to identify and visualize the influencing nodes of anomaly and cyber attacks in online social networks. The results of this research presented as follows;A. Anomaly and Cyber attacks Patterns in Online Social NetworksThe anomaly and cyber attacks patterns in online social networks were analyzed with a cluster algorithm by using source IP address. The model and evaluation of test set categorize signatures into four groups. The anomaly and cyber attacks patterns caused from attackers, which executed an arbitrary program on infected systems with 40% of the test set. The anomaly with HTTP with 37% of the test set. The remote attackers can gain control of vulnerable systems with 19% of the test set. Denial of Service (DOS) with 4% of test set. The result presented as shown in Fig. 3.Fig 3: Social networks anomaly and cyber attacks patterns analysisresultB. The Proportion of Anomaly and Cyber attacks Patterns in Each Branch of the OrganizationIn this experiment, 60% of the data stream was training set and 40% of the data stream was test set. When the training process finished and system model was built, the association rules of anomaly and cyber attacks patterns in online social networks would be discovered. The association predictive model was evaluated and best rules presented with eight principles. The anomaly and cyber attack patterns at head office and 12 regional centers were the same patterns. The attackers executed an arbitrary program on infected systems was the highest percentage of risk of RC6, located at Khon Khan. Anomaly with HTTP was the highest percentage of risk of RC2, located at Lopburi. The remote attackers could gain control of vulnerable systems was the highest percentage of risk of RC9and RC10, located at Pitsanuloke and Chiang Mai, respectively. DoS was the highest percentage of risk of Head office in Bangkok. The proportion of risk in each signature presented as shown in Table II and Fig. 4.Table 2. Percentage of Anomaly and Attack Classify by Site andSignatureFigure 4: Percentage of anomaly and cyber attacks patterns classify bysignatureC. The Proportion of Anomaly and Cyber attacksPatterns Classify by Online Social NetworksThe proportion of anomaly and cyber attack patternsclassified by online social networking presented asshown in Table III, Fig. 5 and Fig. 6.1.The proportion of risk caused by the remoteattackers could gain control of vulnerable systemswhen users access Facebook was 100%.2.The proportion risk caused by the attacker canexecute an arbitrary program on infected systemswhen users access googalz was 100%.3.The proportion of risk caused by Denial ofService when users access Skype was 55.56% andMSN was 44.44% as shown in Fig. 5.4.The proportion of risk caused by anomaly withHTTP when users access Facebook, imeem,YouTube, Hi5 and Twitter was 30%, 16%, 5 %,47% and 2%, respectively as shown in Fig. 6.Table 3. Percentage of Risks Classify by Online Social NetworksFig 5: Percentage of Denial of Service classifies by online socialnetworksFig 6: Percentage of Anomaly with HTTP classifies by online socialnetworksD. Identifying and Visualization the Influencing Nodes of Anomaly and Cyber attacks in Online Social Networks The centrality is the extent to which a person is in the center of a whole network. Central people have more influence in their network. Measures of centrality include degree, betweenness and closeness centrality.1. This research presented 42 nodes (node label 3, 60, 87, googalz, 224, 145, 119, B6, 216, 217, Facebook, Skype, MSN, Hi5, 234, 229, 30, 20, 179, 171, 96, 44, 229, A47, A12, B9, C26, B31, A24, C40, F48, J7, G24, E26, H21, M14, K16, L41, E29, E21, 235, and G19) these were the influencing nodes. A new network of 42 influencing nodes presented the cyber attack patterns, which called Egocentric network as shown in Fig. 7.2. A network member with a higher degree could be the leader or “hub” in a network.3. Top five In-degree nodes were node label 60, 87, 3, googalz and 224 with an In - degree equal 109, 109, 107, 72, and 60, respectively.4. Top five Out-degree nodes were node label 234, J7, 217, 229 and 216 with an Out - degree equal 1451, 101, 75, 68 and 61, respectively.5. Betweenness measures the extent to which a particular node lies between other nodes in a network. Top five Betweenness nodes were node label 216, 119, 234, C26 and J7 with Betweenness measure equal 1513.400, 1338.200, 1197.467, 785.233 and 695.000, respectively.6. Top five Closeness nodes were node label 119, 217, 216, 171 and 234 with closeness measure equal 8.456, 8.436, 8.413, 8.301 and 8.293, respectively. This meant that node label 119, 217, 216, 171 and 234 were persons in the center of the network and could be the leader in the network because they were both influencing nodes andhigh degrees. If they were attacked from many threats or cyber attacks such as social engineering or malware, they would be influenced in their network.A network member with a higher degree could be the leader or “hub” in the network. Betweenness measures the extent to which a particular node lies between other nodes in a network. Closeness is the sum of the length of geodesics between a particular node and all the other nodes in a network. It actually measures how far away one node is from other nodes. This research presented 42 influencing nodes, which could be partitioned to new network with high degree centrality.E. Anomaly and Cyber attacks Diffusion by Influencing Nodes in Online Social NetworksEgocentric Network is the graphs in subgroups of the whole network, builds a picture of a typical actor in any particular environment and shows how many ties they maintain, and what kinds of information they send to and receive from others in their network. This research presented the sociogram of each representative node, for example node label 119, 216, 217, 234 and their linkages of relationships as shown in Fig. 8, Fig. 9, Fig. 10, Fig. 11, and Fig. 12 respectively.Finally, the results of Egocentric Networks presented the closet vertices of node label 119 or node label 50 and node label 53. The closest vertices of node label 216 were node label 20 and node label 29. The closest vertices of node label 217 were node label 18 and node label 28. The closest vertices of node label 234 were node label 29 and node label 42. These could be implied that when node label 119, 216, 217, 234 were attacked from many threats such as social engineering or malware, they would send these attacks to node label 50, 53, 20, 29, 18, 28 29 and 42 with the highest proportionas shown in Fig. 13.Fig 7: Subgroup Identification of influence nodesFig 8: Egocentric network of representative node A24Fig. 8 presented the characteristics of node label A24 as follows: Node label A24 had 93 edges and average degree was equal 1.9565217. The closest vertices were node label 32 and node label 36 with distance: 0.08033. The ClosenessCentralization was equal 1.00000.Fig 9: Egocentric network of representative node 119Fig. 9 presented the characteristics of node label 119 as follows: Node label 119 had 107 edges and average degree was equal 1.9622642. The closest vertices were node label 50 and node label 53 with distance: 0.06899. The ClosenessCentralization was equal 1.00000.Fig 10: Egocentric network of representative node 216Fig. 10 presented the characteristics of node label 216 as follows: Node label 216 had 109 edges and average degree was equal 1.9629630. The closest vertices were node label 20 and node label 29 with distance: 0.06895. The ClosenessCentralization was equal 1.00000.Fig 11: Egocentric network of representative node 217Fig. 11 presented the characteristics of node label 217 as follows: Node label 217 had 105 edges and average degree was equal 1.9615385. The closest vertices were node label 18 and node label 28 with distance: 0.07135. The ClosenessCentralization was equal 1.00000.Fig 12: Egocentric network of representative node 234Fig. 12 presented the characteristics of node label 234 as follows: Node label 234 had 131 edges and average degree was equal 1.9692308. The closest vertices were node label 29 and node label 42 with distance: 0.05801. The ClosenessCentralization was equal 1.00000.Fig 13: Information Diffusion of representative nodes 119, 216, 217 and 234Fig. 13 presented the relationships of representative node label 119, 216, 217, and 234 as follows: Average degreewas equal 2.1604278. The closest vertices were node label 17 and node label 21 with distance: 0.00331. The ClosenessCentralization was equal 0.45397.。

数据仓库与数据挖掘智慧树知到课后章节答案2023年下济南大学

数据仓库与数据挖掘智慧树知到课后章节答案2023年下济南大学

数据仓库与数据挖掘智慧树知到课后章节答案2023年下济南大学济南大学绪论单元测试1.数据挖掘的目标不在于数据采集策略,而在于对于已经存在的数据进行模式的发掘。

()A:错 B:对答案:对第一章测试1.图挖掘技术在社会网络分析中扮演了重要的角色。

()A:对 B:错答案:对2.数据挖掘的主要任务是从数据中发现潜在的规则,从而能更好的完成描述数据、预测数据等任务。

( )A:对 B:错答案:对3.DSS主要是基于数据仓库.联机数据分析和数据挖掘技术的应用。

()A:对 B:错答案:对4.建立一个模型,通过这个模型根据已知的变量值来预测其他某个变量值属于数据挖掘的哪一类任务?( )A:建模描述B:根据内容检索C:寻找模式和规则D:预测建模答案:预测建模5.以下哪些学科和数据挖掘有密切联系?( )A:计算机组成原理B:矿产挖掘C:统计D:人工智能答案:统计;人工智能第二章测试1.下面哪个不属于数据的属性类型:( )A:区间B:序数C:相异D:标称答案:相异2.在上题中,属于定量的属性类型是:( )A:序数B:区间C:相异D:标称答案:区间3.只有非零值才重要的二元属性被称作:( )A:计数属性B:对称属性C:离散属性D:非对称的二元属性答案:非对称的二元属性4.以下哪种方法不属于特征选择的标准方法: ( )A:嵌入B:包装C:过滤D:抽样答案:抽样5.离群点可以是合法的数据对象或者值。

()答案:对第三章测试1.下面哪些属于可视化高维数据技术 ( )A:星形坐标B:平行坐标系C:矩阵D:Chernoff脸E:散布图答案:星形坐标;平行坐标系;矩阵;Chernoff脸2.下面哪种不属于数据预处理的方法? ( )A:聚集B:离散化C:变量代换D:估计遗漏值答案:估计遗漏值3.联机分析处理包括以下哪些基本分析功能? ( )A:转轴B:聚类D:分类E:切片答案:转轴;切块;切片4.检测一元正态分布中的离群点,属于异常检测中的基于()的离群点检测。

自动确定聚类中心的密度峰值算法

自动确定聚类中心的密度峰值算法

自动确定聚类中心的密度峰值算法王洋;张桂珠【摘要】密度峰值聚类算法(Density Peaks Clustering,DPC),是一种基于密度的聚类算法,该算法具有不需要指定聚类参数,能够发现非球状簇等优点.针对密度峰值算法凭借经验计算截断距离dc无法有效应对各个场景并且密度峰值算法人工选取聚类中心的方式难以准确获取实际聚类中心的缺陷,提出了一种基于基尼指数的自适应截断距离和自动获取聚类中心的方法,可以有效解决传统的DPC算法无法处理复杂数据集的缺点.该算法首先通过基尼指数自适应截断距离dc,然后计算各点的簇中心权值,再用斜率的变化找出临界点,这一策略有效避免了通过决策图人工选取聚类中心所带来的误差.实验表明,新算法不仅能够自动确定聚类中心,而且比原算法准确率更高.%Density Peaks Clustering(DPC)is a density-based clustering algorithm,which has the advantage of not need-ing to specify clustering parameters and discovering non-spherical clusters.In this paper,an adaptive truncation method based on Gini index is proposed to solve the problem that the density peak algorithm can not effectively deal with each scene by calculating the cutoff distance dc,and the density peak algorithm manually selects the clustering center to get the actual clustering center.Distance dcand automatic clustering center method can effectively solve the defects of tradi-tional DPC algorithm which can not handle the complex data set.The algorithm firstly cuts off the distance through Gini index,then calculates the cluster center weights of each point,and then uses the change of slope to find the critical point. This strategy effectively avoids the errors caused by manual selection of clustering centers bydecision graph. Experi-ments show that the new algorithm not only can automatically determine the clustering center,but also has higher accuracy than the original algorithm.【期刊名称】《计算机工程与应用》【年(卷),期】2018(054)008【总页数】6页(P137-142)【关键词】密度峰值;聚类;簇中心点;基尼指数【作者】王洋;张桂珠【作者单位】江南大学物联网工程学院,江苏无锡214122;江南大学物联网工程学院,江苏无锡214122【正文语种】中文【中图分类】TP181 引言聚类是把数据集划分成子集的过程,每一个子集是一个簇,满足簇中对象相似,但与其他簇中的对象不相似的条件。

考虑局部均值和类全局信息的快速近邻原型选择算法

考虑局部均值和类全局信息的快速近邻原型选择算法

第40卷第6期自动化学报Vol.40,No.6 2014年6月ACTA AUTOMATICA SINICA June,2014考虑局部均值和类全局信息的快速近邻原型选择算法李娟1,2王宇平1摘要压缩近邻法是一种简单的非参数原型选择算法,其原型选取易受样本读取序列、异常样本等干扰.为克服上述问题,提出了一个基于局部均值与类全局信息的近邻原型选择方法.该方法既在原型选取过程中,充分利用了待学习样本在原型集中k个同异类近邻局部均值和类全局信息的知识,又设定原型集更新策略实现对原型集的动态更新.该方法不仅能较好克服读取序列、异常样本对原型选取的影响,降低了原型集规模,而且在保持高分类精度的同时,实现了对数据集的高压缩效应.图像识别及UCI(University of California Irvine)基准数据集实验结果表明,所提出算法集具有较比较算法更有效的分类性能.关键词数据分类,原型选择,局部均值,类全局信息,自适应学习引用格式李娟,王宇平.考虑局部均值和类全局信息的快速近邻原型选择算法.自动化学报,2014,40(6):1116−1125DOI10.3724/SP.J.1004.2014.01116A Fast Neighbor Prototype Selection Algorithm Based on Local Mean andClass Global InformationLI Juan1,2WANG Yu-Ping1Abstract The condensed nearest neighbor(CNN)algorithm is a simple non-parametric prototype selection method, but its prototype selection process is susceptible to pattern read sequence,abnormal patterns and so on.To deal with the above problems,a new prototype selection method based on local mean and class global information is proposed. Firstly,the proposed method makes full use of those local means of the k heterogeneous and homogeneous nearest neighbors to each be-learning pattern and the class global information.Secondly,an updating process is introduced to the proposed stly,updating strategies are adopted in order to realize dynamic update of the prototype set. The proposed method can not only better lessen the influence of the pattern selected sequence and abnormal patterns on prototype selection,but also reduce the scale of the prototype set.The proposed method can achieve a higher compression efficiency that can guarantee the higher classification accuracy synchronously for original data set.Two image recognition data sets and University of California Irvine(UCI)benchmark data sets are selected as experimental data sets.The experiments show that the proposed method based on the classification performance is more effective than the compared algorithms.Key words Data classification,prototype selection,local mean,global class information,adaptive learningCitation Li Juan,Wang Yu-Ping.A fast neighbor prototype selection algorithm based on local mean and class global information.Acta Automatica Sinica,2014,40(6):1116−1125在机器学习和数据挖掘任务中,作为一种简单成熟的分类算法KNN(k-nearest neighbors algorithm)[1]获得广泛的应用.作为数据挖掘领域的十大经典算法之一,KNN算法具有理论简单、易收稿日期2013-06-19录用日期2013-11-11Manuscript received June19,2013;accepted November11, 2013国家自然科学基金(61272119)资助Supported by National Natural Science Foundation of China (61272119)本文责任编委章毓晋Recommended by Associate Editor ZHANG Yu-Jin1.西安电子科技大学计算机学院西安7100712.陕西师范大学远程教育学院西安7100621.School of Computer Science and Technology,Xidian Univer-sity,Xi an7100712.School of Distance Education,Shaanxi Normal University,Xi an710062于实现、无需预先训练分类器、可适用各种数据分布环境等优势,然而尤其处理大规模数据集时,由于其简单的处理策略而导致产生难以接受的时间和空间消耗.故在分类算法中,如何对大规模数据集去除冗余节点,保留高效分类贡献的代表点,进而降低数据规模、提高分类速度,成为了研究热点.为此,一种有效的处理策略即原型选择就是对原始数据集进行必要缩减,即在保证不降低甚至提高分类精度等性能的基础上,对原始训练集处理从中获取能够反映原始数据集分布及分类特性的代表样本集即原型集,进而降低数据规模和噪音的敏感度,提高分类算法执行效率.6期李娟等:考虑局部均值和类全局信息的快速近邻原型选择算法11171相关技术1.1原型选择算法原型选择算法的重要应用之一是作为某个分类算法的预处理步骤,可与各种分类算法相结合,降低分类算法的数据规模.本文选定原型选择算法与近邻算法相结合,通过分类精度比较所提出算法的执行效率.原型选择算法目标为在不降低分类性能的基础上,去除噪音等异常节点,降低训练集规模,进而提高算法执行效率.其常见模型[2]为:设T R(Training set)为训练集(包含一些无用信息,如噪音、冗余信息等),寻求选择子集T P(Training prototype set),T P⊂T R使得T P不包含多余原型,且Acc(T P)∼=Acc(T R)(Acc(X)表示X作为训练集所获得的分类精度).而在分类过程中,使用T P代替T R作为分类判断基准数据,从而降低了运算的数据规模.原型选择算法一经提出,就获得了长足的发展,产生了诸多的研究成果.其中剪辑近邻法(Edited nearest neighbor,ENN)[3]与压缩近邻法(Con-densed nearest neighbour,CNN)[4]是较早提出的样本选择算法.CNN算法的缺点是对原样本集样本的排列顺序敏感,而且压缩集中含有较多的冗余样本.围绕着CNN算法,产生了一系列改进算法:如FCNN(Fast condensed nearest neighbor)[5]侧重降低样本读取序列敏感性和尽可能获取类决策边界原型;GCNN(Generalized condensed nearest neighbor)[6]引入了同异类近邻,克服了CNN仅使用同类近邻的不足;MNV(Mutual neighborhood value)[7]使用互近邻值降低算法样本读取序列敏感性;RNN(Reduced nearest neighbor rule)[7]侧重于改进CNN算法原型集不能删除的缺陷等;基于聚类策略的类边界样本选择算法,如IKNN(Im-proved k-nearest neighbor classification)[8]和PSC (Prototype selection by clustering)[9]等.上述算法仍然具备算法对噪音的敏感性.通常压缩近邻法即剪辑法,通过去除噪声点和清理不同类别重叠区的样本点来达到代表点选择的目的.编辑法主要采取剔除原始样本集中的噪音等策略,是一种非增量算法,不适用于大规模数据集处理.为此,如何降低传统增量原型选择算法对样本读取序列、异常点敏感,成为增量原型选择算法的研究热点,同时也是本文研究的主要问题.1.2局部均值或类均值分类算法针对KNN算法的噪音敏感性及传统只关注近邻样本忽略其样本分布等弊端,很多研究者考虑了近邻局部均值或类均值信息与样本分布的关系,将近邻局部信息和类统计或均值信息纳入到近邻分类算法中.其中Mitani等[10]提出了一种基于局部均值的非参数分类方法,克服离群点对分类性能的影响,尤其在小样本情形下分类性能较好.Brown 等[11]使用了各自类近邻类样本距离加权信息进行分类,区别于文献[10]中样本集距离加权信息;Han 等[12]引入了类中心思想,充分利用训练样本的整体信息分类;在此基础上,Zeng等[13]提出了基于局部均值和类均值的分类算法,既利用未分类样本在每类里的近邻局部均值信息,又利用类均值的整体知识进行分类;而Brighton等[14]则定义了待学习样本的Reachable和Coverage概念,在此基础上同ENN算法相结合,提出了迭代样本过滤算法(The iterative casefiltering,ICF),Wang等[15]对其进行改进,提出了ISSARC(An iterative algorithm for sample selection based on the reachable and coverage)算法.设置不同的参数,基于均值的分类方法可退化传统最近邻方法.当选择待分类样本在每类训练样本集里的近邻数为1时,则该局部均值方法等价于最近邻分类;当选取近邻数等于对应类的训练样本数时,则等价于欧几里得距离分类[7].综上,在传统原型选择算法中,借鉴样本局部近邻均值和全局均值等信息,可进一步贴合原型集分布状态,降低了异常原型的干扰.本文在传统CNN 算法基础上,利用近邻局部均值和类全局信息,同时借鉴RNN样本删除思想,提出一种新的原型选择算法(An improved nearest neighbor prototype selection algorithm based on local-mean and class global information,LCNN),可在保障不降低甚至提高分类效率基础上,较好克服CNN及其改进算法对样本读取序列的依赖性,提升原型集的动态更新能力、降低算法的噪音敏感性.2考虑局部均值和类全局信息的近邻原型选择算法为便于描述,本文使用以下符号:记任意数据集D={x i=(x i1,x i2,···,x id)|i=1,2,···},类标记集C={c1,c2,···,c m},d为样本维度,m为类别数.记T R={(x i,y i)|x i∈D,y i∈C,i= 1,2,···,n}为训练集;T P⊂T R为训练所得原型集;T S(Testing set)与T R同构,为若干样本的测试集.1118自动化学报40卷设T P =∅,记任一待扫描学习样本x ∈T R ,任一原型p ∈T P ;s kx =S k (x )⊂T P 为x 同类别的k 个近邻原型;h kx =H k (x )⊂T P 为x 异类别的k 个近邻原型;d (x,y )为x 与y 间欧氏距离;label (x )表示样本x 的类别;D (x )= ki =1w i d (x,x i )(其中w i 为x 的第i 近邻原型的距离加权系数,x i 表示x 的第i 近邻原型)为x 的k 近邻加权距离和,即本文所定义的局部均值信息;Ind (p )表示原型p 在T P 中对应索引;P S 为四元组结构,用以表示p 及其同异类最近邻原型关系,其中P S (1)、P S (2)、P S (3)分别表示T P 中p 索引、p 同类最近邻索引和p 异类最近邻索引,P S (4)表示p 是否被删除的标识.2.1LCNN 算法策略在CNN 算法基础上产生了诸多的改进算法.但这些算法仅利用所筛选的k 近邻样本类别信息,未考虑到样本分布等数据集的局部或全局信息,易受近邻样本偏好影响;同时仍保留着CNN 算法的样本读取序列及噪音的敏感性;且较少涉足对原型集样本的动态增删操作,使噪音和孤立点等样本得以延续保存.为此,本文对CNN 算法进行必要改进,其处理策略如下:1)去除CNN 算法无指导的新类别原型获取策略,新增初始化操作,主要完成类全局信息的获取和所有类别初始原型的获取,以类全局信息调控噪音和孤立点能否成为原型;2)针对CNN 算法中仅使用最近邻样本而导致易受样本读取序列和噪音干扰的情况,扩充最近邻样本为k 同类近邻样本和k 异类近邻样本,使用同类近邻均值和异类近邻均值信息作为原型初步判断条件,可有效降低CNN 算法噪音敏感度;3)在预设的更新周期内,使用局部均值及类全局信息完成对孤立点原型、类中心区域原型等删除操作,进而对原型集信息进行针对性更新.图1显示了LCNN 算法运行框图,其中虚线框部分为本文研究的主要内容,即实现原型集选择功能;而分类算法构造分类器部分,本文选择了最近邻分类用以检验LCNN 所产生原型集的性能.令N 1,N 2,···,N m 表示T P 中对应于类别c 1,c 2,···,c m 的原型个数.设x 有k 个有效可取的同异类原型,s kx 为测试样本x 获取T P 中k 个近邻同类原型,那么x 同类局部均值为:D s =k i =1w i d (x,s i kx )(1)同理,h kx 为测试样本x 获取T P 中k 个近邻异类原型,x 的异类原型局部均值为:D h =k i =1w i d (x,h i kx )(2)对于T P 中属于类别c j (j =1,2,···,m )的原型表示为T P j ={p i j |i =1,2,···,N j },那么c j 类的全局均值原型为:G j =1N j N ji =1pij(3)对于类别c j 原型与类均值原型的平均距离为:D j =1N j N ji =1d (p i j ,G j )(4)综上,以类别为整体的均值原型及平均距离都属于c j 的全局信息,故定义了GD =<GD 1,GD 2,···,GD m >为T P 各类的全局信息结构,其中GD j (1)表示类均值原型向量,用来存储T P 各类别的动态中心,即存储G j ;GD j (2)表示类原型间平均距离,用来存储T P 各类别原型间的动态平均距离,即存储D j .本文中GD 被称为T P 类全局信息.图1LCNN 算法运行框图Fig.1Running diagram of LCNN algorithm2.2算法主要处理过程原型集初始化过程、原型学习过程和原型更新过程是LCNN 算法的核心内容.其中,初始化过程采取随机比例的样本读取获取类全局信息,根据类全局信息有指导性选取原型集初始化,降低CNN 算法原型无指导选取的随机性影响;学习过程在一定学习策略下,实现对原型集有效增添;更新过程,设置了不同的更新阀值,通过周期性删除T P 中不符合条件的原型,完成T P 集的动态更新,进而去除类中心原型、孤立点及噪音,较好克服传统CNN 算法只增加、不删除原型的弊端.6期李娟等:考虑局部均值和类全局信息的快速近邻原型选择算法11192.2.1原型集初始化过程原型集初始化过程包含两个功能:1)通过随机提取各类训练样本,获取类样本的平均距离、类均值中心节点的全局信息,作为原型集初始原型选择依据;2)在类全局信息指导下,为每类样本随机选取f (本文一般设置f=2,当不平衡数据集时,f=1)个初始原型加入T P,降低了CNN算法新类别原型选取的随机性;同时获取并填充T P中各原型的同异类近邻节点,更新类均值中心节点.LCNN的原型集初始化过程描述如下:输入.训练样本集T R.输出.GD、P S.步骤1.初始化GD=∅,P S=∅.步骤2.随机从T R中读取一定比例的训练样本,完成GD信息的填充.步骤3.i=1.步骤4.j=1.步骤5.从第i类样本中读取任一样本x,若其满足GD(i,2)<d(x,GD(i,1))<3×GD(i,2),加入到原型集,j=j+1.步骤6.若j<f,转到步骤5.步骤7.i=i+1,若i<m,转到步骤4.步骤8.逐类别逐原型完成对GD和P S的数据填充.步骤9.输出GD、P S.2.2.2原型学习过程LCNN算法是个增量学习算法,整个算法单遍扫描训练样本集,从读取第一个未被扫描样本开始,直至所有待学习样本学习完毕,获取最终原型集.当一个样本的同类近邻局部均值大于样本的异类近邻局部均值时,该样本被选作原型加入到原型集;同时,判断样本与其类中心点距离是否大于最近邻与类中心点距离,如大于则将其选作原型加入到原型集.LCNN的原型学习过程描述如下:输入.GD、P S、λ及T R.输出.T P.步骤1.如T R不存在未被扫描样本,则输出T P,结束算法.步骤2.任取一未被扫描样本x.步骤3.根据x的类别信息c,获取x的s kx、h kx、GD(c,:).步骤4.若d(x,GD(c,1))<GD(c,2),转到步骤8.步骤5.使用式(1)和式(2)分别计算x的同异类近邻局部均值D s和D h.步骤6.若D s>D h,x被选作原型加入T P,同步设置P S(x,:)数据,转到步骤8.步骤7.若d(x,GD(c,1))>d(s1kx,GD(c,1)), x被选作原型加入T P,同步设置P S(x,:)数据.步骤8.若已学习样本数是λ的整数倍,则调用更新过程.步骤9.否则,转到步骤1.LCNN算法突破了CNN算法仅使用最近邻判别原型的简单方式,考虑到原型样本分布等因素,引入训练样本x的同异类局部均值,通过局部均值信息、类均值中心间关系作为判断原型的依据,即克服最近邻原型判别准则的偏好,在一定程度上实现了类边界原型的选取.同时减少了与类中心距离过近原型的添加,稀疏化类中心区域原型个数.2.2.3原型集更新过程更新过程引入了原型删减思想,每λ个样本学习后,调用原型集更新过程,定期删除不符合规则的原型,减少原型数目.本文依托类全局信息和原型的最近同异类近邻设定不同的更新阀值,用以处理不同情况的原型删除操作.对于任一p i∈T P,c j、c s 分别为p i与最近邻异类原型类别,执行两步骤更新操作;待T P原型扫描完毕,执行局部均值及类全局信息更新操作.LCNN的原型集更新过程描述如下:步骤1(孤立原型更新).当d(p i,GD(c j,1))≥3×GD(c j,2)且d(p i,T P(P S(Ind(p i),2)))> d(p i,T P(P S(Ind(p i),3)))表明该类原型为孤立点,删除此类原型,可降低孤立原型影响,则设置P S(Ind(p i),4)=1.步骤2(噪音等异常原型更新).当GD(c s,2)≤d(T P(P S(Ind(p i)),3)),GD(c s,1))且3×GD(c s, 2)>d(T P(P S(Ind(p i)),3)),GD(c s,1))时,利用p i相关局部均值和类全局信息进行判断.若p i的同类局部均值小于它的异类局部均值(D s<D h),同时p i异类原型处于非类边缘区域(即d(T P(P S(Ind(p i),3)),GD(c j,1))> d(T P(P S(Ind(p i),3)),GD(c s,1))或d(T P(P S (Ind(p i),3)),GD(c j,1))>d(T P(P S(Ind(p i),1)), GD(c s,1)),则表示p i为噪音,则设置P S(Ind(p i), 4)=1.步骤3(局部均值及类全局信息更新).原型扫描完毕,对所有更新标识的原型进行删除;更新原型集T P的P S结构信息;最后分类别计算类均值中心 Gj和类原型标准差距离 D j,更新GD(c j,1)= G j1120自动化学报40卷和GD (cj ,2)= D j .2.3关键概念及参数界定1)孤立原型界定:本文采用文献[16]的定义,把孤立原型定义为与类原型均值的距离超过3倍标准差距离的原型.2)近邻权重选取:本文选取了最简单的倒数距离加权参数,即w i =1/i ,w i 随着i 的增加而减小,对应的原型对新原型选取的影响越小.在未考虑全局信息情况,若近邻数为1,即对待学习样本只选取一个同类近邻和一个异类近邻,则局部均值学习退化为传统CNN 学习.LCNN 运行必须两个参数支撑:一是原型近邻数k ;二是更新周期λ.两种参数选取有预设、交叉验证和动态调整三种方式.其中预设和动态调整方式简单便捷,交叉验证方式需要多次验证运行才能获取较好的参数配置.因此,结合原型集增量生成方式,选择了动态调整设置方式,即伴随着原型集的动态变化,动态调整k 和λ.为简化问题,本文在一个更新周期λ内,将不同类别样本x 在T P 中各同异类近邻数设置为相同,且k ≤min(N 1,N 2,···,N m ).本文分别选取k = m min j =1N j ,λ= m j =1N j (N j 表示更新周期开始时T P 中类别c j (j =1,2,···,m )原型数, · 表示向上取整).3算法评估为了更好评估LCNN 算法的性能指标,本文选择了KNN 、CNN 、GCNN 、PSC 、ISSARC 以及ILVQ (Incremental learning vector quantiza-tion)[17]作为比较算法.其中LCNN 与GCNN 处理策略相似,均采取了同异类近邻思想,本文中GCNN 选取了ρ=0,0.1,0.25,0.5,0.75,0.99下的平均运行效率;PSC 主要思想是以空间划分策略尽可能获取类边界原型,本文选取文献[9]中获取最佳运行效率的r =6m 和r =8m;ISSARC 算法是在ICF 算法基础上进行改进的非增量原型选择算法,主要思想是考虑同异类近邻距离的限定,同时通过去除噪音的非增量ENN 算法的预先处理,降低了算法噪音敏感度,其ENN 算法运行采取了文献[15]中的参数设置;而LCNN 也以获取类边界原型为处理目标;ILVQ 是目前高压缩性的快速的增量原型生成算法之一(为简化ILVQ 运行,本文对λ和Ageold采取简单预设λ=Ageold = √n );LCNN 单遍扫描训练集,也体现了快速增量原型选择思想;而选择KNN 和CNN 则作为比较算法分类性能的参照,其中KNN 算法预设5个常见的k 值,分别为3、5、7、9、11.为了验证LCNN 算法的有效性,选择了两个图像识别数据集以及其他12个UCI 数据集(见表1)和3个大规模数据集[18],采用5次5折交叉验证获得对比算法的平均分类效率及分类速度.本文在奔腾IV Intel (R)Core (TM)2Du CPU E 83002.83GHz 1G 的PC 硬件支撑,Windows XP 32位及Matlab 7运行环境下获取实验数据.本文中采用分类精度=|T S correct||T S |×100%、压缩比率=|T P ||T R |×100%、运行时间(单位:秒)作为比较算法的评价指标.其中|T S correct |表示T S 在T R 或T P 作为训练集下被正确分类的样本数,|T R |、|T P |、|T S |分别表示T R 、T P 、T S 所包含的样本或原型数.表1UCI 基准数据集信息Table 1The information of UCI benchmark data sets数据集特征数类别数样本数Iris 43150Wine 133178Glass 96214Ionosphere 342351Cancer 92699Zoo 167101Heart 132270TAE 53151Liver disorders62345Spectf 442267Ecoli 78336Ctg20321263.1理论分析分析比较算法,其中KNN 算法的时间复杂度为O(dn 2n i ),CNN 算法时间复杂度为O(nN 2d +n 1N 2d ),GCNN 算法时间复杂度为O(n 2Nd +n 1N 2d ),PSC 算法时间复杂度为O(τrnd +n 1N 2d ),ISSARC 算法时间复杂度为O(n 3d +d t i =1M 2i +n 1N 2d ),ILVQ 算法时间复杂度为O(dnN +n 1N 2d ).LCNN 算法是增量学习算法,主要分为两部分:增量原型生成时间O(dnN )和原型分类时间O(n 1N 2d ),其整体时间复杂度为O(dnN +n 1N 2d ).上述公式中,n 为训练样本数,d 为样本维度,N 为最终原型数,r 为聚类数,τ为聚6期李娟等:考虑局部均值和类全局信息的快速近邻原型选择算法1121类迭代次数,n 1为测试样本数,t 为ISSARC 算法迭代周期, ti =1M i 为ENN 算法运行所得原型集规模.LCNN 算法对于所有的训练样本而言是近线性的,但后续原型分类所需时间复杂度为传统的近邻分类算法时间复杂度.LCNN 算法是一种增量算法,仅在原型生成过程中执行单遍样本扫描,并不需对训练集进行存储,因此,LCNN 具有处理大规模数据集的能力.3.2人工数据实验为验证LCNN 算法在大规模数据集的处理性能,本文选择文献[17]实验所使用的人工数据集进行增量的原型分类比较.图2和图3均为2维人工数据集:图2中包含5类数据,类别1和类别2满足2维高斯分布,类别3和类别4数据分布为2个同心圆,类别5满足正弦分布;图3在图2有效数据分布的基础上,加入了20%的均匀分布噪音将其随机分布到5个类别中.图2无噪音人工数据集Fig.2No noise artificial dataset图3含噪音的人工数据集Fig.3Noise artificial data set区别于文献[17]实验中多种样本读取序列和不同迭代次数,本文采取单遍随机样本读取序列的简单方式.除选择三种增量算法外,由于ISSARC 算法通过ENN 算法对噪音等异常数据进行了预先处理,提高了算法的抗噪能力,故而选择其作为对照算法.图4、图6、图8、图10为四种算法在图2数据集上原型生成情况,可以看出,LCNN 算法在保持原始样本集分布的情况,对其进行必要缩减,其结果同ISSARC 和ILVQ 算法结果具有可相较性.图5、图7、图9、图11为4种算法在图3数据集上原型生成情况,LCNN 算法除原型个数明显少于ILVQ 算法结果外,相对于ISSARC 算法而言,在一定程度上降低了噪音的敏感性,其噪音数据数量明显少于比较算法.其中,ISSARC 算法属于非增量算法,在人工数据实验中的运行时间消耗达10小时以上.3.3图像识别对比1)医学图像诊断识别为验证LCNN 的实用性能,本文选取了569幅乳腺癌症图像数据进行实验,该数据将每幅乳腺癌症图像提取30个维度详细描述,其中212个异常图像,357个正常图像.通过5次5折交叉验证得到比较算法的平均运行数据.表2数据表明LCNN 在癌症图像识别中有着明显的压缩、分类精度及运行时间优势,是一种可行性的原型选择算法.图4CNN 在无噪音数据集的原型集Fig.4The prototype set obtained by CNN on no noisedataset图5CNN 在噪音数据集的原型集Fig.5The prototype set obtained by CNN on noisedata set1122自动化学报40卷图6ISSARC 在无噪音数据集的原型集Fig.6The prototype set obtained by ISSARC on no noise dataset图7ISSARC 在噪音数据集的原型集Fig.7The prototype set obtained by ISSARC on noisedataset图8ILVQ 在无噪音数据集的原型集Fig.8The prototype set obtained by ILVQ on no noisedataset图9ILVQ 在噪音数据集的原型集Fig.9The prototype set obtained by ILVQ on noisedataset图10LCNN 在无噪音数据集的原型集Fig.10The prototype set obtained by LCNN on nonoise dataset图11LCNN 在噪音数据集的原型集Fig.11The prototype set obtained by LCNN on noisedata set2)数字手写体识别为进一步验证LCNN 实际问题解决能力,特选择了研究文献中常用的手写体数字光学数据集进行算法的比较.该数据集含0到9阿拉伯手写体数字的3823个训练图像信息和1797个测试图像信息.表3数据获取环境同表2.表3数据显示LCNN 较其他比较算法有着一致好的运行效率.其中CNN 算法因运算简单且无需进行原型集的删除等操作,所以运行时间较少;PSC 需要较大的运行开销来完成初始的聚类操作;GCNN 因需要动态计算δ而增加一定运行时间消耗;ILVQ 算法增加了原型周期动态更新操作,运行消耗较大.ISSARC 算法虽然保持最好的压缩比率,然而由于其自身调用ENN 算法的预先处理策略,增加了ISSARC 算法的运行时间消耗.综上,采取LCNN 算法解决实际问题,可有效降低数据规模,可配合其他高效分类算法更好地发挥其优势.3.4UCI 基准数据集实验除上述图像识别数据集外,为更全面验证算法有效性,本文选择的12个中小规模和3个大规模UCI 基准数据集,较全面涵盖了数据集的维度规模和样本规模多样化分布,实验环境同上.6期李娟等:考虑局部均值和类全局信息的快速近邻原型选择算法1123表2比较算法在乳腺癌数据集上的运行效率Table2Operational efficiency results obtained by compared algorithms on breast cancer data set算法KNN CNN GCNN PSC ISSARC ILVQ LCNN 分类精度93.6781.5578.2789.2773.8190.6192.14压缩比率10060.1621.2446.2711.3535.5215.98运行时间 2.409 6.202 3.5193 3.872 4.7269.641 2.752表3比较算法在数字手写体集上的运行效率Table3Operational efficiency results obtained by compared algorithms on handwritten digits dataset算法KNN CNN GCNN PSC ISSARC ILVQ LCNN 分类精度97.9992.0794.5793.2592.4895.5997.08压缩比率10041.3425.7233.9419.9631.5822.57运行时间756.39214.34595.28456.92612.58372.35247.47表4比较算法分类精度与压缩比的实验数据Table4Operational efficiency results obtained by compared algorithms on breast cancer data set 算法KNN CNN GCNN PSC ISSARC ILVQ LCNN 分类精度压缩比率分类精度压缩比率分类精度压缩比率分类精度压缩比率分类精度压缩比率分类精度压缩比率分类精度压缩比率Iris96.6710095.5059.7195.7812.3292.8964.8394.5423.6793.0745.0493.3328.63 Wine70.8010071.2367.9467.3223.5462.2173.7465.5718.5467.6441.1269.6515.82 Glass65.0810062.6466.5968.2749.2660.6972.7862.1222.4064.6928.7465.4326.43 Ionosphere88.6810085.8948.3384.3222.1786.1845.1987.458.7689.2919.9186.1618.79 Cancer96.5010088.127.15394.6116.9278.0510.5584.5514.5678.0510.5595.1425.35 Zoo83.2210088.1457.6788.7331.5278.2657.1976.4323.5187.1035.1992.6234.36 Heart76.2110067.2743.4075.4946.2479.5431.2368.3110.6080.3437.3876.5737.01 TAE77.7210056.3436.7564.4844.6172.2537.1857.5736.9270.2223.3376.6921.69 Liver disorders67.6110055.8016.6765.2642.3161.8867.2155.3620.8760.8715.8967.5114.06 Spectf71.9510063.0520.3673.3552.0779.4124.5672.3316.4877.9335.7480.1435.09 Ecoli86.4510077.4453.7176.7933.8574.9342.8978.6915.7082.2242.9680.1729.73 Ctg82.0910062.6842.9164.4918.7374.8641.6665.0510.0969.1812.2376.7612.48 Average82.8710072.8443.4379.2132.8075.1147.4272.3318.5176.7229.0180.0124.95基于表4数据,可以得到如下结论:对比CNN 和GCNN,LCNN在保持明显的数据优势情况下,有着较高比例数据集的分类精度优势;对比ILVQ 算法,除Ionosphere外,LCNN分类效率优势明显,同时保证了11个数据集上的高压缩比;对比PSC快速原型算法,LCNN在保持11个数据集的高分类效率之外,仍保持9个数据集的高压缩比率,体现了较好的分类效率和较高的压缩比率.相对于其他对比算法而言,ISSARC算法保持着明显的平均压缩优势;而LCNN算法仅有2个数据集的压缩率高于ISSARC算法.通过表5运行时间数据,可得出在小规模数据集下,对于KNN和CNN 而言,LCNN时间优势不足,而在较大规模数据集Ctg下,LCNN时间优势明显;此外,LCNN相对于GCNN、ISSARC、ILVQ算法而言,有着显著的运行时间优势;而相对于目前快速原型PSC算法, LCNN也有着明显的12取7和平均的两项时间优势.。

基于障碍约束的空间聚类算法综述

基于障碍约束的空间聚类算法综述

基于障碍约束的空间聚类算法综述余冬梅【摘要】传统的空间聚类算法解决的是未带障碍约束的空间数据聚类问题,而现实的地理空间中经常会存在河流、山脉等阻碍物,因此,传统空间聚类算法不适用于带障碍数据约束的现实空间.在解析了带障碍空间聚类相关概念和定义的前提下,对带障碍约束条件的空间聚类算法进行梳理,给出了这类算法的研究历史和沿袭关系,并把这类算法按七个维度分为四大类,分析了每类的技术优缺点,最后给出了带障碍约束的空间聚类算法的未来研究趋向.【期刊名称】《计算机系统应用》【年(卷),期】2015(024)001【总页数】5页(P9-13)【关键词】空间聚类;障碍约束;分类;障碍距离;聚类算法【作者】余冬梅【作者单位】陕西理工学院数学与计算机科学学院,汉中723000【正文语种】中文基于空间数据的聚类是在地理空间数据集中, 按照对象间的距离、连接性或者相对密度等指标, 把一些具有一定相似性的数据对象归为一簇, 差异度大的数据对象分派到不同的簇中的过程. 真实地理环境中往往存在着江河、山脉、固定私有区等障碍物, 在空间聚类的分析应用中必须考虑到这些障碍物, 如快递公司投递站的选址问题、城市道路规划等问题, 而传统空间聚类算法没有考虑真实的地理环境特点,未将障碍物考虑在算法中并加以处理. 但近几年随着空间信息技术的发展, 为了使空间聚类结果更加合理和实用, 有障碍要求的空间聚类算法逐渐成了研究热点, 也为空间数据挖掘、模式识别等领域的发展做出了不小的贡献. 为此, 本文将障碍约束下的空间聚类算法进行了整理分析, 以期为研究者的后继研究提供参考.传统的空间聚类算法常常被分为基于划分的、基于密度的、基于层次的、基于模型的以及基于网格的等五种方法类型. 虽然这些方法在处理空间数据聚类分析时的侧重点、效果和特点不同, 但均已被实践检验是有效的无障碍约束的空间聚类算法,也为空间信息研究与应用领域提供了很大的帮助. 基于障碍约束的空间聚类算法大多都是在传统聚类算法的基础上, 增加了障碍距离计算方法, 这些方法改变了传统空间聚类算法中数据之间的距离计算方法, 并且产生了围绕降低计算障碍距离代价为主要目标的改进方法. 经过查阅大量有关考虑障碍约束的空间聚类方法的资料,本文将这些方法大致分为4类: 基于划分的方法、基于密度的方法、基于图论的方法、以及混合聚类方法, 将按两个指标对考虑障碍约束的空间聚类算法进行分析, 一是以时间为线索分析它们的发展研究轨迹, 二是从这些算法的特色上进行分类对比分析, 指出它们的共同特点和各自的优缺点, 并给出以后的研究方向.2.1 带障碍约束的空间聚类带障碍约束的空间聚类是指在具有障碍物的空间数据集中, 将具有相似特性(如空间位置相邻)的数据归为一类(簇), 使得同一类(簇)中的数据相似性最大, 不同类(簇)之间的数据相异性最大. 其中在数据相似性的计算判断中, 须考虑障碍物具有分隔其他数据直接连接的特性. 在具有障碍的空间中, 传统的空间聚类算法的聚类效果见图1, 而考虑障碍约束的空间聚类算法的效果见图2.2.2 带障碍距离空间数据集中两点间最短障碍距离是指两点间不与任何障碍物相交的最短路径长度,若使用Dist(a,b)表示障碍空间中任意a,b两点的带障碍距离, 则:如图3所示, Dist(a,b)= Dist(a,p) + Dist(p,q)+ Dist(q,b), 其中p和q分别是障碍物的相对a和b的可视顶点.在带障碍约束空间聚类算法中需要定义自己的带障碍距离计算方法, 不再是单纯的欧式距离, 致使这些算法的时间复杂度往往比传统空间聚类算法更高, 因此, 带障碍距离的计算是解决带障碍约束空间聚类分析问题的最关键技术之一.图3 障碍距离示意图3 带障碍约束的空间聚类算法的研究和发展最早的带障碍约束的空间聚类算法是2001年Anthony K.H.Tung等提出的COD-CLARANS算法[1], 之后不断研究出了很多新算法. 分析这些算法, 绝大多数都是在传统空间聚类算法的基础上, 增加了障碍距离、障碍模型等概念而产生的, 且主要针对的是二维空间维度下点状数据的聚类, 如文献[1]中的COD-CLARANS算法是在传统的基于划分的聚类算法CLARANS[2]的基础上首次引入障碍距离而产生的[3]; 文献[4]的AUTOCLUST+算法是基于图论的Delaunay三角网而提出的; 文献[5]的DBCLuC算法是在传统的基于密度的聚类算法DBSCAN[6]的基础上引入障碍模型而提出的, 即将障碍物模型化成一组多边形; 基于智能优化的障碍约束聚类算法[7-9]是在聚类的过程中加入智能优化算法中的优化模型而形成的. 下面从四种分类角度描述带障碍约束的空间聚类算法的研究发展情况, 如图4所示.4 带障碍约束的空间聚类算法分类研究中发现, 现有的带障碍约束的空间聚类算法以点目标数据的聚类为主, 未见有针对线目标和面目标数据的障碍约束聚类. 这与传统聚类算法中, 针对线目标和面目标空间聚类算法本身就少有一定关系.图4带障碍约束的空间聚类算法演变发展沿革4.1 算法的分类方法本文把考虑障碍约束的空间聚类算法分为5类: 划分类、密度类、图论类和混合类. 由于基于障碍约束的空间聚类算法间的典型区别在于, 聚类中数据对象间相似性的计算方法和时间效率等, 因此, 下面从多个维度出发分析这些算法: 依据的传统聚类算法的情况、有无预处理、是否需要输入参数、是否考虑障碍物上有连接点、代表性算法、优点和缺点, 如表1所示.表1 障碍空间聚类算法分类表分类依据的典型传统算法预处理需要输入参数值考虑障碍物上有连接点代表性算法典型优点典型缺点划分类CLARANS有是否COD-CLARANS[1]在空间聚类中第一个引入障碍距离等概念,能处理大数量数据预处理的开销非常大、检测球状或者近似球状的簇 K-means无是否OBS-UK-means[26]将不确定性数据引入到障碍空间聚类算法中带障碍距离计算过程复杂,不适合大数量数据空间密度类DBSCAN有是有DBCLuC[5]DBCOD[17]能处理任意形状的簇,引入障碍域的可视性和可视空间概念对噪声和数据输入顺序敏感,不能处理障碍物上有连接点的情况 DBRS无是有DBRS+[10]对障碍物处理细致,能处理连接型障碍时间开销大图论类Delaunay三角网(Voronoi图)有否否AUTOCLUST+[4]CBDTO[15]文献18能处理任意形状的簇,支持多层关联分析构造Delaunay三角网时处理约束代价大,缺少灵活性有否有FOA[11]能解决障碍物上有连接点的情况构造Voronoi图计算障碍距离耗时大混合类基于密度与网格DBSCAN+网格划分有是否DCellO[21]能够进行任意形状的带障碍的聚类,并且不需要指定聚类簇的数目。

ctex引用bib参考文献

ctex引用bib参考文献

ctex引用bib参考文献ctex是一款LaTeX的中文扩展,用于处理中文文档排版。

在处理学术论文时,经常需要使用参考文献。

为了方便管理和使用参考文献,我们可以使用bib文件来存储文献信息,并在LaTeX中引用。

下面是使用ctex引用bib参考文献的步骤:1. 准备bib文件首先需要准备一个bib文件,里面包含需要引用的参考文献。

bib 文件可以用文本编辑器手动编写,也可以使用专业的文献管理软件(如EndNote、Zotero等)自动生成。

例如,一个简单的bib文件内容如下:```@article{chen2017,author = {Chen, J. and Li, Y.},title = {A new algorithm for data clustering}, journal = {Journal of Data Science},volume = {15},number = {3},pages = {345-362},year = {2017},publisher = {National Taiwan University Press}}```其中,@article表示这是一篇期刊文章,chen2017是这篇文章的唯一标识符,后面的各个字段分别表示文章的作者、标题、期刊、卷号、期号、页码、年份和出版社。

2. 在LaTeX文档中引用bib文件在LaTeX文档中使用bib文件,需要在文档开头加入以下代码: ```bibliographystyle{plain}bibliography{filename}```其中,bibliographystyle命令指定参考文献的格式样式,plain 表示按照出现顺序排序;filename是bib文件的文件名(不带扩展名)。

3. 在文中引用参考文献在文中需要引用某篇参考文献时,可以使用cite命令。

例如,要引用上面那篇文章,可以在文中加入以下代码:```本文参考了Chen和Li的工作cite{chen2017}。

cluster-based discrimination -回复

cluster-based discrimination -回复

cluster-based discrimination -回复什么是"cluster-based discrimination"(基于聚类的歧视)?在一个愈发多元化和全球化的世界中,随着不同人群和群体的聚集和交流,我们也面临着一些新的挑战。

这其中之一就是"cluster-based discrimination",即通过聚类来实施的歧视行为。

在讨论这个话题之前,我们先来了解一下什么是聚类。

聚类是一种将数据分组为类别或簇的方法,每个类别内的数据点具有相似的特征,而不同类别之间的数据点则具有不同的特征。

聚类可以帮助我们理解和组织大量的数据,并从中发现潜在的模式和结构。

然而,当聚类被用于歧视的目的时,它就变成了一种有害的工具。

"cluster-based discrimination"指的是将人们划分到不同的群组或簇中,然后对这些群组或簇进行不平等的对待或歧视。

这种歧视可能存在于各个社会领域,包括教育、就业、住房、医疗等。

例如,一个将人们划分为高收入和低收入群体的经济模型,在分配资源时可能更偏向高收入群体,而对低收入群体提供更少的支持或机会。

那么,为什么"cluster-based discrimination"是有害和不公正的呢?首先,它通过强调和加强身份特征和群体差异,加剧了社会的分化和不平等。

它忽视了个体的独特性和多样性,而仅仅把人们简单地归类到某种群体中。

这种划分和偏见可能导致某些群体被边缘化、忽视或排斥。

其次,"cluster-based discrimination"根据人们的群组归属赋予他们特定的特征和属性。

这种界定和标签化可能导致人们被限制在这些刻板印象中,无法充分发展和实现个人潜能。

它剥夺了人们选择和改变自己的权利,将个人的机会和可能性限制在特定的范围内。

最后,"cluster-based discrimination"也会加剧社会的分裂和紧张。

基于SA-WPSO的遥感图像校正方法

基于SA-WPSO的遥感图像校正方法

基于SA-WPSO的遥感图像校正方法苏清贺;程红;王志强;卢永吉【摘要】This paper proposes a remote sensing image correction method based on SA-WPSO. The geometric polynomial model is used as early correction of images to obtain polynomial coefficients, an improved hybrid optimization algorithm SA-WPSO is used to optimize the polynomial correction factor, and the optimized polynomial is utilized to the final geometric correction. Experimental results show that compared with correction algorithms based on quadratic polynomial and cubic polynomial, improved method has higher precision and better robustness.%提出一种基于SA-WPSO的遥感图像校正方法.该方法利用多项式模型对图像进行初步几何校正,得到多项式校正系数后,将模拟退火(SA)思想引入粒子群优化(PSO)算法,通过改进的SA-WPSO算法优化多项式校正系数,在此基础上实现图像的几何校正.实验结果证明,与二次多项式及三次多项式校正方法相比,该方法的校正精度更高、鲁棒性更好.【期刊名称】《计算机工程》【年(卷),期】2012(038)006【总页数】3页(P210-212)【关键词】遥感图像;二次多项式;粒子群优化算法;几何校正;模拟退火算法;全局极值【作者】苏清贺;程红;王志强;卢永吉【作者单位】中国人民解放军空军航空大学特种专业系,长春130022;中国人民解放军空军航空大学特种专业系,长春130022;中国人民解放军空军航空大学特种专业系,长春130022;中国人民解放军空军航空大学航空理论系,长春130022【正文语种】中文【中图分类】TN911.731 概述遥感图像校正技术是进行图像信息融合、目标定位、高分辨率图像重建等的前提性工作,校正的质量直接影响后续图像处理工作的效率。

  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
相关文档
最新文档