Hierarchical hesitant fuzzy K-means clustering algorithm
对fuzzy K-means的认识
俗话说:“物以类聚,人以群分”,在自然科学和社会科学中,存在着大量的分类问题。
聚类(Cluster)分析是由若干模式(Pattern)组成的。
通常,模式是一个度量(Measurement)的向量,或者是多维空间中的一个点。
聚类分析以相似性为基础,在一个聚类中的模式之间比不在同一聚类中的模式之间具有更多的相似性。
所以,聚类分析依赖于对观测间的接近程度(距离)或相似程度的理解,定义不同的距离量度和相似性量度就可以产生不同的聚类结果。
所谓类,通俗地说,就是指相似元素的集合。
聚类就是按照事物间的相似性进行区分和分类的过程。
聚类分析又称群分析,它是研究(样品或指标)分类问题的一种统计分析方法。
聚类分析起源于分类学,聚类分析也可以作为其他分析算法的一个预处理步骤。
Clustering 中文翻译作“聚类”,简单地说就是把相似的东西分到一组,同Classification (分类)不同,理想情况下,一个classifier 会从它得到的训练集中进行“学习”,从而具备对未知数据进行分类的能力,这种提供训练数据的过程通常叫做supervised learning (监督学习),而在聚类的时候,我们并不关心某一类是什么,我们需要实现的目标只是把相似的东西聚到一起,因此,一个聚类算法通常只需要知道如何计算相似度就可以开始工作了,称作 unsupervised learning (无监督学习)。
无监督分类最常用的方法之一是K均值或ISODATA、模糊C均值和EM (Expectation-Maximization)。
K-MEANS有其缺点:产生类的大小相差不会很大,对于脏数据很敏感。
不得不承认这并不是很好的结果。
不过其实大多数情况下 k-means 给出的结果都还是很令人满意的,算是一种简单高效应用广泛的 clustering 方法。
选定 K 个中心的这个过程通常是针对具体的问题有一些启发式的选取方法,或者大多数情况下采用随机选取的办法。
基于语义相关的视频关键帧提取算法
随着多媒体信息的发展,视频成为人们获取信息的重要途径,面对海量的视频,如何从视频中提取关键部分,提高人们看视频的效率已经成为人们所关注的问题。
视频摘要技术正是解决这一问题的关键,在视频摘要技术中的核心部分就是关键帧的提取。
关键帧的提取可以分为以下六类:(1)基于抽样的关键帧提取基于抽样的方法是通过随机抽取或在规定的时间间隔内随机抽取视频帧。
这种方法实现起来最为简单,但存在一定的弊端,在大多数情况下,用随机抽取的方式得到的关键帧都不能准确地代表视频的主要信息,有时还会抽到相似的关键帧,存在极大的冗余和信息缺失现象,导致视频提取效果不佳[1]。
(2)基于颜色特征的关键帧提取基于颜色特征的方法是将视频的首帧作为关键帧,将后面的帧依次和前面的帧进行颜色特征比较,如果发生了较大的变化,则认为该帧为关键帧,以此得到后续的一系列关键帧。
该方法针对相邻帧进行比较,不相邻帧之间无法进行比较,对于视频整体关键帧的提取造成一定的冗余。
(3)基于运动分析的关键帧提取比较普遍的运动分析算法是将视频片段中的运动信息根据光流分析计算出来,并提取关键帧。
如果视频中某个动作出现停顿,即提取为关键帧,针对不同结构的镜头,可视情况决定提取关键帧的数量。
但它的缺点也十分突出,由于需要计算运动量选择局部极小点,这基于语义相关的视频关键帧提取算法王俊玲,卢新明山东科技大学计算机科学与工程学院,山东青岛266500摘要:视频关键帧提取是视频摘要的重要组成部分,关键帧提取的质量直接影响人们对视频的认识。
传统的关键帧提取算法大多都是基于视觉相关的提取算法,即单纯提取底层信息计算其相似度,忽略语义相关性,容易引起误差,同时也造成了一定的冗余。
对此提出了一种基于语义的视频关键帧提取算法。
该算法首先使用层次聚类算法对视频关键帧进行初步提取;然后结合语义相关算法对初步提取的关键帧进行直方图对比,去掉冗余帧,确定视频的关键帧;最后与其他算法比较,所提算法提取的关键帧冗余度相对较小。
Kernelized fuzzy attribute C-means clustering algorithm
Received 30 October 2006; received in revised form 17 March 2008; accepted 18 March 2008 Available online 26 March 2008
Abstract A novel kernelized fuzzy attribute C-means clustering algorithm is proposed in this paper. Since attribute means clustering algorithm is an extension of fuzzy C-means algorithm with weighting exponent m = 2, and fuzzy attribute C-means clustering is a general type of attribute means clustering with weighting exponent m > 1, we modify the distance in fuzzy attribute C-means clustering algorithm with kernel-induced distance, and obtain kernelized fuzzy attribute C-means clustering algorithm. Kernelized fuzzy attribute C-means clustering algorithm is a natural generalization of kernelized fuzzy C-means algorithm with stable function. Experimental results on standard Iris database and tumor/normal gene chip expression data demonstrate that kernelized fuzzy attribute C-means clustering algorithm with Gaussian radial basis kernel function and Cauchy stable function is more effective and robust than fuzzy C-means, fuzzy attribute C-means clustering and kernelized fuzzy C-means as well. © 2008 Elsevier B.V. All rights reserved.
优化初始聚类中心选择的K-means算法
优化初始聚类中心选择的K-means算法杨一帆,贺国先,李永定(兰州交通大学交通运输学院,甘肃兰州730070)摘要:K-means算法的聚类效果与初始聚类中心的选择以及数据中的孤立点有很大关联,具有很强的不确定性。
针对这个缺点,提出了一种优化初始聚类中心选择的K-means算法。
该算法考虑数据集的分布情况,将样本点分为孤立点、低密度点和核心点,之后剔除孤立点与低密度点,在核心点中选取初始聚类中心,孤立点不参与聚类过程中各类样本均值的计算。
按照距离最近原则将孤立点分配到相应类中完成整个算法。
实验结果表明,改进的K-means算法能提高聚类的准确率,减少迭代次数,得到更好的聚类结果。
关键词:聚类;K-means;最近邻点密度;初始聚类中心;孤立点中图分类号:TP391文献标识码:A文章编号:1009-3044(2021)05-0252-04开放科学(资源服务)标识码(OSID):K-Means Algorithm for Optimizing Initial Cluster Center SelectionYANG Yi-fan,HE Guo-xian,LI Yong-ding(School of Transportation,Lanzhou Jiaotong University,Lanzhou730070,China)Abstract:The clustering effect of K-means algorithm is closely related to the selection of initial clustering center and the isolated points in the data,so it has strong uncertainty.In order to solve this problem,a novel K-means algorithm based on nearest neighbor density is proposed.In this algorithm,considering the distribution of the data set,the sample points are divided into isolated points, low density points and core points,and then the isolated points and low density points are eliminated,and the initial clustering cen⁃ter is selected in the core points.Isolated points do not participate in the calculation of the mean value of all kinds of samples in the process of clustering.The outlier is assigned to the corresponding class according to the nearest principle to complete the whole al⁃gorithm.The experimental results show that the improved K-means algorithm can improve the clustering accuracy,reduce the num⁃ber of iterations,and get better clustering results.Key words:clustering;k-means;nearest neighbor density;initial clustering center;isolated points聚类就是按一定的标准把物理或抽象对象的集合分成若干类别的过程,聚类后得到的每一个簇中的对象要尽可能的相似,不同簇中的对象尽量的相异[1-2]。
模糊聚类分析
模糊聚类分析
FCM(Fuzzy C-Means)算法是模糊聚类算法,其属于软聚类,即一个样本点可以属于多个类。
不同于层次、均值和密度聚类,一个样本只能属于或者不属于一个类。
模糊聚类的话,就是引入了隶属值的概念,即每一个样本都是使用[0,1]的隶属值(类似概率或几率值)来确定其属于各簇的程度,当你的隶属值设置成仅有0或者1的时候,它其实就是一个K-mean聚类了,同时模糊聚类存在一个限制条件就是一个样本隶属于各个簇的隶属值之和等于1。
聚类思想是使簇内的样本点之间的越小差异,而簇间的差异越大。
模糊聚类中的C与K均值中的K是相同意思,都是指聚类的个数,而在模糊聚类中除了这个C以外还有一个参数m。
其中C用于控制聚类的数目,参数m用于控制算法的柔性的,可以影响聚类的准确度,m取值太小,样本点会分布会比较分散,导致噪声(异常值)的影响很大,而取值太大,样本点会分布集中,对偏度主流的样本点的控制度又比较弱。
一般m取值为2即可,(R里面默认也是2)。
模糊聚类算法是通过迭代计算目标函数的最小值来判断算法的运转;具体的公式推导过程可以参考(https:///zjsghww/article/details/50922168):其算法大致步骤如下:1:随机产生C个簇中心(或随机产生一些隶属值);2:
计算隶属矩阵(或计算簇中心);3:有了隶属矩阵(或簇中心)再重新计算簇中心(或隶属矩阵);4:计算目标函数;5:判断目标函数达到最小值或趋于不再存在较大的波动,则停止运算,确定聚类最终结果,否则重新计算隶属矩阵(或簇中心)。
bisecting k-means聚类算法
bisecting k-means聚类算法
bisecting k-means聚类算法是一种用于处理高维数据的聚类算法。
它的工作原理是先将所有数据点视为一个簇,然后将该簇一分为二。
在选择要划分的簇时,可以使用不同的策略,例如选择SSE(簇内平方和)最大的簇,或选择样本点最远的簇。
之后,会对划分得到的两个簇执行k-means聚类算法。
这个过程会不
断重复,直到达到预定的聚类数目。
bisecting k-means聚类算法的优点是可以减轻k-means算法对
初始簇中心的敏感性,同时能够对不同尺寸、密度和形状的簇进行更好的划分。
然而,该算法的计算复杂度较高,且较难找到一个合适的聚类数目。
以下是bisecting k-means聚类算法的伪代码:
1. 将所有数据点作为一个簇
2. 当簇的数目小于预定的聚类数目时,执行以下操作:
a. 对当前簇进行k-means聚类,得到两个子簇
b. 计算划分后的两个子簇的SSE
c. 选择SSE最大的子簇进行划分,将其作为一个新簇,放入簇列表中
3. 返回最终的簇列表
在实际应用中,可以根据具体的问题调整算法的参数和划分策略,以获得更好的聚类效果。
聚类算法英文专业术语
聚类算法英文专业术语1. 聚类 (Clustering)2. 距离度量 (Distance Metric)3. 相似度度量 (Similarity Metric)4. 皮尔逊相关系数 (Pearson Correlation Coefficient)5. 欧几里得距离 (Euclidean Distance)6. 曼哈顿距离 (Manhattan Distance)7. 切比雪夫距离 (Chebyshev Distance)8. 余弦相似度 (Cosine Similarity)9. 层次聚类 (Hierarchical Clustering)10. 分层聚类 (Divisive Clustering)11. 凝聚聚类 (Agglomerative Clustering)12. K均值聚类 (K-Means Clustering)13. 高斯混合模型聚类 (Gaussian Mixture Model Clustering)14. 密度聚类 (Density-Based Clustering)15. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)16. OPTICS (Ordering Points To Identify the Clustering Structure)17. Mean Shift18. 聚类评估指标 (Clustering Evaluation Metrics)19. 轮廓系数 (Silhouette Coefficient)20. Calinski-Harabasz指数 (Calinski-Harabasz Index)21. Davies-Bouldin指数 (Davies-Bouldin Index)22. 聚类中心 (Cluster Center)23. 聚类半径 (Cluster Radius)24. 噪声点 (Noise Point)25. 簇内差异 (Within-Cluster Variation)26. 簇间差异 (Between-Cluster Variation)。
聚类分析的基本
聚类分析的基本
聚类分析是一种旨在寻找数据中存在的有规律分布的重要分析
方法。
本文旨在介绍聚类分析的基本概念、分类方法,以及应用等。
首先,什么是聚类分析?简单来说,聚类分析是一种机器学习技术,它将数据集中的对象分组到若干个簇,使得簇内的对象更加相似,而簇间的对象更加不同。
其目的在于发现数据中存在的有规律的分组。
其次,聚类分析有哪些分类方法?常见的聚类分析方法有
K-Means、Hierarchical Clustering、Fuzzy Clustering和DBSCAN 等。
K-Means法是一种基于几何距离的聚类分析方法,其工作原理是通过对对象的迭代计算,使簇的内部数据具有最小的距离,而簇外的距离最大。
Hierarchical Clustering是一种基于层次聚类的聚类分析方法,它使用聚合和分裂的方法,将数据分类为层级结构,从而得到聚类结果。
Fuzzy Clustering是一种基于模糊聚类的聚类分析方法,它可以将对象划分到具有不同程度相似性的多个簇中,而不仅仅是完全相同或完全不同。
DBSCAN是一种基于密度的聚类分析方法,
它可以根据数据密度的不同,将对象分为若干不同的簇。
最后,聚类分析有哪些应用?聚类分析在商业分析中有广泛的应用,可用于客户分析,市场分割和关联规则等。
它也可以在其他领域中使用,比如文本分类、生物医学数据分析、机器学习等等。
总之,聚类分析是一种有效的数据分析工具,能够有效的发现数据中的有规律的分组,已经在商业分析和其他领域中得到广泛应用。
- 1 -。
spss聚类分析结果解释
14.4 判别分析P374
判别分析的概念:是根据观测到的若干变量值,判断研 究对象如何分类的方法. 要先建立判别函数 Y=a1x1+a2x2+...anxn,其中:Y 为判别分数<判别值>,x1 x2...xn为反映研究对象特征 的变量,a1 a2...an为系数 SPSS对于分为m类的研究对象,建立m个线性判别函数. 对于每个个体进行判别时,把观测量的各变量值代入判 别函数,得出判别分数,从而确定该个体属于哪一类,或 计算属于各类的概率,从而判别该个体属于哪一类.还建 立标准化和未标准化的典则判别函数. 具体见下面吴喜之教授有关判别分析的讲义
1 | xi yi |
聚类分析
对于一个数据,人们既可以对变量〔指标〕进 行分类<相当于对数据中的列分类>,也可以对 观测值〔事件,样品〕来分类〔相当于对数据 中的行分类〕.
比如学生成绩数据就可以对学生按照理科或文 科成绩〔或者综合考虑各科成绩〕分类,
当然,并不一定事先假定有多少类,完全可以按 照数据本身的规律来分类.
本章要介绍的分类的方法称为聚类分析 〔cluster analysis〕.对变量的聚类称为R型 聚类,而对观测值聚类称为Q型聚类.这两种聚 类在数学上是对称的,没有什么不同.
数据同上〔data14-01a〕:以四个四类成绩突出者的数据为初始 聚类中心<种子>进行聚类.类中心数据文件data14-01b〔但缺一 列Cluster_,不能直接使用,要修改〕.对运动员的分类〔还是分为4 类〕 Analyze Classify K-Means Cluster Variables: x1,x2,x3 Label Case By: no Number of Cluster: 4 Center: Read initial from: data14-01b Save: Cluster membership和Distance from Cluster Center 比较有用的结果〔可将结果与前面没有初始类中心比较〕: 聚类结果形成的最后四类中心点<Final Cluster Centers> 每类的观测量数目〔Number of Cases in each Cluster〕 在数据文件中的两个新变量qc1_1〔每个观测量最终被分配到哪一 类〕和 qc1_2〔观测量与所属类中心点的距离〕
A robust kernelized intuitionistic fuzzy c-means clustering algorithm
A robust kernelized intuitionistic fuzzy c-means clustering algorithm in segmentation of noisy medical imagesPrabhjot Kaur a ,⇑,A.K.Soni b ,1,Anjana Gosain c ,2aDepartment of Information Technology,Maharaja Surajmal Institute of Technology,C-4,Janakpuri,New Delhi 110058,India bDepartment of Computer Science,Sharda University,Greater Noida,Uttar Pradesh,India cDepartment of Information Technology,USIT,Guru Gobind Singh Indraprastha University,New Delhi,Indiaa r t i c l e i n f o Article history:Received 22March 2011Available online 4October 2012Communicated by S.SarkarKeywords:Fuzzy clusteringIntuitionistic fuzzy c-means Robust image segmentationRBF kernel based intuitionistic fuzzy c-meansFuzzy c-meansa b s t r a c tThis paper presents an automatic effective intuitionistic fuzzy c-means which is an extension of standard intuitionisitc fuzzy c-means (IFCM).We present a model called RBF Kernel based intuitionistic fuzzy c-means (KIFCM)where IFCM is extended by adopting a kernel induced metric in the data space to replace the original Euclidean norm metric.By using kernel function it becomes possible to cluster data,which is linearly non-separable in the original space,into homogeneous groups by transforming the data into high dimensional space.Proposed clustering method is applied on synthetic data-sets referred from various papers,real data-sets from Public Library UCI,Simulated and Real MR brain images.Experimental results are given to show the effectiveness of proposed method in contrast to conventional fuzzy c-means,pos-sibilistic c-means,possibilistic fuzzy c-means,noise clustering,kernelized fuzzy c-means,type-2fuzzy c-means,kernelized type-2fuzzy c-means,and intuitionistic fuzzy c-means.Ó2012Elsevier B.V.All rights reserved.1.IntroductionMany neurological conditions alter the shape,volume,and dis-tribution of brain tissue;magnetic resonance imaging (MRI)is the preferred imaging modality for examining these conditions.MRI is an important diagnostic imaging technique for the early detection of abnormal changes in tissues and organs.MRI possesses good contrast resolution for different tissues and has advantages over CT for brain studies due to its superior contrast properties.There-fore,the majority of research in medical image segmentation con-cerns MR images.Image segmentation plays an important role in image analysis and computer vision.The goal of image segmentation is partition-ing of an image into a set of disjoint regions with uniform and homogeneous attributes such as intensity,color,tone,etc.In images,the boundaries between objects are blurred and distorted due to the imaging acquisition process.Furthermore,object defini-tions are not always crisp and knowledge about the objects in a scene may be vague.Fuzzy set theory and fuzzy logic are ideally suited to deal with such uncertainties.Fuzzy sets were introduced in 1965by Lofti Zadeh with a view to reconcile mathematicalmodeling and human knowledge in the engineering sciences.Med-ical images generally have limited spatial resolution,poor contrast,noise,and non-uniform intensity variation.The fuzzy c-means (FCM)(Bezdek,1981),algorithm,proposed by Bezdek,is the most widely used algorithm in image segmentation.FCM is the exten-sion of the fuzzy ISODATA algorithm proposed by DUNN (Dunn,1974).FCM has been successfully applied to feature analysis,clus-tering,and classifier designs in fields such as astronomy,geology,medical imaging,target recognition,and image segmentation.An image can be represented in various feature spaces and the FCM algorithm classifies the image by grouping similar data points in the feature space into clusters.In case the image is noisy or dis-torted then FCM technique wrongly classify noisy pixels because of its abnormal feature data which is the major limitation of FCM.Various approaches are proposed by researchers to compen-sate this drawback of FCM.Dave proposed the idea of a noise cluster to deal with noisy data using the technique,known as noise clustering (Dave and Krishnapuram,1997).Another similar technique,PCM,proposed by Krishnapuram and Keller (1993)interpreted clustering as a pos-sibilistic partition.However,it caused clustering being stuck in one or two clusters.Yang and Chung (2009)developed a robust cluster-ing technique by deriving a novel objective function for FCM.Kang et al.(2009)proposed another technique which modified FCM objective function by incorporating the spatial neighborhood infor-mation into the standard FCM algorithm.Rhee and Hwang (2001)proposed type 2fuzzy clustering.Type 2fuzzy set is the fuzziness0167-8655/$-see front matter Ó2012Elsevier B.V.All rights reserved./10.1016/j.patrec.2012.09.015⇑Corresponding author.Tel.:+919810665064/9810165064.E-mail addresses:thisisprabhjot@ (P.Kaur),ak.soni@sharda.ac.in (A.K.Soni),anjana_gosain@ (A.Gosain).1Tel.:+919990021800.2Tel.:+919811055716.R E TRA CT E Din a fuzzy set.In this algorithm,the membership value of each pat-tern in the image is extended as type 2fuzzy memberships by assigning membership grades (triangular membership function)to type 1fuzzy membership.While discussing the uncertainty,another uncertainty arises,which is the hesitation in defining the membership function of the pixels of an image.Since the membership degrees are impre-cise and it varies on person’s choice,hence there is some kind of hesitation present which arises from the lack of precise knowledge in defining the membership function.This idea lead to another higher order fuzzy set called intuitionistic fuzzy set which was introduced by Atanassov ’s in 1983.It took into account the mem-bership degree as well as non-membership degree.Few works on clustering is reported in the literature on intuitionistic fuzzy sets.Zhang and Chen (2009)suggested a clustering approach where an intuitionistic fuzzy similarity matrix is transformed to interval valued fuzzy matrix.Recently,Chaira (2011)proposed a novel intuitionistic fuzzy c-means algorithm using intuitionistic fuzzy set theory.This algorithm incorporated another uncertainty factor which is the hesitation degree that aroused while defining the membership function.This paper proposes a new model called RBF kernel based intui-tionistic fuzzy c-means (KIFCM),which is an extension of intuition-istic fuzzy c-means (IFCM)by adopting a kernel induced metric in the data space to replace the original Euclidean norm metric.By replacing the inner product with an appropriate ‘kernel’function,one can implicitly perform a non-linear mapping to a high dimen-sional feature space in which the data is more clearly separable.The organization of the paper is as follows:Section 2,briefly re-view fuzzy c-means (FCM),type-2fcm (T2FCM)(Rhee and Hwang,2001)and intutionistic fuzzy c-means (IFCM)(Chaira,2011),possi-bilistic c-means (PCM),possibilistic fuzzy c-means (PFCM),and noice clustering (NC).Section 3describes the proposed algorithm;RBF kernel based intuitionistic fuzzy c means (KIFCM).Section 4evaluates the performance of the propose algorithm using syn-thetic data-sets,simulated and real medical images followed by concluding remarks in Section 5.2.Background informationThis section briefly discusses the fuzzy c-means (FCM),type-2fuzzy c-means,and intuitionistic fuzzy c means (IFCM),PCM,PFCM,and NC algorithms.In this paper,the data-set is denoted by ‘X ’,where X ={x 1,x 2,x 3,...,x n }specifying an image with ‘n ’pixels in M -dimensional space to be partitioned into ‘c ’clusters.Centroids of clusters are denoted by v i and d ik is the distance between x k and v i .2.1.The fuzzy c means algorithm (FCM)FCM is the most popular fuzzy clustering algorithm.It assumes that number of clusters ‘c ’is known in priori and minimizes the objective function (J FCM )as:J FCM¼X c i ¼1X n k ¼1u m ik d 2ikð1Þwhere d ik =k x k Àv i k ,and u ik is the membership of pixel ‘x k ’in clus-ter ‘i ’,which satisfies the following relationship:Xc i ¼1u ik ¼1;i ¼1;2;...;nð2ÞHere ‘m ’is a constant,known as the fuzzifier (or fuzziness index),which controls the fuzziness of the resulting partition.Any norm kÁk can be used for calculating d ik .Minimization of J FCM is performed by a fixed point iteration scheme known as the alternating optimi-zation technique.The conditions for local extreme for (1)and (2)are derived using Lagrangian multipliers:u ik ¼1P cj ¼1d ikd jk28k ;i ð3Þwhere 16i 6c ;16k 6n andv i¼P n k ¼1u m ik x kÀÁP n k ¼1u m ikÀÁ8i ð4ÞThe FCM algorithm iteratively optimizes J FCM (U ,V )with the con-tinuous update of U and V ,until j U (l +1)–U (l )j 6e ,where ‘l ’is the number of iterations.FCM works fine for the images which are not corrupted with noise but if the image is noisy or distorted then it wrongly classifies noisy pixels because of its abnormal feature data which is pixel intensity in the case of images,and results in an incorrect membership and improper segmentation.2.2.The type-2fuzzy c-means algorithm (T2FCM)Rhee and Hwang (2001)extended the type-1membership val-ues (i.e.membership values of FCM)to type-2by assigning a mem-bership function to each membership value of type-1.Their idea is based on the fact that higher membership values should contribute more than smaller memberships values,when updating the cluster centers.Type-2memberships can be obtained as per following equation:a ik ¼u ik À1Àu ik2ð5Þwhere a ik and u ik are the type-2and type-1fuzzy membership respectively.From (5),the type-2membership function area can be considered as the uncertainty of the type-1membership contri-bution when the center is updated.Substituting (5)for the mem-berships in the center update equation of the conventional FCM method gives the following equation for updating centers.v i¼P n k ¼1ða ik Þmx kP n k ¼1ða ik Þmð6ÞDuring the cluster center updates,the contribution of a pattern that has low memberships to a given cluster is relatively smaller when using type-2memberships and the memberships may represent better typicality.Cluster centers that are estimated by type-2mem-berships tend to have more desirable locations than cluster centers obtained by type-1FCM method in the presence of noise.T2FCM algorithm is identical to the type-1FCM algorithm except Eq.(6).At each iteration,the cluster center and membership matrix are up-dated and the algorithm stops when the updated membership and the previous membership i.e.max ik a new ik Àa pre vik <e ;e is a user defined value.Although T2FCM has proven effective for spherical data,it fails when the data structure of input patterns is non-spherical and complex.2.3.The intuitionistic fuzzy c-means algorithm (IFCM)Intuitionistic fuzzy c-means clustering algorithm is based upon intuitionistic fuzzy set theory.Fuzzy set generates only member-ship function l (x ),x 2X ,whereas intutitionistic fuzzy set (IFS)given by Atanassov considers both membership l (x )and non-membership v (x ).An intuitionistic fuzzy set A in X ,is written as:A ¼f x ;l A ðx Þ;v A ðx Þj x 2X gwhere l A (x )?[0,1],v A (x )?[0,1]are the membership and non-membership degrees of an element in the set A with the condition:06l (x )+v A (x )61.164P.Kaur et al./Pattern Recognition Letters 34(2013)163–175R E TRA CT E DWhen v A (x )=1Àl A (x )for every x in the set A ,then the set A be-comes a fuzzy set.For all intuitionistic fuzzy sets,Atanassov also indicated a hesitation degree,p A (x ),which arises due to lack of knowledge in defining the membership degree of each element x in the set A and is given by:p A ðx Þ¼1Àl A ðx ÞÀv A ðx Þ;06p A ðx Þ61:Due to hesitation degree,the membership values lie in the interval½l A ðx Þ;l A ðx Þþp A ðx ÞIntuitionistic fuzzy c-means (Chaira,2011)objective func-tion contains two terms:(i)modified objective function ofconventional FCM using Intuitionistic fuzzy set and (ii)intui-tionistic fuzzy entropy (IFE).IFCM minimizes the objective function as:J IFCM¼X c i ¼1X n k ¼1u Ãm ik d 2ik þXc i ¼1p Ãi e 1Àp Ãi ð7Þu Ãik ¼u ik þp ik ,where u Ãik denotes the intuitionistic fuzzy member-ship and u ik denotes the conventional fuzzy membership of the k th data in i th class.p ik is hesitation degree,which is defined as:p ik¼1Àu ik À1Àu a ikÀÁ1=a;a >0;and is calculated from Yager’s intuitionistic fuzzy complement as under:N (x )=(1Àx a )1/a ,a >0,thus with the help of Yager’s intuitionis-tic fuzzy compliment,intuitionistic fuzzy set becomes:A IFS k ¼f x ;l A ðx Þ;ð1Àl A ðx Þa Þ1=a j x 2X g ð8Þandp Ãi¼1N X nk ¼1p ik ;k 2½1;NSecond term in the objective function is called intuitionistic fuzzy entropy (IFE).Initially the idea of fuzzy entropy was given by Zadeh in 1969.It is the measure of fuzziness in a fuzzy set.Similarly in the case of IFS,intuitionistic fuzzy entropy gives the amount of vague-ness or ambiguity in a set.For intuitionistic fuzzy cases,if l A (x i ),v A (x i ),p A (x i )are the membership,non-membership,and hesitation degrees of the elements of the set X ={x 1,x 2,...,x n },then intuition-istic fuzzy entropy,IFE that denotes the degree of intuitionism in fuzzy set,may be given as:IFE ðA Þ¼X n i ¼1p A ðx i Þe ½1Àp A ðx i Þwhere p A (x i )=1Àl A (x i )Àv A (x i ).IFE is introduced in the objective function to maximize the good points in the class.The goal is to minimize the entropy of the his-togram of an image.Modified cluster centers are:v Ãi¼P n k ¼1u Ãik x kP n k ¼1u Ãikð9ÞAt each iteration,the cluster center and membership matrix are up-dated and the algorithm stops when the updated membership and the previous membership i.e.max ik U Ãnew ik ÀU Ãpre vik <e ;e is a user defined value.2.4.Possibilistic c-means clustering (PCM)To avoid the noise sensitivity problem of FCM,Krishnapuram and Keller (1993)relaxed the column sum constraintXc k ¼1u ki ¼1;i ¼1;2;...;nin case of FCM and proposed a possibilistic approach to clustering by minimizing objective function as:J PCM ðU ;V Þ¼X c k ¼1X n i ¼1u m ki d 2ki þX c k ¼1g k Xn i ¼1ð1Àu ki Þmwhere g k are suitable positive numbers.The first term tries to re-duce the distance from data points to the centroids as low as possi-ble and second term forces u ki to be as large as possible,thus avoiding the trivial solution.The updating of centroids is same as that in FCM but the membership matrix of PCM is updated as:u ki ¼11þk x i Àv k k 2g k1m À1PCM sometimes helps when data is noisy.It is very much sensitive to initializations and sometimes results into overlapping or identi-cal clusters.2.5.Possibilistic fuzzy c-means clustering (PFCM)Pal et al.(2005)integrates the fuzzy approach with the possibi-listic approach and hence,it has two types of memberships,viz.a possibilistic (t ki )membership that measures the absolute degree of typicality of a point in any particular cluster and a fuzzy member-ship (u ki )that measures the relative degree of sharing of point among the clusters.PFCM minimizes the objective function as:J PFCM ðU ;V ;T Þ¼X c k ¼1X n i ¼1au m ki þbt g kiÀÁd 2ki þX c k ¼1c k X n i ¼1ð1Àt ki Þg subject to the constraint thatXc k ¼1u ki ¼18i Here,a >0,b >0,m >1,and g >1.The constants ‘a ’and ‘b ’define therelative importance of fuzzy membership and typicality values in the objective function.The minimization of objective function gives the following conditions:u ki ¼1P cj ¼1d kiji2m À18k ;iandt ki ¼11þbc kd 2ki1g À18kandv k¼P n i ¼1au m ki þbt g ki ÀÁx iP n i ¼1au m ki þbt g kiÀÁThough PFCM is found to perform better than FCM and PCM but when two highly unequal sized clusters with outliers are given,it fails to give desired results.2.6.Noise clustering (NC)Noise clustering was introduced by Dave and Krishnapuram (1997)to overcome the major deficiency of the FCM algorithm i.e.its noise sensitivity.He gave the concept of ‘‘noise prototype’’,which is a universal entity such that it is always at the same dis-tance from every point in the data-set.Let ‘v k ’be the noiseP.Kaur et al./Pattern Recognition Letters 34(2013)163–175165R E TRA CT E Dprototype and ‘x i ’be any point in the data-set such that v k ;x i C ÀÀRp .Then noise prototype is the distance,d ki ,given by:d ki ¼d ;8iThe NC algorithm considers noise as a separate class.The member-ship u ⁄i of x i in a noise cluster is defined asu Ãi ¼1ÀXc k ¼1u ki NC reformulates FCM objective function as:J NC ðU ;V Þ¼X c þ1k ¼1X N i ¼1u m ki d 2kiwhere ‘c +1’consists of ‘c ’good clusters and one noise cluster andfor k =n =c +1.d 2¼kPck ¼1P Ni ¼1ðd ki Þ2Nc"#and membership equation is:u ji ¼X k ¼c k ¼1d 2ji d ki!1m À1þd 2ji d !1m À10@1A À1Noise clustering is a better approach than FCM,PCM,and PFCM.Although,it identifies outliers in a separate cluster but does not re-sult into efficient clusters because it fails to identify those outliers which are located in between the regular clusters.Its main objec-tive is to reduce the influence of outliers on the clusters rather than identifying it.Real-life data-sets usually contain cluster structures that differ from our assumptions so a clustering technique should be independent of the number of clusters for the same data-set.In NC,noise distance is given as:d 2¼kPc i ¼1P N k ¼1ðd ik Þ2Nc"#Here,noise distance depends upon distance measure,number of as-sumed clusters,and k ,which is the value of multiplier used to obtain ‘d ’,from the average of distances.From the equation,it is interpreted that if the number of clusters is increased,d assumes high values.NC assigns only those points to noise cluster whose distance from reg-ular clusters is less than the distance from noise distance,d .If the number of clusters is increased for the same data-set,NC does not detect outliers,because in that scenario,the average distance be-tween points and regular clusters decreases and the noise distance remains almost constant or assumes relatively high values.3.The proposed algorithm,radial basis kernel based intuitionistic fuzzy c-means (KIFCM)3.1.Kernel based approachThe present work proposes a way of increasing the accuracy of the intuitionistic fuzzy c-means by exploiting a kernel function in calculating the distance of data point from the cluster centers i.e.mapping the data points from the input space to a high dimen-sional space in which the distance is measured using a radial basis kernel function.The kernel function can be applied to any algorithm that solely depends on the dot product between two vectors.Wherever a dot product is used,it is replaced by a kernel function.When done,lin-ear algorithms are transformed into non-linear algorithms.Those non-linear algorithms are equivalent to their linear originals oper-ating in the range space of a feature space u .However,becausekernels are used,the u function does not need to be ever explicitly computed.This is highly desirable,as sometimes our higher-dimensional feature space could even be infinite-dimensional and thus infeasible to compute.A kernel function is a generalization of the distance metric that measures the distance between two data points as the data points are mapped into a high dimensional space in which they are more clearly separable.By employing a mapping function U (x ),which defines a non-linear transformation:x ?U (x ),the non-linearly separable data structure existing in the original data space can pos-sibly be mapped into a linearly separable case in the higher dimen-sional feature space.Given an unlabeled data set X ={x 1,x 2,...,x n }in the p -dimen-sional space R P ,let U be a non-linear mapping function from this input space to a high dimensional feature space H :U :R p !H ;x !U ðx ÞThe key notion in kernel based learning is that mapping func-tion U need not be explicitly specified.The dot product in the high dimensional feature space can be calculated through the kernel function K (x i ,x j )in the input space R P .K ðx i ;x j Þ¼U ðx i ÞÁU ðx j ÞConsider the following example.For p =2and the mapping function U ,U :R 2!H ¼R 3ðx i 1;x i 2Þ!x 2i 1;x 2i 2;ffiffiffi2p x i 1x i 2Then the dot product in the feature space H is calculated asU ðx i ÞÁU ðx j Þ¼x 2i 1;x 2i 2;ffiffiffi2p x i 1x i 2 Áx 2j 1;x 2j 2;ffiffiffi2p x j 1x j 2¼x 2i 1;x 2i 2ÀÁÁx 2j 1;x 2j 22¼ðx i Áx j Þ2¼K ðx i ;x j Þwhere K -function is the square of the dot product in the input space.We saw from this example that use of the kernel function makes it possible to calculate the value of dot product in the feature space H without explicitly calculating the mapping function U .Some exam-ples of kernel function are:Example 1.Polynomial kernel :K (x i ,x j )=(x i Áx j +c )d ,where c P 0,d 2N .Example 2.Gaussian kernel :K ðx i ;x j Þ¼exp Àk x i Àx j k 22r 2,where r >0.Example 3.Radial basis kernel :K ðx i ;x j Þ¼exp ÀP x a i Àx a j br 20B@1C A ,where r ,a ,b >0.RBF function with a =1,b =2reduces to Gaussian function.Example 4.Hyper tangent kernel :K ðx i ;x j Þ¼1Àtanh Àk x i Àx j k 2r 2,where r >0.3.2.FormulationOur propose model RBF kernel based Intuitionistic fuzzy c-means (KIFCM)adopts a kernel induced metric which is different from the Euclidean norm in the original intuitionistic fuzzy c-means.KIFCM minimizes the objective function:J KIFCM ¼2X c i ¼1X n k ¼1u Ãm ik k U ðx k ÞÀU ðv i Þk 2þX c i ¼1p Ãi e 1Àp Ãi ð10Þwhere k U (x k )ÀU (v i )2k is the square of distance between U (x i )andU (v k ).The distance in the feature space is calculated through the kernel in the input space as follows:166P.Kaur et al./Pattern Recognition Letters 34(2013)163–175R E TRA CT E D1.(a–c)shows Clustering result of NC with k =0.6,KFCM with h (kernel width)=6and KIFCM with h =6,a =2,b =4.2,a =7with Diamond data-set (d–f)shows Clustering result of NC with k =0.17,KFCM with h (kernel width)=100and KIFCM with h =300,a =2,b =2,a =3with Bensaid data-set (g–i)shows result of NC with k =0.17,KFCM with h (kernel width)=55,a =2,b =2.6and KIFCM with h =55,a =2,b =3,a =0.7with non-linear data-set.Centroids are shown with ‘⁄’symbol and the clusters differentiated with ‘.’and ‘+’and ‘Â’symbol.Noise in detected with ‘o’.R E TRA CT Eu Ãik ¼1ð1ÀK ðx k ;v i ÞÞ1m À1P c i ¼11k v i1m À1ð12Þv i¼P n k ¼1u Ãmik ÂK ðx k ;v i ÞÂx k P n k ¼1u Ãmik ÂK ðx k ;v i Þð13ÞProof.We differentiate J KIFCM with respect to u Ãik and v i and set the derivatives to zero.Thus,we get (12)and (13).The details are given in Appendix A .h3.3.Steps involved in KIFCM areRadial basis kernel based intuitionistic fuzzy c-means clustering Input parameters:Image data (X ),number of clusters (K =c +1),number of iterations,stopping criteria ðC ÀÀÞ.Output :Cluster centroids matrix,membership matrix.Step 1:Get data from image.Step 2:Select initial prototypes.Step 3:Obtain the memberships using (12).Step 4:Update the prototypes using (13).Step 5:Update the memberships using (12)with updated prototypes.Step 6:Repeat steps (3)–(5)till the updated membershipsatisfies the condition:u ðt þ1Þik Àu t ik<e 8i ;k is met for successive iterations t and t +1where e is a small number.4.Simulations and resultsIn this section,experimental results are presented to compare the segmentation performance of radial basis kernel based intui-tionistic fuzzy c-means (KIFCM)with FCM,PCM,PFCM,NC,KFCM,T2FCM,KT2FCM and IFCM.Experiments are implemented and sim-ulated using MATLAB Version 7.0.We considered following com-mon parameters:m =2,which is a common choice for fuzzy clustering,e =0.03,a =0.85for IFCM (as referred in (Chaira,2011)),and maximum iterations =200.We used RBF kernel for the kernelized methods.Note that the kernel width ‘h ’in RBK ker-nel has a very important effect on the performances of the algo-rithms.However,how to choose an appropriate value for the kernel width in RBF kernel is still an ‘‘open problem’’.In this paper,we adopt the ‘‘trial-and-error’’technique to find the kernel width.We used synthetic data-set,standard data-sets,and real medical images for the experiments.4.1.Synthetic data-setsThree synthetic data-sets,diamond (D12)data set (referred from Pal et al.(2005)),BENSAID Data-set (Bensaid et al.,1996)and Non-linear synthetic data-set are considered in this section.Diamond Data-set,D11,D12(referred from Pal et al.(2005)).D11is a noiseless data-set of points f x i g 11i ¼1.D12is the union of D11and an outlier x12.BENSAID’s two-dimensional data-set con-sists of one big and two small size clusters.We have saved the structure of this set but have increased count of core points and added uniform noise to it,which is distributed over the region [0,120]Â[10,80].To evaluate the effect of non-linear data struc-ture,a synthetic non-linear data-set consisting of one circularand one elliptic cluster is considered.All the seven algorithms,FCM,IFCM,PCM,PFCM,NC,KFCM,KIFCM are implemented with these data-sets to check the performance.For diamond data-set,it is observed that FCM,IFCM,PCM,and PFCM could not detect the original clusters and their performance is badly affected with the presence of noise,NC and KFCM detected the clusters but the centroid location is still affected with noise.To show the effectiveness of the proposed algorithm,we also calcu-lated the error for recognizing correct cluster centers in case of Dia-mond Data-set with the equation E ⁄=k V ideal ÀV ⁄k 2,where ⁄is PCM/PFCM/NC/KFCM/KIFCM.The ideal (true)centroids of Dia-mond data-set are:V ideal ¼À3:3403:34!FCM and IFCM could not detect the clusters.Average error with PCM,PFCM,NC,KFCM and KIFCM is 11.17,11.82,0.077,0.0362and 0.003respectively.Fig.1shows the clustering results with three best performed algorithms (NC,KFCM and KIFCM)for all the three data-sets.Clearly after observing the results of Fig.1and considering the error percentage,it is observed that proposed method can produce more accurate centroids than other methods and is highly robust against noise.In case of BENSAID data-set PCM,PFCM and NC could not detect the original clusters and the performance of FCM and KFCM is af-fected with the presence of noise.Best performance is achieved in the case of IFCM and KIFCM and the results are not much af-fected as in other cases.In case of non-linear data-set,all the algorithms except KFCM and KIFCM could not detect the elliptical cluster.KFCM although detected the elliptical cluster but misaligned some members of the elliptical cluster into the circular cluster,whereas KIFCM cor-rectly partitioned all the data points.4.1.1.Performance evaluation based upon misclassificationsTo verify the performance of seven algorithm based upon mis-classifications,we calculated their scores defined by the following quantitative index (Masulli and Schenone,1999).r ij ¼A ij \A refj ij refjwhere A ij represents the set of pixels belonging to the j th class found by the i th algorithm and A refj represents the set of pixels belonging to the j th class in the reference segmented image.r ij is a fuzzy similarity measure,indicating the degree of equality be-tween A ij and A refj ,and the larger the r ij ,the better the segmentation is.Table 1lists the misclassifications and comparison scores using seven methods.From Fig.1and Table 1,we observed that although many algo-rithms out of seven detected the right clusters in case of diamond and Bensaid data-set but in case of synthetic non-linear data-set no algorithm except KIFCM detected the original clusters.Apart from that,in case of numerical data-sets,location of cluster centers is the major criteria to compare various algorithms.So considering these points into view,we can say that KIFCM outperformed the other six algorithms.4.2.Real data setsFour real data-sets,Wisconsin Breast cancer,iris,Wine and PIDD (Pima Indians Diabetes database)from public data-bank UCI (UC Irvine Machine Learning Repository)are used to evaluate the performance of these algorithms.168P.Kaur et al./Pattern Recognition Letters 34(2013)163–175R E TRA CT E D。
K-Means & Fuzzy C-Mean
E (t ) J (t ) J ( t 1)
E(t)= C (t ) C ( t 1)
成立则停止运
使用K-Means聚类法
•需事先确定聚类的数目K
•若初始聚类中心位置不理想,使得目标 函数 J 落入局部解,最后分类出来的群 集将不甚理想
使用K-Means聚类法
使用K-Means聚类法
– 将N个数据依照其数据特征聚类为K类的聚类算法, K为一正整数 – 目标在于求各个数据与其对应聚类中心点距离平方 和的最小值
J J i w ji X j Ci
i 1 i 1 j 1
K
K
N
2
(1)
– – – – –
Ji 为第 i 类聚类的目标函数 K为聚类个数 Xj为第 j 个输入向量 Ci为第 i 个聚类中心(向量) wji 为权重 (Xj 是否属于聚类Ci)
1, arg miniK 1{d (jit ) } wji – (ii)计算数据点属于哪一聚类(隶属度矩阵) 0, otherwise
• (B)更新聚类中心
Cit
w(jit ) X j
j 1
N
w(jit )
j 1
N
; i 1,...K
1. (C)计算收敛准则,若 算,否则进行下一轮迭代
3500
feature space
3000 Lorries
2500 Weight (kg)
cluster
2000 Sports cars
label
1500 Medium market cars
feature
1000
500 100
150
200
Top speed (km/h)
fcm聚类算法参数模糊系数
fcm聚类算法参数模糊系数Fuzzy C-means (FCM) clustering algorithm is a popular method used in data clustering and pattern recognition. It is a soft clustering algorithm that allows a data point to belong to multiple clusters with varying degrees of membership. One of the key parameters in FCM is the fuzziness coefficient, also known as the membership exponent.在数据聚类和模式识别中,模糊C均值(FCM)聚类算法是一种常用方法。
它是一种软聚类算法,允许数据点以不同的成员度数属于多个聚类之一。
FCM中一个关键参数是模糊系数,也称为成员权重指数。
The fuzziness coefficient in FCM controls the degree of fuzziness in the clustering process. A higher fuzziness coefficient results in softer membership assignments, allowing data points to belong to multiple clusters with more overlapping boundaries. On the other hand, a lower fuzziness coefficient leads to sharper cluster boundaries and more distinct cluster assignments for data points.FCM中的模糊系数控制了聚类过程中的模糊程度。
A Fuzzy K-means Clustering Algorithm Using Cluster Center Displacement
J OURNAL OF I NFORMATION S CIENCE AND E NGINEERING 27, 995-1009 (2011)995A Fuzzy K-means Clustering Algorithm Using ClusterCenter DisplacementC HIH -T ANG C HANG 1, J IM Z. C. L AI 2 AND M U -D ER J ENG 11Department of Electrical Engineering 2Department of Computer Science and EngineeringNational Taiwan Ocean UniversityKeelung, 202 TaiwanIn this paper, we present a fuzzy k -means clustering algorithm using the cluster cen-ter displacement between successive iterative processes to reduce the computationalcomplexity of conventional fuzzy k -means clustering algorithm. The proposed method,referred to as CDFKM, first classifies cluster centers into active and stable groups. Ourmethod skips the distance calculations for stable clusters in the iterative process. Tospeed up the convergence of CDFKM, we also present an algorithm to determine the ini-tial cluster centers for CDFKM. Compared to the conventional fuzzy k -means clusteringalgorithm, our proposed method can reduce computing time by a factor of 3.2 to 6.5 us-ing the data sets generated from the Gauss Markov sequence. Our algorithm can reducethe number of distance calculations of conventional fuzzy k -means clustering algorithmby 38.9% to 86.5% using the same data sets.Keywords: vector quantization, fuzzy k -means clustering, data clustering, knowledgediscovery, pattern recognition1. INTRODUCTIONData clustering is used frequently in a number of applications, such as vector quan-tization (VQ) [1-4], pattern recognition [5], knowledge discovery [6], speaker recogni-tion [7], fault detection [8], and web/data mining [9]. Among clustering formulations that minimize an objective function, fuzzy k -means clustering is widely used and studied [10]. The fuzzy k -means clustering algorithm is a special case of the generalized fuzzy k - means clustering scheme, where point representatives are adopted and the Euclidean dis-tance is used to measure the dissimilarity between a vector X and its cluster representa-tive C .The fuzzy k -means clustering (FKM) algorithm performs iteratively the partition step and new cluster representative generation step until convergence. The applications of FKM can be founded in reference [11], which provided an excellent review of FKM. An iterative process with extensive computations is usually required to generate a set of cluster representatives. The convergence of FKM is usually much lower than that of hard k -means clustering [12]. Some methods are available to speed up hard k -means clustering[13-15]. Kanungo et al. [13] developed a filtering algorithm on a kd-tree to speed up the generation of new cluster centers. Sorting data points in a kd-tree for k -means clustering was also used by Pelleg and Moore [14]. After some iterations of hard k -means clustering, most of the centers are converged to their final positions and the majority of data pointsC HIH-T ANG C HANG , J IM Z. C. L AI AND M U-D ER J ENG996 have few candidates to be selected as their closest centers. Lai et al. [15] exploited this characteristic to develop a fast k -means clustering algorithm to reduce the computational complexity of k -means clustering.To reduce the computational complexity of FKM, Shankar and Pal used multistage random sampling to reduce the data size [16]. This method reduced the computational complexity by a factor of 2 to 4. Cannon, Dave, and Bezdek used look-up tables for stor-ing distances to approximate fuzzy k -mean clustering and reduced the computing time by a factor of about 6 [17]. It is noted that this method is applicable only for integer-valued data in the range of 0 to 255 and the accuracy of a cluster center’s coordinate is up to 0.1. Höppner developed an approximate FKM to reduce the computational complexity of conventional FKM [18]. This method gave the same membership as that of conventional FKM within a given precision and reduced the computing time of conventional FKM by a factor of 2 to 4. It is noted that all the above method cannot obtain the same clustering result as that of conventional FKM. After some iterations of FKM, it is expected that many of the centers are converged to their final positions and many distance calculations can be avoided at each partition step. This characteristic is exploited to reduce the com-putational complexity of fuzzy k -means clustering.In this paper, two algorithms are presented to reduce the computing time of fuzzy k -means clustering. These two algorithms classify cluster centers (representatives) into stable and active groups and the distance calculations are executed only for those active cluster representatives during the iterative process. This paper is organized as follows. Section 2 describes the fuzzy k -means clustering algorithm. Section 3 presents the algo-rithms developed in this paper. Some theoretical analyses of the presented algorithms are also shown in section 3. Experimental results are presented in section 4 and concluding remarks are given in section 5.2. FUZZY K-MEANS CLUSTERING ALGORITHMThe fuzzy k -means clustering algorithm partitions data points into k clusters S l (l = 1, 2, …, k ) and clusters S l are associated with representatives (cluster center) C l . The rela-tionship between a data point and cluster representative is fuzzy. That is, a membership u i ,j ∈ [0, 1] is used to represent the degree of belongingness of data point X i and cluster center C j . Denote the set of data points as S = {X i }. The FKM algorithm is based on minimizing the following distortion:J = ,11k N m i j ij j i u d ==∑∑(1)with respect to the cluster representatives C j and memberships u i ,j , where N is the number of data points; m is the fuzzifier parameter; k is the number of clusters; and d ij is the squared Euclidean distance between data point X i and cluster representative C j . It is noted that u i ,j should satisfy the following constraint:,1k i j j u =∑= 1, for i = 1 to N .(2)A F UZZY K-MEANS C LUSTERING A LGORITHM 997The major process of FKM is mapping a given set of representative vectors into an improved one through partitioning data points. It begins with a set of initial cluster cen-ters and repeats this mapping process until a stopping criterion is satisfied. It is supposed that no two clusters have the same cluster representative. In the case that two cluster cen-ters coincide, a cluster center should be perturbed to avoid coincidence in the iterative process. If d ij < η, then u i ,j = 1 and u i ,l = 0 for l ≠ j , where η is a very small positive num-ber. The fuzzy k -means clustering algorithm is now presented as follows.(1) Input a set of initial cluster centers SC 0 = {C j (0)} and the value of ε. Set p = 1.(2) Given the set of cluster centers SC p , compute d ij for i = 1 to N and j = 1 to k . Updatememberships u i ,j using the following equation:11/11/1,11().m k m i j ij il l u d d −−−=⎛⎞⎛⎞⎜⎟=⎜⎟⎜⎟⎝⎠⎝⎠∑ (3) If d ij < η, set u i ,j = 1, where η is a very small positive number.(3) Compute the center for each cluster using Eq. (4) to obtain a new set of cluster rep-resentatives SC p +1.C j (p ) = 11N m iji i m iji u u ==∑∑X (4)(4) If ||C j (p ) − C j (p − 1)|| < ε for j = 1 to k , then stop, where ε > 0 is a very small positivenumber. Otherwise set p + 1 → p and go to step 2.The major computational complexity of FKM is from steps 2 and 3. However, the computational complexity of step 3 is much less than that of step 2. Therefore the com-putational complexity, in terms of the number of distance calculations, of FKM is O (Nkt ), where t is the number of iterations.3. PROPOSED METHODSIn the iterative process of fuzzy k -means clustering, one may expect that the dis-placements of some cluster centers will be smaller than the threshold ε after few times of iterations and others need the much longer times of iterations to be stabilized, where ε > 0 is a very small positive number. Let the j th cluster centers used in the current and pre-vious partitions be denoted as C j and C ′j , respectively. Denote the displacement between C j and C ′j as D j . That is, D j = ||C j − C ′j ||. If D j < ε, then the vector C j is defined as a stable cluster center; otherwise it is called an active cluster center. The cluster associated with an active center is called an active cluster. Similarly the cluster having a stable center is defined as a stable cluster. The number of stable cluster centers increases as the iteration proceeds [19].C HIH-T ANG C HANG , J IM Z. C. L AI AND M U-D ER J ENG9983.1 Fuzzy K-means Clustering Algorithm Using Cluster DisplacementDenote the subsets, which consist of active cluster centers and stable cluster centers as SC a and SC s , respectively. Let k a ,i be the number of clusters in SC a at the i th iteration of fuzzy k -means clustering. The value of k a ,i decreases as the iteration proceeds in gen-eral. The performance, in terms of computing time, of the proposed method is better, if k a ,i decreases more quickly during the process of iteration. The value and creasing rate of k a ,i depend on data distribution. For a data set with good data separation, k a ,i will decrease quickly. For an evenly distributed data set, k a ,i will decrease slowly. For a real data set, a good data separation is usually obtained. In the worst case, k a ,i equals k , which is the number of clusters. It is noted that centers of clusters in the previous iteration will be used to partition the set of data points {X i } in the current iteration. The FKM algorithm stops, if the displacements of all cluster centers are less than ε. That is, if D j < ε, then cluster S j is a stable cluster and d ij (i = 1 to N ) will not be recalculated to update u i ,j in the iterative process. The proposed algorithm will use this property to speed up fuzzy k - means clustering. Now, the fuzzy k -means clustering algorithm using cluster displace-ment (CDFKM) is presented below.Fuzzy K -means Clustering Algorithm Using Center Displacement(1) Input a set of initial cluster centers SC 0 and the value of ε. Set p = 1.(2) Given the set of cluster centers SC 0 = {C j (0)}, compute d ij for i = 1 to N and j = 1 to k .Use Eq. (3) to update u i ,j . If d ij < η, set u i ,j = 1.(3) Use Eq. (4) to update C j (p ) and calculate D j = ||C j (p ) − C j (p − 1)|| for j = 1 to k . Set q= 1.(4) If D q < ε, go to step 5; otherwise compute d iq for i = 1 to N .(5) Set q = q + 1. If q > k go to step 6; otherwise go to step 4.Use Eq. (3) to compute u i ,j (if d ij < η, set u i ,j = 1) for i = 1 to N and j = 1 to k .(6) Set p = p + 1, use Eq. (4) to update C j (p ), and calculate D j = ||C j (p ) − C j (p − 1)|| for j =1 to k . If D j < ε for j = 1 to k , then stop; otherwise set q = 1 and go to step 4.3.2 Parameter Selection and Computational Complexity AnalysisIn this subsection, the effect of ε on u i ,j will be investigated. Eq. (3) can be rewritten as,1/11,1.1i j m k ij il l l j u d d −=≠=⎛⎞+⎜⎟⎝⎠∑ (5a)That is,1/1,1,1(1).m k ij il i j l l j d d u −=≠⎛⎞=−⎜⎟⎝⎠∑ (5b)Let the squared Euclidean distances between data point X i and cluster centers C j (p −A F UZZY K-MEANS C LUSTERING A LGORITHM 9991) and C j (p ) be d'ij and d ij , respectively. In the case of ||C j (p ) − C j (p − 1)|| < ε, d'ij is used to calculate memberships u i ,j . Denote u'i ,j as the membership of X i with respect to C j (p ), if d ij is replaced by d'ij in Eq. (3). Similarly, let u i ,j be the membership of X i with respect to C j (p ) for the case that d ij is used to calculate memberships. For many applications, m = 2 is used [10] and is adopted in this paper. In the case of ||C j (p ) − C j (p − 1)|| < ε, d'ij can be estimated byd i ′j ≈ d i ′j ± O(ε)(d i ′j )1/2. (6)Eq. (6) implies that if ||C j (p ) − C j (p − 1)|| < ε, then |d'ij − d ij |/(d ij )1/2 ≈ O(||C j (p ) − C j (p − 1)||). That is, |d'ij − d ij |/(d ij )1/2 equals approximately the displacement of the i th cluster’s center. u'i ,j is obtained by replacing d ij in Eq. (5a) by d'ij given in Eq. (6). That is,,1/21,1,1/21,1,111O()()1 ,1O()()i j k k ij ij il il l l j l l j k k ij ij ij il ill l j l l j u d d d d d d d d d εε=≠=≠−=≠=≠′≈⎛⎞+±⎜⎟⎝⎠=⎛⎞+±⎜⎟⎝⎠∑∑∑∑ (7a) for m = 2. For the case of m = 2, Eq. (5b) becomes,1,1( 1).k ij il i j l l j d d u =≠⎛⎞=−⎜⎟⎝⎠∑(7b)Substituting the term 1,kij il l l j d d =≠⎛⎞⎜⎟⎝⎠∑ in Eq. (7a) by ,1( 1)i j u − as shown in Eq. (7b) gives1/22,,,,1/2,,1O()()(()).11O()() 1i j i j ij i j i j ij i j i j u u d u u d u u εε−′′′′≈≈±−⎛⎞±−⎜⎟⎜⎟⎝⎠ (8)That is, |u'i ,j − u i ,j | ≈ O(ε)(d ij )−1/2(u i ,j − (u i ,j )2) ≤ O(ε)(d ij )−1/2. In the case of d ij < η, it implies that u i ,j = u'i ,j = 1 and |u'i ,j − u i ,j | = 0. Since η is a very small positive number, it im- plies that η << η1/2. In the case of η ≤ d ij ≤ ε, one can obtain |u'i ,j − u i ,j | ≤ O(ε)/η1/2. If ε = η is chosen, |u'i ,j − u i ,j | << 1 will be obtained due to η << η1/2. In this paper, ε = η = 0.00001 is used.The major computational complexity of CDFKM is from steps 4, 6, and 7. To com-pute u i ,j , 1/111m k il l d −=⎛⎞⎜⎟⎝⎠∑ for each data point X i are first calculated and stored. That is, Nk multiplications and additions are needed to update u i ,j at step 6. To calculate distancesbetween all data points and cluster centers, Nk distance calculations are required. Each distance calculation requires d multiplications and (2d − 1) additions, where d is the dataC HIH-T ANG C HANG, J IM Z.C. L AI AND M U-D ER J ENG1000dimension. That is, Nkd multiplications and Nk(2d − 1) additions are needed to calculate distances at step 4. Therefore, it can be concluded that the computational complexities of steps 6 and 7 are much less than that of step 4. Let k a be the average number of active clus-ter centers at each stage of iteration for CDFKM, where k a =1,()/,ta iik t−=∑t is the numberof iteration, and k a,i is the number of active clusters at the i th iteration. For a data set with good data separation, a small value of k a is expected. For an evenly distributed data set, k a may equal k. For the worst case, k a equals k and the proposed method will make no improvement over original FKM. The probability of performing distance calculations at step 4 is k a/k, where k is the number of clusters. That is, the computational complexity, in terms of the number of distance calculations, of CDFKM is O(Nkt) × O(k a/k) = O(Nk a t), where t is the number of iterations. Since 1 ≤k a≤k, the computational complexity, in terms of the number of distance calculations, of CDFKM is upper bounded by O(Nkt). 3.3 Determination of Initial Cluster Centers Using SubsetsThe computing time and number of iterations for CDFKM increase as the data size increases. The proposed method first uses CDFKM to generate a codebook from subsets of S and then adopt this codebook as the initial codebook for CDFKM to partition the whole data set S into k clusters. This initial approximation helps to reduce the number of iterations for CDFKM. To speed up the convergence of CDFKM, M subsets of the data set S are used to estimate the initial cluster centers for CDFKM. Denote these M subsets as SB l (l = 1 to M). The data size of SB l is fN, where f < 1 and N is the data points in S. It is noted that the data points in SB l are selected randomly from S and SB i∩ SB j = ∅ (i≠j), where ∅ is an empty set. The cluster center estimation algorithm (CCEA) first gener-ates an initial set of cluster centers SC0, which is obtained by selecting randomly k data points from SB1, for CDFKM to partition SU, where SU = SB1. This partition process will generate a set of cluster centers SC1. Setting SU = SU ∪ SB p, where p = 2 to M, CCEA uses SC p-1 as the initial set of cluster centers for CDFKM to generate SC p using the data set SU. This process is repeated until the set of cluster centers SC M is obtained. Finally, CDFKM uses SC M as the initial set of cluster centers to partition the whole data set S. Now, the cluster center estimation algorithm (CCEA) is presented as follows, Cluster Center Estimation Algorithm(1) Randomly select M subsets SB l (l = 1 to M) of size fN from the data set S such thatSB i∩ SB j = ∅ (i≠ j), where f < 1. Set SU = SB1 and p = 0.(2) Given a set of initial cluster centers SC p = {C j} and the data set SU, use CDFKM todetermine a set of cluster centers SC p+1 = {C j}.(3) Update p = p + 1 and set SU = SU∪ SB p+1. If p≤M, go to step 2.(4) Output SC M as the set of initial cluster centers.The set SC0 is obtained by selecting randomly k data points from the subset SB1. A subset of size sN is used by CCEA to obtain an initial set of cluster centers for CDFKM, where 0 < s < 1. Note that f is a small positive real number. The value of M is so chosen that it is less than (s/f). Using the initial cluster centers determined by CCEA for CDFKM, the corresponding algorithm is denoted as modified CDFKM (MCDFKM).A F UZZY K-MEANS C LUSTERING A LGORITHM10014. EXPERIMENTAL RESULTSTo evaluate the performance of the proposed algorithms CDFKM and MCDFKM, several sets of synthetic data, a set of real images, and a real data set have been used. The values of f andε for MCDFKM are set to be 0.05 and 0.00001, respectively. In the first example, the data set has about 50,000 data points. This data set is obtained from real images consists of image blocks of 4 × 4 pixels. That is, the set of data points with di-mension 16 is obtained from three real images: “Peppers,” “Lena,” and “Baboon.” It is noted that the block size for each data point is 4 × 4. In example 2, several data sets with size 20,000 and dimensions from 8 to 40 are generated. There are 40 cluster centers, which are evenly distributed over the hypercube [− 1, 1]d with d ranging from 8 to 40. A Gaussian distribution with standard distribution σ = 0.05 along each coordinate is used to generate data points around each center, where each coordinate is generated independ-ently. In example 3, several data sets with size 10,000 and dimensions from 8 to 40 are obtained from the Gauss Markov sequence [1] with σ = 10, μ = 0, and a = 0.9, where σ is the standard deviation, μ is the mean value of the sequence and a is the correlation coef-ficient. In example 4, the Statlog (Landsat Satellite) data set consisting of 6,435 data points with 36 attributes is used [20].In these examples, the proposed algorithms are compared to the conventional fuzzy k-means clustering algorithm in terms of the average number of distance calculations and computing time. The average computing time and number of distance calculations are calculated for each algorithm with 50 repetitions using different initial cluster centers. The initial cluster centers are randomly selected from each data set. Every algorithm uses the same initial cluster centers at each repetition. All computing is performed on an AMD Dual Opteron 2.0 GHz PC with 2GB of memory. All programs are implemented as console applications of Microsoft Visual Studio 6.0 and are executed under Windows XP Professional SP3.Example 1: A data set is generated from three real images.In the first example, this data set with d = 16 and N≈ 50,000 is generated from three real images: “Lena,” “Peppers,” and “Baboon.” Tables 1 and 2 give the average execution time and number of distance calculations per data point, respectively, for FKM (fuzzy k- means clustering algorithm), CDFKM, and MCDFKM. Table 3 presents the average dis-tortion per data point for these three methods. From Tables 1 and 2, it can be concluded that MCDFKM with M = 1 has the best performance, in terms of the computing time and number of distance calculations, for k≤ 32. For k≥ 64, CDFKM has the least computing time and number of distance calculations. Compared to FKM, CDFKM can reduce the computing time by a factor of 1.1 to 2.1. From Table 1, it can be found that the computing time of MCDFKM increases as the value of M increases. Therefore MCDFKM with M = 1 is used in the following examples. From Table 3, it can be found that FKM, CDFKM, and MCDFKM can obtain the same clustering result.Example 2: Data sets are generated with cluster centers evenly distributed over the hy-percube [− 1, 1]d.In this example, each data set consists of 20,000 data points. Fig. 1 shows the aver-age computing time for FKM, CDFKM, and MCDFKM with M = 1, whereas Fig. 2 givesC HIH-T ANG C HANG, J IM Z.C. L AI AND M U-D ER J ENG1002Table 1. The average computing time (in seconds) for FKM, CDFKM, and MCDFKM using a data set generated three real images.kMethod16 32 64 12815355.94 FKM 509.502925.561328.92CDFKM 468.86 1212.73 2329.64 7920.42 MCDFKM (M = 1) 338.67 894.00 3192.59 7623.59MCDFKM (M = 2) 349.08 911.80 3434.36 7791.73MCDFKM (M = 3) 394.97 983.69 3678.59 8404.13MCDFKM (M = 4) 412.97 1072.31 4069.98 10147.78MCDFKM (M = 5) 475.51 1138.83 4390.61 10205.95MCDFKM (M = 6) 530.64 1294.08 5385.81 14342.30Table 2. The average number of distance calculations per data point for FKM, CDFKM, and MCDFKM using a data set generated from three real images.kMethod16 32 64 1284624313602415968128 FKM 82577024210768032225074624398747520173313824CDFKM 65275360MCDFKM (M = 1) 41237844 98798730 289202130462964383MCDFKM (M = 2) 42805470 106209666 301687848503016639MCDFKM (M = 3) 47960496 112179816 309157617525194391MCDFKM (M = 4) 49906440 118076676 341583657555601119MCDFKM (M = 5) 56393328 123518739 360170358623380608MCDFKM (M = 6) 62447400 136084641 387006372669698964Table 3. The average distortion per data point for FKM, CDFKM, and MCDFKM usinga data set generated from three real images.kMethod16 32 64 128FKM 828.68 404.38 200.13 99.76CDFKM 828.68 404.38 200.13 99.76 MCDFKM (M = 1) 828.68 404.38 200.13 99.76MCDFKM (M = 2) 828.68 404.38 200.13 99.76MCDFKM (M = 3) 828.69 404.38 200.13 99.77MCDFKM (M = 4) 828.69 404.38 200.13 99.77MCDFKM (M = 5) 828.69 404.38 200.13 99.77MCDFKM (M = 6) 828.69 404.38 200.13 99.77the average number of distance calculations per data point for FKM, CDFKM, and MCD- FKM. Fig. 3 shows the average distortion per data point for these three methods. From Figs. 1 and 2, it can be found that the proposed method MCDFKM with M = 1 outper-forms FKM in terms of the computing time and number of distance calculations. From Fig. 1, it can also be found find that the computing time of the proposed approach MCD-FKM with M = 1 grows linearly with data dimension. Compared to FKM, MCDFKMwith M = 1 can reduce the computing time by a factor of 2.6 to 3.1. Fig. 3 shows that FKM, CDFKM, and MCDFKM with M = 1 can obtain the same clustering result.A F UZZY K-MEANS C LUSTERING A LGORITHM 1003Fig. 1. The average computing time for data sets from example 2 with N = 20,000 and k = 40.Fig. 2. The average number of distance calculations per data point for data sets from example 2 withN = 20,000 and k = 40.Fig. 3. The average distortion per data point for data sets from example 2 with N = 20,000 and k = 40. MCDFKM FKMCDFKMMCDFKMFKMCDFKMMCDFKM (M=1)FKMCDFKMMCDFKM (M=1)C HIH-T ANG C HANG , J IM Z. C. L AI AND M U-D ER J ENG1004 Example 3: Data sets are generated from the Gauss Markov sequence.In this example, each data set is obtained from the Gauss Markov source. Figs. 4 and 7 present the average computing time with k = 128 and 256, respectively. Figs. 5 and 8 show the average number of distance calculations per data point with k = 128 and 256, respectively. Figs. 6 and 9 present the average distortion per data point. From these fig-ures, it can be found that the computing time of MCDFKM with M = 1 increases linearly with the data dimension d . From Figs. 4, 5, 7, and 8, it can be found that MCDFKM with M = 1 has the best performance in terms of the computing time and number of distance calculations for all cases. Compared with FKM, the proposed method MCDFKM with M = 1 can reduce the computing time by a factor of 3.0 to 6.5. From Figs. 5 and 8, it can be found that the number of distance calculations of the proposed approach MCDFKM with M = 1 is independent of data dimension. Figs. 6 and 9 also confirm that the proposed approaches and FKM can obtain the same clustering result.Fig. 4. The average computing time for data sets from example 3 with N = 10,000 and k = 128.Fig. 5. The average number of distance calculation per data point for data sets from example 3 withN = 10,000 and k = 128.FKMCDFKMMCDFKM (M=1)Example 4: The Statlog (Landsat Satellite) data set.In the fourth example, the Statlog (Landsat Satellite) data set consists of 6,435 data points with d = 36 [20]. The value of each coordinate for a data point is in the range of 0 to 255. The more detailed description regarding this data set can be found in [20]. Tables 4 and 5 present the computing time and number of distance calculations per data point, respectively, for three algorithms to partition this data set into 16, 32, and 64 clusters. Table 6 gives the average distortion per data point. From Tables 4 and 5, it can be con-cluded that MCDFKM with M = 1 has the best performance in terms of computing time and number of distance calculations. Compared to FKM, CDFKM can reduce the com-puting time by a factor of 1.58 to 2.47. From Table 6, it can be found that FKM, CDFKM, and MCDFKM can obtain the same clustering result.From Examples 1 and 4, it is found that CDFKM has the better performance for real data sets. Since real data sets usually have the high cluster separation, it is recommended that CDFKM in stead of FKM is used for a data set with high cluster separation. It is noted that CDFKM is a fast version of FKM.FKMCDFKMMCDFKM (M=1)Fig. 6. The average distortion per data point for data sets from example 3 with N = 10,000 and k = 128.FKMCDFKMMCDFKM (M=1)Fig. 7. The average computing time for data sets from example 3 with N = 10,000 and k = 256.Fig. 8. The average number of distance calculation per data point for data sets from example 3 withN = 10,000 and k = 256.Fig. 9. The average distortion per data point for data sets from example 3 with N = 10,000 and k =256.Table 4. The average computing time (in seconds) for FKM, CDFKM, and MCDFKM using the Statlog data set.k Method 16 32 64FKM 297.64 916.06 1838.70 CDFKM 187.81 371.50 802.89 MCDFKM (M = 1) 143.64 314.49 724.46To visualize the clustering result, a data set with N = 1000 and d = 2 is generated. This data set has ten cluster centers distributed over the hypercube [− 1, 1]2. A Gaussian distribution with standard deviation = 0.15 along each coordinate is used to generate data points around each center. This data set is divided into 10 clusters using FKM and MCDFKM with M = 1. Fig. 10 presents the clustering results. From Fig. 10, it can be found that FKM and MCDFKM with M = 1 can obtain the same clustering result. FKMCDFKMMCDFKM (M=1)FKMCDFKMMCDFKM (M=1)Table 5. The average number of distance calculations per data point for FKM, CDFKM, and MCDFKM using the Statlog data set.kMethod16 32 6415817107279291488FKM 2574398429728478 CDFKM 970665217309091MCDFKM (M = 1) 9855873 1897082240250065Table 6. The average distortion per data point for FKM, CDFKM, and MCDFKM using the Statlog data set.kMethod16 32 64342.19684.39FKM 1368.78342.19684.39CDFKM 1368.78342.19MCDFKM (M = 1) 1368.78 684.391000.5. CONCLUSIONSIn this paper, two novel algorithms are developed to speed up fuzzy k-means clus-tering through using the information of center displacement between two successive par-tition processes. A cluster center estimation algorithm is also presented to determine the initial cluster centers for the proposed algorithm CDFKM. The computing time of the proposed algorithm MCDFKM with M = 1 grows linearly with data dimension Com-pared to FKM, the proposed algorithm MCDFKM with M = 1 can effectively reduce the computing time and number of distance calculations. Compared with FKM, the proposed method MCDFKM with M = 1 can reduce the computing time by a factor of 2.6 to 3.1 for the data sets generated with cluster centers evenly distributed over a hypercube. Ex-perimental results show that the proposed approaches and FKM can obtain the same。
fuzzy km
VOL. 20,
NO. 11,
NOVEMBER Fuzzy K -Means Clustering Algorithm with Selection of Number of Clusters
Published by the IEEE Computer Society
LUSTERING is a process of grouping a set of objects into clusters so that the objects in the same cluster have high similarity but are very dissimilar with objects in other clusters. Various types of clustering methods have been proposed and developed, see, for instance, [1]. The K -Means algorithm [1], [2], [3], [5] is well known for its efficiency in clustering large data sets. Fuzzy versions of the K -Means algorithm have been reported by Ruspini [4] and Bezdek [6], where each pattern is allowed to have memberships in all clusters rather than having a distinct membership to one single cluster. Numerous problems in realworld applications, such as pattern recognition and computer vision, can be tackled effectively by the fuzzy K -Means algorithms, see, for instance, [7], [8], and [9]. There are two major issues in the application of K -Means-type (nonfuzzy or fuzzy) algorithms in cluster analysis. The first issue is that the number of clusters k needs to be determined in advance as an input to these algorithms. In a real data set, k is usually unknown. In practice, different values of k are tried, and cluster validation techniques are used to measure the clustering results and determine the best value of k, see, for instance, [1]. In [10], Hamerly and Elkan studied statistical methods to learn k in K -Means-type algorithms.
聚类算法Kmeans与梯度算法Meanshift
Kmeans与Meanshift、EM算法的关系Kmeans算法是一种经典的聚类算法,在模式识别中得到了广泛的应用,基于Kmeans的变种算法也有很多,模糊Kmeans、分层Kmeans等。
Kmeans和应用于混合高斯模型的受限EM算法是一致的。
高斯混合模型广泛用于数据挖掘、模式识别、机器学习、统计分析。
Kmeans的迭代步骤可以看成E步和M步,E:固定参数类别中心向量重新标记样本,M:固定标记样本调整类别中心向量。
K均值只考虑(估计)了均值,而没有估计类别的方差,所以聚类的结构比较适合于特征协方差相等的类别。
Kmeans在某种程度也可以看成Meanshitf的特殊版本,Meanshift是一种概率密度梯度估计方法(优点:无需求解出具体的概率密度,直接求解概率密度梯度。
),所以Meanshift可以用于寻找数据的多个模态(类别),利用的是梯度上升法。
在06年的一篇CVPR文章上,证明了Meanshift方法是牛顿拉夫逊算法的变种。
Kmeans和EM算法相似是指混合密度的形式已知(参数形式已知)情况下,利用迭代方法,在参数空间中搜索解。
而Kmeans和Meanshift相似是指都是一种概率密度梯度估计的方法,不过是Kmean 选用的是特殊的核函数(uniform kernel),而与混合概率密度形式是否已知无关,是一种梯度求解方式。
PS:两种Kmeans的计算方法是不同的。
Vector quantization也称矢量量化:指一个向量用一个符号K来代替。
比如有10000个数据,用Kmeans 聚成100类即最有表征数据意义的向量,使得数据得到了压缩,以后加入的数据都是用数据的类别来表示存储,节约了空间,这是有损数据压缩。
数据压缩是数据聚类的一个重要应用,也是数据挖掘的主要方法。
混合高斯模型是一系列不同的高斯模型分量的线性组合。
在最大似然函数求极值时,直接求导存在奇异点的问题,即有时一个分量只有一个样本点,无法估计其协方差,导致其似然函数趋于无穷,无法求解。
模糊k均值聚类
模糊k均值聚类模糊K均值(FuzzyK-Means;FKM)聚类是一种最常用的聚类算法,它可以有效地将数据点分类到不同的类别中。
这种算法是基于模糊集合理论,它可以提供一种灵活的类别定义以及一种基于非严格的聚类,这使得它更适用于复杂的数据分布。
模糊K均值聚类是一种基于迭代的算法,它具有许多优点,具体表现在:首先,它具有高精度和低计算复杂度,这使得它很容易实现并且适用于大数据集;其次,它不需要给定聚类数,而只需要确定每个聚类的大小,从而使它能够有效地处理多数的聚类;最后,它支持对每个数据集的多种度量,因此可以有效地处理不同的数据分布。
模糊K均值聚类算法在机器学习领域中也有广泛的应用。
在计算机视觉和语音领域,它可以有效地识别物体、载体、背景和声音,这些都是机器视觉或语音应用领域中普遍使用的技术。
此外,它还可以用于文本分类、相似性分析、推荐系统、图像分类和模式识别等方面。
模糊K均值聚类的主要步骤包括:初始化、迭代和分类。
首先,需要给定聚类中心,然后根据距离定义每个数据与每个聚类中心的相似度,最后根据每个点的相似度与聚类中心的距离来定义每个数据点的目标聚类,迭代计算每个聚类的中心,最终完成聚类任务。
模糊K均值聚类算法也可以用来优化数据处理,通过表示数据的不确定性来提高算法的效率,可以通过调整模糊参数来改善算法的结果。
模糊K均值聚类也可以用于处理数据质量问题,包括异常值检测、滤波和噪声处理等。
此外,该算法还可以建立一种基于模糊的距离度量,用于处理类型不完整的非结构化数据,这种应用可以有效地处理复杂的现实场景和提升聚类精度。
由于模糊K均值聚类算法具有计算精度高、收敛性好、对异常值抗性强等优点,已经被广泛应用于现实场景中。
它可以有效解决聚类问题的复杂性和数据集的多样性,并通过调整参数获得更好的结果。
但是,该算法效率较低,容易受参数设置的影响,而且它仅适用于线性可分的数据。
总的来说,模糊K均值聚类是一种建立在模糊集合理论之上的有效的聚类算法,它被广泛应用于机器学习和数据挖掘领域,具有高精度和低计算复杂度的优点,可以有效解决聚类与相关数据处理问题。
聚类分析知识
聚类分析法聚类分析法是理想的多变量统计技术,主要有分层聚类法和迭代聚类法。
聚类分析也称群分析、点群分析,是研究分类的一种多元统计方法。
例如,我们可以根据各个银行网点的储蓄量、人力资源状况、营业面积、特色功能、网点级别、所处功能区域等因素情况,将网点分为几个等级,再比较各银行之间不同等级网点数量对比状况。
1、基本思想:我们所研究的样品(网点)或指标(变量)之间存在程度不同的相似性(亲疏关系——以样品间距离衡量)。
于是根据一批样品的多个观测指标,具体找出一些能够度量样品或指标之间相似程度的统计量,以这些统计量为划分类型的依据。
把一些相似程度较大的样品(或指标)聚合为一类,把另外一些彼此之间相似程度较大的样品(或指标)又聚合为另一类,直到把所有的样品(或指标)聚合完毕,这就是分类的基本思想。
在聚类分析中,通常我们将根据分类对象的不同分为Q型聚类分析和R型聚类分析两大类。
R型聚类分析是对变量进行分类处理,Q型聚类分析是对样本进行分类处理。
R型聚类分析的主要作用是:1、不但可以了解个别变量之间的关系的亲疏程度,而且可以了解各个变量组合之间的亲疏程度。
2、根据变量的分类结果以及它们之间的关系,可以选择主要变量进行回归分析或Q型聚类分析。
Q型聚类分析的优点是:1、可以综合利用多个变量的信息对样本进行分类;2、分类结果是直观的,聚类谱系图非常清楚地表现其数值分类结果;3、聚类分析所得到的结果比传统分类方法更细致、全面、合理。
为了进行聚类分析,首先我们需要定义样品间的距离。
常见的距离有:①绝对值距离②欧氏距离③明科夫斯基距离④切比雪夫距离方法聚类的几种方法:(1)直接聚类法先把各个分类对象单独视为一类,然后根据距离最小的原则,依次选出一对分类对象,并成新类。
如果其中一个分类对象已归于一类,则把另一个也归入该类;如果一对分类对象正好属于已归的两类,则把这两类并为一类。
每一次归并,都划去该对象所在的列与列序相同的行。
聚类定k手肘法 实现
聚类定k手肘法实现聚类定k手肘法是一种常用的聚类分析方法,它可以帮助我们确定最佳的聚类数量。
在实际应用中,聚类定k手肘法被广泛应用于数据挖掘、市场分析、社交网络分析等领域。
本文将详细介绍聚类定k手肘法的原理和应用,并通过实例演示如何使用该方法。
聚类分析是一种将相似对象分组的无监督学习方法。
在聚类分析中,我们将一组数据划分为若干个类别,使得同一类别内的数据相似度较高,而不同类别之间的数据相似度较低。
聚类定k手肘法可以帮助我们确定最佳的聚类数量k,即将数据分为k个类别。
聚类定k手肘法的原理是通过计算不同k值下的聚类结果的误差平方和(SSE),并选择误差平方和下降最快的k值作为最佳聚类数量。
误差平方和即每个数据点与其所属簇中心的距离的平方和,它反映了聚类结果的紧密程度。
下面以一个实例来演示聚类定k手肘法的使用过程。
假设我们有一组销售数据,其中包括每个产品的销售量和价格。
我们希望将这些产品分为不同的类别,以便进行市场分析。
我们需要选择一个合适的聚类算法。
在本例中,我们选择使用k-means算法进行聚类分析。
k-means算法是一种迭代优化算法,通过不断更新簇中心来最小化误差平方和。
接下来,我们需要确定聚类数量k的范围。
一般情况下,k的取值范围应该在2到数据点数量的平方根之间。
在本例中,我们选择将k的取值范围设定为2到10。
然后,我们按照不同的k值运行k-means算法,并计算每个k值对应的误差平方和。
对于每个k值,我们重复运行算法多次,选择误差平方和最小的一次结果。
我们绘制误差平方和与k值的折线图,并观察曲线的走势。
通常情况下,曲线会随着k值的增加而逐渐下降。
但是,在某个k值处,曲线会出现一个明显的拐点,形成一个“手肘”的形状。
这个拐点所对应的k值就是最佳的聚类数量。
在本例中,我们运行k-means算法,并计算了2到10个聚类数量下的误差平方和。
然后,我们绘制了误差平方和与k值的折线图。
从图中可以看出,在k=4处,曲线出现了一个明显的拐点。