An Efficient Spectral Algorithm for Network Community Discovery




关键词高光谱图像无监督分类端元凸面几何原理中图法分类号:TP751文献标识码:A 文章编号:1006-8961(200806-1123.05An Unsupervised Classification Algorithm for Hyperspectral ImagerySHE Hong.wel’1・”,ZHANG Yah—ning”,YUAN He-jin2’”(School ofScience,Nonhwestern Polytechnical University,。

Xi'an 710072”(School ofCompeer Science,Nonhwes阳m Polytechnical University,Xi'an 710072 Abstract In order to classify the data of Hyperspectral remotesensing images automaticallywithout prior knowledge,an unsupervised classification algorithm is presented based On the conception of convex geometry and spectral features in this paper.The endmembers are selected step by step during processing and each endmember can be identified as one class.The advantages of this algorithm are simple in theory,easy to accomplish,widely used,and withoutanymanual assistance. The experiment shows that the classifying result of this algorithm is satisfied.Keywords hyperspectral image,unsupervised classification,endmember,conception of convex geometry1引言高光谱图像处理是一个新兴的研究领域,也是当前图像处理的前沿。












关键词:文字检测;文字定位;文字识别;卷积神经网络;多方向文字;谱聚类文献标志码:A中图分类号:TP399doi:10.3778/j.issn.1002-8331.2008-0015Direct and Efficient Natural Scene Chinese Character Approaching Spotting MethodZHAO Fan,ZHANG Lin,WEN Zhiquan,YANG Linlin,LIN GuangfengDepartment of Information Science,School of Printing,Packaging and Digital Media,Xi’an University of Technology, Xi’an710048,ChinaAbstract:In order to improve the accuracy of the classic target detection algorithms for text localization in natural scenes,and to overcome the problem of incorrect segmentation of Chinese characters by traditional character detection models due to the non-connectivity between strokes,a direct and efficient Chinese text spotting method is proposed in this paper.Text box is detected by EAST algorithm.The detected text box is adjusted to make it more compact and contain text more comprehensively,which comprises the connected component extraction,Chinese character segmentation and text shape approximation.The extracted text regions are corrected and transcribed.Experimental results show that while maintaining3.2frame per second,the proposed algorithm has F-score of83.5%,72.8%and81.1%in text positioning task of three multi-oriented text datasets,ICDAR2015,ICDAR2017-MLT and MSRA-TD500,respectively.The ablation exper-iment verifies the effectiveness of each module in the proposed algorithm.The performance of the comprehensive evalua-tion task of detection and recognition on the ICDAR2015data set also proves that the proposed method has achieved better performance than some of the latest methods.Key words:text detection;text spotting;text recognition;convolution neural network;multi-oriented text;spectral clustering基金项目:国家自然科学基金(61671376,61771386);陕西省重点研发计划(2020SF-359)。



现代电子技术Modern Electronics Technique2023年12月1日第46卷第23期Dec. 2023Vol. 46 No. 230 引 言随着互联网技术与多媒体技术的飞速发展与普及,致使以音频、图像、视频等为主要内容的多种类型作品创作、存储与传播变得极为便利。




就现有研究成果来看,使用较为广泛的降噪算法为一种基于小波阈值的变步长LMS 语音降噪算法[1]与启发式联合PCD 快速降噪算法[2]。

前者主要应用小波软阈值分析语音信号的时频,将具有噪声特征的小波系数进行剔除,通过变步长最小均方误差算法对语音信号进行进一步的降噪处理,从而实现语音信号的降噪处理;后者将音频信号转化为信号矩阵,利用Joint⁃PCD 与超完备字典同多轨道数字音频自适应变阶谱降噪模型构建文雅洁, 陈 娟(中北大学, 山西 太原 030051)摘 要: 文中提出多轨道数字音频自适应变阶谱降噪模型构建,采用一阶高通数字滤波器预加重处理多轨道数字音频信号,以此为基础,通过最大熵谱估计算法估计数字音频信号频谱,搭建自适应变阶谱降噪模型,确定谱减阶数的自适应取值规则。


实验数据显示:构建模型应用后,可以有效去除音频噪声信号,并不会缺失音频有效信息,降噪后多轨道数字音频信噪比最大值为91.25 dB ,充分证实了构建模型降噪效果更佳。



Vol 41,No. 6,ppl869-1873June , 2021第41卷,第6期2 0 2 1年6月光谱学与光谱分析 Spectroscopy and Spectral Analysis机器学习的IBBCEAS 光谱反演波段优化凌六一 1!$ ,黄友锐1!$,王成军X胡仁志3,李 昂3,谢品华31. 安徽理工大学人工智能学院,安徽淮南2320012. 安徽科技学院,安徽凤阳2331003. 中国科学院安徽光学精密机械研究所,中国科学院环境光学与技术重点实验室,安徽合肥230031摘 要 非相干宽带腔增强吸收光谱技术(IBBCEAS )利用高精密谐振腔增强吸收光程,实现对痕量气体的 高灵敏探测$目前,IBBCEAS 技术主要采用发光二极管(LED )作为非相干光源$当谐振腔镜片反射率曲线 与带宽有限的LED 辐射谱不能很好匹配时,光谱反演波段选择不当可能会对被测气体浓度拟合结果产生较大偏差$以定量探测大气NO?浓度为例,分析了 IBBCEAS 光谱反演波段对NO?拟合结果的影响,发现当 反演波段宽度窄到一定程度后,NO?浓度拟合相对误差会迅速增加$为此,提出了一种基于RBF 神经网络结合遗传算法的机器学习IBBCEAS 光谱反演波段优化方法,以使浓度拟合误差达到最小。

在430〜480 nm待选波段内,选择各种宽度和中心波长的子波段作为反演波段,分别进行NO?浓度拟合,以此获得435个样本数据,并将样本数据按照4 : 1比例分成学习样本和测试样本,分别用于RBF 神经网络学习训练和测试,得到输入参数“反演波段的起始波长与截止波长”与输出参数“浓度拟合相对误差/之间的非线性映射关系$使用遗传算法搜索最优反演波段,将反演波段的起始波长和截止波长组合进行个体编码,随机产生若干个体形成种群$以RBF 神经网络的输出(即浓度拟合相对误差)作为个体适应度,经过多代种群进化过程后,获得适应度最优个体,即获得最优反演波段$在种群规模为100个体,种群进化最大代数为100的情况下,当种群进化第61代时,最优个体出现,对应的最优适应度为3. 584% ,最优反演波段为445. 78〜479. 44nm $选择相同带宽的其他4个典型反演波段,与最优反演波段下的NO?拟合结果进行了对比$结果显示,在最优反演波段下,无论是拟合误差、相对拟合误差还是拟合残差标准偏差,均低于其他4个反演波段,光谱拟合质量达到最优$结果表明,利用机器学习来确定IBBCEAS 最优反演波段是可行的$关键词 非相干宽带腔增强吸收光谱;优化;反演波段;机器学习;遗传算法中图分类号:O433文献标识码:A DOI : 10. 3964/j. issn. 1000-0593(2021)06-1869-05引言非相干宽带腔增强吸收光谱(IBBCEAS )是近年来发展起来的一种高灵敏光谱探测技术 利用高精密光学谐振腔增强吸收光程来达到高灵敏探测目的$目前,IBBCEAS 技术已 被广泛应用于大气痕量气体NO?18*, CHOCHO 13*, HO-NO [2,4,67* , HCHO 90*, NO 3[4,66 , I 2[11* , H 2O [4,⑴以及气溶胶消光)12*等探测$ IBBCEAS 仪器可以通过增加谐振腔基长、提高光源辐射光强以及使用更高反射率镜片等手段来提 高探测灵敏度$ IBBCEAS 仪器的这些客观参数一旦固定,又如何进一步改善仪器性能仍然值得研究$如Langridge 等)136通过Allan 方差分析,获得NO3吸收光谱最佳采集时间为 400 s ,将NO3的探测限从0. 25 ppK (10 s 的采集时间)改善到 0. 09 pptv ; Yi 等)6 应用 IBBCEAS 测量 NO3 , HONO 和NO 2 ,利用Allan 方差获得100 s 的最优光谱采集时间,NO3和NO?的探测限分别达到1. 7 pptv 和1. 6 ppbv ; Duan 等⑷ 同样针对HONO 和NO?测量,通过Allan 方差分析,获得320 s 最优光谱采集时间下的HONO 和NO?探测限分别为0.22 ppbv 和0.45 ppbv $现有研究只是针对光谱采集时间, 利用Allan 方差来获得特定曝光时间下的最佳光谱平均次数 来改善IBBCEAS 仪器探测性能$实际上,除了光谱采集时收稿日期:2020-06-10,修订日期:2020-09-30基金项目:安徽省重点研究与开发计划项目(20200407020011),国家自然科学基金项目(41305139),国家重点研发计划课题(2018YFC0213201)资助作者简介:凌六一,1980年生,安徽理工大学教授e-mail : *************.cn$ 通讯作者e-mail : lyling @aust. edu. cn ; **************1870光谱学与光谱分析第41卷间外,IBBCEAS 光谱反演波段同样影响反演结果和仪器性能$本工作以IBBCEAS 光谱反演大气NO?浓度为例,分析了光谱反演波段对NO?拟合结果及拟合残差的影响情况, 以最优反演准确度为目标,提出了一种利用RBF 神经网络 和遗传算法的机器学习最优反演波段确定方法,并进行了验证$1实验部分图1所示是测量装置结构示意图$其中,光源LED 中心波长约460 nm ,半高宽约25 nm ,镜片Ml 和M2在430〜480 nm 波段内具有高反射率$光路中其他部件的功能说明可参考我们之前的报道)4*$图1 IBBCEAS 实验装置结构示意图Fig.1 Aschematic diagram of theIBBCEASinstrument利用IBBCEAS 宽带吸收光谱,在某反演波段内将测得的吸收系数与被测气体吸收截面进行最小二乘拟合,就可以获得被测气体的浓度$基于LED 光源的非相干宽带腔增强吸收光谱系统,由于LED 半高宽一般只有20〜30 nm ,而光学谐振腔的镜片反射率是波长的函数,可能会出现LED 辐射光谱峰值波长与镜片反射率的峰值波长存在较大差距,另 外LED 半高宽又很窄,导致两者波段的重叠程度不高$这种情况下,如果光谱反演波段选择不当,被测气体浓度的拟合结果有可能会产生较大偏差$图2给出了 IBBCEAS 装置中 镜片反射率曲线、LED 辐射谱以及被测气体NO?的吸收截面$其中,镜片反射率是根据氮气和氦气分子对腔内入射光的不同Rayleigh 散射消光得到$在444 nm 处反射率曲线不是很平滑,可能是因滤光片缺陷所导致,最大镜片反射率0.997 20.995 40.993 60.991 80.990 00.999 0ReflectivityNO 2 cross-section LED spectrum.O.^SU0启o A g E I a !G W」8642O.O.O.O.图2 430〜480 nm 波段内的镜片反射率、LED 谱和NO 2吸收截面Fig. 2 Reflectivity , LED spectrum and NO 2 absorptioncross-section in the range of 430 〜480 nm (〜0. 998 7)出现在458 nm 处,与LED 峰值波长(460 nm )相差约2 nm ,镜片反射率曲线与LED 光谱的匹配程度较好$以某条IBBCEAS 吸收谱为例,分别在具有不同中心波 长和带宽的反演波段下对NO?进行浓度拟合,得到反演波段与NO?浓度拟合相对误差、残差谱标准偏差之间的关系。



第35卷 第5期 电 子 科 技 大 学 学 报 V ol.35 No.5 2006年10月 Journal of University of Electronic Science and Technology of China Oct. 2006基于傅里叶技术快速预测DNA 序列编码区王 玉 ,饶妮妮(电子科技大学生命科学与技术学院 成都 610054)【摘要】利用功率谱分析探测DNA 序列编码区的主要特征信号三周期性,需要计算1/3频率点的傅里叶频谱。

针对该问题,提出了只计算1/3频率点处的傅里叶频谱快速预测DNA 序列编码区的方法。


关 键 词 傅里叶变换; 功率谱分析; 基因组序列; 编码区 中图分类号 Q-332 文献标识码 AAn Efficient Algorithm for Prediction Genes of Genomic Sequences Based on Fourier AnalysisWANG Yu ,RAO Ni-ni(School of Life Science and Technology, Univ. of Electron. Sci. & Tech. of China Chengdu 610054)Abstract T he major signal in protein coding regions of genomic sequence is three-base periodicity. We use Fourier transform as a spectral analysis tool for genes detection, all that is required is a spot Fourier coefficient at M /3, and the complete Fourier spectrum is not required. An algorithm for computing spot Fourier coefficients is presented. Thereby, a method is developed to recognize the protein coding region of genomic sequence quickly. An important feature of the method is that its computational speed is very fast. Furthermore, this method is independent of training sets or existing datebase information and thus can find general applications.Key words Fourier transformation; power spectrum analysis; genomic sequence; protein coding region随着人类基因组计划的发展,近年来GenBank 里的碱基数目呈指数增长,如何从大量的数据中挖掘出有用的生物信息,是生物信息学领域今后几十年都需要致力解决的问题,用计算方法识别DNA 序列中蛋白编码区更是迫切需要解决的研究课题之一。

  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

An Efficient Spectral Algorithm for Network Community Discovery and Its Applications to Biological and Social NetworksJianhua Ruan and Weixiong ZhangDepartment of Computer Science and EngineeringWashington University in St LoiusOne Brookings Dr,St Louis,MO63130{jruan,zhang}@AbstractAutomatic discovery of community structures in complex networks is a fundamental task in many disciplines,includ-ing social science,engineering,and biology.Recently,a quantitative measure called modularity(Q)has been pro-posed to effectively assess the quality of community struc-tures.Several community discovery algorithms have since been developed based on the optimization of Q.However, this optimization problem is NP-hard,and the existing al-gorithms have a low accuracy or are computationally ex-pensive.In this paper,we present an efficient spectral algo-rithm for modularity optimization.When tested on a large number of synthetic or real-world networks,and compared to the existing algorithms,our method is efficient and and has a high accuracy.In addition,we have successfully ap-plied our algorithm to detect interesting and meaningful community structures from real-world networks in different domains,including biology,medicine and social science. Due to space limitation,results of these applications are presented in a complete version of the paper available on our website(/˜jruan/).1IntroductionThe study of complex networks has become a fast grow-ing subject in many disciplines,including physics,biol-ogy,and social science.At least part of the reason can be attributed to the discovery that real-world networks from totally different sources can share surprisingly high simi-larities in their topological properties,such as the power-law degree distributions and high clustering coefficients. (See[1,14]for reviews.)One of the key properties in complex networks that have attracted a great deal of interest recently is the so-called community structures,i.e.relatively densely connected sub-networks[15].Community structures have been found in social and biological networks,as well as technological networks such as the Internet and power grid.Automati-cally discovering such structures is fundamentally impor-tant for understanding the relationships between network structures and functions,and has many practical applica-tions.For example,identifying communities from a collab-oration network may reveal scientific activities as well as evolution and development of research areas[12],while de-tecting hidden communities on the World Wide Web may help prevent crime and terrorism[2].To design effective community discovery algorithms, Newman and Girvan[16]proposed a quantitative measure, called modularity(Q),to assess the quality of community structures,and formulated community discovery as a opti-mization problem.Since optimizing Q is a NP-hard prob-lem,several heuristic methods have been developed,as sur-veyed in[4].The fastest algorithm available uses a greedy strategy and suffers from poor quality[3].A more accu-rate method is based on simulated annealing,which requires a prohibitively long running time on large networks[11]. Several spectral algorithms have been developed,which have relatively good performance,but still inefficient for large networks[21,15].In this paper,we propose a spectral algorithm that is ef-fective infinding high quality communities as well as effi-cient on large networks.The algorithm adopts a recursive strategy to partition networks while optimizing Q.Unlike the existing algorithms,our method is a hybrid of direct k-way partitioning and recursive2-way partitioning strate-gies[21,15].We evaluate our algorithm on a large number of synthetic and real-world networks.The results show that the algorithm is more efficient and more accurate than a re-cursive2-way partitioning pared to a direct k-way partitioning method,our algorithm is much more ef-ficient,while having a comparable accuracy.The paper is organized as follows.In Section2,we intro-duce some basic concepts,notations,and previous works.Seventh IEEE International Conference on Data MiningIn Section3,we describe our algorithm and its complexity, and discuss several related methods.We present experimen-tal results in Section4,and conclude in Section5.2Preliminaries2.1Spectral graph partitioningLet G=(V,E)be a network of n vertices in V and m edges in E.Let A=(A ij)be the adjacency matrix of G.A graph partitioning problem is tofind two or more vertex subsets of nearly equal sizes,while minimizing the num-ber of edges cut by the partitioning[8].Known to be NP-hard,the problem exists in many real applications,such as circuit design and load balancing in distributed computing. Many heuristic methods have been developed for the prob-lem,among which spectral methods have received much at-tention and are the most popular.Spectral graph partitioning is in fact a family of methods. These methods depend on the eigenvectors of the Laplacian matrix or its relatives of a graph.Depending on the way they partition a graph,spectral methods can be classified into two classes.Thefirst class uses the leading eigenvec-tor of a graph Laplacian to bi-partition the graph.The sec-ond class of approaches computes a k-way partitioning of a graph using multiple eigenvectors.We briefly review some representatives of these two classes of algorithms below. Let D be the diagonal degree matrix of A,i.e.D ii=j A ij.L=D−A is called the Laplacian matrix of G.Letλ1≤λ2≤···≤λn be the eigenvalues andµ1,µ2,···,µn be the corresponding eigenvectors for the generalized eigen-value problem Lµ=λDµ.It can be shown thatλ1=0, andµ1=1,a vector with all ones.Given the above notation,a representative of bi-partitioning,the SM algorithm[19],works as follows.(1) Computeµ2,the second smallest generalized eigenvector of L.(2)Conduct a linear search onµ2tofind a partition of the graph to minimize a normalized cut criterion[19].It has been shown that when certain constraints are satisfied,the SM algorithm can reach the optima of normalized cuts[19]. Tofind more than two clusters,the SM algorithm can be applied recursively.The most popular algorithm in the second class,the NJW algorithm[17],finds a k-way partition of a network directly as follows,where k is given by the user.(1)Compute the k smallest generalized eigenvectors of L and stack them in columns to form a matrix Y=[µ1,µ2,···,µk].(2)Nor-malize each row of Y to have unit length.(3)Treat each row as a point in R k,and apply standard k-means algorithm (or any other geometric clustering algorithm)to group them into k clusters.2.2Modularity and community structures Given a partition of a network,Γk,which divides its vertices into k communities,the modularity is defined asQ(Γk)=ki=1(e ii/c−(a i/c)2),where e ii is the number of edges with both vertices within community i,a i is the number of edges with one or both vertices in community i, and c is the total number of edges[16].Therefore,the Q function measures the fraction of edges falling within com-munities,subtracted by what one would expect if the edges were randomly placed.A larger Q value means stronger community structures.If a partition gives no more within-community edges than would be expected by chance,the modularity Q≤0.For a trivial partitioning with a single community,Q=0.It has been observed that most real-world networks have Q>0.3[16].The Q function provides a good quality measure to com-pare different community structures.Several algorithms have been developed to search for community structures by looking for the division of a network that optimizes Q(see[4]for a survey).White and Smyth proposed a spectral algorithm(WS),which is effective on small net-works[21].They show that,when the number of commu-nities k is given,the optimization of Q is equivalent to an eigen decomposition problem,if relaxing the discrete mem-bership constraint[21].Therefore,they directly applied a k-way spectral graph partitioning algorithm for this pur-pose.To automatically determine the number of communi-ties,the spectral algorithm is executed multiple times,with k ranging from the user defined minimum K min to maxi-mum K max number of communities.The k that gives the highest Q value is deemed the most appropriate number of communities.A slightly modified version of the WS algo-rithm is as follows.(1)For each k,K min≤k≤K max, apply NJW tofind a k-way partition,denoted asΓk.(2) k∗=arg max k Q(Γk)is the number of communities,and Γ∗=Γk∗is the best community structure.While the WS algorithm is effective infinding good community structures,it scales poorly to large networks, because it needs to execute k-means up to K max times. Without any prior knowledge of a network,one may over-estimate K max in order to reach the optimal Q.For sparse networks,K max can be linear in the number of vertices in the worst case,making it impractical to iterate over all pos-sible k’s for large networks.3The Kcut algorithmIn order to develop a method that scales well to large networks while retaining effectiveness infinding good com-munities,we may take the strategy used in the SM algo-rithm,i.e.,to recursively divide a network into smaller ones. However,two issues remain.First,when should the algo-rithm halt,or in other words,how do we decide whethera (sub)network should be partitioned?Since our goal is to find a partition with a high modularity,we can test whether the Q value increases after the partition.If no partition can improve the modularity,the (sub)network should not be di-vided.Second,it has been empirically observed that if there are multiple communities,using multiple eigenvectors to di-rectly compute a k -way partition is better than recursive bi-partitioning methods [17].Here,we propose an algorithm that is a unique combination of recursive partitioning and direct k -way methods,which will achieve the efficiency of a recursive approach,while also having the same accuracy as a direct k -way method.We follow a greedy strategy to recursively partition a net-work to optimize Q .Unlike the existing algorithms that al-ways seek a bi-partition,we adopt a direct k -way partition strategy as in the WS algorithm.Briefly,we compute the best k -way partition with k =2,3,···,l using the NJW algorithm,and select the k that gives the highest Q value.Then for each subnetwork the algorithm is recursively ap-plied.To reduce the computation cost,we restrict l to small integers.As we will shown in experiments,the algorithm with l as small as 3or 4can significantly improve mod-ularity over the standard bi-partitioning strategy.Further-more,the computation cost is also reduced with a slightly increased value of l compared to bi-partitioning.Given a network G and a small integer l that is the max-imal number of partitions to be considered for each subnet-work,our algorithm Kcut executes the following steps.1.Initialize Γto be a single cluster with all vertices,and set Q =0.2.For each cluster P in Γ,(a)Let g be the subnetwork of G containing the ver-tices in P .(b)For each integer k from 2to l ,i.Apply NJW to find a k -way partitioning of g ,denoted by Γg k ,pute new Q value of the network as Qk =Q (Γ Γgk \P ).(c)Find the k that gives the best Q value,i.e.,k ∗=arg max k Q k .(d)If Q k ∗>Q ,accept the partition by replacing P with Γgk ∗,i.e.,Γ=Γ Γg k ∗\P ,and set Q =Q k ∗.(e)Advance to the next cluster in Γ,if there is any.The inner loop,step 2(b),is similar to the first step of the WS algorithm,except that in 2(b)(ii)we compute the mod-ularity of the whole network G ,which is different from the modularity Q (Γg k ).On the other hand,we do not need to iterate over all communities in the network to re-compute Q .From the definition of Q in Section 2.2,the contributionof each community towards Q is independent of the other communities.Therefore,after g is partitioned,Q can be ef-ficiently updated with the communities that have just been created in g .At step 2(c),we decide the best way to parti-tion g that can improve Q the most.This step turns out to be crucial in identifying globally good community structures with high Q values.At step 2(d),we test if partitioning g can contribute positively towards Q ,and the partition is ac-cepted only if Q increases.When the algorithm terminates,no communities can be further created to improve Q ,thus Γcontains the best community structure.3.1Computational complexityWe first review the computational complexity of the WS algorithm,since the inner loop of Kcut is simply the WS algorithm,except that the computation of Q is slightly dif-ferent.The WS algorithm contains two major components:computing eigenvectors and executing k -means to partition the network.Note that although WS calls NJW multiple times,the eigen problem needs to be solved only once to obtain all K max eigenvectors.To compute eigenvectors,we used the eigs function in MATLAB,which has a time complexity in O (mKh +nK 2h +K 3h ),where m and n are respectively the numbers of edges and vertices of the graph,K =K max is the number of eigenvectors to be computed,and h is the number of iterations for eigs to con-verge [21].Since K <n ,the running time of eigs can be simplified to O (mKh +nK 2h ).Second,we adopted a fast k -means algorithm [6]in our implementation,which takes approximately O (nKe )time,where e is the number of iter-ations for k -means to converge.Since k -means is called K times,the total running time is O (mKh +nK 2h +nK 2e ),where the first two terms are for eigs and the last term is for k -means.Assuming e and h constants,the overall time complexity of WS is O (mK +nK 2),which can be close to O (n 3),since the maximal number of communities for a sparse network may be linear in n .The running time of Kcut depends on the depth of the re-cursive calls.In the worst case,the partitions can be highly imbalanced,and the depth of the recursion is merely the number of partitions produced,K .A more practical es-timate,however,is the average depth,which is close to log l K ,where l is the maximal number of partitions con-sidered by NJW .Therefore,the running time taken by eigs can be estimated to be O ((mlh +nl 2h )log l K ),which can be further simplified to O (mlh log l K ),since l is small and therefore in general m >nl .Similarly,the average-case running time taken by k -means is O (nl 2e log l K ),and the total complexity is given by O ((mlh +nl 2e )log l K ).Our experimental results show that for large networks and small values of l ,the time taken by eigs dominates,giving an overall time complexity in O (mlh log l K )=O (mh ln K lstant,also given that l is small and K=O(n),the total complexity is O(m log n),which is much smaller than the O(n3)running time of the WS algorithm.An important ob-servation from the analysis is that the total running time of Kcut is not a monotonically increasing function of l.Ana-lytically,the minimum value of l/ln l is achieved at l=3. Empirically,we observed that Kcut is most efficient with l=3to5(see Section4.2).The memory complexity of both algorithms is O(m), linear to the number of edges.3.2Related methodsBesides our algorithm and the WS method,several other algorithms have also been developed for identify-ing communities by modularity optimization.Newman proposed an algorithm that is based on recursive spectral bi-partitioning[15].The algorithm computes the leading eigenvector of a so-called modularity matrix,and divides the vertices into two groups according to the signs of the elements in the eigenvector.The algorithm runs recursively on each subnetwork,until no improvement to Q is pared to our method,this algorithm is faster for small networks,since no k-means is performed.On the other hand,the modularity matrix is very dense,with al-most no zero entries.Therefore,the algorithm takes O(n2) memory even for sparse networks,in contrast to O(m)for our method.Furthermore,the algorithm takes O(n2log n) running time,therefore,it does not scale well to large net-works.Importantly,we will show that by combining k-way partitioning with a recursive method,Kcut usually achieves higher modularity than the Newman method.There are also several methods that are not spectral-based.The edge betweenness algorithm[9]and the ex-tremal optimization algorithm[5]are known to be very slow,with O(n3)and O(n2log2n)running time,respec-tively.Another greedy approach,the C NM algorithm[3], has approximately the same time complexity(O(m log2n)) as our method,but the communities returned often have poor quality[15].4EvaluationWe now evaluate our algorithm on a variety of networks and compare it with three existing algorithms that were mentioned in Section3.2:the WS algorithm,the CNM algo-rithm,and the Newman’s algorithm(NM).In what follows, the results of our algorithm are denoted by K-2,K-3,···,for l=2,3,···.Note that Newman suggested in[15]a refining step to improve Q after the initial partitioning.To make a fair comparison,this refining step was omitted in our study, since in theory the same strategy can be applied to any other algorithm as well.Figure1.Results on computer-generated networks.Q relative=Q discovered−Q true.4.1Computer-generated networksTo evaluate Kcut,wefirst tested it on computer-generated networks with artificially embedded community structures.Each network had256vertices forming8com-munities of equal sizes.Edges were randomly placed with probability p in between vertices within the same commu-nity and with probability p out between vertices in different communities.We varied p in from0.8to0.3,representing networks with dense to sparse communities.For each p in, we varied p out from0to p in50.For each pair of(p in,p out),we generated100networks and clustered them with WS(K min=2,K max=15),Kcut (l=2,3,4and5),and NM algorithms.To measure the ac-curacy of the results,we computed the Jaccard Index[20], which is roughly the percentage of within-community edges that were predicted correctly.The Jaccard Index between the true community structure(Γ)and predicted community structure(Γ )is defined asJ(Γ,Γ )=|S(Γ)∩S(Γ )|Table1.Q values for real-world networks.Qn m K∗K maxKarate0.4200.3900.4200.4200.4200.3930.3830.420[21] Football0.6020.5240.6000.5960.5900.4930.577Jazz0.4390.4440.4440.4390.4390.3940.4390.445[5] PPI0.3620.3320.3440.3480.3640.3410.337Internet0.6040.5940.6000.6010.6010.5240.620 Physicists-0.7340.7380.7390.743-0.6590.723[15] K max:maximal number of communities for WS.K∗:number of communities returned by WS.The last column are the best Q values achieved by existing methods in the literature,and references to the methods.Table2.Total CPU time(seconds). NetworkKarate1. Jazz8k40263123580.8 Internet-6k3k2k2k-283 *A significant difference between CNM and the other algorithms here is that CNM was implemented in C,while all the other algo-rithms compared here were implemented in MATLAB m-files. Fig.1(a)shows the Jaccard Index as a function of p out for p in=0.5.Results for other values of p in or using other types of accuracy measurement are similar(data not shown).The WS algorithm,which explicitly searches over all k’s,has the best accuracy.On the other hand,Kcut with large l values can better approximate WS than with small l values.Moreover,as shown in Fig.1(b),the Q values achieved by the algorithms match their accuracies:WS has the highest modularity,followed by K-5,K-4,...,and the Newman algorithm at last.A third measure,the number of times an algorithm predicted k correctly,also shows that WS>K-5>···>K-2>NM(data not shown).The CNM algorithm has an accuracy similar to K-2for smaller p out, but its accuracy drops significantly when p out increases.4.2Real-world networksWe further tested our method on several real-world net-works.These include an acquaintance network in a Karate club[22],the opponent network of American NCAA Di-vision I college football teams in the year2000[9],a co-performing network of Jazz Bands[10],a protein-protein interaction network of E.coli[18],the Autonomous Sys-tems topology of the Internet[7],and a collaboration net-work of physicists[13].As shown in Table1,the WS algo-rithm usually returns community structures with the highest Q value.Although Kcut with l=2often performed poorly, Kcut with l≥3can usually achieve Q values as good as that by WS,whereas with a much reduced running time. Moreover,for the three networks(Karate,Jazz,Physicists) that have been analyzed by others,Kcut canfind modularity values that are comparable to or better than the best known ones.The NM algorithm(without the refining step)and the CNM algorithm usually have much worse accuracy com-paring to WS and Kcut.The WS and Newman algorithms failed tofinish on the physicist network,due to their exces-sive running time or memory usage.In addition,the communities returned by Kcut are often very close to the known communities if they are available. For example,for the Karate club network,Kcut precisely predicted the actual separation of the club caused by a dis-pute among its members[9].For the football network,Kcut correctly revealed the official NCAA conference structure of the football teams[9],except for a few teams that do not belong to any conference.Because of space limit,we omit the detailed results here.4.3Running timeTable2shows the running time of the four algorithms on the six real-world networks.Table3shows the time spent on eigs and k-means by WS,Kcut and M is based on a different rationale and does not have these two com-ponents.As shown in Table2,although WS is efficient for small networks of up to a few hundred of vertices,it is very inefficient on large networks.The Kcut algorithm,on the other hand,can handle networks of several thousand of ver-tices in less than half minute.It appears in Table2that CNM is the most efficient,especially for small networks.At least part of the reason is that CNM was implemented in the C language,while the other three algorithms were all imple-mented in MATLAB M-files.M-files are interpreted at run time,and therefore have higher overhead.Also observe that Kcut is often faster with l=3,4,5 than with l=2.Based on the analysis in Section3.1,the time Kcut spent in eigs is approximately linear to l/ln l, which reaches its minimum at l=3.In contrast,the time Kcut spent on k-means is proportional to l2/ln l,which is monotonically increasing for l≥2.The experimental re-sults in Table3partially support the theoretical analysis. For large networks,the total running time of Kcut is dom-inated by eigs.Therefore,Kcut can take advantage of a slightly increased l to reduce its running time.When l be-Table3.CPU time(seconds)for program components. Network K-2K-4NM0.。
