Mutual information in learning feature transformations
collaborative mutual learning
Collaborative Mutual LearningIntroductionIn today’s fast-paced and interconnected world, the importance of collaborative learning has gained significant recognition. Collaborative mutual learning refers to a process where individuals or groups work together, sharing knowledge, experiences, and insights to enhance their overall learning outcomes. This article aims to delve into the concept of collaborative mutual learning, exploring its benefits, strategies, and potential challenges.Benefits of Collaborative Mutual LearningCollaborative mutual learning offers numerous advantages, both to individuals and the collective group. Some key benefits include:1. Enhanced LearningBy actively participating in collaborative mutual learning, individuals can expand their understanding and gain insights from different perspectives. This engagement promotes higher retention of information, critical thinking skills, and the ability to apply knowledge in real-world scenarios.2. Acquisition of New SkillsWorking collaboratively exposes individuals to a diverse set of skills possessed by others. Through observation, interaction, and hands-on experiences, individuals can develop new abilities and competencies that they may not have acquired through solitary learning.3. Improved Communication and Social SkillsCollaborative mutual learning involves constant communication, active listening, and effective collaboration. Such interactions foster thedevelopment of excellent communication and social skills, including empathy, negotiation, and teamwork. These skills are crucial in professional settings and everyday life.4. Building a Supportive Learning CommunityCollaborative mutual learning creates a sense of belonging and camaraderie within a group. Participants can support and motivate each other, leading to increased engagement, enthusiasm, and overall satisfaction with the learning process.Strategies for Successful Collaborative Mutual LearningTo make collaborative mutual learning effective, several strategies can be implemented. It is worth noting that these strategies can be adapted according to the specific context and participants’ needs. Some key strategies include:1. Clear Objectives and RolesEstablishing clear learning objectives and assigning specific roles to each participant is crucial. This clarity ensures that everyone understands their responsibilities and the purpose of collaborative learning activities.2. Effective Communication ChannelsUtilizing communication channels that facilitate easy interaction and information sharing is vital. Tools such as online discussion forums, video conferencing, and collaboration platforms enable seamless communication and efficient exchange of ideas.3. Encouragement of Active ParticipationCreating an environment that encourages active participation is essential for successful collaborative mutual learning. Facilitators should promote inclusivity, raise thought-provoking questions, and provide opportunities for everyone to contribute.4. Regular Reflection and FeedbackIncorporating regular reflection and feedback sessions allows participants to assess their progress, identify areas for improvement, and offer constructive suggestions to their peers. This reflective practice enhances the learning experience and helps in refining collaborative skills.Challenges of Collaborative Mutual LearningWhile collaborative mutual learning has numerous benefits, it is not without its challenges. Recognizing and addressing these challenges is crucial for fostering a positive learning environment. Some common challenges include:1. Unequal Participation and ContributionsIn collaborative settings, some individuals may dominate discussions, while others may remain silent or contribute minimally. This inequality hampers the mutual learning experience. It is essential to establish norms that promote equitable participation and ensure everyone’s contributions are valued.2. Conflict ResolutionDifferences in opinions and conflicts can arise during collaborative mutual learning. Effective conflict resolution strategies, such asactive listening, empathy, and compromise, should be employed to maintain a harmonious learning environment.3. Time ManagementCollaborative mutual learning requires time for planning, communication, and coordination among participants. Managing time effectively and setting realistic deadlines is crucial to prevent delays and maintain the momentum of the learning process.4. Balancing Individual and Group GoalsWhile collaborative learning emphasizes collective achievements, individuals may have personal goals and aspirations. Striking a balance between individual and group goals ensures the satisfaction and motivation of all participants.ConclusionCollaborative mutual learning offers a dynamic and engaging approach to learning. It enhances knowledge acquisition, fosters the development of essential skills, and cultivates a supportive learning community. By implementing effective strategies and addressing potential challenges, collaborative mutual learning can revolutionize the learning experience, preparing individuals for success in various personal and professional endeavors.。
基于互信息的多标记特征选择
CHENHong
SchoolofLiteratureArtsandMedia,TonglingUniversity,Tongling244000,China
Abstract:ChinesecharactersofWadangarethemostbrilliantpartintheartofWadang.Itfullyshowsthebeautyof
1 引 言
多标记学习已成为国内外机器学习领域的研究 热点 [1-2]。在很多 实 际 应 用 中,多 标 记 学 习 涉 及 许 多高维数据。尽管从理论来说,特征越多,分类精度 越高,但事实上大量的冗余特征不仅容易产生过拟 合,也会扩增算法的复杂度,降低分类器的性能[3]。 为了解决这一问题,大量的多标记维度约简[4]算法 被提出。
lightthecharacteristicsof"circle",whichreflectstheconsciousnessofpursuingaperfectandfulllifeforthepeople
Example-based metonymy recognition for proper nouns
Example-Based Metonymy Recognition for Proper NounsYves PeirsmanQuantitative Lexicology and Variational LinguisticsUniversity of Leuven,Belgiumyves.peirsman@arts.kuleuven.beAbstractMetonymy recognition is generally ap-proached with complex algorithms thatrely heavily on the manual annotation oftraining and test data.This paper will re-lieve this complexity in two ways.First,it will show that the results of the cur-rent learning algorithms can be replicatedby the‘lazy’algorithm of Memory-BasedLearning.This approach simply stores alltraining instances to its memory and clas-sifies a test instance by comparing it to alltraining examples.Second,this paper willargue that the number of labelled trainingexamples that is currently used in the lit-erature can be reduced drastically.Thisfinding can help relieve the knowledge ac-quisition bottleneck in metonymy recog-nition,and allow the algorithms to be ap-plied on a wider scale.1IntroductionMetonymy is afigure of speech that uses“one en-tity to refer to another that is related to it”(Lakoff and Johnson,1980,p.35).In example(1),for in-stance,China and Taiwan stand for the govern-ments of the respective countries:(1)China has always threatened to use forceif Taiwan declared independence.(BNC) Metonymy resolution is the task of automatically recognizing these words and determining their ref-erent.It is therefore generally split up into two phases:metonymy recognition and metonymy in-terpretation(Fass,1997).The earliest approaches to metonymy recogni-tion identify a word as metonymical when it vio-lates selectional restrictions(Pustejovsky,1995).Indeed,in example(1),China and Taiwan both violate the restriction that threaten and declare require an animate subject,and thus have to be interpreted metonymically.However,it is clear that many metonymies escape this characteriza-tion.Nixon in example(2)does not violate the se-lectional restrictions of the verb to bomb,and yet, it metonymically refers to the army under Nixon’s command.(2)Nixon bombed Hanoi.This example shows that metonymy recognition should not be based on rigid rules,but rather on statistical information about the semantic and grammatical context in which the target word oc-curs.This statistical dependency between the read-ing of a word and its grammatical and seman-tic context was investigated by Markert and Nis-sim(2002a)and Nissim and Markert(2003; 2005).The key to their approach was the in-sight that metonymy recognition is basically a sub-problem of Word Sense Disambiguation(WSD). Possibly metonymical words are polysemous,and they generally belong to one of a number of pre-defined metonymical categories.Hence,like WSD, metonymy recognition boils down to the auto-matic assignment of a sense label to a polysemous word.This insight thus implied that all machine learning approaches to WSD can also be applied to metonymy recognition.There are,however,two differences between metonymy recognition and WSD.First,theo-retically speaking,the set of possible readings of a metonymical word is open-ended(Nunberg, 1978).In practice,however,metonymies tend to stick to a small number of patterns,and their la-bels can thus be defined a priori.Second,classic 71WSD algorithms take training instances of one par-ticular word as their input and then disambiguate test instances of the same word.By contrast,since all words of the same semantic class may undergo the same metonymical shifts,metonymy recogni-tion systems can be built for an entire semantic class instead of one particular word(Markert and Nissim,2002a).To this goal,Markert and Nissim extracted from the BNC a corpus of possibly metonymical words from two categories:country names (Markert and Nissim,2002b)and organization names(Nissim and Markert,2005).All these words were annotated with a semantic label —either literal or the metonymical cate-gory they belonged to.For the country names, Markert and Nissim distinguished between place-for-people,place-for-event and place-for-product.For the organi-zation names,the most frequent metonymies are organization-for-members and organization-for-product.In addition, Markert and Nissim used a label mixed for examples that had two readings,and othermet for examples that did not belong to any of the pre-defined metonymical patterns.For both categories,the results were promis-ing.The best algorithms returned an accuracy of 87%for the countries and of76%for the orga-nizations.Grammatical features,which gave the function of a possibly metonymical word and its head,proved indispensable for the accurate recog-nition of metonymies,but led to extremely low recall values,due to data sparseness.Therefore Nissim and Markert(2003)developed an algo-rithm that also relied on semantic information,and tested it on the mixed country data.This algo-rithm used Dekang Lin’s(1998)thesaurus of se-mantically similar words in order to search the training data for instances whose head was sim-ilar,and not just identical,to the test instances. Nissim and Markert(2003)showed that a combi-nation of semantic and grammatical information gave the most promising results(87%). However,Nissim and Markert’s(2003)ap-proach has two major disadvantages.Thefirst of these is its complexity:the best-performing al-gorithm requires smoothing,backing-off to gram-matical roles,iterative searches through clusters of semantically similar words,etc.In section2,I will therefore investigate if a metonymy recognition al-gorithm needs to be that computationally demand-ing.In particular,I will try and replicate Nissim and Markert’s results with the‘lazy’algorithm of Memory-Based Learning.The second disadvantage of Nissim and Mark-ert’s(2003)algorithms is their supervised nature. Because they rely so heavily on the manual an-notation of training and test data,an extension of the classifiers to more metonymical patterns is ex-tremely problematic.Yet,such an extension is es-sential for many tasks throughout thefield of Nat-ural Language Processing,particularly Machine Translation.This knowledge acquisition bottle-neck is a well-known problem in NLP,and many approaches have been developed to address it.One of these is active learning,or sample selection,a strategy that makes it possible to selectively an-notate those examples that are most helpful to the classifier.It has previously been applied to NLP tasks such as parsing(Hwa,2002;Osborne and Baldridge,2004)and Word Sense Disambiguation (Fujii et al.,1998).In section3,I will introduce active learning into thefield of metonymy recog-nition.2Example-based metonymy recognition As I have argued,Nissim and Markert’s(2003) approach to metonymy recognition is quite com-plex.I therefore wanted to see if this complexity can be dispensed with,and if it can be replaced with the much more simple algorithm of Memory-Based Learning.The advantages of Memory-Based Learning(MBL),which is implemented in the T i MBL classifier(Daelemans et al.,2004)1,are twofold.First,it is based on a plausible psycho-logical hypothesis of human learning.It holds that people interpret new examples of a phenom-enon by comparing them to“stored representa-tions of earlier experiences”(Daelemans et al., 2004,p.19).This contrasts to many other classi-fication algorithms,such as Naive Bayes,whose psychological validity is an object of heavy de-bate.Second,as a result of this learning hypothe-sis,an MBL classifier such as T i MBL eschews the formulation of complex rules or the computation of probabilities during its training phase.Instead it stores all training vectors to its memory,together with their labels.In the test phase,it computes the distance between the test vector and all these train-ing vectors,and simply returns the most frequentlabel of the most similar training examples.One of the most important challenges inMemory-Based Learning is adapting the algorithmto one’s data.This includesfinding a represen-tative seed set as well as determining the rightdistance measures.For my purposes,however, T i MBL’s default settings proved more than satis-factory.T i MBL implements the IB1and IB2algo-rithms that were presented in Aha et al.(1991),butadds a broad choice of distance measures.Its de-fault implementation of the IB1algorithm,whichis called IB1-IG in full(Daelemans and Van denBosch,1992),proved most successful in my ex-periments.It computes the distance between twovectors X and Y by adding up the weighted dis-tancesδbetween their corresponding feature val-ues x i and y i:∆(X,Y)=ni=1w iδ(x i,y i)(3)The most important element in this equation is theweight that is given to each feature.In IB1-IG,features are weighted by their Gain Ratio(equa-tion4),the division of the feature’s InformationGain by its split rmation Gain,the nu-merator in equation(4),“measures how much in-formation it[feature i]contributes to our knowl-edge of the correct class label[...]by comput-ing the difference in uncertainty(i.e.entropy)be-tween the situations without and with knowledgeof the value of that feature”(Daelemans et al.,2004,p.20).In order not“to overestimate the rel-evance of features with large numbers of values”(Daelemans et al.,2004,p.21),this InformationGain is then divided by the split info,the entropyof the feature values(equation5).In the followingequations,C is the set of class labels,H(C)is theentropy of that set,and V i is the set of values forfeature i.w i=H(C)− v∈V i P(v)×H(C|v)2This data is publicly available and can be downloadedfrom /mnissim/mascara.73P F86.6%49.5%N&M81.4%62.7%Table1:Results for the mixed country data.T i MBL:my T i MBL resultsN&M:Nissim and Markert’s(2003)results simple learning phase,T i MBL is able to replicate the results from Nissim and Markert(2003;2005). As table1shows,accuracy for the mixed coun-try data is almost identical to Nissim and Mark-ert’sfigure,and precision,recall and F-score for the metonymical class lie only slightly lower.3 T i MBL’s results for the Hungary data were simi-lar,and equally comparable to Markert and Nis-sim’s(Katja Markert,personal communication). Note,moreover,that these results were reached with grammatical information only,whereas Nis-sim and Markert’s(2003)algorithm relied on se-mantics as well.Next,table2indicates that T i MBL’s accuracy for the mixed organization data lies about1.5%be-low Nissim and Markert’s(2005)figure.This re-sult should be treated with caution,however.First, Nissim and Markert’s available organization data had not yet been annotated for grammatical fea-tures,and my annotation may slightly differ from theirs.Second,Nissim and Markert used several feature vectors for instances with more than one grammatical role andfiltered all mixed instances from the training set.A test instance was treated as mixed only when its several feature vectors were classified differently.My experiments,in contrast, were similar to those for the location data,in that each instance corresponded to one vector.Hence, the slightly lower performance of T i MBL is prob-ably due to differences between the two experi-ments.Thesefirst experiments thus demonstrate that Memory-Based Learning can give state-of-the-art performance in metonymy recognition.In this re-spect,it is important to stress that the results for the country data were reached without any se-mantic information,whereas Nissim and Mark-ert’s(2003)algorithm used Dekang Lin’s(1998) clusters of semantically similar words in order to deal with data sparseness.This fact,togetherAcc RT i MBL78.65%65.10%76.0%—Figure1:Accuracy learning curves for the mixed country data with and without semantic informa-tion.in more detail.4Asfigure1indicates,with re-spect to overall accuracy,semantic features have a negative influence:the learning curve with both features climbs much more slowly than that with only grammatical features.Hence,contrary to my expectations,grammatical features seem to allow a better generalization from a limited number of training instances.With respect to the F-score on the metonymical category infigure2,the differ-ences are much less outspoken.Both features give similar learning curves,but semantic features lead to a higherfinal F-score.In particular,the use of semantic features results in a lower precisionfig-ure,but a higher recall score.Semantic features thus cause the classifier to slightly overgeneralize from the metonymic training examples.There are two possible reasons for this inabil-ity of semantic information to improve the clas-sifier’s performance.First,WordNet’s synsets do not always map well to one of our semantic la-bels:many are rather broad and allow for several readings of the target word,while others are too specific to make generalization possible.Second, there is the predominance of prepositional phrases in our data.With their closed set of heads,the number of examples that benefits from semantic information about its head is actually rather small. Nevertheless,myfirst round of experiments has indicated that Memory-Based Learning is a sim-ple but robust approach to metonymy recogni-tion.It is able to replace current approaches that need smoothing or iterative searches through a the-saurus,with a simple,distance-based algorithm.Figure3:Accuracy learning curves for the coun-try data with random and maximum-distance se-lection of training examples.over all possible labels.The algorithm then picks those instances with the lowest confidence,since these will contain valuable information about the training set(and hopefully also the test set)that is still unknown to the system.One problem with Memory-Based Learning al-gorithms is that they do not directly output prob-abilities.Since they are example-based,they can only give the distances between the unlabelled in-stance and all labelled training instances.Never-theless,these distances can be used as a measure of certainty,too:we can assume that the system is most certain about the classification of test in-stances that lie very close to one or more of its training instances,and less certain about those that are further away.Therefore the selection function that minimizes the probability of the most likely label can intuitively be replaced by one that max-imizes the distance from the labelled training in-stances.However,figure3shows that for the mixed country instances,this function is not an option. Both learning curves give the results of an algo-rithm that starts withfifty random instances,and then iteratively adds ten new training instances to this initial seed set.The algorithm behind the solid curve chooses these instances randomly,whereas the one behind the dotted line selects those that are most distant from the labelled training exam-ples.In thefirst half of the learning process,both functions are equally successful;in the second the distance-based function performs better,but only slightly so.There are two reasons for this bad initial per-formance of the active learning function.First,it is not able to distinguish between informativeandFigure4:Accuracy learning curves for the coun-try data with random and maximum/minimum-distance selection of training examples. unusual training instances.This is because a large distance from the seed set simply means that the particular instance’s feature values are relatively unknown.This does not necessarily imply that the instance is informative to the classifier,how-ever.After all,it may be so unusual and so badly representative of the training(and test)set that the algorithm had better exclude it—something that is impossible on the basis of distances only.This bias towards outliers is a well-known disadvantage of many simple active learning algorithms.A sec-ond type of bias is due to the fact that the data has been annotated with a few features only.More par-ticularly,the present algorithm will keep adding instances whose head is not yet represented in the training set.This entails that it will put off adding instances whose function is pp,simply because other functions(subj,gen,...)have a wider variety in heads.Again,the result is a labelled set that is not very representative of the entire training set.There are,however,a few easy ways to increase the number of prototypical examples in the train-ing set.In a second run of experiments,I used an active learning function that added not only those instances that were most distant from the labelled training set,but also those that were closest to it. After a few test runs,I decided to add six distant and four close instances on each iteration.Figure4 shows that such a function is indeed fairly success-ful.Because it builds a labelled training set that is more representative of the test set,this algorithm clearly reduces the number of annotated instances that is needed to reach a given performance.Despite its success,this function is obviously not yet a sophisticated way of selecting good train-76Figure5:Accuracy learning curves for the organi-zation data with random and distance-based(AL) selection of training examples with a random seed set.ing examples.The selection of the initial seed set in particular can be improved upon:ideally,this seed set should take into account the overall dis-tribution of the training examples.Currently,the seeds are chosen randomly.Thisflaw in the al-gorithm becomes clear if it is applied to another data set:figure5shows that it does not outper-form random selection on the organization data, for instance.As I suggested,the selection of prototypical or representative instances as seeds can be used to make the present algorithm more robust.Again,it is possible to use distance measures to do this:be-fore the selection of seed instances,the algorithm can calculate for each unlabelled instance its dis-tance from each of the other unlabelled instances. In this way,it can build a prototypical seed set by selecting those instances with the smallest dis-tance on average.Figure6indicates that such an algorithm indeed outperforms random sample se-lection on the mixed organization data.For the calculation of the initial distances,each feature re-ceived the same weight.The algorithm then se-lected50random samples from the‘most proto-typical’half of the training set.5The other settings were the same as above.With the present small number of features,how-ever,such a prototypical seed set is not yet always as advantageous as it could be.A few experiments indicated that it did not lead to better performance on the mixed country data,for instance.However, as soon as a wider variety of features is taken into account(as with the organization data),the advan-pling can help choose those instances that are most helpful to the classifier.A few distance-based al-gorithms were able to drastically reduce the num-ber of training instances that is needed for a given accuracy,both for the country and the organization names.If current metonymy recognition algorithms are to be used in a system that can recognize all pos-sible metonymical patterns across a broad variety of semantic classes,it is crucial that the required number of labelled training examples be reduced. This paper has taken thefirst steps along this path and has set out some interesting questions for fu-ture research.This research should include the investigation of new features that can make clas-sifiers more robust and allow us to measure their confidence more reliably.This confidence mea-surement can then also be used in semi-supervised learning algorithms,for instance,where the clas-sifier itself labels the majority of training exam-ples.Only with techniques such as selective sam-pling and semi-supervised learning can the knowl-edge acquisition bottleneck in metonymy recogni-tion be addressed.AcknowledgementsI would like to thank Mirella Lapata,Dirk Geer-aerts and Dirk Speelman for their feedback on this project.I am also very grateful to Katja Markert and Malvina Nissim for their helpful information about their research.ReferencesD.W.Aha, D.Kibler,and M.K.Albert.1991.Instance-based learning algorithms.Machine Learning,6:37–66.W.Daelemans and A.Van den Bosch.1992.Generali-sation performance of backpropagation learning on a syllabification task.In M.F.J.Drossaers and A.Ni-jholt,editors,Proceedings of TWLT3:Connection-ism and Natural Language Processing,pages27–37, Enschede,The Netherlands.W.Daelemans,J.Zavrel,K.Van der Sloot,andA.Van den Bosch.2004.TiMBL:Tilburg Memory-Based Learner.Technical report,Induction of Linguistic Knowledge,Computational Linguistics, Tilburg University.D.Fass.1997.Processing Metaphor and Metonymy.Stanford,CA:Ablex.A.Fujii,K.Inui,T.Tokunaga,and H.Tanaka.1998.Selective sampling for example-based wordsense putational Linguistics, 24(4):573–597.R.Hwa.2002.Sample selection for statistical parsing.Computational Linguistics,30(3):253–276.koff and M.Johnson.1980.Metaphors We LiveBy.London:The University of Chicago Press.D.Lin.1998.An information-theoretic definition ofsimilarity.In Proceedings of the International Con-ference on Machine Learning,Madison,USA.K.Markert and M.Nissim.2002a.Metonymy res-olution as a classification task.In Proceedings of the Conference on Empirical Methods in Natural Language Processing(EMNLP2002),Philadelphia, USA.K.Markert and M.Nissim.2002b.Towards a cor-pus annotated for metonymies:the case of location names.In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC2002),Las Palmas,Spain.M.Nissim and K.Markert.2003.Syntactic features and word similarity for supervised metonymy res-olution.In Proceedings of the41st Annual Meet-ing of the Association for Computational Linguistics (ACL-03),Sapporo,Japan.M.Nissim and K.Markert.2005.Learning to buy a Renault and talk to BMW:A supervised approach to conventional metonymy.In H.Bunt,editor,Pro-ceedings of the6th International Workshop on Com-putational Semantics,Tilburg,The Netherlands. G.Nunberg.1978.The Pragmatics of Reference.Ph.D.thesis,City University of New York.M.Osborne and J.Baldridge.2004.Ensemble-based active learning for parse selection.In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics(HLT-NAACL).Boston, USA.J.Pustejovsky.1995.The Generative Lexicon.Cam-bridge,MA:MIT Press.78。
特征选择方法综述
控 制 与 决 策第 27 卷 第 2 期 V ol. 27 No. 22012 年 2 月Feb. 2012Control andDecision文章编号: 1001-0920 (2012) 02-0161-06特征选择方法综述姚 旭, 王晓丹, 张玉玺, 权 文(空军工程大学 计算机工程系,陕西 三原 713800)摘 要: 特征选择是模式识别的关键问题之一, 特征选择结果的好坏直接影响着分类器的分类精度和泛化性能. 首 先分析了特征选择方法的框架; 然后从搜索策略和评价准则两个角度对特征选择方法进行了分析和总结; 最后分析 了对特征选择的影响因素, 并指出了实际应用中需要解决的问题. 关键词: 特征选择;搜索策略;评价准则 中图分类号: TP391文献标识码: ASummary of feature selection algorithmsYAO Xu , WANG Xiao-dan , ZHANG Y u-xi , QUAN Wen(Department of Computer Engineering ,Air Force Engineering University ,Sanyuan 713800,China. Correspondent: Y AO Xu ,E-mail :***************)Abstract: Feature selection is one of the key processes in pattern recognition. The accuracy and generalization capability of classifier are affected by the result of feature selection directly. Firstly, the framework of feature selection algorithm is analyzed. Then feature selection algorithm is classified and analyzed from two points which are searching strategy and evaluation criterion. Finally, the problem is given to solve real-world applications by analyzing infection factors in the feature selection technology.Key words: feature selection ;searching strategy ;evaluation criterion引言1 人[3] 从提高预测精度的角度定义特征选择是一个能 够增加分类精度, 或者在不降低分类精度的条件下 降低特征维数的过程; Koller 等人[4] 从分布的角度定 义特征选择为: 在保证结果类分布尽可能与原始数 据类分布相似的条件下, 选择尽可能小的特征子集; Dash 等人[5] 给出的定义是选择尽量小的特征子集, 并 满足不显著降低分类精度和不显著改变类分布两个 条件. 上述各种定义的出发点不同, 各有侧重点, 但是 目标都是寻找一个能够有效识别目标的最小特征子 集. 由文献 [2-5] 可知, 对特征选择的定义基本都是从 分类正确率以及类分布的角度考虑. Dash 等人[5] 给出 了特征选择的基本框架, 如图 1 所示.特征选择是从一组特征中挑选出一些最有效的 特征以降低特征空间维数的过程[1] , 是模式识别的关 键问题之一. 对于模式识别系统, 一个好的学习样本 是训练分类器的关键, 样本中是否含有不相关或冗余 信息直接影响着分类器的性能. 因此研究有效的特征 选择方法至关重要.本文分析讨论了目前常用的特征选择方法, 按照 搜索策略和评价准则的不同对特征选择方法进行了 分类和比较, 指出了目前特征选择方法及研究中存在 的问题. 目前, 虽然特征选择方法有很多, 但针对实际 问题的研究还存在很多不足, 如何针对特定问题给出 有效的方法仍是一个需要进一步解决的问题.特征选择的框架迄今为止, 已有很多学者从不同角度对特征选择进行过定义: Kira 等人[2] 定义理想情况下特征选择是 寻找必要的、足以识别目标的最小特征子集; John 等 2 图 1 特征选择的基本框架收稿日期: 2011-04-26;修回日期: 2011-07-12.基金项目: 国家自然科学基金项目(60975026).作者简介: 姚旭(1982−), 女, 博士生, 从事智能信息处理、机器学习等研究;王晓丹(1966−), 女, 教授, 博士生导师, 从事智能信息处理、机器学习等研究.由于子集搜索是一个比较费时的步骤, Y u 等 人[6]基于相关和冗余分析, 给出了另一种特征选择框 架, 避免了子集搜索, 可以高效快速地寻找最优子集. 框架如图 2 所示.间远远小于 (2N ).存在的问题: 具有较高的不确定性, 只有当总循 环次数较大时, 才可能找到较好的结果. 在随机搜索 策略中, 可能需对一些参数进行设置, 参数选择的合 适与否对最终结果的好坏起着很大的作用. 因此, 参 数选择是一个关键步骤.3.3 采用启发式搜索策略的特征选择方法这类特征选择方法主要有: 单独最优特征组合, 序列前向选择方法 (SFS), 广义序列前向选择方法 (GSFS), 序列后向选择方法 (SBS), 广义序列后向选择 方法 (GSBS), 增 l 去 选择方法, 广义增 l 去 选择方 法, 浮动搜索方法. 这类方法易于实现且快速, 它的搜 索空间是 (N 2 ). 一般认为采用浮动广义后向选择方 法 (FGSBS) 是较为有利于实际应用的一种特征选择 搜索策略, 它既考虑到特征之间的统计相关性, 又用 浮动方法保证算法运行的快速稳定性[13] . 存在的问 题是: 启发式搜索策略虽然效率高, 但是它以牺牲全 局最优为代价.每种搜索策略都有各自的优缺点, 在实际应用过 程中, 可以根据具体环境和准则函数来寻找一个最佳 的平衡点. 例如, 如果特征数较少, 可采用全局最优搜 索策略; 若不要求全局最优, 但要求计算速度快, 则可 采用启发式策略; 若需要高性能的子集, 而不介意计 算时间, 则可采用随机搜索策略.图 2 改进的特征选择框架从特征选择的基本框架可以看出, 特征选择方法中有 4 个基本步骤: 候选特征子集的生成 (搜索策 略)、评价准则、停止准则和验证方法[7-8] . 目前对特征 选择方法的研究主要集中于搜索策略和评价准则, 因 而, 本文从搜索策略和评价准则两个角度对特征选择 方法进行分类.基于搜索策略划分特征选择方法基本的搜索策略按照特征子集的形成过程可分 为以下 3 种: 全局最优、随机搜索和启发式搜索[9] . 一 个具体的搜索算法会采用两种或多种基本搜索策略, 例如遗传算法是一种随机搜索算法, 同时也是一种启 发式搜索算法. 下面对 3 种基本的搜索策略进行分析 比较.3.1 采用全局最优搜索策略的特征选择方法 迄今为止, 唯一得到最优结果的搜索方法是分支 定界法[10] . 这种算法能保证在事先确定优化特征子 集中特征数目的情况下, 找到相对于所设计的可分 性判据而言的最优子集. 它的搜索空间是 (2N ) (其 中 N 为特征的维数). 存在的问题: 很难确定优化特征 子集的数目; 满足单调性的可分性判据难以设计; 处 理高维多类问题时, 算法的时间复杂度较高. 因此, 虽 然全局最优搜索策略能得到最优解, 但因为诸多因素 限制, 无法被广泛应用.3.2 采用随机搜索策略的特征选择方法在计算过程中把特征选择问题与模拟退火算 法、禁忌搜索算法、遗传算法等, 或者仅仅是一个随 机重采样[11-12] 过程结合起来, 以概率推理和采样过程 作为算法的基础, 基于对分类估计的有效性, 在算法 运行中对每个特征赋予一定的权重; 然后根据用户所 定义的或自适应的阈值来对特征重要性进行评价. 当 特征所对应的权重超出了这个阈值, 它便被选中作为 重要的特征来训练分类器. Relief 系列算法即是一种 典型的根据权重选择特征的随机搜索方法, 它能有效 地去掉无关特征, 但不能去除冗余, 而且只能用于两 类分类. 随机方法可以细分为完全随机方法和概率随 机方法两种. 虽然搜索空间仍为 (2N ), 但是可以通 过设置最大迭代次数限制搜索空间小于 (2N ). 例如 遗传算法, 由于采用了启发式搜索策略, 它的搜索空3 基于评价准则划分特征选择方法特征选择方法依据是否独立于后续的学习算 法, 可分为过滤式 (Filter) 和封装式 (Wrapper)[14] 两种. Filter 与后续学习算法无关, 一般直接利用所有训练 数据的统计性能评估特征, 速度快, 但评估与后续学 习算法的性能偏差较大. Wrapper 利用后续学习算法 的训练准确率评估特征子集, 偏差小, 计算量大, 不适 合大数据集. 下面分别对 Filter 和 Wrapper 方法进行 分析.4.1 过滤式 (Filter) 评价策略的特征选择方法Filter 特征选择方法一般使用评价准则来增强特 征与类的相关性, 削减特征之间的相关性. 可将评价 函数分成 4 类[5] : 距离度量、信息度量、依赖性度量以 及一致性度量.4.1.1 距离度量 距离度量通常也认为是分离性、差异性或者辨4 识能力的度量. 最为常用的一些重要距离测度 有[1] 欧氏距离、 阶 Minkowski 测度、Chebychev 距离、平 方距离等. 两类分类问题中, 对于特征 X 和 Y , 如果 由 X 引起的两类条件概率差异性大于 Y , 则 X 优于 Y . 因为特征选择的目的是找到使两类尽可能分离的姚 旭 等: 特征选择方法综述 第2 期 163特征. 如果差异性为 0, 则 X 与 Y 是不可区分的. 算法 Relief [2] 及其变种 ReliefF [15] , 分支定界 和 BFF [16] 等都 是基于距离度量的. 准则函数要求满足单调性, 也可 通过引进近似单调的概念放松单调性的标准. 蔡哲元 等人[17] 提出了基于核空间的距离度量, 有效地提高了 小样本与线性不可分数据集上的特征选择能力. 4.1.2 信息度量信息度量通常采用信息增益 (IG) 或互信息 (MI) 衡量. 信息增益定义为先验不确定性与期望的后验不 确定性之间的差异, 它能有效地选出关键特征, 剔除 无关特征[18] . 互信息描述的是两个随机变量之间相 互依存关系的强弱. 信息度量函数 (f ) 在 Filter 特征 选择方法中起着重要的作用. 尽管 (f ) 有多种不同 形式, 但是目的是相同的, 即使得所选择的特征子集 与类别的相关性最大, 子集中特征之间的相关性最小. 刘华文[19] 给出了一种泛化的信息标准, 准则如下:互信息的评价准则, 具体函数如下:1 ∑(f ) = (C ; f ) −(; f ), (4)∣∣s ∈S 其中 ∣∣ 表示已选特征的个数. 该算法的思想就是最 大化特征子集和类别的相关性, 同时最小化特征之间 的冗余. Peng 用这种方法将多变量联合概率密度估计 问题转化为多重二变量概率密度估计, 解决了一大难 题. Ding 等人[23] 还给出了此算法的一种变种形式, 将 准则函数中的减法改为除法, 即(C ; f )(f ) = .(5)1 ∑ s ∈S (; f )∣∣4) FCBF (fast correlation-based filter)[6] 是基于相 互关系度量给出的一种算法. 对于线性随机变量, 用 相关系数分析特征与类别、特征间的相互关系. 对于 非线性随机变量, 采用对称不确定性 (SU) 来度量. 对 于两个非线性随机变量 X 和 Y , 它们的相互关系可表 示为(f ) = α ⋅ (, , ) − . (1) [ (X ∣Y )]其中: C 为类别, f 为候选特征, 为已选择的特征, 函数 (, , ) 为 , , 之间的信息量; α 为调控系数,δ 为惩罚因子. 下面就此信息标准的泛化形式与几个 现有选择算法中的信息度量标准之间的关系进行讨 论:1) BIF (best individual feature)[20] 是一种最简单最 直接的特征选择方法. 它的评价函数为B (, Y ) = 2 .(6) (X ) + (Y ) 其中: (X ) 与 (Y ) 为信息熵, (X ∣Y ) 为信息增益. 该算法的基本思想是根据所定义的 C - 相关 (特征与类别的相互关系) 和 - 相关 (特征之间的相互关 系), 从原始特征集合中去除 C - 相关值小于给定阈值 的特征, 再对剩余的特征进行冗余分析.5) CMIM (conditional mutual information maxi-mization). 有些特征选择方法利用条件互信息来评价特征的重要性程度, 即在已知已选特征集 的情况下通过候选特征 f 与类别 C 的依赖程度来确定 f 的重要性, 其中条件互信息 (C ; f ∣) 值越大, f 能提供的新信息越多. 因为 (C ; f ∣) 计算费用较高, 且样本的多维性导致了其估值不准确, Fleuret [24] 在提出的条件互信息最大化选择算法 CMIM 中采取一种变 通的方式, 即使用单个已选特征 来代替整个已选子集 以估算 (C ; f ∣), 其中 是使 (C ; f ∣) 值最大的 已选特征. CMIM 的评价函数为(2) (f ) = (C ; f ),其中 ( ) 为互信息, (C ; f ) 为类别 C 与候选特征 f 之间的互信息. 它的基本思想是对于每一个候选特征 f 计算评价函数 (f ), 并按评价函数值降序排列, 取 前 k 个作为所选择的特征子集. 这种方法简单快速, 尤其适合于高维数据. 但是它没有考虑到所选特征间 的相关性, 会带来较大的冗余.2) MIFS (mutual information feature selection) 为 基于上述算法的缺点, 由 Battiti [21] 给出的一种使用候 选特征 f 与单个已选特征 相关性对 f 进行惩罚的方 法, 其评价函数为(f ) = arg min (C ; f∣).(7) s ∈S(f ) = (C ; f ) − β ∑(;(3)除以上几种信息度量和算法外, 针对存在的问 题, 研究者们提出了新的评价函数和算法. Kwak 等 人[25] 指出 MIFS 算法中评价函数 ( ) 的惩罚因子并 不能准确地表达冗余程度的增长量, 给出了 MIFS- U (MIFS-uncertainty) 算法; 与 MIFS 算法类似, MIFS- U 算法中参数 β 的取值将影响选择算法的性能. 为 了解决该问题, Novovicova 等人[26] 提出了 MIFS-U 的 一种改进算法 mMIFS-U (modified version of MIFS-U), 算法中将 f 与 中单个已选特征相关程度最大的 作 为它们之间的冗余程度; 为了解决对称不确定性可能s ∈S其中 β 为调节系数, 当 β ∈ [0.5, 1] 时, 算法性能较好. 3) mRMR (minimal-redundancy and maximal-relevance) [22] 方法. 从理论上分析了 mRMR 等价于 最大依赖性, 并分析了三者的关系. 基于最大依赖性, 可通过计算不同特征子集与类别的互信息来选取最 优子集. 但是, 在高维空间中, 估计多维概率密度是一 个难点. 另一个缺点是计算速度非常慢. 所以本文从 与其等价的最小冗余和最大相关出发, 给出一种基于提供一些错误或不确定信息, Qu 等人[27] 利用决策依赖相关性来精确度量特征f与间的依赖程度, 提出了DDC (decision dependent correlation) 算法. 它们的思想都是一致的, 只是评价函数的表达形式不同. 刘华文[19] 还提出了一种基于动态互信息的特征选择方法. 随着已选特征数的增加, 类别的不确定性也逐渐降低, 无法识别的样本数也越来越少. 因此, 已识别的样本会给特征带来干扰信息, 可采用动态互信息作为评价标准, 在特征选择过程中不断地删除已识别的样本, 使得评价标准在未识别样本上动态估值.基于信息的度量是近年来的一个研究热点, 出现了大量基于信息熵的特征选择方法, 如文献[28-31] 等. 因为信息熵理论不要求假定数据分布是已知的, 能够以量化的形式度量特征间的不确定程度, 并且能有效地度量特征间的非线性关系. 因此, 信息度量被广泛应用, 并且也通过试验证明了其性能[32-34] . 以上基于信息度量的评价准则虽然形式不同, 但是核心思想都是找出与类别相关性最大的特征子集, 并且该子集中特征之间的相关性最小. 设计体现这一思想的函数是至关重要的.4.1.3 依赖性度量有许多统计相关系数, 如Pearson 相关系数、概率误差、Fisher 分数、线性可判定分析、最小平方回归误差[35] 、平方关联系数[36] 、-test 和F-Statistic 等被用来表达特征相对于类别可分离性间的重要性程度. 例如, Ding[23] 和Peng[22] 在mRMR 中处理连续特征时, 分别使用F-Statistic 和Pearson 相关系数度量特征与类别和已选特征间的相关性程度. Hall[37] 给出一种既考虑了特征的类区分能力, 同时又考虑特征间冗余性的相关性度量标准. Zhang 等人[38] 使用成对约束即must-link 约束和cannot-link 约束计算特征的权重, 其中must-link 约束表示两个样本离得很近, 而cannot-link 表示样本间离得足够远.在依赖性度量中, Hilbert-Schmidt 依赖性准则(HSIC) 可作为一个评价准则度量特征与类别的相关性. 核心思想是一个好的特征应该最大化这个相关性. 特征选择问题可以看成组合最优化问题性准则用不一致率来度量, 它不是最大化类的可分离性, 而是试图保留原始特征的辨识能力, 即找到与全集有同样区分类别能力的最小子集. 它具有单调、快速、去除冗余和不相关特征、处理噪声等优点, 能获得一个较小的特征子集. 但其对噪声数据敏感, 且只适合离散特征. 典型算法有Focus[41] , LVF[42] 等. 文献[43-44] 给出了基于不一致度量的算法.上面分析了Filter 方法中的一些准则函数, 选择合适的准则函数将会得到较好的分类结果. 但Filter 方法也存在很多问题: 它并不能保证选择出一个优化特征子集, 尤其是当特征和分类器息息相关时. 因而, 即使能找到一个满足条件的优化子集, 它的规模也会比较庞大, 会包含一些明显的噪声特征. 但是它的一个明显优势在于可以很快地排除很大数量的非关键性的噪声特征, 缩小优化特征子集搜索的规模, 计算效率高, 通用性好, 可用作特征的预筛选器.4.2 封装式(Wrapper) 评价策略的特征选择方法除了上述4 种准则, 分类错误率也是一种衡量所选特征子集优劣的度量标准. Wrapper 模型将特征选择算法作为学习算法的一个组成部分, 并且直接使用分类性能作为特征重要性程度的评价标准. 它的依据是选择子集最终被用于构造分类模型. 因此, 若在构造分类模型时, 直接采用那些能取得较高分类性能的特征即可, 从而获得一个分类性能较高的分类模型. 该方法在速度上要比Filter 方法慢, 但是它所选择的优化特征子集的规模相对要小得多, 非常有利于关键特征的辨识; 同时它的准确率比较高, 但泛化能力比较差, 时间复杂度较高. 目前此类方法是特征选择研究领域的热点, 相关文献也很多. 例如, Hsu 等人[45] 用决策树来进行特征选择, 采用遗传算法来寻找使得决策树分类错误率最小的一组特征子集. Chiang 等人[46] 将Fisher 判别分析与遗传算法相结合, 用来在化工故障过程中辨识关键变量, 取得了不错的效果. Guyon 等人[47] 使用支持向量机的分类性能衡量特征的重要性程度, 并最终构造一个分类性能较高的分类器. Krzysztof [48] 提出了一种基于相互关系的双重策略的Wrapper 特征选择方法. 叶吉祥等人[49] 提出了一种快速的Wrapper 特征选择方法FFSR(fast feature subset ranking), 以特征子集作为评价单位, 以子集收敛能力作为评价标准. 戴平等人[50] 利用SVM 线性核与多项式核函数的特性, 结合二进制PSO 方法, 提出了一种基于SVM 的快速特征选择方法.综上所述, Filter 和Wrapper 特征选择方法各有优缺点. 将启发式搜索策略和分类器性能评价准则相结合来评价所选的特征, 相对于使用随机搜索策略的方法, 节约了不少时间. Filter 和Wrapper 是两种(8)= arg max (), s.t.∣∣⩽.⊆F其中: 为所选特征个数的上限, 为特征集合,为已选特征的集合, () 为评价准则. 从式(8) 中可知需要解决两个问题: 一是评价准则() 的选择; 二是算法的选择. 文献[39-40] 是HSIC 准则的具体应用.4.1.4 一致性度量给定两个样本, 若他们特征值均相同, 但所属类别不同, 则称它们是不一致的; 否则, 是一致的. 一致姚 旭 等: 特征选择方法综述第2 期 165互补的模式, 两者可以结合. 混合特征选择过程一般 由两个阶段组成, 首先使用 Filter 方法初步剔除大部 分无关或噪声特征, 只保留少量特征, 从而有效地减 小后续搜索过程的规模. 第 2 阶段将剩余的特征连 同样本数据作为输入参数传递给 Wrapper 选择方法, 以进一步优化选择重要的特征. 例如, 文献 [51] 采用 混合模型选择特征子集, 先使用互信息度量标准和 bootstrap 技术获取前 k 个重要的特征, 然后再使用支 持向量机构造分类器.292.Manoranjan Dash, Huan Liu. Feature selection forclassification[J]. Intelligent Data Analysis, 1997, 1(3): 131-156.Lei Y u, Huan Liu. Efficient feature selection via analysisof relevance and redundancy[J]. J of Machine Learnin gResearch, 2004, 5(1): 1205-1224.Liu H, Motoda H. Feature selection for knowledgediscovery and data mining[M]. Boston: Kluwer AcademicPublishers, 1998.Molina L C, Llu´ıs Belanche, A` ngela Nebot. Feature [5] [6] [7] 结论5 [8] 本文首先分析了特征选择的框架, 然后从两个角度对特征选择方法进行分类: 一个是搜索策略, 一个 是评价准则. 特征选择方法从研究之初到现在, 已经 有了很多成熟的方法, 但是, 研究过程中也存在很多 问题. 例如: 如何解决高维特征选择问题; 如何设计小 样本问题的特征选择方法; 如何针对不同问题设计特 定的特征选择方法; 研究针对新数据类型的特征选 择方法等. 影响特征选择方法的因素主要有数据类 型、样本数量. 针对两类还是多类问题, 特征选择方 法的选择也有不同. 例如 Koll-Saha [4] 和 Focus 等人[41] 受限于连续型特征; 分支定界, BFF [16] 和 MDLM(min description length method)[52] 等 不 支 持 布 尔 型 特 征;Relief 系 列 算 法, DTM(decision tree method)[53]和 PRESET [54] 都适合于大数据集; Focus 等人[41] 适用于 小样本; 在度量标准的选择中, 只有一致性度量仅适 用于离散型数据等等.尽管特征选择方法已经有很多, 但针对解决实 际问题的方法还存在很多不足, 如何针对特定问题 给出有效的方法仍是一个需要进一步解决的问题. 将 Filter 方法和 Wrapper 方法两者结合, 根据特定的环境 选择所需要的度量准则和分类器是一个值得研究的 方向.selection algorithms: A survey and experimentalevaluation[R]. Barcelona:Catalunya, 2002.Universitat Politecnicade[9] Sun Z H, George Bebis, Ronald Miller. Object detectionusing feature subset selection[J]. Pattern Recognition, 2004, 37(11): 2165-2176.Narendra P M, Fukunaga K. A branch and bound algorithmfor feature selection[J]. IEEE Trans on Computers, 1977, 26(9): 917-922.Tsymbal A, Seppo P, David W P. Ensemble featureselection with the simple Bayesian classification[J].Information Fusion, 2003, 4(2): 87-100.Wu B L, Tom A, David F, et al. Comparison of statisticalmethods for classification of ovarian cancer using massspectrometry data[J]. Bioinformatics, 2003, 19(13): 1636- 1643.Furlanello C, Serafini M, Merler S, et al. An acceleratedprocedure for recursive feature ranking on microarraydata[J]. Neural Networks, 2003, 16(4): 641-648.Langley P. Selection of relevant features in machinelearning[C]. Proc of the AAAI Fall Symposium on Relevance. New Orleans, 1994: 1-5. [10] [11] [12] [13] [14] [15] Kononenko I. Estimation attributes:Analysis andextensions of RELIEF[C]. Proc of the 1994 European Conf on Machine Learning. New Brunswick, 1994: 171-182.Xu L, Y an P, Chang T. Best first strategy for featureselection[C]. Proc of 9th Int Conf on Pattern Recognition.Rome, 1988: 706-708.蔡哲元, 余建国, 李先鹏, 等. 基于核空间距离测度的特征选择[J]. 模式识别与人工智能, 2010, 23(2): 235-240.(Cai Z Y , Y u J G, Li X P, et al. Feature selection algorithm based on kernel distance measure[J]. Pattern Recognition and Artificial Intelligence, 2010, 23(2): 235-240.) 徐燕, 李锦涛, 王斌, 等. 基于区分类别能力的高性能特 征选择方法[J]. 软件学报, 2008, 19(1): 82-89.(Xu Y , Li J T, Wang B, et al. A category resolve power- based feature selection method[J]. J of Software, 2008, 19(1): 82-89.)参考文献(References )边肇祺, 张学工. 模式识别[M]. 第 2 版. 北京: 清华大学出版社, 2000.(Bian Z Q, Zhang X G. Pattern recognition[M]. 2nd ed. Beijing: Tsinghua University Publisher, 2000.)Kira K, Rendell L A . The feature selection problem:Traditional methods and a new algorithm[C]. Proc of the9th National Conf on Artificial Intelligence. Menlo Park, 1992: 129-134.John G H, Kohavi R, Pfleger K. Irrelevant features and thesubset selection problem[C]. Proc of the 11th Int Conf onMachine Learning. New Brunswick, 1994: 121-129. Koller D, Sahami M. Toward optimal feature selection[C].Proc of Int Conf on Machine Learning. Bari, 1996: 284-[1] [16] [17] [2] [3] [18][4]刘华文. 基于信息熵的特征选择算法研究[D]. 长春: 吉林大学, 2010.(Liu Hua-wen. A study on feature selection algorithm using 孟洋, 赵方. 基于信息熵理论的动态规划特征选取算法[J]. 计算机工程与设计, 2010, 31(17): 3879-3881.(Meng Y , Zhao F. Feature selection algorithm based on dynamic programming and comentropy[J]. Computer Engineering and Design, 2010, 31(17): 3879-3881.) Forman G. An extensive empirical study of feature selection metrics for text classification[J]. J of MachineLearning Research, 2003, 3(11): 1289-1305.Liu H, Liu L, Zhang H. Feature selection using mutualinformation: An experimental study[C]. Proc of the 10thPacific Rim Int Conf on Artificial Intelligence. Las V egas, 2008: 235-246.Hua J, Waibhav D T, Edward R D. Performance of feature-selection methods in the classification of high-dimensiondata[J]. Pattern Recognition, 2009, 42(7): 409-424.Mitra P, Murthy C A, Sankar K P. Unsupervised featureselection using feature similarity[J]. IEEE Trans on PatternAnalysis and Machine Intelligence, 2002, 24(3): 301-312.Wei H-L, Billings S A. Feature subset selection and rankin gfor data dim ensionality reduction[J]. IEEE Trans on PatternAnalysis and Machine Intelligence, 2007, 29(1): 162-166.Hall M A. Correlation-based feature subset selection formachine learning[M]. Hamilton: University of Waikato,1999.Zhang D, Chen S, Zhou Z-H. Constraint score: A new filtermethod for feature selection with pairwise constraints[J].Pattern Recognition, 2008, 41(5): 1440-1451.Le Song, Alex Smola, Arthur Gretton, et al. Supervisedfeature selection via dependence estimation[C]. Proc of the24th Int Conf on Machine Learning. Corvallis, 2007: 245- 252.Gustavo Camps-V alls, Joris Mooij, Bernhard Scholkopf.Remote sensing feature selection by kernel dependencemeasures[J]. IEEE Geoscience and Remote Sensin gLetters, 2010, 7(3): 587-591.Almuallim H, Dietterich T G. Learning with manyirrelevant features[C]. Proc of 9th National Conf onArtificial Intelligence. Menlo Park, 1992: 547-552.Liu H, Setiono R. A probabilistic approach to featureselection –A filter solution[C]. Proc of Int Conf on MachineLearning. Bari, 1996: 319-327.Manoranjan Dash, Huan Liu. Consistency-based search infeature selection[J]. Artificial Intelligence, 2003, 151(16):155-176.Huan Liu, Hiroshi Motoda, Manoranjan Dash. Amonotonic measure for optimal feature selection[M].Machine Learning: ECML-98, Lecture Notes in ComputerScience, 1998: 101-106.(下转第192页)[19] [31] information entropy[D]. Changchun: 2010.)Jain A K, Robert P W, Mao J C. Jilin University, [20] Statistical pattern[32] recognition: A review[J]. IEEE Trans on Pattern Analysis and Machine Intelligence, 2000, 22(1): 4-37.Battiti R. Using mutual information for selecting featuresin supervised neural net learning[J]. IEEE Trans on Neural Networks, 1994, 5(4): 537-550.Hanchuan Peng, Fuhui Long, Chris Ding. Feature selectionbased on mutual information: Criteria of max-dependency, max-relevance, and min-redundancy[J]. IEEE Trans on Pattern Analysis and Machine Intelligence, 2005, 27(8): 1226-1238.Ding C, Peng H. Minimum redundancy feature selectionfrom microarray gene expression data[J]. J of Bioinformatics and Computational Biology, 2005, 3(2): 185-205.Francois Fleuret. Fast binary feature selection withconditional mutual information[J]. J of Machine Learnin g Research, 2004, 5(10): 1531-1555.Kwak N, Choi C-H. Input feature selection by mutualinformation based on Parzen window[J]. IEEE Trans on Pattern Analysis and Machine Intelligence, 2002, 24(12): 1667-1671.Novovicova J, Petr S, Michal H, et al. Conditional mutualinformation based feature selection for classification task[C]. Proc of the 12th Iberoamericann Congress on Pattern Recognition. V alparaiso, 2007: 417-426.Qu G, Hariri S, Y ousif M. A new dependency andcorrelation analysis for features[J]. IEEE Trans onKnowledge and Data Engineering, 2005, 17(9): 1199- 1207.赵军阳, 张志利. 基于模糊粗糙集信息熵的蚁群特征选择方法[J]. 计算机应用, 2009, 29(1): 109-111.(Zhao J Y , Zhang Z L. Ant colony feature selection based on fuzzy rough set information entropy[J]. J of Computer Applications, 2009, 29(1): 109-111.)赵军阳, 张志利. 基于最大互信息最大相关熵的特征选 择方法[J]. 计算机应用研究, 2009, 26(1): 233-235.(Zhao J Y , Zhang Z L. Feature subset selection based on maxmutual information and max correlation entropy[J]. Application Research of Computers, 2009, 26(1): 233- 235.)渠小洁. 一种基于条件熵的特征选择算法[J]. 太原科技大学学报, 2010, 31(5): 413-416.(Qu X J. An algorithm of feature selection based on conditional entropy[J]. J of Taiyuan University of Science and Technology, 2010, 31(5): 413-416.)[21] [33] [22] [34] [35] [23] [36] [24] [37] [25] [38] [26] [39] [27] [40] [28] [41] [42] [29] [43] [44] [30]。
机器学习与人工智能领域中常用的英语词汇
机器学习与人工智能领域中常用的英语词汇1.General Concepts (基础概念)•Artificial Intelligence (AI) - 人工智能1)Artificial Intelligence (AI) - 人工智能2)Machine Learning (ML) - 机器学习3)Deep Learning (DL) - 深度学习4)Neural Network - 神经网络5)Natural Language Processing (NLP) - 自然语言处理6)Computer Vision - 计算机视觉7)Robotics - 机器人技术8)Speech Recognition - 语音识别9)Expert Systems - 专家系统10)Knowledge Representation - 知识表示11)Pattern Recognition - 模式识别12)Cognitive Computing - 认知计算13)Autonomous Systems - 自主系统14)Human-Machine Interaction - 人机交互15)Intelligent Agents - 智能代理16)Machine Translation - 机器翻译17)Swarm Intelligence - 群体智能18)Genetic Algorithms - 遗传算法19)Fuzzy Logic - 模糊逻辑20)Reinforcement Learning - 强化学习•Machine Learning (ML) - 机器学习1)Machine Learning (ML) - 机器学习2)Artificial Neural Network - 人工神经网络3)Deep Learning - 深度学习4)Supervised Learning - 有监督学习5)Unsupervised Learning - 无监督学习6)Reinforcement Learning - 强化学习7)Semi-Supervised Learning - 半监督学习8)Training Data - 训练数据9)Test Data - 测试数据10)Validation Data - 验证数据11)Feature - 特征12)Label - 标签13)Model - 模型14)Algorithm - 算法15)Regression - 回归16)Classification - 分类17)Clustering - 聚类18)Dimensionality Reduction - 降维19)Overfitting - 过拟合20)Underfitting - 欠拟合•Deep Learning (DL) - 深度学习1)Deep Learning - 深度学习2)Neural Network - 神经网络3)Artificial Neural Network (ANN) - 人工神经网络4)Convolutional Neural Network (CNN) - 卷积神经网络5)Recurrent Neural Network (RNN) - 循环神经网络6)Long Short-Term Memory (LSTM) - 长短期记忆网络7)Gated Recurrent Unit (GRU) - 门控循环单元8)Autoencoder - 自编码器9)Generative Adversarial Network (GAN) - 生成对抗网络10)Transfer Learning - 迁移学习11)Pre-trained Model - 预训练模型12)Fine-tuning - 微调13)Feature Extraction - 特征提取14)Activation Function - 激活函数15)Loss Function - 损失函数16)Gradient Descent - 梯度下降17)Backpropagation - 反向传播18)Epoch - 训练周期19)Batch Size - 批量大小20)Dropout - 丢弃法•Neural Network - 神经网络1)Neural Network - 神经网络2)Artificial Neural Network (ANN) - 人工神经网络3)Deep Neural Network (DNN) - 深度神经网络4)Convolutional Neural Network (CNN) - 卷积神经网络5)Recurrent Neural Network (RNN) - 循环神经网络6)Long Short-Term Memory (LSTM) - 长短期记忆网络7)Gated Recurrent Unit (GRU) - 门控循环单元8)Feedforward Neural Network - 前馈神经网络9)Multi-layer Perceptron (MLP) - 多层感知器10)Radial Basis Function Network (RBFN) - 径向基函数网络11)Hopfield Network - 霍普菲尔德网络12)Boltzmann Machine - 玻尔兹曼机13)Autoencoder - 自编码器14)Spiking Neural Network (SNN) - 脉冲神经网络15)Self-organizing Map (SOM) - 自组织映射16)Restricted Boltzmann Machine (RBM) - 受限玻尔兹曼机17)Hebbian Learning - 海比安学习18)Competitive Learning - 竞争学习19)Neuroevolutionary - 神经进化20)Neuron - 神经元•Algorithm - 算法1)Algorithm - 算法2)Supervised Learning Algorithm - 有监督学习算法3)Unsupervised Learning Algorithm - 无监督学习算法4)Reinforcement Learning Algorithm - 强化学习算法5)Classification Algorithm - 分类算法6)Regression Algorithm - 回归算法7)Clustering Algorithm - 聚类算法8)Dimensionality Reduction Algorithm - 降维算法9)Decision Tree Algorithm - 决策树算法10)Random Forest Algorithm - 随机森林算法11)Support Vector Machine (SVM) Algorithm - 支持向量机算法12)K-Nearest Neighbors (KNN) Algorithm - K近邻算法13)Naive Bayes Algorithm - 朴素贝叶斯算法14)Gradient Descent Algorithm - 梯度下降算法15)Genetic Algorithm - 遗传算法16)Neural Network Algorithm - 神经网络算法17)Deep Learning Algorithm - 深度学习算法18)Ensemble Learning Algorithm - 集成学习算法19)Reinforcement Learning Algorithm - 强化学习算法20)Metaheuristic Algorithm - 元启发式算法•Model - 模型1)Model - 模型2)Machine Learning Model - 机器学习模型3)Artificial Intelligence Model - 人工智能模型4)Predictive Model - 预测模型5)Classification Model - 分类模型6)Regression Model - 回归模型7)Generative Model - 生成模型8)Discriminative Model - 判别模型9)Probabilistic Model - 概率模型10)Statistical Model - 统计模型11)Neural Network Model - 神经网络模型12)Deep Learning Model - 深度学习模型13)Ensemble Model - 集成模型14)Reinforcement Learning Model - 强化学习模型15)Support Vector Machine (SVM) Model - 支持向量机模型16)Decision Tree Model - 决策树模型17)Random Forest Model - 随机森林模型18)Naive Bayes Model - 朴素贝叶斯模型19)Autoencoder Model - 自编码器模型20)Convolutional Neural Network (CNN) Model - 卷积神经网络模型•Dataset - 数据集1)Dataset - 数据集2)Training Dataset - 训练数据集3)Test Dataset - 测试数据集4)Validation Dataset - 验证数据集5)Balanced Dataset - 平衡数据集6)Imbalanced Dataset - 不平衡数据集7)Synthetic Dataset - 合成数据集8)Benchmark Dataset - 基准数据集9)Open Dataset - 开放数据集10)Labeled Dataset - 标记数据集11)Unlabeled Dataset - 未标记数据集12)Semi-Supervised Dataset - 半监督数据集13)Multiclass Dataset - 多分类数据集14)Feature Set - 特征集15)Data Augmentation - 数据增强16)Data Preprocessing - 数据预处理17)Missing Data - 缺失数据18)Outlier Detection - 异常值检测19)Data Imputation - 数据插补20)Metadata - 元数据•Training - 训练1)Training - 训练2)Training Data - 训练数据3)Training Phase - 训练阶段4)Training Set - 训练集5)Training Examples - 训练样本6)Training Instance - 训练实例7)Training Algorithm - 训练算法8)Training Model - 训练模型9)Training Process - 训练过程10)Training Loss - 训练损失11)Training Epoch - 训练周期12)Training Batch - 训练批次13)Online Training - 在线训练14)Offline Training - 离线训练15)Continuous Training - 连续训练16)Transfer Learning - 迁移学习17)Fine-Tuning - 微调18)Curriculum Learning - 课程学习19)Self-Supervised Learning - 自监督学习20)Active Learning - 主动学习•Testing - 测试1)Testing - 测试2)Test Data - 测试数据3)Test Set - 测试集4)Test Examples - 测试样本5)Test Instance - 测试实例6)Test Phase - 测试阶段7)Test Accuracy - 测试准确率8)Test Loss - 测试损失9)Test Error - 测试错误10)Test Metrics - 测试指标11)Test Suite - 测试套件12)Test Case - 测试用例13)Test Coverage - 测试覆盖率14)Cross-Validation - 交叉验证15)Holdout Validation - 留出验证16)K-Fold Cross-Validation - K折交叉验证17)Stratified Cross-Validation - 分层交叉验证18)Test Driven Development (TDD) - 测试驱动开发19)A/B Testing - A/B 测试20)Model Evaluation - 模型评估•Validation - 验证1)Validation - 验证2)Validation Data - 验证数据3)Validation Set - 验证集4)Validation Examples - 验证样本5)Validation Instance - 验证实例6)Validation Phase - 验证阶段7)Validation Accuracy - 验证准确率8)Validation Loss - 验证损失9)Validation Error - 验证错误10)Validation Metrics - 验证指标11)Cross-Validation - 交叉验证12)Holdout Validation - 留出验证13)K-Fold Cross-Validation - K折交叉验证14)Stratified Cross-Validation - 分层交叉验证15)Leave-One-Out Cross-Validation - 留一法交叉验证16)Validation Curve - 验证曲线17)Hyperparameter Validation - 超参数验证18)Model Validation - 模型验证19)Early Stopping - 提前停止20)Validation Strategy - 验证策略•Supervised Learning - 有监督学习1)Supervised Learning - 有监督学习2)Label - 标签3)Feature - 特征4)Target - 目标5)Training Labels - 训练标签6)Training Features - 训练特征7)Training Targets - 训练目标8)Training Examples - 训练样本9)Training Instance - 训练实例10)Regression - 回归11)Classification - 分类12)Predictor - 预测器13)Regression Model - 回归模型14)Classifier - 分类器15)Decision Tree - 决策树16)Support Vector Machine (SVM) - 支持向量机17)Neural Network - 神经网络18)Feature Engineering - 特征工程19)Model Evaluation - 模型评估20)Overfitting - 过拟合21)Underfitting - 欠拟合22)Bias-Variance Tradeoff - 偏差-方差权衡•Unsupervised Learning - 无监督学习1)Unsupervised Learning - 无监督学习2)Clustering - 聚类3)Dimensionality Reduction - 降维4)Anomaly Detection - 异常检测5)Association Rule Learning - 关联规则学习6)Feature Extraction - 特征提取7)Feature Selection - 特征选择8)K-Means - K均值9)Hierarchical Clustering - 层次聚类10)Density-Based Clustering - 基于密度的聚类11)Principal Component Analysis (PCA) - 主成分分析12)Independent Component Analysis (ICA) - 独立成分分析13)T-distributed Stochastic Neighbor Embedding (t-SNE) - t分布随机邻居嵌入14)Gaussian Mixture Model (GMM) - 高斯混合模型15)Self-Organizing Maps (SOM) - 自组织映射16)Autoencoder - 自动编码器17)Latent Variable - 潜变量18)Data Preprocessing - 数据预处理19)Outlier Detection - 异常值检测20)Clustering Algorithm - 聚类算法•Reinforcement Learning - 强化学习1)Reinforcement Learning - 强化学习2)Agent - 代理3)Environment - 环境4)State - 状态5)Action - 动作6)Reward - 奖励7)Policy - 策略8)Value Function - 值函数9)Q-Learning - Q学习10)Deep Q-Network (DQN) - 深度Q网络11)Policy Gradient - 策略梯度12)Actor-Critic - 演员-评论家13)Exploration - 探索14)Exploitation - 开发15)Temporal Difference (TD) - 时间差分16)Markov Decision Process (MDP) - 马尔可夫决策过程17)State-Action-Reward-State-Action (SARSA) - 状态-动作-奖励-状态-动作18)Policy Iteration - 策略迭代19)Value Iteration - 值迭代20)Monte Carlo Methods - 蒙特卡洛方法•Semi-Supervised Learning - 半监督学习1)Semi-Supervised Learning - 半监督学习2)Labeled Data - 有标签数据3)Unlabeled Data - 无标签数据4)Label Propagation - 标签传播5)Self-Training - 自训练6)Co-Training - 协同训练7)Transudative Learning - 传导学习8)Inductive Learning - 归纳学习9)Manifold Regularization - 流形正则化10)Graph-based Methods - 基于图的方法11)Cluster Assumption - 聚类假设12)Low-Density Separation - 低密度分离13)Semi-Supervised Support Vector Machines (S3VM) - 半监督支持向量机14)Expectation-Maximization (EM) - 期望最大化15)Co-EM - 协同期望最大化16)Entropy-Regularized EM - 熵正则化EM17)Mean Teacher - 平均教师18)Virtual Adversarial Training - 虚拟对抗训练19)Tri-training - 三重训练20)Mix Match - 混合匹配•Feature - 特征1)Feature - 特征2)Feature Engineering - 特征工程3)Feature Extraction - 特征提取4)Feature Selection - 特征选择5)Input Features - 输入特征6)Output Features - 输出特征7)Feature Vector - 特征向量8)Feature Space - 特征空间9)Feature Representation - 特征表示10)Feature Transformation - 特征转换11)Feature Importance - 特征重要性12)Feature Scaling - 特征缩放13)Feature Normalization - 特征归一化14)Feature Encoding - 特征编码15)Feature Fusion - 特征融合16)Feature Dimensionality Reduction - 特征维度减少17)Continuous Feature - 连续特征18)Categorical Feature - 分类特征19)Nominal Feature - 名义特征20)Ordinal Feature - 有序特征•Label - 标签1)Label - 标签2)Labeling - 标注3)Ground Truth - 地面真值4)Class Label - 类别标签5)Target Variable - 目标变量6)Labeling Scheme - 标注方案7)Multi-class Labeling - 多类别标注8)Binary Labeling - 二分类标注9)Label Noise - 标签噪声10)Labeling Error - 标注错误11)Label Propagation - 标签传播12)Unlabeled Data - 无标签数据13)Labeled Data - 有标签数据14)Semi-supervised Learning - 半监督学习15)Active Learning - 主动学习16)Weakly Supervised Learning - 弱监督学习17)Noisy Label Learning - 噪声标签学习18)Self-training - 自训练19)Crowdsourcing Labeling - 众包标注20)Label Smoothing - 标签平滑化•Prediction - 预测1)Prediction - 预测2)Forecasting - 预测3)Regression - 回归4)Classification - 分类5)Time Series Prediction - 时间序列预测6)Forecast Accuracy - 预测准确性7)Predictive Modeling - 预测建模8)Predictive Analytics - 预测分析9)Forecasting Method - 预测方法10)Predictive Performance - 预测性能11)Predictive Power - 预测能力12)Prediction Error - 预测误差13)Prediction Interval - 预测区间14)Prediction Model - 预测模型15)Predictive Uncertainty - 预测不确定性16)Forecast Horizon - 预测时间跨度17)Predictive Maintenance - 预测性维护18)Predictive Policing - 预测式警务19)Predictive Healthcare - 预测性医疗20)Predictive Maintenance - 预测性维护•Classification - 分类1)Classification - 分类2)Classifier - 分类器3)Class - 类别4)Classify - 对数据进行分类5)Class Label - 类别标签6)Binary Classification - 二元分类7)Multiclass Classification - 多类分类8)Class Probability - 类别概率9)Decision Boundary - 决策边界10)Decision Tree - 决策树11)Support Vector Machine (SVM) - 支持向量机12)K-Nearest Neighbors (KNN) - K最近邻算法13)Naive Bayes - 朴素贝叶斯14)Logistic Regression - 逻辑回归15)Random Forest - 随机森林16)Neural Network - 神经网络17)SoftMax Function - SoftMax函数18)One-vs-All (One-vs-Rest) - 一对多(一对剩余)19)Ensemble Learning - 集成学习20)Confusion Matrix - 混淆矩阵•Regression - 回归1)Regression Analysis - 回归分析2)Linear Regression - 线性回归3)Multiple Regression - 多元回归4)Polynomial Regression - 多项式回归5)Logistic Regression - 逻辑回归6)Ridge Regression - 岭回归7)Lasso Regression - Lasso回归8)Elastic Net Regression - 弹性网络回归9)Regression Coefficients - 回归系数10)Residuals - 残差11)Ordinary Least Squares (OLS) - 普通最小二乘法12)Ridge Regression Coefficient - 岭回归系数13)Lasso Regression Coefficient - Lasso回归系数14)Elastic Net Regression Coefficient - 弹性网络回归系数15)Regression Line - 回归线16)Prediction Error - 预测误差17)Regression Model - 回归模型18)Nonlinear Regression - 非线性回归19)Generalized Linear Models (GLM) - 广义线性模型20)Coefficient of Determination (R-squared) - 决定系数21)F-test - F检验22)Homoscedasticity - 同方差性23)Heteroscedasticity - 异方差性24)Autocorrelation - 自相关25)Multicollinearity - 多重共线性26)Outliers - 异常值27)Cross-validation - 交叉验证28)Feature Selection - 特征选择29)Feature Engineering - 特征工程30)Regularization - 正则化2.Neural Networks and Deep Learning (神经网络与深度学习)•Convolutional Neural Network (CNN) - 卷积神经网络1)Convolutional Neural Network (CNN) - 卷积神经网络2)Convolution Layer - 卷积层3)Feature Map - 特征图4)Convolution Operation - 卷积操作5)Stride - 步幅6)Padding - 填充7)Pooling Layer - 池化层8)Max Pooling - 最大池化9)Average Pooling - 平均池化10)Fully Connected Layer - 全连接层11)Activation Function - 激活函数12)Rectified Linear Unit (ReLU) - 线性修正单元13)Dropout - 随机失活14)Batch Normalization - 批量归一化15)Transfer Learning - 迁移学习16)Fine-Tuning - 微调17)Image Classification - 图像分类18)Object Detection - 物体检测19)Semantic Segmentation - 语义分割20)Instance Segmentation - 实例分割21)Generative Adversarial Network (GAN) - 生成对抗网络22)Image Generation - 图像生成23)Style Transfer - 风格迁移24)Convolutional Autoencoder - 卷积自编码器25)Recurrent Neural Network (RNN) - 循环神经网络•Recurrent Neural Network (RNN) - 循环神经网络1)Recurrent Neural Network (RNN) - 循环神经网络2)Long Short-Term Memory (LSTM) - 长短期记忆网络3)Gated Recurrent Unit (GRU) - 门控循环单元4)Sequence Modeling - 序列建模5)Time Series Prediction - 时间序列预测6)Natural Language Processing (NLP) - 自然语言处理7)Text Generation - 文本生成8)Sentiment Analysis - 情感分析9)Named Entity Recognition (NER) - 命名实体识别10)Part-of-Speech Tagging (POS Tagging) - 词性标注11)Sequence-to-Sequence (Seq2Seq) - 序列到序列12)Attention Mechanism - 注意力机制13)Encoder-Decoder Architecture - 编码器-解码器架构14)Bidirectional RNN - 双向循环神经网络15)Teacher Forcing - 强制教师法16)Backpropagation Through Time (BPTT) - 通过时间的反向传播17)Vanishing Gradient Problem - 梯度消失问题18)Exploding Gradient Problem - 梯度爆炸问题19)Language Modeling - 语言建模20)Speech Recognition - 语音识别•Long Short-Term Memory (LSTM) - 长短期记忆网络1)Long Short-Term Memory (LSTM) - 长短期记忆网络2)Cell State - 细胞状态3)Hidden State - 隐藏状态4)Forget Gate - 遗忘门5)Input Gate - 输入门6)Output Gate - 输出门7)Peephole Connections - 窥视孔连接8)Gated Recurrent Unit (GRU) - 门控循环单元9)Vanishing Gradient Problem - 梯度消失问题10)Exploding Gradient Problem - 梯度爆炸问题11)Sequence Modeling - 序列建模12)Time Series Prediction - 时间序列预测13)Natural Language Processing (NLP) - 自然语言处理14)Text Generation - 文本生成15)Sentiment Analysis - 情感分析16)Named Entity Recognition (NER) - 命名实体识别17)Part-of-Speech Tagging (POS Tagging) - 词性标注18)Attention Mechanism - 注意力机制19)Encoder-Decoder Architecture - 编码器-解码器架构20)Bidirectional LSTM - 双向长短期记忆网络•Attention Mechanism - 注意力机制1)Attention Mechanism - 注意力机制2)Self-Attention - 自注意力3)Multi-Head Attention - 多头注意力4)Transformer - 变换器5)Query - 查询6)Key - 键7)Value - 值8)Query-Value Attention - 查询-值注意力9)Dot-Product Attention - 点积注意力10)Scaled Dot-Product Attention - 缩放点积注意力11)Additive Attention - 加性注意力12)Context Vector - 上下文向量13)Attention Score - 注意力分数14)SoftMax Function - SoftMax函数15)Attention Weight - 注意力权重16)Global Attention - 全局注意力17)Local Attention - 局部注意力18)Positional Encoding - 位置编码19)Encoder-Decoder Attention - 编码器-解码器注意力20)Cross-Modal Attention - 跨模态注意力•Generative Adversarial Network (GAN) - 生成对抗网络1)Generative Adversarial Network (GAN) - 生成对抗网络2)Generator - 生成器3)Discriminator - 判别器4)Adversarial Training - 对抗训练5)Minimax Game - 极小极大博弈6)Nash Equilibrium - 纳什均衡7)Mode Collapse - 模式崩溃8)Training Stability - 训练稳定性9)Loss Function - 损失函数10)Discriminative Loss - 判别损失11)Generative Loss - 生成损失12)Wasserstein GAN (WGAN) - Wasserstein GAN(WGAN)13)Deep Convolutional GAN (DCGAN) - 深度卷积生成对抗网络(DCGAN)14)Conditional GAN (c GAN) - 条件生成对抗网络(c GAN)15)Style GAN - 风格生成对抗网络16)Cycle GAN - 循环生成对抗网络17)Progressive Growing GAN (PGGAN) - 渐进式增长生成对抗网络(PGGAN)18)Self-Attention GAN (SAGAN) - 自注意力生成对抗网络(SAGAN)19)Big GAN - 大规模生成对抗网络20)Adversarial Examples - 对抗样本•Encoder-Decoder - 编码器-解码器1)Encoder-Decoder Architecture - 编码器-解码器架构2)Encoder - 编码器3)Decoder - 解码器4)Sequence-to-Sequence Model (Seq2Seq) - 序列到序列模型5)State Vector - 状态向量6)Context Vector - 上下文向量7)Hidden State - 隐藏状态8)Attention Mechanism - 注意力机制9)Teacher Forcing - 强制教师法10)Beam Search - 束搜索11)Recurrent Neural Network (RNN) - 循环神经网络12)Long Short-Term Memory (LSTM) - 长短期记忆网络13)Gated Recurrent Unit (GRU) - 门控循环单元14)Bidirectional Encoder - 双向编码器15)Greedy Decoding - 贪婪解码16)Masking - 遮盖17)Dropout - 随机失活18)Embedding Layer - 嵌入层19)Cross-Entropy Loss - 交叉熵损失20)Tokenization - 令牌化•Transfer Learning - 迁移学习1)Transfer Learning - 迁移学习2)Source Domain - 源领域3)Target Domain - 目标领域4)Fine-Tuning - 微调5)Domain Adaptation - 领域自适应6)Pre-Trained Model - 预训练模型7)Feature Extraction - 特征提取8)Knowledge Transfer - 知识迁移9)Unsupervised Domain Adaptation - 无监督领域自适应10)Semi-Supervised Domain Adaptation - 半监督领域自适应11)Multi-Task Learning - 多任务学习12)Data Augmentation - 数据增强13)Task Transfer - 任务迁移14)Model Agnostic Meta-Learning (MAML) - 与模型无关的元学习(MAML)15)One-Shot Learning - 单样本学习16)Zero-Shot Learning - 零样本学习17)Few-Shot Learning - 少样本学习18)Knowledge Distillation - 知识蒸馏19)Representation Learning - 表征学习20)Adversarial Transfer Learning - 对抗迁移学习•Pre-trained Models - 预训练模型1)Pre-trained Model - 预训练模型2)Transfer Learning - 迁移学习3)Fine-Tuning - 微调4)Knowledge Transfer - 知识迁移5)Domain Adaptation - 领域自适应6)Feature Extraction - 特征提取7)Representation Learning - 表征学习8)Language Model - 语言模型9)Bidirectional Encoder Representations from Transformers (BERT) - 双向编码器结构转换器10)Generative Pre-trained Transformer (GPT) - 生成式预训练转换器11)Transformer-based Models - 基于转换器的模型12)Masked Language Model (MLM) - 掩蔽语言模型13)Cloze Task - 填空任务14)Tokenization - 令牌化15)Word Embeddings - 词嵌入16)Sentence Embeddings - 句子嵌入17)Contextual Embeddings - 上下文嵌入18)Self-Supervised Learning - 自监督学习19)Large-Scale Pre-trained Models - 大规模预训练模型•Loss Function - 损失函数1)Loss Function - 损失函数2)Mean Squared Error (MSE) - 均方误差3)Mean Absolute Error (MAE) - 平均绝对误差4)Cross-Entropy Loss - 交叉熵损失5)Binary Cross-Entropy Loss - 二元交叉熵损失6)Categorical Cross-Entropy Loss - 分类交叉熵损失7)Hinge Loss - 合页损失8)Huber Loss - Huber损失9)Wasserstein Distance - Wasserstein距离10)Triplet Loss - 三元组损失11)Contrastive Loss - 对比损失12)Dice Loss - Dice损失13)Focal Loss - 焦点损失14)GAN Loss - GAN损失15)Adversarial Loss - 对抗损失16)L1 Loss - L1损失17)L2 Loss - L2损失18)Huber Loss - Huber损失19)Quantile Loss - 分位数损失•Activation Function - 激活函数1)Activation Function - 激活函数2)Sigmoid Function - Sigmoid函数3)Hyperbolic Tangent Function (Tanh) - 双曲正切函数4)Rectified Linear Unit (Re LU) - 矩形线性单元5)Parametric Re LU (P Re LU) - 参数化Re LU6)Exponential Linear Unit (ELU) - 指数线性单元7)Swish Function - Swish函数8)Softplus Function - Soft plus函数9)Softmax Function - SoftMax函数10)Hard Tanh Function - 硬双曲正切函数11)Softsign Function - Softsign函数12)GELU (Gaussian Error Linear Unit) - GELU(高斯误差线性单元)13)Mish Function - Mish函数14)CELU (Continuous Exponential Linear Unit) - CELU(连续指数线性单元)15)Bent Identity Function - 弯曲恒等函数16)Gaussian Error Linear Units (GELUs) - 高斯误差线性单元17)Adaptive Piecewise Linear (APL) - 自适应分段线性函数18)Radial Basis Function (RBF) - 径向基函数•Backpropagation - 反向传播1)Backpropagation - 反向传播2)Gradient Descent - 梯度下降3)Partial Derivative - 偏导数4)Chain Rule - 链式法则5)Forward Pass - 前向传播6)Backward Pass - 反向传播7)Computational Graph - 计算图8)Neural Network - 神经网络9)Loss Function - 损失函数10)Gradient Calculation - 梯度计算11)Weight Update - 权重更新12)Activation Function - 激活函数13)Optimizer - 优化器14)Learning Rate - 学习率15)Mini-Batch Gradient Descent - 小批量梯度下降16)Stochastic Gradient Descent (SGD) - 随机梯度下降17)Batch Gradient Descent - 批量梯度下降18)Momentum - 动量19)Adam Optimizer - Adam优化器20)Learning Rate Decay - 学习率衰减•Gradient Descent - 梯度下降1)Gradient Descent - 梯度下降2)Stochastic Gradient Descent (SGD) - 随机梯度下降3)Mini-Batch Gradient Descent - 小批量梯度下降4)Batch Gradient Descent - 批量梯度下降5)Learning Rate - 学习率6)Momentum - 动量7)Adaptive Moment Estimation (Adam) - 自适应矩估计8)RMSprop - 均方根传播9)Learning Rate Schedule - 学习率调度10)Convergence - 收敛11)Divergence - 发散12)Adagrad - 自适应学习速率方法13)Adadelta - 自适应增量学习率方法14)Adamax - 自适应矩估计的扩展版本15)Nadam - Nesterov Accelerated Adaptive Moment Estimation16)Learning Rate Decay - 学习率衰减17)Step Size - 步长18)Conjugate Gradient Descent - 共轭梯度下降19)Line Search - 线搜索20)Newton's Method - 牛顿法•Learning Rate - 学习率1)Learning Rate - 学习率2)Adaptive Learning Rate - 自适应学习率3)Learning Rate Decay - 学习率衰减4)Initial Learning Rate - 初始学习率5)Step Size - 步长6)Momentum - 动量7)Exponential Decay - 指数衰减8)Annealing - 退火9)Cyclical Learning Rate - 循环学习率10)Learning Rate Schedule - 学习率调度11)Warm-up - 预热12)Learning Rate Policy - 学习率策略13)Learning Rate Annealing - 学习率退火14)Cosine Annealing - 余弦退火15)Gradient Clipping - 梯度裁剪16)Adapting Learning Rate - 适应学习率17)Learning Rate Multiplier - 学习率倍增器18)Learning Rate Reduction - 学习率降低19)Learning Rate Update - 学习率更新20)Scheduled Learning Rate - 定期学习率•Batch Size - 批量大小1)Batch Size - 批量大小2)Mini-Batch - 小批量3)Batch Gradient Descent - 批量梯度下降4)Stochastic Gradient Descent (SGD) - 随机梯度下降5)Mini-Batch Gradient Descent - 小批量梯度下降6)Online Learning - 在线学习7)Full-Batch - 全批量8)Data Batch - 数据批次9)Training Batch - 训练批次10)Batch Normalization - 批量归一化11)Batch-wise Optimization - 批量优化12)Batch Processing - 批量处理13)Batch Sampling - 批量采样14)Adaptive Batch Size - 自适应批量大小15)Batch Splitting - 批量分割16)Dynamic Batch Size - 动态批量大小17)Fixed Batch Size - 固定批量大小18)Batch-wise Inference - 批量推理19)Batch-wise Training - 批量训练20)Batch Shuffling - 批量洗牌•Epoch - 训练周期1)Training Epoch - 训练周期2)Epoch Size - 周期大小3)Early Stopping - 提前停止4)Validation Set - 验证集5)Training Set - 训练集6)Test Set - 测试集7)Overfitting - 过拟合8)Underfitting - 欠拟合9)Model Evaluation - 模型评估10)Model Selection - 模型选择11)Hyperparameter Tuning - 超参数调优12)Cross-Validation - 交叉验证13)K-fold Cross-Validation - K折交叉验证14)Stratified Cross-Validation - 分层交叉验证15)Leave-One-Out Cross-Validation (LOOCV) - 留一法交叉验证16)Grid Search - 网格搜索17)Random Search - 随机搜索18)Model Complexity - 模型复杂度19)Learning Curve - 学习曲线20)Convergence - 收敛3.Machine Learning Techniques and Algorithms (机器学习技术与算法)•Decision Tree - 决策树1)Decision Tree - 决策树2)Node - 节点3)Root Node - 根节点4)Leaf Node - 叶节点5)Internal Node - 内部节点6)Splitting Criterion - 分裂准则7)Gini Impurity - 基尼不纯度8)Entropy - 熵9)Information Gain - 信息增益10)Gain Ratio - 增益率11)Pruning - 剪枝12)Recursive Partitioning - 递归分割13)CART (Classification and Regression Trees) - 分类回归树14)ID3 (Iterative Dichotomiser 3) - 迭代二叉树315)C4.5 (successor of ID3) - C4.5(ID3的后继者)16)C5.0 (successor of C4.5) - C5.0(C4.5的后继者)17)Split Point - 分裂点18)Decision Boundary - 决策边界19)Pruned Tree - 剪枝后的树20)Decision Tree Ensemble - 决策树集成•Random Forest - 随机森林1)Random Forest - 随机森林2)Ensemble Learning - 集成学习3)Bootstrap Sampling - 自助采样4)Bagging (Bootstrap Aggregating) - 装袋法5)Out-of-Bag (OOB) Error - 袋外误差6)Feature Subset - 特征子集7)Decision Tree - 决策树8)Base Estimator - 基础估计器9)Tree Depth - 树深度10)Randomization - 随机化11)Majority Voting - 多数投票12)Feature Importance - 特征重要性13)OOB Score - 袋外得分14)Forest Size - 森林大小15)Max Features - 最大特征数16)Min Samples Split - 最小分裂样本数17)Min Samples Leaf - 最小叶节点样本数18)Gini Impurity - 基尼不纯度19)Entropy - 熵20)Variable Importance - 变量重要性•Support Vector Machine (SVM) - 支持向量机1)Support Vector Machine (SVM) - 支持向量机2)Hyperplane - 超平面3)Kernel Trick - 核技巧4)Kernel Function - 核函数5)Margin - 间隔6)Support Vectors - 支持向量7)Decision Boundary - 决策边界8)Maximum Margin Classifier - 最大间隔分类器9)Soft Margin Classifier - 软间隔分类器10) C Parameter - C参数11)Radial Basis Function (RBF) Kernel - 径向基函数核12)Polynomial Kernel - 多项式核13)Linear Kernel - 线性核14)Quadratic Kernel - 二次核15)Gaussian Kernel - 高斯核16)Regularization - 正则化17)Dual Problem - 对偶问题18)Primal Problem - 原始问题19)Kernelized SVM - 核化支持向量机20)Multiclass SVM - 多类支持向量机•K-Nearest Neighbors (KNN) - K-最近邻1)K-Nearest Neighbors (KNN) - K-最近邻2)Nearest Neighbor - 最近邻3)Distance Metric - 距离度量4)Euclidean Distance - 欧氏距离5)Manhattan Distance - 曼哈顿距离6)Minkowski Distance - 闵可夫斯基距离7)Cosine Similarity - 余弦相似度8)K Value - K值9)Majority Voting - 多数投票10)Weighted KNN - 加权KNN11)Radius Neighbors - 半径邻居12)Ball Tree - 球树13)KD Tree - KD树14)Locality-Sensitive Hashing (LSH) - 局部敏感哈希15)Curse of Dimensionality - 维度灾难16)Class Label - 类标签17)Training Set - 训练集18)Test Set - 测试集19)Validation Set - 验证集20)Cross-Validation - 交叉验证•Naive Bayes - 朴素贝叶斯1)Naive Bayes - 朴素贝叶斯2)Bayes' Theorem - 贝叶斯定理3)Prior Probability - 先验概率4)Posterior Probability - 后验概率5)Likelihood - 似然6)Class Conditional Probability - 类条件概率7)Feature Independence Assumption - 特征独立假设8)Multinomial Naive Bayes - 多项式朴素贝叶斯9)Gaussian Naive Bayes - 高斯朴素贝叶斯10)Bernoulli Naive Bayes - 伯努利朴素贝叶斯11)Laplace Smoothing - 拉普拉斯平滑12)Add-One Smoothing - 加一平滑13)Maximum A Posteriori (MAP) - 最大后验概率14)Maximum Likelihood Estimation (MLE) - 最大似然估计15)Classification - 分类16)Feature Vectors - 特征向量17)Training Set - 训练集18)Test Set - 测试集19)Class Label - 类标签20)Confusion Matrix - 混淆矩阵•Clustering - 聚类1)Clustering - 聚类2)Centroid - 质心3)Cluster Analysis - 聚类分析4)Partitioning Clustering - 划分式聚类5)Hierarchical Clustering - 层次聚类6)Density-Based Clustering - 基于密度的聚类7)K-Means Clustering - K均值聚类8)K-Medoids Clustering - K中心点聚类9)DBSCAN (Density-Based Spatial Clustering of Applications with Noise) - 基于密度的空间聚类算法10)Agglomerative Clustering - 聚合式聚类11)Dendrogram - 系统树图12)Silhouette Score - 轮廓系数13)Elbow Method - 肘部法则14)Clustering Validation - 聚类验证15)Intra-cluster Distance - 类内距离16)Inter-cluster Distance - 类间距离17)Cluster Cohesion - 类内连贯性18)Cluster Separation - 类间分离度19)Cluster Assignment - 聚类分配20)Cluster Label - 聚类标签•K-Means - K-均值1)K-Means - K-均值2)Centroid - 质心3)Cluster - 聚类4)Cluster Center - 聚类中心5)Cluster Assignment - 聚类分配6)Cluster Analysis - 聚类分析7)K Value - K值8)Elbow Method - 肘部法则9)Inertia - 惯性10)Silhouette Score - 轮廓系数11)Convergence - 收敛12)Initialization - 初始化13)Euclidean Distance - 欧氏距离14)Manhattan Distance - 曼哈顿距离15)Distance Metric - 距离度量16)Cluster Radius - 聚类半径17)Within-Cluster Variation - 类内变异18)Cluster Quality - 聚类质量19)Clustering Algorithm - 聚类算法20)Clustering Validation - 聚类验证•Dimensionality Reduction - 降维1)Dimensionality Reduction - 降维2)Feature Extraction - 特征提取3)Feature Selection - 特征选择4)Principal Component Analysis (PCA) - 主成分分析5)Singular Value Decomposition (SVD) - 奇异值分解6)Linear Discriminant Analysis (LDA) - 线性判别分析7)t-Distributed Stochastic Neighbor Embedding (t-SNE) - t-分布随机邻域嵌入8)Autoencoder - 自编码器9)Manifold Learning - 流形学习10)Locally Linear Embedding (LLE) - 局部线性嵌入11)Isomap - 等度量映射12)Uniform Manifold Approximation and Projection (UMAP) - 均匀流形逼近与投影13)Kernel PCA - 核主成分分析14)Non-negative Matrix Factorization (NMF) - 非负矩阵分解15)Independent Component Analysis (ICA) - 独立成分分析16)Variational Autoencoder (VAE) - 变分自编码器17)Sparse Coding - 稀疏编码18)Random Projection - 随机投影19)Neighborhood Preserving Embedding (NPE) - 保持邻域结构的嵌入20)Curvilinear Component Analysis (CCA) - 曲线成分分析•Principal Component Analysis (PCA) - 主成分分析1)Principal Component Analysis (PCA) - 主成分分析2)Eigenvector - 特征向量3)Eigenvalue - 特征值4)Covariance Matrix - 协方差矩阵。
Promoting_Exchanges_and_Mutual_Learning_Among_Civi
We live in an Age of Frac-tures. Humanity is divided along ideological, econom-ic, wealth, cultural, social as well as geographical lines. These fractures oc-cur not only between, but also within States. We are also simultaneously liv-ing with threats which threaten us all. Climate change is the most obvious, but lest we forget, there still exists enough nuclear weapons to obliterate the world many times over. It is there-fore clear humanity has no choice but to come together, cooperate more closely and more meaningfully, as well as promote peace and harmony .However, there can be no har-mony among civilizations without mutual learning among civilizations and there can be no mutual learning among civilizations without equality among civilizations. Thus, acceptance of equality is the basis of harmony . It is important to distinguish between harmony on the one hand, and peace and stability on the other. Peace and stability merely mean the absence of conflict. It can be imposed, it can arise out of fear, it can mean stagnation. In contrast, harmony is joy arising from being in resonance with something higher than one’s own self. In this sense, what is true for relations be-tween individuals is also true for rela-tions among countries.In his speech delivered at UN-ESCO on 23 March 2014, President Xi Jinping emphasized three important points. Firstly, civilizations come in different colors, and such diversity has made exchanges and mutual learn-ing among civilizations relevant and valuable. Secondly, all civilizations are equal, and such equality has made ex-changes and mutual learning among civilizations possible. Thirdly, civiliza-tions are inclusive, and such inclusive-ness has given exchanges and mutual learning among civilizations possible. Since time immemorial, Malaysia and polities which existed before in that geographical space have been at the crossroads of almost all the major civ-ilizations – Indian, Chinese, Persian, Muslim and through colonialism, the West, be they Portuguese, the Dutch or British, and have absorbed these influences.This characteristic was not only unique to Malaysia, but was general for South East A sia as a whole. Local cultures took what they thought to be the best and most germane from the outside influences mentioned above and incorporated them into their own, resulting in something unique and through a process of assimilation and mutation, ensured their survival. That all these happened continuously throughout millennia is an expres-sion not only of the confidence of South East Asian cultures, but also a willingness to admit the equality of all cultures that they came into con-tact with. Therefore, to me, President Xi’s points of the equality of civiliza-tions is the bedrock and principle for promoting harmony .I believe that the answer lies in the fact that the concept and origin of the word “civilization” itself in exclusivist – it can, and indeed tends, to promote the sense of “Us” and “Them” with all of its malevolent consequences. Etymologists tell us that “civilization” is derived from the Latin “civilis” (“civil”), related to “civis” (“citizen”) and “civitas” (“city”). It is also meta-phorically related to “civilitas”, which is defined as “the character of people who are citizens, who live in cities, in organized states and societies, as op-posed to primitive, barbarous peoplesPromoting Exchanges and Mutual Learning Among Civilizations and Jointly Buildinga Community with a Shared Future for MankindRaja Dato’ Nushirwan Zainal AbidinPeople experience the overseas heritage of Chinese culture at the temple fair in Penang, Malaysia on January 28, 2023.March/April 2023CONTEMPORARY WORLD(Photo/Xinhua)15who do not.” The civilized man is the man who lives in a society with its richer, fuller life and has the gifts that enable him to live in this life, which demands certain qualities of mind and character, and gives opportunities for development that the isolated life of the savage, living in a family or in a wandering tribe, cannot give.But the exclusivist nature of the “civilization” as used by the West, is further amplified and is given deeper meaning by how it was used in the context of the West’s colonial project. To quote the words of Professor Gerrit W. Gong in his book The Standard of “Civilization” in International Society: It was fundamentally a confrontation of civilizations and their respective cultural systems. At the heart of this clash were the standards of civiliza-tion by which these different civiliza-tions identified themselves and regu-lated themselves. Thus, the notion of “civilization” propelled, supported and ultimately perpetuated the colo-nial project. It did so in the following ways:Firstly, Western civilization was deemed superior. Indeed, only coun-tries of the West were deemed as “civi-lized”. Others were either “barbarous” or even worse, “savage”. This was not necessarily the case in the beginning. Early British traders in India for ex-ample, were mesmerized by Indian culture. That changed, however, when the British began to win military bat-tles against the Mughals. It was then that a distinction was increasingly made between a “modern”, superior, Western civilization and a “backward”, inferior, Eastern civilization. This merely amplified the exclusivist na-ture of the word itself, as mentioned before;Secondly, standards of civilization were set by the “superior” civiliza-tion. The main standard applied was material progress and the ability to compete economically and militarily, rather than the promotion of virtueand harmony as well as the achieve-ment of salvation;Thirdly, civilizations were classi-fied and in so doing, assumed thatthey were hermetically sealed, eachfree from being influenced by the oth-er. This also accelerated the process of“Othering” in which a group of peopleare viewed or treated as intrinsicallydifferent from and alien to oneself,thus giving rise to exclusivist, ratherthan inclusivist views of the world.These modes of thought persist upuntil today, either in abridged, dilutedor more insidious ways. However,there are prominent thinkers in theWest who still believe that Westernculture and therefore civilization issuperior to others. They often pointto the works of Shakespeare, Milton,Beethoven, Mozart, Rembrandt, Cer-vantes and Moliere to name a few,as the only, unmatched pinnacle ofhuman civilization. I wish to point outtwo points in this regard: First, thereis no objective global and commoncriteria of what constitutes great artand high culture, free from a society’shistorical and cultural context; Sec-ond, even within a culture, standardsmay vary across time; Third, in manycultures, the creation of art is a collec-tive endeavor. Thus, ascribing worksof art to specific individuals is an aliennotion. Equally alien is the idea thatart is an expression of individuality.Given that acceptance of equalityamong civilizations is the sine quanon for harmony among civilizations,what can we do to promote it? Thefirst task is to engender humility. Asmatters stand, this attribute is shortsupply, particularly in policy-makersin the West. This is the most difficulttask as it involves a changing of a col-lective psyche. This is not meant aspersonal and ad hominem criticismagainst any Western leader. They weremerely reflective of their societies atlarge. On the subject of humility, theMalaysian Prime Minister, Datuk SeriAnwar Ibrahim is fond of quoting thewords of the Anglo-American poet,T.S Eliot in his Four Quartets: “Theonly wisdom we can hope to acquire/Is the wisdom of humility: humility isendless”. Inculcating humility is notan exercise in philosophy floating inthe clouds, it is practical one, rootedin the earth. It has to be accepted thatfor more than two decades now, A siais marching towards the drumbeatof its own values, incorporating inthat march a clear sense of its owninterests. As one which has followedand occasionally participated in thatmarch, it comes as no surprise thatChina and India have taken the po-sitions that they have vis-à-vis theRussian-Ukrainian conflict. They arean expression of the values of friend-ship and relationship, as well as thestrategic interests of both countries.Asia’s march will continue, notonly because it can, given its increas-ing economic heft, but because it hasto. The central theme of Dato’ SeriAnwar Ibrahim’s book The A sian Re-naissance written in 1996 is a simplebut powerful one: A sia has learnt andbenefited from its interactions withthe West, but blind imitation of theWest will spell its doom. A s he writes“Asia needs to undergo a paradigmshift, as it seeks to respond to theutilitarian demands of the future,without forsaking its identity, a chal-lenge which requires a revitalizationof A sia’s traditions.”The second task is to promoteknowledge about one another. Obvi-ously, given the West’s dominationof world affairs for so long, the restknows much more about them thanvice-versa. But what should be theaim of gathering and accumulatingknowledge about one another? Forme, it is to enable us to act virtu-ously to those less-known to us. Thisis related to the aim of knowledgegenerally, which is to attain virtue.16March/April 2023 CONTEMPORARY WORLD“Knowledge is virtue” is a far, far cry from Francis Bacon’s aphorism that “knowledge is power.” On the nexus between knowledge and power, it is difficult to steer away from Edward Said’s seminal 1978 work Orientalism which he considers as an expression of “European-Atlantic power over the Orient than it is as a veridic discourse about the Orient.” He further writes “knowledge gives power, more power requires knowledge, and so on in an increasingly profitable dialectic of information and control.” What divides “knowledge is virtue” from “knowledge is power”? I believe that the main dividing lines are sincerity and objectivity towards the subject studied. Those who are sincere and objective aim to achieve understand-ing, while those less so aim to achieve domination. I have read books by purported Western “experts” on emerging or in some cases, re-emerg-ing nations (I unhesitatingly place China in this category). Certainly, they know much about their area of study; they speak the language, they have lived there and may even have deep and profound ties. But they tend to view their areas of study from their own perspective, that is to say,those who have been in power for along time and who wish to remain inpower; they have little understandingof how their subjects view them, orthemselves.The third task is to have more lead-ers, particularly from the developingworld, to speak on the subject of theequality of civilizations. In PresidentXi’s 2014 speech to UNESCO citedearlier, he said that “Civilizations areequal, and such equality has made ex-changes and mutual learning amongcivilizations possible. All human civi-lizations are equal in terms of value.There is no perfect civilization in theworld. Nor is there any civilization thatis devoid of any merit. No civilizationcan be judged superior to another.”President Xi delivered this instructivestatement in 2014. Since then, he hasspoken of this and related topics often,including at the Conference on A sianCivilizations on May 2019. To me, thisindicates the seriousness with whichhe views the subject.For some time now MalaysianPrime Minister Dato’ Seri AnwarIbrahim has spoken against culturalchauvinism, the spring from whichthe sense of civilizational superior-ity flows. In his address on assumingthe Presidency of UNESCO’s GeneralConference in 1989, he said “Thelegacy of history means that somepeoples and nations have enjoyed par-ticular benefits. We must all be awareof failing to make the distinctionbetween cultural pride and culturalchauvinism. Equally, we must all beable to distinguish between the rightto continue to be ourselves, whereverwe are an in whatever we do, and theease with which this becomes culturalimperialism that denies the samefreedom to others.” I have every con-fidence that on this issue the voice ofAfrica, the cradle of humanity, willbe even stronger in the future. This isinevitable, given that Africa is poisedfor growth and development. Whilechallenges remain, the continenthas put behind it the seemingly in-tractable security, human rights andeconomic hardships that bedeviledit twenty to thirty years ago; with 60percent of the world’s arable land, itcan be the breadbasket of the world –as it is by 2030, agribusiness in Africawill be worth 1 trillion US dollars;its population is projected to reach1.1 billion people by 2040, making itthe largest workforce globally. Thisgathering momentum will hopefullyresult in a cascading effect. A s it does,the voices speaking for civilizationalsuperiority, overt or covert, consciousor otherwise, will be swept away likedust in torrential rain.To summarize and conclude:what we must work towards is aglobal humanity composed of differ-ent but equal civilizations, existingin harmony with one another, prov-ing the words of Aime Cesaire that:“There is a place for all/ At the ren-dezvous of history”.——————————————Raja Dato’ Nushirwan Zainal Abidin isMalaysian Ambassador to ChinaPeople participate in the “2022 Malaysia-China Friendship Run” in Kota Kinabalu,Sabah, Malaysia, on December 4, 2022.(Photo/Xinhua)17。
mutual information feature selection
mutual information feature selection
互信息特征选择是一种特征选择方法,基于信息论中的互信息(mutual information)进行特征评估和选择。
互信息用于度量两个随机变量之间的相关性,可以反映一个特征与目标变量之间的依赖程度。
在特征选择中,互信息的思想是通过计算每个特征与目标变量之间的互信息值,然后根据这些值的大小对特征进行排序和选择。
较大的互信息值意味着该特征与目标变量具有较强的相关性,可以提供更多的分类或预测信息。
通过选择具有高互信息的特征,可以减少特征空间的维度,同时保留与目标变量紧密相关的特征,有助于提高分类或预测的性能。
此外,互信息还可以与其他特征选择方法结合使用,以进一步优化特征子集的选择。
需要注意的是,互信息计算需要一定的数据样本和计算资源,对于大规模数据集和高维特征空间可能存在计算效率的问题。
此外,互信息对噪声和异常值也较为敏感,需要进行数据预处理和稳定性分析。
除了互信息特征选择,还有其他多种特征选择方法,如基于统计的方法、基于模型的方法、过滤式方法、包装式方法等。
每种方法都有其优缺点,可以根据具体问题和数据集的特点选择适合的特征选择方法。
中文文本数据分类研究
上海师范大学硕士学位论文中文文本数据分类研究姓名:***申请学位级别:硕士专业:计算机应用技术指导教师:张功镀;吴海涛20040501坶帅托人学颂l:学位论义中文义.牟=数据分类研究摘要随着信息技术的不断发展,特别是Internet应用的普及,网上信息成指数级增长,如何自动处理这些海量的信息,有效的保留大的文本集合成为了目前重要的研究课题。
对文本进行有效管理方法之一就是将它们进行系统的分类,即文本数据分类。
文本数据分类是一项重要的智能信息处理技术,是文本检索技术的基础,在新闻机构分类、电子会议、电子邮件自动分类和信息过滤等方面极具应用价值。
文本数据分类在传统的情报检索、网站索引体系结构的建立和WEB信息检索等方面也占有重要地位。
文本数据分类以文本挖掘技术为基础与核心,是近年来数据挖掘和网络挖掘领域当中的一个研究热点。
本论文介绍了中文文本数据分类的信息处理基础、向量空间模型,探讨了自动分词技术,详细分析多种文本特征选择算法和贝叶斯文本数据分类模型,本论文通过大量实验深入研究了多种文本特征选择算法:互信息MI(Mutualinformation),信息增益(InformationGain),X2估计,文本证据权,并对互信息进行了改进。
鉴于朴素贝叶斯的分类效果不佳,本论文又提出将机器学习中的Boosting思想结合到朴素贝叶斯的分类模型中,对朴素贝叶斯模型进行提升,实验证明,改进的互信息和给合了Boosting思想的朴素贝叶斯分类模型均产生良好的分类效果一分准率、分全率及F1值。
戈踺词:文本数据分类,特征选择,向量空间模型,自动分词,朴素贝叶斯海帅范人学砸I:学位论文中文文本数据分类埘究AbstractWiththedevelopmentofInformationTechnologyandimprovementofInternetapplication,informationoninternetexponentiallyincreased,itwasanimportantresearchsubjecttodealwithlargenumbersofinformationandtostorebigtextsetautomatically.Oneofeffectivemethodtomanagementtextsistoclassifythem,alsocalledtextciassi矗cation.Automatictextsclassificationisanintelligenttechnologyofinformationprocessing,andthefoundationoftextretrieval,whichappliedtonewscategorization,electronicconference,e-mailcategorizationandinformationfilteringere.Automatictextsclassificationplaysanimportantroleintraditionalintelligenceretrieval,foundationofwebindexarchitecture,webinformationretrieval,andSOon.Basedonwebminingtechnology,automatictextclassificationhasbecomeahotresearchareainthefieldofdataminingandnetmining.ThisthesisintroducedthetechnicalfoundationofChinesetextsclassification,VectorSpaceModel,anddiscussedChinesewordsegmentation,analyzedmanytextfeatureselectionalgorithmsandBayescategorizationmodel.Withalotofexperiments,thethesisdeeplyresearchedandevaluatedmanytextsfeaturesclcctionalgorithmsuchasMutualInformation,InformationGain,Chi—squareevaluation,WeiightofEvidenceforText.ThethesisalsodidanimprovementonMutualInformation.BecauseofineffectivenessofNa'fveBayesmodelfortextclassificationthisthesisproposedintegratingBoostingtheoryofmachinelearninginclassificationprocess,boostNaiveBaycscategorizationmodelthroughmanytimestraining。
multi-label feature selection
multi-label feature selectionMulti-label feature selection is a machine learning technique used to identify the most useful features from a dataset when the target variable has multiple labels or classes. It aims to select a subset of features that can effectively predict all possible outcomes of the target variable.Multi-label feature selection involves evaluating the relevance and redundancy of each feature in relation to the target variable. Relevant features are those that have a strong correlation with the target variable, while redundant features are those that can be predicted or explained by other features.There are several methods for multi-label feature selection, including wrapper methods, filter methods, and embedded methods. Wrapper methods involve selecting a subset of features and evaluating their performance using a specific machine learning algorithm. Filter methods evaluate the relevance of features by calculating statistical measures such as correlation, mutual information, and entropy. Embedded methods combine feature selection with the machine learning algorithm, optimizing both simultaneously.Overall, multi-label feature selection is an important technique for improving the accuracy, efficiency, and interpretability of machine learning models in multi-class classification problems.。
自然语言处理中的词向量模型
自然语言处理中的词向量模型自然语言处理(Natural Language Processing,NLP)是人工智能(Artificial Intelligence,AI)领域中的一个重要研究分支,其研究目的是使计算机理解和处理自然语言,实现人机之间的有效交流。
在NLP中,词向量模型是一个重要的研究方向,其目的是将文本信息转换为向量形式,在向量空间中进行处理和分析,以实现特定的NLP应用和功能。
一、词向量模型简介词向量模型是一种将词汇表中的每个单词映射到一个向量空间中的技术。
常见的词向量模型有基于统计的模型和基于神经网络的模型。
其中,基于统计的模型主要包括潜在语义分析(Latent Semantic Analysis,LSA)、概率潜在语义分析(Probabilistic Latent Semantic Analysis, PLSA)和隐式狄利克雷分配(Latent Dirichlet Allocation,LDA)等。
基于神经网络的模型主要包括嵌入式层(Embedded Layer)、循环神经网络(Recursive Neural Network,RNN)和卷积神经网络(Convolutional Neural Network,CNN)等。
二、词向量模型的应用词向量模型在NLP中有着广泛的应用。
其中,最主要的应用包括文本分类和情感分析等。
1. 文本分类文本分类是将一篇文档或一个句子分配到特定的预定义类别中的任务。
例如,将一篇新闻文章分配为政治、科技或体育类别等。
在文本分类中,词向量模型可以帮助将单词映射到向量空间中,并且计算每个类别的向量表示,以便对测试文本进行分类。
常见的文本分类算法包括朴素贝叶斯(Naive Bayes)、支持向量机(Support Vector Machine,SVM)和逻辑回归(Logistic Regression)等。
2. 情感分析情感分析是通过对文本内容的分析,确定人们在撰写或阅读一篇文章、观看一份视频或使用某个产品时的情感状态。
语料库术语中英对照
Aboutness 所言之事Absolute frequency 绝对频数Alignment (of parallel texts) (平行或对应)语料的对齐Alphanumeric 字母数字类的Annotate 标注(动词)Annotation 标注(名词)Annotation scheme 标注方案ANSI/American National Standards Institute 美国国家标准学会ASCII/American Standard Code for Information Exchange 美国信息交换标准码Associate (of keywords) (主题词的)联想词AWL/Academic word list 学术词表Balanced corpus 平衡语料库Base list 底表、基础词表Bigram 二元组、二元序列、二元结构Bi-hapax 两次词Bilingual corpus 双语语料库CA/Contrastive Analysis 对比分析Case-sensitive 大小写敏感、区分大小写Chi-square (χ2) test 卡方检验Chunk 词块CIA/Contrastive Interlanguage Analysis 中介语对比分析CLAWS/Constituent Likelihood Automatic Word-tagging System CLAWS词性赋码系统Clean text policy 干净文本原则Cluster 词簇、词丛Colligation 类联接、类连接、类联结Collocate n./v. 搭配词;搭配Collocability 搭配强度、搭配力Collocation 搭配、词语搭配Collocational strength 搭配强度Collocational framework/frame 搭配框架Comparable corpora 类比语料库、可比语料库ConcGram 同现词列、框合结构Concordance (line) 索引(行)Concordance plot (索引)词图Concordancer 索引工具Concordancing 索引生成、索引分析Context 语境、上下文Context word 语境词Contingency table 连列表、联列表、列连表、列联表Co-occurrence/Co-occurring 共现Corpora 语料库(复数)Corpus Linguistics 语料库语言学Corpus 语料库Corpus-based 基于语料库的Corpus-driven 语料库驱动的Corpus-informed 语料库指导的、参考了语料库的Co-select/Co-selection/Co-selectiveness 共选(机制)Co-text 共文DDL/Data Driven Learning 数据驱动学习Diachronic corpus 历时语料库Discourse 话语、语篇Discourse prosody 话语韵律Documentation 备检文件、文检报告EAGLES/Expert Advisory Groups on Language Engineering Standards EAGLES文本规格Empirical Linguistics 实证语言学Empiricism 经验主义Encoding 字符编码Error-tagging 错误标注、错误赋码Extended unit of meaning 扩展意义单位File-based search/concordancing 批量检索Formulaic sequence 程式化序列Frequency 频数、频率General (purpose) corpus 通用语料库Granularity 颗粒度Hapax legomenon/hapax 一次词Header/Text head 文本头、头标、头文件HMM/Hidden Markov Model 隐马尔科夫模型Idiom Principle 习语原则Index/Indexing (建)索引In-line annotation 文内标注、行内标注Key keyword 关键主题词Keyness 主题性、关键性Keyword 主题词KWIC/Key Word in Context 语境中的关键词、语境共现(方式)Learner corpus 学习者语料库Lemma 词目、原形词、词元Lemma list 词形还原对应表Lemmata 词目、原形词、词元(复数)Lemmatization 词形还原、词元化Lemmatizer 词形还原(词元化)工具Lexical bundle 词束Lexical density 词汇密度Lexical item 词项、词语项目Lexical priming 词汇触发理论Lexical richness 词汇丰富度Lexico-grammar/Lexical grammar 词汇语法Lexis 词语、词项LL/Log likelihood (ratio) 对数似然比、对数似然率Longitudinal/Developmental corpus 跟踪语料库、发展语料库、历时语料库Machine-readable 机读的Markup 标记、置标MDA/Multi-dimensional approach 多维度分析法Metadata 元信息Meta-metadata 元元信息MF/MD (Multi-feature/Multi-dimensional) approach 多特征/多维度分析法Mini-text 微型文本Misuse 误用Monitor corpus (动态)监察语料库Monolingual corpus 单语语料库Multilingual corpus 多语语料库Multimodal corpus 多模态语料库MWU/Multiword unit 多词单位MWE/Multiword expression 多词单位MI/Mutual information 互信息、互现信息N-gram N元组、N元序列、N元结构、N元词、多词序列NLP/Natural Language Processing 自然语言处理Node 节点(词)Normalization 标准化Normalized frequency 标准化频率、标称频率、归一频率Observed corpus 观察语料库Ontology 知识本体、本体Open Choice Principle 开放选择原则Overuse 超用、过多使用、使用过度、过度使用Paradigmatic 纵聚合(关系)的Parallel corpus 平行语料库、对应语料库Parole linguistics 言语语言学Parsed corpus 句法标注的语料库Parser 句法分析器Parsing 句法分析Pattern/patterning 型式Pattern grammar 型式语法Pedagogic corpus 教学语料库Phraseology 短语、短语学POSgram 赋码序列、码串POS tagging/Part-of-Speech tagging 词性赋码、词性标注、词性附码POS tagger 词性赋码器、词性赋码工具Prefab 预制语块Probabilistic (基于)概率的、概率性的、盖然的Probability 概率Rationalism 理性主义Raw text/Raw corpus 生文本(语料)Reference corpus 参照语料库Regex/RE/RegExp/Regular Expressions 正则表达式Register variation 语域变异Relative frequency 相对频率Representative/Representativeness 代表性(的)Rule-based 基于规则的Sample n./v. 样本;取样、采样、抽样Sampling 取样、采样、抽样Search term 检索项Search word 检索词Segmentation 切分、分词Semantic preference 语义倾向Semantic prosody 语义韵SGML/Standard Generalized Markup Language 标准通用标记语言Skipgram 跨词序列、跨词结构Span 跨距Special purpose corpus 专用语料库、专门用途语料库、专题语料库Specialized corpus 专用语料库Standardized TTR/Standardized type-token ratio 标准化类符/形符比、标准化类/形比、标准化型次比Stand-off annotation 分离式标注Stop list 停用词表、过滤词表Stop word 停用词、过滤词Synchronic corpus 共时语料库Syntagmatic 横组合(关系)的Tag 标记、码、标注码Tagger 赋码器、赋码工具、标注工具Tagging 赋码、标注、附码Tag sequence 赋码序列、码串Tagset 赋码集、码集Text 文本TEI/Text Encoding Initiative 文本编码计划The Lexical Approach 词汇中心教学法The Lexical Syllabus 词汇大纲Token 形符、词次Token definition 形符界定、单词界定Tokenization 分词Tokenizer 分词工具Transcription 转写Translational corpus 翻译语料库Treebank 树库Trigram 三元组、三元序列、三元结构T-score T值Type 类符、词型TTR/Type-token ratio 类符/形符比、类/形比、型次比Underuse 少用、使用不足Unicode 通用码Unit of meaning 意义单位WaC/Web as Corpus 网络语料库Wildcard 通配符Word definition 单词界定Word form 词形Word family 词族Word list 词表XML/EXtensible Markup Language 可扩展标记语言Zipf's Law 齐夫定律Z-score Z值Welcome To Download !!!欢迎您的下载,资料仅供参考!。
高考英语时事新闻语法填空(蹴鞠进校园,蹴鞠历史,闲聊对文化传承)
Traditional Chinese sport ‘cuju’ is popular on campuses in Shandong中国传统运动蹴鞠进校园1 ________the FIFA World Cup ended in December 2022, the passion for soccer is still very much alive and well in Zibo, Shandong province.Zibo is called “the home of the soccer ball” as it is the birthplace of the ancient Chinese sport of cuju, which 2 __________________(acknowledge) by FIFA in 2004 as the earliest form of soccer. Li Weipeng, a seventh-generation inheritor (传承人) of cuju, has been practicing these skills for 18 years.Though the 34-year-old has mastered many different cuju techniques, he was stumped by the simplest of moves in 2004 when he joined the cuju team in Linzi and underwent professional soccer training for nearly a decade.“ 3 ________the beginning, I spent eight hours a day [just] practicing juggling (颠) a ball. It was4 ______________( exhaust) ,” Li told China Daily.But all his efforts 5 ___________(pay) off. Now, the 34-year-old is able to juggle a ball with his foot over 10,000 times in a row. After he could juggle a ball hundreds of times in a row, Li also started practicing other soccer skills such as “side-flicking” and “chest down”.The traditional Chinese sport cuju is now popular at the campus of every primary and middle school in Linzi and 6 _______ moves have been adapted into dances and morning exercises.Li is one of the teachers 7 ________ teach students cuju techniques.“Students show great interest in playing cuju, which encourages me 8 _________(promote) the ancient sport,” Li told China Daily.During the World Cup in Qatar, Li gave a demonstration of cuju at a China-Qatar youth exchange activity held in Doha. He led Chinese and Qatari youth players9 ___________ (wear) traditional cuju costumes to experience the ancient game and see for themselves the 10 ____________(similar) and differences between cuju and modern soccer.Li said that the cuju activity in Qatar attracted many soccer fans and they cheered when they demonstrated different cuju moves.参考答案:1 Though 2 was acknowledged 3 At 4 exhausting 5 paid 6 its 7 who 8 to promote 9 wearing 10 similaritiesTRACING HISTORY OF ‘CUJU’“蹴鞠”的历史追溯“Cuju, the ancient Chinese game, _________(go) beyond sports. It has become a platform _____________( enhance )exchanges and mutual learning among different civilizations,” said Yu Jian, _______inheritor of cuju equipment manufacturing (制造) in Linzi district of Zibo city.Cuju is an ancient Chinese game which_____________(involve) the kicking of a ball. The word cu means to kick, while ju refers _____an ancient type of leather (皮革) ball stuffed with feathers or grain chaff (谷壳).The ancient Chinese historical text Zhan Guo Ce (Strategies of the Warring States) recorded it as one of many forms of ____________(entertain) among the public.But, during the Han Dynasty (206 BC-AD 220), cuju was ______________(common) played by soldiers for military training_____________(purpose) . During the Tang Dynasty (618-907), women played cuju at the royal court for the emperor’s entertainment.Cuju reached _____ peak during the Song Dynasty (960-1279) _________its popularity extended to every societal class. Yet, the 2,000-year-old game began to slowly fade away during the Ming Dynasty (1368-1644).参考答案:1 has gone 2 to enhance 3 an 4 involves 5 to 6 entertainment 7 commonly 8 purposes 9 its 10 when闲聊在文化传承的作用Storytelling is a feature of every known culture, but what is it about stories 1. _________ makes them so universal?To put it 2. _______ (simple), they’ve kept us alive. The “story originated 3 ______ a method of bringing us together to share specific information that might be lifesaving,” Cron writes, citing a humorous example of one Neanderthal (尼安德特人的) warning another not to eat certain berries by sharing the tragic story of 4 __________happened to the last guy who ate them.Because a story involves both data and emotions, it’s more engaging – and therefore more memorable – than simply 5 . _______ (tell) someone, “Those berries are 6. _______ (poison).”In fact, stories 7. _______ (remember) up to 22 times more than facts alone, according to Jennifer Aaker, a marketing professor at Stanford’s Graduate School of Business.If you think telling stories about other people to convey information 8. ________ (sound) a lot like gossip (闲聊), you’d be correct. Evolutionary psychologist Robin Dunbar even argues that storytelling has its origins in gossip, 9. __________ social practice that continues today.Gossip actually accounts for 65 percent of all human conversations in public places, regardless of age or gender, according to Dunbar’s research, and that’s not necessarily a bad thing. Sharing _________(story) – even gossip – can help us learn and make sense of the world.参考答案:1. that 2 simply 3. as 4 what 5. telling 6 poisonous 7 are remembered 8 sounds9 a 10 stories。
关于双向奔赴的英语作文指学习的互帮互助
关于双向奔赴的英语作文指学习的互帮互助Mutual assistance in learning is a vital component of academic success. Two-way learning involves both teaching and learning from each other, creating a positive and supportive environment for everyone involved. This approach not only enhances academic achievement but also fosters a sense of community and collaboration among students.One of the key benefits of two-way learning is the opportunity for students to gain different perspectives and insights from their peers. By teaching something to someone else, students are forced to process and organize their thoughts in a way that allows them to effectively convey the information. This process of teaching not only enhances the understanding of the material for the student teaching but also provides a different viewpoint for the student learning. This dynamic exchange of information creates a richer learning experience for both parties.Furthermore, two-way learning promotes a sense of collaboration and teamwork among students. By working together to help each other succeed, students develop strong bonds with their peers and create a supportive community within the classroom. This sense of camaraderie can lead to increasedmotivation and engagement, as students are more likely to want to help each other succeed.In addition, two-way learning can also help to address individual learning needs and preferences. Each student has their own unique way of processing information and understanding concepts. By working together with their peers, students can tailor their learning experience to better suit their individual needs. For example, visual learners can create diagrams and charts to help explain concepts to their peers, while auditory learners can use verbal explanations to assist in comprehension.Overall, two-way learning is a powerful tool for academic success. By working together to support and help each other, students can not only enhance their understanding of the material but also develop important skills such as communication, collaboration, and critical thinking. This approach fosters a positive and supportive learning environment that benefits all students involved. Let us all embrace the concept of two-way learning and reap the rewards of mutual assistance in our academic pursuits.。
互信息法mutual_info_regression回归特征选择
互信息法mutual_info_regression回归特征选择互信息法(mutual_info_regression)是一种常用的回归特征选择方法。
在机器学习和数据挖掘领域,特征选择技术是一个重要的预处理步骤,它可以帮助我们从原始的特征集合中选择出最相关和最有用的特征,从而提高模型的性能和解释能力。
本文将详细介绍互信息法(mutual_info_regression)的原理、步骤和实现方法,并结合实例讲解其应用。
文章主要分为以下几个部分进行阐述:简介、原理、步骤、实现方法和总结。
一、简介在回归分析中,我们常常需要从众多的特征中选择出与目标变量最相关的特征。
互信息法(mutual_info_regression)是一种基于信息论的特征选择方法,它可以度量两个随机变量之间的关联程度。
对于回归问题,我们可以使用互信息(mutual information)来评估特征与目标变量之间的依赖性,在此基础上选择出对回归任务最有用的特征。
二、原理互信息(mutual information)是一种衡量两个随机变量之间相互依赖程度的度量指标。
它是基于信息熵的概念,衡量了两个随机变量的联合分布与各自边缘分布之间的差异程度。
互信息的公式可以表示为:I(X;Y) = H(X) + H(Y) - H(X,Y)其中,I(X;Y)表示随机变量X与Y的互信息,H(X)和H(Y)分别表示随机变量X 和Y的熵,H(X,Y)表示随机变量X和Y的联合熵。
在互信息法(mutual_info_regression)中,我们可以将目标变量视为Y,将特征变量视为X。
然后,通过计算每个特征与目标变量之间的互信息,来评估特征与目标变量之间的相关性,从而选择出最相关的特征。
三、步骤互信息法(mutual_info_regression)的步骤主要包括以下几个部分:1. 数据预处理:对原始数据进行清洗、缺失值处理等预处理步骤,以确保数据的完整性和准确性。
mutual_info_regression用法
mutual_info_regression用法1.介绍在机器学习和统计学中,互信息回归(mu t ua l_in fo_r eg res s io n)是一种用于特征选择和特征相关性量化的方法。
它可以帮助我们确定输入特征与输出变量之间的关系强度,进而帮助我们理解和预测数据集。
本文将介绍m ut ua l_in fo_re gr es si on的基本原理、使用方法和示例。
2. mu tual_info_regre ssion的原理互信息(mu tu al in fo r ma ti on)是一种衡量两个随机变量之间依赖关系的度量。
在回归问题中,m ut ua l_in fo_re gr es si on可用于确定每个输入特征与输出变量之间的关系强度,从而帮助我们选择最相关的特征。
m u tu al_i nf o_re gre s si on基于信息熵的概念,通过计算输入特征与输出变量的联合分布和各自边缘分布之间的差异来衡量它们之间的关系。
3.使用mut ual_info_r egressio n的步骤使用mu tu al_i nf o_r e gr es si on分析数据集的步骤如下:3.1准备数据首先,我们需要准备一个有标记的数据集,其中包含输入特征和相应的输出变量。
确保数据集经过合适的数据预处理,例如标准化、缺失值处理等。
3.2导入必要的库接下来,我们需要导入P yt ho n中的相应库,包括`n um py`、`s kl ea rn.f ea tu re_s el ec ti on`和`s kl e ar n.da ta se ts`。
这些库提供了m ut ua l_in fo_r eg r es si on所需的关键函数和方法。
3.3数据拆分将数据集拆分为输入特征矩阵(X)和输出变量向量(y)。
3.4特征选择使用mu tu al_i nf o_r e gr es si on函数对输入特征和输出变量进行特征选择,并计算每个特征的互信息得分。
人工智能专业词汇
Letter AAccumulated error backpropagation 累积误差逆传播Activation Function 激活函数Adaptive Resonance Theory/ART 自适应谐振理论Addictive model 加性学习Adversarial Networks 对抗网络Affine Layer 仿射层Affinity matrix 亲和矩阵Agent 代理 / 智能体Algorithm 算法Alpha-beta pruning α-β剪枝Anomaly detection 异常检测Approximation 近似Area Under ROC Curve/AUC Roc 曲线下面积Artificial General Intelligence/AGI 通用人工智能Artificial Intelligence/AI 人工智能Association analysis 关联分析Attention mechanism 注意力机制Attribute conditional independence assumption 属性条件独立性假设Attribute space 属性空间Attribute value 属性值Autoencoder 自编码器Automatic speech recognition 自动语音识别Automatic summarization 自动摘要Average gradient 平均梯度Average-Pooling 平均池化Letter BBackpropagation Through Time 通过时间的反向传播Backpropagation/BP 反向传播Base learner 基学习器Base learning algorithm 基学习算法Batch Normalization/BN 批量归一化Bayes decision rule 贝叶斯判定准则Bayes Model Averaging/BMA 贝叶斯模型平均Bayes optimal classifier 贝叶斯最优分类器Bayesian decision theory 贝叶斯决策论Bayesian network 贝叶斯网络Between-class scatter matrix 类间散度矩阵Bias 偏置 / 偏差Bias-variance decomposition 偏差-方差分解Bias-Variance Dilemma 偏差–方差困境Bi-directional Long-Short Term Memory/Bi-LSTM 双向长短期记忆Binary classification 二分类Binomial test 二项检验Bi-partition 二分法Boltzmann machine 玻尔兹曼机Bootstrap sampling 自助采样法/可重复采样/有放回采样Bootstrapping 自助法Break-Event Point/BEP 平衡点Letter CCalibration 校准Cascade-Correlation 级联相关Categorical attribute 离散属性Class-conditional probability 类条件概率Classification and regression tree/CART 分类与回归树Classifier 分类器Class-imbalance 类别不平衡Closed -form 闭式Cluster 簇/类/集群Cluster analysis 聚类分析Clustering 聚类Clustering ensemble 聚类集成Co-adapting 共适应Coding matrix 编码矩阵COLT 国际学习理论会议Committee-based learning 基于委员会的学习Competitive learning 竞争型学习Component learner 组件学习器Comprehensibility 可解释性Computation Cost 计算成本Computational Linguistics 计算语言学Computer vision 计算机视觉Concept drift 概念漂移Concept Learning System /CLS 概念学习系统Conditional entropy 条件熵Conditional mutual information 条件互信息Conditional Probability Table/CPT 条件概率表Conditional random field/CRF 条件随机场Conditional risk 条件风险Confidence 置信度Confusion matrix 混淆矩阵Connection weight 连接权Connectionism 连结主义Consistency 一致性/相合性Contingency table 列联表Continuous attribute 连续属性Convergence 收敛Conversational agent 会话智能体Convex quadratic programming 凸二次规划Convexity 凸性Convolutional neural network/CNN 卷积神经网络Co-occurrence 同现Correlation coefficient 相关系数Cosine similarity 余弦相似度Cost curve 成本曲线Cost Function 成本函数Cost matrix 成本矩阵Cost-sensitive 成本敏感Cross entropy 交叉熵Cross validation 交叉验证Crowdsourcing 众包Curse of dimensionality 维数灾难Cut point 截断点Cutting plane algorithm 割平面法Letter DData mining 数据挖掘Data set 数据集Decision Boundary 决策边界Decision stump 决策树桩Decision tree 决策树/判定树Deduction 演绎Deep Belief Network 深度信念网络Deep Convolutional Generative Adversarial Network/DCGAN 深度卷积生成对抗网络Deep learning 深度学习Deep neural network/DNN 深度神经网络Deep Q-Learning 深度 Q 学习Deep Q-Network 深度 Q 网络Density estimation 密度估计Density-based clustering 密度聚类Differentiable neural computer 可微分神经计算机Dimensionality reduction algorithm 降维算法Directed edge 有向边Disagreement measure 不合度量Discriminative model 判别模型Discriminator 判别器Distance measure 距离度量Distance metric learning 距离度量学习Distribution 分布Divergence 散度Diversity measure 多样性度量/差异性度量Domain adaption 领域自适应Downsampling 下采样D-separation (Directed separation)有向分离Dual problem 对偶问题Dummy node 哑结点Dynamic Fusion 动态融合Dynamic programming 动态规划Letter EEigenvalue decomposition 特征值分解Embedding 嵌入Emotional analysis 情绪分析Empirical conditional entropy 经验条件熵Empirical entropy 经验熵Empirical error 经验误差Empirical risk 经验风险End-to-End 端到端Energy-based model 基于能量的模型Ensemble learning 集成学习Ensemble pruning 集成修剪Error Correcting Output Codes/ECOC 纠错输出码Error rate 错误率Error-ambiguity decomposition 误差-分歧分解Euclidean distance 欧氏距离Evolutionary computation 演化计算Expectation-Maximization 期望最大化Expected loss 期望损失Exploding Gradient Problem 梯度爆炸问题Exponential loss function 指数损失函数Extreme Learning Machine/ELM 超限学习机Letter FFactorization 因子分解False negative 假负类False positive 假正类False Positive Rate/FPR 假正例率Feature engineering 特征工程Feature selection 特征选择Feature vector 特征向量Featured Learning 特征学习Feedforward Neural Networks/FNN 前馈神经网络Fine-tuning 微调Flipping output 翻转法Fluctuation 震荡Forward stagewise algorithm 前向分步算法Frequentist 频率主义学派Full-rank matrix 满秩矩阵Functional neuron 功能神经元Letter GGain ratio 增益率Game theory 博弈论Gaussian kernel function 高斯核函数Gaussian Mixture Model 高斯混合模型General Problem Solving 通用问题求解Generalization 泛化Generalization error 泛化误差Generalization error bound 泛化误差上界Generalized Lagrange function 广义拉格朗日函数Generalized linear model 广义线性模型Generalized Rayleigh quotient 广义瑞利商Generative Adversarial Networks/GAN 生成对抗网络Generative Model 生成模型Generator 生成器Genetic Algorithm/GA 遗传算法Gibbs sampling 吉布斯采样Gini index 基尼指数Global minimum 全局最小Global Optimization 全局优化Gradient boosting 梯度提升Gradient Descent 梯度下降Graph theory 图论Ground-truth 真相/真实Letter HHard margin 硬间隔Hard voting 硬投票Harmonic mean 调和平均Hesse matrix 海塞矩阵Hidden dynamic model 隐动态模型Hidden layer 隐藏层Hidden Markov Model/HMM 隐马尔可夫模型Hierarchical clustering 层次聚类Hilbert space 希尔伯特空间Hinge loss function 合页损失函数Hold-out 留出法Homogeneous 同质Hybrid computing 混合计算Hyperparameter 超参数Hypothesis 假设Hypothesis test 假设验证Letter IICML 国际机器学习会议Improved iterative scaling/IIS 改进的迭代尺度法Incremental learning 增量学习Independent and identically distributed/ 独立同分布Independent Component Analysis/ICA 独立成分分析Indicator function 指示函数Individual learner 个体学习器Induction 归纳Inductive bias 归纳偏好Inductive learning 归纳学习Inductive Logic Programming/ILP 归纳逻辑程序设计Information entropy 信息熵Information gain 信息增益Input layer 输入层Insensitive loss 不敏感损失Inter-cluster similarity 簇间相似度International Conference for Machine Learning/ICML 国际机器学习大会Intra-cluster similarity 簇内相似度Intrinsic value 固有值Isometric Mapping/Isomap 等度量映射Isotonic regression 等分回归Iterative Dichotomiser 迭代二分器Letter KKernel method 核方法Kernel trick 核技巧Kernelized Linear Discriminant Analysis/KLDA 核线性判别分析K-fold cross validation k 折交叉验证/k 倍交叉验证K-Means Clustering K –均值聚类K-Nearest Neighbours Algorithm/KNN K近邻算法Knowledge base 知识库Knowledge Representation 知识表征Letter LLabel space 标记空间Lagrange duality 拉格朗日对偶性Lagrange multiplier 拉格朗日乘子Laplace smoothing 拉普拉斯平滑Laplacian correction 拉普拉斯修正Latent Dirichlet Allocation 隐狄利克雷分布Latent semantic analysis 潜在语义分析Latent variable 隐变量Lazy learning 懒惰学习Learner 学习器Learning by analogy 类比学习Learning rate 学习率Learning Vector Quantization/LVQ 学习向量量化Least squares regression tree 最小二乘回归树Leave-One-Out/LOO 留一法linear chain conditional random field 线性链条件随机场Linear Discriminant Analysis/LDA 线性判别分析Linear model 线性模型Linear Regression 线性回归Link function 联系函数Local Markov property 局部马尔可夫性Local minimum 局部最小Log likelihood 对数似然Log odds/logit 对数几率Logistic Regression Logistic 回归Log-likelihood 对数似然Log-linear regression 对数线性回归Long-Short Term Memory/LSTM 长短期记忆Loss function 损失函数Letter MMachine translation/MT 机器翻译Macron-P 宏查准率Macron-R 宏查全率Majority voting 绝对多数投票法Manifold assumption 流形假设Manifold learning 流形学习Margin theory 间隔理论Marginal distribution 边际分布Marginal independence 边际独立性Marginalization 边际化Markov Chain Monte Carlo/MCMC 马尔可夫链蒙特卡罗方法Markov Random Field 马尔可夫随机场Maximal clique 最大团Maximum Likelihood Estimation/MLE 极大似然估计/极大似然法Maximum margin 最大间隔Maximum weighted spanning tree 最大带权生成树Max-Pooling 最大池化Mean squared error 均方误差Meta-learner 元学习器Metric learning 度量学习Micro-P 微查准率Micro-R 微查全率Minimal Description Length/MDL 最小描述长度Minimax game 极小极大博弈Misclassification cost 误分类成本Mixture of experts 混合专家Momentum 动量Moral graph 道德图/端正图Multi-class classification 多分类Multi-document summarization 多文档摘要Multi-layer feedforward neural networks 多层前馈神经网络Multilayer Perceptron/MLP 多层感知器Multimodal learning 多模态学习Multiple Dimensional Scaling 多维缩放Multiple linear regression 多元线性回归Multi-response Linear Regression /MLR 多响应线性回归Mutual information 互信息Letter NNaive bayes 朴素贝叶斯Naive Bayes Classifier 朴素贝叶斯分类器Named entity recognition 命名实体识别Nash equilibrium 纳什均衡Natural language generation/NLG 自然语言生成Natural language processing 自然语言处理Negative class 负类Negative correlation 负相关法Negative Log Likelihood 负对数似然Neighbourhood Component Analysis/NCA 近邻成分分析Neural Machine Translation 神经机器翻译Neural Turing Machine 神经图灵机Newton method 牛顿法NIPS 国际神经信息处理系统会议No Free Lunch Theorem/NFL 没有免费的午餐定理Noise-contrastive estimation 噪音对比估计Nominal attribute 列名属性Non-convex optimization 非凸优化Nonlinear model 非线性模型Non-metric distance 非度量距离Non-negative matrix factorization 非负矩阵分解Non-ordinal attribute 无序属性Non-Saturating Game 非饱和博弈Norm 范数Normalization 归一化Nuclear norm 核范数Numerical attribute 数值属性Letter OObjective function 目标函数Oblique decision tree 斜决策树Occam’s razor 奥卡姆剃刀Odds 几率Off-Policy 离策略One shot learning 一次性学习One-Dependent Estimator/ODE 独依赖估计On-Policy 在策略Ordinal attribute 有序属性Out-of-bag estimate 包外估计Output layer 输出层Output smearing 输出调制法Overfitting 过拟合/过配Oversampling 过采样Letter PPaired t-test 成对 t 检验Pairwise 成对型Pairwise Markov property 成对马尔可夫性Parameter 参数Parameter estimation 参数估计Parameter tuning 调参Parse tree 解析树Particle Swarm Optimization/PSO 粒子群优化算法Part-of-speech tagging 词性标注Perceptron 感知机Performance measure 性能度量Plug and Play Generative Network 即插即用生成网络Plurality voting 相对多数投票法Polarity detection 极性检测Polynomial kernel function 多项式核函数Pooling 池化Positive class 正类Positive definite matrix 正定矩阵Post-hoc test 后续检验Post-pruning 后剪枝potential function 势函数Precision 查准率/准确率Prepruning 预剪枝Principal component analysis/PCA 主成分分析Principle of multiple explanations 多释原则Prior 先验Probability Graphical Model 概率图模型Proximal Gradient Descent/PGD 近端梯度下降Pruning 剪枝Pseudo-label 伪标记Letter QQuantized Neural Network 量子化神经网络Quantum computer 量子计算机Quantum Computing 量子计算Quasi Newton method 拟牛顿法Letter RRadial Basis Function/RBF 径向基函数Random Forest Algorithm 随机森林算法Random walk 随机漫步Recall 查全率/召回率Receiver Operating Characteristic/ROC 受试者工作特征Rectified Linear Unit/ReLU 线性修正单元Recurrent Neural Network 循环神经网络Recursive neural network 递归神经网络Reference model 参考模型Regression 回归Regularization 正则化Reinforcement learning/RL 强化学习Representation learning 表征学习Representer theorem 表示定理reproducing kernel Hilbert space/RKHS 再生核希尔伯特空间Re-sampling 重采样法Rescaling 再缩放Residual Mapping 残差映射Residual Network 残差网络Restricted Boltzmann Machine/RBM 受限玻尔兹曼机Restricted Isometry Property/RIP 限定等距性Re-weighting 重赋权法Robustness 稳健性/鲁棒性Root node 根结点Rule Engine 规则引擎Rule learning 规则学习Letter SSaddle point 鞍点Sample space 样本空间Sampling 采样Score function 评分函数Self-Driving 自动驾驶Self-Organizing Map/SOM 自组织映射Semi-naive Bayes classifiers 半朴素贝叶斯分类器Semi-Supervised Learning 半监督学习semi-Supervised Support Vector Machine 半监督支持向量机Sentiment analysis 情感分析Separating hyperplane 分离超平面Sigmoid function Sigmoid 函数Similarity measure 相似度度量Simulated annealing 模拟退火Simultaneous localization and mapping 同步定位与地图构建Singular Value Decomposition 奇异值分解Slack variables 松弛变量Smoothing 平滑Soft margin 软间隔Soft margin maximization 软间隔最大化Soft voting 软投票Sparse representation 稀疏表征Sparsity 稀疏性Specialization 特化Spectral Clustering 谱聚类Speech Recognition 语音识别Splitting variable 切分变量Squashing function 挤压函数Stability-plasticity dilemma 可塑性-稳定性困境Statistical learning 统计学习Status feature function 状态特征函Stochastic gradient descent 随机梯度下降Stratified sampling 分层采样Structural risk 结构风险Structural risk minimization/SRM 结构风险最小化Subspace 子空间Supervised learning 监督学习/有导师学习support vector expansion 支持向量展式Support Vector Machine/SVM 支持向量机Surrogat loss 替代损失Surrogate function 替代函数Symbolic learning 符号学习Symbolism 符号主义Synset 同义词集Letter TT-Distribution Stochastic Neighbour Embedding/t-SNE T –分布随机近邻嵌入Tensor 张量Tensor Processing Units/TPU 张量处理单元The least square method 最小二乘法Threshold 阈值Threshold logic unit 阈值逻辑单元Threshold-moving 阈值移动Time Step 时间步骤Tokenization 标记化Training error 训练误差Training instance 训练示例/训练例Transductive learning 直推学习Transfer learning 迁移学习Treebank 树库Tria-by-error 试错法True negative 真负类True positive 真正类True Positive Rate/TPR 真正例率Turing Machine 图灵机Twice-learning 二次学习Letter UUnderfitting 欠拟合/欠配Undersampling 欠采样Understandability 可理解性Unequal cost 非均等代价Unit-step function 单位阶跃函数Univariate decision tree 单变量决策树Unsupervised learning 无监督学习/无导师学习Unsupervised layer-wise training 无监督逐层训练Upsampling 上采样Letter VVanishing Gradient Problem 梯度消失问题Variational inference 变分推断VC Theory VC维理论Version space 版本空间Viterbi algorithm 维特比算法Von Neumann architecture 冯·诺伊曼架构Letter WWasserstein GAN/WGAN Wasserstein生成对抗网络Weak learner 弱学习器Weight 权重Weight sharing 权共享Weighted voting 加权投票法Within-class scatter matrix 类内散度矩阵Word embedding 词嵌入Word sense disambiguation 词义消歧Letter ZZero-data learning 零数据学习Zero-shot learning 零次学习---------------------作者:业余草来源:CSDN原文:版权声明:本文为博主原创文章,转载请附上博文链接!。
基于邻域互信息最大相关性最小冗余度的特征选择
基于邻域互信息最大相关性最小冗余度的特征选择林培榕【摘要】Feature selection is an important data preprocessing technique, where mutual information has been widely studied in information measure. However, mutual information cannot directly calculate relevancy among numeric features. In this paper, we first introduce neighborhood entropy and neighborhood mutual information. Then, we propose neighborhood mutual information based max relevance and min redundancy feature selection. Finally, experimental results show that the proposed method can effectively select a discriminative feature subset, and outperform or equalto other popular feature selection algorithms in classification performance.%特征选择是一种重要的数据预处理步骤,其中互信息是一类重要的信息度量方法。
本文针对互信息不能很好地处理数值型的特征,介绍了邻域信息熵与邻域互信息。
其次,设计了基于邻域互信息的最大相关性最小冗余度的特征排序算法。
最后,用此算法选择前若干特征进行分类并与其它算法比较分类精度。
实验结果表明本文提出算法在分类精度方面且优于或相当于其它流行特征选择算法。
基于最大相关最小冗余联合互信息的多标签特征选择算法
基于最大相关最小冗余联合互信息的多标签特征选择算法张俐;王枞【摘要】Feature selection has played an important role in machine learning and artificial intelligence in the past dec-ades. Many existing feature selection algorithm have chosen some redundant and irrelevant features, which is leading to overestimation of some features. Moreover, more features will significantly slow down the speed of machine learning and lead to classification over-fitting. Therefore, a new nonlinear feature selection algorithm based on forward search was proposed. The algorithm used the theory of mutual information and mutual information to find the optimal subset associ-ated with multi-task labels and reduced the computational complexity. Compared with the experimental results of nine datasets and four different classifiers in UCI, the proposed algorithm is superior to the feature set selected by the original feature set and other feature selection algorithms.%在过去的几十年中,特征选择已经在机器学习和人工智能领域发挥着重要作用.许多特征选择算法都存在着选择一些冗余和不相关特征的现象,这是因为它们过分夸大某些特征重要性.同时,过多的特征会减慢机器学习的速度,并导致分类过渡拟合.因此,提出新的基于前向搜索的非线性特征选择算法,该算法使用互信息和交互信息的理论,寻找与多分类标签相关的最优子集,并降低计算复杂度.在UCI中9个数据集和4个不同的分类器对比实验中表明,该算法均优于原始特征集和其他特征选择算法选择出的特征集.【期刊名称】《通信学报》【年(卷),期】2018(039)005【总页数】12页(P111-122)【关键词】特征选择;条件互信息;特征交互;特征相关;特征冗余【作者】张俐;王枞【作者单位】北京邮电大学软件学院,北京 100876;北京邮电大学可信分布式计算与服务教育部重点实验室,北京 100876;北京邮电大学软件学院,北京 100876;北京邮电大学可信分布式计算与服务教育部重点实验室,北京 100876【正文语种】中文【中图分类】TP1811 引言近年来,大数据、云计算和人工智能等技术的迅速发展,给人类社会生产和生活带来了前所未有的变化,在这之中就产生了大量的数据[1]。
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Kari Torkkola A540AA@ William M.Campbell P27439@ Motorola Human Interface Lab,2100East Elliot Road,MD EL508,Tempe,AZ85284,USAAbstractWe present feature transformations useful for ex-ploratory data analysis or for pattern recognition.Transformations are learned from example datasets by maximizing the mutual information be-tween transformed data and their class labels.Wemake use of Renyi’s quadratic entropy,and weextend the work of Principe et al.to mutual in-formation between continuous multidimensionalvariables and discrete-valued class labels.1.IntroductionReducing the dimensionality of feature vectors is usually an essential step in pattern recognition tasks to achieve practical ually this is done by using domainknowledge or heuristics.Dimensionality reduction is also essential in exploratory data analysis,where the purpose often is to map data onto a low-dimensional space for hu-man eyes to gain some insight to the data.It is well known that principal component analysis(PCA)has nothing to do with discriminative features optimal for classification,since it is only concerned with the covari-ance of all data regardless of the class.However,it may bevery useful in reducing noise in the data.So called linear discriminant analysis(LDA)can be used to derive a dis-criminative transformation that is optimal for certain cases(Fukunaga,1990).LDAfinds eigenvectors ofwhere is the between-class covariance matrix,andis the sum of within-class covariance matrices.The eigen-vectors corresponding to largest eigenvalues of form the columns of the transformation matrix,and new discrim-inative features are derived from the original ones sim-ply by.This simple algebraic way of deriving the transformation matrix is both a strength and a weaknessof the method.Since LDA makes use of only second-order statistical information,covariances,it is optimal for data where each class has a unimodal Gaussian density with well separated means.Also,the maximum rank of is ,where is the number of different classes.Thus LDA cannot produce more than features.Although extensions have been proposed for the latter problem(e.g., see Okada&Tomita,1985),thefirst one remains.The purpose of this paper is to demonstrate that mutual information(MI)can act as a more general criterion that overcomes these limitations.It accounts for higher-order statistics,not just for second order,and it can also be used as the basis for non-linear transformations.A result by Fano(1961)exists which states that maximizing the mu-tual information between transformed data and their class labels achieves the lowest possible bound to the error of a classifier.We discuss this result in Section3.The reasons why mutual information is not in wider use currently lie in computational difficulties.The probabil-ity density functions of the variables are required,and MI involves integrating functions of those,which in turn in-volves evaluating them on a dense set,which leads to ex-ponential computational complexity.Evaluating MI between two scalar variables is feasible through histograms,and this approach has found use in feature selection rather than in feature transformation(Bat-titi,1994;Bonnlander&Weigend,1994;Yang&Moody, 1999).However,greedy algorithms based on sequential feature selection using MI are suboptimal,as they fail to find a feature set that would jointly maximize the MI be-tween the features and the labels.This failure is due to the sparsity of(any amount of)data in high-dimensional spaces for histogram-based MI estimation.Feature selec-tion through any other joint criteria also leads to a combi-natorial explosion.In fact,for this very reasonfinding a transformation to lower dimensions might be easier than selecting features, given an appropriate objective function.Principe,Fisher,and Xu(2000)have shown that using an unconvential definition of entropy,Renyi’s entropy instead of Shannon’s,can lead to expressions of mutual informa-tion with significant computational savings.They explored the mutual information between two continuous variables. We describefirst the basis of their work,and then extend it to mutual information between continuous variables and discrete class labels.We use the criterion to learn linear dimension-reducing feature transformations with optimal discriminative ability,and we demonstrate the results with some well-known data sets both in visualization and pattern recognition applications.2.Formalized ObjectiveGiven a set of training data,as samples of a continuous-valued random variable,,and class labels as samples of a discrete-valued random vari-able,,the objective is to find a transformation(or its parameters) that maximizes the mutual information(MMI)be-tween transformed data and class labels.To this end we need to express as a function of the data set,,in a differentiable form.Once that is done, we can perform gradient ascent on as follows(4)after applying the identities and.Mutual information also measures independence between two variables,in this case between and.It equals zero when,that is,when the joint density of and factors(the condition for independence).How-ever,practical evaluation of MI based on this expression is difficult.3.2Fano’s BoundWhat does mutual information have to do with optimal classification?The answer is given by Fano’s(1961)in-equality.This result determines a lower bound to the prob-ability of error when estimating a discrete random variable from another random variable as(5) where is the estimate of after observing a sample of, which can be scalar or multivariate.Thus the lower bound on error probability is minimized when the mutual infor-mation between and is maximized,or,finding such features achieves the lowest possible bound to the error of a classifier.Whether this bound can be reached or not,de-pends on the goodness of the classifier.4.Renyi’s Entropy Reduces to Pairwise InteractionsHaving established that mutual information is an optimal measure to learn feature transformations,we recapitulate the results of Principe,Fisher,and Xu(2000)of using Renyi’s entropy combined with Parzen window density es-timation to reduce computational complexity.4.1Renyi’s EntropyInstead of Shannon’s entropy we apply Renyi’s quadratic entropy as described in(Principe et al.,2000)because of its computational advantages.For a discrete variable, and for a continuous variable Renyi’s quadratic entropy is defined as(6)(7)It has been shown that for the purposes offinding the ex-trema,coincides with Shannon’s entropy(Kapur, 1994).Later in Section5following Principe et al.we de-velop expressions for quadratic mutual information based on Renyi’s entropy measure.4.2Parzen Density EstimationWhether based on Shannon’s or Renyi’s entropy,evaluating the MI requires estimates of the probability density func-tions of the variables involved.One nonparametric method for this estimation is the Parzen(1962)window method.This involves placing a kernel function on top of each sam-ple and evaluating the density as the sum of the kernels. It turns out that Renyi’s measure combined with Parzen density estimation method using Gaussian kernels provides significant computational savings.The Gaussian kernel in-dimensional space is defined(8) Now,for two kernels,the following holds(9) Thus the convolution of two Gaussians centered at and is a Gaussian centered at with covariance as a sum of the original covariances.This property facilitates evaluating Renyi’s quadratic entropy measure which is a function of the square of the density function.Assume that the density is estimated as a sum of symmetric Gaussians each centered at a sample as(12) This can also be understood as the Kullback-Leibler di-vergence between the joint density and the product of the marginal densities of the variables.Thus(13) In general,for two densities and the divergence is written as(17) they writeIt is easy to see that both measures are always positive,and when both measures evaluate to zero.Thisjustifies their use in minimizing the mutual information. Moreover,especially is also suitable for maxi-mizing the mutual information(Principe,personal commu-nication,1999),although a rigorous proof for this does not exist.Since mutual information is expressed as the divergence between the joint density and the marginals,we can insert them into the two different quadratic divergence expres-sions.Cauchy-Schwartz inequality results in the following quadratic mutual information measure between two contin-uous variables and:(23)(24) Let us make the following definitions of the quantities ap-pearing in(23)and(24)(25) With these,that is needed in(1)will be(31) Since the joint density can be written aswe havefor each.The density of all data is,thus we have(33)Now we write the quantities in(25)using a set of samplesin the transformed space by inserting(30),(32),and(33)into(25).Making use of(9)and(11)we get:(39)This represents a sum of forces that other“particles”re-gardless of class exert to particle.Note that this par-ticle is also denoted as when the class is irrelevant.Direction is again towards.The derivative of is7.2Parametrization as RotationsTo be able to use optimization algorithms that work best in unconstrained problems,we parametrize the linear trans-formation in terms of rotations.Let us begin with an or-thonormal basis,for example,byfilling the columns of with the thefirst unit vectors.Now,there arefree parameters(rotation angles in this case).This number can be derived by considering the basis vectors.A rota-tion plane can be defined by its normal vector around which the current basis vector set is rotated by angle,or alternatively,by any two vectors that span.It does not make sense to rotate the basis vectors along a plane that is defined by two of the basis vectors themselves,since the projection of the data onto the subspace would remain un-changed after the rotation.However,combining each of the current basis vectors with an orthogonal basis vector of the remaining space defines planes.Rotation of along any of these planes will now change the projec-tion.Thus we have independent rotation angles as parameters(instead of coefficients in).These are easily represented as Givens rotation matrices.Let us define as a unit matrix with the exeption of four elements,,,,and.A rotation plane is now defined by dimensions and,and rotates the columns of by angle along that plane.Thus we can write(42)where definingchanges the indexing into a more convenient form.Required derivatives can be computed as follows.This matrix is other-wise a zero matrix,except the trigonometric functions in have been replaced by their derivatives.A gradient ascent algorithm for updatingis now(44)8.ExperimentsWe present experiments in two types of applications,data visualization and pattern recognition,with data sets that are available on the Internet.In the former application,a pro-jection is learned from a-dimensional feature space onto a plane for visualization purposes.In the latter,wefind a projection onto a range of dimensions between one and,and we compare the goodness of the MMI projec-tion to PCA and LDA projections by using two different pattern recognition methods,Learning Vector Quantization (LVQ)(Kohonen et al.,1992),and a polynomial classifier (Campbell&Assaleh,1999).8.1Visualization of Class Structure in DataThis example illustrates a projection from36-dimensional feature space onto two.1This is the Landsat satellite image database from UCI Machine Learning Repository.2The data has six classes,and we used1500of the training sam-ples.The LDA projection is presented in Figure1,and the MMI-projection in Figure2.LDA separates two of the classes very well but places the other four almost on top of each other.The criterion of LDA is a combination of representing each class as compactly as possible and as separated from each other as possible.The compromise in Figure1has achieved this:all classes are represented as quite compact clusters–unfortunately they are on top of each other.Two of the classes and a third cluster compris-ing of the four remaining classes are well separated.MMI has produced a projection that attempts to separate all of the classes.We can see that the four classes that LDA was not able to separate lie on a single continuum and blend into each other,and MMI has found a projection orthogonal to that continuum while still keeping the other two classes well separated.In this example,as well as in general,it is essential to use a kernel width that gives each sample an influence over every other sample.A simple rule that seems to work well is to take the distance of two farthest points in the output space, and use a kernel width of about half of that.The forces become more local if the kernel width is narrowed down.If the transformation has more degrees of freedom,this sug-gests a procedure where a narrow kernel could be used at the end tofine-tune the parameters of the transformation.8.2Classification ExperimentsClassification experiments were performed usingfive data sets very different in terms of dimensionality,number of classes,and the amount of data.The sets and some of their characteristics are presented in Table1.The Phoneme set is available with the LVQ1More examples and video clips illustrating convergence are available at /torkkola/mmi.html 2/mlearn/MLRepository.html3http://www.cis.hut.fi/research/software.shtml−3−2−1123−3−2−1123Figure 1.The Landsat image data set.LDA projection.Figure 2.The Landsat image data set.MMI projection.is available from the Aston University.4The rest of the data sets are from the UCI Machine Learning plete training sets were used to learn the MMI projec-tion with two exceptions.Some 1500random samples were used with the Landsat set,and 2000samples with the Let-ter set.This was done because of the computational com-plexity of of the training procedure,where is the number of samples.This is due to evaluation of pairwise distances in the output space.PCA and LDA projections were learned from the complete training sets each case.For each data set and each dimension the MMI projection was initialized with PCA,LDA (where possible),and three different random projections.After Levenberg-Marquardt optimization the projection with the highest mutual infor-mation was chosen as the final projection for the particular dimension.The learned transformation was then applied to the complete training and test sets to produce the training and test sets for the two classifiers.Data set Dim.Classes Train size Test sizeLVQ experiments were performed using packageLVQTable2.Accuracy on the Letter data set using LVQ classifier.Dim:PCA13.438.053.168.180.386.392.4MMI12349153641.281.585.887.889.490.390.4LDA65.182.086.486.287.689.590.4 Table4.Accuracy on the Phoneme data set using LVQ classifier.Dim:PCA5.166.074.780.282.886.090.0MMI1234571241.588.087.889.796.497.299.0LDA99.499.198.999.298.999.099.0 Table6.Accuracy on the Pima data set using LVQ classifier.Dim:PCA65.8------MMI123456856.377.375.874.677.777.778.1LDA80.580.979.380.579.379.378.1 Although we only describe linear transformations in this paper,nothing prevents applying this method to non-linear transformations,as long as they are parametrizable and dif-ferentiable.In fact,the potential of the method lies exactly here,since unlike LDA,the MMI criterion is readily appli-cable to non-linear cases.ReferencesBattiti,R.(1994).Using mutual information for selecting features in supervised neural net learning.Neural Net-works,5,537–550.Bonnlander,B.,&Weigend,A.S.(1994).Selecting in-put variables using mutual information and nonparamet-ric density estimation.Proceedings of the1994Inter-national Symposium on Artificial Neural Networks(pp. 42–50).Tainan,Taiwan.Campbell,W.M.,&Assaleh,K.T.(1999).Polynomial classifier techniques for speaker verification.Proceed-ings of the IEEE International Conference on Acoustics, Speech,and Signal Processing(pp.321–324).Phoenix, AZ,USA.Fano,R.(1961).Transmission of information:A statistical theory of communications.New York:Wiley. Fukunaga,K.(1990).Introduction to statistical pattern recognition(2nd edition).New York:Academic Press. Kapur,J.(1994).Measures of information and their appli-cations.New Delhi,India:Wiley.Kohonen,T.,Kangas,J.,Laaksonen,J.,&Torkkola,K. (1992).LVQ。