Locally weighted learning

合集下载

Active Learning Literature Survey

Active Learning Literature SurveyAnita KrishnakumarUniversity of California,Santa CruzDepartment of Computer Scienceanita@June05,2007AbstractThe most time consuming and expensive task in machine learning is the gatheringof labeled data to train the model or to estimate its parameters.In the real-worldscenario,the availability of labeled data is scarce and we have limited resources tolabel the abundantly available unlabeled data.Hence it makes sense to pick onlythe most informative instances from the unlabeled data and request an expert to pro-vide the label for that instance.Active learning algorithms aim at minimizing theamount of labeled data required to achieve the goal of the machine learning task inhand by strategically selecting the data instance to be labeled by the expert.A lot ofresearch has been conducted in this area over the past two decades leading to greatimprovements in performance of several existing machine learning algorithms andhas also been applied to diverseﬁelds like text classiﬁcation,information retrieval,computer vision and bioinformatics,to name a few.This survey aims at providingan insight into the research in this area and categorizes the diverse algorithms pro-posed based on main characteristics.We also provides a desk where different activelearning algorithms can be compared by evaluation on benchmark datasets.1IntroductionThe central goal of machine learning is to develop systems that can learn from experience or data and improve their performance at some task.In many natural learning tasks,this experience or data is gained interactively,by taking actions,making queries,or doing experiments.Most machine learning research,however,treats the learner as a passiverecipient of the data to be processed.This passive approach ignores the learner’s ability to interact with the environment and gather data.Active learning is the study of how to use this ability effectively.Active learning algorithms have been developed for classiﬁcation, regression and function optimization and is found to improve the predictive accuracy of several algorithms compared to passive learning.2Major ApproachesThe three major approaches to active learning algorithms are as follows.•Pool-based active learning:As introduced in Lewis and Gale(1994),the learner is provided with a pool of independent and identically distributed unlabeled instances.The active learner at each step chooses an unlabeled instance to request the label from the expert by means of a querying function.This is also called as selective sampling.•Stream-based active learning:The active learner,for example in Freund1997,is presented with a stream of unlabeled instances,from which the learner picks an instance for labeling by the expert.This can be visualized as online pool-based active learning.•Active learning with membership queries:Here,as described in Angluin1988,the active learner asks the expert to classify cases generated by the learning systems.The learner imposes values on the attributes for an instance and observes the re-sponse.This gives the learner theﬂexibility of framing the data instance that will be the most informative to the learner at that moment.3Characteristics of Active Learning AlgorithmsWe have taken into account several key features that have been addressed in many of the proposed active learning algorithms to compare the effect of each characteristic on the overall performance of the algorithm.3.1Ranker considerationActive learning algorithms may or may not depend on a ranker function to pick the train-ing instance for expert labeling.Several algorithms proposed use support vector machines (SVM),logistic regression,naive Bayes,neural networks,etc.In this section we look atactive learning algorithms from the perspective of the ranker function used in the data instance selection process.Lewis and Gale(1994)describe an uncertainty sampling method where the active learner selects instances whose class membership is most unclear to the learner.Different deﬁni-tions of uncertainty have been used,for example the Query-by-Committee algorithm by Seung et al.(1992),picks those examples for which the selected classiﬁers disagree,to be labeled by the expert.The authors suggest that their algorithm can be used with any classiﬁer that predicts a class as well as provides a probability estimate of the prediction certainty.Cohn et al.(1995)describe how optimal data selection techniques can be applied to statistically-based learning algorithms like a mixture of Gaussians and locally weighted regression.The algorithm selects instances that if labeled and added to the training set, minimizes the expected error on future test data.The authors show that the statistical models perform more efﬁciently and accurately than the feedforward neural networks. Similar querying functions have been proposed by Tong and Koller(2000),Campbell et al.(2000)and Schohn and Cohn(2000),called Simple which uses SVMs as the induction component.Here,the querying function is based on the classiﬁer.The algorithm tries to pick instances which are the most informative to the SVM-the support vectors of the dividing hyperplane.This can be thought of as uncertainty sampling where the algorithm selects those instances about which it is most uncertain.In the case of SVMs,the classiﬁer is most uncertain about the examples that are lying close to the margin of the dividing hyperplane.Variations of the Simple algorithm-MaxMin and Ratio methods have been proposed by Tong and Koller(2000),which also use SVM as the learner.Iyengar et al.(2000)present an active learning algorithm that uses adaptive resampling (ALAR)to select instances for expert labeling.In the work described,a probabilistic classiﬁer is usedﬁrst to determine the degree of uncertainty and then decision trees is used for classiﬁcation.The experiments considered use the ensemble of classiﬁcation models generated in each phase or a nearest neighbor(3-nn)as the probabilistic classiﬁer. Roy and McCallum(2001)describe a method to directly maximize the expected error rate reduction,by estimating the future error rate by a loss function.The loss functions help the learner to select those instances that maximize the conﬁdence of the learner about the unlabeled data.Rather than estimating the expected error on the full distribution,this algorithm estimates it over a sample in the pool.The authors base their class probability estimates and classiﬁcation on naive Bayes,however SVMs or other models with complex parameter space are also recommended.Baram et al.(2003)provide an implementation of this method on SVMs,andﬁnd it to be better than the original naive Bayes algorithm. Logistic regression is used to estimate the class probabilities in the SVM based algorithm. Baram et al.(2003)propose a simple heuristic based on“farthest-ﬁrst”travel sequences for active learning called Kernel Farthest-First(KFF).Here the active learner selects thatinstance which is farthest away from the current labeled set and can be applied with any classiﬁer learning algorithm.The authors present an application of KFF with an SVM. Mitra et al.(2004)present a probabilistic active learning strategy for support vector ma-chine learning.Identifying and labeling all the true support vectors guarantees low future error.The algorithm uses the k nearest neighbor algorithm to assign a conﬁdence factor c to all instances lying within the boundary,close to the actual support vectors and1−c to interior points which are far from the support vectors.The instances are then chosen probabilistically based on the conﬁdence factor.Nguyen and Smeulders(2004)offer a framework to incorporate clustering into active learning.A classiﬁer is developed based on the cluster representatives and a local noise model propagates the classiﬁcation decision to the other instances.The model assumes that given the cluster label,the class label of data instance can be inferred.The logistic regression discriminative model is used to estimate the class probability and an isotropic Gaussian model is used to estimate the noise distribution,to propagate information of label from the representatives to the remaining data.A coarse-to-ﬁne strategy is used to adjust the balance between the advantage of large clusters and the accuracy of the data representation.Osugi et al.(2005)propose an active learning algorithm that balances the exploration and exploitation while selecting a new instance for labeling by the expert at each step. The algorithm randomly chooses between exploration and exploitation at each round and receives feedback on the effectiveness of the exploration step,based on the performance of the classiﬁer trained on the explored instance.The active learner updates the probability of exploring in the next phases based on the feedback.The algorithm chooses between the active learners KFF(which explores)and Simple(which exploits),using SVM light as the classiﬁer,with a probability p.If the exploration is a success,resulting in a change in the current hypothesis,then p is maintained with a high value,encouraging more exploration, else p is updated to reduce the probability of exploration.3.2Computational complexityComputational cost is an important factor to be considered in an algorithm.If an algorithm is computationally very expensive,the algorithm might be infeasible for certain real-world applications which are time sensitive.In this section,we consider the time complexities of the above discussed active learning algorithms inﬁnding an optimal instance for labeling. The active learning algorithms with mixture of Gaussians and locally weighted regres-sion proposed by Cohn et al.(1995)performs more effectively than the feedforward neural networks where computing variance estimate and re-training is computationally very expensive.With the mixture of Gaussians,training depends linearly on the number of data instances,but prediction time is independent.On the other hand,for a memory-based model like locally weighted regression,there is no training time,but prediction costs exist.However,both can be enhanced by optimized parallel implementations.The authors Tong and Koller(2000)suggest that the Simple margin active learning al-gorithm is computationally quite efﬁcient.However,improvement gains can be obtained by querying multiple instances at a time as suggested in Lewis and Gale(1994).But the MaxMin and Ratio methods are computationally very expensive.Active learning with adaptive resampling(Iyengar et al.,2000)is computationally very expensive because of the decision trees used in the classiﬁcation phase.The number of phases,number of points chosen to be labeled and number of adaptive resampling rounds also add to the computational cost,hence the authors have chosen these parameters based on computational complexity and accuracy considerations.The computational complexity to implement the algorithm proposed in Roy and McCal-lum(2001)is described as“hopelessly inefﬁcient”.However,various heuristics approxi-mations and optimizations have been suggested.Some of these approximations are gen-eral and some for the speciﬁc implementation by the authors on naive Bayes.The computational complexity of Kernel Farthest-First algorithm of Baram et al.(2003) is quite similar to the Simple margin algorithm.Simple computes the dot product for every unlabeled instance which takes the same time as computing the argmax for KFF.The probabilistic active support vector learning algorithm proposed by Mitra et al.(2004) is computationally more efﬁcient than even the Simple margin algorithm by Tong and Koller(2000),as presented in the comparison by the authors.The active learning using pre-clustering algorithm proposed by Nguyen and Smeulders (2004),use the K-medoid algorithm for clustering as it captures data representation bet-ter.But the K-medoid algorithm is computationally very expensive when the number of clusters or data points is large.However,some simpliﬁcations have been presented to reduce the computational cost.The algorithm proposed by Osugi et al.(2005),uses the active learners,Simple or KFF. Hence the time complexity depends on those algorithms.Simple and KFF have similar time complexities and hence,this algorithm has the sample complexity of those algo-rithms per round.3.3DensityIn real-world applications,the data under consideration might have skewed class distri-butions.Some classes might have larger number of samples,and hence a higher density than the other classes.Some classes might have very few instances in the dataset andhence a low density.In this section we analyze whether the active learning algorithms in discussion consider the density of the classes in the dataset,while selecting instances for labeling.The statistical models of Cohn et al.(1995)selects that instance that minimizes the ex-pectation of the learner’s variance on future test set.The instance selection method is independent of density considerations.The Simple algorithm of Tong and Koller(2000),Campbell et al.(2000)and Schohn and Cohn(2000),picks those instances which are close to the dividing hyperplane.Density of the class distribution is ignored here.The ALAR algorithm of Iyengar et al.(2000)also selects instances for expert labeling ignoring the density of samples,as it only considers the degree of uncertainty of the classiﬁer.The algorithm of Roy and McCallum(2001)queries for instances that provide maximal reassurance of the current model.Hence,it does not depend on the class density distribu-tion.The KFF algorithm of Baram et al.(2003)selects those instances that are the farthest from the current set of labeled instances,which does not really take into account the density of samples.The probabilistic active support vector learning algorithm by Mitra et al.(2004)does not take into account the density while querying for instances.The data selection criterion of the active learning algorithm with pre-clustering by Nguyen and Smeulders(2004),gives priority to samples which are cluster representatives,and chooses the ones belonging to high density clustersﬁbeling of high density clusters contribute to a substantial move of the classiﬁcation boundary,and hence the algorithm clusters the data into large clusters initially.And once the classiﬁcation boundary between the large clusters have been obtained,the parameters are adjusted to obtainﬁner clustering for a more accurate classiﬁcation boundary.The active learning algorithm proposed by Osugi et al.(2005)explores for new instances using KFF,which does not consider the density of the class distribution.The exploitation phase uses the Simple algorithm which again does not consider the density of samples.3.4DiversitySome active learning algorithms can have added advantage if they take into account the diverse nature of instances in the dataset.The classiﬁer developed will perform well when trained with dataset that has different kinds of samples that represent the entireRanker Computational Density DiversitycomplexityAlgorithm1 x xAlgorithm2 x x xAlgorithm3 x xAlgorithm4 x xAlgorithm5x x xAlgorithm6 x xAlgorithm7Algorithm8x x xTable1:Summary of characteristics of the active learning algorithms in study distribution.The algorithms by Cohn et al.(1995),Tong and Koller(2000),Campbell et al.(2000)and Schohn and Cohn(2000)do not consider the diversity of the samples in the labeled set used for training the classiﬁer.They just select the examples that optimize their criterion, which is minimizing the variance in Cohn et al.’s model and choosing the most unclear example to the classiﬁer in the Simple algorithm.The ALAR algorithm by Iyengar et al.(2000)and the algorithm by Roy and McCallum (2001)also do not select instances based on their diversity.The KFF algorithm by Baram et al.(2003)select those instances that are the farthest from the given set of labeled examples.Intuitively,this picks the instance from the unla-beled which is most dissimilar to the current set of labeled examples used for training the classiﬁer.The probabilistic algorithm by Mitra et al.(2004)selects samples that are far from the current boundary with a conﬁdence factor c.This kind of helps the active learner to pick instances in the dataset that are diverse in nature.The active-learning algorithm by Nguyen and Smeulders(2004),selects diverse samples as it gives priority to samples which are cluster representatives,and each cluster represents a different group of data instances.The algorithm by Osugi et al.(2005),uses KFF in the exploration phase,which considers the diversity of the dataset while selecting the next instance for expert labeling.3.5Close to boundaryInstances lying close to the decision boundary generally contribute to the accuracy of the classiﬁer,as in the case of support vector machines.Hence,those samples lying close to the boundary convey a lot of information regarding the underlying class distribution.In this section we see if the algorithms under study consider the instances lying close to the boundary for expert labeling.The statistical algorithms of Cohn et al.(1995)selects instances that minimize the vari-ance of the learner on the future dataset,and this might be equivalent to picking those instances close to the current decision boundary.The Simple algorithm described queries for instances that the learner is most uncertain about and this leads to choosing samples that are close to the classiﬁer’s decision bound-ary.The ALAR algorithm using3-nn classiﬁer for theﬁrst task of the determining the degree of uncertainty,minimizes the cumulative error by choosing the instances that are misclas-siﬁed by the classiﬁer algorithm in the second task,given the actual labels are those given by theﬁrst algorithm.This chooses the samples that are close to the current decision boundary.The active learning algorithm by Roy and McCallum(2001)tries to pick samples that provide maximum reassurance of the model and hence does not pick examples close to the boundary.The KFF algorithm also does not choose samples close to the boundary,it queries for the sample farthest from current labeled training set.The algorithm by Mitra et al.(2004)tries toﬁnd the support vectors of dividing hyper-plane and hence considers the samples lying close to the decision boundary.The active learning with pre-clustering algorithm,tries to minimize the current error of the classiﬁer.This leads to choosing those samples lying on the current classiﬁcation boundary as they contribute the largest to the current error.The algorithm by Osugi et al.(2005)queries for the samples lying close to the boundary during the exploitation phase,using the Simple active learning algorithm.3.6Far from boundarySome active learning algorithms might query for instances that are far from the current decision boundary,as these examples can help to reassure the model as well as give a chance to explore new instances which might be very informative.The algorithm by Cohn et al.(1995),the Simple algorithm,the ALAR algorithm and theKFF do not query for samples lying far from the current decision boundary.The algorithm proposed by Roy and McCallum(2001)queries for examples that reduce the future generalization error probability.This leads to picking examples that reassure the current model and the samples lying far from the boundary are chosen by the algorithm as the learner is most sure about the labels for those samples.The active learning algorithm by Mitra et al.(2004)queries for samples lying far from the boundary using the conﬁdence factor c,which varies adaptively with iteration.The algorithm proposed by Nguyen and Smeulders(2004)picks instances that lie close to the boundary for expert labeling,not the ones far away as they do not contribute towards the current error of the classiﬁer.The algorithm by Osugi et al.(2005)queries for the samples lying far from the boundary during the exploration phase using the KFF algorithm.3.7Probabilistic or uncertainty of rankerHere we consider whether the ranker used in the active learning algorithm is probabilistic and whether the uncertainty of the ranker is used to query for new instances for expert labeling.Most algorithms studied here in this survey use a probabilistic ranker or uncer-tainty of the ranker to pick the samples for labeling.The active learning algorithms with statistical models by Cohn et al.(1995)uses prob-abilistic measures for determining variance estimates.The Simple active learning algo-rithm use uncertainty sampling to pick the instance that is most unclear to the learner. The ALAR algorithm that uses the ensemble of classiﬁers generated in the second task is probabilistic and chooses those samples that are misclassiﬁed by the learner.In the algo-rithm by Roy and McCallum(2001)uses probabilistic estimates using logistic regression to calculated the expected log-loss.The KFF algorithm does not depend on the probabilistic or uncertainty of the ranker. The algorithm of Mitra et al.(2004)uses the conﬁdence factor c to pick the samples for labeling which is correlated with the selected samples in the labeled training set.The algorithm of Nguyen and Smeulders(2004)uses logistic regression with a proba-bilistic framework and also employs soft cluster membership to choose the sample for labeling.The algorithm by Osugi et al.(2005)considers probability measures for exploring based on feedback.3.8MyopicAn algorithm has a myopic approach when it greedily choose for instances that optimize the criterion at that instance(locally),instead of considering a globally-optimal solution. Most active learning algorithms choose a myopic approach as the learner thinks that the instance it selects for the expert to label,is the last instance that the expert is available for labeling.Most of the algorithms considered in this study adopt a myopic approach as they try to select the instance that optimizes the performance of the current classiﬁer on the future test set.This is a major limitation in case of greedy margin based methods as the algo-rithm never explores if the examples lying far from the current decision boundary have more information to convey regarding the class distribution,which helps the classiﬁer to become more effective.The statistical algorithm of Cohn et al.(1995)queries for the instance that minimizes the expected error of the model by minimizing its variance,which is actually myopic in approach.Similarly,the Simple algorithm by Tong and Koller(2000),Campbell et al. (2000)and Schohn and Cohn(2000)query for the instance that current classiﬁer is most unclear about,at each iteration.Roy and McCallum(2001)also choose the instance that reassures the current model.The KFF algorithm by Baram et al.(2003)also chooses the example that is most different from the current dataset.However,some of these active learning algorithms select multiple instances to request label from the expert at each iteration instead of choosing just one instance.For example, the ALAR algorithm by Iyengar et al.(2000)and the probabilistic algorithm of Mitra et al.(2004)allows the learner to query the expert for labels of multiple samples at each instance.However this does not exactly globally-optimize the problem in hand.The algorithm of Nguyen and Smeulders(2004)gives priority to the cluster represen-tatives for labeling after an initial clustering.It also prioritizes examples from the high density clustersﬁrst for labeling.This gives the algorithm a kind of approach for global optimization by choosing diverse samples and high density cluster samples initially.But in the later stages the proposed method chooses those instances that contribute the largest to the current error.The algorithm by Osugi et al.(2005)also gives importance to exploring the dataset with a probability p,apart from just optimizing the current criterion.A high value of p encour-ages exploration and is maintained with a high value if the current hypothesis changes. Otherwise it is updated to reduce the probability of exploration at the next step.The value of p has an upper and lower bound,and hence there is always a chance of exploring and exploiting.Close to Far from Probabilistic/Myopicboundary boundary uncertaintyof rankerAlgorithm1 xAlgorithm2 xAlgorithm3 x not clearAlgorithm4xAlgorithm5x x xAlgorithm6 not clearAlgorithm7 x xAlgorithm8 xTable2:Summary of characteristics of the active learning algorithms in studyAlgorithm Authors Name1Cohn et al.(2000)Active learning with statistical models2Tong and Koller(2000)Simple MarginCampbell et al.(2000)Query learning with large margin classiﬁersSchohn and Cohn(2000)Less is More:Active learning with supportvector machines3Iyengar et al.(2000)Active learning with adaptive resampling4Roy and McCallum(2001)Active learning with Samplingestimation of error reduction5Baram et al.(2003)Kernel Farthest First6Mitra et al.(2004)Probabilistic active support vectorlearning algorithm7Nguyen and Smeulders(2004)Active learning with pre-clustering8Osugi et al.(2005)Balancing exploration and exploitationalgorithm for active learningTable3:Key to Table1&24ConclusionActive learning enables the application of machine learning methods to problems where it is difﬁcult or expensive to acquire expert labels.Empirical results presented in the studied research papers indicate that active-learning based classiﬁers perform better than passive ones.In this paper we have presented a survey of several state-of-the-art active learning algorithms as well as the most popular ones.A detailed analysis of each algorithm has been made based on the characteristics of each algorithm,which gives an insight into the features each active learning algorithm considers while querying for instances for expert labeling.References[1]A NGLUIN,D.Queries and concept learning.Mach.Learn.2,4(1988),319–342.[2]B ARAM,Y.,E L-Y ANIV,R.,AND L UZ,K.Online choice of active learning algo-rithms,2003.[3]C AMPBELL,C.,C RISTIANINI,N.,AND S MOLA,A.Query learning with largemargin classiﬁers.In Proc.17th International Conf.on Machine Learning(2000), Morgan Kaufmann,San Francisco,CA,pp.111–118.[4]C OHN,D.A.,G HAHRAMANI,Z.,AND J ORDAN,M.I.Active learning withstatistical models.In Advances in Neural Information Processing Systems(1995),G.Tesauro,D.Touretzky,and T.Leen,Eds.,vol.7,The MIT Press,pp.705–712.[5]F REUND,Y.,S EUNG,H.S.,S HAMIR,E.,AND T ISHBY,N.Selective samplingusing the query by committee algorithm.Machine Learning28,2-3(1997),133–168.[6]I YENGAR,V.S.,A PTE,C.,AND Z HANG,T.Active learning using adaptive resam-pling.In KDD’00:Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining(New York,NY,USA,2000),ACM Press, pp.91–98.[7]L EWIS,D.D.,AND G ALE,W.A.A sequential algorithm for training text clas-siﬁers.In SIGIR’94:Proceedings of the17th annual international ACM SIGIR conference on Research and development in information retrieval(New York,NY, USA,1994),Springer-Verlag New York,Inc.,pp.3–12.[8]M ITRA,P.,M URTHY,C.,AND P AL,S.K.A probabilistic active support vectorlearning algorithm.IEEE Transactions on Pattern Analysis and Machine Intelli-gence26,3(2004),413–418.[9]N GUYEN,H.T.,AND S MEULDERS,A.Active learning using pre-clustering.InICML’04:Proceedings of the twenty-ﬁrst international conference on Machine learning(New York,NY,USA,2004),ACM Press,p.79.[10]O SUGI,T.,K UN,D.,AND S COTT,S.Balancing exploration and exploitation:Anew algorithm for active machine learning.In ICDM’05:Proceedings of the Fifth IEEE International Conference on Data Mining(Washington,DC,USA,2005), IEEE Computer Society,pp.330–337.[11]R OY,N.,AND M C C ALLUM,A.Toward optimal active learning through samplingestimation of error reduction.In Proc.18th International Conf.on Machine Learn-ing(2001),Morgan Kaufmann,San Francisco,CA,pp.441–448.[12]S CHOHN,G.,AND C OHN,D.Less is more:Active learning with support vectormachines.In Proc.17th International Conf.on Machine Learning(2000),Morgan Kaufmann,San Francisco,CA,pp.839–846.[13]S EUNG,H.S.,O PPER,M.,AND S OMPOLINSKY,H.Query by committee.InComputational Learning Theory(1992),pp.287–294.[14]T ONG,S.,AND K OLLER,D.Support vector machine active learning with applica-tions to text classiﬁcation.In Proceedings of ICML-00,17th International Confer-ence on Machine Learning(Stanford,US,2000),ngley,Ed.,Morgan Kaufmann Publishers,San Francisco,US,pp.999–1006.。

人工智能技术与深度学习测试选择题 64题

1. 人工智能的定义是什么？A. 模拟人类智能的机器B. 计算机科学的一个分支C. 机器学习的一个子集D. 数据分析的工具2. 深度学习是基于什么理论的？A. 符号主义B. 连接主义C. 行为主义D. 认知主义3. 以下哪项不是深度学习的主要应用领域？A. 图像识别B. 自然语言处理C. 数据挖掘D. 网络安全4. 卷积神经网络（CNN）主要用于什么类型的数据？A. 文本数据B. 图像数据C. 音频数据D. 时间序列数据5. 循环神经网络（RNN）适用于处理哪类数据？A. 静态图像B. 序列数据C. 表格数据D. 随机数据6. 以下哪项不是激活函数的类型？A. SigmoidB. ReLUC. TanhD. Lasso7. 损失函数在深度学习中的作用是什么？A. 优化网络结构B. 评估模型预测的准确性C. 调整学习率D. 增加网络层数8. 梯度下降法是用来做什么的？A. 计算梯度B. 最小化损失函数C. 增加网络复杂度D. 减少数据维度9. 什么是过拟合？A. 模型在训练数据上表现不佳B. 模型在测试数据上表现不佳C. 模型在训练数据上表现良好，但在新数据上表现不佳D. 模型在所有数据上表现一致10. 正则化技术主要用来解决什么问题？A. 欠拟合B. 过拟合C. 数据不平衡D. 数据缺失11. 以下哪项不是常见的正则化方法？A. L1正则化B. L2正则化C. DropoutD. Boosting12. 什么是迁移学习？A. 在不同任务间共享知识B. 在同一任务上使用不同模型C. 在不同数据集上训练同一模型D. 在同一数据集上训练不同模型13. 以下哪项不是迁移学习的优势？A. 减少训练时间B. 提高模型泛化能力C. 需要大量标注数据D. 利用预训练模型14. 什么是强化学习？A. 通过反馈优化模型B. 通过梯度下降优化模型C. 通过标签数据优化模型D. 通过无监督学习优化模型15. 强化学习中的“智能体”是什么？A. 环境B. 策略C. 状态D. 动作16. 以下哪项不是强化学习中的关键元素？A. 状态B. 动作C. 奖励D. 损失函数17. 什么是Q学习？A. 一种监督学习算法B. 一种无监督学习算法C. 一种强化学习算法D. 一种迁移学习算法18. 深度强化学习结合了哪两种技术？A. 深度学习和监督学习B. 深度学习和无监督学习C. 深度学习和强化学习D. 深度学习和迁移学习19. 以下哪项不是深度强化学习的应用？A. 游戏AIB. 自动驾驶C. 图像分类D. 机器人控制20. 什么是生成对抗网络（GAN）？A. 一种监督学习模型B. 一种无监督学习模型C. 一种强化学习模型D. 一种迁移学习模型21. GAN中的生成器和判别器的角色分别是什么？A. 生成器生成数据，判别器评估数据B. 生成器评估数据，判别器生成数据C. 生成器和判别器都生成数据D. 生成器和判别器都评估数据22. 以下哪项不是GAN的应用？A. 图像生成B. 文本生成C. 数据增强D. 模型压缩23. 什么是变分自编码器（VAE）？A. 一种监督学习模型B. 一种无监督学习模型C. 一种强化学习模型D. 一种迁移学习模型24. VAE的主要用途是什么？A. 数据压缩B. 数据生成C. 数据分类D. 数据增强25. 以下哪项不是VAE的优势？A. 生成高质量图像B. 可解释性高C. 训练稳定D. 计算效率高26. 什么是注意力机制？A. 一种数据预处理方法B. 一种模型优化方法C. 一种模型结构设计D. 一种数据增强方法27. 注意力机制在自然语言处理中的主要作用是什么？A. 提高模型计算效率B. 增强模型对关键信息的捕捉C. 减少模型参数数量D. 增加模型复杂度28. 以下哪项不是注意力机制的应用？A. 机器翻译B. 图像识别C. 语音识别D. 数据清洗29. 什么是Transformer模型？A. 一种基于RNN的模型B. 一种基于CNN的模型C. 一种基于注意力机制的模型D. 一种基于强化学习的模型30. Transformer模型的主要优势是什么？A. 并行计算能力强B. 序列依赖性强C. 计算复杂度低D. 模型参数少31. 以下哪项不是Transformer模型的应用？A. 文本分类B. 图像生成C. 机器翻译D. 语音识别32. 什么是BERT模型？A. 一种基于RNN的模型B. 一种基于CNN的模型C. 一种基于Transformer的模型D. 一种基于强化学习的模型33. BERT模型的主要创新点是什么？A. 双向编码器B. 单向编码器C. 多任务学习D. 迁移学习34. 以下哪项不是BERT模型的应用？A. 问答系统B. 情感分析C. 图像识别D. 命名实体识别35. 什么是GPT模型？A. 一种基于RNN的模型B. 一种基于CNN的模型C. 一种基于Transformer的模型D. 一种基于强化学习的模型36. GPT模型的主要特点是什么？A. 单向语言模型B. 双向语言模型C. 多任务学习D. 迁移学习37. 以下哪项不是GPT模型的应用？A. 文本生成B. 代码生成C. 图像生成D. 对话系统38. 什么是自监督学习？A. 使用标签数据进行训练B. 不使用标签数据进行训练C. 使用部分标签数据进行训练D. 使用全部标签数据进行训练39. 自监督学习的主要优势是什么？A. 减少标注数据需求B. 提高模型准确性C. 增加模型复杂度D. 减少模型参数40. 以下哪项不是自监督学习的应用？A. 图像分类B. 语音识别C. 文本生成D. 数据清洗41. 什么是元学习？A. 学习如何学习B. 学习如何分类C. 学习如何生成D. 学习如何识别42. 元学习的主要目标是什么？A. 提高模型准确性B. 减少训练时间C. 提高模型泛化能力D. 增加模型复杂度43. 以下哪项不是元学习的应用？A. 小样本学习B. 持续学习C. 迁移学习D. 数据清洗44. 什么是小样本学习？A. 使用大量数据进行训练B. 使用少量数据进行训练C. 不使用数据进行训练D. 使用随机数据进行训练45. 小样本学习的主要挑战是什么？A. 数据量过大B. 数据量过小C. 数据质量差D. 数据分布不均46. 以下哪项不是小样本学习的应用？A. 图像识别B. 语音识别C. 文本分类D. 数据清洗47. 什么是持续学习？A. 一次性学习所有知识B. 分阶段学习知识C. 不断学习新知识D. 不学习新知识48. 持续学习的主要挑战是什么？A. 知识遗忘B. 知识过载C. 知识不一致D. 知识重复49. 以下哪项不是持续学习的应用？A. 自动驾驶B. 机器人控制C. 数据清洗D. 自然语言处理50. 什么是知识蒸馏？A. 减少模型参数B. 增加模型参数C. 提高模型准确性D. 降低模型复杂度51. 知识蒸馏的主要目标是什么？A. 提高模型性能B. 减少模型大小C. 增加模型速度D. 降低模型能耗52. 以下哪项不是知识蒸馏的应用？A. 模型压缩B. 模型加速C. 模型增强D. 数据清洗53. 什么是模型压缩？A. 增加模型参数B. 减少模型参数C. 提高模型准确性D. 降低模型复杂度54. 模型压缩的主要方法是什么？A. 剪枝B. 量化C. 蒸馏D. 以上都是55. 以下哪项不是模型压缩的应用？A. 移动设备部署B. 云计算部署C. 边缘计算部署D. 数据清洗56. 什么是模型量化？A. 将模型参数转换为整数B. 将模型参数转换为浮点数C. 将模型参数转换为字符串D. 将模型参数转换为布尔值57. 模型量化的主要优势是什么？A. 提高模型准确性B. 减少模型大小C. 增加模型速度D. 降低模型能耗58. 以下哪项不是模型量化的应用？A. 移动设备部署B. 云计算部署C. 边缘计算部署D. 数据清洗59. 什么是模型剪枝？A. 增加模型参数B. 减少模型参数C. 提高模型准确性D. 降低模型复杂度60. 模型剪枝的主要方法是什么？A. 按权重剪枝B. 按梯度剪枝C. 按激活剪枝D. 以上都是61. 以下哪项不是模型剪枝的应用？A. 移动设备部署B. 云计算部署C. 边缘计算部署D. 数据清洗62. 什么是模型加速？A. 增加模型参数B. 减少模型参数C. 提高模型速度D. 降低模型复杂度63. 模型加速的主要方法是什么？A. 硬件优化B. 软件优化C. 算法优化D. 以上都是64. 以下哪项不是模型加速的应用？A. 移动设备部署B. 云计算部署C. 边缘计算部署D. 数据清洗答案1. A2. B3. D4. B5. B6. D7. B8. B9. C10. B11. D12. A13. C14. A15. D16. D17. C18. C19. C20. B21. A22. D23. B24. B25. B26. C27. B28. D29. C30. A31. B32. C33. A34. C35. C36. A37. C38. B39. A40. D41. A42. C43. D44. B45. B46. D47. C48. A49. C50. A51. B52. D53. B54. D55. D56. A57. B58. D59. B60. D61. D62. C63. D64. D。

NIR英文术语

英文缩写中文absorbance A 吸光度acousto optic tunable filter AOTF 声光可调滤光器advance process control APC 先进过程控制系统american society for testing and materials ASTM 美国材料试验协会标准artificial neural networks ANN 人工神经网络back propagation BP-ANN 反向传输神经网络calibration transfer CT 模型传递技术charge coupled device CCD 电荷耦合器件cross validation CV 交互验证方法direct standardisation DS 直接校正算法distributed control system DCS 分布式控制系统elimination of uninformative variable UVE 无信息变量消除法euclidean distance ED 欧式距离finite impulse filter FIR 有脉冲响应办法first derivative 1ST DER 一阶微分fourier transform FT 傅里叶变换fourier transform near infrared FT-NIR 傅里叶变换近红外fuzzy C-means clustering FCME 模糊C-均值聚类gas chromatography GC 气相色谱genetic algorithms GA 遗传算法global calibration model GCM 全局校正模型hierarchical cluster analysis HCA 系统聚类分析方法hybrid calibration model HCM 混合校正模型kennard-stone K-S K-S标样选取方法k-nearest neighbor KNN K-最近邻法linear discriminant analysis LDA 线性判别分析linear learning machine LLM 线性学习机light emitting diode LED 发光二极管locally weighted regression LWR 局部权重回归ling wavelength NIR LW-NIR 长波近红外mahalanobis distance MD 马氏距离Mid-infrared MIR 中红外model updating MU 模型更新multiple linear regression MLR 多元线性回归multiplicative scatter correction MSC 多元散射校正near infrared spectroscopy NIR 近红外光谱nearest neighbor distance NND 最邻近距离net analyte signal NAS 净分析信号number aperture NA 数值孔径orthogonal signal correction OSC 正交信号校正partial least squares PLS 偏最小二乘photodiode array PDA 光电二极管阵列piecewise direct standardisation PDS 分段直接校正算法prediction residual error sum of squares PRESS 预测残差平方和principal component analysis PCA 主成分分析principal component regression PCR 主成分回归process analytical chemistry PAC 过程分析化学root mean square error of cross validaton RMSECV 交互验证均方根偏差root mean square error of prediction RMSEP 预测均方根偏差root of mean sum squared residuals RMSSR 光谱残差均方根sample conditioning system SCS 样品预处理系统sample recovery system SRS 样品回收系统second derivative 2nd Der 二姐微分short wavelength near-infrared SW-NIR 短波近红外signal-to-noise ratio S/N 信噪比simulated annealing SA 模拟退火算法slope/bias S/B 斜率偏差算法soft independent modeling of class analogy SIMCA 簇类的独立软模式方法standard error of calibration SEC 校正标准偏差standard error of cross validation SECV 交互验证标准偏差standard error of prediction SEP 预测标准偏差standard normal variate SNV 标准正交变量变换stepwise regressiion analysis SRA 逐步回归分析法support vector machines SVM 支持向量机方法topology TP 拓扑方法ultraviolet-visible UV-VIS 紫外可见光谱wavelet transform WT 小波变换monochromator 单色器grating 光栅detector 检测器collimating mirror 准直镜平行光镜optical bench 光学台resolution 分辨率diffraction gratings 衍射光栅band-width 带宽slit 狭缝A spectroscopic instrument generally consists of entrance slit,collimator, a dispersive element, such as a grating or prism,focusing optics and detector. In a monochromator system there is normally also an exit slit, and only a narrow portion of the spectrum is projected on a one-element detector. In monochromators the entrance and exit slits are in a fixed position and can be changed in width. Rotating the grating scans the spectrum.Development of micro-electronics during the 90’s in the field ofmulti-element optical detectors, such as Charged Coupled Devices (CCD) Arrays and Photo Diode (PD) Arrays。

深度学习模型蒸馏在计算机视觉中的应用与优化

深度学习模型蒸馏在计算机视觉中的应用与优化深度学习模型蒸馏（Distillation）是一种迁移学习的技术，通过将一个复杂的模型的知识转移到另一个较简单的模型中，以提升后者的性能。

在计算机视觉领域，深度学习模型蒸馏被广泛应用于图像分类、目标检测和图像生成等任务中。

本文将介绍深度学习模型蒸馏的原理、应用和优化方法。

一、深度学习模型蒸馏原理深度学习模型蒸馏的核心思想是将一个复杂的模型（教师模型）的知识转移到另一个较简单的模型（学生模型）中，以提升学生模型的性能。

蒸馏的过程可以看作是一种知识传递，教师模型通过输出的软标签（概率分布）传递给学生模型。

学生模型在学习的过程中，不仅要拟合真实标签，还要尽量接近教师模型输出的软标签。

通过这种方式，学生模型可以利用教师模型的知识，从而在性能上超过仅使用真实标签进行训练的模型。

二、深度学习模型蒸馏的应用1. 图像分类深度学习模型蒸馏在图像分类任务中有着广泛的应用。

通常情况下，教师模型是一个较复杂的深度神经网络，可以在大规模的数据集上进行训练，获得较高的准确率。

学生模型则可以是一个较简单的模型，例如浅层神经网络或者线性模型。

通过蒸馏的过程，学生模型可以学习到教师模型的知识，并在准确率和计算效率上取得平衡。

2. 目标检测深度学习模型蒸馏也可以应用于目标检测任务中。

在目标检测中，通常需要同时进行目标的分类和位置回归。

教师模型可以是一个复杂的目标检测器，例如Faster R-CNN或YOLO，而学生模型可以是一个轻量级的目标检测器，例如SSD或MobileNet。

通过蒸馏的过程，学生模型可以学习到教师模型在目标分类和位置回归上的经验，从而提升性能。

3. 图像生成在图像生成任务中，深度学习模型蒸馏也可应用于生成模型（例如生成对抗网络GAN）的训练中。

教师模型可以是一个复杂的生成模型，例如DCGAN或WGAN-GP，而学生模型可以是一个轻量级的生成模型，例如Variational Autoencoder（VAE）。

基于改进YOLOv5的皮革抓取点识别及定位

doi:10.19677/j.issn.1004-7964.2024.01.005基于改进YOLOv5的皮革抓取点识别及定位金光，任工昌*，桓源，洪杰（陕西科技大学机电工程学院，陕西西安710021）摘要：为实现机器人对皮革抓取点的精确定位，文章通过改进YOLOv5算法，引入coordinate attention 注意力机制到Backbone 层中，用Focal-EIOU Loss 对CIOU Loss 进行替换来设置不同梯度，从而实现了对皮革抓取点快速精准的识别和定位。

利用目标边界框回归公式获取皮革抓点的定位坐标，经过坐标系转换获得待抓取点的三维坐标，采用Intel RealSense D435i 深度相机对皮革抓取点进行定位实验。

实验结果表明：与Faster R-CNN 算法和原始YOLOv5算法对比，识别实验中改进YOLOv5算法的准确率分别提升了6.9%和2.63%，召回率分别提升了8.39%和2.63%，mAP 分别提升了8.13%和0.21%；定位实验中改进YOLOv5算法的误差平均值分别下降了0.033m 和0.007m，误差比平均值分别下降了2.233%和0.476%。

关键词:皮革；抓取点定位；机器视觉；YOLOv5；CA 注意力机制中图分类号:TP 391文献标志码:AGrab Point Identification and Localization of Leather based onImproved YOLOv5(College of Mechanical and Electrical Engineering,Shaanxi University of Science and Technology,Xi ’an 710021,China)Abstract:In order to achieve precise localization of leather grasping points by robots,this study proposed an improved approach based on the YOLOv5algorithm.The methodology involved the integration of the coordinate attention mechanism into the Backbone layer and the replacement of the CIOU Loss with the Focal-EIOU Loss to enable different gradients and enhance the rapid and accurate recognition and localization of leather grasping points.The positioning coordinates of the leather grasping points were obtained by using the target bounding box regression formula,followed by the coordinate system conversion to obtain the three-dimensional coordinates of the target grasping points.The experimental positioning of leather grasping points was conducted by using the Intel RealSense D435i depth camera.Experimental results demonstrate the significant improvements over the Faster R -CNN algorithm and the original YOLOv5algorithm.The improved YOLOv5algorithm exhibited an accuracy enhancement of 6.9%and 2.63%,a recall improvement of 8.39%and 2.63%,and an mAP improvement of 8.13%and 0.21%in recognition experiments,respectively.Similarly,in the positioning experiments,the improved YOLOv5algorithm demonstrated a decrease in average error values of 0.033m and 0.007m,and a decrease in error ratio average values of 2.233%and 0.476%.Key words:leather;grab point positioning;machine vision;YOLOv5;coordinate attention收稿日期：2023-06-09修回日期：2023-07-08接受日期：2023-07-12基金项目：陕西省重点研发计划资助项目（2022GY-250）；西安市科技计划项目（23ZDCYJSGG0016-2022）第一作者简介：金光（1996-），男，硕士研究生，主要研究方向为机器视觉,深度学习。

人工智能词汇

常用英语词汇 -andrew Ng课程average firing rate均匀激活率intensity强度average sum-of-squares error均方差Regression回归backpropagation后向流传Loss function损失函数basis 基non-convex非凸函数basis feature vectors特点基向量neural network神经网络batch gradient ascent批量梯度上涨法supervised learning监察学习Bayesian regularization method贝叶斯规则化方法regression problem回归问题办理的是连续的问题Bernoulli random variable伯努利随机变量classification problem分类问题bias term偏置项discreet value失散值binary classfication二元分类support vector machines支持向量机class labels种类标记learning theory学习理论concatenation级联learning algorithms学习算法conjugate gradient共轭梯度unsupervised learning无监察学习contiguous groups联通地区gradient descent梯度降落convex optimization software凸优化软件linear regression线性回归convolution卷积Neural Network神经网络cost function代价函数gradient descent梯度降落covariance matrix协方差矩阵normal equations DC component直流重量linear algebra线性代数decorrelation去有关superscript上标degeneracy退化exponentiation指数demensionality reduction降维training set训练会合derivative导函数training example训练样本diagonal对角线hypothesis假定，用来表示学习算法的输出diffusion of gradients梯度的弥散LMS algorithm “least mean squares最小二乘法算eigenvalue特点值法eigenvector特点向量batch gradient descent批量梯度降落error term残差constantly gradient descent随机梯度降落feature matrix特点矩阵iterative algorithm迭代算法feature standardization特点标准化partial derivative偏导数feedforward architectures前馈构造算法contour等高线feedforward neural network前馈神经网络quadratic function二元函数feedforward pass前馈传导locally weighted regression局部加权回归fine-tuned微调underfitting欠拟合first-order feature一阶特点overfitting过拟合forward pass前向传导non-parametric learning algorithms无参数学习算forward propagation前向流传法Gaussian prior高斯先验概率parametric learning algorithm参数学习算法generative model生成模型activation激活值gradient descent梯度降落activation function激活函数Greedy layer-wise training逐层贪心训练方法additive noise加性噪声grouping matrix分组矩阵autoencoder自编码器Hadamard product阿达马乘积Autoencoders自编码算法Hessian matrix Hessian矩阵hidden layer隐含层hidden units隐蔽神经元Hierarchical grouping层次型分组higher-order features更高阶特点highly non-convex optimization problem高度非凸的优化问题histogram直方图hyperbolic tangent双曲正切函数hypothesis估值，假定identity activation function恒等激励函数IID 独立同散布illumination照明inactive克制independent component analysis独立成份剖析input domains输入域input layer输入层intensity亮度/灰度intercept term截距KL divergence相对熵KL divergence KL分别度k-Means K-均值learning rate学习速率least squares最小二乘法linear correspondence线性响应linear superposition线性叠加line-search algorithm线搜寻算法local mean subtraction局部均值消减local optima局部最优解logistic regression逻辑回归loss function损失函数low-pass filtering低通滤波magnitude幅值MAP 极大后验预计maximum likelihood estimation极大似然预计mean 均匀值MFCC Mel 倒频系数multi-class classification多元分类neural networks神经网络neuron 神经元Newton’s method牛顿法non-convex function非凸函数non-linear feature非线性特点norm 范式norm bounded有界范数norm constrained范数拘束normalization归一化numerical roundoff errors数值舍入偏差numerically checking数值查验numerically reliable数值计算上稳固object detection物体检测objective function目标函数off-by-one error缺位错误orthogonalization正交化output layer输出层overall cost function整体代价函数over-complete basis超齐备基over-fitting过拟合parts of objects目标的零件part-whole decompostion部分-整体分解PCA 主元剖析penalty term处罚因子per-example mean subtraction逐样本均值消减pooling池化pretrain预训练principal components analysis主成份剖析quadratic constraints二次拘束RBMs 受限 Boltzman 机reconstruction based models鉴于重构的模型reconstruction cost重修代价reconstruction term重构项redundant冗余reflection matrix反射矩阵regularization正则化regularization term正则化项rescaling缩放robust 鲁棒性run 行程second-order feature二阶特点sigmoid activation function S型激励函数significant digits有效数字singular value奇怪值singular vector奇怪向量smoothed L1 penalty光滑的L1 范数处罚Smoothed topographic L1 sparsity penalty光滑地形L1 稀少处罚函数smoothing光滑Softmax Regresson Softmax回归sorted in decreasing order降序摆列source features源特点Adversarial Networks抗衡网络sparse autoencoder消减归一化Affine Layer仿射层Sparsity稀少性Affinity matrix亲和矩阵sparsity parameter稀少性参数Agent 代理 /智能体sparsity penalty稀少处罚Algorithm 算法square function平方函数Alpha- beta pruningα - β剪枝squared-error方差Anomaly detection异样检测stationary安稳性（不变性）Approximation近似stationary stochastic process安稳随机过程Area Under ROC Curve／ AUC Roc 曲线下边积step-size步长值Artificial General Intelligence/AGI通用人工智supervised learning监察学习能symmetric positive semi-definite matrix Artificial Intelligence/AI人工智能对称半正定矩阵Association analysis关系剖析symmetry breaking对称无效Attention mechanism注意力体制tanh function双曲正切函数Attribute conditional independence assumptionthe average activation均匀活跃度属性条件独立性假定the derivative checking method梯度考证方法Attribute space属性空间the empirical distribution经验散布函数Attribute value属性值the energy function能量函数Autoencoder自编码器the Lagrange dual拉格朗日对偶函数Automatic speech recognition自动语音辨别the log likelihood对数似然函数Automatic summarization自动纲要the pixel intensity value像素灰度值Average gradient均匀梯度the rate of convergence收敛速度Average-Pooling均匀池化topographic cost term拓扑代价项Backpropagation Through Time经过时间的反向流传topographic ordered拓扑次序Backpropagation/BP反向流传transformation变换Base learner基学习器translation invariant平移不变性Base learning algorithm基学习算法trivial answer平庸解Batch Normalization/BN批量归一化under-complete basis不齐备基Bayes decision rule贝叶斯判断准则unrolling组合扩展Bayes Model Averaging／ BMA 贝叶斯模型均匀unsupervised learning无监察学习Bayes optimal classifier贝叶斯最优分类器variance 方差Bayesian decision theory贝叶斯决议论vecotrized implementation向量化实现Bayesian network贝叶斯网络vectorization矢量化Between-class scatter matrix类间散度矩阵visual cortex视觉皮层Bias 偏置 /偏差weight decay权重衰减Bias-variance decomposition偏差 - 方差分解weighted average加权均匀值Bias-Variance Dilemma偏差–方差窘境whitening白化Bi-directional Long-Short Term Memory/Bi-LSTMzero-mean均值为零双向长短期记忆Accumulated error backpropagation积累偏差逆传Binary classification二分类播Binomial test二项查验Activation Function激活函数Bi-partition二分法Adaptive Resonance Theory/ART自适应谐振理论Boltzmann machine玻尔兹曼机Addictive model加性学习Bootstrap sampling自助采样法／可重复采样Bootstrapping自助法Break-Event Point／ BEP 均衡点Calibration校准Cascade-Correlation级联有关Categorical attribute失散属性Class-conditional probability类条件概率Classification and regression tree/CART分类与回归树Classifier分类器Class-imbalance类型不均衡Closed -form闭式Cluster簇/ 类/ 集群Cluster analysis聚类剖析Clustering聚类Clustering ensemble聚类集成Co-adapting共适应Coding matrix编码矩阵COLT 国际学习理论会议Committee-based learning鉴于委员会的学习Competitive learning竞争型学习Component learner组件学习器Comprehensibility可解说性Computation Cost计算成本Computational Linguistics计算语言学Computer vision计算机视觉Concept drift观点漂移Concept Learning System /CLS观点学习系统Conditional entropy条件熵Conditional mutual information条件互信息Conditional Probability Table／ CPT 条件概率表Conditional random field/CRF条件随机场Conditional risk条件风险Confidence置信度Confusion matrix混杂矩阵Connection weight连结权Connectionism 连结主义Consistency一致性／相合性Contingency table列联表Continuous attribute连续属性Convergence收敛Conversational agent会话智能体Convex quadratic programming凸二次规划Convexity凸性Convolutional neural network/CNN卷积神经网络Co-occurrence同现Correlation coefficient有关系数Cosine similarity余弦相像度Cost curve成本曲线Cost Function成本函数Cost matrix成本矩阵Cost-sensitive成本敏感Cross entropy交错熵Cross validation交错考证Crowdsourcing众包Curse of dimensionality维数灾害Cut point截断点Cutting plane algorithm割平面法Data mining数据发掘Data set数据集Decision Boundary决议界限Decision stump决议树桩Decision tree决议树／判断树Deduction演绎Deep Belief Network深度信念网络Deep Convolutional Generative Adversarial NetworkDCGAN深度卷积生成抗衡网络Deep learning深度学习Deep neural network/DNN深度神经网络Deep Q-Learning深度Q 学习Deep Q-Network深度Q 网络Density estimation密度预计Density-based clustering密度聚类Differentiable neural computer可微分神经计算机Dimensionality reduction algorithm降维算法Directed edge有向边Disagreement measure不合胸怀Discriminative model鉴别模型Discriminator鉴别器Distance measure距离胸怀Distance metric learning距离胸怀学习Distribution散布Divergence散度Diversity measure多样性胸怀／差别性胸怀Domain adaption领域自适应Downsampling下采样D-separation（ Directed separation）有向分别Dual problem对偶问题Dummy node 哑结点General Problem Solving通用问题求解Dynamic Fusion 动向交融Generalization泛化Dynamic programming动向规划Generalization error泛化偏差Eigenvalue decomposition特点值分解Generalization error bound泛化偏差上界Embedding 嵌入Generalized Lagrange function广义拉格朗日函数Emotional analysis情绪剖析Generalized linear model广义线性模型Empirical conditional entropy经验条件熵Generalized Rayleigh quotient广义瑞利商Empirical entropy经验熵Generative Adversarial Networks/GAN生成抗衡网Empirical error经验偏差络Empirical risk经验风险Generative Model生成模型End-to-End 端到端Generator生成器Energy-based model鉴于能量的模型Genetic Algorithm/GA遗传算法Ensemble learning集成学习Gibbs sampling吉布斯采样Ensemble pruning集成修剪Gini index基尼指数Error Correcting Output Codes／ ECOC纠错输出码Global minimum全局最小Error rate错误率Global Optimization全局优化Error-ambiguity decomposition偏差 - 分歧分解Gradient boosting梯度提高Euclidean distance欧氏距离Gradient Descent梯度降落Evolutionary computation演化计算Graph theory图论Expectation-Maximization希望最大化Ground-truth实情／真切Expected loss希望损失Hard margin硬间隔Exploding Gradient Problem梯度爆炸问题Hard voting硬投票Exponential loss function指数损失函数Harmonic mean 调解均匀Extreme Learning Machine/ELM超限学习机Hesse matrix海塞矩阵Factorization因子分解Hidden dynamic model隐动向模型False negative假负类Hidden layer隐蔽层False positive假正类Hidden Markov Model/HMM 隐马尔可夫模型False Positive Rate/FPR假正例率Hierarchical clustering层次聚类Feature engineering特点工程Hilbert space希尔伯特空间Feature selection特点选择Hinge loss function合页损失函数Feature vector特点向量Hold-out 留出法Featured Learning特点学习Homogeneous 同质Feedforward Neural Networks/FNN前馈神经网络Hybrid computing混杂计算Fine-tuning微调Hyperparameter超参数Flipping output翻转法Hypothesis假定Fluctuation震荡Hypothesis test假定考证Forward stagewise algorithm前向分步算法ICML 国际机器学习会议Frequentist频次主义学派Improved iterative scaling/IIS改良的迭代尺度法Full-rank matrix满秩矩阵Incremental learning增量学习Functional neuron功能神经元Independent and identically distributed/独Gain ratio增益率立同散布Game theory博弈论Independent Component Analysis/ICA独立成分剖析Gaussian kernel function高斯核函数Indicator function指示函数Gaussian Mixture Model高斯混杂模型Individual learner个体学习器Induction归纳Inductive bias归纳偏好Inductive learning归纳学习Inductive Logic Programming／ ILP归纳逻辑程序设计Information entropy信息熵Information gain信息增益Input layer输入层Insensitive loss不敏感损失Inter-cluster similarity簇间相像度International Conference for Machine Learning/ICML国际机器学习大会Intra-cluster similarity簇内相像度Intrinsic value固有值Isometric Mapping/Isomap等胸怀映照Isotonic regression平分回归Iterative Dichotomiser迭代二分器Kernel method核方法Kernel trick核技巧Kernelized Linear Discriminant Analysis／KLDA核线性鉴别剖析K-fold cross validation k折交错考证／k 倍交错考证K-Means Clustering K–均值聚类K-Nearest Neighbours Algorithm/KNN K近邻算法Knowledge base 知识库Knowledge Representation知识表征Label space标记空间Lagrange duality拉格朗日对偶性Lagrange multiplier拉格朗日乘子Laplace smoothing拉普拉斯光滑Laplacian correction拉普拉斯修正Latent Dirichlet Allocation隐狄利克雷散布Latent semantic analysis潜伏语义剖析Latent variable隐变量Lazy learning懒散学习Learner学习器Learning by analogy类比学习Learning rate学习率Learning Vector Quantization/LVQ学习向量量化Least squares regression tree最小二乘回归树Leave-One-Out/LOO留一法linear chain conditional random field线性链条件随机场Linear Discriminant Analysis／ LDA 线性鉴别剖析Linear model线性模型Linear Regression线性回归Link function联系函数Local Markov property局部马尔可夫性Local minimum局部最小Log likelihood对数似然Log odds／ logit对数几率Logistic Regression Logistic回归Log-likelihood对数似然Log-linear regression对数线性回归Long-Short Term Memory/LSTM 长短期记忆Loss function损失函数Machine translation/MT机器翻译Macron-P宏查准率Macron-R宏查全率Majority voting绝对多半投票法Manifold assumption流形假定Manifold learning流形学习Margin theory间隔理论Marginal distribution边沿散布Marginal independence边沿独立性Marginalization边沿化Markov Chain Monte Carlo/MCMC马尔可夫链蒙特卡罗方法Markov Random Field马尔可夫随机场Maximal clique最大团Maximum Likelihood Estimation/MLE极大似然预计／极大似然法Maximum margin最大间隔Maximum weighted spanning tree最大带权生成树Max-Pooling 最大池化Mean squared error均方偏差Meta-learner元学习器Metric learning胸怀学习Micro-P微查准率Micro-R微查全率Minimal Description Length/MDL最小描绘长度Minimax game极小极大博弈Misclassification cost误分类成本Mixture of experts混杂专家Momentum 动量Moral graph道德图／正直图Multi-class classification多分类Multi-document summarization多文档纲要One shot learning一次性学习Multi-layer feedforward neural networks One-Dependent Estimator／ ODE 独依靠预计多层前馈神经网络On-Policy在策略Multilayer Perceptron/MLP多层感知器Ordinal attribute有序属性Multimodal learning多模态学习Out-of-bag estimate包外预计Multiple Dimensional Scaling多维缩放Output layer输出层Multiple linear regression多元线性回归Output smearing输出调制法Multi-response Linear Regression／ MLR Overfitting过拟合／过配多响应线性回归Oversampling 过采样Mutual information互信息Paired t-test成对 t查验Naive bayes 朴实贝叶斯Pairwise 成对型Naive Bayes Classifier朴实贝叶斯分类器Pairwise Markov property成对马尔可夫性Named entity recognition命名实体辨别Parameter参数Nash equilibrium纳什均衡Parameter estimation参数预计Natural language generation/NLG自然语言生成Parameter tuning调参Natural language processing自然语言办理Parse tree分析树Negative class负类Particle Swarm Optimization/PSO粒子群优化算法Negative correlation负有关法Part-of-speech tagging词性标明Negative Log Likelihood负对数似然Perceptron感知机Neighbourhood Component Analysis/NCA Performance measure性能胸怀近邻成分剖析Plug and Play Generative Network即插即用生成网Neural Machine Translation神经机器翻译络Neural Turing Machine神经图灵机Plurality voting相对多半投票法Newton method牛顿法Polarity detection极性检测NIPS 国际神经信息办理系统会议Polynomial kernel function多项式核函数No Free Lunch Theorem／ NFL 没有免费的午饭定理Pooling池化Noise-contrastive estimation噪音对照预计Positive class正类Nominal attribute列名属性Positive definite matrix正定矩阵Non-convex optimization非凸优化Post-hoc test后续查验Nonlinear model非线性模型Post-pruning后剪枝Non-metric distance非胸怀距离potential function势函数Non-negative matrix factorization非负矩阵分解Precision查准率／正确率Non-ordinal attribute无序属性Prepruning 预剪枝Non-Saturating Game非饱和博弈Principal component analysis/PCA主成分剖析Norm 范数Principle of multiple explanations多释原则Normalization归一化Prior 先验Nuclear norm核范数Probability Graphical Model概率图模型Numerical attribute数值属性Proximal Gradient Descent/PGD近端梯度降落Letter O Pruning剪枝Objective function目标函数Pseudo-label伪标记Oblique decision tree斜决议树Quantized Neural Network量子化神经网络Occam’s razor奥卡姆剃刀Quantum computer 量子计算机Odds 几率Quantum Computing量子计算Off-Policy离策略Quasi Newton method拟牛顿法Radial Basis Function／ RBF 径向基函数Random Forest Algorithm随机丛林算法Random walk随机闲步Recall 查全率／召回率Receiver Operating Characteristic/ROC受试者工作特点Rectified Linear Unit/ReLU线性修正单元Recurrent Neural Network循环神经网络Recursive neural network递归神经网络Reference model 参照模型Regression回归Regularization正则化Reinforcement learning/RL加强学习Representation learning表征学习Representer theorem表示定理reproducing kernel Hilbert space/RKHS重生核希尔伯特空间Re-sampling重采样法Rescaling再缩放Residual Mapping残差映照Residual Network残差网络Restricted Boltzmann Machine/RBM受限玻尔兹曼机Restricted Isometry Property/RIP限制等距性Re-weighting重赋权法Robustness稳重性 / 鲁棒性Root node根结点Rule Engine规则引擎Rule learning规则学习Saddle point鞍点Sample space样本空间Sampling采样Score function评分函数Self-Driving自动驾驶Self-Organizing Map／ SOM自组织映照Semi-naive Bayes classifiers半朴实贝叶斯分类器Semi-Supervised Learning半监察学习semi-Supervised Support Vector Machine半监察支持向量机Sentiment analysis感情剖析Separating hyperplane分别超平面Sigmoid function Sigmoid函数Similarity measure相像度胸怀Simulated annealing模拟退火Simultaneous localization and mapping同步定位与地图建立Singular Value Decomposition奇怪值分解Slack variables废弛变量Smoothing光滑Soft margin软间隔Soft margin maximization软间隔最大化Soft voting软投票Sparse representation稀少表征Sparsity稀少性Specialization特化Spectral Clustering谱聚类Speech Recognition语音辨别Splitting variable切分变量Squashing function挤压函数Stability-plasticity dilemma可塑性 - 稳固性窘境Statistical learning统计学习Status feature function状态特点函Stochastic gradient descent随机梯度降落Stratified sampling分层采样Structural risk构造风险Structural risk minimization/SRM构造风险最小化Subspace子空间Supervised learning监察学习／有导师学习support vector expansion支持向量展式Support Vector Machine/SVM支持向量机Surrogat loss代替损失Surrogate function代替函数Symbolic learning符号学习Symbolism符号主义Synset同义词集T-Distribution Stochastic Neighbour Embeddingt-SNE T–散布随机近邻嵌入Tensor 张量Tensor Processing Units/TPU张量办理单元The least square method最小二乘法Threshold阈值Threshold logic unit阈值逻辑单元Threshold-moving阈值挪动Time Step时间步骤Tokenization标记化Training error训练偏差Training instance训练示例／训练例Transductive learning直推学习Transfer learning迁徙学习Treebank树库algebra线性代数Tria-by-error试错法asymptotically无症状的True negative真负类appropriate适合的True positive真切类bias 偏差True Positive Rate/TPR真切例率brevity简洁，简洁；短暂Turing Machine图灵机[800 ] broader宽泛Twice-learning二次学习briefly简洁的Underfitting欠拟合／欠配batch 批量Undersampling欠采样convergence收敛，集中到一点Understandability可理解性convex凸的Unequal cost非均等代价contours轮廓Unit-step function单位阶跃函数constraint拘束Univariate decision tree单变量决议树constant常理Unsupervised learning无监察学习／无导师学习commercial商务的Unsupervised layer-wise training无监察逐层训练complementarity增补Upsampling上采样coordinate ascent同样级上涨Vanishing Gradient Problem梯度消逝问题clipping剪下物；剪报；修剪Variational inference变分推测component重量；零件VC Theory VC维理论continuous连续的Version space版本空间covariance协方差Viterbi algorithm维特比算法canonical正规的，正则的Von Neumann architecture冯· 诺伊曼架构concave非凸的Wasserstein GAN/WGAN Wasserstein生成抗衡网络corresponds相切合；相当；通讯Weak learner弱学习器corollary推论Weight权重concrete详细的事物，实在的东西Weight sharing权共享cross validation交错考证Weighted voting加权投票法correlation互相关系Within-class scatter matrix类内散度矩阵convention商定Word embedding词嵌入cluster一簇Word sense disambiguation词义消歧centroids质心，形心Zero-data learning零数据学习converge收敛Zero-shot learning零次学习computationally计算(机)的approximations近似值calculus计算arbitrary任意的derive获取，获得affine仿射的dual 二元的arbitrary任意的duality二元性；二象性；对偶性amino acid氨基酸derivation求导；获取；发源amenable 经得起查验的denote预示，表示，是的标记；意味着，[逻]指称axiom 公义，原则divergence散度；发散性abstract提取dimension尺度，规格；维数architecture架构，系统构造；建筑业dot 小圆点absolute绝对的distortion变形arsenal军械库density概率密度函数assignment分派discrete失散的人工智能词汇discriminative有辨别能力的indicator指示物，指示器diagonal对角interative重复的，迭代的dispersion分别，散开integral积分determinant决定要素identical相等的；完整同样的disjoint不订交的indicate表示，指出encounter碰到invariance不变性，恒定性ellipses椭圆impose把强加于equality等式intermediate中间的extra 额外的interpretation解说，翻译empirical经验；察看joint distribution结合概率ennmerate例举，计数lieu 代替exceed超出，越出logarithmic对数的，用对数表示的expectation希望latent潜伏的efficient奏效的Leave-one-out cross validation留一法交错考证endow 给予magnitude巨大explicitly清楚的mapping 画图，制图；映照exponential family指数家族matrix矩阵equivalently等价的mutual互相的，共同的feasible可行的monotonically单一的forary首次试试minor较小的，次要的finite有限的，限制的multinomial多项的forgo 摒弃，放弃multi-class classification二分类问题fliter过滤nasty厌烦的frequentist最常发生的notation标记，说明forward search前向式搜寻na?ve 朴实的formalize使定形obtain获取generalized归纳的oscillate摇动generalization归纳，归纳；广泛化；判断（依据不optimization problem最优化问题足）objective function目标函数guarantee保证；抵押品optimal最理想的generate形成，产生orthogonal(矢量，矩阵等 ) 正交的geometric margins几何界限orientation方向gap 裂口ordinary一般的generative生产的；有生产力的occasionally有时的heuristic启迪式的；启迪法；启迪程序partial derivative偏导数hone 怀恋；磨property性质hyperplane超平面proportional成比率的initial最先的primal原始的，最先的implement履行permit同意intuitive凭直觉获知的pseudocode 伪代码incremental增添的permissible可同意的intercept截距polynomial多项式intuitious直觉preliminary预备instantiation例子precision精度人工智能词汇perturbation不安，搅乱theorem定理poist 假定，假想tangent正弦positive semi-definite半正定的unit-length vector单位向量parentheses圆括号valid 有效的，正确的posterior probability后验概率variance方差plementarity增补variable变量；变元pictorially图像的vocabulary 词汇parameterize确立的参数valued经估价的；可贵的poisson distribution柏松散布wrapper 包装pertinent有关的总计 1038 词汇quadratic二次的quantity量，数目；重量query 疑问的regularization使系统化；调整reoptimize从头优化restrict限制；限制；拘束reminiscent回想旧事的；提示的；令人联想的（ of ）remark 注意random variable随机变量respect考虑respectively各自的；分其他redundant过多的；冗余的susceptible敏感的stochastic可能的；随机的symmetric对称的sophisticated复杂的spurious假的；假造的subtract减去；减法器simultaneously同时发生地；同步地suffice知足scarce罕有的，难得的split分解，分别subset子集statistic统计量successive iteratious连续的迭代scale标度sort of有几分的squares 平方trajectory轨迹temporarily临时的terminology专用名词tolerance容忍；公差thumb翻阅threshold阈，临界。

[转载]k-折交叉验证（K-fold

[转载]k-折交叉验证（K-fold cross-validation）原⽂地址：k-折交叉验证（K-fold cross-validation）作者：清风⼩荷塘k-折交叉验证（K-fold cross-validation）是指将样本集分为k份，其中k-1份作为训练数据集，⽽另外的1份作为验证数据集。

⽤验证集来验证所得分类器或者回归的错误码率。

⼀般需要循环k次，直到所有k份数据全部被选择⼀遍为⽌。

Cross validation is a model evaluation method that is better than residuals. The problem with residual evaluations is that they do not give an indication of how well the learner will do when it is asked to make new predictions for data it has not already seen. One way to overcome this problem is to not use the entire data set when training a learner. Some of the data is removed before training begins. Then when training is done, the data that was removed can be used to test the performance of the learned model on ``new'' data. This is the basic idea for a whole class of model evaluation methods called cross validation.The holdout method is the simplest kind of cross validation. The data set is separated into two sets, called the training set and the testing set. The function approximator fits a function using the training set only. Then the function approximator is asked to predict the output values for the data in the testing set (it has never seen these output values before). The errors it makes are accumulated as before to give the mean absolute test set error, which is used to evaluate the model. The advantage of this method is that it is usually preferable to the residual method and takes no longer to compute. However, its evaluation can have a high variance. The evaluation may depend heavily on which data points end up in the training set and which end up in the test set, and thus the evaluation may be significantly different depending on how the division is made.K-fold cross validation is one way to improve over the holdout method. The data set is divided into k subsets, and the holdout method is repeated k times. Each time, one of the k subsets is used as the test set and the other k-1 subsets are put together to form a training set. Then the average error across all k trials is computed. The advantage of this method is that it matters less how the data gets divided. Every data point gets to be in a test set exactly once, and gets to be in a training set k-1 times. The variance of the resulting estimate is reduced as k is increased. The disadvantage of this method is that the training algorithm has to be rerun from scratch k times, which means it takes k times as much computation to make an evaluation. A variant of this method is to randomly divide the data into a test and training set k different times. The advantage of doing this is that you can independently choose how large each test set is and how many trials you average over.Leave-one-out cross validation is K-fold cross validation taken to its logical extreme, with K equal to N, the number of da ta points in the set. That means that N separate times, the function approximator is trained on all the data except for one point and a prediction is made for that point. As before the average error is computed and used to evaluate the model. The evaluation given by leave-one-out cross validation error (LOO-XVE) is good, but at first pass it seems very expensive to compute. Fortunately, locally weighted learners can make LOO predictions just as easily as they make regular predictions. That means computing the LOO-XVE takes no more time than computing the residual error and it is a much better way to evaluate models. We will see shortly that Vizier relies heavily on LOO-XVE to choose its metacodes.Figure 26: Cross validation checks how well a model generalizes to new dataFig. 26 shows an example of cross validation performing better than residual error. The data set in the top two graphs is a simple underlying function with significant noise. Cross validation tells us that broad smoothing is best. The data set in the bottom two graphs is a complex underlying function with no noise. Cross validation tells us that very little smoothing is best for this data set.Now we return to the question of choosing a good metacode for data set a1.mbl:File -> Open -> a1.mblEdit -> Metacode -> A90:9Model -> LOOPredictEdit -> Metacode -> L90:9Model -> LOOPredictEdit -> Metacode -> L10:9Model -> LOOPredictLOOPredict goes through the entire data set and makes LOO predictions for each point. At the bottom of the page itshows the summary statistics including Mean LOO error, RMS LOO error, and information about the data point with the largest error. The mean absolute LOO-XVEs for the three metacodes given above (the same three used to generate the graphs in fig. 25), are 2.98, 1.23, and 1.80. Those values show that global linear regression is the best metacode of those three, which agrees with our intuitive feeling from looking at the plots in fig. 25. If you repeat the above operation on data set b1.mbl you'll get the values 4.83, 4.45, and 0.39, which also agrees with our observations.What are cross-validation and bootstrapping?--------------------------------------------------------------------------------Cross-validation and bootstrapping are both methods for estimatinggeneralization error based on "resampling" (Weiss and Kulikowski 1991; Efronand Tibshirani 1993; Hjorth 1994; Plutowski, Sakata, and White 1994; Shaoand Tu 1995). The resulting estimates of generalization error are often usedfor choosing among various models, such as different network architectures.Cross-validation++++++++++++++++In k-fold cross-validation, you divide the data into k subsets of(approximately) equal size. You train the net k times, each time leavingout one of the subsets from training, but using only the omitted subset tocompute whatever error criterion interests you. If k equals the samplesize, this is called "leave-one-out" cross-validation. "Leave-v-out" is amore elaborate and expensive version of cross-validation that involvesleaving out all possible subsets of v cases.Note that cross-validation is quite different from the "split-sample" or"hold-out" method that is commonly used for early stopping in NNs. In thesplit-sample method, only a single subset (the validation set) is used toestimate the generalization error, instead of k different subsets; i.e.,there is no "crossing". While various people have suggested thatcross-validation be applied to early stopping, the proper way of doing so isnot obvious.The distinction between cross-validation and split-sample validation isextremely important because cross-validation is markedly superior for smalldata sets; this fact is demonstrated dramatically by Goutte (1997) in areply to Zhu and Rohwer (1996). For an insightful discussion of thelimitations of cross-validatory choice among several learning methods, seeStone (1977).Jackknifing+++++++++++Leave-one-out cross-validation is also easily confused with jackknifing.Both involve omitting each training case in turn and retraining the networkon the remaining subset. But cross-validation is used to estimategeneralization error, while the jackknife is used to estimate the bias of astatistic. In the jackknife, you compute some statistic of interest in eachsubset of the data. The average of these subset statistics is compared withthe corresponding statistic computed from the entire sample in order toestimate the bias of the latter. You can also get a jackknife estimate ofthe standard error of a statistic. Jackknifing can be used to estimate thebias of the training error and hence to estimate the generalization error,but this process is more complicated than leave-one-out cross-validation(Efron, 1982; Ripley, 1996, p. 73).Choice of cross-validation method+++++++++++++++++++++++++++++++++Cross-validation can be used simply to estimate the generalization error of a given model, or it can be used for model selection by choosing one of several models that has the smallest estimated generalization error. For example, you might use cross-validation to choose the number of hidden units, or you could use cross-alidation to choose a subset of the inputs (subset selection). A subset that contains all relevant inputs will be called a "good" subsets, while the subset that contains all relevant inputs but no others will be called the "best" subset. Note that subsets are "good" and "best" in an asymptotic sense (as the number of training cases goes to infinity). With a small training set, it is possible that a subset that is smaller than the "best" subset may provide better generalization error.Leave-one-out cross-validation often works well for estimating generalization error for continuous error functions such as the mean squared error, but it may perform poorly for discontinuous error functions such as the number of misclassified cases. In the latter case, k-fold cross-validation is preferred. But if k gets too small, the error estimate is pessimistically biased because of the difference in training-set size between the full-sample analysis and the cross-validation analyses. (For model-selection purposes, this bias can actually help; see the discussion below of Shao, 1993.) A value of 10 for k is popular for estimating generalization error.Leave-one-out cross-validation can also run into trouble with variousmodel-selection methods. Again, one problem is lack of continuity--a smallchange in the data can cause a large change in the model selected (Breiman,1996). For choosing subsets of inputs in linear regression, Breiman andSpector (1992) found 10-fold and 5-fold cross-validation to work better thanleave-one-out. Kohavi (1995) also obtained good results for 10-foldcross-validation with empirical decision trees (C4.5). Values of k as smallas 5 or even 2 may work even better if you analyze several different randomk-way splits of the data to reduce the variability of the cross-validationestimate.Leave-one-out cross-validation also has more subtle deficiencies for modelselection. Shao (1995) showed that in linear models, leave-one-outcross-validation is asymptotically equivalent to AIC (and Mallows' C_p), butleave-v-out cross-validation is asymptotically equivalent to Schwarz's Bayesian criterion (called SBC or BIC) when v =n[1-1/(log(n)-1)], where n is the number of training cases. SBCprovides consistent subset-selection, while AIC does not. That is, SBC will choose the "best" subset with probability approaching one as the size of the training set goes to infinity. AIC has an asymptotic probability of one of choosing a "good" subset, but less than one of choosing the "best" subset (Stone, 1979). Many simulation studies have also found that AIC overfits badly in small samples, and that SBC works well (e.g., Hurvich and Tsai, 1989; Shao and Tu, 1995). Hence, these results suggest that leave-one-out cross-validation should overfit in small samples, but leave-v-outcross-validation with appropriate v should do better. However, when true models have an infinite number of parameters, SBC is not efficient, and other criteria that are asymptotically efficient but not consistent formodel selection may produce better generalization (Hurvich and Tsai, 1989). Shao (1993) obtained the surprising result that for selecting subsets of inputs in a linear regression, the probability of selecting the "best" doesnot converge to 1 (as the sample size n goes to infinity) for leave-v-out cross-validation unless the proportion v/n approaches 1. At first glance, Shao's result seems inconsistent with the analysis by Kearns (1997) ofsplit-sample validation, which shows that the best generalization is obtained with v/n strictly between 0 and 1, with little sensitivity to the precise value of v/n for large data sets. But the apparent conflict is dueto the fundamentally different properties of cross-validation andsplit-sample validation.To obtain an intuitive understanding of Shao (1993), let's review some background material on generalization error. Generalization error can be broken down into three additive parts, noise variance + estimation variance + squared estimation bias. Noise variance is the same for all subsets of inputs. Bias is nonzero for subsets that are not "good", but it's zero forall "good" subsets, since we are assuming that the function to be learned is linear. Hence the generalization error of "good" subsets will differ only in the estimation variance. The estimation variance is (2p/t)s^2 where pis the number of inputs in the subset, t is the training set size, and s^2is the noise variance. The "best" subset is better than other "good" subsetsonly because the "best" subset has (by definition) the smallest value of p. But the t in the denominator means that differences in generalization error among the "good" subsets will all go to zero as t goes to infinity.Therefore it is difficult to guess which subset is "best" based on the generalization error even when t is very large. It is well known that unbiased estimates of the generalization error, such as those based on AIC, FPE, and C_p, do not produce consistent estimates of the "best" subset (e.g., see Stone, 1979).In leave-v-out cross-validation, t=n-v. The differences of thecross-validation estimates of generalization error among the "good" subsets contain a factor 1/t, not 1/n. Therefore by making t small enough (and thereby making each regression based on t cases bad enough), we can make the differences of the cross-validation estimates large enough to detect. It turns out that to make t small enough to guess the "best" subset consistently, we have to have t/n go to 0 as n goes to infinity.The crucial distinction between cross-validation and split-sample validation is that with cross-validation, after guessing the "best" subset, we trainthe linear regression model for that subset using all n cases, but withsplit-sample validation, only t cases are ever used for training. If ourmain purpose were really to choose the "best" subset, I suspect we would still have to have t/n go to 0 even for split-sample validation. Butchoosing the "best" subset is not the same thing as getting the best generalization. If we are more interested in getting good generalizationthan in choosing the "best" subset, we do not want to make our regression estimate based on only t cases as bad as we do in cross-validation, because in split-sample validation that bad regression estimate is what we're stuck with. So there is no conflict between Shao and Kearns, but there is a conflict between the two goals of choosing the "best" subset and getting the best generalization in split-sample validation.Bootstrapping+++++++++++++Bootstrapping seems to work better than cross-validation in many cases (Efron, 1983). In the simplest form of bootstrapping, instead of repeatedly analyzing subsets of the data, you repeatedly analyze subsamples of the data. Each subsample is a random sample with replacement from the fullsample. Depending on what you want to do, anywhere from 50 to 2000 subsamples might be used. There are many more sophisticated bootstrap methods that can be used not only for estimating generalization error butalso for estimating confidence bounds for network outputs (Efron and Tibshirani 1993). For estimating generalization error in classification problems, the .632+ bootstrap (an improvement on the popular .632 bootstrap) is one of the currently favored methods that has the advantage of performing well even when there is severe overfitting. Use of bootstrapping for NNs is described in Baxt and White (1995), Tibshirani (1996), and Masters (1995). However, the results obtained so far are not very thorough, and it is known that bootstrapping does not work well for some other methodologies such as empirical decision trees (Breiman, Friedman, Olshen, and Stone, 1984; Kohavi, 1995), for which it can be excessively optimistic.For further information+++++++++++++++++++++++Cross-validation and bootstrapping become considerably more complicated for time series data; see Hjorth (1994) and Snijders (1988).More information on jackknife and bootstrap confidence intervals is available at ftp:///pub/neural/jackboot.sas (this is a plain-text file).References:Baxt, W.G. and White, H. (1995) "Bootstrapping confidence intervals forclinical input variable effects in a network trained to identify thepresence of acute myocardial infarction", Neural Computation, 7, 624-638. Breiman, L., and Spector, P. (1992), "Submodel selection and evaluationin regression: The X-random case," International Statistical Review, 60,291-319.Dijkstra, T.K., ed. (1988), On Model Uncertainty and Its StatisticalImplications, Proceedings of a workshop held in Groningen, TheNetherlands, September 25-26, 1986, Berlin: Springer-Verlag.Efron, B. (1982) The Jackknife, the Bootstrap and Other ResamplingPlans, Philadelphia: SIAM.Efron, B. (1983), "Estimating the error rate of a prediction rule:Improvement on cross-validation," J. of the American StatisticalAssociation, 78, 316-331.Efron, B. and Tibshirani, R.J. (1993), An Introduction to the Bootstrap,London: Chapman & Hall.Efron, B. and Tibshirani, R.J. (1997), "Improvements on cross-validation: The .632+ bootstrap method," J. of the American Statistical Association, 92, 548-560.Goutte, C. (1997), "Note on free lunches and cross-validation," NeuralComputation, 9, 1211-1215,ftp://eivind.imm.dtu.dk/dist/1997/goutte.nflcv.ps.gz.Hjorth, J.S.U. (1994), Computer Intensive Statistical Methods Validation, Model Selection, and Bootstrap, London: Chapman & Hall.Hurvich, C.M., and Tsai, C.-L. (1989), "Regression and time series model selection in small samples," Biometrika, 76, 297-307.Kearns, M. (1997), "A bound on the error of cross validation using theapproximation and estimation rates, with consequences for thetraining-test split," Neural Computation, 9, 1143-1161.Kohavi, R. (1995), "A study of cross-validation and bootstrap foraccuracy estimation and model selection," International Joint Conference on Artificial Intelligence (IJCAI), pp. ?,/users/ronnyk/Masters, T. (1995) Advanced Algorithms for Neural Networks: A C++Sourcebook, NY: John Wiley and Sons, ISBN 0-471-10588-0Plutowski, M., Sakata, S., and White, H. (1994), "Cross-validationestimates IMSE," in Cowan, J.D., Tesauro, G., and Alspector, J. (eds.)Advances in Neural Information Processing Systems 6, San Mateo, CA: Morgan Kaufman, pp. 391-398.Ripley, B.D. (1996) Pattern Recognition and Neural Networks, Cambridge: Cambridge University Press.Shao, J. (1993), "Linear model selection by cross-validation," J. of theAmerican Statistical Association, 88, 486-494.Shao, J. (1995), "An asymptotic theory for linear model selection,"Statistica Sinica ?.Shao, J. and Tu, D. (1995), The Jackknife and Bootstrap, New York:Springer-Verlag.Snijders, T.A.B. (1988), "On cross-validation for predictor evaluation intime series," in Dijkstra (1988), pp. 56-69.Stone, M. (1977), "Asymptotics for and against cross-validation,"Biometrika, 64, 29-35.Stone, M. (1979), "Comments on model selection criteria of Akaike and Schwarz," J. of the Royal Statistical Society, Series B, 41, 276-278.Tibshirani, R. (1996), "A comparison of some error estimates for neural network models," Neural Computation, 8, 152-163.Weiss, S.M. and Kulikowski, C.A. (1991), Computer Systems That Learn, Morgan Kaufmann.Zhu, H., and Rohwer, R. (1996), "No free lunch for cross-validation,"Neural Computation, 8, 1421-1426.。

模型蒸馏的使用中的模型微调和迁移学习(Ⅱ)

模型蒸馏的使用中的模型微调和迁移学习模型蒸馏是指将复杂的神经网络模型压缩成一个更小、更简单的模型的过程。

这种技术可以有效地减少模型的计算成本和存储空间，同时保持原始模型的预测能力。

在实际应用中，模型蒸馏往往需要结合模型微调和迁移学习来进一步提升蒸馏后的模型性能。

首先，模型微调是指在已经经过蒸馏处理的模型上进行进一步的训练，以适应特定的任务或数据集。

这种技术常常用于迁移学习中，通过微调蒸馏后的模型来适应新的任务。

模型微调的过程通常包括解冻原始模型的一些层，然后在新的数据上进行重新训练。

这样可以保留原始模型的知识，同时使模型更好地适应新的任务。

其次，迁移学习是指将已经在一个任务上训练好的模型应用到另一个相关的任务上。

在模型蒸馏中，迁移学习可以帮助蒸馏后的模型更快地收敛，并且在新任务上取得更好的性能。

通常，迁移学习包括特征提取和微调两个阶段。

特征提取阶段是指利用已经训练好的模型提取出对新任务有用的特征，然后将这些特征输入到一个新的模型中进行训练。

而微调阶段则是指在新的数据集上对整个模型进行重新训练，以进一步提升模型的性能。

在实际应用中，模型蒸馏、模型微调和迁移学习通常是结合使用的。

以自然语言处理领域为例，研究人员通常会使用大型的语言模型来进行模型蒸馏，然后通过模型微调和迁移学习来适应具体的任务，比如情感分析、命名实体识别等。

这种方法不仅可以提高模型的预测性能，还可以在一定程度上减少对大规模标注数据的需求。

另外，模型蒸馏、模型微调和迁移学习在计算机视觉领域也有着广泛的应用。

例如，研究人员可以使用模型蒸馏技术将大型的视觉模型压缩成更小、更快的模型，然后通过模型微调和迁移学习来适应特定的视觉任务，比如目标检测、图像分割等。

这种方法不仅可以提高模型的计算效率，还可以在一定程度上避免过拟合问题。

总之，模型蒸馏、模型微调和迁移学习是现代深度学习领域中非常重要的技术。

通过结合这些方法，研究人员可以在保持模型性能的前提下，进一步提高模型的计算效率和泛化能力。

机器学习英语词汇

目录第一部分 (3)第二部分 (12)Letter A (12)Letter B (14)Letter C (15)Letter D (17)Letter E (19)Letter F (20)Letter G (21)Letter H (22)Letter I (23)Letter K (24)Letter L (24)Letter M (26)Letter N (27)Letter O (29)Letter P (29)Letter R (31)Letter S (32)Letter T (35)Letter U (36)Letter W (37)Letter Z (37)第三部分 (37)A (37)B (38)C (38)D (40)E (40)F (41)G (41)H (42)L (42)J (43)L (43)M (43)N (44)O (44)P (44)Q (45)R (46)S (46)U (47)V (48)第一部分[ ] intensity 强度[ ] Regression 回归[ ] Loss function 损失函数[ ] non-convex 非凸函数[ ] neural network 神经网络[ ] supervised learning 监督学习[ ] regression problem 回归问题处理的是连续的问题[ ] classification problem 分类问题处理的问题是离散的而不是连续的回归问题和分类问题的区别应该在于回归问题的结果是连续的，分类问题的结果是离散的。

[ ]discreet value 离散值[ ] support vector machines 支持向量机,用来处理分类算法中输入的维度不单一的情况（甚至输入维度为无穷）[ ] learning theory 学习理论[ ] learning algorithms 学习算法[ ] unsupervised learning 无监督学习[ ] gradient descent 梯度下降[ ] linear regression 线性回归[ ] Neural Network 神经网络[ ] gradient descent 梯度下降监督学习的一种算法，用来拟合的算法[ ] normal equations[ ] linear algebra 线性代数原谅我英语不太好[ ] superscript上标[ ] exponentiation 指数[ ] training set 训练集合[ ] training example 训练样本[ ] hypothesis 假设，用来表示学习算法的输出，叫我们不要太纠结H的意思，因为这只是历史的惯例[ ] LMS algorithm “least mean squares” 最小二乘法算法[ ] batch gradient descent 批量梯度下降，因为每次都会计算最小拟合的方差，所以运算慢[ ] constantly gradient descent 字幕组翻译成“随机梯度下降” 我怎么觉得是“常量梯度下降”也就是梯度下降的运算次数不变，一般比批量梯度下降速度快，但是通常不是那么准确[ ] iterative algorithm 迭代算法[ ] partial derivative 偏导数[ ] contour 等高线[ ] quadratic function 二元函数[ ] locally weighted regression局部加权回归[ ] underfitting欠拟合[ ] overfitting 过拟合[ ] non-parametric learning algorithms 无参数学习算法[ ] parametric learning algorithm 参数学习算法[ ] other[ ] activation 激活值[ ] activation function 激活函数[ ] additive noise 加性噪声[ ] autoencoder 自编码器[ ] Autoencoders 自编码算法[ ] average firing rate 平均激活率[ ] average sum-of-squares error 均方差[ ] backpropagation 后向传播[ ] basis 基[ ] basis feature vectors 特征基向量[50 ] batch gradient ascent 批量梯度上升法[ ] Bayesian regularization method 贝叶斯规则化方法[ ] Bernoulli random variable 伯努利随机变量[ ] bias term 偏置项[ ] binary classfication 二元分类[ ] class labels 类型标记[ ] concatenation 级联[ ] conjugate gradient 共轭梯度[ ] contiguous groups 联通区域[ ] convex optimization software 凸优化软件[ ] convolution 卷积[ ] cost function 代价函数[ ] covariance matrix 协方差矩阵[ ] DC component 直流分量[ ] decorrelation 去相关[ ] degeneracy 退化[ ] demensionality reduction 降维[ ] derivative 导函数[ ] diagonal 对角线[ ] diffusion of gradients 梯度的弥散[ ] eigenvalue 特征值[ ] eigenvector 特征向量[ ] error term 残差[ ] feature matrix 特征矩阵[ ] feature standardization 特征标准化[ ] feedforward architectures 前馈结构算法[ ] feedforward neural network 前馈神经网络[ ] feedforward pass 前馈传导[ ] fine-tuned 微调[ ] first-order feature 一阶特征[ ] forward pass 前向传导[ ] forward propagation 前向传播[ ] Gaussian prior 高斯先验概率[ ] generative model 生成模型[ ] gradient descent 梯度下降[ ] Greedy layer-wise training 逐层贪婪训练方法[ ] grouping matrix 分组矩阵[ ] Hadamard product 阿达马乘积[ ] Hessian matrix Hessian 矩阵[ ] hidden layer 隐含层[ ] hidden units 隐藏神经元[ ] Hierarchical grouping 层次型分组[ ] higher-order features 更高阶特征[ ] highly non-convex optimization problem 高度非凸的优化问题[ ] histogram 直方图[ ] hyperbolic tangent 双曲正切函数[ ] hypothesis 估值，假设[ ] identity activation function 恒等激励函数[ ] IID 独立同分布[ ] illumination 照明[100 ] inactive 抑制[ ] independent component analysis 独立成份分析[ ] input domains 输入域[ ] input layer 输入层[ ] intensity 亮度/灰度[ ] intercept term 截距[ ] KL divergence 相对熵[ ] KL divergence KL分散度[ ] k-Means K-均值[ ] learning rate 学习速率[ ] least squares 最小二乘法[ ] linear correspondence 线性响应[ ] linear superposition 线性叠加[ ] line-search algorithm 线搜索算法[ ] local mean subtraction 局部均值消减[ ] local optima 局部最优解[ ] logistic regression 逻辑回归[ ] loss function 损失函数[ ] low-pass filtering 低通滤波[ ] magnitude 幅值[ ] MAP 极大后验估计[ ] maximum likelihood estimation 极大似然估计[ ] mean 平均值[ ] MFCC Mel 倒频系数[ ] multi-class classification 多元分类[ ] neural networks 神经网络[ ] neuron 神经元[ ] Newton’s method 牛顿法[ ] non-convex function 非凸函数[ ] non-linear feature 非线性特征[ ] norm 范式[ ] norm bounded 有界范数[ ] norm constrained 范数约束[ ] normalization 归一化[ ] numerical roundoff errors 数值舍入误差[ ] numerically checking 数值检验[ ] numerically reliable 数值计算上稳定[ ] object detection 物体检测[ ] objective function 目标函数[ ] off-by-one error 缺位错误[ ] orthogonalization 正交化[ ] output layer 输出层[ ] overall cost function 总体代价函数[ ] over-complete basis 超完备基[ ] over-fitting 过拟合[ ] parts of objects 目标的部件[ ] part-whole decompostion 部分-整体分解[ ] PCA 主元分析[ ] penalty term 惩罚因子[ ] per-example mean subtraction 逐样本均值消减[150 ] pooling 池化[ ] pretrain 预训练[ ] principal components analysis 主成份分析[ ] quadratic constraints 二次约束[ ] RBMs 受限Boltzman机[ ] reconstruction based models 基于重构的模型[ ] reconstruction cost 重建代价[ ] reconstruction term 重构项[ ] redundant 冗余[ ] reflection matrix 反射矩阵[ ] regularization 正则化[ ] regularization term 正则化项[ ] rescaling 缩放[ ] robust 鲁棒性[ ] run 行程[ ] second-order feature 二阶特征[ ] sigmoid activation function S型激励函数[ ] significant digits 有效数字[ ] singular value 奇异值[ ] singular vector 奇异向量[ ] smoothed L1 penalty 平滑的L1范数惩罚[ ] Smoothed topographic L1 sparsity penalty 平滑地形L1稀疏惩罚函数[ ] smoothing 平滑[ ] Softmax Regresson Softmax回归[ ] sorted in decreasing order 降序排列[ ] source features 源特征[ ] sparse autoencoder 消减归一化[ ] Sparsity 稀疏性[ ] sparsity parameter 稀疏性参数[ ] sparsity penalty 稀疏惩罚[ ] square function 平方函数[ ] squared-error 方差[ ] stationary 平稳性（不变性）[ ] stationary stochastic process 平稳随机过程[ ] step-size 步长值[ ] supervised learning 监督学习[ ] symmetric positive semi-definite matrix 对称半正定矩阵[ ] symmetry breaking 对称失效[ ] tanh function 双曲正切函数[ ] the average activation 平均活跃度[ ] the derivative checking method 梯度验证方法[ ] the empirical distribution 经验分布函数[ ] the energy function 能量函数[ ] the Lagrange dual 拉格朗日对偶函数[ ] the log likelihood 对数似然函数[ ] the pixel intensity value 像素灰度值[ ] the rate of convergence 收敛速度[ ] topographic cost term 拓扑代价项[ ] topographic ordered 拓扑秩序[ ] transformation 变换[200 ] translation invariant 平移不变性[ ] trivial answer 平凡解[ ] under-complete basis 不完备基[ ] unrolling 组合扩展[ ] unsupervised learning 无监督学习[ ] variance 方差[ ] vecotrized implementation 向量化实现[ ] vectorization 矢量化[ ] visual cortex 视觉皮层[ ] weight decay 权重衰减[ ] weighted average 加权平均值[ ] whitening 白化[ ] zero-mean 均值为零第二部分Letter A[ ] Accumulated error backpropagation 累积误差逆传播[ ] Activation Function 激活函数[ ] Adaptive Resonance Theory/ART 自适应谐振理论[ ] Addictive model 加性学习[ ] Adversarial Networks 对抗网络[ ] Affine Layer 仿射层[ ] Affinity matrix 亲和矩阵[ ] Agent 代理/ 智能体[ ] Algorithm 算法[ ] Alpha-beta pruning α-β剪枝[ ] Anomaly detection 异常检测[ ] Approximation 近似[ ] Area Under ROC Curve／AUC Roc 曲线下面积[ ] Artificial General Intelligence/AGI 通用人工智能[ ] Artificial Intelligence/AI 人工智能[ ] Association analysis 关联分析[ ] Attention mechanism 注意力机制[ ] Attribute conditional independence assumption 属性条件独立性假设[ ] Attribute space 属性空间[ ] Attribute value 属性值[ ] Autoencoder 自编码器[ ] Automatic speech recognition 自动语音识别[ ] Automatic summarization 自动摘要[ ] Average gradient 平均梯度[ ] Average-Pooling 平均池化Letter B[ ] Backpropagation Through Time 通过时间的反向传播[ ] Backpropagation/BP 反向传播[ ] Base learner 基学习器[ ] Base learning algorithm 基学习算法[ ] Batch Normalization/BN 批量归一化[ ] Bayes decision rule 贝叶斯判定准则[250 ] Bayes Model Averaging／BMA 贝叶斯模型平均[ ] Bayes optimal classifier 贝叶斯最优分类器[ ] Bayesian decision theory 贝叶斯决策论[ ] Bayesian network 贝叶斯网络[ ] Between-class scatter matrix 类间散度矩阵[ ] Bias 偏置/ 偏差[ ] Bias-variance decomposition 偏差-方差分解[ ] Bias-Variance Dilemma 偏差–方差困境[ ] Bi-directional Long-Short Term Memory/Bi-LSTM 双向长短期记忆[ ] Binary classification 二分类[ ] Binomial test 二项检验[ ] Bi-partition 二分法[ ] Boltzmann machine 玻尔兹曼机[ ] Bootstrap sampling 自助采样法／可重复采样／有放回采样[ ] Bootstrapping 自助法[ ] Break-Event Point／BEP 平衡点Letter C[ ] Calibration 校准[ ] Cascade-Correlation 级联相关[ ] Categorical attribute 离散属性[ ] Class-conditional probability 类条件概率[ ] Classification and regression tree/CART 分类与回归树[ ] Classifier 分类器[ ] Class-imbalance 类别不平衡[ ] Closed -form 闭式[ ] Cluster 簇/类/集群[ ] Cluster analysis 聚类分析[ ] Clustering 聚类[ ] Clustering ensemble 聚类集成[ ] Co-adapting 共适应[ ] Coding matrix 编码矩阵[ ] COLT 国际学习理论会议[ ] Committee-based learning 基于委员会的学习[ ] Competitive learning 竞争型学习[ ] Component learner 组件学习器[ ] Comprehensibility 可解释性[ ] Computation Cost 计算成本[ ] Computational Linguistics 计算语言学[ ] Computer vision 计算机视觉[ ] Concept drift 概念漂移[ ] Concept Learning System /CLS 概念学习系统[ ] Conditional entropy 条件熵[ ] Conditional mutual information 条件互信息[ ] Conditional Probability Table／CPT 条件概率表[ ] Conditional random field/CRF 条件随机场[ ] Conditional risk 条件风险[ ] Confidence 置信度[ ] Confusion matrix 混淆矩阵[300 ] Connection weight 连接权[ ] Connectionism 连结主义[ ] Consistency 一致性／相合性[ ] Contingency table 列联表[ ] Continuous attribute 连续属性[ ] Convergence 收敛[ ] Conversational agent 会话智能体[ ] Convex quadratic programming 凸二次规划[ ] Convexity 凸性[ ] Convolutional neural network/CNN 卷积神经网络[ ] Co-occurrence 同现[ ] Correlation coefficient 相关系数[ ] Cosine similarity 余弦相似度[ ] Cost curve 成本曲线[ ] Cost Function 成本函数[ ] Cost matrix 成本矩阵[ ] Cost-sensitive 成本敏感[ ] Cross entropy 交叉熵[ ] Cross validation 交叉验证[ ] Crowdsourcing 众包[ ] Curse of dimensionality 维数灾难[ ] Cut point 截断点[ ] Cutting plane algorithm 割平面法Letter D[ ] Data mining 数据挖掘[ ] Data set 数据集[ ] Decision Boundary 决策边界[ ] Decision stump 决策树桩[ ] Decision tree 决策树／判定树[ ] Deduction 演绎[ ] Deep Belief Network 深度信念网络[ ] Deep Convolutional Generative Adversarial Network/DCGAN 深度卷积生成对抗网络[ ] Deep learning 深度学习[ ] Deep neural network/DNN 深度神经网络[ ] Deep Q-Learning 深度Q 学习[ ] Deep Q-Network 深度Q 网络[ ] Density estimation 密度估计[ ] Density-based clustering 密度聚类[ ] Differentiable neural computer 可微分神经计算机[ ] Dimensionality reduction algorithm 降维算法[ ] Directed edge 有向边[ ] Disagreement measure 不合度量[ ] Discriminative model 判别模型[ ] Discriminator 判别器[ ] Distance measure 距离度量[ ] Distance metric learning 距离度量学习[ ] Distribution 分布[ ] Divergence 散度[350 ] Diversity measure 多样性度量／差异性度量[ ] Domain adaption 领域自适应[ ] Downsampling 下采样[ ] D-separation （Directed separation）有向分离[ ] Dual problem 对偶问题[ ] Dummy node 哑结点[ ] Dynamic Fusion 动态融合[ ] Dynamic programming 动态规划Letter E[ ] Eigenvalue decomposition 特征值分解[ ] Embedding 嵌入[ ] Emotional analysis 情绪分析[ ] Empirical conditional entropy 经验条件熵[ ] Empirical entropy 经验熵[ ] Empirical error 经验误差[ ] Empirical risk 经验风险[ ] End-to-End 端到端[ ] Energy-based model 基于能量的模型[ ] Ensemble learning 集成学习[ ] Ensemble pruning 集成修剪[ ] Error Correcting Output Codes／ECOC 纠错输出码[ ] Error rate 错误率[ ] Error-ambiguity decomposition 误差-分歧分解[ ] Euclidean distance 欧氏距离[ ] Evolutionary computation 演化计算[ ] Expectation-Maximization 期望最大化[ ] Expected loss 期望损失[ ] Exploding Gradient Problem 梯度爆炸问题[ ] Exponential loss function 指数损失函数[ ] Extreme Learning Machine/ELM 超限学习机Letter F[ ] Factorization 因子分解[ ] False negative 假负类[ ] False positive 假正类[ ] False Positive Rate/FPR 假正例率[ ] Feature engineering 特征工程[ ] Feature selection 特征选择[ ] Feature vector 特征向量[ ] Featured Learning 特征学习[ ] Feedforward Neural Networks/FNN 前馈神经网络[ ] Fine-tuning 微调[ ] Flipping output 翻转法[ ] Fluctuation 震荡[ ] Forward stagewise algorithm 前向分步算法[ ] Frequentist 频率主义学派[ ] Full-rank matrix 满秩矩阵[400 ] Functional neuron 功能神经元Letter G[ ] Gain ratio 增益率[ ] Game theory 博弈论[ ] Gaussian kernel function 高斯核函数[ ] Gaussian Mixture Model 高斯混合模型[ ] General Problem Solving 通用问题求解[ ] Generalization 泛化[ ] Generalization error 泛化误差[ ] Generalization error bound 泛化误差上界[ ] Generalized Lagrange function 广义拉格朗日函数[ ] Generalized linear model 广义线性模型[ ] Generalized Rayleigh quotient 广义瑞利商[ ] Generative Adversarial Networks/GAN 生成对抗网络[ ] Generative Model 生成模型[ ] Generator 生成器[ ] Genetic Algorithm/GA 遗传算法[ ] Gibbs sampling 吉布斯采样[ ] Gini index 基尼指数[ ] Global minimum 全局最小[ ] Global Optimization 全局优化[ ] Gradient boosting 梯度提升[ ] Gradient Descent 梯度下降[ ] Graph theory 图论[ ] Ground-truth 真相／真实Letter H[ ] Hard margin 硬间隔[ ] Hard voting 硬投票[ ] Harmonic mean 调和平均[ ] Hesse matrix 海塞矩阵[ ] Hidden dynamic model 隐动态模型[ ] Hidden layer 隐藏层[ ] Hidden Markov Model/HMM 隐马尔可夫模型[ ] Hierarchical clustering 层次聚类[ ] Hilbert space 希尔伯特空间[ ] Hinge loss function 合页损失函数[ ] Hold-out 留出法[ ] Homogeneous 同质[ ] Hybrid computing 混合计算[ ] Hyperparameter 超参数[ ] Hypothesis 假设[ ] Hypothesis test 假设验证Letter I[ ] ICML 国际机器学习会议[450 ] Improved iterative scaling/IIS 改进的迭代尺度法[ ] Incremental learning 增量学习[ ] Independent and identically distributed/i.i.d. 独立同分布[ ] Independent Component Analysis/ICA 独立成分分析[ ] Indicator function 指示函数[ ] Individual learner 个体学习器[ ] Induction 归纳[ ] Inductive bias 归纳偏好[ ] Inductive learning 归纳学习[ ] Inductive Logic Programming／ILP 归纳逻辑程序设计[ ] Information entropy 信息熵[ ] Information gain 信息增益[ ] Input layer 输入层[ ] Insensitive loss 不敏感损失[ ] Inter-cluster similarity 簇间相似度[ ] International Conference for Machine Learning/ICML 国际机器学习大会[ ] Intra-cluster similarity 簇内相似度[ ] Intrinsic value 固有值[ ] Isometric Mapping/Isomap 等度量映射[ ] Isotonic regression 等分回归[ ] Iterative Dichotomiser 迭代二分器Letter K[ ] Kernel method 核方法[ ] Kernel trick 核技巧[ ] Kernelized Linear Discriminant Analysis／KLDA 核线性判别分析[ ] K-fold cross validation k 折交叉验证／k 倍交叉验证[ ] K-Means Clustering K –均值聚类[ ] K-Nearest Neighbours Algorithm/KNN K近邻算法[ ] Knowledge base 知识库[ ] Knowledge Representation 知识表征Letter L[ ] Label space 标记空间[ ] Lagrange duality 拉格朗日对偶性[ ] Lagrange multiplier 拉格朗日乘子[ ] Laplace smoothing 拉普拉斯平滑[ ] Laplacian correction 拉普拉斯修正[ ] Latent Dirichlet Allocation 隐狄利克雷分布[ ] Latent semantic analysis 潜在语义分析[ ] Latent variable 隐变量[ ] Lazy learning 懒惰学习[ ] Learner 学习器[ ] Learning by analogy 类比学习[ ] Learning rate 学习率[ ] Learning Vector Quantization/LVQ 学习向量量化[ ] Least squares regression tree 最小二乘回归树[ ] Leave-One-Out/LOO 留一法[500 ] linear chain conditional random field 线性链条件随机场[ ] Linear Discriminant Analysis／LDA 线性判别分析[ ] Linear model 线性模型[ ] Linear Regression 线性回归[ ] Link function 联系函数[ ] Local Markov property 局部马尔可夫性[ ] Local minimum 局部最小[ ] Log likelihood 对数似然[ ] Log odds／logit 对数几率[ ] Logistic Regression Logistic 回归[ ] Log-likelihood 对数似然[ ] Log-linear regression 对数线性回归[ ] Long-Short Term Memory/LSTM 长短期记忆[ ] Loss function 损失函数Letter M[ ] Machine translation/MT 机器翻译[ ] Macron-P 宏查准率[ ] Macron-R 宏查全率[ ] Majority voting 绝对多数投票法[ ] Manifold assumption 流形假设[ ] Manifold learning 流形学习[ ] Margin theory 间隔理论[ ] Marginal distribution 边际分布[ ] Marginal independence 边际独立性[ ] Marginalization 边际化[ ] Markov Chain Monte Carlo/MCMC 马尔可夫链蒙特卡罗方法[ ] Markov Random Field 马尔可夫随机场[ ] Maximal clique 最大团[ ] Maximum Likelihood Estimation/MLE 极大似然估计／极大似然法[ ] Maximum margin 最大间隔[ ] Maximum weighted spanning tree 最大带权生成树[ ] Max-Pooling 最大池化[ ] Mean squared error 均方误差[ ] Meta-learner 元学习器[ ] Metric learning 度量学习[ ] Micro-P 微查准率[ ] Micro-R 微查全率[ ] Minimal Description Length/MDL 最小描述长度[ ] Minimax game 极小极大博弈[ ] Misclassification cost 误分类成本[ ] Mixture of experts 混合专家[ ] Momentum 动量[ ] Moral graph 道德图／端正图[ ] Multi-class classification 多分类[ ] Multi-document summarization 多文档摘要[ ] Multi-layer feedforward neural networks 多层前馈神经网络[ ] Multilayer Perceptron/MLP 多层感知器[ ] Multimodal learning 多模态学习[550 ] Multiple Dimensional Scaling 多维缩放[ ] Multiple linear regression 多元线性回归[ ] Multi-response Linear Regression ／MLR 多响应线性回归[ ] Mutual information 互信息Letter N[ ] Naive bayes 朴素贝叶斯[ ] Naive Bayes Classifier 朴素贝叶斯分类器[ ] Named entity recognition 命名实体识别[ ] Nash equilibrium 纳什均衡[ ] Natural language generation/NLG 自然语言生成[ ] Natural language processing 自然语言处理[ ] Negative class 负类[ ] Negative correlation 负相关法[ ] Negative Log Likelihood 负对数似然[ ] Neighbourhood Component Analysis/NCA 近邻成分分析[ ] Neural Machine Translation 神经机器翻译[ ] Neural Turing Machine 神经图灵机[ ] Newton method 牛顿法[ ] NIPS 国际神经信息处理系统会议[ ] No Free Lunch Theorem／NFL 没有免费的午餐定理[ ] Noise-contrastive estimation 噪音对比估计[ ] Nominal attribute 列名属性[ ] Non-convex optimization 非凸优化[ ] Nonlinear model 非线性模型[ ] Non-metric distance 非度量距离[ ] Non-negative matrix factorization 非负矩阵分解[ ] Non-ordinal attribute 无序属性[ ] Non-Saturating Game 非饱和博弈[ ] Norm 范数[ ] Normalization 归一化[ ] Nuclear norm 核范数[ ] Numerical attribute 数值属性Letter O[ ] Objective function 目标函数[ ] Oblique decision tree 斜决策树[ ] Occam’s razor 奥卡姆剃刀[ ] Odds 几率[ ] Off-Policy 离策略[ ] One shot learning 一次性学习[ ] One-Dependent Estimator／ODE 独依赖估计[ ] On-Policy 在策略[ ] Ordinal attribute 有序属性[ ] Out-of-bag estimate 包外估计[ ] Output layer 输出层[ ] Output smearing 输出调制法[ ] Overfitting 过拟合／过配[600 ] Oversampling 过采样Letter P[ ] Paired t-test 成对t 检验[ ] Pairwise 成对型[ ] Pairwise Markov property 成对马尔可夫性[ ] Parameter 参数[ ] Parameter estimation 参数估计[ ] Parameter tuning 调参[ ] Parse tree 解析树[ ] Particle Swarm Optimization/PSO 粒子群优化算法[ ] Part-of-speech tagging 词性标注[ ] Perceptron 感知机[ ] Performance measure 性能度量[ ] Plug and Play Generative Network 即插即用生成网络[ ] Plurality voting 相对多数投票法[ ] Polarity detection 极性检测[ ] Polynomial kernel function 多项式核函数[ ] Pooling 池化[ ] Positive class 正类[ ] Positive definite matrix 正定矩阵[ ] Post-hoc test 后续检验[ ] Post-pruning 后剪枝[ ] potential function 势函数[ ] Precision 查准率／准确率[ ] Prepruning 预剪枝[ ] Principal component analysis/PCA 主成分分析[ ] Principle of multiple explanations 多释原则[ ] Prior 先验[ ] Probability Graphical Model 概率图模型[ ] Proximal Gradient Descent/PGD 近端梯度下降[ ] Pruning 剪枝[ ] Pseudo-label 伪标记[ ] Letter Q[ ] Quantized Neural Network 量子化神经网络[ ] Quantum computer 量子计算机[ ] Quantum Computing 量子计算[ ] Quasi Newton method 拟牛顿法Letter R[ ] Radial Basis Function／RBF 径向基函数[ ] Random Forest Algorithm 随机森林算法[ ] Random walk 随机漫步[ ] Recall 查全率／召回率[ ] Receiver Operating Characteristic/ROC 受试者工作特征[ ] Rectified Linear Unit/ReLU 线性修正单元[650 ] Recurrent Neural Network 循环神经网络[ ] Recursive neural network 递归神经网络[ ] Reference model 参考模型[ ] Regression 回归[ ] Regularization 正则化[ ] Reinforcement learning/RL 强化学习[ ] Representation learning 表征学习[ ] Representer theorem 表示定理[ ] reproducing kernel Hilbert space/RKHS 再生核希尔伯特空间[ ] Re-sampling 重采样法[ ] Rescaling 再缩放[ ] Residual Mapping 残差映射[ ] Residual Network 残差网络[ ] Restricted Boltzmann Machine/RBM 受限玻尔兹曼机[ ] Restricted Isometry Property/RIP 限定等距性[ ] Re-weighting 重赋权法[ ] Robustness 稳健性/鲁棒性[ ] Root node 根结点[ ] Rule Engine 规则引擎[ ] Rule learning 规则学习Letter S[ ] Saddle point 鞍点[ ] Sample space 样本空间[ ] Sampling 采样[ ] Score function 评分函数[ ] Self-Driving 自动驾驶[ ] Self-Organizing Map／SOM 自组织映射[ ] Semi-naive Bayes classifiers 半朴素贝叶斯分类器[ ] Semi-Supervised Learning 半监督学习[ ] semi-Supervised Support Vector Machine 半监督支持向量机[ ] Sentiment analysis 情感分析[ ] Separating hyperplane 分离超平面[ ] Sigmoid function Sigmoid 函数[ ] Similarity measure 相似度度量[ ] Simulated annealing 模拟退火[ ] Simultaneous localization and mapping 同步定位与地图构建[ ] Singular Value Decomposition 奇异值分解[ ] Slack variables 松弛变量[ ] Smoothing 平滑[ ] Soft margin 软间隔[ ] Soft margin maximization 软间隔最大化[ ] Soft voting 软投票[ ] Sparse representation 稀疏表征[ ] Sparsity 稀疏性[ ] Specialization 特化[ ] Spectral Clustering 谱聚类[ ] Speech Recognition 语音识别[ ] Splitting variable 切分变量[700 ] Squashing function 挤压函数[ ] Stability-plasticity dilemma 可塑性-稳定性困境[ ] Statistical learning 统计学习[ ] Status feature function 状态特征函[ ] Stochastic gradient descent 随机梯度下降[ ] Stratified sampling 分层采样[ ] Structural risk 结构风险[ ] Structural risk minimization/SRM 结构风险最小化[ ] Subspace 子空间[ ] Supervised learning 监督学习／有导师学习[ ] support vector expansion 支持向量展式[ ] Support Vector Machine/SVM 支持向量机[ ] Surrogat loss 替代损失[ ] Surrogate function 替代函数[ ] Symbolic learning 符号学习[ ] Symbolism 符号主义[ ] Synset 同义词集Letter T[ ] T-Distribution Stochastic Neighbour Embedding/t-SNE T –分布随机近邻嵌入[ ] Tensor 张量[ ] Tensor Processing Units/TPU 张量处理单元[ ] The least square method 最小二乘法[ ] Threshold 阈值[ ] Threshold logic unit 阈值逻辑单元[ ] Threshold-moving 阈值移动[ ] Time Step 时间步骤[ ] Tokenization 标记化[ ] Training error 训练误差[ ] Training instance 训练示例／训练例[ ] Transductive learning 直推学习[ ] Transfer learning 迁移学习[ ] Treebank 树库[ ] Tria-by-error 试错法[ ] True negative 真负类[ ] True positive 真正类[ ] True Positive Rate/TPR 真正例率[ ] Turing Machine 图灵机[ ] Twice-learning 二次学习Letter U[ ] Underfitting 欠拟合／欠配[ ] Undersampling 欠采样[ ] Understandability 可理解性[ ] Unequal cost 非均等代价[ ] Unit-step function 单位阶跃函数[ ] Univariate decision tree 单变量决策树[ ] Unsupervised learning 无监督学习／无导师学习[ ] Unsupervised layer-wise training 无监督逐层训练[ ] Upsampling 上采样Letter V[ ] Vanishing Gradient Problem 梯度消失问题[ ] Variational inference 变分推断[ ] VC Theory VC维理论[ ] Version space 版本空间[ ] Viterbi algorithm 维特比算法[760 ] Von Neumann architecture 冯· 诺伊曼架构Letter W[ ] Wasserstein GAN/WGAN Wasserstein生成对抗网络[ ] Weak learner 弱学习器[ ] Weight 权重[ ] Weight sharing 权共享[ ] Weighted voting 加权投票法[ ] Within-class scatter matrix 类内散度矩阵[ ] Word embedding 词嵌入[ ] Word sense disambiguation 词义消歧Letter Z[ ] Zero-data learning 零数据学习[ ] Zero-shot learning 零次学习第三部分A[ ] approximations近似值[ ] arbitrary随意的[ ] affine仿射的[ ] arbitrary任意的[ ] amino acid氨基酸[ ] amenable经得起检验的[ ] axiom公理，原则[ ] abstract提取[ ] architecture架构，体系结构；建造业[ ] absolute绝对的[ ] arsenal军火库[ ] assignment分配[ ] algebra线性代数[ ] asymptotically无症状的[ ] appropriate恰当的B[ ] bias偏差[ ] brevity简短，简洁；短暂[800 ] broader广泛[ ] briefly简短的[ ] batch批量C[ ] convergence 收敛，集中到一点[ ] convex凸的[ ] contours轮廓[ ] constraint约束[ ] constant常理[ ] commercial商务的[ ] complementarity补充[ ] coordinate ascent同等级上升[ ] clipping剪下物；剪报；修剪[ ] component分量；部件[ ] continuous连续的[ ] covariance协方差[ ] canonical正规的，正则的[ ] concave非凸的[ ] corresponds相符合；相当；通信[ ] corollary推论[ ] concrete具体的事物，实在的东西[ ] cross validation交叉验证[ ] correlation相互关系[ ] convention约定[ ] cluster一簇[ ] centroids 质心，形心[ ] converge收敛[ ] computationally计算(机)的[ ] calculus计算D[ ] derive获得，取得[ ] dual二元的[ ] duality二元性；二象性；对偶性[ ] derivation求导；得到；起源[ ] denote预示，表示，是…的标志；意味着，[逻]指称[ ] divergence 散度；发散性[ ] dimension尺度，规格；维数[ ] dot小圆点[ ] distortion变形[ ] density概率密度函数[ ] discrete离散的[ ] discriminative有识别能力的[ ] diagonal对角[ ] dispersion分散，散开[ ] determinant决定因素[849 ] disjoint不相交的E[ ] encounter遇到[ ] ellipses椭圆[ ] equality等式[ ] extra额外的[ ] empirical经验；观察[ ] ennmerate例举，计数[ ] exceed超过，越出[ ] expectation期望[ ] efficient生效的[ ] endow赋予[ ] explicitly清楚的[ ] exponential family指数家族[ ] equivalently等价的F[ ] feasible可行的[ ] forary初次尝试[ ] finite有限的，限定的[ ] forgo摒弃，放弃[ ] fliter过滤[ ] frequentist最常发生的[ ] forward search前向式搜索[ ] formalize使定形G[ ] generalized归纳的[ ] generalization概括，归纳；普遍化；判断（根据不足）[ ] guarantee保证；抵押品[ ] generate形成，产生[ ] geometric margins几何边界[ ] gap裂口[ ] generative生产的；有生产力的H[ ] heuristic启发式的；启发法；启发程序[ ] hone怀恋；磨[ ] hyperplane超平面L[ ] initial最初的[ ] implement执行[ ] intuitive凭直觉获知的[ ] incremental增加的[900 ] intercept截距[ ] intuitious直觉[ ] instantiation例子[ ] indicator指示物，指示器[ ] interative重复的，迭代的[ ] integral积分[ ] identical相等的；完全相同的[ ] indicate表示，指出[ ] invariance不变性，恒定性[ ] impose把…强加于[ ] intermediate中间的[ ] interpretation解释，翻译J[ ] joint distribution联合概率L[ ] lieu替代[ ] logarithmic对数的，用对数表示的[ ] latent潜在的[ ] Leave-one-out cross validation留一法交叉验证M[ ] magnitude巨大[ ] mapping绘图，制图；映射[ ] matrix矩阵[ ] mutual相互的，共同的[ ] monotonically单调的[ ] minor较小的，次要的[ ] multinomial多项的[ ] multi-class classification二分类问题N[ ] nasty讨厌的[ ] notation标志，注释[ ] naïve朴素的O[ ] obtain得到[ ] oscillate摆动[ ] optimization problem最优化问题[ ] objective function目标函数[ ] optimal最理想的[ ] orthogonal(矢量，矩阵等)正交的[ ] orientation方向[ ] ordinary普通的[ ] occasionally偶然的P[ ] partial derivative偏导数[ ] property性质[ ] proportional成比例的[ ] primal原始的，最初的[ ] permit允许[ ] pseudocode伪代码[ ] permissible可允许的[ ] polynomial多项式[ ] preliminary预备[ ] precision精度[ ] perturbation 不安，扰乱[ ] poist假定，设想[ ] positive semi-definite半正定的[ ] parentheses圆括号[ ] posterior probability后验概率[ ] plementarity补充[ ] pictorially图像的[ ] parameterize确定…的参数[ ] poisson distribution柏松分布[ ] pertinent相关的Q[ ] quadratic二次的[ ] quantity量，数量；分量[ ] query疑问的R[ ] regularization使系统化；调整[ ] reoptimize重新优化[ ] restrict限制；限定；约束[ ] reminiscent回忆往事的；提醒的；使人联想…的（of）[ ] remark注意[ ] random variable随机变量[ ] respect考虑[ ] respectively各自的；分别的[ ] redundant过多的；冗余的S[ ] susceptible敏感的[ ] stochastic可能的；随机的[ ] symmetric对称的[ ] sophisticated复杂的[ ] spurious假的；伪造的[ ] subtract减去；减法器[ ] simultaneously同时发生地；同步地[ ] suffice满足[ ] scarce稀有的，难得的[ ] split分解，分离[ ] subset子集[ ] statistic统计量[ ] successive iteratious连续的迭代[ ] scale标度[ ] sort of有几分的[ ] squares平方T[ ] trajectory轨迹[ ] temporarily暂时的[ ] terminology专用名词[ ] tolerance容忍；公差[ ] thumb翻阅[ ] threshold阈，临界[ ] theorem定理[ ] tangent正弦U[ ] unit-length vector单位向量V[ ] valid有效的，正确的[ ] variance方差[ ] variable变量；变元[ ] vocabulary词汇[ ] valued经估价的；宝贵的[ ] W [1038 ] wrapper包装。

yolov7蒸馏算法

yolov7蒸馏算法
Yolov7蒸馏算法是一种改进的算法，主要基于Yolov5算法。

该算法通过
蒸馏技术对teacher和student网络的特征层进行loss计算，从而优化模
型的性能。

在实现Yolov7蒸馏算法时，需要对Model类进行修改，以便
在forward_once接口中返回所需的中间层。

对于模型的训练，需要从大量数据中提取有价值的特征信息，然后将其应用于模型中，以改进模型的性能。

具体而言，在Yolov7蒸馏算法中，需要将
教师网络中的特征信息传递给学生网络，并使用这些特征信息来优化学生网络的性能。

在实现蒸馏技术时，需要将教师网络中的特征图传递给学生网络，并使用这些特征图来指导学生网络的训练。

同时，学生网络中的输出层需要使用与教师网络相同的方法进行计算，以便进行比较和优化。

此外，在训练学生网络时，需要使用适当的优化算法，如Adam等。

同时，也需要使用交叉验证等技术来评估模型的性能。

总的来说，Yolov7蒸馏算法是一种有效的改进算法，可以显著提高模型的
性能。

该算法不仅适用于Yolov5算法，也适用于其他类似的算法。

基于遗传算法优化的BP神经网络在考研结果预测中的应用

黑铉语言信麵与电睡China Computer & Communication2021年第1期基于遗传算法优化的B P神经网络在考研结果预测中的应用李驰(四川大学锦城学院计算机科学与软件工程系，四川成都611731)摘要：通过遗传算法先对BP神经网络的初始权值和阈值进行优化后，再将BP神经网络用于考研结果的预测模型中。

实验表明，这种优化后的预测模型因为克服了收敛速度慢、易产生局部最小等缺陷，比单纯使用BP神经网络建立的预测模型准确度更高。

将这个预测模型用于考研报名之前供学生预测参考，方便学生做出合理的决策，具有一定的实际意义。

关键词：考研；预测；BP神经网络；遗传算法中图分类号：TD712 文献标识码：A文章编号：1003-9767 (2021) 01-038-04Application of BP Neural Network Based on Genetic Algorithms Optimization in Prediction of Postgraduate Entrance ExaminationLI Chi(Department of Computer Science and Software Engineering,Jincheng College of Sichuan University,Chengdu Sichuan611731, China) Abstract：F irs tly,the in itia l weight and threshold of BP neural network are optimized by genetic algorithm,and then BP neural netw ork is used in the pre diction model of the results o f the postgraduate entrance exam ination.The experim ent shows that the optim ized prediction model overcomes the shortcomings o f slow convergence speed and easy to produce local m inim um,so it is more accurate than the pre diction model established by BP neural network alone.This pre diction model can be used as a reference for students to make a reasonable decision before applying fo r postgraduate entrance examination.Key words：postgraduate entrance exam ination;prediction;BP neural network;genetic algorithms〇引言随着社会对于高素质知识型人才的需求越来越迫切，我国报考研究生的人数呈现逐年大幅増加的趋势。

weka教程_使用方法

University of Waikato 36

“Meta”-classifiers include:

1/28/2014
1/28/2014
University of Waikato
37
1/28/2014
University of Waikato
38
1/28/2014
University of Waikato
University of Waikato 8
1/28/2014
WEKA only deals with “flat” files
@relation heart-disease-simplified @attribute age numeric @attribute sex { female, male} @attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina} @attribute cholesterol numeric @attribute exercise_induced_angina { no, yes} @attribute class { present, not_present} @data 63,male,typ_angina,233,no,not_present 67,male,asympt,286,yes,present 67,male,asympt,229,yes,present 38,female,non_anginal,?,no,not_present ...
作者: Ian H. Witten / Eibe Frank 副标题: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems) 页数: 525 出版社: Morgan Kaufmann 出版年: 2005-06-08

在线最优化求解(Online Optimization)-冯扬

��
最优值。如果��(��)为凸函数，则可以保证结果为全局最优解。针对有等式约束的最优化问题，采用拉格朗日乘数法（Lagrange Multiplier）[2]进行求解：
通过拉格朗日系数�� = [��1, ��2 … �� ]�� ∈ ℝ�� 把等式约束和目标函数组合成为一个式子，对该式子进行最优化求解：
那么称��(��)是凸函数（Convex）[1]。一个函数是凸函数是它存在最优解的充分必要条件。此外，如果��(��)满足： �� 1 + 1 − �� 2 < �� 1 + 1 − �� 2 ∀��1, ��2 ∈ ��, 0 < �� < 1
含义是在 n 个等式约束 �� 以及 m 个不等式约束 ��
目标函数��(��)最小。
的条件下，求解��，令
针对无约束最优化问题，通常做法就是对��(��)求导，并令 �� = 0，求解可以得到
1 动机与目的
在实际工作中，无论是工程师、项目经理、产品同学都会经常讨论一类话题：“从线上对比的效果来看，某某特征或因素对 xx 产品的最终效果有很大的影响”。这类话题本质上说的是通过已有的数据反映出某些特定的因素对结果有很强的正（或负）相关性。而如何定量计算这种相关性？如何得到一套模型参数能够使得效果达到最优？这就是最优化计算要做的事情。

乳腺癌影像组学研究进展

乳腺癌影像组学研究进展汪艳汪艳，，罗威罗威，，徐培豪深圳大学附属华南医院放射科，广东深圳518111摘要影像组学在人工智能大数据时代应运而生，近年来国内外在各种疾病的诊断、治疗、预后领域都有相关研究。

当前已有大量运用影像组学对乳腺癌进行研究的报道，均提示影像组学在乳腺癌的早期诊断、预后判断、疗效评估等方面具有重要意义。

本文将对现有研究成果进行综述。

关键词影像组学；乳腺癌；成果；综述中图分类号R737737..9文献标志码A doi10.11966/j.issn.2095-994X.2023.09.06.48Advances in Radiomics Research of Breast CancerWANG　Yan, LUO　Wei, XU PeihaoDepartment of Radiology, South China Hospital Affiliated to Shenzhen University, Shenzhen, Guangdong Province, 518111 ChinaAbstract Radiomics has emerged in the era of artificial intelligence and big data. In recent years, there have been related researches in the fields of diagnosis, treatment and prognosis of various diseases at home and abroad. At present, there have been a large number of reports on the use of radiomics in the study of breast cancer, all of which suggest that radiomics is of great significance in the early diagnosis, prognosis judgment, and efficacy evaluation of breast cancer. This paper will review the existing research results.Key words Radiomics; Breast cancer; Achievements; Summarize乳腺癌新发病例高达226万，已代替肺癌变成了全球第一大癌症；对于全球女性而言，乳腺癌的发病率和致死率居癌症首位[1]。

kNN算法综述

kNN算法综述王宇航13120476(北京交通大学计算机与信息技术学院，北京，100044)摘要：kNN算法是著名的模式识别统计学方法，是最好的文本分类算法之一，在机器学习分类算法中占有相当大的地位，是最简单的机器学习算法之一。

本文对kNN算法及相关文献做一份总结，详细介绍kNN算法的思想、原理、实现步骤以及具体实现代码，并分析了算法的优缺点及其各种改进方案。

本文还介绍了kNN算法的发展历程、重要的发表的论文。

本文在最后介绍了kNN算法的应用领域，并重点说明其在文本分类中的实现。

关键字：kNN算法；k近邻算法；机器学习；文本分类Abstract: KNN algorithm, a famous statistical method of pattern recognition, which is one of the best algorithms for dealing with text categorization, is playing an important role in machine learning classification algorithm, and it is one of the simplest algorithms in machine learning. This paper mainly summaries the kNN algorithm and its related literature, and detailed introduces its main idea, principle, implementation steps and specific implementation code, as well as analyzes the advantages and disadvantages of the algorithm and its various improvement schemes. This paper also introduces the development course of kNN algorithm, its important published paper. In the final, this paper introduces the application field of kNN algorithm, and especially in text categorization.Keywords: KNN algorithm, K neighbor algorithm, Machine learning, Text classification1引言分类是数据挖掘中的核心和基础技术，在经营、决策、管理、科学研究等多个领域都有着广泛的应用。

[第3集]欠拟合与过拟合的概念

[第3集]⽋拟合与过拟合的概念(⼤部分都不是⾃⼰写的，⽽是看完视频再总结的过程中看到有的博客已经总结的很好了，只是拿来保存⼀下，⾮原创)复习：（x（i），y（i））第 i 个样本，样本总数为 m令，以参数向量为条件，对于输⼊x，输出为：hθ（x（i））=θT xn为特征数量最⼩⼆乘法：通过正规⽅程组推导的结论：⼀、过拟合与⽋拟合 1、⽋拟合：⽋拟合就是模型没有很好地捕捉到数据特征，不能够很好地拟合数据，例如下⾯的例⼦：左图表⽰size与prize关系的数据，中间的图就是出现⽋拟合的模型，不能够很好地拟合数据，如果在中间的图的模型后⾯再加⼀个⼆次项，就可以很好地拟合图中的数据了，如右⾯的图所⽰。

2、过拟合：通俗⼀点地来说过拟合就是模型把数据学习的太彻底，以⾄于把噪声数据的特征也学习到了，这样就会导致在后期测试的时候不能够很好地识别数据，即不能正确的分类，模型泛化能⼒太差。

例如下⾯的例⼦。

若训练集有7个数据，则可拟合出最⾼6次的多项式，可以找到⼀条完美的曲线，该曲线经过每个数据点。

但是这样的模型⼜过于复杂，拟合结果仅仅反映了所给的特定数据的特质，不具有通过房屋⼤⼩来估计房价的普遍性。

过拟合中的到的曲线仅仅反应了特定数据的特质，不具有普遍性⼆、参数学习算法和⾮参数学习算法解决此类学习问题的⽅法： 1) 特征选择算法：⼀类⾃动化算法，在这类回归问题中选择要⽤到的特征 2) ⾮参数学习算法：缓解对于选取特征的需求，引出局部加权回归参数学习算法（parametric learning algorithm）定义：参数学习算法是⼀类有固定数⽬参数，以⽤来进⾏数据拟合的算法。

设该固定的参数集合为。

线性回归即使参数学习算法的⼀个例⼦定义：参数的数⽬会随着训练集合的⼤⼩线性增加局部加权回归（Locally Weighted Regression）⼀种特定的⾮参数学习算法。

也称作Loess。

算法思想：假设对于⼀个确定的查询点x，在x处对你的假设h(x)求值。

Lazy和Eager分类算法

概述
人们为了解决一个新问题，先是进行回忆，从记忆中找到一个与新问题相似的范例，然后把该范例中的有关信息和知识复用到新问题的求解之中。
• Summarization on RBF
– To provide a global approximation to the target function – Represented by a linear combination of many local kernel functions – To neglect the values out of defined region(region/width) – Can be trained more efficiently
• Instance representation V—feature vector <a1(x ), a2(x ),… an(x ) > • Distance metric(Euclidean distance)
d ( xi , x j )
(a ( x ) a ( x ))
r 1 r i r j
2 q
x k nearest nbrs of xq

ˆ ( x)) a ( x) K (d ( xq , x))( f ( x) f
j
[Atkeson,1997 & Bishop,1995]
Radial Basis Function
Related to distance-weighted regression & artificial neural networks [Powell,1987; Broomhead & Lowe 1988; Moody
ˆ ( x)) ( x) f