文献 (10)Semi-supervised and unsupervised extreme learning

合集下载

semi-supervised learning for generative model

semi-supervised learning for generative modelIn semi-supervised learning for generative models, the objective is to learn a model that can generate new data samples similar to the training data when only a subset of the data is labeled. This is achieved by utilizing both labeled and unlabeled data during the training process.One common approach for semi-supervised learning with generative models is the generative adversarial network (GAN) framework. In this framework, a generator model is trained to generate realistic data samples, while a discriminator model is trained to distinguish between real and generated samples.To incorporate the labeled and unlabeled data, the labeled data is used to train the discriminator to correctly classify the real and generated samples. This helps the generator improve its ability to generate realistic samples that can fool the discriminator. The unlabeled data, on the other hand, is used to help the generator learn the underlying data distribution and generate samples that align with the characteristics of the unlabeled data.The training process typically involves alternating between updating the generator and discriminator models. The generator tries to generate samples that are classified as real by the discriminator, while the discriminator tries to correctly classify between real and generated samples. This adversarial process helps the generator learn to generate more realistic samples over time. Overall, semi-supervised learning for generative models aims to leverage the unlabeled data to improve the quality of the generatedsamples and achieve better generalization. It allows for more efficient and effective training when labeled data is limited.。

SupervisedandUnsupervisedLearning

– Matching Imputa?on: for each unit with a missing y, ﬁnd a unit with similar values of x in the observed data and take its y value
– Maximum Likelihood, EM, etc
searching for paOerns in the data. • More in details, the most relevant DM tasks are:
– associa?on – sequence or path analysis – clustering – classiﬁcaDon – regression – visualiza?on
Cluster Analysis
How many clusters do you expect?
Search for Outliers
ClassiﬁcaDon
• Data mining technique used to predict group membership for data instances. There are two ways to assign a new value to a given class.
• The model is not provided with the correct results during the training.
• Can be used to cluster the input data in classes on the basis of their sta?s?cal proper?es only.
• Some tasks

《神经网络与深度学习综述DeepLearning15May2014

Draft:Deep Learning in Neural Networks:An OverviewTechnical Report IDSIA-03-14/arXiv:1404.7828(v1.5)[cs.NE]J¨u rgen SchmidhuberThe Swiss AI Lab IDSIAIstituto Dalle Molle di Studi sull’Intelligenza ArtiﬁcialeUniversity of Lugano&SUPSIGalleria2,6928Manno-LuganoSwitzerland15May2014AbstractIn recent years,deep artiﬁcial neural networks(including recurrent ones)have won numerous con-tests in pattern recognition and machine learning.This historical survey compactly summarises relevantwork,much of it from the previous millennium.Shallow and deep learners are distinguished by thedepth of their credit assignment paths,which are chains of possibly learnable,causal links between ac-tions and effects.I review deep supervised learning(also recapitulating the history of backpropagation),unsupervised learning,reinforcement learning&evolutionary computation,and indirect search for shortprograms encoding deep and large networks.PDF of earlier draft(v1):http://www.idsia.ch/∼juergen/DeepLearning30April2014.pdfLATEX source:http://www.idsia.ch/∼juergen/DeepLearning30April2014.texComplete BIBTEXﬁle:http://www.idsia.ch/∼juergen/bib.bibPrefaceThis is the draft of an invited Deep Learning(DL)overview.One of its goals is to assign credit to those who contributed to the present state of the art.I acknowledge the limitations of attempting to achieve this goal.The DL research community itself may be viewed as a continually evolving,deep network of scientists who have inﬂuenced each other in complex ways.Starting from recent DL results,I tried to trace back the origins of relevant ideas through the past half century and beyond,sometimes using“local search”to follow citations of citations backwards in time.Since not all DL publications properly acknowledge earlier relevant work,additional global search strategies were employed,aided by consulting numerous neural network experts.As a result,the present draft mostly consists of references(about800entries so far).Nevertheless,through an expert selection bias I may have missed important work.A related bias was surely introduced by my special familiarity with the work of my own DL research group in the past quarter-century.For these reasons,the present draft should be viewed as merely a snapshot of an ongoing credit assignment process.To help improve it,please do not hesitate to send corrections and suggestions to juergen@idsia.ch.Contents1Introduction to Deep Learning(DL)in Neural Networks(NNs)3 2Event-Oriented Notation for Activation Spreading in FNNs/RNNs3 3Depth of Credit Assignment Paths(CAPs)and of Problems4 4Recurring Themes of Deep Learning54.1Dynamic Programming(DP)for DL (5)4.2Unsupervised Learning(UL)Facilitating Supervised Learning(SL)and RL (6)4.3Occam’s Razor:Compression and Minimum Description Length(MDL) (6)4.4Learning Hierarchical Representations Through Deep SL,UL,RL (6)4.5Fast Graphics Processing Units(GPUs)for DL in NNs (6)5Supervised NNs,Some Helped by Unsupervised NNs75.11940s and Earlier (7)5.2Around1960:More Neurobiological Inspiration for DL (7)5.31965:Deep Networks Based on the Group Method of Data Handling(GMDH) (8)5.41979:Convolution+Weight Replication+Winner-Take-All(WTA) (8)5.51960-1981and Beyond:Development of Backpropagation(BP)for NNs (8)5.5.1BP for Weight-Sharing Feedforward NNs(FNNs)and Recurrent NNs(RNNs)..95.6Late1980s-2000:Numerous Improvements of NNs (9)5.6.1Ideas for Dealing with Long Time Lags and Deep CAPs (10)5.6.2Better BP Through Advanced Gradient Descent (10)5.6.3Discovering Low-Complexity,Problem-Solving NNs (11)5.6.4Potential Beneﬁts of UL for SL (11)5.71987:UL Through Autoencoder(AE)Hierarchies (12)5.81989:BP for Convolutional NNs(CNNs) (13)5.91991:Fundamental Deep Learning Problem of Gradient Descent (13)5.101991:UL-Based History Compression Through a Deep Hierarchy of RNNs (14)5.111992:Max-Pooling(MP):Towards MPCNNs (14)5.121994:Contest-Winning Not So Deep NNs (15)5.131995:Supervised Recurrent Very Deep Learner(LSTM RNN) (15)5.142003:More Contest-Winning/Record-Setting,Often Not So Deep NNs (16)5.152006/7:Deep Belief Networks(DBNs)&AE Stacks Fine-Tuned by BP (17)5.162006/7:Improved CNNs/GPU-CNNs/BP-Trained MPCNNs (17)5.172009:First Ofﬁcial Competitions Won by RNNs,and with MPCNNs (18)5.182010:Plain Backprop(+Distortions)on GPU Yields Excellent Results (18)5.192011:MPCNNs on GPU Achieve Superhuman Vision Performance (18)5.202011:Hessian-Free Optimization for RNNs (19)5.212012:First Contests Won on ImageNet&Object Detection&Segmentation (19)5.222013-:More Contests and Benchmark Records (20)5.22.1Currently Successful Supervised Techniques:LSTM RNNs/GPU-MPCNNs (21)5.23Recent Tricks for Improving SL Deep NNs(Compare Sec.5.6.2,5.6.3) (21)5.24Consequences for Neuroscience (22)5.25DL with Spiking Neurons? (22)6DL in FNNs and RNNs for Reinforcement Learning(RL)236.1RL Through NN World Models Yields RNNs With Deep CAPs (23)6.2Deep FNNs for Traditional RL and Markov Decision Processes(MDPs) (24)6.3Deep RL RNNs for Partially Observable MDPs(POMDPs) (24)6.4RL Facilitated by Deep UL in FNNs and RNNs (25)6.5Deep Hierarchical RL(HRL)and Subgoal Learning with FNNs and RNNs (25)6.6Deep RL by Direct NN Search/Policy Gradients/Evolution (25)6.7Deep RL by Indirect Policy Search/Compressed NN Search (26)6.8Universal RL (27)7Conclusion271Introduction to Deep Learning(DL)in Neural Networks(NNs) Which modiﬁable components of a learning system are responsible for its success or failure?What changes to them improve performance?This has been called the fundamental credit assignment problem(Minsky, 1963).There are general credit assignment methods for universal problem solvers that are time-optimal in various theoretical senses(Sec.6.8).The present survey,however,will focus on the narrower,but now commercially important,subﬁeld of Deep Learning(DL)in Artiﬁcial Neural Networks(NNs).We are interested in accurate credit assignment across possibly many,often nonlinear,computational stages of NNs.Shallow NN-like models have been around for many decades if not centuries(Sec.5.1).Models with several successive nonlinear layers of neurons date back at least to the1960s(Sec.5.3)and1970s(Sec.5.5). An efﬁcient gradient descent method for teacher-based Supervised Learning(SL)in discrete,differentiable networks of arbitrary depth called backpropagation(BP)was developed in the1960s and1970s,and ap-plied to NNs in1981(Sec.5.5).BP-based training of deep NNs with many layers,however,had been found to be difﬁcult in practice by the late1980s(Sec.5.6),and had become an explicit research subject by the early1990s(Sec.5.9).DL became practically feasible to some extent through the help of Unsupervised Learning(UL)(e.g.,Sec.5.10,5.15).The1990s and2000s also saw many improvements of purely super-vised DL(Sec.5).In the new millennium,deep NNs haveﬁnally attracted wide-spread attention,mainly by outperforming alternative machine learning methods such as kernel machines(Vapnik,1995;Sch¨o lkopf et al.,1998)in numerous important applications.In fact,supervised deep NNs have won numerous of-ﬁcial international pattern recognition competitions(e.g.,Sec.5.17,5.19,5.21,5.22),achieving theﬁrst superhuman visual pattern recognition results in limited domains(Sec.5.19).Deep NNs also have become relevant for the more generalﬁeld of Reinforcement Learning(RL)where there is no supervising teacher (Sec.6).Both feedforward(acyclic)NNs(FNNs)and recurrent(cyclic)NNs(RNNs)have won contests(Sec.5.12,5.14,5.17,5.19,5.21,5.22).In a sense,RNNs are the deepest of all NNs(Sec.3)—they are general computers more powerful than FNNs,and can in principle create and process memories of ar-bitrary sequences of input patterns(e.g.,Siegelmann and Sontag,1991;Schmidhuber,1990a).Unlike traditional methods for automatic sequential program synthesis(e.g.,Waldinger and Lee,1969;Balzer, 1985;Soloway,1986;Deville and Lau,1994),RNNs can learn programs that mix sequential and parallel information processing in a natural and efﬁcient way,exploiting the massive parallelism viewed as crucial for sustaining the rapid decline of computation cost observed over the past75years.The rest of this paper is structured as follows.Sec.2introduces a compact,event-oriented notation that is simple yet general enough to accommodate both FNNs and RNNs.Sec.3introduces the concept of Credit Assignment Paths(CAPs)to measure whether learning in a given NN application is of the deep or shallow type.Sec.4lists recurring themes of DL in SL,UL,and RL.Sec.5focuses on SL and UL,and on how UL can facilitate SL,although pure SL has become dominant in recent competitions(Sec.5.17-5.22). Sec.5is arranged in a historical timeline format with subsections on important inspirations and technical contributions.Sec.6on deep RL discusses traditional Dynamic Programming(DP)-based RL combined with gradient-based search techniques for SL or UL in deep NNs,as well as general methods for direct and indirect search in the weight space of deep FNNs and RNNs,including successful policy gradient and evolutionary methods.2Event-Oriented Notation for Activation Spreading in FNNs/RNNs Throughout this paper,let i,j,k,t,p,q,r denote positive integer variables assuming ranges implicit in the given contexts.Let n,m,T denote positive integer constants.An NN’s topology may change over time(e.g.,Fahlman,1991;Ring,1991;Weng et al.,1992;Fritzke, 1994).At any given moment,it can be described as aﬁnite subset of units(or nodes or neurons)N= {u1,u2,...,}and aﬁnite set H⊆N×N of directed edges or connections between nodes.FNNs are acyclic graphs,RNNs cyclic.Theﬁrst(input)layer is the set of input units,a subset of N.In FNNs,the k-th layer(k>1)is the set of all nodes u∈N such that there is an edge path of length k−1(but no longer path)between some input unit and u.There may be shortcut connections between distant layers.The NN’s behavior or program is determined by a set of real-valued,possibly modiﬁable,parameters or weights w i(i=1,...,n).We now focus on a singleﬁnite episode or epoch of information processing and activation spreading,without learning through weight changes.The following slightly unconventional notation is designed to compactly describe what is happening during the runtime of the system.During an episode,there is a partially causal sequence x t(t=1,...,T)of real values that I call events.Each x t is either an input set by the environment,or the activation of a unit that may directly depend on other x k(k<t)through a current NN topology-dependent set in t of indices k representing incoming causal connections or links.Let the function v encode topology information and map such event index pairs(k,t)to weight indices.For example,in the non-input case we may have x t=f t(net t)with real-valued net t= k∈in t x k w v(k,t)(additive case)or net t= k∈in t x k w v(k,t)(multiplicative case), where f t is a typically nonlinear real-valued activation function such as tanh.In many recent competition-winning NNs(Sec.5.19,5.21,5.22)there also are events of the type x t=max k∈int (x k);some networktypes may also use complex polynomial activation functions(Sec.5.3).x t may directly affect certain x k(k>t)through outgoing connections or links represented through a current set out t of indices k with t∈in k.Some non-input events are called output events.Note that many of the x t may refer to different,time-varying activations of the same unit in sequence-processing RNNs(e.g.,Williams,1989,“unfolding in time”),or also in FNNs sequentially exposed to time-varying input patterns of a large training set encoded as input events.During an episode,the same weight may get reused over and over again in topology-dependent ways,e.g.,in RNNs,or in convolutional NNs(Sec.5.4,5.8).I call this weight sharing across space and/or time.Weight sharing may greatly reduce the NN’s descriptive complexity,which is the number of bits of information required to describe the NN (Sec.4.3).In Supervised Learning(SL),certain NN output events x t may be associated with teacher-given,real-valued labels or targets d t yielding errors e t,e.g.,e t=1/2(x t−d t)2.A typical goal of supervised NN training is toﬁnd weights that yield episodes with small total error E,the sum of all such e t.The hope is that the NN will generalize well in later episodes,causing only small errors on previously unseen sequences of input events.Many alternative error functions for SL and UL are possible.SL assumes that input events are independent of earlier output events(which may affect the environ-ment through actions causing subsequent perceptions).This assumption does not hold in the broaderﬁelds of Sequential Decision Making and Reinforcement Learning(RL)(Kaelbling et al.,1996;Sutton and Barto, 1998;Hutter,2005)(Sec.6).In RL,some of the input events may encode real-valued reward signals given by the environment,and a typical goal is toﬁnd weights that yield episodes with a high sum of reward signals,through sequences of appropriate output actions.Sec.5.5will use the notation above to compactly describe a central algorithm of DL,namely,back-propagation(BP)for supervised weight-sharing FNNs and RNNs.(FNNs may be viewed as RNNs with certainﬁxed zero weights.)Sec.6will address the more general RL case.3Depth of Credit Assignment Paths(CAPs)and of ProblemsTo measure whether credit assignment in a given NN application is of the deep or shallow type,I introduce the concept of Credit Assignment Paths or CAPs,which are chains of possibly causal links between events.Let usﬁrst focus on SL.Consider two events x p and x q(1≤p<q≤T).Depending on the appli-cation,they may have a Potential Direct Causal Connection(PDCC)expressed by the Boolean predicate pdcc(p,q),which is true if and only if p∈in q.Then the2-element list(p,q)is deﬁned to be a CAP from p to q(a minimal one).A learning algorithm may be allowed to change w v(p,q)to improve performance in future episodes.More general,possibly indirect,Potential Causal Connections(PCC)are expressed by the recursively deﬁned Boolean predicate pcc(p,q),which in the SL case is true only if pdcc(p,q),or if pcc(p,k)for some k and pdcc(k,q).In the latter case,appending q to any CAP from p to k yields a CAP from p to q(this is a recursive deﬁnition,too).The set of such CAPs may be large but isﬁnite.Note that the same weight may affect many different PDCCs between successive events listed by a given CAP,e.g.,in the case of RNNs, or weight-sharing FNNs.Suppose a CAP has the form(...,k,t,...,q),where k and t(possibly t=q)are theﬁrst successive elements with modiﬁable w v(k,t).Then the length of the sufﬁx list(t,...,q)is called the CAP’s depth (which is0if there are no modiﬁable links at all).This depth limits how far backwards credit assignment can move down the causal chain toﬁnd a modiﬁable weight.1Suppose an episode and its event sequence x1,...,x T satisfy a computable criterion used to decide whether a given problem has been solved(e.g.,total error E below some threshold).Then the set of used weights is called a solution to the problem,and the depth of the deepest CAP within the sequence is called the solution’s depth.There may be other solutions(yielding different event sequences)with different depths.Given someﬁxed NN topology,the smallest depth of any solution is called the problem’s depth.Sometimes we also speak of the depth of an architecture:SL FNNs withﬁxed topology imply a problem-independent maximal problem depth bounded by the number of non-input layers.Certain SL RNNs withﬁxed weights for all connections except those to output units(Jaeger,2001;Maass et al.,2002; Jaeger,2004;Schrauwen et al.,2007)have a maximal problem depth of1,because only theﬁnal links in the corresponding CAPs are modiﬁable.In general,however,RNNs may learn to solve problems of potentially unlimited depth.Note that the deﬁnitions above are solely based on the depths of causal chains,and agnostic of the temporal distance between events.For example,shallow FNNs perceiving large“time windows”of in-put events may correctly classify long input sequences through appropriate output events,and thus solve shallow problems involving long time lags between relevant events.At which problem depth does Shallow Learning end,and Deep Learning begin?Discussions with DL experts have not yet yielded a conclusive response to this question.Instead of committing myself to a precise answer,let me just deﬁne for the purposes of this overview:problems of depth>10require Very Deep Learning.The difﬁculty of a problem may have little to do with its depth.Some NNs can quickly learn to solve certain deep problems,e.g.,through random weight guessing(Sec.5.9)or other types of direct search (Sec.6.6)or indirect search(Sec.6.7)in weight space,or through training an NNﬁrst on shallow problems whose solutions may then generalize to deep problems,or through collapsing sequences of(non)linear operations into a single(non)linear operation—but see an analysis of non-trivial aspects of deep linear networks(Baldi and Hornik,1994,Section B).In general,however,ﬁnding an NN that precisely models a given training set is an NP-complete problem(Judd,1990;Blum and Rivest,1992),also in the case of deep NNs(S´ıma,1994;de Souto et al.,1999;Windisch,2005);compare a survey of negative results(S´ıma, 2002,Section1).Above we have focused on SL.In the more general case of RL in unknown environments,pcc(p,q) is also true if x p is an output event and x q any later input event—any action may affect the environment and thus any later perception.(In the real world,the environment may even inﬂuence non-input events computed on a physical hardware entangled with the entire universe,but this is ignored here.)It is possible to model and replace such unmodiﬁable environmental PCCs through a part of the NN that has already learned to predict(through some of its units)input events(including reward signals)from former input events and actions(Sec.6.1).Its weights are frozen,but can help to assign credit to other,still modiﬁable weights used to compute actions(Sec.6.1).This approach may lead to very deep CAPs though.Some DL research is about automatically rephrasing problems such that their depth is reduced(Sec.4). In particular,sometimes UL is used to make SL problems less deep,e.g.,Sec.5.10.Often Dynamic Programming(Sec.4.1)is used to facilitate certain traditional RL problems,e.g.,Sec.6.2.Sec.5focuses on CAPs for SL,Sec.6on the more complex case of RL.4Recurring Themes of Deep Learning4.1Dynamic Programming(DP)for DLOne recurring theme of DL is Dynamic Programming(DP)(Bellman,1957),which can help to facili-tate credit assignment under certain assumptions.For example,in SL NNs,backpropagation itself can 1An alternative would be to count only modiﬁable links when measuring depth.In many typical NN applications this would not make a difference,but in some it would,e.g.,Sec.6.1.be viewed as a DP-derived method(Sec.5.5).In traditional RL based on strong Markovian assumptions, DP-derived methods can help to greatly reduce problem depth(Sec.6.2).DP algorithms are also essen-tial for systems that combine concepts of NNs and graphical models,such as Hidden Markov Models (HMMs)(Stratonovich,1960;Baum and Petrie,1966)and Expectation Maximization(EM)(Dempster et al.,1977),e.g.,(Bottou,1991;Bengio,1991;Bourlard and Morgan,1994;Baldi and Chauvin,1996; Jordan and Sejnowski,2001;Bishop,2006;Poon and Domingos,2011;Dahl et al.,2012;Hinton et al., 2012a).4.2Unsupervised Learning(UL)Facilitating Supervised Learning(SL)and RL Another recurring theme is how UL can facilitate both SL(Sec.5)and RL(Sec.6).UL(Sec.5.6.4) is normally used to encode raw incoming data such as video or speech streams in a form that is more convenient for subsequent goal-directed learning.In particular,codes that describe the original data in a less redundant or more compact way can be fed into SL(Sec.5.10,5.15)or RL machines(Sec.6.4),whose search spaces may thus become smaller(and whose CAPs shallower)than those necessary for dealing with the raw data.UL is closely connected to the topics of regularization and compression(Sec.4.3,5.6.3). 4.3Occam’s Razor:Compression and Minimum Description Length(MDL) Occam’s razor favors simple solutions over complex ones.Given some programming language,the prin-ciple of Minimum Description Length(MDL)can be used to measure the complexity of a solution candi-date by the length of the shortest program that computes it(e.g.,Solomonoff,1964;Kolmogorov,1965b; Chaitin,1966;Wallace and Boulton,1968;Levin,1973a;Rissanen,1986;Blumer et al.,1987;Li and Vit´a nyi,1997;Gr¨u nwald et al.,2005).Some methods explicitly take into account program runtime(Al-lender,1992;Watanabe,1992;Schmidhuber,2002,1995);many consider only programs with constant runtime,written in non-universal programming languages(e.g.,Rissanen,1986;Hinton and van Camp, 1993).In the NN case,the MDL principle suggests that low NN weight complexity corresponds to high NN probability in the Bayesian view(e.g.,MacKay,1992;Buntine and Weigend,1991;De Freitas,2003), and to high generalization performance(e.g.,Baum and Haussler,1989),without overﬁtting the training data.Many methods have been proposed for regularizing NNs,that is,searching for solution-computing, low-complexity SL NNs(Sec.5.6.3)and RL NNs(Sec.6.7).This is closely related to certain UL methods (Sec.4.2,5.6.4).4.4Learning Hierarchical Representations Through Deep SL,UL,RLMany methods of Good Old-Fashioned Artiﬁcial Intelligence(GOFAI)(Nilsson,1980)as well as more recent approaches to AI(Russell et al.,1995)and Machine Learning(Mitchell,1997)learn hierarchies of more and more abstract data representations.For example,certain methods of syntactic pattern recog-nition(Fu,1977)such as grammar induction discover hierarchies of formal rules to model observations. The partially(un)supervised Automated Mathematician/EURISKO(Lenat,1983;Lenat and Brown,1984) continually learns concepts by combining previously learnt concepts.Such hierarchical representation learning(Ring,1994;Bengio et al.,2013;Deng and Yu,2014)is also a recurring theme of DL NNs for SL (Sec.5),UL-aided SL(Sec.5.7,5.10,5.15),and hierarchical RL(Sec.6.5).Often,abstract hierarchical representations are natural by-products of data compression(Sec.4.3),e.g.,Sec.5.10.4.5Fast Graphics Processing Units(GPUs)for DL in NNsWhile the previous millennium saw several attempts at creating fast NN-speciﬁc hardware(e.g.,Jackel et al.,1990;Faggin,1992;Ramacher et al.,1993;Widrow et al.,1994;Heemskerk,1995;Korkin et al., 1997;Urlbe,1999),and at exploiting standard hardware(e.g.,Anguita et al.,1994;Muller et al.,1995; Anguita and Gomes,1996),the new millennium brought a DL breakthrough in form of cheap,multi-processor graphics cards or GPUs.GPUs are widely used for video games,a huge and competitive market that has driven down hardware prices.GPUs excel at fast matrix and vector multiplications required not only for convincing virtual realities but also for NN training,where they can speed up learning by a factorof50and more.Some of the GPU-based FNN implementations(Sec.5.16-5.19)have greatly contributed to recent successes in contests for pattern recognition(Sec.5.19-5.22),image segmentation(Sec.5.21), and object detection(Sec.5.21-5.22).5Supervised NNs,Some Helped by Unsupervised NNsThe main focus of current practical applications is on Supervised Learning(SL),which has dominated re-cent pattern recognition contests(Sec.5.17-5.22).Several methods,however,use additional Unsupervised Learning(UL)to facilitate SL(Sec.5.7,5.10,5.15).It does make sense to treat SL and UL in the same section:often gradient-based methods,such as BP(Sec.5.5.1),are used to optimize objective functions of both UL and SL,and the boundary between SL and UL may blur,for example,when it comes to time series prediction and sequence classiﬁcation,e.g.,Sec.5.10,5.12.A historical timeline format will help to arrange subsections on important inspirations and techni-cal contributions(although such a subsection may span a time interval of many years).Sec.5.1brieﬂy mentions early,shallow NN models since the1940s,Sec.5.2additional early neurobiological inspiration relevant for modern Deep Learning(DL).Sec.5.3is about GMDH networks(since1965),perhaps theﬁrst (feedforward)DL systems.Sec.5.4is about the relatively deep Neocognitron NN(1979)which is similar to certain modern deep FNN architectures,as it combines convolutional NNs(CNNs),weight pattern repli-cation,and winner-take-all(WTA)mechanisms.Sec.5.5uses the notation of Sec.2to compactly describe a central algorithm of DL,namely,backpropagation(BP)for supervised weight-sharing FNNs and RNNs. It also summarizes the history of BP1960-1981and beyond.Sec.5.6describes problems encountered in the late1980s with BP for deep NNs,and mentions several ideas from the previous millennium to overcome them.Sec.5.7discusses aﬁrst hierarchical stack of coupled UL-based Autoencoders(AEs)—this concept resurfaced in the new millennium(Sec.5.15).Sec.5.8is about applying BP to CNNs,which is important for today’s DL applications.Sec.5.9explains BP’s Fundamental DL Problem(of vanishing/exploding gradients)discovered in1991.Sec.5.10explains how a deep RNN stack of1991(the History Compressor) pre-trained by UL helped to solve previously unlearnable DL benchmarks requiring Credit Assignment Paths(CAPs,Sec.3)of depth1000and more.Sec.5.11discusses a particular WTA method called Max-Pooling(MP)important in today’s DL FNNs.Sec.5.12mentions aﬁrst important contest won by SL NNs in1994.Sec.5.13describes a purely supervised DL RNN(Long Short-Term Memory,LSTM)for problems of depth1000and more.Sec.5.14mentions an early contest of2003won by an ensemble of shallow NNs, as well as good pattern recognition results with CNNs and LSTM RNNs(2003).Sec.5.15is mostly about Deep Belief Networks(DBNs,2006)and related stacks of Autoencoders(AEs,Sec.5.7)pre-trained by UL to facilitate BP-based SL.Sec.5.16mentions theﬁrst BP-trained MPCNNs(2007)and GPU-CNNs(2006). Sec.5.17-5.22focus on ofﬁcial competitions with secret test sets won by(mostly purely supervised)DL NNs since2009,in sequence recognition,image classiﬁcation,image segmentation,and object detection. Many RNN results depended on LSTM(Sec.5.13);many FNN results depended on GPU-based FNN code developed since2004(Sec.5.16,5.17,5.18,5.19),in particular,GPU-MPCNNs(Sec.5.19).5.11940s and EarlierNN research started in the1940s(e.g.,McCulloch and Pitts,1943;Hebb,1949);compare also later work on learning NNs(Rosenblatt,1958,1962;Widrow and Hoff,1962;Grossberg,1969;Kohonen,1972; von der Malsburg,1973;Narendra and Thathatchar,1974;Willshaw and von der Malsburg,1976;Palm, 1980;Hopﬁeld,1982).In a sense NNs have been around even longer,since early supervised NNs were essentially variants of linear regression methods going back at least to the early1800s(e.g.,Legendre, 1805;Gauss,1809,1821).Early NNs had a maximal CAP depth of1(Sec.3).5.2Around1960:More Neurobiological Inspiration for DLSimple cells and complex cells were found in the cat’s visual cortex(e.g.,Hubel and Wiesel,1962;Wiesel and Hubel,1959).These cellsﬁre in response to certain properties of visual sensory inputs,such as theorientation of plex cells exhibit more spatial invariance than simple cells.This inspired later deep NN architectures(Sec.5.4)used in certain modern award-winning Deep Learners(Sec.5.19-5.22).5.31965:Deep Networks Based on the Group Method of Data Handling(GMDH) Networks trained by the Group Method of Data Handling(GMDH)(Ivakhnenko and Lapa,1965; Ivakhnenko et al.,1967;Ivakhnenko,1968,1971)were perhaps theﬁrst DL systems of the Feedforward Multilayer Perceptron type.The units of GMDH nets may have polynomial activation functions imple-menting Kolmogorov-Gabor polynomials(more general than traditional NN activation functions).Given a training set,layers are incrementally grown and trained by regression analysis,then pruned with the help of a separate validation set(using today’s terminology),where Decision Regularisation is used to weed out superﬂuous units.The numbers of layers and units per layer can be learned in problem-dependent fashion. This is a good example of hierarchical representation learning(Sec.4.4).There have been numerous ap-plications of GMDH-style networks,e.g.(Ikeda et al.,1976;Farlow,1984;Madala and Ivakhnenko,1994; Ivakhnenko,1995;Kondo,1998;Kord´ık et al.,2003;Witczak et al.,2006;Kondo and Ueno,2008).5.41979:Convolution+Weight Replication+Winner-Take-All(WTA)Apart from deep GMDH networks(Sec.5.3),the Neocognitron(Fukushima,1979,1980,2013a)was per-haps theﬁrst artiﬁcial NN that deserved the attribute deep,and theﬁrst to incorporate the neurophysiolog-ical insights of Sec.5.2.It introduced convolutional NNs(today often called CNNs or convnets),where the(typically rectangular)receptiveﬁeld of a convolutional unit with given weight vector is shifted step by step across a2-dimensional array of input values,such as the pixels of an image.The resulting2D array of subsequent activation events of this unit can then provide inputs to higher-level units,and so on.Due to massive weight replication(Sec.2),relatively few parameters may be necessary to describe the behavior of such a convolutional layer.Competition layers have WTA subsets whose maximally active units are the only ones to adopt non-zero activation values.They essentially“down-sample”the competition layer’s input.This helps to create units whose responses are insensitive to small image shifts(compare Sec.5.2).The Neocognitron is very similar to the architecture of modern,contest-winning,purely super-vised,feedforward,gradient-based Deep Learners with alternating convolutional and competition lay-ers(e.g.,Sec.5.19-5.22).Fukushima,however,did not set the weights by supervised backpropagation (Sec.5.5,5.8),but by local un supervised learning rules(e.g.,Fukushima,2013b),or by pre-wiring.In that sense he did not care for the DL problem(Sec.5.9),although his architecture was comparatively deep indeed.He also used Spatial Averaging(Fukushima,1980,2011)instead of Max-Pooling(MP,Sec.5.11), currently a particularly convenient and popular WTA mechanism.Today’s CNN-based DL machines proﬁta lot from later CNN work(e.g.,LeCun et al.,1989;Ranzato et al.,2007)(Sec.5.8,5.16,5.19).5.51960-1981and Beyond:Development of Backpropagation(BP)for NNsThe minimisation of errors through gradient descent(Hadamard,1908)in the parameter space of com-plex,nonlinear,differentiable,multi-stage,NN-related systems has been discussed at least since the early 1960s(e.g.,Kelley,1960;Bryson,1961;Bryson and Denham,1961;Pontryagin et al.,1961;Dreyfus,1962; Wilkinson,1965;Amari,1967;Bryson and Ho,1969;Director and Rohrer,1969;Griewank,2012),ini-tially within the framework of Euler-LaGrange equations in the Calculus of Variations(e.g.,Euler,1744). Steepest descent in such systems can be performed(Bryson,1961;Kelley,1960;Bryson and Ho,1969)by iterating the ancient chain rule(Leibniz,1676;L’Hˆo pital,1696)in Dynamic Programming(DP)style(Bell-man,1957).A simpliﬁed derivation of the method uses the chain rule only(Dreyfus,1962).The methods of the1960s were already efﬁcient in the DP sense.However,they backpropagated derivative information through standard Jacobian matrix calculations from one“layer”to the previous one, explicitly addressing neither direct links across several layers nor potential additional efﬁciency gains due to network sparsity(but perhaps such enhancements seemed obvious to the authors).。

半监督学习中的半监督聚类算法详解

半监督学习（Semi-Supervised Learning）是指在训练过程中同时利用有标签和无标签的数据进行学习。

相比于监督学习和无监督学习，半监督学习更贴近实际场景，因为在实际数据中，通常有很多无标签的数据，而标记数据的获取往往十分耗时耗力。

半监督学习可以利用未标记数据进行模型训练，从而提高模型的性能和泛化能力。

在半监督学习中，半监督聚类算法是一个重要的研究方向，它旨在利用有标签的数据和无标签的数据进行聚类，以获得更好的聚类结果。

本文将对半监督聚类算法进行详细的介绍和解析。

半监督聚类算法的核心思想是利用有标签的数据指导无标签数据的聚类过程。

一般来说，半监督聚类算法可以分为基于约束的方法和基于图的方法两类。

基于约束的方法是通过给定的一些约束条件来引导聚类过程，例如必连约束（必须属于同一类的样本必须被分到同一簇中）和禁连约束（不属于同一类的样本不能被分到同一簇中）。

基于图的方法则是通过构建样本之间的图结构来进行聚类，例如基于图的半监督学习算法中常用的谱聚类算法。

在基于图的方法中，谱聚类算法是一种常用的半监督聚类算法。

谱聚类算法首先将样本之间的相似度表示为一个相似度矩阵，然后通过对相似度矩阵进行特征分解，得到样本的特征向量，再利用特征向量进行聚类。

在半监督学习中，谱聚类算法可以通过引入有标签数据的信息来指导聚类过程，从而提高聚类的准确性。

例如，可以通过构建一个带权图，其中节点代表样本，边的权重代表样本之间的相似度，有标签的样本可以通过设置固定的标签权重来指导聚类，从而使得相似的有标签样本更有可能被分到同一簇中。

除了谱聚类算法，基于图的半监督学习还有许多其他算法，例如标签传播算法（Label Propagation）、半监督支持向量机（Semi-Supervised SupportVector Machine）等。

这些算法都是通过在样本之间构建图结构，利用图的拓扑结构和样本的相似度信息来进行半监督学习。

半监督深度学习图像分类方法研究综述

半监督深度学习图像分类方法研究综述吕昊远+，俞璐，周星宇，邓祥陆军工程大学通信工程学院，南京210007+通信作者E-mail:*******************摘要：作为人工智能领域近十年来最受关注的技术之一，深度学习在诸多应用中取得了优异的效果，但目前的学习策略严重依赖大量的有标记数据。

在许多实际问题中，获得众多有标记的训练数据并不可行，因此加大了模型的训练难度，但容易获得大量无标记的数据。

半监督学习充分利用无标记数据，提供了在有限标记数据条件下提高模型性能的解决思路和有效方法，在图像分类任务中达到了很高的识别精准度。

首先对于半监督学习进行概述，然后介绍了分类算法中常用的基本思想，重点对近年来基于半监督深度学习框架的图像分类方法，包括多视图训练、一致性正则、多样混合和半监督生成对抗网络进行全面的综述，总结多种方法共有的技术，分析比较不同方法的实验效果差异，最后思考当前存在的问题并展望未来可行的研究方向。

关键词：半监督深度学习；多视图训练；一致性正则；多样混合；半监督生成对抗网络文献标志码：A中图分类号：TP391.4Review of Semi-supervised Deep Learning Image Classification MethodsLYU Haoyuan +,YU Lu,ZHOU Xingyu,DENG XiangCollege of Communication Engineering,Army Engineering University of PLA,Nanjing 210007,ChinaAbstract:As one of the most concerned technologies in the field of artificial intelligence in recent ten years,deep learning has achieved excellent results in many applications,but the current learning strategies rely heavily on a large number of labeled data.In many practical problems,it is not feasible to obtain a large number of labeled training data,so it increases the training difficulty of the model.But it is easy to obtain a large number of unlabeled data.Semi-supervised learning makes full use of unlabeled data,provides solutions and effective methods to improve the performance of the model under the condition of limited labeled data,and achieves high recognition accuracy in the task of image classification.This paper first gives an overview of semi-supervised learning,and then introduces the basic ideas commonly used in classification algorithms.It focuses on the comprehensive review of image classification methods based on semi-supervised deep learning framework in recent years,including multi-view training,consistency regularization,diversity mixing and semi-supervised generative adversarial networks.It summarizes the common technologies of various methods,analyzes and compares the differences of experimental results of different methods.Finally,this paper thinks about the existing problems and looks forward to the feasible research direction in the future.Key words:semi-supervised deep learning;multi-view training;consistency regularization;diversity mixing;semi-supervised generative adversarial networks计算机科学与探索1673-9418/2021/15(06)-1038-11doi:10.3778/j.issn.1673-9418.2011020基金项目：国家自然科学基金(61702543)。

联合训练生成对抗网络的半监督分类方法

光学精密工程Optics and Precision Engineering第 29 卷第 5 期2021年5月Vol. 29 No. 5May 2021文章编号 1004-924X( 2021)05-1127-09联合训练生成对抗网络的半监督分类方法徐哲，耿杰*，蒋雯，张卓，曾庆捷(西北工业大学电子信息学院，西安710072)摘要：深度神经网络需要大量数据进行监督训练学习，而实际应用中往往难以获取大量标签数据°半监督学习可以减小深度网络对标签数据的依赖，基于半监督学习的生成对抗网络可以提升分类效果,旦仍存在训练不稳定的问题°为进一步提高网络的分类精度并解决网络训练不稳定的问题，本文提出一种基于联合训练生成对抗网络的半监督分类方法，通过两个判别器的联合训练来消除单个判别器的分布误差，同时选取无标签数据中置信度高的样本来扩充标签数据集，提高半监督分类精度并提升网络模型的泛化能力°在CIFAR -10和SVHN 数据集上的实验结果表明，本文方法在不同数量的标签数据下都获得更好的分类精度°当标签数量为2 000时，在CIFAR -10数据集上分类精度可达80.36% ;当标签数量为10时，相比于现有的半监督方法，分类精度提升了约5%°在一定程度上解决了 GAN 网络在小样本条件下的过拟合问题°关键词：生成对抗网络；半监督学习；图像分类；深度学习中图分类号:TP391文献标识码：Adoi ：10. 37188/OPE. 20212905.1127Co -training generative adversarial networks forsemi -supervised classification methodXU Zhe , GENG Jie * , JIANG Wen , ZHANG Zhuo , ZENG Qing -jie(School of E lectronics and Information , Northwestern Polytechnical University , Xian 710072, China )* Corresponding author , E -mail ： gengjie@nwpu. edu. cnAbstract ： Deep neural networks require a large amount of data for supervised learning ； however , it is dif ficult to obtain enough labeled data in practical applications. Semi -supervised learning can train deep neuralnetworks with limited samples. Semi -supervised generative adversarial networks can yield superior classifi cation performance ； however ， they are unstable during training in classical networks. To further improve the classification accuracy and solve the problem of training instability for networks ， we propose a semi -su pervised classification model called co -training generative adversarial networks ( CT -GAN ) for image clas sification. In the proposed model ， co -training of two discriminators is applied to eliminate the distribution error of a single discriminator and unlabeled samples with higher confidence are selected to expand thetraining set , which can be utilized for semi -supervised classification and enhance the generalization of deep networks. Experimental results on the CIFAR -10 dataset and the SVHN dataset showed that the pro posed method achieved better classification accuracies with different numbers of labeled data. The classifi cation accuracy was 80. 36% with 2000 labeled data on the CIFAR -10 dataset , whereas it improved by收稿日期：2020-11-04；修订日期:2021-01-04.基金项目:装备预研领域基金资助项目(No. 61400010304);国家自然科学基金资助项目(No. 61901376)1128光学精密工程第29卷about5%compared with the existing semi-supervised method with10labeled data.To a certain extent, the problem of GAN overfitting under a few sample conditions is solved.Key words：generative adversarial networks；semi-supervised learning；image classification；deep learning1引言图像分类作为计算机视觉领域最基础的任务之一，主要通过提取原始图像的特征并根据特征学习进行分类［11o传统的特征提取方法主要是对图像的颜色、纹理、局部特征等图像表层特征进行处理实现的，例如尺度不变特征变换法［21,方向梯度法［31以及局部二值法［41等。

基于两阶段相似性度量策略的图像检索方法

基于两阶段相似性度量策略的图像检索方法
张敏;冯晓虹
【期刊名称】《南京邮电大学学报（自然科学版）》
【年(卷),期】2010(030)006
【摘要】提出了一种基于半监督学习(Semi-Supervised Learning)的图像检索方法.首先通过一种预处理方法,可以有效地解决检索大型图像数据库时所面临的高计算代价问题.然后,度量输入查询图像与所有相关图像间的相似性,得到初步的检索结果.最后,运用基于随机行程与重新开始(random walk and restart)的半监督学习方法细化初始的图像检索,以提高检索精度.实际图像数据库上的实验表明,运用半监督学习方法能够获取高精度的图像检索结果.
【总页数】6页(P101-106)
【作者】张敏;冯晓虹
【作者单位】南京邮电大学,自动化学院,江苏,南京,210003;南京邮电大学,自动化学院,江苏,南京,210003
【正文语种】中文
【中图分类】TP391
【相关文献】
1.一种基于多层语义相似性度量的图像检索方法 [J], 陈世亮;李战怀;袁柳
2.基于分级检索策略的医学图像检索方法研究 [J], 尹东;刘京锐
3.基于图像内容和支持向量机的服装图像检索方法研究 [J], 薛培培;邬延辉
4.基于相似性度量的花卉图像检索方法研究 [J], 俞颖;邵志荣;林燕玲
5.基于感知颜色特征、子图像分割和多重Bitmap的彩色图像检索方法 [J], 邹彬;潘志斌;乔瑞萍;禹贵辉;姜彦民
因版权原因，仅展示原文概要，查看原文内容请购买。

Semi-supervised support vector machines

∗ This paper has been accepted for publication in Proceedings of Neural Information Processing Systems, Denver, 1998.
1Байду номын сангаас
INTRODUCTION
In this work we propose a method for semi-supervised support vector machines (S3 VM). S3 VM are constructed using a mixture of labeled data (the training set) and unlabeled data (the working set). The objective is to assign class labels to the working set such that the “best” support vector machine (SVM) is constructed. If the working set is empty the method becomes the standard SVM approach to classiﬁcation [20, 9, 8]. If the training set is empty, then the method becomes a form of unsupervised learning. Semi-supervised learning occurs when both training and working sets are nonempty. Semi-supervised learning for problems with small training sets and large working sets is a form of semi-supervised clustering. There are successful semi-supervised algorithms for k-means and fuzzy c-means clustering [4, 18]. Clustering is a potential application for S3 VM as well. When the training set is large relative to the working set, S3 VM can be viewed as a method for solving the transduction problem according to the principle of overall risk minimization (ORM) posed by Vapnik at the NIPS 1998 SVM Workshop and in [19, Chapter 10]. S3 VM for ORM is the focus of this paper. In classiﬁcation, the transduction problem is to estimate the class of each given point in the unlabeled working set. The usual support vector machine (SVM) approach estimates the entire classiﬁcation function using the principle of statistical risk minimization (SRM). In transduction, one estimates the classiﬁcation function at points within the working set using information from both the training and working set data. Theoretically, if there is adequate training data to estimate the function satisfactorily, then SRM will be suﬃcient. We would expect transduction to yield no signiﬁcant improvement over SRM alone. If, however, there is inadequate training data, then ORM may improve generalization on the working set. Intuitively, we would expect ORM to yield improvements when the training sets are small or when there is a signiﬁcant deviation between the training and working set subsamples of the total population. Indeed,the theoretical results in [19] support these hypotheses. In Section 2, we brieﬂy review the standard SVM model for structural risk minimization. According to the principles of structural risk minimization, SVM minimize both the empirical misclassiﬁcation rate and the capacity of the classiﬁcation function [19, 20] using the training data. The capacity of the function is determined by margin of separation between the two classes based on the training set. ORM also minimizes the both the empirical misclassiﬁcation rate and the function capacity. But the capacity of the function is determined using both the training and working sets. In Section 3, we show how SVM can be extended to the semi-supervised case and how mixed integer programming can be used practically to solve the resulting problem. We compare support vector machines constructed by structural risk minimization and overall risk minimization computationally on eleven problems in Section 4. Our computational results support past theoretical results that improved generalization can be obtained by incorporating working set information during training when there is a deviation between the working set and training set sample distributions. In three of ten real-world problems the semi-supervised approach, S3 VM , achieved a signiﬁcant increase in generalization. In no case did S3 VM ever obtain a signiﬁcant decrease in generalization. We conclude with a discussion of more general S3 VM algorithms.

semi-supervised contrastive learning 训练

semi-supervised contrastive learning 训练Semi-supervised contrastive learning is a training approach used in machine learning to leverage labeled and unlabeled data in a semi-supervised setting.In contrastive learning, the goal is to learn representations (embeddings) of data points such that similar points are pulled together and dissimilar points are pushed apart in the embedding space. This is typically done by training a deep neural network to maximize the agreement between augmented positive pairs (similar samples) while minimizing the agreement between augmented negative pairs (dissimilar samples).In a semi-supervised setting, both labeled and unlabeled data are available. Labeled data includes samples with known class labels, while unlabeled data does not have these labels. Semi-supervised contrastive learning utilizes both labeled and unlabeled data to improve the quality of learned representations.Specifically, the training process involves two main steps. First, labeled data is used to create positive and negative pairs. Positive pairs consist of two augmented versions of the same labeled sample, while negative pairs consist of two augmented versions of different labeled samples. The model is then trained to maximize agreement for positive pairs and minimize agreement for negative pairs.Next, the model is further trained on unlabeled data to generalize the learned representations beyond the labeled samples. This is done by encouraging augmented versions of similar unlabeledsamples to have consistent representations, while pushing apart augmented versions of dissimilar unlabeled samples.By utilizing both labeled and unlabeled data, semi-supervised contrastive learning can improve the performance of models compared to supervised learning approaches that only use labeled data. It allows the model to learn more robust representations and generalize better to unseen data.。

semi-supervised action recognition代码解读

semi-supervised action recognition代码解读“semi-supervised action recognition代码解读”这句话的意思是解读或分析用于半监督动作识别的代码。

对于“semi-supervised action recognition代码解读”的举例，由于具体的代码实现会因研究团队或个人而异，这里只能提供一些相关的半监督学习在动作识别中的应用案例或概念性示例：1.预训练与微调（Pretraining and Fine-tuning）：首先使用大量有标签的数据进行预训练，然后使用少量有标签的数据对模型进行微调。

这种策略通常在迁移学习中使用，其中预训练模型已经在大量数据上进行了训练。

2.伪标签（Pseudo Labeling）：给无标签数据添加伪标签，然后用这些带标签的数据来训练模型。

一种常见的做法是使用已经训练好的模型对无标签数据进行预测，并把这些预测结果作为伪标签。

3.生成对抗网络（Generative Adversarial Networks, GANs）：GANs可以生成新的数据，这些数据可以用于扩充训练集。

在动作识别中，GANs 可以生成模拟的运动序列，这些序列可以与真实的有标签数据一起用于训练模型。

4.自监督学习（Self-supervised Learning）：在这种方法中，模型通过学习如何根据输入数据的某种内部表示生成目标输出，来间接地学习预测任务。

例如，模型可以学习根据视频帧预测未来的帧。

总结来说，“semi-supervised action recognition代码解读”是指对用于半监督动作识别的代码进行深入分析和理解。

这需要理解半监督学习的原理以及在动作识别中的具体应用策略，同时也要具备相关的编程和机器学习知识。

半监督学习综述.ppt

D.J. Miller和H.S. Uyar 认为，半监督学习的研究起步相对较晚，可能是因为在当时的主流机器学习技术（例如前馈神经网络）中考虑未标记示例相对比较困难。随着统计学习技术的不断发展，以及利用未标记示例这一需求的日渐强烈，半监督学习才在近年来逐渐成为一个研究热点。
D. J. Miller, H. S. Uyar. A mixture of experts classifier with learning based on both labelled and unlabelled data. In: M. Mozer, M. I. Jordan, T. Petsche, eds. Advances in Neural Information Processing Systems 9, Cambridge, MA: MIT Press, 1997, 571-577
11
1.2 EM算法的具体步骤（解决方法）
1、设定初值 0
(n)
2、（E-步骤）对 n 0 ，令 X En (X | Y)
3、（M-步骤）（修正的估计）取使之满足：
(n)
(n)
log f (n1, X ) max log f (, X )
其中E-步骤为取条件期望（expectation），而M-步骤为取最大（maximum）。这种交替的方法称为EM方法。
10
1.1 EM算法的特点
定义：具有隐状态变量的分布中参数的最大似然估计。
适用：能够产生很好的聚类数据
困难：如果把在参数下的期望 E
为
X E (X | Y )
。那么，在估计状态变量X时，
估值当然应该用条件期望然而这时就需要知
道参数的值；另一方面，为了知道，

IEEE参考文献格式

•Creating a reference list or bibliographyA numbered list of references must be provided at the end of thepaper. The list should be arranged in the order of citation in the text of the assignment or essay, not in alphabetical order. List only one reference per reference number. Footnotes or otherinformation that are not part of the referencing format should not be included in the reference list.The following examples demonstrate the format for a variety of types of references. Included are some examples of citing electronic documents. Such items come in many forms, so only some examples have been listed here.Print DocumentsBooksNote: Every (important) word in the title of a book or conference must be capitalised. Only the first word of a subtitle should be capitalised. Capitalise the "v" in Volume for a book title.Punctuation goes inside the quotation marks.Standard formatSingle author[1] W.-K. Chen, Linear Networks and Systems. Belmont, CA: Wadsworth,1993, pp. 123-135.[2] S. M. Hemmington, Soft Science. Saskatoon: University ofSaskatchewan Press, 1997.Edited work[3] D. Sarunyagate, Ed., Lasers. New York: McGraw-Hill, 1996.Later edition[4] K. Schwalbe, Information Technology Project Management, 3rd ed.Boston: Course Technology, 2004.[5] M. N. DeMers, Fundamentals of Geographic Information Systems,3rd ed. New York : John Wiley, 2005.More than one author[6] T. Jordan and P. A. Taylor, Hacktivism and Cyberwars: Rebelswith a cause? London: Routledge, 2004.[7] U. J. Gelinas, Jr., S. G. Sutton, and J. Fedorowicz, Businessprocesses and information technology. Cincinnati:South-Western/Thomson Learning, 2004.Three or more authorsNote: The names of all authors should be given in the references unless the number of authors is greater than six. If there are more than six authors, you may use et al. after the name of the first author.[8] R. Hayes, G. Pisano, D. Upton, and S. Wheelwright, Operations,Strategy, and Technology: Pursuing the competitive edge.Hoboken, NJ : Wiley, 2005.Series[9] M. Bell, et al., Universities Online: A survey of onlineeducation and services in Australia, Occasional Paper Series 02-A. Canberra: Department of Education, Science andTraining, 2002.Corporate author (ie: a company or organisation)[10] World Bank, Information and Communication Technologies: AWorld Bank group strategy. Washington, DC : World Bank, 2002.Conference (complete conference proceedings)[11] T. J. van Weert and R. K. Munro, Eds., Informatics and theDigital Society: Social, ethical and cognitive issues: IFIP TC3/WG3.1&3.2 Open Conference on Social, Ethical andCognitive Issues of Informatics and ICT, July 22-26, 2002, Dortmund, Germany. Boston: Kluwer Academic, 2003.Government publication[12] Australia. Attorney-Generals Department. Digital AgendaReview, 4 Vols. Canberra: Attorney- General's Department,2003.Manual[13] Bell Telephone Laboratories Technical Staff, TransmissionSystem for Communications, Bell Telephone Laboratories,1995.Catalogue[14] Catalog No. MWM-1, Microwave Components, M. W. Microwave Corp.,Brooklyn, NY.Application notes[15] Hewlett-Packard, Appl. Note 935, pp. 25-29.Note:Titles of unpublished works are not italicised or capitalised. Capitalise only the first word of a paper or thesis.Technical report[16] K. E. Elliott and C.M. Greene, "A local adaptive protocol,"Argonne National Laboratory, Argonne, France, Tech. Rep.916-1010-BB, 1997.Patent / Standard[17] K. Kimura and A. Lipeles, "Fuzzy controller component, " U.S. Patent 14,860,040, December 14, 1996.Papers presented at conferences (unpublished)[18] H. A. Nimr, "Defuzzification of the outputs of fuzzycontrollers," presented at 5th International Conference onFuzzy Systems, Cairo, Egypt, 1996.Thesis or dissertation[19] H. Zhang, "Delay-insensitive networks," M.S. thesis,University of Waterloo, Waterloo, ON, Canada, 1997.[20] M. W. Dixon, "Application of neural networks to solve therouting problem in communication networks," Ph.D.dissertation, Murdoch University, Murdoch, WA, Australia, 1999.Parts of a BookNote: These examples are for chapters or parts of edited works in which the chapters or parts have individual title and author/s, but are included in collections or textbooks edited by others. If the editors of a work are also the authors of all of the included chapters then it should be cited as a whole book using the examples given above (Books).Capitalise only the first word of a paper or book chapter.Single chapter from an edited work[1] A. Rezi and M. Allam, "Techniques in array processing by meansof transformations, " in Control and Dynamic Systems, Vol.69, Multidemsional Systems, C. T. Leondes, Ed. San Diego: Academic Press, 1995, pp. 133-180.[2] G. O. Young, "Synthetic structure of industrial plastics," inPlastics, 2nd ed., vol. 3, J. Peters, Ed. New York:McGraw-Hill, 1964, pp. 15-64.Conference or seminar paper (one paper from a published conference proceedings)[3] N. Osifchin and G. Vau, "Power considerations for themodernization of telecommunications in Central and Eastern European and former Soviet Union (CEE/FSU) countries," in Second International Telecommunications Energy SpecialConference, 1997, pp. 9-16.[4] S. Al Kuran, "The prospects for GaAs MESFET technology in dc-acvoltage conversion," in Proceedings of the Fourth AnnualPortable Design Conference, 1997, pp. 137-142.Article in an encyclopaedia, signed[5] O. B. R. Strimpel, "Computer graphics," in McGraw-HillEncyclopedia of Science and Technology, 8th ed., Vol. 4. New York: McGraw-Hill, 1997, pp. 279-283.Study Guides and Unit ReadersNote: You should not cite from Unit Readers, Study Guides, or lecture notes, but where possible you should go to the original source of the information. If you do need to cite articles from the Unit Reader, treat the Reader articles as if they were book or journal articles. In the reference list or bibliography use the bibliographical details as quoted in the Reader and refer to the page numbers from the Reader, not the original page numbers (unless you have independently consulted the original).[6] L. Vertelney, M. Arent, and H. Lieberman, "Two disciplines insearch of an interface: Reflections on a design problem," in The Art of Human-Computer Interface Design, B. Laurel, Ed.Reading, MA: Addison-Wesley, 1990. Reprinted inHuman-Computer Interaction (ICT 235) Readings and Lecture Notes, Vol. 1. Murdoch: Murdoch University, 2005, pp. 32-37. Journal ArticlesNote: Capitalise only the first word of an article title, except for proper nouns or acronyms. Every (important) word in the title of a journal must be capitalised. Do not capitalise the "v" in volume for a journal article.You must either spell out the entire name of each journal that you reference or use accepted abbreviations. You must consistently do one or the other. Staff at the Reference Desk can suggest sources of accepted journal abbreviations.You may spell out words such as volume or December, but you must either spell out all such occurrences or abbreviate all. You do not need to abbreviate March, April, May, June or July.To indicate a page range use pp. 111-222. If you refer to only one page, use only p. 111.Standard formatJournal articles[1] E. P. Wigner, "Theory of traveling wave optical laser," Phys.Rev., vol. 134, pp. A635-A646, Dec. 1965.[2] J. U. Duncombe, "Infrared navigation - Part I: An assessmentof feasability," IEEE Trans. Electron. Devices, vol. ED-11, pp. 34-39, Jan. 1959.[3] G. Liu, K. Y. Lee, and H. F. Jordan, "TDM and TWDM de Bruijnnetworks and shufflenets for optical communications," IEEE Trans. Comp., vol. 46, pp. 695-701, June 1997.OR[4] J. R. Beveridge and E. M. Riseman, "How easy is matching 2D linemodels using local search?" IEEE Transactions on PatternAnalysis and Machine Intelligence, vol. 19, pp. 564-579, June 1997.[5] I. S. Qamber, "Flow graph development method," MicroelectronicsReliability, vol. 33, no. 9, pp. 1387-1395, Dec. 1993.[6] E. H. Miller, "A note on reflector arrays," IEEE Transactionson Antennas and Propagation, to be published.Electronic documentsNote:When you cite an electronic source try to describe it in the same way you would describe a similar printed publication. If possible, give sufficient information for your readers to retrieve the source themselves.If only the first page number is given, a plus sign indicates following pages, eg. 26+. If page numbers are not given, use paragraph or other section numbers if you need to be specific. An electronic source may not always contain clear author or publisher details.The access information will usually be just the URL of the source. As well as a publication/revision date (if there is one), the date of access is included since an electronic source may change between the time you cite it and the time it is accessed by a reader.E-BooksStandard format[1] L. Bass, P. Clements, and R. Kazman. Software Architecture inPractice, 2nd ed. Reading, MA: Addison Wesley, 2003. [E-book] Available: Safari e-book.[2] T. Eckes, The Developmental Social Psychology of Gender. MahwahNJ: Lawrence Erlbaum, 2000. [E-book] Available: netLibrary e-book.Article in online encyclopaedia[3] D. Ince, "Acoustic coupler," in A Dictionary of the Internet.Oxford: Oxford University Press, 2001. [Online]. Available: Oxford Reference Online, .[Accessed: May 24, 2005].[4] W. D. Nance, "Management information system," in The BlackwellEncyclopedic Dictionary of Management Information Systems,G.B. Davis, Ed. Malden MA: Blackwell, 1999, pp. 138-144.[E-book]. Available: NetLibrary e-book.E-JournalsStandard formatJournal article abstract accessed from online database[1] M. T. Kimour and D. Meslati, "Deriving objects from use casesin real-time embedded systems," Information and SoftwareTechnology, vol. 47, no. 8, p. 533, June 2005. [Abstract].Available: ProQuest, /proquest/.[Accessed May 12, 2005].Note: Abstract citations are only included in a reference list if the abstract is substantial or if the full-text of the article could not be accessed.Journal article from online full-text databaseNote: When including the internet address of articles retrieved from searches in full-text databases, please use the Recommended URLs for Full-text Databases, which are the URLs for the main entrance to the service and are easier to reproduce.[2] H. K. Edwards and V. Sridhar, "Analysis of software requirementsengineering exercises in a global virtual team setup,"Journal of Global Information Management, vol. 13, no. 2, p.21+, April-June 2005. [Online]. Available: Academic OneFile, . [Accessed May 31, 2005].[3] A. Holub, "Is software engineering an oxymoron?" SoftwareDevelopment Times, p. 28+, March 2005. [Online]. Available: ProQuest, . [Accessed May 23, 2005].Journal article in a scholarly journal (published free of charge on the internet)[4] A. Altun, "Understanding hypertext in the context of readingon the web: Language learners' experience," Current Issues in Education, vol. 6, no. 12, July 2003. [Online]. Available: /volume6/number12/. [Accessed Dec. 2, 2004].Journal article in electronic journal subscription[5] P. H. C. Eilers and J. J. Goeman, "Enhancing scatterplots withsmoothed densities," Bioinformatics, vol. 20, no. 5, pp.623-628, March 2004. [Online]. Available:. [Accessed Sept. 18, 2004].Newspaper article from online database[6] J. Riley, "Call for new look at skilled migrants," TheAustralian, p. 35, May 31, 2005. Available: Factiva,. [Accessed May 31, 2005].Newspaper article from the Internet[7] C. Wilson-Clark, "Computers ranked as key literacy," The WestAustralian, para. 3, March 29, 2004. [Online]. Available:.au. [Accessed Sept. 18, 2004].Internet DocumentsStandard formatProfessional Internet site[1] European Telecommunications Standards Institute, 揇igitalVideo Broadcasting (DVB): Implementation guidelines for DVBterrestrial services; transmission aspects,?EuropeanTelecommunications Standards Institute, ETSI TR-101-190,1997. [Online]. Available: . [Accessed:Aug. 17, 1998].Personal Internet site[2] G. Sussman, "Home page - Dr. Gerald Sussman," July 2002.[Online]. Available:/faculty/Sussman/sussmanpage.htm[Accessed: Sept. 12, 2004].General Internet site[3] J. Geralds, "Sega Ends Production of Dreamcast," ,para. 2, Jan. 31, 2001. [Online]. Available:/news/1116995. [Accessed: Sept. 12,2004].Internet document, no author given[4] 揂憀ayman抯?explanation of Ultra Narrow Band technology,?Oct.3, 2003. [Online]. Available:/Layman.pdf. [Accessed: Dec. 3, 2003].Non-Book FormatsPodcasts[1] W. Brown and K. Brodie, Presenters, and P. George, Producer, 揊rom Lake Baikal to the Halfway Mark, Yekaterinburg? Peking to Paris: Episode 3, Jun. 4, 2007. [Podcast television programme]. Sydney: ABC Television. Available:.au/tv/pekingtoparis/podcast/pekingtoparis.xm l. [Accessed Feb. 4, 2008].[2] S. Gary, Presenter, 揃lack Hole Death Ray? StarStuff, Dec. 23, 2007. [Podcast radio programme]. Sydney: ABC News Radio. Available: .au/newsradio/podcast/STARSTUFF.xml. [Accessed Feb. 4, 2008].Other FormatsMicroform[3] W. D. Scott & Co, Information Technology in Australia:Capacities and opportunities: A report to the Department ofScience and Technology. [Microform]. W. D. Scott & CompanyPty. Ltd. in association with Arthur D. Little Inc. Canberra:Department of Science and Technology, 1984.Computer game[4] The Hobbit: The prelude to the Lord of the Rings. [CD-ROM].United Kingdom: Vivendi Universal Games, 2003.Software[5] Thomson ISI, EndNote 7. [CD-ROM]. Berkeley, Ca.: ISIResearchSoft, 2003.Video recording[6] C. Rogers, Writer and Director, Grrls in IT. [Videorecording].Bendigo, Vic. : Video Education Australasia, 1999.A reference list: what should it look like?The reference list should appear at the end of your paper. Begin the list on a new page. The title References should be either left justified or centered on the page. The entries should appear as one numerical sequence in the order that the material is cited in the text of your assignment.Note: The hanging indent for each reference makes the numerical sequence more obvious.[1] A. Rezi and M. Allam, "Techniques in array processing by meansof transformations, " in Control and Dynamic Systems, Vol.69, Multidemsional Systems, C. T. Leondes, Ed. San Diego: Academic Press, 1995, pp. 133-180.[2] G. O. Young, "Synthetic structure of industrial plastics," inPlastics, 2nd ed., vol. 3, J. Peters, Ed. New York:McGraw-Hill, 1964, pp. 15-64.[3] S. M. Hemmington, Soft Science. Saskatoon: University ofSaskatchewan Press, 1997.[4] N. Osifchin and G. Vau, "Power considerations for themodernization of telecommunications in Central and Eastern European and former Soviet Union (CEE/FSU) countries," in Second International Telecommunications Energy SpecialConference, 1997, pp. 9-16.[5] D. Sarunyagate, Ed., Lasers. New York: McGraw-Hill, 1996.[8] O. B. R. Strimpel, "Computer graphics," in McGraw-HillEncyclopedia of Science and Technology, 8th ed., Vol. 4. New York: McGraw-Hill, 1997, pp. 279-283.[9] K. Schwalbe, Information Technology Project Management, 3rd ed.Boston: Course Technology, 2004.[10] M. N. DeMers, Fundamentals of Geographic Information Systems,3rd ed. New York: John Wiley, 2005.[11] L. Vertelney, M. Arent, and H. Lieberman, "Two disciplines insearch of an interface: Reflections on a design problem," in The Art of Human-Computer Interface Design, B. Laurel, Ed.Reading, MA: Addison-Wesley, 1990. Reprinted inHuman-Computer Interaction (ICT 235) Readings and Lecture Notes, Vol. 1. Murdoch: Murdoch University, 2005, pp. 32-37.[12] E. P. Wigner, "Theory of traveling wave optical laser,"Physical Review, vol.134, pp. A635-A646, Dec. 1965.[13] J. U. Duncombe, "Infrared navigation - Part I: An assessmentof feasibility," IEEE Transactions on Electron Devices, vol.ED-11, pp. 34-39, Jan. 1959.[14] M. Bell, et al., Universities Online: A survey of onlineeducation and services in Australia, Occasional Paper Series 02-A. Canberra: Department of Education, Science andTraining, 2002.[15] T. J. van Weert and R. K. Munro, Eds., Informatics and theDigital Society: Social, ethical and cognitive issues: IFIP TC3/WG3.1&3.2 Open Conference on Social, Ethical andCognitive Issues of Informatics and ICT, July 22-26, 2002, Dortmund, Germany. Boston: Kluwer Academic, 2003.[16] I. S. Qamber, "Flow graph development method,"Microelectronics Reliability, vol. 33, no. 9, pp. 1387-1395, Dec. 1993.[17] Australia. Attorney-Generals Department. Digital AgendaReview, 4 Vols. Canberra: Attorney- General's Department, 2003.[18] C. Rogers, Writer and Director, Grrls in IT. [Videorecording].Bendigo, Vic.: Video Education Australasia, 1999.[19] L. Bass, P. Clements, and R. Kazman. Software Architecture inPractice, 2nd ed. Reading, MA: Addison Wesley, 2003. [E-book] Available: Safari e-book.[20] D. Ince, "Acoustic coupler," in A Dictionary of the Internet.Oxford: Oxford University Press, 2001. [Online]. Available: Oxford Reference Online, .[Accessed: May 24, 2005].[21] H. K. Edwards and V. Sridhar, "Analysis of softwarerequirements engineering exercises in a global virtual team setup," Journal of Global Information Management, vol. 13, no. 2, p. 21+, April-June 2005. [Online]. Available: AcademicOneFile, . [Accessed May 31,2005].[22] A. Holub, "Is software engineering an oxymoron?" SoftwareDevelopment Times, p. 28+, March 2005. [Online]. Available: ProQuest, . [Accessed May 23, 2005].[23] H. Zhang, "Delay-insensitive networks," M.S. thesis,University of Waterloo, Waterloo, ON, Canada, 1997.[24] P. H. C. Eilers and J. J. Goeman, "Enhancing scatterplots withsmoothed densities," Bioinformatics, vol. 20, no. 5, pp.623-628, March 2004. [Online]. Available:. [Accessed Sept. 18, 2004].[25] J. Riley, "Call for new look at skilled migrants," TheAustralian, p. 35, May 31, 2005. Available: Factiva,. [Accessed May 31, 2005].[26] European Telecommunications Standards Institute, 揇igitalVideo Broadcasting (DVB): Implementation guidelines for DVB terrestrial services; transmission aspects,?EuropeanTelecommunications Standards Institute, ETSI TR-101-190,1997. [Online]. Available: . [Accessed: Aug. 17, 1998].[27] J. Geralds, "Sega Ends Production of Dreamcast," ,para. 2, Jan. 31, 2001. [Online]. Available:/news/1116995. [Accessed Sept. 12,2004].[28] W. D. Scott & Co, Information Technology in Australia:Capacities and opportunities: A report to the Department of Science and Technology. [Microform]. W. D. Scott & Company Pty. Ltd. in association with Arthur D. Little Inc. Canberra: Department of Science and Technology, 1984.AbbreviationsStandard abbreviations may be used in your citations. A list of appropriate abbreviations can be found below:。

半监督降维方法的实验比较

软件学报ISSN 1000-9825, CODEN RUXUEW E-mail: jos@Journal of Software,2011,22(1):28−43 [doi: 10.3724/SP.J.1001.2011.03928] +86-10-62562563 ©中国科学院软件研究所版权所有. Tel/Fax:∗半监督降维方法的实验比较陈诗国, 张道强+(南京航空航天大学计算机科学与工程系,江苏南京 210016)Experimental Comparisons of Semi-Supervised Dimensional Reduction MethodsCHEN Shi-Guo, ZHANG Dao-Qiang+(Department of Computer Science and Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China)+ Corresponding author: E-mail: dqzhang@Chen SG, Zhang DQ. Experimental comparisons of semi-supervised dimensional reduction methods. Journalof Software, 2011,22(1):28−43. /1000-9825/3928.htmAbstract: Semi-Supervised learning is one of the hottest research topics in the technological community, whichhas been developed from the original semi-supervised classification and semi-supervised clustering to thesemi-supervised regression and semi-supervised dimensionality reduction, etc. At present, there have been severalexcellent surveys on semi-supervised classification: Semi-Supervised clustering and semi-supervised regression, e.g.Zhu’s semi-supervised learning literature survey. Dimensionality reduction is one of the key issues in machinelearning, pattern recognition, and other related fields. Recently, a lot of research has been done to integrate the ideaof semi-supervised learning into dimensionality reduction, i.e. semi-supervised dimensionality reduction. In thispaper, the current semi-supervised dimensionality reduction methods are reviewed, and their performances areevaluated through extensive experiments on a large number of benchmark datasets, from which some empiricalinsights can be obtained.Key words: semi-supervised dimensionality reduction; dimensionality reduction; semi-supervised learning; classlabel; pairwise constraint摘要: 半监督学习是近年来机器学习领域中的研究热点之一,已从最初的半监督分类和半监督聚类拓展到半监督回归和半监督降维等领域.目前,有关半监督分类、聚类和回归等方面的工作已经有了很好的综述,如Zhu的半监督学习文献综述.降维一直是机器学习和模式识别等相关领域的重要研究课题,近年来出现了很多将半监督思想用于降维,即半监督降维方面的工作.有鉴于此,试图对目前已有的一些半监督降维方法进行综述,然后在大量的标准数据集上对这些方法的性能进行实验比较,并据此得出了一些经验性的启示.关键词: 半监督降维;降维;半监督学习;类别标号;成对约束中图法分类号: TP181文献标识码: A在很多机器学习和模式识别的实际应用中,人们经常会遇到高维数据,如人脸图像、基因表达数据、文本∗基金项目: 国家自然科学基金(60875030); 模式识别国家重点实验室开放课题(20090044)收稿时间: 2009-12-18; 定稿时间: 2010-07-28CNKI网络优先出版: 2010-11-05 11:50, /kcms/detail/11-2560.TP.20101105.1150.001.html陈诗国等:半监督降维方法的实验比较29数据等.直接对这些高维数据进行处理是非常费时且费力的,而且由于高维数据空间的特点,容易出现所谓的“维数灾难”问题[1].降维(dimensionality reduction)是根据某一准则,将高维数据变换到有意义的低维表示[2].因此,降维能够在某种意义上克服维数灾难.根据是否使用了类别标号,传统的降维方法可以分为两类:无监督降维,如主成分分析(PCA)[3];有监督降维,如线性判别分析(LDA),或称为Fisher 判别分析(FDA)[4].在很多实际任务中,无标号的数据往往很容易获取,而有标号的数据则很难获取.为了获得更好的学习精度同时又要充分地利用现有的数据,出现了一种新的学习形式,即半监督学习(semi-supervised learning).相比较传统的学习方法,半监督学习可以同时利用无标号数据和有标号数据,只需较少的人工参与就能获得更精确的学习精度,因此,无论在理论上,还是在实践中,都受到越来越多的关注.目前,半监督学习已从最初的半监督分类和半监督聚类拓展到半监督回归和半监督降维等领域.有关半监督分类、聚类和回归等方面的工作已经有了很好的综述,如Zhu 的半监督学习文献综述[5].然而据我们所知,目前国内外尚未专门针对半监督降维方面的综述工作.因此,本文试图对目前已有的一些半监督降维方法进行综述,然后在大量的标准数据集上比较这些方法的性能,并根据实验得出一些经验性的启示.本文第1节介绍降维的概念,并对已有的降维方法进行分类.第2节详细介绍当今较为流行的几种典型的半监督降维方法,本文将这些方法大致分为3类:基于类别标号的方法、基于成对约束的方法和基于其他监督信息的方法.第3节分别在UCI 标准数据集、半监督学习数据集[6]和标准人脸数据上比较第2节中介绍的半监督降维方法,并作一些经验性的讨论.最后一节对本文的工作进行总结并指出下一步研究的方向.1 降维给定一批观察样本,记X ∈R D ×N ,包含N 个样本,每个样本有D 个特征.降维的目标是:根据某个准则,找到数据的低维表示Z ={z i }∈R d (d <D ),同时保持数据的“内在信息(intrinsic information)”[3].当降维方法为线性时,降维的过程就转变为学习一个投影矩阵1{}d D d i i W w R ×==∈,使得Z =W T X (1)其中,T 表示矩阵的转置操作.当降维方法为非线性时,不需要学习这样一个投影矩阵W ,而直接从原始数据中学习得到低维的数据表示Z .图1对当今流行的一些降维方法进行了分类.首先,根据是否使用数据中的监督信息,将所有方法分成监督的(supervised)、半监督的(semi-supervised)和无监督的(unsupervised)降维这样3类.其中,根据监督信息的不同,半监督降维又可分为基于类别标号(class label)的、基于成对约束(pairwise constraints)的和基于其他监督信息的3类方法.然后再根据算法模型的不同,又可以将所有的降维方法分成线性(linear)降维和非线性(nonlinear)降维.最后列出了一些有代表性的降维方法以及该方法所出自的文献.下面对图中出现的每一种降维方法进行简要的说明.线性判别分析(LDA)也叫Fisher 判别分析(FDA)[4],是当今最流行的监督降维方法之一.其主要思想是,寻找一个投影矩阵,使得降维之后同类数据之间尽量紧凑,而不同类别数据之间尽量分离.Baudat 和Anouar 使用核技巧,把LDA 扩展到非线性形式,即广义判别分析(GDA)[7].间隔Fisher 判别分析(MFA)[8]和局部Fisher 判别分析(LFDA)[9]是传统FDA 的两个扩展版本.与上面的方法不同,判别成分分析(DCA)[10]是利用成对约束进行度量学习(成对约束的定义将在第2.2节中给出介绍).其主要思想与LDA 类似:寻找一个投影矩阵,使得降维之后正约束数据之间尽量紧凑,而负约束数据之间尽量分离.KDCA 是它的核化版本[10].上述降维方法都是有监督的,需要知道数据的某种监督信息,如类别标号或者成对约束等.无监督降维则不需要这些监督信息,它直接利用无标号的数据,在降维过程中保持数据的某种结构信息.主成分分析(PCA)[3]是一种典型的无监督降维方法,其目的是寻找在最小平方意义下最能够代表原始数据的投影[1].与PCA 不同,多维尺度分析(MDS)[11]使得变换后的低维数据点之间的欧氏距离与原数据点之间的欧氏距离尽量保持一致.非负矩阵分解(NMF)[12]则基于这样的假设:数据矩阵可以分解为两个非负矩阵的乘积——基矩阵和系数矩阵.核PCA(KPCA)[13]是传统PCA 方法的核化版本.KPCA 中的核函数需要人为地指定,而最大方差展开(MVU)[14]则通30 Journal of Software 软件学报 V ol.22, No.1, January 2011过对数据的学习直接得到核矩阵.除了上面提到的一些非监督降维方法以外,流形学习(manifold learning)是最近发展起来的一种新的降维方法,它假设数据采样于高维空间中的一个潜在流形上,通过寻找这样一个潜在的流形很自然地找到高维数据的低维表示.ISOMAP [15]、局部线性嵌入(LLE)[16]、拉普拉斯特征映射(LE)[17]和局部保持投影(LPP)[18]是流形学习的代表性方法.Fig.1 Taxonomy of dimensionality reduction methods图1 降维方法的分类图上面分别介绍了监督降维和无监督降维的一些具有代表性的方法,下一节重点介绍半监督降维以及一些典型的半监督降维方法.2 半监督降维把半监督学习思想用于降维,就形成了半监督学习的一个新的分支,即半监督降维.半监督降维是传统降维方法的有效综合,它既可以像监督降维方法那样利用数据标号,又可以像无监督降维方法那样保持数据的某种结构信息,如数据的全局方差、局部结构等等.因此,半监督降维能够克服传统降维方法的缺点,有重要的研究价值和广阔的应用前景.根据使用监督信息的不同,半监督降维方法可以大致分成3类:(1) 基于类别标号的方法;(2) 基于成对约束的方法;(3) 基于其他监督信息的方法. 2.1 基于类别标号的半监督降维首先,我们给出基于类别标号半监督降维方法的数学描述.假设有N 个数据1{}N i X x ==,每个数据的维数为D ,即x i ∈R D (i =1,2,…,N ).在所有数据中,L 个数据已经知道类别标号,记为11{(,)}L i i i X x y ==,其中,x i 表示第i 个数据, y i 是x i 的类别标号,总共有C 个类;剩下的数据没有类别标号,记为21{}N j j L X x =+=.半监督降维的目的是,利用有类别标号的数据和无类别标号的数据X ={X 1,X 2},寻找数据的低维表示Z ={z i }∈R d (d <D ).近年来,研究者们已经提出了多种基于类别标号的半监督降维方法.Yu 等人在概率PCA 模型[19]的基础上加入了类别标号信息,提出了监督和半监督形式的概率PCA 模型[20].Costa 和Hero 在构造拉普拉斯图时引入了类别标号信息,得到了拉普拉斯特征映射算法的一种监督和半监督版本[21].Cai 等人在传统的LDA 方法中加入流形正则化项,提出了一种半监督的判别分析方法SDA [22].Song 等人提出了一个半监督降维方法框架[23],SDA 可以看成是该框架下的一个例子.Zhang 等人在文献[24]中提出的半监督降维方法与SDA 方法相似,也是使用陈诗国等:半监督降维方法的实验比较31正则化项来保持数据的流形结构.所不同的是,它使用了一种基于路径的鲁棒的相似性来构造邻接图.文献[25]在最大化LDA 准则中加入没有类别标号的数据,使用约束凹凸过程解决最终的优化问题.Chen 等人把LDA 重写成最小平方的形式,通过加入拉普拉斯正则化项,该模型可以转化为一个正则化的最小平方问题[26].最近,Sugiyama 把局部Fisher 判别分析(LFDA)和PCA 结合起来,提出了一种半监督局部降维Fisher 判别分析SELF [27].Chatpatanasiri 等人从流形学习的角度提出了一个半监督降维框架.在该框架下,可以很容易地把传统的Fisher 判别分析扩展到半监督的形式[28].下面我们简要介绍5种半监督降维方法:半监督概率PCA(S2PPCA)[19]、分类约束降维(CCDR)[21]、半监督判别分析(SDA)[23]和两个半监督Fisher 判别分析SELF [27],SSLFDA [28].最后,对这几种方法的属性以及它们之间的关系进行分析.2.1.1 半监督概率PCA(S2PPCA)S2PPCA 是概率PCA 模型[20]的半监督版本.首先,仅考虑有标号的样本X 1.假设样本(x ,y )由下列隐变量模型生成:x =W x z +μx +εx , y =f (z ,Θ)+εy (2)这里,f (z ,Θ)=[f 1(z ,θ1),…,f c (z ,θC )]T 是关于类别标号的函数,其中Θ={θ1,…,θC }表示C 个确定性函数f 1,…,f C 的参数(假设每个函数f c ,c =1,…,C ,都是关于隐变量z 的线性函数(,)cT cc c y y f z w z θμ=+,则f (z ,Θ)=W x z +μy ).z ~N (0,I ).z ~N (0,I )是输入x 与输出y 所共享的隐变量.两个相互独立的噪声模型被定义成各向同性的高斯函数,即22~(0,),~(0,)x x x x N I N I εδεδ.因此,对隐变量z 求积分,得到样本(x ,y )的似然函数:(,)(,|)()d (|)(|)()d P x y P x y z P z z P x z P y z P z z ==∫∫ (3)这里,22|~(,),|~(,)x x x y y y x z N W z I y z N W z I μθμθ++.如果样本之间相互独立,则有11()(,)L i i i P X P x y ==∏.最后,所有需要估计的参数向量表示为22{,,,,,}x y x y x y W W Ωμμδδ=.然后,考虑所有样本X ={X 1,X 2}的情况.因为样本之间假设是相互独立的,那么关于有所有样本的似然函数为1211()()()(,)()L Ni i j i j L P X P X P X P x y P x ==+==∏∏ (4)这里,P (x i ,y i )可以由公式(3)计算得到,而()(|)()d j j j j j P x P x z P z z =∫可以由概率PCA 模型计算得到.2.1.2 分类约束降维(CCDR)CCDR 的主要思想如下:将每个类所有样本的中心点作为新的数据节点加入到邻接图中,然后同类样本点与它们的中心点之间加入一条权重为1的边.这样,CCDR 可以形式化地写成最小化下面的目标函数:22()||||||||n ki k i ij i j kiijE Z a z y w y y β=−+−∑∑ (5)这里,z k 表示嵌入到低维空间后第k 个类的中心;A ={a ki }表示类别关系矩阵(如果数据x i 属于第k 类,则a ik =1;否则,a ik =0);W ={w ij }表示数据的邻接图,它的构造方法有很多种,可以参考文献[29];y i 是指嵌入到低维空间后的数据向量,其中,Z n =[z 1,…,z C ,y 1,…,y N ].2.1.3 半监督判别分析(SDA)SDA 是基于线性判别分析的一个半监督降维版本,它通过在LDA 的目标函数中添加正则化项,使得SDA 在最大化类间离散度的同时可以保持数据的局部结构信息.SDA 能够优化下面的目标函数:arg maxarg max ()()T T b b T T T wat t w S w w S ww S w J w w S XLX I w ααβ=+++ (6)式中,S b 表示带标号数据的类间离散度矩阵,S t 表示总体离散度,J (W )是正则化项(通过构造k -近邻图保持数据的流形),L 表示拉普拉斯矩阵.像LDA 一样,SDA 的目标函数也可以转化为一个广义特征分解问题.2.1.4 半监督局部Fisher 判别分析(SELF)SELF 是由文献[27]提出来的另一种半监督降维方法,它是局部Fisher 判别分析[9]的半监督版本.SELF 通过32 Journal of Software 软件学报 V ol.22, No.1, January 2011把PCA 和LFDA 综合起来,可以保持无类别标号数据的全局结构,同时保留LFDA 方法的优点(比如类内的数据为多模态分布、LDA 的维数限制等).SELF 可以表示为求解下面的优化问题:()()1arg max[(())]T rlb T rlw WWopt tr W S W W S W −= (7)上式中,W 是映射矩阵.S(rlb )是正则化局部类间离散矩阵,S (rlw )是正则化局部类内离散矩阵,定义如下:S (rlb )=(1−β)S (lb )+βS (t ), S (rlw )=(1−β)S (lw )+βI d (8)其中,S (lb )和S (lw )分别是LFDA 算法中的局部类间离散矩阵和局部类内离散矩阵,S (t )是离散度矩阵(数据方差矩阵).β∈[0,1]是模型的调节参数,当β=1时,SELF 就退化为PCA;而当β=0时,SELF 就退化为LFDA.2.1.5 基于流形学习的半监督局部Fisher 判别分析(SSLFDA)与第2.1.4节中的SELF 算法类似,SSLFDA 也是LFDA 算法的一个半监督版本,它是根据文献[28]中提出的半监督降维框架直接推导得到的.它与SELF 在利用无类别标号数据方面有所不同:SSLFDA 算法保持数据的流形结构,而SELF 保持数据的全局结构.在这个框架里面,半监督降维方法可以简单地表示为求解下面的优化问题:*arg min ()()l T u T WW f W X f W X γ=+ (9)其中,f l(⋅)和f u(⋅)分别表示关于有类别数据和无类别标号数据的函数,γ是调节因子.通常,f 可以写成成对数据加权距离的函数,最终,问题(9)转化为矩阵的特征分解问题.通过定义不同的权值,该框架可以导出不同的半监督降维方法.2.1.6 方法总结表1中列出了第2.1.1节~第2.1.5节中5个半监督降维方法的一些属性.k 表示数据邻接图近邻样本点的数目,t 是高斯核函数(Gaussian-kernel)中的带宽,i 表示迭代的次数,p 表示稀疏矩阵中的非零元素的个数.因为在S2PPCA 方法中提供了两个互为对偶的EM 算法,所以表中列出了它的两个计算复杂度和存储复杂度.S2PPCA 是概率模型,算法的性能一方面依赖于模型假设,即数据分布是高斯的或混合高斯的;另一方面还依赖于样本数据的个数.如果样本数目太少,参数的估计就不可信,从而导致算法的性能下降.CCDR 是一种非线性方法,与LE 方法一样,邻接图构造的好坏会直接影响算法的性能.SDA,SELF 和SSLFDA 是3种线性降维方法.SDA 通过增加正则化项,使得在降维的过程中能够保持数据的局部结构;SELF 需要最大化数据的协方差(PCA 准则),因此它在降维的过程中利用无标号的数据保持数据的全局结构;SSLFDA 利用无标号的数据保持数据的流形结构(LPP 准则),使数据的局部结构得到保持.Table 1 Properties of semi-supervised dimensionality reduction methods based on class label表1 基于类别标号的半监督降维方法的属性Methods Basic idea ParametersComputational MemoryS2PPCA CCDR SDA SELF SSLFDA Label+PPCALabel+LELDA+Adjacency graph LFDA+PCA LFDA+LPP None k , t , β k , α, β k , β k , βO (i (D +L )Nd ) or O (N 2(id +D ))O (p (N +C )2) O (D 3) O (D 3) O (D 3) O ((D +L )N ) or O (N 2)O (p (N +C )2)O (N 2+D 2) O (D 2)O (D 2)2.2 基于成对约束的半监督降维在半监督学习中,除了类别标号信息,还可以利用其他形式的先验信息,比如成对约束.在很多情况下,人们往往不知道样本的具体类别标号,却知道两个样本属于同一个类别,或者不属于同一个类别,我们称这样的监督信息为成对约束.成对约束往往分为两种:正约束(must-link)和负约束(cannot-link).正约束指的是两个样本属于同一个类别;相反地,负约束指的是两个样本属于不同的类别.本文中,把所有正约束的集合记为ML ,所有负约束的集合记为CL .我们首先回顾一些基于成对约束的半监督降维方法.Tang 等人提出用约束指导降维过程[30],他们的方法仅仅用到约束而忽略了无标号的数据.Bar-Hillel 等人提出一种约束FDA(cFDA)[31]方法对数据进行预处理,它是陈诗国等:半监督降维方法的实验比较33作为相关成分分析算法(RCA)的一个中间步骤.Zhang 等人从一个更为直观的角度同时利用正约束和负约束指导降维过程,提出了一个半监督降维框架(SSDR)[32].Cevikalp 等人在局部保持投影方法(LPP)中引入约束信息,提出了约束局部保持投影算法(cLPP)[33].Wei 等人提出了一种邻居保持降维(NPSSDR)方法[34],在利用约束指导降维的同时,保持数据的局部结构信息.Baghshah 等人将NPSSDR 方法用于度量学习[35],他们使用了一种二分搜索方法来优化求解过程.Chen 等人在文献[36]中提出了一个基于约束信息的半监督非负矩阵分解(NMF)框架.彭岩等人在传统的典型相关分析算法中加入成对约束,提出了一种半监督典型相关分析算法[37].最近,Davidson 提出了一个基于图的降维框架[38],在该框架中,首先构造一个约束图,然后根据构建出来的图来指导降维.下面我们简要介绍4种基于成对约束的半监督降维方法:约束Fisher 判别分析(cFDA)[31]、基于约束的半监督降维框架(SSDR)[32]、约束的局部保持投影(cLPP)[33]和邻域保持半监督降维(NPSSDR)[34,35].最后,对这几种方法的属性以及它们之间的关系进行分析.2.2.1 约束Fisher 判别分析(cFDA)cFDA 是度量学习算法相关成分分析(RCA)[31]的一个中间步骤.cFDA 的具体做法是:首先,使用正约束(文献[31]中称为equivalence constraints)把数据聚成若干个簇(cluster);然后,类似于LDA 构建簇内散布矩阵S w 和总体散布矩阵S t ;最后,最大化下面的比率:max T t TW wW S WW S W (10) 其中,W 是映射矩阵,T 是矩阵转置符号.优化目标W 可以简单地由矩阵1w t S S −的前d 个特征向量组成.2.2.2 基于约束的半监督降维框架(SSDR)不同于cFDA 利用约束信息来构造散布矩阵,SSDR [32]直接使用约束来指导降维.SSDR 在降维的过程中保持数据之间的约束关系,同时像PCA 一样保持数据内部结构信息.SSDR 最大化下面的目标方程:2222,(,)(,)1()()()()222i j i j T T T T T T i j i j i j i j x x CL x x MLCL ML J w w x w x w x w x w x w x n n n αβ∈∈=−+−−−∑∑∑ (11) 上式中,第1项可以使得降维后两两数据之间的距离保持最大,它等价于PCA 准则,即最大化数据的方差;n CL 和n ML 分别表示负约束和正约束的个数. 2.2.3 约束的局部保持投影(cLPP)与SSDR 不同,cLPP 在降维过程中保持数据的局部结构信息.cLPP 的具体步骤如下:首先,构造数据的邻接矩阵;然后,利用约束信息修改邻接矩阵中的权值,使得正约束数据之间的权值增大,负约束数据之间的权值变小,同时修改与有约束的数据点直接相连点的相关权值,对约束信息进行传播;最后,cLPP 的目标函数可以显示地写成下面的形式:222,,,1()()()()2i j ij i j i j i j i j ML i j CL J w z z A z z z z ∈∈⎛⎞=−+−−−⎜⎟⎜⎟⎝⎠∑∑∑% (12) 这里,ijA %代表修改后的数据邻接矩阵,z i 是原始数据x i 映射到低维空间后所对应的点. 2.2.4 邻域保持半监督降维(NPSSDR)∗∗NPSSDR 利用正约束和负约束降维,同时保持数据的局部结构信息.与cLPP 方法不同,NPSSDR 不需要构造数据的邻接矩阵,而是通过添加正则化项的方法来实现.这里,我们给出了文献[35]中的目标函数:2(,)*(,)()arg max()()i j T i j i j z z CLW W Ii j z z MLz z W z z J W α∈=∈−=−+∑(13)∗∗ Wei等人在文献[34]中提出NPSSDR 方法.Baghshah 等人在文献[35]中提出的算法思想与文献[34]相似,我们统一称他们的方法为NPSSDR.实验中比较的NPSSDR 方法是按照文献[35]中的求解方法实现的.34 Journal of Software 软件学报 V ol.22, No.1, January 2011式中,J (W )是正则化项.若使用局部线性嵌入(LLE)的思想构造正则化项,则有J (W )=tr (W T XMX T W ),其中,M 是数据重构矩阵.于是,目标函数(13)最终转化为求解一个迹比(trace radio)问题,该问题可以使用一种二分搜索算法来求解[39].2.2.5 方法总结表2中显示了基于成对约束的半监督降维方法的一些属性,k 表示数据邻接图近邻样本点的数目,t 是高斯核函数中的带宽,i 表示迭代的次数,p 表示稀疏矩阵中的非零元素的个数.cFDA 仅仅利用正约束而没有用负约束,它对约束的选取有很强的依赖性;并且,当S W 奇异的时候,cFDA 的求解过程还需要作特殊的处理.SSDR 既可以利用正约束,也可以利用负约束,但是它在降维过程中只保持数据的全局结构,并且约束的信息也没有进行传播.cLPP 与SSDR 相比,保持了数据的局部结构,并且把约束信息传播到邻近的数据点.但是,cLPP 与LPP 方法一样,降维性能的好坏需要依赖于数据邻接图的构建.NPSSDR 使用一种二分搜索算法近似求解最终的问题,这种搜索算法需要反复求解特征值问题,计算复杂度较大,并且算法有可能会遇到不收敛的问题.NPSSDR 使用LLE 策略保持无标号数据的局部结构,它会遇到LLE 同样的问题,比如局部结构塌陷问题等等[2].Table 2 Properties of semi-supervised dimensionality reduction methods based on pairwise constraints表2 基于成对约束的半监督降维方法的属性Methods Basic idea Structure Parameters Computational MemorycFDA SSDR cLPP NPSSDR must-link+LDA cannot-link+must-link+PCA cannot-link+must-link+LPP cannot-link+must-link+LLE Global Global Local Local None α, β k , tk , αO (D 3) O (D 3) O (D 3) O (iD 3+pN 2) O (D 2)O (N 2+D 2) O (N 2+D 2) O (D 2+pN 2)2.3 基于其他监督信息的方法半监督降维方法除了可以利用类别标号和成对约束作为监督信息以外,还有很多其他形式的监督信息,我们把它们统一分在了第3类.扩充关系嵌入(ARE)[40]、语义子空间映射(SSP)[41]和相关集成映射(RAP)[42]是利用图像检索中的检索与被检索图像间的相关关系作为监督信息指导特征抽取的过程.Yang 等人使用流形上的嵌入关系,把一些无监督的流形方法扩展到半监督的形式[43],例如半监督的ISOMAP(SS-ISOMAP)和半监督的局部线性嵌入(SS-LLE).Memisevic 等人提出了一种半监督降维框架多关系嵌入(MRE)[44],可以综合利用多种相似性关系.图2展示了一个使用流形上的嵌入关系的例子[43].图中相对较大的实心样本点表示已知嵌入关系的样本,即在嵌入过程中,这些样本在低维空间中的位置是已知的,而其他样本的位置是未知的.文献[43]就是利用这种已知的嵌入关系作为先验知识,把几种经典的流形学习方法扩展到半监督的形式.(a) Original samples (b) Low dimensional embedded samples [43](a) 原始数据 (b) 低维嵌入[43]Fig.2 Prior information in the form of on-manifold coordinates图2 已知低维嵌入关系作为先验知识3 实验在这一节中,我们在大量的标准数据集上比较了第2节中介绍的几种半监督降维方法.这些标准数据集包−11042−2−4−4−224陈诗国等:半监督降维方法的实验比较35括UCI标准数据集∗∗∗、半监督学习数据集[5]和人脸标准数据集.实验中,对于基于类别标号的降维方法,我们按照不同的比例将数据集随机划分为两个部分:一部分(占数据总数的10%,30%或50%)作为有标号的数据,另外一部分作为无标号的数据.对于基于成对约束的降维方法,随机选取一定数目(占数据总数的10%,30%或50%)的成对约束,而把整个数据集当作无标号的数据.对于UCI标准数据集和半监督数据集,选用最大似然估计方法[45]计算数据集的内在维数,文献[2]中的实验中也采用了相同的处理方法.实验采用简单的最近邻分类器(nearest neighbor,简称NN)的分类精度作为降维方法的评价指标,使用留一交叉验证法估计最终的实验结果.各种降维方法的参数设置见表3,除了部分参数经验设定以外,其他参数都采用网格式搜索的方法确定最优设置.T able 3Parameter settings for experiments表3实验中的参数设置Methods ParametersettingsLDA NoneCCDR SDA SELF SSLFDA 1≤k≤15, t=1, 0<β≤10 1≤k≤15, β=0, 0<α≤10 1≤k≤15, 0<β<11≤k≤15, 0<γ≤10, α=8cFDA DCA SSDR cLPP NPSSDRNoneNoneα=1, 1≤β≤301≤k≤15, t=51≤k≤15, α=0.2 or 0.023.1 UCI标准数据集上的实验比较本实验中使用了8个UCI标准数据集用于测试降维算法的性能.表4中列出了这些数据集的属性(C表示类别数目,D表示维数,N表示样本数目).这些数据集中既包括简单的数据集,例如iris和soybean_small,也包括一些复杂的数据集,例如手写数据集digits0.05和letter0.05.Table 4Properties of 8 UCI datasets表48个UCI数据集的属性iris digits0.05 letter0.05protein soybean_small letter_0.1_IJL ionosphere zooC D N34150101655026161 00062011643547316227234351716101表5和表6分别显示了基于类别标号和成对约束的半监督降维方法在UCI标准数据集上的实验结果.表中列出来的结果是经过降维后的数据再用最近邻分类器得到的分类精度,每个实验结果都是算法各自运行20遍后的平均值.PCA,LDA和只利用约束的DCA作为比较的基准方法也被列了出来.表中的最左列表示的是数据集的名称、最大似然估计得到的维数值D和类别数C.表5中的nL表示的是有标号的数据在整个数据集中所占的比例(10%,30%或50%),表6中的nC表示的是约束的数目占数据总数的比例(10%,30%或50%),其中,正约束和负约束的数目相同.黑色粗体数值表示的是该数据集上最好的实验结果.∗∗∗实验中使用的UCI标准数据集可以从网站/ml/上得到,其中,digits0.05,letter0.05和letter_0.1_IJL是从原始数据集digits,letter中进行抽样得到的.具体的抽样方法可参照工具包Weka(/ml/weka/)中的数据集.。

全局判别与局部稀疏保持HSI半监督特征提取

1.College of Information Technology, Shanghai Ocean University, Shanghai 201306, China 2.Shanghai University of Electric Power, Shanghai 200090, China
黄冬梅，张晓桐，张明华，等 . 全局判别与局部稀疏保持 HSI 半监督特征提取 . 计算机工程与应用，2019，55（20）：184-191. HUANG Dongmei, ZHANG Xiaotong, ZHANG Minghua, et al. Global discriminant and local sparse preserving semisupervised feature extraction for HSI. Computer Engineering and Applications, 2019, 55（20）：184-191.
Global Discrimiቤተ መጻሕፍቲ ባይዱant and Local Sparse Preserving Semi-Supervised Feature Extraction for HSI HUANG Dongmei1，2, ZHANG Xiaotong1, ZHANG Minghua1, SONG Wei1
摘要：针对高光谱图像存在“维数灾难”的问题，提出一种全局判别与局部稀疏保持的高光谱图像半监督特征提取算法（GLSSFE）。该算法通过 LDA 算法的散度矩阵保存有类标样本的全局类内判别信息和全局类间判别信息，结合利用半监督 PCA 算法对有类标和无类标样本进行主成分分析，保存样本的全局结构；利用稀疏表示优化模型自适应揭示样本数据间的非线性结构，将局部类间判别权值和局部类内判别权值嵌入半监督 LPP 算法保留样本数据的局部结构，从而最大化同类样本的相似性和异类样本的差异性。通过 1-NN 和 SVM 两个分类器分别对 Indian Pines 和 Pavia University 两个公共高光谱图像数据集进行分类，验证所提特征提取方法的有效性。实验结果表明，该 GLSSFE 算法最高总体分类精度分别达到 89.10%和 92.09%，优于现有的特征提取算法，能有效地挖掘高光谱图像的全局特征和局部特征，极大地提升高光谱图像的地物分类效果。关键词：高光谱图像；半监督全局判别分析；半监督局部稀疏保持；特征提取；空间相关性文献标志码：A 中图分类号：TP751 doi：10.3778/j.issn.1002-8331.1806-0261

基于半监督深度学习的图像分类算法研究

基于半监督深度学习的图像分类算法研究随着科技的不断发展，图像分类技术在各行各业中得到了广泛的应用。

然而，由于图像数据来源复杂、数据量庞大等不确定因素，传统的基于监督学习的图像分类算法面对的困难越来越明显。

因此，研究基于半监督深度学习的图像分类算法成为了当下热门的话题。

一、半监督学习理论半监督学习（Semi-supervised learning）是介于监督学习和无监督学习之间的一种学习方式。

在半监督学习中，训练数据集既有标注数据也有未标注数据。

由于带标注数据的数量受限，而未标注数据则数量庞大，因此可以通过合理利用未标注数据来提升模型的分类效果。

二、深度学习理论深度学习（Deep Learning）是一种基于人工神经网络的机器学习算法。

相比于传统机器学习算法，深度学习具有更强的模型泛化能力，能够自动学习特征，从而避免特征工程中的繁琐过程。

因此，在图像处理领域，深度学习模型得到了广泛的应用。

三、半监督深度学习图像分类算法在对图像分类算法进行研究时，我们通常将训练集划分为三个部分：有标签的训练集、无标签的训练集和验证集。

首先，我们将有标签的训练集用于模型的监督训练；然后，将无标签的训练集用于模型的半监督训练，即学习未标记数据中的特征模式；最后，使用验证集对模型进行优化和调参。

目前，半监督深度学习图像分类算法最为流行的方式是利用对抗生成网络（GAN）进行半监督学习。

对抗生成网络通过建立生成器和判别器的对抗模型，能够有效地生成逼真的样本。

在使用对抗生成网络进行半监督学习时，我们将生成器作为无标签数据的分类器，将判别器作为有标签数据的分类器。

通过对抗生成网络的对抗训练过程，可以有效地提升模型的分类效果。

四、算法实现在半监督深度学习图像分类算法的实现过程中，需要选择相应的深度学习框架进行开发。

如今，深度学习框架较为流行的有TensorFlow、Keras、PyTorch等。

这些框架不仅提供了许多深度学习的模型，还提供了各种实用的工具和函数库，简化了开发流程。

面向高光谱图像分类的半监督Laplace鉴别嵌入

面向高光谱图像分类的半监督Laplace鉴别嵌入李志敏;张杰;黄鸿;马泽忠【期刊名称】《电子与信息学报》【年(卷),期】2015(000)004【摘要】In order to extract effectively the discriminant characteristics of hyperspectral remote sensing image data, this paper presents a Semi-Supervised Laplace Discriminant Embedding (SSLDE) algorithm based on the discriminant information of labeled samples and the local structural information of unlabeled samples. The proposed algorithm makes use of the class information of labeled samples to maintain the separability of sample set, and discovers the local manifold structure in sample set by constructing Laplace matrix of labeled and unlabeled samples, which can achieve semi-supervised manifold discriminant. The experimental resultson KSC and Urban database show that the algorithm has higher classification accuracy and can effectively extract the information of discriminant characteristics. In the overall classification accuracy, this algorithm is improved by 6.3%~7.4% compared with Semi-Supervised Maximum Margin Criterion (SSMMC) algorithm and increased by 1.6%~4.4% compared with Semi-Supervised Sub-Manifold Preserving Embedding (SSSMPE) algorithm.%为有效提取出高光谱遥感图像数据的鉴别特征，该文阐述一种融合标记样本中鉴别信息和无标记样本中局部结构信息的半监督Laplace鉴别嵌入(SSLDE)算法。

有监督学习(supervised-learning)和无监督学习(unsupervised-learning)

有监督学习(supervised learning)和无监督学习(unsupervised learning)机器学习的常用方法，主要分为有监督学习(supervised learning)和无监督学习(unsupervised learning)。

监督学习，就是人们常说的分类，通过已有的训练样本（即已知数据以及其对应的输出）去训练得到一个最优模型（这个模型属于某个函数的集合，最优则表示在某个评价准则下是最佳的），再利用这个模型将所有的输入映射为相应的输出，对输出进行简单的判断从而实现分类的目的，也就具有了对未知数据进行分类的能力。

在人对事物的认识中，我们从孩子开始就被大人们教授这是鸟啊、那是猪啊、那是房子啊，等等。

我们所见到的景物就是输入数据，而大人们对这些景物的判断结果（是房子还是鸟啊）就是相应的输出。

当我们见识多了以后，脑子里就慢慢地得到了一些泛化的模型，这就是训练得到的那个（或者那些）函数，从而不需要大人在旁边指点的时候，我们也能分辨的出来哪些是房子，哪些是鸟。

监督学习里典型的例子就是KNN、SVM。

无监督学习（也有人叫非监督学习，反正都差不多）则是另一种研究的比较多的学习方法，它与监督学习的不同之处，在于我们事先没有任何训练样本，而需要直接对数据进行建模。

这听起来似乎有点不可思议，但是在我们自身认识世界的过程中很多处都用到了无监督学习。

比如我们去参观一个画展，我们完全对艺术一无所知，但是欣赏完多幅作品之后，我们也能把它们分成不同的派别（比如哪些更朦胧一点，哪些更写实一些，即使我们不知道什么叫做朦胧派，什么叫做写实派，但是至少我们能把他们分为两个类）。

无监督学习里典型的例子就是聚类了。

聚类的目的在于把相似的东西聚在一起，而我们并不关心这一类是什么。

因此，一个聚类算法通常只需要知道如何计算相似度就可以开始工作了。

那么，什么时候应该采用监督学习，什么时候应该采用非监督学习呢？我也是从一次面试的过程中被问到这个问题以后才开始认真地考虑答案。

一种利用Universum的半监督分类算法

一种利用Universum的半监督分类算法杨伟;侯臣平;吴翊【期刊名称】《计算机工程与应用》【年(卷),期】2012(048)006【摘要】分类是机器学习领域的重要分支,利用少量的标签数据进行分类和高维数据的分类是近期研究的热点问题.传统的半监督方法能够有效利用标签样本数据或非标签样本数据,但忽略了相关的非样本数据,即Universum.利用Universum的半监督分类算法,基于线性回归和子空间学习模型,结合了传统半监督方法和利用Universum方法两者的优点,在不增加标签数据的条件下显著地提高了高维数据的分类效果.仿真实验和真实数据上的分类结果都验证了算法的有效性.%Classification is an important branch of machine learning. It remains a hot issue how to attain a better classification with less labeled data in recent research. Traditional semi-supervised classification can take advantage of the training samples, either labeled or unla-beled, but ignores related non-samples, called the Universum. Combining the advantage of traditional semi-supervised methods and the Uni-versum, Semi-Supervised Classification with the Universum (SSCU) via linear regression and subspace learning, can effectively improve the classification of original high-dimensional data adding no labels. The effectiveness is verified by both simulation and real-world data.【总页数】4页(P155-157,176)【作者】杨伟;侯臣平;吴翊【作者单位】国防科学技术大学数学与系统科学系,长沙410073;国防科学技术大学数学与系统科学系,长沙410073;国防科学技术大学数学与系统科学系,长沙410073【正文语种】中文【中图分类】TP391【相关文献】1.一种基于Tri-training的半监督多标记学习文档分类算法 [J], 高嘉伟;梁吉业;刘杨磊;李茹2.一种多分类器协同的半监督分类算法SSC_MCC [J], 刘宁;赵建华3.一种基于证据理论的多类半监督分类算法 [J], 盛凯;刘忠;周德超;魏启航;冯成旭4.一种基于样本选择的安全半监督分类算法 [J], 赵建华; 刘宁5.一种基于协同训练半监督的分类算法 [J], 王宇;李延晖因版权原因，仅展示原文概要，查看原文内容请购买。

基于最小匹配误差方向预测的快速半像素运动估计

基于最小匹配误差方向预测的快速半像素运动估计
董海燕;张其善
【期刊名称】《计算机科学》
【年(卷),期】2005(032)009
【摘要】为了减小半像素搜索的计算量,本文提出了一个基于最小匹配误差方向预测的快速半像素运动估计算法.本文提出的算法利用亚像素搜索窗内的匹配误差单峰曲面的特性来预测半像素搜索区域中最小匹配误差方向,从而避免了大量不必要的匹配运算量.实验结果表明,对于各种不同运动程度和空间细节的视频序列,本文提出的算法在保证和半像素全搜索法相同图像质量的同时,平均节省73%的计算量,很适合实时应用.
【总页数】3页(P218-220)
【作者】董海燕;张其善
【作者单位】北京航空航天大学电子信息工程学院,北京,100083;北京航空航天大学电子信息工程学院,北京,100083
【正文语种】中文
【中图分类】TP3
【相关文献】
1.基于线性预测的半像素运动估计 [J], 章伟明;徐元欣;王匡
2.一种基于硬件实现的快速运动估计半像素级搜索算法 [J], 赵波;吴成柯;张方
3.快速高效的半像素运动估计算法的VLSI实现 [J], 鲍林;李维祥
4.基于运动方向预测的快速运动估计算法 [J], 向友君;吴宗泽;谢胜利
5.基于方向信息的快速整像素运动估计优化 [J], 熊承义;白云
因版权原因，仅展示原文概要，查看原文内容请购买。

相关主题

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Semi-supervised and unsupervised extreme learningmachinesGao Huang,Shiji Song,Jatinder N.D.Gupta,and Cheng WuAbstract—Extreme learning machines(ELMs)have proven to be an efﬁcient and effective learning paradigm for pattern classiﬁcation and regression.However,ELMs are primarily applied to supervised learning problems.Only a few existing research studies have used ELMs to explore unlabeled data. In this paper,we extend ELMs for both semi-supervised and unsupervised tasks based on the manifold regularization,thus greatly expanding the applicability of ELMs.The key advantages of the proposed algorithms are1)both the semi-supervised ELM (SS-ELM)and the unsupervised ELM(US-ELM)exhibit the learning capability and computational efﬁciency of ELMs;2) both algorithms naturally handle multi-class classiﬁcation or multi-cluster clustering;and3)both algorithms are inductive and can handle unseen data at test time directly.Moreover,it is shown in this paper that all the supervised,semi-supervised and unsupervised ELMs can actually be put into a uniﬁed framework. This provides new perspectives for understanding the mechanism of random feature mapping,which is the key concept in ELM theory.Empirical study on a wide range of data sets demonstrates that the proposed algorithms are competitive with state-of-the-art semi-supervised or unsupervised learning algorithms in terms of accuracy and efﬁciency.Index Terms—Clustering,embedding,extreme learning ma-chine,manifold regularization,semi-supervised learning,unsu-pervised learning.I.I NTRODUCTIONS INGLE layer feedforward networks(SLFNs)have been intensively studied during the past several decades.Most of the existing learning algorithms for training SLFNs,such as the famous back-propagation algorithm[1]and the Levenberg-Marquardt algorithm[2],adopt gradient methods to optimize the weights in the network.Some existing works also use forward selection or backward elimination approaches to con-struct network dynamically during the training process[3]–[7].However,neither the gradient based methods nor the grow/prune methods guarantee a global optimal solution.Al-though various methods,such as the generic and evolutionary algorithms,have been proposed to handle the local minimum This work was supported by the National Natural Science Foundation of China under Grant61273233,the Research Fund for the Doctoral Program of Higher Education under Grant20120002110035and20130002130010, the National Key Technology R&D Program under Grant2012BAF01B03, the Project of China Ocean Association under Grant DY125-25-02,and Tsinghua University Initiative Scientiﬁc Research Program under Grants 2011THZ07132.Gao Huang,Shiji Song,and Cheng Wu are with the Department of Automation,Tsinghua University,Beijing100084,China(e-mail:huang-g09@;shijis@; wuc@).Jatinder N.D.Gupta is with the College of Business Administration,The University of Alabama in Huntsville,Huntsville,AL35899,USA.(e-mail: guptaj@).problem,they basically introduce high computational cost. One of the most successful algorithms for training SLFNs is the support vector machines(SVMs)[8],[9],which is a maximal margin classiﬁer derived under the framework of structural risk minimization(SRM).The dual problem of SVMs is a quadratic programming and can be solved conveniently.Due to its simplicity and stable generalization performance,SVMs have been widely studied and applied to various domains[10]–[14].Recently,Huang et al.[15],[16]proposed the extreme learning machines(ELMs)for training SLFNs.In contrast to most of the existing approaches,ELMs only update the output weights between the hidden layer and the output layer, while the parameters,i.e.,the input weights and biases,of the hidden layer are randomly generated.By adopting squared loss on the prediction error,the training of output weights turns into a regularized least squares(or ridge regression)problem which can be solved efﬁciently in closed form.It has been shown that even without updating the parameters of the hidden layer,the SLFN with randomly generated hidden neurons and tunable output weights maintains its universal approximation capability[17]–[19].Compared to gradient based algorithms, ELMs are much more efﬁcient and usually lead to better generalization performance[20]–[22].Compared to SVMs, solving the regularized least squares problem in ELMs is also faster than solving the quadratic programming problem in standard SVMs.Moreover,ELMs can be used for multi-class classiﬁcation problems directly.The predicting accuracy achieved by ELMs is comparable with or even higher than that of SVMs[16],[22]–[24].The differences and similarities between ELMs and SVMs are discussed in[25]and[26], and new algorithms are proposed by combining the advan-tages of both models.In[25],an extreme SVM(ESVM) model is proposed by combining ELMs and the proximal SVM(PSVM).The ESVM algorithm is shown to be more accurate than the basic ELMs model due to the introduced regularization technique,and much more efﬁcient than SVMs since there is no kernel matrix multiplication in ESVM.In [26],the traditional RBF kernel are replaced by ELM kernel, leading to an efﬁcient algorithm with matched accuracy of SVMs.In the past years,researchers from variesﬁelds have made substantial contribution to ELM theories and applications.For example,the universal approximation ability of ELMs has been further studied in a classiﬁcation context[23].The gen-eralization error bound of ELMs has been investigated from the perspective of the Vapnik-Chervonenkis(VC)dimension theory and the initial localized generalization error model(LGEM)[27],[28].Varies extensions have been made to the basic ELMs to make it more efﬁcient and more suitable for speciﬁc problems,such as ELMs for online sequential data [29]–[31],ELMs for noisy/missing data[32]–[34],ELMs for imbalanced data[35],etc.From the implementation aspect, ELMs has recently been implemented using parallel tech-niques[36],[37],and realized on hardware[38],which made ELMs feasible for large data sets and real time reasoning. Though ELMs have become popular in a wide range of domains,they are primarily used for supervised learning tasks such as classiﬁcation and regression,which greatly limits their applicability.In some cases,such as text classiﬁcation, information retrieval and fault diagnosis,obtaining labels for fully supervised learning is time consuming and expensive, while a multitude of unlabeled data are easy and cheap to collect.To overcome the disadvantage of supervised learning al-gorithms that they cannot make use of unlabeled data,semi-supervised learning(SSL)has been proposed to leverage both labeled and unlabeled data[39],[40].The SSL algorithms assume that the input patterns from both labeled and unlabeled data are drawn from the same marginal distribution.Therefore, the unlabeled data naturally provide useful information for exploring the data structure in the input space.By assuming that the input data follows some cluster structure or manifold in the input space,SSL algorithms can incorporate both la-beled and unlabeled data into the learning process.Since SSL requires less effort to collect labeled data and can offer higher accuracy,it has been applied to various domains[41]–[43].In some other cases where no labeled data are available,people may be interested in exploring the underlying structure of the data.To this end,unsupervised learning(USL)techniques, such as clustering,dimension reduction or data representation, are widely used to fulﬁll these tasks.In this paper,we extend ELMs to handle both semi-supervised and unsupervised learning problems by introducing the manifold regularization framework.Both the proposed semi-supervised ELM(SS-ELM)and unsupervised ELM(US-ELM)inherit the computational efﬁciency and the learn-ing capability of traditional pared with existing algorithms,SS-ELM and US-ELM are not only inductive (straightforward extension for out-of-sample examples at test time),but also can be used for multi-class classiﬁcation or multi-cluster clustering directly.We test our algorithms on a variety of data sets,and make comparisons with other related algorithms.The results show that the proposed algorithms are competitive with state-of-the-art algorithms in terms of accuracy and efﬁciency.It is worth to mention that all the supervised,semi-supervised and unsupervised ELMs can actually be put into a uniﬁed framework,that is all the algorithms consist of two stages:1)random feature mapping;and2)output weights solving.Theﬁrst stage is to construct the hidden layer using randomly generated hidden neurons.This is the key concept in the ELM theory,which differs it from many existing feature learning methods.Generating feature mapping randomly en-ables ELMs for fast nonlinear feature learning and alleviates the problem of over-ﬁtting.The second stage is to solve the weights between the hidden layer and the output layer, and this is where the main difference of supervised,semi-supervised and unsupervised ELMs lies.We believe that the uniﬁed framework for the three types of ELMs might provide us a new perspective to understand the underlying behavior of the random feature mapping in ELMs.The rest of the paper is organized as follows.In Section II,we give a brief review of related existing literature on semi-supervised and unsupervised learning.Section III and IV introduce the basic formulation of ELMs and the man-ifold regularization framework,respectively.We present the proposed SS-ELM and US-ELM algorithms in Sections V and VI.Experiment results are given in Section VII,and Section VIII concludes the paper.II.R ELATED WORKSOnly a few existing research studies on ELMs have dealt with the problem of semi-supervised learning or unsupervised learning.In[44]and[45],the manifold regularization frame-work was introduce into the ELMs model to leverage both labeled and unlabeled data,thus extended ELMs for semi-supervised learning.However,both of these two works are limited to binary classiﬁcation problems,thus they haven’t explore the full power of ELMs.Moreover,both algorithms are only effective when the number of training patterns is more than the number of hidden neurons.Unfortunately,this condition is usually violated in semi-supervised learning since the training data is relatively scarce compared to the hidden neurons,whose number is commonly set to several hundreds or several thousands.Recently,a co-training approach have been proposed to train ELMs in a semi-supervised setting [46].In this algorithm,the labeled training sets are augmented gradually by moving a small set of most conﬁdently predicted unlabeled data to the labeled set at each loop,and ELMs are trained repeatedly on the pseudo-labeled set.Since the algo-rithm need to train ELMs repeatedly,it introduces considerable extra computational cost.The proposed SS-ELM is related to a few other mani-fold assumption based semi-supervised learning algorithms, such as the Laplacian support vector machines(LapSVMs) [47],the Laplacian regularized least squares(LapRLS)[47], semi-supervised neural networks(SSNNs)[48],and semi-supervised deep embedding[49].It has been shown in these works that manifold regularization is effective in a wide range of domains and often leads to a state-of-the-art performance in terms of accuracy and efﬁciency.The US-ELM proposed in this paper are related to the Laplacian Eigenmaps(LE)[50]and spectral clustering(SC) [51]in that they both use spectral techniques for embedding and clustering.In all these algorithms,an afﬁnity matrix is ﬁrst built from the input patterns.The SC performs eigen-decomposition on the normalized afﬁnity matrix,and then embeds the original data into a d-dimensional space using the ﬁrst d eigenvectors(each row is normalized to have unit length and represents a point in the embedded space)corresponding to the d largest eigenvalues.The LE algorithm performs generalized eigen-decomposition on the graph Laplacian,anduses the d eigenvectors corresponding to the second through the(d+1)th smallest eigenvalues for embedding.When LE and SC are used for clustering,then k-means is adopted to cluster the data in the embedded space.Similar to LE and SC,the US-ELM are also based on the afﬁnity matrix,and it is converted to solving a generalized eigen-decomposition problem.However,the eigenvectors obtained in US-ELM are not used for data representation directly,but are used as the parameters of the network,i.e.,the output weights.Note that once the US-ELM model is trained,it can be applied to any presented data in the original input space.In this way,US-ELM provide a straightforward way for handling new patterns without recomputing eigenvectors as in LE and SC.III.E XTREME LEARNING MACHINES Consider a supervised learning problem where we have a training set with N samples,{X,Y}={x i,y i}N i=1.Herex i∈R n i,y i is a n o-dimensional binary vector with only one entry(correspond to the class that x i belongs to)equal to one for multi-classiﬁcation tasks,or y i∈R n o for regression tasks,where n i and n o are the dimensions of input and output respectively.ELMs aim to learn a decision rule or an approximation function based on the training data. Generally,the training of ELMs consists of two stages.The ﬁrst stage is to construct the hidden layer using aﬁxed number of randomly generated mapping neurons,which can be any nonlinear piecewise continuous functions,such as the Sigmoid function and Gaussian function given below.1)Sigmoid functiong(x;θ)=11+exp(−(a T x+b));(1)2)Gaussian functiong(x;θ)=exp(−b∥x−a∥);(2) whereθ={a,b}are the parameters of the mapping function and∥·∥denotes the Euclidean norm.A notable feature of ELMs is that the parameters of the hidden mapping functions can be randomly generated ac-cording to any continuous probability distribution,e.g.,the uniform distribution on(-1,1).This makes ELMs distinct from the traditional feedforward neural networks and SVMs. The only free parameters that need to be optimized in the training process are the output weights between the hidden neurons and the output nodes.By doing so,training ELMs is equivalent to solving a regularized least squares problem which is considerately more efﬁcient than the training of SVMs or backpropagation algorithms.In theﬁrst stage,a number of hidden neurons which map the data from the input space into a n h-dimensional feature space (n h is the number of hidden neurons)are randomly generated. We denote by h(x i)∈R1×n h the output vector of the hidden layer with respect to x i,andβ∈R n h×n o the output weights that connect the hidden layer with the output layer.Then,the outputs of the network are given byf(x i)=h(x i)β,i=1,...,N.(3)In the second stage,ELMs aim to solve the output weights by minimizing the sum of the squared losses of the prediction errors,which leads to the following formulationminβ∈R n h×n o12∥β∥2+C2N∑i=1∥e i∥2s.t.h(x i)β=y T i−e T i,i=1,...,N,(4)where theﬁrst term in the objective function is a regularization term which controls the complexity of the model,e i∈R n o is the error vector with respect to the i th training pattern,and C is a penalty coefﬁcient on the training errors.By substituting the constraints into the objective function, we obtain the following equivalent unconstrained optimization problem:minβ∈R n h×n oL ELM=12∥β∥2+C2∥Y−Hβ∥2(5)where H=[h(x1)T,...,h(x N)T]T∈R N×n h.The above problem is widely known as the ridge regression or regularized least squares.By setting the gradient of L ELM with respect toβto zero,we have∇L ELM=β+CH H T(Y−Hβ)=0(6) If H has more rows than columns and is of full column rank,which is usually the case where the number of training patterns are more than the number of the hidden neurons,the above equation is overdetermined,and we have the following closed form solution for(5):β∗=(H T H+I nhC)−1H T Y,(7)where I nhis an identity matrix of dimension n h.Note that in practice,rather than explicitly inverting the n h×n h matrix in the above expression,we can use Gaussian elimination to directly solve a set of linear equations in a more efﬁcient and numerically stable manner.If the number of training patterns are less than the number of hidden neurons,then H will have more columns than rows, which often leads to an underdetermined least squares prob-lem.In this case,βmay have inﬁnite number of solutions.To handle this problem,we restrictβto be a linear combination of the rows of H:β=H Tα(α∈R N×n o).Notice that when H has more columns than rows and is of full row rank,then H H T is invertible.Multiplying both side of(6) by(H H T)−1H,we getα+C(Y−H H Tα)=0,(8) This yieldsβ∗=H Tα∗=H T(H H T+I NC)−1Y(9)where I N is an identity matrix of dimension N. Therefore,in the case where training patterns are plentiful compared to the hidden neurons,we use(7)to compute the output weights,otherwise we use(9).IV.T HE MANIFOLD REGULARIZATION FRAMEWORK Semi-supervised learning is built on the following two assumptions:(1)both the label data X l and the unlabeled data X u are drawn from the same marginal distribution P X ;and (2)if two points x 1and x 2are close to each other,then the conditional probabilities P (y |x 1)and P (y |x 2)should be similar as well.The latter assumption is widely known as the smoothness assumption in machine learning.To enforce this assumption on the data,the manifold regularization framework proposes to minimize the following cost functionL m=12∑i,jw ij ∥P (y |x i )−P (y |x j )∥2,(10)where w ij is the pair-wise similarity between two patterns x iand x j .Note that the similarity matrix W =[w ij ]is usually sparse,since we only place a nonzero weight between two patterns x i and x j if they are close,e.g.,x i is among the k nearest neighbors of x j or x j is among the k nearest neighbors of x i .The nonzero weights are usually computed using Gaussian function exp (−∥x i −x j ∥2/2σ2),or simply ﬁxed to 1.Intuitively,the formulation (10)penalizes large variation in the conditional probability P (y |x )when x has a small change.This requires that P (y |x )vary smoothly along the geodesics of P (x ).Since it is difﬁcult to compute the conditional probability,we can approximate (10)with the following expression:ˆLm =12∑i,jw ij ∥ˆyi −ˆy j ∥2,(11)where ˆyi and ˆy j are the predictions with respect to pattern x i and x j ,respectively.It is straightforward to simplify the above expression in a matrix form:ˆL m =Tr (ˆY T L ˆY ),(12)where Tr (·)denotes the trace of a matrix,L =D −W isknown as the graph Laplacian ,and D is a diagonal matrixwith its diagonal elements D ii =l +u∑j =1w i,j .As discussed in [52],instead of using L directly,we can normalize it byD −12L D −12or replace it by L p (p is an integer),based on some prior knowledge.V.S EMI -SUPERVISED ELMIn the semi-supervised setting,we have few labeled data and plenty of unlabeled data.We denote the labeled data in the training set as {X l ,Y l }={x i ,y i }l i =1,and unlabeled dataas X u ={x i }ui =1,where l and u are the number of labeled and unlabeled data,respectively.The proposed SS-ELM incorporates the manifold regular-ization to leverage unlabeled data to improve the classiﬁcation accuracy when labeled data are scarce.By modifying the ordinary ELM formulation (4),we give the formulation ofSS-ELM as:minβ∈R n h ×n o12∥β∥2+12l∑i =1C i ∥e i ∥2+λ2Tr (F T L F )s.t.h (x i )β=y T i −e T i ,i =1,...,l,f i =h (x i )β,i =1,...,l +u(13)where L ∈R (l +u )×(l +u )is the graph Laplacian built fromboth labeled and unlabeled data,and F ∈R (l +u )×n o is the output matrix of the network with its i th row equal to f (x i ),λis a tradeoff parameter.Note that similar to the weighted ELM algorithm (W-ELM)introduced in [35],here we associate different penalty coefﬁ-cient C i on the prediction errors with respect to patterns from different classes.This is because we found that when the data is skewed,i.e.,some classes have signiﬁcantly more training patterns than other classes,traditional ELMs tend to ﬁt the classes that having the majority of patterns quite well but ﬁts other classes poorly.This usually leads to poor generalization performance on the testing set (while the prediction accuracy may be high,but the some classes are neglected).Therefore,we propose to alleviate this problem by re-weighting instances from different classes.Suppose that x i belongs to class t i ,which has N t i training patterns,then we associate e i with a penalty ofC i =C 0N t i.(14)where C 0is a user deﬁned parameter as in traditional ELMs.In this way,the patterns from the dominant classes will not be over ﬁtted by the algorithm,and the patterns from a class with less samples will not be neglected.We substitute the constraints into the objective function,and rewrite the above formulation in a matrix form:min β∈R n h×n o 12∥β∥2+12∥C 12( Y −Hβ)∥2+λ2Tr (βT H TL Hβ)(15)where Y∈R (l +u )×n o is the training target with its ﬁrst l rows equal to Y l and the rest equal to 0,C is a (l +u )×(l +u )diagonal matrix with its ﬁrst l diagonal elements [C ]ii =C i ,i =1,...,l and the rest equal to 0.Again,we compute the gradient of the objective function with respect to β:∇L SS −ELM =β+H T C ( Y−H β)+λH H T L H β.(16)By setting the gradient to zero,we obtain the solution tothe SS-ELM:β∗=(I n h +H T C H +λH H T L H )−1H TC Y .(17)As in Section III,if the number of labeled data is fewer thanthe number of hidden neurons,which is common in SSL,we have the following alternative solution:β∗=H T (I l +u +C H H T +λL L H H T )−1C Y .(18)where I l +u is an identity matrix of dimension l +u .Note that by settingλto be zero and the diagonal elements of C i(i=1,...,l)to be the same constant,(17)and (18)reduce to the solutions of traditional ELMs(7)and(9), respectively.Based on the above discussion,the SS-ELM algorithm is summarized as Algorithm1.Algorithm1The SS-ELM algorithmInput:The labeled patterns,{X l,Y l}={x i,y i}l i=1;The unlabeled patterns,X u={x i}u i=1;Output:The mapping function of SS-ELM:f:R n i→R n oStep1:Construct the graph Laplacian L from both X l and X u.Step2:Initiate an ELM network of n h hidden neurons with random input weights and biases,and calculate the output matrix of the hidden neurons H∈R(l+u)×n h.Step3:Choose the tradeoff parameter C0andλ.Step4:•If n h≤NCompute the output weightsβusing(17)•ElseCompute the output weightsβusing(18)return The mapping function f(x)=h(x)β.VI.U NSUPERVISED ELMIn this section,we introduce the US-ELM algorithm for unsupervised learning.In an unsupervised setting,the entire training data X={x i}N i=1are unlabeled(N is the number of training patterns)and our target is toﬁnd the underlying structure of the original data.The formulation of US-ELM follows from the formulation of SS-ELM.When there is no labeled data,(15)is reduced tomin β∈R n h×n o ∥β∥2+λTr(βT H T L Hβ)(19)Notice that the above formulation always attains its mini-mum atβ=0.As suggested in[50],we have to introduce addtional constraints to avoid a degenerated solution.Speciﬁ-cally,the formulation of US-ELM is given bymin β∈R n h×n o ∥β∥2+λTr(βT H T L Hβ)s.t.(Hβ)T Hβ=I no(20)Theorem1:An optimal solution to problem(20)is given by choosingβas the matrix whose columns are the eigenvectors (normalized to satisfy the constraint)corresponding to theﬁrst n o smallest eigenvalues of the generalized eigenvalue problem:(I nh +λH H T L H)v=γH H T H v.(21)Proof:We can rewrite the problem(20)asminβ∈R n h×n o,ββT Bβ=I no Tr(βT Aβ),(22)Algorithm2The US-ELM algorithmInput:The training data:X∈R N×n i;Output:•For embedding task:The embedding in a n o-dimensional space:E∈R N×n o;•For clustering task:The label vector of cluster index:y∈N N×1+.Step1:Construct the graph Laplacian L from X.Step2:Initiate an ELM network of n h hidden neurons withrandom input weights,and calculate the output matrix of thehidden neurons H∈R N×n h.Step3:•If n h≤NFind the generalized eigenvectors v2,v3,...,v no+1of(21)corresponding to the second through the n o+1smallest eigenvalues.Letβ=[ v2, v3,..., v no+1],where v i=v i/∥H v i∥,i=2,...,n o+1.•ElseFind the generalized eigenvectors u2,u3,...,u no+1of(24)corresponding to the second through the n o+1smallest eigenvalues.Letβ=H T[ u2, u3,..., u no+1],where u i=u i/∥H H T u i∥,i=2,...,n o+1.Step4:Calculate the embedding matrix:E=Hβ.Step5(For clustering only):Treat each row of E as a point,and cluster the N points into K clusters using the k-meansalgorithm.Let y be the label vector of cluster index for allthe points.return E(for embedding task)or y(for clustering task);where A=I nh+λH H T L H and B=H T H.It is easy to verify that both A and B are Hermitianmatrices.Thus,according to the Rayleigh-Ritz theorem[53],the above trace minimization problem attains its optimum ifand only if the column span ofβis the minimum span ofthe eigenspace corresponding to the smallest n o eigenvaluesof(21).Therefore,by stacking the normalized eigenvectors of(21)corresponding to the smallest n o generalized eigenvalues,we obtain an optimal solution to(20).In the algorithm of Laplacian eigenmaps,theﬁrst eigenvec-tor is discarded since it is always a constant vector proportionalto1(corresponding to the smallest eigenvalue0)[50].In theUS-ELM algorithm,theﬁrst eigenvector of(21)also leadsto small variations in embedding and is not useful for datarepresentation.Therefore,we suggest to discard this trivialsolution as well.Letγ1,γ2,...,γno+1(γ1≤γ2≤...≤γn o+1)be the(n o+1)smallest eigenvalues of(21)and v1,v2,...,v no+1be their corresponding eigenvectors.Then,the solution to theoutput weightsβis given byβ∗=[ v2, v3,..., v no+1],(23)where v i=v i/∥H v i∥,i=2,...,n o+1are the normalizedeigenvectors.If the number of labeled data is fewer than the numberTABLE ID ETAILS OF THE DATA SETS USED FOR SEMI-SUPERVISED LEARNINGData set Class Dimension|L||U||V||T|G50C2505031450136COIL20(B)2102440100040360USPST(B)225650140950498COIL2020102440100040360USPST1025650140950498of hidden neurons,problem(21)is underdetermined.In this case,we have the following alternative formulation by using the same trick as in previous sections:(I u+λL L H H T )u=γH H H T u.(24)Again,let u1,u2,...,u no +1be generalized eigenvectorscorresponding to the(n o+1)smallest eigenvalues of(24), then theﬁnal solution is given byβ∗=H T[ u2, u3,..., u no +1],(25)where u i=u i/∥H H T u i∥,i=2,...,n o+1are the normal-ized eigenvectors.If our task is clustering,then we can adopt the k-means algorithm to perform clustering in the embedded space.We summarize the proposed US-ELM in Algorithm2. Remark:Comparing the supervised ELM,the semi-supervised ELM and the unsupervised ELM,we can observe that all the algorithms have two similar stages in the training process,that is the random feature learning stage and the out-put weights learning stage.Under this two-stage framework,it is easy toﬁnd the differences and similarities between the three algorithms.Actually,all the algorithms share the same stage of random feature learning,and this is the essence of the ELM theory.This also means that no matter the task is a supervised, semi-supervised or unsupervised learning problem,we can always follow the same step to generate the hidden layer. The differences of the three types of ELMs lie in the second stage on how the output weights are computed.In supervised ELM and SS-ELM,the output weights are trained by solving a regularized least squares problem;while the output weights in the US-ELM are obtained by solving a generalized eigenvalue problem.The uniﬁed framework for the three types of ELMs might provide new perspectives to further develop the ELM theory.VII.E XPERIMENTAL RESULTSWe evaluated our algorithms on wide range of semi-supervised and unsupervised parisons were made with related state-of-the-art algorithms, e.g.,Transductive SVM(TSVM)[54],LapSVM[47]and LapRLS[47]for semi-supervised learning;and Laplacian Eigenmap(LE)[50], spectral clustering(SC)[51]and deep autoencoder(DA)[55] for unsupervised learning.All algorithms were implemented using Matlab R2012a on a2.60GHz machine with4GB of memory.TABLE IIIT RAINING TIME(IN SECONDS)COMPARISON OF TSVM,L AP RLS,L AP SVM AND SS-ELMData set TSVM LapRLS LapSVM SS-ELMG50C0.3240.0410.0450.035COIL20(B)16.820.5120.4590.516USPST(B)68.440.9210.947 1.029COIL2018.43 5.841 4.9460.814USPST68.147.1217.259 1.373A.Semi-supervised learning results1)Data sets:We tested the SS-ELM onﬁve popular semi-supervised learning benchmarks,which have been widely usedfor evaluating semi-supervised algorithms[52],[56],[57].•The G50C is a binary classiﬁcation data set of which each class is generated by a50-dimensional multivariate Gaus-sian distribution.This classiﬁcation problem is explicitlydesigned so that the true Bayes error is5%.•The Columbia Object Image Library(COIL20)is a multi-class image classiﬁcation data set which consists1440 gray-scale images of20objects.Each pattern is a32×32 gray scale image of one object taken from a speciﬁc view.The COIL20(B)data set is a binary classiﬁcation taskobtained from COIL20by grouping theﬁrst10objectsas Class1,and the last10objects as Class2.•The USPST data set is a subset(the testing set)of the well known handwritten digit recognition data set USPS.The USPST(B)data set is a binary classiﬁcation task obtained from USPST by grouping theﬁrst5digits as Class1and the last5digits as Class2.2)Experimental setup:We followed the experimental setup in[57]to evaluate the semi-supervised algorithms.Speciﬁ-cally,each of the data sets is split into4folds,one of which was used for testing(denoted by T)and the rest3folds for training.Each of the folds was used as the testing set once(4-fold cross-validation).As in[57],this random fold generation process were repeated3times,resulted in12different splits in total.Every training set was further partitioned into a labeled set L,a validation set V,and an unlabeled set U.When we train a semi-supervised learning algorithm,the labeled data from L and the unlabeled data from U were used.The validation set which consists of labeled data was only used for model selection,i.e.,ﬁnding the optimal hyperparameters C0andλin the SS-ELM algorithm.The characteristics of the data sets used in our experiment are summarized in Table I. The training of SS-ELM consists of two stages:1)generat-ing the random hidden layer;and2)training the output weights using(17)or(18).In theﬁrst stage,we adopted the Sigmoid function for nonlinear mapping,and the input weights and biases were generated according to the uniform distribution on(-1,1).The number of hidden neurons n h wasﬁxed to 1000for G50C,and2000for the rest four data sets.In the second stage,weﬁrst need to build the graph Laplacian L.We followed the methods discussed in[52]and[57]to compute L,and the hyperparameter settings can be found in[47],[52] and[57].The trade off parameters C andλwere selected from。