Support Vector Machines QSAR for the Toxicity of Organic Chemicals to Chlorella vulgaris wi
Support vector machine reference manual
snsv
ascii2bin bin2ascii
The rest of this document will describe these programs. To nd out more about SVMs, see the bibliography. We will not describe how SVMs work here. The rst program we will describe is the paragen program, as it speci es all parameters needed for the SVM.
sv
- the main SVM program - program for generating parameter sets for the SVM - load a saved SVM and classify a new data set
paragen loadsv
rm sv
- special SVM program for image recognition, that implements virtual support vectors BS97]. - program to convert SN format to our format - program to convert our ASCII format to our binary format - program to convert our binary format to our ASCII format
Support Vector Machines and Kernel Methods
Slack variables
4 3.5 3 2.5 2 1.5 1 0.5 0 −0.5 −3
−2
−1
0
1
2
3
If not linearly separable, add slack variable s ≥ 0 y (x · w + c) + s ≥ 1 Then
i si is total amount by which constraints are violated i si as small as possible
So try to make
Perceptron as convex program
The final convex program for the perceptron is: min
i si subject to
(y i x i ) · w + y i c + s i ≥ 1 si ≥ 0 We will try to understand this program using convex duality
10 8
6
4
2
0
−2
−4
−6
−8
−10 −10
−8
−6
−4
−2
0
2
4
6
8
10
Classification problem
100
10
% Middle & Upper Class
. . .
95
8
6
90
4
85
2
80
0
75
−2
70
−4
−6
65
X
Building support vector machines with reduced classifier complexity
Journal of Machine Learning Research7(2006)1493–1515Submitted10/05;Revised3/06;Published7/06 Building Support Vector Machines withReduced Classifier ComplexityS.Sathiya Keerthi SELVARAK@ Yahoo!Research3333Empire Avenue,Building4Burbank,CA91504,USAOlivier Chapelle CHAPELLE@TUEBINGEN.MPG.DE MPI for Biological Cybernetics72076T¨u bingen,GermanyDennis DeCoste DECOSTED@ Yahoo!Research3333Empire Avenue,Building4Burbank,CA91504,USAEditors:Kristin P.Bennett and Emilio Parrado-Hern´a ndezAbstractSupport vector machines(SVMs),though accurate,are not preferred in applications requiring great classification speed,due to the number of support vectors being large.To overcome this problem we devise a primal method with the following properties:(1)it decouples the idea of basis functions from the concept of support vectors;(2)it greedilyfinds a set of kernel basis functions of a specified maximum size(d max)to approximate the SVM primal cost function well;(3)it is efficient and roughly scales as O(nd2max)where n is the number of training examples;and,(4)the number of basis functions it requires to achieve an accuracy close to the SVM accuracy is usually far less than the number of SVM support vectors.Keywords:SVMs,classification,sparse design1.IntroductionSupport Vector Machines(SVMs)are modern learning systems that deliver state-of-the-art perfor-mance in real world pattern recognition and data mining applications such as text categorization, hand-written character recognition,image classification and bioinformatics.Even though they yield very accurate solutions,they are not preferred in online applications where classification has to be done in great speed.This is due to the fact that a large set of basis functions is usually needed to form the SVM classifier,making it complex and expensive.In this paper we devise a method to overcome this problem.Our method incrementallyfinds basis functions to maximize accuracy.The process of adding new basis functions can be stopped when the classifier has reached some limiting level of complexity.In many cases,our method efficiently forms classifiers which have an order of magnitude smaller number of basis functions compared to the full SVM,while achieving nearly the same level of accuracy.SVM solution and post-processing simplification Given a training set{(x i,y i)}n i=1,y i∈{1,−1}, the Support Vector Machine(SVM)algorithm with an L2penalization of the training errors consistsK EERTHI,C HAPELLE AND D E C OSTE of solving the following primal problemmin λ2w 2+12n∑i=1max(0,1−y i w·φ(x i))2.(1)Computations involvingφare handled using the kernel function,k(x i,x j)=φ(x i)·φ(x j).For conve-nience the bias term has not been included,but the analysis presented in this paper can be extended in a straightforward way to include it.The quadratic penalization of the errors makes the primal objective function continuously differentiable.This is a great advantage and becomes necessary for developing a primal algorithm,as we will see below.The standard way to train an SVM is to introduce Lagrange multipliersαi and optimize them by solving a dual problem.The classifier function for a new input x is then given by the sign of ∑iαi y i k(x,x i).Because there is aflat part in the loss function,the vectorαis usually sparse.The x i for whichαi=0are called support vectors(SVs).Let n SV denote the number of SVs for a given problem.A recent theoretical result by Steinwart(Steinwart,2004)shows that n SV grows as a linear function of n.Thus,for large problems,this number can be large and the training and testing complexities might become prohibitive since they are respectively,O(n n SV+n SV3)and O(n SV).Several methods have been proposed for reducing the number of support vectors.Burges and Sch¨o lkopf(1997)apply nonlinear optimization methods to seek sparse representations after building the SVM classifier.Along similar lines,Sch¨o lkopf et al.(1999)use L1regularization onβto obtain sparse approximations.These methods are expensive since they involve the solution of hard non-convex optimization problems.They also become impractical for large problems.Downs et al. (2001)give an exact algorithm to prune the support vector set after the SVM classifier is built. Thies and Weber(2004)give special ideas for the quadratic kernel.Since these methods operate as a post-processing step,an expensive standard SVM training is still required.Direct simplification via basis functions and primal Instead offinding the SVM solution by maximizing the dual problem,one approach is to directly minimize the primal form after invoking the representer theorem to represent w asw=n∑i=1βiφ(x i).(2)If we allowβi=0for all i,substitute(2)in(1)and solve for theβi’s then(assuming uniqueness of solution)we will getβi=y iαi and thus we will precisely retrieve the SVM solution(Chapelle, 2005).But our aim is to obtain approximate solutions that have as few non-zeroβi’s as possible. For many classification problems there exists a small subset of the basis functions1suited to the complexity of the problem being solved,irrespective of the training size growth,that will yield pretty much the same accuracy as the SVM classifier.The evidence for this comes from the empir-ical performance of other sparse kernel classifiers:the Relevance Vector Machine(Tipping,2001), Informative Vector Machine(Lawrence et al.,2003)are probabilistic models in a Bayesian setting; and Kernel Matching Pursuit(Vincent and Bengio,2002)is a discriminative method that is mainly developed for the least squares loss function.These recent non-SVM works have laid the claim that they can match the accuracy of SVMs,while also bringing down considerably,the number of basis functions as well as the training cost.Work on simplifying SVM solution has not caught up well 1.Each k(x,x i)will be referred to as a basis function.B UILDING SVM S WITH R EDUCEDC OMPLEXITYwith those works in related kernelfields.The method outlined in this paper makes a contribution to fill this gap.We deliberately use the variable name,βi in(2)so as to interpret it as a basis weight as opposed to viewing it as y iαi whereαi is the Lagrange multiplier associated with the i-th primal slack con-straint.While the two are(usually)one and the same at exact optimality,they can be very different when we talk of sub-optimal primal solutions.There is a lot of freedom when we simply think of theβi’s as basis weights that yield a good suboptimal w for(1).First,we do not have to put any bounds on theβi.Second,we do not have to think of aβi corresponding to a particular location relative to the margin planes to have a certain value.Going even one more step further,we do not even have to restrict the basis functions to be a subset of the training set examples.Osuna and Girosi(1998)consider such an approach.They achieve sparsity by including the L1 regularizer,λ1 β 1in the primal objective.But they do not develop an algorithm(for solving the modified primal formulation and for choosing the rightλ1)that scales efficiently to large problems.Wu et al.(2005)write w asw=l∑i=1βiφ(˜x i)where l is a chosen small number and optimize the primal objective with theβi as well as the ˜x i as variables.But the optimization can become unwieldy if l is not small,especially since the optimization of the˜x i is a hard non-convex problem.In the RSVM algorithm(Lee and Mangasarian,2001;Lin and Lin,2003)a random subset of the training set is chosen to be the˜x i and then only theβi are optimized.2Because basis functions are chosen randomly,this method requires many more basis functions than needed in order to achieve a level of accuracy close to the full SVM solution;see Section3.A principled alternative to RSVM is to use a greedy approach for the selection of the subset of the training set for forming the representation.Such an approach has been popular in Gaussian processes(Smola and Bartlett,2001;Seeger et al.,2003;Keerthi and Chu,2006).Greedy meth-ods of basis selection also exist in the boosting literature(Friedman,2001;R¨a tsch,2001).These methods entail selection from a continuum of basis functions using either gradient descent or linear programming column generation.Bennett et al.(2002)and Bi et al.(2004)give modified ideas for kernel methods that employ a set of basis functionsfixed at the training points.Particularly relevant to the work in this paper are the kernel matching pursuit(KMP)algo-rithm of Vincent and Bengio(2002)and the growing support vector classifier(GSVC)algorithm of Parrado-Hern´a ndez et al.(2003).KMP is an effective greedy discriminative approach that is mainly developed for least squares problems.GSVC is an efficient method that is developed for SVMs and uses a heuristic criterion for greedy selection of basis functions.Our approach The main aim of this paper is to give an effective greedy method SVMs which uses a basis selection criterion that is directly related to the training cost function and is also very efficient.The basic theme of the method is forward selection.It starts with an empty set of basis functions and greedily chooses new basis functions(from the training set)to improve the primal objective function.We develop efficient schemes for both,the greedy selection of a new basis function,as well as the optimization of theβi for a given selection of basis functions.For choosing upto d max basis functions,the overall compuational cost of our method is O(nd2max).The different 2.For convenience,in the RSVM method,the SVM regularizer is replaced by a simple L2regularizer onβ.K EERTHI,C HAPELLE AND D E C OSTESpSVM-2SVMData Set TestErate#Basis TestErate n SVBanana10.87(1.74)17.3(7.3)10.54(0.68)221.7(66.98)Breast29.22(2.11)12.1(5.6)28.18(3.00)185.8(16.44)Diabetis23.47(1.36)13.8(5.6)23.73(1.24)426.3(26.91)Flare33.90(1.10)8.4(1.2)33.98(1.26)629.4(29.43)German24.90(1.50)14.0(7.3)24.47(1.97)630.4(22.48)Heart15.50(1.10) 4.3(2.6)15.80(2.20)166.6(8.75)Ringnorm 1.97(0.57)12.9(2.0) 1.68(0.24)334.9(108.54)Thyroid 5.47(0.78)10.6(2.3) 4.93(2.18)57.80(39.61)Titanic22.68(1.88) 3.3(0.9)22.35(0.67)150.0(0.0)Twonorm 2.96(0.82)8.7(3.7) 2.42(0.24)330.30(137.02)Waveform10.66(0.99)14.4(3.3)10.04(0.67)246.9(57.80)Table1:Comparison of SpSVM-2and SVM on benchmark data sets from(R¨a tsch).For TestErate, #Basis and n SV,the values are means over ten different training/test splits and the values in parantheses are the standard deviations.components of the method that we develop in this paper are not new in themselves and are inspired from the above mentioned papers.However,from a practical point of view,it is not obvious how to combine and tune them in order to get a very efficient SVM training algorithm.That is what we achieved in this paper through numerous and careful experiments that validated the techniques employed.Table1gives a preview of the performance of our method(called SpSVM-2in the table)in comparison with SVM on several UCI data sets.As can be seen there,our method gives a competing generalization performance while reducing the number of basis functions very significantly.(More specifics concerning Table1will be discussed in Section4.)The paper is organized as follows.We discuss the details of the efficient optimization of the primal objective function in Section2.The key issue of selecting basis functions is taken up in Section3.Sections4-7discuss other important practical issues and give computational results that demonstrate the value of our method.Section8gives some concluding remarks.The appendix gives details of all the data sets used for the experiments in this paper.2.The Basic OptimizationLet J⊂{1,...,n}be a given index set of basis functions that form a subset of the training set.We consider the problem of minimizing the objective function in(1)over the set of vectors w of the form3w=∑βjφ(x j).(3)j∈J3.More generally,one can consider expansion on points which do not belong to the training set.B UILDING SVM S WITH R EDUCEDC OMPLEXITY2.1Newton OptimizationLet K i j =k (x i ,x j )=φ(x i )·φ(x j )denote the generic element of the n ×n kernel matrix K .The notation K IJ refers to the submatrix of K made of the rows indexed by I and the columns indexed by J .Also,for a n -dimensional vector p ,let p J denote the |J |dimensional vector containing {p j :j ∈J }.Let d =|J |.With w restricted to (3),the primal problem (1)becomes the d dimensional mini-mization problem of finding βJ that solvesmin βJf (βJ )=λ2β⊤J K JJ βJ +12n ∑i =1max (0,1−y i o i )2(4)where o i =K i ,J βJ .Except for the regularizer being more general,i.e.,β⊤J K JJ βJ (as opposed to thesimple regularizer, βJ 2),the problem in (4)is very much the same as in a linear SVM design.Thus,the Newton method and its modification that are developed for linear SVMs (Mangasarian,2002;Keerthi and DeCoste,2005)can be used to solve (4)and obtain the solution βJ .Newton Method1.Choose a suitable starting vector,β0J .Set k =0.2.If βk J is the optimal solution of (4),stop.3.Let I ={i :1−y i o i ≥0}where o i =K i ,J βk J is the output of the i -th example.Obtain ¯βJ as the result of a Newton step or equivalently as the solution of the regularized least squares problem,min βJ λ2β⊤J K JJ βJ+12∑i ∈I (1−y i K i ,J βJ )2.(5)4.Take βk +1J to be the minimizer of f on L ,the line joining βk J and ¯βJ .Set k :=k +1and goback to step 2for another iteration.The solution of (5)is given by¯βJ =βk J −P −1g ,where P =λK JJ +K JI K ⊤JIand g =λK JJ βJ −K JI (y I −o I ).(6)P and g are also the (generalized)Hessian and gradient of the objective function (4).Because the loss function is piecewise quadratic,Newton method converges in a finite number of iterations.The number of iterations required to converge to the exact solution of (4)is usually very small (less than 5).Some Matlab code is available online at http://www.kyb.tuebingen.mpg.de/bs/people/chapelle/primal .2.2Updating the HessianAs already pointed out in Section 1,we will mainly need to solve (4)in an incremental mode:4with the solution βJ of (4)already available,solve (4)again,but with one more basis function added,i.e.,J incremented by one.Keerthi and DeCoste (2005)show that the Newton method is very efficient4.In our method basis functions are added one at a time.K EERTHI,C HAPELLE AND D E C OSTEfor such seeding situations.Since the kernel matrix is dense,we maintain and update a Cholesky factorization of P,the Hessian defined in(6).Even with Jfixed,during the course of solving(4) via the Newton method,P will undergo changes due to changes in I.Efficient rank one schemes can be used to do the updating of the Cholesky factorization(Seeger,2004).The updatings of the factorization of P that need to be done because of changes in I are not going to be expensive because such changes mostly occur when J is small;when J is large,I usually undergoes very small changes since the set of training errors is rather well identified by that stage.Of course P and its factorization will also undergo changes(their dimensions increase by one)each time an element is added to J. This is a routine updating operation that is present in most forward selection methods.2.3Computational ComplexityIt is useful to ask:what is the complexity of the incremental computations needed to solve(4) when its solution is available for some J,at which point one more basis element is included in it and we want to re-solve(4)?In the best case,when the support vector set I does not change,the cost is mainly the following:computing the new row and column of K JJ(d+1kernel evaluations); computing the new row of K JI(n kernel computations);5computing the new elements of P(O(nd) cost);and the updating of the factorization of P(O(d2)cost).Thus the cost can be summarized as: (n+d+1)kernel evaluations and O(nd)cost.Even when I does change and so the cost is more, it is reasonable to take the above mentioned cost summary as a good estimate of the cost of the incremental work.Adding up these costs till d max basis functions are selected,we get a complexity of O(nd2max).Note that this is the basic cost given that we already know the sequence of d max basis functions that are to be used.Thus,O(nd2max)is also the complexity of the method in which basis functions are chosen randomly.In the next section we discuss the problem of selecting the basis functions systematically and efficiently.3.Selection of New Basis ElementSuppose we have solved(4)and obtained the minimizerβJ.Obviously,the minimum value of the objective function in(4)(call it f J)is greater than or equal to f⋆,the optimal value of(1).If the difference between them is large we would like to continue on and include another basis function. Take one j∈J.How do we judge its value of inclusion?The best scoring mechanism is the following one.3.1Basis Selection Method1Include j in J,optimize(4)fully using(βJ,βj),andfind the improved value of the objective func-tion;call it˜f j.Choose the j that gives the least value of˜f j.We already analyzed in the earlier section that the cost of doing one basis element inclusion is O(nd).So,if we want to try all elements out-side J,the cost is O(n2d);the overall cost of such a method of selecting d max basis functions is O(n2d2max),which is much higher than the basic cost,O(nd2max)mentioned in the previous section. Instead,if we work only with a random subset of sizeκchosen from outside J,then the cost in one basis selection step comes down to O(κnd),and the overall cost is limited to O(κnd2max).Smola and Bartlett(2001)have successfully tried such random subset choices for Gaussian process regression, usingκ=59.However,note that,even with this scheme,the cost of new basis selection(O(κnd)) 5.In fact this is not n but the size of I.Since we do not know this size,we upper bound it by n.B UILDING SVM S WITH R EDUCEDC OMPLEXITYis still disproportionately higher(byκtimes)than the cost of actually including the newly selected basis function(O(nd)).Thus we would like to go for cheaper methods.3.2Basis Selection Method2This method computes a score for a new element j in O(n)time.The idea has a parallel in Vincent and Bengio’s work on Kernel Matching Pursuit(Vincent and Bengio,2002)for least squares loss functions.They have two methods called prefitting and backfitting;see equations(7),(3)and(6) of Vincent and Bengio(2002).6Their prefitting is parallel to Basis Selection Method1that we described earlier.The cheaper method that we suggest below is parallel to their backfitting idea. SupposeβJ is the solution of(4).Including a new element j and its corresponding variable,βj yields the problem of minimizingλ2(β⊤Jβj) K JJ K J jK jJ K j j βJβj+12n∑i=1max(0,1−y i(K iJβJ+K i jβj)2,(7)WefixβJ and optimize(7)using only the new variableβj and see how much improvement in the objective function is possible in order to define the score for the new element j.This one dimensional function is piecewise quadratic and can be minimized exactly in O(n log n) time by a dichotomy search on the different breakpoints.But,a very precise calculation of the scoring function is usually unnecessary.So,for practical solution we can simply do a few Newton-Raphson-type iterations on the derivative of the function and get a near optimal solution in O(n) time.Note that we also need to compute the vector K J j,which requires d kernel evaluations.Though this cost is subsumed in O(n),it is a factor to remember if kernel evaluations are expensive.If all j∈J are tried,then the complexity of selecting a new basis function is O(n2),which is disproportionately large compared to the cost of including the chosen basis function,which is O(nd).Like in Basis Selection Method1,we can simply chooseκrandom basis functions to try. If d max is specified,one can chooseκ=O(d max)without increasing the overall complexity beyond O(nd2max).More complex schemes incorporating a kernel cache can also be tried.3.3Kernel CachingFor upto medium size problems,say n<15,000,it is a good idea to have cache for the entire kernel matrix.If additional memory space is available and,say a Gaussian kernel is employed,then the values of x i−x j 2can also be cached;this will help significantly reduce the time associated with the tuning of hyperparameters.For larger problems,depending on memory space available,it is a good idea to cache as many as possible,full kernel rows corresponding to j that get tried,but do not get chosen for inclusion.It is possible that they get called in a later stage of the algorithm,at which time,this cache can be useful.It is also possible to think of variations of the method in which full kernel rows corresponding to a large set(as much that canfit into memory)of randomly chosen training basis is pre-computed and only these basis functions are considered for selection.3.4ShrinkingAs basis functions get added,the SVM solution w and the margin planes start stabilizing.If the number of support vectors form a small fraction of the training set,then,for a large fraction of 6.For least squares problems,Adler et al.(1996)had given the same ideas as Vincent and Bengio in earlier work.K EERTHI,C HAPELLE AND D E C OSTE(well-classified)training examples,we can easily conclude that they will probably never come into the active set I.Such training examples can be left out of the calculations without causing any undue harm.This idea of shrinking has been effectively used to speed-up SVM training(Joachims,1999; Platt,1998).3.5Experimental EvaluationWe now evaluate the performance of basis selection methods1and2(we will call them as SpSVM-1, SpSVM-2)on some sizable benchmark data sets.A full description of these data sets and the kernel functions used is given in the appendix.The value ofκ=59is used.To have a baseline,we also consider the method,Random in which the basis functions are chosen randomly.This is almost the same as the RSVM method(Lee and Mangasarian,2001;Lin and Lin,2003),the only difference being the regularizer(β⊤J K J,JβJ in(4)versus βJ 2in RSVM).For another baseline we consider the(more systematic)unsupervised learning method in which an incomplete Cholesky factorization with pivoting(Meijerink and van der V orst,1977;Bach and Jordan,2005)is used to choose basis functions.7For comparison we also include the GSVC method of Parrado-Hern´a ndez et al.(2003). This method,originally given for SVM hinge loss,uses the following heuristic criterion to select the next basis function j∗∈J:j∗=arg minj∈I,j∈J maxl∈J|K jl|(8)with the aim of encouraging new basis functions that are far from the basis functions that are already chosen;also,j is restricted only to the support vector indices(I in(5)).For a clean comparison with our methods,we implemented GSVC for SVMs using quadratic penalization,max(0,1−y i o i)2.We also tried another criterion,suggested to us by Alex Smola,that is more complex than(8):j∗=arg maxj∈I,j∈J(1−y j o j)2d2j(9)where d j is the distance(in feature space)of the j-th training point from the subspace spanned by the elements of J.This criterion is based on an upper bound on the improvement to the training cost function obtained by including the j-th basis function.It also makes sense intuitively as it selects basis functions that are both not well approximated by the others(large d j)and for which the error incurred is large.8Below,we will refer to this criterion as BH.It is worth noting that both(8)and (9)can be computed very efficiently.Figures1and2compare the six methods on six data sets.9Overall,SpSVM-1and SpSVM-2 give the best performance in terms of achieving good reduction of test error rate with respect to the number of basis functions.Although SpSVM-2slightly lags SpSVM-1in terms of performance in the early stages,it does equally well as more basis functions are added.Since SpSVM-2is significantly less expensive,it is the best method to use.Since SpSVM-1is quite cheap in the early stages,it is also appropriate to think of a hybrid method in which SpSVM-1is used in the early stages and,when it becomes expensive,switch to SpSVM-2.The other methods sometimes do well,but,overall,they are inferior in comparison to SpSVM-1and SpSVM-2.Interestingly,on the IJCNN and Vehicle data7.We also tried the method of Bach and Jordan(2005)which uses the training labels,but we noticed little improvement.8.Note that when the set of basis functions is not restricted,the optimalβsatisfiesλβi y i=max(0,1−y i o i).9.Mostfigures given in this paper appear in pairs of two plots.One plot gives test error rate as a function of the numberof basis functions,to see how effective the compression is.The other plot gives the test error rate as a function of CPU time,and is used to indicate the efficiency of the method.B UILDING SVM S WITH R EDUCEDC OMPLEXITYFigure1:Comparison of basis selection methods on Adult,IJCNN&Shuttle.On Shuttle some methods were terminated because of ill-conditioning in the matrix P in(6).K EERTHI,C HAPELLE AND D E C OSTEFigure2:Comparison of basis selection methods on M3V8,M3VOthers&Vehicle.sets,Cholesky,GSVC and BH are even inferior to Random.A possible explanation is as follows: these methods give preference to points that are furthest away in feature space from the points already selected.Thus,they are likely to select points which are outliers(far from the rest of the training points);but outliers are probably unsuitable points for expanding the decision function.As we mentioned in Section1,there also exist other greedy methods of kernel basis selection that are motivated by ideas from boosting.These methods are usually given in a setting different from that we consider:a set of(kernel)basis functions is given and a regularizer(such as β 1)is directly specified on the multiplier vectorβ.The method of Bennett et al.(2002)called MARK is given for least squares problems.It is close to the kernel matching pursuit method.We compare SpSVM-2with kernel matching pursuit and discuss MARK in Section5.The method of Bi et al. (2004)uses column generation ideas from linear and quadratic programming to select new basis functions and so it requires the solution of,both,the primal and dual problems.10Thus,the basis selection process is based on the sensitivity of the primal objective function to an incoming basis function.On the other hand,our SpSVM methods are based on computing an estimate of the de-crease in the primal objective function due to an incoming basis function;also,the dual solution is not needed.4.Hyperparameter TuningIn the actual design process,the values of hyperparameters need to be determined.This can be done using k-fold cross validation.Cross validation(CV)can also be used to choose d,the number of basis functions.Since the solution given by our method approaches the SVM solution as d becomes large,there is really no need to choose d at all.One can simply choose d to be as big a value as possible.But,to achieve good reduction in the classifier complexity(as well as computing time) it is a good idea to track the validation performance as a function of d and stop when this function becomes nearlyflat.We proceed as follows.First an appropriate value for d max is chosen.For a given choice of hyperparameters,the basis selection method(say,SpSVM-2)is then applied on each training set formed from the k-fold partitions till d max basis functions are chosen.This gives an estimate of the k-fold CV error for each value of d from1to d max.We choose d to be the number of basis functions that gives the lowest k-fold CV error.This computation can be repeated for each set of hyperparameter values and the best choice can be decided.Recall that,at stage d,our basis selection methods choose the(d+1)-th basis function from a set ofκrandom basis functions.To avoid the effects of this randomness on hyperparameter tuning, it is better to make thisκ-set to be dependent only on d.Thus,at stage d,the basis selection methods will choose the same set ofκrandom basis functions for all hyperparameter values.We applied the above ideas on11benchmark data sets from(R¨a tsch)using SpSVM-2as the basis selection method.The Gaussian kernel,k(x i,x j)=1+exp(−γ x i−x j 2)was used.The hyperparameters,λandγwere tuned using3-fold cross validation.The values,2i,i=−7,···,7 were used for each of these parameters.Ten different train-test partitions were tried to get an idea of the variability in generalization performance.We usedκ=25and d max=25.(The Titanic data set has three input variables,which are all binary;hence we set d max=8for this data set.) Table1(already introduced in Section1)gives the results.For comparison we also give the results for the SVM(solution of(1));in the case of SVM,the number of support vectors(n SV)is the 10.The CPLEX LP/QP solver is used to obtain these solutions.。
support-vector-machine
1
Figure 2: The two partial cost terms belonging to the cost function J (θ) for logistic regression: in the left, the positive case for y = 1 is − log 1+1 ; in the right, the negative case for y = 0, is e−z − log 1 −
m
y (i) log hθ x(i) + 1 − y (i) log 1 − hθ x(i)
i=1
+
λ 2m
n 2 θj j =1
(2)
you find that each example, (x, y ), contributes the term (forgetting averaging with the ห้องสมุดไป่ตู้/m weight) − (y log(hθ (x)) + (1 − y ) log(1 − hθ (x))) to the overall cost function, J (θ). If I take the definition of my hypothesis (1), and plug it in the above cost term, what I get is that each training example contributes with the quantity −y log 1 1 + e−θT x − (1 − y ) log 1 − 1 1 + e−θT x (3)
in the objective
Recall that z = θT x. If we plot − log
Support vector machines for multiple-instance learning
Stuart Andrews,Ioannis Tsochantaridis and Thomas HofmannDepartment of Computer Science,Brown University,Providence,RI02912{stu,it,th}@AbstractThis paper presents two new formulations of multiple-instancelearning as a maximum margin problem.The proposed extensionsof the Support Vector Machine(SVM)learning approach lead tomixed integer quadratic programs that can be solved heuristically.Our generalization of SVMs makes a state-of-the-art classificationtechnique,including non-linear classification via kernels,availableto an area that up to now has been largely dominated by specialpurpose methods.We present experimental results on a pharma-ceutical data set and on applications in automated image indexingand document categorization.1IntroductionMultiple-instance learning(MIL)[4]is a generalization of supervised classification in which training class labels are associated with sets of patterns,or bags,instead of individual patterns.While every pattern may possess an associated true label,it is assumed that pattern labels are only indirectly accessible through labels attached to bags.The law of inheritance is such that a set receives a particular label,if at least one of the patterns in the set possesses the label.In the important case of binary classification,this implies that a bag is“positive”if at least one of its member patterns is a positive differs from the general set-learning problem in that the set-level classifier is by design induced by a pattern-level classifier.Hence the key challenge in MIL is to cope with the ambiguity of not knowing which of the patterns in a positive bag are the actual positive examples and which ones are not. The MIL setting has numerous interesting applications.One prominent applica-tion is the classification of molecules in the context of drug design[4].Here, each molecule is represented by a bag of possible conformations.The efficacy of a molecule can be tested experimentally,but there is no way to control for indi-vidual conformations.A second application is in image indexing for content-based image retrieval.Here,an image can be viewed as a bag of local image patches[9] or image regions.Since annotating whole images is far less time consuming then marking relevant image regions,the ability to deal with this type of weakly anno-tated data is very desirable.Finally,consider the problem of text categorization for which we are thefirst to apply the MIL ually,documents which contain a relevant passage are considered to be relevant with respect to a particular cate-gory or topic,yet class labels are rarely available on the passage level and are most commonly associated with the document as a whole.Formally,all of the above applications share the same type of label ambiguity which in our opinion makes a strong argument in favor of the relevance of the MIL setting.We present two approaches to modify and extend Support Vector Machines(SVMs) to deal with MIL problems.Thefirst approach explicitly treats the pattern labels as unobserved integer variables,subjected to constraints defined by the(positive) bag labels.The goal then is to maximize the usual pattern margin,or soft-margin, jointly over hidden label variables and a linear(or kernelized)discriminant func-tion.The second approach generalizes the notion of a margin to bags and aims at maximizing the bag margin directly.The latter seems most appropriate in cases where we mainly care about classifying new test bags,while thefirst approach seems preferable whenever the goal is to derive an accurate pattern-level classifier. In the case of singleton bags,both methods are identical and reduce to the standard soft-margin SVM formulation.Algorithms for the MIL problem werefirst presented in[4,1,7].These methods(and related analytical results)are based on hypothesis classes consisting of axis-aligned rectangles.Similarly,methods developed subsequently(e.g.,[8,12])have focused on specially tailored machine learning algorithms that do not compare favorably in the limiting case of the standard classification setting.A notable exception is[10]. More recently,a kernel-based approach has been suggested which derives MI-kernels on bags from a given kernel defined on the pattern-level[5].While the MI-kernel approach treats the MIL problem merely as a representational problem,we strongly believe that a deeper conceptual modification of SVMs as outlined in this paper is necessary.However,we share the ultimate goal with[5],which is to make state-of-the-art kernel-based classification methods available for multiple-instance learning. 2Multiple-Instance LearningIn statistical pattern recognition,it is usually assumed that a training set of la-beled patterns is available where each pair(x i,y i)∈ d×Y has been generated independently from an unknown distribution.The goal is to induce a classifier,i.e., a function from patterns to labels f: d→Y.In this paper,we will focus on the binary case of Y={−1,1}.Multiple-instance learning(MIL)generalizes this problem by making significantly weaker assumptions about the labeling informa-tion.Patterns are grouped into bags and a label is attached to each bag and not to every pattern.More formally,given is a set of input patterns x1,...,x n grouped into bags B1,...,B m,with B I={x i:i∈I}for given index sets I⊆{1,...,n}(typ-ically non-overlapping).With each bag B I is associated a label Y I.These labels are interpreted in the following way:if Y I=−1,then y i=−1for all i∈I,i.e.,no pattern in the bag is a positive example.If on the other hand Y I=1,then at least one pattern x i∈B I is a positive example of the underlying concept.Notice that the information provided by the label is asymmetric in the sense that a negative bag label induces a unique label for every pattern in a bag,while a positive label does not.In general,the relation between pattern labels y i and bag labels Y I can be expressed compactly as Y I=max i∈I y i or alternatively as a set of linear constraintsy i+1i∈I14Maximum Bag Margin Formulation of MILAn alternative way of applying maximum margin ideas to the MIL setting is to extend the notion of a margin from individual patterns to sets of patterns.It is natural to define the functional margin of a bag with respect to a hyperplane byγI≡Y I maxi∈I( w,x i +b).(3)This generalization reflects the fact that predictions for bag labels take the form ˆYI=sgn max i∈I( w,x i +b).Notice that for a positive bag the margin is defined by the margin of the“most positive”pattern,while the margin of a negative bag is de-fined by the“least negative”pattern.The difference between the two formulations of maximum-margin problems is illustrated in Figure1.For the pattern-centered mi-SVM formulation,the margin of every pattern in a positive bag matters,although one has the freedom to set their label variables so as to maximize the margin.In the bag-centered formulation,only one pattern per positive bag matters,since it will determine the margin of the bag.Once these“witness”patterns have been identified,the relative position of other patterns in positive bags with respect to the classification boundary becomes ing the above notion of a bag margin,we define an MIL version of the soft-margin classifier byMI-SVM minw,b,ξ12 w 2+CIξI(5)s.t.∀I:Y I=−1∧− w,x i −b≥1−ξI,∀i∈I,or Y I=1∧ w,x s(I) +b≥1−ξI,andξI≥0.(6) In this formulation,every positive bag B I is thus effectively represented by a single member pattern x I≡x s(I).Notice that“non-witness”patterns(x i,i∈I with i=s(I))have no impact on the objective.For given selector variables,it is straightforward to derive the dual objective function which is very similar to the standard SVM Wolfe dual.The only major difference is that the box constraints for the Lagrange parametersαare modified compared to the standard SVM solution,namely one gets0≤αI≤C,for I s.t.Y I=1and0≤i∈Iαi≤C,for I s.t.Y I=−1.(7) Hence,the influence of each bag is bounded by C.5Optimization HeuristicsAs we have shown,both formulations,mi-SVM and MI-SVM,can be cast as mixed-integer programs.In deriving optimization heuristics,we exploit the fact that forinitialize y i =Y I for i ∈IREPEATcompute SVM solution w ,b for data set with imputed labelscompute outputs f i = w ,x i +b for all x i in positive bagsset y i =sgn (f i )for every i ∈I ,Y I =1FOR (every positive bag B I )IF ( i ∈I (1+y i )/2==0)compute i ∗=arg max i ∈I f i set y i ∗=1ENDENDWHILE (imputed labels have changed)OUTPUT (w ,b )Figure 2:Pseudo-code for mi-SVM optimization heuristics (synchronous update).initialize x I = i ∈I x i /|I |for every positive bag B I REPEATcompute QP solution w ,b for data set withpositive examples {x I :Y I =1}compute outputs f i = w ,x i +b for all x i in positive bagsset x I =x s (I ),s (I )=arg max i ∈I f i for every I ,Y I =1WHILE (selector variables s (I )have changed)OUTPUT (w ,b )Figure 3:Pseudo-code for MI-SVM optimization heuristics (synchronous update).given integer variables,i.e.the hidden labels in mi-SVM and the selector variables in MI-SVM,the problem reduces to a QP that can be solved exactly.Of course,all the derivations also hold for general kernel functions K .A general scheme for a simple optimization heuristic may be described as follows.Alternate the following two steps:(i)for given integer variables,solve the associated QP and find the optimal discriminant function,(ii)for a given discriminant,update one,several,or all integer variables in a way that (locally)minimizes the objective.The latter step may involve the update of a label variable y i of a single pattern in mi-SVM,the update of a single selector variable s (I )in MI-SVM,or the simultaneous update of all integer variables.Since the integer variables are essentially decoupled given the discriminant (with the exception of the bag constraints in mi-SVM),this can be done very efficiently.Also notice that we can re-initialize the QP-solver at every iteration with the previously found solution,which will usually result in a significant speed-up.In terms of initialization of the optimization procedure,we suggest to impute positive labels for patterns in positive bags as the initial configuration in mi-SVM.In MI-SVM,x I is initialized as the centroid of the bag patterns.Figure 2and 3summarize pseudo-code descriptions for the algorithms utilized in the experiments.There are many possibilities to refine the above heuristic strategy,for example,by starting from different initial conditions,by using branch and bound techniques to explore larger parts of the discrete part of the search space,by performing stochas-tic updates (simulated annealing)or by maintaining probabilities on the integer variables in the spirit of deterministic annealing.However,we have been able to achieve competitive results even with the simpler optimization heuristics,which val-EMDD[12]MI-NN[10]mi-SVMMUSK184.888.987.484.089.284.3Table1:Accuracy results for various methods on the MUSK data sets. idate the maximum margin formulation of SVM.We will address further algorithmic improvements in future work.6Experimental ResultsWe have performed experiments on various data sets to evaluate the proposed tech-niques and compare them to other methods for MIL.As a reference method wehave implemented the EM Diverse Density(EM-DD)method[12],for which very competitive results have been reported on the MUSK benchmark1.6.1MUSK Data SetThe MUSK data sets are the benchmark data sets used in virtually all previous approaches and have been described in detail in the landmark paper[4].Bothdata sets,MUSK1and MUSK2,consist of descriptions of molecules using multiplelow-energy conformations.Each conformation is represented by a166-dimensionalfeature vector derived from surface properties.MUSK1contains on average ap-proximately6conformation per molecule,while MUSK2has on average more than60conformations in each bag.The averaged results of ten10-fold cross-validationruns are summarized in Table1.The SVM results are based on an RBF kernelK(x,y)=exp(−γ x−y 2)with coarsely optimizedγ.For both MUSK1andMUSK2data sets,mi-SVM achieves competitive accuracy values.While MI-SVM outperforms mi-SVM on MUSK2,it is significantly worse on MUSK1.Althoughboth methods fail to achieve the performance of the best method(iterative APR)2,they compare favorably with other approaches to MIL.6.2Automatic Image AnnotationWe have generated new MIL data sets for an image annotation task.The originaldata are color images from the Corel data set that have been preprocessed and segmented with the Blobworld system[2].In this representation,an image consistsof a set of segments(or blobs),each characterized by color,texture and shape descriptors.We have utilized three different categories(“elephant”,“fox”,“tiger”)in our experiments.In each case,the data sets have100positive and100negative example images.The latter have been randomly drawn from a pool of photos ofother animals.Due to the limited accuracy of the image segmentation,the relativesmall number of region descriptors and the small training set size,this ends up beingquite a hard classification problem.We are currently investigating alternative imageData Set MI-SVMCategory poly linear rbf 1391/23078.382.280.079.0 Fox55.257.858.8 1220/23072.178.478.981.6Dims EM-DD mi-SVMinst/feat linear rbf polyTST192.593.993.7 3344/684284.078.274.384.4TST383.382.277.4 3391/662680.582.869.682.9TST778.778.064.5 3300/698265.567.555.263.7TST1078.379.569.1 Table3:Classification accuracy of different methods on the TREC9document categorization sets.representations in the context of applying MIL to content-based image retrieval and automated image indexing,for which we hope to achieve better(absolute) classification accuracies.However,these data sets seem legitimate for a comparative performance analysis.The results are summarized in Table2.They show that both, mi-SVM and MI-SVM achieve a similar accuracy and outperform EM-DD by a few percent.While MI-SVM performed marginally better than mi-SVM,both heuristic methods were susceptible to other nearby local minima.Evidence of this effect was observed through experimentation with asynchronus updates,as described in Section5,where we varied the number of integer variables updated at each iteration.6.3Text CategorizationFinally,we have generated MIL data sets for text categorization.Starting from the publicly available TREC9data set,also known as OHSUMED,we have split documents into passages using overlapping windows of maximal50words each. The original data set consists of several years of selected MEDLINE articles.We have worked with the1987data set used as training data in the TREC9filtering task which consists of approximately54,000documents.MEDLINE documents are annotated with MeSH terms(Medical Subject Headings),each defining a binary concept.The total number of MeSH terms in TREC9was4903.While we are currently performing a larger scale evaluation of MIL techniques on the full data set,we report preliminary results here on a smaller,randomly subsampled data set.We have been using thefirst seven categories of the pre-test portion with at least100positive pared to the other data sets the representation is extremely sparse and high-dimensional,which makes this data an interesting addi-tional benchmark.Again,using linear and polynomial kernel functions,which are generally known to work well for text categorization,both methods show improved performance over EM-DD in almost all cases.No significant difference between the two methods is clearly evident for the text classification task.7Conclusion and Future WorkWe have presented a novel approach to multiple-instance learning based on two alternative generalizations of the maximum margin idea used in SVM classification. Although these formulations lead to hard mixed integer problems,even simple lo-cal optimization heuristics already yield quite competitive results compared to the baseline approach.We conjecture that better optimization techniques,that can for example avoid unfavorable local minima,may further improve the classification accuracy.Ongoing work will also extend the experimental evaluation to include larger scale problems.As far as the MIL research problem is concerned,we have considered a wider range of data sets and applications than is usually done and have been able to obtain very good results across a variety of data sets.We strongly suspect that many MIL methods have been optimized to perform well on the MUSK benchmark and we plan to make the data sets used in the experiments available to the public to encourage further empirical comparisons.AcknowledgmentsThis work was sponsored by an NSF-ITR grant,award number IIS-0085836. References[1]P.Auer.On learning from multi-instance examples:Empirical evaluation of a the-oretical approach.In Proc.14th International Conf.on Machine Learning,pages 21–29.Morgan Kaufmann,San Francisco,CA,1997.[2] C.Carson,M.Thomas,S.Belongie,J.M.Hellerstein,and J.Malik.Blobworld:Asystem for region-based image indexing and retrieval.In Proceedings Third Interna-tional Conference on Visual Information Systems.Springer,1999.[3] A.Demirez and K.Bennett.Optimization approaches to semisupervised learning.In M.Ferris,O.Mangasarian,and J.Pang,editors,Applications and Algorithms of Complementarity.Kluwer Academic Publishers,Boston,2000.[4]T.G.Dietterich,throp,and T.Lozano-Perez.Solving the multiple instanceproblem with axis-parallel rectangles.Artificial Intelligence,89(1-2):31–71,1997. [5]T.G¨a rtner,P.A.Flach,A.Kowalczyk,and A.J.Smola.Multi-instance kernels.InProc.19th International Conf.on Machine Learning.Morgan Kaufmann,San Fran-cisco,CA,2002.[6]T.Joachims.Transductive inference for text classification using support vector ma-chines.In Proceedings16th International Conference on Machine Learning,pages 200–209.Morgan Kaufmann,San Francisco,CA,1999.[7]P.M.Long and L.Tan.PAC learning axis aligned rectangles with respect to productdistributions from multiple-instance examples.In p.Learning Theory,1996.[8]O.Maron and T.Lozano-P´e rez.A framework for multiple-instance learning.InAdvances in Neural Information Processing Systems,volume10.MIT Press,1998. [9]O.Maron and A.L.Ratan.Multiple-instance learning for natural scene classifica-tion.In Proc.15th International Conf.on Machine Learning,pages341–349.Morgan Kaufmann,San Francisco,CA,1998.[10]J.Ramon and L.De Raedt.Multi instance neural networks.In Proceedings of ICML-2000,Workshop on Attribute-Value and Relational Learning,2000.[11] B.Sch¨o lkopf and A.Smola.Learning with Kernels.Support Vector Machines,Regu-larization,Optimization and Beyond.MIT Press,2002.[12]Qi Zhang and Sally A.Goldman.EM-DD:An improved multiple-instance learningtechnique.In Advances in Neural Information Processing Systems,volume14.MIT Press,2002.。
SVMPPT课件
以简单的理解为问题的复杂程度,VC维越高, 一个问题就越复杂。正是因为SVM关注的是VC 维,后面我们可以看到,SVM解决问题的时候, 和样本的维数是无关的(甚至样本是上万维的 都可以,这使得SVM很适合用来解决像文本分 类这样的问题,当然,有这样的能力也因为引 入了核函数)。
11
SVM简介
置信风险:与两个量有关,一是样本数
量,显然给定的样本数量越大,我们的 学习结果越有可能正确,此时置信风险 越小;二是分类函数的VC维,显然VC维 越大,推广能力越差,置信风险会变大。
12
SVM简介
泛化误差界的公式为:
R(w)≤Remp(w)+Ф(n/h) 公式中R(w)就是真实风险,Remp(w)表示 经验风险,Ф(n/h)表示置信风险。此时 目标就从经验风险最小化变为了寻求经 验风险与置信风险的和最小,即结构风 险最小。
4
SVM简介
支持向量机方法是建立在统计学习理论 的VC 维理论和结构风险最小原理基础上 的,根据有限的样本信息在模型的复杂 性(即对特定训练样本的学习精度, Accuracy)和学习能力(即无错误地识 别任意样本的能力)之间寻求最佳折衷, 以期获得最好的推广能力(或称泛化能 力)。
5
SVM简介
10
SVM简介
泛化误差界:为了解决刚才的问题,统计学
提出了泛化误差界的概念。就是指真实风险应 该由两部分内容刻画,一是经验风险,代表了 分类器在给定样本上的误差;二是置信风险, 代表了我们在多大程度上可以信任分类器在未 知样本上分类的结果。很显然,第二部分是没 有办法精确计算的,因此只能给出一个估计的 区间,也使得整个误差只能计算上界,而无法 计算准确的值(所以叫做泛化误差界,而不叫 泛化误差)。
Support vector machine_A tool for mapping mineral prospectivity
Support vector machine:A tool for mapping mineral prospectivityRenguang Zuo a,n,Emmanuel John M.Carranza ba State Key Laboratory of Geological Processes and Mineral Resources,China University of Geosciences,Wuhan430074;Beijing100083,Chinab Department of Earth Systems Analysis,Faculty of Geo-Information Science and Earth Observation(ITC),University of Twente,Enschede,The Netherlandsa r t i c l e i n f oArticle history:Received17May2010Received in revised form3September2010Accepted25September2010Keywords:Supervised learning algorithmsKernel functionsWeights-of-evidenceTurbidite-hosted AuMeguma Terraina b s t r a c tIn this contribution,we describe an application of support vector machine(SVM),a supervised learningalgorithm,to mineral prospectivity mapping.The free R package e1071is used to construct a SVM withsigmoid kernel function to map prospectivity for Au deposits in western Meguma Terrain of Nova Scotia(Canada).The SVM classification accuracies of‘deposit’are100%,and the SVM classification accuracies ofthe‘non-deposit’are greater than85%.The SVM classifications of mineral prospectivity have5–9%lowertotal errors,13–14%higher false-positive errors and25–30%lower false-negative errors compared tothose of the WofE prediction.The prospective target areas predicted by both SVM and WofE reflect,nonetheless,controls of Au deposit occurrence in the study area by NE–SW trending anticlines andcontact zones between Goldenville and Halifax Formations.The results of the study indicate theusefulness of SVM as a tool for predictive mapping of mineral prospectivity.&2010Elsevier Ltd.All rights reserved.1.IntroductionMapping of mineral prospectivity is crucial in mineral resourcesexploration and mining.It involves integration of information fromdiverse geoscience datasets including geological data(e.g.,geologicalmap),geochemical data(e.g.,stream sediment geochemical data),geophysical data(e.g.,magnetic data)and remote sensing data(e.g.,multispectral satellite data).These sorts of data can be visualized,processed and analyzed with the support of computer and GIStechniques.Geocomputational techniques for mapping mineral pro-spectivity include weights of evidence(WofE)(Bonham-Carter et al.,1989),fuzzy WofE(Cheng and Agterberg,1999),logistic regression(Agterberg and Bonham-Carter,1999),fuzzy logic(FL)(Ping et al.,1991),evidential belief functions(EBF)(An et al.,1992;Carranza andHale,2003;Carranza et al.,2005),neural networks(NN)(Singer andKouda,1996;Porwal et al.,2003,2004),a‘wildcat’method(Carranza,2008,2010;Carranza and Hale,2002)and a hybrid method(e.g.,Porwalet al.,2006;Zuo et al.,2009).These techniques have been developed toquantify indices of occurrence of mineral deposit occurrence byintegrating multiple evidence layers.Some geocomputational techni-ques can be performed using popular software packages,such asArcWofE(a free ArcView extension)(Kemp et al.,1999),ArcSDM9.3(afree ArcGIS9.3extension)(Sawatzky et al.,2009),MI-SDM2.50(aMapInfo extension)(Avantra Geosystems,2006),GeoDAS(developedbased on MapObjects,which is an Environmental Research InstituteDevelopment Kit)(Cheng,2000).Other geocomputational techniques(e.g.,FL and NN)can be performed by using R and Matlab.Geocomputational techniques for mineral prospectivity map-ping can be categorized generally into two types–knowledge-driven and data-driven–according to the type of inferencemechanism considered(Bonham-Carter1994;Pan and Harris2000;Carranza2008).Knowledge-driven techniques,such as thosethat apply FL and EBF,are based on expert knowledge andexperience about spatial associations between mineral prospec-tivity criteria and mineral deposits of the type sought.On the otherhand,data-driven techniques,such as WofE and NN,are based onthe quantification of spatial associations between mineral pro-spectivity criteria and known occurrences of mineral deposits ofthe type sought.Additional,the mixing of knowledge-driven anddata-driven methods also is used for mapping of mineral prospec-tivity(e.g.,Porwal et al.,2006;Zuo et al.,2009).Every geocomputa-tional technique has advantages and disadvantages,and one or theother may be more appropriate for a given geologic environmentand exploration scenario(Harris et al.,2001).For example,one ofthe advantages of WofE is its simplicity,and straightforwardinterpretation of the weights(Pan and Harris,2000),but thismodel ignores the effects of possible correlations amongst inputpredictor patterns,which generally leads to biased prospectivitymaps by assuming conditional independence(Porwal et al.,2010).Comparisons between WofE and NN,NN and LR,WofE,NN and LRfor mineral prospectivity mapping can be found in Singer andKouda(1999),Harris and Pan(1999)and Harris et al.(2003),respectively.Mapping of mineral prospectivity is a classification process,because its product(i.e.,index of mineral deposit occurrence)forevery location is classified as either prospective or non-prospectiveaccording to certain combinations of weighted mineral prospec-tivity criteria.There are two types of classification techniques.Contents lists available at ScienceDirectjournal homepage:/locate/cageoComputers&Geosciences0098-3004/$-see front matter&2010Elsevier Ltd.All rights reserved.doi:10.1016/j.cageo.2010.09.014n Corresponding author.E-mail addresses:zrguang@,zrguang1981@(R.Zuo).Computers&Geosciences](]]]])]]]–]]]One type is known as supervised classification,which classifies mineral prospectivity of every location based on a training set of locations of known deposits and non-deposits and a set of evidential data layers.The other type is known as unsupervised classification, which classifies mineral prospectivity of every location based solely on feature statistics of individual evidential data layers.A support vector machine(SVM)is a model of algorithms for supervised classification(Vapnik,1995).Certain types of SVMs have been developed and applied successfully to text categorization, handwriting recognition,gene-function prediction,remote sensing classification and other studies(e.g.,Joachims1998;Huang et al.,2002;Cristianini and Scholkopf,2002;Guo et al.,2005; Kavzoglu and Colkesen,2009).An SVM performs classification by constructing an n-dimensional hyperplane in feature space that optimally separates evidential data of a predictor variable into two categories.In the parlance of SVM literature,a predictor variable is called an attribute whereas a transformed attribute that is used to define the hyperplane is called a feature.The task of choosing the most suitable representation of the target variable(e.g.,mineral prospectivity)is known as feature selection.A set of features that describes one case(i.e.,a row of predictor values)is called a feature vector.The feature vectors near the hyperplane are the support feature vectors.The goal of SVM modeling is tofind the optimal hyperplane that separates clusters of feature vectors in such a way that feature vectors representing one category of the target variable (e.g.,prospective)are on one side of the plane and feature vectors representing the other category of the target variable(e.g.,non-prospective)are on the other size of the plane.A good separation is achieved by the hyperplane that has the largest distance to the neighboring data points of both categories,since in general the larger the margin the better the generalization error of the classifier.In this paper,SVM is demonstrated as an alternative tool for integrating multiple evidential variables to map mineral prospectivity.2.Support vector machine algorithmsSupport vector machines are supervised learning algorithms, which are considered as heuristic algorithms,based on statistical learning theory(Vapnik,1995).The classical task of a SVM is binary (two-class)classification.Suppose we have a training set composed of l feature vectors x i A R n,where i(¼1,2,y,n)is the number of feature vectors in training samples.The class in which each sample is identified to belong is labeled y i,which is equal to1for one class or is equal toÀ1for the other class(i.e.y i A{À1,1})(Huang et al., 2002).If the two classes are linearly separable,then there exists a family of linear separators,also called separating hyperplanes, which satisfy the following set of equations(KavzogluandFig.1.Support vectors and optimum hyperplane for the binary case of linearly separable data sets.Table1Experimental data.yer A Layer B Layer C Layer D Target yer A Layer B Layer C Layer D Target1111112100000 2111112200000 3111112300000 4111112401000 5111112510000 6111112600000 7111112711100 8111112800000 9111012900000 10111013000000 11101113111100 12111013200000 13111013300000 14111013400000 15011013510000 16101013600000 17011013700000 18010113811100 19010112900000 20101014010000R.Zuo,E.J.M.Carranza/Computers&Geosciences](]]]])]]]–]]]2Colkesen,2009)(Fig.1):wx iþb Zþ1for y i¼þ1wx iþb rÀ1for y i¼À1ð1Þwhich is equivalent toy iðwx iþbÞZ1,i¼1,2,...,nð2ÞThe separating hyperplane can then be formalized as a decision functionfðxÞ¼sgnðwxþbÞð3Þwhere,sgn is a sign function,which is defined as follows:sgnðxÞ¼1,if x400,if x¼0À1,if x o08><>:ð4ÞThe two parameters of the separating hyperplane decision func-tion,w and b,can be obtained by solving the following optimization function:Minimize tðwÞ¼12J w J2ð5Þsubject toy Iððwx iÞþbÞZ1,i¼1,...,lð6ÞThe solution to this optimization problem is the saddle point of the Lagrange functionLðw,b,aÞ¼1J w J2ÀX li¼1a iðy iððx i wÞþbÞÀ1Þð7Þ@ @b Lðw,b,aÞ¼0@@wLðw,b,aÞ¼0ð8Þwhere a i is a Lagrange multiplier.The Lagrange function is minimized with respect to w and b and is maximized with respect to a grange multipliers a i are determined by the following optimization function:MaximizeX li¼1a iÀ12X li,j¼1a i a j y i y jðx i x jÞð9Þsubject toa i Z0,i¼1,...,l,andX li¼1a i y i¼0ð10ÞThe separating rule,based on the optimal hyperplane,is the following decision function:fðxÞ¼sgnX li¼1y i a iðxx iÞþb!ð11ÞMore details about SVM algorithms can be found in Vapnik(1995) and Tax and Duin(1999).3.Experiments with kernel functionsFor spatial geocomputational analysis of mineral exploration targets,the decision function in Eq.(3)is a kernel function.The choice of a kernel function(K)and its parameters for an SVM are crucial for obtaining good results.The kernel function can be usedTable2Errors of SVM classification using linear kernel functions.l Number ofsupportvectors Testingerror(non-deposit)(%)Testingerror(deposit)(%)Total error(%)0.2580.00.00.0180.00.00.0 1080.00.00.0 10080.00.00.0 100080.00.00.0Table3Errors of SVM classification using polynomial kernel functions when d¼3and r¼0. l Number ofsupportvectorsTestingerror(non-deposit)(%)Testingerror(deposit)(%)Total error(%)0.25120.00.00.0160.00.00.01060.00.00.010060.00.00.0 100060.00.00.0Table4Errors of SVM classification using polynomial kernel functions when l¼0.25,r¼0.d Number ofsupportvectorsTestingerror(non-deposit)(%)Testingerror(deposit)(%)Total error(%)11110.00.0 5.010290.00.00.0100230.045.022.5 1000200.090.045.0Table5Errors of SVM classification using polynomial kernel functions when l¼0.25and d¼3.r Number ofsupportvectorsTestingerror(non-deposit)(%)Testingerror(deposit)(%)Total error(%)0120.00.00.01100.00.00.01080.00.00.010080.00.00.0 100080.00.00.0Table6Errors of SVM classification using radial kernel functions.l Number ofsupportvectorsTestingerror(non-deposit)(%)Testingerror(deposit)(%)Total error(%)0.25140.00.00.01130.00.00.010130.00.00.0100130.00.00.0 1000130.00.00.0Table7Errors of SVM classification using sigmoid kernel functions when r¼0.l Number ofsupportvectorsTestingerror(non-deposit)(%)Testingerror(deposit)(%)Total error(%)0.25400.00.00.01400.035.017.510400.0 6.0 3.0100400.0 6.0 3.0 1000400.0 6.0 3.0R.Zuo,E.J.M.Carranza/Computers&Geosciences](]]]])]]]–]]]3to construct a non-linear decision boundary and to avoid expensive calculation of dot products in high-dimensional feature space.The four popular kernel functions are as follows:Linear:Kðx i,x jÞ¼l x i x j Polynomial of degree d:Kðx i,x jÞ¼ðl x i x jþrÞd,l40Radial basis functionðRBFÞ:Kðx i,x jÞ¼exp fÀl99x iÀx j992g,l40 Sigmoid:Kðx i,x jÞ¼tanhðl x i x jþrÞ,l40ð12ÞThe parameters l,r and d are referred to as kernel parameters. The parameter l serves as an inner product coefficient in the polynomial function.In the case of the RBF kernel(Eq.(12)),l determines the RBF width.In the sigmoid kernel,l serves as an inner product coefficient in the hyperbolic tangent function.The parameter r is used for kernels of polynomial and sigmoid types. The parameter d is the degree of a polynomial function.We performed some experiments to explore the performance of the parameters used in a kernel function.The dataset used in the experiments(Table1),which are derived from the study area(see below),were compiled according to the requirementfor Fig.2.Simplified geological map in western Meguma Terrain of Nova Scotia,Canada(after,Chatterjee1983;Cheng,2008).Table8Errors of SVM classification using sigmoid kernel functions when l¼0.25.r Number ofSupportVectorsTestingerror(non-deposit)(%)Testingerror(deposit)(%)Total error(%)0400.00.00.01400.00.00.010400.00.00.0100400.00.00.01000400.00.00.0R.Zuo,E.J.M.Carranza/Computers&Geosciences](]]]])]]]–]]]4classification analysis.The e1071(Dimitriadou et al.,2010),a freeware R package,was used to construct a SVM.In e1071,the default values of l,r and d are1/(number of variables),0and3,respectively.From the study area,we used40geological feature vectors of four geoscience variables and a target variable for classification of mineral prospec-tivity(Table1).The target feature vector is either the‘non-deposit’class(or0)or the‘deposit’class(or1)representing whether mineral exploration target is absent or present,respectively.For‘deposit’locations,we used the20known Au deposits.For‘non-deposit’locations,we randomly selected them according to the following four criteria(Carranza et al.,2008):(i)non-deposit locations,in contrast to deposit locations,which tend to cluster and are thus non-random, must be random so that multivariate spatial data signatures are highly non-coherent;(ii)random non-deposit locations should be distal to any deposit location,because non-deposit locations proximal to deposit locations are likely to have similar multivariate spatial data signatures as the deposit locations and thus preclude achievement of desired results;(iii)distal and random non-deposit locations must have values for all the univariate geoscience spatial data;(iv)the number of distal and random non-deposit locations must be equaltoFig.3.Evidence layers used in mapping prospectivity for Au deposits(from Cheng,2008):(a)and(b)represent optimum proximity to anticline axes(2.5km)and contacts between Goldenville and Halifax formations(4km),respectively;(c)and(d)represent,respectively,background and anomaly maps obtained via S-Afiltering of thefirst principal component of As,Cu,Pb and Zn data.R.Zuo,E.J.M.Carranza/Computers&Geosciences](]]]])]]]–]]]5the number of deposit locations.We used point pattern analysis (Diggle,1983;2003;Boots and Getis,1988)to evaluate degrees of spatial randomness of sets of non-deposit locations and tofind distance from any deposit location and corresponding probability that one deposit location is situated next to another deposit location.In the study area,we found that the farthest distance between pairs of Au deposits is71km,indicating that within that distance from any deposit location in there is100%probability of another deposit location. However,few non-deposit locations can be selected beyond71km of the individual Au deposits in the study area.Instead,we selected random non-deposit locations beyond11km from any deposit location because within this distance from any deposit location there is90% probability of another deposit location.When using a linear kernel function and varying l from0.25to 1000,the number of support vectors and the testing errors for both ‘deposit’and‘non-deposit’do not vary(Table2).In this experiment the total error of classification is0.0%,indicating that the accuracy of classification is not sensitive to the choice of l.With a polynomial kernel function,we tested different values of l, d and r as follows.If d¼3,r¼0and l is increased from0.25to1000,the number of support vectors decreases from12to6,but the testing errors for‘deposit’and‘non-deposit’remain nil(Table3).If l¼0.25, r¼0and d is increased from1to1000,the number of support vectors firstly increases from11to29,then decreases from23to20,the testing error for‘non-deposit’decreases from10.0%to0.0%,whereas the testing error for‘deposit’increases from0.0%to90%(Table4). In this experiment,the total error of classification is minimum(0.0%) when d¼10(Table4).If l¼0.25,d¼3and r is increased from 0to1000,the number of support vectors decreases from12to8,but the testing errors for‘deposit’and‘non-deposit’remain nil(Table5).When using a radial kernel function and varying l from0.25to 1000,the number of support vectors decreases from14to13,but the testing errors of‘deposit’and‘non-deposit’remain nil(Table6).With a sigmoid kernel function,we experimented with different values of l and r as follows.If r¼0and l is increased from0.25to1000, the number of support vectors is40,the testing errors for‘non-deposit’do not change,but the testing error of‘deposit’increases from 0.0%to35.0%,then decreases to6.0%(Table7).In this experiment,the total error of classification is minimum at0.0%when l¼0.25 (Table7).If l¼0.25and r is increased from0to1000,the numbers of support vectors and the testing errors of‘deposit’and‘non-deposit’do not change and the total error remains nil(Table8).The results of the experiments demonstrate that,for the datasets in the study area,a linear kernel function,a polynomial kernel function with d¼3and r¼0,or l¼0.25,r¼0and d¼10,or l¼0.25and d¼3,a radial kernel function,and a sigmoid kernel function with r¼0and l¼0.25are optimal kernel functions.That is because the testing errors for‘deposit’and‘non-deposit’are0%in the SVM classifications(Tables2–8).Nevertheless,a sigmoid kernel with l¼0.25and r¼0,compared to all the other kernel functions,is the most optimal kernel function because it uses all the input support vectors for either‘deposit’or‘non-deposit’(Table1)and the training and testing errors for‘deposit’and‘non-deposit’are0% in the SVM classification(Tables7and8).4.Prospectivity mapping in the study areaThe study area is located in western Meguma Terrain of Nova Scotia,Canada.It measures about7780km2.The host rock of Au deposits in this area consists of Cambro-Ordovician low-middle grade metamorphosed sedimentary rocks and a suite of Devonian aluminous granitoid intrusions(Sangster,1990;Ryan and Ramsay, 1997).The metamorphosed sedimentary strata of the Meguma Group are the lower sand-dominatedflysch Goldenville Formation and the upper shalyflysch Halifax Formation occurring in the central part of the study area.The igneous rocks occur mostly in the northern part of the study area(Fig.2).In this area,20turbidite-hosted Au deposits and occurrences (Ryan and Ramsay,1997)are found in the Meguma Group, especially near the contact zones between Goldenville and Halifax Formations(Chatterjee,1983).The major Au mineralization-related geological features are the contact zones between Gold-enville and Halifax Formations,NE–SW trending anticline axes and NE–SW trending shear zones(Sangster,1990;Ryan and Ramsay, 1997).This dataset has been used to test many mineral prospec-tivity mapping algorithms(e.g.,Agterberg,1989;Cheng,2008). More details about the geological settings and datasets in this area can be found in Xu and Cheng(2001).We used four evidence layers(Fig.3)derived and used by Cheng (2008)for mapping prospectivity for Au deposits in the yers A and B represent optimum proximity to anticline axes(2.5km) and optimum proximity to contacts between Goldenville and Halifax Formations(4km),yers C and D represent variations in geochemical background and anomaly,respectively, as modeled by multifractalfilter mapping of thefirst principal component of As,Cu,Pb,and Zn data.Details of how the four evidence layers were obtained can be found in Cheng(2008).4.1.Training datasetThe application of SVM requires two subsets of training loca-tions:one training subset of‘deposit’locations representing presence of mineral deposits,and a training subset of‘non-deposit’locations representing absence of mineral deposits.The value of y i is1for‘deposits’andÀ1for‘non-deposits’.For‘deposit’locations, we used the20known Au deposits(the sixth column of Table1).For ‘non-deposit’locations(last column of Table1),we obtained two ‘non-deposit’datasets(Tables9and10)according to the above-described selection criteria(Carranza et al.,2008).We combined the‘deposits’dataset with each of the two‘non-deposit’datasets to obtain two training datasets.Each training dataset commonly contains20known Au deposits but contains different20randomly selected non-deposits(Fig.4).4.2.Application of SVMBy using the software e1071,separate SVMs both with sigmoid kernel with l¼0.25and r¼0were constructed using the twoTable9The value of each evidence layer occurring in‘non-deposit’dataset1.yer A Layer B Layer C Layer D100002000031110400005000061000700008000090100 100100 110000 120000 130000 140000 150000 160100 170000 180000 190100 200000R.Zuo,E.J.M.Carranza/Computers&Geosciences](]]]])]]]–]]] 6training datasets.With training dataset1,the classification accuracies for‘non-deposits’and‘deposits’are95%and100%, respectively;With training dataset2,the classification accuracies for‘non-deposits’and‘deposits’are85%and100%,respectively.The total classification accuracies using the two training datasets are97.5%and92.5%,respectively.The patterns of the predicted prospective target areas for Au deposits(Fig.5)are defined mainly by proximity to NE–SW trending anticlines and proximity to contact zones between Goldenville and Halifax Formations.This indicates that‘geology’is better than‘geochemistry’as evidence of prospectivity for Au deposits in this area.With training dataset1,the predicted prospective target areas occupy32.6%of the study area and contain100%of the known Au deposits(Fig.5a).With training dataset2,the predicted prospec-tive target areas occupy33.3%of the study area and contain95.0% of the known Au deposits(Fig.5b).In contrast,using the same datasets,the prospective target areas predicted via WofE occupy 19.3%of study area and contain70.0%of the known Au deposits (Cheng,2008).The error matrices for two SVM classifications show that the type1(false-positive)and type2(false-negative)errors based on training dataset1(Table11)and training dataset2(Table12)are 32.6%and0%,and33.3%and5%,respectively.The total errors for two SVM classifications are16.3%and19.15%based on training datasets1and2,respectively.In contrast,the type1and type2 errors for the WofE prediction are19.3%and30%(Table13), respectively,and the total error for the WofE prediction is24.65%.The results show that the total errors of the SVM classifications are5–9%lower than the total error of the WofE prediction.The 13–14%higher false-positive errors of the SVM classifications compared to that of the WofE prediction suggest that theSVMFig.4.The locations of‘deposit’and‘non-deposit’.Table10The value of each evidence layer occurring in‘non-deposit’dataset2.yer A Layer B Layer C Layer D110102000030000411105000060110710108000091000101110111000120010131000140000150000161000171000180010190010200000R.Zuo,E.J.M.Carranza/Computers&Geosciences](]]]])]]]–]]]7classifications result in larger prospective areas that may not contain undiscovered deposits.However,the 25–30%higher false-negative error of the WofE prediction compared to those of the SVM classifications suggest that the WofE analysis results in larger non-prospective areas that may contain undiscovered deposits.Certainly,in mineral exploration the intentions are notto miss undiscovered deposits (i.e.,avoid false-negative error)and to minimize exploration cost in areas that may not really contain undiscovered deposits (i.e.,keep false-positive error as low as possible).Thus,results suggest the superiority of the SVM classi-fications over the WofE prediction.5.ConclusionsNowadays,SVMs have become a popular geocomputational tool for spatial analysis.In this paper,we used an SVM algorithm to integrate multiple variables for mineral prospectivity mapping.The results obtained by two SVM applications demonstrate that prospective target areas for Au deposits are defined mainly by proximity to NE–SW trending anticlines and to contact zones between the Goldenville and Halifax Formations.In the study area,the SVM classifications of mineral prospectivity have 5–9%lower total errors,13–14%higher false-positive errors and 25–30%lower false-negative errors compared to those of the WofE prediction.These results indicate that SVM is a potentially useful tool for integrating multiple evidence layers in mineral prospectivity mapping.Table 11Error matrix for SVM classification using training dataset 1.Known All ‘deposits’All ‘non-deposits’TotalPrediction ‘Deposit’10032.6132.6‘Non-deposit’067.467.4Total100100200Type 1(false-positive)error ¼32.6.Type 2(false-negative)error ¼0.Total error ¼16.3.Note :Values in the matrix are percentages of ‘deposit’and ‘non-deposit’locations.Table 12Error matrix for SVM classification using training dataset 2.Known All ‘deposits’All ‘non-deposits’TotalPrediction ‘Deposits’9533.3128.3‘Non-deposits’566.771.4Total100100200Type 1(false-positive)error ¼33.3.Type 2(false-negative)error ¼5.Total error ¼19.15.Note :Values in the matrix are percentages of ‘deposit’and ‘non-deposit’locations.Table 13Error matrix for WofE prediction.Known All ‘deposits’All ‘non-deposits’TotalPrediction ‘Deposit’7019.389.3‘Non-deposit’3080.7110.7Total100100200Type 1(false-positive)error ¼19.3.Type 2(false-negative)error ¼30.Total error ¼24.65.Note :Values in the matrix are percentages of ‘deposit’and ‘non-deposit’locations.Fig.5.Prospective targets area for Au deposits delineated by SVM.(a)and (b)are obtained using training dataset 1and 2,respectively.R.Zuo,E.J.M.Carranza /Computers &Geosciences ](]]]])]]]–]]]8。
SUPER VECTOR MACHINE
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
4 Classification Example: IRIS data 25
4.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5 Support Vector Regression 29
5.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.1.1 −Insensitive Loss Function . . . . . . . . . . . . . . . . . . . . . . 30
7 Conclusions 43
A Implementation Issues 45
A.1 Support Vector Classification . . . . . . . . . . . . . . . . . . . . . . . . . 45
A.2 Support Vector Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . . . . . . . .
【support vector machines】
~ ~
这样分类间隔就等于 ,因此要求分类间隔最大,就要求 最大.而要求分类面对所有样本正确分类,就是要求满足
yi ( w xi + b) ≥ 1, i i 使等号成立的样本点称为支持向量
2 w
Linear SVM Mathematically
x+
M=Margin Width
X-
What we know: w . x+ + b = +1 w . x- + b = -1 w . (x+-x-) = 2
首先建立Lagrange函数 函数 首先建立 w J ( w, b, α ) = ∑ α [ y ( w x + b) 1] 2
2 l i i i i =1
J ( w, b, α ) =0 条件1: w J ( w, b, α ) =0 条件2: b 最终可得到
1 l l Q(α ) = J ( w, b, α ) = ∑ α i ∑∑ α iα jyiyj ( xi xj ) 2 i =1 j =1 i =1 寻找最大化目标函数Q (α )的Lagrange乘子{α i }il =1 , 满足约束条件 (1)
SVM的理论基础
由于SVM 的求解最后转化成二次规划问题的求 解,因此SVM 的解是全局唯一的最优解 SVM在解决小样本、非线性及高维模式识别问题 中表现出许多特有的优势,并能够推广应用到函 数拟合等其他机器学习问题中.
线性判别函数和判别面
一个线性判别函数(discriminant function)是指 由x的各个分量的线性组合而成的函数
l l
∑α y = 0
i =1 i i l * , i i,
对偶问题
基于支持向量机的弗兰克-赫兹实验曲线拟合
本期推荐本栏目责任编辑:王力基于支持向量机的弗兰克-赫兹实验曲线拟合周祉煜1,孟倩2(1.河北师范大学物理学院,河北石家庄050024;2.江苏师范大学计算机科学与技术学院,江苏徐州221116)摘要:弗兰克-赫兹实验是“近代物理实验”中的重要实验之一,数据量大且数据处理复杂。
支持向量机是一种广泛应用于函数逼近、模式识别、回归等领域的机器学习算法。
本文将支持向量机算法应用于弗兰克-赫兹实验数据的拟合,过程简单,在python 环境下验证该方法拟合精度高,效果好。
支持向量机算法还可应用于其他的物理实验曲线拟合。
关键词:支持向量机;曲线拟合;弗兰克-赫兹实验;Python 中图分类号:TP18文献标识码:A文章编号:1009-3044(2021)13-0001-02开放科学(资源服务)标识码(OSID ):Curve Fitting of Frank Hertz Experiment Based on Support Vector Machine ZHOU Zhi-yu 1,MENG Qian 2(1.Hebei Normal University,College of physics.,Shijiazhuang 050024,China;2.School of Computer Science and technology,Jiang⁃su Normal University,Xuzhou 221116,China)Abstract:Frank-Hertz experiment is a classical experiment in modern physics experiments.It has a large amount of experimental data and a complicated data processing process.Support Vector Machine is a machine learning algorithm which widely used in function approximation,pattern recognition,regression and other fields.In this paper,support vector machine is used to do curve fitting for the experimental data of Frank-Hertz experiment.The process is simple,and the method is verified to have high curve fit⁃ting accuracy and good effect in python environment.SVM can also be applied to curve fitting in other physics experiments.Key words:support vector machine,curve fitting,Frank Hertz experiment ,python 1998年,Vapnik V N.等人[1]提出了一种新型的基于小样本和统计学习理论的机器学习方法-支持向量机(Support Vector Machine,SVM),该方法可以从有限的训练样本出发寻找“最优函数规律”,使它能够对未知输出作尽可能准确的预测,可应用于函数逼近、模式识别、回归等领域。
Text Categorization with Support Vector Machines
Features
Thorsten Joachims
Universitat Dortmund Informatik LS8, Baroper Str. 301
3 Support Vector Machines
Support vector machines are based on the Structural Risk Minimization principle 9 from computational learning theory. The idea of structural risk minimization is to nd a hypothesis h for which we can guarantee the lowest true error. The true error of h is the probability that h will make an error on an unseen and randomly selected test example. An upper bound can be used to connect the true error of a hypothesis h with the error of h on the training set and the complexity of H measured by VC-Dimension, the hypothesis space containing h 9 . Support vector machines nd the hypothesis h which approximately minimizes this bound on the true error by e ectively and e ciently controlling the VC-Dimension of H.
Evolutionary Support Vector Machines and their
U N I V E R S I T ¨AT D O R T M U N D REIHE COMPUTATIONAL INTELLIGENCE S O N D E R F O R S C H U N G S B E R E I C H 531Design und Management komplexer technischer Prozesse und Systeme mit Methoden der Computational Intelligence Evolutionary Support Vector Machines and theirApplication for ClassificationRuxandra Stoean,Mike Preuss,Catalin Stoeanand D.DumitrescuNr.CI-212/06Interner Bericht ISSN 1433-3325June 2006Sekretariat des SFB 531·Universit ¨a t Dortmund ·Fachbereich Informatik/XI 44221Dortmund ·GermanyDiese Arbeit ist im Sonderforschungsbereich 531,”Computational Intelligence“,der Universit ¨a t Dortmund entstanden und wurde auf seine Veranlassung unter Verwendung der ihm von der Deutschen Forschungsgemeinschaft zur Verf ¨u gung gestellten Mittel gedruckt.Evolutionary Support Vector Machines and theirApplication for ClassificationRuxandra Stoean1,Mike Preuss2,Catalin Stoean1,and D.Dumitrescu31Department of Computer Science,Faculty of Mathematics and Computer Science,University of Craiova,Romania{ruxandra.stoean,catalin.stoean}@inf.ucv.ro2Chair of Algorithm Engineering,Department of Computer Science,University of Dortmund,Germanymike.preuss@uni-dortmund.de3Department of Computer Science,Faculty of Mathematics and Computer Science,”Babes-Bolyai”University of Cluj-Napoca,Romaniaddumitr@cs.ubbcluj.roAbstract.We propose a novel learning technique for classification as result ofthe hybridization between support vector machines and evolutionary algorithms.Evolutionary support vector machines consider the classification task as in sup-port vector machines but use evolutionary algorithms to solve the optimizationproblem of determining the decision function.They can acquire the coefficientsof the separating hyperplane,which is often not possible within classical tech-niques.More important,ESVMs obtain the coefficients directly from the evolu-tionary algorithm and can refer them at any point during a run.The concept isfurthermore extended to handle large amounts of data,a problem frequently oc-curring e.g.in spam mail detection,one of our test cases.Evolutionary supportvector machines are validated on this and three other real-world classificationtasks;obtained results show the promise of this new technique.Keywords:support vector machines,coefficients of decision surface,evolution-ary algorithms,evolutionary support vector machines,parameter tuning1IntroductionSupport vector machines(SVMs)represent a state-of-the-art learning technique that has managed to reach very competitive results in different types of classification and regression tasks.Their engine,however,is quite complicated,as far as proper under-standing of the calculus and correct implementation of the mechanisms are concerned. This paper presents a novel approach,evolutionary support vector machines(ESVMs), which offers a simpler alternative to the standard technique inside SVMs,delivered by evolutionary algorithms(EAs).Note that this is not thefirst attempt to hybridize SVMs and EAs.Existing alternatives are discussed in§2.2.Nevertheless,we claim that our approach is significantly different from these.ESVMs as presented here are constructed solely based on SVMs applied for classi-fication.Validation is achieved by considering four real-world classification tasks.Be-sides comparing results,the potential of the utilized,simplistic EA through parametriza-tion is investigated.To enable handling large data sets,thefirst approach is enhanced2Ruxandra Stoean,Mike Preuss,Catalin Stoean,D.Dumitrescuby use of a chunking technique,resulting in a more versatile algorithm.A second so-lution for dealing with a high number of samples is brought by a reconsideration of the elements of thefirst evolutionary algorithm.Obtained results prove suitability and competitiveness of the new approach,so ESVMs qualify as viable simpler alternative to standard SVMs in this context.However,this is only afirst attempt with the new approach.Many of its components still remain to be improved.The paper is organized as follows:§2outlines the concepts of classical SVMs to-gether with existing evolutionary approaches aimed at boosting their performance.§3 presents the new approach of ESVMs.Their validation is achieved on real-world exam-ples in§4.Improved ESVMs with two new mechanisms for reducing problem size in case of large data sets are presented in§5.The last section comprises conclusions and outlines ideas for further improvement.2PrerequisitesGiven{f t∈T|,f t:R n→{1,2,...,k}},a set of functions,and{(x i,y i)}i=1,2,...,m, a training set where every x i∈R n represents a data vector and each y i corresponds to a class,a classification task consists in learning the optimal function f t∗that mini-mizes the discrepancy between the given classes of data vectors and the actual classes produced by the learning machine.Finally,accuracy of the machine is computed on previously unseen test data vectors.In the classical architecture,SVMs reduce k-class classification problems to many binary classification tasks that are separately consid-ered and solved.A voting system then decides the class for data vectors in the test set. SVMs regard classification from a geometrical point of view,i.e.they assume the ex-istence of a separating surface between two classes labelled as-1and1,respectively. The aim of SVMs then becomes the discovery of this decision hyperplane,i.e.its coef-ficients.2.1Classical Support Vector MachinesIf training data is known to be linearly separable,then there exists a linear hyperplane that performs the partition,i.e. w,x −b=0,where w∈R n is the normal to the hy-perplane and represents hyperplane orientation and b∈R denotes hyperplane location. The separating hyperplane is thus determined by its coefficients,w and b.Consequently, the positive data vectors lie on the corresponding side of the hyperplane and their neg-ative counterparts on the opposite side.As a stronger statement for linear separability, the positive and negative vectors each lie on the corresponding side of a matching sup-porting hyperplane for the respective class(Figure1a)[1],written in brief as:y i( w,x i −b)>1,i=1,2,...,m.In order to achieve the classification goal,SVMs must determine the optimal values for the coefficients of the decision hyperplane that separates the training data with as few exceptions as possible.In addition,according to the principle of Structural Risk Minimization from Statistical Learning Theory[2],separation must be performed withEvolutionary Support Vector Machines for Classification3 a maximal margin between classes.Summing up,classification of linear separable data with a linear hyperplane through SVMs leads thus to the following optimization prob-lem: find w∗and b∗as to minimize w∗ 22subject to y i( w∗,x i −b∗)≥1,i=1,2,...,m.(1)Training data are not linearly separable in general and it is obvious that a linear sepa-rating hyperplane is not able to build a partition without any errors.However,a linear separation that minimizes training error can be tried as a solution to the classification problem.Training errors can be traced by observing the deviations of data vectors from the corresponding supporting hyperplane,i.e.from the ideal condition of data separabil-ity.Such a deviation corresponds to a value of±ξiw ,ξi≥0.These values may indicatedifferent nuanced digressions(Figure1b),but only aξi higher than unity signals a classification error.Minimization of training error is achieved by adding the indicator of error for every training data vector into the separability statement and,at the same time,by minimizing the sum of indicators for errors.Adding up,classification of lin-ear nonseparable data with a linear hyperplane through SVMs leads to the following optimization problem,where C is a hyperparameter representing the penalty for errors: find w∗and b∗as to minimize w∗ 22+C m i=1ξi,C>0subject to y i( w∗,x i −b∗)≥1−ξi,ξi≥0,i=1,2,...,m.(2)If a linear hyperplane does not provide satisfactory results for the classification task, then a nonlinear decision surface can be appointed.The initial space of training data vectors can be nonlinearly mapped into a high enough dimensional feature space,where a linear decision hyperplane can be subsequently built.The separating hyperplane will achieve an accurate classification in the feature space which will correspond to a non-linear decision function in the initial space(Figure1c).The procedure leads therefore to the creation of a linear separating hyperplane that would,as before,minimize train-ing error,only this time it would perform in the feature space.Accordingly,a nonlinear mapΦ:R n→H is considered and data vectors from the initial space are mapped into H.w is also mapped throughΦinto H.As a result,the squared norm that is in-volved in maximizing the margin of separation is now Φ(w) 2.Also,the equation of the hyperplane consequently changes to Φ(w),Φ(x i) −b=0.Nevertheless,as simple in theory,the appointment of an appropriate mapΦwith the above properties is a difficult task.As in the training algorithm vectors appear only as part of dot products,the issue would be simplified if there were a kernel function K that would obey K(x,y)= Φ(x),Φ(y) ,where x,y∈R n.In this way,one would use K everywhere and would never need to explicitly even know whatΦis.The remaining problem is that kernel functions with this property are those that obey the corresponding conditions of Mercer’s theorem from functional analysis,which are not easy to check. There are,however,a couple of classical kernels that had been demonstrated to meet Mercer’s condition,i.e.the polynomial classifier of degree p:K(x,y)= x,y p andthe radial basis function classifier:K(x,y)=e x−y 2σ,where p andσare also hyper-parameters of SVMs.In conclusion,classification of linear nonseparable data with a4Ruxandra Stoean,Mike Preuss,Catalin Stoean,D.Dumitrescu(a)(b)(c)Fig.1:(a)Decision hyperplane(continuous line)that separates between circles(positive)and squares(negative)and supporting hyperplanes(dotted lines).(b)Position of data and corre-sponding indicators for errors-correct placement,ξi=0(label1)margin position,ξi<1 (label2)and classification error,ξi>1(label3).(c)Initial data space(left),nonlinear map into the higher dimension/its linear separation(right),and corresponding nonlinear surface(bottom). nonlinear hyperplane through SVMs leads to the same optimization problem as in(2) which is now considered in the feature space and with the use of a kernel function: find w∗and b∗as to minimize K(w∗,w∗)2+C m i=1ξi,C>0(3)subject to y i(K(w∗,x i)−b∗)≥1−ξi,ξi≥0,i=1,2,...,m.The optimization problem(corresponding to either situation above)is subsequently solved.Accuracy on the test set in then computed,i.e.the side of the decision boundary on which each new data vector lies is determined.Classical SVMs approach the op-timization problem through a generalized form of the method of Lagrange multipliers [3].But the mathematics of the technique can be found to be very difficult both to grasp and apply.This is the reason why present approach aims to simplify(and improve) SVMs through a hybridization with EAsby utilizing these in order to determine optimal values for the coefficients of the separating hyperplane(w and b)directly.2.2Evolutionary Approaches to Support Vector MachinesEAs have been widely used in hybridization with SVMs in order to boost performance of classical architecture.Their combination envisaged two different directions:model selection and feature selection.Model selection concerns adjustment of hyperparame-ters(free parameters)within SVMs,i.e.the penalty for errors,C,and parameters of the kernel,p orσwhich,in standard variants,is performed through grid search or gradi-ent descent methods.Evolution of hyperparameters can be achieved through evolution strategies[4].When dealing with high dimensional classification problems,feature se-lection regards the choice of the most relevant features as input for a SVM.The optimal subset of features can be evolved using genetic algorithms in[5]and genetic program-ming in[6].To the best of our knowledge,evolution of coefficients of the separating hyperplane within SVMs has not been accomplished yet.3Evolutionary Support Vector Machines for ClassificationWithin the new hybridized technique of ESVMs separation of positive and negative vectors proceeds as in standard SVMs,while the optimization problem is solved byEvolutionary Support Vector Machines for Classification5 means of EAs.Therefore,the coefficients of the separating hyperplane,i.e.w and b,are encoded in the representation of the EA and their evolution is performed with respect to the objective function and the constraints in the optimization problem(3)within SVMs, which is considered for reasons of generality.3.1Evolving the Coefficients of the Separating HyperplaneRepresentation An individual encodes the coefficients of the separating hyperplane,w and b.Since indicators for errors of classification,ξi,i=1,2,...,m,appear in the con-ditions for hyperplane optimality,ESVMs handle them through inclusion in the struc-ture of an individual,as well:c=(w1,...,w n,b,ξ1,....,ξm).(4) After termination of the algorithm,the best individual from all generations gives ap-proximately optimal values for the coefficients of the decision hyperplane.If proper values for parameters of the evolutionary algorithm are chosen,training errors of clas-sification can also result from the optimal individual(those indicators whose values are higher than unity)but with some loss in accuracy;otherwise,indicators grow in the direction of errors driving the evolutionary cycle towards its goal,but do not reach the limit of unity when the evolutionary process stops.An example of an ESVM which also provides the errors of classification is nevertheless exhibited for simple artificial 2-dimensional data sets separated by various kernels(Figure2).In such a situation,the number of samples and the dimensionality of data are both low,thus accuracy and run-time are not affected by the choice of parameters which leads to the discovery of all training errors.Initial population Individuals are randomly generated such that w i∈[−1,1],i= 1,2,...,n,b∈[−1,1]andξj∈[0,1],j=1,2,...,m.Fitness evaluation Thefitness function derives from the objective function of the optimization problem and has to be minimized.Constraints are handled by penalizing the infeasible individuals by appointing t:R→R which returns the value of the argu-ment if negative while otherwise0.The expression of the function is thus as follows:f(w,b,ξ)=K(w,w)+Cmi=1ξi+m i=1[t(y i(K(w,x i)−b)−1+ξi)]2.(5)Genetic operators Operators were chosen experimentally.Tournament selection, intermediate crossover and mutation with normal perturbation are applied.Mutation of errors is constrained,preventing theξi s from taking negative values.Stop condition The algorithm stops after a predefined number of generations.As the coefficients of the separating hyperplane are found,the class for a new,unseen test data vector can be determined,following class(x)=sgn(K(w,x)−b).This is unlike classical SVMs where it is seldom the case that coefficients can be determined following the standard technique,as the mapΦcannot be always explicitly determined. In this situation,the class for a new vector follows from computational artifices.6Ruxandra Stoean,Mike Preuss,Catalin Stoean,D.Dumitrescu(a)(b)(c)Fig.2:Vizualization of ESVMs on 2-dimensional points.Errors of classification are squared.Odd (a),even (b)polynomial and radial (c)kernels3.2ESVMs for Multi-class ClassificationMulti-class ESVMs employ one classical and very successful SVM technique,the ONE-AGAINST-ONE (1-a-1)[7].As the classification problem is k -class,k >2,1-a-1considers k (k−1)2SVMs,where each machine is trained on data from every two classes,i and j ,where i corresponds to 1and j to -1.For every SVM,the class of x is computed and if x is in class i ,the vote for the i -th class is incremented by one;conversely,the vote for class j is added by one.Finally,x is taken to belong to the class with the largest vote.In case two classes have identical number of votes,the one with the smaller index is selected.Consequently,1-a-1multi-class ESVMs are straightforward.k (k−1)2ESVMs are built for every two classes and voting is subsequently conducted.4Experimental Evaluation:Real-World ClassificationExperiments have been conducted on four data sets (with no missing values)concerning real-world problems coming from the UCI Repository of Machine Learning Databases 4,i.e.diabetes mellitus diagnosis,spam detection,iris recognition and soybean disease diagnosis.The motivation for the choice of test cases was manifold.Diabetes and spam are two-class problems,while soybean and iris are multi-class.Differentiating,on the one hand,diagnosis is a better-known benchmark,but filtering is an issue of current major concern;moreover,the latter has a lot more features and samples,which makes a huge difference for classification as well as for optimization.On the other hand,iris has a lot more samples while soybean has a lot more attributes.For all reasons mentioned above,the selection of test problems certainly contains all the variety of situations that is necessary for the objective validation of the new approach of ESVMs.Brief information on the classification tasks and SVM and EA parameter values are given in Table 1.The error penalty was invariably set to 1.For each data set,30runs of the ESVM were conducted;in every run 75%ran-dom cases were appointed to the training set and the remaining 25%went into test.Experiments showed the necessity for data normalization in diabetes,spam and iris.4Available at /mlearn/MLRepository.htmlEvolutionary Support Vector Machines for Classification7 Table1:Data set properties and ESVM algorithm parameter values.Rightmost columns hold tuned parameter sets for modified ESVMs,ESVMs with chunking,and utilized parameter bounds.Diabetes Iris Soybean SpamData set description modif.chunk.bounds Number of samples768150474601(434)(10/500) Number of attributes843557Number of classes2342ESVM parameter values man.tun.man.tun.man.tun.man.tun.tun.tun.Kernel parameter(p orσ)p=2σ=1p=1p=1p=1 Population size10019810046100162100154519010/200 Number of generations25029610022010029325028718028650/300 Crossover probability0.400.870.300.770.300.040.300.840.950.110.01/1 Mutation probability0.400.210.500.570.500.390.500.200.030.080.01/1 Error mutation probability0.500.200.500.020.500.090.500.07—0.800.01/1 Mutation strength0.104.110.104.040.100.160.103.32 3.750.980.001/5 Error mutation strength0.100.020.103.110.103.800.100.01—0.010.001/5 No further modification of the data was carried out and all data was used in the ex-periments.Obtained test accuracies are presented in Table2.Differentiated(spam/non spam for spamfiltering and ill/healthy for diabetes)accuracies are also depicted.In or-der to validate the manually found EA parameter values,the parameter tuning method SPO[8]was applied with a budget of1000optimization runs.The last4columns of Table2hold performances and standard deviations of the best configuration of an ini-tial latin hypersquare sample(LHS)with size10×#parameters,and thefinally found best parameter setting.Resulting parameter values are depicted in Table1.They indi-cate that for all cases,except for soybean data and chunking enhanced ESVM on the spam data,crossover probabilities were dramatically increased,while often reducing mutation probabilities,especially for errors.It must be stated that in most cases,results achieved with manually determined parameter values are only improved by increasing effort(population size or number of generations).Comparison to worst and best found results of different techniques,either SVMs or others,was conducted.Assessment cannot be objective,however,as outlined meth-ods either use different sizes for the training/test sets or other types of cross-validation and number of runs or employ various preprocessing procedures for feature or sam-ple selection.Diabetes diagnosis was approached in SV M light[9]where an accuracy of76.95%was obtained and Critical SVMs[10]with a result of82.29%.Accuracy on spam detection is reported in[11]where k-Nearest Neighbourhood on non preprocessed data resulted in66.5%accuracy and in[12]where functional link network wrapped into a genetic algorithm for input and output feature selection gave a result of92.44%.1-a-1 multi-class SVMs on the Iris data set were perfomed in[7](accompanied by a shrink-ing technique)and[13];obtained results were of97.33%and98.67%,respectively. Results for the Soybean small data set were provided in[14],where,among others, Naive Bayes was applied and provided an accuracy of95.50%,reaching100%when a pair-wise classification strategy was employed.8Ruxandra Stoean,Mike Preuss,Catalin Stoean,D.DumitrescuTable2:Accuracies of different ESVM versions on the considered test sets,in percent.Average Worst Best StD LHS best StD SPO StD ESVMsDiabetes(overall)76.3071.3580.73 2.2475.82 3.2777.312.45 Diabetes(ill)50.8139.1960.27 4.5349.357.4752.645.32 Diabetes(healthy)90.5484.8096.00 2.7189.60 2.3690.212.64 Iris(overall)95.1891.11100.0 2.4895.11 2.9595.112.95 Soybean(overall)99.0294.11100.0 2.2399.61 1.4799.801.06 Spamfiltering(overall)87.7485.7489.83 1.0689.27 1.3790.590.98 Spamfiltering(spam)77.4870.3182.50 2.7780.63 3.5183.762.21 Spamfiltering(non spam)94.4192.6296.300.8994.820.9495.060.62 ESVMs with ChunkingSpamfiltering(overall)87.3083.1390.00 1.7787.52 1.3188.371.15 Spamfiltering(spam)83.4775.5486.81 2.7886.26 2.6686.352.70 Spamfiltering(non spam)89.7884.2292.52 2.1188.33 2.4889.682.06 Modified ESVMsSpamfiltering(overall)88.4086.5290.35 1.0290.060.9991.250.83 Spamfiltering(spam)79.6375.3984.70 2.1682.73 2.2885.522.08 Spamfiltering(non spam)94.1791.3495.84 1.0594.890.9295.000.72 5Improving Training TimeObtained results for the tasks we have undertaken to solve have proven to be competitive as compared to accuracies of different powerful classification techniques.However,for large problems,i.e.spamfiltering,the amount of runtime needed for training is≈800s. This stems from the large genomes employed,as indicators for errors of every sample in the training set are included in the representation.Consequently,we tackle this prob-lem with two approaches:an adaptation of a chunking procedure inside ESVMs and a modified version of the evolutionary approach.5.1Reducing Samples for Large ProblemsWe propose a novel algorithm to reduce the number of samples for one run of ESVMs which is an adaptation of the widely known shrinking technique within SVMs,called chunking[15].In standard SVMs,this technique implies the identification of Lagrange multipliers that denote samples which do not fulfill restrictions.As we renounced the standard solving of the optimization problem within SVMs for the EA-based approach, the chunking algorithm was adapted tofit our technique(Algorithm1).ESVM with chunking was applied to the spam data set.Values for parameters were set as before,except the number of generations for each EA which is now set to100. The chunk size,i.e N,was chosen as200and the number of iterations with no im-provement was designated to be5.Results are shown in Table2.Average runtime was of103.2s/run.Therefore,the novel algorithm of ESVM with chunking reached its goal, running8times faster than previous one,at a cost of loss in accuracy of0.4%.Evolutionary Support Vector Machines for Classification9Algorithm1ESVM with ChunkingRandomly choose N samples from the training data set,equally distributed,to make a chunk; while a predefined number of iterations passes with no improvement doiffirst chunk thenRandomly initialize population of a new EA;elseUse best evolved hyperplane coefficients and random indicators for errors tofill half of the population of a new EA and randomly initialize the other half;end ifApply EA andfind coefficients of the hyperplane;Compute side of all samples in the training set with evolved hyperplane coefficients;From incorrectly placed,randomly choose(if available)N/2samples,equally distributed;Randomly choose the rest up to N from the current chunk and add all to the new chunk if obtained training accuracy if higher than the best one obtained so far then Update best accuracy and best evolved hyperplane coefficients;set improvement to true;end ifend whileApply best obtained coefficients on the test set and compute accuracy5.2A Reconsideration of the Evolutionary AlgorithmSince ESVMs directly provide hyperplane coefficients at all times,we propose to drop the indicators for errors from the EA representation and,instead,compute their val-ues in a simple geometrical fashion.Consequently,this time,individual representation contains only w and b.Next,all indicatorsξi,i=1,2,...,m are computed in order to be referred in thefitness function(5).The current individual(which is the current separating hyperplane)is considered and,following[1],supporting hyperplanes are de-termined.Then,for every training vector,deviation to the corresponding supporting hyperplane,following its class,is calculated.If sign of deviation equals class,corre-spondingξi=0;else,deviation is returned.The EA proceeds with the same values for parameters as in Table1(certainly except probabilities and step sizes for theξi s)and,in the end of the run,hyperplane coefficients are again directly acquired.Empirical results on the spam data set(Table2)and the average runtime of600s seem to support the new approach.It is interesting to remark that the modified algorithm is not that much faster, but provides some improvement in accuracy.In contrast to the chunking approach,it also seems especially better suited for achieving high non spam recognition rates,in or-der to prevent erroneous deletion of good mail.Surprisingly,SPO here decreases effort while increasing accuracies(Table1),resulting in further speedup.6Conclusions and Future WorkProposed new hybridized learning technique incorporates the vision upon classification of SVMs but solves the inherent optimization problem by means of evolutionary algo-rithms.ESVMs present many advantages as compared to SVMs.First of all,they are definitely much easier to understand and use.Secondly and more important,the evo-lutionary solving of the optimization problem enables the acquirement of hyperplane10Ruxandra Stoean,Mike Preuss,Catalin Stoean,D.Dumitrescucoefficients directly and at all times within a run.Thirdly,accuracy on several bench-mark real-world problems is comparable to those of state-of-the-art SVM methods or to results of other powerful techniques from different other machine learningfields.In order to enhance suitability of the new technique for any classification issue,two novel mechanisms for reducing size in large problems are also proposed;obtained results support their employment.Although already competitive,the novel ESVMs for classification can still be im-proved.Other kernels may be found and used for better separation;also,the two ap-pointed classical kernels may have parameters evolved by means of evolutionary algo-rithms as in some approaches for model selection.Also,other preprocessing techniques can be considered;feature selection through an evolutionary algorithm as found in lit-erature can surely boost accuracy and runtime.Definitely,other ways to handle large data sets can be imagined;a genetic algorithm can also be used for sample selection. Finally,the construction of ESVMs for regression problems is a task for future work. References1.Bosch,R.A.,Smith,J.A.:Separating Hyperplanes and the Authorship of the Disputed Feder-alist Papers.American Mathematical Monthly,V ol.105,No.7,(1998),601-6082.Vapnik,V.,Statistical Learning Theory.Wiley,New York(1998)3.Haykin,S.:Neural Networks:A Comprehensive Foundation.Prentice Hall,New Jersey(1999)4.Friedrichs,F.,Igel,C.:Evolutionary tuning of multiple SVM parameters,In Proc.12th Euro-pean Symposium on Artificial Neural Networks(2004)519-5245.Feres de Souza,B.,Ponce de Leon F.de Carvalho,A.:Gene selection based on multi-class support vector machines and genetic algorithms.Journal of Genetics and Molecular Research, V ol.4,No.3(2005)599-6076.Eads,D.,Hill,D.,Davis,S.,Perkins,S.,Ma,J.,Porter,R.,Theiler,J.:Genetic Algorithms and Support Vector Machines for Time Series Classification.5th Conf.on the Applications and Science of Neural Networks,Fuzzy Systems,and Evolutionary Computation,Proc.Symposium on Optical Science and Technology,SPIE,Seattle,W A,4787(2002)74-857.Hsu,C.-W.,Lin,C.-J.:A Comparison of Methods for Multi-class Support Vector Machines. IEEE Transactions on Neural Networks,V ol.13,No.2(2002)415-4258.Bartz-Beielstein,T.:Experimental research in evolutionary computation-the new experimen-talism.Natural Computing Series,Springer-Verlag(2006)9.Joachims,T.:Making Large-Scale Support Vector Machine Learning Practical.In Advances in Kernel Methods:Support Vector Learning(1999)169-18410.Raicharoen,T.and Lursinsap,C.:Critical Support Vector Machine Without Kernel Function. In Proc.9th Intl.Conf.on Neural Information Processing,Singapore,V ol.5(2002)2532-2536 11.Tax,D.M.J.:DDtools,the data description toolbox for Matlab(2005)http://www-ict.ewi.tudelft.nl/davidt/occ/index.html12.Sierra,A.,Corbacho,F.:Input and Output Feature Selection.ICANN2002,LNCS2415 (2002)625-63013.Weston,J.,Watkins,C.:Multi-class Support Vector Machines.Technical Report,CSD-TR-98-04,Royal Holloway,University of London,Department of Computer Science(1998) 14.Bailey,J.,Manoukian,T.,Ramamohanarao,K.:Classification Using Constrained Emerging Patterns.In Proc.of W AIM2003(2003)226-23715.Perez-Cruz,F.,Figueiras-Vidal,A.R.,Artes-Rodriguez,A.:Double chunking for solving SVMs for very large datasets.Learning’04,Elche,Spain(2004)/ archive/00001184/01/learn04.pdf.。
NGUYEN DUNG DUC
Studies on Improving the Efficiency of Support Vector MachinesbyNGUYEN DUNG DUCsubmitted toJapan Advanced Institute of Science and Technology in partial fulfillment of the requirementsfor the degree ofDoctor of PhilosophySupervisor:Professor Ho Bao TuSchool of Knowledge ScienceJapan Advanced Institute of Science and TechnologyMarch,2006AbstractMotivation and Objective:In recent years support vector machine(SVM)has emerged as a powerful learning approach and successfully be applied in a wide variety of applications.The high generalization ability of SVMs is guaranteed by special properties of the optimal hyperplane and the use of kernel.However,SVM is considered slower than other learning approaches in both testing and training phases.In testing phase SVMs have to compare the test pattern with every support vectors included in their solutions.When the number of support vectors increases,the speed of testing phase decreases proportionally.To reduce this computational expense,reduced set methods try to replace original SVM solution by a simplified one which consists of much fewer number of vectors,called reduced vectors.However,the main drawback of former reduced set methods lies in the construction of each new reduced vector:it is required to minimize a multivariate function with local minima.Thus,in order to achieve a good simplified solution the construction must be repeated many times with different initial values.Our first objective was aiming at building a better reduced set method which overcomes the mentioned local minima problem.The second objective was tofind a simple and effective way to reduce the training time in a model selection process.This objective was motivated by the fact that the selection of a good SVM for a specific application is a very time consuming task.It generally demands a series of SVM training with different parameter settings;and each SVM training solves a very expensive optimization problem.Methodology:Starting from a mechanical point of view,we proposed to simplify support vector solutions by iteratively replacing two support vectors with a newly created vector;or to substitute two member forces in an equilibrium system by an equivalent force.This approach also faces the difficulties caused by the so called pre-image problem of kernel-based methods where generally there is no exact substitution of two support vectors in a kernel-induced feature space by image of a vector in input space.However this bottom-up approach possess a big advantage that the computation of the new vector involves only two support vectors being replaced,not to involve all vectors as in the former top-down approach.The extra task of the bottom-up method is tofind a heuristic to select a good pair of support vectors to substitute in each iteration.This heuristic aims at minimizing the difference between the original solution and the simplified one.Also,it is necessary to design a stopping condition to terminate the simplification process before it makes the simplified solution too different from the original one,thus the possible loss in generalization performance can get out of control.For the second problem,our intensive investigation reconfirmed that different SVMs trained by different parameter settings share a big portion of common support vectors.This observation suggests a simple technique to use the results of previously trained SVMs to initialize the search in training a new machine.In a general decomposition framework for SVM training, this initialization makes the initial local optimized solution closer to the global optimized solution;hence the optimization process for SVM training converges more quickly.Finding and Conclusion:The bottom-up approach leads to a conceptually simpler and computationally less expensive method for simplifying SVM solutions.We found that it is reasonable to select a close support vector pair to replace with a newly constructed vector,and this construction only requiresfinding the unique maximum point of a uni-variate function.The uniqueness of solution does not only make the algorithm run faster, but it also makes the reduce set method easier to use in ers do not have to run many trials and wonder about different results returned in different runs.Experimental results on real life datasets shown that our proposed method can reduce a large number of support vectors and keeps generalization performance paring with for-mer methods,the proposed one produced slightly better results,and more importantly it is much more efficient.For the second problem,experiments on various real life datasets showed that by initializing thefirst working set using the result of trained SVMs,the training time for each subsequent SVM can be reduced by22.8-85.5%.This reduction is significant in speeding up the whole model selection process.AcknowledgmentsThis work was carried out at Knowledge Creating Methodology Lab,School of Knowl-edge Science,Japan Advanced Institute of Science and Technology.I wish to express my gratitude to the many people who have supported me during my work.I am most grateful to my supervisor,Prof.Ho Tu Bao,for providing me with his help, supervision and motivation throughout the course of this work.His insight and breadth of knowledge have been invaluable to me.Without his care,supervision and friendship I would not be able to complete this work.I want to thank Prof.Kenji Satou,who has kindly accepted me to do a minor theme research under his supervision.I wish to express my gratefulness to the official referees of the dissertation,Prof.Kenji Satou,Prof.Yoshiteru Nakamori,Prof.Tsutomu Fujinami,and Prof.Hiroshi Motoda, for their valuable comments and suggestions on this dissertation.I would like to express my appreciation to the Ministry of Education,Culture,Sports, Science,and Technology of Japan,and the International Information Science Foundation for providing me the scholarship and thefinancial support for attending international conferences.My special thank goes to the members of the Knowledge Creating Laboratory,and the many friends of mine in JAIST for providing their helps,a friendly and enjoyable environment.Finally,I am indebted to my parents for their forever affection,patience,and constant encouragement,to my wife for sharing me difficulties and happiness.To my son,the greatest source of inspiration.ContentsAbstract iAcknowledgments iii1Introduction11.1Efforts in Improving the Efficiency of Support Vector Learning (1)1.2Problem and Contribution (4)1.3Thesis Outline (6)2Preliminaries on Support Vector Machines72.1Introduction (7)2.2Linear Support Vector Classification (7)2.2.1The Maximal Margin Hyperplane (7)2.2.2Finding the Maximal Margin Classifier (12)2.2.3Soft Margin Classifiers (12)2.2.4Optimization (13)2.3Nonlinear Support Vector Classification (17)2.3.1Learning in Feature Space (17)2.3.2Kernels (19)2.3.3VC Dimension and Generalization Ability of Support Vector Machine212.4Support Vector Regression (23)2.5Implementation Techniques (26)2.6Summary (30)3Simplifying Support Vector Solutions313.1Introduction (31)3.2Simplifying Support Vector Machines (32)3.2.1Reducing Complexity of SVMs in Testing Phase (32)3.2.2Reduced Set Construction (33)3.2.3Reduced Set Selection (35)3.3A Bottom-up Method for Simplifying Support Vector Solutions (36)3.3.1Simplification of Two Support Vectors (37)3.3.2Simplification of Support Vector Solution (43)3.3.3Pursuing a Better Approximation (46)3.4Experiment (46)3.5Discussion (50)4Speeding-up Support Vector Training in Model Selection544.1Introduction (54)4.2Model Selection for Support Vector Machine (55)4.2.1What is Model Selection (55)4.2.2Model Selection for Support Vector Machines (57)4.3Speeding-up Model Selection SVM (60)4.3.1Speeding-up by Improving Search Strategy (60)4.3.2Speeding-up by Improving Model Evaluation (61)4.4Speeding-up SVM Training in Model Selection (61)4.4.1The General Decomposition Algorithm for SVM Training (61)4.4.2Initializing Working Set (63)4.5Experiments (66)4.6Discussion (68)5Conclusions and Future Work69 References72 Publications80List of Figures2.1Margin of a set of examples with respect to a hyperplane.The origin has−bperpendicular Euclidian distance to the hyperplane (8)w2.2Among liner machines,the maximal margin classifier is intuitively preferable.102.3Two-dimensional example of a classification problem:separate’o’from’+’using a straight line.Suppose that we add bounded noise to each pattern.If the optimal margin hyperplane has marginρ,and the noise is boundedby r<ρ,then the line will correctly separates even the noisy patterns.[53]102.4Noisy pattern will be treated softly by permitting constraint violation(e.g.having functional marginξ<1),but the objective function will be penalizea cost C(1−ξ),whereξis functional margin (13)2.5An illustration of kernel-based algorithms.By mapping the original inputspace to other high dimensional feature space,the linearly inseparable datamay become linearly separable in the feature space (18)2.6Three points in R2shattered by oriented lines (21)2.7Gaussian RBF SVMs of sufficiently small width can classify an arbitrarylarge number of training points correctly,and thus have infinite VC dimen-sion[50] (23)2.8In -SV regression,training examples inside the tube of radius are notconsidered as mistakes.The trade-offbetween model complexity(or theflatness of the hyperplane)and points lying outside the tube is controlledby weighted -insensitive losses (24)+(1−m)C k2ij with m=0.4,C ij=0.7 (39)3.1f(k)=mC(1−k)2ij3.2Projection of vector z on the plane(x i,x j)in the input space (40)3.3Illustration of the marginal difference of a(original)support vector x withrespect to the original and simplified solutions (44)3.4Illustration of simplified support vector solution using proposed method.The decision boundaries constructed by the simplified machines with4SVs(right-top)and20SVs(right-bottom)are almost identical with thoseconstructed by the original machines with61SVs(left-top)and75SVs(left-bottom).The cracked lines represent vectors with approximately1marginal distance to the optimal hyperplane (46)3.5Thefirst100digits in the USPS dataset (47)3.6Performance comparison between the former top-down the the proposedbottom-up approach on the USPS dataset.With the same reduction ratethe bottom-up preserved better predictive accuracy,while computationalefficiency is guaranteed by theoretical result.Note:Top-down:the resultoffix-point iteration method in[37](Phase I);bottom-up:the result of pro-posed method(Phase I);Phase II:the result of proposed method runningwith both two phases optimization (52)3.7Display of all vectors in simplified solutions.The original10classifierstrained with polynomial kernel of degree three and the cost C=10consistof4538SVs and produce88errors(on2007testing data).The simplified10classifiers consist of270vectors and produce95errors.The numberbelow each image indicates the new weight of a reduced vector (53)4.1Relations among model complexity(horizontal axis),empirical risk(thedotted line),and expected risk(the solid line).The dash-dotted line is theupper-bound on the complexity term(confidence).[73] (56)4.2Different kernels produce different type of discriminant function (58)4.3Trade offbetween model complexity and empirical risk (59)4.4Common support vectors in two different machines learned from threedatasets sat-image,letter recognition,and shuttle:(a)linear machineslearned with different error penalties C=1and C=2,(b)polynomialmachines of degree two and three learned with the same C=1,(c)RBFmachines learned with different error penalties C=1and C=2 (64)4.5Illustration of initializing working set using result of previously trainedSVM.The optimized solution for machine(γ=10,C=10)(d)can bereached normally from an random initial solution(a),or more efficientlyfrom solution of a trained machine(γ=5,C=10)or(γ=10,C=1) (65)4.6Reduction in number of required optimization loops and training time onthree datasets sat-image(a-d-g),letter recognition(b-e-h),and shuttle(c-f-i),and in different situations:the same linear kernel with different cost(a-b-c),polynomial kernels of different degree with the same cost,and differentRBF kernels with different costs.”org.”denotes the original method withrandomly working set selection;”WS”denotes the proposed method.Allmeasures(average number of loops and training time)are normalized in to[0,1] (67)List of Tables2.1Decomposition algorithm for SVM training (28)3.1The simplification algorithm (45)3.2Reduction in number of support vectors and the corresponding loss in gen-eralization performance with different values of MMD.Original machines(the3rd and14th lines)were trained on the USPS training data usingGaussian and polynomial kernels.Errors were evaluated on the testing data.483.3Experimental results on45binary classifiers learned from the USPS datasetusing thefirst phase of the proposed method.Left-bottom:number of sup-port vectors in original classifiers/number of vectors in simplified classifiers.Right-top:number of errors on the test data of original classifiers-simpli-fied classifiers (49)3.4Experimental results on various applications (50)4.1Datasets used in experiments (66)Chapter1IntroductionIn this chapter wefirstly review the many efforts currently being made to improve the efficiency of the support vector learning approach.After that we mention some limitations of the previous methods and briefly introduce our solutions in simplifying support vector solutions and in speeding-up support vector training in a model selection process.Outline of this thesis will be given in the last section of this chapter.1.1Efforts in Improving the Efficiency of SupportVector LearningThe support vector learning[1,2,3,4]implements the following idea:itfinds an optimal hyperplane in feature space according to some optimization criterion,e.g.it is the optimal hyperplane that maximizes the distance to training examples in a two-class classification task,or maximizes theflatness of a function in regression.Thus,training a support vector machine(SVM)is equivalent to solving an optimization problem in which the number of variables to be optimized is l,and the number of parameter is l2,where l is the size of training data.This is apparently an expensive task in both memory requirement and computational power.Moreover,the optimal hyperplane lies in feature space which is constructed based on the choice of kernel.Selecting a suitable kernel for a specific application is still an open problem and SVM users have to do intensive trials of training and testing with different types of kernel and different values of parameters.Also,since the feature space does not exist explicitly the hyperplane,e.g.a classifier or a regressor,is characterized by a set of training examples called support vectors.To test a new pattern SVMs have to compare it with all these support vectors and this becomes a very time consuming work when the number of support vectors is large.In short,support vector is a rather computationally demanding learning approach,and in return,it can producehigh generalization ability machines in many practical applications.There have been different directions to deal with the high resource-demanding prop-erty of support vector training.The algorithmic approach tries tofind intelligent solutions for a quick convergence to the optimal solution with a limited memory available.From the observation that the SVM solutions are sparse,or many of training examples do not play any role in the forming of SVM solutions,chunking and decomposition methods[1,5] decompose the original quadratic programming(QP)problem into a series of smaller QP problems.These method has been shown to be able to handle problems with size exceeding the capacity of the computer,e.g.RAM memory.The sequential minimal optimization (SMO)[6]can be viewed as the extreme case of decomposition methods.In each iteration SMO solves a QP problem of size two using an analytical solution,thus no optimizer is required.The remaining problem of SMO is to choose a good pair of variable to optimize. The original heuristics presented in[6]are based on the level of violating the optimal con-dition.There have been several works,e.g.[7,8],trying to improve these heuristics.The general decomposition framework and some other implementation techniques like shrink-ing,kernel caching have been implemented in most currently available SVM softwares, e.g.SVM light[9],LIBSVM[10],SVMTorch[11],HeroSvm[12].The main obstacle for this approach is still the huge memory required for storing kernel elements when the number of training example exceeds a few hundreds of thousands.The second approach to solving large scale SVM is to parallelize the optimization.The main idea is to split training data into subsets and perform optimization on these subsets separately.The partial results are then combined andfiltered again into a”cascade”of SVMs[13,14],or a mixture of SVMs[15].However,the price we must pay is the possibility of losing predictive accu-racy because the combination of partial SVMs does not guarantee an optimal hyperplane, thus we might get a machine with lower performance than those trained by other learn-ing approaches[16].The third approach is to properly remove”unnecessary”examples from training data,thus simultaneously reducing the memory requirement as well as the training time.The reduced support vector machines method[17,18]reduce the size of the original kernel matrix from l×l to l×m,where m is the size of a randomly selected subset of training data considered as candidates of support vectors.The smaller matrix of size l×m(with m is much smaller than l)can be stored in memory,so optimization algorithms such as Newton method can be applied.Instead of random sampling,different techniques have been used to intelligently sample a small number of training examples from training data,ing cross-training[19],boosting[19],clustering[20,21],active learning[22,23,24],on-line and active learning[22].Another way to reduce the size of the optimization problem is applying different techniques to obtain low-rank approxima-tions on the kernel matrix using Nystr¨o m method[25],greedy approximation[26],matrix sampling[26]or matrix decomposition[27].The drawback of this approach still is that the resulted machines can only achieve a”similar”or a comparable performance with the machines trained on the original training data.There have been also many other efficient implementation techniques to achieve approximate support vector solutions with a low cost.The core support machines in[28]reformulates the optimization in SVM training as a minimum enclosing ball(MEB)problem in computational geometry,and then adopt an efficient approximate MEB algorithm to obtain approximately optimal solution.In [29]the authors consider the application of quantum computing to solve the problem of effective SVM training.Though training SVMs is computationally very expensive,SVM users have to spend most time for choosing a suitable kernel and appropriate parameter setting for their applications,or to deal with the model selection problem.In order to achieve a good machine,model selection has to solve two main tasks:to conduct a search in model space (a space of all available SVMs),and to evaluate the goodness of a model.Different search strategies have been proposed to improve the search,including grid search with different grid size[30],pattern search[31],and all common search strategies when applicable like gradient descent[32,32,33],genetic algorithms[34].The difficulty in conducting the search in model space is that there have been no theories to suggest this type of kernel will work better than the other on a given domain,or to determine the region of parameter values where we canfind the best one.Another way to speed-up model selection process is to efficiently evaluate each model in our consideration.In[35],the author proposed ξα-estimator specially designed for support vector machines.Theξα-estimator is based on the general leave-one-out method,but it is much more efficient because it does not require to perform re-sampling and retraining.The open question for model evaluation is that there is no dominated method in estimating the goodness of a model.In practice, SVM users estimate error rate of a machine mainly based on cross validation techniques like k-fold cross validation,which is very time consuming.One common property between support vector learning and instance-based learning is that they have to compare all instances included in their solution with the new pattern in testing phase(these instances are support vectors in SVMs and all training examples in nearest neighbor machines).Except for linear SVMs where the norm vector of the optimal hyperplane can be represented by a vector in input space,the solution of a nonlinear SVM is characterized by a linear combination of support vectors in feature space.Thus to classify a new pattern,SVMs have to compare it with every support vectors via kernel calculations.This computation becomes very expensive when the number of supportvector is large.The reduced set methods,e.g.[36,37,38],try to replace the original SVM by a simplified SVM which consists of fewer number of support vectors,called reduced SVs.The support vectors in the simplified solution can be newly created,or selected from the set of original support vectors.The limitation of this approach lies in the construction/selection of reduced SVs that faces local minimum problem.Another approach to speed-up SVMs is to approximate the comparison in the testing phase.In [39,40],the authors proposed to treat kernel machines as a special form of k-nearest neighbor machines.The result of testing phase is based on comparisons with nearest support vectors,where these SVs are determined in a pre-query analysis.These methods have been shown to produce very promising speed-up rate,but they require an extensive pre-query analysis and depend much on very sensitive parameters,thus cause practical difficulties for real life applications.In summary,support vector learning is a resource demanding learning approach.There have been a huge number of works trying to make support vector machines run faster in all training,model selection,and in testing phases.Our effort described in this dissertation is two folds:making SVMs run faster in testing phase and speeding-up the support vector training in a model selection process.1.2Problem and ContributionIn comparing with making support vector training and model selection run faster,speeding-up SVMs in testing phase is practically important,especially for real-time or on-line appli-cations like detection of objects in streaming video or in image[41,42,14,43,44,45,46], abnormal events detection[46,47],real-time business intelligence systems[20].In these applications,it is possible to train the machines in hours,or days,but the respond time must be limited in a restrictive period.The reduced set methods briefly introduced above have been successfully used for reducing the complexity of SVMs in many applications like handwritten character recognition[48,49],face detection in a large collection of im-ages[14].However,the main difficulty still lies in the fact that it is impossible to exactly replace a complicated linear combination of many vectors in feature space by a simple one,except for linear SVMs.For linear SVMs we can represent the optimal hyperplane by only two parameters:the norm vector which is also a vector in the input space,and the bias.For nonlinear SVMs,because the feature space is constructed implicitly then the normal vector must be represented by a linear combination of images of input support vectors.The reduced set approach has no way but approximates the original combination by a fewer number of SVs,called the reduced SVs.In previous methods,constructingeach new support vector requires to minimize a multivariate function with local minima. Because we cannot know the global minimum has been reached or not,the construction has to repeat the search many times with different initial guesses.This repetition must be applied for every reduced SV in order to arrive at thefinal reduced solution,and there is also no way but to determine the goodness of the reduced solution experimentally. Our attempt in this research direction is to propose a conceptually simpler and compu-tationally less expensive method to simplify support vector solutions.Starting from a mechanical point of view in which if each SV exerts a force on the optimal hyperplane then support vector solutions satisfy the conditions of mechanical equilibrium[50],and in an equilibrium system if we replace two member forces by an equivalent one,the stable state will not change.Thus,instead of constructing reduced vectors set incrementally like in the previous reduced set methods,two nearest SVs will be iteratively considered and replaced by a newly constructed vector.This approach leads to the construction of each new vector only requiring tofind the unique maximum point of a one-variable function on(0,1),and thefinal reduced set is unique for each running time.Experimental results showed that this method is effective in reducing the number of support vectors and preserving generalization performance.To control the possible lost in generalization performance,we propose a quantity called maximal marginal difference to estimate the difference between the original SVM solution and the simplified one.The simplification process will stop before it makes the estimated difference exceed a given threshold.Our second contribution is devoted for speeding-up the support vector training in a model selection process.By conducting intensive experiments we reconfirm that two different machines trained by two different parameter settings,or even two different choices of kernel,share a big number of support vectors.This observation suggests an inheritance mechanism in which training a new SVM in a model selection process can benefit from the results of previously trained machines.In the general decomposition framework,we propose to initialize each new working set by a set of all SVs found in previously trained machines.Moreover,if two machines use the same kernel function then one’s solution can be adjusted and used as the initial point in searching for the the other’s solution. This initialization makes thefirst local solution closer to the global solution,and the decomposition algorithm converges more quickly.Experimental results indicated that we can reduce22.8-85.5%the training time without any impact on the result of model selection.1.3Thesis OutlineChapter2introduces basic concepts in support vector learning.Especially it em-phasizes critical properties of the optimal hyperplane and the use of kernel in classical classification and regression tasks.We intended to discuss in more detail the two most commonly used kernels:Gaussian RBF and polynomial,and the decomposition algorithm for SVM training.These fundamentals will be used in other chapters.Chapter3describes attempts in making SVMs run faster in testing phase.Firstly it reviews existing methods for reducing the complexity of SVMs by reducing the number of necessary SVs included in SVM solutions.It then describes our proposed bottom-up method for replacing two SVs by a new one and the whole iterative simplification process, including a selection heuristic and a stopping condition.Experiments will be reported next,and this chapter ends with conclusions.Chapter4introduces the model selection problem for support vector machines and the many efforts in making this process more efficient.It then describes a technique to speed-up SVM training in a model selection process by inheriting the result among different SVMs under consideration.Experiments on various benchmark datasets are described next for illustrating the effectiveness of the proposed method.Chapter5concludes this dissertation with summarization of methodology,contribu-tion,as well as limitation of the proposed methods.It alsofigures out open problems for a further research in future.。
support vector machine for histogram-based image classification
IEEE TRANSACTIONS ON NEURAL NETWORKS,VOL.10,NO.5,SEPTEMBER19991055 Support Vector Machines forHistogram-Based Image ClassificationOlivier Chapelle,Patrick Haffner,and Vladimir N.VapnikAbstract—Traditional classification approaches generalizepoorly on image classification tasks,because of the highdimensionality of the feature space.This paper shows thatsupport vector machines(SVM’s)can generalize well on difficultimage classification problems where the only features arehigh dimensional histograms.Heavy-tailed RBF kernels ofthe form K(x;y)=e00y jPublisher Item Identifier S1045-9227(1056IEEE TRANSACTIONS ON NEURAL NETWORKS,VOL.10,NO.5,SEPTEMBER 1999bythesolution of the maxi-mization problem (3)has been found,theOSHhas the followingexpansion:(4)The support vectors are the points forwhichwith(6)to allow the possibility of examples that violate (2).The purpose of thevariablesis chosen by the user,alargerin (7),the penalty termfor misclassifications.When dealing with images,most of the time,the dimension of the input space is large(has in this case little impact on performance.C.Nonlinear Support Vector MachinesThe input data is mapped into a high-dimensional feature space through some nonlinear mapping chosen a priori [8].In this feature space,the OSH is constructed.If wereplace ,(3)becomesisneeded in the training algorithm and themappingsuchthatsatisfying Mercer’s condition has beenchosen,the training algorithm consists ofminimizing(8)and the decision functionbecomeshyperplanes areconstructed,wheredecisionfunctionsis givenby ,i.e.,the class withthe largest decision function.We made the assumption that every point has a single label.Nevertheless,in image classification,an image may belong to several classes as its content is not unique.It would be possible to make multiclass learning more robust,and extend it to handle multilabel classification problems by using error correcting codes [12].This more complex approach has not been experimented in this paper.III.T HE D ATAAND I TSR EPRESENTATIONAmong the many possible features that can be extracted from an image,we restrict ourselves to ones which are global and low-level (the segmentation of the image into regions,objects or relations is not in the scope of the present paper).CHAPELLE et al.:SVM’S FOR HISTOGRAM-BASED IMAGE CLASSIFICATION 1057The simplest way to represent an image is to consider its bitmap representation.Assuming the sizes of the images inthe database are fixedto(for the width),then the input data for the SVM are vectorsofsizefor grey-level images and3for color images.Each component of the vector is associated to a pixel in the image.Some major drawbacks of this representation are its large size and its lack of invariance with respect to translations.For these reasons,our first choice was the histogram representation which is described presently.A.Color HistogramsIn spite of the fact that the color histogram technique is a very simple and low-level method,it has shown good results in practice [2]especially for image indexing and retrieval tasks,where feature extraction has to be as simple and as fast as possible.Spatial features are lost,meaning that spatial relations between parts of an image cannot be used.This also ensures full translation and rotation invariance.A color is represented by a three dimensional vector corre-sponding to a position in a color space.This leaves us to select the color space and the quantization steps in this color space.As a color space,we chose the hue-saturation-value (HSV)space,which is in bijection with the red–green–blue (RGB)space.The reason for the choice of HSV is that it is widely used in the literature.HSV is attractive in theory.It is considered more suitable since it separates the color components (HS)from the lu-minance component (V)and is less sensitive to illumination changes.Note also that distances in the HSV space correspond to perceptual differences in color in a more consistent way than in the RGB space.However,this does not seem to matter in practice.All the experiments reported in the paper use the HSV space.For the sake of comparison,we have selected a few experiments and used the RGB space instead of the HSV space,while keeping the other conditions identical:the impact of the choice of the color space on performance was found to be minimal compared to the impacts of the other experimental conditions (choice of the kernel,remapping of the input).An explanation for this fact is that,after quantization into bins,no information about the color space is used by the classifier.The number of bins per color component has been fixedto 16,and the dimension of each histogram is.Some experiments with a smaller number of bins have been undertaken,but the best results have been reached with 16bins.We have not tried to increase this number,because it is computationally too intensive.It is preferable to compute the histogram from the highest spatial resolution available.Subsampling the image too much results in significant losses in performance.subsampling,the histogram loses its sharp peaks,as pixel colors turn into averages (aliasing).B.Selecting Classes of Images in the Corel Stock Photo CollectionThe Corel stock photo collection consists of a set of photographs divided into about 200categories,each one with100images.For our experiments,the original 200categories have been reduced using two different labeling approaches.In the first one,named Corel14,we chose to keep the cat-egories defined by Corel.For the sake of comparison,we chose the same subset of categories as [13],which are:air shows,bears,elephants,tigers,Arabian horses,polar bears,African specialty animals,cheetahs-leopards-jaguars,bald eagles,mountains,fields,deserts,sunrises-sunsets,night scenes .It is important to note that we had no influence on the choices made in Corel14:the classes were selected by [13]and the examples illustrating a class are the 100images we found in a Corel category.In [13],some images which were visually deemed inconsistent with the rest of their category were removed.In the results reported in this paper,we use all 100images in each category and kept many obvious outliers:see for instance,in Fig.2,the “polar bear alert”sign which is considered to be an image of a polar bear.With 14categories,this results in a database of 1400images.Note that some Corel categories come from the same batch of photographs:a system trained to classify them may only have to classify color and exposure idiosyncracies.In an attempt to avoid these potential problems and to move toward a more generic classification,we also defined a second labeling approach,Corel7,in which we designed our own seven categories:airplanes,birds,boats,buildings,fish,people,vehicles .The number of images in each category varies from 300to 625for a total of 2670samples.For each category images were hand-picked from several original Corel categories.For example,the airplanes category includes images of air shows,aviation photography,fighter jets and WW-II planes .The representation of what is an airplane is then more general.Table I shows the origin of the images for each category.IV.S ELECTINGTHEK ERNELA.IntroductionThe design of the SVM classifier architecture is very simple and mainly requires the choice of the kernel (the only other parameter isandresults in a classifier which has a polynomial decisionfunction.1058IEEE TRANSACTIONS ON NEURAL NETWORKS,VOL.10,NO.5,SEPTEMBER1999Fig.1.Corel14:each row includes images from the following seven categories:air shows,bears,Arabian horses,night scenes,elephants,bald eagles, cheetahs-leopards-jaguars.Encouraged by the positive results obtainedwithwhereIt is not known if the kernel satisfies Mercer’s condition.1Another obvious alternative is theCHAPELLE et al.:SVM’S FOR HISTOGRAM-BASED IMAGE CLASSIFICATION1059Fig.2.Corel14:each row includes images from the following seven categories:Tigers,African specialty animals,mountains,fields,deserts,sun-rises-sunsets,polar bears.B.ExperimentsThe first series of experiments are designed to roughly assess the performance of the aforementioned input represen-tations and SVM kernels on our two Corel tasks.The 1400examples of Corel14were divided into 924training examples and 476test examples.The 2670examples of Corel7were split evenly between 1375training and test examples.The SVM error penaltyparametervalues were selectedheuristically.More rigorous procedures will be described in the second series of experiments.Table II shows very similar results for both the RBG and HSV histogram representations,and also,with HSV histograms,similar behaviors between Corel14and Corel7.The “leap”in performance does not happen,as normally expected by using RBF kernels but with the proper choice of metric within the RBF placian or1060IEEE TRANSACTIONS ON NEURAL NETWORKS,VOL.10,NO.5,SEPTEMBER 1999TABLE IH AND -L ABELEDC ATEGORIES U SED WITHTHECOREL DATABASETABLE IIE RROR R ATES U SING THEF OLLOWING K ERNELS :L INEAR ,P OLYNOMIAL OFD EGREE 2,G AUSSIAN RBF,L APLACIAN RBF AND 2RBFTABLE IIIE RROR R ATES WITHKNNSVM’s.To demonstrate this,we conducted some experiments of image histogram classification with aK-nearest neighbors(KNN)algorithm with the distancesgave the best results.Table III presents the results.As expected,the64images.Except in the linearcase,the convergence of the support vector search process was problematic,often finding a hyperplane where every sample is a support vector.The same database has been used by [13]with a decision tree classifier and the error rate was about 50%,to 47.7%error rate obtained with the traditional combination of an HSV histogram and a KNN classifier.The 14.7%error rate obtained with the Laplacian or-pixel bin in the histogram accounts for a singleuniform color region in the image (with histogrampixels to aneighboring bin,resulting in a slightly different histogram,the kernel values arewithThe decay rate around zero is given by:decreasing the value ofwould provide for a slower decay.A data-generating interpretation of RBF’s is that they corre-spond to a mixture of local densities (generally in this case,lowering the value of(Gaussian)to (Laplacian)oreven(Sublinear)[16].Note that if we assume that histograms are often distributed around zero (only a few bins have nonzero values),decreasing the value of.22Aneven more general type of Kernel is K (x ;y )=e 0 dc:Decreasing the value of c does not improve performance as much as decreasing a and b ,and significantly increases the number of support vectors.CHAPELLE et al.:SVM’S FOR HISTOGRAM-BASED IMAGE CLASSIFICATION1061Fig.3.Corel7:each row includes images from the following categories:airplanes,birds,boats,buildings,fish,people,cars.The choice ofdoes not have to be interpreted in terms of kernelproducts.One can see it as the simplest possible nonlinearremapping of the input that does not affect the dimension.The following gives us reasons to believe that,thenumber of pixels is multiplied by,and-exponentiation could lower thisquadratic scaling effect to a more reasonable,which transforms all thecomponents which are not zero to one(we assume that1062IEEE TRANSACTIONS ON NEURAL NETWORKS,VOL.10,NO.5,SEPTEMBER 1999For the reasons stated in Section III.A,the only imagerepresentation we consider here is the1616HSV histogram.Our second series of experiments attempts todefine a rigor-ous procedure to choosehas to be chosen large enoughcompared to the diameter of the sphere containingthe input data (The distance betweenis equal to .With proper renormalization ofthe input data,we can setas In the linear case,the diameter of the data depends on theway it is normalized.The choice ofwith,(7)becomes(10)Similar experimental conditions are applied to both Corel7and Corel14.Each category is divided into three sets,each containing one third of the images,used as training,validation and test sets.For each value of the input renormalization,support vectors are obtained from the training set and tested on the validation set.Theand sum up to or 1.Nonoptimal10and can be computed in advance.sqrt square root exp exponentialExcept in the sublinear RBF case,the number of flt is the dominating factor.In the linear case,the decision function (5)allows the support vectors to be linearly combined:there is only one flt per class and component.In the RBF case,there is one flt per class,component and support vector.Because of the normalization by 7.In the sublinear RBF case,the number of sqrt is dom-inating.sqrt is in theory required for each component of the kernel product:this is the number we report.It is a pessimistic upper bound since computations can be avoided for components with value zero.E.ObservationsThe analysis of the Tables IV–VI shows the following characteristics that apply consistently to both Corel14and Corel7:CHAPELLE et al.:SVM’S FOR HISTOGRAM-BASED IMAGE CLASSIFICATION 1063TABLE VIC OMPUTATIONAL R EQUIREMENTS FOR C OREL 7,R EPORTED AS THEN UMBER OF O PERATIONS FOR THE R ECOGNITION OF O NE E XAMPLE ,D IVIDED BY 724096•As anticipated,decreasing.(comparecolumntolineandto 0.25makes linear SVM’s a very attractivesolution for many applications:its error rate is only 30%higher than the best RBF-based SVM,while its compu-tational and memory requirements are several orders of magnitude smaller than for the most efficient RBF-based SVM.•Experiments withwith the validation set,a solutionwith training misclassifications was preferred (around 1%error on the case of Corel14and 5%error in the case of Corel7).Table VII presents the class-confusion matrix corresponding to the use of the Laplacian kernel on Corel7with(these values yield the best results for both Corel7and Corel14).The most common confusions happen between birds and airplanes ,which is consistent.VI.S UMMARYIn this paper,we have shown that it is possible to push the classification performance obtained on image histograms to surprisingly high levels with error rates as low as 11%for the classification of 14Corel categories and 16%for a more generic set of objects.This is achieved without any other knowledge about the task than the fact that the input is some sort of color histogram or discrete density.TABLE VIIC LASS -C ONFUSION M ATRIX FOR a =0:25AND b =1:0.F ORE XAMPLE ,R OW (1)I NDICATES T HAT ON THE 386I MAGES OF THE A IRPLANES C ATEGORY ,341H A VE B EEN C ORRECTLY C LASSIFIED ,22H A VE B EEN C LASSIFIED IN B IRDS ,S EVEN IN B OATS ,F OUR IN B UILDINGS ,AND 12IN V EHICLESThis extremely good performance is due to the superior generalization ability of SVM’s in high-dimensional spaces to the use of heavy-tailed RBF’s as kernels and to nonlin-ear transformations applied to the histogram bin values.We studied how the choice of the-exponentiation withandimproves the performance of linearSVM’s to such an extent that it makes them a valid alternative to RBF kernels,giving comparable performance for a fraction of the computational and memory requirements.This suggests a new strategy for the use of SVM’s when the dimension of the input space is extremely high.kernels intended at making this dimension even higher,which may not be useful,it is recommended to first try nonlinear transformations of the input components in combination with linear SVM’s.The computations may be orders of magnitude faster and the performances comparable.This work can be extended in several ways.Higher-level spatial features can be added to the histogram features.Al-lowing for the detection of multiple objects in a single image would make this classification-based technique usable for image retrieval:an image would be described by the list of objects it contains.Histograms are used to characterize other types of data than images,and can be used,for instance,for fraud detection applications.It would be interesting to investigate if the same type of kernel brings the same gains in performance.R EFERENCES[1]W.Niblack,R.Barber,W.Equitz,M.Flickner, D.Glasman, D.Petkovic,and P.Yanker,“The qbic project:Querying image by content1064IEEE TRANSACTIONS ON NEURAL NETWORKS,VOL.10,NO.5,SEPTEMBER1999using color,texture,and shape,”SPIE,vol.1908,pp.173–187,Feb.1993.[2]M.Swain and D.Ballard,“Indexing via color histograms,”Int.J.Comput.Vision,vol.7,pp.11–32,1991.[3]V.Vapnik,The Nature of Statistical Learning Theory.New York:Springer-Verlag,1995.[4]V.V apnik,Statistical Learning Theory.New York:Wiley,1998.[5]P.Bartlett and J.Shawe-Taylor,“Generalization performance of supportvector machines and other pattern classifiers,”in Advances in Ker-nel Methods—Support Vector Learning.Cambridge,MA:MIT Press, 1998.[6]M.Bazaraa and C.M.Shetty,Nonlinear Programming New York:Wiley,1979.[7] C.Cortes and V.Vapnik,“Support vector networks,”Machine Learning,vol.20,pp.1–25,1995.[8] B.E.Boser,I.M.Guyon,and V.N.Vapnik,“A training algorithm foroptimal margin classifier,”in Proc.5th ACM put.Learning Theory,Pittsburgh,PA,July1992,pp.144–152.[9]J.Weston and C.Watkins,“Multiclass support vector machines,”Univ.London,U.K.,Tech.Rep.CSD-TR-98-04,1998.[10]M.Pontil and A.Verri,“Support vector machines for3-d objectrecognition,”in Pattern Anal.Machine Intell.,vol.20,June1998. [11]V.Blanz,B.Sch¨o lkopf,H.B¨u lthoff,C.Burges,V.Vapnik,and T.Vetter,“Comparison of view-based object recognition algorithms using realistic3d models,”in Artificial Neural Networks—ICANN’96,Berlin, Germany,1996,pp.251–256.[12]R.Schapire and Y.Singer,“Improved boosting algorithms usingconfidence-rated predictions,”in put.Learning Theory,1998.[13] C.Carson,S.Belongie,H.Greenspan,and J.Malik,“Color-andtexture-based images segmentation using em and its application to image querying and classification,”submitted to Pattern Anal.Machine Intell., 1998.[14] B.Sch¨o lkopf,K.Sung,C.Burges,F.Girosi,P.Niyogi,T.Poggio,andV.Vapnik,“Comparing support vector machines with Gaussian kernels to radial basis function classifiers,”Massachusetts Inst.Technol.,A.I.Memo1599,1996.[15] B.Schiele and J.L.Crowley,“Object recognition using multidimen-sional receptivefield histograms,”in ECCV’96,4th European Conf.Comput.Vision,vol.I,1996,pp.610–619.[16]S.Basu and C.A.Micchelli,“Parametric density estimation for theclassification of acoustic feature vectors in speech recognition,”in Nonlinear Modeling:Advanced Black-Box Techniques,J.A.K.Suykens and J.Vandewalle,Eds.Boston,MA:Kluwer,1998.[17] E.Osuna,R.Freund,and F.Girosi,“Training support vector machines:An application to face detection,”in IEEE CVPR’97,Puerto Rico,June 17–19,1997.[18] E.Osuna,R.Freund,and F.Girosi,“Improved training algorithm forsupport vector machines,”in IEEE NNSP’97,Amelia Island,FL,Sept.24–26,1997.Olivier Chapelle received the B.Sc.degree in com-puter science from the Ecole Normale Sup´e rieurede Lyon,France,in1998.He is currently studyingthe M.Sc.degree in computer vision at the EcoleNormale Superieure de Cachan,France.He has been a visiting scholar in the MOVIcomputer vision team at Inria,Grenoble,France,in1997and at AT&T Research Labs,Red Bank,NJ,in the summer of1998.For the last three months,hehas been working at AT&T Research Laboratorieswith V.Vapnik in thefield of machine learning. His research interests include learning theory,computer vision,and support vectormachines.Patrick Haffner received the bachelor’s degreefrom Ecole Polytechnique,Paris,France,in1987and from Ecole Nationale Sup´e rieure desT´e l´e communications(ENST),Paris,France,in1989.He received the Ph.D.degree in speechand signal processing from ENST in1994.In1988and1990,he worked with A.Waibelon the design of the TDNN and the MS-TDNNarchitectures at A TR,Japan,and Carnegie MellonUniversity,Pittsburgh,PA.From1989to1995,asa Research Scientist for CNET/France-T´e l´e com in Lannion,France,he developed connectionist learning algorithms for telephone speech recognition.In1995,he joined AT&T Bell Laboratories and worked on the application of Optical Character Recognition and transducers to the processing offinancial documents.In1997,he joined the Image Processing Services Research Department at AT&T Labs-Research.His research interests include statistical and connectionist models for sequence recognition,machine learning,speech and image recognition,and information theory.Vladimir N.Vapnik,for a photograph and biography,see this issue,p.999.。
Robust twin support vector machine for pattern classification
Robust twin support vector machine for pattern classificationZhiquan Qi,Yingjie Tian n ,Yong Shi nResearch Center on Fictitious Economy &Data Science,Chinese Academy of Sciences,Beijing 100190,Chinaa r t i c l e i n f oArticle history:Received 27December 2011Received in revised form 22June 2012Accepted 27June 2012Keywords:ClassificationTwin support vector machine Second order cone programming Robusta b s t r a c tIn this paper,we proposed a new robust twin support vector machine (called R -TWSVM)via second order cone programming formulations for classification,which can deal with data with measurement noise efficiently.Preliminary experiments confirm the robustness of the proposed method and its superiority to the traditional robust SVM in both computation time and classification accuracy.Remarkably,since there are only inner products about inputs in our dual problems,this makes us apply kernel trick directly for nonlinear cases.Simultaneously we does not need to solve the extra inverse of matrices,which is totally different with existing TWSVMs.In addition,we also show that the TWSVMs are the special case of our robust model and simultaneously give a new dual form of TWSVM by degenerating R-TWSVM,which successfully overcomes the existing shortcomings of TWSVM.&2012Elsevier Ltd.All rights reserved.1.IntroductionFor the last decade,support vector machines (SVMs)[1–3],as powerful tools for pattern classification and regression,have already successfully applied in a wide variety of fields [4–12].For the standard support vector classification (SVC),the basic idea is to find the optimal separating hyperplane between the positive and negative examples.The optimal hyperplane may be obtained by maximizing the margin between two parallel hyperplanes,which involves the minimization of a quadratic programming problem (QPP).By introducing kernel trick into the dual QPP,SVC can also solve nonlinear classification problem successfully.Recently,Jayadeva et al.[13]proposed a twin support vector machine (TWSVM)classifier for binary classification,motivated by GEPSVM [14].TWSVMs generate two nonparallel planes such that each plane is closer to one of two classes and is at least one distance from the other.It is implemented by solving two smaller QPPs rather than a single large QPP,which makes the learning speed of TWSVM is more faster than that of a classical SVM.Experimental results in [13,15]show the superiority of TWSVM over both standard SVM and GEPSVM on UCI datasets.Some extensions to the TWSVM can be found in [15–18].For the methods mentioned above,the parameters in the training sets are implicitly assumed to be known exactly.However,in real world applications,the parameters have perturbations since they are estimated from the data subject to measurement and statistical errors [19,20].Goldfarb et al.pointed out that the solutions to optimizationproblems are typically sensitive to parameter perturbations,errors in the input parameters tend to get amplified in the decision function,which often results in misclassification.For instance,for the fixed examples,original discriminants can correctly separate them (see Fig.1(a)).When each example is allowed to move in a sphere,original decision function cannot separate samples in the worst case (see Fig.1(b)).So the goal is to explore a robust model which can deal with data set with measurement or statistical errors (see Fig.1(c)).There are many methods of constructing the robust SVMs.Bi and Zhang derived a general statistical formulation where unobserved input is modeled as a hidden mixture component [21];Literatures [20,22–24]employed second order cone programming (SOCP)meth-ods to handle the missing and uncertain data;literatures [25–27]construct the robust models by the ramp loss functions.Xu et al.solved the robust classification problem for a class of non-box-typed uncertainty sets,and providing a linkage between robust classifica-tion and the standard regularization scheme of SVMs [28,29].Other related works also can found in [30–34].In this paper,following the line of the research in [19,20,24,30],we proposed a new robust twin support vector machines for data with measurement errors (called R -TWSVM),which is represented as a second-order cone programming (SOCP)[35].Second-order cone (SOC)is also called the Lorentz cone:Definition 1.1(Second-order cone ).The cone K is called a second-order cone L m ifK ¼f u ¼u 1A R 9u 1Z 0g ,m ¼1;f u ¼ðu 1,u 2,...,u m ÞT A R m 9u 1Z ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiu 22þÁÁÁþu 2m q g ,m Z 2:8<:The SOCP is a special convex optimization problem involvingSOC constraints,which can be efficiently solved by interior pointContents lists available at SciVerse ScienceDirectjournal homepage:/locate/prPattern Recognition0031-3203/$-see front matter &2012Elsevier Ltd.All rights reserved./10.1016/j.patcog.2012.06.019nCorresponding authors.E-mail addresses:qizhiquan@ (Z.Qi),tyj@ (Y.Tian),yshi@ (Y.Shi).Pattern Recognition ](]]]])]]]–]]]methods.Related work can be found in [20,36–38].The proposed R -TWSVM has the following compelling properties.To our knowledge,R -TWSVM is the first TWSVM implementa-tion of dealing with data with measurement noise,which is an useful extension of TWSVM.We show that the TWSVM and TBSVM [39](TWSVM with the regular terms)are the special cases of our robust models.This provides an alternative explanation to the success of R -TWSVM.Although TWSVMs have had great success in classification,they have the following shortcomings:(1)unlike standard SVMs,the dual problem of TWSVMs does not have inner products form about samples,which makes TWSVMs have to use an approximate technology [13]to solve the nonlinear case.This means that TWSVMs need to solve two problems for linear case and two other problems for nonlinear case sepa-rately.At the same time,in order to realize the structural risk minimization (SRM)principle,TWSVMs have to add different regularization terms for linear and nonlinear cases owing to the reason above [39].So far,there is not a clear theoretical explanation for these regularization terms,especially for the nonlinear case.(2)Although TWSVMs only solve two smaller QPPs,they have to compute the inverse of matrices,1it is in practice intractable or even impossible for a large data set.In this paper,we overcome the shortcomings above successfully.There are only inner products about samples in our dual problems of R -TWSVM,this make us not need to add theother regularization term J u J 2þb 2[39]for nonlinear case.Correspondingly,kernel trick can also be applied to our model directly due to the reason above.In addition,we also give a new dual form of TWSVM by degenerating R -TWSVM,which successfully overcomes the shortcomings mentioned above as well.Compared with traditional robust optimization models such as [20,22–24],we solve two SOCP problems of a smaller size instead of a large sized one.This makes R -TWSVM almost faster than these algorithms above.Theoretical analysis and the results of all the experiments show that the R -TWSVM is approximately four times faster than the usual R-SVM [20].Different with exiting robust SVM models,R -TWSVM use two nonparallel hyperplanes for two classes to construct the final decision function,which factually exploits the data’s structural information while margin maximizing to the missing and uncertain data.This can improve the model’s generalized capability efficiently [14,13,39].In addition,R -TWSVM employs two different loss function:a quadratic loss function making the proximal hyperplane close enough to the class itself,and a soft-margin loss function making the hyperplaneas far as possible from the other class,which results that almost all the points in this class and some points in the other class contribute to each final decision function.This makes R -TWSVM have stronger insensitivity to missing or uncertain data with label noise.The remaining content is organized as follows.In Section 2,we briefly introduces the background of SVM and TWSVM.In Section 3,describe the detail of R -TWSVM.In Section 4,we show experiments of R -TWSVM on various data sets.We conclude this work in Section 5.2.Background2.1.Support vector classification (SVC)For classification about the training data T ¼fðx 1,y 1Þ,...,ðx l ,y l Þg A ðR n ÂY Þl ,ð1Þwhere x i A R n ,y i A Y ¼f 1,À1g ,i ¼1,...,l .SVM’s linear softmargin algorithm is to solve the following primal QPP:minw ,b ,x1J w J 22þC X l i ¼1x i s :t :y i ðw >x i þb ÞZ 1Àx i ,x i Z 0,i ¼1;2,...,l ,ð2Þwhere C is a penalty parameter and x i are the slack variables.Thegoal is to find an optimal separating hyperplane w >x þb ¼0,ð3Þwhere x A R n .The Wolfe dual of (2)can be expressed as maxaX l j ¼1a j À1Xl i ¼1X lj ¼1y i y j ðx i Áx j Þa i ajs :t :X l i ¼1y i a i ¼0,0r a i r C ,i ¼1,...,l ,ð4Þwhere a A R l are lagrangian multipliers.The optimal separating hyperplane of (3)can be given byw ¼X l i ¼1a n i y i x i ,b ¼1sv y j ÀX N sv i ¼1a n i y i ðx i Áx j Þ !,ð5Þwhere a n is the solution of the dual (4),N sv represents the number of support vectors such that 0o a o C .A new sample is classified as þ1or À1according to the finally decision function f ðx Þ¼sgn ððw Áx Þþb Þ.2.2.Twin support vector machine (TWSVM)Consider a binary classification problem of l 1positive points and l 2negative points ðl 1þl 2¼l Þ.Suppose that data points belong to positive class are denoted by A A R l 1Ân ,where each row A i A R n represents a data point.Similarly,B A R l 2Ân represents all thedataFig.1.(a)The original examples and discriminants;(b)the effect of measurement noises;(c)the result of robust model.1Although literature [40]uses the Sherman–Morison–Woodbury formula [41]to simply the matrix inversion’s computational complexity,it is still a difficult task when the dimension or sizes of data is very high.Z.Qi et al./Pattern Recognition ](]]]])]]]–]]]2points belong to negative class.For the linear case,the TWSVM [13]determines two nonparallel hyperplanes:f þðx Þ¼w >þx þb þ¼0andf Àðx Þ¼w >Àx þb À¼0,ð6Þwhere w þ,w ÀA R n ,b þ,b ÀA R .Here,each hyperplane is closer toone of the two classes and is at least one distance from the other.A new data point is assigned to positive class or negative class depending upon its proximity to the two nonparallel hyperplanes.Formally,for finding the positive and negative hyperplanes,the TWSVM optimizes the following two respective QPPs:minw þ,b þ,x12J Aw þþe þb þJ 2þc 1e >Àxs :t :ÀðBw þþe Àb þÞþx Z e À,x Z 0ð7Þandminw À,b À,Z12J Bw Àþe Àb ÀJ 2þc 2e >þZ s :t :ðAw Àþe þb ÀÞþZ Z e þ,Z Z 0,ð8Þwhere c 1,c 2Z 0are the pre-specified penalty factors,e þ,e Àarevectors of ones of appropriate dimensions.By introducing the Lagrangian multipliers,the Wolfe dual of QPPs (7)and (8)can be represented as follows:maxae >Àa À1a >G ðH >H ÞÀ1G >a s :t :0r a r c 1e Àð9Þand maxbe >þb À12b >P ðQ >Q ÞÀ1P >bs :t :0r b r c 2e þ,ð10Þwhere G ¼½B e À ,H ¼½A e þ ,P ¼½A e þand Q ¼½B e À ,a A R m 2,b A R m 1are Lagrangian multipliers.The non-parallel hyperplanes (6)can be obtained from the solutions a and b of (9)and (10)by v 1¼ÀðH >H ÞÀ1G >a where v 1¼½w >þb þ >,v 2¼ÀðQ >Q ÞÀ1P >bwhere v 1¼½w >Àb À >:ð11ÞFor the nonlinear case,we can refer to the literature [13].3.Robust twin support vector machine ðR -TWSVM Þ3.1.Linear R -TWSVMWe firstly give the formal representation of robust classifica-tion learning problem.Given a training set T ¼fðX 1,y 1Þ,...,ðX l ,y l Þg ,ð12Þwhere y i A Y ¼f 1,À1g ,i ¼1,...,l ,and input set X i is a sphere within r i radius of the x i center:X i ¼f x i 9x i ¼x i þr i u i g ,i ¼1,...,l ,J u i J r 1,ð13Þx i is the true value of the training data,u i A R n ,r i is a given constant.The goal is to induce a real-valued function y ¼sgn ðg ðx ÞÞð14Þto infer the label y corresponding to any example x in R n space.Generally,such problem is caused by measurement errors,where r i reflects the measurement accuracy.In order to obtain the optimization decision function of (14),by introducing 12J w þJ 2,(7)can be written as the following robust optimization problem:minw þ,b þ,x12J w þJ 2þ12J ½ðw þÁx 1Þþb þ,...,ðw þÁx l 1Þþb þ J 2þc 1Xl i ¼l 1þ1x i s :t :Àððw þÁðx i þr i u i ÞÞþb þÞþZ 1Àx i ,8J u i J r 1i ¼l 1þ1,...,l ,x i Z 0,i ¼l 1þ1,...,l :ð15ÞSincemin f y i r i ðw Áu i Þ,J u i J r 1g ¼Àr i J w þJ ,ð16Þproblem (15)can be converted to minw þ,b þ,x1J w þJ 2þ1J ½ðw þÁx 1Þþb þ,...,ðw þÁx l 1Þþb þ J 2þc 1Xl i ¼l 1þ1x is :t :Àððw þÁx i Þþb þÞÀr i J w þJ Z 1Àx i ,i ¼l 1þ1,...,l x i Z 0,i ¼l 1þ1,...,l :ð17ÞBy introducing new variables t 1,t 2and setting J w þJ r t 1,J ½ðw þÁx 1Þþb þ,...,ðw þÁx l 1Þþb þ J r t 2,The above problem becomes minw þ,b þ,x ,t 1,t 21t 21þ1t 22þc 1X l i ¼l 1þ1x i s :t :Àððw þÁx i Þþb þÞÀr i t 1Z 1Àx i ,i ¼l 1þ1,...,l ,x i Z 0,i ¼l 1þ1,...,l J w þJ r t 1,J ½ðw þÁx 1Þþb þ,...,ðw þÁx l 1Þþb þ J r t 2:ð18ÞFor replacing t 21,t 22in the objective function (18),we introduce new variables u 1,u 2,v 1,v 2and satisfy the linear constraintsu i þv i ¼1,i ¼1;2and second order cone constraints ffiffiffiffiffiffiffiffiffiffiffiffiffiffit 2i þv 2i q ru i .Therefore,problem (18)can be reformulated as the followingsecond order cone program (SOCP):minY 112ðu 1Àv 1Þþ12ðu 2Àv 2Þþc 1X l i ¼l 1þ1x i s :t :Àððw þÁx i ÞÞþb þÞÀr i t 1Z 1Àx i ,i ¼l 1þ1,...,l ,x i Z 0,i ¼l 1þ1,...,l ,u 1þv 1¼1,u 2þv 2¼1,J w þJ r t 1,ffiffiffiffiffiffiffiffiffiffiffiffiffiffit 21þv 21q r u 1,ffiffiffiffiffiffiffiffiffiffiffiffiffiffit 22þv 22q r u 2,J ½ðw þÁx 1Þþb þ,...,ðw þÁx l 1Þþb þ J r t 2,ð19Þwhere Y 1¼½w >þ,b þ,x >,t 1,t 2,u 1,v 1,u 2,v 2 >.By the optimization theory [42],the dual problem of (19)can be expressed asmaxY 2b 1þb 2þX l i ¼l 1þ1a is :t :b 1þz u 1¼12,b 1þz v 1¼À12,b 2þz u 2¼1,b 2þz v 2¼À1,X l i ¼l 1þ1a i ÀX l 1i ¼1l i ¼0,X l 1i ¼1l i x i ÀX l j ¼l 1þ1a j x jr X l i ¼l 1þ1r i a i Àg 1,J l J r Àg 2,ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffig 21þz 2v 1q r z u 1,ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffig 22þz 2v 2q r z u 2,0r a i r c 1,i ¼l 1þ1,...,l ,ð20Þwhere Y 2¼½a >,b 1,b 2,g 1,g 2,z u 1,z v 1,z u 2,z v 2,l >>.Z.Qi et al./Pattern Recognition ](]]]])]]]–]]]3Theorem 3.1.Suppose that Y n 2is a solution of the dual problem(20),where Y n 2¼½a n >,b n 1,b n 2,g n 1,g n 2,z n u 1,z n v 1,z n u 2,z nv 2,ln > >.If there exists 0o a n j o c 1,we will obtain the solution ðw n ,b nÞof the primal problem (15):w n þ¼g n 1ðP l i ¼l 1þ1r i a n i Àg n1ÞX l l 1þ1a n i x i ÀX l 1i ¼1l n i x i 0@1A ,ð21Þb nþ¼À1þg n 1r j Àg n 1ðP i ¼l 1þ1r i a n i Àg n1ÞX l l 1þ1a n i ðx i Áx j ÞÀXl 1i ¼1l ni ðx i Áx j Þ0@1A :ð22ÞThe proof of Theorem 3.1can be found in the Appendix.Similarly,the dual of (10)can be written as maxY 3b 1þb 2þX l i ¼l 1þ1a is :t :b 1þz u 1¼1,b 1þz v 1¼À1,b 2þz u 2¼12,b 2þz v 2¼À12,ÀX l 1i ¼1a i ÀXl i ¼l 1þ1l i ¼0,X l i ¼l 1þ1l i x i þX l 1j ¼1a j x jr X l 1i ¼1r i a i Àg 1,J l J r Àg 2,ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffig 21þz 2v 1q r z u 1,ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffig 22þz 2v 2q r z u 2,0r a i r c 2,i ¼1,...,l 1,ð23Þwhere Y 3¼½a >,b 1,b 2,g 1,g 2,z u 1,z v 1,z u 2,z v 2,l > >.The correspondingsolution isw n À¼g n 1ðP l 1i ¼1r i a n i Àg n1ÞÀX l 1i ¼1a n i x i ÀX l i ¼l 1þ1l n i x i 0@1A ,ð24Þb nÀ¼À1þg n1r j Àg n 1ðP l 1i ¼1r i a n i Àg n1ÞÀX l 1i ¼1a n i ðx i Áx j ÞÀX l i ¼l 1þ1l n i ðx i Áx j Þ0@1A :ð25ÞOnce vectors w þ,b þand w À,b Àare obtained from (20)and (23),the separating planes w >þx þb þ¼0,w >Àx þb À¼0ð26Þare known.A new data point x A R n is then assigned to the positive or negative class,depending on which of the two hyperplanes given by (26)it lies closest to,i.e.f ðx Þ¼argmin þ,Àf d þðx Þ,d Àðx Þg ,ð27Þwhered þðx Þ¼9w >þx þb þ9,d Àðx Þ¼9w >Àx þb À9,ð28Þwhere 9Á9is the perpendicular distance of point x from the planesw >þx þb þand w >Àx þb À.3.2.Nonlinear R -TWSVMThe above discussion is restricted to the linear case.Here,we will analyze nonlinear R -TWSVM by introducing kernel functionK ðx ,x 0Þ¼ðF ðx ÞÁF ðx 0ÞÞ,and the corresponding transformation:x ¼F ðx Þ,ð29Þwhere x A H ,H is the Hilbert space.So the training set (12)becomesT ¼fðX i ,y i Þ,...,ðX l ,y l Þg ,ð30ÞwhereX i ¼f F ð~xÞ9~x is in the sphere of the radius r and the center x i g .So when J ~xi Àx i J r r i and choosing RBF,we have J F ð~xi ÞÀF ðx i ÞJ 2¼ðF ð~x i ÞÀF ðx i ÞÞÁðF ð~x i ÞÀF ðx i ÞÞ¼K ð~x i ,~x i ÞÀ2K ð~x i ,x i ÞþK ðx i ,x i Þ¼2À2exp ðÀJ ~xi Àx i J 2=2s 2Þr r 2i ,ð31Þwhere r i ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi2À2exp ðÀJ ~xi Àx i J 2=2s 2Þq .Thus X i becomes a sphere of the center F ðx i Þand the radius r i X i ¼f ~x9J ~x ÀF ðx i ÞJ r r i g :ð32ÞFor nonlinear case of R -TWSVM,J P l 1i ¼1l i F ðx i ÞÀP lj ¼l 1þ1a j F ðx j ÞJ 2can be expressed asX l 1i ¼1X l 1j ¼1l i l j K ðx i Áx j ÞÀ2X l 1i ¼1X l j ¼l 1þ1l i a j K ðx i Áx j ÞþXl i ¼l 1þ1Xl j ¼l 1þ1a i a j K ðx i Áx j Þ:ð33ÞSimilarly,JP li ¼l 1þ1l i x i þP l 1j ¼1a j x j J 2can be expressed asXl i ¼l 1þ1Xl j ¼l 1þ1l i l j K ðx i Áx j Þþ2X l i ¼l 1þ1X l 1j ¼1l i a j K ðx i Áx j ÞþX l 1i ¼1Xl 1j ¼1a i a j K ðx i Áx j Þ:ð34ÞSo we can easily obtain the nonlinear R -TWSVM only bytaking K ðx ,x 0Þinstead of ðx Áx 0Þof the optimization problem (20)and (23).3.3.Discussion3.3.1.Relationship with R-SVMBoth R -TWSVM and R-SVM use the assumption of the sphere within r to the missing or uncertain data and obtain the final classifier by solving the related SOCP problem.However,there are several differences as follows.(1)R -TWSVM is approximately four times faster than the usual R-SVM.Now suppose the computational complexity is no more than O ðr 3log ðe À1ÞÞ[37],where e is parameters corresponding to a specific algorithm,and r denotes the number of variables.R-SVM contains l þ4variables and each problem of R -TWSVM contains 8þl =2variables.Sup-pose l 1¼l 2¼l =2,the ratio r of runtimes can be written asr ¼ðl þ4Þ3log ðe À1Þ2ð8þl =2Þlog ðÀ1Þ¼ðl þ4Þ32ð8þl =2Þ:ð35ÞSo when l is large,r is roughly 4.(2)R -TWSVM uses two nonparallel hyperplanes for two classes to construct the final decision function,which factually exploits the data’s structural information while margin maximizing to the missing and uncer-tain data.This can directly improve the model’s generalized capability.(3)In fact,R-SVM is only efficient to the data with feature noise,which will lose its robustness when the data is with label noise.Unlike R -SVM,R -TWSVM employs two different loss function:a quadratic loss function making the proximal hyper-plane close enough to the class itself,and a soft-margin loss function making the hyperplane as far as possible from the otherZ.Qi et al./Pattern Recognition ](]]]])]]]–]]]4class,which results that almost all the points in this class and some points in the other class contribute to eachfinal decision function.This makes R-TWSVM have stronger insensitivity to data with label noise.3.3.2.Relationship with TWSVMConsider the primal optimization(15)of R-TWSVM,suppose r i¼0,i¼1,...,l1,our mode will degenerate to the TBSVM;drop-ping terms:12J w J 2simultaneously,our model will degenerate tothe TWSVM.Therefore,TWSVM and TBSVM are the special cases of our models.Although TWSVMs have had great success in classification,it exists the following shortcomings:(1)TWSVMs have to solve two problems for linear case and two other problems for nonlinear case separately.In order to realize the structural risk minimiza-tion(SRM)principle,TWSVMs have to add different regulariza-tion terms for linear and nonlinear cases owing to the reason mentioned in Section1.So far,there is not a clear theoretical explanation for these regularization terms,especially for the nonlinear case.(2)TWSVMs have to compute the inverse of matrices,it is in practice intractable or even impossible for a large data set by the classical methods.In this paper,R-TWSVM successfully overcomes the short-comings above.There are only inner products about samples in the dual problems,this makes us no need to add the other regularization term J u J2þb2[39]for nonlinear case and no need to explain its meaning.Correspondingly,kernel trick can also be applied to our model directly due to the reason above.Further-more,our model does not need to solve the extra inverse of matrices.More importantly,we can derive a new dual form of TWSVM by degenerating R-TWSVM.For simplicity but without loss of generality,we only analyze one of dual problems of R-TWSVM.Consider(20)and suppose r i¼0,i¼1,...,l1,the optimization problem will become a standard quadratic program-ming problem as follows(for simplify,we omit the proof):max Y4À1X l1i¼1l i x iÀX lj¼l1þ1a j x j2À1J l J2þX li¼l1þ1a is:t:X li¼l1þ1a iÀX l1i¼1l i¼0,0r a i r c1,i¼l1þ1,...,l,ð36Þwhere Y4¼½a>,l> >.The corresponding solution isw nþ¼ÀX ll1þ1a nix iÀX l1i¼1l nix i@1A,ð37Þb n þ¼À1þX ll1þ1a niðx iÁx jÞÀX l1i¼1l niðx iÁx jÞ@1A:ð38ÞAccording to Eq.(33),the kernel trick can also be applied tothe model directly.For the research of nonparallel classifiers,Mangasarian and Jayadeva et al.did a series of pioneering work.Up to now, hundreds of paper followed have been published.Unfortunately, none of them solves the fundamental disadvantages mentioned above.Encouragingly,the new model derived by R-TWSVM successfully overcomes these shortcomings as well.This is also an important contribution in this paper.4.ExperimentWe compared R-TWSVM against TWSVM and R-SVM on various data sets in this section.For simplicity,we set all r i in (13)to be a constant r and data of all experiments are normalized toÀ1and 1.All codes were wrote in MATLAB2010.The experiment environment:Intel Core I5CPU,2GB memory.The SeDuMi2software is employed to solve the SOCP problems of R-SVM and R-TWSVM.The‘‘quadprog’’function in MATLAB is used to solve the related optimization problems of TWSVM.The testing accuracies for our method are computed using standard 10-fold cross validation.The parameters c1,c2and the RBF kernel parameter s are selected from the set f2i9i¼À7,...,7g by10-fold cross validation on the tuning set comprising random10%of the training data.Once the parameters are selected,the tuning set was returned to the training set to learn thefinal decision function.4.1.Toy dataTo give an intuitive performance of R-TWSVM,we construct two sets of2-D data:one is to test the influence of r to the accuracy;the other is to show the difference between R-TWSVM and R-SVM when some samples are labeled incorrectly.Thefirst data is generated randomly from two normal dis-tributionðu1¼À0:5,s1¼0:4,u2¼0:5,s2¼0:4Þ.The noise u i is generated randomly from the normal distribution and scaled on the unit sphere.We add many noises for the dataset by x i¼x iþru i.From Figs.2and3,it is not difficult tofind that R-TWSVM is trying tofind a robust classifier which is able to both separate the existing data and all of virtual data included in the circle of radius r.The second data is also generated randomly from two normal distribution mentioned above.Different with thefirst data,we add the noises of labels for the data randomly.Figs.4and5show that the R-TWSVM is superior to R-SVM when there are the label errors of data.The main reason is because that the classifier of R-SVM is decided by the support vectors(SVs)and are easily disturbed when SVs are labeled incorrectly.4.2.Image datasetsFace recognition and image classification are two popular problems in pattern recognition In this subsection,we test our proposed method in face recognition and image classification.The ‘‘1vs r’’method[42]is used to solve the multi-class classification. Because our goal is only to compare the performance between R-TWSVM and other algorithms,all experiment are carried out on raw pixel features.(1)AR Face Database(see Fig.6)[43]:there are seven images of each individual:a neutral face,three variations in lighting (from the left,from the right,from both sides)and three varia-tions in expression(smile,frown,scream).We,respectively,take 70smile,frown,scream’s person aligned and its cropped faces as training set and30other persons as testing set.Each image is resized to be80Â60pixels.For smile faces,we add Gaussian white noise of mean0and variance0.01,0.02,0.03,0.04;for frown faces,we add‘‘salt and pepper’’noise of noise density0.05, 0.06,0.07,0.08;for scream faces,we add multiplicative noise of mean0and variance0.01,0.02,0.03,0.04.Fig.7shows thefinal result.(2)PASCAL2006dataset(see Fig.8)[44]:it contains10object categories(bicycles,buses,cats,cars,cows,dogs,horses, 2/Z.Qi et al./Pattern Recognition](]]]])]]]–]]]5。
数据挖掘十大经典算法总结
AdaBoost
AdaBoost是一种迭代算法,其核心思想是针对同一个训练集训练不同的分类器(弱分类器), 然后把这些弱分类器集合起来,构成一个更强的最终分类器 (强分类器)。其算法本身是通 过改变数据分布来实现的,它根据每次训练集之中每个样本的分类是否正确,以及上次的 总体分类的准确率,来确定每个样本的权值。将修改过权值的新数据集送给下层分类器进 行训练,最后将每次训练得到的分类器最后融合起来,作为最后的决策分类器。使用 AdaBoost分类器可以排除一些不必要的训练数据特征,并将关键放在关键的训练数据上面。
目前,对AdaBoost算法的研究以及应用大多集中于分类问题,同时近年也出 现了一些在 回归问题上的应用。就其应用AdaBoost系列主要解决了: 两类问题、 多类单标签问题、多 类多标签问题、大类单标签问题,回归问题。它用全部的训练样本进行学习。
我们考虑以下形式的样本点.
Outline
Main concept of the algorithm Description of the algorithm Application of the algorithm – dataset and parameter setting Example Advantages and disadvantages Variations of the algorithm Further research issues References
18
References
[1]Support Vector Machine & Its Applications,Mingyue Tan,2004 [2]王景南,古思明,龐金宗,”多類支向機之研究” ,全國碩博士論文網,2002 [3]INCE, H. and T.B. TRAFALIS, “Kernel principal component analysis and support vector machines for stock price prediction”, Proceedings of the 2004 IEEE International Joint Conference on Neural Networks, Volume 3, pages 2053-2058. [4]ONGSRITRAKUL, P. and N. SOONTHORNPHISAJ,” Apply decision tree and support vector regression to predict the gold price”, Proceedings of the International Joint Conference on Neural Networks, 2003, Volume 4, Pages 24882492. [5]/mirror/hladr4pred/
ML_Lecture12_SVM
Support Vector Machines支持向量机主要内容⏹一. 支持向量机的理论基础——统计学习理论二. 支持向量机的基本思想⏹.⏹三. 支持向量机存在问题与研究展望一. 支持向量机的理论基础——统计学习理论SVM的理论基础——统计学习理论⏹机器学习问题G:产生器,产生随机向量x产器产随机向量;S:训练器,对给定输入x输出y相应的;LM:学习机器,从给定的函数集中选择最能逼近训练器的函数。
⏹机器学习目的通过有限的观测数据(xi,yi)来估计输入与输出的函数关系,并有定的预测推广能力的函数关系,并有一定的预测推广能力⏹传统的机器学习理论基础——统计学⏹缺点:统计学研究的是样本数目趋于无穷大时的渐近理论⏹实际问题:样本有限(小样本)⏹统计学习理论对小样本统计估计和预测学习的最佳理论V Vapnik六七十年代创立九十年代在此基础上创立V.Vapnik 六、七十年代创立,九十年代在此基础上创立支持向量机(SVM)统计学习理论(SLT)⏹问题表示⏹根据n个独立同分布的观测样本在一组函数集{f(x,w)}中求最优函数f(x,w0)对依赖关系进行估计,使期望风险最小。
三类机器学习(1)模式识别问题:y={0,1}(2)回归估计问题(函数逼近):y输出为实数(3)密度估计问题⏹由于样本的有限,使用经验风险代替期望风险⏹经验风险最小化(ERM)准则经验风险最小是否真的使真实风险最小?问题事实上,训练误差小并不总能导致好的预测效果,某事实上训练误差小并不总能导致好的预测效果某些情况下,训练误差小导致推广能力下降,即真实风险增加,这就是过学习问题推广性的界置信范围l:样本数h:VC维VC维如果存在h个样本能够被函数集里的函数按所有h2的种形式分开,称函数集能够把h个样本打散。
VC维就是能够打散的最大样本数VC维无通用的计算方法。
特别的,N维实空间线性函数VC维是N+1结构风险最小化(SRM)原则在函数集中折中考虑经验风险和置信范围,取得实际风险的最小。
支持向量机的新发展
第19卷第5期V ol.19N o.5 控 制 与 决 策 Control and Decision 2004年5月 M ay2004 文章编号:1001-0920(2004)05-0481-04支持向量机的新发展许建华1,张学工2,李衍达2(1.南京师范大学数学与计算机学院,江苏南京210097; 2.清华大学自动化系,北京100084)摘 要:Vapnik等学者首先提出了实现统计学习理论中结构风险最小化原则的实用算法—支持向量机,比较成功地解决了模式分类问题.其后,机器学习界兴起了研究统计学习理论和支持向量机的热潮,引人瞩目的研究分支有从最优化技术出发改进或改造支持向量机,依据统计学习理论和支持向量机的优点设计新的非线性机器学习算法等.对此,较为系统地回顾了近10年来算法研究领域的新发展.关键词:机器学习;统计学习理论;支持向量机中图分类号:T P18 文献标识码:AAdvances in support vector machinesX U J ian-hua1,ZH A N G X ue-gong2,L I Yan-da2(1.Schoo l of M athematical and Computer Science,N anjing N o rmal U niver sity,N anjing210097,China; 2.Depar tment o f A uto matio n,T sing hua U niv ersit y,Beijing100084,China.Cor r espondent:XU Jian-hua,E-mail: xujianhua@ema )Abstract:V apnik and his collabo rato r s pro po sed a useful alg or it hm:suppor t vecto r machines,w hich can implement the str uctur al risk minimizat ion pr inciple in statistical learning theor y.T his nov el algo rithm handles the classification pro blems successfully.Since then mo r e attentio ns hav e been paid to statistica l lear ning theo ry and suppo rt v ector machines.T he attr activ e r esearch ar ea s ar e to impr ov e or modify suppo rt v ect or machines by optimizat ion techniques,and to desig n t he no vel no n-linea r machine lear ning algo rithms based o n sta tistical lear ning theor y and so me ideas in suppor t vecto r machines,etc.T he adv ances in such algo rithm studies in the last ten y ear s are r eviewed.Key words:ma chine lear ning;stat istical lear ning t heor y;suppo rt vecto r machines1 引 言 Vapnik等学者提出的支持向量机算法(SVM)[1~4]和机器学习的统计学分析专著——统计学习理论(SLT)[5~7],是近10年来机器学习、模式识别以及神经网络界最有影响力的成果之一.支持向量机分类算法具有4个显著特点:1)利用大间隔的思想降低分类器的V C维,实现结构风险最小化原则,控制分类器的推广能力;2)利用M ercer核实现线性算法的非线性化;3)稀疏性,即少量样本(支持向量)的系数不为零,就推广性而言,较少的支持向量数在统计意义上对应好的推广能力,从计算角度看,支持向量减少了核形式判别式的计算量;4)算法设计成凸二次规划问题,避免了多解性.自1995年以来,在实用算法研究、设计和实现方面已取得丰硕的成果,可归结为几个大的研究方向:1)提高SVM的计算速度,以便于处理大规模问题,如序列最小化算法[8]等;2)利用最优化技术改进或改造支持向量机形式,简化计算过程,如线性 收稿日期:2003-03-31;修回日期:2003-06-09. 基金项目:国家自然科学基金资助项目(60275007). 作者简介:许建华(1962—),男,浙江长兴人,高级工程师,博士,从事模式识别、机器学习等研究;李衍达(1936—),男,广东南海人,教授,中国科学院院士,从事信号处理、智能控制和生物信息学等研究.SVM[7],LS-SVM[9]等;3)依据结构风险最小化原则和支持向量机的某些思想提出新的算法,如M -SVM [10],广义SVM [11]等算法;4)利用结构风险最小原则、核思想和正则化技术等改造传统的线性算法,构造出相应的核形式,如核主成份分析[12]等.本文主要介绍上述与支持向量机二次规划形式密切相关的2)和3)两个研究方向的新进展.2 支持向量机分类和回归算法 依据统计学习理论中的结构风险最小化原则,Vapnik 等人首先提出了模式分类算法[1,2]以及回归分析算法[3,4].假设训练样本集为(x 1,y 1),(x 2,y 2),…,(x l ,y l ),(1)其中:x i∈R n,i =1,2,…,l (R 为实数域);对于两类的分类问题,y i ∈{+1,-1};对于回归问题,y i ∈R .支持向量机分类算法的原始形式可归结为下列二次规划问题:min12(w ,w )+C ∑li =1N i ,s.t.y i ((w ,x i )+b )-1+N i ≥0,(2)其中:(õ,õ)为两向量之间的内积[13];N i ≥0为松弛项,表示错分样本的惩罚程度;C 为常数,用于控制对错分样本惩罚的程度,实现在错分样本数与模型复杂性之间的折衷;w 和b 为判决函数f (x )=(w ,x )+b 中的权向量和阈值.当无错分样本时,最小化目标函数的第一项等价于最大化两类间的间隔,可降低分类器的VC 维,实现结构风险最小化原则.上述二次规划的对偶形式为m ax ∑li =1A i-12∑li ,j =1A i A j y i y j (x i ,x j ),s.t.∑l i =1A i y i =0,0≤Ai ≤C ,i =1,2,…,l ,(3)其中Ai 为Lagrange 乘子.根据最优化理论中的KKT 条件,只有少量样本(判决函数值等于±1的样本和错分样本)的A i 值不为零,Vapnik 等人称之为支持向量,这便是支持向量机名称的由来.由于对偶形式(3)中只出现两向量间的内积运算,Vapnik 等人用满足M ercer 条件的核函数k (x i ,x j )来代替内积运算(x i ,x j ),实现线性算法的非线性化.常用的核函数包括:多项式核,径向基核以及二层神经网[5~7].核形式的判别函数为f (x )=∑li =1A iy ik (x i,x )+b . 在SVM 分类算法取得成功后,Vapnik 等人将分类算法的思想推广到回归分析,提出了支持向量机回归算法,采用E 线性不敏感损失函数、E 平方不敏感损失函数和Huber 损失函数代替经典回归分析中的平方损失函数,从而使支持向量机回归算法仍可用二次规划来表达.其中采用E 线性不敏感损失函数的支持向量机回归算法的对偶形式为m ax -E ∑li =1(A *i+Ai )+∑li =1y i(A*i-A i )- 12∑li ,j =1(A *i -A i )(A *j -A j )(x i ,x j ),s.t.∑li =1A *i=∑li =1A i, 0≤A i ,A *i ≤C ,i =1,2,…,l .(4)这时,位于E 管道边缘和外侧的样本便成为支持向量,E 的大小可控制支持向量的数目.与分类算法类似,两向量间的内积运算可用核函数来代替,从而实现算法的非线性化.采用其他损失函数的支持向量机回归算法的形式参见文献[3~7].基于核的回归函数形式为f (x )=∑li =1(A*i-A i )k (x i ,x )+b .对于支持向量机分类和回归算法可参考两篇有影响的指导性文献[14,15].3 支持向量机的变形算法 支持向量机的成功促使众多研究人员投入到算法的设计和研究,其中一个研究方向是对SVM 的形式作某些改动,简化求解过程或使算法参数具有一定的物理意义.文献[2]建议采用N ki (k ≥0)形式的松弛项,为便于计算,支持向量机算法取k = 1.Keer thi 等[16]将SVM 分类算法的松弛项改成平方型,即N 2i ,此时其对偶形式的目标函数为max ∑li =1A i -12∑li ,j =1A i A j y i y j (x i ,x j )+D ijC ,(5)这使得对偶形式(5)中的矩阵为正定阵(一般核矩阵至少是半正定的).然而对于可分问题,它的解只有在C →∞时才是最大间隔分类面[17].文献[9]将不等式约束条件改成等式约束,提出了SVM 的最小二乘形式(LS-SVM ),其对偶问题仅为一线性方程组,即8+I /C y y TA B =u 0.(6)其中:A =[A 1,A 2,…,A l ]T ,y =[y 1,y 2,…,y l ]T,u =482控 制 与 决 策第19卷[1,1,…,1]T,矩阵8的元素为8ij =y i y j k (x i ,x j ).每个样本的A 值与误差N i 成正比,没有支持向量.为获得稀疏表达,文献[18]提出训练后去掉一些小A 值再训练的思路,使多数样本的A 值为零.为避开求解二次规划问题,文献[7]用线性规划来表达SVM 回归算法的基本思想,其形式为m in∑li =1A i+∑li =1A*i+C∑li =1N i+∑li =1N*i,s.t.y i -∑lj =1(A*j-A j )k (x i ,x j )-b ≤E -N *i,∑lj =1(A *j -A j )k (x i ,x j )+b -y i ≤E -N i ,A i ,A *i ,N i ,N *i ≥0,i =1,2,…,l .(7)这是用核形式回归函数直接构成的线性规划问题,可采用熟悉的单纯形方法来求解.由于支持向量机中的C 和E 等参数缺乏一定的物理含义,Scholkopf 等人[10]用M 代替原来的这些参数,提出了M -SVM 分类和回归算法.其中分类算法的对偶形式为max -12∑li ,j =1A i A j y i y j k (x i ,x j ),s.t ∑li =1A i y i =0,0≤A i ≤1l, ∑li =1A i ≥M,i =1,2,…,l .(8)其中M 的物理含义是:错分样本数占总样本数的上界和支持向量数占总样本数的下界.这种改造是非常合理的,因为原始SVM 中支持向量由边界上的样本和错分样本组成,支持向量数一定大于错分样本数,M 正好表示二者之间的某个中间值.原始SV M 的解主要取决于两类分界面附近的样本点,实际数据不可避免地带有噪声或存在野值.文献[19]提出了中心SVM ,以减小随机噪声的影响;[20]则提出了中值SVM ,以减小野值点的影响.从有限的实验结果看,它们都能有效地提高算法的鲁棒性.4 广义SVM 及其特例 从最优化理论和方法的角度出发,M ang a-sarian 等人提出了广义SVM 的框架(GSVM )[11],并从中派生出若干特例.在广义SVM 框架中,直接以A ,B 和核矩阵K (其元素为k (x i ,x j ))构造出一个不等式约束的非线性优化问题,即m in C u TN +f (A ),s.t.D (KD A -u B )≥u -N .(9)其中:f (A )是一凸函数,如某一范数或半范数;N =[N 1,N 2,…,N l ]T,D =diag(y 1,y 2,…,y l ).他们证明了式(9)的对偶形式与SVM 对偶形式(3)之间的等价关系,所以称为广义支持向量机.但GSVM 并不是直接求解式(9)或对偶形式,而是构造出若干特例:1)将广义SVM 中目标函数的第1项改成C N TN ,第2项取为(A TA +B 2)/2,并将其转化为无约束问题,用二次收敛的New ton-Armijo 求解,称为光滑SVM [21];2)如果从光滑SVM 问题的Lag rang e 函数出发,直接构造一迭代算法求解,这样的算法称为Lagr ang e SVM [22];3)简化SVM [23]则利用分块思想,从样本集中随机选取非常少的样本,求解光滑SVM 的无约束目标函数;4)近似SVM [24]采用光滑SVM 的目标函数,将不等式约束改为等式约束,其解满足一线性方程组,并用递推公式求解;5)增量SVM [25]求解线性近似SVM 问题,通过去除旧样本加入新样本来修正线性分类器,同时完成矩阵更新.广义SVM 算法的特殊形式都采用迭代方法求解,可处理不同规模样本集的分类问题.5 分位数估计与新奇检测算法 文献[26]提出了形式上类似于SVM 算法的新奇检测和高维分位数估计方法,本质上是一种无监督模式分类算法.其目的是找到一个非线性的判别函数,在包含多数样本的小区域内取值为+1,而在区域外取值为- 1.算法实现分为两步:1)利用核函数将样本数据映射到一特征空间;2)使原点和训练样本之间的间隔最大.算法的对偶形式为max 12∑li ,j =1A i A j k (x i ,x j ),s.t.∑li =1A i=1,0≤A i ≤1M l,i =1,2,…,l ,(10)其中M ∈(0,1]具有明确的物理含义:当Q ≠0时,它既是野值样本数占总样本数的上界,也是支持向量数占总样本数的下界.最终的判别函数为f (x )=sign∑li =1A ik (x ,x i)+Q ,其中Q 的含义见文献[26].第5期许建华等:支持向量机的新发展4836 讨 论 自20世纪90年代中期以来,综合统计学习理论、支持向量机算法的优点和最优化技术,设计实现新算法成为机器学习领域的一个研究热点,并取得了丰硕的成果.本文主要综述了从SVM算法发展而来的一些算法,算法形式可分为4类:1)二次规划问题,如SVM的一些变型算法和新奇检测算法;2)线性方程组,如LS-SVM;3)线性规划问题,如线性SVM;4)迭代求解的广义SVM特例.在算法设计思路上,多数算法的目标函数折衷考虑数据的拟合程度(或错分样本数目)与模型的复杂程度,通过正则化参数将二者线性组合起来,可有效地控制算法的推广能力;然后将优化问题转换为对偶形式,使其只出现两样本向量的内积运算;最后用核函数来代替内积,实现算法的非线性化.另外,采用满足M er cer条件的核函数,可保证核矩阵是半正定的,从而确保解的唯一性.对于不等式约束,支持向量的出现是KKT条件的直接结果.线性SVM和广义SVM则利用核形式的判决函数或回归函数直接构造算法,可以体现支持向量机算法的主要思想(即控制模型复杂性和核思想),但不会自动出现支持向量.这是另一条应引起人们关注的算法研究思路.大量仿真实验结果表明,本文介绍的算法具有较好的推广能力.但实际应用中将面临核函数类型及参数选取问题,这是一个既有实际价值又有理论意义的研究课题.因此,利用问题的启发式知识(即背景知识)来研究核参数的选择问题是值得重视的.参考文献(Refer ences):[1]Boser B E,Guyo n I M,V apnik V N.A tr ainingalg or ithm for optimal marg in classifiers[A].T he5thA nnual A CM W or ks hop on COL T[C].Pitt sbur gh:A CM P ress,1992.144-152.[2]Co rtes C,Vapnik V N.Suppo rt v ect or netw o rks[J].M achine L ear ning,1995,20(3):273-297.[3]D rucker H,Bur ges C J C,K aufm an L,et al.Suppo rtv ector r egr ession machines[A].A dvances in N eur alI nf ormation P rocess ing Sy stems[C].Cambr idge:M ITP ress,1997.155-161.[4]V apnik V N,Go lo w ich S,Smola A.Suppo rt vecto rmet ho d for functio n appro ximatio n,reg ressio n estimatio n and signal pr ocessing[A].A dv ances in N eur al I nf ormation Pr ocessing Sy stems[C].Cambr idge:M IT P ress,1997.281-287.[5]V apnik V N.T he N atur e of S tatistical L ear ningT heory[M].N ew Yo rk:Spr inger-V er lag,1995.[6]V apnik V N.Statistical L ear ning T heory[M].NewY or k:Wiley,1998.[7]V apnik V N.T he N atur e of Statistical L ear ningT heory[M].2nd editio n.N ew Y or k:Spring er-V erlag,1999.[8]Platt J.F ast tr aining of suppor t v ecto r machines usingsequential minimal optimizat ion[A].A dvances in K ernel M ethod s—S u p p or t V ector L ear ning[C].Cambridg e:M IT Pr ess,1999.185-208.[9]Suykens J A K,V andew alle J.L east squares suppor tvecto r machines[J].N eur al Pr ocessing L etter s,1999,9(3):293-300.[10]Scholkopf B,Smola A J,William so n R C,et al.Newsuppo rt vecto r alg o rit hm s[J].N eur al Com p utation,2000,12(5):1207-1245.[11]M ang asarian O L.G ener alized suppor t v ectormachines[A].A d vances in L arg e M arg in Classif ier s[C].Cambr idge:M IT Pr ess,2000.135-146.[12]Scholkopf B,Smo la A,M uller K R.No nlinearco mpo nent analysis a s a ker nel eig env alue pr oblem[J].N eur al Comp utation,1998,10(5):1299-1319. [13]张恭庆,林源渠.泛函分析讲义[M].北京:北京大学出版社,1987.[14]Burg es C J C.A tuto ria l on suppo rt v ect or machinesfo r pat tern r eco gnit ion[J].D ata M ining and K now ledg e D iscov ery,1998,2(2):1-43.[15]Smo la J,Schob lko pf B.A tuto rial on suppor t v ectorr egr ession[R].L ondo n:U niv ersit y o f Lo ndo n,1998.[16]K eer thi S S,Shev ade S K,Bhattachar yya C,et al.Afast it erat ive nearest point a lg or ithm for suppor t v ector machine classifier desig n[R].Bang alo re:I ndian Institute of Science,1999.[17]L in C J.F ounda tio ns of suppor t vecto r machines:Ano te fro m an o ptimizatio n point v iew[J].N eur al Comp utation,2001,13(2):307-317.[18]Suykens J A K,L ukas L,Va ndew alle J.Sparse leastsquar es suppor t v ect or machines classifier s[A].T he 8th Europ ean Sy mp osium on A rtif icial N eur al N etw or ks[C].Br ug ers,2000.37-42.[19]Z hang X.U sing class center v ect or s to build suppor tv ector m achines[A].N eural N etw or k s f or Signal P roces sing[C].N ew Yo r k:IEEE Pr ess,1999.3-11.[20]K ou Z,Xu J,Z hang X,et al.A n impr ov ed suppor tv ector ma chine using class-median vecto rs[A].P r oc of8th I nt Conf on N eur al I nf or mation Pr ocessing[C].Shanghai:F udan U niver sity P ress,2001.2:883-887.(下转第495页)484控 制 与 决 策第19卷对x 1和x 2分别施加幅值为0.0001和0.01的白噪声.图2和图3分别显示了采用FPE 和PDC 控制方法得到的控制效果.图中,虚线为设定值,实线为跟踪值.通过对比图2和图3可以看出,本文给出的基于性能评估器的设计方法具有更好的控制性能和鲁棒性.图3 文献[3]方法跟踪设定曲线图2 本文方法跟踪设定曲线7 结 语 通过对一类非线性时滞系统的分析,提出了一种基于模糊性能评估器的鲁棒控制器设计方法.该方法利用模糊性能评估器实现了3个功能:1)验证模糊模型的有效性;2)为控制器提供干扰抑制量信息;3)为模糊模型以及部分控制器参数调整提供一种无损调试的手段. 理论分析和仿真结果都证明了该方法具有良好的鲁棒性和实用性.参考文献(References ):[1]Br ier ley S D,Chiasso n J N ,L ee E B,et al.Onstability independent of delay fo r linea r systems [J ].I EEE T rans on A utomatic Contr ol ,1982,27(2):252-254.[2]M ahmo ud M S ,Al-M ut ha iri N F.Desig n o f r obustcontr oller fo r time -delay systems [J ].I EEE T r ans on A utomatic Contr ol ,1994,39(8):995-999.[3]Cao Y Y,Fr ank P M .A na ly sis and sy nthesis ofnonlinea r time-delay systems via fuzzy contr ol appr oach [J ].I E EE T r ans on Fuz z y Sy s tems ,2000,8(2):200-211.[4]张化光,黎明.基于H ∞观测器原理的模糊自适应控制器设计[J ].自动化学报,2002,28(1):27-33.(Zhang H G ,L i M .A daptive fuzzy contr oller design based on the pr inciple of H∞observ er [J ].A ctaA utomatica S inica ,2002,28(1):27-33.)[5]L i M ,Z hang H G ,He X Q .Adaptiv e fuzzy contr ollerdesig n based on H∞perfo rmance ev aluat or [A ].P roc ofthe 2002I nt Conf on Contr ol and A utomation [C ].Xia men ,2002.55-59.[6]L ehm an B,Bentsman J,Lunel S V ,et al.V ibra tio nalcontr ol of nonlinea r time lag systems with bounded delay :A ver agingtheor y ,stability ,and t ransientbehav io r [J ].I EEE T r ans on A utomatic Control ,1994,39(7):898-912.[7]Xu L D ,Cheng C W,T ang B Y.A linear matr ixinequality a ppro ach fo r r obust co nt ro l of systems w ith delay ed states [J ].Eur op ean J of Op er ational Res earch ,2000,124(2):332-341. (上接第484页)[21]L ee Y J,M ang asa ria n O L.SSVM :A smoo th suppo rtvecto r machines [R ].Wisco nsin :U niv ersityofWisco nsin,1999.[22]M angasar ian O L ,M usicant D R.L agr angian suppo rtvecto r machines [R ].Wisco nsin :U niv ersityofWisco nsin ,2000.[23]L ee Y J,M ang asar ian O L.RSV M :R educed suppo rtvecto r machines [R ].Wisco nsin:U niv ersityofWisco nsin ,2000.[24]Fung G ,M ang asarian O L .P ro x imal suppor t vecto rmachine cla ssifiers [R ].Wisco nsin:U niv ersit y ofW isconsin ,2001.[25]F ung G ,M ang asar ian O L.I ncr ement suppor t v ectormachine classification [R ].Wisco nsin:U niv ersit y of W isconsin ,2001.[26]Scholkopf B ,Platt J ,Shaw e -T o ylor J ,et al .Estimating thesuppor t ofhig hdimensio naldistr ibutio n [R ].R edmo nd:M icr oso ft Resear ch,1999.第5期黎明等:一种新型时滞系统鲁棒控制器设计方法493。
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
CODEN IEJMAT Internet Electronic Journal of Molecular Design2007,6, 229–236ISSN 1538–6414 B io C hem Press Inter net Electronic Journal ofMolecular DesignAugust 2007, Volume 6, Number 8, Pages 229–236Editor: Ovidiu IvanciucSupport Vector Machines QSAR for the Toxicity of Organic Chemicals to Chlorella vulgaris with SVM ParametersOptimized with SimplexZhong–Sheng Yi 1 and Li–Tang Qin 11 Department of Material and Chemical Engineering, Guilin University of Technology, Guilin541004, P. R. ChinaReceived: September 23, 2006; Accepted: December 5, 2006; Published: August 31, 2007Citation of the article:Z.–S. Yi and L.–T. Qin, Support Vector Machines QSAR for the Toxicity of Organic Chemicals to Chlorella vulgaris with SVM Parameters Optimized with Simplex,Internet Electron. J. Mol.Des.2007,6, 229–236, .Copyright © 2007io hem PressZ.–S. Yi and L.–T. QinInternet Electronic Journal of Molecular Design2007,6, 229–236Inter net Electronic Journal of Molecular Designio hem Press Support Vector Machines QSAR for the Toxicity of Organic Chemicals to Chlorella vulgaris with SVM ParametersOptimized with SimplexZhong–Sheng Yi 1,* and Li–Tang Qin 11 Department of Material and Chemical Engineering, Guilin University of Technology, Guilin541004, P. R. ChinaReceived: September 23, 2006; Accepted: December 5, 2006; Published: August 31, 2007Internet Electron. J. Mol. Des.2007,6 (8), 229–236AbstractMotivation.The key to a successful application of support vector machines (SVM) is to selecte proper parameters,but there is no general method for selecting the best set of SVM parameters. The predictive power of SVM models depends strongly on the set of parameters that control the model.In this paper we used the simplex optimization method to search for the optimum set of SVM parameters, namely the capacity parameter C, the insensitive loss parameter H and the parameter J that controls the shape of the RBF kernel.Method. The leave–one–out cross–validation correlation coefficient q2 is used as objective function for the simplex optimization of SVM parameters.Results.SVM quantitative structure–activity relationships(QSAR) models were built for the toxicity of organic chemicals to Chlorella vulgaris. The SVM models with simplex optimized parameters are compared with multi–linear regression QSAR models obtained in the same conditions.Conclusions. A series of QSAR models with one to three variables were obtained for the acute toxicity of 91 organic chemicals to Chlorella vulgaris. The SVM models with parameters optimized with simplex have better statistics than the multi–linear regression QSAR equations. The results from the present investigation demonstrate that the simplex algorithm is an efficient approach in finding the best set of SVM parameters for QSAR models.Keywords. SVM; support vector machines;QSAR;quantitative structure–activity relationships; simplex optimization;Chlorella vulgaris; acute toxicity.1 INTRODUCTIONIn recent years, with the development of modern industrial technology, a large number of organical chemicals have entered into the natural environment, on which human life relies for existence, and some chemicals have seriously polluted the environment. Due to the high cost and time, it is difficult to determine experimentally the toxicity of a large number of chemicals. Due to their high predictive power, quantitative structure–activity relationships (QSAR) are powerful toolsB ioC hem Press * Correspondence author; E–mail: yzs@ and yizhsh@.Support Vector Machines QSAR for the Toxicity of Organic Chemicals to Chlorella vulgarisInternet Electronic Journal of Molecular Design2007,6, 229–236for predicting chemical toxicity of environmental pollutants [1,2]. Many interesting applications of QSAR models were developed in environmental toxicology, demonstrating their theoretical and practical importance.Recently, by using multi–linear regression (MLR), Cronin and his co–workers [3] assessed and modeled the toxicity of selected 91 organic compounds to Chlorella vulgaris using three descriptors, which included hydrophobicity expressed by the 1–octanol/water partition coefficient (log K ow), electrophilicity expressed by the lowest unoccupied molecular orbital (LUMO) and a function of molecular size corrected for the presence of hereroatoms expressed by the first–order delta valence connectivity index ('1F v), and concluded that method selection in QSAR is task–dependent and there should be clear indications supporting the need of more complied, nonlinear, methods that may deliver better QSAR. As a power classified and regression tool, the support vector machines (SVM) have wide application in QSAR, and we attempt to re–build this model using it.SVM is a popular algorithm developed from the statistics learning theory by Vapnik [4,5]. Due to its remarkable generalization performance, the SVM has attracted attention and gained extensive application such as MOA prediction [6–8], classification of microarray gene expression data [9], estimation of aqueous solubility [10], classification of organophosphate nerve agent simulants [11]. The key of the SVM’s models quality is the selection of SVM parameters, similarly with artificial neural networks (ANN), and the molecular structure descriptors. Many studies demonstrated the importance of the SVM parameters in developing a predictive QSAR [12–14]. In this study, a modified simplex optimization was used to optimize SVM parameters and various QSAR models were built with difference descriptors for the toxicity of selected chemicals to Chlorella vulgaris. Compared with the result of multi–linear regression (MLR), SVM QSAR models have higher prediction power.2 MATERIALS AND METHODS2.1 Data Set and Structural DescriptorsThe toxicity data (pEC50) of a total of 91 chemicals covering a wide range of physicochemical properties and structural features, were collected from literature [3] and are listed in Table 1. The data set was split into a training set of 73 chemicals and a test set of 18 chemicals. The training set and test set were used to test the reliability of SVM model using the parameters through simplex optimization.Cronin [3]obtained a successful MLR model by using logarithm of the octanol–water partition coefficient (log K ow), the energy of the lowest unoccupied molecular orbital (E LUMO) and the first–order delta valence connectivity index ('1F v). In order to compare the performance of the MLR QSAR with that of SVM models, the same descriptors were used in this study.230B ioC hem Press Z.–S. Yi and L.–T. QinInternet Electronic Journal of Molecular Design 2007,6, 229–236Table 1. The Acute Chlorella vulgaris Toxicities and Descriptors for 91 CompoundsNo Name log K ow E LUMO '1F vpEC 50(Exp)pEC 50Eq. (1)pEC 50Eq. (2) Training Set2ethanol –0.31 3.565–0.130–3.32–3.94–3.744butan–2–ol 0.61 3.554–0.500–2.98–3.06–2.9252–hydroxyethyl methacrylate 0.47–0.074–2.275–2.82–1.71–1.7262–hydroxyethyl acrylate –0.21–0.102–2.044–2.79–2.34–2.367methyl acrylate 0.800.001–1.509–2.75–1.67–1.699butanone 0.290.882–0.686–2.51–2.56–2.5710methyl methacrylate 1.380.055–1.736–2.24–1.14–1.1411pentan–3–one 0.990.910–0.717–2.23–1.97–1.9812crotonaldehyde 0.52–0.141–0.972–1.98–2.02–2.0314trans–2–pentenal 1.05–0.115–1.024–1.88–1.56–1.5815phenol 1.470.398–1.477–1.46–1.22–1.2316allyl methacrylate 1.680.045–2.240–1.42–0.74–0.7317aniline 0.900.639–1.260–1.34–1.83–1.8419anisole 2.110.483–1.644–1.09–0.66–0.66202–fluorophenol 1.710.013–2.166–1.08–0.73–0.72212–fluoroaniline 1.260.266–1.998–1.05–1.22–1.23223–cresol 1.960.396–1.622–1.01–0.77–0.77242–hydroxyaniline 0.620.474–1.806–0.91–1.86–1.88252–methoxyphenol 1.320.392–2.194–0.88–1.15–1.15262,6–dimethylaniline 1.840.595–1.476–0.87–0.97–0.9727benzaldehyde 1.47–0.435–1.732–0.81–0.93–0.93292–hydroxybenzaldehyde 1.81–0.434–2.282–0.80–0.49–0.4730nitrobenzene 1.85–1.068–2.293–0.78–0.29–0.2531methidathion 2.42–2.550–1.436–0.730.350.42324–cresol 1.940.429–1.622–0.66–0.80–0.80344–tolualdehyde 1.99–0.430–1.866–0.65–0.46–0.44352–ethoxyphenol 1.850.422–2.184–0.62–0.72–0.71363–cyanobenzaldehyde 1.18–0.917–2.452–0.57–0.84–0.82373–nitrotoluene 2.42–1.017–2.481–0.500.230.28392,4–dinitroaniline 1.72–1.475–3.840–0.360.150.26404–bromophenol 2.590.020–1.578–0.35–0.16–0.14414–bromoaniline 2.260.218–1.289–0.33–0.57–0.56423–chloroaniline 1.880.263–1.350–0.31–0.88–0.88443,5–dinitroaniline 1.89–1.780–3.8460.030.370.50452–chlorobenzaldehyde 2.33–0.683–1.8380.06–0.11–0.09464–iodophenol 2.910.024–1.5900.160.110.13474–ethylbenzaldehyde 2.52–0.423–1.8420.16–0.020.00492–isopropylphenol 2.880.408–1.7540.170.030.05503,5–dichloroaniline 2.90–0.042–1.4280.240.080.10511,3,5–trimethyl–2–nitrobenzene 3.22–0.857–2.8080.250.95 1.01522,6–dichloroaniline 2.82–0.006–1.4160.260.000.01541,2–dichlorobenzene 3.43–0.142–1.0110.370.430.46551,3–dinitrobenzene 1.49–1.911–3.5320.38–0.020.09562,4–dinitrophenol 1.67–1.807–3.9760.400.230.36571,4–dinitrobenzene 1.47–2.208–3.5320.410.050.1659phosmet 2.78–2.349–2.3220.470.840.9360methylparathion 2.86–2.068–3.0770.60 1.051.15B ioC hem PressSupport Vector Machines QSAR for the Toxicity of Organic Chemicals to Chlorella vulgaris Internet Electronic Journal of Molecular Design2007,6, 229–236Table 1. (Continued)No Name log K ow E LUMO'1F v50(Exp)50Eq. (1)50Eq. (2)61622,6–dichloro–4–nitroaniline 2.80–1.096–2.9700.640.710.78 642,4–dinitrotoluene 1.98–1.841–3.7600.700.440.5665cyanophos 2.75–1.832–2.3490.790.690.76666–chloro–2,4–dinitroaniline 2.46–1.667–4.0800.800.88 1.02 672,6–dibromo–4–nitrophenol 3.57–1.452–3.3150.81 1.54 1.63 692,5–dichloronitrobenzene 3.03–1.296–2.6330.970.860.9270piperine 2.70–0.767–3.9850.970.820.92714–tert–butylbenzaldehyde 3.32–0.391–2.210 1.000.740.78 722,4,6–trichloroaniline 3.69–0.240–1.485 1.110.810.83744–chloro–2,6–dinitroaniline 2.46–1.895–4.080 1.190.94 1.09 751,2–dinitrobenzene 1.69–1.840–3.526 1.230.130.24761–chloro–4–nitrobenzene 2.39–1.344–2.476 1.250.290.3577dicapthon 3.58–2.124–3.256 1.36 1.71 1.82 792,3,5,6–tetrachloroaniline 4.47–0.560–1.535 1.48 1.56 1.58 802,4–dichloro–6–nitrophenol 3.07–1.431–3.174 1.50 1.08 1.17 812,6–dichlorobenzaldehyde 3.08–0.473–1.931 1.500.480.5282fenthion 4.09–1.628–1.486 1.56 1.52 1.5684pentachlorophenol 5.12–0.978–1.931 1.69 2.33 2.3385fenitrothion 3.30–2.027–3.256 1.71 1.45 1.56 861,2,4–trichloro–5–nitrobenzene 3.47–1.536–2.774 1.88 1.33 1.41 871,3,5–trichloro–2,4–dinitrobenzene 2.97–2.037–4.179 1.89 1.44 1.59894–(dibutylamino)benzaldehyde 5.06–0.097–2.392 2.18 2.17 2.16 902,3,5,6–tetrachloronitrobenzene 4.38–1.419–2.895 2.34 2.10 2.1591pentabromophenol 4.85–1.193–1.459 3.10 2.03 2.04Testing set38butan–1–ol0.88 3.425–0.428–2.73–2.82–2.6913trans–2–hexenal 1.58–0.115–1.094–1.94–1.10–1.10182–heptanone 1.980.879–0.902–1.18–1.09–1.08234–methoxyphenol 1.340.313–2.200–0.97–1.11–1.11282–cresol 1.950.396–1.616–0.81–0.78–0.78 333,4–dimethylphenol 2.230.436–1.750–0.65–0.52–0.52384–chlorophenol 2.390.095–1.603–0.42–0.34–0.3343benzyl Methacrylate 2.530.079–3.044–0.210.190.23482–methyl–1,4–naphthoquinone 2.20–1.493–3.0450.160.330.41532–tert–butyl phenol 3.270.407–1.7470.290.360.37583–nitrobenzaldehyde 1.47–1.404–3.1020.45–0.29–0.2263methyl azinphos 2.75–2.494–2.1860.690.820.9068thiometon 3.15–2.632 1.1340.940.270.37732–chloro–6–nitrotoluene 3.09–1.219–2.636 1.170.890.95 782,6–di–tert–butyl–4–methylphenol 5.890.464–2.040 1.45 2.62 2.55 832,4–di–tert–butylphenol 4.360.431–1.936 1.60 1.32 1.3388phenylazophenol 3.96–0.768–3.417 2.16 1.71 1.782.2 Support Vector Machines RegressionInitially, SVM is developed for pattern recognition problems, but now, with the introduction of insensitive loss function, SVM has been extended to solve nonlinear regression estimation with excellent performances [10]. Here, SVM can be called support vector regression (SVR). A detailed description of the regression theory of SVM can be seen in several excellent books and tutorials [15–18]. For this reason, we will only briefly describe the main ideas of SVM regression here.232B ioC hem Press Z.–S. Yi and L.–T. QinInternet Electronic Journal of Molecular Design 2007,6, 229–236A support vector machine is first trained on a group of objects having known target values. After training, the SVM model is used to predict or estimate target values for objects where these values are unknown. A kernel–induced feature space with function is used for the mapping of objects onto target values. Thus a non–linear feature mapping will allow the treatment of non–linear problems in linear space. The prediction or approximation function used by a basis SVM is, where ),(x x k i ¦ D D li i i i b x x k x f 1*),()()(i D and D are Lagrange multipliers which are mostly zeroand have , where is a feature vector corresponding to a training object, and isa kernel function such as linear, polynomial, radical basis function and sigmoid. The components of vector D and the constantb represent the hypothesis and are optimized during training. It may be useful to think of kernel as comparing patterns, or as evaluating the proximity of objects in their feature space. Thus a test point is evaluated by comparing it to all training points. Training points with non–zero weights are called the support vectors.*i 0* D u D i i i x ,(x k i ),(x x k i )i x D For a given dataset and one of the two SVM methods, mu–SVR and nu–SVR, the kernel function with parameters respectively and the capacity parameter C of SVM must be selected to determine a specific SVM model. All SVM models used mu–SVR and the radical basis function (RBF)2),(ix x i ex x k J , and all calculations were performed with LIBSVM software by Changand Lin [17]. Therefore, there are 3 parameters to be optimized, namely the capacity parameter C ,the insensitive loss parameter H and the parameter J that controls the shape of the RBF kernel. The target function of the simplex optimization was the correlation coefficient of leave–one–out cross validation,q 2.3 RESULTS AND DISCUSSION 3.1 Multi–linear RegressionThree descriptors, log K ow ,E LUMO and '1F v , were chosen from 110 descriptors in Ref. [3] to characterize 91 compounds. The multi–linear models built by various combinations of those descriptors are list in Table 2. This result show that log K ow is the most important descriptor, the next is E LUMO and then '1F v , because q 2 for these descriptors is 0.7410, 0.4648, 0.2437 and r 2 is 0.7561, 0.4819, 0.2881, respectively. From the model built with three descriptors (log K ow ,E LUMO ,and '1F v ), we can find that the importance of descriptors '1F v is very small. When '1F v is used in the three parameters model, the q 2 only increase about 0.02 and R 2 only 0.022 compared with the two parameters (log K ow and E LUMO ) model. The three variables model of log K ow ,E LUMO and '1F v is:v150)0664.02786.0()0540.02678.0(log )0474.08376.0()1735.07602.2(F 'r r r r LUMO ow E K pEC n = 91, R 2 = 0.8903, q 2 = 0.8752, RMSE = 0.4826, F = 244(1)B ioC hem PressSupport Vector Machines QSAR for the Toxicity of Organic Chemicals to Chlorella vulgarisInternet Electronic Journal of Molecular Design 2007,6, 229–236The predicted toxicities of 91 organic chemicals from Eq. (1) are list in Table 1.Table 2. The Result of Building Linear Models for Different Combination between Three DescriptorsNo No.samplesq 2R 2S F Descriptors 1910.74100.75610.7196279logK ow 2910.46480.4819 1.048983E LUMO 3910.24370.2881 1.229636'1F v4910.85770.86820.5291296logK ow E LUMO 5910.84170.85930.5466275logKow '1F v 6910.46990.4966 1.034045E LUMO '1F v7910.87540.89030.4826244logK ow E LUMO and '1F vTable 3. The Initial Simplex and the Best Parameters of SVMH J 220.065540.72940.01080.804230.0226402.26430.01080.795740.0226128.00000.04720.8414620.002162.24260.00390.88793.2 Simplex Optimization of the SVM ParametersThe QSAR models were obtained by considering log K ow ,E LUMO and '1F v as SVR inputdescriptors, the toxicity of organic chemicals as output, the parameters C ,H and J as simplex factors,q 2 of SVR as simplex object function. The starting simplexes were generated by gold section. The optimization converges when the difference is lower than 10–4 among each peak of the last simplex. The initial simplex (No. 1 to 4) and the best parameters (No. 62) of SVR are listed in Table 3. The convergence parameters were q 2 = 0.8879, C = 62.2426, H = 0.0021, J = 0.0039. The other sets of optimum parameters are listed in Table 4 (No. 1 to 7).Table 4. The Best Parameters of SVM for the Different Combination between Three Descriptors No.No.samples H C Jq 2R 2S Descriptors 1910.0034129.21700.00420.78110.79220.4492logK ow 2910.045615.67730.00790.50990.5139 1.0368E LUMO 3910.38340.71210.00690.38760.3891 1.3095'1F v 4910.088117.53860.05460.88550.89970.2141logK ow , E LUMO 5910.021211.60730.03120.86730.88580.2434logK ow ,'1F v 6910.006626.18150.02660.49740.5081 1.0496E LUMO ,'1F v 7910.002162.24260.00390.88790.89560.2238logK ow , E LUMO ,'1F v 8730.002162.24260.00390.88510.89310.2317logK ow , E LUMO ,'1F v 9730.005063.99710.01490.88960.89640.2229logK ow , E LUMO ,'1F vAfter choosing the SVR parameters, we generated the following prediction or approximation function of three descriptors¦ J D D 7212**500.9778)),exp()((i i ii x x pEC (2)234B ioC hem Presswhere support vector samples is 72, J * is the optimum parameter that controls the RBF shape. TheZ.–S. Yi and L.–T. QinInternet Electronic Journal of Molecular Design 2007,6, 229–236estimated toxicities of organic chemicals by Eq. (2) are listed in Table 1, and the plot of estimated by Eq. (1) versus observed toxicity of organic chemicals is shown in Figure 1. A comparison of the results from Tables 2 and 4 show that q 2 and R 2 of SVR models are larger than that of the corresponding linear models.p E C 50 e s t i m a t e dp EC 50 observedp E C 50P r e d i c t e dp EC 50 observedFigure 1. Plot of the observed algal toxicity against thatpredicted in Eq. (2)Figure 2. Plot of the observed algal toxicity against thatpredicted in Eq. (3)A good QSAR model should have not only excellent estimation ability for the training set but also a better predictive power for the external samples. In order to validate the predictive ability of the model, 73 samples are selected from all 91 organic chemicals to construct a training set and the remaining 18 formed a test set. The 73 samples in the training set are employed to develop a QSAR model and then the SVR model is used to predict the toxicity of organic chemicals in the test set. The statistical parameters of the training set, optimized with simplex, are presented in Table 4: No. 7 is obtained from 91 chemicals, No. 8 is obtained from 73 chemicals using the same parameters with No. 7, and No. 9 is from 73 chemicals. The statistics of the models with 91 and 73 compounds are similar.¦ 5212**500868.0)),exp()((i i i i x x pEC J D D (3)4 CONCLUSIONSThe predictive power of SVM models depends strongly on the set of parameters that control the model. In this paper we used the simplex optimization method to search for the optimum set of SVM parameters, namely the capacity parameter C , the insensitive loss parameter H and the parameter J that controls the shape of the RBF kernel. The leave–one–out cross–validation correlation coefficient q 2 is used as objective function for the simplex optimization of SVM parameters. SVM quantitative structure–activity relationships models were built for the toxicity of organic chemicals to Chlorella vulgaris . A series of QSAR models with one to three variables were obtained for the acute toxicity of 91 organic chemicals to Chlorella vulgaris . The SVM models withB ioC hem PressSupport Vector Machines QSAR for the Toxicity of Organic Chemicals to Chlorella vulgarisInternet Electronic Journal of Molecular Design2007,6, 229–236parameters optimized with simplex have better statistics than the multi–linear regression QSAR equations. The results from the present investigation demonstrate that the simplex algorithm is an efficient approach in finding the best set of SVM parameters for QSAR models.5 REFERENCES[1]L. S. Wang,Chemistry of Organic Pollution. Higher Education Press, Beijing, 2004.[2]T. W. Schultz, M. T. D. Cronin, T. I. Netzeva,The present status of QSAR in toxicology.J Mol.Struct.(Theochem)2003,622, 23–38.[3]M.T.D. Cronin, T. I. Netzeva, J. C. Dearden, R. Edwards, A. D. P. Worgan, Assessment and Modeling of theToxicity of Organic Chemicals to Chlorella vulgaris: Development of a Novel Database.Chem.Res.Toxicol.2004,17, 545–554.[4] V. Vapnik,The Nature of Statistical Learning Theory. Springer Verlag: New York, 1995.[5]V. N. Vapnik, Statistical Learning Theory. Wiley: New York, 1998.[6]O. Ivanciuc, Support Vector Machine Identification of the Aquatic Toxicity Mechanism of Organic Compounds,Internet Electron.J.Mol.Des.2002,1, 157–172, .[7]O. Ivanciuc, Aquatic Toxicity Prediction for Polar and Nonpolar Narcotic Pollutants with Support VectorMachines,Internet Electron.J.Mol.Des.2003,2, 195–208, .[8]O. Ivanciuc, Support Vector Machines Prediction of the Mechanism of Toxic Action from Hydrophobicity andExperimental Toxicity Against Pimephales promelas and Tetrahymena pyriformis,Internet Electron.J.Mol.Des.2004,3, 802–821, .[9]M. P. S. Brown, W. N. Grundy, D. Lin, N. Cristianini,C.Sugnet,J.Manuel Ares,D.Haussler Support VectorMachine Classification of Microarray Gene Expression Data; UCSC–CRL–99–09; University of California: Santa Cruz, 9, 1999.[10]P. Lind, T. Maltseva, Support Vector Machines for the Estimation of Aqueous Solubility. put.Sci.2003,43, 1855–1859.[11]O.Sadik,J.Walker H. Land, A. K. Wanekaya, M. Uematsu, M. J. Embrechts, L. Wong, D. Leibensperger, A.Volykin, Detection and Classification of Organophosphate Nerve Agent Simulants Using Support Vector Machines with Multiarray put.Sci.2004,44, 499–507.[12] C.X. Xue, R. S. Zhang, H. X. Liu, M. C. Liu, Z. D. Hu, B. T. Fan, Support Vector Machines–Based QuantitativeStructure–Property Relationship for the Prediction of Heat put.Sci.2004,44, 1267–1274.[13]Z.–S. Yi and S.–S. Liu, Support Vector Machines for Prediction of Mechanism of Toxic Action from MultivariateClassification of Phenols Based on MEDV Descriptors,Internet Electron.J.Mol.Des.2005,4, 835–849, .[14]K. R. Muller, G. Ratsch, S. Sonnenburg, S. Mika, M.Grimm,N.Heinrich,Classifying‘Drug–likeness’withKernel–based Learning Methods.J.Chem.Inf.Model.2005,45, 249–53.[15]N. Cristianini, J. Shawe–Taylor,An Introduction to Support Vector Machines and Other Kernel–based LearningMethods. Cambridge University Press: Cambridge, 2000.[16]S. R. Gunn. Support Vector Machines for Classification and Regression;Image Speech and Intelligent SystemsResearch Group, University of Southampton: 1998.[17] C.–C. Chang, C.–J. Lin LIBSVM – a library for support vector machines, 2.6; 2001..tw/~cjlin/libsvm.[18]O.Ivanciuc,Applications of support vector machines in chemistry. In:Reviews in Computational Chemistry, K.B. Lipkowitz and T. R. Cundari, (eds.), Wiley–VCH, Weinheim, 2007; Vol. 23, pp 291–400.236B ioC hem Press 。