1 Introduction Learning Statistical Models of Time-Varying Relational Data
Support Vector Machines for Classification and Regression
UNIVERSITY OF SOUTHAMPTONSupport Vector MachinesforClassification and RegressionbySteve R.GunnTechnical ReportFaculty of Engineering,Science and Mathematics School of Electronics and Computer Science10May1998ContentsNomenclature xi1Introduction11.1Statistical Learning Theory (2)1.1.1VC Dimension (3)1.1.2Structural Risk Minimisation (4)2Support Vector Classification52.1The Optimal Separating Hyperplane (5)2.1.1Linearly Separable Example (10)2.2The Generalised Optimal Separating Hyperplane (10)2.2.1Linearly Non-Separable Example (13)2.3Generalisation in High Dimensional Feature Space (14)2.3.1Polynomial Mapping Example (16)2.4Discussion (16)3Feature Space193.1Kernel Functions (19)3.1.1Polynomial (20)3.1.2Gaussian Radial Basis Function (20)3.1.3Exponential Radial Basis Function (20)3.1.4Multi-Layer Perceptron (20)3.1.5Fourier Series (21)3.1.6Splines (21)3.1.7B splines (21)3.1.8Additive Kernels (22)3.1.9Tensor Product (22)3.2Implicit vs.Explicit Bias (22)3.3Data Normalisation (23)3.4Kernel Selection (23)4Classification Example:IRIS data254.1Applications (28)5Support Vector Regression295.1Linear Regression (30)5.1.1 -insensitive Loss Function (30)5.1.2Quadratic Loss Function (31)iiiiv CONTENTS5.1.3Huber Loss Function (32)5.1.4Example (33)5.2Non Linear Regression (33)5.2.1Examples (34)5.2.2Comments (36)6Regression Example:Titanium Data396.1Applications (42)7Conclusions43A Implementation Issues45A.1Support Vector Classification (45)A.2Support Vector Regression (47)B MATLAB SVM Toolbox51 Bibliography53List of Figures1.1Modelling Errors (2)1.2VC Dimension Illustration (3)2.1Optimal Separating Hyperplane (5)2.2Canonical Hyperplanes (6)2.3Constraining the Canonical Hyperplanes (7)2.4Optimal Separating Hyperplane (10)2.5Generalised Optimal Separating Hyperplane (11)2.6Generalised Optimal Separating Hyperplane Example(C=1) (13)2.7Generalised Optimal Separating Hyperplane Example(C=105) (14)2.8Generalised Optimal Separating Hyperplane Example(C=10−8) (14)2.9Mapping the Input Space into a High Dimensional Feature Space (14)2.10Mapping input space into Polynomial Feature Space (16)3.1Comparison between Implicit and Explicit bias for a linear kernel (22)4.1Iris data set (25)4.2Separating Setosa with a linear SVC(C=∞) (26)4.3Separating Viginica with a polynomial SVM(degree2,C=∞) (26)4.4Separating Viginica with a polynomial SVM(degree10,C=∞) (26)4.5Separating Viginica with a Radial Basis Function SVM(σ=1.0,C=∞)274.6Separating Viginica with a polynomial SVM(degree2,C=10) (27)4.7The effect of C on the separation of Versilcolor with a linear spline SVM.285.1Loss Functions (29)5.2Linear regression (33)5.3Polynomial Regression (35)5.4Radial Basis Function Regression (35)5.5Spline Regression (36)5.6B-spline Regression (36)5.7Exponential RBF Regression (36)6.1Titanium Linear Spline Regression( =0.05,C=∞) (39)6.2Titanium B-Spline Regression( =0.05,C=∞) (40)6.3Titanium Gaussian RBF Regression( =0.05,σ=1.0,C=∞) (40)6.4Titanium Gaussian RBF Regression( =0.05,σ=0.3,C=∞) (40)6.5Titanium Exponential RBF Regression( =0.05,σ=1.0,C=∞) (41)6.6Titanium Fourier Regression( =0.05,degree3,C=∞) (41)6.7Titanium Linear Spline Regression( =0.05,C=10) (42)vvi LIST OF FIGURES6.8Titanium B-Spline Regression( =0.05,C=10) (42)List of Tables2.1Linearly Separable Classification Data (10)2.2Non-Linearly Separable Classification Data (13)5.1Regression Data (33)viiListingsA.1Support Vector Classification MATLAB Code (46)A.2Support Vector Regression MATLAB Code (48)ixNomenclature0Column vector of zeros(x)+The positive part of xC SVM misclassification tolerance parameterD DatasetK(x,x )Kernel functionR[f]Risk functionalR emp[f]Empirical Risk functionalxiChapter1IntroductionThe problem of empirical data modelling is germane to many engineering applications. In empirical data modelling a process of induction is used to build up a model of the system,from which it is hoped to deduce responses of the system that have yet to be ob-served.Ultimately the quantity and quality of the observations govern the performance of this empirical model.By its observational nature data obtained isfinite and sampled; typically this sampling is non-uniform and due to the high dimensional nature of the problem the data will form only a sparse distribution in the input space.Consequently the problem is nearly always ill posed(Poggio et al.,1985)in the sense of Hadamard (Hadamard,1923).Traditional neural network approaches have suffered difficulties with generalisation,producing models that can overfit the data.This is a consequence of the optimisation algorithms used for parameter selection and the statistical measures used to select the’best’model.The foundations of Support Vector Machines(SVM)have been developed by Vapnik(1995)and are gaining popularity due to many attractive features,and promising empirical performance.The formulation embodies the Struc-tural Risk Minimisation(SRM)principle,which has been shown to be superior,(Gunn et al.,1997),to traditional Empirical Risk Minimisation(ERM)principle,employed by conventional neural networks.SRM minimises an upper bound on the expected risk, as opposed to ERM that minimises the error on the training data.It is this difference which equips SVM with a greater ability to generalise,which is the goal in statistical learning.SVMs were developed to solve the classification problem,but recently they have been extended to the domain of regression problems(Vapnik et al.,1997).In the literature the terminology for SVMs can be slightly confusing.The term SVM is typ-ically used to describe classification with support vector methods and support vector regression is used to describe regression with support vector methods.In this report the term SVM will refer to both classification and regression methods,and the terms Support Vector Classification(SVC)and Support Vector Regression(SVR)will be used for specification.This section continues with a brief introduction to the structural risk12Chapter1Introductionminimisation principle.In Chapter2the SVM is introduced in the setting of classifica-tion,being both historical and more accessible.This leads onto mapping the input into a higher dimensional feature space by a suitable choice of kernel function.The report then considers the problem of regression.Illustrative examples re given to show the properties of the techniques.1.1Statistical Learning TheoryThis section is a very brief introduction to statistical learning theory.For a much more in depth look at statistical learning theory,see(Vapnik,1998).Figure1.1:Modelling ErrorsThe goal in modelling is to choose a model from the hypothesis space,which is closest (with respect to some error measure)to the underlying function in the target space. Errors in doing this arise from two cases:Approximation Error is a consequence of the hypothesis space being smaller than the target space,and hence the underlying function may lie outside the hypothesis space.A poor choice of the model space will result in a large approximation error, and is referred to as model mismatch.Estimation Error is the error due to the learning procedure which results in a tech-nique selecting the non-optimal model from the hypothesis space.Chapter1Introduction3Together these errors form the generalisation error.Ultimately we would like tofind the function,f,which minimises the risk,R[f]=X×YL(y,f(x))P(x,y)dxdy(1.1)However,P(x,y)is unknown.It is possible tofind an approximation according to the empirical risk minimisation principle,R emp[f]=1lli=1Ly i,fx i(1.2)which minimises the empirical risk,ˆf n,l (x)=arg minf∈H nR emp[f](1.3)Empirical risk minimisation makes sense only if,liml→∞R emp[f]=R[f](1.4) which is true from the law of large numbers.However,it must also satisfy,lim l→∞minf∈H nR emp[f]=minf∈H nR[f](1.5)which is only valid when H n is’small’enough.This condition is less intuitive and requires that the minima also converge.The following bound holds with probability1−δ,R[f]≤R emp[f]+h ln2lh+1−lnδ4l(1.6)Remarkably,this expression for the expected risk is independent of the probability dis-tribution.1.1.1VC DimensionThe VC dimension is a scalar value that measures the capacity of a set offunctions.Figure1.2:VC Dimension Illustration4Chapter1IntroductionDefinition1.1(Vapnik–Chervonenkis).The VC dimension of a set of functions is p if and only if there exists a set of points{x i}pi=1such that these points can be separatedin all2p possible configurations,and that no set{x i}qi=1exists where q>p satisfying this property.Figure1.2illustrates how three points in the plane can be shattered by the set of linear indicator functions whereas four points cannot.In this case the VC dimension is equal to the number of free parameters,but in general that is not the case;e.g.the function A sin(bx)has an infinite VC dimension(Vapnik,1995).The set of linear indicator functions in n dimensional space has a VC dimension equal to n+ Risk MinimisationCreate a structure such that S h is a hypothesis space of VC dimension h then,S1⊂S2⊂...⊂S∞(1.7) SRM consists in solving the following problemmin S h R emp[f]+h ln2lh+1−lnδ4l(1.8)If the underlying process being modelled is not deterministic the modelling problem becomes more exacting and consequently this chapter is restricted to deterministic pro-cesses.Multiple output problems can usually be reduced to a set of single output prob-lems that may be considered independent.Hence it is appropriate to consider processes with multiple inputs from which it is desired to predict a single output.Chapter2Support Vector ClassificationThe classification problem can be restricted to consideration of the two-class problem without loss of generality.In this problem the goal is to separate the two classes by a function which is induced from available examples.The goal is to produce a classifier that will work well on unseen examples,i.e.it generalises well.Consider the example in Figure2.1.Here there are many possible linear classifiers that can separate the data, but there is only one that maximises the margin(maximises the distance between it and the nearest data point of each class).This linear classifier is termed the optimal separating hyperplane.Intuitively,we would expect this boundary to generalise well as opposed to the other possible boundaries.Figure2.1:Optimal Separating Hyperplane2.1The Optimal Separating HyperplaneConsider the problem of separating the set of training vectors belonging to two separateclasses,D=(x1,y1),...,(x l,y l),x∈R n,y∈{−1,1},(2.1)56Chapter 2Support Vector Classificationwith a hyperplane, w,x +b =0.(2.2)The set of vectors is said to be optimally separated by the hyperplane if it is separated without error and the distance between the closest vector to the hyperplane is maximal.There is some redundancy in Equation 2.2,and without loss of generality it is appropri-ate to consider a canonical hyperplane (Vapnik ,1995),where the parameters w ,b are constrained by,min i w,x i +b =1.(2.3)This incisive constraint on the parameterisation is preferable to alternatives in simpli-fying the formulation of the problem.In words it states that:the norm of the weight vector should be equal to the inverse of the distance,of the nearest point in the data set to the hyperplane .The idea is illustrated in Figure 2.2,where the distance from the nearest point to each hyperplane is shown.Figure 2.2:Canonical HyperplanesA separating hyperplane in canonical form must satisfy the following constraints,y i w,x i +b ≥1,i =1,...,l.(2.4)The distance d (w,b ;x )of a point x from the hyperplane (w,b )is,d (w,b ;x )= w,x i +bw .(2.5)Chapter 2Support Vector Classification 7The optimal hyperplane is given by maximising the margin,ρ,subject to the constraints of Equation 2.4.The margin is given by,ρ(w,b )=min x i :y i =−1d (w,b ;x i )+min x i :y i =1d (w,b ;x i )=min x i :y i =−1 w,x i +b w +min x i :y i =1 w,x i +b w =1 w min x i :y i =−1 w,x i +b +min x i :y i =1w,x i +b =2 w (2.6)Hence the hyperplane that optimally separates the data is the one that minimisesΦ(w )=12 w 2.(2.7)It is independent of b because provided Equation 2.4is satisfied (i.e.it is a separating hyperplane)changing b will move it in the normal direction to itself.Accordingly the margin remains unchanged but the hyperplane is no longer optimal in that it will be nearer to one class than the other.To consider how minimising Equation 2.7is equivalent to implementing the SRM principle,suppose that the following bound holds,w <A.(2.8)Then from Equation 2.4and 2.5,d (w,b ;x )≥1A.(2.9)Accordingly the hyperplanes cannot be nearer than 1A to any of the data points and intuitively it can be seen in Figure 2.3how this reduces the possible hyperplanes,andhence thecapacity.Figure 2.3:Constraining the Canonical Hyperplanes8Chapter2Support Vector ClassificationThe VC dimension,h,of the set of canonical hyperplanes in n dimensional space is bounded by,h≤min[R2A2,n]+1,(2.10)where R is the radius of a hypersphere enclosing all the data points.Hence minimising Equation2.7is equivalent to minimising an upper bound on the VC dimension.The solution to the optimisation problem of Equation2.7under the constraints of Equation 2.4is given by the saddle point of the Lagrange functional(Lagrangian)(Minoux,1986),Φ(w,b,α)=12 w 2−li=1αiy iw,x i +b−1,(2.11)whereαare the Lagrange multipliers.The Lagrangian has to be minimised with respect to w,b and maximised with respect toα≥0.Classical Lagrangian duality enables the primal problem,Equation2.11,to be transformed to its dual problem,which is easier to solve.The dual problem is given by,max αW(α)=maxαminw,bΦ(w,b,α).(2.12)The minimum with respect to w and b of the Lagrangian,Φ,is given by,∂Φ∂b =0⇒li=1αi y i=0∂Φ∂w =0⇒w=li=1αi y i x i.(2.13)Hence from Equations2.11,2.12and2.13,the dual problem is,max αW(α)=maxα−12li=1lj=1αiαj y i y j x i,x j +lk=1αk,(2.14)and hence the solution to the problem is given by,α∗=arg minα12li=1lj=1αiαj y i y j x i,x j −lk=1αk,(2.15)with constraints,αi≥0i=1,...,llj=1αj y j=0.(2.16)Chapter2Support Vector Classification9Solving Equation2.15with constraints Equation2.16determines the Lagrange multi-pliers,and the optimal separating hyperplane is given by,w∗=li=1αi y i x ib∗=−12w∗,x r+x s .(2.17)where x r and x s are any support vector from each class satisfying,αr,αs>0,y r=−1,y s=1.(2.18)The hard classifier is then,f(x)=sgn( w∗,x +b)(2.19) Alternatively,a soft classifier may be used which linearly interpolates the margin,f(x)=h( w∗,x +b)where h(z)=−1:z<−1z:−1≤z≤1+1:z>1(2.20)This may be more appropriate than the hard classifier of Equation2.19,because it produces a real valued output between−1and1when the classifier is queried within the margin,where no training data resides.From the Kuhn-Tucker conditions,αiy iw,x i +b−1=0,i=1,...,l,(2.21)and hence only the points x i which satisfy,y iw,x i +b=1(2.22)will have non-zero Lagrange multipliers.These points are termed Support Vectors(SV). If the data is linearly separable all the SV will lie on the margin and hence the number of SV can be very small.Consequently the hyperplane is determined by a small subset of the training set;the other points could be removed from the training set and recalculating the hyperplane would produce the same answer.Hence SVM can be used to summarise the information contained in a data set by the SV produced.If the data is linearly separable the following equality will hold,w 2=li=1αi=i∈SV sαi=i∈SV sj∈SV sαiαj y i y j x i,x j .(2.23)Hence from Equation2.10the VC dimension of the classifier is bounded by,h≤min[R2i∈SV s,n]+1,(2.24)10Chapter2Support Vector Classificationx1x2y11-133113131-12 2.513 2.5-143-1Table2.1:Linearly Separable Classification Dataand if the training data,x,is normalised to lie in the unit hypersphere,,n],(2.25)h≤1+min[i∈SV s2.1.1Linearly Separable ExampleTo illustrate the method consider the training set in Table2.1.The SVC solution is shown in Figure2.4,where the dotted lines describe the locus of the margin and the circled data points represent the SV,which all lie on the margin.Figure2.4:Optimal Separating Hyperplane2.2The Generalised Optimal Separating HyperplaneSo far the discussion has been restricted to the case where the training data is linearly separable.However,in general this will not be the case,Figure2.5.There are two approaches to generalising the problem,which are dependent upon prior knowledge of the problem and an estimate of the noise on the data.In the case where it is expected (or possibly even known)that a hyperplane can correctly separate the data,a method ofChapter2Support Vector Classification11Figure2.5:Generalised Optimal Separating Hyperplaneintroducing an additional cost function associated with misclassification is appropriate. Alternatively a more complex function can be used to describe the boundary,as discussed in Chapter2.1.To enable the optimal separating hyperplane method to be generalised, Cortes and Vapnik(1995)introduced non-negative variables,ξi≥0,and a penalty function,Fσ(ξ)=iξσiσ>0,(2.26) where theξi are a measure of the misclassification errors.The optimisation problem is now posed so as to minimise the classification error as well as minimising the bound on the VC dimension of the classifier.The constraints of Equation2.4are modified for the non-separable case to,y iw,x i +b≥1−ξi,i=1,...,l.(2.27)whereξi≥0.The generalised optimal separating hyperplane is determined by the vector w,that minimises the functional,Φ(w,ξ)=12w 2+Ciξi,(2.28)(where C is a given value)subject to the constraints of Equation2.27.The solution to the optimisation problem of Equation2.28under the constraints of Equation2.27is given by the saddle point of the Lagrangian(Minoux,1986),Φ(w,b,α,ξ,β)=12 w 2+Ciξi−li=1αiy iw T x i+b−1+ξi−lj=1βiξi,(2.29)12Chapter2Support Vector Classificationwhereα,βare the Lagrange multipliers.The Lagrangian has to be minimised with respect to w,b,x and maximised with respect toα,β.As before,classical Lagrangian duality enables the primal problem,Equation2.29,to be transformed to its dual problem. The dual problem is given by,max αW(α,β)=maxα,βminw,b,ξΦ(w,b,α,ξ,β).(2.30)The minimum with respect to w,b andξof the Lagrangian,Φ,is given by,∂Φ∂b =0⇒li=1αi y i=0∂Φ∂w =0⇒w=li=1αi y i x i∂Φ∂ξ=0⇒αi+βi=C.(2.31) Hence from Equations2.29,2.30and2.31,the dual problem is,max αW(α)=maxα−12li=1lj=1αiαj y i y j x i,x j +lk=1αk,(2.32)and hence the solution to the problem is given by,α∗=arg minα12li=1lj=1αiαj y i y j x i,x j −lk=1αk,(2.33)with constraints,0≤αi≤C i=1,...,llj=1αj y j=0.(2.34)The solution to this minimisation problem is identical to the separable case except for a modification of the bounds of the Lagrange multipliers.The uncertain part of Cortes’s approach is that the coefficient C has to be determined.This parameter introduces additional capacity control within the classifier.C can be directly related to a regulari-sation parameter(Girosi,1997;Smola and Sch¨o lkopf,1998).Blanz et al.(1996)uses a value of C=5,but ultimately C must be chosen to reflect the knowledge of the noise on the data.This warrants further work,but a more practical discussion is given in Chapter4.Chapter2Support Vector Classification13x1x2y11-133113131-12 2.513 2.5-143-11.5 1.5112-1Table2.2:Non-Linearly Separable Classification Data2.2.1Linearly Non-Separable ExampleTwo additional data points are added to the separable data of Table2.1to produce a linearly non-separable data set,Table2.2.The resulting SVC is shown in Figure2.6,for C=1.The SV are no longer required to lie on the margin,as in Figure2.4,and the orientation of the hyperplane and the width of the margin are different.Figure2.6:Generalised Optimal Separating Hyperplane Example(C=1)In the limit,lim C→∞the solution converges towards the solution obtained by the optimal separating hyperplane(on this non-separable data),Figure2.7.In the limit,lim C→0the solution converges to one where the margin maximisation term dominates,Figure2.8.Beyond a certain point the Lagrange multipliers will all take on the value of C.There is now less emphasis on minimising the misclassification error, but purely on maximising the margin,producing a large width margin.Consequently as C decreases the width of the margin increases.The useful range of C lies between the point where all the Lagrange Multipliers are equal to C and when only one of them is just bounded by C.14Chapter2Support Vector ClassificationFigure2.7:Generalised Optimal Separating Hyperplane Example(C=105)Figure2.8:Generalised Optimal Separating Hyperplane Example(C=10−8)2.3Generalisation in High Dimensional Feature SpaceIn the case where a linear boundary is inappropriate the SVM can map the input vector, x,into a high dimensional feature space,z.By choosing a non-linear mapping a priori, the SVM constructs an optimal separating hyperplane in this higher dimensional space, Figure2.9.The idea exploits the method of Aizerman et al.(1964)which,enables the curse of dimensionality(Bellman,1961)to be addressed.Figure2.9:Mapping the Input Space into a High Dimensional Feature SpaceChapter2Support Vector Classification15There are some restrictions on the non-linear mapping that can be employed,see Chap-ter3,but it turns out,surprisingly,that most commonly employed functions are accept-able.Among acceptable mappings are polynomials,radial basis functions and certain sigmoid functions.The optimisation problem of Equation2.33becomes,α∗=arg minα12li=1lj=1αiαj y i y j K(x i,x j)−lk=1αk,(2.35)where K(x,x )is the kernel function performing the non-linear mapping into feature space,and the constraints are unchanged,0≤αi≤C i=1,...,llj=1αj y j=0.(2.36)Solving Equation2.35with constraints Equation2.36determines the Lagrange multipli-ers,and a hard classifier implementing the optimal separating hyperplane in the feature space is given by,f(x)=sgn(i∈SV sαi y i K(x i,x)+b)(2.37) wherew∗,x =li=1αi y i K(x i,x)b∗=−12li=1αi y i[K(x i,x r)+K(x i,x r)].(2.38)The bias is computed here using two support vectors,but can be computed using all the SV on the margin for stability(Vapnik et al.,1997).If the Kernel contains a bias term, the bias can be accommodated within the Kernel,and hence the classifier is simply,f(x)=sgn(i∈SV sαi K(x i,x))(2.39)Many employed kernels have a bias term and anyfinite Kernel can be made to have one(Girosi,1997).This simplifies the optimisation problem by removing the equality constraint of Equation2.36.Chapter3discusses the necessary conditions that must be satisfied by valid kernel functions.16Chapter2Support Vector Classification 2.3.1Polynomial Mapping ExampleConsider a polynomial kernel of the form,K(x,x )=( x,x +1)2,(2.40)which maps a two dimensional input vector into a six dimensional feature space.Apply-ing the non-linear SVC to the linearly non-separable training data of Table2.2,produces the classification illustrated in Figure2.10(C=∞).The margin is no longer of constant width due to the non-linear projection into the input space.The solution is in contrast to Figure2.7,in that the training data is now classified correctly.However,even though SVMs implement the SRM principle and hence can generalise well,a careful choice of the kernel function is necessary to produce a classification boundary that is topologically appropriate.It is always possible to map the input space into a dimension greater than the number of training points and produce a classifier with no classification errors on the training set.However,this will generalise badly.Figure2.10:Mapping input space into Polynomial Feature Space2.4DiscussionTypically the data will only be linearly separable in some,possibly very high dimensional feature space.It may not make sense to try and separate the data exactly,particularly when only afinite amount of training data is available which is potentially corrupted by noise.Hence in practice it will be necessary to employ the non-separable approach which places an upper bound on the Lagrange multipliers.This raises the question of how to determine the parameter C.It is similar to the problem in regularisation where the regularisation coefficient has to be determined,and it has been shown that the parameter C can be directly related to a regularisation parameter for certain kernels (Smola and Sch¨o lkopf,1998).A process of cross-validation can be used to determine thisChapter2Support Vector Classification17parameter,although more efficient and potentially better methods are sought after.In removing the training patterns that are not support vectors,the solution is unchanged and hence a fast method for validation may be available when the support vectors are sparse.Chapter3Feature SpaceThis chapter discusses the method that can be used to construct a mapping into a high dimensional feature space by the use of reproducing kernels.The idea of the kernel function is to enable operations to be performed in the input space rather than the potentially high dimensional feature space.Hence the inner product does not need to be evaluated in the feature space.This provides a way of addressing the curse of dimensionality.However,the computation is still critically dependent upon the number of training patterns and to provide a good data distribution for a high dimensional problem will generally require a large training set.3.1Kernel FunctionsThe following theory is based upon Reproducing Kernel Hilbert Spaces(RKHS)(Aron-szajn,1950;Girosi,1997;Heckman,1997;Wahba,1990).An inner product in feature space has an equivalent kernel in input space,K(x,x )= φ(x),φ(x ) ,(3.1)provided certain conditions hold.If K is a symmetric positive definite function,which satisfies Mercer’s Conditions,K(x,x )=∞ma mφm(x)φm(x ),a m≥0,(3.2)K(x,x )g(x)g(x )dxdx >0,g∈L2,(3.3) then the kernel represents a legitimate inner product in feature space.Valid functions that satisfy Mercer’s conditions are now given,which unless stated are valid for all real x and x .1920Chapter3Feature Space3.1.1PolynomialA polynomial mapping is a popular method for non-linear modelling,K(x,x )= x,x d.(3.4)K(x,x )=x,x +1d.(3.5)The second kernel is usually preferable as it avoids problems with the hessian becoming zero.3.1.2Gaussian Radial Basis FunctionRadial basis functions have received significant attention,most commonly with a Gaus-sian of the form,K(x,x )=exp−x−x 22σ2.(3.6)Classical techniques utilising radial basis functions employ some method of determining a subset of centres.Typically a method of clustering isfirst employed to select a subset of centres.An attractive feature of the SVM is that this selection is implicit,with each support vectors contributing one local Gaussian function,centred at that data point. By further considerations it is possible to select the global basis function width,s,using the SRM principle(Vapnik,1995).3.1.3Exponential Radial Basis FunctionA radial basis function of the form,K(x,x )=exp−x−x2σ2.(3.7)produces a piecewise linear solution which can be attractive when discontinuities are acceptable.3.1.4Multi-Layer PerceptronThe long established MLP,with a single hidden layer,also has a valid kernel represen-tation,K(x,x )=tanhρ x,x +(3.8)for certain values of the scale,ρ,and offset, ,parameters.Here the SV correspond to thefirst layer and the Lagrange multipliers to the weights.。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Learning Statistical Models of Time-Varying Relational Data Sumit Sanghai,Pedro Domingos and Daniel WeldDepartment of Computer Science and EngineeringUniversity of Washington,Seattle,WA981951IntroductionFormalisms that can represent objects and relations,as op-posed to just variables,have a long history in AI.Recently, significant progress has been made in combining them with a principled treatment of uncertainty.In particular,proba-bilistic relational models or PRMs[4]are an extension of Bayesian networks that allows reasoning with classes,ob-jects and relations.Although PRMs have been successfully applied to a lot of different domains,they lack the temporal dynamics of the real world.In most real world systems,ob-jects get created,modified and even deleted over time.Sim-ilarly,the relationships between objects change as time pro-gresses.For example,consider the problem of predicting the set of research topics that become“hot”(e.g.,as measured by the number of papers published about them)over time,the changing distribution of these topics among conferences,and the interests and collaborations between authors.It would be difficult to learn a PRM that modeled this time-varying be-havior.Currently the most powerful representation available for capturing sequential phenomena is dynamic Bayesian net-works(DBNs)[1],but DBNs are unable to compactly rep-resent many real-world domains that contain multiple objects and classes of objects,as well as multiple kinds of relations among them.DBNs are even more awkward if one wishes to model objects and relations that appear and disappear over time.Thus,our research has focused on a new repre-sentation,dynamic probabilistic relational models(DPRMs) which combines PRMs with DBNs.Previously,we have ex-plored the problem of efficient inference[8];this paper out-lines our thoughts on learning DPRMs.2Dynamic Probabilistic Relational Models We start by briefly summarizing the definition of PRMs and DPRMs,adapted from[4;8].A PRM encodes a probability distribution over the set of all possible instantiations I of a schema.In the simplest case,the relational attributes of all objects are assumed to be known,and the PRM specifies a probability distribution for each propositional attribute A of each class C.The parents of each attribute(i.e.,the variables it depends on)can be other attributes of C,or attributes of classes that are related to C by some slot chain.Thus,by knowing the relational attributes one can get the joint proba-bility distribution by computing the set of parents for each ob-ject and its attributes and calculating the probability through the distribution specified.More generally,only the object skeleton might be known,in which case the PRM also needs to specify a distribution over the relational attributes[5]. Now,we extend PRMs to handle the time domain in the same way that DBNs extend Bayesian networks.Given a re-lational schema S,wefirst extend each class C with the re-lational attribute C.previous,with domain C.As before,we initially assume that the relational skeleton at each time slice is known.Definition1A two-time-slice PRM(2TPRM)for a relational schema S is defined as follows.For each class C and each propositional attribute A∈A(C),we have:•A set of parents P a(C.A)={P a1,P a2,...,P a l},where each P a i has the form C.B or f(C.τ.B),whereτis a slot chain containing the attribute previous at most once,and f()is an aggregation function.•A conditional probability model for P(C.A|P a(C.A)).2 Definition2A dynamic probabilistic relational model (DPRM)for a relational schema S is a pair(M0,M→),where M0is a PRM over I0,representing the distribution P0over the initial instantiation of S,and M→is a2TPRM represent-ing the transition distribution P(I t|I t−1)connecting succes-sive instantiations of S.2DPRMs are extended to the case where only the object skeleton for each time slice is known in the same way that PRMs are,by adding to Definition1a set of parents and con-ditional probability model for each relational attribute,where the parents can be in the same or the previous time slice. When the object skeleton is not known(e.g.,if objects can appear and disappear over time),the2TPRM includes in ad-dition a Boolean existence variable for each possible object, again with parents from the same or the previous time slice. 3Inference in DPRMsJust as a PRM can be expanded into a Bayesian network,so can a DPRM be unrolled into a DBN.In principle,we can then perform inference using particlefiltering[2],the most widely used approximate inference algorithm for DBNs.Par-ticlefiltering maintains a set of samples(particles)to ap-proximate the distribution of any state;the distribution for next state is achieved by importance sampling and resam-pling.Unfortunately,for DPRMs,particlefiltering is likelyto perform poorly,because the state space will be huge.We overcome this by adapting Rao-Blackwellisation[7]to the re-lational setting.Rao-Blackwellisation divides the state vari-ables into two sets—one in which values are inferred using a particlefilter and the other in which values are calculated analytically from the values of the variables in thefirst set. We make the major assumption that relational attributes do not appear anywhere in the DPRM as parents of unobserved attributes,and that each reference slot can be occupied by at most one object.Then,a Rao-Blackwellised particle is com-posed of sampled values for all propositional attributes of all objects,plus a probability vector for each relational attribute of each object which is inferred exactly.While this technique can vastly reduce the size of the state space which particlefiltering needs to sample,storing and up-dating all the requisite probabilities can still become quite ex-pensive.This expense can be ameliorated if context-specific independences exist.We can then replace the vector of prob-abilities with a novel tree structure whose leaves represent probabilities for entire sets of objects[8].Our experiments evaluated the efficiency of several infer-ence schemes applied to an assembly-plan execution moni-toring task in a simplified manufacturing domain.Even with hundreds of thousands of particles,standard particlefilter-ing failed(i.e.terminated due to inconsistent observations which could not be explained)on datasets with around100 objects and500time steps.In contrast,our inference algo-rithm yielded accurate predictions on similar problems with only5000particles,and ran more quickly and with less stor-age[8].Much work remains to improve inference.For example, we will endeavor to lift the assumptions mentioned above and more effectively use a DPRM’s structure during inference. 4Learning in DPRMsWhen a DPRM consists of only a single time slice it becomes equivalent to a PRM,and when the DPRM is devoid of re-lations it is a DBN.Thus we look to combine the learning algorithms already developed for PRMs and DBNs.Thefirst step,parameter learning,appears to be relatively straightfor-ward when no data is missing,since the parameters associated with different types of nodes can be estimated individually. However,there is a subtlety which makes the problem more complex than in a DBN:•A DPRM can generate a unique state in multiple ways,and each way must be considered during parameter estimation. For example,if in the new state objects get created,the order of creation can affect the likelihood of the data,as the newly-created objects can interact with each other.There may be a combinatorial number of ways in which a DPRM may gener-ate each state,so we are developing methods to do parameter estimation efficiently.One possibility is to impose a canon-ical ordering,and another is to greedily compute the most likely order(s)in which the data could have been generated. In order to learn the DPRM structure,we have to take care of several more issues:•Defining constraints to eliminate illegal DPRMs is essen-tial when navigating the space of structures.A cycle in aPRM is illegal,and this constraint extends to the two parts of a DPRM.There are additional constraints on a2TPRM;specifying these in a way that allows creation of an un-bounded number of dynamic objects is challenging.•There are several strategies for searching the space of DPRM structures.The simplest idea is to add and delete edges in the two components,PRM and2-TPRM,to gen-erate candidate DPRMs.One could do the search byfirst learning a PRM which gives a good intra-time-slice con-nectivity,before learning the inter-time-slice connectivity.•An important task is scoring a DPRM, e.g.with a likelihood-based measure.To compute the likelihood of the data given a candidate DPRM,fast DPRM inference is required.While our particlefiltering algorithm is quite fast,we wish to extend it so that we can efficiently explore the space of DPRMS,incrementally updating the likeli-hood scores.We believe the two-phase search strategy sug-gested previously will simplify this task.•Since the space of candidate DPRM models is huge,we are considering pruning mechanisms.Note that some of the methods stated above actually prune the space(e.g.learn-ing the PRMfirst,followed by time dependencies).One may also impose priors on the models to bias towards sim-plicity by limiting the number of edges.We plan to design priors over DPRM structures by extending the approach of Heckerman et al.[6]who exponentially penalize arc differ-ences from a”best”prior structure.We will compare the relative benefits of doing this at the class vs.instance level.•We plan to extend the learning algorithm to work in the presence of missing values and hidden variables.EM is easiest to apply when the observations are relational but the hidden state is not.Solving this problem with full gen-erality would require an extension of structural EM[3],but this needs to be done for PRMsfirst. AcknowledgementsThis work was partly supported by an NSF CAREER Award to the second author,by ONR grant N00014-02-1-0932,and by NASA grant NAG2-1538.References[1]T.Dean and K.Kanazawa.A model for reasoning about persis-tence and putational Intelligence,1989.[2] A.Doucet,N.de Freitas,and N.Gordon,editors.SequentialMonte Carlo Methods in Practice.Springer,2001.[3]N.Friedman.The Bayesian structural EM algorithm.UAI-98.[4]N.Friedman,L.Getoor,D.Koller,and A.Pfeffer.Learningprobabilistic relational models.IJCAI-99.[5]L.Getoor,N.Friedman,D.Koller,and B.Taskar.Learningprobabilistic models of relational structure.ICML-01.[6] D.Heckerman, D.Geiger,and D.Chickering.Learningbayesian networks:the combination of knowledge and statis-tical data.Machine Learning,1995.[7]K.Murphy and S.Russell.Rao-Blackwellised particlefilteringfor dynamic Bayesian networks.In A.Doucet,et al.,editors, Sequential Monte Carlo Methods in Practice.Springer,2001.[8]S.Sanghai,P.Domingos,and D.Weld.Dynamic probabilisticrelational models.Submitted to IJCAI-03.。