Gene Selection Using a New Error Bound for Support Vector Machines

合集下载

相关主题

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

1

Gene Selection Using a New Error Bound for

Support Vector Machines

Xin Zhou 1,David P.Tuck 1

Keywords:Gene selection,Microarray analysis,Support vector machines,Error bound 1Introduction.

The microarray technique has the power to simultaneously measure the expression levels of tens of thousands of genes in a single experiment [1,2].Selection of relevant genes for sample classiﬁcation (for example,to distinguish cancerous from normal tissues)is a common task in most microarray data studies.In the past few years,the support vector machine (SVM)[5]has been widely used for solving pattern classiﬁcation problems.As the SVM is well suited to work with high dimensional data,it has been extensively applied to microarray data analysis.In this work,we have developed a new error bound for support vector machines,and proposed a gene selection algorithm using this error bound to identify the informative genes.

2Method.

The leave-one-out error is an unbiased estimate for the true error rate of a classiﬁer.Several bounds on the leave-one-out error rate for the SVM classiﬁer have been proposed.The radius/margin bound [5]and span bound [4]are two often used error bounds for support vector machines.However,they are only applicable to separable cases,while most real applications are non-separable.Furthermore,the radius/margin bound is very loose;the span bound is under an assumption that the set of support vectors remains the same during the leave-one-out procedure,but the assumption is not always true.In this work,we have proved a new bound,which is applicable to either separable or non-separable cases without any assumption.

For the SVM,the leave-one-out error rate L meets the following inequality,

L ≤ 1n 2W (α∗)[ p (D p 2+D p 2)+4n SV C ],(1)

where n is the number of samples,n SV is the number of support vectors,C is the regulatory

parameter of SVMs,and W (α∗)is the objective value of the dual QP problem generated

from SVM classiﬁcation. p (·)indicates that the sum is taken only over support vectors x p .D p is the Euclidean distance between support vector x p and its nearest support vector in the same class,while D p is the Euclidean distance between support vector x p and its farthest sample (not support vector)in the other class.

The right hand side of inequality (1)is the new upper bound we developed.If the bound is used for gene (feature)selection,the gene (feature)subset with the minimal bound is preferred.In the present work,the new bound is combined with the sequential forward selection (SFS)to form a gene (feature)selection algorithm.SFS is a bottom-up search

1Department

of Pathology,Yale University School of Medicine,New Haven,Connecticut 06510,

USA.E-mail:xin.zhou@ ,david.tuck@