生物信息文献
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
A Novel Method for Protein Secondary Structure Prediction Using Dual-Layer SVM and Profiles
Jian Guo,1,2†Hu Chen,1†Zhirong Sun1*,and Yuanlie Lin2
1Institute of Bioinformatics,State Key Laboratory of Biomembrane and Membrane Biotechnology,
Department of Biological Sciences and Biotechnology,Tsinghua University,Beijing,China
2Department of Mathematical Sciences,Tsinghua University,Beijing,China
ABSTRACT A high-performance method was developed for protein secondary structure predic-tion based on the dual-layer support vector machine (SVM)and position-specific scoring matrices (PSSMs).SVM is a new machine learning technology that has been successfully applied in solving prob-lems in thefield of bioinformatics.The SVM’s perfor-mance is usually better than that of traditional machine learning approaches.The performance was further improved by combining PSSM profiles with the SVM analysis.The PSSMs were generated from PSI-BLAST profiles,which contain important evolu-tion information.Thefinal prediction results were generated from the second SVM layer output.On the CB513data set,the three-state overall per-residue accuracy,Q3,reached75.2%,while segment overlap (SOV)accuracy increased to80.0%.On the CB396 data set,the Q3of our method reached74.0%and the SOV reached78.1%.A web server utilizing the method has been constructed and is available at /pmsvm.Proteins 2004;54:738–743.©2004Wiley-Liss,Inc.
Key words:protein structure prediction;protein secondary structure;support vector ma-
chine;position-specific scoring matri-
ces;PSI-BLAST
INTRODUCTION
A large number of genome sequences have been pro-duced in high-throughput experiments.The next step is to analyze these genome and protein sequences tofind new gene functions.1The prediction of protein structure and function from amino acid sequences is one of the most important problems in molecular biology.This problem is becoming more pressing as the number of known protein sequences is explored as a result of genome and other sequencing projects,and the protein sequence–structure gap is widening rapidly.2,3Therefore,computational tools to predict protein structures are badly needed to narrow the widening gap.Although the prediction of three-dimensional(3D)protein structures is the ultimate goal, the structure still cannot be accurately predicted directly from sequences.An intermediate but useful step is to predict the protein secondary structure,which provides some knowledge and simplifies the complicated3D struc-ture prediction problem.
The fundamental elements of the secondary structure of
proteins are␣-helices,-sheets,coils,and turns.Some methods have been developed for defining various protein secondary structure elements from the atomic coordinates in the Protein Data Bank(PDB),such as DSSP,4STRIDE,5 and DEFINE.6According to DSSP,8types of protein secondary structure elements were classified and denoted by letters:H(␣-helix),E(extended-strand),G(3
10
helix), I(-helix),B(isolated-strand),T(turn),S(bend)and“_”
(coil).The8classes are usually reduced to three states, helix(H),sheet(E),and coil(C)by different reduction methods.7Thus,the secondary structure prediction can be analyzed as a typical three-state pattern recognition or classification problem,where the secondary structure class of a given amino acid residue in a protein is predicted based on its sequence features.
Since the1970s,many methods have been developed for predicting protein secondary structures.Early works usu-ally relied on the single-residue statistics in various second-ary structural elements,for example,the Chou–Fasman method8and the Garnier–Osguthorpe–Robson(GOR I) method.9Nearly20years later,a significant improvement was made in the PHD method,10which is a three-level neural network including some machine learning tech-niques.After the PHD method,many further neural networks and machine learning refinements were devel-oped.11–13Several machine learning approaches have suc-cessfully predicted protein secondary structures,and pre-diction accuracies were further improved.In2001,Hua and Sun14introduced a new method,support vector ma-chine(SVM),which is based on statistical learning theory (SLT).The SVM method achieved good segment overlap accuracy,SOVϭ76.2%,and good three-state overall
per-residue accuracy,Q
3ϭ73.5%.14
Here,we describe an improved dual-layer SVM com-bined with a position-specific scoring matrix(PSSM)gener-
†These two authors contributed equally in this work.
Grant sponsor:Fondational Science Research Grant of Tsinghua University(JC2001043);863Projects(2002AA234041);973Project (2003CB715903);NSFC(90303017).
*Correspondence to:Zhirong Sun,Institute of Bioinformatics,State Key Laboratory of Biomembrane and Membrane Biotechnology,De-partment of Biological Sciences and Biotechnology,Tsinghua Univer-sity,Beijing10084,China.E-mail:sunzhr@
Received18June2003;Accepted3September2003
PROTEINS:Structure,Function,and Bioinformatics54:738–743(2004)
©2004WILEY-LISS,INC.