expression profiles
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Vol.22no.22006,pages245–247
doi:10.1093/bioinformatics/bti760 BIOINFORMATICS APPLICATIONS NOTE
Gene expression
PCP:a program for supervised classification of gene
expression profiles
Ljubomir J.Buturovic´
San Francisco State University,1600Holloway Avenue,San Francisco,CA94132,USA
Received on August10,2005;revised on September18,2005;accepted on November2,2005
Advance Access publication November8,2005
Associate Editor:Alvis Brazma
ABSTRACT
Summary:PCP(Pattern Classification Program)is an open-source
machine learning program for supervised classification of patterns
(vectors of measurements).The principal use of PCP in bioinformatics
is design and evaluation of classifiers for use in clinical diagnostic tests
based on measurements of gene expression.PCP implements leading
pattern classification and gene selection algorithms and incorporates
cross-validation estimation of classifier performance.Importantly,the
implementation integrates gene selection and class prediction stages,
which is vital for computing reliable performance estimates in small-
sample scenarios.Additionally,the program includes automated and
efficient model selection(optimization of parameters)for support vector
machine(SVM)classifier.The distribution includes Linux and Windows/
Cygwin binaries.The program can easily be ported to other platforms.
Availability:Free download at
Contact:ljubomir@
1INTRODUCTION
Clinical diagnostic tests based on measurements of gene expression are starting to be offered commercially and have a potential to gradually enter widespread clinical practice(van de Vijver et al., 2002;Soonmyung et al.,2004;Moraleda et al.,2004).The research and development of automated diagnostic tools based on genome-wide gene expression patterns include the gene selection and clas-sifier design phases.The gene selection phase chooses the optimal (according to some suitable criterion)subset of genes thought to be the most relevant for discriminating among the disease categories. The task of the classifier is to assign the specimen being interrogated into one of the previously defined diagnostic classes,using measurements of expression of the selected genes.
Pattern Classification Program(PCP)package has been designed to assist in the development of these two stages.It is a stand-alone, open-source application which implements leading algorithms for gene selection and classifier learning,prediction and performance evaluation.The principal applications of PCP are evaluation of inherent discrimination(i.e.diagnostic power)of the datasets under study,identification of optimal gene subsets,and comparison of proprietary or novel algorithms with the more established ones. The program can be used and redistributed without restrictions in binary and source forms.
2ARCHITECTURE
PCP source code is written in C and C++programming languages, strictly conforming to the corresponding ANSI standards.The distribution contains pre-compiled binaries for Linux and Windows (the latter requires installation of the free Cygwin environment). The program uses(links with)LAPACK linear algebra library (Anderson et al.,1999),available on many platforms.
PCP is a desktop application which uses hierarchically organized menus for user interaction.The main menu is shown in Figure1. The program is started from a command prompt.The menus are controlled interactively from the keyboard,by pressing keys cor-responding to the menu actions(e.g.Learning,Cross-validation, Prediction,etc.).All processing parameters are entered from the keyboard in response to the program prompts.
Input data are read from whitespace-delimited textfiles.The data files are assumed to contain normalized expression values of indi-vidual genes for all specimens.The expression values will typically be generated using software specific to the microarray platform.For example,for Affymetrix GeneChip platform,the gene expression values may be produced from CELfiles using the Affymetrix Microarray Suite(MAS)software or an open source program imple-menting the RMA algorithm(Irizarry et al.,2003)for normalization and summarization.
Results are presented in tabular form on the screen and also saved in textfiles in easily parsable formats.Thefiles can then be used for graphical display or further analyses by other programs.
PCP also supports batch mode,in which it reads commands (corresponding to the navigation controls and processing para-meters)from a textfile.This makes it convenient to
incorporate Fig.1.PCP Main Menu.The functions are activated by pressing the corre-sponding keyboard keys.For example,to enter the Pattern Classification Menu,press the‘b’key.
ÓThe Author2005.Published by Oxford University Press.All rights reserved.For Permissions,please email:journals.permissions@245
PCP processing in a complex data analysis dataflow driven from a scripting language such as Perl.This mode of operation is strength-ened by the robust error handling facility,which stores all diagnost-ics in an easily parsable textfile.
3ALGORITHMS AND METHODS
The algorithms supported by PCP are shown in Table1.
3.1Gene selection
PCP offers two groups of algorithms for projecting input gene expression values into lower-dimensional space(the process often referred to as dimensionality reduction):gene(feature) extraction and gene(feature)selection.
Gene extraction refers to algorithms which build a linear mapping transformation for reducing the dimensionality of the input gene space.Note that a diagnostic test incorporating such a transforma-tion utilizes all of the input gene expression measurements. Gene selection chooses an optimal subset of genes for further processing(classification).In contrast to gene extraction,a diag-nostic test incorporating gene selection only utilizes the expression values of the genes in the subset.This permits the use of a more accurate expression measurement technology(e.g.RT–PCR)for the genes in the chosen subset.In addition,the process potentially identifies a biologically meaningful subset of genes.For these rea-sons,gene selection is usually the preferred dimensionality reduc-tion method in expression-based diagnostics.Nevertheless,on occasion gene extraction provides superior classification perfor-mance,and may be used to evaluate discrimination power of all available measurements.
Gene selection can further be subdivided into algorithms which evaluate and compare predictive power of individual genes,and algorithms which compare groups(subsets)of genes.Thefirst group of algorithms is known as Gene Ranking(Su,Y.et al.,http:// /yangsu/rankgene).The algorithms in this group assume independence among the genes’expressions and differ by the gene ranking criterion.The criteria currently available in PCP are listed in Table1.The second group of gene selection algorithms includes forward selection and backward elimination, supported in PCP.These methods may be able to identify complex relationships among genes,at a cost of significantly higher com-putational complexity.PCP uses1-NN error rate estimate,Bayes error estimate and inter-intra distance as gene subset evaluation criteria for this group of gene selection algorithms.
3.2Performance evaluation
PCP utilizes cross-validation to evaluate classifier performance. One of the challenges in cross-validation-based estimation of clas-sifier performance is the integration of the gene selection stage.It has been demonstrated in the context of microarray data analysis (Molinari et al.,2005)that a reliable estimate requires repeated gene subset selection for each training resampling subset.PCP rigorously implements this requirement.Thus,for each cross-validation fold, the gene selection is performed anew using the training subset,and the chosen genes are then extracted from the training and test sub-sets.This processing significantly increases the overall computa-tional complexity,but is vital to avoid severe underestimation of classifier error rates.
3.3SVM model selection
Model selection refers to the process of choosing optimal para-meters of a classifier.For example,MLP model selection consists in determining the optimal number of hidden nodes.For algorithms with a single discrete-valued parameter,such as k-NN and neural networks,the process is straightforward and amounts to cross-validation of the classifier for a relatively small set of values of the parameter.The optimal parameter is the value which gives best cross-validation performance.This analysis can often be executed interactively or within a simple driver script.
In contrast,SVM presents considerable challenges for model selection.The various incarnations of SVM usually have two or more continuously valued parameters.An exhaustive search of parameter space is computationally demanding and requires automation.The situation is further complicated with the addition of gene selection phase in the classifier design.A rigorous SVM model selection involves classifier cross-validation for each set of parameter values,including the repeated gene selection for each cross-validation subset,as explained in Section3.2.This require-ment dramatically increases the computational complexity.Given the complexity of gene selection algorithms themselves,the total computational burden may be impractical even for small datasets. PCP efficiently solves the problem of SVM model selection and reduces complexity to manageable levels.The improvements are achieved by employing a heuristic based on the Simplex algorithm (Press et al.,1992)in parameter search and by pre-computing the gene selection subsets before starting parameter search.As a result, PCP makes it practical to use SVM for analysis of high-dimensional,small-sample datasets typically encountered during the development of gene expression-based diagnostics. ACKNOWLEDGEMENTS
Sasha Jaksic of San Francisco State University contributed the code to the gene selection functionality and suggested the
Table1.Algorithms implemented in PCP
Classification Multi-layer Perceptron
Linear parametric classifier
Quadratic parametric classifier
Linear discriminant classifier
Support vector machine
Nearest-neighbor classifier
Gene selection Gene ranking
Forward selection(Theodoridis and
Koutroumbas,2003)
Backward elimination(Theodoridis and
Koutroumbas,2003)
Gene selection criteria Euclidean distance
Golub criterion(Golub et al.,1999)
Pearson correlation coefficient
1-NN error rate estimate
Bayes error estimate(Fukunaga and Hummels,1987) Inter-intra distance(Theodoridis and Koutroumbas, 2003)
Gene extraction Fisher’s linear discriminant
Principal component analysis
Singular value decomposition L.J.Buturovic´
246
pre-computation of gene selection subsets.Professor Milan Milosavljevic´of the University of Belgrade collaborated on an earlier version of the program.
Conflict of Interest:none declared.
REFERENCES
Anderson,E.,Bai,Z.,Bischof,C.,Blackford,S.,Demmel,J.,Dongarra,J.,Du Croz,J., Greenbaum,A.,Hammarling,S.,McKenney,A.and Sorensen,D.(1999)LAPACK Users’Guide,Third Edition.Philadelphia:Society for Industrial and Applied Mathematics.
Fukunaga,K.and Hummels,D.M.(1987)Bayes error estimation using Parzen and k-NN procedures.IEEE Trans.Pattern Anal.Mach.Intell.,PAMI-9,634–643. Golub,T.et al.(1999)Molecular classification of cancer:class discovery and class prediction by gene expression monitoring.Science,286,531–537.Irizarry,R.A.et al.(2003)Summaries of Affymetrix GeneChip probe level data.
Nucleic Acids Res.,31,345–349.
Molinari,A.M.et al.(2005)Prediction error estimation:a comparison of resampling methods.Bioinformatics,21,3301–3307.
Moraleda,J.,Grove,N.,Tran,Q.,Doan,J.,Hull,J.,Nguyen,L.,Pattin,A.and Anderson,G.(2004)Gene expression data analytics with interlaboratory validation for identifying anatomical sites of origin of metastatic carcinomas.In Proceedings of the American Society of Clinical Oncology Annual Meeting,New Orleans,LA., 23,862.
Press,W.H.,Teukolsky,S.A.,Vetterling,W.T.and Flannery,B.P.(1992)Numerical Recipes,Second Edition.Cambridge:Cambridge University Press,Section10.4. Soonmyung,P.et al.(2004)A multigene assay to predict recurrence of Tamoxifen-treated,Node-negative breast cancer.N.Engl.J.Med.,351,2817–2826.
Su,Y.et al.(2003)RankGene:identification of diagnostic genes based on expression data.Bioinformatics,19,1578–1579.
Theodoridis,S.and Koutroumbas,K.(2003)Pattern Recognition,Second Edition.
Amsterdam:Academic Press.
van de Vijver,M.et al.(2002)A gene-expression signature as a predictor of survival in breast cancer.N.Engl.J.Med.,347,1999–2009.
PCP:pattern classification program
247。