利用PLS_VIP方法筛选差异表达基因_英文_韩放(论文资料)
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
北京大学学报(自然科学版),第45卷,第1期,2009年1月
Acta Scientiarum Naturalium Universitatis Pekinensi s,Vol.45,No.1(Jan.2009)
北京大学校长基金,国家自然科学基金(30570425),国家高技术研究发展计划专项经费(2006AA02Z331,2008AA02Z306),国家重点基础研究
发展计划(2003CB715903)和微软亚洲研究院资助
收稿日期:2007-08-06;修回日期:2008-11-04
Searching for Differentially Expressed Genes
by PLS -VIP Method
HAN Fang 1
,W U Jingchen 1
,XU Jiangfeng 1
,DENG Minghua
1,2,
11School of Mathematical Science,Peking University,Beijing 100871;21Cen ter for Theoretical Biology,
Peki ng Universi ty,Beijing 100871;Corresponding Author,E -mail:dengmh@
Abstract A new approach called PLS -VIP based on variables p importance of projection (VIP )for detecting differen tially expressed genes is proposed.It is a new method to sum up the contribution of components based on Partial Least Square method (PLS).As it takes the correlation of different genes into consideration,i t is more sui table than those classification methods based on reviewing disti nct genes separately.The effect of genes extracted by ordinary PLS,di scriminant PLS and proposed method to classify mul t-i class tu mors is compared.Resul ts show that the error rate of the new method is obviously lower than the other two methods in most cases.
Key words microarray;differentially exp ressed genes;reduction of dimension;PLS -VIP
利用PLS -VIP 方法筛选差异表达基因
韩放1
吴晶辰1
徐江峰1
邓明华
1,2,
11北京大学数学科学学院,北京100871;21北京大学理论生物学中心,北京100871;
通讯作者,E -mail:dengmh@
摘要 提出一种基于变量权重寻找差异表达基因的新方法。
该方法的最终目的是从微阵列数据中抽取出核心变量(基因)。
将该种方法抽取出的差异表达基因判别样本的能力和普通的PLS 方法以及判别最小二乘方法进行比较,结果表明该方法的错误率明显低于其他两种传统方法。
因此,PLS -VIP 方法是一种较为合适的抽取差异表达基因并判别样本的方法。
关键词 微阵列;差异表达基因;降维;PLS -VIP 中图分类号 O213
Microarrays have become increasingly common in
biological and medical research.They enable the simultaneous study of thousands of genes and afford unprecedented ability to provide gene e xpression information on a whole genome level.With the advent of DNA microarray technology,it became possible to detect differentially expressed genes on a genome -wide scale.A major question in mic roarray studies is how to select genes associated with specific tumor subtype or patient survival.In a clinical context,such differentially expressed genes
(DE )are often referred to as clinical markers.
Identification of clinical markers may lead to improved diagnosis and treatment guidance,early disease detection and clinical outcomes prediction.
The most c om monly used tools for identification of differentially expressed genes include -t test,
logistic
regression and linear regression analysis,some of which often require larger number of sa mples than variables.Ho wever,DNA microarray gene expression data are always including many variables (genes)with a fe w
samples.Therefore,the above methods may lose their effect in the real applications.
In this context,we c onsider principle c omponent analysis(PC A)and partial least square(PLS)for dimension reduction,which attempt to find a set of principle components(linear combinations of original variables)to account for a maximum variations or correlation with the information of classifications, respectively.They can work effectively without the require ment for larger samples.C omparing to PCA methods,PLS takes the information provided by response variables into account in constructing principle components.The performance of PCA in classification generally is less powerful than PLS in classification.PLS method is therefore introduced to identify differentially e xpressed genes.PLS method could be regarded as a compromise between PC A and ordinary linear regression. It is of particular interest to analyze data with strongly collinear(correlated),noisy and numerous predictors.
Discriminant PLS(D-PLS)is another popular method related to PLS method in detecting differentially e xpressed genes,which extract genes by sorting the sum of squared correlation c oefficients between gene e xpression and each of the response variables.D-PLS preserves more data than traditional PLS and therefore seems to perform better in classification.This advantage could also been seen from our experimental results.
In this paper,we use PLS to reduce dimension and introduce a new calculation method called variable importance of projection(VIP)to obtain the DE.VIP sums PLS components c oefficient scores of each variable and has shown its strong effect in combining data information.To compare the effect of gene extraction of different me thods,we use linear discriminant analysis (LDA)and the effect is measured by the error-rates of classification.The result shows that DE extrac ted by our method,associated with LDA,could largely decrease the erro-r rate of misclassification c ompared with other methods.
1Methods
With the advent of DNA microarray technology,it became possible to detect differentially expressed genes on a genome-wide scale[1-4].Ho wever,ho w to analysis the microarray data via appropriate statistical methods is still a challenge.PLS and D-PLS methods are proposed to deal with the ra w data and both have some disadvantages[5-11].
In this paper,a ne w method is proposed to find differentially expressed genes basing on microarray data. We take microarray data as our input.To explore which predictors contribute more to the response,we look at the summation of PLS component coefficients for the standardized data.Genes with large absolute c oefficients can be re garded as differentially e xpressed.The variable importance for projection(VIP)is proposed to sum marize the contribution of variables.The VI P of a predictor is the synthesis of PLS components coefficients.This score is used for ranking genes.
In details,we only use a certain number(k)of top principle components to compute the VIP score.Let W i be the i-th component,r i be the eigenvalue of the i-th component.Then we define
W ij=w ij/uss w j,(1)
R i=r i/uss(r),(2) where uss w j=E k i=1w2ij,uss(r)=E k i=1r2i.
The VI P score of the j-th variable is defined as
VIP j=
E k
i=1
R i W2ij
E k
i=1
R i
p,(3)
where p is the dimension of variables.
Next,we use discriminant analysis(DA)for classification by the selected variables.DA has become one of the most important and popular statistic methods to deal with classification and therefore we adopt it as a classifier and tool to test the effec t of dimension reduction.
2Results
211Data
We apply our method to two real data sets.One is the Leukemia data of Golub et al.[1],which involves 7129genes in47acute lymphoblastic leukemia(ALL) and25acute myeloid leuke mia(AML)samples.It is
北京大学学报(自然科学版)第45卷
divided into a training data set including27ALL and11 AML and a testing data set including20ALL and14 AML.Another one is the human lung carcinomas data (B)of Bhattacharjee et al.[12],which involves12600 genes in156samples,with139adenocarcinoma and17 nor mal lung samples.
212Gene Selection
21211The Leukemia data
We rank genes by the VIP scores learning from the training data and compare the rank list of genes selected by our method with those selected by Golub et al.[1] When choosing top50genes by our method,twenty four genes are shared with Golub p s predic tion.Table1lists these genes shared by two lists.Golub et al.[1]showed that genes,such as CD33(M23917)and c-myb (U22376),have been kno wn to be related to the two types of Leukemia,which are also included in the list by PLS-VIP.The zyxin(X95735)is described as a new maker of acute leukemia subtype in Golub p s paper and is also detec ted by PLS-VIP as1st.
T able1Twenty-four genes selected both by PLS-VIP
and Golub et al.
Gene accessi on number Rank by PLS-VIP VIP sc ore X95735121830914
U501362218278869
Y126703217361162
M239174217323231
M551505217239477
M160387216226533
U827599215835153
L0824611215061075
M6276212215053049
M8025413214937764
U4675114214823027
X0408515214822681
Y0078716214597593
M9632617214511953
M2789118214493376
X8511619214249357
M6313820214036770
M2813022213138276
M8365223213690988
M5771024213566580
M8169525213531780
M6904329213261821
M1904530213195164
U2237649211845380 VIP sc ore is calculated by PLS-VIP method usi ng the top8components.21212The lung cancer data
By learning from the data and sorting the VIP value of each gene we could get a rank of genes,in other words,a criterion to test how important a gene is.Here we bring out the genes in the lung cancer data of B ha tta-charjee et al.[12]that rank top10in PLS-VIP method and could be checked in Table2.
213Comparison with other methods
We compared the efficiency of PLS-VIP with the traditional PLS and D-PLS method by implementing in Bayes linear discriminant analysis(LDA).Firstly,we extract3groups of top n(n varies)genes,which is ranked by PLS-VIP,tradition PLS and D-PLS method respectively.Secondly,using only these n genes,we classify samples by LDA approach.The error rate then is obtained via leave one out cross validation(LOOCV). The result is illustrated in Fig.1and2.
Table2Ten top genes selected by PLS-VIP to disti nguish normal and adenocarcinoma samples
Gene accession number Rank by P LS-VIP VIP value 40994131268897
1135231261932
38995331208274
268431191018
35868531139615
37247631074748
36119731068207
1596831058278
1955931041014
365691031
025643
F ig11Comparison of the classification performance using
genes ranked by different methods
第1期韩放等:利用PLS-VIP方法筛选差异表达基因
Fig12Comparison of the classification performance using genes ranked by different methods
From Fig.1and2,we c ould see that the prediction errors is lowest when about300-500genes is selected. And it increases in the beginning and drop more or less significantly in about50-100genes and maintain about the same error-rate when gene number is larger than250. This phenomenon is also detected by Yan et al.[13]and is reasonable for the following three reasons.First,when few genes are selected,they achieve a low error-rate at the expense of failing to fit a model well with the original data.Second,when informative genes are gradually added to the selected gene set,more and more infor mation is included causing prediction errors to decrease.Third, when the set of differentially expressed genes are saturated,the addition of other genes cannot include many things and error rates maintain in the same level.
Additionally,it is not difficult to find that the error-rate of PLS-VIP method is clearly lower than traditional PLS method and D-PLS method.
214Effect of number of components to the classification
In the upper discussion,we acquiesce in extracting the top8principle c omponents to compute the VIP sc ore. This acknowledge ment calls for explana tion.We compare the classification error rate between2,4,6and8PLS principle components in Leukemia Data[1]in Fig. 3. Results sho w that8components classification effect is better than others.
3Discussion
PC A and PLS are the most popular and successful approaches for dimension reduction.They attempt to
find
F ig13Comparison of the classification performance using
top2,4,6and8principle components
a set of principle components(linear combinations of original variables)to account for a maximum variations or correlation with the information of classifications. Ho wever,those two methods both might leave much information out.
PLS-VI P method,on the other hand,takes the most influential n components extracted by PLS into consideration.It rules out information much less than traditional PLS method.The PLS-VIP method,working as a modified PLS,perfor ms better in variable selection. Our analysis sho ws that PLS-VIP is able to preserve more information for classifying Golub leukemia data[1]and normal adenocarcinoma in lung cancer data[12].
The comparison between our method and D-PLS method also indicates that our method is better than D-PLS when N is comparatively large and not worse than it when N is small.Acc ordingly,we believe PLS-VIP method is effective and expect it to be useful in more conditions.
In this manuscript,we use LDA as the classifier. Logistic analysis may be another traditional way for classifying.However,it fails to work well in a situation where sample is not large enough.In this way,Bayes discriminant analysis shares some merits and it reduce the requirement of the sample number.
Next we want to modify this method and make it more explicit in statistic language.Further,more data should be tested in PLS-VIP method and more comparisons should be made between PLS-VI P method and other methods.
北京大学学报(自然科学版)第45卷
References
[1]Golub T R,Slonim D K,Tamayo P,et al.Molecular
classi fication of cancer:Class di scovery and class p rediction by gene exp ression monitorin g.Science,1999,286:531-537 [2]Young R A.Biomedical discovery wi th DNA arrays.Cell,
2000,102(1):9-15
[3]Shi L.Arrays,molecular diagnostics,personalized therapy
and informatics.E xpert Rev Mol Diagn,2001,1(4):
363-365
[4]Gershon D.Microarray technology:An array of
opportunities.Nature,2002,416:885-891
[5]Park P J,T ian L,Kohane I S.Linking gene expression
data with patient survival ti mes using partial least squares.
Bioinformatics,2002,18(12):120-127
[6]Wold S,Sjostrom M,Eriksson L.PLS-regression:A basic
tool of chemometrics.Chemom Intell Lab Sys,2001,
58(2):109-130
[7]Barker M,Rayens W.Partial least squares for
discrimination.J Chemom,2003,17(3):166-173
[8]Song X H,Hopke P K.Pattern recogni tion of soil samples
based on the microbial fatty acid contents.Environ Sci
Technol,1999,33(20):3524-3530
[9]Nguyen D V,Rocke D M.Mult-i class cancer classification
via partial least squares with gene expressi on profiles.
Bi oinformatics,2002,18(9):1216-1226
[10]Ang X H,Pan W.Linear regression and two-class
classification with gene expression data.Bioinformatics,
2003,19(16):2072-2078
[11]Tan Y X,Shi L M,Tong W D,et al.Mult-i class tumor
classification by discriminant partial least sq uares using
microarray gene expressi on data and assessment of
classification putational Biology and Chemistry,
2004,28(3):25-244
[12]Bhattacharjee A,Richard,W,Staunton J,et al.
Classification of hu man lung carcinomas by mRNA
expression profiling reveals distinct adenocarcinoma
subclasses.Proceedings of the National Academy of
Sciences,2001,98(24):13790-13795
[13]Yan X T,Deng M,Fung W,et al.Detecti ng differen tially
expressed genes by relati ve entropy.Journal of Theoretical
Bi ology,2005,234(3):395-402
*****
简讯
北京大学2007年度国际论文被引用次数居高校第2位
学术影响力持续提升
根据2008年12月9日中国科学技术信息研究所召开的/第16届中国科技论文统计结果发布会0上公布的统计结果,北京大学2007年度国际论文被引用次数为9769次,在高等院校中排名第2位;国际论文被引用篇数为3185篇,在高等院校中排名第3位。
北京大学2007年度SCI收录论文1939篇(按第一作者统计),在高等院校中排名第4位。
在1998)2007年10年间,北京大学SCI收录论文累计被引用次数达80547次,在高等院校中排名第2位;累计被引用篇数为9957篇,在高等院校中排名第3位。
在按学科分类的2007年度SCI收录论文总数统计中,北京大学的生物、天文科技论文排名第2位,医学、环境科学科技论文排名第3位。
(摘自北京大学科学研究部主页2008-12-30)第1期韩放等:利用PLS-VIP方法筛选差异表达基因。