Acta Scientiarum Naturalium Universitatis Pekinensi s,Vol.45,No.1(Jan.2009)
Searching for Differentially Expressed Genes
by PLS -VIP Method
HAN Fang 1
,W U Jingchen 1
,XU Jiangfeng 1
,DENG Minghua
11School of Mathematical Science,Peking University,Beijing 100871;21Cen ter for Theoretical Biology,
Peki ng Universi ty,Beijing 100871;Corresponding Author,E -mail:dengmh@
Abstract A new approach called PLS -VIP based on variables p importance of projection (VIP )for detecting differen tially expressed genes is proposed.It is a new method to sum up the contribution of components based on Partial Least Square method (PLS).As it takes the correlation of different genes into consideration,i t is more sui table than those classification methods based on reviewing disti nct genes separately.The effect of genes extracted by ordinary PLS,di scriminant PLS and proposed method to classify mul t-i class tu mors is compared.Resul ts show that the error rate of the new method is obviously lower than the other two methods in most cases.
Key words microarray;differentially exp ressed genes;reduction of dimension;PLS -VIP
利用PLS -VIP 方法筛选差异表达基因
摘要 提出一种基于变量权重寻找差异表达基因的新方法。
将该种方法抽取出的差异表达基因判别样本的能力和普通的PLS 方法以及判别最小二乘方法进行比较,结果表明该方法的错误率明显低于其他两种传统方法。
因此,PLS -VIP 方法是一种较为合适的抽取差异表达基因并判别样本的方法。
关键词 微阵列;差异表达基因;降维;PLS -VIP 中图分类号 O213
Microarrays have become increasingly common in
biological and medical research.They enable the simultaneous study of thousands of genes and afford unprecedented ability to provide gene e xpression information on a whole genome level.With the advent of DNA microarray technology,it became possible to detect differentially expressed genes on a genome -wide scale.A major question in mic roarray studies is how to select genes associated with specific tumor subtype or patient survival.In a clinical context,such differentially expressed genes
(DE )are often referred to as clinical markers.
Identification of clinical markers may lead to improved diagnosis and treatment guidance,early disease detection and clinical outcomes prediction.
The most c om monly used tools for identification of differentially expressed genes include -t test,
regression and linear regression analysis,some of which often require larger number of sa mples than variables.Ho wever,DNA microarray gene expression data are always including many variables (genes)with a fe w
samples.Therefore,the above methods may lose their effect in the real applications.
In this context,we c onsider principle c omponent analysis(PC A)and partial least square(PLS)for dimension reduction,which attempt to find a set of principle components(linear combinations of original variables)to account for a maximum variations or correlation with the information of classifications, respectively.They can work effectively without the require ment for larger samples.C omparing to PCA methods,PLS takes the information provided by response variables into account in constructing principle components.The performance of PCA in classification generally is less powerful than PLS in classification.PLS method is therefore introduced to identify differentially e xpressed genes.PLS method could be regarded as a compromise between PC A and ordinary linear regression. It is of particular interest to analyze data with strongly collinear(correlated),noisy and numerous predictors.
Discriminant PLS(D-PLS)is another popular method related to PLS method in detecting differentially e xpressed genes,which extract genes by sorting the sum of squared correlation c oefficients between gene e xpression and each of the response variables.D-PLS preserves more data than traditional PLS and therefore seems to perform better in classification.This advantage could also been seen from our experimental results.
In this paper,we use PLS to reduce dimension and introduce a new calculation method called variable importance of projection(VIP)to obtain the DE.VIP sums PLS components c oefficient scores of each variable and has shown its strong effect in combining data information.To compare the effect of gene extraction of different me thods,we use linear discriminant analysis (LDA)and the effect is measured by the error-rates of classification.The result shows that DE extrac ted by our method,associated with LDA,could largely decrease the erro-r rate of misclassification c ompared with other methods.
With the advent of DNA microarray technology,it became possible to detect differentially expressed genes on a genome-wide scale[1-4].Ho wever,ho w to analysis the microarray data via appropriate statistical methods is still a challenge.PLS and D-PLS methods are proposed to deal with the ra w data and both have some disadvantages[5-11].
In this paper,a ne w method is proposed to find differentially expressed genes basing on microarray data. We take microarray data as our input.To explore which predictors contribute more to the response,we look at the summation of PLS component coefficients for the standardized data.Genes with large absolute c oefficients can be re garded as differentially e xpressed.The variable importance for projection(VIP)is proposed to sum marize the contribution of variables.The VI P of a predictor is the synthesis of PLS components coefficients.This score is used for ranking genes.
In details,we only use a certain number(k)of top principle components to compute the VIP score.Let W i be the i-th component,r i be the eigenvalue of the i-th component.Then we define
W ij=w ij/uss w j,(1)
R i=r i/uss(r),(2) where uss w j=E k i=1w2ij,uss(r)=E k i=1r2i.
The VI P score of the j-th variable is defined as
VIP j=
E k
R i W2ij
E k
R i
where p is the dimension of variables.
Next,we use discriminant analysis(DA)for classification by the selected variables.DA has become one of the most important and popular statistic methods to deal with classification and therefore we adopt it as a classifier and tool to test the effec t of dimension reduction.
We apply our method to two real data sets.One is the Leukemia data of Golub et al.[1],which involves 7129genes in47acute lymphoblastic leukemia(ALL) and25acute myeloid leuke mia(AML)samples.It is
divided into a training data set including27ALL and11 AML and a testing data set including20ALL and14 AML.Another one is the human lung carcinomas data (B)of Bhattacharjee et al.[12],which involves12600 genes in156samples,with139adenocarcinoma and17 nor mal lung samples.
212Gene Selection
21211The Leukemia data
We rank genes by the VIP scores learning from the training data and compare the rank list of genes selected by our method with those selected by Golub et al.[1] When choosing top50genes by our method,twenty four genes are shared with Golub p s predic tion.Table1lists these genes shared by two lists.Golub et al.[1]showed that genes,such as CD33(M23917)and c-myb (U22376),have been kno wn to be related to the two types of Leukemia,which are also included in the list by PLS-VIP.The zyxin(X95735)is described as a new maker of acute leukemia subtype in Golub p s paper and is also detec ted by PLS-VIP as1st.
T able1Twenty-four genes selected both by PLS-VIP
and Golub et al.
Gene accessi on number Rank by PLS-VIP VIP sc ore X95735121830914
U2237649211845380 VIP sc ore is calculated by PLS-VIP method usi ng the top8components.21212The lung cancer data
By learning from the data and sorting the VIP value of each gene we could get a rank of genes,in other words,a criterion to test how important a gene is.Here we bring out the genes in the lung cancer data of B ha tta-charjee et al.[12]that rank top10in PLS-VIP method and could be checked in Table2.
213Comparison with other methods
We compared the efficiency of PLS-VIP with the traditional PLS and D-PLS method by implementing in Bayes linear discriminant analysis(LDA).Firstly,we extract3groups of top n(n varies)genes,which is ranked by PLS-VIP,tradition PLS and D-PLS method respectively.Secondly,using only these n genes,we classify samples by LDA approach.The error rate then is obtained via leave one out cross validation(LOOCV). The result is illustrated in Fig.1and2.
Table2Ten top genes selected by PLS-VIP to disti nguish normal and adenocarcinoma samples
Gene accession number Rank by P LS-VIP VIP value 40994131268897
F ig11Comparison of the classification performance using
genes ranked by different methods
Fig12Comparison of the classification performance using genes ranked by different methods
From Fig.1and2,we c ould see that the prediction errors is lowest when about300-500genes is selected. And it increases in the beginning and drop more or less significantly in about50-100genes and maintain about the same error-rate when gene number is larger than250. This phenomenon is also detected by Yan et al.[13]and is reasonable for the following three reasons.First,when few genes are selected,they achieve a low error-rate at the expense of failing to fit a model well with the original data.Second,when informative genes are gradually added to the selected gene set,more and more infor mation is included causing prediction errors to decrease.Third, when the set of differentially expressed genes are saturated,the addition of other genes cannot include many things and error rates maintain in the same level.
Additionally,it is not difficult to find that the error-rate of PLS-VIP method is clearly lower than traditional PLS method and D-PLS method.
214Effect of number of components to the classification
In the upper discussion,we acquiesce in extracting the top8principle c omponents to compute the VIP sc ore. This acknowledge ment calls for explana tion.We compare the classification error rate between2,4,6and8PLS principle components in Leukemia Data[1]in Fig. 3. Results sho w that8components classification effect is better than others.
PC A and PLS are the most popular and successful approaches for dimension reduction.They attempt to
F ig13Comparison of the classification performance using
top2,4,6and8principle components
a set of principle components(linear combinations of original variables)to account for a maximum variations or correlation with the information of classifications. Ho wever,those two methods both might leave much information out.
PLS-VI P method,on the other hand,takes the most influential n components extracted by PLS into consideration.It rules out information much less than traditional PLS method.The PLS-VIP method,working as a modified PLS,perfor ms better in variable selection. Our analysis sho ws that PLS-VIP is able to preserve more information for classifying Golub leukemia data[1]and normal adenocarcinoma in lung cancer data[12].
The comparison between our method and D-PLS method also indicates that our method is better than D-PLS when N is comparatively large and not worse than it when N is small.Acc ordingly,we believe PLS-VIP method is effective and expect it to be useful in more conditions.
In this manuscript,we use LDA as the classifier. Logistic analysis may be another traditional way for classifying.However,it fails to work well in a situation where sample is not large enough.In this way,Bayes discriminant analysis shares some merits and it reduce the requirement of the sample number.
Next we want to modify this method and make it more explicit in statistic language.Further,more data should be tested in PLS-VIP method and more comparisons should be made between PLS-VI P method and other methods.
