Figure Legends

合集下载

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

A Hidden Markov Model web application for analysing bacterial genomotyping DNA microarray experiments Richard Newton1,Jason Hinds2and Lorenz Wernisch1
1School of Crystallography,Birkbeck College,University of London,Malet St.,London WC1E7HX,UK
2Bacterial Microarray Group,Department of Cellular and Molecular Medicine,St George’s Hospital Medical School,London SW17ORE,UK
Running Title:Hidden Markov Model for bacterial DNA experiments
1
Acknowledgments
The TB data set was provided by Jaqueline Inwald,Veterinary Laboratories Agency,the YP data set by Stewart Hinchliﬀe,London School of Hygiene and Tropical Medicine and the SA data set by Jodi Lindsay,St.George’s Hospital Medical School.RN was funded as part of a Wellcome Trust Functional Genomics programme grant.The Wellcome Trust funds the BµG@S multi-collaborative microbial pathogen microarray facility under its Functional Genomics Resources Initiative.
Corresponding Author:
Richard Newton,
School of Crystallography,
Birkbeck College,
University of London,
Malet St.,
London WC1E7HX,UK
email:r.newton@ telephone:+44(0)2076316869 fax:+44(0)2076316803
Figure Legends
Figure1-Plot of M values with genome position
A typical plot of M values with genome position for an array from the T
B data set(test strain2122/97).
Figure2-Density distribution of M values
Density distribution of M values showing two normal distributionsﬁtted by the HMM(test strain2122/97).
Figure3-Plot of M values with genome position in the RD8region of M.tuberculosis H37Rv
Plot of M values with genome position in the RD8region of M.tuberculosis H37Rv,showing the gene assignments of the HMM and the position of a cut-oﬀof minus three standard deviations.
Abstract
Abstract:Whole genome DNA microarray experiments compare the gene content of dif-ferent species or strains of bacteria.A statistical approach to analysing the results of these experiments was developed,based on a Hidden Markov model,that takes adjacency of genes along the genome into account when calling genes present or absent.The model was im-plemented in the statistical language R and applied to three data sets.The method is numerically stable with good convergence properties.Error rates are reduced compared to approaches that ignore spatial information.Moreover,the HMM circumvents a problem encountered in a conventional analysis:determining the cutoﬀvalue to use to classify a gene as absent.An Apache Struts web interface for the R script was created for the beneﬁt of users unfamiliar with R.
Availability:The application may be found at /hmmgd. Source code illustrating how to run R scripts from a Struts based web application is available from the authors on request.The application is also available for local installation if required.
Introduction
DNA microarrays are commonly used to monitor gene expression proﬁles at the mRNA level. However,microarrays have also become an established technique in bacterial comparative genomics,in which comparisons are made at the genomic DNA rather than mRNA level, studying gene content rather than gene expression.In numerous studies of this type,DNA microarrays have been used to compare a test strain,of unknown gene content,to a sequenced reference strain.[1–11]Bacterial species or strains may vary phenotypically in their degree of pathogenicity,virulence,transmission or host speciﬁcity.The ability to compare genomes using microarrays means that it is possible to identify the genes potentially associated with these diﬀerences.
However there are problems intrinsic to the method,namely sequence divergence and cross-hybridization,plus the noise inherent in any usage of microarray technology,that require a statistical analysis of the data.In this paper an analysis is described based on aﬁrst order Hidden Markov model(HMM)in which each gene is represented by a hidden variable that can be in one of two states,absent or present.
The acquisition,loss or rearrangement of DNA in bacterial genomes commonly involves whole blocks of contiguous genes.For example,in the comparison of Mycobacterium bovis AF2122/97with Mycobacterium tuberculosis H37Rv there are around80absent genes,but arranged in just11segments.Two of these segments contain one gene only,the remaining segments range in size from3genes up to17genes.[24]This feature of bacterial genomes results from events such as the horizontal transfer of DNA,recombination between repetitive DNA or phage-mediated events.So the state of a gene often depends on the states of the adjacent genes.The HMM exploits this fact and we show that this reduces error rates compared to approaches that ignore spatial information.
A further advantage of an HMM over a conventional analysis,which calls genes present or absent based on cut-oﬀvalues,is that the latter requires use of either an arbitrary cut-oﬀ
value or requires additional empirical determination of such values.In contrast,an HMM provides a probability for each of the two possible states of a gene,absent or present,making a decision straightforward without the need for further experimental work.
The HMM was written in the statistical language R.[12]This programming language is of tremendous importance in theﬁeld of bioinformatics.As a high-level mathematical scripting language it allows the rapid development of programs and now includes a vast range of pre-programmed functions.The Bioconductor[13]project contains many packages speciﬁcally for bioinformatics.A disadvantage of R is that many potential users of bioinformatics programs written in R,who come from a non-bioinformatics background,are unfamiliar with the language.A solution to this problem is to provide a user-friendly web interface for R scripts.
Whole genome DNA microarrays
Whole genome DNA microarrays are typically constructed so that there is a reporter element on the array representing each gene in the genome of the sequenced reference strain.Reporter elements are speciﬁc fragments of DNA able to hybridize complementary DNA present in either of the two genomic DNA samples being compared in the experiment.Following labeling of the two DNA samples with diﬀerentﬂuorescent dyes,the two labeled samples are co-hybridized to a single array.Measurement of theﬂuorescent intensities of the two dyes bound to a particular reporter element on the microarray enables the relative abundance of hybridizing DNA in each sample to be determined.For each reporter element,the relative signal intensity levels of the test strain compared to the reference strain give an indication whether a gene that is present in the reference strain is present,absent or divergent in the test strain.The result for a gene is expressed as a log2ratio of the test strain signal intensity compared to the reference strain signal intensity,so a lower ratio is indicative of divergence or absence of that particular gene in the test strain.Typical experiments may compare dozens of diﬀerent test strains to a reference strain and generate large amounts of data.
In practice the data is not as clear-cut as may be expected,whereby genes would clearly segregate into either present or absent categories.Generally the microarray process in-troduces experimental noise,but there are additional technical artifacts,caused by cross-hybridization and sequence divergence,that reduce the ability to clearly determine whether genes are present or not.
Cross-hybridization results from an inability to select reporter elements that are truly unique for each gene.For a small minority of reporters,the signal measured may not be speciﬁc for the desired target gene but may also include signal from paralogues in the genome that are able to cross-hybridize to the reporter to some degree.This can have two eﬀects: 1)if the target gene is absent but cross-hybridizing paralogues are not,then the ratio for the target gene will not be as low as for absent genes not aﬀected by cross-hybridization,2) if a cross-hybridizing paralogue is absent but the target gene is not,then the ratio for the target gene will be reduced and the target gene may incorrectly appear absent,due to the loss of cross-hybridizing DNA.
Sequence divergence between the test and reference strains also contributes to a lack of clarity.A gene in the test strain with a reduced amount of sequence similarity to the reference strain will generate a lower ratio than a gene with identical sequence in the test and reference strains.
Various analysis approaches have been used to determine the genes considered absent or highly divergent in the test strain.These range from aﬁxed arbitrary ratio cut-oﬀset at a two-fold diﬀerence,[4,11]an empirically determined cut-oﬀbased on independent experimental determination of presence or absence[5]or availability of comparative sequence information.[2,7,8]Other methods used to analyse comparative genomics data have accounted for variation in the spread of data to set more dynamic and appropriate cut-oﬀs.These include using the variation of ratios for sub-sets of genes considered universally conserved or absent in all the test strains to determine the cut-oﬀs appropriate for a particular strain[6]
and assessing shape properties of the normalized ratio distribution to assign cut-oﬀs to detect outliers in the tail of the distribution that represent absent or highly divergent genes.[14] If several strains of a bacteria have been sequenced then it is possible to design microar-rays with reporter elements representing more than just the genes in the genome of the reference strain in the experiment.[15]The genomes of all the sequenced strains can be used in the design.As well as absences the microarray can now measure additions,that is,those genes that are present in the test strain but absent in the reference strain.The additions have positive log ratio values.
Hidden Markov Model
Hidden Markov Models(HMM)wereﬁrst applied in theﬁeld of speech recognition[16]but in recent years have found numerous uses in theﬁeld of bioinformatics in particular for homology searching and gene prediction.[17,18]Hidden Markov Models are a method for modeling an observed,sequential signal in terms of a sequence of hidden variables in order to identify the underlying sequence of source states producing the observed signal.In the context of this paper the sequence of log ratio values of genes along the genome,measured from the microarray experiment is the observed signal.Fridlyand et al.[19,20]have used a HMM approach to analyse microarray comparative genomic hybridization data for copy number alterations in tumours.This paper describes a related model designed speciﬁcally for analysing bacterial genomotyping microarray data.
Apache Struts
In order to create a user-friendly web interface for the HMM R script we chose to use Apache Struts.[21]Struts is an open source and widely used Java based framework for constructing web applications.The Struts framework separates out the three main components of a web application,the View(the way in which information is presented to the user),the Model(the
data processing)and the Controller(controlling theﬂow of the application).This produces a well-organised,stable and extensible web application.
Methods
Data
Three sets of microarray data from BµG@S(Bacterial Microarray Group at St.George’s Hospital Medical School)were used in the development of the Hidden Markov model.The ar-rays in all three data sets contained approximately3000-4000gene-speciﬁc reporter elements for genomes of3-4Mb in size,so on average the resolution of reporters was every1Kb in the genome.Array designs can be accessed in ArrayExpress(/arrayexpress) with Accession Numbers A-BUGS-1(TB),A-BUGS-11(YP)and A-BUGS-16(SA).In the ﬁrst set,referred to as the TB data set,the microarray used was based on the sequenced genome of Mycobacterium tuberculosis H37Rv.[22]Three strains of Mycobacterium bovis, namely BCG Pasteur,AN5and2122/97,were compared to the M.tuberculosis H37Rv reference strain with three replicate hybridizations performed for each strain.[7] Figure1plots the log2ratio values(denoted by M in the following[23])against position along the genome for a microarray from the TB data set(test strain2122/97).Each point on the graph corresponds to a single gene.The majority of genes are present in the test strain and are found in a distribution centred around M=0.Some genes have M values in the region below this distribution and it is these genes which are either absent in the test strain or have a signiﬁcantly diverged sequence compared to the reference strain gene sequence.A plot of the density of M values from this data set can be seen in Figure2.
The TB data set was used to assess the results from the HMM as there were already inde-pendent comparative studies for two of the strains analysed.Mycobacterium bovis2122/97 used in this experiment has been sequenced[24]and therefore any gene absences or sequence
divergence indicated by the microarray results can be validated at the sequence level.In addition,gene absences in M.bovis BCG Pasteur have been determined previously in another microarray study.[1]The insertion sequence IS6110occurs in multiple copies throughout the genome,which cross-hybridise to each other extensively.It has therefore been excluded from further analysis.
In the second data set,referred to as the YP data set,all genes in the genome of Yersinia pestis strain CO-92were used as the basis of the array.This sequenced reference strain of Y.pestis was compared to22further strains of Y.pestis with two replicate hybridizations performed for each strain.[8]The results from a previous analysis of the data,[8]that included PCR analysis of speciﬁc genes,were used to assess the performance of the HMM in detecting gene absences.
The third data set,referred to as the SA data set,used microarrays designed from seven strains of Staphylococcus aureus.[15]The data set comprised six microarrays each comparing the same sequenced reference strain to a diﬀerent sequenced test strain.Because the microarray was designed from more than one sequenced strain,as well as measuring absences in the test strain relative to the reference strain,the experiment can measure additions in the test strain relative to the reference strain.
To correct for spatial variation across an array two-dimensional loess(local weighted re-gression)normalization was applied to the data.[25]To correct for variation in the red/green bias with spot intensity,one dimensional loess normalization was used.[23]The robust mea-sures of location and scale,the median and mad estimators,were used for between-array normalization,so that intensity values between arrays become comparable.
Hidden Markov Model
Structure of the HMM
The HMM models the test strain genome as a sequence of hidden variables.Each hidden variable may be in one of two states,depending on whether the corresponding gene is present or absent in the strain.We try to determine the sequence of hidden states from the observed sequence of log ratio values.This type of HMM is described by transition probabilities and emission densities.The four transition probabilities refer to the probabilities of either staying in the same state or switching to the alternative state when moving between adjacent positions in the genome.
A probability density function for the emission of log ratio values is associated with each one of the two states.We assume that genes are ordered corresponding to their position in the genome,each gene deﬁned by a position t.The observed log ratio value of a gene at position t is denoted by y t.The hidden variable corresponding to the gene at position t is X t which can take on values0(present)or1(absent).The emission probabilities are given by P(y t|X t).Two normal distributions were used to model the emission probabilities,that is P(y t|x t)=N(y t;µx t,σx t)for x t0or1.Determining the emission probabilities of the two states involves determining the mean and standard deviation of the two distributions.
By including both emission and transition probabilities the HMM uses not only the log ratio value of an individual gene to classify the gene as absent or present but also takes into account the values of neighbouring genes.
There are two steps to using an HMM to analyze experimental data.[16]Firstly the model parameters,that is the emission densities and transition probabilities,must be determined from the observed sequence.The model is then used to predict the most probable state sequence that could have generated the observed sequence.
For the SA data set which measures additions as well as absences a three state HMM is required,with a state corresponding to genes absent in the test strain but present in
the reference strain,a state corresponding to genes present in both strains,and a state corresponding to genes present in the test strain but absent from the reference strain.
Fitting the model,prediction of states
The HMM parameters(emission densities and transition probabilities)are estimated using an expectation maximization(EM)approach.[26]This is an extension of the maximum likelihood method forﬁtting the parameters of a distribution to a set of data.
Essentially,in the E-step,expected values for the hidden state variables are derived using initial estimates of all parameters.In practice this step is implemented using a forwards-backwards algorithm.A modiﬁcation of this algorithm as described by Murphy,[27]which helps to reduce problems with data underﬂow,was used.Once expected values of hidden variables are calculated,model parameters can be inferred by maximizing the expected log-likelihood in the M-step.For normal emission densities the M-step is straightforward because simple analytical expressions are obtained for re-estimating the means and standard deviations.
The expected values of hidden states are then updated using the re-estimates of the model parameters,in a new E-step,followed by another M-step.The cycle is repeated until the model’s parameter values have stabilized.At each iteration the percentage change from the previous iteration in the log likelihood of the data was calculated and the algorithm was considered to have converged when the percentage change was less than0.01.Around 10iterations of the EM algorithm were found to be suﬃcient for convergence,depending on the data set.Progammed in R this takes about30seconds to complete.Convergence was found to be insensitive to initial estimates of model parameters.The HMM converged quickly with general initial estimates such as means of0(present)and-5.0(absent),standard deviations of1.0,and transition probabilities all set to0.5.
Predicting the sequence of hidden states(that is,absence or presence of genes)is a matter
ofﬁnding the most probable sequence of states,given the model parameters and the observed data,by using the Viterbi algorithm in theﬁnal Viterbi step.[27]
Because of the relatively high level of noise in microarray experiments replication of arrays is essential for increasing the reliability of results.The HMM can be run separately on the replicate arrays and the results combined using an‘all replicates agree rule’which assigns a gene as absent if and only if all replicates agree it is absent.
Running the R script from a Struts web application
The Model part of the Struts web application,a Java program,is responsible for data processing.This program must run the R script,passing user entered information to the script,and then wait for the script toﬁnish.The web application then needs to display the results of running the script.If the R script takes some time to run the web application should display real-time progress information for the user.The web application should also be able to retrieve and display any R errors.
The R script is run using the Java Runtime class.Each Java application has one instance of the Runtime class which allows the application to interface with the environment in which it is running.The R script is run using R batch mode(using R CMD BATCH).This batch command is placed in a shell script.The shell script in turn is run as a Process using the Model program’s instance of the Runtime class.In Java an instance of the Process class has a method that causes the current thread to wait until the Process has completed.
The web application and the R script communicate viaﬁles,saved in a directory.There is one such uniquely named directory for each submission of data.The information the user enters on the application’s web pages needs to be passed as variables to the R script.In our application the Model program writes this information,as R assignments,into a text ﬁle which is concatenated with the main R script prior to execution.Alternatively the user entered data may be written to aﬁle that the R script then reads on execution.To pass
results back to the web application the R script writes the results toﬁles and also saves a list of theseﬁlenames.The web application,on completion of the R script,reads the list of ﬁlenames and displays them as links on the web page of results.
Pseudo real-time progress information is achieved by having the R script append infor-mation to a textﬁle at stages during its execution.This textﬁle is displayed by a Javascript pop-up window which refreshes itself atﬁxed intervals.
Struts is particularly eﬃcient at validating data input and if necessary,returning error messages to the web page,but it is also important to cater for errors that may occur whilst the R program is running.This is achieved by using the options command in R.This is set so that if an error does occur the error message is written to a textﬁle.The web application checks thisﬁle when the Process terminates and if it is not empty,displays the contents as an error message on the web page.
As well as online processing the web application provides the option for oﬄine processing when the results are e-mailed to the user rather than displayed on the web page.When data is submitted for oﬄine processing its submission ID is added to a list in a textﬁle.A continuously running Java program checks thisﬁle at regular intervals for additions to the list,in which case it processes the relevantﬁles by running the R script,and on completion e-mails the results to the user.
Results and Discussion
A graph showing the two normal distributionsﬁtted by the HMM to the data from the T
B data set(2122/97strain),is shown in Figure2.Because the actual gene absences are known for the TB data set it is possible to classify the HMM gene assignments as either true positives(TP),false negatives(FN)or false positives(FP).The results of applying the HMM to the TB data set are given in Table1.The table shows the results for the HMM compared to a conventional analysis using a cut-oﬀof minus three standard deviations.The
average of the results for the three strains show that the HMM gives gene assignments with a lower FP rate(2.3)than a conventional analysis using a cut-oﬀof three standard deviations (3.3).The FN rate is also lower for the HMM(3.3compared to6.7).The overall error rate FP+FN is5.6for the HMM and10.0for the cut-oﬀmethod.
The FP rate of the conventional cut-oﬀanalysis may be reduced by lowering the cut-oﬀvalue.A cut-oﬀof minus four standard deviations was found to minimize the FP rate of the conventional analysis to1.0,which is lower than the FP rate of the HMM(2.3).There is however a concomitant increase in the FN rate to9.3,which is far higher than that of the HMM(3.3),and the overall error rate(FP+FN)is now10.3.
The lowest overall error rate possible for the conventional cut-oﬀmethod was10.0,corre-sponding to a cut-oﬀof minus three standard deviations,so the HMM performs better,with a lower error rate,than the best conventional analysis.And in practice extra experimental work is required to determine the optimum cut-oﬀto use in a conventional analysis,whereas the HMM requires no experimentally determined cut-oﬀ.
In some arrays extreme negative M values caused problems with machine underﬂow. The probability that a gene belongs in the population of present genes becomes too small for the computer and the program fails.To prevent this the application trims extreme negative M values.Extreme positive values can also cause problems.At high positive M values the tails of the two normal distributions may crossover so that the absent distribution becomes, incorrectly,more probable for these high positive M values.To prevent this the application trims high positive M values and also checks theﬁnal list of absent genes for any that may have positive M values.
Figure3shows the M values for a section of the genome of M.tuberculosis H37Rv in the absent region RD8(genes Rv3617to Rv3622)which is absent in M.bovis2122/97.The two genes in the middle of the block(Rv3619and Rv3620)with M values close to zero are in fact absent,but their M values are elevated due to cross-hybridization.The graphs show
how the HMM recognizes that the genes are absent,whereas a conventional three standard deviation cut-oﬀanalysis will mistakenly assign Rv3620as present.
The results for eleven strains from the YP data set are given in Table1.In the YP data set the distributions of M values of present and absent genes are not in general as well deﬁned as for the TB data set.It was noted that some strains have very few gene absences and a few of the strains have signiﬁcant numbers of genes with some degree of sequence divergence.
Results are compared to that obtained from a simple cut-oﬀin log2ratio values of-2.32, which corresponds to a ratio of0.2as used in the analysis of Hinchliﬀe et al.[8]The average results for the eleven strains show that the HMM gave slightly higher FP(3.0)than the conventional cut-oﬀmethod(1.3)but gave greatly reduced FN(5.5compared to14.0).The HMM was tested on5strains from the YP data set which have only one gene absence and was found not to work.Similarly for the strains in the data set that contain no absences. In these cases the HMM either fails to converge or else the absent distributionﬁts to the sequence diverged tail of the distribution of present genes.The HMM is also likely not to work if applied to data from a bacterial genomotyping experiment where there is very little similarity between the strains or species being compared at the level of DNA sequence divergence or genome organisation.
A three state HMM was applied to the SA data set.The HMM gave a9.5%reduction in the total error rate(FP+FN)averaged over the six arrays in the experiment.
Conclusion
An HMM analysis for gene absences achieves a lower overall error rate than a conventional analysis without the need to determine any arbitrary parameters such as cut-oﬀvalues.All necessary parameters are estimated from the data alone.The conventional cut-oﬀanalysis requires independent experimental conﬁrmation of absences to determine the optimum cut-
oﬀvalue to use.
The HMM analysis was found to work well even when only1%of the test genome is absent.The HMM approach was found to be eminently suitable as the basis of an automatic bacterial strain microarray analysis tool,in that it was stable and converged in only a very few iterations.
The R language makes programming the matrix algebra required for a Hidden Markov Model quick and simple,but many potential users of the resulting program will be unfamiliar with using R.Providing a web interface for the R script solves this problem.It also simpliﬁes distribution and updating of the application since the R program runs on the server,the user only requiring a web browser to use the program.
Apache Struts provides a rigorous and well-tested framework for creating web appli-cations.Running R scripts from a Struts based web application is robust and simple to implement.Although quick to program,R code can be slow to run compared to,for exam-ple,C++or ing the Struts framework means that a prototype R script,if demand merits,can be rewritten in C++or Java and simply slotted into the application in place of the R code.。