Prediction of signal peptides and signal anchors by a hidden Markov model

合集下载

Improved prediction of signal peptides SignalP 3.0.

Improved prediction of signal peptides SignalP 3.0.

”J.Mol.Biol.,to appear2004.”Improved prediction of signal peptides—SignalP3.0 Jannick Dyrløv Bendtsen1,Henrik Nielsen1,Gunnar von Heijne3and Søren Brunak1∗1Center for Biological Sequence AnalysisBioCentrum-DTUBuilding208Technical University of DenmarkDK-2800Lyngby,Denmark3Stockholm Bioinformatics CenterDepartment of Biochemistry and BiophysicsStockholm UniversitySE-10691Stockholm,Sweden∗To whom correspondence should be addressed(email:brunak@cbs.dtu.dk) Keywords:Signal peptide,signal peptidase I,neural network,hidden Markov model, SignalPRunning title:Signal peptide prediction by SignalPWe describe improvements of the currently most popular method for predic-tion of classically secreted proteins,SignalP.SignalP consists of two different predictors based on neural network and hidden Markov model algorithms, where both components have been updated.Motivated by the idea that the cleavage site position and the amino acid composition of the signal peptide are correlated,new features have been included as input to the neural network. This addition,combined with a thorough error-correction of a new data set, have improved the performance of the predictor significantly over SignalP ver-sion2.In version3,correctness of the cleavage site predictions have increased notably for all three organism groups,eukaryotes,Gram-negative and Gram-positive bacteria.The accuracy of cleavage site prediction has increased in the range from6-17%over the previous version,whereas the signal peptide discrimination improvement is mainly due to the elimination of false positive predictions,as well as the introduction of a new discrimination score for the neural network.The new method has also been benchmarked against other available methods.Predictions can be made at the publicly available web server http://www.cbs.dtu.dk/services/SignalP/.1IntroductionNumerous attempts to predict the correct subcellular location of proteins using machine learning techniques have been developed1–putational methods for prediction of N-terminal signal peptides were published around20years ago,initially using a weight matrix approach1,2.Development of prediction methods shifted to machine learning al-gorithms in the mid1990’s10,11,with a significant increase in performance12.SignalP,one of the currently most used methods,predicts the presence of signal peptidase I cleavage sites.For signal peptidase II cleavage sites found in lipo-proteins the LipoP predictor has been constructed13.SignalP produces both classification and cleavage site assignment, while most of the other methods classifies proteins as secretory or non-secretory.A consistent assessment of the predictive performance requires a reliable benchmark data set.This is particularly important in this area where the predictive performance is approaching the performance calculated from interpretation of experimental data,which is not always perfect.Incorrect annotation of signal peptide cleavage sites in the databases stems not only from trivial database errors,but also from peptide sequencing where it may be hard to control the level of post-processing of the protein by other peptidases, after the signal peptidase I has made its initial cleavage.Such post-processing typically leads to cleavage site assignments shifted downstream relative to the true signal peptidase I cleavage site.In the process of training the new version of SignalP we have generated a new,thor-oughly curated data set based on the extraction and redundancy reduction method pub-lished earlier14.Other methods were used for cleaning the new data set,and we found a surprisingly high error rate in Swiss-Prot,where,for example,in the order of7%of the Gram-positive entries had either wrong cleavage site position and/or wrong annotation of the experimental evidence.Also,we found many errors in a previously used bench-mark set12(stemming from automatic extraction from Swiss-Prot),and it appears that some programs are in fact better than the performance reported(predictions are correct, while feature annotation is incorrect).For comparison,we made use of this independent benchmark data set that was initially used for evaluation offive different signal peptide predictors12.In the new version of SignalP we have introduced novel amino acid composition units as well as sequence position units in the neural network input layer in order to obtain better performance.Moreover,we have slightly changed the window sizes compared to the previous version.We have usedfivefold cross-validation tests for direct comparison to the previous version of SignalP10.In the previous version of SignalP a combination score,Y,was created from the cleavage site score,C,and the signal peptide score,S,and used to obtain a better prediction of the position of the cleavage site.In the new version, we also use the C-score to obtain a better discrimination between secreted and non-secreted sequences,and have constructed a new D-score for this classification task.The architecture of the hidden Markov model SignalP has not changed,but the models have been retrained on the new data set,and have also significantly increased their performance.2Results and discussionGeneration of data setsAs the predictive performance of the earlier SignalP method was quite high,assessment of potential improvements is critically dependent on the quality of the data annotation.We generated a new positive signal peptide data set from Swiss-Prot15release40.0,retaining the negative data set extracted from the previous work.The method for redundancy reduction was the same as in the previous work14,and was based on the reduction prin-ciple developed by Hobohm et al.16.Ourfinal positive signal peptide data sets contain 1192,334and153sequences for eukaryotes,Gram-negative and Gram-positive bacteria, respectively.In the previous work,we found many errors by detailed inspection of hard-to-learn examples during training and wrongly predicted examples.Nevertheless,we were quite sure that even after careful examination in this manner,the data set would probably still contain errors obtained from incorrect database annotation and wrongly interpreted laboratory results.Therefore,we developed a new feature based approach where abnormal examples can be detected by inspecting rare amino acid occurrences and outlier physical-chemical properties of signal peptides.In the following,we show that the isoelectric point of signal peptides can help infinding possible annotation errors and other errors,where these errors may be due to the fact that some(long)signal peptides annotated in Swiss-Prot actually include probable propeptides.In such cases,convertase cleavage sites are mixed together with signal peptidase I cleavage sites.Removal of spurious cleavage site residuesExperimental assessment of the effect of certain amino acids in the cleavage site region has shown that rare residues do not allow for efficient cleavage17,18.Examination of amino acids around the signal peptidase I cleavage site in the data set revealed a number of sequences containing amino acids,which very rarely appear at the cleavage site.In the eukaryotic data set we found and removed seven sequences containing lysines (K)and13sequences containing arginines(R)at the−1position.All sequences with either a lysine or an arginine at position−1were investigated manually.All of them except one had a predicted cleavage site upstream of the annotated one.Most of these sequences probably undergo N-terminal maturation by different proteases,either in the Trans Golgi Network(TGN)or after release from the cell as mentioned below in the section on propeptide analysis.In one clear case we found an obvious error in the Swiss-Prot entry NPAB LOCMI.According to the annotation the cleavage site is located between residues24-25(arginine in position−1),but in the original paper the authors identified the cleavage to occur between amino acids22-23.In this case,the two amino acids,ER, are removed by a dipeptidase19.Furthermore,we removed sequences where other amino acids appeared at position−1 in very few of the sequences.For the eukaryotic data set,the only allowed residues at position−1were alanine(A),cysteine(C),glycine(G),leucine(L),proline(P),glutamine (Q),serine(S)and threonine(T).By allowing only the latter amino acids we might have removed a few true,unusual sequences.For instance,tyrosine(Y)and histidine(H) at position−1were found only in one case each in the entire eukaryotic data set.We3removed eight sequences with aspartic acid(D)and eight with phenylalanine(F),seven each with glutamic acid(E)and asparagine(N),respectively.Five with methionine(M), three containing isoleucine(I)and two sequences containing tryptophan(W)at position −1were also removed.Some of these are in fact provable errors,in one of the aspartic acid examples,CLUS BOVIN20,the N-terminal peptide sequencing in the paper reports the cleavage as MKTLLLLMGLLLSWESGWA---ISDKELQEMST···,while Swiss-Prot annotates the sequence as being cleaved between D and K,thereby changing a common position−1 amino acid,alanine,into a rare one.Interestingly,SignalP predicts the cleavage site as reported in the paper.For Gram-positive and Gram-negative bacteria,only four residues were allowed at position−1.These residues were alanine(A),glycine(G),serine(S)and threonine (T)17,18.For the Gram-positive data set,this approach removed four sequences containing arginines(R),three containing valines(V),two containing lysines(K)and one sequence each of glutamic acid(E),leucine(L),asparagine(N),glutamine(Q),threonine(T)and tyrosine(Y).In the Gram-negative data set,we removed two sequences containing valine (V)at position−1and one sequence for each of the following amino acids,glutamic acid (E),lysine(K),leucine(L),asparagine(N),glutamine(Q).Isoelectric point calculationsPrevious studies have shown differences in amino acid composition between signal peptide and mature protein21,22.Thus,we examined to what extent the isoelectric point(pI)could be used as a unique feature of signal peptides.We calculated the pI for all signal peptides and the corresponding mature proteins in the data set and presented this in three scatter plots(Figure1).In the scatter plot for Gram-positive bacteria two very distinct clusters appear.Only three signal peptide outliers were found and by manual inspection of the corresponding Swiss-Prot entries,we found that these proteins most likely were either not carrying signal peptides,or were annotated wrongly.These outliers having pI values below8had the following Swiss-Prot ID’s CWLA BACSP, IAA2STRGS,COTT BACSU.The three entries have annotated signal peptides,but it is doubtful whether the annotation is correct.According to the prediction from SignalPprotein,indicated by s and m,respectively.Clusters of outlier examples for bacteria are indicated on the two plots.4and PSORT,CWLA BACSP does not carry a signal peptide.CWLA BACSP was in the paper described as a “putative”signal peptide 23and later it was indicated that cwlA is part of an ancestral prophage,still remnant in the Bacillus subtilis genome 24.All phage and virus sequences were initially removed from the SignalP training set,which could result in the negative prediction for this prophage sequence.The cleavage site in the alpha-amylase inhibitor IAA2STRGS turns out not to be ex-perimentally verified.It is predicted to have a cleavage site at position 26(SignalP)or 24(PSORT).Calculation of pI using the SignalP predicted signal peptide length gave a new result of 8.66,closer to the average for Gram-positive bacteria.The paper proposes two other cleavage site positions,but none of these have been verified experimentally 25.The last entry COTT BACSU is a spore coat protein from B.subtilis 26,27and no BLAST homologs in Swiss-Prot were found to contain an experimentally verified signal peptide.CotT is proteolytically processed from a 10kD precursor protein and is localized to spore coat where it controls the assembly.By N-terminal sequencing the N-terminus of the mature and processed protein was identified,although nowhere in the two papers is an SPase I cleavage site indicated,thus no signal peptide is mentioned 26,27.With the current knowledge about spore coats,spore coat assembly does not involve translocation of coat protein across any membrane 28–30.Hence,it is very unlikely for CotT to carry an N-terminal signal peptide as annotated in Swiss-Prot.The average isoelectric point of signal peptides and mature proteins in the entire Gram-positive data set was 10.59and 6.24,respectively.This is consistent with the fact that Gram-positive bacteria are known to have the longest signal peptides that carry more basic residues (K/R)in the n-region,than Gram-negatives and eukaryotes 11.When inspecting the scatter plot for Gram-negative bacteria,we find the same overall clustering as observed for the Gram-positive bacteria,although not as distinct.Here the major group of signal peptides have pIs between 8and 13,although the variation is larger than in the Gram-positive scatter plot.A few sequence entries with acidic signal peptides were investigated in detail.Sequence entry SFMA ECOLI having a pI of 4.78was found to0.00.20.40.60.81.0010203040506070S c o r e Position SignalP-NN prediction (gram- networks): SFMA_ECOLI MES I NE I EG I YMKLRF I SSALAAALFAATGSYAAVVDG GT I HFEGELVNAACSVNTDSADQVVTLG QYRTC score S score Y scoreFigure 2:Alternative start codon assignment.The graphical output from SignalP strongly indicates erroneous annotation of the signal peptide from Swiss-Prot entry SFMA ECOLI .Further investigation showed a wrong annotation of the start codon (see text for details).C,S,and Y-score indicate cleavage site,“signal peptide-ness”and combined cleavage site predictions,respectively.5be an obvious erroneous annotation in Swiss-Prot.This entry had an annotated cleavage site at position22,but a predicted cleavage site at position34.As seen from Figure2we found an internal methionine at position12.Since the signal peptide-ness is very low until position12we assumed that this was an incorrectly annotated start codon.If the initial 11amino acids until the internal methionine were removed,SignalP correctly predicted the cleavage to be at position22and the pI of the signal peptide increased from4.78to 9.99.Indeed,in release41.0of Swiss-Prot this entry was corrected and the signal peptide marked“POTENTIAL”.For eukaryotes on the other hand,we were not able to distinguish the pI of the signal peptide and the mature protein.Eukaryotes have the shortest signal peptides and the amount of basic residues is much lower than for bacteria.Propeptide or signal peptide?For the eukaryotic data we examined whether annotated signal peptides could possi-bly include propeptides.In secreted proteins,propeptides are often found immediately downstream of the signal peptidase I cleavage site and their cleavage site is defined by a conserved set of basic amino acids.Propeptides can be hard to detect by N-terminal Edmann degradation,as the propeptides are cleaved offin the TGN before the release of the mature protein to the surroundings31.We used a new propeptide predictor,ProP,to predict propeptide cleavage sites32in the eukaryotic data set.In ten sequences we found a predicted cleavage site for a propeptide at the same position where a signal peptidase I cleavage site was annotated in Swiss-Prot. In all ten cases SignalP predicted a shorter signal peptide than annotated,thus making room for a short propeptide between the predicted signal peptide and the mature pro-tein.The ten sequences,AMYH SACFI,CRYP CRYPA,FINC RAT,GUX2TRIRE,LIGC TRAVE, MDLA PENCA,RNMG ASPRE,RNT1ASPOR,XYN2TRIRE,XYNA THELA,were reassigned accord-ing to the prediction of SignalP version2.0.This is an exceptional case where we tend to rate the computational analysis higher than experimental evidence,which must be considered weak,as the propeptide processing takes place before the proteins have been subjected to experimental,N-terminal peptide sequencing.After the signal peptide in these cases had been reassigned,we got marginally higher correlation coefficients when retraining the neural network on the reassigned data set (data not shown).Optimization of window sizesAs in the earlier SignalP approach,the signal peptide discrimination and the signal peptidase I cleavage site prediction were handled using two different types of neural networks10,33.We used a brute force approach to optimize the window sizes for the neural net-works by calculating single position correlation coefficients for all possible combinations of symmetric and asymmetric ing this approach we trained approximately 6500neural networks for window optimization for a single organism group.This was furthermore done for different combinations where amino acid composition and position information was included in the input to network or not,leading to approximately27000 neural networks being tested in all.6For eukaryotes,these data are shown in Figure 3.It is clear that optimal signal peptide discrimination prediction requires symmetric (or nearly symmetric)windows,whereas cleavage site training needs asymmetric windows with more positions upstream of the cleavage site included in the input to the network.The optimal window size for cleavage site prediction for the eukaryote network included 20positions upstream and 4positions downstream of the cleavage site.The window sizes for the Gram-positive networks were retained as previously found 10,whereas the Gram-negative cleavage site network included one more position downstream of the cleavage site,resulting in a window of 11positions upstream and 3positions downstream of the cleavage site.The eukaryote discrimination network performs best when using a symmetric window of 27positions.For both Gram-positive and Gram-negative bacteria the discrimination network is based on a symmetric window of 19positions.This brute force approach changed the optimal window sizes of the cleavage site network slightly from those used in SignalP 2.010,33.Network performanceWe have evaluated the performance of SignalP version 3.0using the same performance measures as used for the previous two versions of SignalP,see Table 1.The performance values were calculated using five fold cross-validation,i.e.testing on sequences not present in the training set (all data split into five subsets of approximately the same size).The most significant performance increase was obtained for the cleavage site prediction as seen in Table 1.A performance increase of 6-17%for all three organism classes was obtained.We were able to optimize the signal peptide discrimination performance by introducing a new score,termed the D-score,replacing the earlier used mean S-score quantifying the “signal peptide-ness”of a given sequence segment.In the earlier versions of SignalP the scores from the two types of networks were combined for cleavage site assignment,and not for the task of discrimination.In the new version 3,the D-score is calculated as the average of the mean S-score and the maximal Y-score,and the two types of networks are 0.20.30.40.50.60.550.60.650.70.750.80.850.9Figure 3:Window optimization.These plots show single position level correlation coefficients for all combinations of window sizes for the signal peptide cleavage and discrimination networks used for eu-karyotic signal peptide prediction.The optimal window size for cleavage site for the eukaryotic network included 20positions to the left and 4positions to the right of the cleavage site.For reasons of computa-tional efficiency we have selected a discrimination network with a symmetric window of 27amino acids,although networks with larger windows have slightly higher single position level correlation coefficients.7then used for both purposes(see Material and Methods for details).Version Cleavage site(Y-score)Discrimination(SP/non-SP)Euk Gram−Gram+Euk Gram−Gram+ SignalP1NN70.279.367.90.970.880.96 SignalP2NN72.483.467.40.970.900.96 SignalP2HMM69.581.464.50.940.930.96 SignalP3NN79.092.585.00.980.950.98 SignalP3HMM75.790.281.60.940.940.98 Table1:Performances of three different SignalP versions.The most significant improvement was for the cleavage site predictions.Cleavage site performances are presented as%and discrimination values (based on D-score)as correlation coefficients.NN and HMM indicate neural network and hidden Markov model,respectively.Results are based onfive-fold cross validation for all SignalP versions.Improvement by position information and composition featuresIn order to improve the performance of the neural network version of SignalP,we intro-duced two new features into the network input:information about the position of the sliding window as well as information on the amino acid composition of the entire se-quence.This information was encoded by additional input units in the neural network. The new position information units were found to be important for both the cleavage site and discrimination networks,whereas the amino acid composition information only improved the discrimination network.The idea of including compositional information is based on the observation that the composition of secreted and non-secreted proteins differ21,22.The average length of signal peptides range from22(eukaryotes)and24(Gram-negatives)to32amino acids for Gram-positives,and the new network encoding the po-sition of the sliding window uses these averages to penalize prediction of extremely long or short signal peptides.Therefore,twin arginine signal peptides often receive a below threshold D-score as they tend to be quite long(average37amino acids)34,35.This also means that a few cases of ordinary signal peptides with extreme length are not predicted correctly by the neural networks.The HMM is also in its structure penalizing long signal peptides,and similarly the SignalP3HMM is not able to predict these cases correctly. One example36is the(NUC STAAU)with a63amino acid long signal peptide that is not pre-dicted correctly by any of the SignalP3models.SignalP3does not always fail to predict long signal peptides correctly,e.g.the56amino acids long signal peptide of CYGD BOVIN37 is handled correctly by the neural network version,both in terms of cleavage site and dis-crimination.However,great care should be taken when interpreting the scores for long potential signal peptides.From Figure4the importance of the new approach where position and amino acid composition information is included can be assessed.Including information of the position of the sliding window during training,increased the neural network cleavage site prediction performance slightly(left panel of thefigure).Composition information did not increase the performance of the cleavage site prediction,therefore it is excluded from the left panel in Figure4.But composition information did increase the performance of the discrimination network slightly(right panel of thefigure),whereas information of the8Figure4:Improvement of the neural network by introducing length and composition fea-tures.Position of the sliding window in the neural network input increased cleavage site prediction performance slightly(left panel).Amino acid composition information together with information of the position of the sliding window improved the discrimination network significantly as seen in the right panel.The performance improvement was evaluated as single position level correlations during training on the individual networks for cleavage and discrimination,respectively.position of the sliding window together with composition increased the discrimination significantly(right panel).Another improvement of the discrimination stems from the new D-score(see Table2).Thefinal prediction method uses both position and composition information.Effect of the new discrimination scoreIn SignalP version3.0we have introduced a new discrimination score for the neural network,termed the D-score.Based on the mean S-score and maximal Y-score it was found to give increased discriminative performance over the mean S-score,used in SignalP version2.0.In Table2,the D-score shows superior performance over the mean S-score for the novel part of the benchmark set defined by Menne et al.(see below).Dataset sensitivity specificity accuracy cc Gram−0.94(0.93)0.88(0.81)0.95(0.93)0.88(0.82) Gram+0.98(0.98)0.98(0.98)0.98(0.98)0.96(0.95)Table2:D-score outperforms the mean S-score for discrimination of signal peptide versus non-signal ing the novel part of the Menne test set12,we tested the D-score for discrimi-nation compared to the mean S-score.The mean S-score performances are shown in parentheses.The above mentioned56amino acid long signal peptide in CYGD BOVIN is an example where the D-score leads to a correct classification,while the mean S-score is below the threshold.In this case the strong cleavage site score adds to a weaker signal peptide-ness in the C-terminal part of the leader sequence.Performance comparison to other prediction methodsAs described in a recent review of signal peptide prediction methods it is hard tofind an ideal benchmark set,as methods have been frozen at different times12.The data used to train a method is in general“easier”than genuine test sequences that are novel to a particular method.Since we have used a more recent version of Swiss-Prot than did9Menne et al.in their assessment,we have merely retained Menne set sequences that are not present in the SignalP version3.0training set.In this manner,we do not give an advantage to SignalP,as some of these sequences possibly have been included in the training set for other methods.We did not test the performance of the weight matrix-based methods SigCleave or SPScan as the earlier report shows that these are outperformed by machine learning methods12.SigCleave is based on von Heijne’s weight matrix2from1986.SPScan is also based on the weight matrix from von Heijne,but in addition to this it uses McGeoch’s criteria for a minimal,acceptable signal peptide1.We have tested other methods which are made available,one problem being that they do not necessarily predict the same organism classes,e.g.the PSORT-B method8does only predict on Gram-negative data,and not on the two other SignalP organism classes.The comparative results are given in Table3.For the PSORT-II method38,39which predicts on eukaryotic sequences,the subcellular localization classes“endoplasmic retic-ulum(ER)”,“extracellular”and“Golgi”were merged into one category of secretory proteins,whereas the rest“cytoplasmic”,“mitochondrial”,“nuclear”,‘peroxisomal”and “vacuolar”were merged into a single“non-secretory”category.The performance reported in the paper is57%correct for all categories.In Table3it can be seen that SignalP3 outperforms PSORT-II on this particular set with a significant margin.PSORT-II does not assign cleavage sites,and we have therefore only compared the discrimination perfor-mance.We believe that the minor decrease in discrimination performance of SignalP3on this set,when compared to the cross-validation performance reported above in Table1,is a result of errors in the Menne set(originating from Swiss-Prot)together with its redun-dancy(see below),but more importantly,the presence of transmembrane helices within thefirst60amino acids in more than10%of the novel negative test sequences from this set(when analyzed by TMHMM40).The new version of PSORT(PSORT-B)has been trained onfive subcellular localiza-tion classes in Gram-negative bacteria and was reported to obtain a97%specificity and 75%sensitivity8.PSORT-B was optimized for specificity over sensitivity.Another recent method,SubLoc5predicts three subcellular compartments for prokaryotes and four com-Data set/Method sensitivity specificity accuracy cc Eukaryotes SignalP3-NN0.990.850.930.87 Eukaryotes PSORT-II0.650.750.800.56 Eukaryotes SubLoc0.580.700.770.47Gram−PSORT-B0.990.640.750.58 Gram−Subloc0.900.790.910.78 Gram+SignalP3-NN0.950.930.970.92 Gram+PSORT0.860.800.910.77 Gram+SubLoc0.820.920.860.76 Table3:Performance measures for signal peptide ing the novel part of the Menne et al.test set12we obtained the results shown in the table.Note that the values for PSORT-B is calculated on the part of the data set where PSORT-B produces a classification.Around55%of the sequences were classified as“Unknown”,and the actual performance is therefore much lower than indicated here.For a given organism class the relevant version of PSORT has been used to make the predictions and calculated the performance.10。

PrediSi prediction of signal peptides and their cleavage positions

PrediSi prediction of signal peptides and their cleavage positions

PrediSi:prediction of signal peptides and their cleavage positionsKarsten Hiller 1,Andreas Grote 1,Maurice Scheer 1,2,Richard Mu¨nch 1and Dieter Jahn 1,*1Institut fu¨r Mikrobiologie,Technische Universita ¨t Braunschweig,Spielmannstrasse 7,D-38106Braunschweig,Germany and 2Fachbereich fu¨r Informatik,Fachhochschule Wolfenbu ¨ttel,Am Exer,D-38302Wolfenbu ¨ttel,Germany Received February 13,2004;Revised and Accepted March 15,2004ABSTRACTWe have developed PrediSi (Prediction of Signalpeptides),a new tool for predicting signal peptide sequences and their cleavage positions in bacterial and eukaryotic amino acid sequences.In contrast to previous prediction tools,our new software is espe-cially useful for the analysis of large datasets in real time with high accuracy.PrediSi allows the evaluation of whole proteome datasets,which are currently accu-mulating as a result of numerous genome projects and proteomics experiments.The method employed is based on a position weight matrix approach improved by a frequency correction which takes in to consideration the amino acid bias present in pro-teins.The software was trained using sequences extracted from the most recent version of the SwissProt database.PrediSi is accessible via a web interface.An extra Java package was designed for the integration of PrediSi into other software projects.The tool is freely available on the World Wide Web at http://www.predisi.de.INTRODUCTIONSignal peptides direct proteins to their proper cellular and extracellular locations (1).One major example of such a process is the translocation of proteins across the cytoplasmic membrane via the well-established sec pathway found in both eukaryotic and prokaryotic cells (2).In this secretory pathway,proteins designated for export from the cell are labeled by an N-terminal signal sequence.This signal sequence directs its protein to the secretion apparatus.After translocation of the protein across the cell membrane,the N-terminal signal pep-tide is usually cleaved off by an extracellular signal peptidase.Signal peptides for the sec pathway generally consist of the following three domains:(i)a positively charged n-region,(ii)a hydrophobic h-region and (iii)an uncharged but polar c-region.The cleavage site for the signal peptidase is locatedin the c-region (3).However,the degree of signal sequence conservation and length,as well as the cleavage site position,varies significantly between different proteins.Moreover,major differences were observed between eukaryotic and bacterial signal sequences.For various purposes it is desirable to identify signal peptides and their corresponding cleavage positions.For the calculation of sequence length-dependent features such as the molecular weight and the isoelectric point of a protein,the presence or absence of a signal peptide leads to considerably different results.We used the SignalP (4,5)signal peptide prediction tool in combination with the proteo-mics software JVirGel (6)to improve the calculation of virtual two-dimensional (2D)protein gels with respect to the position of protein spots.However,the resulting application was time consuming and limited to 10requests of up to 2000sequences per day using SignalP’s free version via the Internet (http://www.cbs.dtu.dk/services/SignalP-2.0/).Moreover,most of the existing prediction tools for the analysis of huge datasets such as whole proteomes are either based on old training datasets or not freely accessible.Finally,a recent evaluation of signal peptide prediction programs revealed that the majority of available tools do not meet today’s standards of performance and compatibility (7).Therefore,we set out to develop a new piece of software including the following features:(i)accurate and fast prediction of signal peptides and their corresponding cleavage positions,(ii)a user-friendly web interface,freely available on the World Wide Web for the analysis of unlimited datasets,(iii)presentation of the results in user-as well as computer-friendly formats such as HTML,XML and CSV and (iv)free availability as a Java package for integration into other software projects.SYSTEM AND METHODSDataset of secreted proteins with experimentally determined cleavage positionsFor the generation of the position weight matrices (PWMs)of PrediSi,datasets of secreted proteins with experimentally determined cleavage positions were constructed.Three*To whom correspondence should be addressed.Tel:+495313915801;Fax:+495313915854;Email:d.jahn@tu-bs.deThe online version of this article has been published under an open access ers are entitled to use,reproduce,disseminate,or display the open access version of this article provided that:the original authorship is properly and fully attributed;the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given;if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated.ª2004,the authorsNucleic Acids Research,Vol.32,Web Server issue ªOxford University Press 2004;all rights reservedNucleic Acids Research,2004,Vol.32,Web Server issue W375–W379DOI:10.1093/nar/gkh378different datasets were employed:one set for eukaryotes,one for Gram-negative and one for Gram-positive bacteria.Amino acid sequences with annotated signal peptides were extracted from the XML version of SwissProt release42.9(8).All proteins denoted as‘fragments’,‘putative’,‘found by simi-larity’,‘probable’or with similar descriptions were removed. Furthermore,all proteins from organelles were excluded. From the prokaryotic datasets,signal peptides which are sub-ject to signal peptidase II cleavage were excluded.The training datasets were aligned according to the annotated experiment-ally determined cleavage position of each sequence.In parallel,we constructed control datasets of cytoplasmic and nuclear proteins which are clearly devoid of secretory signal peptides for the sec pathway.For this purpose amino acid sequences of proteins with determined appropriate cel-lular location were extracted from the SwissProt database. Sequences consisting of protein fragments shorter than 70amino acids or indicated with comments such as‘potential’or‘probable’were excluded.Identical sequences with regard to the initial100N-terminal amino acids were eliminated from all datasets.Integration of similar amino acid sequences that differ only in a few amino acids increased the performance of the self-consistency test. All generated datasets are available for download and as supple-mentary information(http://www.predisi.de/download.html). The resulting training datasets consist of2783amino acid sequences from eukaryotes,557sequences from Gram-negative bacteria and236sequences from Gram-positive bacteria.The control datasets consist of5547amino acid sequences from eukaryotes,2013sequences from Gram-negative bacteria and 1077sequences from Gram-positive bacteria.AlgorithmsThe algorithm employed is based on a position weight matrix approach.We generated three different frequency matrices built on the constructed and aligned datasets described above.The position weight matrices are based on the amino acid frequency of parts of the signal sequences in addi-tion to up to four amino acid residues from the N-terminus.We estimated the optimal size of the PWMs by calculating the accuracy of all meaningful combinations.Before calculating the score,we applied a frequency correction to adjust the amino acid bias present in proteins(9).The score was calcu-lated according to Equation1.We simplified the frequency correction by determining the amino acid distribution within only one group of organisms(eukaryotes,Gram-negative, Gram-positive bacteria).The group-specific amino acid com-position was estimated via calculating the amino acid fre-quency of all the proteins in the corresponding control dataset.S=X I PWMi=1log P iP idealP obs,1where S is the score,P i is the observed amino acid frequency at position i,P ideal is set to0.05(statistical ideal amino acid frequency)and P obs is the observed amino acid frequency. Web interfaceThe main program for signal peptide prediction was written in Java()to take advantage of its object-oriented technology and to allow integration of its out-put into dynamic web sites using Java Server Page(JSP) ing this strategy it was possible to smoothly combine and reuse the Java classes with JSP.Jakarta Tomcat was chosen as the servlet container and web server(stable release,version4.1.29).It is the official reference implemen-tation of the Java Servlet(version2.3)and JSP(version1.2) technologies and is available as an open source tool(http:// /tomcat).Besides these Java packages the javax.servlet was employed for Tomcat JSP core functionality, and mons.fileupload was used for uploading inputfiles.The web server runs on a personal computer (1.8GHz CPU,512MB working memory)with Linux as the operating system(SuSE9.0,Kernel2.4.20).Use of the web interfaceThe web interface allows the user to easily search a list of sequences(provided in FASTA format)for the presence of potential signal peptides.There are two ways to submit this input list:either pasting the list into the queryfield or trans-ferring it as afile upload.The user has the option of setting several parameters manually.First,a PWM is selected by taking the organism-specific background into account.For that purpose three matrices for the analysis of sequences from eukaryotes,Gram-positive and Gram-negative bacteria are offered.Second,the user can define the maximal length of the signal peptide.Biologically meaningful values for this parameter lie between60and100amino acid residues.The default parameter is a length of70amino acids.Third,the output format is selectable.Depending on the need for further processing of the resulting data,the user can choose between an HTML table,an easily parseable CSVs(comma separated values)file to port the data to Excel and related applications, and XML format.The output can be shown in the web browser or saved as afile on the local machine.Output parameters given are the overall estimation of whether the investigated amino acid sequence possesses a signal peptide(Y/N),the underlying score and the putative signal peptidase cleavage position. RESULTS AND DISCUSSIONThe prediction of signal peptides has become an important application of genomics and proteomics investigations. SignalP is currently the most efficient and widely used tool for this task.A comparison of most available software in this field underscored the unique performance of this program(7). However,non-commercial utilization of SignalP via the Inter-net is limited to10requests of up to2000sequences per day. Response to such requests takes several minutes.This means that SignalP is not suited to fast whole proteome analysis approaches.Finally,the program is not available as public domain software for integration into other software projects. Therefore,we decided to implement an alternative efficient prediction tool which meets the described criteria.The algor-ithm employed also represents an alternative approach to the neural network and Hidden Markov solutions implemented by SignalP.Thefidelity of the employed method was significantly improved by the introduction of a frequency correction in order to adjust the amino acid bias as described by Schneider and Brown(9).W376Nucleic Acids Research,2004,Vol.32,Web Server issueTo check the accuracy of PrediSi,we performed a self-consistency test.For this purpose we constructed three test datasets containing proteins carrying signal peptides—for eukaryotes,Gram-negative and Gram-positive bacteria.The test datasets consist of all the amino acid sequences from a training dataset extracted from SwissProt and the same num-ber of randomly chosen amino acid sequences without signal peptides from a corresponding control set.We compared the results obtained with the accuracy of SignalP(Table1). Predictions were only considered as correct if both the exist-ence and the cleavage position of the signal peptide were predicted correctly.The results of the analysis showed that PrediSi was slightly less accurate in the prediction of eukar-yotic and Gram-negative signal sequences[85.49%PrediSi versus90.66%SignalP-Neural Network(NN)and88.24%SignalP–Hidden Markov Model(HMM);91.12%versus 91.39%NN and93.09%HMM,respectively]but slightly bet-ter at predicting Gram-positive signal peptides(88.14%versus 85.61%NN and87.29%HMM)(Table1).Interestingly,if we allowed a tolerance of two positions between the cleavage position,the accuracy of returning the correct cleavage posi-tion increased significantly.Probably some of these falsely predicted cleavage positions are due to database errors as mentioned before(10).PrediSi provides a normalized score on a scale between0and1.A score greater than0.5means that the examined sequence very likely contains a signal peptide. The advantage of this user-friendly score is that it is compar-able between different weight matrices.The optimal PWM size differs between the three examined groups of organisms.The optimal size for the eukaryoticPWM Figure1.Sequence logos based on the aligned amino acid sequences of signal peptides.The signal peptide is cleaved off between positionÀ1and0.(A)Gram-negative bacteria,(B)Gram-positive bacteria,(C)eukaryotes.Shaded area represents PWM region.Nucleic Acids Research,2004,Vol.32,Web Server issue W377isÀ16/+4(with the cleavage position between positionsÀ1and+1),for Gram-negativesÀ16/+2and for Gram-positives À21/+1.Figure1depicts sequence logos(11)of signal pept-ides for the three different groups.The estimated matrix sizecorrelates well with the information content of the observedsequences.Agreeing with earlier analysis,signal peptides of Gram-positives are larger than those of other organisms(12). In summary,accuracy of prediction with PrediSi is similar to that with SignalP.The use of a very fast algorithm for the prediction of thesignal peptides enables our web interface tofinish the neces-sary calculations nearly in real time.For example,the analysis of20000eukaryotic sequences takes only about10s and is, therefore,limited only by the data transfer via the Internet.To our knowledge,this is the fastest public method available for predicting signal ing PrediSi it is not necessary to deliver the results by email or to install queues,because the results are directly presented in the web browser(Figure2). Other methods such as Markovian models and neural networks need much more calculation time to perform such a task. ACKNOWLEDGEMENTSWe would like to thank Dr Barbara Schulz for critical proof-reading of the manuscript.This work was funded by the German Bundesministerium fu¨r Bildung und Forschung (BMBF)for the Bioinformatics Competence Center Inter-genomics’(Grant No.031U110A/031U210A). REFERENCES1.Zheng,N.and Gierasch,L.M.(1996)Signal sequences:the same yetdifferent.Cell,86,849–852.Table1.Statistical examination of the accuracy of the different models promoted by SignalP and the accuracy of the new weight matrix approachDataset Eukarya Gram-positive Gram-negativePositive Control Overall Positive Control Overall Positive Control OverallPrediSi72.6698.3185.4978.3997.8988.1486.5495.791.12 NN(SignalP)82.1199.2190.6677.9793.2585.6186.5496.2491.39 HMM(SignalP)78.7397.7488.2475.4299.1687.2987.0799.193.09Scores for the various predictions are given separately for Gram-positive bacteria,Gram-negative bacteria and eukaryotes.The values provided are the percentage of correctly identified signal peptides including the correct positions of their cleavage site.The positive dataset consists of proteins carrying signal peptides;the control consists of proteins without signal peptides.The overall score combines the obtained values for the positive and controldatasets.Figure2.Screenshot of the PrediSi web interface.W378Nucleic Acids Research,2004,Vol.32,Web Server issue2.Rapoport,T.A.,Jungnickel,B.and Kutay,U.(1996)Protein transportacross the eukaryotic endoplasmic reticulum and bacterial innermembranes.Annu.Rev.Biochem.,65,271–303.3.von Heijne,G.(1985)Signal sequences.The limits of variation.J.Mol.Biol.,184,99–105.4.Nielsen,H.and Krogh,A.(1998)Prediction of signal peptides and signalanchors by a hidden Markov model.Proc.Int.Conf.Intell.Syst.Mol.Biol.,6,122–130.5.Nielsen,H.,Engelbrecht,J.,Brunak,S.and von Heijne,G.(1997)Identification of prokaryotic and eukaryotic signal peptides andprediction of their cleavage sites.Protein Eng.,10,1–6.6.Hiller,K.,Schobert,M.,Hundertmark,C.,Jahn,D.and Mu¨nch,R.(2003)JVirGel:calculation of virtual two-dimensional protein gels.Nucleic Acids Res.,31,3862–3865.7.Menne,K.M.L.,Hermjakob,H.and Apweiler,R.(2000)A comparison ofsignal sequence prediction methods using a test set of signal peptides.Bioinformatics,16,741–742.8.Boeckmann,B.,Bairoch,A.,Apweiler,R.,Blatter,M.-C.,Estreicher,A.,Gasteiger,E.,Martin,M.J.,Michoud,K.,O’Donovan,C.,Phan,I.et al.(2003)The SWISS-PROT protein knowledgebase and its supplementTrEMBL in2003.Nucleic Acids Res.,31,365–370.9.Schreiber,M.and Brown,C.(2002)Compensation for nucleotide bias in agenome by representation as a discrete channel with noise.Bioinformatics,18,507–512.10.Nielsen,H.,Engelbrecht,J.,von Heijne,G.and Brunak,S.(1996)Defininga similarity threshold for a functional protein sequence pattern:thesignal peptide cleavage site.Proteins,24,165–177.11.Schneider,T.D.and Stephens,R.M.(1990)Sequence logos:a newway to display consensus sequences.Nucleic Acids Res.,18,6097–6100.12.Tjalsma,H.,Bolhuis,A.,Jongbloed,J.D.,Bron,S.and van Dijl,J.M.(2000)Signal peptide-dependent protein transport in Bacillus subtilis:agenome-based survey of the secretome.Microbiol.Mol.Biol.Rev.,64,515–547.Nucleic Acids Research,2004,Vol.32,Web Server issue W379。

鸟分枝杆菌MAV2928基因编码蛋白的生物信息学分析

鸟分枝杆菌MAV2928基因编码蛋白的生物信息学分析

中国病原生物学杂志2021年1月第16卷第1期]ournal o f Pathogen Biology Jan. 2021,Vol. 16. No. 1• 17 •D O I:10. 13350/j. cjpb. 210104 • i仑著•鸟分枝杆菌MAV_2928基因编码蛋白的生物信息学分析&陈晓文,陈越,高婧华,吴利先x x(大理大学基础医学院微生物与免疫学教研室,云南大理671000)【摘要】________目的运用生物信息学方法分析鸟分枝杆菌P P E25-M A V蛋白的结构与功能。

方法从N C B I数据库获取M A V_2928基因编码蛋白氨基酸序列;使用p r m P a r a m和PortScale工具分析该基因编码P P E25-M A V蛋白的理化性质 以及亲疏水性;分别运用 SignalP4. 1Server、T M H M M Server v.2.0 和P S O R T Prediction工具预测 P P E25-M A V蛋白 的信号肽、跨膜区、亚细胞定位;采用NetPhos 3. 1Servera预测憐酸化位点,采用NCBI-Conserved d o m a i n s分析蛋白的 保守域结构;采用S P O M A软件在线分析P P E25_M A V蛋白二级结构,使用S W I S S-M O D E L建立三级结构模型。

运用A B C p r e d软件和I E D B预测蛋白的B细胞抗原表位。

结果M A V_2928基因全长为1 266 b p,编码蛋白P P E25-M A V含有421个氨基酸,为不稳定疏水蛋白,脂肪系数为72. 66,无信号肽及跨膜区,定位于细胞质中。

该蛋白含有52个磷酸 位点,有1个保守域结构属于P P E超家族蛋白;二级结构以无规则卷曲为主,结构较松散。

预测该蛋白含有7个B细胞 优势抗原表位和24个T h细胞表位。

结论P P E25-M A V蛋白含有多个磷酸化位点,参与细胞信号的转导,含有7个 优势B细胞抗原表位,为该蛋白作为候选疫苗抗原提供了理论基础。

鸡Myf5_蛋白结构与功能的生物信息学分析

鸡Myf5_蛋白结构与功能的生物信息学分析
[ Result] The protein was composed of 258 amino acids with the molecular formula of C 1221 H 1915 N359 O391 S18 . There were 34 negatively charged
amino acid residues ( Asp+Glu) and 29 positively charged amino acid residues ( Arg+Lys) . The theoretical isoelectric point ( pI) was 5. 86,
指数为 97. 59( >40),表明该蛋白结构极不稳定;亲水性平均
值(GRAVY)为-0. 701,脂溶指数为 54. 92,说明鸡 Myf5 蛋白
属于亲水性蛋白。
2. 2 鸡 Myf5 蛋白二级结构及三级结构预测 利用 SOPMA
在线软件预测鸡 Myf5 蛋白二级结构。 结果显示,鸡 Myf5 蛋
Pro(P)
Ser(S)
Thr(T)
Trp(W)
Tyr(Y)
Val(V)
Pyl(O)
Sec(U)
数量
Number∥个
21
21
5
9
11
9
25
15
7
5
20
8
7
7
28
32
11
2
7
8
0
0
频率
Frequency∥%
8. 1
8. 1
1. 9
3. 5
4. 3
3. 5
9. 7
5. 8
2. 7

7.故障诊断专家系统解析

7.故障诊断专家系统解析
我国故障诊断工作者也积极探索专家系统的应用研究, 国家在“七·五”和“八.五”期间也列有这方面的攻关课 题,取得了—些进展,但目前总的情况是实验室研究较多, 现场条件下的实际应用、特别是成功的应用实例并不多见。
故障诊断专家系统
人工神经网络
一、概述
1.定义及特点 2.目前的应用情况
x1 w1
i
二、基本原理
3)产生式表示(或规则表示)
其一般形式为
P
Q(即IF … THEN…)
左部分表示前提(条件或状态),右部分表示若干 结论
故障诊断专家系统
如:出现异常振动则振幅大。对于复杂的故障用树
枝状表示。
振动峰值大
基频振动
低频振动 二倍频振动 广谐振动
不平衡 热弯曲 油膜涡动 支承问题 轴裂纹 不对中 摩擦
油膜震荡
故障诊断专家系统
故障诊断专家系统
四、推理机制 1.推理分类 2.推理控制策略 3.推理搜索策略 4.似然推理
故障诊断专家系统
五、应用
美国西屋公司从开发汽轮发电机专家系统GenAID开始, 现已在佛罗里达州的奥兰多发电设备本部建立了一个自动 诊断中心,对各地西屋公司制造的汽轮发电机进行远距离 自动诊断。诊断对象从汽轮发电机逐步扩大到汽轮机、锅 炉和辅机。西屋公司和卡内基·梅隆大学合作研制了一台汽 轮发电机监控用专家系统,用来监视德州三家主要发电厂 的七台汽轮发电机组的全天工作状况。此专家系统能快速、 精确地分析仪表送来的信号,然后立即告诉操作人员应采 取什么措施。
故障诊断专家系统
二、知识库
1. 定义:专家知识、经验及书本知识的存储器
2. 知识表示
1)对知识表示的基本要求(三个基本要求) ①表示方案应便于知识的修改和扩充; ②表示方案应尽量简单易懂; ③ 表示方法应清晰明确。因为专家系统的建造过程是一

铜绿假单胞菌YfiB蛋白的生物信息学分析

铜绿假单胞菌YfiB蛋白的生物信息学分析

12个 限 制 性 C T L 表 位 和 6 个 限 制 性 T h 表 位 ,10种 蛋 白 可 能 与 其 相 互 作 用 。 结 论 生 物 信 息 学 方 法 预 测 YfiB蛋 存 在 多 个 B 、T 细 胞 表 位 ,可 为 铜 绿 假 单 胞 菌 YfiBNR信 号 系 统 促 进 相 关 生 物 膜 形 成 的 功 能 机 制 研 究 提 供 参 考 。
材料与方法
1 材料 从 NCBI GenBank数 据 库 中 查 询 YfiB基 因
(P A 1119,基 因 ID:8 8 1 9 3 8 ) ,其 对 应 的 蛋 白 序 列 (登录 号 :N P _249810. 1 ) 和 编 码 基 因 序 列 (登 录 号 :N C _ 002516.2),以 及 铜 绿 假 单 胞 菌 P A O l 全 基 因 组 和 其 余信息。 2 方法 2. 1 YfiB基 因 序 列 分 析 从 NCBI GenBank数 据库 中 获 取 Y fiB 基 因 序 列 及 其 序 列 信 息 ,运 用 基 因 预 测 软 件 ()RF Finger分 析 搜 索 序 列 中 的 O R F 区 (开 放 阅 读 框 架 )。 2.2 YfiB蛋 白 基 本 理 化 性 质 运 用 蛋 白 质 分 析 软 件 ProtParam预 测 蛋 白 质 的 分 子 式 、氨 基 酸 组 成 、P I 和 不 稳 定 系 数 等 参 数 ,运 用 S O S U I 和 ExPASyProtScale软 件 对 YfiB蛋 白 的 可 溶 性 和 亲 疏 水 性 进 行 分析。 2.3 YfiB蛋 白 信 号 肽 在 SignalP 4. 1 Server工 具 内 输 人 YfiB蛋 白 的 氨 基 酸 序 列 ,选 定 “革 兰 氏 阴 性 菌 ” 选 项 ,其 余 选 项 默 认 。 运 用 TargetP-2.0 Server软 件 对 YfiB蛋 白 信 号 肽 序 列 及 剪 切 进 行 预 测 分 析 。 2.4 跨 膜 结 构 区 域 、磷 酸 化 位 点 和 保 守 域 在 丁 1 ^ H M M Server v. 2. 0 软 件 内 输 入 YfiB蛋 白 得 到 其 跨 膜 区 域 预 测 结 果 ,由 NetPhos 3. 1 Server软 件 得 出 YfiB蛋 白 翻 译 后 磷 酸 化 修 饰 位 点 。 应 用 NCBI B L A S T 软 件 对 YfiB蛋 白 结 构 域 进 行 预 测 。 2.5 YfiB蛋 白 结 构 采 用 N P S :S O P M A 和 COILS 软 件 预 测 YfiB蛋 白 的 二 级 结 构 。 应 用 SWISS-M O D E L 软 件 对 YfiB蛋 白 三 级 结 构 进 行 预 测 分 析 并 建 立 模 型。 2.6 YfiB蛋 白 抗 原 表 位 应 用 ABCpred和 S Y FPEITH丨 软 件 分 别 对 铜 绿 假 单 胞 菌 YfiB蛋白可能存 在的抗原表位进行预测。

nar_34_1_1__1

nar_34_1_1__1

Explanatory Notes for Supplementary Annotation Table.Each field is populated with information either generated or collected by the workshop participants. Multiple entries in a field are separated by the symbol “-!-“ to assist in parsing. The term “predicted” was included for gene products whose function not was experimentally verified.Gene Nomenclature and IdentifiersGenes are identified by their feature (CDS or RNA). Pseudogenes are identified in the comment fields for the two strains. Some pseudogenes are fragmented by inserts, others are frameshifts. Gene names are given both in conventional Demerec format ((1); Gene column)) and in a format not restricted to the Demerec format (Locus Name column). Locus names for 2- and 3-part pseudogene fragments are given extensions of “_1”, “_2” numbering from the N-terminal to the C-terminal end. Multiple copies of IS proteins are given locus names with “-1”, “-2” extensions based on their location on the chromosome relative to the first instance of the specific IS protein. Synonyms of gene names found in the literature and collected from several database sources are provided.Loci of MG1655 and W3110 are described in terms of their gene boundaries (left end, right end) and direction of transcription (clockwise (+) or counterclockwise (-)). The boundary is defined as the nucleotide number of the start/end codon of a transcript, pseudogene (fragment) or functional RNA. The start and end positions between MG1655 and W3110 differ due to a difference in the start position and inversion, insertion or deletion of regions. Locustags specific to MG1655 (b numbers) and W3110 (JW numbers) are listed. For MG1655 locustags have been assigned to 21 entities representing fused pseudogene fragments (ancestral version of the gene). Locustags were not assigned for the fused pseudogene fragments in W3110.ECK (E. coli K-12) numbers are identifiers assigned to E. coli K-12 genes by the workshop participants. ECK numbers are given to unique CDSs, RNAs and pseudogenes. Individual fragments of divided pseudogenes are given the same ECK identifier. The fused pseudogene fragments were assigned thesame ECK identifier as the corresponding fragments. One ECK number is used for multiple copies of an IS protein, resulting in a one to many mapping for these CDSs. This ‘one to many’ nomenclature is limited to mobile elements and does not include ribosomal RNA genes. The ECK identifiers are numbered sequentially in the order of the MG1655 map beginning with thrL.Gene Product Type.Assignment of the type of gene product was attempted. Clearly the major types of proteins in E. coli are enzymes followed by transporters and regulators. To tally the relative proportions that occupy the genome, gene products were labeled according to type. Assignments are often difficult because of the complexity of biology. A few new categories were added (see Table 2 for complete list), but the most difficult assignments concerned gene products that could be described correctly with more than one function. Examples of such complexities include the phosphotransferases of PTS system (enzymes or transporters), the sigma factors (factors or regulators), DNA polymerase (enzyme or cell process protein), and flagellae (structure of the cell or cell process protein). Complex enzyme subunits such as the four types in the succinate dehydrogenase enzyme could all be labeled as the dehydrogenase enzyme, as is current practice. On the other hand each subunit can be labeled more accurately according to their essential character, such as for instance an inner membrane subunit or an electron transport subunit (carrier). Carrying these properties forward would be more useful to annotation of unknown genes in other organisms. We have made an effort to make assignments that reflect the nature of the individual gene product when it is a part of a larger complex.Gene Product Descriptions, Comments, EvidenceThe assignment of gene product description occupied a major fraction of time and effort by the workshop participants. Starting with existing descriptions from databases and web sites offering full genomepredictions, groups of 1, 2 and 3 participants reexamined data, checked for new information in the literature, new sequence matches, and new kinds of sequence analysis. Web sites new and old were consulted. Whether the product description was derived experimentally or by computation was noted as a measure of the reliability of the assignment. The gene product descriptions were kept succinct. Additional remarks were lodged in the associated comment field.An attempt was made towards describing the gene products in a uniform format. Enzymes were described by their common name and information on cofactor requirement was included where available. Enzyme complexes were described by the name of the enzyme complex followed by the name of the subunit itself (b0784, MoaD; molybdopterin synthase, small subunit). Some enzymes encode multiple functions either as a result of gene fusion events or as a result of multiple activities encoded at the same site of the protein. The term “fused” was included for the fused proteins and their activities were listed using “-!-“ to separate the activities (b0002, ThrA; fused aspartokinase I -!- homoserine dehydrogenase I). The other enzymes with more than one functions were listed as bifunctional (b0025, RibF; bifunctional riboflavin kinase -!- FAD synthetase) or as multifunctional (b0494, TesA; multifunctional acyl-CoA thioesterase I -!- protease I -!- lysophospholipase L1). Transport proteins were listed with the substrate transported (b0336, CodB; cytosine transporter). For the ABC superfamily transport complexes the substrate, complex, and subunit information was listed (b0199, MetN, DL-methionine transporter subunit -!- ATP-binding component of ABC superfamily). Transcriptional regulators were described either as DNA-binding transcriptional, repressor, activator, dual regulator (may act as both activator and regulator), or regulator (not known whether repressor or activator). A uniform format was given to the two component regulatory systems for the response regulators (b0620, CitB; DNA-binding response regulator in two-component regulatory system with CitA), sensory histidine kinases (b0619, CitA; sensory histidine kinase in two-component regulatory system with CitB) and for the fused two-component regulators (b2218, RcsC; hybrid sensory kinase in two-component system with RcsB and YojN).For genes encoded in the 10 cryptic prophages or prophage-like elements (9 in W3110 due to lack of CPZ-55), the name of the prophage was listed following the name of the gene product (b0246, YafW; CP4-6 prophage; antitoxin of the YkfI-YafW toxin-antitoxin system). Gene product descriptions for the t-RNAs included information on their anticodon (b0536, ArgU; tRNA-Arg(UCU) (Arginine tRNA4)). For pseudogenes, the term “(pseudogene)” was included in the gene product description as well as “N-ter fragment”, “middle fragment” or “C-ter fragment” for fragmented pseudogenes.Gene product predictions were based on the data collected from several E. coli databases and from specialized databases listed in the text (i.e., transmembrane helix predictions, protein family and protein domain predictions, sequence similar homologs, etc.). Gene products whose functions not could be predicted were either annotated as conserved proteins (had homologs beyond Escherichia and Salmonella) or predicted proteins (did not have homologs outside of Escherichia and Salmonella).LiteratureLiterature given is an incomplete collection derived from GenProtEC and from the Cyber Cell Database. Cell LocationLocations of the gene products were individually determined through careful evaluation of the literature; transmembrane helix predictions, HMMTOP(2) and TMHMM (3); signal peptide predictions SignalP (4) and LipoP (5) and have been taken from the EchoLocation section of EchoBASE (6). The cell location data were translated into Gene Ontology (GO) terms (7) and are presented in the GO cellular component field. ContextNames of IS elements and prophages are listed for the loci that belong to these elements.Enzyme Nomenclature.EC numbers were collected for the E. coli enzymes from EcoCyc (8), GenProtEC (9), BRENDA (10), and from the literature. The IUBMB Enzyme Nomenclature database (/iubmb/) was consulted for the assigned EC numbers.Cofactor, Protein Complexes.Information on cofactors used by enzymes were collected from EcoCyc (8). EcoCyc also provided data on protein complexes (homomultimers and heteromultimers) for over 950 proteins. The name of the complex and its components are provided.Transporter Classification.Information on the transport proteins were collected from the Transport Classification Database(/). Both the Transport Classification (TC) number and Superfamily membership are given.Regulator Family, Transcriptional Units Regulated.Data on transcriptional regulators were from RegulonDB (11) and J. Collado-Vides and Heladia Saldago (personal communications). The family membership of regulators and transcriptional units controlled by the regulators are given.ProteasesIdentification of peptide bond hydrolysis characteristics in proteins has allowed prediction of proteases (peptidases). Information on known and predicted proteases of E. coli K-12 proteins has been extracted from the MEROPS database (12).Signal Peptides, Membrane Helices, C-terminus location.The amino acids predicted to encode the signal peptide according to SignalP(4) are listed. In addition, literature based signal peptide cleavage sites collected from EcoGene (13) are presented. The predicted number of transmembrane helices are provided based on the two algorithms, HMMTOP(2) and TMHMM (3).Location of the C-terminal end of transmembrane proteins, either outside in the periplasm or inside in the cytoplasm are based on experimental methods (14).Attenuation RegulationInformation on regulation by transcription-attenuation was included (15). Operons are predicted to be regulated by attenuation based on the presence of possible stem and loop RNA structures in advance of the first gene of the operon. The Attenuation field contains information for the first gene of the operon believed to be regulated by attenuation, and the set of genes in the operon is listed (b0463: AceB; regulated by attenuation (aceB-aceA).Fused ProteinsFused proteins are encoded by genes which have undergone a gene fusion event. The resulting gene product encodes two or more functions in separate regions of the protein. Such proteins are known to contribute to errors in annotation when alignment regions are not considered for transfer of functions between homologous sequences. The 108 fused E. coli proteins(9) are listed with functions and location of functions separated by “-!-“.StructureStructure data for E. coli proteins are presented in the form of PDB IDs from the Protein Data Bank (16). COG assignmentsMembership of proteins in COGs (Clusters of Orthologous Genes; (17)) is presented by the COG IDs and their annotations. Some E. coli proteins contain more than one COG. These were provided directly by E.V. Koonin as more than one per gene cannot easily be retrieved from the NCBI Web site.Superfamily (SCOP domain) assignmentsSCOP superfamily domains identify structural elements in protein sequences some of which have known function that can help characterize otherwise unknown proteins. The presence of structural domains basedon similarity to known SCOP superfamily domains are shown in the table. Information on structural domains was obtained from the Superfamily database (18) and is listed with superfamily ID and domain annotation.Pfam assignmentsInformation on Pfam assignments for the E. coli proteins was obtained from the Pfam database (19). Pfam represents a large collection of multiple sequence alignments and Hidden Markov Models for many common protein domains and families. The Pfam IDs, annotations, e-value, and amino acid range are shown. TIGRFAM assignmentsMembership in TIGRFAM protein families were obtained from the TIGRFAM database (20). TIGRFAMs are curated protein families developed for use in annotation.GO assignmentsData on cellular component were obtained by translating data from the “Cell Location” field to GO terminology. GO assignments for the cellular process and molecular function levels were obtained by transferring MultiFun cellrole/pathway assignments (9) to Gene Ontology terminology (21). The mapping can be obtained at: /external2go/multifun2go. Current MultiFun assignments are present at GenProtEC (/).References1. Demerec,M., Adelberg,E.A., Clark,A.J. and Hartman,P.E. (1966) A proposal for a uniform nomenclaturein bacterial genetics. Genetics, 54, 61-76.2. Tusnady,G.E. and Simon,I. (2001) The HMMTOP transmembrane topology prediction server.Bioinformatics, 17, 849-850.3. Krogh,A., Larsson,B., Von,H.G. and Sonnhammer,E.L. (2001) Predicting transmembrane proteintopology with a hidden Markov model: application to complete genomes.J. Mol. Biol., 305, 567-580.4. Bendtsen,J.D., Nielsen,H., Von Heijne,G. and Brunak,S. (2004) Improved prediction of signal peptides:SignalP 3.0. J Mol Biol., 340, 783-795.5. Juncker,A.S., Willenbrock,H., Von,H.G., Brunak,S., Nielsen,H. and Krogh,A. (2003) Prediction oflipoprotein signal peptides in Gram-negative bacteria. Protein Sci, 12, 1652-1662.6. Misra,R.V., Horler,R.S., Reindl,W., Goryanin,I.I. and Thomas,G.H. (2005) EchoBASE: an integratedpost-genomic database for Escherichia coli. Nucleic Acids Res., 33, D329-D333.7. Fujimoto,S. and Clewell,D.B. (1998) Regulation of the pAD1 sex pheromone response of Enterococcusfaecalis by direct interaction between the cAD1 peptide mating signal and the negatively regulating, DNA-binding TraA protein. Proc. Natl. Acad. Sci. USA, 95, 6430-6435.8. Keseler,I.M., Collado-Vides,J., Gama-Castro,S., Ingraham,J., Paley,S., Paulsen,I.T., Peralta-Gil,M. andKarp,P.D. (2005) EcoCyc: a comprehensive database resource for Escherichia coli. Nucleic Acids Res., 33, D334-D337.9. Serres,M.H., Goswami,S. and Riley,M. (2004) GenProtEC: an updated and improved analysis of functionsof Escherichia coli K-12 proteins. Nucleic Acids Res., 32, D300-D302.10. Schomburg,I., Chang,A., Ebeling,C., Gremse,M., Heldt,C., Huhn,G. and Schomburg,D. (2004)BRENDA, the enzyme database: updates and major new developments. Nucleic Acids Res., 32, D431-D433.11. Salgado,H., Gama-Castro,S., Martinez-Antonio,A., Diaz-Peredo,E., Sanchez-Solano,F., Peralta-Gil,M.,Garcia-Alonso,D., Jimenez-Jacinto,V., Santos-Zavaleta,A., Bonavides-Martinez,C. et al.(2004)RegulonDB (version 4.0): transcriptional regulation, operon organization and growth conditions in Escherichia coli K-12. Nucleic Acids Res., 32, D303-D306.12. Rawlings,N.D., Tolle,D.P. and Barrett,A.J. (2004) MEROPS: the peptidase database. Nucleic AcidsRes., 32, D160-D164.13. Rudd,K.E. (2000) EcoGene: a genome sequence database for Escherichia coli K-12. Nucleic Acids Res.,28, 60-64.14. Daley,D.O., Rapp,M., Granseth,E., Melen,K., Drew,D. and Von,H.G. (2005) Global topology analysisof the Escherichia coli inner membrane proteome. Science, 308, 1321-1323.15. Merino,E. and Yanofsky,C. (2005) Transcription attenuation: a highly conserved regulatory strategyused by bacteria. Trends Genet., 21, 260-264.16. Berman,H.M., Westbrook,J., Feng,Z., Gilliland,G., Bhat,T.N., Weissig,H., Shindyalov,I.N. andBourne,P.E. (2000) The Protein Data Bank. Nucleic Acids Res., 28, 235-242.17. Tatusov,R.L., Fedorova,N.D., Jackson,J.D., Jacobs,A.R., Kiryutin,B., Koonin,E.V., Krylov,D.M.,Mazumder,R., Mekhedov,S.L., Nikolskaya,A.N. et al. (2003) The COG database: an updated version includes eukaryotes. BMC. Bioinformatics, 4, 41.18. Madera,M., Vogel,C., Kummerfeld,S.K., Chothia,C. and Gough,J. (2004) The SUPERFAMILYdatabase in 2004: additions and improvements. Nucleic Acids Res., 32, D235-D239.19. Bateman,A., Coin,L., Durbin,R., Finn,R.D., Hollich,V., Griffiths-Jones,S., Khanna,A., Marshall,M.,Moxon,S., Sonnhammer,E.L. et al. (2004) The Pfam protein families database. Nucleic Acids Res., 32, D138-D141.20. Haft,D.H., Selengut,J.D. and White,O. (2003) The TIGRFAMs database of protein families. NucleicAcids Res., 31, 371-373.21. Ashburner,M., Ball,C.A., Blake,J.A., Botstein,D., Butler,H., Cherry,J.M., Davis,A.P., Dolinski,K.,Dwight,S.S., Eppig,J.T. et al. (2000) Gene ontology: tool for the unification of biology. The GeneOntology Consortium. Nat. Genet., 25, 25-29.。

生物信息学-蛋白质性质和结构分析

生物信息学-蛋白质性质和结构分析
PredictProtein: https:///
(二) 分析蛋白质的二级结构 二级结构:主要是氢键维持的结构 -螺旋(-helix) -折叠(-sheet)
转角(turn) 环(loop)
无规则卷(random coil)
Chou-Fasman method
蛋白质的 pI、Mw、氨基酸组成等
2. 分析蛋白质的疏水性 打开/tools/ 在“Primary structure analysis”栏目选择 “ProtScale”分析软件 在ProtScale主页 (/protscale/) 粘贴序列、选择分析方法
(三) 分析蛋白质的三级结构 1. 根据已知蛋白质结构推测未知蛋白质结构
BLAST 检索 在蛋白质结构数据库(PDB) 中检索同源蛋白质的结构
2. 通过分子建模(molecular modeling)分析蛋白质结构
分析复杂 适用于专业人员
Phyre2 /phyre2/html/page.cgi?id=index
蛋白质性质和结构分析
ExPASy (Expert Protein Analysis System)
Nucleic Acids Research 2003, 31:3784-8
Swiss Institute of Bioinformatics (SIB) 的分析工具
蛋白质的亲水和疏水性分析结果,有文字 和图形两种显示方式
3. 分析蛋白质的保守结构域
在文本框“Scan a sequence against PROSITE patterns and profiles”粘贴序列
使用缺省参数( exclude patterns with a high probability of occurrence)

大连蛇岛蝮蛇类凝血酶基因克隆与表达研究

大连蛇岛蝮蛇类凝血酶基因克隆与表达研究

( 1. 大连理工大学 生物工程系 辽宁 大连 116012; 2. 中国科学院化工冶金研究所 生物化学工程国家重点实验室 3. 瑞典乌普萨拉大学 生物医学研究中心 乌普萨拉 75 123 )
北京
100080;
摘要: 根据同源性设计引物 通过 RT-PCR 方法从大连蛇岛蝮蛇毒腺总 RNA 中合成扩增出
第2期
杨 青等, 大连蛇岛蝮蛇类凝血酶基因克隆与表达研究
157
1. 2 RT-PCR 合成扩增 cDN A 取 1 pL 新鲜制备的总 RNA 在 65 下保温
1O min, 然后立即转移至 冰 上. cDNA 的 合 成 以 及 扩 增 采 用 RT-PCR kit( 5/ -f ull RACE Core Set, Takara, 日本) . 具体操作根据试剂盒生产商 的建议完成. 基于同源性[5~ 6], 设计合成用于扩增 反应的 PCR 引物序列是
O引 言
毒蛇的毒液也许是迄今为止所发现的脊椎动 物中最高度浓缩的分泌产物 而蛋白水解酶是蛇 毒 中最重要的成分. 从生物化学上讲 蛇毒蛋白 水解酶可被划分为两大类: 丝氨酸蛋白水解酶和 金属蛋白水解酶{1]. 类 凝 血 酶 属 于 丝 氨 酸 蛋 白 水 解酶 它的特点是能将纤维蛋白原转化为纤维蛋 白凝胶. 由于此纤维蛋白凝胶是非交联的并极易 为纤维溶解酶所降解 类凝血酶在临床上可以用 于改善血液的流动性 从而达到治疗和防治血栓 的目 的{2~ 3]. 目 前 已 有 20 余 种 不 同 种 蛇 的 类 凝 血酶被分离出来并进行了性质的鉴定 其中有 8 种 类 凝 血 酶 的 氨 基 酸 全 序 列 已 见 报 道 {4] .
GTT V
ACT T
TAT Y
AGA R

绵羊CTSB基因过表达载体的构建及生物信息学分析

绵羊CTSB基因过表达载体的构建及生物信息学分析

第44卷第2期2021年3月河北农业大学学报JOURNAL OF HEBEI AGRICULTURAL UNIVERSITYVol.44 No.2Mar.2021绵羊CTSB基因过表达载体的构建及生物信息学分析韩红叶1,张丽萌2,刘爱菊1,马晓菲1,李悦欣1,高 旭1,王志刚1,田树军1,3(1. 河北农业大学 动物科技学院,河北 保定071001; 2. 郑州师范学院分子生物学实验室,郑州450044; 3.河北省牛羊胚胎技术创新中心, 河北 保定071001)摘要:为了构建组织蛋白酶B(Cathepsin B,CTSB)基因的真核表达载体,进一步研究其蛋白的结构和功能,本试验以绵羊卵巢组织的cDNA为模板,扩增绵羊卵巢CTSB基因的CDS编码区,将其插入真核表达载体中成功获得重组质粒pcDNA3.1-CTSB,利用生物学信息学分析软件对绵羊卵巢CTSB基因的结构和功能进行分析鉴定。

研究结果表明:成功构建了重组质粒pcDNA3.1-CTSB,发现CTSB基因主要在细胞外行使功能,CTSB蛋白具有翻译后磷酸化及糖基化修饰特性,CTSB与LGMN、NLRP3、CTSD蛋白之间存在一定互作关系,上述发现将为进一步探究绵羊CTSB基因的功能提供重要线索。

关键词:绵羊;CTSB;过表达载体;蛋白结构;蛋白功能中图分类号:S826开放科学(资源服务)标识码(OSID):文献标志码:AConstruction of overexpression vector and bioinformaticsanalysis of sheep CTSB geneHAN Hongye1, ZHANG Limen2, LIU Aiju1, MA Xiaofei1, LI Yuexin1, GAO Xu1, WANG Zhigang1, TIAN Shujun1(1.College of Animal Science and Technology, Hebei Agricultural University, Baoding 071001, China; 2. Laboratoryof Molecular Biology, Zhengzhou Normal University, Zhengzhou 450044, China; 3. Hebei Technology InnovationCenter of Cattle and Sheep Embryo, Baoding 071001, China )Abstract: In order to construct the eukaryotic expression vector of cathepsin B (CTSB) gene and further study thestructure and function of CTSB protein, the CDS coding region of sheep CTSB gene was amplified by using thecDNA of sheep ovary tissue as a template, and it was inserted into the eukaryotic expression vector to successfullyobtain PCDNA3.1-CTSB. We analyzed the structure and function of the CTSB gene by bioinformatics software. Theresults showed that the sheep recombinant plasmid pcDNA3.1-CTSB was successfully constructed, and the CTSBgene was highly conserved during evolution and functions outside the cell. The CTSB protein had post-translationalphosphorylation and glycosylation modification, and there was a certain interaction between CTSB and LGMN,NLRP3, CTSD, which provided clues for further exploring the function of sheep CTSB gene.Keywords: sheep; CTSB ; overexpression vector; protein structure; protein function收稿日期:2020-12-25基金项目:河北农业产业技术体系(HBCT2018140202);河北省重点研发计划(20326349D).第一作者:韩红叶(1993- ),女,河北石家庄人,硕士研究生,从事动物繁殖研究.E-mail:******************** 通信作者:王志刚(1968- ),男,河北衡水人,硕士,研究员,从事动物繁殖调控技术研究.E-mail:**************.cn 田树军(1970- ),男,河北张家口人,博士,教授,从事动物胚胎工程及羊生产学工作研究.E-mail:***************本刊网址:http: // hauxb. hebau. edu. cn文章编号:1000-1573(2021)02-0093-04DOI:10.13320/ki.jauh.2021.003194第44卷河北农业大学学报组织蛋白酶B(CTSB)是半胱氨酸蛋白酶家族中的一员,在所有半胱氨酸组织蛋白酶中含量丰富,在生理和病理(如细胞凋亡、肿瘤的浸润转移等)中发挥作用[1]。

多房棘球绦虫钙网蛋白的真核表达及其T、B细胞表位预测

多房棘球绦虫钙网蛋白的真核表达及其T、B细胞表位预测

1)01:10.13350/j.cjpb.201207•论著•多房棘球绦虫钙网蛋白的真核表达及其T、B细胞表位预测*陈路娟'.程詰;.王彦海:….赵利美(1.内蒙古科技大学包头医学院基础医学与法医学院病原生物学教研室.内蒙古包头014060;2.厦门大学生命科学学院寄生动物研究室)籃珂目的构建多房棘球绦虫钙网蛋白(EmCRT)的真核表达载体.鉴定重组EmCRT蛋白在Hela细胞中的表达,并对其进行T、B细胞衣位预测等生物信息学分析。

方法根据多房棘球绦虫钙网蛋白基因序列设计特异引物.以多房棘球绦虫原头坳cDNA为模板,PCR扩增EmCRT基因片段。

将此片段插入真核表达载体pcDNA3.3-HA,构建重组质粒pcDNA3.3-HA-EmCRT,并将PCR、酶切和测序鉴定正确的质粒转染Hela细胞.应用Western blot和细胞免疫荧光法检测重组EmCRT蛋白在细胞中的表达。

利用ProtParam预测EmCRT的理化性质.SignalP4.1Server预测其信号肽序列,PS()RT II Prediction预测其亚细胞定位.TMHMM 2.0预测其跨膜结构域.SOPMA和SWISS-MODEL预测其二、三级结构.DNAStar软件分析其亲水性、柔韧性、抗原指数以及表面可及性.并推测可能的B细胞表位;采用SYF-PEUTHI的T细胞表位预测工具分别预测细胞毒性T细胞(CTL)和辅助T细胞(Th)表位。

结果成功构建了Em-CRT的真核表达载体,经Western blot和细胞免疫荧光检测EmCRT在Hela细胞中高效表达。

EmCRT蛋白的相对分子质量为45.44X101,等电点为4.47,预测该蛋白含有1个信号肽序列和1个跨膜区域.可能定位于细胞质,具有6个T、B细胞联合表位,分别为51-88aa、112-154aa.149-178aa、185-198aa,244-263aa、280-309aa q结论真核表达的重组多房棘球绦虫钙网蛋白相对分子质量为45.44 X1O1.预测含有T、B细胞表位.为高效抗多房棘球绦虫表位疫苗的研发奠定了基础。

苎麻expansin_家族成员鉴定与表达分析

苎麻expansin_家族成员鉴定与表达分析

㊀㊀㊀2023年第45卷第4期㊀㊀中国麻业科学㊀㊀PLANTFIBERSCIENCESINCHINA㊀㊀㊀㊀文章编号:1671-3532(2023)04-0145-07苎麻expansin家族成员鉴定与表达分析石亚亮1ꎬ2ꎬ钟意成2ꎬ黄坤勇2ꎬ牛娟2ꎬ孙志民2ꎬ陈建华2∗ꎬ栾明宝2∗(1.三亚中国农业科学院国家南繁研究院ꎬ海南三亚572024ꎻ2.中国农业科学院麻类研究所ꎬ湖南长沙410221)摘㊀要:苎麻是一种重要的纤维作物ꎬ纤维被用作纺织工业的原料ꎮ扩展蛋白具有诱导依赖pH的细胞壁伸长和压力松弛的特性ꎮ为了对苎麻基因组中expansin家族成员数量和类型进行鉴定ꎬ并研究在纤维细度不同的苎麻茎皮中扩展蛋白基因的表达情况ꎬ研究首先在苎麻中苎1号基因组数据库挖掘到24个扩展蛋白基因家族成员ꎬ通过生物信息学方法对其分子量㊁等电点㊁信号肽㊁亚细胞定位㊁motif及基因结构进行分析和预测ꎮ系统进化树结果显示:扩展蛋白包括α亚族成员17个㊁β亚族4个㊁α类亚族1个和β类亚族2个ꎬ共分为3类ꎻexpansin家族同一进化支蛋白结构域比较保守且有一个特有的motifꎮ然后ꎬ利用苎麻茎皮扩展蛋白基因家族成员转录组表达分析和qRT-PCR验证ꎬ发现两个基因在两个纤维细度差异显著的品种及其5个不同生长期表达量差异显著ꎮ该结果有助于了解苎麻expansin家族的进化ꎬ为影响苎麻纤维发育和纤维细度相关基因及其功能的研究奠定基础ꎮ关键词:苎麻ꎻexpansin家族ꎻ生物信息学分析ꎻ表达分析中图分类号:S563.1㊀文献标识码:A㊀开放科学(资源服务)标识码(OSID):㊀收稿日期:2022-03-16基金项目:国家自然科学基金(31671744)ꎻ2060299-2-23年科技创新工程-基础研究-南繁育种中心-麻类作物专用品种繁育及种质创新项目(ZDXM2306)作者简介:石亚亮(1995 )ꎬ男ꎬ硕士研究生ꎬ研究方向为苎麻种质资源鉴定与评价ꎮE-mail:shi1072548663@163.com∗通信作者:栾明宝(1978 )ꎬ男ꎬ研究员ꎬ主要从事作物种质资源研究ꎮE-mail:luanmingbao@caas.cnꎻ陈建华(1963 )ꎬ男ꎬ研究员ꎬ主要从事作物种质资源研究ꎮE-mail:cjhbt@sina.comIdentificationandExpressionAnalysisofExpansinFamilyinRamie(BoehmerianiveaL.)SHIYaliang1ꎬ2ꎬZHONGYicheng2ꎬHUANGKunyong2ꎬNIUJuan2ꎬSUNZhimin2ꎬCHENJianhua2∗ꎬLUANMingbao2∗(1.NationalNanfanResearchInstitute(Sanya)ꎬChineseAcademyofAgriculturalSciencesꎬSanya572024ꎬHainanꎬChinaꎻ2.InstituteofBastFiberCropsꎬChineseAcademyofAgriculturalSciencesꎬChangsha410221ꎬHunanꎬChina)Abstract:Ramieisanimportantfibercropꎬandfiberisusedastherawmaterialoftextileindustry.ExpansinhavethecharacterofinducingPH-dependentcellwallelongationandstressrelaxation.Inordertoexplorethenumberandtypeoftheexpansinfamilymembersinramiegenomeandtostudytheexpres ̄sionpatternofexpansingenesinthestembarkoframievarietieswithdifferentfiberfinesseꎬ24membersoftheexpansingenefamilywerefirstlydetectedinramiefromtheZhongzhuNo.1genomicdatabase.A ̄nalysisandpredictionofproteinmolecularweightꎬisoelectricpointꎬsignalpeptideandsubcellularlocali ̄zationwereconductedonthefamilymembersusingbioinformaticsinthisstudy.Byconstructingthephy ̄logenetictreeꎬ17Expansinαsubfamilyꎬ4Expansinβsubfamilyꎬ1Expansinα-likesubfamilyand2Expansinβ-likesubfamiliesmembersweredividedinto3categories.Insameevolutionarybranchofex ̄541641㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀中国麻业科学㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀第45卷pansinfamilyꎬproteindomainisrelativeconservativeandtheproteinsequencesofeachgenesubfamilyhasaspecificmotif.Expressionpatternofexpansingenefamilymembersinstembarkoframiewasana ̄lyzedusingtranscriptomedataandsignificantexpressiondifferenceoftwogeneswasconfirmedatfivegrowthstagesofthetwovarietiesthroughqRT-PCR.Theresultsarehelpfulforunderstandingtheevolu ̄tionoftheramieexpansingenefamilyꎬaswellasfuturestudiesofgenesandtheirfunctionsthataffectfi ̄berdevelopmentandfiberfinesse.Keywords:ramieꎻexpansinfamilyꎻbioinformaticanalysisꎻgeneexpressionanalysis扩展蛋白最早被鉴定为一种具有细胞壁松弛作用的蛋白ꎬ部分介导了植物细胞壁延伸和细胞生长ꎮ扩展蛋白主要分为两个蛋白家族ꎬ即α和β扩展蛋白亚族ꎬ他们不仅作用于细胞膨胀ꎬ还参与调节包括形态发生㊁果实软化㊁花粉管生长等多种植物生长过程[1]ꎮ蛋白功能被细胞壁酸性条件激活ꎬ在植物中存在一些反应机制会诱导细胞壁pH值发生变化而影响细胞的生长[2]ꎮ由于扩展蛋白对细胞壁具有独特的修饰作用ꎬ先后在棉花和苎麻关于纤维发育的研究中得到关注ꎬ已有研究发现3种编码扩展蛋白的基因在苎麻的上部茎皮上调表达[3-4]ꎮ一些家族成员基因在其他植物中的功能相继被证实ꎬ涉及促进生长㊁改善纤维品质和盐胁迫响应ꎬ例如水稻OsEXP4[5]㊁棉花GhEXPA8[6]和GbEXPATR[7]㊁小麦OsEXPB23[8]等ꎮ扩展蛋白在苎麻中的研究才刚刚起步ꎬ该家族基因具体的分子功能及其对纤维细胞的发育和纤维品质的影响尚不清楚ꎮ在Swiss-Prot数据库中:水稻的56个编码扩展蛋白的基因ꎬ包括α亚族34个ꎬβ亚族19个ꎻ拟南芥36个编码扩展蛋白的基因包括α亚族25个和β亚族7个ꎻ玉米4个基因都属于β亚族ꎮ苎麻中通过转录组序列拼接和同源基因克隆的方法发现了12个α亚族和4个β亚族ꎬ共16个编码扩展蛋白的基因[9]ꎮ然而ꎬ目前还未对苎麻expansin家族成员的具体数目和类型进行全面鉴定ꎮ苎麻基因组测序草图的顺利完成[10]ꎬ使利用生物信息学在全基因组水平上分离和鉴定expansin家族成为可能ꎮ因此ꎬ本研究利用苎麻基因组数据库对苎麻expansin家族成员及其类型进一步鉴定与分析ꎬ旨在为苎麻纤维发育候选基因挖掘奠定基础ꎮ1㊀材料与方法1.1㊀苎麻expansin家族成员鉴定在Pfam数据库下载扩展基因家族保守结构域数据文件(登录号分别为PF03330和PF01357)ꎬ利用hmmerv3.0软件构建苎麻expansin家族专有的保守结构域隐马尔可夫模型文件ꎬ然后在苎麻基因组蛋白数据中搜索家族成员[11]ꎮ将家族蛋白氨基酸序列在HMMER网站(ht ̄tps://www.ebi.ac.uk/Tools/hmmer/search/phmmer)进行序列分析ꎮ1.2㊀理化性质分析和亚细胞定位通过ExPASy网站(https://www.expasy.org/)中的Protparam程序分析蛋白质的分子量㊁等电点ꎬ利用SignalP-5.0预测蛋白的信号肽ꎬ采用pLoc-mPlant进行蛋白的亚细胞定位预测ꎮ1.3㊀扩展基因结构及蛋白序列分析利用在线软件MEME(http://meme-suite.org/)对扩展蛋白保守结构域进行分析ꎮ最大motif数量设为20ꎬ其他参数为默认值ꎮ根据苎麻基因组的DNA序列和编码区序列ꎬ使用TBtools软件[12]分析和绘制基因家族成员的基因结构图ꎮ1.4㊀基因家族系统进化分析利用分子进化分析软件MEGA7的ClustalW程序对鉴定的苎麻扩展蛋白氨基酸序列进行多重序列比对ꎬ采用邻接法(NJꎬneighbor-joining)构建系统发育树ꎮ在Swiss-Prot数据库下载水稻和拟南芥的expansin家族的蛋白氨基酸序列ꎬ与苎麻的蛋白氨基酸序列进行比对ꎬ默认参数ꎬbootstrap值设为500ꎬ构建NJ进化树ꎮ1.5㊀表达分析依据本课题组已经完成的苎麻转录组测序结果ꎬ即2个纤维细度差异大的品种ꎬ编号为2-25(纤维细度1312m/g)和3-4(2788m/g)ꎬ发芽2周(T1)㊁4周(T2)㊁6周(T3)㊁8周(T4)㊁10周(T5)5个不同纤维发育时期茎皮中各基因的FPKM值[13-14]ꎬ分析其5个发育时期的expansin家族基因表达模式ꎮ利用与转录组相同的样品材料(液氮冷冻ꎬ-80ħ保存)ꎬ提取苎麻茎皮总RNAꎬRNA提取方法和模板cDNA的合成均按照试剂盒操作手册ꎮTaKaRaMiniBESTUniversalRNAEx ̄tractionKit用于茎皮总RNA提取ꎬThermoScientificRevertAidfirst-strandcDNAsynthesiskit(Ther ̄moScientificꎬVilniusꎬLithuania)用于cDNA合成ꎬ合成20μL体系cDNAꎬ用ddH2O稀释一倍后备用ꎮ根据扩展蛋白基因家族表达差异的结果在Primer5设计相关基因的特异性引物ꎬ以苎麻18SrRNA为内参基因ꎬ于Bio-RadiQ5Real-TimePCRSystem(Bio-RadꎬCAꎬUSA)分析仪上进行qRT-PCR分析ꎬ25μLqPCR体系ꎬ即1μLcDNAꎬ12.5μL2ˑSYBRqPCRMix(北京艾德莱生物公司)ꎬ各1μL上下游引物(10μmol/L)和10.5μLddH2Oꎬ程序为95ħ2min㊁95ħ15s和55ħ30sꎬ40个循环ꎬ每个样品设3个重复ꎬ并按照2-ΔΔCT的方法计算基因的相对表达量[15]ꎮ利用GraphPadPrism8软件的Holm-Sidak方法对表达量进行t测验ꎬ以比较基因在品种间和时期间的差异显著性ꎮ2㊀结果与分析2.1㊀基因家族成员的鉴定及理化性质分析通过全基因组基因家族成员挖掘ꎬ去掉结构和功能冗余的2个候选序列ꎬ共获得了27个ex ̄pansin基因候选序列ꎮ在Pfam数据库对这些序列进行结构域注释ꎬ发现24个具有家族保守结构域的DPBB_1和Pollen_allerg_1ꎬ3个缺失第二个结构域的候选序列ꎮ对此24个具有完整结构域基因的命名沿用苎麻基因组蛋白数据库注释ꎮ比较和分析24个苎麻扩展蛋白序列的理化性质(表1)ꎬ氨基酸数目213aa(Expansin-A23)~589aa(Expansin-B15_3)ꎬ分子量23253.44Da(Expansin-A23)~62928.18Da(Expansin-B15_3)ꎬ等电点4.78(Expansin-like_2)~9.96(Expansin-A12_1)ꎮ预测发现5个蛋白序列没有信号肽ꎬ分别为Expansin-A23㊁Expansin-A12_2㊁Expansin-A4_1㊁Expansin-B15_3㊁Expansin-A12_1ꎮ亚细胞定位结果显示24个苎麻扩展蛋白均位于细胞壁ꎮ表1㊀苎麻扩展蛋白基本理化性质Table1㊀Physicalandchemicalcharacteristicsoframieexpansins基因ID蛋白名称编码的氨基酸数目(aa)相对分子质量/Da理论等电点(pI)信号肽亚细胞定位Maker00000009Expansin-A1524726611.339.57有细胞壁Maker00000787Expansin-A2321323253.448.63无细胞壁Maker00002900Expansin-B327129305.429.30有细胞壁Maker00003265Expansin-A2_127129335.686.18有细胞壁Maker00012521Expansin-A12_221623752.349.57无细胞壁Maker00013219Expansin-A1326428426.297.55有细胞壁Maker00016792Expansin-A4_130533474.319.73无细胞壁Maker00030840Expansin-A4_226027872.679.71有细胞壁Maker00032883Expansin-A1628832138.599.58有细胞壁Maker00050605Expansin-A8_125327034.448.65有细胞壁Maker00052718Expansin-like_125227733.366.29有细胞壁Maker00053376Expansin-A2_227029264.606.18有细胞壁Maker00053410Expansin-like_227229987.704.78有细胞壁Maker00054238Expansin-B15_125326763.885.90有细胞壁Maker00054946Expansin-B15_227529240.696.80有细胞壁Maker00056345Expansin-A4_326027993.809.38有细胞壁Maker00064735Expansin-A124725970.149.81有细胞壁Maker00065621Expansin-A1126928528.599.49有细胞壁Maker00075780Expansin-A8_225026428.506.38有细胞壁Maker00076261Expansin-B15_358962928.188.63无细胞壁Maker00077261Expansin-A726328391.229.51有细胞壁741第4期㊀石亚亮等:苎麻expansin家族成员鉴定与表达分析续表1基因ID蛋白名称编码的氨基酸数目(aa)相对分子质量/Da理论等电点(pI)信号肽亚细胞定位Maker00083697Expansin-A12_121623969.509.96无细胞壁Maker00083906Expansin-A2025227973.718.29有细胞壁Maker00084600Expansin-like_326228555.678.71有细胞壁2.2㊀基因家族的基因结构及蛋白质基序预测苎麻expansin家族成员外显子数目为2~9个ꎬ内含子数目为1~8个ꎮ根据MEME分析结果ꎬ选择其中E-value最低为8.4e-003的17个保守基序作图(图1)ꎮ同一进化支蛋白的结构域均保守且每个基因亚族蛋白含特有的保守基序(motif)ꎬα亚族蛋白特有motif3ꎬβ亚族特有motif8ꎬEXPL支(EXPLA亚族和EXPLB亚族)特有motif15ꎮ图1㊀扩展家族蛋白序列保守基序和基因结构与3个特有保守基序Fig.1㊀Proteinsequenceconservedmotifsandgenestructureofexpansinfamilyand3specificmotifs841㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀中国麻业科学㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀第45卷2.3㊀苎麻expansin家族的系统进化分析在Swiss-Prot数据库下载水稻和拟南芥的expansin家族的蛋白氨基酸序列ꎬ分别为36和56条ꎬ与24条苎麻蛋白氨基酸序列一起构建进化树(图2)ꎮexpansin家族包括EXPA㊁EXPB㊁EXPLA㊁EXPLB等4个亚族ꎬ在进化树中可分为EXPA㊁EXPB㊁EXPL3大类ꎮEXPA类包含苎麻expansin家族的17个基因(α亚族)ꎬEXPB类包含4个基因(β亚族)ꎬEXPL类有3个基因(1个α类亚族和2个β类亚族)ꎮ图2㊀苎麻expansin家族进化树Fig.2㊀Expansinfamilyphylogenetictreeinramie2.4㊀基因家族成员的表达分析5个发育时期的expansin家族基因表达模式如图3所示ꎬ9个基因在两个品种间存在明显的差异表达情况ꎬ其中BnEXPA16㊁BnEXP-like-3和BnEXPA20仅在品种3-4的T3㊁T4及T5期上调表达ꎬBnEXP-like-1和BnEXPA7仅在品种2-25的T4期上调表达ꎬBnEXPA8-2㊁BnEXPB3㊁BnEX ̄PA23和BnEXPA11仅在品种2-25的T2期上调表达ꎮ两个苎麻品种5个生长期内ꎬ在T1和T2期都上调表达的有BnEXPA2-1㊁BnEXPA2-2㊁BnEXP-like-2㊁BnEXPA8-1㊁BnEXPA12-1㊁BnEXPA8-2㊁BnEXPA1㊁BnEXPA15㊁BnEXPA4-1㊁BnEXPA13ꎮ在T3㊁T4和T5期都上调表达的有BnEXPB15-1㊁BnEXPB15-2和BnEXPB15-3ꎮ根据转录组数据分析结果和相关差异表达基因的功能分析ꎬ选择BnEXP-like-3和BnEXPB15-2这两个基因做进一步分析(特异性引物序列见表2)ꎬ荧光定量PCR结果(图4)显示ꎬ5个时期苎麻茎皮中两个基因的表达量呈现显著差异(p<0.01)ꎬ两个基因在品种3-4表达量明显高于2-25的表达量ꎮ941第4期㊀石亚亮等:苎麻expansin家族成员鉴定与表达分析图3㊀苎麻expansin家族基因表达量热图Fig.3㊀Geneexpressionheatmapofexpansinfamilyinramie表2㊀expansin家族基因特异qRT-PCR引物(5 ң3 )Table2㊀ExpansinfamilygenesspecificqRT-PCRprimers(5 ң3 )Gene正向引物反向引物BnEXP-like-3ATCAAGCAACAAGCCACACTATGAGCAACATCAATGCCAACABnEXPB15-2AACGAGGCGACCCACTGTACTGAATCTGCAACACCC18SrRNATGACGGAGAATTAGGGTTCGACCGTGTCAGGATTGGGTAATTT注: ∗ 表示p<0.05的显著差异性ꎬ ∗∗ 表示p<0.01ꎬ ∗∗∗ 表示p<0.001ꎬ ꎮ图4㊀BnEXP-like-3(左)和BnEXPB15-2(右)在两个苎麻品种茎皮5个时期的相对表达量Fig.4㊀ExpressionlevelofBnEXP-like-3(left)andBnEXPB15-2(right)ofstembarkatfivegrowthstagesof2varieties051㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀中国麻业科学㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀第45卷3㊀讨论与结论苎麻中鉴定出的4种扩展蛋白亚族的蛋白ꎬ包括α亚族17个㊁β亚族4个㊁α类亚族1个和β类亚族2个ꎮ蛋白质的氨基酸序列分析表明ꎬ可依据其特有的保守基序区分不同亚族的扩展蛋白ꎬ苎麻expansin家族α亚族蛋白特有motif3ꎬβ亚族特有motif8ꎬα类亚族和β类亚族共特有motif15ꎮ表达分析结果显示ꎬexpansin家族成员在不同品种和时期存在表达差异ꎬ且BnEXP-like-3与BnEXPB15-2两个基因在品种3-4中各个时期的相对表达量均高于2-25ꎮ目前被报道与纤维发育相关的扩展蛋白基本属于如EXP1㊁EXPA2㊁EXPA8等所在的成员数量占有优势的α亚族[3-4ꎬ6-7]ꎬ而鲜有α类亚族和β亚族的成员ꎮ另外ꎬ有研究发现ꎬ苎麻纤维细度还受光照㊁湿度㊁温度等生态环境因素的影响[16]ꎮ本研究首次发现在两个在纤维细度不同的两个品种间呈表达差异的BnEXP-like-3与BnEXPB15-2ꎬ其所在亚族的蛋白被报道与植物抗逆性相关ꎬ如启动子序列存在脱落酸㊁生长素㊁水杨酸等激素诱导元件和对干旱㊁高温等非生物胁迫的响应元件的OfEX ̄LA1[17]ꎬ及被证实由磷酸盐饥饿诱导且能够提高大豆磷效率的GmEXPB2[18]ꎮ因此ꎬBnEXP-like-3和BnEXPB15-2是否具有通过对苎麻生长环境响应而影响纤维细度的功能值得深入探讨ꎮ棉花中发现了两个扩展蛋白基因ꎬGbEXPA2通过增加结晶纤维素含量影响纤维细胞厚度ꎬGbEXPATR是编码一个缺失Pollen_allerg_1结构域的蛋白基因ꎬ同属于扩展蛋白α亚族ꎬ过表达GbEXPATR纤维会产生更长㊁更结实的薄壁纤维[7]ꎮ在本研究中也发现了3个缺失第二结构域的蛋白ꎬ其中两个有信号肽ꎬ但与扩展蛋白家族各个亚族的蛋白相似性不高ꎬ其功能有待进一步研究ꎮ参考文献:[1]LEEYꎬCHOIDꎬKENDEH.Expansins:ever-expandingnumbersandfunctions[J].CurrentOpinioninPlantBiologyꎬ2001ꎬ4(6):527-532.[2]COSGROVEDJ.Growthoftheplantcellwall[J].NatureReviewsMolecularCellBiologyꎬ2005ꎬ6(11):850-861.[3]RUANYLꎬLLEWELLYNDJꎬFURBANKRT.Thecontrolofsingle-celledcottonfiberelongationbydevelopmentallyreversiblega ̄tingofplasmodesmataandcoordinatedexpressionofsucroseandK+transportersandexpansin[J].PlantCellꎬ2001ꎬ13(1):47-60.[4]CHENJꎬPEIZHꎬDAILJꎬetal.Transcriptomeprofilingusingpyrosequencingshowsgenesassociatedwithbastfiberdevelopmentinramie(BoehmerianiveaL.)[J].BMCGenomicsꎬ2014ꎬ15:919.[5]CHOIDSꎬLEEYꎬCHOHTꎬetal.Regulationofexpansingeneexpressionaffectsgrowthanddevelopmentintransgenicriceplants[J].PlantCellꎬ2003ꎬ15(6):1386-1398.[6]BAJWAKSꎬSHAHIDAAꎬRAOAQꎬetal.StabletransformationandexpressionofGhEXPA8fiberexpansingenetoimprovefiberlengthandmicronairevalueincotton[J].FrontiersinPlantScienceꎬ2015ꎬ6:838.[7]LIYꎬTULLꎬPETTOLINOFAꎬetal.GbEXPATRꎬaspecies-specificexpansinꎬenhancescottonfibreelongationthroughcellwallre ̄structuring[J].PlantBiotechnologyJournalꎬ2016ꎬ14(3):951-963.[8]HANYYꎬLIAXꎬLIFꎬetal.Characterizationofawheat(TriticumaestivumL.)expansingeneꎬTaEXPB23ꎬinvolvedintheabioticstressresponseandphytohormoneregulation[J].PlantPhysiologyandBiochemistryꎬ2012ꎬ54:49-58.[9]陈杰.苎麻纤维发育相关转录组测序及expansin家族功能分析[D].武汉:华中农业大学ꎬ2017.[10]LUANMBꎬJIANJBꎬCHENPꎬetal.DraftgenomesequenceoframieꎬBoehmerianivea(L.)Gaudich[J].MolecularEcologyRe ̄sourcesꎬ2018ꎬ18(3):639-645.[11]LOZANORꎬHAMBLINMTꎬPROCHNIKSꎬetal.IdentificationanddistributionoftheNBS-LRRgenefamilyintheCassavagenome[J].BMCGenomicsꎬ2015ꎬ16:360.[12]CHENCJꎬCHENHꎬZHANGYꎬetal.TBtools:anintegrativetoolkitdevelopedforinteractiveanalysesofbigbiologicaldata[J].MolecularPlantꎬ2020ꎬ13(8):1194-1202.[13]CHENKMꎬMINGYꎬLUANMBꎬetal.Thechromosome-levelassemblyoframie(BoehmeriaNiveaL.)genomeprovidesinsightsin ̄tomolecularregulationoffiberfineness[J].JournalofNaturalFibersꎬ2023ꎬ20(1):2168819.[14]黄坤勇.苎麻纤维细度全基因组关联分析及候选基因筛选[D].北京:中国农业科学院ꎬ2020.[15]谭龙涛.苎麻氮代谢高效基因型筛选及表达分析[D].北京:中国农业科学院ꎬ2015.[16]湖南省苎麻优质高产的气象条件研究协作组ꎬ刘淑梅.苎麻优质高产的气象生态条件试验研究[J].中国麻作ꎬ1991ꎬ2:13-20.[17]高晓月ꎬ董彬ꎬ张超ꎬ等.桂花扩展蛋白基因OfEXPA2㊁OfEXPA4和OfEXLA1启动子克隆及活性分析[J].浙江大学学报(农业与生命科学版)ꎬ2019ꎬ45(1):23-29.[18]ZHOUJꎬXIEJNꎬLIAOHꎬetal.Overexpressionofbeta-expansingeneGmEXPB2improvesphosphorusefficiencyinsoybean[J].PhysiologiaPlantarumꎬ2014ꎬ150(2):194-204.151第4期㊀石亚亮等:苎麻expansin家族成员鉴定与表达分析。

肺炎链球菌耐药相关蛋白Sp_0010生物信息学分析及结晶尝试

肺炎链球菌耐药相关蛋白Sp_0010生物信息学分析及结晶尝试

•1268 •中国病原生物学杂志2020年11月第15卷第11期J o u rn a l o f Pathogen B io lo g y Nov. 2020, Vol. 15, No. 11D()I:10. 13350/j. cjpb. 201106 •论著•肺炎链球菌耐药相关蛋白S p_0010生物信息学分析及结晶尝试+柏晓辉1…,刘雪朱雯培1,俞晨忻1,滕芸芸1,周颖1(1.黄山学院生命与环境科学学院,安徽黄山245041 ;2.清华大学医学院传染病研究中心)【摘要】目的克隆表达肺炎链球菌中一个潜在(3-内酰胺酶基因SP_0010并进行晶体学研究和生物信息学分析,以期阐明该蛋白的生理功能,为其疫苗研究奠定基础。

方法从数据库中查询获得肺炎链球菌TIG R4菌株基因SP_0010及其编码蛋白SP_0010的序列信息,利用生物信息学分析软件对Sp_0010的生物学功能进行预测和分析;利用异丙基-(H>硫代 半乳糖苷(IPTG)诱导已构建的菌株Rosetta (DE3)/ pET28a(+)-SP_0010中目标蛋白Sp_0010表达,经Ni2+柱亲和层析、分子筛纯化及浓缩后用晶体初筛试剂盒进行初筛,对筛出的蛋白Sp_0010晶体进行优化。

结果基因SP_0010编码422 个氨基酸残基,生物信息学分析显示其编码蛋白Sp_0010具有信号肽,是一种表达后分泌至细胞外的亲水性蛋白;同源建模显示Sp_0010具有N端和C端两个结构域,且C端结构域形成具有“p■内酰胺酶折叠类似”结构;抗原表位分析表明Sp_ 0010含有7个优势抗原表位,具有免疫原性;晶体筛选显示Sp_0010的初试结晶条件为25%polyethylene glycol 3350,0. 2 m d/L MgCl” 0.1 mol/L HepeS,pH 7.5。

结论蛋白Sp_0010可在大肠杆菌中重组表达,表达蛋白可形成晶体,这为研究其结构与功能奠定了基础。

SignalP:信号肽预测工具 _ Public Library of Bioinformatics

SignalP:信号肽预测工具 _ Public Library of Bioinformatics
现在的位置: 首页 > Bioinformatics > 正文 RSS 小中大 SignalP:信号肽预测工具 2012年07月03日 ⁄Bioinformatics ⁄评论关闭 ⁄被围观 215 views+ SignalP是一个信号肽预测服务器,它的功能是预测给定的氨基酸序列中是否存在潜在的信号肽剪 切位点及其所在,原核生物和真核生物都可以进行预测。目前服务器提供的是SignalP 4.0版本。 在线服务器网址:http://www.cbs.dtu.dk/services/SignalP/
在现在的3.0版本中还有两个值S-mean和D值。
S-mean是从N端氨基酸开始到剪切位点处各氨基酸的平均S值。
D值是S-mean和Y-max的平均值,对区分是否为分泌蛋白具有重要作用。
隐马可夫模型(HMM)主要计算序列中是否含有信号肽,在真核生物的预测 中还有signal anchor的一个参数(相当于信号肽),并进一步分为nregion、h-region和c-region三个部分。
R evolution fasta genome Java linux mRNA NCBI ncRNA newbler NGS perl reads RNA RNAseq
基因组 测序 sequence sequencing SNP SOAPdenovo velvet
序列 序列比对 拼接
物种起源 生物信息 画图 碱基
这里我就直接用服务器中提供的结果说明为例。
在分泌蛋白的预测结果中,NN法Signal peptide列中结果为yes,并根据C 值、S值和Y值等给出潜在的剪切位点;图表右上角处有C值、S值和Y值的曲 线颜色指示,图表中有各值的变化趋势曲线。
HMM法文本结果显示其含有信号肽的可能性以及潜在的剪切位点;图表中给 出信号肽不同区域划分的预测。

信息论方法预测信号肽-论文

信息论方法预测信号肽-论文

第二章几种公认的预测方法2.1,3准确性权重矩阵方法对于蛋白质信号肽剪切位点是成功的,至今仍然是众多科研人员对新方法时候成功的进行检验的一个标准,在Dr.vonHeijin.G1986年的这篇文章中,该方法对于自建数据库中的已知剪切位点蛋白质的检验准确性可以达到:真核生物61%、革兰氏阳性菌81%和革兰氏阴性菌69%;对于位置剪切位点的蛋白质的预测准确性可以达到75%.80%。

2-2序列编码方法伍川Ⅱence_encodedalgorithm)1912.2.1方法信号肽的长度对于不同蛋白质有所不同,最短的线号肽可能是8个氨基酸(t=8),最长的可能是90个氨基酸(厶=90),大部分的信号肽长度分布在18—25个氨基酸之间。

假定一个信号肽和他的剪切位点可以被一个虚拟的、标示为【一厶,+厶】的序列来说明,其中厶是信号部分的氨基酸残基数目,厶是蛋白质成熟部分的数目,信号台的剪切位点必定存在于这段被称为“基准窗口”的序列片断中标定位一1和+l的两个残基之间。

首先【9]作者选定厶=6、上2=2,那么【9】作者有一个基准窗口【一6,+21(这个算法可以很容易的推广到其他的厶、岛值)。

一个卜6,+2】序列片断可以表示成为:足6噩5足4足3足2足l段l心这里的R代表新生蛋白质序列i位置的氨基酸残基。

在(一1,+1)之间的位置时分泌过程中的剪切位点,在此之前的位置上的残基组成了信号部分。

图2-1:信号肽及其剪切位点示意图第五章结果与讨论5.1信号肽特征不同物种的信号肽,在其长度上时有区别的。

对于真核生物来说,信号肽的平均长度是23.4(氨基酸个数);革兰式阴性菌是25.9,而革兰式阳性菌则相对更长,其平均长度达到了32.7。

各个物种信号肽长度的具体分布见图5.1。

lengthofsignal口ep啦de圈5-l:信号肽长度分布对于信号肽来说,剪切位点附近的氨基酸服从下面的(一3,一1)规则【lO】:一l位置的残基必须是小氨基酸,比如,Ala,Ser,Gly,Cys,Thr或是Gin,一3位鼍的残基一定不是芳香族氨基酸(Phe,His,Tyr,Trp),带电荷的氨基酸(Asp,Olu,Lys,Arg),或是大且极性的氨基酸(Arm,Gin)。

信号肽假说1

信号肽假说1
3
新信号假说的基本内容
补充修改后的信号假说比早期的信号假说更为合理, 这一假说的核心内容是: 核糖体同内质网的结合受制 于mRNA中特定的密码序列(可以翻译成信号肽),具 有这种密码序列的新生肽才能连同核糖体一起附着到内 质网膜的特定部位。因此,核糖体同内质网的结合是功 能性结合,具有功能性和暂时性,并受时间和空间的限 制。正是由于这种结合保证了新合成蛋白的矢量释放。 信号序列的两个基本作用是:①通过与信号识别颗粒 (SRP)的识别和结合, 引导核糖体与内质网结合; ②通过信号序列的疏水性,引导新生肽跨膜转运。
18
3、核定位蛋白的运转机制 a.在细胞质中合成的蛋白质一般通过核孔进入细胞核。 a.在细胞质中合成的蛋白质一般通过核孔进入细胞核。 在细胞质中合成的蛋白质一般通过核孔进入细胞核
b.所有核糖体蛋白都首先在细胞质中被合成, b.所有核糖体蛋白都首先在细胞质中被合成,运转 所有核糖体蛋白都首先在细胞质中被合成 到细胞核内,在核仁中被装配成40S 60S核糖体亚 40S和 到细胞核内,在核仁中被装配成40S和60S核糖体亚 基,然后运转回到细胞质中行使作为蛋白质合成机 器的功能。 器的功能。 c.RNA、DNA聚合酶、组蛋白、 c.RNA、DNA聚合酶、组蛋白、拓朴异构酶及大量 聚合酶 转录、 转录、复制调控因子都必须从细胞质进入细胞核 才能正常发挥功能。
信号肽假说
信号肽( 信号肽(Signal peptide ):常指新合成多肽链 : 中用于指导蛋白质跨膜转移的N 中用于指导蛋白质跨膜转移的N-末端氨基酸序列 有时不一定在N (有时不一定在N端)。 假说的基础: 假说的基础:蛋白质定位的信息存在于该蛋白 质自身结构中, 质自身结构中,并且通过与膜上特殊受体的相 互作用得以表达。 互作用得以表达。

甘薯解偶联蛋白基因家族鉴定与表达分析

甘薯解偶联蛋白基因家族鉴定与表达分析

DOI: 10.3724/SP.J.1006.2022.14126甘薯解偶联蛋白基因家族鉴定与表达分析陈璐周淑倩李永新陈刚陆国权杨虎清*浙江农林大学食品与健康学院,浙江杭州311300摘要:本研究旨在鉴定和分析甘薯(Ipomonea batatas (L.) Lam)解偶联蛋白(uncoupling proteins, UCPs)基因家族成员,探究其在甘薯不同组织中的表达特异性及其对低温(4℃)、高盐(NaCl)和干旱(PEG-6000)等胁迫的响应规律。

结果发现,甘薯UCP (IbUCP)含有5个家族基因,分别将其命名为IbUCP1 (GenBank登录号为MW753000)、IbUCP2 (GenBank登录号为MW753004)、IbUCP3(GenBank登录号为MW753001)、IbUCP4 (GenBank登录号为MW753002)和IbUCP5 (GenBank登录号为MW753003)。

预测IbUCP的理论等电点为8.53~9.86,含有261~375个的氨基酸残基;IbUCP定位于线粒体;IbUCP为亲水蛋白,又属于线粒体载体蛋白超家族的成员,其二级结构主要包括α-螺旋和无规则卷曲,这与三级结构预测结果相符;IbUCP不存在跨膜螺旋结构和信号肽;IbUCP家族成员分为5个,与三裂叶薯和牵牛花有较近的亲缘关系,具有一定的保守性;启动子预测发现,IbUCPs基因具有基本的转录元件以及一些信号响应元件、转录因子识别结合元件和逆境等响应顺式作用元件。

表达分析显示,IbUCPs基因家族成员具有组织特异性,其中IbUCP4在茎中表达最高,其余IbUCPs均在块根中最高;IbUCPs基因家族成员中响应低温胁迫的有IbUCP1、IbUCP4和IbUCP5;IbUCPs基因家族对高盐胁迫均有响应;在干旱的胁迫下,IbUCP1、IbUCP4和IbUCP5均有响应,分别在不同的时间达到峰值。

多种胁迫可调控IbUCPs的表达,本研究为甘薯UCP基因的功能挖掘及甘薯抗逆品种筛选提供了一定的理论依据。

  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

In J.Glasgow et al.,eds.,Proc.Sixth Int.Conf.on Intelligent Systems for Molecular Biology,122-130.AAAI Press,1998.1Prediction of signal peptides and signal anchors by a hidden Markov modelHenrik Nielsen and Anders KroghCenter for Biological Sequence AnalysisTechnical University of DenmarkBuilding206,2800Lyngby,Denmarkhnielsen@cbs.dtu.dk and krogh@cbs.dtu.dkAbstractA hidden Markov model of signal peptides has been devel-oped.It contains submodels for the N-terminal part,the hy-drophobic region,and the region around the cleavage site.Forknown signal peptides,the model can be used to assign objec-tive boundaries between these three regions.Applied to ourdata,the length distributions for the three regions are signifi-cantly different from expectations.For instance,the assignedhydrophobic region is between8and12residues long in al-most all eukaryotic signal peptides.This analysis also makesobvious the difference between eukaryotes,Gram-positivebacteria,and Gram-negative bacteria.The model can be usedto predict the location of the cleavage site,which itfinds cor-rectly in nearly70%of signal peptides in a cross-validatedtest—almost the same accuracy as the best previous method.One of the problems for existing prediction methods is thepoor discrimination between signal peptides and uncleavedsignal anchors,but this is substantially improved by the hid-den Markov model when expanding it with a very simple sig-nal anchor model.IntroductionThe general secretory pathway is a mechanism for proteinsecretion found in both eukaryotic and prokaryotic cells.The entry to the general secretory pathway is controlled bythe signal peptide,an N-terminal peptide typically between15and40amino acids long,which is cleaved from the ma-ture part of the protein during translocation across the mem-brane,see Figure1.The most characteristic common feature of signal pep-tides is a stretch of hydrophobic amino acids called the h-region.The region between the initiator Met and the h-region,the n-region,is typically one tofive amino acids inlength,and normally carries positive charge.Between theh-region and the cleavage site is the c-region,which consistsof three to seven polar,but mostly uncharged,amino acids.Close to the cleavage site a more specific pattern of aminoacids is found:the residues at positions3and1(relativeto the cleavage site)must be small and neutral for cleavageto occur correctly(von Heijne1985).Translocation takes place via a multiprotein com-plex known as the translocon or translocation apparatusFigure1:Cartoons of a signal peptide(above)and a signal anchor(below),and how they are translocated by the translocon. After translocation the signal peptide is cleaved off and the mature protein released,whereas the signal anchor is not cleaved off and the protein is anchored to the membrane.Heijne1994).Signal peptide prediction involves two tasks:(1)Given that the sequence is a signal peptide,locate the cleavage site;and(2)discriminate between secretory proteins with signal peptides and non-secretory proteins.Prediction of the cleavage site has been performed with a weight matrix (von Heijne1986b)and by a neural network method,Sig-nalP(Nielsen et al.1997),which also performs the discrim-ination task.SignalP has been available as a WWW and mail server since1996and is very widely used.In this paper we apply a hidden Markov model(HMM) for both prediction tasks.An HMM for proteins consists of a number of states that are connected by transition probabil-ities.Associated with each state is a distribution over the20 amino acids.It is often useful to think of HMMs as gener-ative models that can‘emit’protein sequences by randomly going from state to state,and in each state emit an amino acid according to the distribution for that state.For a given sequence one can calculate for instance the most probable way this sequence was generated by the model,or the total probability that it was generated by the model at all.Because it is a probabilistic model,one can use standard methods like maximum likelihood to determine the model parame-ters.Introductions to HMMs can be found in(Rabiner1989; Krogh1998;Durbin et al.1998).In computational biology the most commonly used HMM type is probably the profile HMM(Krogh et al.1994;Eddy1996),which has a struc-ture inspired by profiles(Gribskov,McLachlan,&Eisenberg1987).However,HMMs are more general,and the model structures used in this work are not of the profile type.One of the advantages of HMMs is that it is usually very easy to build biological knowledge into the model in an in-tuitive way—in contrast to e.g.neural networks.For the sig-nal peptides we design the model so that it has parts corre-sponding to each of the three regions of a signal peptide and such that reasonable length constraints are hard-wired in the model.Another advantage of the HMM approach is that the HMM can easily be extended by adding other modules to the model.In this work we combine the signal peptide model with a model of signal anchors,in order to make a model that is good at discriminating between signal peptides and anchors.There are very few known examples of signal an-chors,and therefore it is hard to make good models of these.For this situation,the HMMs have another big advantage:it is very easy to control the model complexity by making the model simple enough to be estimated from the amount of data available.MethodsData setsData were extracted from SWISS-PROT version35(Bairoch &Apweiler1997).Data sets were made for four types of proteins:signal peptides,signal anchors,cytoplasmic,and (for eukaryotes)nuclear.All sets were grouped in subsets for eukaryotes,Gram-positive bacteria,and Gram-negative 2Signal Nuclearproteins anchorsred.red.red.red. Euk247716142060164G neg498697G pos222280Figure2:The model used for signal peptides.The states in a shaded box are tied to each other.bution to capture the pattern of amino acids just before the cleavage site(states c1to c6in Figure2).To allow for longer c-regions,four more states(c7to c10)are added,which are tied to each other in order to capture the over-all amino acid distribution of c-regions longer than six.One of these states has a transition to itself so long c-regions are modeled by a geometric distribution.From the last h-state there are transi-tions to all the c-states except the two just before the cleav-age site,making the minimum length of c-regions equal to three.After the cleavage site,four states model the posi-tion specific amino acid distributions before a transition is made to thefinal state with an amino acid distribution equal to a standard background distribution.The six states prior to the cleavage site plus the four states after the cleavage site correspond approximately the weight matrix used ear-lier for signal peptide prediction(von Heijne1986b).The difference is that the states c4through c6can be skipped, which means that the weight matrix-like part does not have to model hydrophobic residues of signal peptides with very short c-regions.Models were estimated from the training data by the Baum-Welch algorithm(Rabiner1989;Durbin et al.1998), which is a maximum likelihood procedure that iteratively increases the total likelihood of the training data.The train-ing was done with the labeled data,such that the cleav-age site was always correctly positioned during training, but the model was left tofind out for itself where to put the boundaries between n-,h-,and c-regions.However, to help the modelfind a sensible partition into regions, we initialized the models:for each of the three regions, the initial distributions were set to the amino acid fre-quencies in the regions as assigned by the simple proce-dure described above.Pseudocounts(Krogh et al.1994; Durbin et al.1998)were also added,which were obtainedby multiplying the same amino acid frequencies by100.The size of this number is not critical.Each distribution is ob-tained from more than1000amino acids,so the pseudo-counts are relatively small.To predict the cleavage site for a new sequence,the most probable path through the trained model is found by the stan-dard Viterbi algorithm(Rabiner1989).The most probable path was also used for assigning a region to each amino acid in the sequence.To discriminate between signal peptides,signal anchors and soluble non-secretory proteins,the model was aug-mented by a model of anchors as shown in Figure3.The structure of this model is like the model for signal peptides, but the n-and h-regions are simpler and the c-region is of course omitted.The whole model was now trained from all types of sequences(signal peptides,anchors,cytoplasmic and nuclear).The most likely path through the combined model yields a prediction of which of the three classes the protein belongs to.Neural network methodThe neural network method implemented in the SignalP server is described in detail elsewhere(Nielsen et al.1997).In the present work,we made no modifications to the archi-tecture of the networks,the training scheme,or the output interpretation;we merely retrained the networks on the new data set(the present version of SignalP is based on SWISS-PROT release29).In the context of this work,it is important to note that SignalP combines two types of network:the C-score(raw cleavage site score)is the output from a network trained solely on signal peptide sequences to recognize cleavage sites from non-cleavage sites;while the S-score(signal pep-tide score)is the output from a network trained to recog-4Signal anchor modelCombined modelFigure 3:The block diagram (top)shows how the combined model is put together from the signal peptide model and the anchor model.The final states shown in the shaded box are tied to each other,and model all residues not in a signal peptide or an anchor.The model of signal anchors (bottom)has only two types of states (grouped by the shaded boxes)apart from the Met state.nize windows within signal peptides from windows after the cleavage site and windows in non-secretory proteins.The prediction of cleavage site location is optimized by observ-ing where the C-score is high and the S-score changes from a high to a low value.This is formally implemented by the Y-score (combined cleavage site score),a geometric average of the C-score and a smoothed derivative of the S-score.Discrimination between signal peptides and non-secre-tory proteins is done by using either the maximal value of the Y -score or the mean value of the S-score,averaged from position 1to the most likely cleavage site.Results and discussionPerformance of the HMM methodThe performances of the trained hidden Markov model and neural networks are shown in Table 2.All the results re-ported are obtained by five-fold cross validation.For cleav-age site location,the neural networks are slightly better than the HMM.The observation that the neural networks—even using only the C-score—are able to locate the cleavage site a few percent more precisely than the HMM suggests that there might be a weak non-linear feature involved in the cleavage site recognition.Discrimination between signal peptides and soluble non-secretory proteins is performed with a version of the HMM where the anchor model is omitted.If the three-module HMM including the signal anchor model is used instead,a few signal peptides are falsely classified as signal anchors,bringing the correlation coefficient for eukaryotic sequences down by 0.02.The simple neural network (the C-score net-work alone)is poorer than the HMM for discrimination,which is not remarkable,since the non-secretory proteins were not used in the training of this network.The combi-nation of C-score and S-score networks has a discrimination performance comparable to that of the HMM:for eukaryotes the networks are slightly better,while for Gram-negative bacteria the HMM is slightly better.The neural network performances were in general com-parable to those obtained with the data from SWISS-PROT release 29as reported in (Nielsen et al.1997),but the cleav-age site location was two percent better for eukaryotes and four percent better for Gram-negative bacteria.Since the number of signal peptide sequences extracted has not grown very much,this suggests that the quality of signal peptide annotations has improved.Discrimination between cleaved signal peptides and un-cleaved signal anchors is shown in the rightmost column of Table 2.The HMM correlation coefficient of 0.74corre-sponds to a sensitivity of 71%and a specificity of 81%—a far better performance for this problem than hitherto re-ported.For the neural network,uncleaved signal anchors can to some degree be identified by intermediate values of the mean S score,but even when the threshold is optimized specifically for this task,the correlation coefficient does not exceed 0.4.Interestingly,the cleavage site scores provided an even worse discrimination between signal peptides and signal anchors,suggesting that cryptic cleavage sites are not uncommon in signal anchors.These results should not be taken as a claim that the neural network method is unable to 5Task Discriminationsig/anc Method Euk G neg G posHMM0.940.930.9671.8%81.7%66.9%(0.18) NN(combined)0.970.890.960510*******1015202530Euk. n-regions10203040505101520Euk. h regions1020304050607051015202530Euk. c-regions0510*******1015202530Gram-neg. n-regions10203040505101520Gram-neg. h regions1020304050607051015202530Gram-neg. c-regions51015202551015202530Gram-pos. n-regions10203040505101520Gram-pos. h-regions1020304050607051015202530Gram-pos. c-regions05101520255101520253035404550Euk. signal anchors n-regions 05101520255101520253035Euk. signal anchors h-regionsFigure 4:The length distributions of the n-,h-,and c-regions of signal peptides,and n-and h-regions of signal anchors,as assigned by the trained HMM models.The x-axis is length,and the histograms display the number of sequences in percent.7h-regionanc signal peptide signal peptideEuk Euk G neg G posA9.711.623.918.82.7 1.50.40.3 2.8 1.60.0D0.10.10.00.16.2 2.3 1.2 1.1 3.60.5 2.4F9.28.0 5.1 5.56.17.2 2.3 3.510.78.97.1H0.10.30.00.03.0 2.4 6.7 5.0 2.6 3.6 3.6K0.00.00.00.09.38.67.2 6.58.0 5.3 5.0M 2.2 1.7 3.1 3.33.2 2.7 6.3 5.8 2.1 2.9 5.6P0.6 1.1 1.4 1.04.1 3.8 3.7 2.95.1 3.1 5.1R0.00.10.20.19.710.87.2 6.911.919.39.1T 5.3 3.7 5.48.53.24.5 3.8 3.7 6.2 4.8 6.5W 1.2 1.60.40.52.2 1.7 2.1 1.9 1.5 1.20.7Table3:Amino acid distributions in the n-,h-,and c-regions of signal peptides assigned by the trained HMM.Results are shown for eukaryotes(Euk),Gram-negative bacteria(G neg),and Gram-positive bacteria(G pos).For the eukaryotes,n-and h-regions of signal anchors(anc)are also included(the concept of c-regions does not apply to signal anchors).The c-regions do not include the cleavage site consensus(position3through1).bacteria in Table3:the positive charge in the h-region is more dominant in bacteria(up to40%Lys+Arg for the Gram-positives),while eukaryotes have the most hydropho-bic h-region with almost40%Leu.In the c-region,the most conspicuous feature is the high occurrence of Gly and Pro—again,the Gram-positives stand out as the most extreme group with almost16%Pro.Note also the difference between eukaryotic signal an-chors and signal peptides:the n-regions of signal anchors are more tolerant to the negatively charged residues Asp and Glu;and the h-region is less dominated by Leu,allowing higher proportions of other hydrophobic residues such as Ile and V al.ConclusionIn terms of accuracy of the cleavage site prediction,the neu-ral network-based SignalP is slightly better than the hidden Markov model described here.However,the HMM can be used to label the three different regions of a signal peptide, which yields quite surprising results.It was also demon-strated that the HMM can discriminate well between signal peptides,signal anchors,and other proteins.Because of the small number of known signal anchors,it is not likely that a neural network could be trained to discriminate so well.An important application for the signal peptide HMM will be analysis of whole genomes and other large datasets de-rived from single species.Here,we have only considered differences between three large groups of organisms,but it is conceivable that further differences can be found within these groups.Statistical analysis suggests a difference be-tween mammalian and plant signal peptides(von Heijne& Abrahms´e n1989),and there is experimental evidence thata yeast signal peptide can be non-functional in mammaliancells(Bird,Gething,&Sambrook1987).The HMM can be used to divide the signal peptides into regions and thereby facilitate comparisons between these regions.Archaea represent a special problem,since very few sig-nal peptides are known experimentally from this domain of life,and therefore it is not clear which,if any,of the Sig-nalP versions will apply.An analysis of signal-peptide like sequences from Methanococcus jannaschii suggests that its signal peptides differ from both their eukaryotic and bacte-rial counterparts(manuscript in preparation).When analysing unknown sequences,it is important to note that the type II membrane proteins addressed in this work comprise only a small fraction of the transmembrane proteins.In particular,we have not tested the performance of neither the HMM nor the NN method on N-terminal parts of multispanning(type IV)transmembrane proteins.A com-bined model of signal peptides,signal anchors,and other transmembrane helices is clearly needed.Finally,it has not escaped our notice that the two-peaked length distributions of h-regions might be correlated to a dif-ference in translocation mechanism for two classes of signal peptides;but this question demands further investigation be-fore anything definitive can be said.AcknowledgmentsWe thank Gunnar von Heijne and Erik Sonnhammer for helpful discussions.This work was supported by the Danish 8National Research Foundation.ReferencesBairoch,A.,and Apweiler,R.1997.The SWISS-PROT protein sequence data bank and its supplement TrEMBL. Nucleic Acids Res.25:31–36.Bird,P.;Gething,M.-J.;and Sambrook,J.1987. Translocation in yeast and mammalian cells:not all sig-nal sequences are functionally equivalent.J.Cell Biol. 105:2905–2914.Bird,P.;Gething,M.-J.;and Sambrook,J.1990.The func-tional efficiency of a mammalian signal peptide is directly related to its hydrophobicity.J.Biol.Chem.265:8420–8425.Chou,M.M.,and Kendall,D.A.1990.Polymeric se-quences reveal a functional interrelationship between hy-drophobicity and length of signal peptides.J.Biol.Chem. 265:2873–2880.Durbin,R.M.;Eddy,S.R.;Krogh,A.;and Mitchison,G. 1998.Biological Sequence Analysis.Cambridge Univer-sity Press.To appear.Eddy,S.R.1996.Hidden Markov models.Current Opin-ion in Structural Biology6:361–365.Gribskov,M.;McLachlan,A.D.;and Eisenberg,D.1987. Profile analysis:Detection of distantly related proteins. A84:4355–4358.Krogh,A.;Brown,M.;Mian,I.S.;Sj¨o lander,K.;and Haussler,D.1994.Hidden Markov models in computa-tional biology:Applications to protein modeling.J.Mol. Biol.235:1501–1531.Krogh,A.1998.An introduction to hidden Markov mod-els for biological sequences.In Salzberg,S.;Searls,D.; and Kasif,S.,eds.,Computational Methods in Molecular Biology.Elsevier.chapter4.To appear.Mathews, parison of the predicted and observed secondary structure of T4phage lysozyme. Biochim.Biophys.Acta405:442–451.Nielsen,H.;Engelbrecht,J.;von Heijne,G.;and Brunak, S.1996.Defining a similarity threshold for a functional protein sequence pattern:The signal peptide cleavage site. Proteins24:165–177.Nielsen,H.;Brunak,S.;Engelbrecht,J.;and von Heijne, G.1997.Identification of prokaryotic and eukaryotic sig-nal peptides and prediction of their cleavage sites.Protein Eng.10:1–6.Nilsson,I.;Whitley,P.;and von Heijne,G.1994.The COOH-terminal ends of internal signal and signal-anchor sequences are positioned differently in the ER translocase. J.Cell Biol.126:1127–1132.Rabiner,L.R.1989.A tutorial on hidden Markov mod-els and selected applications in speech recognition.Proc. IEEE77(2):257–286.Rapoport,T.A.;Jungnickel,B.;and Kutay,U.1996. Protein transport across the eukaryotic endoplasmic reticu-lum and bacterial inner membranes.Annu.Rev.Biochem. 65:271–303.V arshavsky,A.1996.The N-end rule:functions,myster-ies,A93:12142–12149.von Heijne,G.,and Abrahms´e n,L.1989.Species-specific variation in signal peptide design.FEBS Lett.244:439–446.von Heijne,G.1985.Signal sequences.The limits of vari-ation.J.Mol.Biol.184:99–105.von Heijne, N–C charge imbalance may be important for signal sequence function in bacteria.J.Mol.Biol.192:287–290.von Heijne,G.1986b.A new method for predicting signal sequence cleavage sites.Nucleic Acids Res.14:4683–4690.von Heijne,G.1988.Transcending the impenetrable:How proteins come to terms with membranes.Biochim.Biophys.Acta947:307–333.9。

相关文档
最新文档