Bayesian Segmentation of Protein Secondary Structure Scott C. Schmidler
文王一支笔提取物在制备治疗肝炎药物中的应用[发明专利]
专利名称:文王一支笔提取物在制备治疗肝炎药物中的应用专利类型:发明专利
发明人:袁成福,肖莉,何毓敏,彭帆,王艳华
申请号:CN201911039488.8
申请日:20191029
公开号:CN110772547A
公开日:
20200211
专利内容由知识产权出版社提供
摘要:本发明公开了文王一支笔提取物在制备治疗肝炎药物中的应用,文王一支笔水煎煮得到的总提取物、水提醇沉部位及醋酸乙酯部位在体外AML12细胞中,能明显抑制PA刺激的TNF‑α、
IL‑1β炎症因子的分泌,能抑制TNF‑α、IL‑1β等炎症因子mRNA的表达;在体内均能明显抑制HFD 诱导的小鼠肝脏组织内TNF‑α、IL‑1β等炎症因子mRNA的表达。
同时,文王一支笔水提醇沉部位能有效的改善酒精性肝炎小鼠肝细胞内脂质的沉积,减少炎性细胞浸润,降低小鼠血清ALT、AST活性和炎性因子TNF‑α,IL‑1β,IL‑6,NLRP3表达水平,对酒精性肝炎具有抑制脂质沉积,抵抗炎症反应的保护作用。
上述提取物在动物实验过程中,没有发现小鼠死亡,也没有观察到小鼠明显的肝肾功能异常,这表明药物具有较好的安全性。
申请人:三峡大学
地址:443002 湖北省宜昌市西陵区大学路8号
国籍:CN
代理机构:宜昌市三峡专利事务所
代理人:成钢
更多信息请下载全文后查看。
蛋白质结构预测中的机器学习方法
蛋白质结构预测中的机器学习方法蛋白质是生命体系中非常重要的分子,因为它们的结构和功能对细胞的正常运作至关重要。
预测蛋白质结构是生物科学领域中的一个重要问题,因为它有助于我们更好地理解蛋白质的生物功能、药物作用等方面。
通过理解蛋白质结构预测中的机器学习方法,我们可以更好地了解这个问题。
在科学家开始研究蛋白质结构预测之前,了解蛋白质结构的基本知识是很有必要的。
蛋白质具有四级结构,包括原生、二级、三级和四级结构。
原生结构是在蛋白质合成过程中形成的。
二级结构是由蛋白质内α-螺旋和β-折叠形成。
三级结构描述的是蛋白质的立体构象,包括螺旋、β-折叠、卷曲和其他结构特征。
最后,四级结构描述的是由多个蛋白质聚合而成的蛋白质复合物结构。
在蛋白质结构预测中,机器学习方法是非常有用的。
机器学习技术是一种通过数据和模型进行预测、分类和决策的方法,而不是基于人工指定规则的方法。
这些方法通过让计算机学习大量的数据,来预测和分类输入数据。
在蛋白质结构预测中,机器学习方法可以帮助我们更好地理解蛋白质结构的模式。
机器学习中的一种常用方法是神经网络。
神经网络是一种通过相互连接的神经单元来模拟人脑神经细胞网络的模型。
在蛋白质结构预测中,神经网络可以用来预测蛋白质的二级结构。
使用神经网络进行二级结构预测的一种流行方法是使用全卷积神经网络,这种网络可以将所有的输入序列转换为输出序列。
另一种机器学习方法是支持向量机(SVM)。
SVM是一种算法,可以将输入数据映射到高维空间中,并在其上构建超平面。
在蛋白质结构预测中,SVM可以用于预测蛋白质的三级结构。
它可以通过提取结构特征来预测蛋白质的空间构型,在进行预测之前,需要对原始蛋白质序列进行处理。
处理过程包括对序列进行特定的编码,并使用特征提取算法,将蛋白质序列的结构信息转换为特征向量。
总的来说,机器学习方法在蛋白质结构预测中是非常重要的。
预测蛋白质结构是一个大型的计算任务,需要消耗大量的计算资源和数据。
蛋白质结构域预测
蛋白质结构域预测蛋白质结构域预测是蛋白质功能注释中的一个重要任务。
蛋白质结构域是指在蛋白质中具有特定结构和功能的连续序列段。
准确地预测蛋白质结构域可以帮助我们理解蛋白质的功能和作用机制,对药物设计和疾病治疗等领域具有重要意义。
随着高通量测序技术的迅猛发展,大量的蛋白质序列数据被积累,蛋白质结构域预测方法也得到了长足的进步。
基于比对的方法是将待预测序列与已知结构域库中的序列进行比对,根据比对结果来判断待预测序列是否含有特定的结构域。
通过这种方法可以预测到已知结构域的序列,但是对于新发现的结构域或者与已知结构域相似度较低的序列预测效果较差。
基于机器学习的方法是利用已知结构域的序列和非结构域的序列作为训练集,通过机器学习算法构建一个预测模型,然后用该模型对待预测序列进行预测。
这种方法可以预测到新发现的结构域,并且可以预测与已知结构域相似度较低的序列。
目前,基于机器学习的方法在蛋白质结构域预测中占据主导地位。
常见的机器学习算法包括SVM(支持向量机)、DT(决策树)、RF(随机森林)等。
这些算法可以通过学习已知结构域的特征和非结构域的特征,来区分结构域和非结构域的序列。
除了机器学习算法,人工神经网络(ANN)也是常用的预测模型。
人工神经网络模型可以建立一个多层的神经网络,通过自我调整权重和阈值参数来计算输入和输出之间的关系。
通过训练样本,可以优化神经网络的参数,使之能够对待预测序列进行准确的预测。
此外,一些新兴的预测方法也逐渐得到应用。
例如,通过整合不同的预测结果进行综合预测。
这种方法可以利用多个预测方法的优势,提高预测的准确性。
同时,一些基于深度学习的方法也逐渐应用于蛋白质结构域预测中。
深度学习利用多层神经网络模型进行特征学习和表征学习,可以从海量的数据中发现隐藏的规律和模式,进一步提高预测效果。
总的来说,蛋白质结构域的准确预测对于研究生命科学和药物设计具有重要意义。
基于比对和机器学习的方法已经取得了显著的进展,通过不断地创新和技术的进步,预测方法将会更加精确和有效。
2003-Nanoparticle-based bio-bar codes for the ultrasensitive detection of proteins
DOI: 10.1126/science.1088755, 1884 (2003);301 Science , et al.Jwa-Min Nam ProteinsNanoparticle-Based Bio-Bar Codes for the Ultrasensitive Detection ofThis copy is for your personal, non-commercial use only.clicking here.colleagues, clients, or customers by , you can order high-quality copies for your If you wish to distribute this article to othershere.following the guidelines can be obtained by Permission to republish or repurpose articles or portions of articles): February 16, 2012 (this infomation is current as of The following resources related to this article are available online at/content/301/5641/1884.full.html version of this article at:including high-resolution figures, can be found in the online Updated information and services, /content/suppl/2003/09/25/301.5641.1884.DC1.htmlcan be found at:Supporting Online Material /content/301/5641/1884.full.html#related found at:can be related to this article A list of selected additional articles on the Science Web sites 718 article(s) on the ISI Web of Science cited by This article has been /content/301/5641/1884.full.html#related-urls 29 articles hosted by HighWire Press; see:cited by This article has been/cgi/collection/chemistry Chemistrysubject collections:This article appears in the following registered trademark of AAAS.is a Science 2003 by the American Association for the Advancement of Science; all rights reserved. The title Copyright American Association for the Advancement of Science, 1200 New York Avenue NW, Washington, DC 20005. (print ISSN 0036-8075; online ISSN 1095-9203) is published weekly, except the last week in December, by the Science o n F e b r u a r y 16, 2012w w w .s c i e n c e m a g .o r g D o w n l o a d e d f r o mnumber corresponds to a bulk resistivity of 2.4ϫ10Ϫ6ohm ⅐m for the silver nanowire.This nanowire is easily reproducible and has mark-edly higher conductivity than previously report-ed double-helix DNA–templated silver nanow-ires (20).The 4ϫ4DNA tile can be easily pro-grammed by varying the sticky ends to form more sophisticated arrays for applications in construction of logical molecular devices;for instance,quantum-dot cellular automata arrays may be constructed by specifically incorporat-ing metal nanoparticles into the nanogrids.The cavities can also be used as pixels in a uniform pixel array,which could be applied to AFM visual readout of self-assembly DNA computa-tions such as a binary counting lattice (30).References and Notes1.N.C.Seeman,Nature 421,427(2003).Bean,in Computational Biology and Genome Informatics ,J.T.L.Wang,C.H.Wu,P.P.Wang,Eds.(World Scientific,River Edge,NJ,2003),p.35.3.E.Winfree,F.Liu,L.A.Wenzler,N.C.Seeman,Nature 394,539(1998).Bean et al.,J.Am.Chem.Soc.122,1848(2000).5.C.Mao,W.Sun,N.C.Seeman,J.Am.Chem.Soc.121,5437(1999).6.R.Sha,F.Liu,N.C.Seeman,Chem.Biol.7,743(2000).7.H.Yan,Bean,L.Feng,J.H.Reif,Proc.Natl.Acad.Sci.U.S.A.100,8103(2003).8.C.Mao,W.Sun,Z.Shen,N.C.Seeman,Nature 397,144(1999).9.B.Yurke et al.,Nature 406,605(2000).10.H.Yan,X.Zhang,Z.Shen,N.C.Seeman,Nature 415,62(2002).11.J.J.Li,W.Tan,Nano Lett.2,315(2002).12.L.Feng,S.H.Park,J.H.Reif,H.Yan,Angew.Chem.Int.Ed.,in press.13.L.M.Adleman,Science 266,1021(1994).14.Q.Liu et al.,Nature 403,175(2000).15.C.Mao,Bean,J.H.Reif,N.C.Seeman,Nature407,493(2000).16.Y.Benenson et al.,Nature 414,430(2001).17.B.S.Ravinderjit et al.,Science 296,499(2002).18.C.A.Mirkin,Inorg.Chem 39,2258(2000).19.A.P.Alivisatos et al.,Nature 382,609(1996).20.E.Braun,Y.Eichen,U.Sivan,G.Ben-Yoseph,Nature391,775(1998).21.K.Keren et al.,Science 297,72(2002).22.F.Patolsky,Y.Weizmann,O.Lioubashevski,I.Willner,Angew.Chem.Int.Ed.41,2323(2002).23.C.F.Monson,A.T.Woolley,Nano Lett.3,359(2003).24.W.E.Ford,O.Harnack,A.Yasuda,J.M.Wessels,Adv.Mater.13,1793(2001).25.J.Richter et al.,Adv.Mater.12,507(2000).26.C.M.Niemeyer,W.Burger,J.Peplies,Angew.Chem.Int.Ed.37,2265(1998).27.S.Xiao et al.,J.Nanopart.Res.4,313(2002).28.N.C.Seeman,Trends Biotechnol.17,437(1999).29.Materials and methods are available as supportingmaterial on Science Online.30.E.Winfree,J.Biomol.Struct.Dyn.11,263(2000).31.We thank E.Winfree,P.Rothemund,and N.Papadakisforhelpful advice with AFM underaqueous buffer ;D.L iu for development of the metallization procedure and L.Feng for technical assistance in the thermal profile experiment;and J.Liu for providing access to his AFM instrument.This work was supported by grants from the National Science Foundation (H.Y.,J.H.R.,and T.H.L.)and Defense Advanced Research Projects Agency (J.H.R.)and by an industrial partners arrangement with Taiko Denki Co.,Ltd.(J.H.R.and T.H.L.).Supporting Online Material/cgi/content/full/301/5641/1882/DC1Materials and Methods Figs.S1to S6References18July 2003;accepted 21August 2003Nanoparticle-Based Bio–Bar Codes for the Ultrasensitive Detection of ProteinsJwa-Min Nam,*C.Shad Thaxton,*Chad A.Mirkin †An ultrasensitive method for detecting protein analytes has been developed.The system relies on magnetic microparticle probes with antibodies that spe-cifically bind a target of interest [prostate-specific antigen (PSA)in this case]and nanoparticle probes that are encoded with DNA that is unique to the protein target of interest and antibodies that can sandwich the target captured by the microparticle probes.Magnetic separation of the complexed probes and target followed by dehybridization of the oligonucleotides on the nanoparticle probe surface allows the determination of the presence of the target protein by identifying the oligonucleotide sequence released from the nanoparticle probe.Because the nanoparticle probe carries with it a large number of oligonucle-otides per protein binding event,there is substantial amplification and PSA can be detected at 30attomolar concentration.Alternatively,a polymerase chain reaction on the oligonucleotide bar codes can boost the sensitivity to parable clinically accepted conventional assays for detecting the same target have sensitivity limits of ϳ3picomdar,six orders of magnitude less sensitive than what is observed with this method.The polymerase chain reaction (PCR)and other forms of target amplification have en-abled rapid advances in the development of powerful tools for detecting and quantifying DNA targets of interest for research,forensic,and clinical applications (1–3).The develop-ment of comparable target amplification methods for proteins could substantially im-prove medical diagnostics and the developing field of proteomics (4–7).Although one can-not yet chemically duplicate protein targets,it is possible to tag such targets with oligonu-cleotide markers that can be subsequently amplified with PCR and then use DNA de-tection to identify the target of interest (8–13).This approach,often referred to as im-muno-PCR,allows the detection of proteins with DNA markers in a variety of different formats.Thus far,all immuno-PCR ap-proaches involve initial immobilization of a target analyte to a surface and subsequent detection with an antibody (Ab)with a DNA marker.The DNA marker is typically strong-ly bound to the Ab (either through covalent interactions or streptavidin-biotin binding).Although these approaches are considerable advances in protein detection,they have sev-eral drawbacks:(i)a low ratio of DNA iden-tification sequence to detection Ab,which limits sensitivity,(ii)slow target-binding ki-netics because of the heterogeneous nature of the target-capture procedure,which increasesassay time and decreases assay sensitivity,(iii)complex conjugation chemistries that are required to link the Ab and DNA markers,and (iv)PCR requirements (14).Herein,we report a nanoparticle-based bio–bar-code approach to detect a protein target,free prostate-specific antigen (PSA),at low at-tomolar concentrations (Fig.1).PSA was cho-sen as the initial target for these studies because of its importance in the detection of prostate and breast cancer,the most common cancers and the second leading cause of cancer death among American men and women,respectively (15–18).Identification of disease relapse after the surgical treatment of prostate cancer using PSA as a marker present at low levels (10s of copies)could be extremely beneficial and en-able the delivery of curative adjuvant therapies (17,19).Furthermore,PSA is found in the sera of breast cancer patients,and it is beginning to be explored as a breast cancer screening target (16).Because the concentration of free PSA is much lower in women’s serum as compared to that of men,an ultrasensitive test is needed for breast cancer screening and diagnosis.The bio–bar-code assay reported herein uses two types of probes,magnetic micropar-ticles (MMPs,1-m diameter polyamine par-ticles with magnetic iron oxide cores)func-tionalized with PSA monoclonal antibodies (mAbs)(Fig.1A)(20)and gold nanoparticles (NP)heavily functionalized with hybridized oligonucleotides (the bio–bar codes;5ЈAC-ACAACTGTGTTCACTAGCGTTGAACGT-GGATGAAGTTG 3Ј)(7,21,22)and poly-clonal detection Abs to recognize PSA (Fig.1A)(20).In a typical PSA detection experi-ment (Fig.1B),the gold NPs and the MMPs sandwich the PSA target,generating a com-Department of Chemistry and Institute for Nanotech-nology,Northwestern University,2145Sheridan Road,Evanston,IL 60201,USA*These authors contributed equally to the work.†To whom correspondence should be addressed.E-mail:camirkin@R E P O R T S26SEPTEMBER 2003VOL 301SCIENCE 1884 o n F e b r u a r y 16, 2012w w w .s c i e n c e m a g .o r g D o w n l o a d e d f r o mplex with a large ratio of bar-code DNA to protein target (23).Application of a magnetic field draws the MMPs to the wall of the reaction vessel in a matter of seconds,allow-ing the separation of all of the MMPs but only the reacted NPs from the reaction mix-ture.Washing the aggregate structures in NANOpure water (18megohm;Barnstead International,Dubuque,IA)dehybridizes bar-code DNA from NP-immobilized com-plements.With the use of the magnetic sep-arator,we readily removed the aggregate from the assay solution to leave only the bar-code DNA,which can be quickly identi-fied by standard DNA detection methodolo-gies [e.g.,gel electrophoresis,fluorophore-labeling,and scanometric (24)approaches]that may or may not rely on PCR (Fig.1B).Although gel electrophoresis was routine-ly used to analyze the results of the assay (20),in general the scanometric method pro-vided higher sensitivity and was easier to implement than the gel-based method.There-fore,the results of the scanometric assay are reported herein.In the case of PCR-less de-tection,30-nm gold particles were used dur-ing the detection step instead of 13-nm gold particles to increase the amount of detectable bar-code DNA (Fig.1B,step 2).For bar-code DNA identification,chip-immobilized DNA20-mers[5ЈSH-(CH 2)6-A 10-CAACTTCATC-CACGTTCAAC 3Ј],which are complemen-tary with half of the target bar-code sequence,were used to capture the isolated bar-code DNA sequences,and oligonucleotide-modi-fied 13-nm gold NPs ϭ[5ЈGCTAGTGAA-CACAGTTGTGT-A 10-(CH 2)3-SH 3Ј-Au]were used to label the other half of the se-quence in a sandwich assay format.Chips with hybridized NP probes are then subjected to silver amplification (25),which results in gray spots that can be read with a Verigene ID (identification)system (Nanosphere,In-corporated,Northbrook,IL)that measures light scattered from the developed spots.Tar-get PSA concentrations from 300fM to 3aMFig.1.The bio–bar-code assay method.(A )Probe design and preparation.(B )PSA detection and bar-code DNA amplification and identification.In a typical PSA-detectionexperiment,an aqueous dispersion of MMP probes functionalized with mAbs to PSA (50l of 3mg/ml magnetic probe solution)was mixed with an aqueous solution of free PSA (10l of PSA)and stirred at 37°C for30min (Step 1).A 1.5-ml tube containing the assay solution was placed in a BioMag microcentrifuge tube separator (Polysciences,Incorporated,Warrington,PA)at room temperature.After 15s,the MMP-PSA hybrids were concentrated on the wall of the tube.The supernatant (solution of unbound PSA molecules)was removed,and the MMPs were resuspended in 50l of 0.1M phosphate-buffered saline (PBS)(repeated twice).The NP probes (for 13-nm NP probes,50l at 1nM;for30-nm NP pr obes,50l at 200pM),functionalized with polyclonal Abs to PSA and hybridized bar-code DNA strands,were then added to the assay solution.The NPs reacted with the PSA immobilized on the MMPs and provided DNA strands for signal amplification and protein identification (Step 2).This solution was vigorously stirred at 37°C for 30min.The MMPs were then washed with 0.1M PBS with the magnetic separator to isolate the mag-netic particles.This step was repeated four times,each time for 1min,to remove everything but the MMPs (along with the PSA-bound NP probes).After the final wash step,the MMP probes were resuspended in NANOpure water (50l)for 2min to dehybridize bar-code DNA strands from the nanoparticle probe surface.Dehybridized bar-code DNA was then easily separated and collected from the probes with the use of the magnetic separ ator(Step 3).Forbar -code DNA amplification (Step 4),isolated bar-code DNA was added to a PCR reaction mixture (20-l final volume)containing the appropriate primers,and the solution was then thermally cycled (20).The bar-code DNA amplicon was stained with ethidium bromide and mixed with gel-loading dye (20).Gel electrophoresis or scanometric DNA detection (24)was then performed to determine whether amplifica-tion had taken place.Primer amplification was ruled out with appro-priate control experiments (20).Notice that the numberof bound NP probes for each PSA is unknown and will depend upon target proteinconcentration.Fig.2.Scanometric de-tection of PSA-specific bar-code DNA.PSA con-centration (sample vol-ume of 10l)was var-ied from 300fM to 3aM and a negative control sample where no PSA was added (control)is shown.Forall seven samples,2l of antidi-nitrophenyl (10pM)and 2l of -galactosidase (10pM)were added as background proteins.Also shown is PCR-less detection of PSA (30aM and control)with 30nm NP probes (inset).Chips were imaged with the Verigene ID system (20).R E P O R T S SCIENCE VOL 30126SEPTEMBER 20031885o n F e b r u a r y 16, 2012w w w .s c i e n c e m a g .o r g D o w n l o a d e d f r o mwere detected with the use of the PCR-cou-pled approach (Fig.2).The use of this ap-proach in a more complicated medium such as goat serum provided a detection sensitivity of 30aM,with clear differentiation from background signal (fig.S4).The selectivity for the bar-code DNA sequence was excel-lent,as evidenced by the lack of signal from the control spots with noncomplementary capture DNA [5ЈSH-(CH 2)6-A 10-GGCAGC-TCGTGGTGA 3Ј]and the observation that there is little discernible signal when PSA is absent (Fig.2).Importantly,one can eliminate the PCR step and still obtain a high sensitivity assay by using larger nanoparticles (30nm),which can support larger absolute amounts of bar-code DNA.With such an assay,it was pos-sible to detect PSA at 30aM concentration in a 10-l sample (Fig.2,inset).This substan-tially simplifies the overall complexity of the assay and still yields a sensitivity that is five orders of magnitude greater than the cited commercial assay sensitivity (19)and two orders of magnitude greater than that cited for immuno-PCR on the same target under near-identical conditions (13).The bio –bar-code method offers several ad-vantages over current protein detection meth-ods.First,the target-binding portion of the as-say is homogeneous (in the sense that the cap-ture antibodies on the MMPs are dispersed in solution as opposed to the flat surface of a microarray or titer plate).Therefore,we can add a large quantity of MMPs to the reaction vessel to facilitate the binding kinetics between the detection antibody and target analyte.Homoge-neous mixing makes this assay faster than het-erogeneous immuno-PCR systems and also can increase sensitivity because the capturing step is more efficient (the equilibrium can be pushed toward the captured protein state by increasing the concentration of MMP probe,which cannot be done in the heterogeneous assay).Second,the use of the NP bio –bar codes provides a high ratio of PCR-amplifiable DNA to labeling Ab that can substantially increase assay sensitivity.Third,this assay obviates the need for compli-cated conjugation chemistry for attaching DNA to the labeling Abs.Bar-code DNA is bound to the NP probe through hybridization at the start of the labeling reaction and liberated for subse-quent identification with a simple wash step.Because the labeling Ab and DNA are present on the same particle,there is no need for the addition of further antibodies or DNA-protein conjugates before the identification of bar-code DNA.In addition,the bar-code DNA is re-moved from the detection assay,and direct detection or PCR is carried out on samples of bar-code DNA that are free from PSA,most of the biological sample,microparticles,and nano-particles.This step substantially reduces back-ground signal.Finally,this protein detection scheme has the potential for massive multiplex-ing and the simultaneous detection of many analytes in one solution,especially in the PCR-less form.Although the PSA system is used for the proof of concept,the approach should be general for almost any target with known Abs,and,by using the NP-based bio –bar-code approach (7),one could prepare a distinct identifiable bar code for nearly every target of interest.References and Notes1.R.K.Saiki et al .,Science 230,1350(1985).2.R.A.Gibbs,Curr.Opin.Biotechnol.2,69(1991).3.S.A.Bustin,J.Mol.Endocrinol.29,23(2002).4.G.MacBeath,S.L.Schreiber,Science 289,1760(2000).5.H.Zhu et al .,Science 293,2101(2001);published online 26July 2001;10.1126/science.1062191.6.B.B.Haab,M.J.Dunham,P.O.Brown,Genome Biol .2(2):research0004.1(2001).7.J.-M.Nam,S.-J.Park,C.A.Mirkin,J.Am.Chem.Soc.124,3820(2002).8.T.Sano,C.L.Smith,C.R.Cantor,Science 258,120(1992).9.A.McKie,D.Samuel,B.Cohen,N.A.Saunders,J.Im-munol.Methods 270,135(2002).10.H.Zhou,R.J.Fisher,T.S.Papas,Nucl.Acids Res.21,6038(1993).11.E.R.Hendrickson,T.M.Hatfield-Truby,R.D.Joerger,W.R.Majarian,R.C.Ebersole,Nucl.Acids Res.23,522(1995).12.C.M.Niemeyer et al .,Nucl.Acids Res.27,4553(1999).13.B.Schweitzer et al .,Proc.Natl.Acad.Sci.U.S.A.97,10113(2000).14.C.M.Niemeyer,Trends Biotechnol.20,395(2002).15.J.L.Stanford et al .,Prostate Cancer Trends 1973–1995[National Institutes of Health (NIH)Pub.No.99-4543,Surveillance,Epidemiology,and End Results (SEER)Program,National Cancer Institute,Bethesda,MD,1999].16.M.H.Black et al .,Clin.Cancer Res.6,467(2000).17.R.A.Ferguson,H.Yu,M.Kalyvas,S.Zammit,E.P.Diamandis,Clin.Chem.42,675(1996).18.L.A.G.Ries et al .,SEER Cancer Statistics Review 1973–1999(SEER Program,National Cancer Institute,Bethesda,MD,2002).19.H.Yu,E.P.Diamandis,A.F.Prestigiacomo,T.A.Stamey,Clin.Chem.41,430(1995).20.Detailed materials and methods can be found on Science Online.21.C.A.Mirkin,R.L.Letsinger,R.C.Mucic,J.J.Storhoff,Nature 382,607(1996).22.Y.C.Cao,R.Jin,C.A.Mirkin,Science 297,1536(2002).23.For13-nm NPs,each NP can suppor t up to 100strands of DNA with the proteins present (26);this value represents the upper limit,and the exact num-berforthese par ticles has not yet been deter mined.24.T.A.Taton,C.A.Mirkin,R.L.Letsinger,Science 289,1757(2000).25.Silver enhancement kit was purchased from Ted Pella (Redding,CA),and silverenhancement time was 6min.26.L.M.Demers et al .,Anal.Chem.72,5535(2000).27.C.A.M.acknowledges the AirFor ce Office of Scientific Research,the Defense Advanced Research Projects Agency,and the NSF for support of this research.C.S.T.acknowledges the Howard Hughes Medical In-stitute fora Tr aining fellowship.A.Schaefferis ac-knowledged forhelpful discussions,and V.evenson assisted in PCR optimization.Supporting Online Material/cgi/content/full/301/5641/1884/DC1Materials and Methods Figs.S1to S5References and Notes3July 2003;accepted 28August 2003Particle Formation by Ion Nucleation in the Upper Troposphere andLower StratosphereS.-H.Lee,1*J.M.Reeves,1J.C.Wilson,1D.E.Hunton,2A.A.Viggiano,ler,2J.O.Ballenthin,it 3Unexpectedly high concentrations of ultrafine particles were observed over a wide range of latitudes in the upper troposphere and lower stratosphere.Particle number concentrations and size distributions simulated by a numerical model of ion-induced nucleation,constrained by measured thermodynamic data and observed atmospheric key species,were consistent with the obser-vations.These findings indicate that,at typical upper troposphere and lower stratosphere conditions,particles are formed by this nucleation process and grow to measurable sizes with sufficient sun exposure and low preexisting aerosol surface area.Ion-induced nucleation is thus a globally important source of aerosol particles,potentially affecting cloud formation and radiative transfer.Atmospheric aerosols affect climate direct-ly by altering the radiative balance of the Earth (1)and indirectly by acting as cloud condensation nuclei (CCN)(2),which inturn change the number and size of cloud droplets and the cloud albedo.Homoge-neous nucleation (HN)(formation of solid or liquid particles directly from the gas phase)is an important source of new par-ticles in the atmosphere (3,4),but the process is poorly understood and alone is unable to explain the observed particle for-mation.Homogeneous nucleation includes binary homogeneous nucleation (BHN)of sulfuric acid –water (H 2SO 4-H 2O)(3,4)and ternary homogeneous nucleation1Department of Engineering,University of Denver,Denver,CO 80208,USA.2Air Force Research Labora-tory,Space Vehicle Directorate,Hanscom Air Force Base,MA 01731,USA.3National Aeronautics and Space Administration,Goddard Space Flight Center,Greenbelt,MD 20771,USA.*To whom correspondence should be addressed.E-mail:shanlee@R E P O R T S26SEPTEMBER 2003VOL 301SCIENCE 1886 o n F e b r u a r y 16, 2012w w w .s c i e n c e m a g .o r g D o w n l o a d e d f r o m。
用于疾病和病症分析的无细胞DNA甲基化模式[发明专利]
专利名称:用于疾病和病症分析的无细胞DNA甲基化模式专利类型:发明专利
发明人:向红·婕思敏·周,康舒里,李文渊,史蒂文·杜比尼特,李青娇
申请号:CN201780047763.3
申请日:20170607
公开号:CN110168099A
公开日:
20190823
专利内容由知识产权出版社提供
摘要:本文公开了利用测序读取来检测并定量由血液样品制备的无细胞DNA中组织类型或癌症类型的存在的方法和系统。
申请人:加利福尼亚大学董事会,南加利福尼亚大学
地址:美国加利福尼亚州
国籍:US
代理机构:北京柏杉松知识产权代理事务所(普通合伙)
更多信息请下载全文后查看。
染色质免疫沉淀技术在研究DNA与蛋白质相互作用中的应用
遗传HEREDITAS Beijing 27 5 : 801 807 2005 技术与方法染色质免疫沉淀技术在研究DNA 与蛋白质相互作用中的应用1 1 1 2 1 王春雨石建党朱彦张琚 1. 南开大学分子生物学研究所教育部生物活性材料重点实验室天津300071 2. 美国塔芙茨大学医学院圣伊丽莎白医学中心心血管研究所分子生理学实验室波士顿马萨诸塞02135 2997 美国摘要: 在后基因组时代DN A 蛋白质的相互作用是研究基因表达调控的一个重要领域。
与其他方法相比染色质免疫沉淀技术chromat in immuno pr ecipitatio n assay ChIP 是一种在体内研究DN A 蛋白质相互作用的理想的方法。
近年来这种方法与DN A 芯片和分子克隆技术相结合可用于高通量的筛选已知蛋白因子的未知DN A 靶点和研究反式作用因子在整个基因组上的分布情况这将有助于深入理解DN A 蛋白质相互作用的调控网络。
总结了染色质免疫沉淀技术的方法特别介绍了使用这些方法取得的最新进展。
关键词: DN A 蛋白质相互作用染色质免疫沉淀技术ChIP DN A 芯片ChIP 克隆中图分类号: Q 78 文献标识码: A 文章编号: 0253- 9772 2005 05- 0801- 07 Application of Chromatin Immunoprecipitation Assay in Deciphering DNA Protein Interactions WANG Chun Yu1 SH I Jian Dang 1 ZH U Yan1 2 Z H ANG Ju 1 1. The Key La borat ory of Bioactive Mat erials Minist ry of Educat ion Ins titute for Molecular Biology Nankai Universit y Tianjin 300071 China 2. Laboratory of Molecula r Physiology Division of Cardiovas cular Res earch St . Eliza bet h s Medica l Center Tufts Univers ity S chool of Medicine Bost on MA 02135 2997 US A Abst ract : In the post genomic era identifyi ng and characterizing vari ous DNA protein inter actions are a major chal lenge in the research of gene tr anscripti onal regulation. Although many techni ques can be used for thi s purpose chr o matin immunopreci pitation assay ChIP by contrast is ideally suited for studyi ng DNA protein interactions in v ivo . In recent years standard ChIP assay has been modi fied to uncover some known factors unknown target sequences es pecial ly when combined wi th DNA microarray and molecular cloni ng strategi es. These hi gh throughput ChIP assays are more and more used to reveal the di stributi on profile of trans acti ng factor binding si tes throughout the genome which may yield many new insights into the DNA protein interaction network. This article summarized the methods of ChIP as say and highl ighted recent progress in the appli cation of this technique. Key words: DNA pr otein i nteractions chromati n immunoprecipitati on assay ChIP DNA micr oarr ay ChIP cl oning收稿日期: 2004 07 08 修回日期: 2004 08 04基金项目: 国家自然科学基金资助项目编号: 30471733 30271297 和留学人员短期回国工作讲学项目两个基地项目编号: 30410403237 2002 年度高等学校博士学科点专项科研基金编号: 20020055026 Supported by National Natural Sciences Foundat ion of China No. 30471733 30271297 Fund for Chinese Scholars Abroad to Work and Lecture in China Two Base Project No. 30410403237 and Spe o. cialized Research Fund for t he Doctoral Program of Higher Educat ion N 20020055026 作者简介: 王春雨1979男天津市人博士研究生专业方向: 医学分子生物学。
Global landscape of protein complexes in the yeast Saccharomyces cerevisiae
Global landscape of protein complexes in the yeast Saccharomyces cerevisiaeNevan J.Krogan1,2*†,Gerard Cagney1,3*,Haiyuan Yu4,Gouqing Zhong1,Xinghua Guo1,Alexandr Ignatchenko1, Joyce Li1,Shuye Pu5,Nira Datta1,Aaron P.Tikuisis1,Thanuja Punna1,Jose´M.Peregrı´n-Alvarez5,Michael Shales1,Xin Zhang1,Michael Davey1,Mark D.Robinson1,Alberto Paccanaro4,James E.Bray1, Anthony Sheung1,Bryan Beattie6,Dawn P.Richards6,Veronica Canadien6,Atanas Lalev1,Frank Mena6,Peter Wong1,Andrei Starostine1,Myra M.Canete1,James Vlasblom5,Samuel Wu5,Chris Orsi5,Sean R.Collins7, Shamanta Chandran1,Robin Haw1,Jennifer J.Rilstone1,Kiran Gandi1,Natalie J.Thompson1,Gabe Musso1, Peter St Onge1,Shaun Ghanny1,Mandy m1,2,Gareth Butland1,Amin M.Altaf-Ul8,Shigehiko Kanaya8,Ali Shilatifard9,Erin O’Shea10,Jonathan S.Weissman7,C.James Ingles1,2,Timothy R.Hughes1,2,John Parkinson5, Mark Gerstein4,Shoshana J.Wodak5,Andrew Emili1,2&Jack F.Greenblatt1,2Identification of protein–protein interactions often provides insight into protein function,and many cellular processes are performed by stable protein complexes.We used tandem affinity purification to process4,562different tagged proteins of the yeast Saccharomyces cerevisiae.Each preparation was analysed by both matrix-assisted laser desorption/ ionization–time offlight mass spectrometry and liquid chromatography tandem mass spectrometry to increase coverage and accuracy.Machine learning was used to integrate the mass spectrometry scores and assign probabilities to the protein–protein interactions.Among4,087different proteins identified with high confidence by mass spectrometry from 2,357successful purifications,our core data set(median precision of0.69)comprises7,123protein–protein interactions involving2,708proteins.A Markov clustering algorithm organized these interactions into547protein complexes averaging4.9subunits per complex,about half of them absent from the MIPS database,as well as429additional interactions between pairs of complexes.The data(all of which are available online)will help future studies on individual proteins as well as functional genomics and systems biology.Elucidation of the budding yeast genome sequence1initiated a decade of landmark studies addressing key aspects of yeast cell biology on a system-wide level.These included microarray-based analysis of gene expression2,screens for various biochemical activi-ties3,4,identification of protein subcellular locations5,6,and identify-ing effects of single and pairwise gene disruptions7–10.Other efforts were made to catalogue physical interactions among yeast proteins, primarily using the yeast two-hybrid method11,12and direct purifi-cation via affinity tags13,14;many of these interactions are conserved in other organisms15.Data from the yeast protein–protein interaction studies have been non-overlapping to a surprising degree,a fact explained partly by experimental inaccuracy and partly by indications that no single screen has been comprehensive16.Proteome-wide purification of protein complexesOf the various high throughput experimental methods used thus far to identify protein–protein interactions11–14,tandem affinity purification(TAP)of affinity-tagged proteins expressed from their natural chromosomal locations followed by mass spectrometry13,17 has provided the best coverage and accuracy16.To map more completely the yeast protein interaction network(interactome), S.cerevisiae strains were generated with in-frame insertions of TAP tags individually introduced by homologous recombination at the 30end of each predicted open reading frame(ORF)(http:// /)18,19.Proteins were purified from4L yeast cultures under native conditions,and the identities of the co-purifying proteins(preys)determined in two complementary ways17.Each purified protein preparation was electrophoresed on an SDS polyacrylamide gel,stained with silver,and visible bands removed and identified by trypsin digestion and peptide mass fingerprinting using matrix-assisted laser desorption/ionization–time offlight(MALDI–TOF)mass spectrometry.In parallel,another aliquot of each purified protein preparation was digested in solution and the peptides were separated and sequenced by data-dependent liquid chromatography tandem mass spectrometry(LC-MS/ MS)17,20–22.Because either mass spectrometry method often fails toARTICLES1Banting and Best Department of Medical Research,Terrence Donnelly Centre for Cellular and Biomolecular Research,University of Toronto,160College St,Toronto,OntarioM5S3E1,Canada.2Department of Medical Genetics and Microbiology,University of Toronto,1Kings College Circle,Toronto,Ontario M5S1A8,Canada.3Conway Institute, University College Dublin,Belfield,Dublin4,Ireland.4Department of Molecular Biophysics and Biochemistry,266Whitney Avenue,Yale University,PO Box208114,New Haven, Connecticut06520,USA.5Hospital for Sick Children,555University Avenue,Toronto,Ontario M4K1X8,Canada.6Affinium Pharmaceuticals,100University Avenue,Toronto, Ontario M5J1V6,Canada.7Howard Hughes Medical Institute,Department of Cellular and Molecular Pharmacology,UCSF,Genentech Hall S472C,60016th St,San Francisco, California94143,USA.8Comparative Genomics Laboratory,Nara Institute of Science and Technology8916-5,Takayama,Ikoma,Nara630-0101,Japan.9Department of Biochemistry,Saint Louis University School of Medicine,1402South Grand Boulevard,St Louis,Missouri63104,USA.10Howard Hughes Medical Institute,Department of Molecular and Cellular Biology,Harvard University,7Divinity Avenue,Cambridge,Massachusetts02138,USA.†Present address:Department of Cellular and Molecular Pharmacology,UCSF,San Francisco,California94143,USA.*These authors contributed equally to this work.identify a protein,we used two independent mass spectrometry methods to increase interactome coverage and confidence.Among the attempted purifications of4,562different proteins(Supplemen-tary Table S1),including all predicted non-membrane proteins,2,357 purifications were successful(Supplementary Table S2)in that at least one protein was identified(in1,613cases by MALDI–TOF mass spectrometry and in2,001cases by LC-MS/MS;Fig.1a)that was not present in a control preparation from an untagged strain.In total,4,087different yeast proteins were identified as preys with high confidence($99%;see Methods)by MALDI–TOF mass spectrometry and/or LC-MS/MS,corresponding to72%of the predicted yeast proteome(Supplementary Table S3).Smaller pro-teins with a relative molecular mass(M r)of35,000were less likely to be identified(Fig.1b),perhaps because they generate fewer peptides suited for identification by mass spectrometry.We were more successful in identifying smaller proteins by LC-MS/MS than by MALDI–TOF mass spectrometry,probably because smaller proteins stain less well with silver or ran off the SDS gels.Our success in protein identification was unrelated to protein essentiality(data not shown)and ranged from80%for low abundance proteins to over 90%for high abundance proteins(Fig.1c).Notably,we identified 47%of the proteins not detected by genome-wide western blotting18, indicating that affinity purification followed by mass spectrometry can be more sensitive.Many hypothetical proteins not detected by western blotting18or our mass spectrometry analyses may not be expressed in our standard cell growth conditions.Although our success rates for identifying proteins were94%and89%for nuclear and cytosolic proteins,respectively,and at least70%in most cellular compartments(Fig.1d),they were lower(61%and59%,respectively) for the endoplasmic reticulum and vacuole.However,even though we had not tagged or purified most proteins with transmembrane domains,we identified over70%of the membrane-associated proteins,perhaps because our extraction and purification buffers contained0.1%Triton X-100.Our identification success rate was lowest(49%)with proteins for which localization was not estab-lished5,6,many of which may not be expressed.We had high success in identifying proteins involved in all biological processes,as defined by gene ontology(GO)nomenclature,or possessing any broadly defined GO molecular function(Fig.1e,f).We were less successful (each about65%success)with transporters and proteins of unknown function;many of the latter may not be expressed.A high-quality data set of protein–protein interactions Deciding whether any two proteins interact based on our data must encompass results from two purifications(plus repeat purifications, if performed)and integrate reliability scores from all protein identi-fications by mass spectrometry.Removed from consideration as likely nonspecific contaminants were44preys detected in$3%of the purifications and nearly all cytoplasmic ribosomal subunits (Supplementary Table S4).Although the cytosolic ribosomes and pre-ribosomes,as well as some associated translation factors,are not represented in the interaction network and protein complexes we subsequently identified,we previously described the interactome for proteins involved in RNA metabolism and ribosome biogenesis22. We initially generated an‘intersection data set’of2,357protein–protein interactions based only on proteins identified in at least one purification by both MALDI–TOF mass spectrometry and LC-MS/MS with relatively low thresholds(70%)(Supplementary Table S5).This intersection data set containing1,210proteins was of reasonable quality but limited in scope(Fig.2b).Our second approach added to the intersection data set proteins identified either reciprocally or repeatedly by only a single mass spectrometrymethodFigure1|The yeast interactome encompasses a large proportion of the predicted proteome.a,Summary of our screen for protein interactions. PPI,protein–protein interactions.b–f,The proportions of proteins identified in the screen as baits or preys are shown in relation to protein mass (b),expression level(c),intracellular localization(d)and annotated GO molecular function(e)and GO biological process(f).ARTICLES NATURE|Vol440|30March2006to generate the‘merged data set’.The merged data set containing 2,186proteins and5,496protein–protein interactions(Supplemen-tary Table S6)had better coverage than the intersection network (Fig.2b).To deal objectively with noise in the raw data and improve precision and recall,we used machine learning algorithms with two rounds of learning.All four classifiers were validated by the hold-out method(66%for training and33%for testing)and ten-times tenfold cross-validation,which gave similar results.Because our objective was to identify protein complexes,we used the hand-curated protein complexes in the MIPS reference database23as our training set.Our goal was to assign a probability that each pairwise interaction is true based on experimental reproducibility and mass spectrometry scores from the relevant purifications(see Methods).In thefirst round of learning,we tested bayesian inference networks and 28different kinds of decision trees24,settling on bayesian networks and C4.5-based and boosted stump decision trees as providing the most reliable predictions(Fig.2a).We then improved performance by using the output of the three methods as input for a second round of learning with a stacking algorithm in which logistic regression was the learner25.We used a probability cut-off of0.273(average0.68; median0.69)to define a‘core’data set of7,123protein–protein interactions involving2,708proteins(Supplementary Table S7)and a cut-off of0.101(average0.42;median0.27)for an‘extended’data set of14,317protein–protein interactions involving3,672proteins (Supplementary Table S8).The interaction probabilities in Sup-plementary Tables S7and S8are likely to be underestimated because the MIPS complexes used as a‘gold standard’are themselves imperfect26.We subsequently used the core protein–protein inter-action data set to define protein complexes(see below),but the extended data set probably contains at least1,000correct interactions (as well as many more false interactions)not present in the core data set.The complete set of protein–protein interactions and their associ-ated probabilities(Supplementary Table S9)were used to generate a ROC curve with a performance(area under the curve)of0.95 (Fig.2b).Predictive sensitivity(true positive rate)or specificity(false positive rate),or both,are superior for our learned data set than for the intersection and merged data sets,each previous high-through-put study of yeast protein–protein interactions11–14,or a bayesian combination of the data from all these studies27(Fig.2b).Identification of complexes within the interaction networkIn the protein interaction network generated by our core data set of 7,123protein–protein interactions,the average degree(number of interactions per protein)is5.26and the distribution of the number of interactions per protein follows an inverse power law(Fig.2c), indicating scale-free network topology28.These protein–protein interactions could be represented as a weighted graph(not shown) in which individual proteins are nodes and the weight of the arc connecting two nodes is the probability that interaction is correct. Because the2,357successful purifications underlying such a graph would represent.50%of the detectably expressed proteome18, we have typically purified multiple subunits of a given complex.To identify highly connected modules within the global protein–protein interaction network,we used the Markov cluster algorithm,which simulates random walks within graphs29.We chose values for the expansion and inflation operators of the Markov cluster procedure that optimized overlap with the hand-curated MIPS complexes23. Although the Markov cluster algorithm displays good convergence and robustness,it does not necessarily separate two or more com-plexes that have shared subunits(for example,RNA polymerases I and III,or chromatin modifying complexes Rpd3C(S)and Rpd3C(L))30,31.The Markov cluster procedure identified547distinct(non-overlapping)heteromeric protein complexes(Supplementary Table S10),about half of which are not present in MIPS or two previous high-throughput studies of yeast complexes using affinity purification and mass spectrometry(Fig.3a).New subunits or interacting proteins were identified for most complexes that had been identified previously(Fig.3a).Overlap of our Markov-cluster-computed complexes with the MIPS complexes was evaluated(see Supplementary Information)by calculating the total precision (measure of the extent to which proteins belonging to one reference MIPS complex are grouped within one of our complexes,and vice versa)and homogeneity(measure of the extent to which proteins from the same MIPS complex are distributed across our complexes, and vice versa)(Fig.3b).Both precision and homogeneity were higher for the complexes generated in this study—even for the extended set of protein–protein interactions—than for complexes generated by both previous high-throughput studies of yeast com-plexes,perhaps because the increased number of successful purifi-cations in this study increased the density of connections within most modules.The average number of different proteins per complex is 4.9,but the distribution(Fig.3c),which follows an inverse power law, is characterized by a large number of small complexes,most often containing only two to four different polypeptides,and a much smaller number of very large complexes.Proteins in the same complex should have similar function and co-localize to the same subcellular compartment.To evaluate this,weFigure2|Machine learning generates a core data set of protein–protein interactions.a,Reliability of observed protein–protein interactions was estimated using probabilistic mass spectra database search scores and measures of experimental reproducibility(see Methods),followed by machine learning.b,Precision-sensitivity ROC plot for our protein–protein interaction data set generated by machine learning.Precision/sensitivity values are also shown for the‘intersection’and‘merged’data sets(see text)and for other large-scale affinity tagging13,14and two-hybrid11,12data sets, and a bayesian networks combination of those data sets27,all based on comparison to MIPS complexes.FP,false positive;TP,true positive.c,Plot of the number of nodes against the number of edges per node demonstrates that the core data set protein–protein interaction network has scale-free properties.NATURE|Vol440|30March2006ARTICLESFigure3|Organization of the yeast protein–protein interaction network into protein complexes.a,Pie charts showing how many of our547 complexes have the indicated percentages of their subunits appearing in individual MIPS complexes or complexes identified by other affinity-based purification studies13,14.b,Precision and homogeneity(see text)in comparison to MIPS complexes for three large-scale studies.c,The relationship between complex size(number of different subunits)and frequency.d,Graphical representation of the complexes.This Cytoscape/ GenePro screenshot displays patterns of evolutionary conservation of complex subunits.Each pie chart represents an individual complex,its relative size indicating the number of proteins in the complex.The thicknesses of the429edges connecting complexes are proportional to the number of protein–protein interactions between connected nodes. Complexes lacking connections shown at the bottom of thisfigure have,2 interactions with any other complex.Sector colours(see panel f)indicate the proportion of subunits sharing significant sequence similarity to various taxonomic groups(see Methods).Insets provide views of two selected complexes—the kinetochore machinery and a previously uncharacterized, highly conserved fructose-1,6-bisphosphatase-degrading complex(see text for details)—detailing specific interactions between proteins identified within the complex(purple borders)and with other proteins that interact with at least one member of the complex(blue borders).Colours indicate taxonomic similarity.e,Relationship between protein frequency in the core data set and degree of connectivity or betweenness as a function of conservation.Colours of the bars indicate the evolutionary grouping.f,Colour key indicating the taxonomic groupings(and their phylogenetic relationships).Numbers indicate the total number of ORFs sharingsignificant sequence similarity with a gene in at least one organism associated with that group and,importantly,not possessing similarity to any gene from more distantly related organisms.ARTICLES NATURE|Vol440|30March2006calculated the weighted average of the fraction of proteins in each complex that maps to the same localization categories5(see Sup-plementary Information).Co-localization was better for the com-plexes in our study than for previous high-throughput studies but, not unexpectedly,less than that for the curated MIPS complexes (Supplementary Fig.S1).We also evaluated the extent of semantic similarity32for the GO terms in the‘biological process’category for pairs of interacting proteins within our complexes(Supplementary Fig.S2),and found that semantic similarity was lower for our core data set than for the MIPS complexes or the previous study using TAP tags13,but higher than for a study using protein overproduc-tion14.This might be expected if the previous TAP tag study significantly influenced the semantic classifications in GO.To analyse and visualize our entire collection of complexes,the highly connected modules identified by Markov clustering for the global core protein–protein interaction network were displayed (b.sickkids.ca)using our GenePro plug-in for the Cytoscape software environment33(Fig.3d).Each complex is represented as a pie-chart node,and the complexes are connected by a limited number(429)of high-confidence interactions.Assignment of connecting proteins to a particular module can therefore be arbitrary,and the limited number of connecting proteins could just as well be part of two or more distinct complexes.The size and colour of each section of a pie-chart node can be made to represent the fraction of the proteins in each complex that maps into a given complex from the hand-curated MIPS complexes (Supplementary Fig.S3).Similar displays can be generated when highlighting instead the subcellular localizations or GO biological process functional annotations of proteins in each complex.Further-more,the protein–protein interaction details of individual complexes can readily be visualized(see Supplementary Information). Evolutionary conservation of protein complexesORFs encoding each protein were placed into nine distinct evolu-tionary groups(Fig.3f)based on their taxonomic profiles(see Methods),and the complexes displayed so as to show the evolution-ary conservation of their components(Fig.3d).Insets highlight the kinetochore complex required for chromosome segregation and a novel,highly conserved complex involved in degradation of fructose-1,6-bisphosphatase.Strong co-evolution was evident for com-ponents of some large and essential complexes(for example,19S and20S proteasomes involved in protein degradation,the exosome involved in RNA metabolism,and the ARP2/3complex required for the motility and integrity of cortical actin patches).Conversely,the kinetochore complex,the mediator complex required for regulated transcription,and the RSC complex that remodels chromatin haveaFigure4|Characterization of three previously unreported protein complexes and Iwr1,a novel RNAPII-interacting factor.a,Identification of three novel complexes by SDS–PAGE,silver staining and mass spectrometry. The same novel complex containing Vid30was obtained after purification from strains with other tagged subunits(data not shown).b,Identification of Iwr1(interacts with RNAPII).Tagging and purification of unique RNAPII subunits identified YDL115C(Iwr1)as a novel RNAPII-associated factor (Supplementary Fig.S5a).Purification of Iwr1is shown here.c,Genetic interactions of Iwr1with various transcription factors.Lines connect genes with synthetic lethal/sick genetic interactions.d,Microarray analysis on the indicated deletion strains.Pearson correlation coefficients were calculated for the effects on gene expression of each deletion pair and organized by two-dimensional hierarchical clustering.e,Antibody generated against the amino-terminal amino acid sequence(DDDDDDDSFASADGE)of the Drosophila homologue of Iwr1(CG10528)and a monoclonal antibody(H5) against RNAPII subunit Rpb1phosphorylated on S5of the heptapeptide repeat of its carboxy-terminal domain48were used for co-localization studies on polytene chromosomes as previously described47.NATURE|Vol440|30March2006ARTICLEShigh proportion of fungi-specific subunits.Previous studies have shown that highly connected proteins within a network tend to be more highly conserved17,34,a consequence of either functional con-straints or preferential interaction of new proteins with existing highly connected proteins28.For the network as a whole,and consistent with earlier studies,Fig.3e reveals that the frequency of ORFs with a large number(.10)of connections is proportional to the relative distance of the evolutionary group.‘Betweenness’pro-vides a measure of how‘central’a protein is in a network,typically calculated as the fraction of shortest paths between node pairs passing through a node of interest.Figure3e shows that highly conserved proteins tend to have higher values of betweenness. Despite these average network properties,the subunits of some complexes(for example,the kinetochore complex)display a high degree of connectedness despite restriction to hemiascomycetes. Thesefindings suggest caution in extrapolating network properties to the properties of individual complexes.We also investigated the relationship between an ORF’s essentiality and its conservation, degree of connectivity and betweenness(Supplementary Fig.S4). Consistent with previous studies17,35,essential genes tend to be more highly conserved,highly connected and central to the network(as defined by betweenness),presumably reflecting their integrating role. Examples of new protein complexes and interactionsAmong the275complexes not in MIPS that we identified three are shown in Fig.4a.One contains Tbf1,Vid22and YGR071C.Tbf1 binds subtelomeric TTAGGG repeats and insulates adjacent genes from telomeric silencing36,37,suggesting that this trimeric complex might be involved in this process.Consistent with this,a hypo-morphic DAmP allele10(30untranslated region(UTR)deletion)of the essential TBF1gene causes a synthetic growth defect when combined with a deletion of VID22(data not shown),suggesting that Tbf1and Vid22have a common function.Vid22and YGR071C are the only yeast proteins containing BED Zinc-finger domains, thought to mediate DNA binding or protein–protein interactions38, suggesting that each uses its BED domain to interact with Tbf1or enhance DNA binding by Tbf1.Another novel complex in Fig.4a contains Vid30and six other subunits(also see Fig.3d inset).Five of its subunits(Vid30,Vid28,Vid24,Fyv10,YMR135C)have been genetically linked to proteasome-dependent,catabolite-induced degradation of fructose-1,6-bisphosphatase39,suggesting that the remaining two subunits(YDL176W,YDR255C),hypothetical pro-teins of hitherto unknown function,are probably involved in the same process.Vid24was reported to be in a complex with a M r of approximately600,000(ref.39),similar to the sum of the apparent M r values of the subunits of the Vid30-containing complex.The third novel complex contains Rtt109and Vps75.Because Vps75is related to nucleosome assembly protein Nap1,and Rtt109is involved in Ty transposition40,this complex may be involved in chromatin assembly or function.Our systematic characterization of complexes by TAP and mass spectrometry has often led to the identification of new components of established protein complexes(Fig.3a)41–43.Figure4high-lights Iwr1(YDL115C),which co-purifies with RNA polymerase II (RNAPII)along with general initiation factor TFIIF and transcrip-tion elongation factors Spt4/Spt5and Dst1(TFIIS)(Figs4b and3d (inset);see also Supplementary Fig.S5a).We used synthetic genetic array(SGA)technology9in a quantified,high-density E-MAP for-mat10to systematically identify synthetic genetic interactions for iwr1D with deletions of the elongation factor gene DST1,the SWR complex that assembles the variant histone Htz1into chromatin44, an Rpd3-containing histone deacetylase complex(Rpd3(L))that mediates promoter-specific transcriptional repression30,31,the his-tone H3K4methyltransferase complex(COMPASS),the activity of which is linked to elongation by RNAPII45,and other transcription-related genes(Fig.4c).Moreover,DNA microarray analyses of the effects on gene expression of deletions of IWR1and other genes involved in transcription by RNAPII,followed by clustering of the genes according to the similarity of their effects on gene expression, revealed that deletion of IWR1is most similar in its effects on mRNA levels to deletion of RPB4(Fig.4d),a subunit of RNAPII with multiple roles in transcription46.We also made use of the fact that Iwr1is highly conserved(Supplementary Fig.S5b),with a homologue,CG10528,in Drosophila melanogaster.Fig.4e shows that Drosophila Iwr1partly co-localizes with phosphorylated,actively transcribing RNAPII on polytene chromosomes,suggesting that Iwr1 is an evolutionarily conserved transcription factor.ConclusionsWe have described the interactome and protein complexes under-lying most of the yeast proteome.Our results comprise7,123 protein–protein interactions for2,708proteins in the core data set. Greater coverage and accuracy were achieved compared with pre-vious high-throughput studies of yeast protein–protein interactions as a consequence of four aspects of our approach:first,unlike a previous study using affinity purification and mass spectrometry14, we avoided potential artefacts caused by protein overproduction; second,we were able to ensure greater data consistency and repro-ducibility by systematically tagging and purifying both interacting partners for each protein–protein interaction;third,we enhanced coverage and reproducibility,especially for proteins of lower abun-dance,by using two independent methods of sample preparation and complementary mass spectrometry procedures for protein identifi-cation(in effect,up to four spectra were available for statistically evaluating the validity of each PPI);andfinally,we used rigorous computational procedures to assign confidence values to our pre-dictions.It is important to note,however,that our data represent a‘snapshot’of protein–protein interactions and complexes in a particular yeast strain subjected to particular growth conditions. Both the quality of the mass spectrometry spectra used for protein identification and the approximate stoichiometry of the interacting protein partners can be evaluated by accessing our publicly available comprehensive database(http://tap.med.utoronto.ca/)that reports gel images,protein identifications,protein–protein interactions and supporting mass spectrometry data(Supplementary Information and Supplementary Fig.S6).Soon to be linked to our database will be thousands of sites of post-translational modification tentatively identified during our LC-MS/MS analyses(manuscript in prepa-ration).The protein interactions and assemblies we identified pro-vide entry points for studies on individual gene products,many of which are evolutionarily conserved,as well as‘systems biology’approaches to cell physiology in yeast and other eukaryotic organisms.METHODSExperimental procedures and mass spectrometry.Proteins were tagged, purified and prepared for mass spectrometry as previously described43.Gel images,mass spectra and confidence scores for protein identification by mass spectrometry are found in our database(http://tap.med.utoronto.ca/).Confi-dence scores for protein identification by LC-MS/MS were calculated as described previously43.After processing72database searches for each spectrum, a score of1.25,corresponding to99%confidence(A.P.T.and N.J.K,unpublished data),was used as a cut-off for protein identification by MALDI–TOF mass spectrometry.Synthetic genetic interactions and effects of deletion mutations on gene expression were identified as described previously30.Drosophila polytene chromosomes were stained with dIwr1anti-peptide antibody and H5 monoclonal antibody as previously described47.Identification of protein complexes.Details of the methods for identification of protein complexes and calculating their overlaps with various data sets are described in Supplementary Information.Protein property analysis.We used previously published yeast protein localiza-tion data5,6,and yeast protein properties were obtained from the SGD(http:// /)and GO()databases. Proteins expressed at high,medium or low levels have expression log values of .4,3–4,or,3,respectively18.Phylogenetic analysis.For each S.cerevisiae sequence a BLAST and TBLASTXARTICLES NATURE|Vol440|30March2006。
分子生物学名词解释
分子生物学名词解释重要概念解释AAbundance (mRNA 丰度):指每个细胞中mRNA 分子的数目。
Abundant mRNA(高丰度mRNA):由少量不同种类mRNA组成,每一种在细胞中出现大量拷贝。
Acceptor splicing site (受体剪切位点):内含子右末端和相邻外显子左末端的边界。
Acentric fragment(无着丝粒片段):(由打断产生的)染色体无着丝粒片段缺少中心粒,从而在细胞分化中被丢失。
Active site(活性位点):蛋白质上一个底物结合的有限区域。
Allele(等位基因):在染色体上占据给定位点基因的不同形式。
Allelic exclusion(等位基因排斥):形容在特殊淋巴细胞中只有一个等位基因来表达编码的免疫球蛋白质。
Allosteric control(别构调控):指蛋白质一个位点上的反应能够影响另一个位点活性的能力。
Alu-equivalent family(Alu 相当序列基因):哺乳动物基因组上一组序列,它们与人类Alu家族相关。
Alu family (Alu家族):人类基因组中一系列分散的相关序列,每个约300bp 长。
每个成员其两端有Alu 切割位点(名字的由来)。
α-Amanitin(鹅膏覃碱):是来自毒蘑菇Amanita phalloides 二环八肽,能抑制真核RNA聚合酶,特别是聚合酶II 转录。
Amber codon (琥珀密码子):核苷酸三联体UAG,引起蛋白质合成终止的三个密码子之一。
Amber mutation (琥珀突变):指代表蛋白质中氨基酸密码子占据的位点上突变成琥珀密码子的任何DNA 改变。
Amber suppressors (琥珀抑制子):编码tRNA的基因突变使其反密码子被改变,从而能识别UAG 密码子和之前的密码子。
Aminoacyl-tRNA (氨酰-tRNA):是携带氨基酸的转运RNA,共价连接位在氨基酸的NH2基团和tRNA 终止碱基的3¢或者2¢-OH 基团上。
生物信息学英文术语及释义总汇
Abstract Syntax Notation (ASN.l)(NCBI发展的许多程序,如显示蛋白质三维结构的Cn3D 等所使用的内部格式)A language that is used to describe structured data types formally, Within bioinformatits,it has been used by the National Center for Biotechnology Information to encode sequences, maps, taxonomic information, molecular structures, and biographical information in such a way that it can be easily accessed and exchanged by computer software.Accession number(记录号)A unique identifier that is assigned to a single database entry for a DNA or protein sequence.Affine gap penalty(一种设置空位罚分策略)A gap penalty score that is a linear function of gap length, consisting of a gap opening penalty and a gap extension penalty multiplied by the length of the gap. Using this penalty scheme greatly enhances the performance of dynamic programming methods for sequence alignment. See also Gap penalty.Algorithm(算法)A systematic procedure for solving a problem in a finite number of steps, typically involving a repetition of operations. Once specified, an algorithm can be written in a computer language and run as a program.Alignment(联配/比对/联配)Refers to the procedure of comparing two or more sequences by looking for a series of individual characters or character patterns that are in the same order in the sequences. Of the two types of alignment, local and global, a local alignment is generally the most useful. See also Local and Global alignments.Alignment score(联配/比对/联配值)An algorithmically computed score based on the number of matches, substitutions, insertions, and deletions (gaps) within an alignment. Scores for matches and substitutions Are derived from a scoring matrix such as the BLOSUM and PAM matrices for proteins, and aftine gap penalties suitable for the matrix are chosen. Alignment scores are in log odds units, often bit units (log to the base 2). Higher scores denote better alignments. See also Similarity score, Distance in sequence analysis.Alphabet(字母表)The total number of symbols in a sequence-4 for DNA sequences and 20 for protein sequences.Annotation(注释)The prediction of genes in a genome, including the location of protein-encoding genes, the sequence of the encoded proteins, anysignificantmatches to other Proteins of known function, and the location of RNA-encoding genes. Predictions are based on gene models; e.g., hidden Markov models of introns and exons in proteins encoding genes, and models of secondary structure in RNA.Anonymous FTP(匿名FTP)When a FTP service allows anyone to log in, it is said to provide anonymous FTP ser-vice. A user can log in to an anonymous FTP server by typing anonymous as the user name and his E-mail address as a password. Most Web browsers now negotiate anonymous FTP logon without asking the user for a user name and password. See also FTP.ASCIIThe American Standard Code for Information Interchange (ASCII) encodes unaccented letters a-z, A-Z, the numbers O-9, most punctuation marks, space, and a set of control characters such as carriage return and tab. ASCII specifies 128 characters that are mapped to the values O-127. ASCII tiles are commonly called plain text, meaning that they only encode text without extra markup.BAC clone(细菌人工染色体克隆)Bacterial artificial chromosome vector carrying a genomic DNA insert, typically 100–200 kb. Most of the large-insert clones sequenced in the project were BAC clones.Back-propagation(反向传输)When training feed-forward neural networks, a back-propagation algorithm can be used to modify the network weights. After each training input pattern is fed through the network, the network’s output is compared with the desired output and the amount of error is calculated. This error is back-propagated through the network by using an error function to correct the network weights. See also Feed-forward neural network.Baum-Welch algorithm(Baum-Welch算法)An expectation maximization algorithm that is used to train hidden Markov models.Baye’s rule(贝叶斯法则)Forms the basis of conditional probability by calculating the likelihood of an event occurring based on the history of the event and relevant background information. In terms of two parameters A and B, the theorem is stated in an equation: The condition-al probability of A, given B, P(AIB), is equal to the probability of A, P(A), times the conditional probability of B, given A, P(BIA), divided by the probability of B, P(B). P(A) is the historical or prior distribution value of A, P(BIA) is a new prediction for B for a particular value of A, and P(B) is the sum of the newly predicted values for B. P(AIB) is a posterior probability, representing a new prediction for A given the prior knowledge of A and the newly discovered relationships between A and B.Bayesian analysis(贝叶斯分析)A statistical procedure used to estimate parameters of an underlyingdistribution based on an observed distribution. S ee also Baye’s rule.Biochips(生物芯片)Miniaturized arrays of large numbers of molecular substrates, often oligonucleotides, in a defined pattern. They are also called DNA microarrays and microchips.Bioinformatics (生物信息学)The merger of biotechnology and information technology with the goal of revealing new insights and principles in biology. /The discipline of obtaining information about genomic or protein sequence data. This may involve similarity searches of databases, comparing your unidentified sequence to the sequences in a database, or making predictions about the sequence based on current knowledge of similar sequences. Databases are frequently made publically available through the Internet, or locally at your institution.Bit score (二进制值/ Bit值)The value S' is derived from the raw alignment score S in which the statistical properties of the scoring system used have been taken into account. Because bit scores have been normalized with respect to the scoring system, they can be used to compare alignment scores from different searches.Bit unitsFrom information theory, a bit denotes the amount of information required to distinguish between two equally likely possibilities. The number of bits of information, AJ, required to convey a message that has A4 possibilities is log2 M = N bits.BLAST (基本局部联配搜索工具,一种主要数据库搜索程序)Basic Local Alignment Search Tool. A set of programs, used to perform fast similarity searches. Nucleotide sequences can be compared with nucleotide sequences in a database using BLASTN, for example. Complex statistics are applied to judge the significance of each match. Reported sequences may be homologous to, or related to the query sequence. The BLASTP program is used to search a protein database for a match against a query protein sequence. There are several other flavours of BLAST. BLAST2 is a newer release of BLAST. Allows for insertions or deletions in the sequences being aligned. Gapped alignments may be more biologically significant.Block(蛋白质家族中保守区域的组块)Conserved ungapped patterns approximately 3-60 amino acids in length in a set of related proteins.BLOSUM matrices(模块替换矩阵,一种主要替换矩阵)An alternative to PAM tables, BLOSUM tables were derived using local multiple alignments of more distantly related sequences than were used for the PAM matrix. These are used to assess thesimilarity of sequences when performing alignments.Boltzmann distribution(Boltzmann 分布)Describes the number of molecules that have energies above a certain level, based on the Boltzmann gas constant and the absolute temperature.Boltzmann probability function(Boltzmann 概率函数)See Boltzmann distribution.Bootstrap analysisA method for testing how well a particular data set fits a model. For example, the validity of the branch arrangement in a predicted phylogenetic tree can be tested by resampling columns in a multiple sequence alignment to create many new alignments. The appearance of a particular branch in trees generated from these resampled sequences can then be measured. Alternatively, a sequence may be left out of an analysis to deter-mine how much the sequence influences the results of an analysis.Branch length(分支长度)In sequence analysis, the number of sequence changes along a particular branch of a phylogenetic tree.CDS or cds (编码序列)Coding sequence.Chebyshe, d inequalityThe probability that a random variable exceeds its mean is less than or equal to the square of 1 over the number of standard deviations from the mean.Clone (克隆)Population of identical cells or molecules (e.g. DNA), derived from a single ancestor.Cloning V ector (克隆载体)A molecule that carries a foreign gene into a host, and allows/facilitates the multiplication of that gene in a host. When sequencing a gene that has been cloned using a cloning vector (rather than by PCR), care should be taken not to include the cloning vector sequence when performing similarity searches. Plasmids, cosmids, phagemids, Y ACs and PACs are example types of cloning vectors.Cluster analysis(聚类分析)A method for grouping together a set of objects that are most similar from a larger group of related objects. The relationships are based on some criterion of similarity or difference. For sequences, a similarity or distance score or a statistical evaluation of those scores is used.CobblerA single sequence that represents the most conserved regions in a multiple sequence alignment. The BLOCKS server uses the cobbler sequence to perform a database similarity search as a way to reach sequences that are more divergent than would be found using the single sequences in the alignment for searches.Coding system (neural networks)Regarding neural networks, a coding system needs to be designed for representing input and output. The level of success found when training the model will be partially dependent on the quality of the coding system chosen.Codon usageAnalysis of the codons used in a particular gene or organism.COG(直系同源簇)Clusters of orthologous groups in a set of groups of related sequences in microorganism and yeast (S. cerevisiae). These groups are found by whole proteome comparisons and include orthologs and paralogs. See also Orthologs and Paralogs.Comparative genomics(比较基因组学)A comparison of gene numbers, gene locations, and biological functions of genes in the genomes of diverse organisms, one objective being to identify groups of genes that play a unique biological role in a particular organism.Complexity (of an algorithm)(算法的复杂性)Describes the number of steps required by the algorithm to solve a problem as a function of the amount of data; for example, the length of sequences to be aligned.Conditional probability(条件概率)The probability of a particular result (or of a particular value of a variable) given one or more events or conditions (or values of other variables).Conservation (保守)Changes at a specific position of an amino acid or (less commonly, DNA) sequence that preserve the physico-chemical properties of the original residue.Consensus(一致序列)A single sequence that represents, at each subsequent position, the variation found within corresponding columns of a multiple sequence alignment.Context-free grammarsA recursive set of production rules for generating patterns of strings. These consist of a set of terminal characters that are used to create strings, a set of nonterminal symbols that correspond to rules and act as placeholders for patterns that can be generated using terminal characters, a set of rules for replacing nonterminal symbols with terminal characters, and a start symbol.Contig (序列重叠群/拼接序列)A set of clones that can be assembled into a linear order. A DNA sequence that overlaps with another contig. The full set of overlapping sequences (contigs) can be put together to obtain the sequence for a long region of DNA that cannot be sequenced in one run in a sequencing assay. Important in genetic mapping at the molecular level.CORBA(国际对象管理协作组制定的使OOP对象与网络接口统一起来的一套跨计算机、操作系统、程序语言和网络的共同标准)The Common Object Request Broker Architecture (CORBA) is an open industry standard for working with distributed objects, developed by the Object Management Group. CORBA allows the interconnection of objects and applications regardless of computer language, machine architecture, or geographic location of the computers.Correlation coefficient(相关系数)A numerical measure, falling between - 1 and 1, of the degree of the linear relationship between two variables. A positive value indicates a direct relationship, a negative value indicates an inverse relationship, and the distance of the value away from zero indicates the strength of the relationship. A value near zero indicates no relationship between the variables.Covariation (in sequences)(共变)Coincident change at two or more sequence positions in related sequences that may influence the secondary structures of RNA or protein molecules.Coverage (or depth) (覆盖率/厚度)The average number of times a nucleotide is represented by a high-quality base in a collection of random raw sequence. Operationally, a 'high-quality base' is defined as one with an accuracy of at least 99% (corresponding to a PHRED score of at least 20).Database(数据库)A computerized storehouse of data that provides a standardized way for locating, adding, removing, and changing data. See also Object-oriented database, Relational database.DendogramA form of a tree that lists the compared objects (e.g., sequences or genes in a microarray analysis) in a vertical order and joins related ones by levels of branches extending to one side of the list.Depth (厚度)See coverageDirichlet mixturesDefined as the conjugational prior of a multinomial distribution. One use is for predicting the expected pattern of amino acid variation found in the match state of a hid-den Markov model (representing one column of a multiple sequence alignment of proteins), based on prior distributions found in conserved protein domains (blocks).Distance in sequence analysis(序列距离)The number of observed changes in an optimal alignment of two sequences, usually not counting gaps.DNA Sequencing (DNA测序)The experimental process of determining the nucleotide sequence of a region of DNA. This is done by labelling each nucleotide (A, C, G or T) with either a radioactive or fluorescent marker which identifies it. There are several methods of applying this technology, each with their advantages and disadvantages. For more information, refer to a current text book. High throughput laboratories frequently use automated sequencers, which are capable of rapidly reading large numbers of templates. Sometimes, the sequences may be generated more quickly than they can be characterised.Domain (功能域)A discrete portion of a protein assumed to fold independently of the rest of the protein andpossessing its own function.Dot matrix(点标矩阵图)Dot matrix diagrams provide a graphical method for comparing two sequences. One sequence is written horizontally across the top of the graph and the other along the left-hand side. Dots are placed within the graph at the intersection of the same letter appearing in both sequences. A series of diagonal lines in the graph indicate regions of alignment. The matrix may be filtered to reveal the most-alike regions by scoring a minimal threshold number of matches within a sequence window.Draft genome sequence (基因组序列草图)The sequence produced by combining the information from the individual sequenced clones (by creating merged sequence contigs and then employing linking information to create scaffolds) and positioning the sequence along the physical map of the chromosomes.DUST (一种低复杂性区段过滤程序)A program for filtering low complexity regions from nucleic acid sequences.Dynamic programming(动态规划法)A dynamic programming algorithm solves a problem by combining solutions to sub-problems that are computed once and saved in a table or matrix. Dynamic programming is typically used when a problem has many possible solutions and an optimal one needs to be found. This algorithm is used for producing sequence alignments, given a scoring system for sequence comparisons.EMBL (欧洲分子生物学实验室,EMBL数据库是主要公共核酸序列数据库之一)European Molecular Biology Laboratories. Maintain the EMBL database, one of the major public sequence databases.EMBnet (欧洲分子生物学网络)European Molecular Biology Network: /was established in 1988, and provides services including local molecular databases and software for molecular biologists in Europe. There are several large outposts of EMBnet, including EXPASY.Entropy(熵)From information theory, a measure of the unpredictable nature of a set of possible elements. The higher the level of variation within the set, the higher the entropy.Erdos and Renyi lawIn a toss of a “fair” coin, the number of heads in a row that can be expected is the logari thm of the number of tosses to the base 2. The law may be generalized for more than two possible outcomes by changing the base of the logarithm to the number of out-comes. This law was used to analyze the number of matches and mismatches that can be expected between random sequences as a basis for scoring the statistical significance of a sequence alignment.EST (表达序列标签的缩写)See Expressed Sequence TagExpect value (E)(E值)E value. The number of different alignents with scores equivalent to or better than S that are expected to occur in a database search by chance. The lower the E value, the more significant the score. In a database similarity search, the probability that an alignment score as good as the one found between a query sequence and a database sequence would be found in as many comparisons between random sequences as was done to find the matching sequence. In other types of sequence analysis, E has a similar meaning.Expectation maximization (sequence analysis)An algorithm for locating similar sequence patterns in a set of sequences. A guessed alignment of the sequences is first used to generate an expected scoring matrix representing the distribution of sequence characters in each column of the alignment, this pattern is matched to each sequence, and the scoring matrix values are then updated to maximize the alignment of the matrix to the sequences. The procedure is repeated until there is no further improvement.Exon (外显子)Coding region of DNA. See CDS.Expressed Sequence Tag (EST) (表达序列标签)Randomly selected, partial cDNA sequence; represents it's corresponding mRNA. dbEST is a large database of ESTs at GenBank, NCBI.FASTA(一种主要数据库搜索程序)The first widely used algorithm for database similarity searching. The program looks for optimal local alignments by scanning the sequence for small matches called "words". Initially, the scores of segments in which there are multiple word hits are calculated ("init1"). Later the scores of several segments may be summed to generate an "initn" score. An optimized alignment that includes gaps is shown in the output as "opt". The sensitivity and speed of the search are inversely related and controlled by the "k-tup" variable which specifies the size of a "word". (Pearson andLipman)Extreme value distribution(极值分布)Some measurements are found to follow a distribution that has a long tail which decays at high values much more slowly than that found in a normal distribution. This slow-falling type is called the extreme value distribution. The alignment scores between unrelated or random sequences are an example. These scores can reach very high values, particularly when a large number of comparisons are made, as in a database similarity search. The probability of a particular score may be accurately predicted by the extreme value distribution, which follows a double negative exponential function after Gumbel.False negative(假阴性)A negative data point collected in a data set that was incorrectly reported due to a failure of the test in avoiding negative results.False positive (假阳性)A positive data point collected in a data set that was incorrectly reported due to a failure of the test. If the test had correctly measured the data point, the data would have been recorded as negative.Feed-forward neural network (反向传输神经网络)Organizes nodes into sequence layers in which the nodes in each layer are fully connected with the nodes in the next layer, except for the final output layer. Input is fed from the input layer through the layers in sequence in a “feed-forward” direction, resulting in output at the final layer. See also Neural network.Filtering (window size)During pair-wise sequence alignment using the dot matrix method, random matches can be filtered out by using a sliding window to compare the two sequences. Rather than comparing a single sequence position at a time, a window of adjacent positions in the two sequences is compared and a dot, indicating a match, is generated only if a certain minimal number of matches occur.Filtering (过滤)Also known as Masking. The process of hiding regions of (nucleic acid or amino acid) sequence having characteristics that frequently lead to spurious high scores. See SEG and DUST.Finished sequence(完成序列)Complete sequence of a clone or genome, with an accuracy of at least 99.99% and no gaps.Fourier analysisStudies the approximations and decomposition of functions using trigonometric polynomials.Format (file)(格式)Different programs require that information be specified to them in a formal manner, using particular keywords and ordering. This specification is a file format.Forward-backward algorithmUsed to train a hidden Markov model by aligning the model with training sequences. The algorithm then refines the model to reduce the error when fitted to the given data using a gradient descent approach.FTP (File Transfer Protocol)(文件传输协议)Allows a person to transfer files from one computer to another across a network using an FTP-capable client program. The FTP client program can only communicate with machines that run an FTP server. The server, in turn, will make a specific portion of its tile system available for FTP access, providing that the client is able to supply a recognized user name and password to the server.Full shotgun clone (鸟枪法克隆)A large-insert clone for which full shotgun sequence has been produced.Functional genomics(功能基因组学)Assessment of the function of genes identified by between-genome comparisons. The function of a newly identified gene is tested by introducing mutations into the gene and then examining the resultant mutant organism for an altered phenotype.gap (空位/间隙/缺口)A space introduced into an alignment to compensate for insertions and deletions in one sequence relative to another. To prevent the accumulation of too many gaps in an alignment, introduction of a gap causes the deduction of a fixed amount (the gap score) from the alignment score. Extension of the gap to encompass additional nucleotides or amino acid is also penalized in the scoring of an alignment.Gap penalty(空位罚分)A numeric score used in sequence alignment programs to penalize the presence of gaps within an alignment. The value of a gap penalty affects how often gaps appear in alignments produced by the algorithm. Most alignment programs suggest gap penalties that are appropriate for particular scoring matrices.Genetic algorithm(遗传算法)A kind of search algorithm that was inspired by the principles of evolution. A population of initial solutions is encoded and the algorithm searches through these by applying a pre-defined fitness measurement to each solution, selecting those with the highest fitness for reproduction. New solutions can be generated during this phase by crossover and mutation operations, defined in the encoded solutions.Genetic map (遗传图谱)A genome map in which polymorphic loci are positioned relative to one another on the basis of the frequency with which they recombine during meiosis. The unit of distance is centimorgans (cM), denoting a 1% chance of recombination.Genome(基因组)The genetic material of an organism, contained in one haploid set of chromosomes.Gibbs sampling methodAn algorithm for finding conserved patterns within a set of related sequences. A guessed alignment of all but one sequence is made and used to generate a scoring matrix that represents the alignment. The matrix is then matched to the left-out sequence, and a probable location of the corresponding pattern is found. This prediction is then input into a new alignment and another scoring matrix is produced and tested on a new left-out sequence. The process is repeated until there is no further improvement in the matrix.Global alignment(整体联配)Attempts to match as many characters as possible, from end to end, in a set of twomore sequences.Gopher (一个文档发布系统,允许检索和显示文本文件)Graph theory(图论)A branch of mathematics which deals with problems that involve a graph or network structure. A graph is defined by a set of nodes (or points) and a set of arcs (lines or edges) joining the nodes. In sequence and genome analysis, graph theory is used for sequence alignments and clustering alike genes.GSS(基因综述序列)Genome survey sequence.GUI(图形用户界面)Graphical user interface.H (相对熵值)H is the relative entropy of the target and background residue frequencies. (Karlin and Altschul, 1990). H can be thought of as a measure of the average information (in bits) available per position that distinguishes an alignment from chance. At high values of H, short alignments can be distinguished by chance, whereas at lower H values, a longer alignment may be necessary. (Altschul, 1991)Half-bitsSome scoring matrices are in half-bit units. These units are logarithms to the base 2 of odds scores times 2.Heuristic(启发式方法)A procedure that progresses along empirical lines by using rules of thumb to reach a solution. The solution is not guaranteed to be optimal.Hexadecimal system(16制系统)The base 16 counting system that uses the digits O-9 followed by the letters A-F.HGMP (人类基因组图谱计划)Human Genome Mapping Project.Hidden Markov Model (HMM)(隐马尔可夫模型)In sequence analysis, a HMM is usually a probabilistic model of a multiple sequence alignment, but can also be a model of periodic patterns in a single sequence, representing, for example, patterns found in the exons of a gene. In a model of multiple sequence alignments, each column of symbols in the alignment is represented by a frequency distribution of the symbols called a state, and insertions and deletions by other states. One then moves through the model along a particular path from state to state trying to match a given sequence. The next matching symbol is chosen from each state, recording its probability (frequency) and also the probability of going to thatparticular state from a previous one (the transition probability). State and transition probabilities are then multiplied to obtain a probability of the given sequence. Generally speaking, a HMM is a statistical model for an ordered sequence of symbols, acting as a stochastic state machine that generates a symbol each time a transition is made from one state to the next. Transitions betweenstates are specified by transition probabilities.Hidden layer(隐藏层)An inner layer within a neural network that receives its input and sends its output to other layers within the network. One function of the hidden layer is to detect covariation within the input data, such as patterns of amino acid covariation that are associated with a particular type of secondary structure in proteins.Hierarchical clustering(分级聚类)The clustering or grouping of objects based on some single criterion of similarity or difference.An example is the clustering of genes in a microarray experiment based on the correlation between their expression patterns. The distance method used in phylogenetic analysis is another example.Hill climbingA nonoptimal search algorithm that selects the singular best possible solution at a given state or step. The solution may result in a locally best solution that is not a globally best solution.Homology(同源性)A similar component in two organisms (e.g., genes with strongly similar sequences) that can be attributed to a common ancestor of the two organisms during evolution.Horizontal transfer(水平转移)The transfer of genetic material between two distinct species that do not ordinarily exchange genetic material. The transferred DNA becomes established in the recipient genome and can be detected by a novel phylogenetic history and codon content com-pared to the rest of the genome.HSP (高比值片段对)High-scoring segment pair. Local alignments with no gaps that achieve one of the top alignment scores in a given search.HTGS/HGT(高通量基因组序列)High-throughout genome sequences。
伊文思蓝(EB)检测血脑屏障(BBB)通透性之欧阳歌谷创作
1、测定血脑屏障完整性原理欧阳歌谷(2021.02.01)伊文思蓝属于一种常用的偶氮染料制剂,因其分子量大小与血浆白蛋白相近,而且在血液中与血浆白蛋白有很高的亲和力,由于正常状态下血浆白蛋白无法透过血脑屏障,所以染色时,如神经系统是完整的,与血浆白蛋白结合的依文思蓝无法使其着色。
相反如果神经系统血脑屏障被破坏,依文思蓝就可以进入神经系统并使其着色。
在荧光波长470与540 nm 各有一强峰,680 nm 处有一弱峰。
其在组织中的含量常使用化学透析法和比色法进行检测。
脑组织中血脑屏障的破坏,可以引起毛细血管的通透性增加,结合EB的白蛋白可通过BBB进入脑组织,应用化学比色法甲酰胺测定脑组织EB渗出量,可以反映血脑屏障的开放程度,并在此基础上进一步探讨不同细胞移植干预与BBB通透性的效应关系。
伊文思蓝灌注染色法结合共聚焦激光扫描显微镜观察脑切片中EB 荧光强度,可以检测血脑屏障中血管形态的改变,同时对脑组织中渗入的EB含量进行定量分析。
两者结合可以从形态学、组织定量上相互补充完善。
2、实验准备试剂:2%EB、1%戊巴比妥钠、0.9%氯化钠、20U/ml肝素钠、二甲基甲酰胺器材:冰冻切片机、共聚焦激光扫描显微镜、分光光度计、恒温箱、离心机、注射器、输液器、解剖器械等3、测定方法1、各组动物处死前0.5h(也有的为1h、2h)尾静脉(或股静脉)注入2%EB(2ml/kg,也有的用3、4ml/kg)(剂量与提前时间成反比?)2、1%戊巴比妥钠30-40 mg/kg麻醉后打开胸腔,心内灌注肝素生理盐水(0.9%氯化钠+20U/ml肝素钠)200-300ml(有文献提出当右心房流出液体变清澈时即可停止灌注),断头取脑作矢状切取半脑分离海马,一半脑组织进行冰冻切片,应用冰冻切片机行10-20μm切片,于共聚焦激光扫描显微镜下观察 EB通透情况。
3、另一半脑组织称重,剪碎置于二甲基甲酰胺(1ml/100mg脑组织,也可用三氯乙酸、甲酰胺)60℃孵育24h,1000r/min离心5min(有的作者认为脑组织在甲酰胺中匀浆呈胶状,上清液不能通过离心取得,水浴后直接取上清液比色),用分光光度计检测波长为620 nm的吸光度。
prusiner的科学贡献 -回复
prusiner的科学贡献-回复【Prusiner的科学贡献】Stanley B. Prusiner是美国著名的生物学家和神经学家,因其对于传染性蛋白质疾病(传染性海绵状脑病)的研究和发现,并提出孤立性埃森基似变性病(PrPSc)假说,而获得了2007年诺贝尔生理学或医学奖的殊荣。
Prusiner的贡献不仅深刻地改变了人们对传染病的理解,也为神经退行性疾病、蛋白质聚集性疾病的研究以及治疗提供了宝贵的思路。
那么,Prusiner的科学贡献是如何逐步展开并影响着相关领域的呢?首先,Prusiner于1982年发表的一篇论文中首次提出了突破性的假说,即孤立性埃森基似变性病的传播是由一种异常的蛋白质引起的。
他将这种异常的蛋白质命名为PrPSc,与正常的神经元细胞表面蛋白质称为PrPC。
这一假说提出了一种全新的传染机制,挑战了当时传染病研究领域对于传染体(如病毒或细菌)的传播观念。
Prusiner将PrPSc定义为一种感染物,可以通过蛋白质迭加的方式来传播,并进一步引起神经元的丧失和神经退化。
这个假说第一次明确地将蛋白质聚集与疾病的发生联系在一起,为后续相关疾病的研究打下了坚实的基础。
接下来,Prusiner和他的团队开始通过一系列实验证据来验证PrPSc假说。
他们利用高度纯化的PrPSc蛋白,成功地将其在实验条件下转化为由正常蛋白质PrPC转变而来的PrPSc蛋白,通过蛋白质传播来研究这种转变过程。
这一实验证明了PrPSc可以通过自我复制的方式进行传播,从而引起突变形式的PrPC并且蔓延至周围的细胞。
这些实验为PrPSc假说的有效性提供了强有力的证据,并为后续相关疾病的研究奠定了实验基础。
此后,Prusiner的研究还涉及到其他蛋白质聚集性疾病,如阿尔茨海默病、帕金森病、帕金森误谬病等。
他的团队发现,这些疾病中与传播机制相关的异常蛋白质类似于PrPSc,存在着类似的传播和蔓延过程。
通过对于这些疾病的研究,Prusiner进一步拓展了对于蛋白质聚集性疾病的理解,提出了“传染性蛋白质假说”,并对于其病理机制和治疗方式提供了新思路。
简述定量蛋白质组学技术
定量蛋白质组(quantitative proteomics)是把一个基因组所表达的全部蛋白或者是一个复杂体系所有的全部蛋白进行鉴定和定量的方法。
蛋白质组丰度的动态变化对各种生命过程都有重要影响。
例如在许多疾病的发生和发展进程中,常常伴随着某些蛋白质的表达异常。
发展至今,传统的基于双向电泳的2D和2D-DIGE技术正在逐渐被基于NanoLC-MS/MS的液质联用技术取代;后者需要的样品量更少(25ug蛋白),灵敏度更高(ng级),通量也更高(一次分析可以鉴定和定量超过5000种蛋白)。
定量蛋白质组学常见技术如iTRAQ/TMT、Label Free、三类定量方法,百泰派克均可为您提供服务。
在这里我们给大家简要介绍一下这三种定量蛋白质组学方法:iTRAQ(Isobaric Tag for Relative Absolute Quantitation)和TMT(Tandem Mass Tags)技术分别由美国AB Sciex公司和Thermo Fisher公司研发的多肽体外标记定量技术。
该技术采用多个(2-10)稳定同位素标签,特异性标记多肽的氨基基团进行串联质谱分析,能够同时比较多达10种不同样本中蛋白质的相对含量,可用于研究不同病理条件下或者不同发育阶段的组织样品中蛋白质表达水平的差异。
分析原理iTRAQ/TMT标签包括三部分,如下图:1. 报告基团(reporter group):指示蛋白样品丰度水平。
2. 平衡基团(balance group):平衡报告基团的质量差,使等重标签重量一致,保证标记的同一肽段m/z相同。
3. 肽反应基团(amine-specific reactive group):能与肽段N端及赖氨酸侧链氨基发生共价连接,从而标记上肽段。
来自不同样品的同一肽段经试剂标记后具有相同的质量数,并在一级质谱检测(MS1)中表现为同一个质谱峰。
当此质谱峰被选定进行碎裂后,在二级质谱检测(MS2)中,不同的报告基团被释放,它们各自的质谱峰的信号强弱,代表着来源于不同样品的该肽段及其所对应的蛋白的表达量的高低。
治疗与淀粉样蛋白沉积有关的炎症和涉及已活化小胶质细胞的脑部炎
专利名称:治疗与淀粉样蛋白沉积有关的炎症和涉及已活化小胶质细胞的脑部炎症的方法
专利类型:发明专利
发明人:贝卡·所罗门,奥尔纳·戈伦
申请号:CN200680006773.4
申请日:20060131
公开号:CN101166536A
公开日:
20080423
专利内容由知识产权出版社提供
摘要:在其表面上不展示抗体或者非丝状噬菌体抗原的丝状噬菌体用于抑制或者治疗与淀粉样蛋白沉积有关的和/或涉及活化的小胶质细胞的脑部炎症、用于抑制淀粉样蛋白沉积的形成、以及用于解聚预形成的淀粉样蛋白沉积。
申请人:雷蒙特亚特特拉维夫大学有限公司
地址:以色列特拉维夫
国籍:IL
代理机构:北京集佳知识产权代理有限公司
更多信息请下载全文后查看。
蛋白质组学皮尔森系数-概述说明以及解释
蛋白质组学皮尔森系数-概述说明以及解释1.引言1.1 概述蛋白质组学是研究生物体内所有蛋白质的全套组成和功能的科学领域。
在生物学领域中,蛋白质是生物体内最重要的分子之一,扮演着各种生物学过程中的关键角色。
蛋白质组学的发展为我们深入了解蛋白质在细胞内、组织内以及整个生物体内的作用提供了新的技术手段。
在蛋白质组学研究中,皮尔森系数是一种常用的统计学方法,用于描述两组数据之间的线性相关性程度。
它在蛋白质组学中的应用可以帮助研究人员发现蛋白质之间的相互关系,揭示它们在各种生物学过程中的协同作用和调控机制。
本文将重点介绍蛋白质组学中皮尔森系数的应用,探讨其在蛋白质组学研究中的重要性和计算方法。
希望通过本文的阐述,读者能够对蛋白质组学和皮尔森系数有更深入的了解,为深入探究蛋白质组学的研究方向提供一定的参考和启示。
1.2 文章结构文章结构部分应该包括以下内容:本文将包括三个主要部分:引言、正文和结论。
在引言部分,将介绍蛋白质组学和皮尔森系数的概念,以及本文的目的和结构。
在正文部分,将详细介绍什么是蛋白质组学,以及皮尔森系数在蛋白质组学中的应用。
此外,还将介绍蛋白质组学皮尔森系数的计算方法。
最后,在结论部分,将对文章进行总结,展望未来研究的方向,并得出结论。
通过这样的结构,读者可以清晰地了解本文的框架和内容。
1.3 目的蛋白质组学作为一门涉及生物信息学、生物化学和生物学等多个学科领域的交叉学科,对于解析蛋白质组的复杂性具有重要意义。
本文旨在探讨蛋白质组学中的皮尔森系数在蛋白质相互作用网络研究中的应用,以及介绍蛋白质组学皮尔森系数的计算方法。
通过这些内容的讨论,希望能够帮助读者更深入地了解蛋白质组学的基本概念和方法,以及在相关领域中的应用价值。
同时,也希望能够激发更多研究者对蛋白质组学中的皮尔森系数及其相关领域的研究兴趣,促进这一领域的进一步发展和探索。
2.正文2.1 什么是蛋白质组学:蛋白质组学是研究蛋白质在生物系统中的全面性质和功能的学科领域。
在来源于CHO细胞的糖蛋白产物中的含有半乳糖-α-1,3-半乳糖的N-聚糖[
专利名称:在来源于CHO细胞的糖蛋白产物中的含有半乳糖-α-1,3-半乳糖的N-聚糖
专利类型:发明专利
发明人:C·鲍斯克斯,J·墨菲,H·萨维亚,N·瓦斯博恩,C·刘,X-J·徐
申请号:CN200980155081.X
申请日:20090122
公开号:CN102292640A
公开日:
20111221
专利内容由知识产权出版社提供
摘要:本发明提供了评估中国仓鼠卵巢(CHO)细胞群的方法,该方法是通过测量由所述细胞产生的包含末端半乳糖-α-1,3-半乳糖残基的聚糖来进行的,其中这些CHO细胞没有被基因工程化以表达一种α-半乳糖苷转移酶编码序列。
申请人:动量制药公司
地址:美国马萨诸塞州
国籍:US
代理机构:北京纪凯知识产权代理有限公司
更多信息请下载全文后查看。
Bayes逐步判别分析在小儿病毒性肝炎临床分型中的价值
Bayes逐步判别分析在小儿病毒性肝炎临床分型中的价值吕晴;陈惠黎
【期刊名称】《微型电脑应用》
【年(卷),期】1990(000)004
【总页数】4页(P56-59)
【作者】吕晴;陈惠黎
【作者单位】不详;不详
【正文语种】中文
【中图分类】R512.6
【相关文献】
1.Bayes逐步判别分析法在煤矿瓦斯分区评价中的应用 [J], 彭苏萍;段延娥;孟召平;戈连柱
2.基于Bayes逐步判别分析的油气储量价值分级 [J], 杨磊;王化增
3.Bayes逐步判别分析在卵巢非赘生性囊肿早期定量诊断研究中的应用 [J], 杨俊英;刘荷一;王远;王海舫;姜福增
4.Bayes逐步判别分析模型在物质浓度辨识中的应用 [J], 刘建清
5.Bayes逐步判别分析模型在物质浓度辨识中的应用 [J], 刘建清[1]
因版权原因,仅展示原文概要,查看原文内容请购买。
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Bayesian Segmentation of Protein Secondary StructureScott C. Schmidler*Section on Medical Informatics andDepartments of Biochemistry and StatisticsStanford University School of MedicineStanford, CA 94305, USAJun S. LiuDepartment of StatisticsStanford UniversityStanford, CA 94305, USADouglas L. BrutlagDepartment of BiochemistryStanford University School of MedicineStanford, CA 94305, USAKeywords: Protein Secondary Structure Prediction, Bayesian Methods, ProbabilisticModelingRunning Head: Bayesian Segmentation of Secondary Structure*Corresponding author. Address: Section on Medical Informatics, Medical School Office Building, X215, Stanford University School of Medicine, Stanford, CA 94305. Phone: (650) 723-5976. Fax: (650) 725-6044AbstractWe present a novel method for predicting the secondary structure of a protein from its amino acid sequence. Most existing methods predict each position in turn based on a local window of residues, sliding this window along the length of the sequence. In contrast, we develop a probabilistic model of protein sequence/structure relationships in terms of structural segments, and formulate secondary structure prediction as a general Bayesian inference problem. A distinctive feature of our approach is the ability to develop explicit probabilistic models for a-helices, b-strands, and other classes of secondary structure, incorporating experimentally and empirically observed aspects of protein structure such as helical capping signals, side chain correlations, and segment length distributions. Our model is Markovian in the segments, permitting efficient exact calculation of the posterior probability distribution over all possible segmentations of the sequence using dynamic programming. The optimal segmentation is computed and compared to a predictor based on marginal posterior modes, and the latter is shown to provide significant improvement in predictive accuracy. The marginalization procedure provides exact secondary structure probabilities at each sequence position, which are shown to be reliable estimates of prediction uncertainty. We apply this model to a database of 452 non-homologous structures, achieving accuracies as high the best currently available methods. We conclude by discussing an extension of this framework to model non-local interactions in protein structures, providing a clear direction for future improvements in secondary structure prediction accuracy.1. IntroductionPrediction of the secondary structure of a protein from its amino acid sequence remains an important and difficult task. Not only can successful predictions provide a starting point for direct tertiary structure modeling (Friesner & Gunn, 1996; Jones et al., 1994; Monge et al., 1994; Rost et al., 1996), but they can also significantly improve sequence analysis and sequence-structure threading (Fischer & Eisenberg, 1996; Russell et al., 1996) for aiding in structure and function determination. However, despite considerable progress in secondary structure prediction over the last decade (see (Barton, 1995) for a recent survey), the current best methods reach accuracies of about 75% when multiple homologous sequences are available (Frishman & Argos, 1997), and 71% for single sequence predictions (Salamov & Solovyev, 1997). New methods which more accurately reflect features of protein structure folding and stabilization may be necessary to advance prediction beyond these levels.Since early attempts to predict secondary structure (Garnier et al., 1978), most efforts have focused on development of mappings from a local window of residues in the sequence to the structural state of the central residue in the window, and a large number of methods for estimating such mappings have been developed. Early approaches scored individual amino acids by frequency of occurrence in each structural state, combining them in ways corresponding to conditional independence models (Chou & Fasman, 1974; Garnier et al., 1978). Improvements in accuracy were achieved by methods considering correlations among positions within the window, either implicitly using semi- and non-parametric statistical models such as neural networks (Holley & Karplus, 1989; Qian & Sejnowski, 1988; Stolorz et al., 1992) and nearest-neighbor classifiers (Yi & Lander, 1993; Zhang et al., 1992), or explicitly (Garnier et al., 1996; Munson et al., 1994; Riis & Krogh, 1996). Further improvements were demonstrated by the inclusion of evolutionary information via multiple alignments of homologous sequences (Di Francesco et al., 1996; Rost & Sander, 1993a; Rost & Sander, 1994; Salamov & Solovyev, 1995), although the relative contribution of such information has been debated (Benner, 1995; Frishman & Argos, 1996). Interestingly, most recent improvements in accuracy have come from methods which are capable of considering non-local interactionsin the sequence which occur outside a fixed length window (Frishman & Argos, 1996; Frishman & Argos, 1997; Salamov & Solovyev, 1997). Here we take a model-based approach, formulating secondary structure prediction as a general Bayesian inference problem. Such an approach avoids many of the problems associated with window-based predictions, such as the need for post-prediction ÒfilteringÓ (Frishman & Argos, 1996; Rost & Sander, 1993b), and provides a general framework for incorporation of the growing body of scientific knowledge about protein structure into the prediction process.2. MethodsWe begin by choosing a representation of sequence/structure relationships in proteins which is based on segments of secondary structure. We parameterize this model in a convenient fashion by representing the segment positions and structural types. We denote segment locations by the position of the last residue in the segment, following (Auger & Lawrence, 1989; Liu & Lawrence, 1996). Because segments are required to be contiguous, this parameterization uniquely identifies a set of segment locations for a given sequence. Let R R R R n =(,,...)12 be a sequence of n amino acid residues, S i Struct R Struct R i i =¹{}+:()()1 be a sequence of m positions denoting the end of each individual structural segment (so that S n m =), and T T T T m =(,,...,)12 be the sequence of secondary structural types for each respective segment. An example is given in Figure 1. We will concern ourselves with the 3-state problem, where "Î{}i T H E L i ,,, although generalizations may be desirable. Together m , S and T completely determine a secondary structure assignment for a given amino acid sequence. In the case of secondary structure prediction, the quantities of interest are thus the values of m , S S S S m =(,,...,)12 and T T T T m =(,,...,)12 corresponding to the known amino acid sequence R R R R n =(,,...)12, i.e. the locations and types of the secondary structural segments.The problem is to infer the values of (,,)m S T given a residue sequence R .We take a Bayesian approach to the assignment of these parameter values, by defining a joint probability distribution P R m S T ,,,() for an amino acid sequence and its secondary structure assignment.We then compute the conditional or posterior probability distribution over structural assignments given a new sequence P m S T R ,,|() via Bayesian inference, and predict those secondary structure assignments m S T ,,() which maximize this posterior distribution. In Section 2.1, we define a general segment-based joint probability model which lends itself to efficient exact calculation of the posterior. Section 2.2provides specific models for a -helices, b -strands, and loops, and shows how such models can be used to capture key aspects of protein structure formation observed in experimental settings and database analyses.Section 2.3 describes an algorithm for computation of quantities of interest under the posterior distribution.2.1 Basic modelA key aspect of our approach is the choice of a joint probability model P R m S T ,,,() which is decomposable into individual segment terms. In other words, the joint distribution may be factored by conditional independence of inter-segment residues, so that the sequence likelihood can be written as a product of segment likelihoods:P R m S T P RS T S S j m j j |,,|,[:]()=()-+=Õ111(1)where the j th term on the right-hand side of (1) is the likelihood of the subsequence of R beginning at position S j -+11 and ending at position S j , in other words the amino acids in segment j . The exact form of this segment likelihood is structure-dependent, and the specification of this form for each structural type amounts to developing a probabilistic model of the given type of segment. The particular models used in this paper are developed in Section 2.2, but some general comments are appropriate here. First, note that this model does not assume conditional independence of int ra-segment residues; in fact, as described in Section 2.2 an explicit goal of our approach is to choose a form which allows us to model correlation among positions within a segment. Moreover, the terms P R S T S S j j [:]|,-+()11 for individual segments can take on arbitrary form, and may depend on general properties of a segment (such as hydrophobic moment or helix dipole) beyond properties of individual residues.Given (1), we need only provide the prior distribution P m S T ,,() to completely specify the joint distribution P R m S T ,,,(). A computationally convenient choice is to factor P S T m ,|() as a Markov process:P m S T P m P TT P S S T j j j m j j j ,,()||,()=()()-=-Õ111(2)where each segment type depends only on its nearest neighbors, and the conditioning of S j on (,)S T j j -1allows explicit modeling of the differing length distributions of each segment type observed in the Protein Data Bank (Bernstein et al ., 1977), as shown in Figure 2. Here we take P m () to be improper uniform;more informative priors on m are possible, but have little impact (Schmidler, 2000). By the choice of (2),our model becomes closely related to the class of hidden semi-Markov or semi-Markov source models discussed in (Levinson, 1986; Rabiner, 1989; Russell & Moore, 1985) for applications in speech recognition. In the speech recognition literature however, observations during a given state occupancy are typically modeled as iid (independent and identically distributed). As described in Section 2.2, the ability to model both non-independence and non-identity of distributions is the major motivation for our segment-based approach. We note also that a model very similar to that given in (1,2) has been developed independently by (Burge & Karlin, 1997) and applied to gene parsing in eukaryotic DNA with great success.2.2 Probabilistic models for protein structureOur goal here is to choose a specific form of the segment likelihood P R S T S S j j [:]|,-+()11 which captures core aspects of protein secondary structure formation: hydrophobicity patterns, side chain interactions, and helical capping signals. In other words, we wish to develop probabilistic models for protein structural segments. For example, the function P R i j H i j [:]|,,() provides the likelihood of the subsequence R i j [:]under the assumption that a helix begins at position i and ends at position j . Given such a segment likelihood for each structural class (H, E, L), computing the likelihood of a sequence under any given structural assignment is trivially done by evaluating the product of (1) and (2). Here we provide the exact forms for these segment likelihoods used in this paper.Helix model :The presence of correlated side chain mutations in a -helices has been well studied, deriving from both environmental constraints such as hydropathy (Eisenberg et al ., 1984) and from stabilizing side chain interactions (Klingler & Brutlag, 1994). These correlations in non-adjacent sequence positions are inducedÜ Fig. 1Ü Fig. 2by their spatial proximity in the folded protein molecule, and hence provide an important source of information about the underlying structure. Figure 3 shows an example of an amphipathic helix which exhibits periodicity in sequence hydrophobicity. Because of the differing rates of rotation in helices and strands, this side chain periodicity can be an important clue for identifying the underlying backbone conformation.Another important source of information for identifying a -helical segments in protein sequences is the existence of helical capping signals, the preference for particular amino acids at the N- and C-terminal ends which terminate helices through side chain-backbone hydrogen bonds or hydrophobic interactions. Such signals have been well characterized experimentally in terms of their stabilizing effect in helical peptides (Doig & Baldwin, 1995; Presta & Rose, 1988; Richardson & Richardson, 1988) (see (Aurora & Rose,1998) for a review), as well as empirically through observed correlations (Klingler & Brutlag, 1994). This capping effect results in amino acid distributions at end-segment positions which differ significantly from that of internal positions. Figure 4 shows some of these informative distributions.Our goal is to develop a helical segment model which captures such position-specific preferences and probabilistic dependence of int ra-segment residues, in addition to standard amino acid propensities. The model must also account for helices of various lengths. In this paper we use the following form of this distribution:P R S S H P R R P R R P S S j j N H i S i i S S I H i S i i S S C j j i S j j j j N H j j N H j C H S ji [:][:][:]|,,||---------+-+-=+++-=++-()=()´()ÕÕ111111*********l l l +-+-=-+()Õ11111H i S i i S S R R j j C H j |[:]l (3)Here l N H indicates the length of the helix N-cap model, N i , C i indicate the i th position from the N- and C-termini respectively; and I indicates an internal (non-cap) position. Figure 5 shows graphically how this model is applied to the particular amino acid subsequence of the helix in Figure 3: the first product term in(3) models the distribution of amino acids at each of the first l N H N-terminal positions (N-cap, N1, N2, N3,É), and similarly the last term for the C-terminal positions (É, C3, C2, C1, C-cap), while the middle term models all internal positions as identically distributed but dependent.In choosing the length of the helix cap models l l N H C H ,, we considered caps of up to 4 positions at eachend of the segment. The first 4 such positions at each terminus in a -helices are of particular interest due to their inability to form intra-helical hydrogen bonds, their propensity for acidic/basic side chains, and stabilization effects of the helical dipole moment. Figure 4 shows distributions for these positions in a -helices. As described in Section 3, we use secondary structure assignments provided by DSSP (Kabsch &Sander, 1983) which do not include the first and last hydrogen bonded residues in a helix. Hence the N-cap and C-cap positions are not typically included by DSSP. (To correct for this, we allow the segment transition term in (2) to depend on the last residue of the previous segment.) Nevertheless, Figure 4a displays previously observed patterns such as the prevalence of Pro at position N1 and Glu and Asp at position N2, while Figure 4b shows the expected prevalence of Ala and various hydrophobic residues at internal positions of helices.Table 1 shows the statistical deviance between the amino acid distribution at each end-segment position and the amino acid distribution at internal positions, calculated using the data set described in Section 3.The strongest signal appears in the first two positions of the helical N-terminus (N1 and N2), while b -strands and loops show little change in these positions. The positions we included in each structural modelfor predictive purposes are highlighted (so that l l N H C H ==41,), capturing the positions that aresignificantly different. It should be noted that such information is inherently difficult to include in window-based prediction methods, which must scan a residue across each position in the window in turn.Equation (3) provides everything except the exact intra-segment residue dependencies in the model. For a -helices, these are given by:P R R P R h P h h h h i H i j i i H i i i H i i i i |||,,[:]----()=()()1234(4)where h hydrophobic neutral hydrophilic i Î{},, indicates the hydrophobicity class of residue R i assigned by (Klingler & Brutlag, 1994). In other words, dependency between positions is modeled using a reduced alphabet in order to avoid combinatorial explosion of parameters. Figure 6 provides a graphical model representation (Whittaker, 1990) of the dependency structure given by (4). This form of the distribution allows us to explicitly capture the previously described intra-segment residue correlations corresponding to the periodicity of an a -helix, by conditioning the probability of a particular residue on the i -4, i -3, and i -2 residues. Internal positions are therefore modeled as identically distributed, but dependent. We note that (Stultz et al ., 1993) also provide a model for amphipathicty in a -helices in their development of structured hidden Markov models for particular tertiary folds.b -Strand and Loop models :The general form of (3) is convenient for modeling variable-length segments, and we retain such a form for b -strand and loop segments. However the utility of distinguishing end-capping residues in b -strands and loops is less obvious than in the case of a -helices. In choosing l l l l N E C E N L C L ,,, we once again considered up to 4 positions for loops and 3 for b -strands (due to sparse data). Again Table 1 shows the statistical deviance, and it is seen that b -strands and loops show little change in these positions. Accordingly we set l l l l N E C E N L C L ====1121,,,. Figures 3a and 3b reveal some expected patterns in the associated amino acid distributions, such as Pro in position 2 as an initial 1 helix-terminating position in loops, a prevalence of Gly in internal loop positions, and various hydrophobic residues in strands.Another difference between the models for a -helices, b -strands, and loops lies in the exact form of (4),reflecting the differing intra-segment correlations induced by the underlying backbone-side chain geometry being modeled. Reflecting the periodicity of b -strand side chains, conditioning is done on residues i -1 and i -2, and loops are modeled similarly.Finally, we note that no algorithmic model selection was done to select the best models of form (3,4),and so the models described above, while effective, are not necessarily optimal. Moreover, (3,4) is only one1 The occurrence of the peak at position2 rather than position 1 is again an artifact of the helix boundaries defined by DSSP.Ü Table 1Ü Fig. 4of many conceivable forms for the segment models, and many other possibilities exist. For example, many of the statistical models used for window-based prediction methods might be adapted for this purpose. Thus we view the development of new models for structural segments to be a promising area of research. So long as the factorization given by (1) is maintained, our general framework holds and the computational methods described in the next section are applicable. Generalizations of (1) itself are discussed in Section4.2.2.3 Computation and inferenceAssuming the probability model given by (1-4), we wish to infer the secondary structure assignment parameters m S T ,,() for a new protein sequence R . Thus we wish to find m S T ,,() such that P m S T R ,,|()is maximized 2. As mentioned in Section 2.1, our class of models is structurally similar to the class of semi-Markov source models described in (Rabiner, 1989). Thus computation can be done exactly using a slight generalization of the standard forward-backward algorithm for hidden Markov models (HMMs), as described in (Rabiner, 1989), using the forward and backward variables defined as follows:a a q q q j t v l P R Sv S j T t P S j T t S v P T t T l l SS v j v j prev prev prev ,,|,,,|,,|,[:]()=()===()´===()==()Î=-+åå111(5)b b q q q j t v l P R S v S j T l P S v S j T l P T l T t l SS v j n j v next next next next next ,,|,,,|,,|,[:]()=()===()´===()==()Î=++åå11(6)2 Throughout, the parameters of the probability model will assumed to be fixed, and we discuss only computation of predictive quantities of interest. Estimation of these probability parameters from the structural database described in Section3 is straightforward using maximum likelihood or maximum a posteriori methods, and amounts to counting observed frequencies for the desired quantities (the length of alpha-helices, or the occurrence of particular amino acids in the C-terminal capping position of an alpha-helix, for example). Because the database contains only sequences with known structures, no Baum-Welch type iteration is required during estimation. This contrasts with the use of HMM-like models in manywhere SS H E L ={},,, the set of possible secondary structural types and q represents the model parameters.This yields an O n ()3 algorithm, but in practice we limit the maximum size considered for any one segment to some length D . Thus the first summation in (5) begins at j D -() and the first summation in (6) ends at j D +(), yielding an algorithm which is linear (O nD ()2) in the length of the input sequence for fixed D .All experiments in this paper use a value of D =30, which is sufficiently large to account for nearly all observed structural segments as can be seen from Figure 2. We note that the model given by (3) allows further reduction of the D 2 term to D yielding O nD (); however, this does not hold in general and in practice the additional computational savings provided by this form is unnecessary.We can therefore compute the maximum a posteriori values of m S T ,,():Struct P m S T R MAP m S T =arg max (,,|,)(,,)q using a procedure analogous to the Viterbi algorithm for HMMs (see (Rabiner, 1989)), simply by replacing the summations in (5) with maximization. We refer to these values of m S T ,,() as the MAP segmentation .A similar approach is taken by (Burge & Karlin, 1997) to find the optimal parse of a DNA sequence. We note however that many different segmentations may exist which, although not optimal, may have significant probability mass. In addition, the most commonly reported measure of accuracy for protein secondary structure prediction is the Q 3 value, the percentage correct on a per-residue basis. Thus the MAP segmentation is not as desirable as the marginal posterior mode at each position:Struct P T R Mode T R i n i ={}=arg max (|,)[]q 1where P T R R i (|,)[]q represents the marginal posterior distribution over structural types at position i .Fortunately, this is easily calculated from (5,6) above: other applications (such as multiple sequence alignment), where the underlying model is assumed unknown,P T t R j l k t P R S j S k T t P S k S j T t P T t T l ZR l SS k i j D j i D i j k prev prev prev i =()=()()===()´===()==()Î=+-=-+-+ååå|,,,|,,,|,,|,/[:]q a b qq q 1111(7)where Z is the normalizing constant (or partition function)P R |q () which is available directly from the forward pass (5). The calculation in (7) yields the marginal posterior distribution at each position in the sequence in O nD () time. We show in Section 3 that this marginal mode strategy significantly outperforms the MAP segmentation strategy by the Q 3 measure.It is worth reiterating that (7) gives us the exact marginal posterior distribution over secondary structural types at each position, averaging over all possible segmentations, and hence provides an exact measure of the uncertainty of prediction at each position (subject to modeling assumptions). Figure 8 in Section 3shows that this measure correlates very strongly with prediction accuracy, and is still somewhat conservative. Figure 7 shows a typical sequence prediction, where we see that segment endpoints are the regions of highest uncertainty, as we would expect given the variability of assignments in such positions (Colloc'h et al ., 1993).Lastly, we note that our approach can easily incorporate prior knowledge about regions or positions in the sequence if such is available. That is, the methods described in this section can be easily modified to calculate probabilities conditional on certain positions or segments taking on known conformations. Such might be the case for example if experimental evidence exists such as circular dichroism data or footprinting experiments, or if highly significant motif hits occur on the sequence and we wish to include them, for example with helix-turn-helix DNA binding motifs. Again, such information is inherently difficult to include in most existing secondary structure prediction methods. or the data is not fully observable.Ü Fig. 53. ResultsIn order to evaluate the accuracy of our approach, we created a non-redundant set of 452 globular proteinstructures from the Protein Data Bank (Bernstein et al., 1977) using OBSTRUCT (Heringa et al., 1992). We created a maximal set of structures determined at better than 2.5 angstroms resolution with less than 25% sequence identity, removing those structures classified as membrane proteins within the SCOP hierarchy (Murzin et al., 1995) and those sequences less than 50 amino acids in length. Table 2 reports the results of cross-validation experiments whereby each structure was predicted in turn, using parameters of(2,3,4) estimated from the remaining 451 structures. Quantities reported are the total percent correct (Q3), percent of each structural type predicted correctly (sensitivity), percent of predictions for each type which were correct (positive predictive value), and MatthewÕs correlation (Matthews, 1975). Computation time on an SGI 195MHz Octane ranges from 0.2 seconds for the shortest sequence (50 residues) to 6.4 seconds for the longest (869 residues). The gold-standard secondary structure assignments were taken to be those provided by DSSP (Kabsch & Sander, 1983), with adjustments following (Frishman & Argos, 1996) to restrict the minimum b-strand length to 3, and the minimum a-helix length to 5. Our Bayesiansegmentation algorithm (BSPSS) achieves a Q3accuracy of 68.8%, as high as most published single-sequence methods (Frishman & Argos, 1996) and only slightly below the best such method (71%), (Salamov & Solovyev, 1997). Of the two predictors described in Section 2.3, the marginal mode at each position significantly outperforms the MAP segmentation on a per-residue basis.As described in Section 2.3, the BSPSS algorithm calculates the exact posterior distribution overstructural types at each position. Figure 8 shows the Q3accuracy as a function of the probability assigned to the predicted structure at each position. As can be seen from the strong correlation, a clear advantage of our explicit probabilistic approach is the ability to accurately estimate the confidence in prediction at each position. At a threshold prediction probability of .6, we make predictions for 58% of positions and achieve an accuracy of 80.6%. At a threshold probability of .8 we achieve an accuracy of 91.4%, but predict only 21% of positions with this level of confidence. It is worth noting that according to (Rost & Schneider, 1997), these threshold percentages indicate that the BSPSS algorithm performs 6 times as well as other Ü Fig. 6Ü Table 2。