BIOINFORMATICS APPLICATIONS NOTE doi10.1093bioinformaticsbtk005 Gene expression

合集下载

display output

display output

BIOINFORMATICS APPLICATIONS NOTEVol.19no.122003,pages 1594–1595DOI:10.1093/bioinformatics/btg198Digital extractor:analysis of digital differential display outputStephen F .Madden 1,Barry O’Donovan 2,Simon J.Furney 1,Hugh R.Brady 1,Guenole Silvestre 2and Peter P .Doran 1,∗1HumanGenomics and Bioinformatics Research Unit,Department of Medicine andTherapeutics,University College Dublin,Mater Misericordiae Hospital,The Dublin Molecular Medicine Centre,41Eccles St,Dublin 7,Ireland and 2Department of Computer Science,University College Dublin,Belfield,Dublin 4,IrelandReceived on June 24,2002;accepted on March 3,2003ABSTRACTSummary:Digital Extractor is a program for the high-throughput processing of data sets derived from digital differential display-based comparisons of EST libraries.These comparisons can be utilized to identify discrete subsets of genes whose expression is restricted to distinct tissue types.The program facilitates these investigations by permitting parallel annotation of genes identified as being differentially expressed.Availability:The executable program,suitable for use on all UNIX-based platforms is freely available to non-profit usersContact:pdoran.genome@mater.ieDigital Differential Display (DDD)is an Internet based resource for the identification of genes whose expression is altered between different tissue types (/UniGene).This resource exploits the large number of publicly available cDNA libraries corresponding to different tissues,cancers,etc.This online system permits:(a)selection of cDNA libraries to be compared (e.g.cancer versus normal tissue);(b)comparison of the constituent sequences;and (c)output of a list of differentially expressed sequences.This resource offers exciting new avenues for exploration in the search for novel genes in health and disease;indeed it has recently been applied to the identification of cancer-associated transcripts (Scheurle et al.,2000).Whilst DDD represents an important tool for the biomedical research community,its main limitation rests in the cumbersome nature of the subsequent data analysis.Here we present a new program Digital Extractor,for the processing of data obtained from DDD-based investi-gations of differential gene expression.This program re-fines the DDD approach by permitting rapid,automated annotation of output gene lists.∗To whom correspondence should be addressed.Extractor is written in PERL and can be implemented on all UNIX platforms with PERL version 5.0or greater.The application can be executed using either a Java application or a command line interface.Digital Extractor integrates and utilizes a number of tools including:(a)CAP3(Huang and Madan,1999),for assembly of EST clusters into contigs;(b)RepeatMasker (Smit,1996),for masking of repetitive elements within the assembled EST contigs;and (c)BLAST (Altschul et al.,1990),for homology searching.The results file produced from a DDD experiment is an HTML page detailing all of the ESTs,whose expression differs between the conditions of interest with a link pro-vided to each UNIGENE cluster.Digital Extractor uses the DDD output HTML page as input.The page is loaded into the application and scanned to extract the accession numbers of the UNIGENE clusters,representing differen-tially expressed genes.These accession numbers are then used to complete automated extraction of the UNIGENE clusters from the locally-stored databases.The user can specify application parameters such as the database to be searched,and the e -values for the BLAST searches.Each UNIGENE cluster extracted from the database contains all of the cDNA sequences that correspond to an individual gene.The number of sequences in each cluster can range from a few dozen to many thousand.To date no attempt has been made to produce contigs from the representative sequences in these clusters due to a number of reasons including the presence in the UNIGENE cluster of all the splicing variants of the gene of interest and the inclusion of 5 and 3 reads from the same gene.However,if the information produced from the DDD is to be of use in the identification of differentially expressed genes,in particular the annotation of unknown transcripts,it is necessary to produce contiguous sequences,for database searching,by cluster assembly.This step is crucial to reduce the number of BLAST runs per experiment thus1594Bioinformatics 19(12)cOxford University Press 2003;all rights reserved.Digital extractorimproving throughput.Having obtained the complete UNIGENE clusters representing each hit in the DDD experiment,Digital Extractor integrates and utilizes the CAP3EST assembly program to produce contiguous sequences.The result of this procedure is the production of a sequence for BLAST analysis.This step obviates the need for BLAST searches of all the individual se-quences within each UNIGENE cluster,thus dramatically increasing the efficiency of the application.To facilitate the rapid,parallel identification of the gene corresponding to each assembled contig,by means of the BLAST algorithm,it is necessary to optimize the sequence inputs.To achieve this,the assembled contigs are masked for repeat elements and low complexity DNA using the Repeat Masker Application.This step substantially improves the performance of the BLAST-based sequence identification.Having assembled and masked each of the contiguous sequences corresponding to genes whose expression is altered between the conditions of interest, BLAST is used to search for homologous sequences in the non-redundant nucleotide database.The output of Digital Extractor is a results page with the identity of all annotated sequences,with links to the NCBI database for further information.To determine the speed and accuracy of Digital Extrac-tor,a test analysis was performed on a data set produced from a DDD-based comparison of cDNA libraries from human kidney and human pancreas.The output from this comparison was a web page containing171links to UNIGENE clusters.The web page was downloaded and submitted to the program via the Java GUI.The option to extract only un-annotated clusters was chosen,which focused the analysis on a subset of15UNIGENE clusters.It was also decided to compare the clusters to the nr database with an e-value of0.001(choosing a smaller database,and a lower e-value would obviously decrease run time of this aspect of the program).It takes approx-imately6minutes,to extract,assemble,mask,and blast each cluster,on a1GHz processor with216MB of mem-ory.The major limiting factor being the assembly process.A detailed analysis of the data set can be viewed on our web-site,/worked example.htm.In summary Digital Extractor represents an efficient, user-friendly platform for the rapid annotation of data de-rived from DDD-based experiments to analyze differential gene expression.ACKNOWLEDGEMENTSWe acknowledge the National Centre for Biotechnology Information for provision and curation of sequence databases,used herein.These studies were supported by the Irish Government’s Programme for Research in Third Level Institutions,the European Union Fifth Framework Programme and the Punchestown Kidney Research Fund. REFERENCESAltschul,S.F.,Gish,W.,Miller,W.,Myers,E.W.and Lipman,D.J.(1990)Basic local alignment search tool.J.Mol.Biol.,215,403–410.Huang,X.and Madan,A.(1999)CAP3:a DNA sequence assembly program.Genome Res.,9,868–877.Scheurle,D.,DeYoung,M.P.,Binninger,D.M.,Page,H.,Jahanzeb,M.and Naraynan,R.(2000)Cancer gene discovery using digital differential display.Cancer Res.,60,4037–4043.Smit,A.F.A.(1996)The origin of interspersed repeats in the human genome.Curr.Op.Gen.Dev.,6,743–748.1595。

01-Introduction to Bioinformatics(生物信息学国外教程2010版) PPT课件

01-Introduction to Bioinformatics(生物信息学国外教程2010版) PPT课件

Textbook
The course textbook has no required textbook. I wrote Bioinformatics and Functional Genomics (Wiley-Blackwell, 2nd edition 2009). The lectures in this course correspond closely to chapters.
The textbook website is: This has powerpoints, URLs, etc. organized by chapter. This is most useful to find “web documents” corresponding to each chapter.
I will make pdfs of the chapters available to everyone.
You can also purchase a copy at the bookstore, at (now $60), or at Wiley with a 20% discount through the book’s website .
Literature references
You are encouraged to read original source articles (posted on moodle). They will enhance your understanding of the material. Readings are optional but recommended.
Web sites
The course website is reached via moodle: /moodle (or Google “moodle bioinformatics”) --This site contains the powerpoints for each lecture, including black & white versions for printing --The weekly quizzes are here --You can ask questions via the forum --Audio files of each lecture will be posted here

生物信息学 英文教科书

生物信息学 英文教科书

生物信息学英文教科书1. "Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins" (Third Edition) by David W. MountThis textbook provides a comprehensive introduction to bioinformatics, covering topics such as sequence analysis, genomics, transcriptomics, proteomics, and systems biology. It includes practical examples and exercises to help readers apply the concepts.2. "Introduction to Bioinformatics" (Second Edition) by Arthur M. LeskThis book offers a broad overview of bioinformatics, including sequence analysis, database searching, phylogenetic inference, and genome analysis. It also covers bioinformatics tools and techniques used in experimental biology.3. "Bioinformatics for Dummies" by John M. Walker and Todd W. J. DavisThis beginner-friendly guide introduces the fundamentals of bioinformatics in an easy-to-understand manner. It covers topics like sequence alignment, database searching, and phylogenetic trees, with a focus on practical applications.4. "Computational Biology: A Practical Introduction to Bioinformatics and its Applications" by Udit Sharma and Navdeep KaurThis textbook provides a comprehensive overview of bioinformatics, including sequence analysis, genome annotation, protein structure prediction, and biological networks. It includes real-life examples and case studies.These textbooks offer in-depth coverage of bioinformatics concepts and techniques, and they can serve as valuable references for students, researchers, and professionals in the field. The specific choice of a textbook may depend on the reader's background, level of expertise, and specific interests within bioinformatics.。

cMAP

cMAP

BIOINFORMATICS APPLICATIONS NOTE Vol.25no.222009,pages3040–3042doi:10.1093/bioinformatics/btp458 Databases and ontologiesCMap1.01:a comparative mapping application for the Internet Ken Youens-Clark1,∗,Ben Faga1,Immanuel V.Yap2,Lincoln Stein1and Doreen Ware1,3 1Cold Spring Harbor Laboratories,Cold Spring Harbor,NY11724,2Department of Plant Breeding and Genetics, Cornell University,Ithaca and3USDA-ARS NAA Plant,Soil&Nutrition Laboratory Research Unit,Ithaca,NY14853,USAReceived on June10,2009;revised on July17,2009;accepted on July20,2009Advance Access publication July30,2009Associate Editor:Alex BatemanABSTRACTSummary:CMap is a web-based tool for displaying and comparing maps of any type and from any species.A user can compare an unlimited number of maps,view pair-wise comparisons of known correspondences,and search for maps or for features by name, species,type and accession.CMap is freely available,can run on a variety of database engines and uses only free and open software components.Availability:/cmapContact:kclark@1INTRODUCTIONCMap is a generic and extensible comparative map viewer that runs in standard web browsers and aims to assist biological researchers seeking to extrapolate known map data into unknown parison of genetic,physical and sequence maps allows researchers tofill in gaps and extend knowledge both within and across species.For example,comparison of map fragments such as FPC contigs to an assembled sequence map or high-quality genetic map can help order,orient and assemble the fragments.Feature order in one map can help aid the selection of additional markers for mapping in another population of interest.CMap is used by the Gramene project(Liang et al.,2008)to visualize and compare over200map sets of various types from30 plant species;it is also used by many other projects comparing data from plants and animals.Data providers can extensively customize CMap to suit their tastes,configuring the definitions of species, map types,how maps are grouped into sets,how maps are drawn, the types of features displayed,their positions on the maps,how the features are drawn,what correspondences are made between features,how correspondences between maps are aggregated and colored,the evidence codes supporting these correspondences, and more.CMap relies on a relational database and open-source software.The history of interactive graphical maps includes an early version from AceDB(Stein et al.,1999)which included‘Multi-map’to display comparisons.The similarly named‘cMap’application(Fang et al.,2003)from the MaizeDB was one of thefirst comparative map viewers to allow cross-species comparisons,but the application appears to be unavailable.National center for genome resources (NCGR)also created the comparative map and trait viewer(CMTV)∗To whom correspondence should be addressed.(Sawkins et al.,2004)to allow multiple cross-species comparisons. The SOL genomics network(SGN)comparative map(Mueller et al., 2008)viewer is perhaps the closest in features to CMap,but it allows a user to compare only two maps at a time,a limitation not shared by this software.CMap was originally written in2001for Gramene,a comparative mapping resource for crop grasses,and has since been contributed to the generic model organism web site(GMOD)project under the GNU Public License.CMap has been downloaded well over1000times and adapted by many other groups working on a range of different organisms,such as the legume information system, CottonDB,GrainGenes,the nematode Pristionchus and BeeBase. This article discusses version1.01of CMap released on July1, 2008.2METHODS2.1Data conceptsThe main data components in CMap are species,map types,map sets, maps,features and correspondences.The data administrator decides what constitutes each.Any species or map type is allowed.A map set is simply a collection of maps,and maps are any linear ordered set of features such as a linkage group,chromosome or an FPC contig.A feature is any point or interval positioned on a map such as a genetic marker,an in/del,a centromere, a QTL or a gene.A correspondence can be anything from a shared synonym to a sequence similarity such as a BLAST hit.All data can be loaded from tab-delimitedfiles or manually inserted with tools provided in the CMap distribution.2.2Using CMapTo use CMap,the researcher willfind there are four main paths for entering the comparative map viewer:the map viewer,the feature search, the correspondence matrix and the map search.If the user wishes to view a particular map or map set,then the map viewer is a logical point of entry.After clicking the‘Maps’link in the CMap menu, the user is presented with a form to select a species and a map set.The user may choose to view any number of maps in the set to serve as the starting point.Next,the user may select from lists of comparative maps to place on the right and/or left of the reference map(s).The number of correspondences to each comparative map is shown in square brackets to aid the user in deciding which maps to select.After adding comparative maps,the user may choose to add new maps to the right and/or left of the outermost maps;remove,flip or crop maps;change which features and/or labels are displayed;change the size of the image,and more.If the user has a set of features(markers,QTL,BACs,etc.)he wishes to locate on maps,then he can search for features by clicking on the ‘Feature Search’link in the CMap menu.For each feature found,the©The Author(s)2009.Published by Oxford University Press.This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License(/licenses/ by-nc/2.5/uk/)which permits unrestricted non-commercial use,distribution,and reproduction in any medium,provided the original work is properly cited.CMap1.01Fig.1.Many of the key concepts of CMap are shown.Five maps of varying types from QTL to genetic to sequence and from two species are displayed,and more could be added.Map features range from QTLs to genetic markers to bins to genes,and correspondences based on different types of evidence are show in varying colors(/db/cmap).feature’s name,type,species,map set,map,position and aliases are displayed as well as two links for each feature,one that takes the user to the‘map details’page for the map on which the feature occurs with the feature highlighted and another which takes the user to the‘feature details’page showing all the known information on the feature and all of its correspondences.If the user is interested to know the total number of correspondences from any map set to any other map set,then he can click on the‘Matrix’link in the CMap menu.A table shows a cross-tabulated comparison of all the correspondences among map sets.By clicking on any map set name,the user can limit the matrix to a particular map set.Clicking on the numbers of correspondences takes the user to the map viewer showing the two maps in relation to each other.The last main method to enter CMap is by choosing the‘Map Search’link in the CMap menu.Here,a user can search for particular maps based on the map name or the number of related maps.The results include the map names,the number of related maps,the number of correspondences and the number of features by type present on the map.Clicking a map’s name takes the user to the map viewer page so he may add comparative maps.In addition to the above four most direct routes to view comparative maps, other entry points are available through the CMap menu.The‘Species’link allows the user to browse the available species and all the map sets from each.From there or from the CMap menu,the user can choose the‘Map Set Info’link to learn more about a particular map set,and from there choose to view one or all of the maps in the map viewer or the matrix.In addition, the user may choose the links for‘Feature Types’,‘Map Types’or‘Evidence Types’.Depending on the questions that the user seeks to answer,other routes may prove more stly,a proactive data provider may choose to present pre-formulated views which he knows will be of interest to his community of users by simply copying and pasting a URL into a web page and making a link with text explaining the view or storing this via the ‘Saved Links’section of CMap,a function that also allows user to recall previous views.The data provider has the ability to create multiple CMap databases that are entirely separate from each other.If this has been done,the user can switch among these databases using the‘Data Source’control in the upper-right ers can download data of maps or whole map sets in GFF or CMap’s tab-delimited or XML formats.Links to the download page are available on the‘Map Set Info’page and from the map menu buttons in the map viewer.There is also a‘Help’section and a‘Tutorial’to show users how CMap works and its terminology.2.3ImplementationCMap is a Perl application that runs on an Apache web server(versions1 or2)on Windows and UNIX variants.CMap has a simple relational schema that can be implemented in MySQL,Oracle,PostgreSQL or Sybase.It relies on no proprietary software or SQL extensions,and uses all freely available software.The user interface is a basic HTML/JavaScript page that works with any modern web browser.No registration or permission is required for its use. Administration of CMap is accomplished through the following three methods:simple text configurationfiles define system settings such as databases and directories for templates,sessions and temporaryfiles as well as how map,features and correspondences are drawn.The‘cmap_admin.pl’tool is used for importing and exporting data in various text formats(tab, XML,SQL),creating and deleting maps or correspondences,reloading the correspondence matrix or purging CMap’s web data cache after loading new stly,a browser-based administrative tool allows the data provider a basic CRUD(Create/Read/Update/Delete)interface for all the different objects(map sets,maps,features,etc.)in the database.2.4Future plansCMap continues development as a GMOD project.Work is underway on version2.0which will feature a major rewrite of the internals,an improved and streamlined database,graphical output in scalable vector graphics,and integration of the Circos circular genome viewer(Krzywinski et al.,2009). Funding:United State Department of Agriculture(USDA)Initiative for Future Agriculture and Food Systems(IFAFS)(grant number00-52100-9622);Cooperative State Research and Education Service (CSREES)agreement through the USDA Agricultural Research Service(ARS)(grant number58-1907-0-041);National Science Foundation(NSF)PGI(grant number0321685,for the years 2004-2007);NSF Plant Genome Research Resource(grant number 0703908,work from2004till now);USDA ARS(grant number 413089).Conflict of Interest:none declared.REFERENCESFang,Z.et al.(2003)cMap:the comparative genetic map viewer.Bioinformatics,19, 416–417.3041K.Youens-Clark et al.Krzywinski,M.et al.(2009)Circos:an information aesthetic for comparative genomics.Genome Res.[Epub ahead of print,doi:10.1101/gr.092759.109,July24,2009]. Liang,C.et al.(2008)Gramene:a growing plant comparative genomics resource.Nucleic Acids Res.,36,D947–D953.Mueller,L.et al.(2008)The SGN comparative map viewer.Bioinformatics,24, 422–423.Sawkins,M.C.et al.(2004)Comparative map and trait viewer(CMTV):an integrated bioinformatic tool to construct consensus maps and compare QTL and functional genomics data across genomes and experiments.Plant Mol.Biol.,56,465–480. Stein,L.D.et al.(1999)AceDB:a genome database management put.Sci.Engineering,1,44–52.3042。

转录组测序以及常用算法简介

转录组测序以及常用算法简介

转录组测序以及常用算法简介转录组测序,也被称为“全转录组鸟枪法测序”(WTSS),由于转录组测序的高覆盖率,它也被称为深度测序。

它主要利用新一代高通量测序技术,对物种或组织的RNA反转录而成的cDNA文库进行测序,并得到相关的RNA信息。

其研究对象为特定细胞在某一功能状态下所能转录出来的所有RNA的总和,包括mRNA和非编码RNA。

它是指用新一代高通量测序技术,对物种或组织的RNA反转录而成的cDNA文库进行测序,并得到相关的RNA信息。

转录组测序根据有无基因组参考序列分为:有参考基因组的转录组测序,和无参考基因组的de novo测序。

如果有基因组参考序列,可以把转录本映射回基因组,确定转录本位置、剪切情况等更为全面的遗传信息,而这些遗传信息可以广泛应用于生物学研究、医学研究、临床研究中。

虽然转录组测序和基因组测序的步骤大体相同,但是在文库制备和分析方法上却有很大的区别。

在生物信息学领域,序列比对作为识别DNA、RNA和蛋白质相似区域的有效手段,有助于我们更好地研究其结构、功能以及进化方向的关系。

下图简要说明了转录组测序的主要流程:首先将细胞中所有的反转录产物转化为cDNA文库,再将cDNA随机剪切为小DNA片段,并在两端加上接头(Adapter),所得序列通过比对(有参考基因组)或者从头组装de novo(无参考基因组),形成全基因组范围的转录谱。

图1 转录组测序流程图常用算法简介TopHat(/software/tophat/index.shtml)TopHat是Cole Trapnell等人于2009年发表在Bioinformatics上的基于Bowtie的转录组测序比对算法,是马里兰大学生物信息和计算机生物中心,以及加利福尼亚大学伯克利分校数学系和分子细胞生物学系以及哈佛大学的干细胞与再生生物学系联合开发的结果。

它通过超快的高通量短序列比对RNA序列来识别剪切位点。

图2 TopHat流程图TopHat首先先用Bowtie将RNA序列与整个参考基因组进行比对,找到匹配的序列,再用Maq合并匹配的序列,对外显子进行选择性的拼接。

QDD a user-friendly program to select microsatellite markers and

QDD a user-friendly program to select microsatellite markers and

BIOINFORMATICS APPLICATIONS NOTE Vol.26no.32010,pages403–404doi:10.1093/bioinformatics/btp670 Sequence analysisQDD:a user-friendly program to select microsatellite markers and design primers from large sequencing projectsEmese Meglécz1,∗,Caroline Costedoat1,Vincent Dubut1,AndréGilles1,Thibaut Malausa2,Nicolas Pech1and Jean-François Martin31Aix-Marseille Université,CNRS,IRD,UMR6116–IMEP,Equipe Evolution,Génome et Environnement,Centre Saint-Charles,Case36,3Place Victor Hugo,13331Marseille Cedex3,2Institut National de la Recherche Agronomique,UMR1301,INRA/UNSA/CNRS,Equipe BPI,400,route des Chappes.BP167.06903Sophia-Antipolis Cedex and3Montpellier SupAgro,INRA,CIRAD,IRD,Centre de Biologie et de Gestion des Populations,Campus International de Baillarguet,CS30016,34988Montferrier-sur-Lez,FranceReceived on August6,2009;revised on November24,2009;accepted on December1,2009Advance Access publication December10,2009Associate Editor:Limsoon WongABSTRACTSummary:QDD is an open access program providing a user-friendly tool for microsatellite detection and primer design from large sets of DNA sequences.The program is designed to deal with all steps of treatment of raw sequences obtained from pyrosequencing of enriched DNA libraries,but it is also applicable to data obtained through other sequencing methods,using FASTAfiles as input.The following tasks are completed by QDD:tag sorting,adapter/vector removal,elimination of redundant sequences,detection of possible genomic multicopies(duplicated loci or transposable elements), stringent selection of target microsatellites and customizable primer design.It can treat up to one million sequences of a few hundred base pairs in the tag-sorting step,and up to50000sequences in a single inputfile for the steps involving estimation of sequence similarity. Availability:QDD is freely available under the GPL licence for Windows and Linux from the following web site:http://www.univ-provence.fr/gsite/Local/egee/dir/meglecz/QDD.htmlContact:emese.meglecz@univ-provence.frSupplementary information:Supplementary data are available at Bioinformatics online.1INTRODUCTIONMicrosatellites are among the most informative and thus the most frequently used molecular markers in population biology.So far their isolation for non-model species has been dependent on time-and labour-consuming laboratory protocols providing typically a few hundred sequences.New advances in pyrosequencing now produce large amounts of sequences from any DNA sample at low cost and open new avenues for the setup of markers for species with no previous genomic information.Availability of thousands of sequences may help to optimize the setup of microsatellite markers by allowing the selection of microsatellites that are not compound or interrupted and likely follow a simple mutation model.Moreover, microsatellite amplification by polymerase chain reaction(PCR) To whom correspondence should be addressed.can be seriously affected by microsatellite association with mobile element.The large number of sequences allows the detection of putative mobile elements by spotting sequence similarity groups and eliminating them.As a result,the proportion of primers allowing specific amplification at unique loci with a clear banding will be higher and will avoid time-consuming primer tests.The presence of null alleles is also a recurrent problem in microsatellite amplification.If several copies of the same loci are identified,and the sequencing was based on pooled DNA from several individuals, some of the null alleles can be detected and avoided if consensus sequence construction is stringent.Several programs for microsatellite detection(Tandem Repeats Finder,Benson,1999;SciRoKo,Kofler et al.,2007)sequence clustering(BLASTclust,Phrap)and primer design(Primer3,Rozen and Skaletsky,2000;FastPCR,Kalendar et al.,2009)are currently available for users.A recent paper by Castoe et al.(2009)even provides Perl scripts for microsatellite detection,primer design and attempts to eliminate some of the sequence redundancies,but it lacks potentially necessary tag/adapter/vector/cleaning,and their method to detect multiple copies is based on only perfect primer site matches.Hence,a complete analysis pipeline is needed for the routine treatment of thousands or millions of sequences coming from next generation sequencing.QDD is designed to treat all bioinformatics steps from large quantities of raw sequences to the design of PCR primers for microsatellites amplification.We provide results on efficiency of the primers selected by QDD and a glossary of unusual terms in the Supplementary Materials.2IMPLEMENTATIONQDD is written in Perl and can be run as a standalone application on Windows or Linux systems.The Windows interface provides a user-friendly version.It is a collection of small modules that use the following freely available programs: ActivePerl(/activeperl/),BLAST (ftp:///blast/executables/),CLUSTALw2(Larkin et al.,2007;ftp:///pub/software/clustalw2/)and Primer3©The Author2009.Published by Oxford University Press.All rights reserved.For Permissions,please email:journals.permissions@403 at National Huaqiao University on December 20, 2015 /Downloaded fromE.Meglécz et al.(Rozen and Skaletsky,2000;/).They should be installed in order to be able to use QDD.QDD proceeds in three successive stages:2.1Sequence cleaning and microsatellite detectionThe inputfiles of QDD are the following fastafiles(i)tag.fas is a file containing all user tags(optional),(ii)adapter.fas contains all user adapter/vector sequences when suitable(optional),(iii)one or several fastafiles containing all raw sequence reads.If sequences have been tagged,the program prepares one fasta file for each tag containing all sequences cleaned from the given tag,plus onefile with all sequences for which no tag was detected. This step is optional.Removal of adaptors or possible vector contaminations if needed. Selection of sequences that are longer than a user-defined limit. Selection of sequences that have perfect microsatellites with a minimum number of repeats defined by the user.These sequences are written in a fastafile that will be the input for stage2.2.2Sequence similarity detectionThis stage is the most time consuming but essential part of the program.Removing redundancy is obviously necessary,but eliminating sequences that are part of a repetitive region of the genome is often neglected by researchers.This omission is due to the lack of genomic background information for most species.QDD detects sequence groups that potentially fall in this category.It is not designed to describe mobile elements,but to be a conservative method that eliminates many of the sequences that might be problematic for microsatellite amplifications(e.g.involving null alleles).This stage is composed of the following steps:‘All against all’BLAST of the inputfile with soft-masked microsatellites.Concatenation of100%identical sequences.Elimination of potential minisatellites(more than one BLAST hit between two compared sequences).If similarity was detected by BLAST with a user-defined limit, QDD calculates pair wise identity alongflanking regions.In this way,variability in the microsatellite length is not taken into account in the percentage of pair wise identity.Establishment of contigs if pair wise identity along theflanking regions is higher than a user-defined limit(e.g.95%). Reconstruction of consensus sequences for each contig,where ambiguous sites are replaced by‘Ns’.‘All against all’BLAST offile containing all consensus sequences plus all original sequences that are not unique,not potential minisatellites and not included in the contigs.Selection of contig consensus sequences that did not have hit to any other sequence in the previous BLAST.Preparation of afile with selected consensus sequences and all original‘unique’sequences(either did not have a BLAST hit or only to100%identical sequences).Thisfile will be the inputfile for stage3.2.3Microsatellite selection and primer designThe mutation model of compound,interrupted microsatellites or short repetitions(‘nanosatellites’)in theflanking region is complicated.Therefore,it is preferable to use perfect microsatellites with nanosatellite freeflanking regions.However, since nanosatellites are abundant,most of the selected sequence regions for primer design are short.This can be problematic in multiplexing.Therefore,QDD automatically runs Primer3to design primers several times,each time for a different PCR product size range.Stage3of QDD accomplishes the following steps: Selection of sequences that contain a target microsatellite and aflanking region free from nanosatellites(users set the minimum number of repeats for the target microsatellites,the maximum number of allowed repeats for nanosatellites,the minimum length offlanking region and the minimum length of PCR product).An option is available to select compound or interrupted microsatellite as a target.Preparation of an inputfile for Primer3and a fastafile with all target microsatellites and nanosatellites printed in lower case. Primer3runs for all user-defined PCR product ranges.Primer3 parameters can be set by the user.Results of the different runs including all output parameters of Primer3and information on the target microsatellites are summarized in a user-friendly table including sequence code.ACKNOWLEDGEMENTSWe thank Frédéric Calendini for invaluable help in building the Windows interface.Funding:University of Provence,Montpellier SupAgro,and a grant from the French Institut National de la Recherche Agronomique (INRA),AIP BioRessources EcoMicro.Conflict of Interest:none declared.REFERENCESBenson,G.(1999)Tandem repeatsfinder:a program to analyze DNA sequences.Nucleic Acids Res.,27,573–580.Castoe,T.A.et al.(2009)Rapid identification of thousands of copperhead snake (Agkistrodon contortrix)microsatellite loci from modest amounts of454shotgun genome sequence.Mol.Ecol.Resources,[Epub ahead of print,doi:10.1111/j.1755-0998.2009.02750.x,Jul30,2009].Kalendar,R.et al.(2009)Invited review:FastPCR software for PCR primer and probe design and repeat search.Genes Genomes Genomics,3(In press).Kofler,R.et al.(2007)SciRoKo:a new tool for whole genome microsatellite search and investigation.Bioinformatics,23,1683–1685.Larkin,M.A.et al.(2007)ClustalW and ClustalX version2.Bioinformatics,23, 2947–2948.Rozen,S.and Skaletsky,H.(2000)Primer3on the WWW for general users and for biologist programmers.Methods Mol.Biol.,132,365–386.404 at National Huaqiao University on December 20, 2015 /Downloaded from。

snpstats教程

snpstats教程

Vol.22no.152006,pages1928–1929doi:10.1093/bioinformatics/btl268 BIOINFORMATICS APPLICATIONS NOTEGenetics and population analysisSNPStats:a web tool for the analysis of association studies Xavier Sole´1,Elisabet Guino´1,Joan Valls1,2,Raquel Iniesta1and Vı´ctor Moreno1,2,Ã1Catalan Institute of Oncology,IDIBELL,Epidemiology and Cancer Registry,L’Hospitalet,Barcelona,Spain and2Autonomous University of Barcelona,Laboratory of Biostatistics and Epidemiology,Bellaterra,Barcelona,Spain Received on March6,2006;revised on May16,2006;accepted on May18,2006Advance Access publication May23,2006Associate Editor:Charlie HodgmanABSTRACTSummary:A web-based application has been designed from a genetic epidemiology point of view to analyze association studies.Main cap-abilities include descriptive analysis,test for Hardy–Weinberg equili-brium and linkage disequilibrium.Analysis of association is based on linear or logistic regression according to the response variable(quanti-tative or binary disease status,respectively).Analysis of single SNPs: multiple inheritance models(co-dominant,dominant,recessive,over-dominant and log-additive),and analysis of interactions(gene–gene or gene–environment).Analysis of multiple SNPs:haplotype frequency estimation,analysis of association of haplotypes with the response, including analysis of interactions.Availability:/SNPstats.Source code for local installation is available under GNU license.Contact:v.moreno@Supplementary Information:Figures with a sample run are available on Bioinformatics online.A detailed online tutorial is available within the application.The analysis of association between genetic polymorphisms and diseases allows identifying susceptibility genes(Cordell and Clayton,2005).The proper analysis of these studies can be per-formed with general purpose statistical packages,but the researcher usually needs the assistance of additional software to perform spe-cific analysis,like haplotype estimation,and results from different packages are difficult to integrate.We present a free web-based tool to help researchers in the analysis of association studies based on SNPs or biallelic markers. Both the selection of analysis and the output have been designed from a genetic epidemiology perspective.This application can also be used for learning purposes.We have written(in Spanish)an analysis guide with detailed explanations(Iniesta et al.,2005).A similar extensive help in English can also be found on the website. The software is used following three steps,with the possibility of performing multiple analyses in one session.The steps are as follows.(1)Data entry.Raw data in tabular form can be pasted in a window or uploaded from a textfile.Variables can be named and the user can choose thefield delimiter and the missing value code(Supplementary Figure1).SNPs should be coded as genotypes with each allele separated by a slash(e.g.‘T/T’,‘T/C’,‘C/C’).(2)Data processing.A list with the variables read by the appli-cation is presented with an initial suggestion about the type: quantitative,categorical or SNP,which can be modified(Supple-mentary Figure2).The user is prompted to select those needed for the analysis and to specify which one is the response,which may be binary(disease status)or quantitative.For categorical variables, including SNPs,the user can reorder the categories.Thefirst one will be treated as reference category in the analysis.The application assumes that the main interest is the analysis of the SNPs in relation to the response.Other variables selected with type quantitative or categorical will be added to the regression models for analysis as covariates and treated as potential confounders.(3)Analyses customization.The third step requests the selection of the desired statistical analyses that will be described later in this article(Supplementary Figure3).Regarding the statistical analysis,the association with disease is modeled depending on the response variable.If binary,the applica-tion assumes an unmatched case–control design and unconditional logistic regression models are used.If the response is quantitative, then a unique population is assumed and linear regression models are used to assess the proportion of variation in the response explained by the SNPs.The association for each SNP is analyzed in turn and adjusted for the selected covariates.If more than one SNP are selected,then the application assumes that haplotype analysis is appropriate. Haplotype frequencies are estimated using the implementation of the EM algorithm coded into the haplo.stats package(Sinnwell and Schaid,2005,/mayo/research/ biostat/schaid.cfm).Association between haplotypes and disease appropriately accounts for the uncertainty in the estimation of hap-lotypes for individuals with multiple heterozygous when phase is unknown or when missing values are present(Schaid et al.,2002). Individuals with missing values in the response,in all SNPs or in any covariate are excluded from analysis.The software main page can be found online at http://bioinfo. /SNPstats.The application uses PHP server pro-gramming language to build the input forms,upload data,call the statistical analysis procedures and process the output.The sta-tistical analyses are performed in a batch call to the R package (R Development Core Team,2005,). The contributed packages genetics(Warnes and Leisch,2005) and haplo.stats(Sinnwell and Schaid,2005,http://mayoresearch. /mayo/research/biostat/schaid.cfm)are called to perform some of the analysis.Anonymous use is guaranteed and data areÃTo whom correspondence should be addressed.Ó2006The Author(s)This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License(/licenses/ by-nc/2.0/uk/)which permits unrestricted non-commercial use,distribution,and reproduction in any medium,provided the original work is properly cited.treated as confidential.Source code for local installation(Linux and Windows)is also available under GNU license.SNPStats returns a complete set of results for the analysis,cover-ing from the descriptive statistics to the haplotype analysis.The descriptive statistics returned are the absolute frequencies and pro-portions for categorical variables,and mean,standard deviation and a list of percentiles for the quantitative ones.Always the total valid sample size and the count of missing values are displayed (Supplementary Figure4).Each SNP is described as allele and genotype frequencies.An exact test for Hardy–Weinberg equilibrium is performed(Supple-mentary Figure5).When the response variable is binary,these statistics can be displayed by each response group.The user usually will be interested in checking Hardy–Weinberg equilibrium in the control population.The analysis of association for each SNP can be performed bothfor quantitative or binary response variables.For binary responses, the logistic regression analysis is summarized with genotype fre-quencies,proportions,odds ratios(OR)and95%confidence inter-vals(CI)(Fig.1).For quantitative responses,linear regression is summarized by means,standard errors,mean differences respect to a reference category and95%CI of the differences.SNPStats can also perform analyses of interactions.For simpli-city,models with only one pair of variables interacting can be selected at a time.Three summary tables are shown(Supplementary Figure7).Thefirst one is the cross-classification that uses a com-mon reference category for both interacting variables.ORs or mean differences are estimated,together with95%CI,for all other com-binations.Next tables use the margins as reference category and estimate ORs or mean differences of one variable nested within the other one.A global test for interaction is performed,as well as a test for the interaction in the linear trend of the nested variable.This assumes that the nested variable is ordinal and tests for different trend among categories.This test might be more sensitive than the global one due to the reduction in degrees of freedom.When more than one SNP is included in the analysis,SNPStats offers the possibility of performing linkage disequilibrium(LD)and haplotype analysis.For LD,matrices with selected statistics(D,D0, Pearson’s r and associated P-values)are shown.(Supplementary Figure8).In the analysis of haplotypes,descriptive statistics show the estimated relative frequency for each haplotype(Supplementary Figure9).Cumulative frequencies are also shown to help in the selection of the threshold cut point to group rare haplotypes for further analysis.The association analysis of haplotypes is similar to that of genotypes in that either logistic regression results are shown as OR and95%CI or linear regression results with differences in means and95%CI.The most frequent haplotype is automatically selected as the reference category and rare haplotypes are pooled together in a group.The analysis of haplotypes assumes a log-additive model by default,but dominant and recessive models are available as alternative choices.When haplotypes are selected for interaction tables similar to the genotype interaction ones are shown,replacing the genotypes by haplotypes(Supplementary Figure10).This analysis of interactions and presentation of the results is unique to the available alternatives explored and is an important contribution to the analysis of genetic epidemiology studies,often focused on testing for gene–environment interactions(Lake et al.,2003).As a limitation,we are aware that the selection of the available analysis has been done for the most frequent profile but might not be adequate in some instances.We plan to implement in future versions more response types:survival data for studies of prognosis, multinomial data for categorical responses with more than two categories and paired designs(matched case–control or nested case–control).ACKNOWLEDGEMENTSFunding support from the Spanish Instituto de Salud Carlos III (networks of centres RCESP C03/09and RTICCC C03/10). Funding to pay the Open Access publication charges for this article was provided by Instituto de Salud Carlos III(FIS03/0114). Conflict of Interest:none declared.REFERENCESCordell,H.J.and Clayton,D.G.(2005)Genetic association ncet,366, 1121–1131.Iniesta,R.et al.(2005)Ana´lisis estadı´stico de polimorfismos gene´ticos en estudios epidemiolo´gicos.Gac.Sanit.,19,333–341.Lake,S.et al.(2003)Estimation and tests of haplotype–environment interaction when linkage phase is ambiguous.Human Heredity,55,56–65.R Development Core Team(2005)R:a language and environment for statistical computing.R Foundation for Statistical Computing,Vienna,Austria.ISBN 3-900051-07-0.Schaid,D.J.et al.(2002)Score tests for association between traits and haplotypes when linkage phase is ambiguous.Am.J.Hum.Genet.,70,425–434.Sinnwell,J.P.and Schaid,D.J.(2005),haplo.stats:statistical analysis of haplotypes with traits and covariates when linkage phase is ambiguous.R package version1.2.2. Warnes,G.and Leisch,F.(2005)Genetics:Population Genetics.R package version1.2.0.Fig. 1.Sample output of association models for binary response.Five inheritance models are fitted,which correspond to different parameterizations or groupings of the genotypes.Akaike’s Information Criterion(AIC)and Bayesian Information Criterion(BIC)are calculated to help the user in the selection of the best model for a specific SNP.SNPStats:analysis of association studies web tool1929。

Bioinformatics生信案例

Bioinformatics生信案例
精子生成障碍病例及其对照人群
基因组DNA甲基化谱分析
Roche-NimbleGen HG18 Meth 3×720K CpG plus Promoter
DNA甲基化模式异常基因的筛选 人群验证
异常DNA甲基化模式基因与EDCs 交互作用分析
Genetics
Biochemistry
Bioinformatics
Molecular biology
Computer science
molecular biology genetics biochemistry other biological science computer science mathematics statistics ……
Bioinformatics
What is Bioinformatics?
Bioinformatics is a new subject of genetic data collection, analysis and dissemination to the research community Dr. Hwa A. Lim,1987
Application in bioinformatics
or Case Control
Case
Control
Thousands of genes, Thousands of problems, while Thousands of papers
1
2 Gene annotation
2 Microarray presentation
A very brief history of bioinformatics:
1965 1970 1974 ~1979 Margaret Dayhoff and collaborators at National Biomedical Research Foundation develop first protein sequence database Saul Needleman and Christian Wunsch develop first sequence alignment algorithm Chou and Fasman develop first protein structure prediction algorithm GenBank established as a nucleic acid sequence database at Los Alamos National Laboratory

生物信息学bioinformatics(近完整版) Microsoft Word 文档 (2)1

生物信息学bioinformatics(近完整版) Microsoft Word 文档 (2)1

一.什么是生物信息学?Genome informatics is a scientific discipline that encompasses all aspects of genome information acquisition, processing, storage, distribution, analysis, and interpretation. (它是一个学科领域,包含着基因组信息的获取、处理、存储、分配、分析和解释的所有方面。

)(The U.S. Human Genome Project: The First Five Y ears FY 1991-1995, by NIH and DOE)生物信息学是把基因组DNA序列信息分析作为源头,破译隐藏在DNA序列中的遗传语言,特别是非编码区的实质;同时在发现了新基因信息之后进行蛋白质空间结构模拟和预测。

生物信息学的研究目标是揭示“基因组信息结构的复杂性及遗传语言的根本规律”。

它是本世纪自然科学和技术科学领域中“基因组、“信息结构”和“复杂性”这三个重大科学问题的有机结合。

How to find the coding regions in rude DNA sequence?By signals or By contentsAmong the types of functional sites in genomic DNA that researchers have sought to recognize are splice sites, start and stop codons, branch points, promoters and terminators of transcription, polyadenylation sites, ribosomal binding sites, topoisomerase II binding sites, topoisomerase I cleavage sites, and various transcription factor binding sites. Local sites such as these are called signals and methods for detecting them may be called signal sensors.二.新基因和新SNPs的发现与鉴定大部分新基因是靠理论方法预测出来的。

Data and text mining

Data and text mining

Vol.21no.242005,pages4432–4433doi:10.1093/bioinformatics/bti696 BIOINFORMATICS APPLICATIONS NOTEData and text miningMedusa:a simple tool for interaction graph analysisSean D.Hooper1,Ãand Peer Bork1,21European Molecular Biology Laboratory,Heidelberg,Germany and2Max Delbru¨ck Center forMolecular Medicine,Berlin-Buch,GermanyReceived on June28,2005;revised on September13,2005;accepted on September26,2005Advance Access publication September27,2005ABSTRACTSummary:Medusa is a Java application for visualizing and manipulat-ing graphs of interaction,such as data from the STRING database. It features an intuitive user interface developed with the help of biologists.Medusa is optimized for accessing protein interaction data from STRING,but can be used for any type of graph from any scientific field.Availability:Medusa,along with sample datasets and instructions,can be downloaded from http://www.bork.embl.de/medusaContact:hooper@embl.deThere are many graph analysis applications available,including Pajek(Batagelj and Mrvar,1998),Cytoscape(Shannon et al.,2003), Osprey(Breitkreutz et al.,2003),ProViz(Iragne et al.,2005), CNPlot(Batada,2004)and SHARKview(Pinney et al.,2005), each with their strengths but also drawbacks.Here,we present Medusa,a tool which addresses many of these issues.A quick rundown of its features and other programs that lack them(shown in parentheses)are as follows:Medusa displays up to10multiple edges concurrently between nodes using Bezier curves(all above applications)and allows users to add and delete nodes and edges by simply clicking the mouse(Pajek,Cytoscape,others).Background images can be inserted to enhancefigure quality(Osprey,Cyto-scape).Edges can be hidden or shown depending on the type of interaction(Osprey,Cytoscape,Pajek).Node properties can be described,such as color,annotation,position and shape(Osprey, Cytoscape).Medusa requires no additional packages(ProViz, CNPlot)and runs on any machine with Java1.4.2installed(ProViz, Pajek).It runs as standalone(SHARKView)and as an applet for use in web interfaces(most other applications).Graphs can be exported to image or postscriptfiles.Overall,Medusa(screenshot shown as Fig1)is designed to be a simple and an intuitive tool for custom-ization of interaction graphs of any kind.With the ever-increasing mass of biological data,as exemplified by large-scale cross-species comparisons of proteins and genes,an easy visualization of the often highly complex linkage between,e.g. proteins becomes more and more important.For instance,the inter-actions between core proteins and alternative functional modules of complexes(work in progress)can be represented as a graph of nodes(proteins)and edges(interactions between core and modules). Other graphs of protein interaction may be more abstract;for instance,the STRING(‘search tool for the retrieval of interacting genes/proteins’)database predicts protein–protein associations and includes a variety of indirect(non-physical)evidence types(von Mering et al.,2005).The results from the STRING database can be studied using its web interface,or alternatively handled directly by the Medusa application,which is run on the client side.The handling of large networks is made easier by layout algo-rithms and the option to hide or show certain edge types.Moreover, nodes and edges can be deleted or added either manually via mouse clicks,or added directly from ers can create their own datafiles which can then be easily appended to existing graphs. These datafiles are simply tab-delimited textfields describing edge relationships.Used solely as a graph visualization tool,Medusa is designed to be immediately accessible,focusing on construction offigures. For graph analysis of a more mathematical nature,Pajek is highly recommended,although it has a much steeper learning curve. Medusa can export graphs to Pajek format.Medusa is also available as an applet version for use directly in web pages.This version has a lower degree of functionality,since the main goal is to allow an easy display of graphs from a web-based database.It is already implemented in the STRING website and is planned to be incorporated into other web databases(work in pro-gress).The graph is simply passed to the applet as parameters,along with other preferences,making it highly suitable as an interactive enhancement of an existing web server.Medusa has been used in a number of EMBL projects as an in-house tool,ranging from yeast cell cycle studies(http://www. cbs.dtu.dk/cellcycle/yeast_complexes/figure1.html;de Lichtenberg et al.,2005)to genotype–phenotype links(Korbel et al.,2005).It is now publicly available and is designed to be a simple but efficient tool for quickly analysing graphs and producingfigures.Medusa is easy to install and requires only Java1.4.2,making it accessible to a wide variety of platforms.Medusa is free for academic use. ACKNOWLEDGEMENTSWe thank Lars Steinmetz,Lars Juhl-Jensen,Fabiana Perocchi,Eliza Izaurralde and Rob Russel for their feedback.S.D.H.was supported by the Knut and Alice Wallenberg Foundation.Conflict of Interest:none declared.REFERENCESBatada,N.N.(2004)CNplot:visualizing pre-clustered networks.Bioinformatics,20, 1455–1456.Batagelj,V.and Mrvar,A.(1998)Pajek—program for large network analysis.Connections,2,47–57.ÃTo whom correspondence should be addressed.4432ÓThe Author2005.Published by Oxford University Press.All rights reserved.For Permissions,please email:journals.permissions@Breitkreutz,B.J.et al.(2003)Osprey:a network visualization system.Genome Biol.,4,R22.de Lichtenberg,U.et al.(2005)Dynamic complex formation during the yeast cell cycle.Science ,307,724–727.Iragne,F.et al.(2005)ProViz:protein interaction visualization and exploration.Bioinformatics ,21,272–274.Korbel,J.O.et al.(2005)Systematic association of genes to phenotypes by genome andliterature mining.PLoS Biol.,3,e134.Pinney,J.W.et al.(2005)metaSHARK:software for automated metabolic networkprediction from DNA sequence and its application to the genomes of Plasmodium falciparum and Eimeria tenella .Nucleic Acids Res.,33,1399–1409.Shannon,P.et al.(2003)Cytoscape:a software environment for integrated models ofbiomolecular interaction networks.Genome Res.,13,2498–2504.von Mering,C.et al.(2005)STRING:known and predicted protein–protein associ-ations,integrated and transferred across organisms.Nucleic Acids Res.,33(Database Issue),D433–D437.Fig.1.Investigating the evidential and physical links between proteins transversing the mitochondrial-cytoplasmic boundary in Yeast.Details from work in progress (Lars Steinmetz,personal communication).For instance,ECM10(a homolog to heat shock protein 70)is linked by gene neighbourhood,co-expression and text mining to the mitochondrial precursor protein MDJ1.Medusa:protein interaction graph analysis4433。

BIOINFORMATICS APPLICATIONS NOTE

BIOINFORMATICS APPLICATIONS NOTE

BIOINFORMATICS APPLICATIONS NOTEVol.19no.22003Pages317–318Visualization and analysis of protein interactionsByong-Hyon Ju 1,Byungkyu Park 1,Jong H.Park 2and Kyungsook Han 1,∗1Departmentof Computer Science &Engineering,Inha University,Inchon,402-751,South Korea and 2MRC-DUNN,Human Nutrition Unit,Cambridge CB22XY,UKReceived on May 31,2002;revised on July 23,2002;accepted on August 14,2002ABSTRACTSummary:We have developed a new program called InterViewer for drawing large-scale protein interaction networks in three-dimensional space.Unique features of InterViewer include (1)it is much faster than other recent implementations of drawing algorithms;(2)it can be used not only for visualizing protein interactions but also for analyzing them interactively;and (3)it provides an integrated framework for querying protein interaction databases and directly visualizes the query results.Availability:http://wilab.inha.ac.kr/protein/Contact:khan@inha.ac.krINTRODUCTIONRecent improvements in high-throughput proteomics tech-niques such as the yeast two-hybrid system (Uetz et al.,2000)have produced a rapidly expanding volume of pro-tein interaction data.Protein interaction data can be vi-sualized as a graph in which nodes represent proteins and edges represent their interactions.Most protein interaction data have the following characteristics:(1)When visual-ized as a graph,the data yields a disconnected graph with many connected components.(2)The data yields a non-planar graph with a large number of edge crossings that cannot be removed in a two-dimensional drawing.(3)The number of interacting proteins varies widely within the same set of data,resulting in nodes of very high degree as well as very low degree of interaction.(4)The data often contains protein interactions corresponding to self-loops.Therefore,interaction data demands robust and di-verse features in visualizing and analyzing complex information.This paper describes an algorithm and its implementation called InterViewer.Its integrated ap-proach to databases can handle rapidly increasing diverse interaction data effectively compared to conventional flat files.InterViewer can directly query databases and visu-alizes the results in three-dimensional space at runtime.Visualized networks can be further refined or navigated to explore protein interactions.∗To whom correspondence should be addressed.ALGORITHM AND IMPLEMENTATIONInterViewer’s layout is based on the force-directed layout of Walshaw’s algorithm (Walshaw,2000),but different from Walshaw’s algorithm in the following sense:(1)Wal-shaw’s algorithm groups nodes into clusters,whereas InterViewer does not.(2)Walshaw’s algorithm initially places nodes randomly,whereas InterViewer places nodes on the surface of a sphere for better results.(3)Walshaw’s algorithm iteratively updates layouts until the graph size falls below a certain threshold value,whereas InterViewer iterates 20times unless specified otherwise by a user.At each iteration,the node positions are updated based on global spring forces between nonadjacent nodes (line 8in Algorithm Layout)as well as local spring forces between adjacent nodes (line 10).1:r =12:repeat 3:g =0.01·r ·k 2{k:natural spring length }4:for v ∈V do 5:D =0{D:displacement vector of node v }6:for u ∈V ,u =v do 7: =pos [u ]−pos [v ]{pos[u]:position of node u }8:D =D −g ·( /| |)·(|u |/| |){|u |:distance of u from the origin }9:if u ∈ (v)then 10:D =D − /| |·(1−| |/k )/| (v)|{ (v):set of vertices adjacent to v }11:end if 12:end for 13:pos [v ne w ]=pos [v old ]+D 14:end for 15:r =0.98·r 16:until T times {T:user-specified number or 20}Let T be the total number of iterations of the outer loop (line 2of Algorithm Layout).For a graph with n nodes,O (n )time is required to compute the displacement D ofcOxford University Press 2003317B.-H.Ju et al.Table1.Running times of graph drawing programs on3test cases on a Pentium IV1.7GHz processorprogram Y2H MIPS-G MIPS-P (layout(1005nodes,(888nodes,(2167nodes, algorithm)905edges)1093edges)2948edges) InterViewer5s4s23s Pajek(F–R)3min17s1min48s12min42s Tulip(GEM)26s19s27min0s Tulip(S–E)3min40s3min43s95min21sF-R:Fruchterman–Reingold’s layout,S-E:Spring-Electrical Force layout, MIPS-G:MIPS genetic interaction data,MIPS-P:MIPS physical interaction dataa node,and O(n2)time is required to compute D of all nodes.Therefore,the total time required is O(T·n2)= O(n2)since T is pared to the asymptotic time complexity O(n3)of Kamada and Kawai’s algorithm (1989),InterViewer has a lower order of time complexity (see below for actual running times).The layout algorithm was implemented in Borland Delphi6.0,and databases of protein–protein interactions were constructed using Microsoft Data Access Compo-nents2.7.The program runs on any PC with Windows 2000/XP/Me/98/NT4.0.Figure1shows the drawing of the entire MIPS physical interaction data.The drawing appears to have edge crossings,but it actually contains no edge crossing in the three-dimensional drawing.InterViewer allows the user to interactively explore three-dimensional drawings by rotating or by zooming in or out of them.It also enables the user to extract connected components of a disconnected graph,proteins interacting with a certain protein within a specified distance level,or proteins sharing a certain function.A protein interaction network can be saved either in an imagefile,the local database or a textfile in GML format(http://www.uni-passau.de/Graphlet/GML).For the purpose of comparison of actual running times, we ran two other graph-drawing programs,Pajek(Batagelj and Mrvar,2001)and Tulip(David,2001).Table1shows the running times of InterViewer,Pajek,and Tulip on a same set of test cases.It follows from this result that InterViewer is an order of magnitude faster than Pajek (Fruchterman–Reingold layout)and Tulip with Spring-Electric Force layout and is significantly faster than Tulip with GEM layout.We also implemented Kamada and Kawai’s algorithm to compare its actual running times with ours.Kamada and Kawai’s algorithm produces2D drawings only,so we extended it to3D drawings.Since their algorithm cannot visualize a disconnected graph, we tested both their algorithm and ours on the largest connected components of Y2H data(473nodes),MIPS genetic interaction data(531nodes),and MIPSphysical Fig.1.Drawing of the MIPS physical interaction data with2167 nodes and2948edges.Node labels are not shown in this drawing. interaction data(1526nodes)on a Pentium IV1.7GHz processor.The running times of Kamada and Kawai’s algorithm on these test cases are3.7s,4.9s,and1min 4.3s,respectively,while those of InterViewer are1.2s, 1.5s,and12.7s,respectively.Thus,InterViewer is faster than Kamada and Kawai’s algorithm in actual running times,too.ACKNOWLEDGEMENTSThis work was supported by the Ministry of Information and Communication of South Korea under grant number IMT2000-C3-4.REFERENCESBatagelj,V.and Mrvar,A.(2001)Pajek—analysis and visualization of large networks.In Mutzel,P.et al.(ed.),Graph Drawing, Lecture Notes Comput.Sci.,2265,pp.477–478.David,A.(2001)Tulip.In Mutzel,P.et al.(ed.),Graph Drawing, Lecture Notes Comput.Sci.,2265,pp.435–437.Kamada,T.and Kawai,S.(1989)An algorithm for drawing general undirected rmation Processing Letters,31,7–15. Uetz,P.,Giot,L.,Cagney,G.et al.(2000)A comprehensive analysis of protein–protein interactions in Saccharomyces cerevisiae.Nature,403,623–627.Walshaw,C.(2000)A multilevel algorithm for force directed graph drawing.In Marks,J.(ed.),Graph Drawing,Lecture Notes in Computer Science,1984,pp.171–182.318。

生物信息学论文bioinformatics

生物信息学论文bioinformatics

SNVMix: predicting single nucleotide variants from
next-generation sequencing of tumors
Rodrigo Goya1,2, Mark G.F. Sun1, Ryan D. Morin2, Gillian Leung1, Gavin Ha1, Kimberley C. Wiegand3,4, Janine Senz3,4, Anamaria Crisan1, Marco A. Marra2, Martin Hirst2, David Huntsman3,4, Kevin P. Murphy5, Sam Aparicio1 and Sohrab P. Shah1,3,4,∗
Associate Editor: Alex Bateman
ABSTRACT Motivation: Next-generation sequencing (NGS) has enabled whole genome and transcriptome single nucleotide variant (SNV) discovery in cancer. NGS produces millions of short sequence reads that, once aligned to a reference genome sequence, can be interpreted for the presence of SNVs. Although tools exist for SNV discovery from NGS data, none are specifically suited to work with data from tumors, where altered ploidy and tumor cellularity impact the statistical expectations of SNV discovery. Results: We developed three implementations of a probabilistic Binomial mixture model, called SNVMix, designed to infer SNVs from NGS data from tumors to address this problem. The first models allelic counts as observations and infers SNVs and model parameters using an expectation maximization (EM) algorithm and is therefore capable of adjusting to deviation of allelic frequencies inherent in genomically unstable tumor genomes. The second models nucleotide and mapping qualities of the reads by probabilistically weighting the contribution of a read/nucleotide to the inference of a SNV based on the confidence we have in the base call and the read alignment. The third combines filtering out low-quality data in addition to probabilistic weighting of the qualities. We quantitatively evaluated these approaches on 16 ovarian cancer RNASeq datasets with matched genotyping arrays and a human breast cancer genome sequenced to >40× (haploid) coverage with ground truth data and show systematically that the SNVMix models outperform competing approaches. Availability: Software and data are available at http://compbio.bccrc.ca Contact: sshah@bccrc.ca Supplemantary information: Supplementary data are available at Bioinformatics online.

基于生物信息学探索系统性硬化病相关肺动脉高压免疫相关基因及免疫细胞浸润分析

基于生物信息学探索系统性硬化病相关肺动脉高压免疫相关基因及免疫细胞浸润分析

doi:10.3969/j.issn.1000-484X.2023.09.024基于生物信息学探索系统性硬化病相关肺动脉高压免疫相关基因及免疫细胞浸润分析①李小群宣皓晨时想想王超凡李雷徐通达(徐州医科大学附属医院,徐州 221000)中图分类号R593.2 文献标志码 A 文章编号1000-484X(2023)09-1937-06[摘要]目的:通过生物信息学分析鉴定系统性硬化病相关肺动脉高压(SSc-PAH)免疫相关基因(IRGs),并深入探讨免疫细胞浸润在SSc-PAH中的作用。

方法:从GEO数据库下载数据集GSE22356,使用R软件对其进行差异表达基因(DEGs)分析。

通过ImmPort数据库下载人类免疫基因数据集,与DEGs取交集得到差异免疫相关基因(DEIRGs),对DEIRGs进行GO 和KEGG富集分析。

通过Cytoscape软件构建DEIRGs的蛋白互作(PPI)网络,根据Degree值筛选枢纽免疫基因。

CIBERSORT 算法评估SSc和SSc-PAH患者血液组织免疫细胞浸润。

对枢纽免疫基因与浸润性免疫细胞进行相关性分析。

结果:共得到56个DEIRGs。

功能富集表明DEIRGs在信号转导、免疫应答、趋化作用中具有重要意义。

根据PPI网络及Degree值得到枢纽免疫基因FCGR3B、CD28、LYN、LCK。

免疫细胞浸润结果显示,与SSc相比,SSc-PAH患者血液组织中单核细胞、嗜中性粒细胞比例升高,而初始CD4 T细胞、未活化的CD4记忆性T细胞、γδ T细胞比例降低。

相关性分析结果显示,CD28、LCK与γδ T细胞呈正相关,与单核细胞呈负相关。

FCGR3B与中性粒细胞呈正相关,与初始CD4 T细胞呈负相关,LYN与单核细胞呈正相关,与γδ T 细胞呈负相关。

结论:枢纽免疫基因和免疫细胞浸润差异在SSc-PAH发生发展中起重要作用。

[关键词]系统性硬化病相关肺动脉高压;生物信息学分析;免疫细胞浸润Identification of key immune-related genes and immune cells infiltration in systemic sclerosis-related pulmonary arterial hypertension based on bioinfor-matics analysisLI Xiaoqun,XUAN Haochen,SHI Xiangxiang,WANG Chaofan,LI Lei,XU Tongda. The Affiliated Hospital of Xuzhou Medical University, Xuzhou 221000, China[Abstract]Objective:To identify immune-related genes (IRGs) and explore role of immune cells infiltration in systemic scle‐rosis-related pulmonary arterial hypertension (SSc-PAH)based on bioinformatics analysis. Methods:GSE22356 dataset was down‐loaded from GEO database and conducted with differentially expressed genes (DEGs) analysis by R software. Human immunity gene dataset was downloaded from ImmPort database, which were intersected with DEGs to obtain differentially expressed immune-related genes (DEIRGs). GO and KEGG enrichment analysis were performed with DEIRGs. Protein-protein interaction (PPI)network of DEIRGs was built by Cytoscape, and key immune genes were screened according to Degree value. CIBERSORT algorithm was used to evaluate immune cells infiltration between SSc and SSc-PAH patients. Correlation analysis between key DEIRGs and infiltrating immune cells was performed. Results:A total of 56 DEIRGs were detected. Enrichment function demonstrated that DEIRGs were significant in signal transduction,immune response and chemotaxis. Key DEIRGs assessed by PPI network and degree score were FCGR3B,CD28,LYN,LCK. Immune cells infiltration findings indicated that compared with SSc,SSc-PAH patients contained a higher proportion of monocytes and neutrophils, while a lower proportion of naive CD4 T cells, inactive CD4 memory T cells and γδ T cells. Correlation analysis results demonstrated that CD28 and LCK were positively correlated with γδ T cells and negatively correlated with monocytes. FCGR3B was positively correlated with neutrophils and negatively correlated with naive CD4 T cells. LYN was posi‐tively correlated with monocytes and negatively correlated with γδ T cells. Conclusion:Key immune-related genes and differences in immune cells infiltration play an essential role in occurrence and progression of SSc-PAH.[Key words]Systemic sclerosis-related pulmonary arterial hypertension;Bioinformatics analysis;Immune cells infiltration①本文受江苏省科技厅重点研发计划(BE2019639);江苏省中医药管理局课题(YB201988);江苏省卫生健康委科研项目(M2020015)资助。

interactomes

interactomes

BIOINFORMATICS APPLICATIONS NOTE Vol.21no.82005,pages1741–1742doi:10.1093/bioinformatics/bti237 Systems biologyAn enhanced Java graph applet interface for visualizing interactomesAaron N.Chang1,2,Jason McDermott2and Ram Samudrala2,∗1Biomedical and Health Informatics and2Department of Microbiology,University of Washington,Seattle,WA98195,USAReceived on September16,2004;revised on November8,2004;accepted on December15,2004Advance Access publication December21,2004ABSTRACTSummary:We have developed several new navigation features for a Java graph applet previously released for visualizing protein–protein interactions.This graph viewer can be used to navigate any mole-cular interactome dataset.We have successfully implemented this tool for exploring protein networks stored in the Bioverse interaction database.Availability:/viewer Contact:ram@Modeling protein–protein interaction(PPI)networks as graphs with nodes and edges to depict proteins and their interactions is an import-ant paradigm for systems analysis.The development of simple, intuitive interfaces to PPI and other molecular interaction net-works is becoming increasingly important as these datasets grow exponentially.The ubiquity of web browsers on all modern com-puting platforms makes them the ideal interface to deploy such tools.A Java applet to assist with the visualization of PPI data as interactive graphs wasfirst released by Mrowka(2001).This tool was originally based on the Graph.java applet code distributed by Sun Microsystems.It utilizes a relaxation layout algorithm which attempts to prevent overlapping of nodes.It can run as a stand-alone applet or more commonly inside a Java-enabled web browser.Nav-igation of graphs is mediated by setting the desired neighborhood (depth)level and clicking on individual nodes.Double-clicking on nodes isolates the subgraph containing nodes of the specified neigh-borhood.A‘freeze’function is provided to halt the layout algorithm to aid visual analysis.We have implemented feature improvements to this applet to enhance navigation(Fig.1).We have added a simple searchfield to the toolbar which allows for asterisk-based wildcarding(i.e.S∗will show all nodes starting with S).The search feature can locate exact or similar node matches based on the identifiers used for node labels.Next,we added a sorted,enumerated list of functional groups in the drop-down menu.In our implementation,we employed Gene Ontology(GO)functional annotations for exploring PPI networks (Ashburner et al.,2000).In practice,any functional annotation sys-tem can be used.An indication of the total number of nodes and edges present in the current graph is also displayed.In the graph To whom correspondence should be addressed.window,a color-coding scheme for edge weights has been imple-mented.These colors represent quartiles of interaction confidence values between0.00and1.00as stored in the Bioverse database (McDermott and Samudrala,2003,2004).The colors range from red(0.00–0.24),orange(0.25–0.49),yellow(0.50–0.74)to green (0.75–1.00).Finally,we added a right-click mouse feature,which in addition to expanding nodes,also opens hyperlinks to detailed protein records specific to a given node.This is based on URL forwarding and can be easily modified for hyperlinking to other destinations.This updated graph viewer was used successfully as an interface to the Bioverse database which contains PPI data for more than50 different organisms.We have used the viewer to depict contextual,as well as evolutionary,relationships between proteins.Contextual rela-tionships are defined as physical interactions or functional linkages between proteins.Evolutionary relationships are based on sequence, structure and functional similarities.Our testing of this viewer indic-ates that it is effective in visualizing PPI networks,as well as any other molecular interactome.These can include genes,metabolites or other small molecules with pairwise interaction relationships.Gene inter-actions derived from microarray analysis,synthetic lethal screens or combinations with PPI data are some examples(Ideker et al.,2001; Tong et al.,2001).Such datasets can easily be adapted as input values.This viewer is small,stable,multi-platform and simple to use. It can function as a stand-alone applet or be integrated into a web application.Detailed support information on the installation,usage and code development is available on our servers.In future versions, we plan to extend many of these interactive features into a server-based web application framework to provide more robust searching of PPI networks.ACKNOWLEDGEMENTSA.N.C.was supported by an NLM Medical Informatics Training Grant(1T15LM07441-01).This work was also supported in part by a Searle Scholar Award and an NSF Grant(DBI-0217241)to R.S. REFERENCESAshburner,M.,Ball,C.A.,Blake,J.A.,Botstein,D.,Butler,H.,Cherry,J.M.,Davis,A.P., Dolinski,K.,Dwight,S.S.,Eppig,J.T.et al.(2000)Gene ontology:tool for the unification of biology.The Gene Ontology Consortium.Nat.Genet.,25, 25–29.©The Author2004.Published by Oxford University Press.All rights reserved.For Permissions,please email:journals.permissions@1741A.N.Chang etal.Fig.1.Enhanced Java graph viewer applet as integrated in the Bioverse framework.Here,the viewer is shown in the context of PPI data from Bacillus subtilis stored in the Bioverse database.(A)The tool bar.From left to right,this contains the searchfield,freeze button,drop-down functional group menu,depth selector(distance),refresh button and total counts of nodes and edges.Search terms can be exact matches or similar matches using asterisk wildcards.The drop-down menu enumerates the number of nodes for each functional group in reverse sort order.(B)Undirected,weighted edges are depicted by a color code for confidence values ranging from0.00to1.00.The colors range from red(0.00–0.24),orange(0.250–0.49),yellow(0.50–0.74),to green(0.75–1.00). Clicking on a node will select it and expand its local neighbors according to the‘distance’setting in the toolbar.A selected node is green and its neighbors are yellow.Double-clicking a node will isolate the subgraph,removing unconnected nodes from the window.Right-clicking on a node will expand the node and bring up hyperlinked web pages pertinent to the node.In our case,the user is provided with a protein record page stored in the Bioverse database.Ideker,T.,Thorsson,V.,Ranish,J.A.,Christmas,R.,Buhler,J.,Eng,J.K.,Bumgarner,R., Goodlett,D.R.,Aebersold,R.and Hood,L.(2001)Integrated genomic and pro-teomic analyses of a systematically perturbed metabolic network.Science,292, 929–934.McDermott,J.and Samudrala,R.(2003)Bioverse:functional,structural and contextual annotation of proteins and proteomes.Nucleic Acids Res.,31, 3736–3737.McDermott,J.and Samudrala,R.(2004)Enhanced functional information from predicted protein networks.Trends Biotechnol.,22,60–62;Discussion62–63.Mrowka,R.(2001)A Java applet for visualizing protein–protein interaction.Bioinform-atics,17,669–671.Tong,A.H.,Evangelista,M.,Parsons,A.B.,Xu,H.,Bader,G.D.,Page,N.,Robinson,M., Raghibizadeh,S.,Hogue,C.W.,Bussey,H.et al.(2001)Systematic genetic analysis with ordered arrays of yeast deletion mutants.Science,294,2364–2368.1742。

cytoscapewebshou...

cytoscapewebshou...

BIOINFORMATICS APPLICATIONS NOTEVol.26no.182010,pages 2347–2348doi:10.1093/bioinformatics/btq430Systems biologyAdvance Access publication July 23,2010Cytoscape Web:an interactive web-based network browserChristian T.Lopes,Max Franz,Farzana Kazi,Sylva L.Donaldson,Quaid Morris and Gary D.Bader ∗Banting and Best Department of Medical Research,Donnelly Centre for Cellular and Biomolecular Research,University of Toronto,160College Street,Toronto,ON M5S 3E1,CanadaAssociate Editor:Joaquin DopazoABSTRACTSummary:Cytoscape Web is a web-based network visualization tool–modeled after Cytoscape–which is open source,interactive,customizable and easily integrated into web sites.Multiple file exchange formats can be used to load data into Cytoscape Web,including GraphML,XGMML and SIF .Availability and Implementation:Cytoscape Web is implemented in Flex/ActionScript with a JavaScript API and is freely available at /Contact:**********************Supplementary information:Supplementary data are available at Bioinformatics online.Received on May 3,2010;revised on July 1,2010;accepted on July 20,20101INTRODUCTIONIncreasing amounts of high-throughput data are being collected,stored,shared and analyzed on the web,highlighting the need for effective web-based data work visualization components are especially valuable to help researchers interpret their data as part of data analysis tools.However,current web-based network visualization components lack many useful features of their desktop counterparts.Medusa (Hooper and Bork,2005)is a Java applet originally used in the STRING database (Jensen et al.,2009)and by many other web sites for network visualization,but lacks advanced features,such as detailed customization of the network view.jSquid (Klammer et al .,2008)expands Medusa’s functionality,but does not provide an easy way for the client web site to change and interact with the network view after it has been rendered.TouchGraph (/navigator .html)is another Java applet for network visualization,but provides only one mode of network interaction designed for exploration and is not easily customizable.yFiles Flex (/en/products_yfilesflex_about.html)is a rich Internet application with a feature-rich user interface,an architecture that balances client/server work and supports efficient data communication.This commercial software is customizable within the bounds of the code already written,but is not open source.Cytoscape (/)is an open source Java network visualization and analysis tool that provides a large array of useful features (Shannon et al .,2003),but is not specifically designed for use on the web except via Java WebStart or as a library to generate static network images for web display.The field of network visualization is∗Towhom correspondence should be addressed.Fig.1.A Cytoscape Web network with a customized visual style.lacking an interactive,easily customizable,open source,web-based visualization component.Cytoscape Web is an interactive,web-based network visualization tool,modeled after the popular Cytoscape software (Fig.1).Using basic programming skills,Cytoscape Web can be customized and incorporated into any web site.Cytoscape Web is not intended as a replacement for the Cytoscape desktop application,for example,it contains none of the plugin architecture functionality of Cytoscape;instead it is intended as a low overhead tool to add network visualization to a web application.2IMPLEMENTATIONCytoscape Web is a client-side component that requires no server-side implementation,and allows developers to choose any server-side technology,if necessary.The main network display component of Cytoscape Web is implemented in Flex/ActionScript,but a JavaScript application programming interface (API)is provided so all the customization and interaction with the network view can be easily built in JavaScript without needing to change and compile the Flash code.This architecture has the advantage of using the Flash©The Author(s)2010.Published by Oxford University Press.This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (/licenses/by-nc/2.5),which permits unrestricted non-commercial use,distribution,and reproduction in any medium,provided the original work is properly cited.C.T.Lopes et al.platform to implement complex and interactive vector images that behave consistently across major browsers,but without requiring the web site to be entirely built with this technology.In other words,the web site itself can rely on web standards (HTML,CSS and JavaScript)for embedding and interacting with Cytoscape Web.This design also offers the possibility of migrating the implementation to other technologies—such as scalable vector graphics (SVG)and HTML5—in the future without making major API changes.The choice of Flash rather than Java is motivated by the fact that Java applets can be slow to launch,require the download of the large Java runtime and make it difficult to create custom (non-Swing)user interfaces without writing low-level graphics code.3AVAILABLE FEATURES 3.1FeaturesSimilarly to Cytoscape,Cytoscape Web allows the client application to define a network of nodes and edges and customize their attributes.Data can be loaded into Cytoscape Web through one of the supported XML-based exchange formats (GraphML or XGMML)or a simple tab-delimited text (Cytoscape’s SIF format).The network data can also be exported to any of the above-mentioned formats.The client can dynamically change node and edge visual styles (e.g.color,size and opacity),using any of the following methods:(i)specifying default visual properties for all elements;(ii)mapping node and edge attributes (,interaction type and weight)to visual styles;and (iii)overriding default or mapped styles by setting a bypass style.For instance,different types of interactions can be mapped to edge colors (e.g.protein–protein to blue,protein–DNA to green),and the edge width can be used to represent interaction weight.Then the developer can use the bypass mechanism to create a first neighbors highlight feature,for example,by setting the nodes and edges that belong to the neighbor set to the color red.When the first neighbors’bypass is removed,the colors are automatically restored to their default or mapped values.These three options,combined with more than 20visual properties for nodes and edges,provide flexibility and enable each Cytoscape Web-based application to define its own semantics,styles and features.For example,iRefWeb (/iRefWeb/),an interface to the interaction Reference Index (iRefIndex)database (Razick et al .,2008),uses a basic implementation of Cytoscape Web to display all interactions in which a single query gene participates.Alternatively,GeneMANIA (;Warde-Farley et al .,2010),a gene function prediction tool,uses a more complex implementation of Cytoscape Web to extend a user’s input gene list and display interactions among the genes.Cytoscape Web communicates with the GeneMANIA server,mediated by client-side JavaScript,to display gene or network-specific highlights and associated information interactively.Cytoscape Web’s API can be used to implement the following features:a filter for nodes and edges,which temporarily removes the filtered out elements based on attribute data;functions for adding and deleting nodes and edges at runtime;the ability to export the whole network as an image,either to a PNG (Portable Network Graphics)file or to a publication quality vector format (PDF).Cytoscape Web provides the ability to pan and zoom the network and choose different network layouts,including force directed.The layout parameters can be customized,but if none of the available layouts produces the desired results,the web application can run an external layout algorithm—in JavaScript or on the server-side,for instance—and pass the results,via node positions,back to Cytoscape Web for visualization.3.2PerformanceCytoscape Web works best with small-to medium-sized networks,generally with up to a few hundred nodes and rger networks can be visualized,but the user interaction can become sluggish around 2000elements (nodes or edges)—800nodes and 1200edges,for example (tested on an Apple laptop computer with 2GHz dual core CPU and 4GB RAM).Use of the force-directed layout is the major bottleneck in the initial rendering of a typical network.However,faster layouts are available and overall performance is dependent upon the client web site implementation and the end user configuration.Additional performance statistics for Cytoscape Web are available in the Supplementary Material.3.3DocumentationCytoscape Web is actively developed as an open source project and is freely available at /.This web site includes a tutorial,with ready to use sample code,the API documentation and a showcase of major Cytoscape Web features.The online examples can be freely used as a template for building web sites containing Cytoscape Web.3.4Future directionsFuture plans include the implementation of custom graphics for nodes and edges,additional network layouts and support for importing and closer integration with Cytoscape [e.g.importing/exporting networks in Cytoscape (.cys)file format].ACKNOWLEDGEMENTSCytoscape is developed through an ongoing collaboration between the University of California at San Diego,the Institute for Systems Biology,Memorial Sloan-Kettering Cancer Center,Institut Pasteur,Agilent Technologies,Unilever,the University of Michigan,the University of California at San Francisco and the University of Toronto.We gratefully acknowledge the contributions of many Cytoscape developers who developed software that Cytoscape Web was based on.We thank the entire GeneMANIA team for support during the development of Cytoscape Web.Funding :Genome Canada through the Ontario Genomics Institute (grant number 2007-OGI-TD-05);the U.S.National Institute of General Medical Sciences of the National Institutes of Health (grant number 2R01GM070743-06).Conflict of Interest :none declared.REFERENCESHooper,S.D.and Bork,P.(2005)Medusa:a simple tool for interaction graph analysis.Bioinformatics ,21,4432–4433.Jensen,L.J.et al.(2009)STRING 8–a global view on proteins and their functionalinteractions in 630organisms.Nucleic Acids Res .,37,D412–D416.Klammer,M.et al.(2008)jSquid:a Java applet for graphical on-line networkexploration.Bioinformatics ,24,1467–1468.Razick,S.et al.(2008)iRefIndex:a consolidated protein interactions database withprovenance.BMC Bioinformatics,9,405.Shannon,P.et al.(2003)Cytoscape:a software environment for integrated models ofbiomolecular interaction networks.Genome Res.,13,2498–2504.Warde-Farley,D.et al .(2010)The GeneMANIA prediction server:biological networkintegration for gene prioritization and predicting gene function Nucleic Acids Res.,38,W214–W220.2348。

Sequence analysis

Sequence analysis

Vol.21no.242005,pages4416–4419doi:10.1093/bioinformatics/bti715 BIOINFORMATICS APPLICATIONS NOTESequence analysisImproving disulfide connectivity prediction with sequential distance between oxidized cysteinesChi-Hung Tsai1,Bo-Juen Chen1,Chen-hsiung Chan1,Hsuan-Liang Liu2andCheng-Yan Kao1,3,Ã1Department of Computer Science and Information Engineering,National Taiwan University,Taipei,Taiwan106,2Department of Chemical Engineering and Graduate Institute of Biotechnology,National Taipei University of Technology,Taipei,Taiwan10608and3Institute for Information Industry,Taipei,Taiwan106Received on August19,2005;revised and accepted on October11,2005Advance Access publication October13,2005ABSTRACTSummary:Predicting disulfide connectivity precisely helps towards the solution of protein structure prediction.In this study,a descriptor derived from the sequential distance between oxidized cysteines(denoted as DOC)is proposed.An approach using support vector machine(SVM) method based on weighted graph matching was further developed to predict the disulfide connectivity pattern in proteins.When DOC was applied,prediction accuracy of63%for our SVM models could be achieved,which is significantly higher than those obtained from pre-vious approaches.The results show that using the non-local descriptor DOC coupled with local sequence profiles significantly improves the prediction accuracy.These improvements demonstrate that DOC, with a proper scaling scheme,is an effective feature for the prediction of disulfide connectivity.The method developed in this work is available at the web server PreCys(prediction of cys–cys linkages of proteins). Availability:.tw:5433/Disulfide/Contact:cykao@.twSupplementary information:Supplementary data,detailed results, tables and information are available at . tw:5433/Disulfide/1INTRODUCTIONDisulfide bonds,commonly found in extracellular proteins,stabilize folded conformations as they contribute to the stability of the three-dimensional structures with respect to thermodynamics (Wedemeyer et al.,2000).Since disulfide bonds impose length and angle constraints on the backbone of a protein,correct pre-diction of disulfide connectivity can be employed to dramatically reduce the search in conformational space and greatly raise the accuracy for protein structure prediction(Huang et al.,1999).Dif-ferent methods(Fariselli and Casadio,2001;Fariselli et al.,2002; Vullo and Frasconi,2004)have been developed to predict disulfide connectivity with the prior knowledge of the oxidization states of cysteine residues.These methods can be classified into two categories:(1)patternwise or(2)pairwise.The major difference between them is whether the methodology is developed to deal with alternative disulfide connectivity patterns(Vullo and Frasconi, 2004;Zhao et al.,2005)or the relationships between cysteine pairs(Fariselli and Casadio,2001;Baldi et al.,2005;Ferre`and Clote,2005).This difference decides how the information is encoded.However,the prediction accuracies of these methods are still limited so far($50%).Besides the methodology used,another critical factor determin-ing the predicting performance is the descriptor employed.Fariselli and Casadio(2001)computed residue contact potentials according to the nearest-neighbor residues of bonded cysteines.Secondary structure(Baldi et al.,2005;Ferre`and Clote,2005)and solvent accessibility(Baldi et al.,2005)were also used as descriptors to represent input information.All these descriptors only describe the local environments of bonded cysteines.However,a disulfide bridge is a long-range interaction between two linearly distant cysteines. Descriptors containing local information only are insufficient for predicting disulfide connectivity accurately.Therefore,information regarding relationships between cysteines is highly desired. Harrison and Sternberg(1994)have suggested that sequence separation between bonded cysteines correlates strongly with spe-cific connectivity patterns.Zhao et al.(2005)also observed that disulfide connectivity pattern is highly conserved with the same cysteine-separation pattern of oxidized cysteines.Although there have been some attempts(Vullo,2004;Baldi et al.,2005)to take advantage of such information by using descriptors such as posi-tions of cysteines or relative sequence length,no emphasis has been addressed on the effects of these features so far.In this paper,a descriptor derived from the linear sequence dis-tance between oxidized cysteines(denoted as DOC)was used to demonstrate its power on predicting disulfide connectivity.A pair-wise method using support vector machine(SVM)to generate bonding potentials of cysteine pairs was developed.This method was further validated with a dataset derived from Swiss-Prot 39(SP39),and significant improvements were obtained when theÃTo whom correspondence should be addressed.The authors wish it to be known that,in their opinion,the first two authorsshould be regarded as joint First Authors.ÓThe Author2005.Published by Oxford University Press.All rights reserved.For Permissions,please email:journals.permissions@ The online version of this article has been published under an open access ers are entitled to use,reproduce,disseminate,or display the open access version of this article for non-commercial purposes provided that:the original authorship is properly and fully attributed;the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given;if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated.For commercial re-use,please contact journals.permissions@non-local descriptor DOC coupled with local sequence profiles was applied.These results reveal that DOC is an effective feature in disulfide connectivity prediction.The web interface service of the method proposed in this study for disulfide connectivity prediction is available at .tw:5433/Disulfide/2METHODOLOGY2.1Prediction of the connectivity pattern ofdisulfide bridgesWith prior knowledge of the oxidation states of cysteine residues, a prediction strategy similar to previous studies(Fariselli and Casadio,2001;Baldi et al.,2005;Ferre`and Clote,2005)was applied.The whole problem was mapped to an undirected complete graph,where oxidized cysteines were considered as vertices and the probabilities of connectivity between cysteine pairs were assigned as the weights of the edges between corresponding vertices.Then the disulfide connectivity pattern can be inferred by solving the maximum weight matching of this graph,which implies maximum probabilities for bonding pairs of this resulting pattern.2.1.1SVM In this work,SVM was employed to predict the poten-tial of connectivity between cysteines.SVM has been applied broadly within thefield of computational biology to pattern-recognition pro-blems and is a promising technique for data classification(Vapnik, 1998).Given data x1,...,x1,we set their labels,y i,as+1if x i is in class1and asÀ1if x i belongs to class2.Then with these training data, SVM solves an optimization problem for binary classification:min v‚b‚j 1v T vþCX li¼1j i and y i v T w x iðÞþbÀÁ!1Àj i‚j i!0‚ð1Þwhere x i is mapped to a higher dimensional space by the function w;j i is the training error allowed and C is the cost of error.Moreover,SVM can further be solved to approximate posterior class probability P(y i¼1|x i)with a sigmoid function(Platt,2000):P y i¼1j f iðÞ¼1i‚ð2Þwhere A and B are parameters and f i¼v T w(x i)+ing(2),we can infer the bonding probability for each pair of cysteines.The software LIBSVM(Chang and Lin,2000),a library for SVMs,was adopted in our experiments.2.1.2Data encoding Two descriptors were mainly considered to encode input data for the SVM:(1)local sequence profiles (evolutionary information)around target cysteines from multiple sequence alignments and(2)the linear DOC.We generated sequence profiles by performing multiple sequence alignments with the widely used program PSI-BLAST(Altschul et al.,1997).For each cysteine pair Cys(i,j),profiles were extracted using a window centered at cysteines i and j.The window size indicates the scope of vicinity of the target cysteine and determines how much information is provided for our models.In our experi-ments,the window size was set to13,and the values of elements in the profiles were scaled to[0,1].For a cysteine pair with sequence indexes i and j,the correspond-ing DOC is defined as follows:DOC i‚jðÞ¼k iÀj k:ð3ÞSince scaling approaches may affect the performance of SVM, three scaling schemes for DOC were tested:(1)DOC L,DOC normalized with the protein sequence length L.(2)DOC max,DOC normalized with the maximum value of thewhole dataset.(3)DOC log,DOC values normalized with the logarithm function.2.1.3Maximum weight matching Features were encoded with respect to each pair of cysteines,and SVM models were trained with these data to generate posterior probabilities that indicate the potential of connectivity between cysteine pairs.After the bonding probability of each cysteine pair was produced by SVM models,an implementation of Gabow’s algorithm(Gabow,1973), wmatch(Rothberg,http://elib.zib.de/pub/Packages/mathprog/ matching/weighted/),was used tofind the maximum weight match-ing.Finally,the matching with maximum weight was transformed to the corresponding disulfide connectivity pattern.2.2Evaluation criteriaOur models were evaluated by Q p and Q c which are defined as follows:Q p¼C pT p‚Q c¼C cT c‚ð4Þwhere C p is the number of proteins whose connectivity patterns are correctly predicted;T p is the total number of proteins in the test set;C c is the number of disulfide bridges correctly predicted and T c is the total number of disulfide bridges in test proteins.3IMPLEMENTATION AND RESULTS3.1DatasetIn order to compare our method with the approaches reported previously(Vullo and Frasconi,2004;Baldi et al.,2005),the same dataset extracted from SP39(Bairoch and Apweiler,2002) was employed.The samefiltering procedure(Fariselli and Casadio, 2001)was applied to ensure only high quality and experimentally verified intra-chain disulfide bridge annotations were included.For cross-validation,this dataset was further divided into four subsets so that each of the two shared sequence homology30%.3.2Cross-validation of SP39Table1lists the accuracies of4-fold cross-validation performed with the dataset SP39for our model along with the results reported ing sequence profiles only,our SVM models obtained a Q P of59%,which is better than those obtained in previous works. This may benefit from the generality of SVM,which avoids over-fitting during the training process.Another reason for the improve-ment is the enlarging of window size when extracting sequence profiles.We tried to use different window sizes to build SVM models,and the accuracy of the predictions is shown in Figure1. The overall Q P increases with enlarging window size and peaks at 13,which was adopted in this ing the same window size of 5as used by Vullo and Frasconi(2004)and Baldi et al.(2005), similar accuracy of52%was also obtained using our method. Moreover,when DOC was used,the prediction accuracy was further improved.To explore the effects of scaling schemes on DOC,three scaling functions were considered:DOC L,DOC maxDisulfide connectivity prediction4417and DOC log .The trend of DOC between cysteine bonding pairs in dataset SP39is shown in Figure 2a,and the distributions of DOC L ,DOC max and DOC log are also shown in Figure 2b–d,respectively.As can be seen,DOC max remains the distribution of the DOC since the scaling is simply performed by dividing the distance with a fixedvalue.On the other hand,the originally skewed distribution of DOC becomes close to a normal distribution after logarithm function was applied,and the distribution of DOC L becomes blurred due to the variation of sequence lengths.The prediction accuracies of 59and 61%were obtained by using the scaling function DOC L or DOC max .On the other hand,the highest prediction accuracy of 63%was obtained by using the scaling function DOC log ,which was selected to build our SVM models for disulfide connectivity prediction.These results suggest that the scaling of DOC can affect its contribution to our models.With a proper scaling function,DOC can enhance the performance of SVM models.3.3PreCys (prediction of cys–cys linkages in proteins)web serverThe PreCys server (at .tw:5433/Disulfide/)provides the service of disulfide connectivity prediction by the method developed in this work.In addition,a simple CSP search can also be accessed on the website.This server provides two SVM models built from Swiss-Prot releases 39and 47.With the sequence and the positions of oxidized cysteines (optional)input,the bonding probabilities of cysteine pairs and the final connectivity pattern can be generated.Additional experimental results and the chain lists used can be found at this website.4DISCUSSION AND CONCLUSIONThere are two major categories for the methods of disulfide con-nectivity prediction.The ‘patternwise’approaches take the whole protein as a unit directly and rank alternative connectivity patterns (Vullo and Frasconi,2004).They can easily include global infor-mation,such as the sequence length,amino acid contents or the positions of all cysteines.On the other hand,the ‘pairwise’methods(Baldi et al .,2005;Ferre`and Clote,2005)lack the overview of the whole protein and are usually limited to the scope of local environ-ments of cysteines.However,the patternwise methods often suffer from the problem of insufficient data,especially when the number of disulfide bonds increases.For proteins with five disulfide bonds,there aresomeFig.1.The accuracy (Q p )of predictions using different window sizes to extract sequence profiles on the datasetSP39.Fig.2.Histogram of the fraction of chains versus (a )the original distribution of DOC without normalization,(b )DOC L ,(c )DOC max and (d )DOC log in the dataset SP39.Table 1.Results of cross-validation on the data extracted from SP39MethodsB ¼2B ¼3B ¼4B ¼5B ¼2–5Q p (%)Q c (%)Q p (%)Q c (%)Q p (%)Q c (%)Q p (%)Q c (%)Q p (%)Q c (%)MC graph-matching a 5656213617372212938NN graph-matching b 6868223720372263442BiRnn-2profile c 737341512437133044492D-Rnn profile d 74745161274411414956dNN2e 62—40—55—26—49—CSP72725466335018365258SVM profile76765362486244605965SVM profile +DOC log79795362557058716370a Fariselli and Casadio (2001).bFariselli et al .(2002).cVullo and Frasconi (2004).dBaldi et al .(2005).eFerre`and Clote (2005),only results of Q p are available.C.-H.Tsai et al.4418patterns that only have one instance in the dataset.These patterns are not likely to be predicted correctly by patternwise methods because there is not enough information for model training.For example,the connectivity patterns of the protein chains CTRA_BOVIN (PDB:1HJA,pattern:[1–4,2–3,5–9,6–7,8–10],Fig.3)and UROK_HUMAN (PDB:1LMW,pattern:[1–3,2–4,5–9,6–7,8–10])only appear once in the dataset SP39.The patternwise method CSP fails to predict the disulfide connectivity of these chains,because no template is available for the patterns to be pre-dicted.On the other hand,our pairwise SVM models can still predict their connectivity correctly,since the pattern can be assembled by the bonding pairs predicted.In addition,the imbalance situation between the positive and negative data differs for pairwise and patternwise methods.As to a protein with B disulfide bonds,the positive/negative ratio is 1:(2B À2)for pairwise encoding.However,for the patternwise encoding,the imbalance is more severe,since there is only one correct pattern among the (2B À1)!!generated entries.Taking B ¼5for an example,the positive/negative ratio is only 1:8in pairwise encoding.With the same bond number B in patternwise encoding,there are 945entries where the positive/negative ratio is 1:944.Such severe imbalance can bias the learning process and result in poor models.Due to the insufficiency of data and the severe imbalance issue of patternwise encoding,we adopted the pairwise approach in our method.In this paper,we developed a method to predict disulfide con-nectivity based on SVMs.The non-local descriptor DOC describing the distance between oxidized cysteines was proposed to encode additional information for our input.For the dataset SP39,the pre-diction accuracy can be improved significantly with the combina-tion of local sequence profiles and the non-local descriptor DOC.The significant improvement on prediction accuracies against pre-vious approaches is because of the following reasons.First,SVMs can avoid over-fitting problems commonly seen in neural networks and other machine learning methods.Second,we explored the local environments of oxidized cysteines and found the optimum windowsize with best Q p values.Third,the non-local descriptor DOC log also contributes to the prediction accuracies.Our method achieved an accuracy of 63%in dataset SP39when DOC was used,which outperforms other previous approaches.Consistent improvements were also obtained on other datasets,detailed results can be found in the Supplementary data.These results imply that the formation of disulfide linkages between cysteines is determined not only by the local information of cysteines but also by the relationships between them.The descriptor DOC contains important information about the relationships between oxidized cysteines and is an effective feature for predicting disulfide connectivity accurately.This descriptor can be additionally applied to other problems where the knowledge of disulfide bridges is required.The web interface of our program is provided on the PreCys website.The results from our method may be useful for advanced studies in protein structure prediction,pro-tein structure modeling and protein engineering.ACKNOWLEDGEMENTSWe would like to thank Jianlin Cheng for generously sharing data-sets and useful comments and Shih-Chieh Chen for enlightening discussion.Funding to pay the Open Access publication charges for this article was provided by the Institute for Information Industry.Conflict of Interest:none declared.REFERENCESAltschul,S.F.et al.(1997)Gapped BLAST and PSI-BLAST:a new generation ofprotein database search programs.Nucleic Acids Res.,25,3389–3402.Bairoch,A.and Apweiler,R.(2000)The Swiss–Prot protein sequence database and itssupplement TrEMBL in 2000.Nucleic Acids Res.,28,45–48.Baldi,P.,Cheng,J.and Vullo,A.(2005)Large-scale prediction of disulphide bondconnectivity.In Saul,L.K.,Weiss,Y.and Bottou,L.(eds),Advances in Neural Information Processing Systems 17.MIT Press,Cambridge,MA,pp.97–104.Chang,C.-C.and Lin,C.-J.(2000)LIBSVM:introduction and benchmarks.TechnicalReport ,Department of Computer Science and Information Engineering,National Taiwan University,Taipei,Taiwan.Fariselli,P.and Casadio,R.(2001)Prediction of disulfide connectivity in proteins.Bioinformatics ,17,957–964.Fariselli,P.,Riccobelli,P.and Casadio,R.(2002)A neural network based methodfor predicting the disulfide connectivity in proteins.In Damiani,E.,Jain,L.C.,Howlett,R.J.and Ichalkaranje,N.(eds),Knowledge based intelligent information engineering systems and allied technologies (KES 2002).IOS Press,Amsterdam,1,pp.464–468.Ferre`,F.and Clote,P.(2005)Disulfide connectivity prediction using secondary struc-ture information and diresidue frequencies.Bioinformatics ,21,2336–2346.Gabow,H.N.(1973)Implementation of algorithms for maximum matching on non-bipartite graphs.Phd Thesis,Stanford University,CA.Harrison,P.M.and Sternberg,M.J.E.(1994)Analysis and classification of disulphideconnectivity in proteins.J.Mol.Biol.,244,448–463.Huang,E.S.et al.(1999)Ab initio fold prediction of small helical proteins usingdistance geometry and knowledge-based scoring functions.J.Mol.Biol.,290,267–281.Platt,J.(2000)Probabilistic outputs for support vector machines and comparisonto regularized likelihood methods.In Smola,A.J.,Bartlett,P.L.,Scho¨lkopf,B.and Schuurmans,D.(eds),Advances in Large Margin Classifiers .MIT Press,Cambridge,MA,pp.61–74.Rothberg,E.(1985)wmatch:a C Program to solve maximum weight matching.Vapnik,V.(1998)Statistical Learning Theory .Wiley,New York,NY.Vullo,A.and Frasconi,P.(2004)Disulfide connectivity prediction using recursiveneural networks and evolutionary information.Bioinformatics ,20,653–659.Wedemeyer,W.J.et al.(2000)Disulfide bonds and protein folding.Biochemistry ,39,4207–4216.Zhao,E.et al.(2005)Cysteine separations profiles on protein sequences infer disulfideconnectivity.Bioinformatics ,21,1415–1420.Fig.3.(a )The structure and the connectivity pattern of disulfide bridges and (b )the bonding potential P (i ,j )for each cysteine pair cys(i ,j )generated by SVM model for chymotrypsinogen A (PDB id 1HJA).Selected bonding pairs are boxed.Disulfide connectivity prediction4419。

haplotypes

haplotypes

BIOINFORMATICS APPLICATIONS NOTE Vol.21no.82005,pages1730–1732doi:10.1093/bioinformatics/bth488 Genetics and population analysisHaploPainter:a tool for drawing pedigrees with complex haplotypesHolger Thiele1,3,∗and Peter Nürnberg1,2,31Gene Mapping Center,Max Delbrueck Center(MDC)for Molecular Medicine,Berlin-Buch,Germany,2Institute of Medical Genetics,Charité—University Medicine of Berlin,Berlin,Germany and3Cologne Center for Genomics (CCG),University of Cologne,GermanyReceived on July6,2004;revised and accepted on August16,2004Advance Access publication September17,2004ABSTRACTSummary:HaploPainter is a user-friendly pedigree-drawing applica-tion with special features for easy visualization of complex haplotype information.It has been developed to facilitate gene mapping in Mendelian diseases in terms of fast and reliable definition of the smallest critical interval harbouring the underlying gene defect.Hap-loPainter is written in Perl and may be used for visualization of haplotypes calculated by any of the common linkage programs. With special features like haplotype compression or the ability of marker section cut-out it particularly addresses the requirements for viewing large haplotypes as obtained by using for genome scans high-density marker panels of many thousands of single nucleotide polymorphisms(SNPs).Availability:/Contact:holger.thiele@uni-koeln.deConstruction and visualization of haplotypes is an essential step in gene mapping projects.It is often instrumental in choosing the right region for further follow-up.Moreover,it defines precisely the size of the critical interval by pinpointing the location of the recombination breakpoints in the families under investigation.While a graphical presentation of haplotypes is very useful for interpretation and pub-lication of genotyping data,there is only inadequate software support available.Existing solutions as implemented in Cyrillic(Chapman, 1990),Pedigree/Draw(Mamelka et al.,1990),CoPE(Brun-Samarcq et al.,1999)or Pelican(Dudbridge et al.,2004)either suffer from lack of user-friendliness,limited data compatibility,or insufficient drawing alternatives for haplotypes.Recently,we performed several genome scans with the Affymetrix GeneChip®Human Mapping10K SNP(single nucleotide poly-morphism)array(Janecke et al.,2004;Kaindl et al.,2004;Uhlenberg et al.,2004).The new approach,using>10000SNPs instead of about 400microsatellites as DNA markers in genome-wide linkage stud-ies,clearly revealed the shortcomings of the existing solutions for haplotyping.Dealing with very long haplotypes composed of hun-dreds of SNPs was impossible with any existing software and turned out to be a time consuming bottleneck in our data analysis.This prompted us to develop HaploPainter as a user-friendly tool for the To whom correspondence should be addressed.handling of haplotype information in extended pedigrees.In partic-ular,we aimed at a clearly arranged presentation of large marker blocks and have therefore implemented features like selective defin-ition and narrowing of marker regions or haplotype compression at a chosen length.The program is written in Perl/Tk,running under Windows and Linux.The algorithms are all oriented at practical considerations. The majority of families should be drawable without extensive computational time.Every family is represented in a single mul-tidimensional array data structure which is built up following a top-to-bottom strategy.In case of loops the order of sibs is randomly chosen,keeping loop-starting and connecting family members close to each other.Further optimization is performed by minimization of line crossings.From this point different pedigree drawing solu-tions may be found.Although the majority of simple and moderately complex pedigrees were drawn correctly,there are limitations,for example,when a person occurs in different generations,that is,in the typical‘backcross’situation of animal breeds.This can be avoided by allowing for person duplications.We will implement this feature into future versions of HaploPainter.HaploPainter accepts haplotype outputs from Simwalk(Weeks et al.,1995),Allegro(Gudbjartsson et al.,2000),Genehunter (Kruglyak et al.,1996)and Merlin(Abecasis et al.,2002).Any supplementary information provided by programs like Simwalk is discarded at this stage.Points of recombination are recognized using HaploPainter’s own algorithm.Starting at the p-telomer,the program identifies thefirst informative marker revealing that linkage phase has changed.Pedigree data are imported either in a pre-or post-makeped format as provided by standard linkagefiles.Map information can be easily added in afile format used by Mega2[Mukhopadhyay et al.(2001)in /Menu/Help/mega2/].Up to three lines of case information may be attached to each person using an excel sheet-like table format.After drawing of the ped-igree,the following modifications are possible.Symbols can be moved per drag-and-drop and haplotype phases can be switched by double clicking on uninformative marker alleles.Many differ-ent drawing styles are selectable from the configuration menu.The output from HaploPainter can be similar to what is shown in Figure1. The graphic is directly printable or can be exported as a postscript file.In conclusion,there is a growing need for the support of haplo-type drawing in pedigrees.Many programs suitable for haplotype1730©The Author2004.Published by Oxford University Press.All rights reserved.For Permissions,please email:journals.permissions@HaploPainter for drawing pedigrees with complexhaplotypes ing the Affymetrix GeneChip®Human Mapping10K SNP array more than10000SNPs were genotyped in13family members of a consanguineous pedigree.Data conversion,handling and haplotype calculation were performed with the programs ALOHOMORA(Ruschendorf and Nürnberg,2005)and Genehunter-Imprinting(Strauch et al.,2000).The left window demonstrates HaploPainter’s ability to draw even very long haplotypes—composed of100 SNPs in this case.Cutting out most of the homozygous region between the critical recombination events results in a clearly arranged graphic with all necessary information as shown in the right window.Filled symbols represent affected individuals and open symbols non-affected ones.Persons with an unknown affected status are shown in grey with a question mark inside.A diagonal slash indicates that the person is deceased.Marker names and positions are displayed on the left side of each generation.The crucial recombinations occurred in individuals302and303.Their disease haplotype fragments are boxed.calculation like Genehunter,Merlin,Simwalk or Allegro are insuffi-cient in proper visual presentation.We offer a platform-independent, Perl-based solution,HaploPainter,as an open source software for the scientific community tofill this gap.Features like haplotype com-pression and the ability of marker section cut-out are particularly helpful for viewing large SNP-derived haplotypes.The software is equipped with an intuitive graphical interface and is powerful enough to draw even complex consanguineous pedigrees.It is addressed to geneticists and physicians handling human pedigrees with the necessity of graphical haplotype representation. ACKNOWLEDGEMENTSWe would like to thank our colleagues and guests of the Gene Map-ping Center at the MDC for advice and software testing.This work was funded by the Federal Ministry of Science and Education of Germany through the National Genome Research Network. REFERENCESAbecasis,G.R.,Cherny,S.S.,Cookson,W.O.and Cardon,L.R.(2002)Merlin—rapid analysis of dense genetic maps using sparse geneflow trees.Nat.Genet.,30,97–101. Brun-Samarcq,L.,Gallina,S.,Philippi,A.,Demenais,F.,Vaysseix,G.and Barillot,E.(1999)CoPE:a collaborative pedigree drawing environment.Bioinformatics,15, 345–346.Chapman,C.J.(1990)A visual interface to computer programs for linkage analysis.Am.J.Med.Genet.,36,155–160.Dudbridge,F.,Carver,T.and Williams,G.W.(2004)Pelican:pedigree editor for linkage computer analysis,Bioinformatics,20,2327–2328.Gudbjartsson,D.F.,Jonasson,K.,Frigge,M.L.and Kong,A.(2000)Allegro,a new computer program for multipoint linkage analysis.Nat.Genet.,25,12–13. Janecke,A.R.,Thompson,D.A.,Utermann,G.,Becker,C.,Hübner,C.A.,Schmid,E., McHenry,C.L.,Nair,A.R.,Rüschendorf,F.,Heckenlively,J.et al.(2004)Mutations in1731H.Thiele and P.NürnbergRDH12encoding a photoreceptor cell retinol dehydrogenase cause childhood-onset severe retinal dystrophy.Nat.Genet.,36,850–854.Kaindl,A.M.,Rüschendorf,F.,Krause,S.,Goebel,H.H.,Koehler,K.,Becker,C., Pongratz,D.,Müller-Höcker,J.,Nürnberg,P.,Stoltenburg-Didinger,G.,Huebner,A.(2004)Missense mutations of ACTA1cause dominant congenital myopathy with cores.J.Med.Genet.,41,842–848.Kruglyak,L.,Daly,M.J.,Reeve-Daly,M.P.and Lander,E.S.(1996)Parametric and nonparametric linkage analysis:a unified multipoint approach.Am.J.Hum.Genet., 58,1347–1363.Mamelka,P.M.,Dyke,B.and MacCluer,J.W.(1990)Pedigree/Draw for the Apple Macintosh,Version4.0.PGL Technical Report No.1,Pop.Genetic Laboratory, Department of Genetics,South-west Foundation for Biomedical Research,San Antonio,Texas78284,USA.Ruschendorf,F.and Nürnberg,P.(2005)ALOHOMORA:A tool for linkage analysis using10K SNP array data.Bioinformatics,Jan12.[Epub ahead of print] doi:10.1093/bioinformatics/bti264Strauch,K.,Fimmers,R.,Kurz,T.,Deichmann,K.A.,Wienker,T.F.and Baur,M.P.(2000) Parametric and nonparametric multipoint linkage analysis with imprinting and two-locus-trait models:application to mite sensitization.Am.J.Hum.Genet.,66, 1945–1957.Uhlenberg,B.,Schuelke,M.,Rüschendorf,F.,Ruf,N.,Kaindl,A.M.,Henneke,M., Thiele,H.,Stoltenburg-Didinger,G.,Aksu,F.,Topalo˘g lu,H.et al.(2004)Muta-tions in the gene encoding gap junction proteinα12(connexin46.6)cause Pelizaeus-Merzbacher-like disease.Am.J.Hum.Genet.,75,251–260.Weeks,D.E.,Sobel,E.,O’Connell,J.R.and Lange,K.(1995)Computer programs for multilocus haplotyping of general pedigrees.Am.J.Hum.Genet.,56,1506–1507.1732。

  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

Vol.22no.42006,pages507–508doi:10.1093/bioinformatics/btk005 BIOINFORMATICS APPLICATIONS NOTEGene expressionEDGE:extraction and analysis of differential gene expression Jeffrey T.LeekÃ,Eva Monsen,Alan R.Dabney and John D.StoreyDepartment of Biostatistics,University of Washington,Seattle98195,USAReceived on October18,2005;revised on December10,2005;accepted on December11,2005Advance Access publication December15,2005Associate Editor:John QuackenbushABSTRACTSummary:EDGE(Extraction of Differential Gene Expression)is anopen source,point-and-click software program for the significance anal-ysis of DNA microarray experiments.EDGE can perform both standardand time course differential expression analysis.The functionsare based on newly developed statistical theory and methods.Thisdocument introduces the EDGE software package.Availability:EDGE is freely available for non-commercial users.EDGEcan be downloaded for Windows,Macintosh and Linux/UNIX from/jstorey/edgeContact:jtleek@1INTRODUCTIONDNA microarrays have become a standard tool used in identifyingand characterizing gene expression variation across differing bio-logical conditions.A variety of software packages are available forthe significance analysis of microarray experiments.Many of thesepackages are closed source,difficult to use or available for only oneoperating system.Most are unable to analyze data from time coursemicroarray experiments.EDGE is a user friendly software packagethat includes functions for missing data imputation,data trans-formation and visualization,eigen-genes/eigen-array analysis,hier-archical clustering,differential expression analysis(static and timecourse)and automatic internet-based NCBI queries of user chosengenes.EDGE can be used to analyze microarray data across allplatforms,although interpretation of the results may depend on theexperimental design.The EDGE interface is multithreaded,andreports real time updates for the time remaining in lengthy calcu-lations.Many of these calculations are performed through C++ extensions for R that dramatically reduce computation time.Dif-ferential expression analyses in EDGE are based on newly devel-oped statistical methodology,including the Optimal Discovery Procedure for static differential expression(Storey,2005,http:// /uwbiostat/paper259).EDGE is open source and is available for Windows,Macintosh and Linux/UNIX operating systems.2EDGEEDGE runs on top of the statistical software package R(R Development Core Team,2005,). Detailed downloading and installation instructions are available from the EDGE website.At the beginning of each EDGE session, the main menu should appear as in Figure1.Thefirst step in an EDGE analysis is to load the pre-normalized expression data and covariatefiles using the Load/Save Expression Data and Covariates menu.(The covariatefile contains information about the experimental design,such as which biological group from which each array comes.)If the expression matrix has missing values,they can be imputed using the KNN imputation algorithm from the Impute Missing Data menu(Troyanskaya et al., 2001).After loading expression data and covariate information, the covariates can be checked for accuracy using the View Covariates menu.It is also possible to center,scale or log transform the expression values using the Transform Data menu.Several tools for visual exploratory analysis are included in the EDGE interface.Boxplots and eigengenes(Alter et al.,2000)can be displayed for each array,or stratified by a covariate using the Dis-play Boxplots option and Display Eigengenes and Eigenarrays options,respectively.EDGE also allows the user to plot clusters of genes with similar expression patternsÃTo whom correspondence should beaddressed.Fig.1.The main menu of EDGE.ÓThe Author2005.Published by Oxford University Press.All rights reserved.For Permissions,please email:journals.permissions@507(Eisen et al .,1998)from the Display Hierarchical Clus-tering menu.Clustering can be performed on the entire set of genes,or only on the significant genes from a differential expression analysis.A variety of plotting options are available for visualizing the clusters.The Identify Differentially Expressed Genes menu allows users to set options for performing both static and time course differential expression analyses.For a static analysis,the user should select a class variable indicating the biological group assignment,or the option None (within class differential expression)to identify differentially expressed genes in a single biological sample.In the static setting,significance calculations are based on the Optimal Discovery Procedure (Storey,2005),which estimates the optimal rule for identifying differentially expressed genes (Storey et al.,2005a,/uwbiostat/paper260).For time course data,the user can perform either a ‘between class’analysis by selecting a variable distinguishing biological groups,or a ‘within class’analysis by selecting None (within class differen-tial expression)for the class variable.A ‘between class’analysis assesses the evidence for a difference in expression over time between two or more biological groups,while a ‘within class’analysis looks for any differential expression over time within a single group.The user must specify a covariate for the time points,and if necessary,should also specify a covariate corresponding to which individuals were sampled.EDGE implements statistical methodology specifically designed for time course experiments (Storey et al.,2005b).For either type of analysis,the user should specify the number of permutations to be used in the significance calculations and,in some cases,set a seed for reproducible results.For time course analyses,the user can also specify the type of spline used in fitting the lon-gitudinal model,the dimension of the basis for the spline model and whether to include the baseline expression level in the time course analysis.If the baseline level is included,EDGE will not only identify genes showing different patterns of expression over time,but will also identify genes with different baseline levels of expression.Once the appropriate options have been selected and the user clicks GO ,the expression analysis is performed and the Differential Expression Results menu is displayed.A significance measure is assigned to each gene via the Q -value methodology (Storey and Tibshirani,2003).The user can select a Q -or P -value cutoff to display the genes that meet that significance threshold.For advanced users,optional Q -value arguments can also be adjusted.The user can plot a histogram of the P -values from all significance tests,create a Q -plot,or cluster significant genes based on similarities in their expression profiles.If the EDGE session is being performed on a computer with internet access,the user can select a significant gene in the results window,and access NCBI information for that gene name.Results of differential expression analyses can be saved for further analysis or reporting.3RESULTSFigure 2shows the results of a differential expression analysis on a subset of 3170genes on 15arrays from the Hedenfalk et al .(2001)study.The analysis compared expression levels for BRCA1and BRCA2tumors.EDGE shows substantial improvements over five leading methodologies.ACKNOWLEDGEMENTSThis software development was supported in part by NIH grant R01HG002913-01.Conflict of Interest:none declared.REFERENCESAlter,O.et al.(2000)Singular value decomposition for genome-wide expression dataprocessing and modeling.Proc.Natl Acad.Sci.,97,10101–10106.Cui,X.et al.(2005)Improved statistical tests for differential gene expression byshrinking variance components estimates.Biostatistics ,6,59–75.Dudoit,S.et al.(2002)Comparison of discrimination methods for the classification oftumors using gene expression data.J.Am.Stat.Assoc.,97,77–87.Efron,B.et al.(2001)Empirical Bayes analysis of a microarray experiment.J.Am.Stat.Assoc.,96,1151–1160.Eisen,M.B.et al.(1998)Cluster analysis and display of genome-wide expressionpatterns.Proc.Natl Acad.Sci.,95,14863–14868.Hedenfalk,I.et al.(2002)Gene-expression profiles in hereditary breast cancer.N.Engl.J.Med.,344,539–548.Lonnstedt,I.and Speed,T.(2002)Replicated microarray data.Stat.Sinica ,12,31–46.R Development Core Team (2005)R:A language and environment for statisticalcomputing.R Foundation for Statistical Computing.Vienna,Austria.Storey,J.D.(2005)The optimal discovery procedure:a new approach to simultaneoussignificance testing.UW Biostatistics Working Paper Series Working Paper ,259.Storey,J.D.and Tibshirani,R.(2003)Statistical significance for genome-wide studies.Proc.Natl Acad.Sci.,100,9440–9445.Storey,J.D.,Dai,J.Y.and Leek,J.T.(2005a)The Optimal Discovery Procedure forLarge-Scale Significance Testing,with Applications to Comparative Microarray Experiments.UW Biostatistics Working Paper Series.Working Paper 260.Storey,J.D.et al.(2005b)Significance analysis of time course microarray experiments.Proc.Natl Acad.Sci.,36,12837–12842.Troyanskaya,O.et al.(2001)Missing value estimation methods for DNA microarrays.Bioinformatics ,17,520–525.Fig.2.A comparison between EDGE and five leading procedures for iden-tifying differentially expressed genes applied to the Hedenfalk et al.,2001study.For each Q -value (false discovery rate)cutoff,the number of genes found to be significant is plotted for each procedure.See Storey et al.(2005a)for comparisons based on a 3-sample analysis,where improvements are even greater.J.T.Leek et al.508。

相关文档
最新文档