Sequence-length requirements for phylogenetic methods
一步一步教你做转录组分析(HISAT--StringTie-and-Ballgown)
一步一步教你做转录组分析(HISAT, StringTie andBallgown)该分析流程主要根据2016年发表在Nature Prot ocols上的一篇名为Transcript-level expressionanalysis of RNA-seq experiments with HISAT,StringTie and Ballgown的文章撰写的,主要用到以下三个软件:HISAT (http://ccb.jhu.edu/software/hisat/index.shtml)利用大量FM索引,以覆盖整个基因组,能够将RNA-Seq的读取与基因组进行快速比对,相较于STAR、Tophat,该软件比对速度快,占用内存少。
StringTie(http://ccb.jhu.edu/software/stringtie/)能够应用流神经网络算法和可选的de novo组装进行转录本组装并预计表达水平。
与Cufflin ks等程序相比,StringTie实现了更完整、更准确的基因重建,并更好地预测了表达水平。
Ballgown(https://github.com/alyssafrazee/ballgown)是R语言中基因差异表达分析的工具,能利用RNA-Seq实验的数据(StringTie, RSEM,Cufflinks)的结果预测基因、转录本的差异表达。
然而Ballgown并没有不能很好地检测差异外显子,而DEXseq、rMATS和MISO可以很好解决该问题。
一、数据下载Linux系统下常用的下载工具是wget,但该工具是单线程下载,当使用它下载较大数据时比较慢,所以选择axel,终端中输入安装命令:$sudo yum install axel然后提示输入密码获得root权限后即可自动安装,安装完成后,输入命令axel,终端会显示如下内容,表示安装成功。
Axel工具常用参数有:axel[选项][下载目录][下载地址]-s:指定每秒下载最大比特数-n:指定同时打开的线程数-o:指定本地输出文件-S:搜索镜像并从Xservers服务器下载-N:不使用代理服务器-v:打印更多状态信息-a:打印进度信息-h:该版本命令帮助-V:查看版本信息号#Axel安装成功后在终端中输入命令:$axel ftp://ftb.jhu.edu/pub/RNAseq_protocol/chrX_data.tar.gz此时在终端中会显示如下图信息,如果不想该信息刷屏,添加参数q,采用静默模式即可。
biopython的使用 -回复
biopython的使用-回复Biopython的使用指南Biopython是Python编程语言中一个非常强大的生物信息学库,它提供了丰富的工具和函数,用于处理DNA、RNA、蛋白质和其他生物数据。
本文将以Biopython的使用为主题,逐步回答关于该库的使用。
第一步:安装和导入Biopython要使用Biopython,首先需要在计算机上安装它。
可以通过pip命令来安装Biopython,运行以下命令:pip install biopython安装完成后,就可以在Python代码中导入Biopython了:pythonimport Bio第二步:读取和处理生物数据Biopython提供了一些函数和类,用于读取和处理生物数据。
例如,可以使用SeqIO模块中的read函数来读取FASTA和GenBank格式的文件。
pythonfrom Bio import SeqIOrecord = SeqIO.read("sequence.fasta", "fasta")上面的代码将读取名为“sequence.fasta”的FASTA格式文件,并将结果存储在变量“record”中。
SeqIO.read函数还接受其他格式的文件,如GenBank。
第三步:处理DNA和蛋白质序列Biopython允许对DNA和蛋白质序列进行各种操作。
例如,可以使用seq 对象的方法来计算序列长度、转录和翻译。
pythonsequence = record.seqsequence_length = len(sequence)mRNA = sequence.transcribe()protein = mRNA.translate()上面的代码首先从record对象获取序列,然后使用len函数计算序列的长度。
接下来,通过调用transcribe方法将DNA序列转录为mRNA序列,并使用translate方法将mRNA序列翻译为蛋白质序列。
生物信息学中的序列比对算法使用方法解析
生物信息学中的序列比对算法使用方法解析序列比对在生物信息学中是一项重要的技术,用于寻找DNA、RNA或蛋白质序列之间的相似性和差异性。
它是理解生物学结构和功能的基石之一。
在本文中,我们将解析生物信息学中常用的序列比对算法的使用方法。
序列比对算法主要分为全局比对和局部比对。
全局比对用于比较完整的序列,而局部比对则更适用于在序列中查找相似区域。
在这两个主要类别中,有几种经典的序列比对算法,包括Pairwise Sequence Alignment、BLAST、Smith-Waterman算法和Needleman-Wunsch算法等。
首先,我们来看Pairwise Sequence Alignment(两两序列比对)算法。
这个算法是基本的序列比对方法,通过比较两个序列中的每一个碱基、氨基酸或核苷酸,并根据其相似性和差异性对它们进行排列。
Pairwise Sequence Alignment算法使用动态规划的思想,通过计算匹配、替代和插入/删除的分数,来确定两个序列的最佳匹配方案。
在生物信息学中,常用的实现包括Needleman-Wunsch算法和Smith-Waterman算法。
Needleman-Wunsch算法是一种全局比对算法,用于比较两个序列的整个长度。
它是通过填充一个二维矩阵来计算最佳匹配路径的。
算法的核心思想是,通过评估每个格子的分数,根据路径选择的最佳分数进行全局比对。
这个算法不仅可以计算序列的相似性,还可以计算每个位置的分数,从而获得两个序列的对应二面的对应关系。
Smith-Waterman算法是一种局部比对算法,用于寻找两个序列中的最佳匹配片段(子序列)。
它与Needleman-Wunsch算法的计算思路相同,但不同之处在于允许负分数,这使得算法能够确定具有高分数的局部匹配片段。
通过动态规划计算,Smith-Waterman算法可以寻找到两个序列中的相似片段,并生成比对的结果。
另一种常用的序列比对算法是基本本地搜索工具(BLAST)。
trimAl Phylogenetics Alignment Trimming Tool说明书
trimAl: a tool for automated alignment trimming in large-scale phylogenetics analyses Salvador Capella-Gutiérrez, Jose M. Silla-Martínez and Toni GabaldónTutorialVersion 1.2trimAl tutorialtrimAl is a tool for the automated trimming of Multiple Sequence Alignments. A format inter-conversion tool, called readAl, is included in the package. You can use the program either in the command line or webserver versions. The command line version is faster and has more possibilities,so it is recommended if you are going to use trimAl extensively.The trimAl webserver included in Phylemon 2.0 provides a friendly user interface and the opportunity to perform many different downstream phylogenetic analyses on your trimmed alignment. This document is a short tutorial that will guide you through the different possibilities of the program.Additional information can be obtained from where a more comprehensive documentation is available.If you use trimAl or readAl please cite our paper:trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses.Salvador Capella-Gutierrez;Jose M.Silla-Martinez;Toni Gabaldon.Bioinformatics 2009 25: 1972-1973.If you use the online webserver phylemon or phylemon2, please cite also this reference:Phylemon:a suite of web tools for molecular evolution,phylogenetics and phylogenomics.Tárraga J, Medina I, Arbiza L, Huerta-Cepas J, Gabaldón T, Dopazo J, Dopazo H. Nucleic Acids Res. 2007 Jul;35 (Web Server issue):W38-42.1. Program Installation.If you have chosen the trimAl command line version you can download the source code from the Download Section in trimAl's wikipage.For Windows OS users, we have prepared a pre-compiled trimAl version to use in this OS. Once the user has uncompressed the package, the user can find a directory,called trimAl/bin, where trimAl and readAl pre-compiled version can be found.Meanwhile for the OS based on Unix platform, e.g. GNU/Linux or MAC OS X, the user should compile the source code before to use these programs. To compile the source code, you have to change your current directory to trimAl/source and just execute "make".Once you have the trimAl and readAl binaries program, you should check if trimAl is running in appropriate way executing trimal program before starting this tutorial.2. trimAl. Multiple Sequence Alignment dataset.In order to follow this tutorial, we have prepared some examples. These examples have been taken from and you can use the codes from these files to get more information about it in this database.You can find three different directories called Api0000038, Api0000040 and Api0000080 with different files. The directory contains these files:A file .seqs with all the unaligned sequences.A file .tce with the Multiple Sequence Alignment produced by T-Coffee1.A file .msl with the Multiple Sequence Alignment produced by Muscle2.A file .mft with the Multiple Sequence Alignment produced by Mafft3.A file .clw with the Multiple Sequence Alignment produced by Clustalw4.A file .cmp with the different names of the MSAs in the directory. This file would be used by trimAl to get the most consistent MSA among the different alignments.You can use any directory to follow the present tutorial.3. Useful trimAl's features.Among the different trimAl parameters, there are some features that can be useful to interpret your alignment results:-htmlout filename. Use this parameter to have the trimAl output in an html file. In this way you can see the columns/sequences that trimAl maintains in the new alignment in grey color while the columns/sequences that have been deleted from the original alignment are in white color.-colnumbering. This parameter will provide you the relationship between the column numbers in the trimmed and the original alignment.-complementary. This parameter lets the user get the complementary alignment, in other words,when the user uses this parameter trimAl will render the columns/sequences that would be deleted from the original alignment.-w number. The user can change the windows size, by default 1, to take into account the surrounding columns in the trimAl's manual methods. When this parameter is fixed, trimAl take into account number columns to the right and to the left from the current position to compute any value, e.g. gap score, similarity score, etc. If the user wants to change a specific windows size value should use the correspond parameter-gw to change window size applied only a gap score assessments, -sc to change window size applied only to similiraty score calculations or -cw to change window size applied only to consistency part.4. Useful trimAl's/readAl's features.Both programs, trimAl and readAl, share common features related to the MSA conversion. It is possible to change the output format for a given alignment, by default the output format is the same than the input one, you can produce an output in different format with these options: -clustal. Output in CLUSTAL format.-fasta. Output in FASTA format.-nbrf. Output in PIR/NBRF format.-nexus. Output in NEXUS format.-mega. Output in MEGA format.-phylip3.2. Output in Phylip NonInterleaved format.-phylip. Output in Phylip Interleaved format.5. Getting Information from Multiple Sequence Alignment.trimAl computes different scores, such as gap score or similarity score distribution, from a given MSA. In order to obtain this information, we can use different parameters through the command line version.To do this part,we are going to use the MSA called Api0000038.msl.This file is in the Api0000038 directory.$ cd Api0000038$ trimal -in Api0000038.msl -sgt$ trimal -in Api0000038.msl -sgc$ trimal -in Api0000038.msl -sct$ trimal -in Api0000038.msl -scc$ trimal -in Api0000038.msl -sidentYou can redirect the trimAl output to a file. This file can be used in subsequent steps as input of other programs, e.g.gnuplot,,microsoft excel,etc,to do plots of this information.$ trimal -in Api0000038.msl -scc > SimilarityColumnsFor instance, in the lines below you can see how to plot the information generated by trimAl using the GNUPLOT program.$ gnuplotplot 'SimilarityColumns' u 1:2 w lp notitleset yrange [-0.05:1.05]set xrange [-1:1210]set xlabel 'Columns'set ylabel 'Residue Similarity Score'plot 'SimilarityColumns' u 1:2 w lp notitleexitIn this other example you can see the gaps distribution from the alignment. This plot also was generated using GNUPLOT$ trimal -in Api0000038.msl -sgt > gapsDistribution$ gnuplotset xlabel '% Alignment'set ylabel 'Gaps Score'plot 'gapsDistribution' u 7:4 w lp notitleexit6. Using user-defined thresholds.If you do not want to use any of the automated procedures included in trimAl (see sections 7 and 8) you can set your own thresholds to trim your alignment. We will use the parameter -htmlout filename for each example so differences can be visualized. In this example, we will use the Api0000038.msl file from the Api0000038 directory.Firstly, we are going to trim the alignment only using the -gt value which is defined in the [0 - 1] range. In this specific example, those columns that do not achieve a gap score, at least, equal to 0.190, meaning that the fraction of gaps on these columns are smaller than this value, will be deleted from the input alignment.$ trimal -in Api0000038.msl -gt 0.190 -htmlout ex01.htmlYou can see different parts of the alignment in the image below.This figure has been generated from the trimAl's HTML file for the previous example.In this other example, we can see the effect to be more strict with our threshold. An usual consequence of higher stringency is that the trimmed MSA has fewer columns. Be careful so you do not remove too much signal$ trimal -in Api0000038.msl -gt 0.8 -htmlout ex02.htmlTo be on the safe side, you can set a minimal fraction of your alignment to be conserved. In this example,we have reproduced the previous example with the difference that here we required to the program that, at least, conserve the 80% of the columns from the original alignment. This will remove the most gappy 20% of the columns or stop at the gap threshold set.$ trimal -in Api0000038.msl -gt 0.8 -cons 80 -htmlout ex03.htmlSecondly,we are going to introduce other manual threshold-st value.In this case,this threshold,also defined in the[0-1]range,is related to the similarity score.This score measures the similarity value for each column from the alignment using the Mean Distance method, by default we use Blosum62 similarity matrix but you can introduce any other matrix (see the manual). In the example below, we have used a smaller threshold to know its effect over the example.$ trimal -in Api0000038.msl -st 0.003 -htmlout ex04.htmlIn this example, similar to the previous example, we have required to conserve a minimum percentage of the original alignment in a independent way to fixed by the similarity threshold.A given threshold maintains a larger number of columns than the cons threshold, trimAl selects this first one.$ trimal -in Api0000038.msl -st 0.003 -cons 30 -htmlout ex05.htmlThirdly, we are going to see the effect of combining two different thresholds. In this case, trimAl only maintains those columns that achieve or pass both thresholds.$ trimal -in Api0000038.msl -st 0.003 -gt 0.19 -htmlout ex06.htmlFinally, we are going to see the effect of combining two different thresholds with the cons parameter. In this case, if the number of columns that achieve or pass both thresholds is equal or greater than the percentage fixed by cons parameter, trimAl chose these columns. However, if the number of columns that achieve or pass both thresholds is less than the number of columns fixed by cons parameter, trimAl relaxes both to thresholds in order to retrieve those columns that lets to achieve this minimum percentage.$ trimal -in Api0000038.msl -st 0.003 -gt 0.19 -cons 60 -htmlout ex07.html7. Selection of the most consistent alignment.trimAl can select the most consistent alignment when more than one alignment is provided for the same sequences (and in the same order) using the -compareset filename parameter. To do this part, we are going to move to Api0000040 directory, we can find there a file calledApi0000040.cmp listing the alignment paths. Using this file, we execute the instruction below to select the most consistent alignment among the alignment provided$ trimal -compareset Api0000040.cmpAs in previous section, once trimAl has selected the most consistent alignment, we can get information about the alignment selected using the appropriate parameters. For example, we can use the follow instructions to know the consistency value for each column in the alignment or its consistency values distribution$ trimal -compareset Api0000040.cmp -sct$ trimal -compareset Api0000040.cmp -sccAlso, we can trim the selected alignment using a specific threshold related to the consistency value. To do that, we should use the -ct value where the value is a number defined in the [0 - 1] range. This number refers to the average conservation of residue pars in that column with respect to the other alignments.$ trimal -compareset Api0000040.cmp -ct 0.6 -htmlout ex08.htmlOn the same way than the previous section, we can define a minimum percentage of columns that should be conserve in the new alignment. For this purpose, we have to use the cons parameter as we explained before.$ trimal -compareset Api0000040.cmp -ct 0.6 -cons 50 -htmlout ex09.htmlFinally, we can combine different thresholds, in fact, we can use all of them as well as we can define a minimum percentage of columns that should be conserve in the output alignment. In the line below, you can see an example of this situation.$ trimal -compareset Api0000040.cmp -ct 0.6 -cons 50 -gt 0.8 -st 0.01-htmlout ex10.html8. Applying automated methods.One of the most powerful aspects of trimAl is that it provides you with several automated options.This option will automatically select the most appropriate thresholds for your alignment after examining the distribution of various parameters along your alignment. Among the alignment features that trimAl takes into account to compute these optimal cut-off are the gap distribution, the similarity distribution, the identity score, etc.You can find a complete explanation about all of these methods in the trimAl's Publications Section.Here,we provide some examples on how to use these methods.The automated methods, gappyout, strict and strictpus, can be used independently if you are working with one or more than one alignment, in the last case, for the same sequences.In the lines below, you can see how to use the gappyout method in both ways. This method will eliminate the most gappy fraction of the columns from your alignment. For this, we are going to continue using the same directory than the previous section.$ trimal -compareset Api0000040.cmp -gappyout -htmlout ex11.html$ trimal -in Api0000040.mft -gappyout -htmlout ex12.htmlIn this case, we are going to use the same files than in the example before but we have changed the method to trim the alignmnet. Now, we are using strict and strictplus methods. These two methods combine the information on the fraction of gaps in a column and their similarity scores, being strictplus for more stringent than strict method.$ trimal -compareset Api0000040.cmp -strict -htmlout ex13.html$ trimal -in Api0000040.clw -strictplus -htmlout ex14.htmling an heuristic method to decide which is the best automated method for a given MSA.Finally, we implemented an heuristic method to decide which is the best automated method to trim a given alignment. The heuristic method takes into account alignment features such as the number of sequences in the alignment as well as some measures about the identity score among the sequences in the alignment or among the best pairwise sequences in that MSA. According to these characteristics trimAl will decide upon one of the two automated methods (gappyout or strictplus).To illustrate how to use this method, we provide a couple of example using the same directory than the section before. First, we used trimAl to selecte the most consistent alignment and then we trimmed that alignmnet using our heuristic method.$ trimal -compareset Api0000040.cmp -automated1 -htmlout ex15.htmlThen, we trim a single MSA using the previously mentioned method.$ trimal -in Api0000040.msl -automated1 -htmlout ex16.html10. Getting more information.We hope that this short introduction to trimAl's features has been useful to you.We advise you to visit periodically the trimAl's wikipage()where you could get the latest news about the program as well as more information, examples, etc, about trimAl's package. You can also subscribe to the mailing list if you want to be updated in new trimAl developing.11. References.1.T-Coffee: A novel method for fast and accurate multiple sequence alignment.Notredame C, Higgins DG, Heringa J. J Mol Biol. 2000 Sep 8;302(1):205-17.2.MUSCLE:multiple sequence alignment with high accuracy and highthroughput. Edgar RC.Nucleic Acids Res. 2004 Mar 19;32(5):1792-7.3.MAFFT: a novel method for rapid multiple sequence alignment based on fastFourier transform. Katoh K, Misawa K, Kuma K, Miyata T. Nucleic Acids Res. 2002 Jul 15;30(14):3059-66.4.CLUSTAL W:improving the sensitivity of progressive multiple sequencealignment through sequence weighting,position-specific gap penalties and weight matrix choice. Thompson JD, Higgins DG, Gibson TJ. Nucleic Acids Res. 1994 Nov 11;22(22):4673-80.。
印度碗状红菇——一个中国新纪录种(英文)
热带作物学报2021, 42(9): 2542 2548 Chinese Journal of Tropical Crops收稿日期 2021-02-23;修回日期 2021-03-20基金项目 国家自然科学基金项目(No. 31770657,No. 31570544,No. 31900016)。
作者简介 陈 彬(1990—),男,博士研究生,研究方向:森林微生物资源遗传多样性。
*通信作者(Corresponding author ):梁俊峰(Liang Junfeng ),E-mail :*******************。
Russula indocatillus , a New Record Species in ChinaCHEN Bin 1, 2, SONG Jie 1, WANG Qian 1, LIANG Junfeng 1*1. Research Institute of Tropical Forestry, Chinese Academy of Forestry, Guangzhou, Guangdong 510520, China;2. Nanjing For-estry University, Nanjing, Jiangsu 210037, ChinaAbstract: Russula indocatillus was reported as new species to China. A detailed morphological description, illustrations and phylogeny are provided, and comparisons with related species are made. It is morphologically characterized by a brownish orange to yellow ochre pileus center with butter yellow to pale yellow margin, white to cream spore print, subglobose to broadly ellipsoid to ellipsoid basidiospores with bluntly conical to subcylindrical isolated warts, always one-celled pileocystidia, and short, slender, furcated and septated terminal elements of pileipellis. The combination of detailed morphological features and phylogenetic analysis based on ITS-nrLSU-RPB2 sequences dataset indicated that the species belonged to Russula subg. Heterphyllidia sect. Ingratae . Keywords: Russulaceae; new record species; phylogeny; taxonomy DOI 10.3969/j.issn.1000-2561.2021.09.014印度碗状红菇——一个中国新纪录种陈 彬1,2,宋 杰1,王 倩1,梁俊峰1*1. 中国林业科学研究院热带林业研究所,广东广州 510520;2. 南京林业大学,江苏南京 210037摘 要:本研究报道一个中国红菇属新记录种——印度碗状红菇(Russula indocatillus )。
PHYML Online — a web server for fast maximum likelihood-based phylogenetic inference
PHYML Online—a web server for fast maximumlikelihood-based phylogenetic inferenceSt´e phane Guindon a,Franck Lethiec b,Patrice Duroux b and Olivier Gascuel b1a Bioinformatics Institute&Allan Wilson Centre,University of Auckland,Private Bag92019,Auckland,New Zealand,b Projet M´e thodes et Algorithmes pour la Bioinformatique, LIRMM-CNRS,161Rue Ada,34392-Montpellier Cedex5,France.AbstractPHYML Online is a web interface to PHYML,a software that implements a fast and accurate heuristic for estimating maximum likelihood phylogenies from DNA and protein sequences.This tool provides the user with a number of options,e.g.nonparametric bootstrap and estimation of various evolutionary parameters,in order to perform comprehensive phylogenetic analyses on large data sets in reasonable computing time.The server and its documentation are available from http://atgc.lirmm.fr/phyml.IntroductionThe ever-increasing size of homologous sequence data sets and complexity of substitution models stimulate the development of better methods for building phylogenetic trees.Likelihood-based approaches(including Bayesian)provided arguably the most successful advances in this area in the last decade.Unfortunately,these methods are hampered with computational difficulties. Different strategies have then been used to tackle this problem,mostly based on stochastic approaches.Markov chain Monte Carlo methods are probably the most valuable tools in this context as they provide computationally tractable solutions to Bayesian estimation of phyloge-nies(1;2).Stochastic approaches have also been used to address optimisation issues in the maximum likelihood framework.Hence,simulated annealing(3)and genetic algorithms(4;5)were pro-posed to estimate maximum likelihood phylogenies from large data sets.However,the hill climbing principle is usually considered faster than stochastic optimisation and sufficient for numerous combinatorial optimisation problems(6).Recently,Guindon and Gascuel(2003)de-scribed a fast and simple heuristic based on this principle,for building maximum likelihood phylogenies.Several simulation studies(7;8)demonstrated that the tree topologies estimated with this approach are as accurate as those inferred using the best tree building methods cur-rently available.These studies also showed that this new method is considerably faster than the other likelihood-based ing this heuristic,the analysis of large data sets is now achieved in reasonable computing time on any standard personal computer;e.g.,only12 min were required to analyse a data set consisting of500rbc L sequences with1,428base pairs from plant plastids.This paper introduces PHYML Online,a web interface to the PHYML(PHYlogenetic infer-ences using Maximum Likelihood)software that implements the heuristic described by Guindon and Gascuel(2003).PHYML Online provides a number of useful options(e.g.nonparamet-ric bootstrap),and proposes quite recent models of sequence evolution(e.g.,WAG(9)and DCMut(10)).Wefirst give an overview of the algorithm and present the web server thereafter.AlgorithmThe core of the heuristic is based on a well-known tree-swapping operation,namely“nearest neighbour interchange”(NNI),which defines three possible topological configurations around each internal branch(see(11)).For each of these configurations,the length of the internal branch that maximises the likelihood is estimated using numerical optimisation.The difference of likelihood obtained under the best alternative topological configuration and the current one defines a score.A score with positive value indicates that the best alternative topological con-figuration yields an improvement of likelihood.A score with negative value indicates that the current topological configuration cannot be improved at this stage and only the length of the internal branch is adjusted.Each internal branch is examined in this manner and ranked accord-ing to its score.The optimal length of external branches is also computed.These calculations are performed independently for every branch and define a set of(topological or numerical) modifications,each of which corresponding to an improvement of the current tree regarding the likelihood function.The standard approach would only apply one of these modifications,typically that corre-sponding to the internal branch with best score.Here,a large proportion of all modifications computed previously is performed instead.This proportion is adjusted so as to increase the likelihood at each step,ensuring convergence of the algorithm.This way,the current tree is im-proved at each step,both in terms of topology and branch length,and only a few steps(usually a few dozen or less)are necessary to reach an optimum of the likelihood function.This explains the speed of this algorithm whose time complexity is O(pns),where p represents the number of refinement steps that have been performed and n is the number of sequences of length s.PHYML OnlinePHYML Online is a web interface to the PHYML algorithm(Figure1).By default,the input data consists of a single textfile containing one or more alignments of DNA or protein sequences in PHYLIP(12)interleaved or sequential format.Examples of sequence data sets in PHYLIP format are given in the‘User’s guide’section of the web site.Setting the parameters of a phylogenetic analysis through the interface is straightforward. Thefirst step is the selection of the substitution model of interest.Alignments of homologous DNA and amino-acid sequences can be examined under a wide range of models(JC69,K80,F81, F84,HKY85,TN93and GTR for nucleotides,and Dayhoff,JTT,mtREV,WAG and DCMut for amino acids).Variability of substitution rates across sites and invariable sites can also be taken into account.The parameters that model the intensity of the variation of rates across sites and the proportion of invariables sites can befixed by the user or estimated by maximum likelihood.Note that the parameters of the substitution model can be estimated under afixed tree topology or not.Thefixed topology option is useful when describing the evolutionary process is more important than estimating the history of sequences.An option is available to assess the reliability of internal branches using nonparametricbootstrap(13)which is possible to achieve for even large data sets thanks to the speed of PHYML optimisation algorithm.The number of bootstrap replicates isfixed by the user. The bootstrap values are displayed on the maximum likelihood phylogeny estimated from the original data set.Trees estimated from each bootstrap replicate,as well as the corresponding substitution parameters,can also be saved in separatefiles for further analysis(e.g.,computation of confidence intervals for the substitution parameters or estimation of a consensus bootstrap tree,as performed by PHYLIP’s CONSENSE for instance).Several data sets can be analysed in a single run.This option is especially useful in multiple gene studies.Multiple trees can be also be used as input and further optimised by the algorithm described above.This might prevent the tree searching heuristic to be trapped in local maxima. When combined with thefixed tree option,the multiple input trees approach also facilitates the comparison of thefit of different phylogenies estimated from a single data set.The‘User’s guide’section gives details on the format of multiple sequence and treefiles.Sequences(and starting tree(s)if provided)are uploaded on our server,a16-processor IBM computer running Linux2.6.8-1.521custom SMP,and a maximum likelihood analysis is performed using the PHYML algorithm.Results are then sent to the user by electronic mail. Thefirstfile presents a summary of the options selected by the user,maximum likelihood estimates of the parameters of the substitution model that were adjusted,and the log likelihood of the model given the data.The secondfile shows the maximum likelihood phylogeny(ies) in NEWICK format.Trees can be viewed through an applet available on the PHYML Online server.This applet runs the program ATV(14)that provides numerous options to display and a manipulate large phylogenetic trees.AvailabilityThe PHYML Online server is located at“Laboratoire d’Informatique,de Robotique et de Mi-cro´e lectronique de Montpellier”:http://atgc.lirmm.fr/phymlPHYML can also be downloaded for local installation at http://atgc.lirmm.fr/phyml/binaries. html.The PHYML software has been implemented in C ANSI and is available under GNU general public licence.Sources are available upon request.Binaries,example data sets,sources and documentation are distributed free of charge for academic purpose only.AcknowledgementsThanks to Emmanuel Douzery and Stephanie Pl¨o n for carefully reading this article.This work was funded by ACI IMPBIO(French Ministry of Research).S.G.is supported by a postdoctoral fellowship from the Allan Wilson Centre for Molecular Ecology and Evolution,New Zealand.References1.Rannala,B.and Yang,Z.(1996)Probability distribution of molecular evolutionary trees:a new method of phylogenetic inference.J.Mol.Evol.,43,304–311.2.Huelsenbeck,J.P.and Ronquist,F.(2001)MrBayes:Bayesian inference of phylogeny.Bioinformatics,17,754–755.3.Salter,L.and Pearl,D.(2001)Stochastic search strategy for estimation of maximum like-lihood phylogenetic trees.Syst.Biol.,50,7–17.4.Lewis,P.(1998)A genetic algorithm for maximum likelihood phylogeny inference usingnucleotide sequence data.Mol.Biol.Evol.,15,277–283.5.Lemmon,A.and Milinkovitch,M.(2002)The metapopulation genetic algorithm:an efficientsolution for the problem of large phylogeny A.,99, 10516–10521.6.Aarts,E.and Lenstra,J.K.(1997)Local search in combinatorial optimization,Wiley,Chichester.7.Guindon,S.and Gascuel,O.(2003)A simple,fast and accurate algorithm to estimate largephylogenies by maximum likelihood.Syst.Biol.,52,696–704.8.Vinh,L.S.and vonHaeseler,A.(2004)IQPNNI:Moving fast through tree space and stop-ping in time.Mol.Biol.Evol.,21,1565–1571.9.Whelan,S.and Goldman,N.(2001)A general empirical model of protein evolution derivedfrom multiple protein families using a maximum-likelihood approach.Mol.Biol.Evol.,18, 691–699.10.Kosiol,C.and Goldman,N.(2004)Different versions of the dayhoffrate matrix.Mol.Biol.Evol.,In press.11.Swofford,D.,Olsen,G.,Waddel,P.,and Hillis,D.(1996)Phylogenetic inference.In Hillis,D.,Moritz,C.,and Mable,B.,(eds.),Molecular Systematics,chapter11Sinauer Sunderland,MA.12.Felsenstein,J.(1993)PHYLIP(PHYLogeny Inference Package)version3.6a2,Distributedby the author,Department of Genetics,University of Washington,Seattle.13.Felsenstein,J.(1985)Confidence limits on phylogenies:an approach using the bootstrap.Evolution,39,783–791.14.Zmasek,C.and Eddy,S.(2001)ATV:display and manipulation of annotated phylogenetictrees.Bioinformatics,17,383–384.Figure1.The PHYML Online interface.。
应用PHYLIP构建进化树的完整详细过程
应用PHYLIP构建进化树的完整详细过程一、获取序列一般自己通过测序得到一段序列(已知或未知的都可以),通过NCBI的BLAST 获取相似性较高的一组序列,下载保存为FASTA格式。
用BIOEDIT等软件编辑序列名称,注意PHYLIP在DOS下运行,文件名不能超过10位,超过的会自动截留前面10位。
二、多序列比对目前一般应用CLASTAL X进行,注意输出格式选用PHY格式。
生成的指导树文件(DND文件)可以直接用TREEIEW打开编辑,形式上和最终生成的进化树类似,但是注意不是真正的进化树。
三、构建进化树1.N-J法建树依次应用PHYLIP软件中的SEQBOOT.EXE、DNADIST.EXE、NEIGHBOR.EXE和CONSENSE.EXE打开。
具体步骤如下:(1)打开seqboot.exe输入文件名:输入你用CLASTAL X生成的PHY文件(*.phy)。
R为bootstrap的次数,一般为1000 (设你输入的值为M,即下两步DNADIST.EXE、NEIGHBOR.EXE中的M值也为1000)odd number: (4N+1)(eg: 1、5、9…)改好了y得到outfile(在phylip文件夹内)改名为2(2)打开Dnadist.EXE输入2修改M值,再按D,然后输入1000(M值)Y得到outfile(在phylip文件夹内)改名为3(3)打开Neighboor.EXE输入3M=1000(M值)按Y得到outfile和outtree(在phylip文件夹内)改outtree为4,outfile改为402(4)打开consense.exe输入4Y得到outfile和outtree(在phylip文件夹内)Outfile可以改为*.txt文件,用记事本打开阅读。
三、进化树编辑和阅读outtree可改为*.tre文件,直接双击在treeiew里看;也可以不改文件扩展名,直接用treeiew、PHYLODRAW、NJPLOT等软件打开编辑。
分子生态学名词解释
一、翻译并解释名词:10x4分1.allele 等位基因一个位点的序列变异.2.Effective population size Ne 有效种群大小在一个具有相等性比、随机交配的理想种群中表现出与特定统计全部成体数目规模相对应的真实的种群杂合性随时间丧失的速率相同的个体数.3.F-statistics F 统计检验用于评估个体间、亚种群间和整个种群间杂合性的分布的统计方法,被广泛应用于定量亚种群的遗传分化.4.Genetic load 遗传负荷相对于理论最佳值来说降低了的基因型适合度.5.Hardy-Weiberg equilibrium哈温平衡当所有等位基因频率是已知的时候,在一个大的随机交配种群中的纯合子和杂合子的预期比例.假设没有迁移、突变或选择作用,哈温平衡定律则认为等位基因频率从一个世代到下一个世代应该保持不变.6.Bottleneck effect瓶颈效应种群的规模大为缩小,随后常常有一个种群的恢复.7.Selection sweep选择扫荡.课件:Occurrence of a beneficial mutation,Only individuals carrying the mutation reproduce,‘Population bottleneck’,Mainly affects linked loci.8.IAM 无限等位基因模型其中突变不是以可预料的方式一个接一个发生,而大多数突变是像产生SNP单核苷酸多态性那样出现的.9.Linkage disequilibrium LD 连锁不平衡.术语表:Linkage equilibrium 连锁平衡:由重组促成的情形,其中遗传位点在繁殖期相互独立分离.当两个位点上的等位基因一起分离时,如他们在同一个染色体上的物理位置太接近时,则发生不平衡.百度:连锁平衡:不同的各在人群中以一定的出现.在某一群体中,不同座位上某两个出现在同一条染色体上的高于预期的随机频率的现象,称连锁不平衡 linkage disequilibrium .由于 HLA 不同的某些经常连锁在一起遗传,而连锁的基因并非完全随机地组成单体型,有些基因总是较多地在一起出现,致使某些单体型在群体中呈现较高的,从而引起连锁不平衡.10.Metapopulation复合种群种群再分为多个同类群,至少其中的一些偶尔灭绝,随后通过从其他同类群迁入再建立种群.11.Microsatellite微卫星带有单序列通常为2-,3-或4-核苷酸重复多次的遗传位点.12.MtDNA 线粒体DNA存在于线粒体中的环状染色体.13.Non-synomous mutation非同义突变由一个三联密码变化使特定氨基酸改变的突变.14.PCR聚合酶链式反应用寡核苷酸引物和耐热的DNA聚合酶扩增大量DNA序列的一种方法.15.SNP单核苷酸多态性在DNA序列中一个特殊位点上出现不同核苷酸碱基的等位基因.16.RFLP限制性片段长度多态性用限制性酶和凝胶电泳鉴定DNA序列多态性的方法.17.Transition转换一个嘌呤核苷酸被另一个嘌呤核苷酸替代,或一个嘧啶核苷酸被另一个嘧啶核苷酸取代的突变.18.Transversion颠换一个嘌呤核苷酸被另一个嘧啶核苷酸取代或相反过程的突变.19.Molecular ecology分子生态学课件:分子生物学是应用分子生物学的原理和方法来研究生命系统与环境系统相互作用的生态机理及其分子机制的科学.它是生态学与分子生物学相互渗透而形成的一门新兴交叉学科,其研究内容包括种群在分子水平的遗传多样性及遗传结构,生物器官变异的分子机制、生物体内有机大分子对环境因子变化的响应、生物大分子结构、功能演变与环境长期变化的关系以及其它生命层次生态现象的分子机理等.分子生态学的理论和方法对传统学科有巨大的促进作用,同时,对解决诸如转基因、克隆技术应用中的生态安全、环境与人类健康等重大问题将产生深刻的影响.20.Functional ecological and evolutionary genomicsFEEG生态和进化基因组学21.PhylogeographicPhylogeographiy 亲缘地理学研究调控系谱世系的地理分布的原理和过程的科学.22.Monophyly 单系类群中的所有个体是从同一个祖先来的,并且从这个祖先来的所有存活的后代都在这个类群中.23.Intron 内含子真核生物的结构基因间的非编码DNA序列.内含子被转录但它们的RNA拷贝在功能产生期间被切除.24.Introgression 渐渗杂交等位基因从一个种群或物种向另一个种群或者物种的扩散,而产生像种群间或物种间的近交或杂交这样的结果.25.Genetic drift 遗传漂变生物多样性导论:在有性生殖的群体中,每个世代的基因库是对上一个世代基因库的随机抽样和复制.在世代交替过程中,不同的等位基因遗传到下一代的偶然性对种群遗传结构有可能产生显着影响. Wright把这种由于配子产生及结合过程中的随机性导致的基因频率的波动称为遗传漂变genetic drift,也称随机漂变、遗传偏离或Wright效应.26.Haplotype 单倍体型源于同一染色体或染色体单倍体组的一套等位基因.27.ESUEvolutionary Significant Unit 进化显着单元分类学上对保护重要类群的一种尝试性定义.28.Metagenomics 宏基因组学百度:宏基因组学Metagenomics又叫微生物环境基因组学、元基因组学.它通过直接从环境样品中提取全部微生物的DNA,构建宏基因组文库,利用基因组学的研究策略研究环境样品所包含的全部微生物的遗传组成及其群落功能.它是在微生物基因组学的基础上发展起来的一种研究微生物多样性、开发新的生理活性物质或获得新基因的新理念和新方法.其主要含义是:对特定环境中全部为生物的总DNA也称宏基因组,metagenomic进行克隆,并通过构建宏基因组文库和筛选等手段获得新的生理活性物质;或者根据rDNA数据库设计引物,通过系统学分析获得该环境中微生物的遗传多样性和分子生态学信息.29.CpDNA 叶绿体DNA存在于叶绿体内的环状染色体.30.Heterosis 杂种优势,杂合体优势杂合子比纯合子有较高的适合度的情形.31.Recombination 重组二倍体生物中减数分裂期间同源配对染色体间的DNA交换.32.rRNA 核糖体RNA核糖体中的RNA分子…………33.SNP 单核苷酸多态性在DNA序列中一个特殊位点上出现不同核苷酸碱基的等位基因.34.Sympatric speciation 同域发生物种形成生活在同一地区的个体中形成的新物种.35.Vicariance 地理隔离种群或物种被环境事件,如山脉形成造成的物理隔离注:描黄部分为术语表中无相关解释的,仅供参考.分子生态学简答题1. 什么是分子生态学, 分子生态学的主要研究内容是什么教材的定义:分子生物学是应用分子生物学的原理和方法来研究生命系统与环境系统相互作用的生态机理及其分子机制的科学.它是生态学与分子生物学相互渗透而形成的一门新兴交叉学科.研究内容:包括种群在分子水平的遗传多样性及遗传结构,生物器官变异的分子机制、生物体内有机大分子对环境因子变化的响应、生物大分子结构、功能演变与环境长期变化的关系以及其它生命层次生态现象的分子机理等.分子生态学的理论和方法对传统学科有巨大的促进作用,同时,对解决诸如转基因、克隆技术应用中的生态安全、环境与人类健康等重大问题将产生深刻的影响.2. 什么是宏基因组学,其主要研究过程如何宏基因组学Metagenomics又叫微生物环境基因组学、元基因组学.它通过直接从环境样品中提取全部微生物的DNA,构建宏基因组文库,利用基因组学的研究策略研究环境样品所包含的全部微生物的遗传组成及其群落功能.宏基因组学的研究过程:对特定环境中全部为生物的总DNA也称宏基因组,metagenomic进行克隆,并通过构建宏基因组文库和筛选等手段获得新的生理活性物质;或者根据rDNA数据库设计引物,通过系统学分析获得该环境中微生物的遗传多样性和分子生态学信息.3. 什么是生态和进化基因组学,其主要研究内容如何研究环境条件与基因组结构、功能、动态及进化相互关系的学科.4. 什么是亲缘地理学,其研究内容是什么研究调控系谱世系的地理分布的原理和过程的科学.5. 什么是数量性状数量性状有那些类型数量性状又称为适应性性状,通常由互相影响的多个基因控制,并会明显受到环境的影响.在足够大的群体中,数量性状的分布基本符合正态规律.数量性状的类型:连续型、阈值型、间断型或离散型连续型:数量性状的表现为连续分布,群体中个体数足够多时,连续型数量性状基本符合正态分布,如:身高、体重、奶牛产奶量等.阈值型数量性状:此种数量性状的表现会受到潜在风险因素的影响,具有最低或最高阈值,例如:生物的性成熟年龄,疾病的易感率等.间断型数量性状:只能用离散的数值或数组表示其表现型的数量性状,如动物每胎产仔个数、果蝇足上的刚毛数等.6. 影响种群等位基因频率变化的进化过程有哪些种群进化的主要动力有:突变、遗传漂变、选择、基因的迁移基因流动.突变:生物核酸序列上碱基种类或数目发生改变,导致新等位基因产生或原有等位基因丧失.遗传漂变:由于进行有性生殖的生物的配子形成过程具有随机性,亲代的部分遗传信息在通过配子传递给子代时会由于这种不确定性而丢失,导致部分等位基因在整个种群的下一代中频率降低甚至消失.遗传漂变现象在小种群中表现尤其明显.选择:外部自然环境对种群等位基因的选择往往是定向的,若等位基因所控制的形状不能使生物较好地适应生存环境,自然选择就会逐步淘汰这些等位基因.基因流动:种群内个体在不同种群之间进行迁移时,会造成不同种群之间等位基因的混合.整体而言,大规模的种群之间个体迁移能够调和等位基因频率分布的不均衡现象,有利于提高种群等位基因多样性.7. 什么叫选择扫荡是如何产生的A selective sweep is the reduction or elimination of variation among the nucleotides in neighboring DNA of a mutation as the result of recent and strong positive natural selection.A selective sweep can occur when a new mutation occurs that increases the fitness of the carrier relative to other members of the population. Natural selection will favour individuals that have a higher fitness and with time the newly mutated variant allele will increase in frequency relative to other alleles. As its prevalence increases, neutral and nearly neutral genetic variation linked to the new mutation will also become more prevalent. This phenomenon is called genetic hitchhiking.A strong selective sweep results in a region of the genome where the positively selected haplotype the mutated allele and its neighbours is essentially the onlyone that exists in the population, resulting in a large reduction of the total genetic variation in that chromosome region.选择扫荡或选择性清除:种群中产生了能够提高个体适合度的有利变异后,该有利变异以及与之连锁的中性和近中性变异的等位基因频率逐渐提高,从而导致其他等位基因频率的降低乃至丧失,最终造成种群遗传多样性整体显着降低的现象.当种群中产生了可提高个体适合度的新变异即有利变异时,选择性扫荡就可能发生.携带了这种有利变异的个体在自然选择中更具优势,拥有更高繁殖成功率,因此随着时间推移,该有利变异在同位点等位基因中的频率逐渐提高,并且与其可连锁遗传的其他中性和近中性等位基因频率也会相应提高.强烈的选择性扫荡作用会导致基因组整体遗传多样性的大幅降低乃至丧失,因为它往往会最终导致种群中只剩下具有少数几种有利突变基因型的个体.8. 什么是瓶颈效应产生瓶颈效应的原因有哪些规模较大、具有高度遗传多样性的种群,由于干扰等因素导致种群个体数量急剧减少后,虽然个体数目可以在后来逐渐恢复至原来的水平,但由于大量个体丧失而同时导致的种群遗传多样性的损失却再也无法恢复.导致瓶颈效应产生的原因:破坏性自然灾害、传染病、大规模捕杀或采集、栖息地的破坏等均可造成种群数量急剧减少,从而导致瓶颈效应.9. 什么是FST,怎样用FST 衡量种群分化程度FST 是用于衡量种群分化程度的F-统计方法所用的统计量之一,定义为:T S ST H H F -=1 ,其中:HS 为亚种群期望杂合度,HT 为整个种群的期望杂合度.它反映的是一个大种群内部各个亚种群之间的分化程度.FST 取值范围为:0≤FST <1若FST=0,则有HS=HT,原种群无分化现象;若FST→1,则有HS→0,表示亚种群杂合度极低,原种群已发生高度分化.10. 举例说明常用的分子标记及其在分子生态学研究中的应用.常用分子标记有:RFLP Restriction fragment length polymorphism限制性片段长度多态性SSLP Simple sequence length polymorphism简单序列长度多态性AFLP Amplified fragment length polymorphism扩增片段长度多态性RAPD Random amplification of polymorphic DNA随机扩增的多态性DNAVNTR Variable number tandem repeat可变数目串联重复序列Microsatellite polymorphism, SSR Simple sequence repeat简单重复序列多态性微卫星SNP Single nucleotide polymorphism简单核苷酸多态性STR Short tandem repeat短串联重复序列SFP Single feature polymorphism单一特征多态性DArT Diversity Arrays Technology多样性阵列技术RAD markers Restriction site associated DNA markers与DNA相关的限制性位点分子标记分子标记的应用:研究动植物交配机制、遗传漂变、亲缘地理学、种群生态学、保护生物学、宏基因组学研究例如微生物群落组成与功能的研究等.分子标记的特点:普遍性、稳定性、高度多态性、可遗传性、可区分由于遗传和生存环境导致的相似性区分同源性与相似性;对于绝大多数种类难以进行培养的微生物,可利用分子标记,直接建立环境样品中的微生物宏基因组文库,以便研究环境中微生物群落的组成、结构与功能.此外,分子标记还是基因组学分析的重要工具.11. 什么叫适应,适应性状有哪些特点定义:使生物个体对其生境适应性提高的现象.适应性状:又称为数量性状,其表现型一般由多个基因共同控制,并且受环境影响较明显.补充:群体很大时,数量性状的表型分布规律基本为正态的.12. 什么叫突变突变有那些种类定义:生物核酸序列上核苷酸种类的改变或少数几个核苷酸的插入或缺失现象.突变种类:按突变发生机制,可以分为碱基替换Substitution、插入与删除Indels碱基替换:又称错误配对.分为转换一种嘌呤替换为另一种嘌呤,或一种嘧啶替换为另一种嘧啶和颠换嘌呤和嘧啶之间的替换替换不改变核苷酸数目,可能导致对应密码子和氨基酸的改变.按突变结果,可以分为:同义突变或沉默突变、错义突变对应氨基酸改变、无义突变对应的密码子变为终止子,使转录停止.插入与删除:即在DNA新链合成时,由于复制错误,在新链中添加或遗漏了若干个核苷酸.这将导致新链核苷酸数目改变,造成密码子读取顺序随之发生变化,又称为移码突变.按突变发生场所,亦可分为体细胞突变与生殖细胞突变.分子生态学论述题回答参考思路1. 举例说明分子生态学在解决经典生态学问题中的优势和局限是什么经典生态学的研究层次:个体、种群、群落、生态系统.个体生态学:个体的形态结构、生理功能、生态习性多样性及分布;种群生态学:种群组成与结构、种群动态出生与死亡、迁移、种群行为交配机制、种间关系等;群落生态学:群落的物种组成与结构、群落动态群落演替、群落稳定性;生态系统生态学:生态系统的组成与结构生产者、消费者、分解者、生态过程物质循环、能量流动、信息传递※分子生态学的直接研究对象是生物大分子如核酸序列、蛋白质.优势:从分子生态学的主要研究手段方面考虑,分子生态学的最大特点是应用分子生物学原理和技术手段,上述研究方向,哪些可以应用局限性:起步相对较晚,缺乏原始数据没有建立完善的数据库,因此进行基础性研究仍存在一定难度.2. 举例说明分子生态学方法在微生物学中的应用.1宏基因组学:提取环境样品中的全部微生物基因组信息,构建环境微生物群落的宏基因组文库.传统研究方法:微生物培养;仅能培养约1%的微生物种类2分子标记技术:进行微生物的分类学研究.使用微生物的16SrRNA中的可变序列,利用PCR技术、凝胶电泳、放射自显影的方法对环境品种的微生物进行遗传序列鉴定,可以有效地进行微生物的分类学研究.传统分类学通过对生物个体表性特征的鉴别进行分类,对于个体及其微小的微生物,传统方法显然是不适合的.3基因工程与微生物育种:利用生物突变原理,对微生物进行诱变育种;利用基因工程技术培育杂交菌种等.3. 从种群遗传学和保育生物学角度来说明小种群为什么易于灭绝、遗传漂变、瓶颈效应、奠基者效应、近交现象4. 人类谱系地理学证据如何表明人类起源于非洲如何解释用线粒体和Y-染色体来推断人类最近共同祖先进化时间上的差异科学家利用遵循母系遗传的线粒体DNA和遵循父系遗传的Y染色体片段作为分子标记,对采集自世界各地的人类样本进行了遗传序列比对,分析出样本之间遗传序列的差异情况,并根据分子钟原理和系统发育树的建立原理,得到了人类的大致谱系发育图.大量这方面的相关研究表明,除非洲外的其他区域采集到的样本都曾来源于共同的祖先同一个谱系,而只有在非洲采集到的样本中,发现有的样本不属于上述共同的谱系,因此科学家推断,人类可能更早起源于非洲,后通过迁移扩散分布于世界各地.进化时间差异:用线粒体DNA估计出的人类母系祖先出现的时间为距今约20万年前,早于用Y染色体估计出的人类父系祖先出现时间为距今约6~14万年<=¥▽¥=>.产生差异的主要原因:线粒体DNA和Y染色体的突变速率不同,线粒体DNA的突变速率更快.建议大家背题的时候,遇到不懂的地方自己查找一下有关的资料.论述题只需要建立清晰的条理,明白地表达自己的观点即可,因此建议:对上述四个问题,请大家选择自己容易理解的角度,尽可能完全弄懂.考试时有得写才是最重要的_:37∠_。
RAxML建立极大似然进化树简明指南
用RAxML构建极大似然进化树RAxML是用极大似然法建立进化树的软件之一,可以处理超大规模的序列数据,包括上千至上万个物种,几百至上万个已经比对好的碱基序列。
作者是德国慕尼黑大学的 A. Stamatak博士。
RAxML有若干版本(有的版本支持在多个CPU上运行),本文以最常用的单机版raxmlHPC为例。
1 下载和安装RAxML可以在Linux, MacOS, DOS下运行,下载网址为http://icwww.epfl.ch/~stamatak/index-Dateien/Page443.htm也可以使用的超级计算机运行。
对于Linux和Mac用户下载RAxML-7.0.4.tar.gz 用gcc编译即可make –f Makefile.gccWindows用户可以下载编译好的exe文件,而无需安装。
2 数据的输入RAxML的数据位PHYLIP格式,但是其名字可以增加至256个字符。
“RAxML对PHYLIP文件中的tabs,inset不敏感”。
输入的树的格式为NewickRAxML的查错功能1 序列的名称有重复,即不同的碱基却拥有一致的名称。
2 序列的内容重复,即两条不同名称的序列,碱基完全一致。
3 某个位点完全由序列完全由未知符号组成,如氨基酸序列完全由X,?,*,-组成,DNA序列完全由N,O,X,?,-组成。
4序列完全由未知符号组成,如氨基酸序列完全由X,?,*,-组成,DNA序列完全由N,O,X,?,-组成。
5 序列名称中禁用的字符如包括空格、制表符、换行符、:,(),[]等3 RAxMLHPC下的选项-s sequenceFileName 要处理的phy文件-n outputFileName 输出的文件-m substitutionModel 模型设定方括号中的为可选项:[-a weightFileName] 设定每个位点的权重,必须在同一文件夹中给出相应位点的权重[-b bootstrapRandomNumberSeed] 设定bootstrap起始随机数[-c numberOfCategories] 设定位点变化率的等级[-d] -d 完全随机的搜索进化树,而不是从maximum parsimony tree开始。
reads数,测序深度笔记
reads数,测序深度笔记前⾔对于测序深度,reads的计算,以及相关的数据量不太清楚,找了⼀个篇⽂章,源于公众号“嘉因”,后⽅有链接。
Sequencing depth.The amount of sequencing needed for a given sample is determined by the goals of the experiment and the nature of the RNA sample.Experiments whose purpose is to evaluate the similarity between the transcriptional profiles of two polyA+ samples may require only modest depths of sequencing (e.g. 30M pair-end reads of length > 30NT, of which 20-25M are mappable to the genome or known transcriptome, Experiments whose purpose is discovery of novel transcribed elements and strong quantification of known transcript isoforms requires more extensive sequencing. The ability to detect reliably low copy number transcripts/isoforms depends upon the depth of sequencing and on a sufficiently complex library. For experiments from a typical mammalian tissue or in which sensitivity of detection is important, a minimum depth of 100-200 M 2 x 76 bp or longer reads is currently recommended. [Specialized studies in which the prevalence of different RNAs has been intentionally altered (e.g. “normalizing” using DSN) as part of sample preparation need more than the read amounts (>30M paired end reads) used for simple comparison (see above). Reasons for this include: (1) overamplification of inserts as a result of an additional round of PCR after DSN and (2) much more broad coverage given the nature of A(-) and low abundance transcripts.上述⽂字翻译如下:根据研究⽬的决定测序深度:⽬的1:通过抓取polyA尾建库(只测那些带有polyA尾巴的基因,⼤多是蛋⽩编码基因),寻找样品间基因转录谱的相似性,只需要30M reads,长度⼤于30nt即可,双端测序,其中20-25M能够回帖(map)到已知转录组上。
PAUP使用说明
PAUP(摘自生物信息学-基因和蛋白质分析的使用指南第九章,包括PHYLIP、PAUP、FastDNAml, MACCLADE, MEGA plus METREE, MOLPHY和PAML。
)开发PAUP(Swofford, 1997)的目的是为系统发育分析提供一个简单的,带有菜单界面的,与平台无关的,拥有多种功能(包括进化树图)的程序。
在苹果机(Macintosh)上使用过PAUP程序(版本3)的人对这个程序的菜单界面都会很熟悉,虽然这个版本已经不再发行了。
PAUP 3.0只建立于MP相关的进化树及其分析功能;而PAUP 4.0已经可以针对核苷酸数据进行与距离方法和ML方法相关的分析功能,以及其它一些特色。
获取和编译程序在商业版本发行之前,现行的出版物中,有成打的分析使用了PAUP 4.0测试版本(由原作者通过blue@提供)。
菜单界面的测试版本已经在Macintosh 68K 、PRC 计算机和微软的视窗操作系统上编译通过。
命令行版本已经在Sun Sparc、Supersparc、DEC Alpha(OSF1和OPENVMS)、SGI(32位和64位)以及linux上编译通过。
初学的用户应该将其中一个菜单版本浏览一遍。
在这些版本中也可以使用命令行,这样会使得命令教程会变得容易一些。
通常而言,命令都有缩写。
比如,要执行启发式进化树搜索的命令可以键入“hs[earch]”(大小写不敏感;括弧内的字符为选项)。
而且,因为文件在各个平台之间都是可移植的,菜单版本可以用来测试数据文件。
如果希望在一个很快的Unix机器上跑一个分析程序,这个协议就显得非常重要。
如果文件格式出错,菜单版本不仅仅报告文件格式的错误,而且还会打开文件,将错误的地方高亮度显示。
数据格式PAUP使用一种称为NEXUS的数据格式,这种格式还可以被MACCLADE程序使用,当然PAUP也可以输入PHYLIP, GCG-MSF, NBRF-PIR, HENNIG86数据格式以及文本比对(形如“{ name } <tab or space> { same-length sequences } <ret>”的列表,以“;<ret> end”结束)。
甘蔗组培苗_2_种污染细菌的分离与鉴定
热带作物学报2021, 42(2): 519-526Chinese Journal of Tropical Crops甘蔗组培苗2种污染细菌的分离与鉴定刘红坚,李松,何为中,刘俊仙,刘丽敏,卢曼曼,游建华,黄诚华,林善海*广西农业科学院甘蔗研究所/中国农业科学院甘蔗研究中心/农业农村部广西甘蔗生物技术与遗传改良重点实验室,广西南宁 530007摘要:细菌污染是甘蔗组培苗生产上的一个主要问题。
为明确甘蔗组培中常出现的污染细菌的种类,本研究通过梯度稀释的方法分离污染细菌,结合生化特性、16S rRNA和phaC基因序列比对及系统发育分析鉴定其分类地位。
结果表明,共分离获得2种不同类型污染细菌,除了甘露醇,其余9个生化指标完全一致。
基因序列比对发现,2种类型分离物的种内2个基因序列完全一致,而种间16S rRNA序列只有2个碱基差异,种间phaC序列有23个碱基差异。
2个基因序列构建的系统发育树显示,2种细菌遗传距离较近,并均能形成独立的2个分支。
结合生化特征、序列比对及系统发育树,将2种污染细菌分别鉴定为巨大芽孢杆菌(Bacillus megaterium)和阿氏芽孢杆菌(B. aryabhattai),且首次报道阿氏芽孢杆菌为甘蔗组织培养中的污染细菌。
本研究揭示了巨大芽孢杆菌可能来源于甘蔗的内生菌,而阿氏芽孢杆菌来源于环境。
关键词:甘蔗;组培苗;污染细菌;芽孢杆菌;系统发育树中图分类号:S566.1 文献标识码:AIsolation and Identification of Two Bacteria Contaminated Sugarcane Tissue Culture SeedlingsLIU Hongjian, LI Song, HE Weizhong, LIU Junxian, LIU Limin, LU Manman, YOU Jianhua,HUANG Chenghua, LIN Shanhai*Sugarcane Research Institute, Guangxi Academy of Agricultural Sciences / Sugarcane Research Center, Chinese Academy of Agri-cultural Sciences / Key Laboratory of Sugarcane Biotechnology and Genetic Improvement (Guangxi), Ministry of Agriculture and Rural Affairs, Nanning, Guangxi 530007, ChinaAbstract: Contamination is a main issue in sugarcane tissue culture. The purpose of this study was to determine the contaminative bacteria frequently occurred during the production of sugarcane tissue culture. The gradient dilution method was used to isolate bacteria, and the toxonomic position was identified combining the biochemical, sequence alignment of 16S rDNA and phaC genes, phylogeomic analysis. The results showed that two type contaminative bacteria were obtained from the sugarcane tissue culture nutrient solution, and nine tested biochemical characters were the same between the two bacteria except mannitol utilization. Sequence alignment showed that intraspecific sequences of 16S rDNA and phaC genes were completely uniformity respectively in the two bacteria, but the difference of the two nu-cleotide bases in 16S rDNA and 23 in phaC sequences existed between the two bacteria. The phylogenic tree constructed based on 16S rDNA and phaC genes sequences respectively also indicated that the two bacteria were closest each other and clustered into two absolutely independent branches. Combining the biochemical characters, sequence alignment and phylogenetic analysis, the two contaminative bacteria were identified as Bacillus megaterium and B. aryabhattai, re-收稿日期 2020-02-18;修回日期 2020-04-07基金项目 广西创新驱动发展专项资金项目(桂科AA17202042-3);广西农业科学院稳定资助团队项目(桂农科2018YT04);广西农业科学院基本科研业务专项(桂农科2017YM05)。
Phylogenies with Annotations 1.3 软件说明书
Package‘phylotate’October14,2022Title Phylogenies with AnnotationsVersion1.3Date2019-06-29Author Daniel Beer[aut],Anusha Beer[aut]Maintainer Daniel Beer<****************>Description Functions to read and write APE-compatible phylogenetictrees in NEXUS and Newick formats,while preserving annotations.Depends R(>=3.0.0)Suggests apeLicense MIT+file LICENSECollate'utility.R''tokenize.R''newick.R''nexus.R''format.R''mbattrs.R'NeedsCompilation noRepository CRANDate/Publication2019-06-2921:00:03UTCR topics documented:phylotate-package (2)finches (2)mb_attrs (2)parse_annotated (3)print_annotated (4)read_annotated (5)write_annotated (6)Index812mb_attrs phylotate-package Phylogenies with AnnotationsDescriptionphylotate provides functions that allow you to read and write NEXUS and Newick trees containing annotations,including those produced by MrBayes.It does this by extending ape’s phylo object type with extra data members containing per-node annotation information.This information is stored in such a way that it can be manipulated easily and will survive most manipulations using standard ape functions(e.g.reorder,chronopl).See the documentation for the parse_annotated function for more information on how annotations are stored internally.The functions you probably want to use for most things are read_annotated and write_annotated.Author(s)Daniel Beer<****************>,Anusha Beer<******************>finches NEXUS data exampleDescriptionA simple tree generated by MrBayes using the sequences for Darwin’sfinches from the exampledistributed with BEAST.Usagedata(finches)mb_attrs Parse MrBayes-supplied attributes from a NEXUSfileDescriptionThis function takes a tree object and produces a dataframe containing attributes attached to each node by MrBayes.Usagemb_attrs(tree)parse_annotated3Argumentstree an object of type"phylo"DetailsThe returned dataframe contains one row per node,and one column per attribute.The attributesparsed are prob,prob_stddev,length_mean,length_median,length_95_HPD_low,and length_95_HPD_high.Attributes which are derivable from the others are not parsed(for example,the prob_percentattribute is not parsed,since it’s prob times100.ValueA dataframe of attributes.Author(s)Anusha Beer<******************>See Alsoparse_annotated,read_annotatedExamples#Parse the example data included with this packagedata(finches)t<-parse_annotated(finches,format="nexus")#Obtain a table of MrBayes attributes for each nodeattrs<-mb_attrs(t)parse_annotated Parse an annotated phylogenetic treeDescriptionThis function takes the given text string,containing data in either NEXUS or Newick format,andreturns annotated phylogenetic trees.Usageparse_annotated(str,format="nexus")Argumentsstr a text string,containing tree dataformat a format specifier;either"nexus"or"newick"4print_annotatedDetailsThe givenfile text is parsed and a tree object is constructed which can be used with the functions in the ape package.Annotations of the kind produced by,for example,MrBayes,are parsed and preserved in the returned object.In addition to edge,edge.length and bel,two additional vectors are added.These are ment and ment.These contain annotations associated with nodes and their distance values.These arrays are indexed by node number,not by edge.The reason for this is that this ensures that the object will remain in a valid state after a call to reorder which might change the ordering of the edge arrays without being aware of annotations.If you need to obtain annotations in edge-order,subset by the second column of the edge array.Valuean object of type"phylo"or"multiPhylo",augmented with node annotations.Author(s)Daniel Beer<****************>ReferencesParadis,E.Definition of Formats for Coding Phylogenetic Trees in R.http://ape-package.ird.fr/misc/FormatTreeR_24Oct2012.pdfSee Alsoprint_annotated,read_annotated,write_annotated,finchesExamples#Parse the example data included with this packagedata(finches)t<-parse_annotated(finches,format="nexus")#Obtain annotations in edge-order,rather than node-orderment<-t$ment[t$edge[,2]]print_annotated Serialize an annotated phylogenetic treeDescriptionThis function takes the given tree object and returns a string representing the tree in the requested format.The difference between the"newick"and"d"formats is that the former uses only node numbers in its output,whereas the latter uses the tip labels(sanitized and deduplicated if necessary).read_annotated5 Usageprint_annotated(tree,format="nexus")Argumentstree a phylogentic tree,with optional annotationsformat a format specifier;either"nexus","newick",or"d"DetailsThe tree object should be either a"phylo"or"multiPhylo"object.It may optionally be aug-mented with annotations,as described in the documentation for the parse_annotated function.The output is a string suitable for writing to afile.Valuea string containing a serialized tree.Author(s)Daniel Beer<****************>See Alsoparse_annotated,read_annotated,write_annotatedread_annotated Read an annotated phylogenetic treeDescriptionThis function takes the givenfile,containing data in either NEXUS or Newick format,and returns annotated phylogenetic trees.Usageread_annotated(filename,format="nexus")Argumentsfilename afile to read tree data fromformat a format specifier;either"nexus"or"newick"DetailsThe givenfile text is parsed and a tree object is constructed which can be used with the functions in the ape package.Annotations of the kind produced by,for example,MrBayes,are parsed and preserved in the returned object.See parse_annotated for more information about the structure of the returned value.Valuean object of type"phylo"or"multiPhylo",augmented with node annotations.Author(s)Daniel Beer<****************>See Alsoprint_annotated,parse_annotated,write_annotatedwrite_annotated Write an annotated phylogenetic tree to afileDescriptionThis function takes the given tree object and returns a string representing the tree in the requested format.The difference between the"newick"and"d"formats is that the former uses only node numbers in its output,whereas the latter uses the tip labels(sanitized and deduplicated if necessary).Usagewrite_annotated(tree,filename,format="nexus")Argumentstree a phylogentic tree,with optional annotationsfilename afile to write noformat a format specifier;either"nexus","newick",or"d"DetailsThe tree object should be either a"phylo"or"multiPhylo"object.It may optionally be aug-mented with annotations,as described in the documentation for the parse_annotated function. Author(s)Daniel Beer<****************>See Alsoparse_annotated,print_annotated,read_annotatedIndex∗datasetsfinches,2∗phylomb_attrs,2parse_annotated,3phylotate-package,2print_annotated,4read_annotated,5write_annotated,6finches,2,4mb_attrs,2parse_annotated,2,3,3,5–7phylotate(phylotate-package),2phylotate-package,2print_annotated,4,4,6,7read_annotated,2–5,5,7reorder,2write_annotated,2,4–6,68。
phylip格式序列 -回复
phylip格式序列-回复如何将FASTA格式的序列转换为PHYLIP格式的序列[引言]在生物信息学领域,序列比对是许多研究的基础。
然而,不同的生物信息学工具和软件使用不同的序列格式,这就需要我们进行序列格式的转换。
本文将重点介绍如何将常见的FASTA格式的序列转换为PHYLIP格式的序列。
我们将分步骤地指导您完成这一转换过程。
[步骤一:了解FASTA格式的序列]FASTA(The Fast-All) 是一种常见的序列格式,其基本格式如下所示:>序列名称序列其中,">"为注释行,以">" 开始,后面跟着序列名称;序列为一行或多行,包含碱基或氨基酸序列。
[步骤二:了解PHYLIP格式的序列]PHYLIP(PHYLogeny Inference Package)是一种用于构建系统进化树的软件包,其使用的序列格式如下所示:N(序列数)L(序列长度)序列名称1序列1序列名称2序列2...其中,N 表示序列的数量,L 表示每个序列的长度,序列名称和序列分别按照顺序排列。
[步骤三:准备转换工具]为了将FASTA格式的序列转换为PHYLIP格式的序列,我们需要准备一个序列格式转换工具。
这里我们推荐使用BioPython,一个强大的生物信息学Python库,您可以在其官方网站上下载并安装。
[步骤四:利用BioPython进行转换]1. 首先,在Python中导入BioPython库:from Bio import SeqIO2. 然后,使用SeqIO模块读取FASTA文件并将其转换为PHYLIP格式,代码如下:input_file = "input.fasta"output_file = "output.phy"fasta_sequences = SeqIO.parse(open(input_file), 'fasta')with open(output_file, "w") as output_handle:SeqIO.write(fasta_sequences, output_handle, "phylip")在代码中,"input.fasta" 是您要转换的FASTA文件的文件名,"output.phy" 是转换后的PHYLIP文件的文件名。
二代测序 实验流程
二代测序实验流程英文回答:Sequencing is a fundamental technique in molecular biology that allows us to determine the order of nucleotides in a DNA molecule. The second generation sequencing, also known as next-generation sequencing (NGS), has revolutionized the field of genomics with its high-throughput and cost-effective approach. In this response, I will explain the general workflow of a typical second-generation sequencing experiment.The first step in a second-generation sequencing experiment is the preparation of the DNA sample. This involves isolating the DNA of interest and fragmenting it into smaller pieces. Various methods can be used for DNA fragmentation, such as enzymatic digestion or mechanical shearing. Once the DNA is fragmented, adapters are added to the ends of the DNA fragments. These adapters serve as priming sites for the subsequent steps in the sequencingprocess.After the DNA sample preparation, the next step is library construction. In this step, the DNA fragments with adapters are amplified through PCR (polymerase chain reaction) to generate a large number of identical copies of each fragment. This step is crucial as it increases the amount of DNA available for sequencing and ensures that each fragment is represented adequately.Once the library is constructed, it is loaded onto the sequencing platform. There are several different sequencing platforms available, such as Illumina, Ion Torrent, and Pacific Biosciences. Each platform has its unique sequencing chemistry and technology. For instance, Illumina sequencing utilizes reversible terminators andfluorescently labeled nucleotides to determine the sequence of the DNA fragments.During the sequencing run, the DNA fragments in the library are sequenced in parallel, generating millions of short reads. These short reads are then processed andanalyzed using bioinformatics tools to reconstruct the original DNA sequence. This involves aligning the reads to a reference genome or assembling the reads de novo to generate a consensus sequence.Once the sequencing run is completed, the data analysis step begins. This involves quality control, read mapping, variant calling, and downstream analysis. Quality control ensures that the sequencing data is of high quality and free from artifacts. Read mapping involves aligning the reads to a reference genome to determine their genomic location. Variant calling identifies genetic variations, such as single nucleotide polymorphisms (SNPs) orinsertions/deletions (indels), in the sequenced DNA.In summary, the workflow of a second-generation sequencing experiment involves DNA sample preparation, library construction, sequencing, and data analysis. This process allows us to obtain high-quality DNA sequences and gain insights into the genetic makeup of an organism.中文回答:测序是分子生物学中的一项基本技术,可以确定DNA分子中核苷酸的顺序。
primerTree 1.0.6 用户指南说明书
Package‘primerTree’October14,2022Title Visually Assessing the Specificity and Informativeness of PrimerPairsVersion1.0.6Description Identifies potential target sequences for a givenset of primers and generates phylogenetic trees annotated with thetaxonomies of the predicted amplification products.License GPL-2Depends R(>=3.5.0),directlabels,gridExtraImports ape,foreach,ggplot2,grid,httr,lubridate,plyr,reshape2,scales,stringr,XMLEncoding UTF-8LazyData trueRoxygenNote7.1.2NeedsCompilation yesAuthor Jim Hester[aut],Matt Cannon[aut,cre]Maintainer Matt Cannon<********************>Repository CRANDate/Publication2022-04-0514:30:02UTCR topics documented:accession2taxid (2)bryophytes_trnL (2)calc_rank_dist_ave (3)clustalo (4)filter_seqs (4)get_sequence (5)get_sequences (6)get_taxonomy (7)identify.primerTree_plot (7)12bryophytes_trnL layout_tree_ape (8)mammals_16S (8)parse_primer_hits (8)plot.primerTree (9)plot_tree (9)plot_tree_ranks (10)primerTree (11)primer_search (12)search_primer_pair (13)seq_lengths (14)seq_lengths.primerTree (15)summary.primerTree (15)tree_from_alignment (16)Index17accession2taxid Maps a nucleotide database accession to a taxonomy database taxIdDescriptionMaps a nucleotide database accession to a taxonomy database taxIdUsageaccession2taxid(accessions)Argumentsaccessions accessions character vector to lookup.Valuenamed vector of taxIds.bryophytes_trnL PrimerTree results for the bryophyte trnL primersDescriptionPrimerTree results for the bryophyte trnL primerscalc_rank_dist_ave3 calc_rank_dist_ave Summarize pairwise differences.DescriptionSummarize pairwise differences.Usagecalc_rank_dist_ave(x,ranks=common_ranks)Argumentsx a primerTree objectranks ranks to show unique counts for,defaults to the common ranksDetailsThe purpose of this function is to calculate the average number of nucleotide differences between species within each taxa of given taxonomic level.For example,at the genus level,the function calculates the average number of nucleotide differences between all species within each genus and reports the mean of those values.There are several key assumptions and calculations made in this function.First,the function randomly selects one sequence from each species in the primerTree results.This is to keep any one species(e.g.human,cow,etc.)with many hits from skewing the results.Second,for each taxonomic level tested,the function divides the sequences by each taxon at that level and calculates the mean number of nucleotide differences within that taxa,then returns the mean of those values.Third,when calculating the average distance,any taxa for which there is only one species is omitted, as the number of nucleotide differences will always be0.Valuereturns a data frame of resultsExamples##Not run:calc_rank_dist_ave(mammals_16S)calc_rank_dist_ave(bryophytes_trnL)#Note that the differences between the results from these two primers#the mean nucleotide differences is much higher for the mammal primers#than the byrophyte primers.This suggests that the mammal primers have#better resolution to distinguish individual species.##End(Not run)4filter_seqs clustalo Multiple sequence alignment with clustal omegaDescriptionCalls clustal omega to align a set of sequences of class DNAbin.Run without any arguments to see all the options you can pass to the command line clustal omega.Usageclustalo(x,exec="clustalo",quiet=TRUE,original.ordering=TRUE,...)Argumentsx an object of class’DNAbin’exec a character string with the name or path to the programquiet whether to supress output to stderr or stdoutoriginal.orderinguse the original ordering of the sequences...additional arguments passed to the command line clustalofilter_seqs Filter out sequences retrieved by search_primer_pair()that are eithertoo short or too long.The alignment and tree will be recalculated afterremoving unwanted reads.DescriptionFilter out sequences retrieved by search_primer_pair()that are either too short or too long.The alignment and tree will be recalculated after removing unwanted reads.Usagefilter_seqs(x,...)##S3method for class primerTreefilter_seqs(x,min_length=0,max_length=Inf,...)Argumentsx a primerTree object...additional arguments passed to methods.min_length the minimum sequence length to keepmax_length the maximum sequence length to keepValuea primerTree objectMethods(by class)•primerTree:Method for primerTree objectsExamples##Not run:#filter out sequences longer or shorter than desired:mammals_16S_filtered<-filter_seqs(mammals_16S,min_length=131,max_length=156)##End(Not run)get_sequence Retrieves a fasta sequence from NCBI nucleotide database.DescriptionRetrieves a fasta sequence from NCBI nucleotide database.Usageget_sequence(accession,start=NULL,stop=NULL,api_key=Sys.getenv("NCBI_API_KEY"))Argumentsaccession nucleotide accession to retrieve.start start base to retrieve,numbered beginning at1.If NULL the beginning of the sequence.stop last base to retrieve,numbered beginning at1.if NULL the end of the sequence.api_key NCBI api-key to allow faster sequence retrieval.Valuean DNAbin object.See AlsoDNAbinget_sequences Retrieves fasta sequences from NCBI nucleotide database.DescriptionRetrieves fasta sequences from NCBI nucleotide database.Usageget_sequences(accession,start=NULL,stop=NULL,api_key=Sys.getenv("NCBI_API_KEY"),simplify=TRUE,.parallel=FALSE,.progress="none")Argumentsaccession the accession number of the sequence to retrievestart start bases to retrieve,numbered beginning at1.If NULL the beginning of the sequence.stop stop bases to retrieve,numbered beginning at1.if NULL the stop of the se-quence.api_key NCBI api-key to allow faster sequence retrieval.simplify simplify the FASTA headers to include only the genbank accession..parallel if’TRUE’,perform in parallel,using parallel backend provided by foreach .progress name of the progress bar to use,see’create_progress_bar’Valuean DNAbin object.See AlsoDNAbinget_taxonomy7get_taxonomy Retrieve the taxonomy information from NCBI for a set of nucleotidegis.DescriptionRetrieve the taxonomy information from NCBI for a set of nucleotide gis.Usageget_taxonomy(accessions)Argumentsaccessions a character vector of the accessions to retrieveValuedata.frame of the’accessions,taxIds,and taxonomyidentify.primerTree_plotidentify the point closest to the mouse click only works on single ranksDescriptionidentify the point closest to the mouse click only works on single ranksUsage##S3method for class primerTree_plotidentify(x,...)Argumentsx the plot to identify...additional arguments passed to annotate8parse_primer_hits layout_tree_ape layout a tree using ape,return an object to be plotted by plot_treeDescriptionlayout a tree using ape,return an object to be plotted by plot_treeUsagelayout_tree_ape(tree,...)Argumentstree The phylo tree to be plotted...additional arguments to plot.phyloValueedge list of x,y and xend,yend coordinates as well as ids for the edgestips list of x,y,label and id for the tipsnodes list of x,y and id for the nodesmammals_16S PrimerTree results for the mammalian16S primersDescriptionPrimerTree results for the mammalian16S primersparse_primer_hits Parse the primer hitsDescriptionParse the primer hitsUsageparse_primer_hits(response)Argumentsresponse a httr response object obtained from primer_searchplot.primerTree9 plot.primerTree plot function for a primerTree object,calls plot_tree_ranksDescriptionplot function for a primerTree object,calls plot_tree_ranksUsage##S3method for class primerTreeplot(x,ranks=NULL,main=NULL,...)Argumentsx primerTree object to plotranks The ranks to include,defaults to all common ranks,if NULL print all ranks.If ’none’just print the layout.main an optional title to display,if NULL displays the name as the title...additional arguments passed to plot_tree_ranksSee Alsoplot_tree_ranks,plot_treeExampleslibrary(gridExtra)library(directlabels)#plot with all common ranksplot(mammals_16S)#plot only the classplot(mammals_16S, class )#plot the layout onlyplot(mammals_16S, none )plot_tree plots a tree,optionally with colored and labeled points by taxonomicrankDescriptionplots a tree,optionally with colored and labeled points by taxonomic rank10plot_tree_ranksUsageplot_tree(tree,type="unrooted",main=NULL,guide_size=NULL,rank=NULL,taxonomy=NULL,size=2,legend_cutoff=25,...)Argumentstree to be plotted,use layout_tree to layout tree.type The type of tree to plot,default unrooted.main An optional title for the plotguide_size The size of the length guide.If NULL auto detects a reasonable size.rank The rank to include,if null only the tree is plottedtaxonomy A data.frame with an accessionfield corresponding to the tree tip labels.size The size of the colored pointslegend_cutoff The number of different taxa names after which the names are no longer printed....additional arguments passed to layout_tree_apeValueplot to be printed.plot_tree_ranks plots a tree along with a series of taxonomic ranksDescriptionplots a tree along with a series of taxonomic ranksUsageplot_tree_ranks(tree,taxonomy,main=NULL,type="unrooted",ranks=common_ranks,primerTree11 size=2,guide_size=NULL,legend_cutoff=25,...)Argumentstree to be plotted,use layout_tree to layout tree.taxonomy A data.frame with an accessionfield corresponding to the tree tip labels.main An optional title for the plottype The type of tree to plot,default unrooted.ranks The ranks to include,defaults to all common ranks,if null print all ranks.size The size of the colored pointsguide_size The size of the length guide.If NULL auto detects a reasonable size.legend_cutoff The number of different taxa names after which the names are no longer printed....additional arguments passed to layout_tree_apeSee Alsoplot_tree to plot only a single rank or the just the tree layout.Exampleslibrary(gridExtra)library(directlabels)#plot all the common ranksplot_tree_ranks(mammals_16S$tree,mammals_16S$taxonomy)#plot specific ranks,with a larger dot sizeplot_tree_ranks(mammals_16S$tree,mammals_16S$taxonomy,ranks=c( kingdom , class , family ),size=3)primerTree primerTree Visually Assessing the Specificity and Informativeness ofPrimer PairsDescriptionprimerTree has two main commands:search_primer_pair which takes a primer pair and re-turns an primerTree object of the search results plot.primerTree a S3method for plotting the primerTree object obtained using search_primer_pair12primer_searchprimer_search Query a pair of primers using ncbi’s Primer-BLAST,if primers containiupacDescriptionambiguity codes,enumerate all possible combinations and combine the results.Usageprimer_search(forward,reverse,num_aligns=500,num_permutations=25,...,.parallel=FALSE,.progress="none")Argumentsforward forward primer to search by5’-3’on plus strandreverse reverse primer to search by5’-3’on minus strandnum_aligns number of alignment results to keepnum_permutationsthe number of primer permutations to search,if the degenerate bases cause morethan this number of permutations to exist,this number will be sampled from allpossible permutations....additional arguments passed to Primer-Blast.parallel if’TRUE’,perform in parallel,using parallel backend provided by foreach .progress name of the progress bar to use,see’create_progress_bar’Valuehttr response object of the query,pass to parse_primer_hits to parse the results.search_primer_pair13search_primer_pair Automatic primer searching Search a given primer pair,retrieving thealignment results,their product sequences,the taxonomic informationfor the sequences,a multiple alignment of the productsDescriptionAutomatic primer searching Search a given primer pair,retrieving the alignment results,their prod-uct sequences,the taxonomic information for the sequences,a multiple alignment of the productsUsagesearch_primer_pair(forward,reverse,name=NULL,num_aligns=500,num_permutations=25,simplify=TRUE,clustal_options=list(),distance_options=list(model="N",pairwise.deletion=T),api_key=Sys.getenv("NCBI_API_KEY"),...,.parallel=FALSE,.progress="none")Argumentsforward forward primer to search by5’-3’on plus strandreverse reverse primer to search by5’-3’on minus strandname name to give to the primer pairnum_aligns number of alignment results to keepnum_permutationsthe number of primer permutations to search,if the degenerate bases cause morethan this number of permutations to exist,this number will be sampled from allpossible permutations.simplify use simple names for primer hit results or complexclustal_optionsa list of options to pass to clustal omega,see link{clustalo}for a list ofoptionsdistance_optionsa list of options to pass to dist.dna,see link{dist.dna}for a list of optionsapi_key NCBI api-key to allow faster sequence retrieval...additional arguments passed to Primer-Blast14seq_lengths .parallel if’TRUE’,perform in parallel,using parallel backend provided by foreach .progress name of the progress bar to use,see create_progress_barValueA list with the following elements,name name of the primer pairBLAST_result html blast results from Primer-BLAST as’a responseobject.taxonomy taxonomy for the primer products from NCBIsequence sequence of the primer productsalignment multiple alignment of the primer productstree phylogenetic tree of the reconstructed from the’multiple alignmentSee Alsoprimer_search,clustaloExamples##Not run:#simple searchmammals_16S=search_primer_pair(name= Mammals16S ,CGGTTGGGGTGACCTCGGA , GCTGTTATCCCTAGGGTAACT )#returning1000alignments,allow up to3mismatches in primermammals_16S=search_primer_pair(name= Mammals16S ,CGGTTGGGGTGACCTCGGA , GCTGTTATCCCTAGGGTAACT ,num_aligns=1000,total_primer_specificity_mismatch=3)##End(Not run)seq_lengths Get a summary of sequence lengths from a primerTree objectDescriptionGet a summary of sequence lengths from a primerTree objectUsageseq_lengths(x,summarize=TRUE)Argumentsx a primerTree object.summarize a logical indicating if a summary should be displayedseq_lengths.primerTree15Valuea table of sequence length frequenciesExamples#Show the counts for each lengthseq_lengths(mammals_16S)#Plot the distribution of lengthsseqLengths<-seq_lengths(mammals_16S)barplot(seqLengths,main="Frequency of sequence lengths for16S mammal primers",xlab="Amplicon length(in bp)",ylab=("Frequency"))seq_lengths.primerTreeMethod for primerTree objectsDescriptionMethod for primerTree objectsUsage##S3method for class primerTreeseq_lengths(x,summarize=TRUE)Argumentsx a primerTree object.summarize a logical indicating if a summary should be displayedsummary.primerTree Summarize a primerTree result,printing quantiles of sequence lengthand pairwise differences.DescriptionSummarize a primerTree result,printing quantiles of sequence length and pairwise differences. Usage##S3method for class primerTreesummary(object,...,probs=c(0,0.05,0.5,0.95,1),ranks=common_ranks)16tree_from_alignmentArgumentsobject the primerTree object to summarise...Ignored optionsprobs quantile probabilities to compute,defaults to0,5,50,95,and100probabilities.ranks ranks to show unique counts for,defaults to the common ranksValueinvisibly returns a list containing the printed resultstree_from_alignment Construct a neighbor joining tree from a dna alignmentDescriptionConstruct a neighbor joining tree from a dna alignmentUsagetree_from_alignment(dna,pairwise.deletion=TRUE,...)Argumentsdna fasta dna object the tree is to be constructed frompairwise.deletiona logical indicating if the distance matrix should be constructed using pairwisedeletion...furthur arguments to dist.dnaSee Alsodist.dna,njIndexaccession2taxid,2bryophytes_trnL,2calc_rank_dist_ave,3clustalo,4,14create_progress_bar,14dist.dna,16DNAbin,5,6filter_seqs,4get_sequence,5get_sequences,6get_taxonomy,7identify.primerTree_plot,7layout_tree_ape,8,10,11mammals_16S,8nj,16parse_primer_hits,8,12phylo,8plot.phylo,8plot.primerTree,9,11plot_tree,8,9,9,11plot_tree_ranks,9,10primer_search,8,12,14primerTree,11response,14search_primer_pair,11,13seq_lengths,14seq_lengths.primerTree,15summary.primerTree,15tree_from_alignment,1617。
设计引物原则以及如何查找基因序列
引物设计原则:1.找出这种细胞物种的PTN全长核苷酸序列2.采用primer premier 5.0软件设计引物设计应注意如下要点:● 1. 引物的长度一般为15-30 bp,常用的是18-27 bp,但不应大于38,因为过长会导致其延伸温度大于74℃,不适于Taq DNA聚合酶进行反应[2]。
● 2. 引物序列在模板内应当没有相似性较高,尤其是3’端相似性较高的序列,否则容易导致错配。
引物3’端出现3个以上的连续碱基,如GGG或CCC,也会使错误引发机率增加[2]。
● 3. 引物3’端的末位碱基对Taq酶的DNA合成效率有较大的影响。
不同的末位碱基在错配位置导致不同的扩增效率,末位碱基为A的错配效率明显高于其他3个碱基,因此应当避免在引物的3’端使用碱基A[3][4]。
另外,引物二聚体或发夹结构也可能导致PCR 反应失败。
5’端序列对PCR影响不太大,因此常用来引进修饰位点或标记物[2]。
● 4. 引物序列的GC含量一般为40-60%,过高或过低都不利于引发反应。
上下游引物的GC含量不能相差太大[2][5]。
● 5. 引物所对应模板位置序列的Tm值在72℃左右可使复性条件最佳。
Tm值的计算有多种方法,如按公式Tm=4(G+C)+2(A+T),在Oligo软件中使用的是最邻近法(the nearest neighbor method) [6][7]。
● 6. ΔG值是指DNA双链形成所需的自由能,该值反映了双链结构内部碱基对的相对稳定性。
应当选用3’端ΔG值较低(绝对值不超过9),而5’端和中间ΔG值相对较高的引物。
引物的3’端的ΔG值过高,容易在错配位点形成双链结构并引发DNA聚合反应[6]。
●7. 引物二聚体及发夹结构的能值过高(超过4.5kcal/mol)易导致产生引物二聚体带,并且降低引物有效浓度而使PCR反应不能正常进行[8]。
●8. 对引物的修饰一般是在5’端增加酶切位点,应根据下一步实验中要插入PCR产物的载体的相应序列而确定。
faure sequence 的python实现 -回复
faure sequence 的python实现-回复Faure序列的Python实现Faure序列是一种用于生成高维随机数的序列,并被广泛应用于蒙特卡洛方法、数值积分和优化问题等领域。
本文将以Faure序列的Python实现为主题,逐步解释如何生成Faure序列,并介绍其在实际应用中的一些特性。
Faure序列是基于一类特殊的归一化分数函数的序列。
通过一系列变换,Faure序列能够更加均匀地填充高维空间,从而提供更好的随机样本。
在实现Faure序列之前,我们需要了解一些相关概念。
首先,我们需要知道Faure序列是如何定义的。
Faure序列的定义基于一种称为基数的概念。
基数是一个正整数,用于标识序列中的元素的索引。
对于一个给定的基数,每个索引对应的元素都是一个唯一的序列。
在Faure序列中,基数必须是一个素数,并且选取适当的素数可以显著影响序列的质量。
为了生成Faure序列,我们需要一个庞大的基数集合,这些基数之间必须互质。
通常情况下,我们选择的基数集合是一个包含素数的序列。
对于一个给定的素数,Faure序列的定义就是基于这个素数的归一化分数函数。
现在我们来实现Faure序列的Python代码。
首先,我们需要导入一些数学函数和工具库,例如numpy和sympy。
numpy是一个常用的数值计算库,而sympy则提供了一些符号计算的功能。
pythonimport numpy as npfrom sympy import ZZ, Matrix, symbols, simplifydef faure_sequence(d, n):prime_set = [2, 3, 5, 7, 11, 13, 17, 19, 23, 29] # 基数集合bases = prime_set[:d] # 所需基数的子集seq = np.zeros((d, n)) # 存储序列的数组for i in range(d):p = bases[i]inv_p = pow(p, -1, p)seq[i] = [(inv_pj p) / p for j in range(n)]return seq在这段代码中,我们定义了一个名为`faure_sequence`的函数,该函数接受两个参数,`d`表示维度,`n`表示所需的序列长度。
PHYLIP和PAUP建立系统树的详细步骤
1距离法构建系统系树1.将要分析的所有序列存在“txt”文件里。
序列名为“>XXXXX”,“X”不能为汉字、标点。
2.打开CLUSTALX.EXE 软件,①file→→load sequences→→选择你要比对的序列文件;②alignment→→output format options→→在Output Files里选择PHYLIP format(Phylip 建树用的文件格式)、NEXUS format(PAUP 建树用的格式),CLUSTAL format(可以直接看到树),可以多选几个,下边的两个OFF 改成ON →→CLOSE.。
③alignment→→do complete alignment,会自动产生3个文件:“*.nxs”(PAUP 建树用的格式)、“*.aln”、“*.phy”(Phylip 建树用的文件格式)。
此时简单的方法是将“*.aln”文件直接拉到TREEVIEW中,即可产生一个树状图(距离分析法NJ 模式的树状图)。
我们一般不用,只是参考。
3.手工修改,用BioEdit软件打开“*.phy”文件进行手工修改。
(这步很重要)。
4.打开SALAMAND软件。
5. 在SALAMAND软件下打开打开PHYLIP软件,将先前已存好的*.phy文件复制到PHYLIP 下并重命名为infile文件。
6. 运行SEQBOOT.exe文件,给一个运行数字,一般是4N+1,以保证每次都按此数字运行。
然后按R后回车,以更改重复次数,一般不低于1000,然后按Y回车。
这样就自动产生一个outfile文件,按F3查看此文件,将outfile 改为infile。
7.距离分析法:运行DNADIST,进入DNADIST后,进行如下操作:①按D,选择一种运算方法,有四种距离模式可以选择,分别是Kimura 2-parameter、Jin/Nei、Maximum-likelihood 和Jukes-Cantor(J-C) ,可以任选一种。
序列搜索_比对以及进化树的构建
Clustalx的输出结果
• .aln格式文件
– 这个文件是默认输出,可以转换成各种格式, 而且很多软件都支持这种格式。
• .dnd格式文件
– 引导树。就是根据两两序列相似值构建的一个 指导后面多重联配的启发树 – 不能做进化分析。进化分析要考虑的所有同源 位点的一个综合效应,因此应该用.aln格式文 件专门做进化分析。
• Blastn : 应该是出现较早的算法。比对的速度慢, 但允许更短序列的比对(如短到7个碱基的序列)。 • MEGABLAST : 主要用来鉴定一段新的核酸序列, 它并不注重比对各个碱基的不同和序列片断的同 源性,而只注重被比对序列是否是数据库未收录 的,是否为新的提交序列或基因。 速度快。同一 物种间的。 • Discontiguous MEGABLAST : 灵敏度 (sensitivity)更高,用于更精确的比对。主要用 于跨物种之间的同源比对。
• dnadist 计算核苷酸距离矩阵 • 把刚才的outfile改名,如dnadistinfile • 双击dnadist,输入dnadistinfile,回车
输入D,选择模型, 如改成kimura-2 输入M,然后输入 D,再输入1000, 和上面步骤要一致 即自举值 bootstrap=1000
• NCBI负责管理GenBank。 GenBank是
美国国立卫生研究院维护的基因序列数据库, 汇集并注释了所有公开的核酸序列。
• GenBank与日本DNA数据库(DNA Data Bank of Japan, DDBJ)以及欧洲生物信息研究所的欧洲 分子生物学实验室核苷酸数据库(European Molecular Biology Laboratory, EMBL),所有这 3个中心都可以独立地接受数据提交,而3个中心 之间则逐日交换信息,并制成相同的充分详细的 数据库向公众开放。因此他们是相等的。
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Sequence-length requirements for phylogenetic methodsBernard M.E.Moret1,Usman Roshan2,and Tandy Warnow21Department of Computer Science,University of New Mexico,Albuquerque,NM87131moret@2Department of Computer Sciences,University of Texas,Austin,TX78712{usman,tandy}@Abstract.We study the sequence lengths required by neighbor-joining,greedyparsimony,and a phylogenetic reconstruction method(DCM NJ+MP)based ondisk-covering and the maximum parsimony criterion.We use extensive simula-tions based on random birth-death trees,with controlled deviations from ultra-metricity,to collect data on the scaling of sequence-length requirements for eachof the three methods as a function of the number of taxa,the rate of evolutionon the tree,and the deviation from ultrametricity.Our experiments show thatDCM NJ+MP has consistently lower sequence-length requirements than the othertwo methods when trees of high topological accuracy are desired,although allmethods require much longer sequences as the deviation from ultrametricity orthe height of the tree grows.Our study has significant implications for large-scalephylogenetic reconstruction(where sequence-length requirements are a crucialfactor),but also for future performance analyses in phylogenetics(since devia-tions from ultrametricity are proving pivotal).1IntroductionThe inference of phylogenetic trees is basic to many research problems in bi-ology and is assuming more importance as the usefulness of an evolutionary perspective is becoming more widely understood.The inference of very large phylogenetic trees,such as might be needed for an attempt on the“Tree of Life,”is a big computational challenge.Indeed,such large-scale phylogenetic analysis is currently beyond our reach:the most pop-ular and accurate approaches are based on hard optimization problems such as maximum parsimony and maximum likelihood.Even heuristics for these opti-mization problems sometimes take weeks tofind local optima;some datasets have been analyzed for years without a definite global optimum being found.Polynomial-time methods would seem to offer a reasonable alternative,since they can be run to completion even on large datasets.Recent simulation studies have examined the performance of fast methods(specifically,the distance-based method neighbor-joining[21]and a simple greedy heuristic for the maximum parsimony problem)and have shown that both methods might perform rather well(with respect to topological accuracy),even on very large model trees.These studies have focused on trees of low diameter that also obey the strong molecular clock hypothesis(so that the expected number of evolutionary events is roughly proportional to time).Our study seeks to determine whether these, and other,polynomial-time methods really do scale well with increasing num-ber of taxa under more difficult conditions—when the model trees have large diameters and do not obey the molecular clock hypothesis.Our paper is organized as follows.In Section2we review earlier studies and the issues they addressed.In Section3,we define our terms and present the basic concepts.In Section4we outline our experimental design and in Section5 we report the results of our experiments.We conclude with some remarks on the implications of our work.2BackgroundThe sequence-length requirement of a method is the sequence length that this method needs in order to reconstruct the true tree topology with high probabil-ity.Earlier studies established analytical upper bounds on the sequence-length requirements of various methods,including the popular neighbor-joining(NJ) [21]method.These studies showed that standard methods,such as NJ,recover the true tree with high probability from sequences of lengths that are exponen-tial in the evolutionary diameter of the true tree.Based upon these studies,we defined a parameterization of model trees in which the longest and shortest edge lengths arefixed[6,7],so that the sequence-length requirement of a method can be expressed as a function of the number,n,of taxa.This parameterization led us to define fast-converging methods,methods that recover the true tree with high probability from sequences of lengths bounded by a polynomial in n once f and g,the minimum and maximum edge lengths,are bounded.Several fast-converging methods were subsequently developed[4,5,11,24].We and others analyzed the sequence-length requirements of standard methods,such as NJ, under the assumptions that f and g arefixed.These studies[1,7]showed that NJ and many other methods can recover the true tree with high probability when given sequences of lengths bounded by a function that grows exponentially in n.We then performed studies on a different parameterization of the model tree space,where wefixed the evolutionary diameter of the tree and let the number of taxa vary[17].This parameterization,suggested to us by J.Huelsenbeck, allowed us to examine the differential performance of methods with respect to “taxon-sampling”strategies[10].In these studies,we evaluated the performance of NJ and DCM NJ+MP on simulated data,obtained from random birth-death trees with bounded deviation from ultrametricity.We found that DCM NJ+MP consistently dominated NJ throughout the parameter space;we also found thatthe difference in performance increased as the number of taxa increased or the evolutionary rate increased.In previous studies[15]we had studied the sequence-length requirements for NJ,Weighbor[3],Greedy MP,and DCM NJ+MP as a function of the num-bers of taxa,withfixed height and deviation.We had found that DCM NJ+MP requires shorter sequences than the other four methods when trees of high topo-logical accuracy are desired.However,we had used datasets of modest size and had not explored a very large range of evolutionary rates and deviations from ultrametricity.In this paper we remedy these limitations.3Basics3.1Simulation studySimulation studies are the standard technique used in phylogenetic performance studies[9,10,14].In a simulation study,a DNA sequence at the root of the model tree is evolved down the tree under some assumed stochastic model of evolution.This process generates a set of sequences at the leaves of the tree. The sequences are then given to phylogenetic reconstruction methods,with each method producing a tree for the set of sequences.These reconstructed trees are then compared against the model tree for topological accuracy.The process is repeated many times in order to obtain a statistically significant test of the performance of the methods under these conditions.3.2Model treesWe used random birth-death trees as model trees for our experiments.They were generated using the program r8s[22],where we specified the option to generate birth-death trees using a backward Yule(coalescent)process with waiting times taken from an exponential distribution.An edge e in a tree generated by this process has lengthλe,a value that indicates the expected number of times a ran-dom site changes on the edge.Trees produced by this process are by construc-tion ultrametric(that is,the path length from the root to each leaf is identical). Furthermore,a random site is expected to change once on any root-to-leaf path; that is,the(evolutionary)height of the tree is1.In our experiments we scale the edge lengths of the trees to give differ-ent heights.To scale the tree to height h,we multiply the length of each edge of T by h.We also modify the edge lengths as follows in order to deviate the tree away from ultrametricity.First we pick a deviation factor,c;then,for each edge e∈T,we choose a number,x,uniformly at random from the in-terval[−lg(c),lg(c)].We then multiply the branch lengthλe by e x.For de-viation factor c,the expected deviation is(c−1/c)/2lnc,with a variance of(c2−1c ) /4ln c.For instance,for an expected deviation of1.36,the standard deviation has the rather high value of1.01.3.3DNA sequence evolutionWe conducted our experiments under the Kimura2-parameter[13]+Gamma [25]model of DNA sequence evolution,one of the standard models for studying the performance of phylogenetic reconstruction methods.In this model a site evolves down the tree under the Markov assumption and all site substitutions are partitioned into two classes:transitions,which substitute a purine for a purine; and transversions,which substitute a pyrimidine for a pyrimidine.This model has a parameter to indicate the transition/transversion ratio;in our experiments, we set it to2(the standard setting).We assume that the sites have different rates of evolution drawn from a known distribution.In our experiments we use the gamma distribution with shape parameterα(which we set to1,the standard setting),which is the in-verse of the coefficient of variation of the substitution rate.3.4Measures of accuracySince the trees returned by each of the three methods are binary,we use the Robinson-Foulds(RF)distance[20],which is defined as follows.Every edge e in a leaf-labeled tree T defines a bipartition,πe,of the leaves(induced by the deletion of e);the tree T is uniquely encoded by the set C(T)={πe:e∈E(T)}, where E(T)is the set of all internal edges of T.If T is a model tree and T′is the tree obtained by a phylogenetic reconstruction method,then the set of False Positives is C(T′)−C(T)and the set of False Negatives is C(T)−C(T′).The RF distance is then the average of the number of false positives and the number of false negatives.We plot the RF rates in ourfigures,which are obtained by dividing the RF distance by n−3,the number of internal edges in a binary tree on n leaves.Thus the RF rate varies between0and1(or0%and100%).Rates below5%are quite good,while rates above25%are unacceptably large.We focus our attention on the sequence lengths needed to get at least75%accuracy.3.5Phylogenetic reconstruction methodsNeighbor-Joining Neighbor-Joining(NJ)[21]is one of the most popular poly-nomial time distance-based methods[23].The basic technique is as follows:first,a matrix of estimated distances between every pair of sequences is com-puted.(This step is standardized for each model of evolution,and takes O(kn2) time,where k is the sequence length and n the number of taxa.)The NJ methodthen uses a type of agglomerative clustering to construct a tree from this ma-trix of distances.The NJ method is provably statistically consistent under most models of evolution examined in the phylogenetics literature,and in particular under the K2P model that we use.Greedy Maximum Parsimony Maximum Parsimony is an NP-hard optimiza-tion problem for phylogenetic tree reconstruction[8].Given a set of aligned sequences(in our case,DNA sequences),the objective is to reconstruct a tree with leaves labelled by the input sequences and internal nodes labelled by other sequences so as to minimize the total“length”of the tree.This length is calcu-lated by summing up the Hamming distances on the edges.Although the MP problem is NP-hard,many effective heuristics(most based on iterative local re-finement)are offered in popular software packages.Earlier studies[19,2]have suggested that a very simple greedy heuristic for MP might produce as accurate a reconstructed tree as the best polynomial-time methods.Our study tests this hypothesis by explicitly looking at this greedy heuristic.The greedy heuristic begins by randomly shuffling the sequences and building an optimal tree on the first four sequences.Each successive sequence is inserted into the current tree, so as to minimize the total length of the resulting,larger tree.This method takes O(kn2)time,sincefinding the best place to insert the next sequence takes O(kn) time,where k is the sequence length and n is the number of taxa.DCM NJ+MP The DCM NJ+MP method was developed by us in a series of papers[16].In our previous simulation studies,DCM NJ+MP outperformed,in terms of topological accuracy,both DCM∗NJ(a provably fast-converging method of which DCM NJ+MP is a variant)and NJ.Let d i j represent the distance be-tween taxa i and j;DCM NJ+MP operates in two phases.–Phase1:For each choice of threshold q∈{d i j},compute a binary tree T q, by using the Disk-Covering Method from[11],followed by a heuristic for refining the resultant tree into a binary tree.Let T={T q:q∈{d i j}}.(For details of Phase I,the reader is referred to[11].)–Phase2:Select the tree from T that optimizes the maximum parsimony criterion.If we consider all n2 values for the threshold q in Phase1,then DCM NJ+MP takes O(n6)time;however,if we consider only afixed number p of thresholds, it only takes O(pn4)time.In our experiments we use p=10(and p=5for 800taxa),so that the running time of DCM NJ+MP is O(n4).Experiments(not shown here)indicate that choosing10thresholds does not reduce the perfor-mance of the method significantly(in terms of topological accuracy).4Experimental DesignWe explore a broad portion of the parameter space in order to understand the fac-tors that affect topological accuracy of the three reconstruction methods under study.In particular,we are interested in the type of datasets that might be needed in order to attempt large-scale phylogenetic analyses that span distantly related taxa,as will be needed for the“Tree of Life.”Such datasets will have large heights and large numbers of taxa.Therefore we have created random model trees with heights that vary from quite small(0.05)to quite large(up to4),and with a number of taxa ranging from100to800.We have also explored the ef-fect of deviations from the molecular clock,by using model trees that obey the molecular clock(expected deviation1)as well as model trees with significant violations of the molecular clock(expected deviation2.87).In order to keep the experiment tractable,we limited our analysis to datasets in which the sequence lengths are bounded by16,000;this has the additional benefit of not exceeding by too much what we might expect to have available in a phylogenetic analysis in the coming years.In contrast the trees in[2]have smaller heights(up to a maximum of0.5)and smaller expected deviations with less variance(their ac-tual deviation is truncated if it exceeds a preset value),providing much milder experimental conditions than ours.The trees studied in[19]are even simpler, being ultrametric(no deviation from molecular clock)with a very small height of0.15.For the two distance-based methods(NJ and DCM NJ+MP),we computed evolutionary distances appropriately for the K2P+gamma model with our set-tings forαand the transition/transversion ratio.Because we explore perfor-mance on datasets with high rates of evolution,we had to handle cases in which the standard distance correction cannot be applied because of saturation.In such a case,we used the technique developed in[12],called“fix-factor1:”distances that cannot be set using the distance-correction formula are simply assigned the largest corrected distance in the matrix.We used seqgen[18]to generate the sequences,and PAUP*4.0to construct Greedy MP trees as described.The software for the basic DCM NJ was written by D.Huson.We generated the rest of the software(a combination of C++ programs and Perl scripts)explicitly for these experiments.For job management across the cluster and public laboratory machines,we used the Condor software package.5Experimental resultsWe present a cross-section of our results in three major categories.All address the sequence lengths required by each reconstruction method to reach prescribedlevels of topological accuracy on at least80%of the reconstructed trees as a function of three major parameters,namely the deviation from ultrametric-ity,the height of the tree,and the number of taxa.We compute the sequence-length requirements to achieve a given accuracy as follows.For each method and each collection of parameter values(number of taxa,height,and deviation), we perform40runs to obtain40FN rates at each sequence length of the input dataset.We then use linear interpolation on each of the40curves of FN rates vs.sequence length to derive the sequence-length requirement at a given accu-racy level.(Some curves may not reach the desired accuracy level even with sequences of length16,000,in which case we exclude those runs.)We then average the computed sequence lengths to obtain the average sequence-length requirement at the given settings.Points not plotted in our curves indicate that less than80%of the trees returned by the method did not have the required topological accuracy even at16000characters.In all cases the relative requirements of the three methods follow the same ordering:DCM NJ+MP requires the least,followed by NJ,with greedy MP a distant third.However,even the best of the three,DCM NJ+MP,demonstrates a very sharp,exponential rise in its requirements as the deviation from ultra-metricity increases—and the worst of the three,Greedy MP,cannot be used at all with a deviation larger than1or with a height larger than0.5.Similarly sharp rises are also evident when the sequence-length requirements are plotted as a function of the height of the trees.(In contrast,increases in the number of taxa cause only mild increases in the sequence length requirements,even some decreases in the case of the greedy MP method.)5.1Requirements as a function of deviationFigure1shows plots of the sequence-length requirements of the three methods for various levels of accuracy as a function of the deviation.Most notable is the very sharp and fast rise in the requirements of all three methods as the devi-ation from ultrametricity increases.The greedy MP approach suffers so much from such increases that we could not obtain even80%accuracy at expected deviations larger than1.36.DCM NJ+MP shows the best behavior among the three methods,even though it requires unrealistically long sequences(signif-icantly longer than16,000characters)at high levels of accuracy for expected deviations above2.5.2Requirements as a function of heightFigure2parallels Figure1in all respects,but this time the variable is the height of the tree,with the expected deviation being held to a mild1.36.The rises inS e q u e n c e L e n g t hFig.1.Sequence-length requirements as a function of the deviation of the tree,for a height of 0.5,two numbers of taxa,and three levels of accuracy.S e q u e n c e L e n g t hFig.2.Sequence-length requirements as a function of the height of the tree,for an expected deviation of 1.36,two numbers of taxa,and three levels of accuracy.Height S e q u e n c e L e n g t h(a)80%accuracy (b)85%accuracyFig.3.Sequence-length requirements as a function of the height of the tree,for an expected deviation of 2.02,100taxa,and two levels of accuracy.sequence-length requirements are not quite as sharp as those observed in Fig-ure 1,but remain severe enough that accuracy levels beyond 85%are not achiev-able for expected deviations above 2.Once again,the best method is clearly DCM NJ +MP and greedy MP is clearly inadequate.Figure 3repeats the exper-iment with a larger expected deviation of 2.02.At this deviation value,greedy MP cannot guarantee 80%accuracy even with strings of length 16,000,so the plots only show curves for NJ and DCM NJ +MP ,the latter again proving best.But note that the rise is now sharper,comparable to that of Figure 1.Taken to-gether,these two figures suggest that the product of height and deviation is a very strong predictor of sequence-length requirements and that these require-ments grow exponentially as a function of that product.The number of taxa does not play a significant role,although,for the larger height,we note that increasing the number of taxa decreases sequence-length requirements—a be-havior that we attribute to the gradual disappearance of long tree edges under the influence of better taxon sampling.5.3Requirements as a function of the number of taxaFigure 4shows the sequence-length requirements as a function of the number of taxa over a significant range (up to 800taxa)at four different levels of accuracy for our three methods.The tree height here is quite small and the deviation mod-est,so that the three methods place only modest demands on sequence length for all but the highest level of accuracy.(Note that parts (a)and (b)of the figure use a different vertical scale from that used in parts (c)and (d).)At high ac-curacy (90%),NJ shows distinctly worse behavior than the other two methods,with nearly twice the sequence-length requirements.Figure 5focuses on just1002003004005006007008000200400600800100012001400160018002000Number of taxa S e q u e n c e L e n g t hNJ Greedy MP DCM NJ+MP(a)75%accuracy(b)80%accuracy(c)85%accuracy (d)90%accuracyFig.4.Sequence-length requirements as a function of the number of taxa,for an expected devia-tion of 1.36,a height of 0.05,and four levels of accuracy.Fig.5.Sequence-length requirements for Greedy MP (60runs)as a function of the number of taxa,for an expected deviation of 1.36,a height of 0.5,and two levels of accuracy.the greedy MP method for trees with larger height.On such trees,initial length requirements are quite high even at80%accuracy,but they clearly decrease as the number of taxa increases.This curious behavior can be explained by taxon sampling:as we increase the number of taxa for afixed tree height and devi-ation,we decrease the expected(and the maximum)edge length in the model tree.Breaking up long edges in this manner avoids one of the known pitfalls for MP:its sensitivity to the simultaneous presence of both short and long edges.Figure6gives a different view on our data.It shows how DCM NJ+MP and NJ are affected by the number of taxa as well as by the height of the model tree. Note that the curves are not stacked in order of tree heights because Figures2 and3show that the sequence-length requirements decrease from height0.05 to0.25and then start to increase gradually.The juxtaposition of the graphsNJFig.6.Sequence-length requirements(for at least70%of trees)as a function of the number of taxa,for an expected deviation of1.36,various heights,and two levels of accuracy,for each of DCM NJ+MP and NJ.clearly demonstrate the superiority of DCM NJ+MP over NJ:the curves for the former are both lower andflatter than those for the latter—that is,DCM NJ+MP can attain the same accuracy with significantly shorter sequences than does NJ and the growth rate in the length of the required sequences is negligible for DCM NJ+MP whereas it is clearly visible for NJ.6Conclusions and Future Work6.1SummaryWe have explored the performance of three polynomial-time phylogenetic re-construction methods,of which two(greedy maximum parsimony and neighbor-joining)are well known and widely used.In contrast to earlier studies,we have found that neighbor-joining and greedy maximum parsimony have very differ-ent performance on large trees and that both suffer substantially in the presence of significant deviations from ultrametricity.The third method,DCM NJ+MP, also suffers when the model tree deviates significantly from ultrametricity,but it is significantly more robust than either NJ or greedy MP.Indeed,for large trees(containing more than100taxa)with large evolutionary diameters(where a site changes more than once across the tree),even small deviations from ultra-metricity can be catastrophic for greedy MP and difficult for NJ.Several inferences can be drawn from our data.First,it is clearly not the case that greedy MP and NJ are equivalent:while they both respond poorly to increasing heights and deviations from ultrametricity,greedy MP has much the worse reaction.Secondly,neither of them has good performance on large trees with high diameters—both require more characters(longer sequences)than are likely to be easily obtained.Finally,the newer method,DCM NJ+MP,clearly outperforms both others and does so by a significant margin on large trees with high rates of evolution.6.2Future workA natural question is whether a better,albeit slower,heuristic for maximum parsimony might demonstrate better behavior.Other polynomial-time methods could also be explored in this same experimental setting,and some might out-perform our new method DCM NJ+MP.Maximum-likelihood methods,while typically slower,should also be compared to the maximum-parsimony methods; to date,next to nothing is known about their sequence-length requirements.Also of interest is the impact of various aspects of our experimental setup on the results.For example,a different way of deviating a model tree from the molecular clock might be more easily tolerated by these reconstructionmethods—and might also be biologically more realistic.The choice of the spe-cific model of site evolution(K2P)can also be tested,to see whether it has a significant impact on the relative performance of these(and other)methods.7AcknowledgmentsThis work is supported by the National Science Foundation under grants ACI 00-81404(Moret),EIA99-85991(Warnow),EIA01-13095(Moret),EIA01-21377(Moret),EIA01-21680(Warnow),and DEB01-20709(Moret and Warnow), and by the David and Lucile Packard Foundation(Warnow).References[1]K.Atteson.The performance of the neighbor-joining methods of phylogenetic reconstruc-tion.Algorithmica,25:251–278,1999.[2]O.R.P.Bininda-Emonds,S.G.Brady,J.Kim,and M.J.Sanderson.Scaling of accuracy inextremely large phylogenetic trees.In Proc.6th Pacific Symp.Biocomputing PSB2002,pages547–558.World Scientific Pub.,2001.[3]W.J.Bruno,N.Socci,and A.L.Halpern.Weighted neighbor joining:A likelihood-basedapproach to distance-based phylogeny reconstruction.Mol.Biol.Evol.,17(1):189–197,2000.[4]M.Cs˝u r¨o s.Fast recovery of evolutionary trees with thousands of nodes.To appear inRECOMB01,2001.[5]M.Cs˝u r¨o s and M.Y.Kao.Recovering evolutionary trees through harmonic greedy triplets.Proceedings of ACM-SIAM Symposium on Discrete Algorithms(SODA99),pages261–270,1999.[6]P.L.Erd˝o s,M.Steel,L.Sz´e k´e ly,and T.Warnow.A few logs suffice to build almost alltrees–I.Random Structures and Algorithms,14:153–184,1997.[7]P.L.Erd˝o s,M.Steel,L.Sz´e k´e ly,and T.Warnow.A few logs suffice to build almost alltrees–p.Sci.,221:77–118,1999.[8]L.R.Foulds and R.L.Graham.The Steiner problem in phylogeny is NP-complete.Ad-vances in Applied Mathematics,3:43–49,1982.[9]J.Huelsenbeck.Performance of phylogenetic methods in simulation.Syst.Biol.,44:17–48,1995.[10]J.Huelsenbeck and D.Hillis.Success of phylogenetic methods in the four-taxon case.Syst.Biol.,42:247–264,1993.[11] D.Huson,tles,and T.Warnow.Disk-covering,a fast-converging method for phylo-genetic tree put.Biol.,6:369–386,1999.[12] D.Huson,K.A.Smith,and T.Warnow.Correcting large distances for phylogenetic recon-struction.In Proceedings of the3rd Workshop on Algorithms Engineering(WAE),1999.London,England.[13]M.Kimura.A simple method for estimating evolutionary rates of base substitutionsthrough comparative studies of nucleotide sequences.J.Mol.Evol.,16:111–120,1980. [14]K.Kuhner and J.Felsenstein.A simulation comparison of phylogeny algorithms underequal and unequal evolutionary rates.Mol.Biol.Evol.,11:459–468,1994.[15]L.Nakhleh,B.M.E.Moret,U.Roshan,K.St.John,and T.Warnow.The accuracy of fastphylogenetic methods for large datasets.In Proc.7th Pacific Symp.Biocomputing PSB 2002,pages211–222.World Scientific Pub.,2002.[16]L.Nakhleh,U.Roshan,K.St.John,J.Sun,and T.Warnow.Designing fast convergingphylogenetic methods.In Proc.9th Int’l Conf.on Intelligent Systems for Molecular Biol-ogy(ISMB01),volume17of Bioinformatics,pages S190–S198.Oxford U.Press,2001.[17]L.Nakhleh,U.Roshan,K.St.John,J.Sun,and T.Warnow.The performance of phy-logenetic methods on trees of bounded diameter.In O.Gascuel and B.M.E.Moret,ed-itors,Proc.1st Int’l Workshop Algorithms in Bioinformatics(WABI’01),pages214–226.Springer-Verlag,2001.[18] A.Rambaut and N.C.Grassly.Seq-gen:An application for the Monte Carlo simulationof DNA sequence evolution along phylogenetic p.Appl.Biosci.,13:235–238, 1997.[19] B.Rannala,J.P.Huelsenbeck,Z.Yang,and R.Nielsen.Taxon sampling and the accuracyof large phylogenies.Syst.Biol.,47(4):702–719,1998.[20] D.F.Robinson and parison of phylogenetic trees.Mathematical Bio-sciences,53:131–147,1981.[21]N.Saitou and M.Nei.The neighbor-joining method:A new method for reconstructingphylogenetic trees.Mol.Biol.Evol.,4:406–425,1987.[22]M.J.Sanderson.r8s software package.Available from /r8s/.[23]M.J.Sanderson,B.G.Baldwin,G.Bharathan,C.S.Campbell,D.Ferguson,J.M.Porter,C.V on Dohlen,M.F.Wojciechowski,and M.J.Donoghue.The growth of phylogeneticinformation and the need for a phylogenetic database.Systematic Biology,42:562–568, 1993.[24]T.Warnow,B.Moret,and K.St.John.Absolute convergence:true trees from short se-quences.Proceedings of ACM-SIAM Symposium on Discrete Algorithms(SODA01),pages 186–195,2001.[25]Z.Yang.Maximum likelihood estimation of phylogeny from DNA sequences when sub-stitution rates differ over sites.Mol.Biol.Evol.,10:1396–1401,1993.。