BRIG比较基因组操作手册
基因编辑工具的操作指南与技巧分享
基因编辑工具的操作指南与技巧分享基因编辑工具是当今生命科学领域的一项重要技术,它可以精确地修改生物体的基因组,为研究和应用提供了巨大的潜力。
然而,对于初学者来说,使用基因编辑工具可能会有一些困惑和挑战。
本文将分享一些基因编辑工具的操作指南与技巧,帮助读者更好地掌握这一技术。
一、选择适合的基因编辑工具基因编辑工具有多种选择,包括CRISPR-Cas9、TALEN和ZFN等。
在选择工具时,需要考虑实验需求、目标基因的特性以及实验室的技术能力等因素。
对于初学者来说,CRISPR-Cas9是一个较为简单且常用的选择。
它具有高效率、低成本和易于设计的特点,适用于大多数基因编辑实验。
二、设计合适的引物在使用CRISPR-Cas9进行基因编辑时,设计合适的引物是至关重要的。
引物应具有高特异性,以确保编辑的准确性和效率。
此外,引物的长度和浓度也需要适当调整,以提高编辑效果。
在设计引物时,可以借助一些在线工具或软件,如Benchling和SnapGene等,帮助优化引物的设计。
三、优化转染条件转染是将基因编辑工具引入细胞的重要步骤。
为了提高转染效率,可以优化转染条件。
例如,可以尝试不同的细胞密度、转染剂浓度和转染时间等参数。
此外,使用合适的阳性对照组和阴性对照组,有助于评估转染效果并确定最佳条件。
四、筛选和鉴定编辑细胞在进行基因编辑后,需要对编辑细胞进行筛选和鉴定。
常用的筛选方法包括荧光活性检测、PCR和测序等。
荧光活性检测可用于筛选表达特定蛋白的细胞,PCR和测序则可用于检测基因组的编辑情况。
同时,为了减少假阳性结果,可以使用多个独立的克隆进行验证。
五、优化基因编辑效率提高基因编辑效率是许多研究者关注的问题。
除了前面提到的优化转染条件外,还可以尝试其他策略来提高效率。
例如,使用化学诱变剂或光基因组学方法,可以增加CRISPR-Cas9的效率。
此外,结合其他技术,如单细胞测序和高通量筛选,也可以帮助优化基因编辑效果。
细菌比较基因组学分析新手指南
新手指南篇:基于二代测序数据的比较基因组分析摘要现在高通量测序既快又便宜,足以被视为细菌研究的重要工具,并且在公共领域有数以千计的细菌基因组序列供比较分析。
越来越多不同的群体研究,像临床和公共卫生实验室,进行细菌基因组分析,它们感兴趣与细菌遗传学和进化相关的广泛话题。
例如疫情分析及致病性和耐药性的研究。
在这个初学者的指南中,我们的目标是,为那些生物信息学背景的个人分析细菌基因组数据提供了一个切入点,让他们来回答自己的研究问题。
我们假设读者熟悉遗传学和序列数据的基本性质,但不承担任何计算机编程技能。
涉及的主要议题是组装,contig排序,注释,基因组比较及提取共有的输入信息。
每个部分均使用公开可用的大肠杆菌数据和免费的软件工具,所有这些都可以在台式计算机上被执行。
介绍和目的现在高通量测序既快又便宜,足以被视为细菌研究的重要工具。
越来越多不同的群体研究,像临床和公共卫生实验室,进行细菌基因组分析,它们感兴趣与细菌遗传学和进化相关的广泛话题。
例如疫情分析及致病性和耐药性的研究。
如今细菌的基因组序列,可以在许多实验室内部产生,仅需要使用台式测序仪数小时或数天,如Illumina的MiSeq,Ion Torrent PGM或者Roche 454 FLX Junior。
这些许多数据在公共数据库中可用,允许进行广泛的比较分析;例如截止到2013年2月GenBank数据库包含>6500细菌基因组,其中2/3是处于草图形式(即呈现为一组片段序列,并非单一序列代表全基因组)。
在这个初学者的指南中,我们的目标是,为个人想利用全基因组序列数据进行从头组装基因组回答以在更广泛的研究目标范围内的问题提供一个切入点。
该指南并非针对那些希望执行数百个基因在同一时间的自动化处理;在常规的微生物学诊断实验室的使用顺序的一些讨论是在文献中可用的[8]。
我们假设读者熟悉遗传学和序列数据的基本性质,但不承担任何计算机编程技能,而我们使用,可以在台式计算机(在Mac,Windows或Linux)上执行的例子。
GCBI使用手册1.2.1
GCBI使用手册文档版本号:V1.0.1编制:汝平编辑日期:2015-12-01 审核:周晓光适用于GCBI v1.1.3版本目录GCBI简介 (1)第一章:使用GCBI进行搜索 (2)一、文献的搜索 (2)二、基因的搜索 (3)三、样本搜索 (4)第二章:注册 (5)第三章:实验室使用 (7)一、创建实验室 (7)二、新建项目 (8)三、新建方案 (9)四、获取样本 (10)1、上传自有样本 (10)2、使用公共数据 (12)五、数据分析 (12)1.方案设计 (12)2.选取样本 (13)3.参数设置 (14)4.运行方案 (15)5.查看结果 (15)第四章:GCBI应用 (19)一、乳腺癌研究应用实例 (19)附件1 参数含义 (22)附件2 GCBI支持的芯片平台 (22)GCBI简介GCBI,Gene-Cloud of Biotechnology information,这是一个发现和创造基因知识的地方。
GCBI注重知识的科学性、权威性、丰富性和创新性,为科研工作者提供更为优质便捷的资源。
无论历史信息还是最新进展,无论侧重于疾病、药物、生理状态等哪一方面,或是关于生命的任何一个阶段、任何一个器官、任何一种细胞,GCBI都可以发掘出基因的更多可能性。
GCBI吸收了大数据、新知识和新技术,输出了医学创新的动力引擎。
GCBI整合了文献资料、基因信息、样本信息、数据算法以及生物信息处理等技术,建立一个“基因知识库”,涉及到生物学、医学、信息学、计算机学、数学、图像学等多个学科领域。
目前,GCBI拥有超过20TB的知识数据存量,包含了120多万份全基因样本,26万多份期刊,7000多万份文献,17多万份基因信息。
GCBI为您提供更好的便利来获得分析和计算的工具,推动对遗传物质和其在健康与疾病中角色的理解。
GCBI让科学研究不再那么高冷,让科学变得更有乐趣,让一切变得简单易行。
第一章:使用GCBI进行搜索一、文献的搜索1、搜文献在的首页面,选择文献,在搜索框中输入搜索的关键词,中英文均可。
基因操作指南(完整资料).doc
此文档下载后即可编辑病毒及基因治疗实验室操作指南(分子生物学部分)700μl 菌液加入等体积60%甘油,-20℃或-70℃保存。
方法一:挑取甘油菌一环,接种在含有相应浓度抗生素的LB 固体培养基上(活化菌种),37℃培养过夜(约12~16小时)。
选取单独菌落转接在含有相应抗生素的LB液体培养基中,37℃振摇过夜(约12~16小时)。
方法二:直接吸取10~20μl 甘油菌,接种在含有相应抗生素的LB液体培养基中,37℃振摇过夜(约12~16小时)。
适于从1~5ml 菌液中制备20μg、小于10Kb的高拷贝质粒。
(1)溶液EB放置于37℃孵箱内,预热;(2)用Ep管收集菌液,10,000 rpm ,离心1min,弃上清;(3)加入250μl 溶液P1(含RNase),重悬细菌;(4)加入250μl 溶液P2,轻轻颠倒4~6次混匀,室温下静止2~3分钟(不可超过5分钟,以免过度消化);(5)加入350μl溶液N3,迅速颠倒4~6次混匀,1,3000 rpm 离心10分钟;(6)上清入QIAprep柱,1,3000 rpm离心30~60秒,滤液弃之;(7)加入500μl 溶液PB洗柱,1,3000rpm 离心30~60秒;(8)加入750μl 溶液PE洗柱,1,3000rpm 离心30~60秒,弃滤液,再次离心1分钟,以去除残存的PE;(9)换新Ep管,加入37℃预热的溶液EB,37℃孵箱内静置1分钟(若质粒较大可适当延长时间),1,3000 rpm离心1分钟,-20℃或-70℃冻存备用。
(1)将菌液装于Ep管内,用台式离心机10,000rpm离心1min,回收细菌;(2)加溶液I 200μl(溶液I成分:50mM葡萄糖,25mM Tris-HCl pH8.0,10mM EDTA),振荡,重悬细菌;(3)加溶液II 200μl(溶液II成分:0.2N NaOH,1%SDS),盖严管盖,颠倒4~5次,混匀,静置3~4分钟。
BRIG(BLAST Ring Image Generator) 软件中文说明
BRIG(BLAST Ring Image Generator)软件中文说明BRIG使用BLAST(Basic Local Alignment Search Tool)进行基因组的比对分析并使用CGView进行图像的生成。
1、首先安装Java和BLAST/BLAST+,解压BRIG安装包,双击BRIG.jar,运行BRIG。
(1)通过Preferences >BRIG options,设置blast+软件路径,如下:(2)载入示例文件选择载入参考基因组,设置输入文件夹和输出文件夹,单击Add to data pool将输入文件夹中的序列添加到BRIG data pool,单击next。
(3)设置每一圈信息Add new ring添加圈,选中data pool中的数据,单击Add data,添加选中的序列。
设置完成后单击next。
一般设置如下:Ring1 G C_SkewRing2 G C contentRing3 r efgene.fnaRing4 g enome1.fnaRing5 g enome2.fna…………(4)提交任务选择图片输出格式:jpg、png、svg、svgz,单击submit,提交比对任务并输出图片。
(5)最终输出结果2、看着上面的图是不是感觉缺少点什么?是的,缺少注释信息。
当序列添加完成后,通过Add custom feature按钮,进入基因注释信息编辑栏。
注释信息的添加有四种方式,单击Input data下拉栏显示:Single entry(即编辑单个基因)、Tab-delimited(导入xls文件)、Genbank(添加gbk文件)、Embl(添加embl文件)。
(1)通过xls文件导入自己感兴趣的基因。
注释信息文件(list.xls)整理如下,共三列:strat、end、label,注意表头一定要添加#,基因起点(start)位置必须小于终止(end)位置。
生物信息学分析工具的操作指南与使用技巧
生物信息学分析工具的操作指南与使用技巧近年来,随着生物学研究的向深度学习和大数据方向转变,生物信息学分析工具越来越重要。
这些工具能够处理和解读庞大的生物信息数据,从而提供对基因、蛋白质和其他生物分子功能的深入了解。
为了帮助研究者更好地应用这些工具,本文将提供生物信息学分析工具的操作指南与使用技巧。
一、 BLASTBLAST(Basic Local Alignment Search Tool)是最常用的生物信息学工具之一,用于比对基因或蛋白质序列并寻找相似性。
以下是使用BLAST的操作指南:1. 登录NCBI(National Center for Biotechnology Information)网站,选择"BLAST"选项卡。
2. 选择合适的BLAST程序,如nucleotide BLAST(用于比对核苷酸序列)或protein BLAST(用于比对蛋白质序列)。
3. 输入待比对的序列或上传序列文件。
4. 选择适当的数据库进行比对。
例如,对于人类基因,可以选择"Human genome"数据库。
5. 调整BLAST参数,如期望阈值(E-value)和比对长度,以优化结果。
6. 提交任务并等待结果。
BLAST将返回比对结果和相似性分数。
使用技巧:- 选择正确的数据库,以确保比对结果具有生物学相关性。
- 调整参数以满足特定的研究需求,如提高灵敏度或选择严格的相似性阈值。
- 分析比对结果时,关注较高的BLAST分数和较低的E-value,以确定最相关的序列。
二、DNA序列编辑器DNA序列编辑器是生物信息学研究中常用的工具,用于编辑、操作和分析DNA序列。
以下是使用DNA序列编辑器的操作指南:1. 下载和安装合适的DNA序列编辑器,如ApE(A plasmid Editor)或SnapGene。
2. 打开编辑器并创建新项目。
3. 在序列窗口中输入或粘贴DNA序列。
原位杂交原理及具体操作
原位杂交原理及具体操作原位杂交实验原理与方法一、目的本实验的目的是学会原位杂交的使用方法。
了解各种原位杂交的基本原理和优缺点。
二、原理原位杂交组化(简称原位杂交,in situ hybridization histochemistry ;ISHH)属于分子杂交的一种,是一种应用标记探针与组织细胞中的待测核酸杂交,再应用标记物相关的检测系统,在核酸原有的位置将其显示出来的一种检测技术。
原位杂交的本质就是在一定的温度和离子浓度下,使具有特异序列的单链探针通过碱基互补规则与组织细胞内待测的核酸复性结合而使得组织细胞中的特异性核酸得到定位,并通过探针上所标记的检测系统将其在核酸的原有位置上显示出来。
当然杂交分子的形成并不要求两条单链的碱基顺序完全互补,所以不同来源的核酸单链只要彼此之间有一定程度的互补顺序(即某种程度的同源性)就可以形成杂交双链。
探针的种类按所带标记物可分为同位素标记探针和非同位素标记探针两大类。
目前,大多数放射性标记法是通过酶促反应将标记的基因掺入DNA 中,常用的同位素标记物有3H、35S、125I 和32P。
同位素标记物虽然有灵敏性高,背底较为清晰等优点,但是由于放射性同位素对人和环境均会造成伤害,近来有被非同位素取代的趋势。
非同位素标记物中目前最常用的有生物素、地高辛和荧光素三种。
探针的种类按核酸性质不同又可分为DNA探针、cDNA探针、cRNA探针和合成寡核苷酸探针。
cDNA 探针又可分为双链cDNA探针和单链cDNA 探针。
原位杂交又可分为菌落原位杂交和组织原位杂交。
菌落原位杂交( Colony in situ hybridization ) 菌落原位杂交是将细菌从培养平板转移到硝酸纤维素滤膜上,然后将滤膜上的菌落裂菌以释出DNA。
将NDA烘干固定于膜上与32P 标记的探针杂交,放射自显影检测菌落杂交信号,并与平板上的菌落对位。
组织原位杂交( Tissue in situ hybridization ) 组织原位杂交简称原位杂交,指组织或细胞的原位杂交,它与菌落的原位杂交不同。
半兄弟家族基因组分析软件包的用户指南说明书
hsphase–an R package for identification of recombination events,pedigree and haplotype reconstruction and imputation using SNP datafrom half-sib familiesMohammad H.Ferdosi and Cedric GondroNovember26,2018Contents1Overview3 2Data Input Format32.1Reading the Genotype File (4)3Main Functions:Block Partitioning,Sire Imputation and Phas-ing of a Half-Sib Family43.1Block Partitioning(bmh) (4)3.2Sire Imputation and Phasing(ssp) (4)3.3Half-Sib Family Phasing(phf) (5)3.4All-in-one Phasing(aio) (5)4Auxiliary Functions54.1Input Files (5)4.1.1Genotype File (5)4.1.2Pedigree File (6)4.1.3Map File (6)4.2Functions (6)4.2.1Half-sib Family Separator(hss) (6)4.2.2Chromosome Separator(cs) (6)5Parallel Data Analysis(para)6 6Visualisation76.1Blocking Structure Plot(imageplot) (7)6.2Recombination Plot(rplot) (8)6.3Heatmap of Half-sib Families(hh) (8)6.4Plot of Opposing Homozygotes(ohplot) (9)17Diagnostic Tools107.1Probability Matrix(pm) (10)7.2Haplotype Block of Phased Data(hbp) (11)7.3Identification of Recombination Events(recombinations) (11)8Pedigree Reconstruction118.1Matrix of Opposing Homozygotes(ohg) (11)8.2Pedigree Reconstruction of Half-sib Families(rpoh) (12)8.3Fix Pedigree Errors(pedigreeNaming) (12)8.4Parentage Assignment(pogc) (12)9Imputation13 10Quick Guide1310.1Half-sib Family Analysis (13)10.2Multi-family Analysis (13)11How to Cite the hsphase1421Overviewhsphase comprises a suite of functions to reconstruct haplotypes using SNP data from half-sib families(a data structure widely used in livestock genomics).The package can be used to identify the paternal strand of origin(blocks)which the half-sibs inherited from their sire(i.e.which chromosomal regions in an individ-ual come from either the paternal or maternal strand of the sire).These blocks define the recombination events that occurred in the sire to form the offspring’s haplotype.Blocks can then used to count recombination events and detect hot and cold spot regions.The package also includes a function to phase the half-sibs using an algorithm based on opposing homozygotes and another function to impute and phase ungenotyped haplotypes of the sires.If the pedigree is not available or the pedigree is not reliable,the pedigree can be inferred from the raw genotypes.Diagnostic images can be generated to evaluate the quality of results. Note1:Auxiliary functions hss,cs and para(Sections4.2and5)can be used to analyse large datasets(they are generally used in this order).cs and hss parse genotype,pedigree and mapfiles into a format that can be used downstream by para.Note2:These functions are meant for autosomes only.The package will not phase sex chromosomes.2Data Input Formathsphase uses a simple numeric matrix of animals(rows)and SNP(columns). SNP should be coded as0,1and2for respectively AA,AB and e9 for missing data.The matrix should have rownames which are the sample IDs and colnames which are the SNP names.The matrix must only contain individuals from one half-sib family and one chromosome and the SNP should be ordered in ascending order by base pair position.There should be at least four samples in the dataset.A large genotypefile can be split into halfsib groups and chromosomes by utilising hss and cs respectively.The cs output can be analysed with para.This matrix can be used directly with the bmh,ssp,phf,aio(Section3), rplot(Section6.2)and pm(Section7.1)functions.Auxiliary functions to read a large genotype dataset into R,parse into family groups and chromosomes are readGenotype(Section2.1),hss and cs(Section4.2).Note:We have found that we get better results using just the raw data with missing values than data which has previously had genotypes imputed.Toy example of a half-sib genotype matrix:genotype<-matrix(c(0,0,0,0,1,2,2,2,0,0,2,0,0,0,2,2,2,2,1,0,0,0,2,2,2,2,2,2,2,2,2,2,1,2,2,2,0,0,2,2,2,2,2,2,2,2,0,0,0,0,2,2,2,2,2,2,0,0,0,0,0,2,2,2,2,2,2,0,0,03),ncol=14,byrow=TRUE)AnimalID<-paste("ID-",1:5,sep="")SNPID<-paste("SNP-",letters[1:14],sep="")rownames(genotype)<-AnimalIDcolnames(genotype)<-SNPIDgenotype[1:5,1:5]2.1Reading the Genotype FileThe readGenotype function reads and performs a sanity check on a genotype file.The input is the name of the genotypefile.readGenotype(genotypePath,separatorGenotype="",check=TRUE) If the check option is set to TRUE,the genotypefile will be checked for possible errors.Note1:The SNP and the animals’IDs should not contain a double quotes.Note2:This function uses the scan function in R to improve read speeds for large genotypefiles.3Main Functions:Block Partitioning,Sire Im-putation and Phasing of a Half-Sib Family 3.1Block Partitioning(bmh)The bmh function creates the block structure for the half-sibs.The result is a matrix(one row for each half-sib and one column for each SNP in the same order as the input matrix)that shows which of the sire’s haplotype each half-sib inherited.Paternal strands are arbitrarily coded as1and2;i.e.individuals that have the same numbers(e.g.1,1)for a given SNP,inherited the same haplotype from the sire.Alternatively,if two half-sibs have e.g.a1and2,they inherited different haplotypes from the sire.0is used when the paternal strand of origin could not be determined.recombinationBlocks<-bmh(genotype)3.2Sire Imputation and Phasing(ssp)The ssp function infers(imputes)and phases the sire’s genotype.It requires the half-sib’s original genotype matrix and the block structure generated with bmh. The function returns a matrix with two rows,one for each haplotype of the sire (columns are SNP in the order of the genotype matrix).Alleles are coded as0 (A)and1(B).Alleles that could not be imputed are coded as9.Note:Across chromosomes there is no relationship between the sire’sfirst and4second haplotype(i.e.the paternal/maternal haplotypes of the sire can be swapped between chromosomes).sireHaplotype<-ssp(bmh(genotype),genotype)3.3Half-Sib Family Phasing(phf)The phf function phases the half-sib data.It uses as input the half-sib’s genotype matrix,block structure(generated with bmh)and the matrix of imputed sire (generated with ssp).The output is a matrix that contains the phased haplotype each half-sib inherited from the sire.The resulting matrix has the same number of rows as the input genotype matrix.It uses0,1and9for A,B and missing. The maternal haplotype(haplotype offspring inherited from the dam)can be created by subtracting the genotype matrix from this matrix.The function aio (below)can be used to generate both haplotypes in a single matrix. familyPaternalHaplotypes<-phf(genotype,bmh(genotype),ssp(recombinationBlocks,genotype))3.4All-in-one Phasing(aio)The aio function uses the previous functions to generate the half-sib’s haplotypes (from both the sire and dam).The output is a matrix containing0and1 for alleles A and B(with9for missing/unknown phase).The IDs of each animal(rownames)are duplicated with p and m suffixes which stand for paternal (inherited from the sire)and maternal(from the dam)haplotypes. familyPhased<-aio(genotype)4Auxiliary FunctionsThe objective of these functions is to assist users to convert the data to the format used by hsphase.Essentially,these functions will simply split a matrix of genotypes into a list of chromosomes and half-sib families.4.1Input FilesThree inputfiles are required to use the hss and cs functions:4.1.1Genotype FileAflatfile with genotypes(ID×SNP),where the SNP names are in thefirst row and the ID of each individual in thefirst column(ID should not have any header).The delimiter can be specified via the readGenotype function(Section 2.1).The IDs and SNP names should be unique.SNP are denoted by0,1,2 and9for AA,AB,BB and missing respectively.5SNP1SNP2SNP3SNP4IND10211IND21902IND320104.1.2Pedigree FileThe pedigreefile should contain at least two columns,one for the half-sibs and one for the sires,in this order.Other columns are ignored.Unknown sires can be specified by0(but results will simply be a string of9s)and offspring IDs should be unique.Thisfile must not have a header.4.1.3Map FileThe mapfile must contain a column for each of:SNP name,chromosome and their position in base pairs.Thefirst line must contain a header with”Name Chr Position”.Additional columns are ignored.Space is the default separator for allfiles.4.2Functions4.2.1Half-sib Family Separator(hss)The hss function generates a list of matrices(ID×SNP),one for each of the half-sib groups that have at least4offspring per sire.The input are the genotype and pedigreefiles(path to thefile,matrix or data.frame).The names of the list are the names of the sires in the pedigreefile.halfsib<-hss(pedigree,genotype)4.2.2Chromosome Separator(cs)After splitting the data into family groups,the cs function can be used to separate the hss data into the different chromosomes based on a mapfile.The output is also a list of matrices(ID×SNP,one for each family and chromosome). The names in the R output list are the half-sib groups name(sire ID)and their chromosome numbers separated by an underscore().Subsets can be found using e.g.grep(Section10).Data in the correct format can also be built manually;it’s simply a split list of numeric matrices(ID×SNP with0,1,2 and9),with SNP ordered by BP position.One matrix for each chromosome and sire.halfsib<-cs(halfsib,mapPath,mapSeparator)5Parallel Data Analysis(para)The para function uses the list of matrices(the output of cs)and runs one of the options below,on each element of the list,in parallel.This function requires the snowfall package.To run use e.g.:6blocks<-para(halfsib,cpus=20,option="bmh",type="SOCK") sireImpute<-para(halfsib,cpus=20,option="ssp",type="SOCK") phasedSibs<-para(halfsib,cpus=20,option="aio",type="SOCK") cpus sets the number of CPUs to use,option sets the type of analysis and can be any of the above described bmh,ssp,aio or pm(7.1)and also rec.The rec option returns a list with the sum of recombinations between SNP for each chromosome in each half-sib group.The parameter type sets the type of clusterfor parallel analysis(for more information refer to the snowfall documentation). The output is a list of the same length as the input data and its contents depends on the option selected(Section3).rec is just a wrapper function whichis equivalent to:result<-pm(bmh(genotype))result<-apply(result,2,sum)The result can be plotted to visualise the recombinations.6Visualisation6.1Blocking Structure Plot(imageplot)The imageplot function creates a plot of the blocking structure created by either bmh or hbp(Sections3.1and7.2).White indicates regions of unknown origin, red and blue correspond the two sire strands.Note:across chromosomes the colours are not comparable–i.e.blue in chro-mosomes1and2may not relate to the same strand of origin in the sire(pater-nal/maternal).imageplot(bmh(genotype))imageplot(blocks[[1]])#from para resultsImageplot of simulated half−sib family76.2Recombination Plot (rplot)This function creates a plot which shows the sum of all recombination events across the half-sib family.It uses the half-sib genotypes and needs a vector of SNP positions for each SNP on the chromosome.rplot(genotype,distance)0e+002e+064e+066e+068e+061e+070.00.51.01.52.0MarkersS u m o f t h e p r o b a b i l i t y o f r e c o m b i n a t i o n sRecombination of simulated half−sib family6.3Heatmap of Half-sib Families (hh)The hh function generates a heatmap of the half-sib families using the matrix of opposing homozygotes (8).hh(ohg(genotype),inferredPedigree,realPedigree)8The heatmap assists identification of pedigree errors and can help to check if the pedigree reconstruction results seem correct.The inferredPedigree(Section 8.2)and realPedigree are used by the hh function to create,respectively,the RowSideColors and ColSideColors.In practice either can be freely interchanged to colour code the side bars of the heatmap.A maximum of21colours are used by the heatmap.If there are more half-sib families,colours will be repeated. 6.4Plot of Opposing Homozygotes(ohplot)The ohplot function plot the vectorized and sorted opposing homozygote matrix. ohplot(ohg(genotype),genotype,pedigree,check=TRUE)9100200300400500600700separation value: 0.117 cut off (number of SNP): 48o p p o s i n g h o m o z y g o t e s7Diagnostic Tools7.1Probability Matrix (pm)The pm function utilises the block structure matrix (from bmh )to find recom-bination sites.It returns a matrix of 0s and 1s.1indicates that a recombinationoccurred between two consecutive SNP or 0for no recombination.If the exact position of the recombination cannot be determined the region of uncertainty is filled with 1s.genotype <-matrix(c(0,2,0,1,0,2,0,1,2,2,2,2,1,0,2,2,2,1,1,1,0,0,2,1,0),ncol =5,byrow =T)(result <-bmh(genotype))pm(result)This function can be useful to identify mapping errors.For example,if (very)large recombination rates are observed at a particular SNP across families.107.2Haplotype Block of Phased Data(hbp)The hbp function creates a block structure of the half-sib family based on phased data from the sire and its half-sib family.It requires a haplotype matrix from the sire(ssp)and one from the half-sib family(aio).This function can be used as a diagnostic tool to evaluate the result of other phasing algorithms(provided there are parents–offspring in the data).sire<-matrix(c(0,0,0,0,0,1,#Haplotype one of the sire0,1,1,1,1,0#Haplotype two of the sire),byrow=T,ncol=6)haplotypeHalfsib<-matrix(c(1,0,1,1,1,1,#Individual one,haplotype one 0,1,0,0,0,0,#Individual one,haplotype two 0,1,1,0,1,1,#Individual two,haplotype one 1,0,0,1,0,0#Individual two,haplotype two ),byrow=T,ncol=6)#0s and1s are allele A and B hbp(haplotypeHalfsib,sire)Note:The results can be plotted with imageplot.7.3Identification of Recombination Events(recombina-tions)The recombinations function counts the number of recombinations for each in-dividual in a half-sib family.genotype<-matrix(c(2,1,0,0,2,0,2,2,0,0,2,2,0,2,0,0),byrow=TRUE,ncol=4)recombinations(bmh(genotype))Note:This function can be used to detect pedigree errors and is robust if there are at least10half-sibs in the family.Individuals that show a dispropor-tionate number of recombinations do not belong to that family group.8Pedigree Reconstruction8.1Matrix of Opposing Homozygotes(ohg)The ohg function creates a matrix of the number of opposing homozygotes between all pairs of individuals.The result is square matrix where the rownames and colnames are the IDs of individuals.11ohg(genotype,cpus=2)Note:This function utilises OpenMP in GNU/Linux and the number ofcpus is only valid in GNU/Linux.8.2Pedigree Reconstruction of Half-sib Families(rpoh)The rpoh function reconstructs half-sib families;i.e.splits the individuals intohalf-sib groups.Four methods simple,recombinations,calus and manual can be utilised to reconstruct the pedigree.For more details please refer to–article.pedigree1<-rpoh(oh=oh,snpnooh=732,method="simple")pedigree2<-rpoh(genotypeMatrix=genotypeChr1,oh=ohg(genotype), maxRec=10,method="recombinations")pedigree3<-rpoh(genotypeMatrix=genotype,oh=oh,method="calus") pedigree4<-rpoh(oh=oh,maxsnpnooh=31662,method="manual") Note1:The functions ohg and rpoh with recombinations method can beslow with very large datasets.The genotype matrix with only one chromosomeis usually sufficient to separate the individuals into half-sib groups and can speedup the process.Note2:Thefirst argument of rpoh function(genotypeMatrix)for recombi-nations method must use only genotypic data from a single chromosome.Note3:The snpnooh is the number of SNPs(divided by1000)used to create opposing homozygote matrix.Note4:The maxsnpnooh is the maximum number of allowing opposing homozygote in a family.8.3Fix Pedigree Errors(pedigreeNaming)The pedigreeNaming function tries to link the inferred half-sib family groups tothe sire IDs in the original pedigree andfix errors.This works well if the original pedigree is relatively correct but individuals will be misassigned if most of the individuals originally allocated to a sire are not its offspring.8.4Parentage Assignment(pogc)The pogc function utilises the opposing homozygote matrix and return pedigreeof parent-offspring assignments.pedigree<-pogc(oh,genotypeError)The genotypeError argument is the maximum number of mismatches al-lowed.129ImputationThe impute function impute the low density SNP marker to high density markerutilising the high density haplotype of sire.impute(halfsib_genotype_ld,sire_hd)The halfsib genotype ld is the genotype of half–sibs with low density mark-ers.The sire hd is the haplotype of sire that can be either phased sire haplotypeor the haplotype of sire assembled from sequence data.Note:The halfsib genotype ld and sire hd must have colnames which are theSNP names.10Quick Guide10.1Half-sib Family AnalysisThe aio,ssp and bmh functions phase a half-sib family,impute the sire andcreate the block structure respectively.The half-sibs must be a numeric matrixwith animal IDs as rownames and SNP IDs as colnames.If the genotypefilecontains only one half-sib family,it can be read with the readGenotype function(Section2.1).genotype<-readGenotype("path to the genotype file",separator)#Section2.1 recombinationBlocks<-bmh(genotype)#Section3.1 sireHaplotype<-ssp(bmh(genotype),genotype)#Section3.2 familyPhased<-aio(genotype)#Section3.4 10.2Multi-family AnalysisThe input genotypefile must have the same structure as the half-sib genotypefile discussed above(Section2).#read in the genotype file(Section2.1)genotype<-readGenotype("path and name of the genotype file",separator)#hss generates a list of half-sibs based on a pedigree file(Section4.2.1)halfsib<-hss(pedigree,genotype)#splits the output from hss into the various chromosomes(Sec-tion4.2.2)halfsib<-cs(halfsib,mapPath)#Block Partitioning(Section5and3.1)recombinationBlocksList<-para(halfsib,cpus=20,option="bmh",type=13"SOCK")#Sire Imputation(Section5and3.2)sireHaplotypeList<-para(halfsib,cpus=20,option="ssp",type="SOCK")#Half-Sib Family Phasing(Section5and3.4)familyPhasedList<-para(halfsib,cpus=20,option="aio",type="SOCK")Results can be concatenated by using a simple function such as:chromosomeMatch<-function(listHalfsibs,numberChr){chr<-list()for(i in1:numberChr){chr[[i]]<-listHalfsibs[grep(paste("_",i,"$",sep=""),names(listHalfsibs))] chr[[i]]<-do.call(rbind,chr[[i]])}phasedGenotype<-do.call(cbind,chr)phasedGenotype}The numberChr is the number of chromosomes.Note:A comprehensive demo and example dataset is available from CedricGondro’s home page or running demo(hsphase).11How to Cite the hsphaseTo cite the package please typecitation("hsphase")14。
evobiR 1.1 比较生物学遗传分析软件包说明书
Package‘evobiR’October13,2022Type PackageTitle Comparative and Population Genetic AnalysesVersion1.1Date2015-8-25Author Heath Blackmon and Richard H.AdamsMaintainer Heath Blackmon<******************>URL /karyodb/evobiR/Description Comparative analysis of continuous traits influencing discrete states,and util-ity tools to facilitate comparative analyses.Implementations of ABBA/BABA type statis-tics to test for introgression in genomic data.Wright-Fisher,phylogenetic tree,and statistical dis-tribution Shiny interactive simulations for use in teaching.License GPL(>=2)Imports seqinr,ape,geiger,shiny,phytoolsNeedsCompilation noRepository CRANDate/Publication2015-09-0619:30:55R topics documented:evobiR-package (2)1.fasta (2)AICc (3)AncCond (4)CalcD (5)Even (6)FuzzyMatch (7)horn.beetle.csv (8)hym.tree (8)mcmc2 (8)mite.trait (8)Mode (9)12 1.fastaPPSDiscrete (10)ReorderData (11)ResSel (11)SampleTrees (12)SlidingWindow (13)SuperMatrix (14)trees (15)trees.mite (15)trees.nex (15)ViewEvo (15)WinCalcD (16)Index18 evobiR-package evobiR:Evolutionary Biology in RDescriptionevobiR is a collection of tools for use in evolutionary biology.Some of the functions manipulate data in a way not implemented by other functions while others calculate sequence statistics or perform simulations,either of data across trees or genetic and genomic simulations.DetailsPackage:evobiRType:PackageVersion: 1.1Date:2013-10-08License:GPL(>=2)More information on evobiR is available at http://coleoguy.github.io/software.html Author(s)Heath BlackmonMaintainer:Heath Blackmon<******************>1.fasta simulated SNP dataDescriptionThisfile contains simulated SNP dataAICc3 Author(s)Heath BlackmonReferenceshttp://coleoguy.github.io/AICc Computes an AICc scoreDescriptionSupplied with a log likelihood,the number of model parameters,and sample size calculates the small sample size version of the AIC score.UsageAICc(loglik,K,N)Argumentsloglik log likelihood.K the number of parameters in the modelN the sample size.DetailsReturns an AICc score.Author(s)Heath BlackmonReferenceshttp://coleoguy.github.io/ExamplesAICc(-32,3,100)4AncCondAncCond Calculate the mean of a continuous character at the origin of derivedstate of a binary charachterDescriptionThis function uses stochastic mapping and ancestral state reconstruction to determine if the derived state of a binary trait originates when a continuous trait has an extreme value.UsageAncCond(trees,data,derived.state,iterations=1000)Argumentstrees tree(s)of class phylo or multiPhylodata a dataframe with3columns.Thefirst should match the taxa names in the tree, the second should have the continuous trait values and the third the states for thebinary characterderived.state the derived condition for the binary traititerations the number of iterations to be used in estimating significanceDetailsThis function uses stochastic mapping and ancestral state reconstruction as implemented in phytools to determine if the derived state of a binary trait originates when a continuous trait has an extreme value.This test assumes that the derived state of the binary character may lead to correlated selec-tion in the continuous trait.Because of this the ancestral state reconstruction of the continuous trait is based only on data from species that remain in the ancestral condition for the binary traitValueReturns a plot of the null distribution and the observed data as well as empirical p-value for the observed data.Author(s)Heath Blackmon and Richard H.AdamsReferenceshttp://coleoguy.github.io/CalcD5Examples##Not run:data(mite.trait)data(trees.mite)AncCond(trees,mite.trait,derived.state="haplodiploidy",iterations=100)##End(Not run)CalcD Calculate Patterson’s D-statisticDescriptionThese functions calculate Patterson’s D-statistic to compare the frequencies of discordant SNP ge-nealogies.These tests assume equal substitution rates and unlinked loci,D-statistics significantlydifferent from0suggest that introgression has occurred.UsageCalcD(alignment="alignment.fasta",sig.test="N",block.size=1000,replicate=1000) CalcPopD(alignment="alignment.fasta")Argumentsalignment This is an alignment in fasta format.Sequences should be in the order:P1,P2,P3,Outgroup.sig.test This indicates whether or if to test for significance.Options are"B"bootstrap,"J"jackknife,or"N"none.block.size The number of sites to be dropped in the jackknife approachreplicate Number of replicates to be used in estimating varianceDetailsThe functions CalcD and CalcPopD are implementations of the algorithm described in Durand etal.2011.Significance of the D-stat can be calculated either through bootstrapping or jackknifing.Bootstrapping is appropriate for datasets where SNPs are unlinked for instance unmapped RADSeqdata.Jackknifing is the appropriate approach when SNPs are potentially in linkage for instancegene alignments or genome alignments.ValueReturns the number of each type of site,Z scores and p-valuesAuthor(s)Heath Blackmon6EvenReferenceshttp://coleoguy.github.io/Durand,Eric Y.,et al.Testing for ancient admixture between closely related populations.Molecular biology and evolution28.8(2011):2239-2252.Eaton,D.A.R.,and R.H.Ree.2013.Inferring phylogeny and introgression using RADseq data: An example fromflowering plants(Pedicularis:Orobanchaceae).Syst.Biol.62:689-706 ExamplesCalcD(alignment=system.file("1.fasta",package="evobiR"),sig.test="N")CalcPopD(alignment=system.file("3.fasta",package="evobiR"))Even Tests whether a number is evenDescriptionJust a simple function that returns True if a number is even and False otherwise.UsageEven(x)Argumentsx a numerical vector.DetailsReturns a vector of logical values of the same length as the input vector.If the input value is not a number it will return an error message.Author(s)Heath BlackmonReferenceshttp://coleoguy.github.io/ExamplesEven(c(1,2,3,4,5,6,2,5))FuzzyMatch7 FuzzyMatch Find Close Matches in a tree and datasetDescriptionWhen assembling data from different sources typos can sometimes cause a loss of perfect matches between trees and datasets.This function helps youfind these close matches that can be hand curated to keep as many species as possible in your analysis.UsageFuzzyMatch(tree,data,max.dist)Argumentstree a phylogenetic tree of the class"phylo".data character vector with the names from your dataset.max.dist This is the maximum number of characters that can differ between your tree and data and still be recognized as a close match.ValueA dataframe with the following rows:Name in dataName in treeNumber of differencesAuthor(s)Heath BlackmonReferenceshttp://coleoguy.github.io/Examplesdata(hym.tree)names<-c("Pepsis_elegans","Plagiolepis_alluaudi","Pheidele_lucreti","Meliturgula_scriptifronsi","Andrena_afimbriat")FuzzyMatch(tree=hym.tree,data=names,max.dist=3)8mite.trait horn.beetle.csv Gnatocerus measurementsDescriptionA csvfile containing measurements of horn and body size for the beetle Gnatocerus cornutus. hym.tree Phylogenetic treeDescriptionThis is a phylogenetic tree with5species of hymenoptera.mcmc2mcmc logfileDescriptionan mcmc logfile.Thefirst column is the tree used during the iteration the remaining columns are the rate parameters of the Q matrix listed by column order.mite.trait phenotype data for mitesDescriptiondataframe of sexual system and chromosome number data for mitesMode9 Mode Calcualtes the mode of a numeric vectorDescriptionR’s base package function mode returns the type of object’numeric’,’character’etc.This give the option of an easy to remember work around for that.UsageMode(x)Argumentsx a numerical vector.DetailsReturns the most frequently occuring value in a vector.In the case of a tie it will return the mode which has the earliest initial occurence in the vectorValuereturns the most frequently occuring value in a series of numbersAuthor(s)Heath BlackmonReferenceshttp://coleoguy.github.io/ExamplesMode(c(1,2,3,4,5,6,2,5))10PPSDiscrete PPSDiscrete Create Simulated Datasets via PPSDescriptionThis function performs posterior predictive simulations of discrete traits.The function is written to work with the output of bayesian programs that produce a collection of rate matrix parameter estimates based on either one or a collection of trees.UsagePPSDiscrete(trees,MCMC,states,N=2)Argumentstrees an object of class"multiPhylo"or"phylo"containing the trees used in generting the rate estimatesMCMC this will normally be a logfile that is brought into R with read.csv the columns for a three state character should be:tree,qAA,qBA,qCA,qAB,qBB,qCB,qAC,qBC,qCC.If your analysis involves only a single tree then the tree columnshould be excluded.states a vector of root probabilitiesN the number of PPS datasets desiredValueA matrix is returned with the rownames being the species names from the tree and each columncontaining a result of a single PPS.Author(s)Heath BlackmonReferenceshttp://coleoguy.github.io/Examplesdata(trees)data(mcmc2)data(mcmc3)#1tree100q-mats3statesPPSDiscrete(trees[[1]],MCMC=mcmc3[,2:10],states=c(.5,.2,.3),N=2)#10trees100q-mats3statesPPSDiscrete(trees,MCMC=mcmc3,states=c(.5,.2,.3),N=10)#10trees100q-mats2statesPPSDiscrete(trees,MCMC=mcmc2,states=c(.5,.5),N=10)ReorderData11 ReorderData Reorders trait data to match the order of tips in a treeDescriptionThis function takes a vector,matrix,or dataframe and reorders the data to match the order of tips ina phylo object.UsageReorderData(tree,data,s="row names")Argumentstree a phylo objectdata a vector,matrix,dataframe set of taxa names as present in the tree and data must match.If data is a vector it should be a named vector.If the data is a matrix ordataframe the taxa names may be row names or present in a column.s If taxa names are present in a column the column number should be supplied.If taxa names are the row names the argument can be set to"row names"(defaultsetting).If the data is being supplied in a vector this argument is not used.DetailsReturns data in the same format as supplied but reordered to match the order of tip labels.Author(s)Heath BlackmonReferenceshttp://coleoguy.github.io/ResSel Selection on ResidualsDescriptionThis function takes measurements of multiple traits and performs a linear regression and identifies those records with the largest and smallest residual.Originally it was written to perform a regression of horn size on body size allowing for high and low selection lines.UsageResSel(data,traits,percent=10,identifier=1,model="linear")12SampleTreesArgumentsdata this is a dataframe with subject identifiers and phenotypic trait valuestraits a numeric vector indicating the column containing the predictor and response variables in that orderpercent the percentage of highest and lowest residuals that should be identifiedidentifier the column which contains the record numbers to identify individualsmodel currently this is not usedValueThis function returns a listhigh line the ID numbers for the individuals selected for the high linelow line the ID numbers of the individuals selected for the low lineAuthor(s)Heath BlackmonReferenceshttp://coleoguy.github.io/Examplesdata<-read.csv(file=system.file("horn.beetle.csv",package="evobiR"))ResSel(data=data,traits=c(2,3),percent=15,identifier=1,model="linear") SampleTrees Select a random sample of treesDescriptionThis function takes as its input a large collection of trees from a program like MrBayes or Beast and allows the user to select the number of randomly drawn trees they wish to retrieveUsageSampleTrees(trees,burnin,final.number,format,prefix)Argumentstrees a nexus formatfile containing trees that the user wants to sample fromburnin the proportion of trees to remove as burninfinal.number the number of trees desiredformat options are"new"or"nex"indicating to save the trees in newick format or nexus formatprefix a text string to assing to the new treefile nameSlidingWindow13Valuean object of the class"multiPhylo"is returnedAuthor(s)Heath BlackmonReferenceshttp://coleoguy.github.io/ExamplesSampleTrees(trees=system.file("trees.nex",package="evobiR"),burnin=.1,final.number=20,format= new ,prefix= sample ) SlidingWindow Sliding window analysisDescriptionApplies a function within a sliding window of a numeric vector.Both the step size and the window size can be set by the user.UsageSlidingWindow(FUN,data,window,step)ArgumentsFUN a function to be applied within each window.data a numerical vector.window an integer setting the size of the window.step an integer setting the size of step between windows.DetailsReturns a vector of numeric values representing the applying the selected function within each window.The length will be unequal to the original data and will be determined primarily by the step size.Author(s)Heath BlackmonReferenceshttp://coleoguy.github.io/14SuperMatrixExamplesdata<-c(1,2,1,2,10,2,1,2,1,2,3,4,5,6,2,5)SlidingWindow("mean",data,3,1)SuperMatrix creates a supermatrix from multiple gene alignmentsDescriptioncombines all alignments in a folder into a single supermatrixUsageSuperMatrix(missing="-",prefix="concatenated",save=T)Argumentsmissing the character to use when no data is available for a taxaprefix prefix for the resulting supermatrixsave if True then supermatrix and partitionfile will be savedDetailsThis function reads all fasta format alignments in the working directory and constructs a single supermatrix that includes all taxa present in any of the fastafiles and inserts missing symbols for taxa that are missing sequences for some loci.ValueA list with two elements is returned.Thefirst element contains partition data that records thealignment positions of each input fastafile in the combined supermatrix.The second element is a dataframe version of the supermatrix.If the argument save is set to True then both of thesefiles are also saved to the working directory.Author(s)Heath BlackmonReferenceshttp://coleoguy.github.io/Examples##Not run:SuperMatrix(missing="N",prefix="DATASET2",save=T)##End(Not run)trees15 trees10Phylogenetic treesDescriptionThis is a collection of10simulated phylogenetic trees with200tips each.trees.mite10Phylogenetic treesDescriptionThese are trees from a previously published work on mite sexual system evolution.trees.nex100Phylogenetic treesDescriptionThis is a collection of100simulated phylogenetic trees with10tips each.ViewEvo Learning ResourcesDescriptionThis uses the shiny app to produce interactive pages.UsageViewEvo(simulation)Argumentssimulation Text string indicating the application to run.Currently options are"wf.model", "bd.model","dist.model"DetailsThe wf.model was implemented to illustrate to students the effects of genetic drift.In particular the high likelihood of losing a beneficial allele when population size isfinite.The bd.model will plot 4phylogenetic trees based on a birth death model with a single set of parameters.This application was developed to illustrate the high variability of a birth death process as a generating model for phylogenies and the inherint difficulty in detecting differential diversification rates.Finally the dist.model was developed to help illustrate the relationship between common statistical distributions often used as priors and the way that parameters effect the density distribution.ValueThis function returns an interactive webpage.Author(s)Heath BlackmonReferenceshttp://coleoguy.github.io/Wright-Fisher Simulator:https://evobir.shinyapps.io/wf_model/Birth-death Simulator:https://evobir.shinyapps.io/bd_modelStatistical Distribution:https://evobir.shinyapps.io/dist_modelExamples##Not run:ViewEvo("wf.model")ViewEvo("bd.model")ViewEvo("dist.model")##End(Not run)WinCalcD Calculate Patterson’s D-statistic in sliding windowsDescriptionThis functions calculate Patterson’s D-statistic in windows.UsageWinCalcD(alignment="alignment.fasta",win.size=100,step.size=50,boot=F,replicate=1000)Argumentsalignment This is an alignment in fasta formatwin.size This is the size of the window usedstep.size This is the size of steps in the sliding windowboot This indicates whether or not bootstrapping should be performed to estimate variancereplicate Number of replicates to be used in estimating varianceDetailsThis function is just an extension of CalcD and calculates D statistic for windows.ValueReturns a table with the number of each type of site,Z scores and p-values for each window in the genomeAuthor(s)Heath BlackmonReferenceshttp://coleoguy.github.io/Durand,Eric Y.,et al.Testing for ancient admixture between closely related populations.Molecular biology and evolution28.8(2011):2239-2252.Eaton,D.A.R.,and R.H.Ree.2013.Inferring phylogeny and introgression using RADseq data: An example fromflowering plants(Pedicularis:Orobanchaceae).Syst.Biol.62:689-706ExamplesWinCalcD(alignment=system.file("1.fasta",package="evobiR"),win.size=100,step.size=50,boot=TRUE,replicate=10)Index∗ABBACalcD,5WinCalcD,16∗AICcAICc,3∗D-statisticCalcD,5WinCalcD,16∗PPSPPSDiscrete,10∗SNPCalcD,5WinCalcD,16∗SuperMatrix,concatenation,alignment SuperMatrix,14∗basic statsEven,6Mode,9SlidingWindow,13∗comparative analysisReorderData,11∗comparative methodsFuzzyMatch,7∗comparative methodAncCond,4∗concatention alignment DNA fasta SuperMatrix,14∗continuous traitAncCond,4∗data mungingReorderData,11∗datasets1.fasta,2horn.beetle.csv,8hym.tree,8mcmc2,8mite.trait,8trees,15trees.mite,15trees.nex,15∗discrete traitAncCond,4∗hard selectionResSel,11∗hybridizationCalcD,5WinCalcD,16∗interactiveViewEvo,15∗introgressionCalcD,5WinCalcD,16∗model comparisonAICc,3∗modeEven,6Mode,9∗pedagogicalViewEvo,15∗phylogeneticsFuzzyMatch,7SampleTrees,12∗residualResSel,11∗sliding windowSlidingWindow,13WinCalcD,16∗teachingViewEvo,151.fasta,22.fasta(1.fasta),23.fasta(1.fasta),2AICc,3AncCond,4CalcD,5CalcPopD(CalcD),518INDEX19 Even,6evobiR(evobiR-package),2evobiR-package,2FuzzyMatch,7horn.beetle.csv,8hym.tree,8mcmc2,8mcmc3(mcmc2),8mite.trait,8Mode,9PPSDiscrete,10ReorderData,11ResSel,11SampleTrees,12SlidingWindow,13SuperMatrix,14trees,15trees.mite,15trees.nex,15ViewEvo,15WinCalcD,16。
使用生物大数据技术进行比较基因组学研究的步骤和工具推荐
使用生物大数据技术进行比较基因组学研究的步骤和工具推荐比较基因组学是一门研究不同物种中基因组结构和功能之间差异的领域,可以揭示生物的进化关系、基因家族的起源和功能以及进化过程中的基因重排等信息。
在过去的几十年中,随着高通量测序技术的发展和生物大数据的积累,比较基因组学研究变得更加高效和全面。
本文将介绍使用生物大数据技术进行比较基因组学研究的步骤,并推荐一些常用的工具。
第一步:数据获取比较基因组学研究的第一步是获取适合研究的生物大数据。
目前,公共数据库(如NCBI、Ensembl等)中拥有大量物种的基因组序列和注释信息,可以用于比较基因组学的研究。
研究人员可以通过数据库的网站或API接口访问这些数据,并下载到本地进行后续分析。
第二步:序列比对在比较基因组学研究中,序列比对是一个关键的步骤。
该步骤旨在将不同物种的基因组序列进行比较,在它们之间找到相似和差异之处。
为了完成序列比对,研究人员需要选择合适的比对工具。
常用的比对工具包括Bowtie、BWA和BLAST 等。
这些工具根据比对算法和参数的不同,可以应对不同类型和长度的序列。
第三步:基因注释基因注释是比较基因组学研究中的另一个重要步骤。
在比对完成后,研究人员需要对比对结果进行注释,以了解比对的序列所在的功能区域和基因特征。
常用的基因注释工具包括ANNOVAR、GATK和Ensembl Variant Effect Predictor等。
这些工具可以帮助研究人员预测功能影响、基因型和表型相关性等。
第四步:基因家族分析比较基因组学的一个重要应用是研究基因家族的起源和功能演化。
基因家族是指一组具有相似序列和功能的基因,它们通常起源于同一个祖先基因。
为了研究基因家族,研究人员可以使用一些工具来进行系统发育和基因家族分类分析。
常用的工具包括Phylogenetic Analysis by Maximum Likelihood(PAML)、Geneious和OrthoMCL等。
ntm 直接的同源基因或 序列比较方法(金标准)
"NTM" 通常指的是非结核分枝杆菌(Non-tuberculous Mycobacteria),这是一类与结核分枝杆菌(Mycobacterium tuberculosis)在分类学上相似但在疾病表现和治疗方案上有所不同的细菌。
当谈到寻找NTM 的直接同源基因或进行序列比较时,通常涉及到生物信息学和分子生物学的方法。
直接同源基因的比较1.双向最佳匹配(BBH, Bidirectional Best Hits): 这是一种常用的方法来识别两个物种之间的直接同源基因。
BBH 方法基于两个物种之间的双向最佳BLAST比对结果,即如果一个物种中的基因A是另一个物种中基因B的最佳比对结果,并且基因B也是基因A的最佳比对结果,那么可以认为基因A和基因B是同源的。
2.基因家族和进化分析: 通过构建基因家族并进行进化分析,可以确定NTM与其他物种之间的同源关系。
这可以通过使用如OrthoMCL, MAFFT,FastTree 等工具来完成。
3.保守基因和保守区域: 通过比较多个物种的基因组,可以确定一组保守的基因或保守的DNA区域,这些通常具有功能重要性。
序列比较方法1.全基因组比对(Whole Genome Alignment): 通过比对两个物种的整个基因组,可以识别同源区域和可能的重组事件。
常用的工具有Mauve,BRIG, MUMmer 等。
2.局部比对工具(Local Alignment Tools): 如BLAST, BLAT, LAST等,这些工具用于比较两个序列之间的局部相似性。
3.多重序列比对(Multiple Sequence Alignment): 当涉及到多个物种或序列时,可以使用多重序列比对方法,如Clustal, MUSCLE, MAFFT等。
金标准"金标准" 在生物信息学中通常指的是一种被广泛接受且可靠的方法或数据集,用于评估其他方法或数据集的准确性和可靠性。
生物信息学中的基因组比对技术使用注意事项分析
生物信息学中的基因组比对技术使用注意事项分析基因组比对是生物信息学中的一项重要任务,它对于研究基因组的结构、功能以及演化起着关键的作用。
基因组比对技术,即将一个基因组序列与另一个基因组序列进行比较,可帮助研究人员识别共同的基因片段、检测变异以及推断物种间的进化关系。
然而,基因组比对并非一项简单的任务,其存在一些注意事项需要研究人员注意。
首先,基因组比对需要选择合适的比对算法。
目前常用的比对算法包括BLAST、Smith-Waterman和BWA等。
BLAST算法适用于快速比对大规模数据集,而Smith-Waterman算法则适用于准确比对较小的序列。
BWA算法则对于基因组级的比对任务非常适用。
根据比对任务的特点选择合适的算法对于提高比对质量和效率至关重要。
其次,基因组比对需要准备好参考基因组序列和待比对的样本序列。
参考基因组序列通常是已知的完整基因组序列,可以是人类基因组、模式生物基因组或者其他物种的基因组序列。
待比对的样本序列可以是测序得到的个体基因组序列、转录组数据或者其他改良后的基因组序列。
选择合适的参考基因组和样本序列对于准确的基因组比对至关重要。
第三,基因组比对也需要注意选择合适的参数设置。
比对参数设置直接影响到比对的结果。
例如,比对任务中需要确定匹配、缺失和插入等的惩罚分数,以及决定是否对多个比对结果进行多序列比对等,这些参数的设置需要根据实际任务需求进行调整。
此外,基因组比对还面临着如何处理重复序列的问题。
基因组中存在着大量的重复序列,这些重复序列可能对于比对任务产生干扰,使得比对结果不准确。
因此,在进行基因组比对时,需要注意如何正确识别和处理重复序列,以获得准确可靠的比对结果。
还有一点需要注意的是,基因组比对的结果需要进行质量评估。
比对结果可以通过比对率、比对质量、覆盖度和可信性等指标来评估。
评估比对结果的质量有助于研究人员判断比对数据的可靠性,并作为后续分析的基础。
最后,基因组比对涉及到大量的计算和存储资源。
G基因操作【25页】
涂平板 (Ampr)
复苏 (37℃ 培养/Amp-)
转化子的筛选 selecting / screening 抗性筛选 蓝白斑筛选( α互补) laz′:编码β半乳糖苷酶的α肽链(3-92aa) 宿主细胞:β半乳糖苷酶突变体
ampR
ampR
影印(Replica plate)
insert DNA
tet R
tet S
pBR322
ampR
-lacZ/
ampR
Ampicillin Tetracycline
insert DNA
-lacZ/
pUC119
Amp+IPTG+X-gal
转化子的鉴定:
酶切鉴定:插入片段的大小和方向
H
H
H
E
E
S
A
X
C
6000
E
4000
H E
H
2000
A
BC
E S/H S/H H
D H E/H
质粒DNA (<1 µg) 10 × 酶切反应缓冲液 2 µL 限制性内切酶(1U) 无菌水补至20 µL 37C(30 C)
E2 X E3
E4 E5 E1
B
E1
B
E2
E3
E4
E2 X E3
E4 E5 E5 E1
琼脂糖凝胶电泳
原理:电荷效应和分子筛效应 DNA移动方向: 负极到正极 迁移速率:DNA分子的大小和构型
限制性内切酶 其他几种工具酶
S1核酸酶 (水解单链DNA,粘性末端变为平头末端 )
碱性磷酸酶 (去除DNA末端的磷酸根)
逆转录酶 (mRNA cDNA)C,平头末端变为粘性末端)
基因数据的使用流程
基因数据的使用流程1. 导入基因数据•确保已经获取到所需的基因数据,数据格式通常为FASTA、FASTQ 等。
•使用适当的生物信息学工具,如BioPython、SeqIO等,将基因数据导入到程序中进行处理和分析。
2. 数据预处理•查看基因数据的质量和完整性,检查是否存在噪音、缺失值和重复序列等问题。
•使用合适的方法进行数据清洗、去除不必要的信息和修复错误。
•对数据进行标准化或转换,使其符合进一步分析所需的格式和要求。
3. 基因序列分析3.1 序列比对•使用适当的比对工具(如BLAST、Bowtie、BWA等)将基因序列与已知的参考序列或其他数据库中的序列进行比对。
•根据比对结果,评估序列的相似性、匹配程度和变异情况。
3.2 序列注释•使用生物信息学工具(如Geneious、NCBI等)对基因序列进行注释,从而获得关键的生物学信息。
•注释内容包括基因功能、蛋白质结构、进化信息等。
可以通过比对已知的注释数据库(如GeneBank、UniProt等)进行注释。
3.3 基因表达分析•使用转录组分析工具(如DESeq2、edgeR等)对基因表达进行定量分析。
•通过对不同样本(如对照组和实验组)的基因表达量进行比较,发现差异表达基因并进行统计学分析。
•利用PCA(Principal Component Analysis)和聚类分析等方法对表达模式进行可视化分析。
3.4 功能富集分析•对不同ially富集分度的差异表达基因进行功能富集分析。
•使用生物信息学工具(如DAVID、GOseq等)对差异表达基因进行GO(Gene Ontology)富集和KEGG(Kyoto Encyclopedia of Genes andGenomes)通路分析。
•从而揭示差异表达基因的潜在生物学功能和通路。
4. 结果解释和可视化•结合上述分析结果,解释基因数据中的生物学意义和发现。
•使用适当的可视化工具(如R、Python的Matplotlib、Seaborn等)将分析结果可视化,以便更直观地展示和解释数据。
基因测序技术的使用教程
基因测序技术的使用教程基因测序技术是一项重要的生物学工具,它可以帮助我们了解生命的奥秘。
本文将介绍基因测序技术的使用教程,包括前期准备、实验步骤和数据分析等方面,希望能为读者提供一些有用的信息。
一、前期准备在进行基因测序之前,我们需要准备一些实验材料和设备。
首先,我们需要提取待测序的DNA样本,可以从人体组织、细胞培养物或其他生物体中获得。
其次,我们需要准备一台高通量测序仪,例如Illumina HiSeq或PacBio Sequel等。
此外,还需要一些实验耗材,如试剂盒、试管、离心管等。
在准备实验材料和设备的过程中,我们需要注意实验室的安全和卫生。
二、实验步骤1. DNA样本制备:将提取的DNA样本进行纯化和扩增,以获得足够的DNA量进行测序。
这一步通常使用PCR技术,可以选择特定的引物扩增目标DNA片段。
2. 文库构建:将扩增得到的DNA片段连接到测序文库中。
文库是一系列DNA片段的集合,它们将在测序过程中被读取和分析。
在文库构建过程中,我们需要选择适当的文库构建方法,如Illumina TruSeq或NEBNext Ultra II等。
3. 测序反应:将文库装载到测序仪中,进行测序反应。
不同的测序仪使用不同的测序技术,例如Illumina测序仪使用桥式扩增和碱基荧光标记技术,而PacBio测序仪则使用单分子实时测序技术。
4. 数据生成:测序仪将读取文库中的DNA片段,并将其转化为数字化的测序数据。
这些数据将被存储为FASTQ文件格式,包含了每个DNA片段的序列信息和质量值。
三、数据分析1. 数据预处理:在进行数据分析之前,我们需要对测序数据进行预处理。
这包括去除低质量的序列、去除适配序列和进行序列比对等步骤。
常用的数据预处理工具包括Trimmomatic、Cutadapt和BWA等。
2. 序列比对:将测序数据与参考基因组进行比对,以确定每个DNA片段的来源。
常用的比对工具有Bowtie、BWA和STAR等。
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
BRIG0.95ManualNabil AlikhanJune27,2011/projects/brig/1CONTENTS2 Contents1Introduction3 2Licence9 3Installation93.1Installing BLAST (9)4Warning when using BLAST114.1Low complexityfiltering (11)4.2Expected values(e-values)and bit scores (11)5Visualising whole genome comparisons135.1Step1:Load in sequences (13)5.2Step2:Configure rings (14)5.3Step3:Review and submit (16)6Working with a Multi-FASTA reference186.1Step1:Load in sequences (18)6.2Step2:Configure rings,annotations and spacer value (19)6.3Step3:Configure image settings and submit (23)7Visualising graphs and genome assemblies257.1Walkthrough for visualising SAMfile mapping coverage (26)7.2Walk through for visualising acefile assembly coverage (31)8Walkthroughs on creating custom annotations398.1Adding custom annotations from a tab-delimitedfile,GenBank orEMBLfile (39)8.1.1Step1:Load in sequences (39)8.1.2Step2:Configure rings (40)8.1.3Step3:Adding annotations (41)8.1.4Step4:Review and submit (43)8.2How to create tab-delimitedfiles for BRIG (45)9Configuration options469.1Saving and reopening your work (46)9.2BLAST options (46)9.3Setting BRIG options (48)9.4Setting Image options (50)9.5Loading a preset image template (52)1INTRODUCTION3 1IntroductionThe BLAST Ring Image Generator(BRIG)is a cross-platform desktop applica-tion written in Java1.6.It uses CGView[5]for image rendering and the Basic Local Alignment Search Tool(BLAST)for genome comparisons.It has a graph-ical user interface programmed on the Swing framework,which takes the user step-by-step through the configuration of a circular image generation.Figure1is an example of an image BRIG can create.Figure1:BRIG example output image of a simulated draft E.coli O157:H7 genome.Thefigure show BLAST comparisons against28published E.coli and Salmonella genomes against the simulated draft genome.1INTRODUCTION4 Figure2shows a magnified view of the same example image showing similar-ity between a central reference genome in the centre against other query sequences as a set of concentric rings,where colour indicates a BLAST match of a particular percentage identity.BRIG does not represent sequences that are not present in the reference genome The image shows:•GC skew,•GC content,•Genome coverage and contig boundaries(calculated from an assemblyfile),•Genome alignment results,customs annotations.Figure2:A magnified view of BRIG example image1INTRODUCTION5 How to use this manualThis manual contains a set of detailed walk throughs where readers are taken step by step through a worked example.Each walkthrough highlights different features of BRIG and users should work through each one.If you are interested in a particular aspect of BRIG,please turn to the relevant walkthrough:•Whole genome comparisons,including how to load in coverage graphs,e.g.Figures1&3,see Section5on page13.•Using a user-defined list of genes as a reference(in Multi-FASTA),e.g Fig-ure4,see Section6on page18.•Creating and visualising graphs generated from assembly(.ace)or read mapping coverage(.SAM),e.g Figure5,see Section7on page25.•Labeling images with information from GenBank,Tab-delimited or Multi-FASTAfiles,like those seen in Figure3,4&5,see Section8on page39.The manual also has detailed instructions for how to install and configure BRIG:•For instructions on how to install BRIG,see Section3on page9.•For instructions on how to configure BRIG and save BRIG settings,see Section9on46.1INTRODUCTION6Figure3:Reference:Published E.coli O157:H7Sakai genome.Query:Com-plete genome sequences of related strains,listed in the key.The prophage regions from the Sakai genome are marked in alternating black&blue.To make an image like this please refer to Section5on page13.1INTRODUCTION7Figure4:Reference:A list of translated genes that make up the Locus of Entero-cyte Effacement(LEE),which encodes a Type III secretion system.Query:Raw sequencing reads simulated from several complete LEE+published genomes(nu-cleotide sequence)and E.coli K12,(negative control;LEE-).You can clearly see gene presence/absence,and divergence(the colour represents sequence identity on a sliding scale,the greyer it gets;the lower the percentage identity).To make an image like this please refer to Section6on page18.1INTRODUCTION8Figure5:Reference:Published E.coli O157:H7Sakai genome.Query:Read mapping coverage of sequencing reads simulated from complete genomes,indi-cated in the key.Simulated sequencing reads were mapped onto the published complete Sakai genome using BWA.The read coverage for each genome was generated from the resulting SAMfipare this with Figure3,which is based on the original published genome sequences.To make an image like this please refer to Section7on page25.2LICENCE9 2LicenceThis program is free software:you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foun-dation,either version3of the License,or(at your option)any later version.This program is distributed in the hope that it will be useful,but without any warranty;without even the implied warranty of merchantability orfitness for a particular purpose.See the GNU General Public License for more details.You should have received a copy of the GNU General Public License along with this program.If not,see</licenses/>.Please note that these restrictions do not apply to the third party libraries bundled with this software.3InstallationThere’s no real”Installation”process for BRIG itself.However,BLAST+[2]or BLAST legacy[1]must already be installed and BRIG needs to be able to locate the BLAST executables(See Section3.1).To run BRIG users need to:1.Download the latest version(BRIG-x.xx-dist.zip)from/projects/brig/2.Unzip BRIG-x.xx-dist.zip to a desired location.3.Run BRIG.jar,by double clicking.Users who wish to run BRIG from the command-line need to:1.Navigate to the unpacked BRIG folder in a command-line interface(termi-nal,console,command prompt).2.Run“java-Xmx1500M-jar BRIG.jar”.Where-Xmx specifies the amountof memory allocated to BRIG.3.1Installing BLASTThe latest version of BLAST+[2]can be downloaded from:ftp:///blast/executables/blast+/LATEST/ BLAST+offers a number of improvements on the original BLAST implementa-tion and comes as a bundled installer,which will walk users through the installa-tion process.Please read the published paper on BLAST+:3INSTALLATION10 Camacho,C.,G.Coulouris,et al.(2009).“BLAST+:architecture and appli-cations.”BMC Bioinformatics10(1):421Available online at:http://www. /1471-2105/10/421The latest version of BLAST legacy[1]can be downloaded from:ftp:///blast/executables/release/LATEST/ BLAST legacy comes as a compressed package,which will unzip the BLAST binaries where ever the package is.We advise users tofirst create a BLAST direc-tory(in either the home or applications directory),copy the downloaded BLAST package to that directory and unzip the package.BRIG supports both BLAST+&BLAST ers can specify the loca-tion of their BLAST installation in the BRIG options menu which is: Main window>Preferences>BRIG options.The window is shown in Figure6.If BRIG cannotfind BLAST it will prompt users at runtime.PRO TIP1:BRIG uses BLAST,do not use wwwblast or netblast with BRIG. PRO TIP2:If BOTH BLAST+and legacy versions are in the same location, BRIG will prefer BLAST+.Figure6:You can change where BRIG looks for BLAST in the BRIG options window.For more information about BRIG options see Section9.2on page48.4W ARNING WHEN USING BLAST11 4Warning when using BLASTBRIG relies on the Basic Local Alignment Search Tool(BLAST)for genome comparisons.BLAST has a number of behaviours that may seem counterintuitive and we encourage users to learn about local alignment and the BLAST algorithm to fully understand the images that BRIG produces.There are a few concepts to keep in mind when using BRIG:4.1Low complexityfilteringPRO TIP3:BLASTfilters may cause gaps in alignments,which will show up as blank regions in BRIG images.BLASTfilters(BLAST legacy-Fflag or BLAST+-dust/-seg noflag)filter the query sequence for low-complexity sequences by default.This includes sequences that are highly repetitive or contain the same nucleotide for long lengths of the sequences.Low-complexityfiltering is generally a good idea,but it may break long matches into several smaller matches.This is often shown in BRIG images as truncations or gaps in alignments,it is particularly obvious in very small reference sequences where alignments are shown on a gene-by-gene level.To prevent this,either turn offfiltering or use soft masking.4.2Expected values(e-values)and bit scoresPRO TIP4:BLAST’s bitscorefiltering may cause different results in BRIG if users swap the query and reference sequences,particularly if these are very different sizes.BLAST uses statistical thresholds tofilter out“bad alignments”;alignment matches that appear random to BLAST.One of these thresholds is the e-value,which is the probability of the alignment occurring by chance,given the complexity of the match,sequence composition and the size of the database.It is more likely in a larger sequence that an alignment could occur by chance,so BLAST is more critical of these matches.This can create different expected values if BLAST is used with the same reference sequence against databases of different sizes and may potentiallyfilter out significant matches or include poor scoring ones.4W ARNING WHEN USING BLAST12 Because of this,users might notice different results in BRIG images if they swap the order of the database and reference sequences around in the BLAST, especially if the two sequences are quite different in size.The differences are often due to a few very low-scoring hits.Users should consider what an appropriate e-value threshold is for the compar-isons that they run.Remember,that BLAST runs with an e-value of10by default, we recommend that users change this ers can set thefinal threshold (e-value)with the-eflag in BLAST legacy or-evalueflag in BLAST+.PRO TIP5:BLAST does not handle spaces infilenames,BRIG will prompt users if they have spaces infile locations.5VISUALISING WHOLE GENOME COMPARISONS135Visualising whole genome comparisonsIn this section we will walk through the basics of generating an image.This walk through will be comparing an E.coli genome withfive other E.coli genomes and mapping the read coverage from the underlying genome assembly onto the same image.For this walk through,users will need BRIG examples.zip,which is avail-able from the BRIG website(/projects/brig/ files/).This contains all the genomes andfiles needed to follow along with this walk through.Unzip it somewhere easily accessible,like the home directory or desktop.About the reference genomeThe reference genome used in this walk through is a simulated E.coli genome assembly.We took the published E.coli O157:H7Sakai genome(Accession num-ber BA000007)sequence and had assembly reads simulated by METASIM[4]and then assembled these using Newbler version2.3.The resulting contiguous se-quences were ordered using Mauve[3]against the published Sakai genome.This simulated E.coli is useful for illustrating some of BRIG’s graphing features for assembly read coverage.Enterohemorrhagic E.coli are gram-negative,enteric bacterial pathogens.They can cause diarrhea,hemorrhagic colitis,and hemolytic uremic syndrome.This particular genome we are using in this example was based on an E.coli O157:H7 isolated from the Sakai,Japan outbreak.5.1Step1:Load in sequencesThe walk through will work out of the unzipped BRIG examples.zip in the Chap-ter568wholeGenomeExamples folder.The walk through and relatedfigureswill use C:\BRIG examples\Chapter568wholeGenomeExamples as that loca-tion.To keep thefinal image consistent with the walk through,please open”Exam-pleProfile.xml”from the Chapter568wholeGenomeExamples folder.Thisfile configures BRIG to the same image settings in the walk through.1.First,set BRIGExample.fna as the reference sequence.2.Set<unzipped BRIG examples folder>\Chapter568wholeGenomeExamplesas the query sequence folder.3.Press“add to data pool”,this should load several items into the pool list,there should be ninefiles.5VISUALISING WHOLE GENOME COMPARISONS144.Set the Chapter568wholeGenomeExamples as the output folder.5.The BLAST options box should be left blank.6.Click nextPRO TIP6:Users can add individualfiles to the data pool too.5.2Step2:Configure ringsThe next step is to configure what information is shown on each of the concentric rings in BRIG.Create six rings,for each ring:1.Set the legend text for each ring2.Select the required sequences from the data pool and click on“add data”toadd to the ring list.3.Choose a colour4.Set the upper(90)and lower(70)identity threshold.5VISUALISING WHOLE GENOME COMPARISONS155.Click on“add new ring”and repeat steps for each new ring required.The values required for each ring are detailed in the table below.Notice that thatsequences can be collated into a single ring,like the example of K12&HS.The ring will show BLAST matches from both HS and K12.Legend text Required sequences ColourGC Content GC Content IgnoreGC Skew GC Skew IgnoreCoverage BRIGExample.graph153,0,0O157:H7E coli O157H7Sakai.gbk0,0,153HS and K12E coli HS.fna0,153,0E coli K12MG1655.fnaCFT073and UTI89E coli CFT073.fna153,0,153E coli UTI89.fna5VISUALISING WHOLE GENOME COMPARISONS16 PRO TIP7:Rings can be reordered by dragging them in the Ring List pane. PRO TIP8:You can set default threshold values in“BRIG options”.See section9.2(page48)for more details.PRO TIP9:When using a Genbank/EMBLfile as a reference,users can choose whether to use the protein or nucleotide sequence.5.3Step3:Review and submitThe last window allows us to change the BLAST options,the location of the imagefile and set the image title,which will appear in the centre of the ring.For the walkthrough configure the third window as:1.Set the image title as“BRIG example image”.2.Hit submit.3.The image will be created in the specified output directory and should looksomething like Figure7.BRIG will format Genbankfiles,run BLAST,parse the results and render the im-age.Thefinal image(Figure7)shows GC Content and Skew,the Genome cover-age,contig boundaries,and the BLAST results against the other E.coli genomes. The results for HS and K12have been collated into a single ring,likewise for UTI89and CFT073.5VISUALISING WHOLE GENOME COMPARISONS17Figure7:Thefinal BRIG imagePRO TIP10:Image settings,like size,fonts,etc can be configured in:Main window>Preferences>Image options..6WORKING WITH A MULTI-FASTA REFERENCE186Working with a Multi-FASTA reference6.1Step1:Load in sequencesThis section is a walk through of how to use BRIG to generate an image using a listof genes in Multi-FASTA format as a reference.The multi-FASTAfile in this ex-ample is a number of virulence genes from enterohemorrhagic and uropathogenic E.coli,which includes EHEC polarfimbraie(ecpA to ecpR),EHEC Locus of En-terocyte Effacement(espF to espG)and the UPEC F1C Fimbraie(focA to focI), which will be compared against the whole genome seqeuences of E.coli strainsO157:H7Sakai,K12MG1665,O126:H7and CFT073.Start a new session in BRIG and load in thefiles from the Chapter568wholeGenomeExamples folderin the unzipped BRIG-Example folder:1.Set the reference sequence as“Ecoli vir.fna”.Users can use the browsebutton to traverse thefile system.2.Set<unzipped BRIG examples folder>/Chapter568wholeGenomeExamplesas the query sequence folder.3.Press“add to data pool”,this should load several items into the pool list.4.Set the output folder as unzipped BRIG-Example folder.5.Make sure the BLAST options box is blank.6.Click“Next”.6WORKING WITH A MULTI-FASTA REFERENCE196.2Step2:Configure rings,annotations and spacer valueThe next step is to configure what information is shown on each concentric ring in BRIG.Figure8is an example of how one of the windows should be set up.There should befive rings.Do the follow for each ring,according to the table below:1.Set legend text for each ring.2.Select the required sequences from the data pool and click“add data”toadd.3.Choose a colour4.Set upper(90)and lower(70)identity thresholds.5.Click“add new ring”and repeat these steps for each new ring.Legend text Required sequences ColourO157:H7E coli O157H7Sakai.gbk172,14,225O126:H7E coli O126.fna255,0,51CFT073E coli CFT073.fna0,0,102K12E coli K12MG1655.fna161,221,231null none ignoreAfter each ring is configured,users need to make the following changes:1.Set the spacerfield to50base pairs.2.Set the ring size of ring5as“2”.PRO TIP11:The Spacerfield determines the number of base pairs to leave between FASTA sequences.The next step is to add the gene annotations,which will be fetched from the Multi-FASTA headers:1.Click Add custom features in the second BRIG window to bring up customannotation window(Figure9).2.Double click“Ring5”.3.Set“input data”as Multi-FASTA.4.Set“colour”as alternating red-blue5.Click add.6WORKING WITH A MULTI-FASTA REFERENCE20Figure8:Ring set-up for Multi-FASTAfile6WORKING WITH A MULTI-FASTA REFERENCE21Figure9:Custom annotation window-adding gene annotations6WORKING WITH A MULTI-FASTA REFERENCE22 This step colours the gaps between FASTA entries,the gaps are calculated from the Multi-FASTAfile(Figure10).For each genome ring,do the following:1.Set“input data”as Multi-FASTA.2.Set“colour”as black3.Check“load gaps only”.4.Click add.The results should be similar to Figure10in the left hand pane.Close the window when this is done.PRO TIP12:A spacer value can be set when using protein sequences from a Genbank/EMBLfile.Figure10:Custom annotation window-adding spacers6WORKING WITH A MULTI-FASTA REFERENCE236.3Step3:Configure image settings and submitThere are a few more steps to complete and then the image isfinished.In the customize ring window:1.Make the following changes in Preferences>Image options(a)Set“show shading”in“Global settings”as false.(b)Set“featureSlot”spacing in“Feature settings”as x-small.2.Return to customize ring window,click“Next”to go to thefinal BRIGconfirmation window3.Set the image title as“Various E.coli virulence genes”and press submit. The output image should be something like Figure11.The alternating red-blue op-tion has automatically alternated the red and blue colours for the gene labels.This option is available whenever a multi-FASTAfile is used as a reference sequence. This same option could be used to show contig or genome scaffold boundaries. This image shows some real biological information very clearly.1.CFT073(UPEC)and K12MG1655(Commensal)do not carry the Locus ofEnterocyte Effacement.These virulence factors are specific to EHEC and EPEC.2.All E.coli shown carry the common pilus(ecpA-R).3.Only CFT073carries the F1Cfimbriae.PRO TIP13:You use protein sequences as a multi-FASTA reference and use blastx to improve alignment accuracy for divergent sequences.6WORKING WITH A MULTI-FASTA REFERENCE24Figure11:Output image from Multi-FASTA walkthrough.This was generated using BLAST+[2],BLAST legacy[1]will produce slightly different results.7VISUALISING GRAPHS AND GENOME ASSEMBLIES25 7Visualising graphs and genome assembliesBRIG can produce any user-specified graph e.g coverage,read mapping,expres-sion data etc.For example,the coverage graph in Figure2was produced from a tab-delimited textfile,with a start,stop and value for that range.BRIG supports.acefiles(produced by Newbler,454/Roche’s propriety as-sembler,and used by PHRAP/Consed)or SAMfiles(used for read mapping and some de-novo assemblers).BRIG has a number of modules for handling assembly information.These tools are:•Contig mapping:BRIG will use BLAST to try and map contigs from an .ace or Multi-FASTAfile onto a reference genome and produce a.graphfile that can show frequency of BLAST hits and the best BLAST hit position of contigs.It will then produce a.graphfile of the frequency of BLAST hits and the best BLAST hit position of contigs and another.graphfile,with the suffix”rep.graph”showing all the other BLAST hits.•Coverage graph:BRIG requires an.ace or.samfile and an output loca-tion(Figure7.1).BRIG will calculate coverage values over a user-defined window and produce a.graphfile in the output folder.This will create a tab-delimited.graphfile,which can be loaded into back BRIG.•Convert graph:A draft genome is usually modified post assembly;adding spacers,reordering contigs,etc.These changes are often not reflected in the original.acefiles BRIG uses to generate coverage graphs.BRIG can use BLAST to align the original assembly output with the newer sequence and map the coverage information to the new sequence.BRIG requires:–Original454AllContigs.fna produced by Newbler.–Graphfile created by BRIG’s“Coverage graph”module,based onNewbler’s acefile.–The modified sequence or another suitable reference genome.BRIG will produce a new.graphfile in the output folder,using thefilename of the originalfile,with“new.graph”appended to the end.To create work with graphfiles:Main window>Modules>Create graphfiles Using graphfiles in BRIG images.graphfiles should visible when users load a directory into the query sequence pool(Figure7.1).Graphs can be treated like any other sequencefile in BRIG;the example from Figure7.1shows a graphfile loaded into thefirst ring of a particular BRIG session.7VISUALISING GRAPHS AND GENOME ASSEMBLIES26 PRO TIP14:Graphfiles cannot be shown on same ring as sequencefiles (protein or nucleotide).7.1Walkthrough for visualising SAMfile mapping coverage. This section will give a worked example of producing a BRIG image show-ing mapping reads coverage from a SAM inputfile.Thefinal image will look like Figure12.This walkthrough requires BRIG examples.zip from the BRIG website(/projects/brig/files/).Unzip this somewhere convenient.The general procedure is tofirst generate the graphfiles from the SAMfile,add additionalfiles to data pool,edit the rings and annotation, then render the image.1.Open a new BRIG session.2.Create the graphfile from the graphfiles modules:Main window>Mod-ules>Create graphfiles.(a)Set drop down to coverage graph,fill infields(Figure7.1).(b)Set Assemblyfile as“Mu50.sam”from the BRIG examples/Chapter7-sam-examples folder.(c)Set Output folder as the location of the Chapter7-sam-examples folder.(d)Window size as“1”..(e)Click Create Graph.This will add the graphfile to the data pool whenit hasfinished.7VISUALISING GRAPHS AND GENOME ASSEMBLIES27Close the coverage graph window and return to thefirst main window.1.Set referencefile as“S.aureus.Mu50-plasmid-AP003367.gbk”from the Chapter7-sam-examples ers can use the browse button to traverse thefilesystem.2.Set<unzipped BRIG examples folder>/Chapter7-sam-examples as the querysequence folder.3.Press“add to data pool”,this should load several items into the pool list.4.Set the output folder as unzipped Chapter7-sam-examples folder.5.Make sure the BLAST options box is blank.Click next to move to the next window to configure the rings and add in annota-tions.1.Create4rings,name them“Mapping Coverage”,“pSK57”,“SAP014A”,“CDS”.2.Ring1Settings:(a)Add“Mu50.sam.graph”from data pool to Ring1.(b)Set graph maximum value as“10”.(c)Set colour as rgb(204,0,0).(d)Set legend title as“Mapping coverage”.(e)Check show red/blue.7VISUALISING GRAPHS AND GENOME ASSEMBLIES283.Ring2Settings:(a)Add“S.aureus.pSK57-plasmid-GQ900493.gbk”from data pool to Ring2.(b)Set colour as rgb(0,0,102).(c)Set legend title as“pSK57”.4.Ring3Settings:(a)Add“S.aureus.SAP014A-plasmid-GQ900379”from data pool to Ring3.(b)Set colour as rgb(102,0,102).(c)Set legend title as“SAP014A”.5.Ring4Settings:(a)Set colour as rgb(0,0,0).(b)Set legend title as“CDS”.7VISUALISING GRAPHS AND GENOME ASSEMBLIES29Click“Add Custom features”.1.Double-click Ring4.2.Set Input data to“Genbank”.3.Set colour to“black”.4.Set Draw feature as“default”.5.Set Genbankfile location to the location of“S.aureus.Mu50-plasmid-AP003367.gbk”.6.Set Feature as“CDS”7.Click add.This will load all the coding sequences from the Genbankfile.These annotations will be drawn as arrows,indicating orientation.Close this window and click next on the second window.1.Set title as“S.aureus Mu50plasmid”.2.Click Submit.This will generate thefinal image,it should look like Figure12.7VISUALISING GRAPHS AND GENOME ASSEMBLIES30Figure12:S.aureus Mu50plasmid,showing read mapping from simulated454 reads,CDSs,and genome comparisons to other S.aureus plasmids,pSK57& SAP014A.Alignments were performed with BLAST+7VISUALISING GRAPHS AND GENOME ASSEMBLIES317.2Walk through for visualising acefile assembly coverage. This section will give a worked example of producing a BRIG image showing assembly coverage read from an acefile.Thefinal image will look like Fig-ure13.This walk through requires BRIG examples.zip from the BRIG website (/projects/brig/files/).Unzip this some-where convenient.The general procedure is tofirst generate the graphfiles from the acefile,con-vert the coverage information to reference sequence if necessary,add additional files to data pool,edit rings and annotation,then render the image.Draft genome sequences are often modified to be consistant with other infor-mation(e.g genome scaffolding,PCR sequencing of gaps)after being initially assembled.This may change the order and size of thefinal genome sequence compared to the original assembly.To show the read coverage from the assembly on thefinal sequence correctly the“Convert graph”module within BRIG can be used to map the coverage infor-mation from the acefile onto the new sequence.This module can also be used to map read coverage from an assembly onto a closely-related reference genome.7VISUALISING GRAPHS AND GENOME ASSEMBLIES32First,produce the coverage graphfile based off the assembly(acefile):1.Open a new BRIG session.2.Create the graphfile from the graphfiles module:Main window>Modules>Create graphfiles.3.Set drop down to coverage graph,fill infields(Figure7.2).(a)Set Assemblyfile as“454-S.aureus.Mu30.ace”from the BRIGEXAMPLE2-ace folder.(b)Set Output folder as the location of the BRIGEXAMPLE2-ace folder.(c)Set window size as“1”.4.Click“Create Graph”.This will add the graphfile to the data pool when ithasfinished.7VISUALISING GRAPHS AND GENOME ASSEMBLIES33 Next,map the coverage generated in the previous graphfile to the modified genome sequence.1.Remain in the“Create custom graph”window:Main window>Modules>Create graphfiles.2.Set drop down to convert graph,fill infields as below.(a)Set Original sequence as“454AllContigs-S.aureus.Mu50.fna”.(b)Set New sequence as“S.aureus.Mu50-plasmid-AP003367.fna”.(c)Set graphfile as“454-S.aureus.Mu50.ace.graph”.(d)Set Output folder as the location of the BRIGEXAMPLE2-ace folder.(e)Window size as“1”.3.Click“Create Graph”.This will add the graphfile to the data pool when ishasfinished.7VISUALISING GRAPHS AND GENOME ASSEMBLIES34Close the“Create custom graph”window and return to the main window.1.Set referencefile as“S.aureus.Mu50-plasmid-AP003367.gbk”from theBRIGEXAMPLE2-ace ers can use the browse button to traverse thefile system.2.Set<unzipped BRIGEXAMPLE2-ace folder>/genomes as the query se-quence folder.3.Press“add to data pool”,this should load several items into the pool list.4.Set the output folder as unzipped BRIGEXAMPLE2-ace folder.5.Make sure the BLAST options box is blank.。