转录组学分析流程及常用软件介绍

合集下载

相关主题

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

转录组分析流程及常用软件使用方法（无参，有参）
Novogene 孙福明 2015.1.12
无参转录组分析流程
有参分析流程
OUTLINE
拼接（无参）比对定量（RESM无参）比对软件（Tophat2有参））定量（HTSeq有参）常用数据库介绍（NCBI，ENSEMBL）
OUTLINE
rsem-calculate-expression [options] --sam/--bam [--paired-end] input reference_name
sample_name
--bowtie2 --bowtie2-path --forward-prob --p #建库方式参数 0.0 #需要的 cpu 数
• --min_contig_length default:200 (step1,filter contigs) – minimum assebled contig length to report • --min_glue default:2 (step1,contigs to componets)
比对——Tophat2 Tophat2
特点如下： -不依赖于splice sites信息可只使用参考基因组进行比对; 可以处理spliced reads; -可以利用注释信息进行比对 -序列比对；Bowtie2，也可以使用Bowite -处理错误比对，eg:假基因 -处理Indels
比对——Tophat2
--phred33-quals 、--phred64-quals、 --paired-end
RSEM用法
• extract-transcript-to-gene-map-from-trinity trinity_out_dir.Trinity.fasta isoform2gene • rsem-prepare-reference --bowtie2 --bowtie2-path /PUBLIC/software/RNA/bowtie2-2.2.3 --transcript-to-gene-map isoform2gene --no-polyA trinity_out_dir.Trinity.fasta Trinity.fasta • rsem-calculate-expression --bowtie2 --bowtie2-path /PUBLIC/software/RNA/bowtie2-2.2.3/ --phred33-quals -p 8 --forwardprob 0.0 --time --paired-end reads.left.fq reads.right.fq Trinity.fasta Sample
Trinity示例
• 输出结果
– Trinity.fasta文件 – unigene.fasta文件
从trinity.fasta中选择最长的转录本作为unigene，Trinity作者推荐
Trinity拼接结果解读
• >c1_g1_i1 len=233 path=[94:0-232]
c1: sequence is derived from Chrysalis component 1 g1: sequence also corresponds to Butterfly subcomponent# 1 (during graph compaction and pruning, some components are partitioned into disconnected subcomponents). i1: sequence count from chrysalis component 1, butterfly subcomponent 1. If this subcomponent yields multiple sequences, these will have different seq numbers.
1. 图形化简
2.解图，确定转录本序列
Butterfly
•
2016-1-11
Butterfly
2016-1-11
Trinity
• --jaccard_clip
Trinity参数
– for gene-dense compact genome, such as fungal genomes, where transcripts may often overlap in UTR regions. • --SS_lib_type (Strand-specific library type) – Paired reads: • RF: first read(/1) of fragment pair is sequenced as antisense(reverse(R)), and second read(/2) is in the sense strand(forward(F)); typical of sequencing method. • FR: (reverse) – Unpaired (single) reads: • F: the single read is in the sense (forward) orientation • R: the single read is in the antisense (reverse) orientation
拼接——Trinity（无参）比对定量（RESM无参）比对软件（Tophat2有参））定量（HTSeq有参）常用数据库介绍（NCBI，ENSEMBL）
step1:Inchworm
1.分解测序reads，构建k-mer字典 2.从k-mer字典中移除error-containing k-mer 3.选择seed k-mer 4.Seed k-mer 延伸,构成contig 5. 重复seed selection 和 bidirectional k-mer extension 直到k-mer 字典耗尽 6. 过滤 contig
--bowtie-path #bowtie2路径
RSEM用法
• 3.[rsem-calculate-expression]
rsem-calculate-expression [options] upstream_read_file(s) reference_name sample_name rsem-calculate-expression [options] --paired-end upstream_read_file(s) downstream_read_file(s) reference_name sample_name
RSEM结果展示
• 1.Sample.transcript.bam • 2.Sample.isoforms.results
• 3.Sample.genes.results
OUTLINE
拼接——Trinity（无参）比对定量——RSEM（无参）比对软件——Tophat2（有参））定量（HTSeq有参）常用数据库介绍（NCBI，ENSEMBL）
Trinity参数
- min numbler of read needed to glue two inchworm contigs together
Trinity示例
ln -s /BJPROJ/RNA/rna_test/TR_bioinfomatics1/prepare/sunfuming/lession5/trinitydata/re ads.*.fq . perl /PUBLIC/software/public/Assembly/trinityrnaseq_r20140413p1/Trinity \ --seqType fq \ --JM 2G \ --left reads.left.fq \ --right reads.right.fq\ --SS_lib_type RF \ --CPU 4 \ --min_kmer_cov 2 \ --min_glue 2 \ --full_cleanup \#删除中间文件
RSEM简介
• 特点 • 可以处理ambiguous reads • 不依赖参考基因组文件
RSEM用法
• RSEM的计算主要分三步
帮助文档：http://deweylab.biostat.wisc.edu/rsem/README.html#usage
1. [extract-transcript-to-gene-map-from-trinity] 2. [rsem-prepare-reference]
rsem-prepare-reference [options] reference_fasta_file(s) reference_name
reference_name 输出文件前缀 --transcript-to-gene-map #上一步输出文件 --bowtie2 #选择 bowtie2 比对
step2:Chrysalis
1.将contigs 组合成connected components 2. 将每个component构成一个de Bruijn graph 3.reads回比 4.过滤
step3:Butterfly
Butterfly resolves alternatively spliced and paralogous transcripts
-p 线程数；defualt 1 -G 结构注释文件
• Tophat2 其他可选参数
比对——Tophat2
比对——Tophat2
• Tophat2 示例
修改参数：将mismatch设置为3
tophat2 -p 1 -G /BJPROJ/RNA/rna_test/TR_bioinfomatics1/prepare/sunfuming/lession5/yo ucan/use.gtf -o test use.fa /BJPROJ/RNA/rna_test/TR_bioinfomatics1/prepare/sunfuming/lession5/yo ucan/pair_1.fq /BJPROJ/RNA/rna_test/TR_bioinfomatics1/prepare/sunfuming/lession5/yo ucan/pair_2.fq --library-type fr-unstranded -N 3 --read-edit-dist 3
• 原理
比对——Tophat2
• Tophat2 使用方法
比对——Tophat2
• Tophat2 示例
1）Build index
bowtie2-build use.fa use.fa
比对——Tophat2
2）Mapping
tophat2 -p 1 -G /BJPROJ/RNA/rna_test/TR_bioinfomatics1/prepare/sunfuming/lession5/youcan/ use.gtf -o test use.fa /BJPROJ/RNA/rna_test/TR_bioinfomatics1/prepare/sunfuming/lession5/youcan/ pair_1.fq
len: length of the transcript sequence
Trinity拼接质量评估
• N50/N90：按照长度将拼接转录本从大到小排序，累加转录本的长度，到不小于总长50%/90%的拼接转录本的长度就是N50/N90。
OUTLINE
拼接——Trinity（无参）比对定量——RSEM（无参）比对软件（Tophat2有参））定量（HTSeq有参）常用数据库介绍（NCBI，ENSEMBL）
• Tophat2
比对分3步: （1）与转录本比对
如果提供了注释信息，将reads与转录本进行比对
（2）与基因组比对
如进行（1），将未必对的/可能错比的reads，与基因组进行比对；如未进行（1），则将所有reads与参考基因组比对。
wk.baidu.com
（3）spliced 比对
依据 GT-AG/GC-AG/AT-AC 法则，从（ 2 ）中未比对上的 reads中寻找新的spliced位点