二代测序数据分析简介

合集下载
  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

Duplicate Sequences
•This module will issue a warning if non-unique sequences make up more than 20% of the total •This module will issue a error if non-unique sequences make up more than 50% of the total
American Standard Code for Information Interchange (ASCII)
FastQC
• http://www.bioinformatics.babraham.ac.uk/ projects/fastqc/ • Double click “run_fastqc.bat” to run FastQC • The analysis results for 11 modules • Green tick for normal • Orange triangle for slightly abnormal • Red cross for very unusual
二代测序数据分析简介
童春发 2013.12.23
主要内容
• • • • 重测序的原理及流程 数据结构与质量评估 SRA数据库及数据获取 Bowtie2、BWA和SAMtools软件使用
重测序的原理及流程
数据结构与质量评估
• Fastq格式 • FastQC
FASTQ format
http://en.wikipedia.org
Sequence length
0
100
%GC
37
Per Base Sequence Quality
•The central red line is the median value •The yellow box represents the inter-quartile range (25-75%) •The upper and lower whiskers represent the 10% and 90% points •The blue line represents the mean quality
Per Base N Content
•This module raises a warning if any position shows an N content of >5% •This module will raise an error if any position shows an N content of >20%
直接下载FASTQ格式数据
• ftp://ftp.era.ebi.ac.uk/vol1/fastq/SRR576/S RR576183
将Reads比对到参考序列
• • • •
BWA Bowtie2 Soap Samtools
BWA
• http://bio-bwa.sourceforge.net/ • https://github.com/lh3/bwa • wget http://sourceforge.net/projects/biobwa/files/bwa-0.7.5a.tar.bz2 • tar -xjvf bwa-0.7.5a.tar.bz2 • cd bwa-0.7.5a • make • Dowload test.tar.gz from ftp://202.119.214.193
Saving a Report
NHS066-47_L4_1.fq_fastqc.zip
SRA数据库及数据获取
SRA数据库及数据获取
SRA数据库及数据获取
SRA数据库及数据获取
查看和下载SRR576183
Fastq-dum将SRA文件转化成 FASTQ格式
• fastq-dump --split-files -DQ “+” ./SRR576183.sra • fastq-dump --split-files -DQ “+” -gzip ./SRR576183.sra
Illumina sequence identifiers
@HWUSI-EAS100R:6:73:941:1973#0/1
Versions of the Illumina pipeline since 1.4 appear to use #NNNNNN instead of #0 for the multiplex ID, where NNNNNN is the sequence of the multiplex tag.
Sequence Length Distribution
•This module will raise a warning if all sequences are not the same length •This module will raise an error if any of the sequences have zero length
Per Base Sequence Content
•This module issues a warning if the difference between A and T, or G and C is greater than 10% in any position. • This module will fail if the difference between A and T, or G and C is greater than 20% in any position.
With Casava 1.8 the format of the '@' line has changed
@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG
Quality
• A quality value Q is an integer mapping of p (i.e., the probability that the corresponding base call is incorrect). • Phred quality score:
Basic Statistics
Filename File type Encoding Total Sequences NHS066-47_L4_1.fq.gz Conventional base calls Sanger / Illumina 1.9 3992798
Filtered Sequences
Overrepresented Sequences
AATTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACTCGAAGATCTCG 65311 1.636 TruSeq Adapter, Index 10 (97% over 36bp) ATTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACTCGAAGATCTCGT 6464 0.162 TruSeq Adapter, Index 10 (97% over 36bp) AATAGATCGGAAGAGCACACGTCTGAACTCCAGTCACTCGAAGATCTCGT 4633 0.116 TruSeq Adapter, Index 10 (97% over 36bp) AATTAGTCGGAAGAGCACACGTCTGAACTCCAGTCACTCGAAGATCTCGT 4463 0.112 TruSeq Adapter, Index 10 (97% over 34bp) AATTATGGATAATTAAAGTATTCCCCCCTTTTTTTTATGATATTTTTGAC3994 0.100 No Hit
Per Sequence GC Content
•A warning is raised if the sum of the deviations from the normal distribution represents more than 15% of the reads •This module will indicate a failure if the sum of the deviations from the normal distribution represents more than 30% of the reads
A FASTQ file containing a single sequence might look like this
@SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
Per Base GC Content
•This module issues a warning it the GC content of any base strays more than 5% from the mean GC content. •This module will fail if the GC content of any base strays more than 10% from the mean GC content.
Per Sequence Quality Scores
•A warning is raised if the most frequently observed mean quality is below 27 - this equates to a 0.2% error rate. • An error is raised if the most frequently observed mean quality is below 20 - this equates to a 1% error rate.
Warning: >0.1% Failure: >1%
Overrepresented Kmers
•This module will issue a warning if any k-mer is enriched more than 3 fold overall, or more than 5 fold at any individual position •This module will issue a error if any k-mer is enriched more than 10 fold at any individual base position
• The Solexa pipeline (i.e., the software delivered with the Illumina Genome Analyzer) earlier used
பைடு நூலகம்
Quality
Encoding
• Sanger format can encode a Phred quality score from 0 to 93 using ASCII 33 to 126 • Illumina's newest version (1.8) of their pipeline CASAVA will directly produce fastq in Sanger format • Solexa/Illumina 1.0 format can encode a Solexa/Illumina quality score from -5 to 62 using ASCII 59 to 126 • Starting with Illumina 1.3 and before Illumina 1.8, the format encoded a Phred quality score from 0 to 62 using ASCII 64 to 126 • Starting in Illumina 1.5 and before Illumina 1.8, the Phred scores 0 to 2 have a slightly different meaning
相关文档
最新文档