二代测序数据分析简介
合集下载
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
• The Solexa pipeline (i.e., the software delivered with the Illumina Genome Analyzer) earlier used
Quality
Encoding
• Sanger format can encode a Phred quality score from 0 to 93 using ASCII 33 to 126 • Illumina's newest version (1.8) of their pipeline CASAVA will directly produce fastq in Sanger format • Solexa/Illumina 1.0 format can encode a Solexa/Illumina quality score from -5 to 62 using ASCII 59 to 126 • Starting with Illumina 1.3 and before Illumina 1.8, the format encoded a Phred quality score from 0 to 62 using ASCII 64 to 126 • Starting in Illumina 1.5 and before Illumina 1.8, the Phred scores 0 to 2 have a slightly different meaning
二代测序数据分析简介
童春发 2013.12.23
主要内容
• • • • 重测序的原理及流程 数据结构与质量评估 SRA数据库及数据获取 Bowtie2、BWA和SAMtools软件使用
重测序的原理及流程
数据结构与质量评估
• Fastq格式 • FastQC
FASTQ format
Sequence Length Distribution
•This module will raise a warning if all sequences are not the same length •This module will raise an error if any of the sequences have zero length
With Casava 1.8 the format of the '@' line has changed
@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG
Quality
• A quality value Q is an integer mapping of p (i.e., the probability that the corresponding base call is incorrect). • Phred quality score:
Per Base GC Content
•This module issues a warning it the GC content of any base strays more than 5% from the mean GC content. •This module will fail if the GC content of any base strays more than 10% from the mean GC content.
Overrepresented Sequences
AATTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACTCGAAGATCTCG 65311 1.636 TruSeq Adapter, Index 10 (97% over 36bp) ATTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACTCGAAGATCTCGT 6464 0.162 TruSeq Adapter, Index 10 (97% over 36bp) AATAGATCGGAAGAGCACACGTCTGAACTCCAGTCACTCGAAGATCTCGT 4633 0.116 TruSeq Adapter, Index 10 (97% over 36bp) AATTAGTCGGAAGAGCACACGTCTGAACTCCAGTCACTCGAAGATCTCGT 4463 0.112 TruSeq Adapter, Index 10 (97% over 34bp) AATTATGGATAATTAAAGTATTCCCCCCTTTTTTTTATGATATTTTTGAC3994 0.100 No Hit
Basic Statistics
Filename File type Encoding Total Sequences NHS066-47_L4_1.fq.gz Conventional base calls Sanger / Illumina 1.9 3992798
Filtered Sequences
直接下载FASTQ格式数据
• ftp:///vol1/fastq/SRR576/S RR576183
将Reads比对到参考序列
• • • •
BWA Bowtie2 Soap Samtools
BWA
• / • https:///lh3/bwa • wget /projects/biobwa/files/bwa-0.7.5a.tar.bz2 • tar -xjvf bwa-0.7.5a.tar.bz2 • cd bwa-0.7.5a • make • Dowload test.tar.gz from ftp://202.119.214.193
Illumina sequence identifiers
@HWUSI-EAS100R:6:73:941:1973#0/1
Versions of the Illumina pipeline since 1.4 appear to use #NNNNNN instead of #0 for the multiplex ID, where NNNNNN is the sequence of the multiplex tag.
Per Sequence Quality Scores
•A warning is raised if the most frequently observed mean quality is below 27 - this equates to a 0.2% error rate. • An error is raised if the most frequently observed mean quality is below 20 - this equates to a 1% error rate.
Warning: >0.1% Failure: >1%
Overrepresented Kmers
•This module will issue a warning if any k-mer is enriched more than 3 fold overall, or more than 5 fold at any individual position •This module will issue a error if any k-mer is enriched more than 10 fold at any individual base position
Per Base N Content
•This module raises a warning if any position shows an N content of >5% •This module will raise an error if any position shows an N content of >20%
Duplicate Sequences
•This module will issue a warning if non-unique sequences make up more than 20% of the total •This module will issue a error if non-unique sequences make up more than 50% of the total
A FASTQ file containing a single sequence might look like this
@SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
Per Sequence GC Content
•A warning is raised if the sum of the deviations from the normal distribution represents more than 15% of the reads •This module will indicate a failure if the sum of the deviations from the normal distribution represents more than 30% of the reads
Saving a Report
NHS066-47_L4_1.fq_fastqc.zip
SRA数据库及数据获取
SRA数据库及数据获取
SRA数据库及数据获取
Hale Waihona Puke RA数据库及数据获取查看和下载SRR576183
Fastq-dum将SRA文件转化成 FASTQ格式
• fastq-dump --split-files -DQ “+” ./SRR576183.sra • fastq-dump --split-files -DQ “+” -gzip ./SRR576183.sra
American Standard Code for Information Interchange (ASCII)
FastQC
• / projects/fastqc/ • Double click “run_fastqc.bat” to run FastQC • The analysis results for 11 modules • Green tick for normal • Orange triangle for slightly abnormal • Red cross for very unusual
Sequence length
0
100
%GC
37
Per Base Sequence Quality
•The central red line is the median value •The yellow box represents the inter-quartile range (25-75%) •The upper and lower whiskers represent the 10% and 90% points •The blue line represents the mean quality
Per Base Sequence Content
•This module issues a warning if the difference between A and T, or G and C is greater than 10% in any position. • This module will fail if the difference between A and T, or G and C is greater than 20% in any position.