兰州大学生物信息学课件:6-基因组组装- 王明成

合集下载
  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
L/G is very small, the n_r is very large, this is obey to Poisson distribution. So,
d_k = (L-K+1)/G*n_r n_k = (L-K+1)*n_r then, G =n_k/d_k
Quality control and filtering
Read 1 and read 2 of two paired-end reads that were completely identical (and thus considered to be the products of PCR duplication).
Error correction before assembly
Overlap:
contig
Ge+en+no+om+mi+ic+cs Genomics
Pair-end: scaffold
nom Genome
sem Genome****assembly assembly
22
De bruijn graph construction
Reads : AGATCTTGTTATT
Reads having a ‘N’ over 10% of its length.
Reads from short insert-size libraries having more than 65% bases with the quality ≤ 7, and the reads from large insert-size libraries that contained more than 80% bases with the quality ≤ 7.
Contigs: AGATCTTGTTATTGATCTCC
3、Pair-end mapping to contig
4、Construct scaffolds
Note: 1. For mate-pair(>=2Kb), the order is just opposite. 2. A reliable link will be built between two contigs, when pair-end/mate-pair reads
TTGTT TATTG
TGTTA
TTATT
GTTAT
ATCTC
TCTCC
wenku.baidu.com
AGATC 1
2
GATCT ATCTTGTTATTGATC
4 3
ATCTCC
Read1:AGATCTTGTTATT Read2:GTTATTGATCTCC
set -R parameter
Contigs: GATCTTGTTATTGATCT GATCTCC AGATCT
1、De bruijn graph construction
Sequence assembly refers to aligning and merging fragments to a much longer DNA sequence in order to reconstruct the original sequence.
(1) Close gap by pair-end information (One end mapped on the contig, the other end fall in the gap)
(2)Do a local assembly using the reads fall in the gap to get a sequence connect with the both edges of two contigs. Note: Gap closure here also means extend contigs.
二、SOAPdenovo algorithm
SOAPdenovo was developed to assemble large genomes, such as human, it also works well for small genomes like bacteria. Include five major steps:
◦基因组组装
王明成 2015.10.29
一、Genome survey
Kmer: a continuous nucleic acid sequences, the length is K bp.
Suppose the genome is unique to K, we can get G different kmers. when generate a read, the possibility of a certain kmer be sequenced is (L-K+1)/G.
• De bruijn graph construction • Graph simplification and obtain contigs • Pair-end reads mapping to contigs • Construct scaffolds • Gap filling with pair-end reads
2.If the Kmer is already existent,merge the links of it with the first one's.
De bruijn graph
2、Graph simplification
AGATC
ATCTT TCTTG GATCT
TGATC TTGAT
CTTGT ATTGA
support larger than the number be set. 3. The gap size is estimated from the insert size of each reads pair.
5、Gap closure
• Get reads located in the gap and then do local assembly.
GTTATTGATCTCC
AGATC
ATCTT TCTTG CTTGT GATCT
TGATC TTGAT ATTGA
TTGTT TATTG
TGTTA TTATT
GTTAT
ATCTC TCTCC
1.liding to take Kmer from reads,storing the links between neighboring Kmers.
相关文档
最新文档