玉米基因挖掘及注释
合集下载
相关主题
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
454 transcript sequence illustrates potential alternative splicing.
• Align (MAGI + gene models) to 454 sequences (GeneSeqr).
– 1334 have a minimum of 2 individual ESTs that align and overlap with a 454 predicted splice site. – 285 loci with PASA detected conflicts - 358 potential splice variants.
– Coverage simulations – Empirical data
•Transcript representation, coverage •Novel gene / transcript isoform discovery
– Process and pitfalls
• High throughput polymorphism discovery
454 Life Sciences GS20 sequencing process
Template coated beads are Shear DNA. deposited into a multi-well picotitre Attach linkers. plate. Denature and immobilize Each wellonto is a mini-reactor for ssDNA beads. sequence - massively Amplify generation templates on parallel. beads.
Assembly(0): orient(a+/s+) align: 2384-2507()>GT....AG<2900-2907() Assembly(0): orient(a+/s+) align: 2384-2507()>GT....AG<2713-2770()
>GT..AG<
>GT…………..AG<
454 Life Sciences GS20 sequencing process
emissions Uses Photon a pyrosequencing Processed to provide Protocol. base calls. Sequencing run consists of Signal amplitude is 42 cycles. proportional to number bases incorporated. Each of cycle contains 4 nucleotide wash, extension and rinse sub cycles. PPi photon
• 200 intron retention - possible unprocessed mRNA • 50 exon-skip; 18 alt. donor; 33 alt. acceptor; 23 alt. donor + acceptor; 39 mixed.
cluster: MAGI_90118 189128_2748_3485,+,0,2460-2507,2713-2770 040340_0449_0448,+,0,2384-2434 069655_3271_1710,+,0,2390-2479 044934_2827_2354,+,0,2417-2507,2900-2907 056536_0609_1282,+,0,2417-2507,2900-2907 // HEADER: cluster: MAGI_90118 Individual Alignments: (5) 0 ------- (a+/s+)040340_0449_0448 1 ----------(a+/s+)069655_3271_1710 2 ----------> 3 ----------> 4 -----> <------
Hit Rate, coverage and contig length potential.
B73 Transcript Sequencing
• Isolated RNA from B73 shoot apical meristem tissue obtained by laser-capture microdissection. • Provided cDNA to 454 Life Sciences. • Obtained 288,992 (260,736 >50bp after trimming PolyA tails and removing contaminants - 28.8Mb ). • Remove 14,126 organelle sequences -- 246,460 seqs.
GS20 OUTPUT
• Advantages
– No clone libraries. – ~20M bases in a single 4 hour run.
• Disadvantages
– Sequences are small - 105bp avg. – Error rate somewhat higher than Sanger sequencing.
– Transcript collections are small collections of high information content sequence.
Computer Simulations
• Can 454 capture a transcriptome with sufficient depth of sampling to identify and annotate transcripts. • Simulate identification and coverage with increasing sequence depth (sequencing runs). • Think about potential assembly issues.
Computer Simulations
• Start with a set of maize unigenes • Model transcript abundance based on member frequency • Select a ‘template’ - influenced by the abundance profile. • Randomly select a start site >= 50bp from the template end*. • ‘Generate’ sequence from that point assuming 42 cycles, with defined nt. Wash orders. • Repeat 200,000 times (1 run), or more to simulate multiple runs. • Map sequences back to the templates to examine representation and coverage.
<(a+/s+)044934_2827_2354 <(a+/s+)056536_0609_1282 (a+/s+)189128_2748_3485
ASSEMBLIES: (2) --------------> -------------->
<------
<(a+/s+)040340_0449_0448,044934_2827_2354,056 (a+/s+)040340_0449_0448,069655_3271_1710,189128_2748_3485
Maize gene discovery and annotation using 454 transcriptome sequencing
Brad Barbazuk Scott Emrich Patrick Schnable
OUTLINE
• An introduction to the 454 Life Sciences GS20 sequencer. • Examining 454 mediated transcript sequenБайду номын сангаасing
– 18,558 clusters and singletons (Apex lib.)
Simulating one run suggests This trend remains, but that coverage will be biased diminishes with higher towards interior portions of coverage. the template.
Capture of Genic sequences vs. previous collections.
Source database UGA-ISU Apex unigenes (N= 18,558) GenBank maize ESTs (N= 647,685) ESTs + ISU MAGI 3.1 (GSS) (N= 862,158) ESTs + ISU MAGI 3.1 + Organelle genomes + repeats (N= 877,431) ESTs + ISU MAGI 3.1 + Organelle + Monocot ESTs (N= 1,282,226) No. of matching 454 ESTs 73,145 179,912 239,113 244,328 No. of novel 454 ESTs 187,591 (71.9%) 80,824 (31.0%) 21,623 (8.3%) 16,408 (6.3%) 15,397 Novel transcript (5.9%) sequence?
SNP detection pipeline
GS20 and Maize
• Maize repeat complexity limits usefulness of small reads. • Large genome and small read size aggravate assembly. • GS20 is well suited for transcript sequencing:
245,339
Transcript Result Summary
• Over 70% of the 454 SAM ESTs were not represented within a collection of ESTs collected from maize apex tissue. • Over 30% were not represented within ~600K maize ESTs in GenBank (12/05). • 15,000 MAGI assemblies that previously did not align to available maize EST sequences align to B73 454 SAM ESTs. • 390 MAGIs are potential orphans.
Currently Underway:
• Performing wet bench validation of orphans and a selection of potential splice variants. • Examine the use of 454 transcript sequence to identify gene associated polymorphism between maize inbred lines.