Mining Genes in DNA Using GeneScout

合集下载
  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

Mining Genes in DNA Using GeneScout

Michael M.Yin

Department of Computer Science

New Jersey Institute of Technology University Heights,Newark,NJ07102,USA

mxy3100@

Jason T.L.Wang

Department of Computer Science

New Jersey Institute of Technology University Heights,Newark,NJ07102,USA

wangj@

Abstract

In this paper,we present a new system,called GeneScout,for predicting gene structures in vertebrate ge-nomic DNA.The system contains specially designed hidden

Markov models(HMMs)for detecting functional sites in-cluding protein-translation start sites,mRNA splicing junc-tion donor and acceptor sites,etc.Our main hypothesis

is that,given a vertebrate genomic DNA sequence,it is always possible to construct a directed acyclic graph

such that the path for the actual coding region of is in the set of all paths on.Thus,the gene detection prob-lem is reduced to that of analyzing the paths in the graph .A dynamic programming algorithm is used tofind the optimal path in.The proposed system is trained using an expectation-maximization(EM)algorithm and its per-

formance on vertebrate gene prediction is evaluated using the10-way cross-validation method.Experimental results show the good performance of the proposed system and its complementarity to a widely used gene detection system. Keywords:Bioinformatics,Genefinding,Hidden Markov models,Knowledge discovery,Data mining

1Introduction

Data mining,or knowledge discovery from data,refers to the process of extracting interesting,non-trivial,implicit, previously unknown and potentially useful information or patterns from data.In life sciences,this process could re-fer tofinding clustering rules for gene expression,discover-ing classifications rules for proteins,detecting associations between metabolic pathways,predicting genes in genomic DNA sequences,etc.[6,7].

Our research is targeted toward developing effective and accurate methods for automatically detecting gene struc-tures in the genomes of high eukaryotic organisms.This paper presents a data mining system for automated gene dis-covery.2Our Approach

2.1HMM Models for Predicting Functional Sites

Our proposed GeneScout system contains several spe-cially designed HMM models for predicting functional sites as well as an HMM model for calculating coding poten-tials[8].Often,the functional sites include(almost)in-variant(consensus)nucleotides and other degenerate fea-tures.Thus,the invariant nucleotides themselves do not completely characterize a functional site.For example,a start codon is always a sequence of ATG and it is the start position in mRNA for protein translation,so the start codon is thefirst3bases of the coding region of a gene.ATG is also the codon for Methionine,a regular amino acid occur-ring at many positions in all of the known proteins.This means one is unable to detect the start codon by simply searching for ATG in a genomic DNA sequence.

It is reported that there are some statistic relations be-tween a start codon ATG and the13nucleotides immedi-ately preceding it and the3bases immediately following it[5].We call these19bases containing a start codon a start site.We build an HMM model,called the Start Site Model,to model the start site.Figure1illustrates the model. As shown in Figure1,there are19states for the Start Site Model.Except for states14,15and16,there are four pos-sible bases at each state,and a base at one state may have four possible ways to transit to the next state.States14,15 and16are constant states(representing a start codon),and the transitions from state14to15and from state15to16 are also constant with a probability of1.With the Start Site Model,we can use the HMM algorithms described in our previously published paper[9]to detect a start site.1

相关文档
最新文档