生物序列分析中几个典型算法介绍
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
生物序列分析中几个典型算法介绍
生物信息学研究背景与方向
序列家族的序列谱隐马尔可夫模型(Profile HMMs for sequence families )
模体识别(Motif Discovery )
刘立芳计算机学院西安电子科技大学
生物秀-专心做生物!
www.bbioo.com
背景知识
DNA脱氧核糖核酸
1、DNA的分子组成
核甘(nucleotides)
•磷酸盐(phosphate)
•糖(sugar)
•一种碱基
9腺嘌呤(A denine)
9鸟嘌呤(G uanine)
9胞嘧啶(C ytosine)
9胸腺嘧啶(T hymine) 2、碱基的配对原则
•A(腺嘌呤)—T(胸腺嘧啶)
•C(鸟嘌呤)—G(胞嘧啶)
3、一个嘌呤基与一个嘧啶基通过氢键联结成一个碱基对。
4、DNA分子的方向性
5’→3’
5、DNA的双螺旋结构
RNA、转录和翻译
1、RNA(核糖核酸):单链结构、尿嘧啶U代替胸腺嘧啶T、位于细胞核和细胞质中。
2、转录: DNA链→RNA链信使RNA(mRNA),启动子。
3、翻译: mRNA上携带遗传信息在核糖体中合成蛋白质的过程。
变异
1、进化过程中由于不正确的复制,使DNA内容发生局部的改变。
2、变异的种类主要有以下三种:
9替代(substitution)
9插入或删除(insertion or deletion)
9重排(rearrangement)
基因
intron
exon
基因组
任何一条染色体上都带有许多基因,一条高等生物的染色体上可能带有成千上万个基因,一个细胞中的全部基因序列及其间隔序列统称为genomes(基因组)。
人类基因组计划(Human Genome Project)
基因的编码
1、基因编码是一个逻辑的映射,表明存储在DNA和mRNA中的基因信息决定什么样的蛋白质序列。
2、每个碱基三元组称为一个密码子(codon)
3、碱基组成的三元组的排列共有43=64种,而氨基酸共有20种类型,所以不同的密码子可能表示同一种氨基酸。
分子生物学中心法则
带来的问题
1、序列排列问题
2、基因组的重排问题
3、蛋白质结构和功能的预测
4、基因(外显子、内含子)查找问题
5、序列装配(Sequence Assembly)问题……
基因组序列装配 基因识别
基因功能预报
基因多态性分析 基因进化
mRNA结构预测
基因芯片设计
基因芯片数据分析 疾病相关基因分析 蛋白质序列分析
蛋白质家族分类
蛋白质结构预测
蛋白质折叠研究
代谢途径分析
转录调控机制
蛋白质芯片设计
蛋白质芯片数据分析 药物设计
生物信息学–研究方向
Bioinformatics
Bioinformatics is now part of Oxford Open and your work can be made available free online immediately upon
publication.
GENOME ANALYSIS
SEQUENCE ANALYSIS
STRUCTURAL BIOINFORMATICS
GENE EXPRESSION
GENETICS AND POPULATION ANALYSIS
SYSTEMS BIOLOGY
DATA AND TEXT MINING
DATABASES AND ONTOLOGIES
多序列比对
序列分析—生物信息学的首要任务 多序列比对和模体识别—序列分析的两个主要方法
多序列比对问题优化模型
1)SP 记分函数(weighted sums-of-pairs with affine gap penalties )
n 条序列S 的一个多序列比对A 的SP 记分函数定义如下:
∑∑=−==n i i j j i ij s s COST w A COST 21
1
),()(
其中(,)i j COST s s 为序列i s 和j s 的比对分值,ij w 为两序列的权重。
如果
S 的一个比
对'A 满足:'()max (())A COST A COST A =,则称'A 是一个最优比对。
2)COFFEE 记分函数 (consistency based objective function for alignment evaluation)
一个多序列比对A 的COFFEE 记分函数定义如下:
112121()()/()n i n i ij ij ij ij i j i j COST A w SCORE A w LEN A −−====⎡⎤⎡⎤=⎢⎥⎢⎥⎣⎦⎣⎦
∑∑∑∑ (2.6)
其中,ij A 为序列i s 和j s 在A 中的双序列比对,)(ij A LEN 是其比对长度,)
(ij A SCORE 是ij A 与库中i s 和j s 最优比对的一致性,
其值等于ij A 和库中双序列比对残基对的一致性数目。
ij w 为两序列的相似度,即两序列i s 和j s 的最优比对中,残基完全相同且对齐的数目与两序列的最小长度min{,}i j l l 之比。
如果S 的一个比对'A 满足:))((max )('A COST A COST A =,则称'A 是一个最优比对。
3)序列谱隐Markov 模型(Profile HMMs )
Profile HMMs
Aligned Sequences
Build a Profile HMM (Training)
Database search
Multiple
alignments
(Viterbi) Query against Profile
HMM database
(Forward)
Including BEGIN and END states
Two new states can be added to the Markov model description. These are treated as SILENT STATES.
The addition of the E state (END) essentially means that we now define a probability distribution over all possible sequences of ANY length.
Prediction of fair/loaded die
HMM Formal definition
Objective is to distinguish the sequence of states from the sequence of symbols –call the state sequence the path
1The ith state in the path is called the chain is defined by :(|)
define emission probabilities ():()(|)so the joint probability of an observed sequence x and a state i kl kl i i k k i i a P l k e b e b P x b k παπππ−======11
01 sequence :(,)()i i i L
i i P x a e x a ππππππ+==∏
However to use this we must already know the path
估值问题 假设有一个HM M ,其转移概率ij a 和()k i e x 均已知。
计算这个模型产生某一个特定观测序列12L x x x x ="的概率()P x 。
解码问题 假设已经有了一个HM M 和它所产生的一个观测序列12L x x x x =",决定最有可能产生这个观测序列的状态序列,即其路径1...L πππ=。
学习问题 假设只知道一个HM M 的大致结构(比如状态数量和每一状态的可见符号数量),但ij a 和()k i e x 均未知。
如何从一组可见符号的训练序列中决定这些参数。
隐马尔可夫模型的3个核心问题
A HMM model for a DNA motif alignments, The transitions are shown with arrows whose thickness indicate their probability. In each state, the histogram shows the probabilities of the four bases.
ACA C --AGC AGA ---ATC
ACC G --ATC Transition probabilities
Output Probabilities
insertion
Building –Final Topology
Deletion states Matching states
Insertion states
No of matching states= average sequence length in the family PFAM Database-of Protein families
()
The Viterbi Algorithm
Given an observation sequence x = x1x2….x L and an HMM, we would like to know the most probable state path that led to the generation of the observed sequence
i.e., the state sequence that is optimal in some
meaningful sense
Formal techniques exist e.g., VITERBI algorithm
The VITERBI algorithm is a dynamic programming algorithm which finds the best path (state sequence)
with the highest probability.
Viterbi algorithm –finding the path
The most probable path can be found recursively
*arg m ax (,)
P x πππ
=1
π
We want to choose the path with the highest probability –like we did in the dynamic programming examples
1suppose the probability () of the most prob. path ending in state with observation is known for all states ,(1)()max (())
k l l i k kl k
v i k i k v i e x v i a ++=O(C L L )
Viterbi Algorithm
00All sequences have the same start state,so 1.By keeping pointers backwards, the actual path can be found by backtracking. The full algorithm:Init (i 0): (0)1,(0)0 for 0
Recur k v v v k ====>00sion(i 1..L): ()max ((1)) ()arg max ((1))Termination: P(x, *)max (()) *arg max (())
Tra l l i k k kl i k k kl k k k L k k k v e x v i a ptr l v i a v L a v L a ππ==−=−==1ceback(i L..1): *
()i i ptr l π−==1
π
The Forward algorithm
We also are interested in calculating the probability of a sequence P(x) given a model of the system
behavior
The number of possible paths increases exponentially with length so it is not possible to enumerate all of them This can be calculate by a dynamic programming algorithm, like the Viterbi algorithm, replacing
maximization with a sum.
∑=π
π),()(x P x P ),...()(1k x x P i f i i k ==π
kl
k k i l l a i f x e i f ∑+=+)()()1(1
Backward algorithm or posterior state probabilities )
|...(),...( ),...|...(),...(),(11111k x x P k x x P k x x x x P k x x P k x P i L i i i i i L i i
i i =======++ππππ
π
)
|...(P )(1k x x i b i L i k ==+πFirst calculate the probability of producing the entire observed Sequence with ith symbol produced by state k:
)
,...()(1k x x P i f i i k ==πFrom forward alg.
Backward Algorithm
(1)
)b (x e P(x) :n Terminatio
1).(i )b (x e )(b :1,...,1)-L i Recursion(k.
)(b :)(tion Initialisa l 1l 0l 1i l k 0k ∑∑=+====+l
l l
kl k a a i all for a L L i Calculation of backward term is similar to calculation of forward term –except that it starts from the end of the sequence
Posterior probability of a fair die
Parameter estimation for HMMs
l n l n 1 l(x ,...,x |)log (x ,...,x |)log (|)where represents all of the transition and emission probs
n
j
j P P x θθθθ===∑Most difficult problem is specifying the model
9what are the states?
9how are they connected?
9what are the transition and emission probabilities?Start with estimating the probabilities….
From training sequences (x 1to x n ) –
the log likelihood of the model is:
More occasionally dishonest casino
Real model Estimated model (300 rolls)
30,000 rolls。