Evolutionary Models
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Evolutionary Models
Markov Processes
Motivation
1.understand dynamical processes in nature.
2.deterministic (celestial motion of star行星的天体运动, trajectory of cannon balls
加农炮的轨迹)
3.probabilistic (stock prices, DNA mutation, random walks)
Examples
1.Random Walks(随机游走)
start with zero, jump up and down with probability 1/2
This is a Markov process, only the current state (but no information of the past) influences the next step.(这是一个马尔可夫过程,只有当前状态(但没有过去的信息)的影响下一步。
)
The process of the maxima is non-markovian(这里是不是极大值需要一个累积的过程所以不是马尔可夫过程。
)
2.DNA Evolution
independent biochemical processes
Here we have the four players - the Purines: Adenosine and Guanine - and the Pyrimidines Thymidine and Cytosine -
Most common are transitions, since the two Purines and two Pyrimidines are chemical more similar to each other
The less common substitutions are the transversions exchanging a Purine with a Pyrimidine and vise versa
In the double-starnded DNA an A is always paired with an T and a C with a G, therefore if one of the partner changes its identity the other was to change accordingly to ensure the right Watson-Crick base pairing.
For example if an A changes to a C, the paired T has to change to G.
For neutral evolution the two strands are equivalent and therefore the two rates are assumed to be the same.
Thus we are left with 6 independent rates, 2 transitions and 4 transversions,
four of those processes change the GC-content as indicated by the blue and red arrows.
嘌呤:腺苷和鸟嘌呤和胸腺嘧啶和胞嘧啶嘧啶。
最常见的是转换,因为两个嘌呤和嘧啶是两个化学组成彼此相似,不常见的替换是用嘧啶替换嘌呤,反之亦然
在双链DNA,A总是伴随着T,C伴随着G,因此如果有一方改变其对应一方也发生变化,以确保正确的Watson-Crick碱基配对。
例如,如果A突变成C,配对T必须变为G。
进化中的两条链是等价的,因此两者的速率被假定是相同的。
因此,有6个独立的速率,2个转换和4个颠换。
Jukes-Cantor Model(Jukes-Cantor模型)
一个最简单的核苷酸替代模型由Jukes和Cantor(1969)提出。
该模型假定任一位点的核苷酸替代都是以相同频率发生的,且每一位点的核苷酸每年(或以任何其他时间单位)以p概率演变为其它三种核苷酸的一种。
with Rate Uniformity Among Sites with Rate Variation Among Sites
位点间替代速率相同,四种核苷酸的替代速率相同、核苷频率也相同。
不管怎么突变,存在25%的保守。
With time more mutations happen. However (due to back-mutations) the divergence saturates.(随着时间的推移,更多的突变发生。
然而(由于突变)离散趋于常态。
) The evolutionary between two sequences can be estimated:
Kimura 2 Parameter Model
Kimura 2参数模型对Jukes-Cantor 进化模型的一个推广。
该模型假定转换和颠换之间的替代速率不同,但是核苷频率和位点间替代速率均一致。
Transitions and transversion occur with different rates:(以不同的替换和颠换速率)Felsenstein Model
Felsenstein 进化模型是对Jukes-Cantor 进化模型的另一种推广。
Hasegawa Kishino Yano (HKY) Model
HKY进化模型进一步对Felsenstein 进化模型推广。
To account for general stationary states and differences between transitions and transversions (描述一般常态以及转换和颠换之间的差异)
Tamura Nei Model (1993)
With Rate Uniformity and Pattern Homogeneity with Rate Variation Among Sites with Pattern Heterogeneity Between Lineages with Rate Variation and Pattern Heterogeneity
该模型假定位点间的替代速率是相同的,但是转换速率与颠换速率之间存在差异,且嘌啉和嘧啶各自的转换速率也不同。
The TN93 model has four frequencies parameters. It accounts for the difference between transitions and transversions and differentiates the two kinds of transitions
该模型具有四个频率参数。
它描述转换和颠换的不同、区分两种类型的转换。
General Time Reversible Model(一般时间可逆模型)
A reversible model is sometimes preferred because it allows to simplify certain expressions when computing the likelihood of a model.
一个可逆模型有时是首选,因为计算模型的似然性时它可以简化某些表达方式。
dynamics on dsDNA is coupled through the repair process
Reverse Complement Symmetric Model反向互补对称模型
12 parameter model
The most general model has all 12 rate parameters independent of each other:含有独立的十二个速率参数
This model has the most parameter, which are therefore harder to estimate from a given amount of data. However, it is useful to detect violations of the strand symmetry or reversibility of DNA substitutions.
该模型参数最多,因此,从给定的数据比较难估计。
但,对于检测违反DNA替换的链的对称性和可逆性是有用的。
CpG Methylation Deamination CpG甲基化去氨化(主要发生在脊
椎动物中)
Neighbor Dependencies
我们必须考虑到至少两个核苷酸的相邻的状态。
with neighbor dependencies,Including the CpG → CpA / TpG process:
For a DNA sequence of length L the there are 4L states.
Even if the rate matrix Q is sparse, the matrix exponential has no vanishing entries. The computation of exp(Q) involves matrix multiplications, which get exponentially more time-consuming with L. However, in real life we can make use of some cluster approximation and only need to consider 44 x 44 =256 x 256 matrices. Overfitting Data过度拟合数据
Models with more parameters can fit data better.
Likelihood Ratio Test
Should we therefore use ever more complicated / parameter rich models? No!
具有较多参数的模型拟合数据较好,但是我们不能用更加复杂或者更多参数的模型。
Maximum Likelihood Estimation
Bayesian Analysis: includes prior knowledge and returns distributions of possible parameter values.
贝叶斯:包括先验知识,并返回可能的参数值分布。
Maximum Likelihood: disregards prior knowledge and returns one estimate for each parameter.
最大似然:和先验知识无关,并对每个参数返回一项估计。
建立一个模型,模型中有一些待定的参数,如何让这个模型最大程度上解释现有的数据,这就是最大似然法要做的。
The fraction of red balls
从袋子中取球,估算红球的比例。
红球R, 绿球G
Asymptotic Normality渐进正态性,Consistent Estimator估计一致性。
Bayesian Analysis:
assuming a uniform prior: Prob(model) =1
Mean a posteriori estimator(后验估计):
The Likelihood Function
Consider a sequence which evolved for some finite time T:在有限的时间T内进化的序列:
with neighbour dependencies:。