机器学习_生物计算
合集下载
相关主题
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
23
Example
Assume a simple model where S(a,b) = 1 if a=b and -5 otherwise. Also, assume that d = -1
AGCCT T 0 -1 -2 -3 -4 -5 -6 A -1 1 0 -1 -2 -3 -4 C -2 0 -1 1 0 -1 -2 C -3 -1 -2 0 2 1 0 A -4 -2 -3 -1 1 0 -1 T -5 -3 -4 -2 0 2 1 T -6 -4 -5 -3 -1 1 3
i
j
P(x, y | M ) pxi yi
i
Score for alignment:
S s(xi , yi )
i
where: s(a, b) log( pa,b ) qa qb
Can be computed for each pair
of letters
16
Scoring Alignments
10
Sequence analysis techniques
• A major area of research within computational biology.
• Initially, based on deterministic or heuristic alignment methods
• Alignments can be scored by comparing the resulting alignment to a background (random) model.
In oInthdeeprewndoerndts, we are trying to find Raenlaatelidgnment
AG CATT
15
Scoring Alignments
• Alignments can be scored by comparing the resulting alignment to a background (random) model.
Independent
Related
P(x, y | I ) qxi qxj
Can be measured using microarrays
transcription
mRNA CCUGAGCCAACUAUUGAUGAA
translation
Protein
PEPTIDE
Can be measured using
mass speห้องสมุดไป่ตู้trometry
4
5
FDA Approves Gene-Based Breast Cancer Test*
24
Example
Assume a simple model where S(a,b) = 1 if a=b and -5 otherwise. Also, assume that d = -1
AGCC TT A CCATT
AGCCT T 0 -1 -2 -3 -4 -5 -6 A -1 1 0 -1 -2 -3 -4 C -2 0 -1 1 0 -1 -2 C -3 -1 -2 0 2 1 0 A -4 -2 -3 -1 1 0 -1 T -5 -3 -4 -2 0 2 1 T -6 -4 -5 -3 -1 1 3
F(i,j) = max
F(i-1,j-1)+s(xi,xj) F(i-1,j)+d
F(i,j-1)+d
AGCCT T 0 -1 -2 -3 -4 -5 -6 A -1 1 C -2 C -3 A -4 T -5 T -6
20
Example
Assume a simple model where S(a,b) = 1 if a=b and -5 otherwise. Also, assume that d = -1
Substitutions Insertions Deletions
13
Pairwise sequence alignment
ACATTG AACATT
A CATTG
AGCCTT AGCATT
AA C A T T AGCCT T
AG CATT
14
Pairwise sequence alignment
• More recently, based on probabilistic inference methods
11
Sequence analysis
• Traditional - Dynamic programming
• Probabilsitic - Profile HMMs
12
Alignment: Possible reasons for differences
Hard Easier
9
Function from sequence homology
• We have a query gene: ACTGGTGTACCGAT • Given a database containing genes with known
function, our goal is to find similar genes from this database (possibly in another organism) • When we find such gene we predict the function of the query gene to be similar to the resulting database gene • Problems - How do we determine similarity?
F(i,j) = max
F(i-1,j-1)+s(xi,xj) F(i-1,j)+d
F(i,j-1)+d
AGCCT T 0 -1 -2 -3 -4 -5 -6 A -1 1 0 C -2 0 C -3 A -4 T -5 T -6
21
Example
Assume a simple model where S(a,b) = 1 if a=b and -5 otherwise. Also, assume that d = -1
F(i,j) = max
F(i-1,j-1)+s(xi,xj) F(i-1,j)+d
F(i,j-1)+d
AGCCT T 0 -1 -2 -3 -4 -5 -6 A -1 1 0 -1 -2 -3 -4 C -2 0 -1 C -3 -1 A -4 -2 T -5 -3 T -6 -4
22
Example
Assume a simple model where S(a,b) = 1 if a=b and -5 otherwise. Also, assume that d = -1
AGCCT T 0 -1 -2 -3 -4 -5 -6 A -1 1 0 -1 -2 -3 -4 C -2 0 -1 1 0 -1 -2 C -3 -1 -2 0 2 1 0 A -4 -2 -3 -1 1 0 -1 T -5 -3 -4 -2 0 2 1 T -6 -4 -5 -3 -1 1 3
AGCCTT ACCATT
AGCC TT
• We cannot expect the alAignmenCts tCo bAe pTerfeTct. AGCCTT
•(inBAsueGtrwtCioeAnn,TedTeedlettoiodneoter rsmubinsetitwuhtioant i)s. the reason for the difference AGCCT T
i
where: s(a, b) log( pa,b ) qa qb
17
Computing optimal alignment: The Needham-Wuncsh algorithm
F(i,j) = max
F(i-1,j-1)+s(xi,xj) F(i-1,j)+d F(i,j-1)+d
d is a penalty for a gap
translation
Protein
PEPTIDE
2
Growth in biological data
Lu et al Bioinformatics 2009
3
Central dogma
Can be measured using sequencing techniques
DNA
CCTGAGCCAACTATTGATGAA
10-601 Machine Learning
Computational biology: Sequence alignment and profile HMMs
Central dogma
DNA
CCTGAGCCAACTATTGATGAA
transcription
mRNA CCUGAGCCAACUAUUGAUGAA
A G C CTT
A C C A T T
F(i-1,j-1)
F(i-1,j)
F(i,j-1)
F(i,j)
18
Example
Assume a simple model where S(a,b) = 1 if a=b and -5 otherwise. Also, assume that d = -1
AGCCT T 0 -1 -2 -3 -4 -5 -6 A -1 C -2 C -3 A -4 T -5 T -6
19
Example
Assume a simple model where S(a,b) = 1 if a=b and -5 otherwise. Also, assume that d = -1
*Washington Post, 2/06/2007
Input – Output HMM For Data Integration
I
g
H
HH
H
0
1
2
3
O
O
O
O
0
1
2
3
Active Learning
8
Assigning function to proteins
• One of the main goals of molecular (and computational) biology.
25
Running time
• The running time of an alignment algorithms if O(n2) • This doesn’t sound too bad, or is it?
• The time requirement for doing global sequence alignment is too high in many cases. • Consider a database with tens of thousands of sequences. Looking through all these sequences for the best alignment is too time consuming. • In many cases, a much faster heuristic approach can achieve equally good results.
that maximizes the likelihood ratio of the aligned
Pp(xa,iry |cIo)m parqexidtoqxtjhe backgroPu(nxd, ym| Mod)el pxi yi
i
j
i
Score for alignment:
S s(xi , yi )
• There are 25000 human genes and the vast majority of their functions is still unknown
• Several ways to determine function
- Direct experiments (knockout, overexpression) - Interacting partners - 3D structures - Sequence homology
“ MammaPrint is a DNA microarray-based test that measures the activity of 70 genes in a sample of a woman's breast-cancer tumor and then uses a specific formula to determine whether the patient is deemed low risk or high risk for the spread of the cancer to another site.”
Example
Assume a simple model where S(a,b) = 1 if a=b and -5 otherwise. Also, assume that d = -1
AGCCT T 0 -1 -2 -3 -4 -5 -6 A -1 1 0 -1 -2 -3 -4 C -2 0 -1 1 0 -1 -2 C -3 -1 -2 0 2 1 0 A -4 -2 -3 -1 1 0 -1 T -5 -3 -4 -2 0 2 1 T -6 -4 -5 -3 -1 1 3
i
j
P(x, y | M ) pxi yi
i
Score for alignment:
S s(xi , yi )
i
where: s(a, b) log( pa,b ) qa qb
Can be computed for each pair
of letters
16
Scoring Alignments
10
Sequence analysis techniques
• A major area of research within computational biology.
• Initially, based on deterministic or heuristic alignment methods
• Alignments can be scored by comparing the resulting alignment to a background (random) model.
In oInthdeeprewndoerndts, we are trying to find Raenlaatelidgnment
AG CATT
15
Scoring Alignments
• Alignments can be scored by comparing the resulting alignment to a background (random) model.
Independent
Related
P(x, y | I ) qxi qxj
Can be measured using microarrays
transcription
mRNA CCUGAGCCAACUAUUGAUGAA
translation
Protein
PEPTIDE
Can be measured using
mass speห้องสมุดไป่ตู้trometry
4
5
FDA Approves Gene-Based Breast Cancer Test*
24
Example
Assume a simple model where S(a,b) = 1 if a=b and -5 otherwise. Also, assume that d = -1
AGCC TT A CCATT
AGCCT T 0 -1 -2 -3 -4 -5 -6 A -1 1 0 -1 -2 -3 -4 C -2 0 -1 1 0 -1 -2 C -3 -1 -2 0 2 1 0 A -4 -2 -3 -1 1 0 -1 T -5 -3 -4 -2 0 2 1 T -6 -4 -5 -3 -1 1 3
F(i,j) = max
F(i-1,j-1)+s(xi,xj) F(i-1,j)+d
F(i,j-1)+d
AGCCT T 0 -1 -2 -3 -4 -5 -6 A -1 1 C -2 C -3 A -4 T -5 T -6
20
Example
Assume a simple model where S(a,b) = 1 if a=b and -5 otherwise. Also, assume that d = -1
Substitutions Insertions Deletions
13
Pairwise sequence alignment
ACATTG AACATT
A CATTG
AGCCTT AGCATT
AA C A T T AGCCT T
AG CATT
14
Pairwise sequence alignment
• More recently, based on probabilistic inference methods
11
Sequence analysis
• Traditional - Dynamic programming
• Probabilsitic - Profile HMMs
12
Alignment: Possible reasons for differences
Hard Easier
9
Function from sequence homology
• We have a query gene: ACTGGTGTACCGAT • Given a database containing genes with known
function, our goal is to find similar genes from this database (possibly in another organism) • When we find such gene we predict the function of the query gene to be similar to the resulting database gene • Problems - How do we determine similarity?
F(i,j) = max
F(i-1,j-1)+s(xi,xj) F(i-1,j)+d
F(i,j-1)+d
AGCCT T 0 -1 -2 -3 -4 -5 -6 A -1 1 0 C -2 0 C -3 A -4 T -5 T -6
21
Example
Assume a simple model where S(a,b) = 1 if a=b and -5 otherwise. Also, assume that d = -1
F(i,j) = max
F(i-1,j-1)+s(xi,xj) F(i-1,j)+d
F(i,j-1)+d
AGCCT T 0 -1 -2 -3 -4 -5 -6 A -1 1 0 -1 -2 -3 -4 C -2 0 -1 C -3 -1 A -4 -2 T -5 -3 T -6 -4
22
Example
Assume a simple model where S(a,b) = 1 if a=b and -5 otherwise. Also, assume that d = -1
AGCCT T 0 -1 -2 -3 -4 -5 -6 A -1 1 0 -1 -2 -3 -4 C -2 0 -1 1 0 -1 -2 C -3 -1 -2 0 2 1 0 A -4 -2 -3 -1 1 0 -1 T -5 -3 -4 -2 0 2 1 T -6 -4 -5 -3 -1 1 3
AGCCTT ACCATT
AGCC TT
• We cannot expect the alAignmenCts tCo bAe pTerfeTct. AGCCTT
•(inBAsueGtrwtCioeAnn,TedTeedlettoiodneoter rsmubinsetitwuhtioant i)s. the reason for the difference AGCCT T
i
where: s(a, b) log( pa,b ) qa qb
17
Computing optimal alignment: The Needham-Wuncsh algorithm
F(i,j) = max
F(i-1,j-1)+s(xi,xj) F(i-1,j)+d F(i,j-1)+d
d is a penalty for a gap
translation
Protein
PEPTIDE
2
Growth in biological data
Lu et al Bioinformatics 2009
3
Central dogma
Can be measured using sequencing techniques
DNA
CCTGAGCCAACTATTGATGAA
10-601 Machine Learning
Computational biology: Sequence alignment and profile HMMs
Central dogma
DNA
CCTGAGCCAACTATTGATGAA
transcription
mRNA CCUGAGCCAACUAUUGAUGAA
A G C CTT
A C C A T T
F(i-1,j-1)
F(i-1,j)
F(i,j-1)
F(i,j)
18
Example
Assume a simple model where S(a,b) = 1 if a=b and -5 otherwise. Also, assume that d = -1
AGCCT T 0 -1 -2 -3 -4 -5 -6 A -1 C -2 C -3 A -4 T -5 T -6
19
Example
Assume a simple model where S(a,b) = 1 if a=b and -5 otherwise. Also, assume that d = -1
*Washington Post, 2/06/2007
Input – Output HMM For Data Integration
I
g
H
HH
H
0
1
2
3
O
O
O
O
0
1
2
3
Active Learning
8
Assigning function to proteins
• One of the main goals of molecular (and computational) biology.
25
Running time
• The running time of an alignment algorithms if O(n2) • This doesn’t sound too bad, or is it?
• The time requirement for doing global sequence alignment is too high in many cases. • Consider a database with tens of thousands of sequences. Looking through all these sequences for the best alignment is too time consuming. • In many cases, a much faster heuristic approach can achieve equally good results.
that maximizes the likelihood ratio of the aligned
Pp(xa,iry |cIo)m parqexidtoqxtjhe backgroPu(nxd, ym| Mod)el pxi yi
i
j
i
Score for alignment:
S s(xi , yi )
• There are 25000 human genes and the vast majority of their functions is still unknown
• Several ways to determine function
- Direct experiments (knockout, overexpression) - Interacting partners - 3D structures - Sequence homology
“ MammaPrint is a DNA microarray-based test that measures the activity of 70 genes in a sample of a woman's breast-cancer tumor and then uses a specific formula to determine whether the patient is deemed low risk or high risk for the spread of the cancer to another site.”