Multiple Sequence Alignment as a Facility Location Problem 1
Part 5 - Multiple sequence alignment and phylogeny
Part 7 - Multiple sequence alignment & Phylogenetic
Multipቤተ መጻሕፍቲ ባይዱe sequence alignment
• One amino acid sequence plays a coy; pair of homologous sequences whisper; many aligned sequences shout out loud. • How does multiple sequence alignment make the information regarding protein fold more intelligible and useful. • To be informative, a multiple alignment should contain a distribution of closely and distantly related sequences.
PSI-BLAST
• Search databank for sequences similar to a query sequence.
What information does multiple alignment provide?
• Multiple alignment is more informative about functional conservation. • Two amino acids may be aligned between two protein sequences, but it may just come down to chance. • If a residue is conserved throughout a family of sequences that are diverse, it might actually play a key structural or functional role.
MultipleSequenceAlignments
• Optimum alignment can be found by dynamic programming
– Extension of Needleman-Wunsch
• Spaces are only added to msa – never removed
– Once a gap, always a gap
1 2 3 4 5
K .75 .25 .75 L .75 .75 M .25 .25 .50 .25 .25 .25 .25
K 1
K -
L 2
3
L 4
M 5
Column 1 score: 0.75 s(K,K) + 0.25 s(K,M)
Profile-to-sequence alignments
Multiple Sequence Alignments
Profiles and Progressive Alignment
Profiles for families of sequences can be built from MSAs
1 1 C 2 G 3 — A C 25% 2 0% 3 0% 0% 50% 75% 25%
• To compare a sequence to a profile, need to assign a score for each amino acid • The score the profile for amino acid a at position p is 20 M ( p, a ) f (p, b ) s (a , b )
multiple sequence alignment 序列
multiple sequence alignment 序列什么是多序列比对(multiple sequence alignment)?多序列比对是一种在生物信息学中常用的方法,旨在将多个相关的生物序列进行比较和对齐。
这些序列可以是DNA、RNA或蛋白质序列,它们可能来自不同物种、同一物种的不同亚种或同一家族中的不同成员。
多序列比对用于发现序列之间的相似性和差异性,从而揭示它们之间的功能和进化关系。
通过将多个序列对齐,我们可以识别出保守区域和变异区域,并从中推断出序列的共同祖先。
为什么要进行多序列比对?多序列比对在许多生物学研究领域中都是非常重要的工具。
首先,它可以帮助我们理解复杂的生物过程,比如蛋白质结构与功能之间的关系。
在多序列比对中,我们可以观察到在保守区域中存在相同的氨基酸或核苷酸,这暗示了它们在结构和功能上的重要性。
其次,多序列比对还可以帮助我们预测新序列的功能。
如果一个新的序列与已知的序列具有高度相似的区域,那么我们可以合理地假设它们在功能上可能是相似的。
还有,多序列比对对于生物进化研究也是至关重要的。
通过比较不同物种的序列,我们可以跟踪进化过程中的变化,并推断出它们的共同祖先。
多序列比对的方法实现多序列比对的方法有许多,其中最常用的方法是基于动态规划的方法,例如Clustal系列软件,如ClustalW和Clustal Omega。
这些算法通过优化一个得分函数,尽量使序列在各个位置上对齐。
动态规划算法的基本原理是通过计算一个得分矩阵,并利用矩阵中的值来选择最佳的序列对齐方式。
得分矩阵中的每个元素代表了相应位置上的比对得分,得分越高表示对齐得越好。
在进行序列比对时,动态规划算法考虑了多个因素,如序列的相似性分数、罚分矩阵(用于惩罚不同类型的差异)和间隙的惩罚分数(用于对齐中的间隙进行惩罚)。
通过调整这些参数,我们可以在比对过程中进行不同类型的优化。
此外,还有一些其他的多序列比对算法,如T-Coffee、MAFFT和MUSCLE 等,它们使用了不同的策略来解决比对问题。
Multiple sequence alignment
Page 320
Proportion of structurally superposable residues in pairwise alignments as a function of sequence identity
Proportion of residues in common core
Hale Waihona Puke Fig. 10.2 Page 323
Progressive MSA stage 2 of 3: generate guide tree
( ( gi|5803139|ref|NP_006735.1|:0.04284, ( gi|6174963|sp|Q00724|RETB_MOUS:0.00075, gi|132407|sp|P04916|RETB_RAT:0.00423) five closely :0.10542) related lipocalins :0.01900, gi|89271|pir||A39486:0.01924, gi|132403|sp|P18902|RETB_BOVIN:0.01902);
0.75
0.5
0.25
Globin Cytochrome c Serine protease Immunoglobulin domain
100
75
50
25
0
Sequence identity (%)
After Chothia & Lesk (1986)
Multiple sequence alignment: features
Multiple sequence alignment
Monday, October 16, 2006
Introduction to Bioinformatics J. Pevsner pevsner@
Lecture5 multiple sequence alignment
(1)标定1的各列,如果s1在比对中对应位置的编辑操作不是插入或删 除,则这些列分别标记为s1对应位置上的字符a1、a2、…、als1(ls1为序列 s1的长度); (2)标定2的各列,如果t1在比对中对应的位置编辑操作不是插入或删 除,则这些列分别标记为t1对应位置上的字符b1、b2、…、blt1(lt1为序列t1 的长度); (3)对a1、a2、…、als1和b1、b2、…、blt1进行比对; (4)在所得到的比对中,对于1、2和中原来有插入或删除操作的位置, 恢复其原有的实际字符或空位字符“-”。
sc=s1
S1-S2
S1-S3
S1-S4
S1-S5
ATTGCCATT ATGGCCATT
ATTGCCATT-ATC-CAATTTT
ATTGCCATT ATCTTC-TT
ATTGCCATT ACTGACC--
ATTGCCATT-ATGGCCATT-ATC-CAATTTT ATCTTC-TT-ACTGACC---18
16
How to find the core sequence?
Multiple sequence alignment
(2) Directly find the sequence that is the most similar to all the rest Find Sc such that
4
Multiple sequence alignment
5
2、SP(Sum-of-Pairs)model
Multiple sequence alignment
Evaluation of the results of multiple sequence alignment
04-Multiple sequence alignment(生物信息学国外教程2010版)
Page 179
Multiple sequence alignment: outline
[1] Introduction to MSA Exact methods Progressive (ClustalW) Iterative (MUSCLE) Consistency (ProbCons) Structure-based (Expresso) Conclusions: benchmarking studies [2] Hidden Markov models (HMMs), Pfam and CDD
Multiple sequence alignment: outline
[1] Introduction to MSA Exact methods Progressive (ClustalW) Iterative (MUSCLE) Consistency (ProbCons) Structure-based (Expresso) Conclusions: benchmarking studies [2] Hidden Markov models (HMMs), Pfam and CDD
Page 185
Use ClustalW to do a progressive MSA
http://www.ebi. /clustalw/
Page 186
Feng-Doolittle MSA occurs in 3 stages
[1] Do a set of global pairwise alignments (Needleman and Wunsch’s dynamic programming algorithm) [2] Create a guide tree
This insertion could be due to alternative splicing
Multiple Sequence Alignment
• si,j,k = max
si-1,j-1,k-1 + δ(vi, wj, uk) si-1,j-1,k + δ (vi, wj, _ ) si-1,j,k-1 + δ (vi, _, uk) si,j-1,k-1 + δ (_, wj, uk) si-1,j,k + δ (vi, _ , _) si,j-1,k + δ (_, wj, _) si,j,k-1 + δ (_, _, uk)
• Progressive alignment is a variation of greedy algorithm with a somewhat more intelligent strategy for choosing the order of alignments. • Use profiles to compare sequences • Gaps in consensus string are permanent
Step 2: Guide Tree (cont’d)
v1 v2 v3 v4 v1 v2 v3 v4 .17 .87 .28 .59 .33 .62 v1 v3 v4 v2
04-Multiple sequence alignment(生物信息学国外教程2010版)
Page 185
Use ClustalW to do a progressive MSA
http://www.ebi. /clustalw/
Page 186
Feng-Doolittle MSA occurs in 3 stages
[1] Do a set of global pairwise alignments (Needleman and Wunsch’s dynamic programming algorithm) [2] Create a guide tree
[3] MEGA to make a multiple sequence alignment
[4] Multiple alignment of genomic DNA
Multiple sequence alignment: definition
• a collection of three or more protein (or nucleic acid) sequences that are partially or completely aligned • homologous residues are aligned in columns across the length of the sequences • residues are homologous in an evolutionary sense • residues are homologous in a structural sense
Page 185
Multiple sequence alignment: methods
Example of MSA using ClustalW: two data sets Five distantly related globins (human to plant)
基因序列分析及资料库的使用
基因序列分析及資料庫的使用陳福旗屏東科技大學農園系基因序列之分析可以從其核苷酸序列或氨基酸序列著手,序列分析之目的有下列各點:1.與資料庫的基因比對(BLAST),確認所選殖到的基因為試驗之標的基因。
2.將資料庫中相似的基因序列複製出來,以比對試驗標的基因與基因庫中同一類基因之相似度(多序列排比,multiple alignment),並可獲得親緣關係之樹狀圖(dendrogram)。
3.由多序列排比,找出具有相同序列(保守性) 的區域,據以設計退化因子(degenerate primers),進行聚合酶連鎖反應(PCR),選殖基因。
4.比較兩個DNA(基因)之核苷酸序列具差異之處。
多序列排比,multiple alignmentUpload multiple sequences in Fasta format, which consists of a one-line header starting with a ">" symbol, followed by the sequence name/ description,例如,如果要比對花青素合成基因dihyroflavonol 4-reductase,可以到資料庫中,把不同物種相同基因之編碼及其氨基酸序列以下列方式列出,即為所謂Fasta format (先標出一個大於符號,其後將基因編碼列出,換行後,將氨基酸序列貼上去:>AAU93766 MENEKKGPVVVTGASGYVGSWLVMKLLKKGYVVRATVRDPTNLTKVKPLLDLPRSNELLSIWKADLDDV EGSFDEVIRGSIGVFHVATPMNFQSKDPENEVIKPAINGLLGILRSCKKAGSVQRVIFTSSAGTVNVEE HQAAVYDESCWSDLDFVNRVKMTGWMYFLSKTLAEKAAWEFVKDNDIHLITIIPTLVVGSFITSEMPPS MITALSLITGNDAHYSILKQVQFVHLDDVCDAHIFLFEHPKANGRYICSSYDSTIYGLAEMLKNRYPTY VIPQKFKEIDPDIKCVSFSSKKLLELGFKYKYSMEEMFDDAINTCRDKKLIPLNTDQE IVLAAEKFE EVKEQIAVK> AAB62873MENEKKGPVVVTGASGYVGSWLVMKLLQKGYDVRATIRDPTNLEKVKPLLDLPRSNELLSIWKADLNDI EGSFDEVIRGCVGVFHVATPMNFQSKDPENEVIKPAINGLLGILTSCKKAGSVKRVIFTSSAGTVNVEE HQAAVYDENSWSDLHFVTRVKMTGWMYFVSKTLAEKAAWEFVKENAIHFIAIIPTLVVGSFITNEMPPS LITALSLISGNEAHYSILKQAQFVHLDDLCDAHIFVYEHPEANGRYICSSHDSTIYDLANMLKNRYATY AIPQKFKEIDPNIKSVSFSSKKLMDLGFKYKYTIEEMFDDAIKTCRDKNLMPLNTEEL VLAAEKYD EVKEQIAVK>AAC17843 METERKGPVVVTGASGYVGSWLVMKLLQKGYEVRAAVRDSTNFEKVKPLLDLPGSNELLSIWKADLNDI DETFDEVTRGSVGLFHVATPMNFQSEDPENEVIKPTISGLLGILRSCKRVGTVKRVIFTSSAGTVNVEE HQATVYDESSWSDLDFVTRVKMTGWMYFVSKTLAEKAAWEFVSDNDIHFITIIPTLVVGSFLISRMPPS LITALSLITGNEAHYSILRQAQFVHLDDLCDAHIFLFEHHKANGRYICSSHDSTIYSLAKMLKNRYATY DIPLKFKEIDPNIESVSFSSKKLLDLGFKYKYKYTMEEMFDDAIKTCRDKNLIPLHTE EMVSANEK FDEVKEQIAVK>AAQ83576 MENVKGPVVVTGASGYVGSWLVMKLLQYGYAIRATVRDPRDLRKTKPLLDIPGADERLTIWKADLSEDA SFDEAINGCTGVYHVATPMDFDSKDPENEVIQPTINGVLGIMKSCKKAGTVKRVIFTSSAGTVNVQENQ MPEYDESSWSDVDFCRRVKMTGWMYFVSKTLAEKAAWEFAKENDIQLISIIPTLVVGPFITSTMPPSML TALSLITGNEAHYSILKQIQLVHLDDVCKAHIFLFENPEASGRYICSSYDATIYDLARKIKDRYPQYAI PQKFEGIDDQIKPVHFSSKKLMDLGFKYQYTFEEMFDEGIRSCIEKKLIPHQTQERYY VHDELDLG CSKMTNDKLDLGGSKLNSMDEMVRGHNEQVSVALQ>AAO63025 MMKEIGAAGGAVVVTGAGGYVGSWLVMKLLHYGYTVRATLRDSSDEAKTKPLLELPGADTRLSLWEADL LQDGSFDHVISGSIAVFHVATPMDFDSIDPENEVIKPAVNGMLSIMKSCKKAGTVKRVIFTSSAGTVNV EEHQKPEYDENSWSDIDFCRRVKMTGWMYFVSKSLAEKAAWEFAKANGIDLVTIIPTLVVGAFITTAMP PSMITALSLITGNEAHYSIIKQAQLVHLDDLCEAHILLLNHPKAEGRYICSSHDVTIYDMAKMIRQNYP QYYIPQQFEGIDKGIQPVRFSSKKLVDLGFRYKYSMESMFDEAIKTCVERKFIPLQTA VELQLKPYE LLEHNNKNGVVTNTIKIVGQMVNTKAMITEHEENEPIATH要進行比對時,將這些序列選取及複製,貼到多序比對之網站EMBL 之ClustalW (/clustalw/):比對結果會顯示多序列排列,相同的氨基酸會以星號在最下面一行顯示,亦可用彩色方式標出,相同的氨基酸會顯示相同顏色。
multiple_align_03_3_English_2013
Y
Sequence 2
W ? Optimal score in 3 dimensions
Alignment of three sequences by dynamic programming
• For three protein sequences each 300 amino acids
in length and excluding gaps, the number of comparisons to be made by dynamic programming is equal to 3003 = 2.7 ×107, whereas only 3002 = 9 ×104 is required for two sequences of this length.
(The number of steps and memory required for N M-amino-acid sequences: MN)
• Carrillo and Lipman (1988) found a way (the sum
of pairs, SP method, the MSA program) to reduce the number of comparisons that must be made without compromising the attempt to find an optimal alignment.
AGGCTT AAGCTA AGACTT AAACTA
1 2 3 4
The flowchart of multisequence alignment
Challenges
• Finding an optimal alignment of more than
Multiple sequence alignment
Steven Prestwich Des Higgins and Orla O’Sullivan s.prestwich@cs.ucc.ie d.higgins@ucc.ieCork Constraint Computation Centre ojos@student.ucc.ieDepartment of Computer Science Department of BiochemistryUniversity College,Cork,IrelandAbstractMultiple sequence alignment is a central problem in Bioinformatics and a challenging one for optimisation algorithms.An established integer programming approach is to applybranch-and-cut to a graph-theoretical model.The models are exponentially large but arerepresented intensionally,and violated constraints can be located in polynomial time.Thisreport describes a new integer program formulation that generates polynomial-sized modelssmall enough to be passed to generic solvers.It is a hybrid formulation relating the sparsealignment graph with a compact encoding of the alignment matrix via channelling constraints.Alignments obtained with a pseudo-Boolean local search algorithm are competitive with thoseof state-of-the-art algorithms.Execution times are much longer,but in future work we aim todevelop a more efficient specialised algorithm.1IntroductionMultiple sequence alignment(MSA)is a central problem in Bioinformatics and is known to be NP-complete[15].Given a number of sequences of symbols from an alphabet,the aim is to align them while maximizing some function.Gaps may be introduced between symbols,and in some MSA formulations the objective function includes a measure of the number and length of gaps.A common data structure is the alignment matrix which contains one sequence per row, including gaps;aligned symbols occur in the same column.Numerous heuristic methods have been proposed for multiple alignment,of which by far the most widely used is so-called progressive alignment.This involves clustering the sequences first to give a guide tree and then building up the alignment gradually,following the branching order in the guide tree.This is very fast even for hundreds of sequences,and the most widely used software is the well-known ClustalW package[28].The T-Coffee package[20]also uses a progressive heuristic but has been shown to be more accurate than ClustalW,at the expense ofsequence 1sequence 2sequence 3Figure1:An alignment graph with small cyclesextra computing time.There are also several methods based on optimising the WSP(weighted sums of pairs)objective function which use Genetic Algorithms[19]or iteration[11].These vary in the extent to which they are practical for more than a few sequences or in the quality of the optimisation.Dynamic programming[18]has been used for MSA problems but is known to scale poorly to more than a few sequences.More successful is the Complete Maximum Weight Trace(CMWT) formulation in which the symbols are viewed as vertices in an alignment graph(actually an extended graph that includes edges between adjacent symbols in each sequence). Each vertex is a position in sequence,which we shall denote by.Each edge connects two vertices from different sequences:we shall denote an edge between and by where by convention.Each edge has a weight representing the usefulness of aligning its two symbols.An alignment realises an edge if it aligns its two symbols,and the aim is to maximize the sum of the weights of the realised edges.The set of realised edges is a trace if certain constraints are satisfied,and an alignment matrix can always be constructed from a trace.The CMWT generates large models which can be reduced by using the Sparse Maximum Weight Trace(MWT)formulation.This restricts attention to a carefully chosen subgraph,defining only those edges that are used in pairwise alignments of high quality.Besides reducing the size of the models,the MWT provides an opportunity to input biological knowledge via the choice of subgraph.The usual way of ensuring that the realised edges form a valid trace is to enumerate all critical mixed cycles in the graph,adding a constraint to prohibit each cycle [1,16,24](other constraints may also be added).Figure1shows examples of cycles that prevent the trace from representing a alignment matrix.The MWT and related formulations have natural integer linear program(ILP)models.The number of constraints is exponential in the size of the graph[1]but these are not passed en masse to a solver.Instead a branch-and-cut approach is used,generating violated constraints as required in order to derive cutting planes.Generating the relevant constraints is known as the separation problem and can be done in polynomial time.In this report we explore an alternative ILP approach to the MSA.Instead of accessing an exponentially large model in polynomial time,we use a model that is only polynomial in size and can be passed to a generic ILP solver.To avoid the use of cycle constraints we model the alignment matrix directly,and relate it to the sparse alignment graph by channelling constraints [5].The Constraint Programming literature contains many examples of problems in which extra variables are introduced to facilitate the expression of certain constraints,which must be related to the original variables via channelling constraints.The motivation is that some constraintscan be more easily or economically expressed on different sets of variables.Introducing extra variables is not necessarily counter-productive,and hybrid models sometimes give better results than those with fewer variables.Given such a model,we then face the choice of how tofind solutions.Instead of applying branch-and-cut we transform the model to pseudo-Boolean form and pass it to a local search algorithm,an approach that has given good results on a variety of combinatorial problems.We also exploit modelling techniques aimed specifically at local search.The report is organised as follows.Section2describes the model,Section3describes the techniques used to solve the problem,Section4evaluates the usefulness of the model on exam-ples,and Section5draws conclusions and discusses future work.A condensed version of this report in published in[23].2A hybrid ILP modelThis section describes the new ILP model of the MSA,drawing on techniques from the Con-straint,Satisfiability and Integer Programming literatures.Section2.1describes an initial model, Section2.2discusses symmetry breaking,Section2.3adds implied constraints to improve con-straint propagation,and Section2.4discusses properties of the model.2.1Initial modelAs in the MWT define a binary variable for each edge.The problem is then:maximizesubject to constraints ensuring the construction of an alignment matrix.The matrix has rows (one per sequence)and columns where max and is the length of sequence.We allow each sequence position to be placed anywhere in the corresponding row of the matrix, subject to constraints.A matrix entry with no sequence position placed in it implicitly contains a gap.The symbols are not explicitly modelled,only the way in which sequence positions are mapped to matrix columns:define an integer variable with domain for each se-quence position denoting its matrix position.We need only define for sequence positions that lie on an edge of the sparse alignment graph,but we shall assume that they all do.We require a0/1ILP model so we next replace the by0/1variables.To enable the mod-elling of long sequences we use a log encoding by modelling binary representations of the. Several researchers have used log encodings to model constraint satisfaction problems(CSPs)as SAT(propositional satisfiability)and found that both local and backtrack search performance is usually worse than on other encodings[8,9,12],and[30]has shown that propagation is weaker on log encodings.Nevertheless,this type of model seems worth trying because of its compact-ness.Let log so that the matrix columns can be represented using bits.Define0/1 variables where,and,such thatThat is is the bit of the binary representation of.The initial model has two sets of constraints,expressed in0/1form by substituting the above formula for.To ensure that sequence positions are placed in the alignment matrix in an ordered way,add ordering constraintswhere and.To relate the and variables add channelling constraintswhere.Each channelling constraint can be implemented by two linear constraintsThat is,if then is both and the sum of.If then both constraints are trivially true.Note that the channelling constraints only work one way,so that .The reason is discussed in Section2.2.2.2Symmetry breakingMany combinatorial problems exhibit symmetries,for example the N-queens problem has eight: each solution may be rotated and reflected.Removing symmetries from a problem can reduce its search space by a large factor,greatly reducing the time tofind a solution and to prove it optimal. Probably the most popular approach is to add constraints to the model,so that each equivalence class of symmetric solutions to the original problem corresponds to a single solution in the new problem.An obvious symmetry in our model involves gaps in the alignment matrix.Any MSA solution may have many symmetric solutions in which gaps are inserted between columns.These can be avoided by adding constraints to restrict the width of the alignment matrix.A less obvious form of symmetry involves the relationship between the and variables.An assignment implies that the sequence positions connected by edge are aligned in the matrix,but the reverse is not true:does not imply that the two sequence positions connected by are unaligned. Thus any solution in which two symbols on an edge are unaligned has two symmetric solutions: with and(unless the latter case would cause the total weight to be less than a lower bound).This symmetry can be removed by adding constraints to enforce the reverse implication, though this probably requires either a large number of linear constraints or the introduction of further variables.However,recent work[21]has shown that adding symmetry breaking constraints to a model is often counter-productive when a local search algorithm is to be applied.For local search performance the size of the search space appears to be less important than the number and dis-tribution of solutions in the space,and symmetry appears to be harmless.We therefore do not break symmetries in our model.2.3Implied constraintsThe initial model ensures a correct alignment,but it can be improved by adding implied con-straints—constraints that are implied by others already in the model.These have been shown to speed up both backtracking[27]and local search[4,14,21].2.3.1Stronger ordering constraintsA larger set of ordering constraints can replace those of the initial model:where and.Similar constraints were found to improve local search performance on SAT-encoded clique problems[21],and they also help on this model.2.3.2Stronger channelling constraintsThe channelling constraints can also be strengthened.Every integer has a unique binary repre-sentation,so if and only if where.Therefore the following constraints can be used:where and.Given an assignment a constraint propagation algorithm can then remove a value from the domain of any bit[]for which the corresponding bit[]is currently assigned.Under the weaker channelling constraints,domain values can only be pruned on the larger[]when several of the []have been assigned.2.3.3Positioning constraintsWe can restrict the range of each by adding constraintsfor and.In any correct alignment the will always lie within these ranges.In tests these constraints made little difference,perhaps because the stronger ordering constraints provide similar information.2.3.4Small cycle constraintsThough the MWT constraints are unnecessary in our model,we can use some of them as im-plied constraints.For example to prevent cycles involving2sequences we could add SAT-style constraints to enumerate pairs of crossing edges:where,,and either and or and.ILP allows a more compact representation:where and is the set of edges crossing(as defined above).These constraints were also found to have little effect.2.3.5Transitivity constraintsIn[1]transitivity constraints are used,based on the observation that if symbols A and B are aligned in the matrix,and if B and C are aligned,then so are A and C.It therefore seems reason-able to add transitivity constraints to our model:where,and. However,in our model transitivity on the is not implied because of the one-way nature of the channelling constraints:the assignment forces the symbols on edge to be aligned but does not force them to be unaligned.Thus the transitivity constraints are symmetry breaking constraints in our model,not implied constraints.The distinction is unimportant if backtracking is to be applied because both types of constraint are usually beneficial.But if local search is to be applied then the former are probably harmful and the latter beneficial.Interestingly,the following theorem holds.Theorem.In an optimal solution the transitivity constraints on the are implied.Proof.Suppose that in an optimal solution where symbols A and B occur on edge and symbols B and C on edge.Then we know that the pairs A,B and B,C are aligned in the matrix,and therefore so are A,C.This means that can be set to either0or1,where A and C are on edge.If the solution is less than optimal,contradicting the initial assumption. Hence and transitivity holds on the.QED.Such constraints could be called optimality-implied and to our knowledge they have not been described previously.As optimality is approached they undergo a transition from being sym-metry breaking constraints to implied constraints,so for local search we might expect them to switch from being harmful to helpful.This would be an interesting result,analogous to the fact that different models can be best forfinding an optimal solution and for proving it optimal by backtrack search[26].In experiments they seemed to be harmful but less so as alignment weight increased,and we do not use them in the model.2.4Discussion of the modelWe call this the Sparse Graph Log Matrix(SGLM)model,indicating its combination of the sparse alignment graph with a log encoding of the alignment matrix.It contains only the stronger ordering and channelling constraints from Section2.3.Both the SGLM and the MWT have variables;the SGLM has in addition log variables which is not prohibitive.A motivation for the SGLM was to avoid the exponential number of con-straints in the MWT so we analyse its space complexity.It has ordering constraints and log channelling constraints,where is the matrix width,is the number of se-quences and is the sequence length(we assume that the sequences are of similar length). In practice[24]and for some constant,so the model has log constraints.Perhaps a better measure of model size is the total number of literals,that is where ranges over the set of constraints and is the number of literals in constraint.The SGLM has ordering constraints of size log and log channelling constraints of size3,giving a total model size of log.Typically so this reduces to log literals.Even if the transitivity and2-cycle constraints were found to be useful,they greatly increase the model size and are best avoided:there are transitivity constraints, and though there are only2-cycle constraints they are of size,so they contain literals.It is fortunate that these constraints seem to contribute little to local search performance.However,the stronger ordering and channelling constraints do greatly improve performance,so we use them even though they increase the model size:the initial model with weaker ordering and channelling constraints has only log literals.3Solving the optimisation problemWe reduce an MSA optimisation problem to a series of CSPs,each with a cost constraintfor some integer lower bound.The CSPs have increasing values of,each being the weight of the previous solution,and the initial value of is set to0.We now describe the algorithm used to solve each CSP.3.1Pseudo-Boolean local searchILP contain constraints of the formwhere is one of the operators,the weights and the constant are(possibly negative)integers,and the variables have domain.Other integer values can be permitted for the but such problems can always be transformed into0/1form.0/1ILP models can in turn be transformed into linear pseudo-Boolean form,involving only positive and,and literals which are either variables or their negations.The interest of this form is that it is only a slight extension of SAT,and many SAT algorithms generalise easily to it.This makes sense when trying to solve problems with a strong logical content,which are sometimes better solved by SAT approaches than by cost-based methods.The well-known Davis-Logemann-Loveland-Putnam SAT backtracking algorithm[6]was extended to pseudo-Boolean problems in[2,3]and applied to integer programming benchmarks and randomly-generated problems.The Walksat local search algorithm for SAT[25]was generalised in[29]to WSAT(OIP)and applied to radar surveillance benchmarks and the Progressive Party problem.The Saturn hybrid local search algorithm was generalised in[22]and applied to hardware verification,block design and a sports scheduling problem.In each case good results were obtained.We apply Saturn to the SGLM.Saturn performs local search in a space of partial variable assignments that are consistent under constraint propagation.On SAT problems the propagation algorithm used is unit propagation which generalises naturally to pseudo-Boolean problems(see [2,3,22]for details).The aim is to combine the scalability of local search with the pruning ability of constraint-based algorithms.Like most local search algorithms,Saturn has a runtime noise parameter that must be tuned to each problem or problem class.For this work we modified Saturn slightly.Each CSP can in principle be solved independently;instead we use each solution as a starting point for the next CSP,by assigning as many variables as possible under a random variable ordering,without violating any constraints.This technique was not previously usedwith Saturn but greatly speeds up convergence on MSA instances.Best results were obtained by executing Saturn in a rapid-restart mode with variable noise.On failing tofind a solution of a given weight within a short time,it is restarted from the previous solution with slightly higher noise.Onfinding a solution the noise is reduced again.This strategy allows Saturn to make many small improvements in a short time,and on becoming trapped in a local minimum it increases its noise in order to escape.3.2Post-processing the matrixOn solving the optimisation problem an alignment matrix is constructed,then post-processed by applying simple hill-climbing and plateau traversal techniques.This improves the matrix weight significantly and is done in three phases:Randomly-chosen symbols are moved randomly one position to the left or right,where this is possible and does not decrease the score of the matrix;score is defined here as a lexicographical ordering on total weight then number of non-weighted symbol matches.Symbols are moved randomly one position to the left as many times as possible,where this does not decrease the score.If two adjacent columns and contain no symbols in the same row then the two columns can be merged by shifting the rightmost symbols one place to the left(this in-cludes the case of an empty column).This phase eliminates unnecessary gaps in the matrix without decreasing the weight,and possibly increasing it.If Saturn fails to solve a CSP in the allotted time,it is restarted from the best solution so far,after feeding back the matrix improvements into the solution.4ExperimentsWe take MSA instances from the HOMSTRAD[17]database of protein alignments.These are generated by considering the three-dimensional structure of sets of proteins and provide very ac-curate test cases.We measure the accuracy of our alignments by simply counting the percentage of columns in our alignment matrix that are identical with the test cases.For each test case we generate the sparse alignment graph using T-Coffee with default settings.This takes every pair of sequences and outputs weighted pairs of symbols.These are produced by aligning each pair of sequences using dynamic programming and recording all of the pairs of aligned symbols.The weights are simply the percent identity of the parent sequences for each pair.Wefirst applied Saturn to four fairly small problems:ChtBD is a family of chitin binding domains which are structural proteins in plant cells,hla consists of a group of histocompatibility proteins involved in the immune system,TIG contains a group of glucanotransferases which are involved in metabolism,and ch is a family of calponin homology domain proteins which are involved in actin binding in the cell.These have between4and6similar sequences of between 43and178residues,apart from ch which has4very dissimilar sequences.In each case Saturn finds the same solutions as ClustalW and T-Coffee in a few minutes,except for ch where itfinds a less optimal solution:only57.9%correct,while ClustalW’s solution is61.7%correct.Next we consider two larger problems,starting with the mmp problem,a family of matrix metalloproteineases which are important proteins in the cytoskeleton of the cell.This has6se-quences,55%identical to each other,and an average length of164residues.The SGLM modelhas6723variables,7872variables,80311ordering constraints and107569channelling constraints.The alignment found by Saturn is shown in Figure2.It is95.5%correct,equalling that found by T-Coffee,while ClustalW’s alignment is92.4%correct.Secondly we take the oxi-doredadvances[7].AcknowledgmentThe Cork Constraint Computation Centre is supported by Science Foundation Ireland.Orla O’Sullivan was paid from a Basic Research grant from Enterprise Ireland to D.Higgins. References[1]E.Althaus,A.Caprara,H.-P.Lenhof,K.Reinert.Multiple Sequence Alignment With Arbi-trary Gap Costs:Computing an Optimal Solution Using Polyhedral Combinatorics.Bioin-formatics Suppl2:S4–S16.[2]F.A.Aloul,A.Ramani,I.Markov,K.Sakallah.PBS:a Backtrack-Search Pseudo-BooleanSolver and Optimizer.Fifth International Symposium on Theory and Applications of Satis-fiability Testing,Cincinnati,Ohio,USA,2002,pp.346–353.[3]P.Barth.A Davis-Putnam Based Enumeration Algorithm for Linear Pseudo-Boolean Opti-mization.Research Report mpi-i-95-2-003,Max-Plank Institut fur Informatik,Saarbrucken 1995.[4]B.Cha,K.Iwama.Adding New Clauses for Faster Local Search.Fourteenth National Con-ference on Artificial Intelligence,American Association for Artificial Intelligence1996,pp.332–337.[5]B.M.W.Cheng,K.M.F.Choi,J.H.M.Lee,J.C.K.Wu.Increasing Constraint Propaga-tion by Redundant Modeling:an Experience Report.Constraints vol.4,1999,pp.167–192.[6]M.Davis,G.Logemann,D.Loveland.A Machine Program for Theorem mu-nications of the ACM vol.5,1962,pp.394–397.[7]H.E.Dixon,M.L.Ginsberg,A.J.Parkes.Likely Near-term Advances in SAT Solvers.Workshop on Microprocessor Test and Verification,Austin,Texas,USA,2002.[8]A.Frisch,T.Peugniez.Solving Non-Boolean Satisfiability Problems with Stochastic Lo-cal Search.Seventeenth International Joint Conference on Artificial Intelligence,Seattle, Washington,USA,2001.[9]A.van Gelder.Another Look at Graph Coloring via Propositional Satisfiputa-tional Symposium on Graph Coloring and its Generalizations,Ithaca,NY,2002,pp.48–54.[10]M.L.Ginsberg,A.J.Parkes.Satisfiability Algorithms and Finite Quantification.Sev-enth International Conference on Principles of Knowledge Representation and Reasoning, Breckenridge,Colorado,USA,2000.[11]O.Gotoh.Significant Improvement in Accuracy of Multiple Protein Sequence Alignmentsby Iterative Refinement as Assessed by Reference to Structural Alignments.Journal of Molecular Biology vol.264,1996,pp.823–838.[12]H.Hoos.SAT-Encodings,Search Space Structure,and Local Search Performance.Six-teenth International Joint Conference on Artificial Intelligence,Morgan Kaufmann,1999, pp.296–302.[13]D.Joslin,A.Roy.Exploiting Symmetry in Lifted CSPs.Proceedings of the FourteenthNational Conference on Artificial Intelligence,American Association for Artificial Intelli-gence1997,pp.197–203.[14]K.Kask,R.Dechter.GSAT and Local Consistency.Fourteenth International Joint Confer-ence on Artificial Intelligence,Morgan Kaufmann1995,pp.616–622.[15]J.D.Kececioglu.Exact and Approximation Algorithms for DNA Sequence Reconstruction.PhD thesis,University of Arizona,1991.[16]J.D.Kececioglu,H.-P.Lenhof,K.Mehlhorn,P.Mutzel,K.Reinert,M.Vingron.A Polyhe-dral Approach to Sequence Alignment Problems.Discrete Applied Mathematics vol.104, 2000,pp.143–186.[17]K.Mizuguchi,C.M.Deane,T.L.Blundell,J.P.Overington.HOMSTRAD:A Database ofProtein Structure Alignments for Homologous Families.Protein Science vol.7,1998,pp.2469–2471.[18]S.B.Needleman,C.D.Wunsch.A General Method Applicable to the Search of Similaritiesin the Amino Acid Sequences of Two Proteins.Journal of Molecular Biology vol.48,1970, pp.443–453.[19]C.Notredame,D.G.Higgins.SAGA:Sequence Alignment by Genetic Algorithm.NucleicAcids Research vol.2,1996,pp.1515–1524.[20]C.Notredame,D.G.Higgins,J.Heringa.T-COFFEE:A Novel Method for Fast and Ac-curate Multiple Sequence Alignment.Journal of Molecular Biology vol.302,2000,pp.205–217.[21]S.D.Prestwich.Negative Effects of Modeling Techniques on Search Performance.Annalsof Operations Research vol.18,Kluwer Academic Publishers,2003,pp.137–150. [22]S.D.Prestwich.Randomised Backtracking for Linear Pseudo-Boolean Constraint Prob-lems.Fourth International Workshop on Integration of AI and OR techniques in Constraint Programming for Combinatorial Optimisation Problems,le Croisic,France,2002,pp.7–20.[23]S.D.Prestwich,D.Higgins,O.O’Sullivan.A SAT-Based Approach to Multiple SequenceAlignment.Poster,Ninth International Conference on Principles and Practice of Con-straint Programming,Kinsale,Ireland,2003(to appear).[24]K.Reinert,H.-P.Lenhof,P.Mutzel,K.Mehlhorn,J.Kececioglu.A Branch-and-Cut Algo-rithm for Multiple Sequence Alignment.First Annual International Conference on Compu-tational Molecular Biology,1997,pp.241–249.[25]B.Selman,H.Kautz,B.Cohen.Noise Strategies for Improving Local Search.TwelfthNational Conference on Artificial Intelligence,AAAI Press1994,pp.337–343.[26]B.M.Smith,K.E.Petrie,I.P.Gent.Models and Symmetry Breaking for Peaceable Armiesof Queens.Workshop on Modelling and Solving Problems with Constraints,ECAI,2002.[27]B.Smith,K.Stergiou,ing Auxiliary Variables and Implied Constraints toModel Non-Binary Problems.Seventeenth National Conference on Artificial Intelligence and Twelfth Innovative Applications of AI Conference,2000,pp.182–187.[28]J.D.Thompson,D.G.Higgins,T.J.Gibson.CLUSTAL W:Improving the Sensitivity ofProgressive Multiple Sequence Alignment Through Sequence Weighting,Position-Specific Gap Penalties and Weight Matrix Choice.Nucleic Acids Research vol.22,1994,pp.4673–80.[29]J.P.Walser.Solving Linear Pseudo-Boolean Constraints With Local Search.Eleventh Con-ference on Artificial Intelligence,1997,pp.269–274.[30]T.Walsh.SAT v CSP.Sixth International Conference on Principles and Practice of Con-straint Programming,Lecture Notes in Computer Science vol.1894,Springer-Verlag2000, pp.441–456.1mnc------GPKWERTNLTYRIRNYTPQLSEAEVERAIKDAFELWSVASPLIFTGISQ-----1hfc-------PRWEQTHLTYRIENYTPDLPRADVDHAIEKAFQLWSNVTPLTFTKVSE-----1bqoa YSLFPNSPKWTSKVVTYRIVSYTRDLPHITVDRLVSKALNMWGKEIPLHFRKVVW-----456ca FRTFPGIPKWRKTHLTYRIVNYTPDLPKDAVDSAVEKALKVWEEVTPLTFSRLYE-----1mmpa------TLKWSKMNLTYRIVNYTPDMTHSEVEKAFKKAFKVWSDVTPLNFTRLHD-----1bqqm----IQGLKWQHNEITFCIQNYTPKVGEYATYEAIRKAFRVWESATPLRFREVPYAYIRE1mnc G---EADINIAFYQRDHGDGSPFDGPNGILAHAFQPGQGIGGDAHFDAEETWTNTSA---1hfc G---QADIMISFVRGDHRDNSPFDGPGGNLAHAFQPGPGIGGDAHFDEDERWTNNFR---1bqoa G---TADIMIGFARGAHGDSYPFDGPGNTLAHAFAPGTGLGGDAHFDEDERWTDGSS-L-456ca G---EADIMISFAVREHGDFYPFDGPGNVLAHAYAPGPGINGDAHFDDDEQWTKDTT---1mmpa G---IADIMISFGIKEHGDFYPFDGPSGLLAHAFPPGPNYGGDAHFDDDETWTSSSK---1bqqm GHEKQADIMIFFAEGFHGDSTPFDGEGGFLAHAYFPGPNIGGDTHFDSAEPWTVRNEDLN1mnc NYNLFLVAAHEFGHSLGLAHSSDPGALMYPNYAF-RETSNYSLPQDDIDGIQAIYG----1hfc EYNLHRVAAHELGHSLGLSHSTDIGALMYPSYTFSGD---VQLAQDDIDGIQAIYGRS--1bqoa GINFLYAATHELGHSLGMGHSSDPNAVMYPTYGN-GDPQNFKLSQDDIKGIQKLYGK---456ca GTNLFLVAAHEIGHSLGLFHSANTEALMYPLYHSLTDLTRFRLSQDDINGIQSLYGPPPD 1mmpa GYNLFLVAAHEFGHSLGLDHSKDPGALMFPIYTY-TGKSHFMLPDDDVQGIQSLYGPG--1bqqm GNDIFLVAVHELGHALGLEHSSDPSAIMAPFYQW-MDTENFVLPDDDRRGIQQLYGGES-Figure2:Result for the mmp problem1frfs---KHRPSVVWLHNAECTGCTEAAIRTIKPYIDALILDTISLDYQETIMAAAGETSEAAL 2frva---KKRPSVVYLHNAECTGCSESVLRTVDPYVDELILDVISMDYHETLMAGAGHAVEEAL 1e3da----SRPSVVYLHAAECTGCSEALLRTYQPFIDTLILDTISLDYHETIMAAAGEAAEEAL 1h2rs LMGPRRPSVVYLHNAECTGCSESVLRAFEPYIDTLILDTLSLDYHETIMAAAGDAAEAAL 1cc1s----KKAPVIWVQGQGCTGCSVSLLNAVHPRIKEILLDVISLEFHPTVMASEGEMALAHM 1frfs HEALEGKDG-YYLVVEGGLPTIDGGQWGMVAG-------HPMIETCKKAAAKAKGIICIG 2frva HEAIKGD---FVCVIEGGIPMGDGGYWGKVGG-------RNMYDICAEVAPKAKAVIAIG 1e3da QAAVNGPDG-FICLVEGAIPTGMDNKYGYIAG-------HTMYDICKNILPKAKAVVSIG 1h2rs EQAVNSPHG-FIAVVEGGIPTAANGIYGKVAN-------HTMLDICSRILPKAQAVIAYG 1cc1s YEIAEKFNGNFFLLVEGAIPTAKEGRYCIVGEAKAHHHEVTMMELIRDLAPKSLATVAVG1frfs TCSPYGGVQKAKPNPSQAKGVSEAL---G--VKTINIPGCPPNPINFVGAVVHVLT----2frva TCATYGGVQAAKPNPTGTVGVNEALGKLG--VKAINIAGCPPNPMNFVGTVVHLLT----1e3da TCACYGGIQAAKPNPTAAKGINDCYADLG--VKAINVPGCPPNPLNMVGTLVAFLK----1h2rs TCATFGGVQAAKPNPTGAKGVNDALKHLG--VKAINIAGCPPNPYNLVGTIVYYLKN---1cc1s TCSAYGGIPAAEGNVTGSKSVRDFFADEKIEKLLVNVPGCPPHPDWMVGTLVAAWSHVLN 1frfs K---GIPDLDENGRPKLFYGELVHDNCPRLPHFEASEFAPSFDSEEAKKGFCLYELGCKG 2frva K---GMPELDKQGRPVMFFGETVHDNCPRLKHFEAGEFATSFGSPEAKKGYCLYELGCKG 1e3da G---QKIELDEVGRPVMFFGQSVHDLCERRKHFDAGEFAPSFNSEEARKGWCLYDVGCKG 1h2rs K---AAPELDSLNRPTMFFGQTVHEQCPRLPHFDAGEFAPSFESEEARKGWCLYELGCKG 1cc1s PTEHPLPELDDDGRPLLFFGDNIHENCPYLDKYDNSEFAETFTK-----PGCKAELGCKG1frfs PVTYNNCPKVLFNQ-VNWPVQAGHPCLGCSEPDFWDTMTPFYEQG2frva PDTYNNCPKQLFNQ-VNWPVQAGHPCIACSEPNFWDLYSPFYSA-1e3da PETYNNCPKVLFNE-TNWPVAAGHPCIGCSEPNFWDDMTPFYQN-1h2rs PVTMNNCPKIKFNQ-TNWPVDAGHPCIGCSEPDFWDAMTPFYQN-1cc1s PSTYADCAKRRWNNGINWCVEN-AVCIGCVEPDFPDGKSPFYVAEFigure3:Result for the oxidored。
MultipleSequenceAlignment(MSA)
x GGGCACTGCAT y GGTTACGTC-z GGGAACTGCAG
w GGACGTACC-v GGACCT-----
Alignment 1 Alignment 2
Aligning alignments/profiles
-AGGCTATCACCTG TAG–CTACCA---G CAG–CTACCA---G CAG–CTATCAC–GG CAG–CTATCGC–GG
A
1
1
.8
C
.6
1
.4 1 .6 .2
G
1 .2
.2
.4 1
T
.2
1 .6
.2
-
.2
.8
.4 .8 .4
Aligning alignments/profiles
SeqA GARFIELD THE LAST FAT CAT SeqB GARFIELD THE ---- FAST CAT
SeqB GARFIELD THE FAST CAT
SeqC GARFIELD THE VERY FAST CAT
SeqA GARFIELD THE LAST FA-T CAT SeqB GARFIELD THE FAST CAT SeqC GARFIELD THE VERY FAST CAT SeqD -------- THE FA-T CAT
AAA
ACC
An alignment with 3 columns
ACG
ACT
0
Consistency-based approaches
▪ T-Coffee
– M-Coffee & 3D-Coffee (Expresso)
多序列比对——精选推荐
多序列⽐对⽂章转载于 Original 2017-07-11 liuhui ⽣信百科多序列⽐对(或多序列联配,multiple sequence alignment,MSA),是指把多条(3 条或以上)有系统进化关系的蛋⽩质或核酸序列进⾏⽐对,尽可能地把相同的碱基或氨基酸残基排在同⼀列上。
这样做的意义是,对齐的碱基或氨基酸残基在进化上是同源的,即来⾃共同祖先(common ancestor)。
下图是⼀个 MSA 的例⼦。
MSA 有许多⽤途,如构建,选择压分析,基因家族的保守结构域分析,等。
从软件的速度和准确性出发,mafft 和 muscle 是不错的选择。
这⾥介绍 mafft 的使⽤⽅法。
mafft 安装(⾮ root)下载wget http://mafft.cbrc.jp/alignment/software/mafft-7.310-with-extensions-src.tgz解压tar -zxvf mafft-7.310-with-extensions-src.tgz编辑 Makefile ⽂件的第⼀⾏cd mafft-7.310-with-extensions/core/vim Makefile (或⽤ nano 等进⾏编辑)编辑:PREFIX = /usr/local为:PREFIX = /home/your_home/somewhere(如:PREFIX = /home/liuhui/bin/mafft-7.310编译和安makemake install安装最后安装在/home/liuhui/bin/mafft-7.310/bin下,将这个路径放到.bashrc中即可mafft 使⽤⽅法mafft 的⼀般⽤法为:mafft [arguments] input > outputinput 可以是 fasta 格式的蛋⽩质或核苷酸序列。
对于 200 条序列以内且序列长度⼩于 2,000 bp 或 aa 的⽂件,可以使⽤mafft-linsimafft-linsi input > output⽂件较⼩时,也可以使⽤在线版:http://mafft.cbrc.jp/alignment/server/。
多序列比对 简书
多序列比对简书摘要:一、多序列比对简介1.多序列比对的概念2.多序列比对的作用3.多序列比对的应用领域二、多序列比对方法1.传统的多序列比对方法2.基于进化树的多序列比对方法3.基于统计模型的多序列比对方法三、多序列比对软件1.Clustal Omega2.MUSCLE3.MAFFT4.ProbCons四、多序列比对的结果分析1.一致性序列的确定2.进化树构建3.功能域分析五、多序列比对在生物学研究中的应用1.基因进化分析2.蛋白质结构预测3.基因组注释正文:多序列比对(Multiple Sequence Alignment, MSA)是一种将多个氨基酸或核苷酸序列在同一坐标系中进行比较的方法,通过比较不同序列之间的相似性和差异性,探讨它们的进化关系和生物学功能。
多序列比对在分子进化、蛋白质结构预测、基因组注释等领域有着广泛的应用。
多序列比对方法主要分为传统的多序列比对方法和基于统计模型的多序列比对方法。
传统的多序列比对方法通常采用基于距离的方法,例如最长公共子序列(Longest Common Subsequence, LCS)方法和动态规划方法(例如Needleman-Wunsch 算法和Smith-Waterman 算法)等。
基于统计模型的多序列比对方法则通过建立序列间的统计模型来进行比对,例如利用核苷酸或氨基酸的组成、序列长度分布、局部序列比对等特征来建模。
多序列比对软件有很多,其中比较常用的有Clustal Omega、MUSCLE、MAFFT 和ProbCons 等。
这些软件在算法原理和实现上有所不同,但都能有效地完成多序列比对任务。
Clustal Omega 是一个基于迭代算法和优化聚类的软件,适用于大规模的多序列比对;MUSCLE 是一个采用简化分子进化模型和逐步比对策略的软件,适用于中等规模的多序列比对;MAFFT 是一个基于增量比对和优化搜索策略的软件,适用于小规模的多序列比对;ProbCons 是一个基于概率模型和蒙特卡洛搜索策略的软件,适用于高质量的多序列比对。
MULTIPLE SEQUENCE ALIGNMENT
MULTIPLE SEQUENCE ALIGNMENTOfer Gill (gill@)Alyssa Lees(lees@)Aris Tsirigos(tsirigos@)MULTIPLE SEQUENCE ALIGNMENT Introduction: Multiple Sequence Alignment poses a challenging and fascinating endeavor in computational biology. The computational methods use in multiple sequence alignments have tremendous importance in solving a series of relatedbiological problemsFor example, a great discovery in molecular biology was the surprising similarity between the DNA sequences of distinct organisms. In fact, similar genes often perform identical functions across different species. In other instances, similar genes evolve or mutate in different species and therefore perform altered functions. Simultaneous alignment of nucleic acid or amino acid sequences allows the examination of the evolutions of mutations among similar genes (Mount, 141).The ability to mathematically distinguish between gene sequences is also of great use in phylogenetic analysis. Aligned sequences yield predictions in each column of the mutation that occurred at a given site during the evolution of the sequence family. While prior knowledge of the actual nature of an evolutionary tree is needed for proper assessment, multiple sequence alignment algorithms are of great use in constructing the exact nature of evolutionary mutations.One subsection of multiple sequence alignment algorithms focuses solely on the localized alignments of a subset of each sequence. Instead of examining the global alignment, which includes the relationship of all parts of a sequence, localized analysis removes highly conserved regions in the alignments. These highly conserved subsequences are used to build a “profile”. Once established a profile is used to search a target sequence for possible matches (Mount, 162). A related type of analysis uses blocks to identify motifs in protein sequences. Proteins have higher level structures and a motif is a certain combination of these structures (such as a helix-loop-helix structure). ( Setubal, 253). Motifs are extremely important as potential binding sites and the discovery of such structures by pattern matching in related proteins is of great importance.Generally, if the structure of one of more members of an alignment is known, it is possible to predict the amino acids that hold the same relationships in otherproteins. These alignments can uncover both underlying structural and functional relationships. One example, is that aligned promoters of a set of similarly regulated genes may reveal consensus-binding sites for regulatory proteins. In addition, once a consensus pattern has been found between sequence, database searching can be performed to find related sequence in other organisms.The importance of multiple string alignment is indisputable. However, implementing efficient algorithms poses several computational difficulties.As expected, the problem of sequencing several genes simultaneously poses new obstacles when compared to the standard two-sequence case. The different types of multiple sequence alignment algorithms can be divided into classes: 1) global alignment algorithms that progressively align a series of sequences 2) iterative methods that make an initial alignment and then revise 3) statistical methods. In this project we are mainly focusing on the comparison between progressive global alignments, iterative methods and statistical algorithms.Below is a table we created to summarize the differences between the varying approaches to multiple sequence alignment.MSA TYPE SIZE METHOD SOFTWARE FUNCTION PROBLEMSGlobal Alignments3-4 sequences or<10 small ones Modified dynamicprogramming for pairwise algorithm with aset scoring matrixCLUSTALW, MSA Buildingphylogenetic trees,Comparing distancebetween proteinsequencesOnly small # ofsequences can beused. Final msadependent on initialpair-wise alignment.Hard to makeaccurate gappenalties.Iterative Methods< 20 sequence Correct problem inglobal alignments offinal msa based oninitial pair-wisealignment. Geneticalgorithm generatesmany alignments thatsimulate gapinsertion andrecombinationevents. Revisealignment iteratively.DIALIGN, PRRP,SAGA geneticalgorithm, MultAlinImproves overallalignment score ofmsa. Basically usedfor global alignmentpurposes such asbuildingphylogenetic trees.Produces alignmentsby simulatingevolutionary changesin sequences.Slow. Alignmentsare not guaranteed tobe optimal.Statistical Algorithms (HMM)Large # of ing priors and atraining set ofsequences, train setmodel until itconverges. Oncemodel has beencreated, use otherdata sequences tocreate msa. ProfileHMMs used for bothglobal/localalignmentEM/MEMEalgorithms used forlocaleMOTIF, GIBBS,HMMER, MACAW,MEME, SAMGenerally used forlocal alignment. In agiven sequence ofproteins findcommon substringsto all sequences(motifs).HMMs also used tocreate most probableMSA for a givenfamily of sequences.Model used to searchdatabase for commonsequencesLarge number ofsequences arerequired. HMMsprovide better msathan other methods,no sequence order isrequired, however ifinappropriate initialor prior conditionsare used resultscould be nonsense.Problem: In this project, we explore the different methods of multiple sequence alignment. Using existing data for a variety of protein sequences, we sample algorithms that use progressive global alignment methods (i.e. global dynamic programming), iterative methods, and statistical methods. In attempting to program, a couple of the methods ourselves, we have drawn conclusions about some of the computational difficulties that multiple sequence alignment include. In addition, by exploring examples from each subtype of algorithms, we can draw conclusions on the usefulness and effectiveness of all such methods.Background on Multiple Sequence Alignment Methods:Global Alignment (dynamic programming) –When solving the computational problem of multiple sequence alignment, a natural generalization is to expand the two-sequence case. The basic algorithm for the global comparison of two sequences, is to assign a score. A positive number is assigned for a match between the two base pairs or proteins in a sequence, a negative number for a mismatch and a more severe penalty for a gap. A standard application of dynamic programming is used to calculate the score with gap penalties/etc for the entire sequence (Setubal, 50).In the multiple sequence case, the issue of scoring becomes more complex.With several sequences, scores are calculated by assigning a score to each column. The sum-of-pairs or SP measure is often used to calculate values for a sequence. The sum-of-pairs function is the sum of pairwise scores of all pairs of symbols in a given column. A k-dimensional array would be an obvious solution for dynamic programming, but it will quickly eat up time and space. A standard application of dynamic programming for multiple sequences takes exponential time. Even with time saving measures, a multiple sequence alignment of three sequences takes O(n^3) time. As a result, only 3 or 4 sequences can be realistically used when implementing a global dynamic programming algorithm.Due to the slow nature of dynamic programming algorithms, heuristics are often used to give reasonable approximations quickly.. One such algorithm is Star Alignment, which consists of building a multiple alignment based upon pairwise alignments between a fixed sequence of the input set. The result is a feasible polynomial time algorithm. Another approach to the global alignment problem is Tree alignments, where a weight is computed for each edge of a tree. The star alignment is a particular case of the tree alignment algorithm, in which the star is the tree(Setubal, 79).In general, the major problems with the global alignment approach are that only a few sequences (even with speedy heuristics) can be sequenced at once. Pure dynamic programming yields accurate results, but is simply too slow with out heuristics to approximate the heavy computation. In addition the heuristic approaches, yield good answers, but are often dependent on the initial sequences compared. A bad choice for the first sequence can yield an inaccurate result.Our representative of this genre of multiple sequence alignment algorithms is the implementation of software CLUSLAW on the gene data (collected and listed below). CLUSTLAW works as follows :ClustalW is a multiple sequence alignment program which aligns proteins in the following steps:1.Pairwise alignments of all the sequences are performed with scores recorded2.A phylogenetic tree is produced based on these scores, using progressivealignment and dynamic programming methods. In general, the closer-relatedsequences are placed closer to one another in the tree.ing this tree, the sequences are aligned.The genetic distance is the number of mismatched positions in an alignment (not including gaps) divided by the total number of matched positions. Sequences contributions are weighted based on their relationships from the tree. Weights are based on the distance of each sequence from the root of the tree.Gaps provide a penalty to the score. There is a penalty for opening a gap within a sequence alignment, opening a gap adjacent to another gap, and enhancing or shrinking already present gaps. Tables based on observed gaps for certain proteins are also utilized in calculating this penalty, and these tables also provide an intuition of the correct alignment. The penalties are customizable, but for purposes of this experiment, the default values for ClustalW 1.82 have been chosen.ClustalW works well in alignment when protein sequences are closely matched to one another. However, in cases where the sequences are far apart from one another when pairwise-aligned, more errors are made, and these errors are propagated to the tree construction, as well as the final results.Iterative Methods:The main of example of iterative methods in the paper is our basic implementation of the SAGA algorithm. SAGA is a genetic algorithm that simulatedchanges in a group of alignments in an evolutionary manner. In this project, we wrote a simplified version of this algorithm (the code is attached at the end of the paper as Appendix A). The program is written in the language K.Statistical Methods:HMMLocal Alignment : Searching for Motifs and BlocksPrimarily consisting of the EM algorithm(or MEME algorithm), this category of methods are used on a set of sequences that are expected to have a common subsequence that is not readily recognizable. Initially a guess is made to the location and size of a site of interest in each of the sequences. These sites are aligned for all sequences. The EM algorithms then consists of the two steps: the 'E' or expect ion step and the 'M', maximization step. In the expectation step, the conditional probabilities of finding the site at each of the positions in the sequence are calculated. In the maximization step, the next counts of bases or amino acids for each position in the site found in the expectation step are substituted for the previous set. The expectation is step is then repeated with the new counts. The algorithm repeats the two steps until the results converge (Koski, 87).MEME (Multiple EM for Motif Elicitation) is a common software tool to perform local multiple sequence alignments using the EM algorithm. The software algorithm searches over a range of possible motif widths. The most probably motif width for each profile is found by calculating the loglikelihood score after the first iteration of the EM algorithm. The remaining iterations of the algorithm find the best EM estimate for the given width. The software supports searching for one expected occurrence of a motif per a sequence, a search for zero or one occurrence of a motif per sequence, and finally for a motif to appear any number of times per sequence.Global Alignment (and local): Building Model and creating MSA with HMMA Hidden Markov Model uses statistics and prior information about a set of sequences to create all possible combinations of matches, mismatches, ans gaps in ultimating creating an alignment. HMMs require a large number of sequences (typically 20-100) in order to “train” the model, but no insertion/deletion penalties are need and no sequence ordering is required. The results of Hidden Markov Models are comparable if not better than other multiple sequence alignment methods (Mount, 185).In particular, a certain structured Hidden Markov Model called a profile HMM is used to model multiple alignments. For the moment we will assume that a correct multiple alignment is given. Using this alignment, a profile HMM can be built and used to find and score potential matches to new sequences. The standard structure of the profile HMM consists of a series of N identical match states that are separated by transition probabilities of 1. The model deals with gaps, by treating insertions and deletions separately. The standard model is illustrated below, where match state represent the mostly conserved positions in the sequence family, the insert states represent subsequences that have been inserted in some members of the family and the delete states are silent states representing subsequences that have been deleted in some members of the family (Durbin, 104).For a given sequence alignment, the most probable path through the model is determined by a decoding algorithm. A common solution is the Viterbi algorithm which finds the most probable path through the profile HMM recursively.A multiple sequence alignment is created by by calculating a viterbi alignment for each individual sequence. The residues aligned to the same profile HMM match state are aligned in columns. This gives rise to an important difference between HMM multiple alignments and msa's created by traditional methods. The choice of where or how to put the “insert” residues in the alignment is arbitrary. The insert states are considered to represent segments of a sequence which are unconverted and not meaningfully alignable. According to Durbin, this is a biologically more realistic view of alignment. Other msa algorithms align entire sequences regardless of whether the sequences are meaningfully alignable (Durbin, 151).In this project, two different methods of exploring HMMs were utilized. Our group initially begin by attempting to code a Multiple Sequence Alignment Algorithm using profile HMMs. The algorithm followed the following procedure: 1)First a length of the profile HMM was determined by averaging the length of allthe sequences in a given gene family. All other parameters were initialized.2)The model was trained with a subset of the data sequences. The estimated modelwas create using a variation of the Baum-Welch algorithm. First a forwardalgorithm for profile HMMs was used and then a backward algorithm. Theforward and backward algorithms were then combined to estimate the emission and transition probabilities using the Baum-Welch Re-Estimation Equations. The emissions probability are defined to be the probability of seeing a certain symbol (in this case an amino acid), given the fact that we are in a given state. Thetransition probabilities are the probabilities of entering state i, given that we have started in state j. With Baum Welch, we improve with each iteration of theprobability of a given sequence being observed. The iteration stop when a limiting probability is achieved (the model should eventually converge).3)Finally, all sequences not in the training set are aligned to the final model using theViterbi algorithm and a multiple alignment is built.Unfortunately, the project of coding a HMM algorithm in matlab was tooambitious for the limited time alloted. While we were somewhat successfully in attempting to write code to train a model, Matlab was simply too slow to getany meaningful output. For example, we were able to write preliminary codethat created a HMM profile of an imaginary sequence of a few states. However, even the smallest gene sequences in our data set required a profile of aminimum of 1500 states. This was far too difficult to manage efficiently inmatlab.Unable to write functional matlab code to create multiple sequence alignments with HMMs, we tested the data using the program HMMER. The HMMER software uses an approach listed above to create profile hidden Markov models (profile HMMs) to do sensitive database searching using statical descriptions of a sequence family's consensus.Data:The data for all experiments with existing software and code we constructed ourselves are amino acid sequences found on various websites Below we have listed brief descriptions of each of the genes used in this project.Gene: BACEDescription: Beta-secretase precursor protein found in brain tissue. Responsible for proteolytic processing of the Amyloid Percursor Protein (APP), which generates amyloid beta peptide (Abeta). And, Abeta is an early and critical feature of Alzheimer's disease. Development of BACE inhibitors might assist treatment of Alzheimer's disease.BACE cleaves at the Amino Terminus of the Abeta peptide sequence, between residues 671 and 672 of APP, leading to the generation and extracellular release of beta-cleaved soluble APP, as well as cell-associated caboxy-terminal fragment which is later released by gamma-secretase.Gene: CTSE or CATEDescription: The predicted sequence of human gastric cathepsin E (CTSE) was determined by analysis of cDNA clones isolated from a library constructed with poly(A+) RNA from a gastric adenocarcinoma cell line. The CTSE cDNA clones were identified using a set of complementary 18-base oligonucleotide probes specific for a 6-residue sequence surrounding the first active site of all previously characterized human aspartic proteinases. The CTSE gene was localized to human chromosome 1 by concurrent cytogenetic and cDNA probe analyses of a panel of human x mouse somatic cell hybrids.The intracellular location and distribution of CTSE in lymphoid-associated tissue causes scientists to believe that it may have a role in immune function. CTSE is a member of the Petidase family A1 (Pepsin).Gene: GLTA or CISYDescription: This gene is a member of the citrate synthase family, which is found in nearly all cells capable of oxidative metabolism. GLTA encodes citrate synthase from a novel bacterial isolate (DS2-3R) from Antarctica, and has been cloned, sequenced and over-expressed in Escherichia coli.Both the recombinant and the native enzymes were purified from DS2-3R, and are cold-active with a temperature optimum of 31 degrees C. In addition the enzymes are rapidly inactivated at 45 degrees C, and show significant activity at 10 degrees C and below.Gene: CS (Citrate synthase, mitochondrial [Precursor])Description: This gene is another member of the citrate synthase family. The nucleotide sequence encoding the CS gene was determined from the sequencing of the CS cDNA isolated from a human heart cDNA library. The primary sequence of CS deduced from its nucleotide sequence reveals a highly conserved, albeit slightly larger, protein of 466 amino acids, with 95% homology to its pig homologue. The data also indicates that the human genomic CS gene contains no introns, and confirms the location of the human CS gene on chromosome 12.Gene: POL (Protease)Description: This gene belongs to the Peptidase family A2, AKA the RetroPepsin family. This gene is found in immune-deficiency viruses such as HIV-1, HIV-2 / SIVSMM / MAC, SIVAGM, SIVMND, LV1-1KS1, FIV, EAIV, and BIV. It is beinganalyzed for purposes of treating HIV. A branching order has yet to be found among differing groups of immune-deficiency viruses.Experiments:1.Global Alignment – CLUSTALW SOFTWARE:The experiment consisted of simply running software on all of the gene data listed above. The print outs for each of the CLUSTALW runs on each gene family are attached at the end of this document.Since the experiments conducted in this report involve closely related proteins, ClustalW worked well on them. Based on the results, within the BACE, CS, GLTA, CATE, and POL proteins, the sequences aligned fairly well. In the BACE, CATE, and POL experiments, the sequences aligned with 10% distances from one another or less.In the CS experiment, all sequences aligned with less than 10% distances, except CISY_DAUCA which came up around 34.6% distance. In the GLTA experiment, only four genes aligned with less than 10% distance, with all the others aligning with 10-28% distance.In the CISY experiment, which uses both GLTA and CS proteins, the tree structures produced matched those of GLTA and CS, with the exception of CISY_DAUCA. Furthermore, the sequence distances measured were roughly the same as the GLTA and CS counterparts, with the exception ofCISY_DAUCA, whose distance came out fairly lower in the CISY experiment. The explanation for CISY_DAUCA's change of location in the tree and lower distance is probably due to its close match with the GLTA proteins, in spite CISY_DAUCA being a CS protein.2.Iterative – SAGA genetic algorithm implementationsimple Genetic Algorithm for the MSA problemWe developed a program in K that implements a very basic version of the genetic algorithm described in [1]. Given the initial sequences, a population of MSAs is created and then evolved through various random “evolutiona ry”events called operators. In this program, we incorporated three operators: Recombination (crossover)Gap insertionBlock shufflingThe operators act on MSAs that are chosen randomly according to their weights (scores). Each generation of MSAs comprises the best scoring 50% of MSAs from the previous generation and another 50% of mutated MSAs generated with the use of the three operators. We tested our program with the first 60 characters of the following proteins:Name: CARP_CANTRWebID: Q00663Gene: SAPT1.FoundIn: Candida tropicalis (Yeast)Size=394Sequence=MATIFLFTKN VFIALAFALF AQGLTIPDGI EKRTDKVVSL DFTVIRKPFN ATAHRLIQKRSDVPTTLINE GPSYAADIVV GSNQQKQTVV IDTGSSDLWV VDTDAECQVT YSGQTNNFCKQEGTFDPSSS SSAQNLNQDF SIEYGDLTSS QGSFYKDTVG FGGISIKNQQ FADVTTTSVDQGIMGIGFTA VEAGYNLYSN VPVTLKKQGI INKNAYSCDL NSEDASTGKI IFGGVDNAKYTGTLTALPVT SSVELRVHLG SINFDGTSVS TNADVVLDSG TTITYFSQST ADKFARIVGATWDSRNEIYR LPSCDLSGDA VVNFDQGVKI TVPLSELILK DSDSSICYFG ISRNDANILGDNFLRRAYIV YDLDDKTISL AQVKYTSSSD ISALName: CARP_CRYPAWebID: P11838Gene: EAPAFoundIn: parasiticaSize=419Sequence=MSSPLKNALV TAMLAGGALS SPTKQHVGIP VNASPEVGPG KYSFKQVRNP NYKFNGPLSVKKTYLKYGVP IPAWLEDAVQ NSTSGLAERS TGSATTTPID SLDDAYITPV QIGTPAQTLNLDFDTGSSDL WVFSSETTAS EVDGQTIYTP SKSTTAKLLS GATWSISYGD GSSSSGDVYTDTVSVGGLTV TGQAVESAKK VSSSFTEDST IDGLLGLAFS TLNTVSPTQQ KTFFDNAKASLDSPVFTADL GYHAPGTYNF GFIDTTAYTG SITYTAVSTK QGFWEWTSTG YAVGSGTFKSTSIDGIADTG TTLLYLPATV VSAYWAQVSG AKSSSSVGGY VFPCSATLPS FTFGVGSARIVIPGDYIDFG PISTGSSSCF GGIQSSAGIG INIFGDVALK AAFVVFNGAT TPTLGFASKName: CARP_NEUCRWebID: Q01294Gene: PEP-4.FoundIn: Neurospora crassaSize=396Sequence=MKGALLTAAM LLGSAQAGVH TMKLKKVPLA DELESVPIDV QVQHLGQKYT GLRTESHTQAMFKATDAQVS GNHPVPITNF MNAQYFSEIT IGTPPQTFKV VLDTGSSNLW VPSSQCGSIACYLHNKYESS ESSTYKKNGT SFKIEYGSGS LSGFVSQDRM TIGDITINDQ LFAEATSEPGLAFAFGRFDG ILGLGYDRLA VPGITPPFYK MVEQKLVDEP VFSFYLADQD GESEVVFGGVNKDRYTGKIT TIPLRRKAYW EVDFDAIGYG KDFAELEGHG VILDTGTSLI ALPSQLAEMLNAQIGAKKSW NGQFTIDCGK KSSLEDVTFT LAGYNFTLGP EDYILEASGS CLSTFMGMDMPAPVGPLAIL GDAFLRKYYS IYDLGADTVG IATAKRName: CARP_RHICHWebID: P06026Gene:FoundIn: Rhizopus chinensis (Bread mold)Size=393Sequence=MKFTLISSCI AIAALAVAVD AAPGEKKISI PLAKNPNYKP SAKNAIQKAI AKYNKHKINTSTGGIVPDAG VGTVPMTDYG NDVEYYGQVT IGTPGKKFNL DFDTGSSDLW IASTLCTNCG SRQTKYDPKQ SSTYQADGRT WSISYGDGSS ASGILAKDNV NLGGLLIKGQ TIELAKREAASFANGPNDGL LGLGFDTITT VRGVKTPMDN LISQGLISRP IFGVYLGKAS NGGGGEYIFGGYDSTKFKGS LTTVPIDNSR GWWGITVDRA TVGTSTVASS FDGILDTGTT LLILPNNVAASVARAYGASD NGDGTYTISC DTSRFKPLVF SINGASFQVS PDSLVFEEYQ GQCIAGFGYGNFDFAIIGDT FLKNNYVVFN QGVPEVQIAP VAQWe used 30 generations of 100 MSAs each and we achieved a score of 63 compared to 84 by CLUSTALW. The scoring array used was a standard PAM250 array. Here is the output:s: optimize[Snum;30;100]"BEST SCORE: -32"("-MATIFLFTKNVFIALAFALFAQGLTIPDGIEKRTDKVVSLDFTVIRKPFNATAHRLIQKR-------" "------MSSPLKNALVTAMLAGGALSSPTKQHVGIPVNASPEVGPGKYSFKQVRNPNYKFNGPLSV--" "--------MKGALLTAAMLLGSAQAGVHTMKLKKVPLADELESVPIDVQVQHLGQKYTGLRTESHTQA" "-----MKFTLISSCIAIAALAVAVDAAPGEKKISIPLAKNPNYKPSAKNAIQKAIAKYNKHKINT---")1"BEST SCORE: -32"("-MATIFLFTKNVFIALAFALFAQGLTIPDGIEKRTDKVVSLDFTVIRKPFNATAHRLIQKR-------" "------MSSPLKNALVTAMLAGGALSSPTKQHVGIPVNASPEVGPGKYSFKQVRNPNYKFNGPLSV--" "--------MKGALLTAAMLLGSAQAGVHTMKLKKVPLADELESVPIDVQVQHLGQKYTGLRTESHTQA" "-----MKFTLISSCIAIAALAVAVDAAPGEKKISIPLAKNPNYKPSAKNAIQKAIAKYNKHKINT---")2"BEST SCORE: -32"("-MATIFLFTKNVFIALAFALFAQGLTIPDGIEKRTDKVVSLDFTVIRKPFNATAHRLIQKR-------" "------MSSPLKNALVTAMLAGGALSSPTKQHVGIPVNASPEVGPGKYSFKQVRNPNYKFNGPLSV--" "--------MKGALLTAAMLLGSAQAGVHTMKLKKVPLADELESVPIDVQVQHLGQKYTGLRTESHTQA" "-----MKFTLISSCIAIAALAVAVDAAPGEKKISIPLAKNPNYKPSAKNAIQKAIAKYNKHKINT---")3"BEST SCORE: -32"("-MATIFLFTKNVFIALAFALFAQGLTIPDGIEKRTDKVVSLDFTVIRKPFNATAHRLIQKR-------" "------MSSPLKNALVTAMLAGGALSSPTKQHVGIPVNASPEVGPGKYSFKQVRNPNYKFNGPLSV--" "--------MKGALLTAAMLLGSAQAGVHTMKLKKVPLADELESVPIDVQVQHLGQKYTGLRTESHTQA" "-----MKFTLISSCIAIAALAVAVDAAPGEKKISIPLAKNPNYKPSAKNAIQKAIAKYNKHKINT---")4"BEST SCORE: -13"("MATIFLF-TKNVFIALAFALFAQGLTIPDGIEKRTDKVVSLDFTVIRKPFNATAHRLIQKR--""M-SSPLK-NALVTAMLAGGALSSPTKQHVGIPVNASPEVGPGKYSFKQVRNPNYKFNGPLSV-" "M-KGALLTAAMLLGSAQAGVHTMKLKKVPLADELESVPIDVQVQHLGQKYTGLRT-ESHTQ-A" "M-KFTLISSCIAIAALAVAVDAAPGEKKISIPLAKNPNYKPSAKNAIQKAIAKYN-KHKIN-T")5"BEST SCORE: -13"("MATIFLF-TKNVFIALAFALFAQGLTIPDGIEKRTDKVVSLDFTVIRKPFNATAHRLIQKR--""M-SSPLK-NALVTAMLAGGALSSPTKQHVGIPVNASPEVGPGKYSFKQVRNPNYKFNGPLSV-" "M-KGALLTAAMLLGSAQAGVHTMKLKKVPLADELESVPIDVQVQHLGQKYTGLRT-ESHTQ-A" "M-KFTLISSCIAIAALAVAVDAAPGEKKISIPLAKNPNYKPSAKNAIQKAIAKYN-KHKIN-T")6"BEST SCORE: 13"("MATIF--LFTK-NVFIALAFALFAQGLTIPDGIEKRTDKVVSLDFTVIRKPFNATAHRLIQKR---" "MSSPLKNALVT-AMLAGGALSSPTKQHVGIPVNASPEVGPGKYSFKQVRNPNYKFNGPL--SV---" "MK---GALLTAAMLLGSAQAGVHTMKLKKVPLADELESVPIDVQVQHLGQKYTG-LRTE--SHTQA" "MKFTLISSCIAIAALAVAVDAAPGEKKISIPLAKNPNYKPSAKNAIQKAIAKYN-KHKI--NT---")7"BEST SCORE: 13"("MATIF--LFTK-NVFIALAFALFAQGLTIPDGIEKRTDKVVSLDFTVIRKPFNATAHRLIQKR---" "MSSPLKNALVT-AMLAGGALSSPTKQHVGIPVNASPEVGPGKYSFKQVRNPNYKFNGPL--SV---" "MK---GALLTAAMLLGSAQAGVHTMKLKKVPLADELESVPIDVQVQHLGQKYTG-LRTE--SHTQA" "MKFTLISSCIAIAALAVAVDAAPGEKKISIPLAKNPNYKPSAKNAIQKAIAKYN-KHKI--NT---")8"BEST SCORE: 13"("MATIF--LFTK-NVFIALAFALFAQGLTIPDGIEKRTDKVVSLDFTVIRKPFNATAHRLIQKR---" "MSSPLKNALVT-AMLAGGALSSPTKQHVGIPVNASPEVGPGKYSFKQVRNPNYKFNGPL--SV---" "MK---GALLTAAMLLGSAQAGVHTMKLKKVPLADELESVPIDVQVQHLGQKYTG-LRTE--SHTQA" "MKFTLISSCIAIAALAVAVDAAPGEKKISIPLAKNPNYKPSAKNAIQKAIAKYN-KHKI--NT---")9"BEST SCORE: 17"("M-ATIF---LFTKNVFIALAFALFAQGLTIPDGIEKRTDKVVSLDFTVIRKPFNATAHRLIQKR""M-SSPLKNALVTAMLAGGALSSPTKQHVGIPVNASPEVGPGKYSFKQVRNPNYKF---NGPLSV" "M-KGAL---LTAAMLLGSAQAGVHTMKLKKVPLADELESVPIDVQVQHLGQKYTGLRTESHTQA" "MKFTLISSCIAIAALAVAVDAAPGEKKISIPLAKNPNYKPSAKNAIQKAIAKYNK---HKINT-")10"BEST SCORE: 17"("M-ATIF---LFTKNVFIALAFALFAQGLTIPDGIEKRTDKVVSLDFTVIRKPFNATAHRLIQKR""M-SSPLKNALVTAMLAGGALSSPTKQHVGIPVNASPEVGPGKYSFKQVRNPNYKF---NGPLSV" "M-KGAL---LTAAMLLGSAQAGVHTMKLKKVPLADELESVPIDVQVQHLGQKYTGLRTESHTQA" "MKFTLISSCIAIAALAVAVDAAPGEKKISIPLAKNPNYKPSAKNAIQKAIAKYNK---HKINT-")11"BEST SCORE: 17"("M-ATIF---LFTKNVFIALAFALFAQGLTIPDGIEKRTDKVVSLDFTVIRKPFNATAHRLIQKR""M-SSPLKNALVTAMLAGGALSSPTKQHVGIPVNASPEVGPGKYSFKQVRNPNYKF---NGPLSV" "M-KGAL---LTAAMLLGSAQAGVHTMKLKKVPLADELESVPIDVQVQHLGQKYTGLRTESHTQA" "MKFTLISSCIAIAALAVAVDAAPGEKKISIPLAKNPNYKPSAKNAIQKAIAKYNK---HKINT-")12"BEST SCORE: 17"("M-ATIF---LFTKNVFIALAFALFAQGLTIPDGIEKRTDKVVSLDFTVIRKPFNATAHRLIQKR""M-SSPLKNALVTAMLAGGALSSPTKQHVGIPVNASPEVGPGKYSFKQVRNPNYKF---NGPLSV" "M-KGAL---LTAAMLLGSAQAGVHTMKLKKVPLADELESVPIDVQVQHLGQKYTGLRTESHTQA" "MKFTLISSCIAIAALAVAVDAAPGEKKISIPLAKNPNYKPSAKNAIQKAIAKYNK---HKINT-")13"BEST SCORE: 17"("M-ATIF---LFTKNVFIALAFALFAQGLTIPDGIEKRTDKVVSLDFTVIRKPFNATAHRLIQKR""M-SSPLKNALVTAMLAGGALSSPTKQHVGIPVNASPEVGPGKYSFKQVRNPNYKF---NGPLSV" "M-KGAL---LTAAMLLGSAQAGVHTMKLKKVPLADELESVPIDVQVQHLGQKYTGLRTESHTQA" "MKFTLISSCIAIAALAVAVDAAPGEKKISIPLAKNPNYKPSAKNAIQKAIAKYNK---HKINT-")14"BEST SCORE: 17"("M-ATIF---LFTKNVFIALAFALFAQGLTIPDGIEKRTDKVVSLDFTVIRKPFNATAHRLIQKR""M-SSPLKNALVTAMLAGGALSSPTKQHVGIPVNASPEVGPGKYSFKQVRNPNYKF---NGPLSV" "M-KGAL---LTAAMLLGSAQAGVHTMKLKKVPLADELESVPIDVQVQHLGQKYTGLRTESHTQA" "MKFTLISSCIAIAALAVAVDAAPGEKKISIPLAKNPNYKPSAKNAIQKAIAKYNK---HKINT-")15"BEST SCORE: 41"("MATIFLFTKNVFIALAFALFAQGL--TIPDGIEKRTDKV-VSLDFTVIRKPFNATAHRLIQKR-" "MS-SPLKNALVTAMLAGGALSSPTKQHVGIPVNASPE---VGPGKYSFKQVRNPNYKFNGPLSV" "MKGALLTAAMLLGSAQAGVHTMKLKK---VPLADELESVPIDVQVQHLGQKYTGLRTESHTQA-" "MKFTLISSCIAIAALAVAVDAAPGEKKISIPLAKNPN---YKPSAKNAIQKAIAKYNKHKINT-")16。
实习四:多序列比对(Multiple alignment)
实习四:多序列比对(Multiple alignment)学号姓名专业年级实验时间提交报告时间实验目的:1. 学会利用MegAlign进行多条序列比对2. 学会使用ClustalX、MUSCLE 和T-COFFEE进行多条序列比对分析3. 学会使用HMMER进行HMM模型构建,数据库搜索和序列比对实验内容:多序列比对是将多条序列同时比对,使尽可能多的相同(或相似)字符出现在同一列中。
多序列比对的目标是发现多条序列的共性。
如果说序列两两比对主要用于建立两条序列的同源关系,从而推测它们的结构和功能,那么,同时比对多条序列对于研究分子结构、功能及进化关系更为有用。
例如,某些在生物学上有重要意义的相似区域只能通过将多个序列同时比对才能识别。
只有在多序列比之后,才能发现与结构域或功能相关的保守序列片段,而两两序列比对是无法满足这样的要求的。
多序列比对对于系统发育分析、蛋白质家族成员鉴定、蛋白质结构预测、保守模块的搜寻以及PCR引物设计等具有非常重要的作用。
作业:1.Align the orthologous nucleotide and protein sequences from 5 organisms you found from first practice with MegAlign. Describe the sequences you used (the title of each sequence), explain whether the phylogenetic tree is consistent with the species tree from NCBI taxonomy database. Set the alignment report to show consensus strength and decorate the residues different from consensus with green shade.(Hint: use the taxonomy common tree from NCBI to get the evolutionary relationship among the organisms. Save your organism name in a text file with each organism name in a line, and upload the file, choose Add from file, and you will see the relationship among the specified organisms) /Taxonomy/CommonTree/wwwcmt.cgiHint 2:Change the accession number in your fasta or genPept format sequence file to organism name, so that the phylogenetic tree can be easily understood.方法与结果:打开Megalign,选择FILE下的Enter sequence ,打开之前保存的来自于五个物种的蛋白(或核酸)序列;首先选择打分矩阵,点击“Align”,选择Set residue Weight Table 选择矩阵:PAM100(核酸则设为weighted),通过“method parameters”查看参数,使用Clustal V的默认值;其次进行序列的比对,选择Align下的“by Clustal V Method”开始比对,再次待其结束后,进行比对结果的显示,选择view下的“Phylogenetic Tree”,显示出树形图;(图)与NCBI上找到的树形图进行对比(图);接下来点击View 下的“Alignment reports ”,选择OPTIONS下的“Alignment report contents”勾中“show consensus strength”,即在序列中显示出相似性条块;在OPTIONS下选择“New decorations”对decoration parameters 下选“shade—residues differing from—the consensus”把字符选择现实的颜色为绿色,结果显示如下:(图)同法可以得到核酸的树形图:(图)分析:系统发育树与NCBI上的物种树有很大的差异,因为可能这些物种间含有很多同源序列,我们不能单凭几条相似序列的同源关系来判断物种的亲缘关系,而应该考虑到物种更多相似序列的同源关系。
Multiplesequencealignment
Lab 4: Multiple Sequence Alignment (MSA) The objective of this lab is to become familiar with the features of several multiple alignment and visualization tools, including the data input and output, basic visualization and editing functions, alignment options, and different between nucleotide and amino acid alignments.A large number of alignment tools are available, but we don’t have time to go through them all one by one. Thus we are going to focus on two of them: MAFFT and MUSCLE, since multiple studies have proved both to be of the best performance.For visualization purpose, we use the program Jalview, from which you can invoke these two alignment tools.You are allowed to finish the lab by yourself or with a partner.[Exercise 1] Jalview: Basic Functions1.Start Jalview2.choose the file 1ped.fasta in the data directory by going to “File > Inputalignment > from File”. Have a look at the data. Are the sequences well-aligned?3.Try some basic commands(1)To select a taxa, click on the taxon name on the left side(2)To select all sequences at once, use CTRL-A(3)To deselect all, go to “Select > Deselect All”(4)To move selected sequences to another point in the data set, hightlight thesequences and use the arrow keys on your keyboard to move the sequences up/down/left/right-ward.4.You can edit the sequences manually bya)Left click and drag to select where you wish to begin editingb)Right click the hightlighted sequence and then choose “Select>Edit>EditSequence”c)Enter the characters you want to insert, or a space to denote a gap5.You can undo changes by choosing “Edit>Undo”6.Close the file with/without saving.[Exercise 2] MSA: MAFFT and MUSCLE Comparison[Part 1: MAFFT]1.Run a progressive alignment in MAFFT by entering the commandmafft –retree 2 1ped.fasta > mafft_dna.fasta2.Once the alignment process is completed, open the result in Jalview. Theconcensus graph shows the percentage of agreement for each column of the alignment.[Part 2: MUSCLE]1.Run a standard alignment in MUSCLE by entering the commandmuscle –verbose –log muscle_dna.log –in 1ped.fasta –out muscle_dna.fastaNote that –verbose and –log are not necessary but you can use them to see the default options in MUSCLE.2.Open the results in Jalview[Part C: Comparison]pare the alignment resulting from the previous two parts. Are they different?Which one do you prefer, the MAFFT or MUSCLE alignment? What may be wrong with both? (Hint: this file contains protein-coding genes)2.Build 2 trees, one from each of your alignments. Go to the alignment window (forboth MAFFT and MUSCLE alignments), then click “Calculate > Calculate Tree > Neighbor Joining Using %Identity” (Note: these trees are great for helping to evaluate your alignments, but this program should never be your tree-building choice)pare the trees from both alignments. Do the topologies and/or branchlengths differ?[Exercise 3] MAFFT for Nucleotide and Protein SequenceIn this exercise we will convert the nucleotide sequences to their equivalent protein sequencesand align these instead. Note that because we are running the alignment programs externally to JalView you cannot convert back to the original DNA sequences post alignment. MAFFT and MUSCLE run through the program SeaView can do this, but will not be covered here.[Part 1: Iterative Strategy usin MAFFT]1.Find the original 1ped.fasta window.2.Click “Calculate > translate cDNA”3.Click “File > Save as” to save the translated sequence as “1ped_aa.fasta”4.Run an iterative alignment in MAFFT by using the command:mafft --maxiterate 1000 1ped_aa.fasta > mafft_aa_iter.fastaLoad this resulted file into JalView. Notice that two new graphs appear along with the alignment: conservation and quality. Conservation measures the number of changes in the physio-chemical properties of the amino acids in any given column of the alignment. Quality is a score that measures the likelihood of changes in each column, given the substitution matrix used to calculate the alignment. For more detail, click “Help > Documentation > Alignment Annotations”.5.Build a tree out using your nucleotide alignment by selecting “Calculate >Calculate Tree > Neighbor Joining using BLOSUM62”.[Part 2: Automatic Strategy usin MAFFT]1.Find the original 1ped.fasta window, then click “Calculate > translate cDNA”2.Click ‘File’>’Save as” and save the file as ’1ped_aa.fasta’ with Fasta as the fileformat.3.Employ the MAFFT automatic selection of alignment strategy by using thecommand:mafft --auto 1ped_aa.fasta > mafft_aa_auto.fastaCan you determine which kind of strategy was employed? (hint: mafft outputs data to the screen and the strategy should be listed near the end).Load this file into JalView. Notice that also two new graphs appear along with the alignment: conservation and quality.4.Build a tree out using your nucleotide alignment by selecting “Calculate >Calculate Tree > Neighbor Joining using BLOSUM62”.[Part 3: Comparison]Compare amino acid alignments and trees. Which one do you prefer? Does it make sense to align protein-coding sequences using the protein translation, or should you instead build alignments from nucleotide sequences?[Exercise 4] MUSCLE: Does the gap penalty affect the alignment?Here we will run MUSCLE with different gap penalties to observe how this changes the alignment results.[Part 1]1.Run MUSCLE with a gap-open penalty of -20:muscle –verbose –log muscle_gap-20.log –in 1ped_aa.fasta –out muscle_aa_gap-20.fasta –gapopen -202.Load the alignment results into Jalview and build a tree as above.[Part 2]1.Run MUSCLE with a gap penalty of -1:muscle -verbose -log muscle_gap-1.log -in 1ped_aa.fasta -out muscle_aa_gap-1.fasta -gapopen -12.Load the alignment into JalView and build a tree.[Part 3]1.Run MUSCLE with the default gap penalty:muscle -verbose -log muscle_defGap.log -in 1ped_aa.fasta -out muscle_aa.fasta2.Load the alignment into JalView and build a tree.[Part 4]Compare the modified gap penalty alignments to the default one. Which one of the three alignments do you prefer, and why? Which has the most gaps? Can you guess the default gap penalty? Can you find in the log files of MUSCLE the gap penalty used?[Exercise 5] BMGE: Trimming AlignmentsTrimming of an alignment removes the ambiguous columns later analysis. We shall use BMGE for this task. This program calculates column entropy to determine columns that are within biologically expected variation. Any columns above a given entropy threshold are removed.1.We shall trim the MUSCLE amino acid alignment (muscle_aa.fasta) using BMGEwith a conserved entropy threshold of 0.7 and use the BLOSUM62 matrix to determine substitution weights:java -jar BMGE.jar -i muscle_aa.fasta -ofaa muscle_trimmed.fasta -h .7 -m BLOSUM62 -t AAHow many columns were removed by BMGE?2.Load the resulting alignment into JalView and build a tree. Compare to theuntrimmed MUSCLE alignment. Has the tree changed? Have the branch lengths changed? Which alignment do you prefer?[Exercise 6] Loading sequences from a public databaseJalview can search the EMBL, PDB, PFAM, and Uniprot databases and load sequences so that you may align and analyze them. Try searching through one of these databases and finding sequences you are interested in working with. Make a note of the accession numbers and enter them in step 4, or use the numbers provided below.1.Close all alignments you have open in Jalview.2.Click on “File > Fetch Sequence(s)”.3.Click “EMBL” in the drop-down menu (or a different database).4.Enter “X53828; X53829; X53930; X5831” (or the numbers of the sequences youfound yourself) in the box and then submit5.Save these sequences in a variety of different formats to analyze as you wish.[Exercise 7] Saving alignments as graphicsYou may need an image of your alignment for publication. Jalview will allow you to save one inHTML, EPS or PNG format. Open any of the alignments you have worked with so far.1.To wrap the alignment on the page, click on “Format > Wrap”.2.Click on “File > Export > HTML” and enter a name for your new image.。
multiple alignments
X
coincides with the number F (L1 ; : : : ; LN ; L) of all standard alignments of total length L without any column consisting of blanks only. Remark: The standard proof for this fact runs as follows: for X f1; : : : ; Lg as above, let f (X; L) denote the number of alignments of total length L with exactly those columns consisting of blanks only which are indexed by elements j 2 X ; then, if x := #X , we have
;$V f1;:::;N g
F (L1 ; : : : ; LN ) =
X
;$V f1;:::;N g
F (L1 ?
V (1); : : : ; LN ? V (N ))
Abstract
1 Introduction
Sequence alignment is one of the most important tools for data analysis in molelcular biology. There are di erent notions of what an alignment is: By standard theory, an alignment of N sequences s1 ; : : : ; sN of length P1 ; : : : ; LN L is de ned to be an N L matrix A with max(L1 ; : : : ; LN ) L 1 i N Li whose rows are obtained from the original sequences by insertion of so-called `blanks' or `gap characters' { with the additional requirement that no column of the matrix A consists exclusively of blanks (cf. 1]; p. 186). Recently, Morgenstern et al. 2] have proposed a di erent way of de ning alignments (see also 3] and 1], p. 188, for the case of two sequences and 4] for a thorough discussion of this concept for any number of sequences). In their de nition, an alignment of the sequences s1 ; : : : ; sN is a consistent equivalence relation de ned on the so-called site space S := f ijj ] 1 i N; 1 j Li g. This de nition avoids a certain redundancy inherent in the standard de nition and allows to apply the mathematical theory of sets and relations to investigate the state space associated with an alignment problem. To distinguish these alignments from standard alignments, we will refer to them as e ective alignments. No matter which de nition is preferred, in either case the alignment problem is the problem of nding an optimal alignment { according to some well-de ned criterion { and the search space for this optimization problem is the set of all possible alignments of a given set of sequences. 1
MultipleSequenceAlignmentIntroduction:多序列比对的介绍
Multiple Sequence AlignmentIntroductionWhat is an alignment?• An arrangement of two or more DNA, RNA or protein sequences– A multiple sequence alignment is one of more than two sequences• Homologous sites between sequences are aligned– This is achieved by inserting gapsWhy do we align?• Alignments allow for identification of regions of similarity between sequences• Identify indels (insertion and deletions) caused by DNA/RNA replicationWhat is an alignment used for? • Building phylogenetic trees• Looking for sites of interest/conservation within a gene (motifs, binding sites, etc.)• Identifying positive/negative selection• Using as references for short read analysisHow do we align?• The goal of alignment programs is to maximise a score based on 3 modifiers:– rewarding matches (+ score)– penalising rare substitutions (- score)• requires a substitution matrix– penalising gaps (- score)• requires gap opening and extension penalty scheme • Different programs go about it through different methodsSubstitution matrices• Places weights upon comparison of characters• Most simple is +1 for a match and 0 for a mismatch• As some substitutions are more acceptable than others these must be weighted• Substitution matrices assign scores to each substitution– Built from alignments of given similarity• BLOSUMx where x is the similarity• PAMx where x is the number of subsitutions/100 amino acidsBLOSUM62Gaps• Represent an insertion in 1 or more sequences or a deletion in the remaining sequences• Two types of penalties are associated with gaps:– Gap opening• To make an initial opening of a gap in a sequence – Gap extension• To add an extra gap character to an existing gap• Gap extension penalty usually smaller than gap openingPairwise alignment• Dot-matrix– 1 sequence as a row, 1 sequence as a column– A dot where two characters match• Dynamic programming– Use a scoring function to optimally align sequences – Either a global or local algorithm used• Word– Heuristic method using ‘words’ of a given size– Find matching words and extend alignments until 1sequences ends or score drops below a thresholdGlobal versus local• A global alignment method attempts to align sequences end-to-end– Useful when sequences are of approximately the same length– Needleman-Wunsch algorithm• A local alignment method attempts to find one or more stretches of similar sequences– Useful when one sequence is significantly longer than the other or there are small similar motifs within large dissimilar sequences– Smith-Waterman algorithmMultiple Sequence Alignment • Progressive– Do a pairwise alignment– Use a clustering method to create a guide tree– Using the guide tree create a succession of pairwisealignments starting with the two closest sequences and ending with the most distant from these• Iterative– Given a MSA remove a sequence and realign to theothers– May also optimise weights and distance measures– Repeat to convergenceMUSCLE• A progressive alignment is created starting with a word-based pairwise method• A new distance matrix is created from this• The old and new trees are compared and sequences realigned to reflect new guide tree– If old and new tree are the same we stop• The alignment is split into 2 profiles and these are aligned as above– Different bipartitions are tried until convergence isreachedMAAFT• Options for a standard progressive alignment (word based)• Iterative alignment available using guide tree reconstruction and realignment• Can use dynamic programming (local or global) instead of word based initial pairwise alignmentEditing alignments• Trimming– Removal of poorly aligned regions can improvesubsequent analysis– Cut-offs of gap proportions or amino acid variation(entropy) are used to remove columns• Manual– Some sequences are difficult for optimal automatedaligning– Manual editing of alignments based on users biological knowledge may improve alignments。
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Multiple Sequence Alignment as a FacilityLocation Problem1Winfried Just and Gianluca Della VedovaDepartment of Mathematics,Ohio University,Athens,Ohio45701,U.S.A.Dipartimento di Informatica,Sistemistica e Comunicazione,Universit´a degli Studi di Milano-Bicocca,20126Milano,Italye-mail:just@,dellavedova@disco.unimib.it Abstract.A connection is made between certain multiple sequence alignmentproblems and facility location problems,and the existence of a PTAS for theseproblems is shown.Moreover,it is shown that multiple sequence alignmentwith SP-score andfixed gap penalties is MAX SNP-hard.Key words:multiple sequence alignment,SP-score,Space-L Alignment prob-lem,Switchboard Location problem,PTAS,smooth polynomial programming1IntroductionRecent advances in the availability of biological data(i.e.DNA,RNA or protein) has led to tremendous improvements in Molecular Biology.This huge amount of data has also given a tremendous boost to a newfield of Computer Science called Bioinformatics.Pattern matching is a basic tool in Molecular Biology,as sequence similarity usually implies homology and functional similarity of the proteins or genes encoded by such sequences.Another crucial application of sequence comparison are searches of biological data bases.All known biological sequences are stored in huge data bases(e.g.EMBL,Swiss-Prot),and all recent papers in Molecular Biology that report the discovery of a new sequence include a detailed comparison of the novel sequences with those stored in the publicly available data bases.These facts reveal the importance of developing efficient algorithms for aligning a set of sequences.It is standard practice to represent biological sequences as sequences over afixed alphabet(4symbols for DNA and RNA sequences,20symbols for pro-teins).An alignment of a set S of sequences is basically a matrix where the rows correspond to the sequences in the set,possibly with some spaces inserted,and the cost of an alignment is the sum of the costs of all columns.The goal is to compute the alignment of S of minimum cost(or,in an equivalent formulation preferred by many biologists,maximum score).This general definition allows different specifica-tions of the problem,according to the definition of cost of a column in the alignment we choose.In practice at least two definitions make sense,thefirst(called Tree Align-ment)requires a tree whose node set is exactly S and where the cost of a column is the 1This work was supported by grant XXXX.1Paper Submitted for PSCWsum of the costs of the pairs of symbols of the two sequences that are adjacent in the tree.The other definition(called SP-Alignment)will be the one studied in this paper and defines the cost of a column as the sum of all pairs of symbols in the column. Equivalently,the SP-score can be defined as the sum over the pairwise alignment scores of all induced alignments of pairs of the sequences.The pairwise alignment scores are defined as follows.LetΣbe afixed alphabet and let∆/∈Σdenote the space,then a scoring scheme is a symmetric scoring function Modified definition of function from mathbb N to N d M:(Σ∪∆)×(Σ∪∆)→N together with specifications how to handle gaps.A scoring function d M can be conveniently repre-sented by a scoring matrix M.The cost of a pair of symbols s1,s2under the scoring matrix M is d M(s1,s2).A gap is a string of the form∆i.Most scoring schemes used in practice are affine,i.e.,they specify afixed gap opening penalty g(possibly0)that is added to the score calculated according to d M for each newly created gap in the alignment.In this context,the numbers d M(s,∆)for s∈Σare called gap extension penalties.Note that if all gap extension penalties are zero,then we have a scoring scheme withfixed gap penalties.If d M(s,∆)>0for all s∈Σ,then we will say that the scoring scheme specifies strictly positive gap extension penalties.Both Tree Alignment and SP-Alignment problems have been proved to be NP-hard by Wang and Jiang[17].Hence reasearch has focused on heuristic algorithms or approximation algorithms for such problems and onfinding restrictions that are efficiently solvable.A restriction which has a natural interpretation is the one where the scoring function is a metric;moreover,optimal alignments in actual biological sequences tend to have relatively few gaps[18,7,4,14].Unfortunately even in such restricted cases the SP-problem is still NP-hard[6,12].In[12],several such modifications of the SP-Alignment problem were studied.In the Gap-0Alignment problem,spaces may be inserted at the beginning and at the end of sequences,but not between characters fromΣ,and the Gap-0-1Alignment problem is the restriction of Gap-0Alignment where at most one space can be inserted in each sequence.It turns out that SP-Alignment,Gap-0Alignment and Gap-0-1Alignment problems are all NP-hard for practically every affine scoring scheme with strictly positive gap extension penalties used by molecular biologists[12].This leaves open the case of other ways of calculating the gap penalties that are sometimes used in Molecular Biology.In particular,this leaves open the interesting case offixed gap penalties,where all gaps are penalized equally,no matter where they occur and how long they are.Moreover,it had been shown in[12]that for some scoring matrix M the three problems mentioned above are MAX SNP-hard.The scoring matrix M used in the the latter result does not penalize all character mismatches,and thus not metric.In [11],Jiang et al.ask whether a particular restriction of the SP-Alignment problem(the case of metric scoring matrix)has a PTAS.In our paper we will give a positive answer to a closely related question by showing that there is a restriction of the problem that does admit a PTAS;more precisely,a PTAS exists when the total number of spaces that can be inserted is bounded by a constant and the ratio of the costs between each pairwise alignment is in afixed interval.Our results trivially hold also for Gap-0-1 Alignment.Moreover,we will show that at least for some scoring scheme withfixed gap penalty,the Gap-0Alignment and the SP-Alignment problems are MAX SNP-hard. Since the optimal alignment in the example that yields the latter result contains only2Multiple Sequence Alignment as a Facility Location Problem one space in each sequence,the requirement of bounded cost ratio cannot be dropped from the construction of the PTAS we describe in the paper.2PreliminariesLetΣbe afinite alphabet,let∆/∈Σbe the space symbol,Modified definition of function from mathbb N to N let d M:(Σ∪∆)×(Σ∪∆)→N be a function called scoring function,and let g be a nonnegative integer constant called gap opening penalty.The symbolα(M)will denote the maximum value d M(a1,a2) between two different symbols a1,a2∈Σ∪{∆}.Given a sequence a overΣ∪{∆}, the symbol a[i]will denote the i-th character of a and|a|will denote the length n of a sequence a=a[1]...a[n].Then given two sequences s1=s1[1],···,s1[m], s2=s2[1],···,s2[m]of m symbols over(Σ∪∆),the cost of aligning s1and s2 is d M(s1,s2)=g(G1+G2)+ m i=1d M(s1[i],s2[i]),where G j is the number of gaps (consecutive runs of space symbols)in s j.Given a set<t1,...,t k>of sequences overthe alphabetΣ∪{∆},a multiple alignment is a set<at1,...,at k>of equal-length sequences(where at i stands for aligned t i)over the alphabetΣ∪{∆}such that each at i can be obtained from t i by inserting some space symbols into the sequences without altering the order of symbols of t i.Given two equal-length sequences at1,at2,their pairwise alignment is the pair of sequences bt1,bt2that is obtained from at1,at2byremoving all columns containing only∆s.If L is a nonnegative integer,By d optM,L (t1,t2)we will denote the minimum cost among all pairwise alignments of<t1,t2>that insert at most L spaces into each of the sequences t1,t2.The SP-Alignment problem for a given scoring scheme(d M,g)is tofind the multiple alignment<at1,...,at k> that minimizes SP(<at1,...at k>)= 1≤i<j≤k d(bt i,bt j)among all possible multiple alignments of<t1,...,t k>.Here we will study a restriction of SP-Alignment that captures to some extent the pattern of space insertions observed in real biomolecular sequences and is different from the restrictions studied in[12].A space-L alignment A of a set<t1,...,t k>of sequences is an alignment<at1,...,at k>of<t1,...,t k>such that|at i|≤|t i|+L for each sequence t i.Note that space-L-alignments exist only if the length of the shortest of these sequences is at least n−L,where n is the length of the longest among the sequences t1,...,t k.Please also note that there are no restriction about where the space symbols can be inserted.The Space-L Multiple Alignment problem asks tofind,for a set of sequences<t1,...,t k>and a scoring scheme(d M,g), a space-L multiple alignment that minimizes the SP-score with respect to(d M,g). Given an instance I=<<t1,...,t k>,(d M,g)>of the Space-L Multiple Alignment problem we define the variability of I,denoted by v(I),asv(I)=max{nα(M)+Lgd optM,L(t i,t j):1≤i<j≤k},Please note that the value v(I)of the instance I can be computed in polynomial time.The Space-L Multiple Alignment(σ)problem is the restriction of the Space-L Multiple Alignment problem to instances I with s(I)≤σ.A few comments are in order.The most common multiple alignment problem in Molecular Biology is the alignment of homologous protein sequences from different3Paper Submitted for PSCWspecies.For a pair<t i,t j>of such sequences,<a(i,j)t i,a(i,j)t j>will be small only if the sequences are very similar,which usually happens only if the two species of origin have a relatively recent(in the timescale of evolution)common ancestor, and will be close to the average distance of random sequences if the species diverged a long time ago,or if the optimal alignment requires more than L spaces.For scoring matrices used in practice,the average distance of random sequences is usually a number of about the same order of magnitude as nα(M).The algorithms used in practice for multiple sequence alignment tend to perform well if all sequences are closely related to each other,while ourfirst theorem covers one of the cases that are difficult in practice and quite common,namely the case where none of the sequences are closely related to each other.3The PTASThe main results of this section is the following:Theorem1Letσbe a constant.Then the Space-L Multiple Alignment(σ)problem has a polynomial time approximation scheme.Note that in the above theorem,the scoring scheme(d M,g)is considered part of the input,thus the theorem works for all affine scoring schemes,no matter whether the scoring function is a metric and the gap penalties arefixed or variable.This does not contradict the results about MAX SNP-hardness from[12]though,since the spread of the instances used to obtain the latter results was not bounded.Theorem1will be proved by reformulating it as a kind of facility location problem. To see the connection,suppose a communication network is to be set up in a country that consists of k regions.In each region,there should be one switchboard of the network,and each switchboard is to be connected by expensive,high quality cable to every other switchboard.If in each region there are several possible locations for the switchboard that are equally good for the operation of the network within this region,then the locations of switchboards should be chosen in such a way as to minimize overall cost of cable between them.The question of choosing optimal locations for the switchboards can then be formalized as follows.The Switchboard Location problem has as instance some disjoint sets R1,...,R k called regions,as well as a distance function d between all pairs of points x i,x j in R1∪···∪R k.The distance function gives strictly positive values whenever the two points are distinct.A feasible solution is a set<x1...,x k>of points such that x i∈R i for1≤i≤k.The problem asks for a feasible solution that minimizes 1≤i<j≤k d(x i,x j).While facility location problems with objective functions similar to those of Switch-board Location have been studied for regions of the real line(see e.g.[2],[16]),we are not aware of any published results concerning the general formulation of Switchboard Location given above.We will discuss later how instances of Space-L Alignment(σ)can be mapped to suitable instances of Switchboard Location in order to have a(1+ )approximation algorithm.Butfirst we have to introduce a restriction of Switchboard Location similar to the one introduced for Space-L Alignment.Let I={R1,...,R k,d}be an instance4Multiple Sequence Alignment as a Facility Location Problemof the Switchboard Location problem.We define the spread s (I )of I ass (I )=max {d (x i ,x j ):1≤i <j ≤k,x i ∈R i ,x j ∈R j }min {d (x i ,x j ):1≤i <j ≤k,x i ∈R i ,x j ∈R j }.It is immediate from the definition that s (i )≥1.For any pair of constants P,σ,the Switchboard Location P (σ)problem is the Switchboard Location problem restricted to instances of spread at most σand where each region contains at most P points.Theorem 2Let P,σbe two constants,then the Switchboard Location P (σ)problem admits a PTAS.Proof .The PTAS for Switchboard Location is based on the smooth polynomial pro-gramming technique of Arora et.al [3].We will briefly recall the relevant material from those papers.A c -smooth polynomial integer program (or PIP)is a problem of the formminimizep 0(x 1,...,x n )(1)subject to l j ≤p i (x 1,...,x n )≤u jx i ∈{0,1}for i ={1,...,n }where each p j is an n -variate polynomial of maximum degree d ,and each coefficient of each degree monomial (term)is at most c ·n d − .The fundamental result that we will use,Theorem 1.10of [3],asserts that,for eachδ>0,there exists an approximation algorithm running in time O (n 1δ2)that computesa 0/1assignment <y 1,...,y n >to the variables x i of a c -smooth PIP such that,for n -variate degree-d polynomials,the value of p 0(y 1,...,y n )is within an additive error is at most δn d of minimum for 0/1solutions that satisfy all constraints p 1,...,p m ,and such that <y 1,...,y n >satisfies each linear constraint within an additive error of O (δ√n log n ).Now let σbe a fixed constant,and suppose we have an instance I of the Switch-board Location P (σ)problem,where {R i :1≤i ≤k }are the regions of I ,and R i ={x i,j :1≤j ≤P }.(Since one can always add dummy points to the regions,we do not lose generality by assuming the regions to be exactly of cardinality P .)Let D be the value min {d (x i,j ,x h, ):1≤i <h ≤k,1≤j, ≤k }.Now we can formulate the Switchboard Location problem as PIP:minimize1≤h<i ≤k,1≤j, ≤k d (x i,j ,x h, )D y i,j y h, (2)subject to jky i,j =ki =1,...,ky i,j ∈{0,1}i =1,...,k ;j =1,...,P Please note that the total number of variables is at most kP .Since s (I )≤σ,all coefficients of the objective functions are between 1and σ.Thus the PIP is σ-smooth.Now suppose we want to find a solution to the Switchboard Location problem thatis within a factor of (1+ )of minimum.Setting δ= 2P 2,and running the algorithmof [3]on the PIP defined above,we find a 0/1solution that satisfies all constraints within an additive error of O (δ√kP log kP ).Since for 0/1solutions the left hand sides of our side constraints are multiples of k ;for sufficiently large k we can assume that5Paper Submitted for PSCWthese side constraints are satisfied exactly .But then for each region R i ,exactly one of the numbers y i,j is equal to 1.Thus the corresponding x i,j ’s form a feasible solution of instance I of the Switchboard Location problem,and the sum of the distances is within an additive error of D k 2 .By the choice of D ,the minimum value for the sum of all distances in any feasible solution of instance I of the Switchboard Location problem cannot be less than D k 2,and thus we have found,in polynomial time,an approximation within a factor of (1+ ).2Now let us show how Theorem 2implies Theorem 1.Suppose we are given an instance I =<<t 1,...,t k >,(d M ,g )>of the Space-L Alignment (σ)problem,and let >0.We want to find a space-L multiple alignment of these sequences that scores within (1+ )of optimum.Let N = 4Lσ and note that N is a constant.Let n be the length of the longest among the sequences t 1,...,t k ,and let K = 2N +gN α(M ) .First assume that n ≤K .In this case we let R i be the set of all sequences x i,j that are obtainable by inserting L spaces into t i (at the beginning,end,or between symbols).This set contains at most K +LL elements.Note that K +LL is a constant that does not depend on the number of sequences k .Thus the family {R i :1≤i ≤k }together with the distances d (x i,j ,x i ,j )defined by the scoring scheme is an instance of the Switchboard Location problem where the cardinality of all regions is bounded by the constant K +LL .Feasible solutions of the Switchboard Location problem are exactly all space-L alignments of our sequences,and the objective function of the Switchboard Location problem is exactly the SP-score of the alignment.Since the variability of the Space-L Alignment problem is bounded by σ,the spread of the corresponding Switchboard Location problem that we just constructed is also bounded by σ.Thus the PTAS for Switchboard Location (K +LL )(σ)finds a solution within (1+ )of optimum.Now assume that n >K .In this case we partition each sequence t i into consecu-tive chunks <s i,h :1≤h ≤N >,where the length of each chunk differs from n N byno more than 1.Modified definition of function from mathbb N to N With each function f :{1,...,N +1}→N such that 1≤i ≤N +1f (i )≤L we associate a sequence t i,f by inserting f (h )space symbols to the left of each chunk s i,h .In other words,t i,f =∆f (1)s i,1∆f (2)s i,2...∆f (N )s i,N ∆f (N +1)Now we let R i be the set of all t i,f for functions Modified definition of function from mathbb N to N f :{1,...,N +1}→N that satisfy 1≤i ≤N +1f (i )≤L .We run the approximation algorithm for Switchboard Location N +1(σ)that finds a solutionwithin (1+ 3)on the instance <R 1,...,R k ,(d M ,g )>.The algorithm returns a space-L multiple alignment <t 1,f 1,...,t k,f k >of the sequences <t 1,...,t k >.It remains to show that the alignment <t 1,f 1,...,t k,f k >scores within (1+ )of optimum.Let <at 1,...,at k >denote the space-L multiple alignment with optimal SP-score.For each i ,let Modified definition of function from mathbb N to N g i :{1,...,N +1}→N be the function such that for each 1≤i ≤k and 1≤h ≤N ,g i is equal to the number of spaces in at i inserted immediately to the left of the chunk s i,h or between characters of s i,h .Instead of t i,g i we will write bt i .Since bt i ∈R i for each i ,we haveSP (<t 1,f 1,...,t k,f k >)≤(1+ 3)SP (<bt 1,...,bt k >).6Multiple Sequence Alignment as a Facility Location Problem Since1+ >(1+ /2)(1+ /3)whenever <1,it now suffices to show thatSP(<bt1,...,bt k>)≤(1+2)SP(<at1,...,at k>).Let us split the sequences at i,bt i into N+1chunks at i,h,bt i,h for1≤h≤N+1 where bt i,h=∆g i(h)s i,h,s i,N+1is the empty string,and|bt i,h|=|at i,h|,so that at i= at i,1at i,2···at i,N+1and bt i=bt i,1bt i,2···bt i,N+1.From the definition of g i,whenever g i(h)=g j(h)=0,the pairwise alignment<at i,h,at j,h>is the same as<bt i,h,bt j,h>. Since at most L spaces are inserted into each sequence t i,and since the maximum penalty on each chunk(excluding the newly inserted spaces)is equal to the lengthof the chunk(i.e.at most nN +1)multiplied byα(M),and there are globally only Lextra spaces,we get the inequalityd M(bt i,bt j)≤d M(at i,at j)+α(M)L(2+nN)+Lg.Since n>2N+gNα(M)andα(M)LnN≥2Lα(M)+Lg,we get d M(bt i,bt j)≤d M(at i,at j)+α(M)n2LN.By the choice of N,the latter yieldsd M(bt i,bt j)≤d M(at i,at j)+α(M)n2σ.Since the spread of our instance was assumed to be at mostσ,we have d M(at i,at j)≥α(M)n,and we getd M(bt i,bt j)≤d M(at i,at j)(1+ 2 ),as required.4MAX SNP-hardnessThe following theorem shows that the assumption of bounded spread cannot be simply dropped in Theorem1.Theorem3There exists a scoring scheme(d M,g)withfixed gap penalties such that: (a)For the scoring scheme(d M,g)and for every L>0the Space-L Multiple Alignment problem is MAX SNP-hard.(b)For the scoring scheme(d M,g)the Gap-0Alignment problem is MAX SNP-hard.(c)For the scoring scheme(d M,g),the SP-Alignment problem is MAX SNP-hard.Here is the scoring scheme mentioned in the above theorem.The alphabet will be Σ={A,C,T},the gap opening penalty will be g=2,and the scoring matrix M will be:∆A C T∆0000A0001C0000T01007Paper Submitted for PSCWProof.We will prove Theorem3by reducing the Max Cut problem on cubic graphs (denoted by3-Max Cut)to the respective multiple alignment problems.Recall that an instance of size k of the3-Max Cut problem is a simple graph G=<V,E>such that |V|=k and each vertex of G has degree exactly3.The problem is tofind a partition of the set of vertices V into disjoint sets V0and V1such that the number of edges that connect a vertex in V0with a vertex in V1,i.e.,the size of the cut determined by<V0,V1>,is as large as possible.It is well known that the3-Max Cut problem is MAX SNP-hard[1].For our purposes,it is most important to note that the latter implies that there exists a real >0such that no polynomial-time approximation algorithm canfind a cut such that the number of edges that are NOT cut is within an additive constant of k of minimum.Given a cubic graph G=<V,E>with k vertices,we define a2k-tuple¯t G=< t1,...,t2k>of sequences as follows:Enumerate V={v1,...,v k},E={e0,...,e −1}. Each sequence t i will have length2(k4+ αk ),whereαis a constant that will be defined below.Intuitively speaking,for1≤i≤k,the sequence t i will encode the vertex v i.Edge e m={v i,v h}will be encoded by characters t h[j],t i[j],where j=2( αk m+1),...,2 αk (m+1).More precisely,we define t i[j],the j-th character in t i,as follows.For1≤m≤ ,e m={v h,v i},h<i, αk m<j< αk (m+1)we put:t h[2j]=A,t i[2j]=T,t p[2j]=C for p/∈{i,h}.The sequence t k+i will act as a“mirror image”of t i.The purpose of mirror images is to neutralize the effects of unbalanced cuts on the scores of aligments.For1≤i≤k and αk +ik3<j< αk +(i+1)k3we put:t i[2j]=A,t k+i[2j]=T,t p[2j]=C for p/∈{i,k+i}.For all p,j,we let t p[2j−1]=C.Let us illustrate this construction with a picture. We exhibit a situation where e m={v h,v i}.t[2( αk m+1)]t[2 αk +2(h−1)k3]t[2 αk +2(i−1)k3]↓↓↓t h:...A C A C A...A C A C A...C C C C C...t i:...T C T C T...C C C C C...A C A C A...t k+h:...C C C C C...T C T C T...C C C C C...t k+i:...C C C C C...C C C C C...T C T C T...Let us define a“benchmark alignment”of the above sequences.We will define this alignment by partitioning the sequences into two sets L and R and inserting one space to the left of each sequence in L and one space to the right of each sequence in R.Let<V1,V2>be a cut of G,we will show how to associate a benchmark alignment to such cut.For each1≤i≤k we let t i∈L ifft k+i∈R.Moreover for each1≤i≤k we let t i∈L iffv i∈V1.Note that the score for the benchmark alignment is4k2+αkU,where U is the number of edges that are not in the cut<V1,V2>.Moreover,the benchmark alignment is a gap-0-1alignment,and hence both a gap-0alignment and a space-1 alignment.We will show that there exists afixedδ>0such that if an alignment a of the above sequences is found that scores within a factor of(1+δ)of the benchmark alignment, then it will be possible to reconstruct,in polynomial time,from this alignment a8Multiple Sequence Alignment as a Facility Location Problem partition of the vertex set that induces a cut whose size is within a additive constant of k of maximum.Suppose we have any alignment a that scores within1+δof our benchmark alignment,whereδis sufficiently small and will be determined later.Let us say that a sequence pair<t p,t q>is static in a if there is no space in the induced pairwise alignment<bt p,bt q>.Being static in a is easily seen to be an equivalence relation.Let T1and T2denote the two largest equivalence classes of the“static”relation,and let T3denote the set of sequences that are neither in T1nor in T2.Note that none of the sequence pairs<t i,t k+i>can be static in a,otherwise the cost of the alignment of<t i,t k+i>is too large.Thus the size of T1and T2is at most k.Let |T1|=k−k1,|T2|=k−k2.Then|T3|=k1+k2.Since each pair of sequences from different equivalence classes adds at least4to the SP-score of a,we haveSP(<at1,...,at2k>)≥4((k−k1)(k−k2)+(k−k1)(k1+k2)+(k−k2)(k1+k2))≥4(k2+k1k2+k(k1+k2)−(k1+k2)2)=4(k2+k1k2+(k−|T3|)|T3|).Thus the numbers k1and k2must be such that k1k2+(k−|T3|)|T3|<δk2+δαkU,where U is the number of edges that are not cut by the partition used in the benchmarkalignment.Note that U≤3k.We will chooseδ< α100.It follows that as long asαissufficiently small,we can assume that|T3|<k6.Now letα,δbe as above,and letV i be the set of all vertices such that t i∈T i for i∈{1,2}.Consider the partition <V1,V\V1>.Let W be the number of edges of G that are not cut by<V1,V\V1>. Note that this number differs from the number Z of edges{v i,v j}such that<t i,t j> is static by at most3|T3|,since every edge in the difference must have an endpoint in T3and the degree of the graph is3.If the SP-score of the alignment is within a factor of(1+δ)of that of the benchmark alignment,then we have:4k2+αkW≤4k2+αk(Z+k2)≤(1+δ)(4k2+αkU)+α2k2.By the choice ofδand since U≤3k,we getαkW−αkU<4δk2+δαkU+α2k2.Assuming,as we may,thatα≤1,and noting that U≤3k,our choice ofδgives:W−U<4100k+3100αk+2k< k.2The following results on hardness of Switchboard Location problems are not covered by Theorem3.Theorem4For every constantσ>1,the Switchboard Location2(σ)problem is NP-hard.Proof.Letσ>1.Since the number of instances of Switchboard Location2(σ)in-creases withσ,we may without loss of generality assume thatσ≤2.We prove the theorem by reducing the Max-Cut problem to Switchboard Location(2(σ).Given a graph G=<V,E>with vertices V={v1,...,v k},construct a metric space X={x1,...,x k,y1,...,y k}as follows:For i=j,we let d(x i,x j)=d(y i,y j)=1.9Paper Submitted for PSCWIf {v i ,v j }∈E ,then d (x i ,y j )=σ;if {v i ,v j }/∈E ,then d (x i ,y j )=1.(Note that for our choice of σ,the distance function is actually a metric.)For 1≤i ≤k ,the region R i is defined as {x i ,y i }.This gives us an instance I of the Switchboard Location (2(σ))problem.Every solution ¯x of I induces a partition <V x ,V y >,where V x ={v i :x i ∈¯x }and V y ={v i :y i ∈¯x }.If c ¯x denotes the size of the cut induced by the partition <V x ,V y >,then the measure of ¯x is equal to k 2+(σ−1)(|E |−c ¯x ) ,and the theorem follows from NP-hardness of the Max-Cut problem (see [8]).2Theorem 5The Switchboard Location 2problem is MAX SNP-hard.In view of our observation that Gap-0-1Alignment is a special case of Switchboard Location,Theorem 5is a corollary of Theorem 3(c)of [12].AcknowledgementsWe thank Tao Jiang for helpful discussions about this paper and Arie Tamir for bringing references [2]and [16]to our attention.References[1]P.Alimonti and V.Kann.Hardness of approximating problems on cubic graphs.CIAC 97,288-298,volume 1203of LNCS,1997.[2]E.M.Arkin and M.Hassin.Minimum diameter covering problems.Technicalreport,1997.[3]S.Arora, D.Karger,and M.Karpinski.Polynomial Time ApproximationSchemes for Dense Instances of N P -Hard puter and Systems Sci.,58,193-210,1999.[4]S.A.Benner,M.A.Cohen,and G.H.Gonnet.Empirical and Structural Modelsfor Insertions and Deletions in the Divergent Evolution of Proteins.J.Mol.Biol.229:1065-1082,1993.[5]P.Bonizzoni and G.Della Vedova.The complexity of multiple alignment withSP-score that is a metric.To appear in Theoretical Computer Science.[6]W.M.Fitch.Letter to the Editor:Commentary on the letter by Ward C.Wheeler.Mol.Biol.Evol.10(3):713-714,1993.[7]M.R.Garey and puters and Intractability:A Guide to theTheory of NP-Completeness.Freeman 1979.[8]T.Jiang,P.Kearney,and M.Li.Orchestrating Quartets:Approximation andData Correction.Proc.39th IEEE Symp.on Found.of Comp.Science,416-425,1998.[9]T.Jiang,P.Kearney,and M.Li.Some Open Problems in Computational Molec-ular Biology.SIGACT News 30(3):43-49,1999.10。