2012生物信息学试卷(英文)

生物信息学英文术语及释义总汇

Abstract Syntax Notation (ASN.l)（NCBI发展的许多程序，如显示蛋白质三维结构的Cn3D 等所使用的内部格式）A language that is used to describe structured data types formally, Within bioinformatits,it has been used by the National Center for Biotechnology Information to encode sequences, maps, taxonomic information, molecular structures, and biographical information in such a way that it can be easily accessed and exchanged by computer software.Accession number（记录号）A unique identifier that is assigned to a single database entry for a DNA or protein sequence.Affine gap penalty（一种设置空位罚分策略）A gap penalty score that is a linear function of gap length, consisting of a gap opening penalty and a gap extension penalty multiplied by the length of the gap. Using this penalty scheme greatly enhances the performance of dynamic programming methods for sequence alignment. See also Gap penalty.Algorithm（算法）A systematic procedure for solving a problem in a finite number of steps, typically involving a repetition of operations. Once specified, an algorithm can be written in a computer language and run as a program.Alignment（联配/比对/联配）Refers to the procedure of comparing two or more sequences by looking for a series of individual characters or character patterns that are in the same order in the sequences. Of the two types of alignment, local and global, a local alignment is generally the most useful. See also Local and Global alignments.Alignment score（联配/比对/联配值）An algorithmically computed score based on the number of matches, substitutions, insertions, and deletions (gaps) within an alignment. Scores for matches and substitutions Are derived from a scoring matrix such as the BLOSUM and PAM matrices for proteins, and aftine gap penalties suitable for the matrix are chosen. Alignment scores are in log odds units, often bit units (log to the base 2). Higher scores denote better alignments. See also Similarity score, Distance in sequence analysis.Alphabet（字母表）The total number of symbols in a sequence-4 for DNA sequences and 20 for protein sequences.Annotation（注释）The prediction of genes in a genome, including the location of protein-encoding genes, the sequence of the encoded proteins, anysignificantmatches to other Proteins of known function, and the location of RNA-encoding genes. Predictions are based on gene models; e.g., hidden Markov models of introns and exons in proteins encoding genes, and models of secondary structure in RNA.Anonymous FTP（匿名FTP）When a FTP service allows anyone to log in, it is said to provide anonymous FTP ser-vice. A user can log in to an anonymous FTP server by typing anonymous as the user name and his E-mail address as a password. Most Web browsers now negotiate anonymous FTP logon without asking the user for a user name and password. See also FTP.ASCIIThe American Standard Code for Information Interchange (ASCII) encodes unaccented letters a-z, A-Z, the numbers O-9, most punctuation marks, space, and a set of control characters such as carriage return and tab. ASCII specifies 128 characters that are mapped to the values O-127. ASCII tiles are commonly called plain text, meaning that they only encode text without extra markup.BAC clone（细菌人工染色体克隆）Bacterial artificial chromosome vector carrying a genomic DNA insert, typically 100–200 kb. Most of the large-insert clones sequenced in the project were BAC clones.Back-propagation（反向传输）When training feed-forward neural networks, a back-propagation algorithm can be used to modify the network weights. After each training input pattern is fed through the network, the network’s output is compared with the desired output and the amount of error is calculated. This error is back-propagated through the network by using an error function to correct the network weights. See also Feed-forward neural network.Baum-Welch algorithm（Baum-Welch算法）An expectation maximization algorithm that is used to train hidden Markov models.Baye’s rule（贝叶斯法则）Forms the basis of conditional probability by calculating the likelihood of an event occurring based on the history of the event and relevant background information. In terms of two parameters A and B, the theorem is stated in an equation: The condition-al probability of A, given B, P(AIB), is equal to the probability of A, P(A), times the conditional probability of B, given A, P(BIA), divided by the probability of B, P(B). P(A) is the historical or prior distribution value of A, P(BIA) is a new prediction for B for a particular value of A, and P(B) is the sum of the newly predicted values for B. P(AIB) is a posterior probability, representing a new prediction for A given the prior knowledge of A and the newly discovered relationships between A and B.Bayesian analysis（贝叶斯分析）A statistical procedure used to estimate parameters of an underlyingdistribution based on an observed distribution. See also Baye’s rule.Biochips（生物芯片）Miniaturized arrays of large numbers of molecular substrates, often oligonucleotides, in a defined pattern. They are also called DNA microarrays and microchips.Bioinformatics （生物信息学）The merger of biotechnology and information technology with the goal of revealing new insights and principles in biology. /The discipline of obtaining information about genomic or protein sequence data. This may involve similarity searches of databases, comparing your unidentified sequence to the sequences in a database, or making predictions about the sequence based on current knowledge of similar sequences. Databases are frequently made publically available through the Internet, or locally at your institution.Bit score （二进制值/ Bit值）The value S' is derived from the raw alignment score S in which the statistical properties of the scoring system used have been taken into account. Because bit scores have been normalized with respect to the scoring system, they can be used to compare alignment scores from different searches.Bit unitsFrom information theory, a bit denotes the amount of information required to distinguish between two equally likely possibilities. The number of bits of information, AJ, required to convey a message that has A4 possibilities is log2 M = N bits.BLAST （基本局部联配搜索工具，一种主要数据库搜索程序）Basic Local Alignment Search Tool. A set of programs, used to perform fast similarity searches. Nucleotide sequences can be compared with nucleotide sequences in a database using BLASTN, for example. Complex statistics are applied to judge the significance of each match. Reported sequences may be homologous to, or related to the query sequence. The BLASTP program is used to search a protein database for a match against a query protein sequence. There are several other flavours of BLAST. BLAST2 is a newer release of BLAST. Allows for insertions or deletions in the sequences being aligned. Gapped alignments may be more biologically significant.Block（蛋白质家族中保守区域的组块）Conserved ungapped patterns approximately 3-60 amino acids in length in a set of related proteins.BLOSUM matrices（模块替换矩阵，一种主要替换矩阵）An alternative to PAM tables, BLOSUM tables were derived using local multiple alignments of more distantly related sequences than were used for the PAM matrix. These are used to assess thesimilarity of sequences when performing alignments.Boltzmann distribution（Boltzmann 分布）Describes the number of molecules that have energies above a certain level, based on the Boltzmann gas constant and the absolute temperature.Boltzmann probability function(Boltzmann 概率函数)See Boltzmann distribution.Bootstrap analysisA method for testing how well a particular data set fits a model. For example, the validity of the branch arrangement in a predicted phylogenetic tree can be tested by resampling columns in a multiple sequence alignment to create many new alignments. The appearance of a particular branch in trees generated from these resampled sequences can then be measured. Alternatively, a sequence may be left out of an analysis to deter-mine how much the sequence influences the results of an analysis.Branch length（分支长度）In sequence analysis, the number of sequence changes along a particular branch of a phylogenetic tree.CDS or cds （编码序列）Coding sequence.Chebyshe, d inequalityThe probability that a random variable exceeds its mean is less than or equal to the square of 1 over the number of standard deviations from the mean.Clone （克隆）Population of identical cells or molecules (e.g. DNA), derived from a single ancestor.Cloning Vector （克隆载体）A molecule that carries a foreign gene into a host, and allows/facilitates the multiplication of that gene in a host. When sequencing a gene that has been cloned using a cloning vector (rather than by PCR), care should be taken not to include the cloning vector sequence when performing similarity searches. Plasmids, cosmids, phagemids, YACs and PACs are example types of cloning vectors.Cluster analysis（聚类分析）A method for grouping together a set of objects that are most similar from a larger group of related objects. The relationships are based on some criterion of similarity or difference. For sequences, a similarity or distance score or a statistical evaluation of those scores is used.CobblerA single sequence that represents the most conserved regions in a multiple sequence alignment. The BLOCKS server uses the cobbler sequence to perform a database similarity search as a way to reach sequences that are more divergent than would be found using the single sequences in the alignment for searches.Coding system (neural networks)Regarding neural networks, a coding system needs to be designed for representing input and output. The level of success found when training the model will be partially dependent on the quality of the coding system chosen.Codon usageAnalysis of the codons used in a particular gene or organism.COG（直系同源簇）Clusters of orthologous groups in a set of groups of related sequences in microorganism and yeast (S. cerevisiae). These groups are found by whole proteome comparisons and include orthologs and paralogs. See also Orthologs and Paralogs.Comparative genomics（比较基因组学）A comparison of gene numbers, gene locations, and biological functions of genes in the genomes of diverse organisms, one objective being to identify groups of genes that play a unique biological role in a particular organism.Complexity (of an algorithm)（算法的复杂性）Describes the number of steps required by the algorithm to solve a problem as a function of the amount of data; for example, the length of sequences to be aligned.Conditional probability（条件概率）The probability of a particular result (or of a particular value of a variable) given one or more events or conditions (or values of other variables).Conservation （保守）Changes at a specific position of an amino acid or (less commonly, DNA) sequence that preserve the physico-chemical properties of the original residue.Consensus（一致序列）A single sequence that represents, at each subsequent position, the variation found within corresponding columns of a multiple sequence alignment.Context-free grammarsA recursive set of production rules for generating patterns of strings. These consist of a set of terminal characters that are used to create strings, a set of nonterminal symbols that correspond to rules and act as placeholders for patterns that can be generated using terminal characters, a set of rules for replacing nonterminal symbols with terminal characters, and a start symbol.Contig （序列重叠群/拼接序列）A set of clones that can be assembled into a linear order. A DNA sequence that overlaps with another contig. The full set of overlapping sequences (contigs) can be put together to obtain the sequence for a long region of DNA that cannot be sequenced in one run in a sequencing assay. Important in genetic mapping at the molecular level.CORBA（国际对象管理协作组制定的使OOP对象与网络接口统一起来的一套跨计算机、操作系统、程序语言和网络的共同标准）The Common Object Request Broker Architecture (CORBA) is an open industry standard for working with distributed objects, developed by the Object Management Group. CORBA allows the interconnection of objects and applications regardless of computer language, machine architecture, or geographic location of the computers.Correlation coefficient（相关系数）A numerical measure, falling between - 1 and 1, of the degree of the linear relationship between two variables. A positive value indicates a direct relationship, a negative value indicates an inverse relationship, and the distance of the value away from zero indicates the strength of the relationship. A value near zero indicates no relationship between the variables.Covariation (in sequences)（共变）Coincident change at two or more sequence positions in related sequences that may influence the secondary structures of RNA or protein molecules.Coverage (or depth) （覆盖率/厚度）The average number of times a nucleotide is represented by a high-quality base in a collection of random raw sequence. Operationally, a 'high-quality base' is defined as one with an accuracy of at least 99% (corresponding to a PHRED score of at least 20).Database（数据库）A computerized storehouse of data that provides a standardized way for locating, adding, removing, and changing data. See also Object-oriented database, Relational database.DendogramA form of a tree that lists the compared objects (e.g., sequences or genes in a microarray analysis) in a vertical order and joins related ones by levels of branches extending to one side of the list.Depth （厚度）See coverageDirichlet mixturesDefined as the conjugational prior of a multinomial distribution. One use is for predicting the expected pattern of amino acid variation found in the match state of a hid-den Markov model (representing one column of a multiple sequence alignment of proteins), based on prior distributions found in conserved protein domains (blocks).Distance in sequence analysis（序列距离）The number of observed changes in an optimal alignment of two sequences, usually not counting gaps.DNA Sequencing （DNA测序）The experimental process of determining the nucleotide sequence of a region of DNA. This is done by labelling each nucleotide (A, C, G or T) with either a radioactive or fluorescent marker which identifies it. There are several methods of applying this technology, each with their advantages and disadvantages. For more information, refer to a current text book. High throughput laboratories frequently use automated sequencers, which are capable of rapidly reading large numbers of templates. Sometimes, the sequences may be generated more quickly than they can be characterised.Domain （功能域）A discrete portion of a protein assumed to fold independently of the rest of the protein andpossessing its own function.Dot matrix（点标矩阵图）Dot matrix diagrams provide a graphical method for comparing two sequences. One sequence is written horizontally across the top of the graph and the other along the left-hand side. Dots are placed within the graph at the intersection of the same letter appearing in both sequences. A series of diagonal lines in the graph indicate regions of alignment. The matrix may be filtered to reveal the most-alike regions by scoring a minimal threshold number of matches within a sequence window.Draft genome sequence （基因组序列草图）The sequence produced by combining the information from the individual sequenced clones (by creating merged sequence contigs and then employing linking information to create scaffolds) and positioning the sequence along the physical map of the chromosomes.DUST （一种低复杂性区段过滤程序）A program for filtering low complexity regions from nucleic acid sequences.Dynamic programming（动态规划法）A dynamic programming algorithm solves a problem by combining solutions to sub-problems that are computed once and saved in a table or matrix. Dynamic programming is typically used when a problem has many possible solutions and an optimal one needs to be found. This algorithm is used for producing sequence alignments, given a scoring system for sequence comparisons.EMBL （欧洲分子生物学实验室，EMBL数据库是主要公共核酸序列数据库之一）European Molecular Biology Laboratories. Maintain the EMBL database, one of the major public sequence databases.EMBnet （欧洲分子生物学网络）European Molecular Biology Network: /was established in 1988, and provides services including local molecular databases and software for molecular biologists in Europe. There are several large outposts of EMBnet, including EXPASY.Entropy（熵）From information theory, a measure of the unpredictable nature of a set of possible elements. The higher the level of variation within the set, the higher the entropy.Erdos and Renyi lawIn a toss of a “fair” coin, the number of heads in a row that can be expected is the logarithm of the number of tosses to the base 2. The law may be generalized for more than two possible outcomes by changing the base of the logarithm to the number of out-comes. This law was used to analyze the number of matches and mismatches that can be expected between random sequences as a basis for scoring the statistical significance of a sequence alignment.EST （表达序列标签的缩写）See Expressed Sequence TagExpect value (E)（E值）E value. The number of different alignents with scores equivalent to or better than S that are expected to occur in a database search by chance. The lower the E value, the more significant the score. In a database similarity search, the probability that an alignment score as good as the one found between a query sequence and a database sequence would be found in as many comparisons between random sequences as was done to find the matching sequence. In other types of sequence analysis, E has a similar meaning.Expectation maximization (sequence analysis)An algorithm for locating similar sequence patterns in a set of sequences. A guessed alignment of the sequences is first used to generate an expected scoring matrix representing the distribution of sequence characters in each column of the alignment, this pattern is matched to each sequence, and the scoring matrix values are then updated to maximize the alignment of the matrix to the sequences. The procedure is repeated until there is no further improvement.Exon （外显子）Coding region of DNA. See CDS.Expressed Sequence Tag (EST) （表达序列标签）Randomly selected, partial cDNA sequence; represents it's corresponding mRNA. dbEST is a large database of ESTs at GenBank, NCBI.FASTA （一种主要数据库搜索程序）The first widely used algorithm for database similarity searching. The program looks for optimal local alignments by scanning the sequence for small matches called "words". Initially, the scores of segments in which there are multiple word hits are calculated ("init1"). Later the scores of several segments may be summed to generate an "initn" score. An optimized alignment that includes gaps is shown in the output as "opt". The sensitivity and speed of the search are inversely related and controlled by the "k-tup" variable which specifies the size of a "word". (Pearson andLipman)Extreme value distribution（极值分布）Some measurements are found to follow a distribution that has a long tail which decays at high values much more slowly than that found in a normal distribution. This slow-falling type is called the extreme value distribution. The alignment scores between unrelated or random sequences are an example. These scores can reach very high values, particularly when a large number of comparisons are made, as in a database similarity search. The probability of a particular score may be accurately predicted by the extreme value distribution, which follows a double negative exponential function after Gumbel.False negative（假阴性）A negative data point collected in a data set that was incorrectly reported due to a failure of the test in avoiding negative results.False positive （假阳性）A positive data point collected in a data set that was incorrectly reported due to a failure of the test. If the test had correctly measured the data point, the data would have been recorded as negative.Feed-forward neural network （反向传输神经网络）Organizes nodes into sequence layers in which the nodes in each layer are fully connected with the nodes in the next layer, except for the final output layer. Input is fed from the input layer through the layers in sequence in a “feed-forward” direction, resulting in output at the final layer. See also Neural network.Filtering (window size)During pair-wise sequence alignment using the dot matrix method, random matches can be filtered out by using a sliding window to compare the two sequences. Rather than comparing a single sequence position at a time, a window of adjacent positions in the two sequences is compared and a dot, indicating a match, is generated only if a certain minimal number of matches occur.Filtering （过滤）Also known as Masking. The process of hiding regions of (nucleic acid or amino acid) sequence having characteristics that frequently lead to spurious high scores. See SEG and DUST.Finished sequence（完成序列）Complete sequence of a clone or genome, with an accuracy of at least 99.99% and no gaps.Fourier analysisStudies the approximations and decomposition of functions using trigonometric polynomials.Format (file)（格式）Different programs require that information be specified to them in a formal manner, using particular keywords and ordering. This specification is a file format.Forward-backward algorithmUsed to train a hidden Markov model by aligning the model with training sequences. The algorithm then refines the model to reduce the error when fitted to the given data using a gradient descent approach.FTP (File Transfer Protocol)（文件传输协议）Allows a person to transfer files from one computer to another across a network using an FTP-capable client program. The FTP client program can only communicate with machines that run an FTP server. The server, in turn, will make a specific portion of its tile system available for FTP access, providing that the client is able to supply a recognized user name and password to the server.Full shotgun clone （鸟枪法克隆）A large-insert clone for which full shotgun sequence has been produced.Functional genomics（功能基因组学）Assessment of the function of genes identified by between-genome comparisons. The function of a newly identified gene is tested by introducing mutations into the gene and then examining the resultant mutant organism for an altered phenotype.gap （空位/间隙/缺口）A space introduced into an alignment to compensate for insertions and deletions in one sequence relative to another. To prevent the accumulation of too many gaps in an alignment, introduction of a gap causes the deduction of a fixed amount (the gap score) from the alignment score. Extension of the gap to encompass additional nucleotides or amino acid is also penalized in the scoring of an alignment.Gap penalty（空位罚分）A numeric score used in sequence alignment programs to penalize the presence of gaps within an alignment. The value of a gap penalty affects how often gaps appear in alignments produced by the algorithm. Most alignment programs suggest gap penalties that are appropriate for particular scoring matrices.Genetic algorithm（遗传算法）A kind of search algorithm that was inspired by the principles of evolution. A population of initial solutions is encoded and the algorithm searches through these by applying a pre-defined fitness measurement to each solution, selecting those with the highest fitness for reproduction. New solutions can be generated during this phase by crossover and mutation operations, defined in the encoded solutions.Genetic map （遗传图谱）A genome map in which polymorphic loci are positioned relative to one another on the basis of the frequency with which they recombine during meiosis. The unit of distance is centimorgans (cM), denoting a 1% chance of recombination.Genome（基因组）The genetic material of an organism, contained in one haploid set of chromosomes.Gibbs sampling methodAn algorithm for finding conserved patterns within a set of related sequences. A guessed alignment of all but one sequence is made and used to generate a scoring matrix that represents the alignment. The matrix is then matched to the left-out sequence, and a probable location of the corresponding pattern is found. This prediction is then input into a new alignment and another scoring matrix is produced and tested on a new left-out sequence. The process is repeated until there is no further improvement in the matrix.Global alignment（整体联配）Attempts to match as many characters as possible, from end to end, in a set of twomore sequences.Gopher (一个文档发布系统，允许检索和显示文本文件)Graph theory（图论）A branch of mathematics which deals with problems that involve a graph or network structure. A graph is defined by a set of nodes (or points) and a set of arcs (lines or edges) joining the nodes. In sequence and genome analysis, graph theory is used for sequence alignments and clustering alike genes.GSS（基因综述序列）Genome survey sequence.GUI（图形用户界面）Graphical user interface.H （相对熵值）H is the relative entropy of the target and background residue frequencies. (Karlin and Altschul, 1990). H can be thought of as a measure of the average information (in bits) available per position that distinguishes an alignment from chance. At high values of H, short alignments can be distinguished by chance, whereas at lower H values, a longer alignment may be necessary. (Altschul, 1991)Half-bitsSome scoring matrices are in half-bit units. These units are logarithms to the base 2 of odds scores times 2.Heuristic（启发式方法）A procedure that progresses along empirical lines by using rules of thumb to reach a solution. The solution is not guaranteed to be optimal.Hexadecimal system（16制系统）The base 16 counting system that uses the digits O-9 followed by the letters A-F.HGMP （人类基因组图谱计划）Human Genome Mapping Project.Hidden Markov Model (HMM)（隐马尔可夫模型）In sequence analysis, a HMM is usually a probabilistic model of a multiple sequence alignment, but can also be a model of periodic patterns in a single sequence, representing, for example, patterns found in the exons of a gene. In a model of multiple sequence alignments, each column of symbols in the alignment is represented by a frequency distribution of the symbols called a state, and insertions and deletions by other states. One then moves through the model along a particular path from state to state trying to match a given sequence. The next matching symbol is chosen from each state, recording its probability (frequency) and also the probability of going to thatparticular state from a previous one (the transition probability). State and transition probabilities are then multiplied to obtain a probability of the given sequence. Generally speaking, a HMM is a statistical model for an ordered sequence of symbols, acting as a stochastic state machine that generates a symbol each time a transition is made from one state to the next. Transitions betweenstates are specified by transition probabilities.Hidden layer（隐藏层）An inner layer within a neural network that receives its input and sends its output to other layers within the network. One function of the hidden layer is to detect covariation within the input data, such as patterns of amino acid covariation that are associated with a particular type of secondary structure in proteins.Hierarchical clustering（分级聚类）The clustering or grouping of objects based on some single criterion of similarity or difference.An example is the clustering of genes in a microarray experiment based on the correlation between their expression patterns. The distance method used in phylogenetic analysis is another example.Hill climbingA nonoptimal search algorithm that selects the singular best possible solution at a given state or step. The solution may result in a locally best solution that is not a globally best solution.Homology（同源性）A similar component in two organisms (e.g., genes with strongly similar sequences) that can be attributed to a common ancestor of the two organisms during evolution.Horizontal transfer（水平转移）The transfer of genetic material between two distinct species that do not ordinarily exchange genetic material. The transferred DNA becomes established in the recipient genome and can be detected by a novel phylogenetic history and codon content com-pared to the rest of the genome.HSP （高比值片段对）High-scoring segment pair. Local alignments with no gaps that achieve one of the top alignment scores in a given search.HTGS/HGT（高通量基因组序列）High-throughout genome sequences。

生物信息学考试参考题目

1. 在NCBI进行BLAST序列比对时，需要输入查询序列的信息，以下错误的格式是（ C ）A. 序列的accession numberB. 序列的giC. 序列对应基因的IDD. FASTA 格式的序列2. 下面这段序列是: ( B )>gi|24646620|ref|NM_057587.3| Drosophila melanogaster RNA-binding protein 4 CG9654-RA, transcript variant A (Rbp4), mRNAGGATTTTCTTGCCTGTCA TTCAA TTTGTGGTTGGCTTCACCTGAGTGCTGTAGT。

A. DNA序列B. RNA序列C. 蛋白质序列D. 基因3. ExPASy上的工具软件ProtParam提供的是哪种类型的服务？（ B ）A．蛋白质三级结构分析B．蛋白质序列理化性质预测C．蛋白质二级结构分析D．跨膜结构分析4. 假如你有两条远相关的蛋白，为了比较它们，最好使用下列哪个记分矩阵（A ）A. BLOSUM45或PAM250B. BLOSUM45或PAM1C. BLOSUM80或PAM250D. BLOSUM10或PAM15. 构建系统发生树，应使用CA. BLASTB. FASTAC. UPGMAD. Entrez6. 下面这段蛋白质序列是什么格式? ( D )>gi|4506183|ref|NP_002779.1| proteasome alpha 3 [Homo sapiens]MSSIGTGYDLSASTFSPDGRVFQVEYAMKAVENSSTAIGIRCKDGVVFGVEKLVLS KL YEEGSNKRLFNVDRHVGMA V AGLLADARSLADIAREEASNFRSNFGYNIPLKHLADRV AMYVHAYTL YSA VRPFGCSFMLGS。

A. GBFFB. TEXTC. PDBD. FASTA7. 直系同源物定义为（A ）A．不同物种中具有共同祖先的同源序列B．具有较小的氨基酸一致性但是有较大的结构相似性的同源序列C．同一物种中由基因复制产生的同源序列D．同一物种中具有相似的并且通常是冗余功能的同源序列8. 美国NIH维护提供的DNA序列数据库是：（ A ）A. GenBankB. ProteinC. dbESTD. dbSNP9. 高分配对片段的英文缩写为（A ）A. HSPB. HMPC. HCPD. HDP10. BLAST比对结果报告中有一统计数值E值，该值大小与匹配度的关系是（ B ）A. 值越小说明匹配度越低B. 值越小说明匹配度越高C. 两者无内在关系D. 以上说法都不对11. NCBI提供了大量的序列分析工具，其中用来寻找DNA序列潜在的蛋白质编码区的工具是：（A ）A. ORF FinderB. BLASTC. Scan PrositeD. SAGEmap12. Entrez是哪个网站数据库的检索系统（A ）A．NCBIB．PROSITEC．EBID．PDB13. 如果想找一个和查询蛋白远源的蛋白质，下面哪种方法最可能成功？ BA．采用PHI-BLAST，因为你能自己选择一个和搜索蛋白质有关的信号序列B．采用PSI-BLAST，因为这个算法使用位点特异性打分矩阵最为敏感C．采用BLASTP，因为你能够调整你的打分矩阵从而使得搜索敏感度最大D．采用专门的物种数据库，因为他们中可能含有这种远源序列。

生物信息学主要英文术语及释义

生物信息学主要英文术语及释义生物信息学主要英文术语及释义Abstract Abstract Syntax Syntax Syntax Notation Notation Notation (ASN.l)(ASN.l)（NCBI 发展的许多程序，如显示蛋白质三维结构的Cn3D等所使用的内部格式）等所使用的内部格式） A A language language language that that that is is is used used used to to to describe describe describe structured structured structured data data data types types types formally, formally, formally, Within Within Within bioinformatits,it bioinformatits,it bioinformatits,it has has been been used used used by by by the the the National National National Center Center Center for for for Biotechnology Biotechnology Biotechnology Information Information Information to to to encode encode encode sequences, sequences, sequences, maps, maps, taxonomic information, molecular structures, and biographical information in such a way that it can be easily accessed and exchanged by computer software. Accession number （记录号）（记录号）A unique identifier that is assigned to a single database entry for a DNA or protein sequence. Affine gap penalty （一种设置空位罚分策略）（一种设置空位罚分策略）（一种设置空位罚分策略） A gap penalty score that is a linear function of gap length, consisting of a gap opening penalty and a a gap gap gap extension extension extension penalty penalty penalty multiplied multiplied multiplied by by by the the the length length length of of of the the the gap. gap. gap. Using Using Using this this this penalty penalty penalty scheme scheme scheme greatly greatly enhances enhances the the the performance performance performance of of of dynamic dynamic dynamic programming programming programming methods methods methods for for for sequence sequence sequence alignment. alignment. alignment. See See See also also Gap penalty. Algorithm （算法）（算法）A A systematic systematic systematic procedure procedure procedure for solving for solving a a problem problem problem in in in a a a finite finite finite number number number of of of steps, steps, steps, typically typically typically involving involving involving a a repetition of operations. Once specified, an algorithm can be written in a computer language and run as a program. Alignment （联配/比对/联配）联配）Refers to the procedure of comparing two or more sequences by looking for a series of individual characters or character patterns that are in the same order in the sequences. Of the two types of alignment, alignment, local local local and and and global, global, global, a a a local local local alignment alignment alignment is is is generally generally generally the the the most most most useful. useful. useful. See See See also also also Local Local Local and and Global alignments. Alignment score （联配/比对/联配值）联配值）An algorithmically computed score based on the number of matches, substitutions, insertions, and deletions deletions (gaps) (gaps) (gaps) within within within an an an alignment. alignment. alignment. Scores Scores Scores for for for matches matches matches and and and substitutions substitutions substitutions Are Are Are derived derived derived from from from a a scoring scoring matrix matrix matrix such such such as as as the the the BLOSUM BLOSUM BLOSUM and and and P AM P AM matrices matrices matrices for for for proteins, proteins, proteins, and and and aftine aftine aftine gap gap gap penalties penalties suitable for the matrix are chosen. Alignment scores are in log odds units, often bit units (log to the the base base base 2). 2). 2). Higher Higher Higher scores scores scores denote denote denote better better better alignments. alignments. alignments. See See See also also also Similarity Similarity Similarity score, score, score, Distance Distance Distance in in sequence analysis. Alphabet （字母表）（字母表）The total number of symbols in a sequence-4 for DNA sequences and 20 for protein sequences. Annotation （注释）（注释）The The prediction prediction prediction of of of genes genes genes in in in a a a genome, genome, genome, including including including the the the location location location of of of protein-encoding protein-encoding protein-encoding genes, genes, genes, the the sequence of the encoded proteins, anysignificantmatches to other Proteins of known function, and the location of RNA-encoding genes. Predictions are based on gene models; e.g., hidden Markov models models of of of introns introns introns and and and exons exons exons in in in proteins proteins proteins encoding encoding encoding genes, genes, genes, and and and models models models of of of secondary secondary secondary structure structure structure in in RNA. Anonymous FTP （匿名FTP ）When a FTP service allows anyone to log in, it is said to provide anonymous FTP ser-vice. A user can can log log log in in in to to to an an an anonymous anonymous anonymous FTP FTP FTP server server server by by by typing typing typing anonymous anonymous anonymous as as the the user name user name and and his E-mail his E-mail address as a password. Most Web browsers now negotiate anonymous FTP logon without asking the user for a user name and password. See also FTP. ASCII The American Standard Code for Information Interchange (ASCII) encodes unaccented letters a-z, A-Z, A-Z, the the the numbers numbers numbers O-9, O-9, O-9, most most most punctuation punctuation punctuation marks, marks, marks, space, space, space, and and and a a a set set set of of of control control control characters characters characters such such such as as carriage carriage return return return and and and tab. tab. tab. ASCII ASCII ASCII specifies specifies specifies 128 128 128 characters characters characters that that that are are are mapped mapped mapped to to to the the the values values values O-127. O-127. ASCII ASCII tiles tiles tiles are are are commonly commonly commonly called called called plain plain plain text, text, text, meaning meaning meaning that that that they they they only only only encode encode encode text text text without without without extra extra markup. BAC clone （细菌人工染色体克隆）（细菌人工染色体克隆）Bacterial Bacterial artificial artificial artificial chromosome chromosome chromosome vector vector vector carrying carrying carrying a a a genomic genomic genomic DNA DNA DNA insert, insert, insert, typically typically typically 100100100––200 200 kb. kb. Most of the large-insert clones sequenced in the project were BAC clones. Back-propagation （反向传输）（反向传输）When training feed-forward neural networks, a back-propagation algorithm can be used to modify the network weights. After each training input pattern is fed through the network, the network’s output output is is is compared compared compared with with with the the the desired desired desired output output output and and and the the the amount amount amount of of of error error error is is is calculated. calculated. calculated. This This This error error error is is back-propagated through the network by using an error function to correct the network weights. See also Feed-forward neural network. Baum-Welch algorithm （Baum-Welch 算法）算法）An expectation maximization algorithm that is used to train hidden Markov models. Baye ’s rule （贝叶斯法则）（贝叶斯法则）Forms Forms the the the basis basis basis of of of conditional conditional conditional probability probability probability by by by calculating calculating calculating the the the likelihood likelihood likelihood of of of an an an event event event occurring occurring based on the history of the event and relevant background information. In terms of two parameters A and B, the theorem is stated in an equation: The condition-al probability of A, given B, P(AIB), is is equal equal equal to to to the the the probability probability probability of of of A, P(A), A, P(A), times times the the the conditional conditional conditional probability probability probability of of of B, B, B, given given given A, P(BIA), A, P(BIA), divided by the probability of B, P(B). P(A) is the historical or prior distribution value of A, P(BIA) is a new prediction for B for a particular value of A, and P(B) is the sum of the newly predicted values for B. P(AIB) is a posterior probability, representing a new prediction for A given the prior knowledge of A and the newly discovered relationships between A and B. Bayesian analysis （贝叶斯分析）（贝叶斯分析）A A statistical statistical statistical procedure procedure procedure used used used to to to estimate estimate estimate parameters parameters parameters of of of an an an underlyingdistribution underlyingdistribution underlyingdistribution based based based on on on an an observed distribution. See a lso Baye’s rule.Biochips （生物芯片）（生物芯片）Miniaturized arrays of large numbers of molecular substrates, often oligonucleotides, in a defined pattern. They are also called DNA microarrays and microchips. Bioinformatics （生物信息学）（生物信息学）The merger of biotechnology and information technology with the goal of revealing new insights and and principles principles principles in in in biology. biology. biology. /The /The /The discipline discipline discipline of of of obtaining obtaining obtaining information information information about about about genomic genomic genomic or or or protein protein sequence sequence data. data. data. This This This may may may involve involve involve similarity similarity similarity searches searches searches of of of databases, databases, databases, comparing comparing comparing your your your unidentified unidentified sequence sequence to to to the the the sequences sequences sequences in in in a a a database, database, database, or or or making making making predictions predictions predictions about about about the the the sequence sequence sequence based based based on on current current knowledge knowledge knowledge of of of similar similar similar sequences. sequences. sequences. Databases Databases Databases are are are frequently frequently frequently made made made publically publically publically available available through the Internet, or locally at your institution. Bit score （二进制值/ Bit 值）值）The value S' is derived from the raw alignment score S in which the statistical properties of the scoring system used have been taken into account. Because bit scores have been normalized with respect respect to to to the the the scoring scoring scoring system, system, system, they they they can can can be be be used used used to to to compare compare compare alignment alignment alignment scores scores scores from from from different different searches. Bit units From information theory, a bit denotes the amount of information required to distinguish between two two equally equally equally likely likely likely possibilities. possibilities. possibilities. The The The number number number of of of bits bits bits of of of information, information, information, AJ, AJ, AJ, required required required to to to convey convey convey a a message that has A4 possibilities is log2 M = N bits. BLAST （基本局部联配搜索工具，一种主要数据库搜索程序）（基本局部联配搜索工具，一种主要数据库搜索程序）Basic Local Alignment Search Tool. A set of programs, used to perform fast similarity searches. Nucleotide sequences can be compared with nucleotide sequences in a database using BLASTN, for for example. example. example. Complex Complex Complex statistics statistics statistics are are are applied applied applied to to to judge judge judge the the the significance significance significance of of of each each each match. match. match. Reported Reported sequences may be homologous to, or related to the query sequence. The BLASTP program is used to search a protein database for a match against a query protein sequence. There are several other flavours of BLAST. BLAST2 is a newer release of BLAST. Allows for insertions or deletions in the sequences being aligned. Gapped alignments may be more biologically significant. Block （蛋白质家族中保守区域的组块）（蛋白质家族中保守区域的组块）Conserved Conserved ungapped ungapped ungapped patterns patterns patterns approximately approximately approximately 3-60 3-60 3-60 amino amino amino acids acids acids in in in length length length in in in a a a set set set of of of related related proteins. BLOSUM matrices （模块替换矩阵，一种主要替换矩阵）（模块替换矩阵，一种主要替换矩阵）An alternative to PAM tables, BLOSUM tables were derived using local multiple alignments of more distantly related sequences than were used for the PAM matrix. These are used to assess the similarity of sequences when performing alignments. Boltzmann distribution （Boltzmann 分布）分布）Describes Describes the the the number number number of of of molecules molecules molecules that that that have have have energies energies energies above above above a a a certain certain certain level, level, level, based based based on on on the the Boltzmann gas constant and the absolute temperature.Boltzmann probability function(Boltzmann 概率函数) See Boltzmann distribution. Bootstrap analysis A method for testing how well a particular data set fits a model. For example, the validity of the branch branch arrangement arrangement arrangement in in in a a a predicted predicted predicted phylogenetic phylogenetic phylogenetic tree tree tree can can can be be be tested tested tested by by by resampling resampling resampling columns columns columns in in in a a multiple multiple sequence sequence sequence alignment alignment alignment to to to create create create many many many new new new alignments. alignments. alignments. The The The appearance appearance appearance of of of a a a particular particular branch in trees generated from these resampled sequences can then be measured. Alternatively, a sequence sequence may may may be be be left left left out out out of of of an an an analysis analysis analysis to to to deter-mine deter-mine deter-mine how how how much much much the the the sequence sequence sequence influences influences influences the the results of an analysis. Branch length （分支长度）（分支长度）In sequence analysis, the number of sequence changes along a particular branch of a phylogenetic tree. CDS or cds （编码序列）（编码序列）（编码序列） Coding sequence. Chebyshe, d inequality The probability that a random variable exceeds its mean is less than or equal to the square of 1 over the number of standard deviations from the mean. Clone （克隆）（克隆）Population of identical cells or molecules (e.g. DNA), derived from a single ancestor. Cloning V ector （克隆载体）（克隆载体）A molecule that carries a foreign gene into a host, and allows/facilitates the multiplication of that gene in a host. When sequencing a gene that has been cloned using a cloning vector (rather than by by PCR), PCR), PCR), care care care should should should be be be taken taken taken not not not to to to include include include the the the cloning cloning cloning vector vector vector sequence sequence sequence when when when performing performing similarity searches. Plasmids, cosmids, phagemids, Y ACs and PACs are example types of cloning vectors. Cluster analysis （聚类分析）（聚类分析） A method for grouping together a set of objects that are most similar from a larger group of related objects. The relationships are based on some criterion of similarity or difference. For sequences, a similarity or distance score or a statistical evaluation of those scores is used. Cobbler A single sequence that represents the most conserved regions in a multiple sequence alignment. The BLOCKS server uses the cobbler sequence to perform a database similarity search as a way to reach sequences that are more divergent than would be found using the single sequences in the alignment for searches. Coding system (neural networks) Regarding Regarding neural neural neural networks, networks, networks, a a a coding coding coding system system system needs needs needs to to to be be be designed designed designed for for for representing representing representing input input input and and output. output. The The The level level level of of of success success success found found found when when when training training training the the the model model model will will will be be be partially partially partially dependent dependent dependent on on on the the quality of the coding system chosen. Codon usageAnalysis of the codons used in a particular gene or organism. COG （直系同源簇）（直系同源簇）Clusters of orthologous groups in a set of groups of related sequences in microorganism and yeast (S. cerevisiae). These groups are found by whole proteome comparisons and include orthologs and paralogs. See also Orthologs and Paralogs. Comparative genomics （比较基因组学）（比较基因组学）A comparison of gene numbers, gene locations, and biological functions of genes in the genomes of diverse organisms, one objective being to identify groups of genes that play a unique biological role in a particular organism. Complexity (of an algorithm)（算法的复杂性）（算法的复杂性）Describes the number of steps required by the algorithm to solve a problem as a function of the amount of data; for example, the length of sequences to be aligned. Conditional probability （条件概率）（条件概率）The The probability probability probability of of of a a a particular particular particular result result result (or (or (or of of of a a a particular particular particular value value value of of of a a a variable) variable) variable) given given given one one one or or or more more events or conditions (or values of other variables). Conservation （保守）（保守）Changes at a specific position of an amino acid or (less commonly, DNA) sequence that preserve the physico-chemical properties of the original residue. Consensus （一致序列）（一致序列）A A single single single sequence sequence sequence that that that represents, represents, represents, at at at each each each subsequent subsequent subsequent position, position, position, the the the variation variation variation found found found within within corresponding columns of a multiple sequence alignment. Context-free grammars A A recursive recursive recursive set set set of of of production production production rules rules rules for for for generating generating generating patterns patterns patterns of of of strings. strings. strings. These These These consist consist consist of of of a a a set set set of of terminal characters that are used to create strings, a set of nonterminal symbols that correspond to rules and act as placeholders for patterns that can be generated using terminal characters, a set of rules for replacing nonterminal symbols with terminal characters, and a start symbol. Contig （序列重叠群/拼接序列）拼接序列）A A set set set of of of clones clones clones that that that can can can be be be assembled assembled assembled into into into a a a linear linear linear order. order. order. A A A DNA DNA DNA sequence sequence sequence that that that overlaps overlaps overlaps with with another contig. The full set of overlapping sequences (contigs) can be put together to obtain the sequence for a long region of DNA that cannot be sequenced in one run in a sequencing assay. Important in genetic mapping at the molecular level. CORBA （国际对象管理协作组制定的使OOP 对象与网络接口统一起来的一套跨计算机、操作系统、程序语言和网络的共同标准）作系统、程序语言和网络的共同标准）The The Common Common Common Object Object Object Request Request Request Broker Broker Broker Architecture Architecture Architecture (CORBA) (CORBA) (CORBA) is is is an an an open open open industry industry industry standard standard standard for for working working with with with distributed distributed distributed objects, objects, objects, developed developed developed by by by the the the Object Object Object Management Management Management Group. CORBA allows Group. CORBA allows the interconnection of objects and applications regardless of computer language, machine architecture, or geographic location of the computers. Correlation coefficient （相关系数）（相关系数）A numerical measure, falling between - 1 and 1, of the degree of the linear relationship between two variables. A positive value indicates a direct relationship, a negative negative value value value indicates indicates indicates an an an inverse inverse inverse relationship, relationship, relationship, and and and the the the distance distance distance of of of the the the value value value away away away from from from zero zero indicates the strength of the relationship. A value near zero indicates no relationship between the variables. Covariation (in sequences)（共变）（共变）Coincident change at two or more sequence positions in related sequences that may influence the secondary structures of RNA or protein molecules. Coverage (or depth) （覆盖率（覆盖率/厚度）厚度）The average number of times a nucleotide is represented by a high-quality base in a collection of random raw sequence. Operationally, a 'high-quality base' is defined as one with an accuracy of at least 99% (corresponding to a PHRED score of at least 20). Database （数据库）（数据库）A A computerized computerized computerized storehouse storehouse storehouse of of of data data data that that that provides provides provides a a a standardized standardized standardized way way way for for for locating, locating, locating, adding, adding, removing, and changing data. See also Object-oriented database, Relational database. Dendogram A form of a tree that lists the compared objects (e.g., sequences or genes in a microarray analysis) in a vertical order and joins related ones by levels of branches extending to one side of the list. Depth （厚度）（厚度）See coverage Dirichlet mixtures Defined Defined as as as the the the conjugational conjugational conjugational prior prior prior of of of a a a multinomial multinomial multinomial distribution. distribution. distribution. One One One use use use is is is for for for predicting predicting predicting the the expected expected pattern pattern pattern of of of amino amino amino acid acid acid variation variation variation found found found in in in the the the match match match state state state of of of a a a hid-den hid-den hid-den Markov Markov Markov model model (representing one column of a multiple sequence alignment of proteins), based on prior distributions found in conserved protein domains (blocks). Distance in sequence analysis （序列距离）（序列距离）The number of observed changes in an optimal alignment of two sequences, usually not counting gaps. DNA Sequencing （DNA 测序）测序）The The experimental experimental experimental process process process of of of determining determining determining the the the nucleotide nucleotide nucleotide sequence sequence sequence of of of a a a region region region of of of DNA. DNA. DNA. This This This is is done by labelling each nucleotide (A, C, G or T) with either a radioactive or fluorescent marker which which identifies identifies identifies it. it. it. There There There are are are several several several methods methods methods of of of applying applying applying this this this technology, technology, each each with with with their their advantages and disadvantages. For more information, refer to a current text book. High throughput laboratories laboratories frequently frequently frequently use use use automated automated automated sequencers, sequencers, sequencers, which which which are are are capable capable capable of of of rapidly rapidly rapidly reading reading reading large large numbers of templates. Sometimes, the sequences may be generated more quickly than they can be characterised. Domain （功能域）（功能域）A A discrete discrete discrete portion portion portion of of of a a a protein protein protein assumed assumed assumed to to to fold fold fold independently independently independently of of of the the the rest rest rest of of of the the the protein protein protein and and possessing its own function.Dot matrix （点标矩阵图）（点标矩阵图）Dot matrix diagrams provide a graphical method for comparing two sequences. One sequence is written horizontally across the top of the graph and the other along the left-hand side. Dots are placed within the graph at the intersection of the same letter appearing in both sequences. A series of diagonal lines in the graph indicate regions of alignment. The matrix may be filtered to reveal the the most-alike most-alike most-alike regions regions regions by by by scoring scoring scoring a a a minimal minimal minimal threshold threshold threshold number number number of of of matches matches matches within within within a a a sequence sequence window. Draft genome sequence （基因组序列草图）（基因组序列草图）（基因组序列草图） The sequence produced by combining the information from the individual sequenced clones (by creating merged sequence contigs and then employing linking information to create scaffolds) and positioning the sequence along the physical map of the chromosomes. DUST （一种低复杂性区段过滤程序）（一种低复杂性区段过滤程序）A program for filtering low complexity regions from nucleic acid sequences. Dynamic programming （动态规划法）（动态规划法）A dynamic programming algorithm solves a problem by combining solutions to sub-problems that are computed once and saved in a table or matrix. Dynamic programming is typically used when a problem has many possible solutions and an optimal one needs to be found. This algorithm is used for producing sequence alignments, given a scoring system for sequence comparisons. EMBL （欧洲分子生物学实验室，EMBL 数据库是主要公共核酸序列数据库之一）数据库是主要公共核酸序列数据库之一）European Molecular Biology Laboratories. Maintain the EMBL database, one of the major public sequence databases. EMBnet （欧洲分子生物学网络）（欧洲分子生物学网络）European European Molecular Molecular Molecular Biology Biology Biology Network: Network: Network: / / was was established established established in in in 1988, 1988, 1988, and and provides provides services services services including including including local local local molecular molecular molecular databases databases databases and and and software software software for for for molecular molecular molecular biologists biologists biologists in in Europe. There are several large outposts of EMBnet, including EXPASY . Entropy （熵）（熵）From information theory, a measure of the unpredictable nature of a set of possible elements. The higher the level of variation within the set, the higher the entropy. Erdos and Renyi law In a toss of a “fair” coin, the number of heads in a row that can be expected is the logarithm of the number of tosses to the base 2. The law may be generalized for more than two possible outcomes by changing the base of the logarithm to the number of out-comes. This law was used to analyze the number of matches and mismatches that can be expected between random sequences as a basis for scoring the statistical significance of a sequence alignment. EST （表达序列标签的缩写）（表达序列标签的缩写）See Expressed Sequence Tag Expect value (E)（E 值）值）E E value. value. value. The The The number number number of of of different different different alignents alignents alignents with with with scores scores scores equivalent equivalent equivalent to to to or or or better better better than than than S S S that that that are are expected to occur in a database search by chance. The lower the E value, the more significant the score. In a database similarity search, the probability that an alignment score as good as the one found between a query sequence and a database sequence would be found in as many comparisons between random sequences as was done to find the matching sequence. In other types of sequence analysis, E has a similar meaning. Expectation maximization (sequence analysis) An algorithm for locating similar sequence patterns in a set of sequences. A guessed alignment of the sequences is first used to generate an expected scoring matrix representing the distribution of sequence characters in each column of the alignment, this pattern is matched to each sequence, and and the the the scoring scoring scoring matrix matrix matrix values values values are are are then then then updated updated updated to to to maximize maximize maximize the the the alignment alignment alignment of of of the the the matrix matrix matrix to to to the the sequences. The procedure is repeated until there is no further improvement. Exon （外显子）（外显子）。

2012生物信息学复习题

2012生物信息学复习题一、选择题1. 根据PAM打分矩阵，下列哪个氨基酸最不容易突变？A) 丙氨酸 B) 谷氨酰胺 C) 甲硫氨酸 D) 半胱氨酸2. 下列哪个句子最好描述了两序列全局比对和局部比对的不同？A) 全局比对通常用于DNA序列，而局部比对通常用于蛋白质序列；B) 全局比对允许间隙，而局部比对不允许间隙；C) 全局比对寻求全局最大化，而局部比对寻求局部最大化；D) 全局比对比对整条序列，而局部比对寻找最佳匹配子序列3. 与PAM打分矩阵比较，BLOSUM打分矩阵的最大区别在哪里？A) 它最好用于比对相关性很近的序列； B) 它是基于近相关蛋白的全局多序列比对；C) 它是基于远相关蛋白的局部多序列比对； D) 它结合了局部和全局比对信息4. 全局比对算法（如Needleman-Wunsch算法）是这样一种算法：A) 把两条比较的蛋白质放到一个矩阵中，然后通过穷尽搜索每一个可能的比对组合来寻找最佳分值的比对；B) 把两条比较的蛋白质放到一个矩阵中，然后通过迭代递归的方法找到最佳的分值；C) 把两条比较的蛋白质放到一个矩阵中，然后通过寻找最佳子序列的方法来找到最佳的比对；D) 能用于蛋白质，但不能用于DNA序列5. 数据库搜索中或双序列比对中，敏感性定义为：A) 搜索算法寻找真阳性（即同源序列）和避免假阳性（即不相干序列，但具有高相似分值）的能力；B) 搜索算法寻找真阳性（即同源序列）和避免假阳性（即没有被搜索算法报告的同源序列）的能力；C) 搜索算法寻找真阳性（即同源序列）和避免假阴性（即不相干序列，但具有高相似分值）的能力；D ) 搜索算法寻找真阳性（即同源序列）和避免假阴性（即没有被搜索算法报告的同源序列）的能力；6. 如有一小段DNA序列，基本上它能编码多少种蛋白？A）1 B）2 C）3 D）67. 有一段DNA序列，如想知道在主要的蛋白质数据库中哪一个与该DNA编码的蛋白最接近，你会选择用哪一个程序？A）blastn B）blastp C）blastx D）tblastx E）tblastn8. blast检索的哪一种输出估计了假阳性的数目？A）E值 B）Bit score C）Percent identity D）Percent positives9. 将下面哪个blast参数改变后会得到更少的检索结果？A）关闭low-complexity filter B）将期望值从1变为0C）提高极限值 D）将打分矩阵从PAM30改为PAM7010.极值分布A）描述了对数据库的query的scores的分布 B）比正态分布的总面积大C）对称 D）形状可用两个参数来描述，即 µ（平均值）和 λ（衰减系数）11.当blast检索的E值减小时A）K值也减小 B）score变大 C）概率p值变大 D）极值分布偏斜率减小12.标准化的blast score（也称为bit scores）A）是没有单位 B）可在不同的blast检索之间比较，即使使用了不同的打分矩阵C）与使用的打分矩阵无关 D）可在不同的blast检索之间比较，但前提是使用相同的打分矩阵13.在EMBL和NCBI数据库中未加工的DNA序列（与注释序列相比）是A）完全重叠了 B）很大程度上重叠了，不过序列不同 C）相对只有一点重叠14.下面的哪种工作，PSI-BLAST搜索最为有效A）在老鼠中找一个人类蛋白质的同源蛋白 B）在数据库查询中找到更多的匹配蛋白 C）在数据库查询中找到更多的匹配DNA序列 D）用模式序列或者信号序列加强数据库搜索15.下面的哪种blast程序是用氨基酸的信号序列在一个蛋白质家族中寻找匹配的？A）PSI-BLAST B）PHI-BLAST C）MS BLAST D）WormBLAST16.下面的哪种blast 程序用来分析免疫球蛋白最好？A）RPS-BLAST B）PHI-BLAST C）IgBLAST D）ProDom17.在一个位点特异性打分矩阵中，列中可以有20种氨基酸。

大基因组数据与生物信息学英文及翻译

Big Genomic Data in Bioinformatics CloudAbstractThe achievement of Human Genome project has led to the proliferation of genomic sequencing data. This along with the next generation sequencing has helped to reduce the cost of sequencing, which has further increased the demand of analysis of this large genomic data. This data set and its processing has aided medical researches.Thus, we require expertise to deal with biological big data. The concept of cloud computing and big data technologies such as the Apache Hadoop project, are hereby needed to store, handle and analyse this data. Because, these technologies provide distributed and parallelized data processing and are efficient to analyse even petabyte (PB) scale data sets. However, there are some demerits too which may include need of larger time to transfer data and lesser network bandwidth, majorly.人类基因组计划的实现导致基因组测序数据的增殖。

生物信息学英文介绍

生物信息学英文介绍Introduction to Bioinformatics.Bioinformatics is an interdisciplinary field that combines biology, computer science, mathematics, statistics, and other disciplines to analyze and interpret biological data. At its core, bioinformatics leverages computational tools and algorithms to process, manage, and minebiological information, enabling a deeper understanding of the molecular basis of life and its diverse phenomena.The field of bioinformatics has exploded in recent years, driven by the exponential growth of biological data generated by high-throughput sequencing technologies, proteomics, genomics, and other omics approaches. This data deluge has presented both challenges and opportunities for researchers. On one hand, the sheer volume and complexityof the data require sophisticated computational methods for analysis. On the other hand, the wealth of information contained within these data holds the promise oftransformative insights into the functions, interactions, and evolution of biological systems.The core tasks of bioinformatics encompass genome annotation, sequence alignment and comparison, gene expression analysis, protein structure prediction and function annotation, and the integration of multi-omic data. These tasks require a range of computational tools and algorithms, often developed by bioinformatics experts in collaboration with biologists and other researchers.Genome annotation, for example, involves the identification of genes and other genetic elements within a genome and the prediction of their functions. This process involves the use of bioinformatics algorithms to identify protein-coding genes, non-coding RNAs, and regulatory elements based on sequence patterns and other features. The resulting annotations provide a foundation forunderstanding the genetic basis of traits and diseases.Sequence alignment and comparison are crucial for understanding the evolutionary relationships betweenspecies and for identifying conserved regions within genomes. Bioinformatics algorithms, such as BLAST and multiple sequence alignment tools, are widely used for these purposes. These algorithms enable researchers to compare sequences quickly and accurately, revealing patterns of conservation and divergence that inform our understanding of biological diversity and function.Gene expression analysis is another key area of bioinformatics. It involves the quantification of thelevels of mRNAs, proteins, and other molecules within cells and tissues, and the interpretation of these data to understand the regulation of gene expression and its impact on cellular phenotypes. Bioinformatics tools and algorithms are essential for processing and analyzing the vast amounts of data generated by high-throughput sequencing and other experimental techniques.Protein structure prediction and function annotation are also important areas of bioinformatics. The structure of a protein determines its function, and bioinformatics methods can help predict the three-dimensional structure ofa protein based on its amino acid sequence. These predictions can then be used to infer the protein'sfunction and to understand how it interacts with other molecules within the cell.The integration of multi-omic data is a rapidly emerging area of bioinformatics. It involves theintegration and analysis of data from different omics platforms, such as genomics, transcriptomics, proteomics, and metabolomics. This approach enables researchers to understand the interconnectedness of different biological processes and to identify complex relationships between genes, proteins, and metabolites.In addition to these core tasks, bioinformatics also plays a crucial role in translational research and personalized medicine. It enables the identification of disease-associated genes and the development of targeted therapeutics. By analyzing genetic and other biological data from patients, bioinformatics can help predict disease outcomes and guide treatment decisions.The future of bioinformatics is bright. With the continued development of high-throughput sequencing technologies and other omics approaches, the amount of biological data available for analysis will continue to grow. This will drive the need for more sophisticated computational methods and algorithms to process and interpret these data. At the same time, the integration of bioinformatics with other disciplines, such as artificial intelligence and machine learning, will open up new possibilities for understanding the complex systems that underlie life.In conclusion, bioinformatics is an essential field for understanding the molecular basis of life and its diverse phenomena. It leverages computational tools and algorithms to process, manage, and mine biological information, enabling a deeper understanding of the functions, interactions, and evolution of biological systems. As the amount of biological data continues to grow, the role of bioinformatics in research and medicine will become increasingly important.。

生物信息学,英文

生物信息学,英文BioinformaticsBioinformatics is a rapidly growing field that combines biology, computer science, and information technology to analyze and interpret biological data. It has become an essential tool in modern scientific research, particularly in the fields of genomics, proteomics, and molecular biology. The advent of high-throughput sequencing technologies has led to an exponential increase in the amount of biological data available, and bioinformatics provides the means to manage, analyze, and interpret this vast amount of information.At its core, bioinformatics involves the use of computational methods and algorithms to store, retrieve, organize, analyze, and interpret biological data. This includes tasks such as DNA and protein sequence analysis, gene and protein structure prediction, and the identification of functional relationships between different biological molecules. Bioinformatics also plays a crucial role in the development of new drugs, the understanding of disease mechanisms, and the study of evolutionary processes.One of the primary applications of bioinformatics is in the field ofgenomics, where it is used to analyze and interpret DNA sequence data. Researchers can use bioinformatics tools to identify genes, predict their function, and study the genetic variation within and between species. This information is essential for understanding the genetic basis of diseases, developing personalized medicine, and exploring the evolutionary history of different organisms.In addition to genomics, bioinformatics is also widely used in proteomics, the study of the structure and function of proteins. Bioinformatics tools can be used to predict the three-dimensional structure of proteins, identify protein-protein interactions, and study the role of proteins in biological processes. This information is crucial for understanding the mechanisms of disease, developing new drugs, and engineering proteins for industrial and medical applications.Another important application of bioinformatics is in the field of systems biology, which aims to understand the complex interactions and relationships between different biological components within a living organism. By using computational models and simulations, bioinformaticians can study how these components work together to produce the observed behavior of a biological system. This knowledge can be used to develop new treatments for diseases, optimize agricultural practices, and gain a deeper understanding of the fundamental principles of life.Bioinformatics is also playing a crucial role in the field of personalized medicine, where the goal is to tailor medical treatments to the unique genetic and molecular profile of each individual patient. By using bioinformatics tools to analyze a patient's genetic and molecular data, healthcare providers can identify the most effective treatments and predict the likelihood of adverse reactions to certain drugs. This approach has the potential to improve patient outcomes, reduce healthcare costs, and pave the way for more targeted and effective therapies.In addition to its scientific applications, bioinformatics is also having a significant impact on various industries, such as agriculture, environmental science, and forensics. In agriculture, bioinformatics is used to improve crop yields, develop new pest-resistant varieties, and optimize the use of resources. In environmental science, bioinformatics is used to study the impact of human activities on ecosystems, identify new sources of renewable energy, and monitor the health of the planet. In forensics, bioinformatics is used to analyze DNA evidence and identify individuals involved in criminal activities.Despite the many benefits of bioinformatics, there are also significant challenges and ethical considerations that need to be addressed. As the amount of biological data continues to grow, there is an increasing need for efficient data storage, management, andsecurity systems. Additionally, the interpretation of biological data requires a deep understanding of both biological and computational principles, which can be a barrier for some researchers and healthcare providers.Furthermore, the use of bioinformatics in areas such as personalized medicine and forensics raises important ethical questions regarding privacy, informed consent, and the potential for discrimination based on genetic information. It is crucial that the development and application of bioinformatics technologies be guided by strong ethical principles and robust regulatory frameworks to ensure that they are used in a responsible and equitable manner.In conclusion, bioinformatics is a rapidly evolving field that is transforming the way we understand and interact with the biological world. By combining the power of computer science and information technology with the insights of biology, bioinformatics is enabling new discoveries, improving human health, and shaping the future of scientific research. As the field continues to grow and evolve, it will be important for researchers, policymakers, and the public to work together to ensure that the benefits of bioinformatics are realized in a responsible and ethical manner.。

生物信息学复习题已附答案

本卷的答案仅做参考，如有疑问欢迎提出。

后面的补充复习题要靠你们自己整理答案了。

生物信息学复习题一、填空题1、识别基因主要有两个途径即基因组DNA外显子识别和基于EST策略的基因鉴定。

2、表达序列标签是从mRNA 中生成的一些很短的序列（300-500bp），它们代表在特定组织或发育阶段表达的基因。

3、序列比对的基本思想，是找出检测基因和目标序列的相似性，就是通过在序列中插入空位的方法使所比较的序列长度达到一致。

比对的数学模型大体分为两类，分别是整体比对和局部比对。

4、2-DE的基本原理是根据蛋白质等电点和分子量不同，进行两次电泳将之分离。

第一向是等电聚焦分离,第二向是SDS-PAGE分离。

5、蛋白质组研究的三大关键核心技术是双向凝胶电泳技术、质谱鉴定技术、计算机图像数据处理与蛋白质数据库。

二、判断题1、生物体的结构和功能越复杂的种类就越多，所需要的基因也越多，C值越大，这是真核生物基因组的特点之一。

（对）2、CDS一定就是ORF。

（对）3、两者之间有没有共同的祖先，可以通过序列的同源性来确定，如果两个基因或蛋白质有着几乎一样的序列，那么它们高度同源,就具有共同的祖先。

（错）4、STS，是一段200-300bp的特定DNA序列，它的序列已知，并且在基因组中属于单拷贝。

（对）5、非编码DNA是“垃圾DNA”，不具有任何的分析价值，对于细胞没有多大的作用。

（错）6、基因树和物种树同属于系统树，它们之间可以等同。

（错）7、基因的编码序列在DNA分子上是被不编码的序列隔开而不连续排列的。

( 对）8、对任意一个DNA序列，在不知道哪一个碱基代表CDS的起始时，可用6框翻译法，获得6个潜在的蛋白质序列。

（对）9、一个机体只有一个确定的基因组，但基因组内各个基因表达的条件和表达的程度随时间、空间和环境条件而不同。

（对）10、外显子和内含子之间没有绝对的区分，一个基因的内含子可以是另一个基因的外显子，同一个基因在不同的生理状况或生长发育的不同阶段，外显子组成也可以不同。

2012生物信息学题库

一、选择题:1.以下哪一个是mRNA条目序列号：A. J01536B. NM_15392C. NP_52280D. AAB1345062.确定某个基因在哪些组织中表达的最直接获取相关信息方式是：A. UnigeneB. EntrezC. LocusLinkD. PCR3.一个基因可能对应两个Unigene簇吗？A. 可能B. 不可能4.下面哪种数据库源于mRNA信息：A. dbESTB. PDBC. OMIMD. HTGS5.下面哪个数据库面向人类疾病构建：A. ESTB. PDBC. OMIMD. HTGS6.Refseq和GenBank有什么区别：A. Refseq包括了全世界各个实验室和测序项目提交的DNA序列B. GenBank提供的是非冗余序列C. Refseq源于GenBank，提供非冗余序列信息D. GenBank源于Refseq7.如果你需要查询文献信息，下列哪个数据库是你最佳选择：A. OMIMB. EntrezC. PubMedD. PROSITE8.比较从Entrez和ExPASy中提取有关蛋白质序列信息的方法，下列哪种说法正确：A. 因为GenBank的数据比EMBL更多，Entrez给出的搜索结果将更多B. 搜索结果很可能一样，因为GenBank和EMBL的序列数据实际一样C. 搜索结果应该相当，但是ExPASy中的SwissProt记录的输出格式不同9.天冬酰胺、色氨酸和酪氨酸的单字母代码分别对应于：A. N/W/YB. Q/W/YC. F/W/YD. Q/N/W10.直系同源定义为：A. 不同物种中具有共同祖先的同源序列B. 具有较小的氨基酸一致性但是有较大的结构相似性的同源序列C. 同一物种中由基因复制产生的同源序列D. 同一物种中具有相似的并且通常是冗余的功能的同源序列11.下列那个氨基酸最不容易突变：A. 丙氨酸B. 谷氨酰胺C. 甲硫氨酸D.半胱氨酸12.PAM250矩阵定义的进化距离为两同源序列在给定的时间有多少百分比的氨基酸发生改变：A. 1%B. 20%C. 80%D. 250%13.下列哪个句子最好的描述了两个序列全局比对和局部比对的不同：A. 全局比对通常用于比对DNA序列，而局部比对通常用于比对蛋白质序列B. 全局比对允许间隙，而局部比对不允许C. 全局比对寻找全局最大化，而局部比对寻找局部最大化D. 全局比对比对整体序列，而局部比对寻找最佳匹配子序列14.假设你有两条远源相关蛋白质序列。

生物信息学题库答案

生物信息学题库答案work Information Technology Company.2020YEARUTR的含义是（B ）。

A. 编码区B. 非编码区C. 低复杂度区域D. 开放阅读框motif的含义是（ D）。

A. 基序B. 跨叠克隆群C. 碱基对D. 结构域algorithm的含义是（B ）。

A. 登录号B. 算法C. 比对D. 类推RGP是（D ）。

A. 在线人类孟德尔遗传数据B. 国家核酸数据库C. 人类基因组计划D. 水稻基因组计划下列Fasta格式正确的是（B ）。

A. seq1: agcggatccagacgctgcgtttgctggctttgatgaaaactctaactaaacactcccttaB. >seq1 agcggatccagacgctgcgtttgctggctttgatgaaaactctaactaaacactcccttaC. seq1:agcggatccagacgctgcgtttgctggctttgatgaaaactctaactaaacactcccttaD. >seq1agcggatccagacgctgcgtttgctggctttgatgaaaactctaactaaacactccctta 如果我们试图做蛋白质亚细胞定位分析，应使用（ D）。

A. NDB数据库B. PDB数据库C. GenBank数据库D. SWISS-PROT数据库Bioinformatics的含义是（A ）。

A. 生物信息学B. 基因组学C. 蛋白质组学D. 表观遗传学GenBank中分类码PLN表示是（ D）。

A. 哺乳类序列B. 细菌序列C. 噬菌体序列D. 植物、真菌和藻类序列ortholog的含义是（A ）。

A. 直系同源B. 旁系同源C. 直接进化D. 间接进化从cDNA文库中获得的短序列是（D ）。

A. STSB. UTRC. CDSD. ESTcontig的含义是（B ）。

生物信息学试题

生物信息学考题（2012版）一、填空题（共10分，每空一分）1、美国政府于1990年10月启动耗资30亿美元的15年研究计划，预期到2005年完成人类基因组大约30亿个碱基的全序列测定，这就是被称为生命科学“登月计划”的人类基因组计划。

2、生物信息学的研究目标：以核酸、蛋白质等生物大分子数据库为主要对象，以数学、信息学、计算机科学为主要手段，以计算机硬件、软件和计算机网络为主要工具，对浩瀚如海的原始数据进行存储、管理、注释、加工，使之成为具有明确生物意义的生物信息。

3、随着生物信息学的诞生及应用，今后生物学研究项目的起点将是理论的，一位科学家将从理论推测开始，然后转向试验去追踪或检验该假设。

4、生物信息学作为一门交叉学科，已经成为当今生命科学乃至整个自然科学的重大前沿领域之一，也将是21世纪自然科学的核心领域之一。

5、人类基因组计划、“曼哈顿原子计划”和“阿波罗登月计划”并称为20世纪的三大著名计划，中国在1999年承担了1%的研究任务，即对第3号染色体上3000万碱基对的测定。

6、人类基因组的主要任务是：人类基因组以及一些模式生物（细菌、酵母、线虫、果蝇等）基因组作图、测序和基因识别。

二、是非题（共10分，每小题1分）1、生物学就是实验科学，所有的研究结论从实验中来，于实验中得到验证。

（错）2、比较是科学研究中最常见的方法，在生物信息学研究中，比对是最常用和最经典的研究手段。

（对）3、两个蛋白质序列相似性超过30%就是同源蛋白。

（错）4、蛋白质序列相似性指一级序列中氨基酸残基相同。

（错）5、蛋白质序列相似性指氨基酸残基具有相似特性：侧链基团大小电荷性、疏水性等相同。

（对）6、核酸序列相似性指序列中相同碱基所占的比例。

（对）7、对一段未知功能DNA片段进行功能预测需对其进行3位翻译。

（错）8、对一段未知功能DNA片段进行功能预测需对其进行6位翻译。

（对）9、相似性是指一种很直接的数量关系，无需实验验证。

分子生物学试题_完整版(Felisa)

05级分子生物学真题一、选择题1、激活子的两个功能域，一个是转录激活结构域，另一个是（DNA结合域）2、转录因子包括通用转录因子和（基因特异转录因子）3、G-protein 激活needs ( GTP ) as energy.4、Promoters and (enhancers) are cis-acting elements.5、噬菌体通过（位点专一重组）整合到宿主中6、在细菌中，色氨酸操纵子的前导区转录后，（翻译）就开始7、mRNA的剪切跟(II)类内含子相似8、UCE是（I）类启动子的识别序列9、TATA box binding protein 在下列哪个启动子里面存在（三类都有）10、（5S rRNA)是基因内部启动子转录的11、人体全基因组大小(3200000000 bp)12、与分枝位点周围序列碱基配对的剪接体（U2 snRNP）13、tRNA基因是RNA聚合酶（III）启动的14、在细菌中，色氨酸操纵子的前导区转录后，（翻译）就开始15、乳糖操纵子与阻遏蛋白结合的物质是（异构乳糖）。

16、核mRNA的内含子剪接和（II类内含子剪接）的过程相似17、基因在转录时的特点（启动子上无核小体）18、RNA干涉又叫（转录后的基因沉默，PTGS）19、内含子主要存在于（真核生物）20、snRNA在下列哪种反应中起催化酶的作用（mRNA的剪接）二、判断题1、原核生物有三种RNA聚合酶。

2、抗终止转录蛋白的机制是使RNA聚合酶忽略终止子。

3、RNA聚合酶II结合到启动子上时，其亚基的羧基末端域（CTD）是磷酸化的。

4、Operon is a group of contiguous, coordinately controlled genes.5、RNA聚合酶全酶这个概念只应用于原核生物。

6、聚腺苷酸尾是在mRNA剪接作用前发生的。

7、σ在转录起始复合复合物中使得open到closed状态（closed转变成open）8、剪接复合体作用的机制：组装、作用、去组装，是一个循环三、简答题1、原核生物转录终止的两种方式。

自测题-生物信息学

转录起始位点的英文缩写是（）（1.0分）0.0 分A、CDS B、ORF C、TSS D、SRA 正确答案：C 我的答案：D 2 生物信息数据库中的核苷酸代码表中代码D 代表的是( ) （1.0分）0.0 分A、非A B、非C C、A或T或C或GD、 C 正确答案：B 我的答案：C 3 下列拉丁文那个代表小鼠（）（1.0分）0.0 分A、Homo sapiens B、Xenopus laevis C、Mus musculus D、Sus scrofa 正确答案：C 我的答案： 4 CDS的含义是（1.0分）0.0 分A、编码区B、非编码区C、低复杂度区域D、非调控区正确答案： A 我的答案：答案解析： 5 Genomics的含义是（1.0分）0.0 分A、生物信息学B、基因组学C、蛋白质组学D、表观遗传学正确答案：B 我的答案：答案解析： 6 base pair的含义是（1.0分）0.0 分A、基序B、跨叠克隆群C、碱基对D、结构域正确答案：C 我的答案：答案解析：7在真核生物中，一个基因cDNA的5′端起始密码子AUG的前后序列符合（）规则（1.0分）0.0 分A、Kozak B、AU…AG C、SD D、Poly(A)n 正确答案：A 我的答案：答案解析：8mRNA 3′端有（）结构（1.0分）0.0 分A、帽子B、尾巴C、帽子和尾巴D、多聚胞嘧啶正确答案：B 我的答案：答案解析：9 ORF的含义是（1.0分）0.0 分A、调控区B、非编码区C、低复杂度区域D、开放阅读框正确答案：D 我的答案：答案解析：10mRNA 3′端有（）结构（1.0分）0.0 分A、帽子B、尾巴C、帽子和尾巴D、多聚胞嘧啶正确答案： B 我的答案：答案解析：11在真核生物的一个基因内含子两端，即外显子/内含子拼接边界处，其符合（）规则（1.0分）0.0 分A、Kozak B、AU…AG C、SD D、Poly(A)n 正确答案：B 我的答案：答案解析：12mRNA 5′端有（）结构（1.0分）0.0 分A、帽子B、尾巴C、帽子和尾巴D、多聚核苷酸正确答案： A 我的答案：答案解析：13mRNA 5′端有（）结构（1.0分）0.0 分A、帽子B、尾巴C、帽子和尾巴D、多聚核苷酸正确答案：A 我的答案：答案解析：14<p>Proteomics 的含义是</p>（1.0分）0.0 分A、生物信息学B、基因组学C、蛋白质组学D、表观遗传学正确答案：C 我的答案：答案解析：15 STS的含义是（1.0分）0.0 分A、表达序列标签B、序列标签位点C、高通量基因组序列D、人工合成序列正确答案：B 我的答案：答案解析：16 EST的含义是（1.0分）0.0 分A、表达序列标签B、序列标签位点C、高通量基因组序列D、人工合成序列正确答案：A 我的答案：答案解析：17在真核生物的一个基因内含子两端，即外显子/内含子拼接边界处，其符合（）规则（1.0分）0.0 分A、Kozak B、AU…AG C、SD D、Poly(A)n 正确答案：B 我的答案：答案解析：18从cDNA 文库中获得的短序列是（1.0分）0.0 分A、STS B、UTR C、CDS D、EST 正确答案：D 我的答案：答案解析：19 Proteomics的含义是（1.0分）0.0 分A、生物信息学B、基因组学C、蛋白质组学D、表观遗传学正确答案：C 我的答案：答案解析：20 UTR的含义是（1.0分）0.0 分A、编码区B、非翻译区C、低复杂度区域D、开放阅读框正确答案：B 我的答案：答案解析：21在真核生物中，一个基因cDNA的5′端起始密码子AUG的前后序列符合（）规则（1.0分）0.0 分A、Kozak B、AU…AG C、SD D、Poly(A)n 正确答案：A 我的答案：答案解析：22 base pair的含义是（1.0分）0.0 分A、基序B、跨叠克隆群C、碱基对D、结构域正确答案：C 我的答案：答案解析：23 （）年美国国会批准正式启动人类基因组计划?（）年发表草图? （1.0分）0.0 分A、1990 2004 B、1990 2001 C、1988 2004 D、1988 2001 正确答案：B 我的答案：答案解析：24 限制性片段长度多态性标记是（）。

2012生物信息学题库(1)(2)

■一、选择题:1.以下哪一个是mRNA条目序列号： A. J01536■. NM_15392 C. NP_52280D. AAB1345062.确定某个基因在哪些组织中表达的最直接获取相关信息方式是：■. Unigene B.Entrez C. LocusLink D. PCR3.一个基因可能对应两个Unigene簇吗？■可能 B. 不可能4.下面哪种数据库源于mRNA信息：■dbEST B. PDB C. OMIM D.HTGS5.下面哪个数据库面向人类疾病构建： A. EST B. PDB ■. OMIMD. HTGS6.Refseq和GenBank有什么区别： A. Refseq包括了全世界各个实验室和测序项目提交的DNA序列B. GenBank提供的是非冗余序列■. Refseq源于GenBank，提供非冗余序列信息D. GenBank源于Refseq7.如果你需要查询文献信息，下列哪个数据库是你最佳选择： A. OMIM B. Entrez■PubMed D. PROSITE8.比较从Entrez和ExPASy中提取有关蛋白质序列信息的方法，下列哪种说法正确：A. 因为GenBank的数据比EMBL更多，Entrez给出的搜索结果将更多B. 搜索结果很可能一样，因为GenBank和EMBL的序列数据实际一样■搜索结果应该相当，但是ExPASy 中的SwissProt记录的输出格式不同9.天冬酰胺、色氨酸和酪氨酸的单字母代码分别对应于：■N/W/Y B. Q/W/YC. F/W/YD. Q/N/W10.直系同源定义为：■不同物种中具有共同祖先的同源序列B. 具有较小的氨基酸一致性但是有较大的结构相似性的同源序列C. 同一物种中由基因复制产生的同源序列D. 同一物种中具有相似的并且通常是冗余的功能的同源序列11.下列那个氨基酸最不容易突变： A. 丙氨酸 B. 谷氨酰胺 C. 甲硫氨酸■半胱氨酸12.PAM250矩阵定义的进化距离为两同源序列在给定的时间有多少百分比的氨基酸发生改变： A. 1% B. 20%■. 80% D. 250%13.下列哪个句子最好的描述了两个序列全局比对和局部比对的不同：A. 全局比对通常用于比对DNA序列，而局部比对通常用于比对蛋白质序列B. 全局比对允许间隙，而局部比对不允许C. 全局比对寻找全局最大化，而局部比对寻找局部最大化■全局比对比对整体序列，而局部比对寻找最佳匹配子序列14.假设你有两条远源相关蛋白质序列。

生物信息学课后题及答案

生物信息学课后习题及答案（由10级生技一、二班课代表整理）一、绪论1.你认为，什么是生物信息学？采用信息科学技术，借助数学、生物学的理论、方法，对各种生物信息（包括核酸、蛋白质等）的收集、加工、储存、分析、解释的一门学科。

2.你认为生物信息学有什么用？对你的生活、研究有影响吗？（1）主要用于：在基因组分析方面：生物序列相似性比较及其数据库搜索、基因预测、基因组进化和分子进化、蛋白质结构预测等在医药方面：新药物设计、基因芯片疾病快速诊断、流行病学研究：SARS、人类基因组计划、基因组计划：基因芯片。

（2）指导研究和实验方案，减少操作性实验的量；验证实验结果；为实验结果提供更多的支持数据等材料。

3.人类基因组计划与生物信息学有什么关系？人类基因组计划的实施，促进了测序技术的迅猛发展，从而使实验数据和可利用信息急剧增加，信息的管理和分析成为基因组计划的一项重要的工作。

而这些数据信息的管理、分析、解释和使用促使了生物信息学的产生和迅速发展。

4简述人类基因组研究计划的历程。

通过国际合作，用15年时间（1990-2005）至少投入30亿美元，构建详细的人类基因组遗传图和物理图，确定人类DNA的全部核苷酸序列，定位约10万基因，并对其他生物进行类似研究。

1990，人类基因组计划正式启动。

1996，完成人类基因组计划的遗传作图，启动模式生物基因组计划。

1998完成人类基因组计划的物理作图，开始人类基因组的大规模测序。

Celera公司加入，与公共领域竞争启动水稻基因组计划。

1999，第五届国际公共领域人类基因组测序会议，加快测序速度。

2000，Celera公司宣布完成果蝇基因组测序，国际公共领域宣布完成第一个植物基因组——拟南芥全基因组的测序工作。

2001，人类基因组“中国卷”的绘制工作宣告完成。

2003，中、美、日、德、法、英等6国科学家宣布人类基因组序列图绘制成功，人类基因组计划的.目标全部实现。

2004，人类基因组完成图公布。

生物信息学英文术语及释义总汇

Abstract Syntax Notation (ASN.l)（NCBI发展的许多程序，如显示蛋白质三维结构的Cn3D 等所使用的内部格式）A language that is used to describe structured data types formally, Within bioinformatits,it has been used by the National Center for Biotechnology Information to encode sequences, maps, taxonomic information, molecular structures, and biographical information in such a way that it can be easily accessed and exchanged by computer software.Accession number（记录号）A unique identifier that is assigned to a single database entry for a DNA or protein sequence.Affine gap penalty（一种设置空位罚分策略）A gap penalty score that is a linear function of gap length, consisting of a gap opening penalty and a gap extension penalty multiplied by the length of the gap. Using this penalty scheme greatly enhances the performance of dynamic programming methods for sequence alignment. See also Gap penalty.Algorithm（算法）A systematic procedure for solving a problem in a finite number of steps, typically involving a repetition of operations. Once specified, an algorithm can be written in a computer language and run as a program.Alignment（联配/比对/联配）Refers to the procedure of comparing two or more sequences by looking for a series of individual characters or character patterns that are in the same order in the sequences. Of the two types of alignment, local and global, a local alignment is generally the most useful. See also Local and Global alignments.Alignment score（联配/比对/联配值）An algorithmically computed score based on the number of matches, substitutions, insertions, and deletions (gaps) within an alignment. Scores for matches and substitutions Are derived from a scoring matrix such as the BLOSUM and PAM matrices for proteins, and aftine gap penalties suitable for the matrix are chosen. Alignment scores are in log odds units, often bit units (log to the base 2). Higher scores denote better alignments. See also Similarity score, Distance in sequence analysis.Alphabet（字母表）The total number of symbols in a sequence-4 for DNA sequences and 20 for protein sequences.Annotation（注释）The prediction of genes in a genome, including the location of protein-encoding genes, the sequence of the encoded proteins, anysignificantmatches to other Proteins of known function, and the location of RNA-encoding genes. Predictions are based on gene models; e.g., hidden Markov models of introns and exons in proteins encoding genes, and models of secondary structure in RNA.Anonymous FTP（匿名FTP）When a FTP service allows anyone to log in, it is said to provide anonymous FTP ser-vice. A user can log in to an anonymous FTP server by typing anonymous as the user name and his E-mail address as a password. Most Web browsers now negotiate anonymous FTP logon without asking the user for a user name and password. See also FTP.ASCIIThe American Standard Code for Information Interchange (ASCII) encodes unaccented letters a-z, A-Z, the numbers O-9, most punctuation marks, space, and a set of control characters such as carriage return and tab. ASCII specifies 128 characters that are mapped to the values O-127. ASCII tiles are commonly called plain text, meaning that they only encode text without extra markup.BAC clone（细菌人工染色体克隆）Bacterial artificial chromosome vector carrying a genomic DNA insert, typically 100–200 kb. Most of the large-insert clones sequenced in the project were BAC clones.Back-propagation（反向传输）When training feed-forward neural networks, a back-propagation algorithm can be used to modify the network weights. After each training input pattern is fed through the network, the network’s output is compared with the desired output and the amount of error is calculated. This error is back-propagated through the network by using an error function to correct the network weights. See also Feed-forward neural network.Baum-Welch algorithm（Baum-Welch算法）An expectation maximization algorithm that is used to train hidden Markov models.Baye’s rule（贝叶斯法则）Forms the basis of conditional probability by calculating the likelihood of an event occurring based on the history of the event and relevant background information. In terms of two parameters A and B, the theorem is stated in an equation: The condition-al probability of A, given B, P(AIB), is equal to the probability of A, P(A), times the conditional probability of B, given A, P(BIA), divided by the probability of B, P(B). P(A) is the historical or prior distribution value of A, P(BIA) is a new prediction for B for a particular value of A, and P(B) is the sum of the newly predicted values for B. P(AIB) is a posterior probability, representing a new prediction for A given the prior knowledge of A and the newly discovered relationships between A and B.Bayesian analysis（贝叶斯分析）A statistical procedure used to estimate parameters of an underlyingdistribution based on an observed distribution. S ee also Baye’s rule.Biochips（生物芯片）Miniaturized arrays of large numbers of molecular substrates, often oligonucleotides, in a defined pattern. They are also called DNA microarrays and microchips.Bioinformatics （生物信息学）The merger of biotechnology and information technology with the goal of revealing new insights and principles in biology. /The discipline of obtaining information about genomic or protein sequence data. This may involve similarity searches of databases, comparing your unidentified sequence to the sequences in a database, or making predictions about the sequence based on current knowledge of similar sequences. Databases are frequently made publically available through the Internet, or locally at your institution.Bit score （二进制值/ Bit值）The value S' is derived from the raw alignment score S in which the statistical properties of the scoring system used have been taken into account. Because bit scores have been normalized with respect to the scoring system, they can be used to compare alignment scores from different searches.Bit unitsFrom information theory, a bit denotes the amount of information required to distinguish between two equally likely possibilities. The number of bits of information, AJ, required to convey a message that has A4 possibilities is log2 M = N bits.BLAST （基本局部联配搜索工具，一种主要数据库搜索程序）Basic Local Alignment Search Tool. A set of programs, used to perform fast similarity searches. Nucleotide sequences can be compared with nucleotide sequences in a database using BLASTN, for example. Complex statistics are applied to judge the significance of each match. Reported sequences may be homologous to, or related to the query sequence. The BLASTP program is used to search a protein database for a match against a query protein sequence. There are several other flavours of BLAST. BLAST2 is a newer release of BLAST. Allows for insertions or deletions in the sequences being aligned. Gapped alignments may be more biologically significant.Block（蛋白质家族中保守区域的组块）Conserved ungapped patterns approximately 3-60 amino acids in length in a set of related proteins.BLOSUM matrices（模块替换矩阵，一种主要替换矩阵）An alternative to PAM tables, BLOSUM tables were derived using local multiple alignments of more distantly related sequences than were used for the PAM matrix. These are used to assess thesimilarity of sequences when performing alignments.Boltzmann distribution（Boltzmann 分布）Describes the number of molecules that have energies above a certain level, based on the Boltzmann gas constant and the absolute temperature.Boltzmann probability function(Boltzmann 概率函数)See Boltzmann distribution.Bootstrap analysisA method for testing how well a particular data set fits a model. For example, the validity of the branch arrangement in a predicted phylogenetic tree can be tested by resampling columns in a multiple sequence alignment to create many new alignments. The appearance of a particular branch in trees generated from these resampled sequences can then be measured. Alternatively, a sequence may be left out of an analysis to deter-mine how much the sequence influences the results of an analysis.Branch length（分支长度）In sequence analysis, the number of sequence changes along a particular branch of a phylogenetic tree.CDS or cds （编码序列）Coding sequence.Chebyshe, d inequalityThe probability that a random variable exceeds its mean is less than or equal to the square of 1 over the number of standard deviations from the mean.Clone （克隆）Population of identical cells or molecules (e.g. DNA), derived from a single ancestor.Cloning V ector （克隆载体）A molecule that carries a foreign gene into a host, and allows/facilitates the multiplication of that gene in a host. When sequencing a gene that has been cloned using a cloning vector (rather than by PCR), care should be taken not to include the cloning vector sequence when performing similarity searches. Plasmids, cosmids, phagemids, Y ACs and PACs are example types of cloning vectors.Cluster analysis（聚类分析）A method for grouping together a set of objects that are most similar from a larger group of related objects. The relationships are based on some criterion of similarity or difference. For sequences, a similarity or distance score or a statistical evaluation of those scores is used.CobblerA single sequence that represents the most conserved regions in a multiple sequence alignment. The BLOCKS server uses the cobbler sequence to perform a database similarity search as a way to reach sequences that are more divergent than would be found using the single sequences in the alignment for searches.Coding system (neural networks)Regarding neural networks, a coding system needs to be designed for representing input and output. The level of success found when training the model will be partially dependent on the quality of the coding system chosen.Codon usageAnalysis of the codons used in a particular gene or organism.COG（直系同源簇）Clusters of orthologous groups in a set of groups of related sequences in microorganism and yeast (S. cerevisiae). These groups are found by whole proteome comparisons and include orthologs and paralogs. See also Orthologs and Paralogs.Comparative genomics（比较基因组学）A comparison of gene numbers, gene locations, and biological functions of genes in the genomes of diverse organisms, one objective being to identify groups of genes that play a unique biological role in a particular organism.Complexity (of an algorithm)（算法的复杂性）Describes the number of steps required by the algorithm to solve a problem as a function of the amount of data; for example, the length of sequences to be aligned.Conditional probability（条件概率）The probability of a particular result (or of a particular value of a variable) given one or more events or conditions (or values of other variables).Conservation （保守）Changes at a specific position of an amino acid or (less commonly, DNA) sequence that preserve the physico-chemical properties of the original residue.Consensus（一致序列）A single sequence that represents, at each subsequent position, the variation found within corresponding columns of a multiple sequence alignment.Context-free grammarsA recursive set of production rules for generating patterns of strings. These consist of a set of terminal characters that are used to create strings, a set of nonterminal symbols that correspond to rules and act as placeholders for patterns that can be generated using terminal characters, a set of rules for replacing nonterminal symbols with terminal characters, and a start symbol.Contig （序列重叠群/拼接序列）A set of clones that can be assembled into a linear order. A DNA sequence that overlaps with another contig. The full set of overlapping sequences (contigs) can be put together to obtain the sequence for a long region of DNA that cannot be sequenced in one run in a sequencing assay. Important in genetic mapping at the molecular level.CORBA（国际对象管理协作组制定的使OOP对象与网络接口统一起来的一套跨计算机、操作系统、程序语言和网络的共同标准）The Common Object Request Broker Architecture (CORBA) is an open industry standard for working with distributed objects, developed by the Object Management Group. CORBA allows the interconnection of objects and applications regardless of computer language, machine architecture, or geographic location of the computers.Correlation coefficient（相关系数）A numerical measure, falling between - 1 and 1, of the degree of the linear relationship between two variables. A positive value indicates a direct relationship, a negative value indicates an inverse relationship, and the distance of the value away from zero indicates the strength of the relationship. A value near zero indicates no relationship between the variables.Covariation (in sequences)（共变）Coincident change at two or more sequence positions in related sequences that may influence the secondary structures of RNA or protein molecules.Coverage (or depth) （覆盖率/厚度）The average number of times a nucleotide is represented by a high-quality base in a collection of random raw sequence. Operationally, a 'high-quality base' is defined as one with an accuracy of at least 99% (corresponding to a PHRED score of at least 20).Database（数据库）A computerized storehouse of data that provides a standardized way for locating, adding, removing, and changing data. See also Object-oriented database, Relational database.DendogramA form of a tree that lists the compared objects (e.g., sequences or genes in a microarray analysis) in a vertical order and joins related ones by levels of branches extending to one side of the list.Depth （厚度）See coverageDirichlet mixturesDefined as the conjugational prior of a multinomial distribution. One use is for predicting the expected pattern of amino acid variation found in the match state of a hid-den Markov model (representing one column of a multiple sequence alignment of proteins), based on prior distributions found in conserved protein domains (blocks).Distance in sequence analysis（序列距离）The number of observed changes in an optimal alignment of two sequences, usually not counting gaps.DNA Sequencing （DNA测序）The experimental process of determining the nucleotide sequence of a region of DNA. This is done by labelling each nucleotide (A, C, G or T) with either a radioactive or fluorescent marker which identifies it. There are several methods of applying this technology, each with their advantages and disadvantages. For more information, refer to a current text book. High throughput laboratories frequently use automated sequencers, which are capable of rapidly reading large numbers of templates. Sometimes, the sequences may be generated more quickly than they can be characterised.Domain （功能域）A discrete portion of a protein assumed to fold independently of the rest of the protein andpossessing its own function.Dot matrix（点标矩阵图）Dot matrix diagrams provide a graphical method for comparing two sequences. One sequence is written horizontally across the top of the graph and the other along the left-hand side. Dots are placed within the graph at the intersection of the same letter appearing in both sequences. A series of diagonal lines in the graph indicate regions of alignment. The matrix may be filtered to reveal the most-alike regions by scoring a minimal threshold number of matches within a sequence window.Draft genome sequence （基因组序列草图）The sequence produced by combining the information from the individual sequenced clones (by creating merged sequence contigs and then employing linking information to create scaffolds) and positioning the sequence along the physical map of the chromosomes.DUST （一种低复杂性区段过滤程序）A program for filtering low complexity regions from nucleic acid sequences.Dynamic programming（动态规划法）A dynamic programming algorithm solves a problem by combining solutions to sub-problems that are computed once and saved in a table or matrix. Dynamic programming is typically used when a problem has many possible solutions and an optimal one needs to be found. This algorithm is used for producing sequence alignments, given a scoring system for sequence comparisons.EMBL （欧洲分子生物学实验室，EMBL数据库是主要公共核酸序列数据库之一）European Molecular Biology Laboratories. Maintain the EMBL database, one of the major public sequence databases.EMBnet （欧洲分子生物学网络）European Molecular Biology Network: /was established in 1988, and provides services including local molecular databases and software for molecular biologists in Europe. There are several large outposts of EMBnet, including EXPASY.Entropy（熵）From information theory, a measure of the unpredictable nature of a set of possible elements. The higher the level of variation within the set, the higher the entropy.Erdos and Renyi lawIn a toss of a “fair” coin, the number of heads in a row that can be expected is the logari thm of the number of tosses to the base 2. The law may be generalized for more than two possible outcomes by changing the base of the logarithm to the number of out-comes. This law was used to analyze the number of matches and mismatches that can be expected between random sequences as a basis for scoring the statistical significance of a sequence alignment.EST （表达序列标签的缩写）See Expressed Sequence TagExpect value (E)（E值）E value. The number of different alignents with scores equivalent to or better than S that are expected to occur in a database search by chance. The lower the E value, the more significant the score. In a database similarity search, the probability that an alignment score as good as the one found between a query sequence and a database sequence would be found in as many comparisons between random sequences as was done to find the matching sequence. In other types of sequence analysis, E has a similar meaning.Expectation maximization (sequence analysis)An algorithm for locating similar sequence patterns in a set of sequences. A guessed alignment of the sequences is first used to generate an expected scoring matrix representing the distribution of sequence characters in each column of the alignment, this pattern is matched to each sequence, and the scoring matrix values are then updated to maximize the alignment of the matrix to the sequences. The procedure is repeated until there is no further improvement.Exon （外显子）Coding region of DNA. See CDS.Expressed Sequence Tag (EST) （表达序列标签）Randomly selected, partial cDNA sequence; represents it's corresponding mRNA. dbEST is a large database of ESTs at GenBank, NCBI.FASTA（一种主要数据库搜索程序）The first widely used algorithm for database similarity searching. The program looks for optimal local alignments by scanning the sequence for small matches called "words". Initially, the scores of segments in which there are multiple word hits are calculated ("init1"). Later the scores of several segments may be summed to generate an "initn" score. An optimized alignment that includes gaps is shown in the output as "opt". The sensitivity and speed of the search are inversely related and controlled by the "k-tup" variable which specifies the size of a "word". (Pearson andLipman)Extreme value distribution（极值分布）Some measurements are found to follow a distribution that has a long tail which decays at high values much more slowly than that found in a normal distribution. This slow-falling type is called the extreme value distribution. The alignment scores between unrelated or random sequences are an example. These scores can reach very high values, particularly when a large number of comparisons are made, as in a database similarity search. The probability of a particular score may be accurately predicted by the extreme value distribution, which follows a double negative exponential function after Gumbel.False negative（假阴性）A negative data point collected in a data set that was incorrectly reported due to a failure of the test in avoiding negative results.False positive （假阳性）A positive data point collected in a data set that was incorrectly reported due to a failure of the test. If the test had correctly measured the data point, the data would have been recorded as negative.Feed-forward neural network （反向传输神经网络）Organizes nodes into sequence layers in which the nodes in each layer are fully connected with the nodes in the next layer, except for the final output layer. Input is fed from the input layer through the layers in sequence in a “feed-forward” direction, resulting in output at the final layer. See also Neural network.Filtering (window size)During pair-wise sequence alignment using the dot matrix method, random matches can be filtered out by using a sliding window to compare the two sequences. Rather than comparing a single sequence position at a time, a window of adjacent positions in the two sequences is compared and a dot, indicating a match, is generated only if a certain minimal number of matches occur.Filtering （过滤）Also known as Masking. The process of hiding regions of (nucleic acid or amino acid) sequence having characteristics that frequently lead to spurious high scores. See SEG and DUST.Finished sequence（完成序列）Complete sequence of a clone or genome, with an accuracy of at least 99.99% and no gaps.Fourier analysisStudies the approximations and decomposition of functions using trigonometric polynomials.Format (file)（格式）Different programs require that information be specified to them in a formal manner, using particular keywords and ordering. This specification is a file format.Forward-backward algorithmUsed to train a hidden Markov model by aligning the model with training sequences. The algorithm then refines the model to reduce the error when fitted to the given data using a gradient descent approach.FTP (File Transfer Protocol)（文件传输协议）Allows a person to transfer files from one computer to another across a network using an FTP-capable client program. The FTP client program can only communicate with machines that run an FTP server. The server, in turn, will make a specific portion of its tile system available for FTP access, providing that the client is able to supply a recognized user name and password to the server.Full shotgun clone （鸟枪法克隆）A large-insert clone for which full shotgun sequence has been produced.Functional genomics（功能基因组学）Assessment of the function of genes identified by between-genome comparisons. The function of a newly identified gene is tested by introducing mutations into the gene and then examining the resultant mutant organism for an altered phenotype.gap （空位/间隙/缺口）A space introduced into an alignment to compensate for insertions and deletions in one sequence relative to another. To prevent the accumulation of too many gaps in an alignment, introduction of a gap causes the deduction of a fixed amount (the gap score) from the alignment score. Extension of the gap to encompass additional nucleotides or amino acid is also penalized in the scoring of an alignment.Gap penalty（空位罚分）A numeric score used in sequence alignment programs to penalize the presence of gaps within an alignment. The value of a gap penalty affects how often gaps appear in alignments produced by the algorithm. Most alignment programs suggest gap penalties that are appropriate for particular scoring matrices.Genetic algorithm（遗传算法）A kind of search algorithm that was inspired by the principles of evolution. A population of initial solutions is encoded and the algorithm searches through these by applying a pre-defined fitness measurement to each solution, selecting those with the highest fitness for reproduction. New solutions can be generated during this phase by crossover and mutation operations, defined in the encoded solutions.Genetic map （遗传图谱）A genome map in which polymorphic loci are positioned relative to one another on the basis of the frequency with which they recombine during meiosis. The unit of distance is centimorgans (cM), denoting a 1% chance of recombination.Genome（基因组）The genetic material of an organism, contained in one haploid set of chromosomes.Gibbs sampling methodAn algorithm for finding conserved patterns within a set of related sequences. A guessed alignment of all but one sequence is made and used to generate a scoring matrix that represents the alignment. The matrix is then matched to the left-out sequence, and a probable location of the corresponding pattern is found. This prediction is then input into a new alignment and another scoring matrix is produced and tested on a new left-out sequence. The process is repeated until there is no further improvement in the matrix.Global alignment（整体联配）Attempts to match as many characters as possible, from end to end, in a set of twomore sequences.Gopher (一个文档发布系统，允许检索和显示文本文件)Graph theory（图论）A branch of mathematics which deals with problems that involve a graph or network structure. A graph is defined by a set of nodes (or points) and a set of arcs (lines or edges) joining the nodes. In sequence and genome analysis, graph theory is used for sequence alignments and clustering alike genes.GSS（基因综述序列）Genome survey sequence.GUI（图形用户界面）Graphical user interface.H （相对熵值）H is the relative entropy of the target and background residue frequencies. (Karlin and Altschul, 1990). H can be thought of as a measure of the average information (in bits) available per position that distinguishes an alignment from chance. At high values of H, short alignments can be distinguished by chance, whereas at lower H values, a longer alignment may be necessary. (Altschul, 1991)Half-bitsSome scoring matrices are in half-bit units. These units are logarithms to the base 2 of odds scores times 2.Heuristic（启发式方法）A procedure that progresses along empirical lines by using rules of thumb to reach a solution. The solution is not guaranteed to be optimal.Hexadecimal system（16制系统）The base 16 counting system that uses the digits O-9 followed by the letters A-F.HGMP （人类基因组图谱计划）Human Genome Mapping Project.Hidden Markov Model (HMM)（隐马尔可夫模型）In sequence analysis, a HMM is usually a probabilistic model of a multiple sequence alignment, but can also be a model of periodic patterns in a single sequence, representing, for example, patterns found in the exons of a gene. In a model of multiple sequence alignments, each column of symbols in the alignment is represented by a frequency distribution of the symbols called a state, and insertions and deletions by other states. One then moves through the model along a particular path from state to state trying to match a given sequence. The next matching symbol is chosen from each state, recording its probability (frequency) and also the probability of going to thatparticular state from a previous one (the transition probability). State and transition probabilities are then multiplied to obtain a probability of the given sequence. Generally speaking, a HMM is a statistical model for an ordered sequence of symbols, acting as a stochastic state machine that generates a symbol each time a transition is made from one state to the next. Transitions betweenstates are specified by transition probabilities.Hidden layer（隐藏层）An inner layer within a neural network that receives its input and sends its output to other layers within the network. One function of the hidden layer is to detect covariation within the input data, such as patterns of amino acid covariation that are associated with a particular type of secondary structure in proteins.Hierarchical clustering（分级聚类）The clustering or grouping of objects based on some single criterion of similarity or difference.An example is the clustering of genes in a microarray experiment based on the correlation between their expression patterns. The distance method used in phylogenetic analysis is another example.Hill climbingA nonoptimal search algorithm that selects the singular best possible solution at a given state or step. The solution may result in a locally best solution that is not a globally best solution.Homology（同源性）A similar component in two organisms (e.g., genes with strongly similar sequences) that can be attributed to a common ancestor of the two organisms during evolution.Horizontal transfer（水平转移）The transfer of genetic material between two distinct species that do not ordinarily exchange genetic material. The transferred DNA becomes established in the recipient genome and can be detected by a novel phylogenetic history and codon content com-pared to the rest of the genome.HSP （高比值片段对）High-scoring segment pair. Local alignments with no gaps that achieve one of the top alignment scores in a given search.HTGS/HGT（高通量基因组序列）High-throughout genome sequences。

生物信息学试题

生物信息学考题（2012版）一、填空题（共10分，每空一分）1、美国政府于1990年10月启动耗资30亿美元的15年研究计划，预期到2005年完成人类基因组大约30亿个碱基的全序列测定，这就是被称为生命科学“登月计划”的人类基因组计划。

2、生物信息学的研究目标：以核酸、蛋白质等生物大分子数据库为主要对象，以数学、信息学、计算机科学为主要手段，以计算机硬件、软件和计算机网络为主要工具，对浩瀚如海的原始数据进行存储、管理、注释、加工，使之成为具有明确生物意义的生物信息。

3、随着生物信息学的诞生及应用，今后生物学研究项目的起点将是理论的，一位科学家将从理论推测开始，然后转向试验去追踪或检验该假设。

4、生物信息学作为一门交叉学科，已经成为当今生命科学乃至整个自然科学的重大前沿领域之一，也将是21世纪自然科学的核心领域之一。

5、人类基因组计划、“曼哈顿原子计划”和“阿波罗登月计划”并称为20世纪的三大著名计划，中国在1999年承担了1%的研究任务，即对第3号染色体上3000万碱基对的测定。

6、人类基因组的主要任务是：人类基因组以及一些模式生物（细菌、酵母、线虫、果蝇等）基因组作图、测序和基因识别。

二、是非题（共10分，每小题1分）1、生物学就是实验科学，所有的研究结论从实验中来，于实验中得到验证。

（错）2、比较是科学研究中最常见的方法，在生物信息学研究中，比对是最常用和最经典的研究手段。

（对）3、两个蛋白质序列相似性超过30%就是同源蛋白。

（错）4、蛋白质序列相似性指一级序列中氨基酸残基相同。

（错）5、蛋白质序列相似性指氨基酸残基具有相似特性：侧链基团大小电荷性、疏水性等相同。

（对）6、核酸序列相似性指序列中相同碱基所占的比例。

（对）7、对一段未知功能DNA片段进行功能预测需对其进行3位翻译。

（错）8、对一段未知功能DNA片段进行功能预测需对其进行6位翻译。

（对）9、相似性是指一种很直接的数量关系，无需实验验证。

生物信息学主要英文术语及释义（续完）

⽣物信息学主要英⽂术语及释义（续完）These substitutions may be found in an amino acid substitution matrix such as the Dayhoff PAM and Henikoff BLOSUM matrices. Columns in the alignment that include gaps are not scored in the calculation. Perceptron（感知器，模拟⼈类视神经控制系统的图形识别机) A neural network in which input and output states are directly connected without intervening hidden layers. PHRED （⼀种⼴泛应⽤的原始序列分析程序，可以对序列的各个碱基进⾏识别和质量评价） A widely used computer program that analyses raw sequence to produce a 'base call' with an associated 'quality score' for each position in the sequence. A PHRED quality score of X corresponds to an error probability of approximately 10-X/10. Thus, a PHRED quality score of30 corresponds to 99.9% accuracy for the base call in the raw read. PHRAP （⼀种⼴泛应⽤的原始序列组装程序） A widely used computer program that assembles raw sequence into sequence contigs and assigns to each position in the sequence an associated 'quality score', on the basis of the PHRED scores of the raw sequence reads. A PHRAP quality score of X corresponds to an error probability of approximately 10-X/10. Thus, a PHRAP quality score of 30 corresponds to 99.9% accuracy for a base in the assembled sequence. Phylogenetic studies（系统发育研究） PIR （主要蛋⽩质序列数据库之⼀，翻译⾃GenBank） A database of translated GenBank nucleotide sequences. PIR is a redundant (see Redundancy) protein sequence database. The database is divided into four categories: PIR1 - Classified and annotated. PIR2 - Annotated. PIR3 -Unverified. PIR4 - Unencoded or untranslated. Poisson distribution（帕松分布） Used to predict the occurrence of infrequent events over a long period of time 143or when there are a large number of trials. In sequence analysis, it is used to calculate the chance that one pair of a large number of pairs of unrelated sequences may give a high local alignment score. Position-specific scoring matrix (PSSM)（特定位点记分矩阵，PSI-BLAST等搜索程序使⽤） The PSSM gives the log-odds score for finding a particular matching amino acid in a target sequence. Represents the variation found in the columns of an alignment of a set of related sequences. Each subsequent matrix column corresponds to the next column in the alignment and each row corresponds to a particular sequence character (one of four bases in DNA sequences or 20 amino acids in protein sequences). Matrix values are log odds scores obtained by dividing the counts of the residue in the alignment, dividing by the expected number of counts based on sequence composition, and converting the ratio to a log score. The matrix is moved along sequences to find similar regions by adding the matching log odds scores and looking for high values. There is no allowance for gaps. Also called a weight matrix or scoring matrix. Posterior (Bayesian analysis) A conditional probability based on prior knowledge and newly uated relationships among variables using Bayes rule. See also Bayes rule. Prior (Bayesian analysis) The expected distribution of a variable based on previous data. Profile（分布型） A matrix representation of a conserved region in a multiple sequence alignment that allows for gaps in the alignment. The rows include scores for matching sequential columns of the alignment to a test sequence. The columns include substitution scores for amino acids and gap penalties. See also PSSM. Profile hidden Markov model（分布型隐马尔可夫模型） A hidden Markov model of a conserved region in a multiple sequence alignment that includes gaps and may be used to search new sequences for similarity to the aligned sequences. Proteome（蛋⽩质组） The entire collection of proteins that are encoded by the genome of an organism. Initially the proteome is estimated by gene prediction and annotation methods but eventually will be revised as more information on the sequence of the expressed genes is obtained. Proteomics （蛋⽩质组学） Systematic analysis of protein expression_r of normal and diseased tissues that involves the separation, identification and characterization of all of the proteins in an organism. Pseudocounts Small number of counts that is added to the columns of a scoring matrix to increase the variability either to avoid zero counts or to add more variation than was found in the sequences used to produce the matrix. 144PSI-BLAST （BLAST系列程序之⼀） Position-Specific Iterative BLAST. An iterative search using the BLAST algorithm. A profile is built after the initial search, which is then used in subsequent searches. The process may be repeated, if desired with new sequences found in each cycle used to refine the profile. Details can be found in this discussion of PSI-BLAST. (Altschul et al.) PSSM （特定位点记分矩阵） See position-specific scoring matrix and profile. Public sequence databases （公共序列数据库，指GenBank、EMBL和DDBJ） The three coordinated international sequence databases: GenBank, the EMBL data library and DDBJ. Q20 (Quality score 20) A quality score of > or = 20 indicates that there is less than a 1 in 100 chance that the base call is incorrect. These are consequently high-quality bases. Specifically, the quality value "q" assigned to a basecall is defined as: q = -10 x log10(p) where p is the estimated error probability for that basecall. Note that high quality values correspond to low error probabilities, and conversely. Quality trimming This is an algorithm which uses a sliding window of 50 bases and trims from the 5' end of the read followed by the 3' end. With each window, the number of low quality (10 or less) bases is determined. If more than 5 bases are below the threshold quality, the window is incremented by one base and the process is repeated. When the low quality test fails, the position where it stopped is recorded. The parameters for window length low quality threshold and number of low quality bases tolerated are fixed. The positions of the 5' and 3' boundaries of the quality region are noted in the plot of quality values presented in the" Chromatogram Details" report. Query （待查序列/搜索序列） The input sequence (or other type of search term) with which all of the entries in a database are to be compared. Radiation hybrid (RH) map （辐射杂交图谱） A genome map in which STSs are positioned relative to one another on the basis of the frequency with which they are separated by radiation-induced breaks. The frequency is assayed by analysing a panel of human–hamster hybrid cell lines, each produced by lethally irradiating human cells and fusing them with recipient hamster cells such that each carries a collection of human chromosomal fragments. The unit of distance is centirays (cR), denoting a 1% chanceof a break occuring between two loci Raw Score （初值，指最初得到的联配值S） The score of an alignment, S, calculated as the sum of substitution and gap scores. Substitution scores are given by a look-up table (see PAM, BLOSUM). Gap scores are typically calculated as the sum of G, the gap opening penalty 145and L, the gap extension penalty. For a gap of length n, the gap cost would be G+Ln. The choice of gap costs, G and L is empirical, but it is customary to choose a high value for G (10-15)and a low value for L (1-2). Raw sequence （原始序列/读胶序列） Individual unassembled sequence reads, produced by sequencing of clones containing DNA inserts. Receiver operator characteristic The receiver operator characteristic (ROC) curve describes the probability that a test will correctly declare the condition present against the probability that the test will declare the condition present when actually absent. This is shown through a graph of the tesls sensitivity against one minus the test specificity for different possible threshold values. Redundancy （冗余） The presence of more than one identical item represents redundancy. In bioinformatics, the term is used with reference to the sequences in a sequence database. If a database is described as being redundant, more than one identical (redundant) sequence may be found. If the database is said to be non-redundant (nr), the database managers have attempted to reduce the redundancy. The term is ambiguous with reference to genetics, and as such, the degree of non-redundancy varies according to the database manager's interpretation of the term. One can argue whether or not two alleles of a locus defines the limit of redundancy, or whether the same locus in different, closely related organisms constitutes redundency. Non-redundant databases are, in some ways, superior, but are less complete. These factors should be taken into consideration when selecting a database to search. Regular expression_rs This computational tool provides a method for expressing the variations found in a set of related sequences including a range of choices at one position, insertions, repeats, and so on. For example, these expression_rs are used to characterize variations found in protein domains in the PROSITE catalog. Regularization A set of techniques for reducing data overfitting when training a model. See also Overfitting. Relational database（关系数据库）Organizes information into tables where each column represents the fields of informa-tion that can be stored in a single record. Each row in the table corresponds to a single record. A single database can have many tables and a query language is used to access the data. See also Object-oriented database. Scaffold （⽀架，由序列重叠群拼接⽽成） The result of connecting contigs by linking information from paired-end reads from plasmids, paired-end reads from BACs, known messenger RNAs or other sources. The contigs in a scaffold are ordered and oriented with respect to one another. 146 Scoring matrix（记分矩阵） See Position-specific scoring matrix. SEG （⼀种蛋⽩质程序低复杂性区段过滤程序） A program for filtering low complexity regions in amino acid sequences. Residues that have been masked are represented as "X" in an alignment. SEG filtering is performed by default in the blastp subroutine of BLAST 2.0. (Wootton and Federhen) Selectivity (in database similarity searches)（数据库相似性搜索的选择准确性） The ability of a search method to locate members of a protein family without making a false-positive classification of members of other families. Sensitivity (in database similarity searches)（数据库相似性搜索的灵敏性） The ability of a search method to locate as many members of a protein family as possi-ble, including distant members of limited sequence similarity. Sequence Tagged Site （序列标签位点） Short cDNA sequences of regions that have been physically mapped. STSs provide unique landmarks, or identifiers, throughout the genome. Useful as a framework for further sequencing. Significance（显著⽔平） A significant result is one that has not simply occurred by chance, and therefore is prob-ably true. Significance levels show how likely a result is due to chance, expressed as a probability. In sequence analysis, the significance of an alignment score may be calcu-lated as the chance that such a score would be found between random or unrelated sequences. See Expect value. Similarity score (sequence alignment) （相似性值） Similarity means the extent to which nucleotide or protein sequences are related. The extent of similarity between two sequences can be based on percent sequence identity and/or conservation. In BLAST similarity refers to a positive matrix score. The sum of the number of identical matches and conservative (high scoring) substitu-tions in a sequence alignment divided by the total number of aligned sequence charac-ters. Gaps are usually ignored. Simulated annealing A search algorithm that attempts to solve the problem of finding global extrema. The algorithm was inspired by the physical cooling process of metals and the freezing process in liquids where atoms slow down in movement and line up to form a crystal. The algorithm traverses the energy levels of a function, always accepting energy levels that are smaller than previous ones, but sometimes accepting energy levels that are greater, according to the Boltzmann probability distribution. Single-linkage cluster analysis An analysis of a group of related objects, e.g., similar proteins in different genomes to identify both close and more distant relationships, represented on a tree or dendogram. The method joins the most closely related pairs by the neighbor-joining algorithm by representing these pairs as outer branches on 147the tree. More distant objects are then pro-gressively added to lower tree branches. The method is also used to predict phylogenet-ic relationships by distance methods. See also Hierarchical clustering, Neighbor-joining method. Smith-Waterman algorithm（Smith-Waterman算法） Uses dynamic programming to find local alignments between sequences. The key fea-ture is that all negative scores calculated in the dynamic programming matrix are changed to zero in order to avoid extending poorly scoring alignments and to assist in identifying local alignments starting and stopping anywhere with the matrix. SNP （单核苷酸多态性） Single nucleotide polymorphism, or a single nucleotide position in the genome sequence for which two or more alternative alleles are present at appreciable frequency (traditionally, at least 1%) in the human population. Space or time complexity（时间或空间复杂性） An algorithms complexity is the maximum amount of computer memory or time required for the number of algorithmic steps to solve a problem. Specificity (in database similarity searches)（数据库相似性搜索的特异性） The ability of a search method to locate members of one protein family, including dis-tantly related members. SSR （简单序列重复） Simple sequence repeat, a sequence consisting largely of a tandem repeat of a specific k-mer (such as (CA)15). Many SSRs are polymorphic and have been widely used in genetic mapping. Stochastic context-free grammar A formal representation of groups of symbols in different parts of a sequence; i.e., not in the same context. An example is complementary regions in RNA that will form sec-ondary structures. The stochastic feature introduces variability into such regions. Stringency Refers to the minimum number of matches required within a window. See also Filtering. STS （序列标签位点的缩写） See Sequence Tagged Site Substitution （替换） The presence of a non-identical amino acid at a given position in an alignment. If the aligned residues have similar physico-chemical properties the substitution is said to be "conservative". Substitution Matrix （替换矩阵） A substitution matrix containing values proportional to the probability that amino acid i mutates into amino acid j for all pairs of amino acids. such matrices are constructed by assembling a large and diverse sample of verified pairwise alignments of amino acids. If the sample is large enough to be statistically significant, the resulting matrices should reflect the true probabilities of mutations occuring through a period of evolution. 148Sum of pairs method Sums the substitution scores of all possible pair-wise combinations of sequence charac-ters in one column of a multiple sequence alignment. SWISS-PROT （主要蛋⽩质序列数据库之⼀） A non-redundant (See Redundancy) protein sequence database. Thoroughly annotated and cross referenced. A subdivision is TrEMBL. Synteny The presence of a set of homologous genes in the same order on two genomes. Threading In protein structure prediction, the aligning of the sequence of a protein of unknown structure with a known three-dimensional structure to determine whether the amino acid sequence is spatially and chemically compatible with that structure. TrEMBL （蛋⽩质数据库之⼀，翻译⾃EMBL） A protein sequence database of Translated EMBL nucleotide sequences. Uncertainty（不确定性） From information theory, a logarithmic measure of the average number of choices that must be made for identification purposes. See also Information content. Unified Modeling Language (UML) A standard sanctioned by the Object Management Group that provides a formal nota-tion for describing object-oriented design. UniGene （⼈类基因数据库之⼀） Database of unique human genes, at NCBI. Entries are selected by near identical presence in GenBank and dbEST databases. The clusters of sequences produced are considered to represent a single gene. Unitary Matrix （⼀元矩阵） Also known as Identity Matrix.A scoring system in which only identical characters receive a positive score. URL（统⼀资源定位符） Uniform resource locator. Viterbi algorithm Calculates the optimal path of a sequence through a hidden Markov model of sequences using a dynamic programming algorithm. Weight matrix See Position-specifc scoring matrix.。

生物信息学(期末)-生技08

齐齐哈尔大学试卷考试科目: 生物信息学适用对象: 生物技术08本使用学期: 2011—2012—1 第七学期课程编码: 05113019 总分80分共 2 页１）考生须知:２）姓名必须写在装订线左侧, 其它位置一律作废。

３）请先检查是否缺页, 如缺页应向监考教师声明, 否则后果由考生负责。

４）答案一律写在答题纸上, 可不抄题, 但要标清题号。

５）用蓝色或黑色的钢笔、圆珠笔答题。

监考须知: 请将两份题签放在上层随答题纸一起装订。

一、名词解释（每小题3分, 共4小题12分）表达序列标签, 外类群, 开放阅读框, 蛋白质组学二、选择题（每小题1分, 共10小题10分）1.下列哪项不属于人类基因组计划的研究内容（）A.绘制化学图谱、物理图谱B.获得全部人类基因组的序列C.获得转录图谱D.获得人体内全部的蛋白质序列2.图中哪一项为直系同源（）A.HA1和HA2B.HA1和WA2C.HA1和HBD.WA1和WA23.下列软件中哪一个能够用来构建系统发育树的（）A CLUSTALB BLASTC AssemblerD Treeview4.核酸序列增长最快是在哪一时期（）A 1970-1980年B 1980-1990年C 1990-2000年D 2000-2008年5. 研究一条测序获得的DNA序列时首先需要（）A.屏蔽重复序列B.去除序列污染C.查找开放阅读框D.查找密码子偏好性6. 对于序列ATGCCCCGA和序列ATCCGA哪一种是正确的序列对位排列方式（）A ATGCCCCGAAT_CC__GAB ATGCCCCGAAT_CCG__AC ATGCCCCGAAT_CC_G_AD ATGCCCCGAAT_C__G_A7.BLAST系列软件与下列哪一项能够在同一网站中检索到（）A GeneBank数据库B DDBJ数据库C EMBL数据库D CLUSTAL W8.生物信息学数据以什么形式存储（）A.文件系统B.程序软件C.数据库D.手工管理9.下列陈述哪一项是错误的（）A PIR-PSD是国际上最大的蛋白质序列数据库B 数据库的检索分为关键词检索和序列检索C STS是基因组作图时常用的一种图标D ACeDB仅储存秀丽新小杆线虫数据10.在使用CLUSTAL软件进行比对时, 多序列的比对结构中几条序列都相同的核苷酸位点用什么标注（）A 不同的颜色B “*”C “-”D “_”三、判断题（每小题1分, 共10小题10分, 对的画“√”, 错的画“×”）1.华盛顿大学的Phred软件是用来处理数据冗余的（）2.NCBI网站不能用来查询文章（）3.CLUSTAL X有汉化版（）4.EcoCyc是大肠杆菌的知识体系数据库系统（）5. 文昌鱼是人类的五种模式生物之一（）6.生物信息学研究物种信息, 不包括序列（）7.研究一条测序获得的DNA序列时首先应该去除污染序列（）8.双向凝胶电泳技术是蛋白质组研究的关键技术（）9.CAP3是EST序列的拼接软件（）10.氨基酸的顺序决定蛋白质的构象，即蛋白质的一级结构决定蛋白质的二级结构。