生 物信 息 学
C i orao inomts hn Junl f o f i a Bi n c
专 论 与 综 述
自然 语 言 处 理 技 术 在 生 物 信 息 学 中 的 应 用
( 哈尔滨工业大学计算机科学与技术学院 ,501 100 )
棠, 演示 了如何在 生物数据上应用算法进行 实验。 关键词 : 生物信 息学; 自然语 言处理 ; 比对 : 分类 中图分类号 :P 1. 8 14 T 32 Q l . 文献标识码 : A 文章编号 : 7 5 ̄(o 6 一O —0 1 4 1 2— 5 20 ) 1 4 —0 6
摘要 : 从信 息处理 的角度来看 , 生物信息学与 自然语 言处理 中的许 多问题是非常相似 的, 因此 , 可以将一些 自然语 言处理 中的 经典方法应用到生物信 息学文字中。本文介绍 了自然语言处理和 生物信息 学中共有 的问题 , 比对、 类、 如 分 预测等 , 以及这 些 问题的解决方法。通过 对两个领域形似问题 的分析 可知 , 秀的 自然语言处理技术 也可用来解决 生物信 息学方 面的问题 , 优 并 且一些还未在 生物信 息学领域得到应用的 自然语 言理解技 术也有其 潜在 的应用价值 。最后 蛤 出 了一 个分类 问题 的解决 方
pi ay s in e ,w c ne rt ilg ,c mp tr s i ln r ce c s hih i tg ae boo y o ue — s c e c ,a p id mah maisa d ifr t n tc n lg n n e p l te t n n oma o e h oo y i— e c i t ige ds i l e.W i e d v lp n fBiifr o a sn l i p i c n h h t t e eo me to on o - a i , c e t s c n u d mtn r tc lo ilg m t s s init a n e a d t e p oo o fb oo y c s h wol d sle ma y c re tp o lmsi ilg Ho v rd a ov n u r n rb e n boo y. we — n Biifn aiss in ss h v le d p l on o n t e t t a e ar a y a p i NLP t h— c c i d e e c nq e o t er r s ac a d h p roma c o e e i s t i e e rh, u h n te e fr n e f t s s h



生物信息学主要英文术语及释义Abstract Syntax Notation (ASN.l)(NCBI发展的许多程序,如显示蛋白质三维结构的Cn3D等所使用的内部格式)A language that is used to describe structured data types formally, Within bioinformatits,it has been used by the National Center for Biotechnology Information to encode sequences, maps, taxonomic information, molecular structures, and biographical information in such a way that it can be easily accessed and exchanged by computer software.Accession number(记录号)A unique identifier that is assigned to a single database entry for a DNA or protein sequence.Affine gap penalty(一种设置空位罚分策略)A gap penalty score that is a linear function of gap length, consisting of a gap opening penalty and a gap extension penalty multiplied by the length of the gap. Using this penalty scheme greatly enhances the performance of dynamic programming methods for sequence alignment. See also Gap penalty. Algorithm(算法)A systematic procedure for solving a problem in a finite number of steps, typically involving a repetition of operations. Once specified, an algorithm can be written in a computer language and run as a program.Alignment(联配/比对/联配)Refers to the procedure of comparing two or more sequences by looking for a series of individual characters or character patterns that are in the same order in the sequences. Of the two types of alignment, local and global, a local alignment is generally the most useful. See also Local and Global alignments. Alignment score(联配/比对/联配值)An algorithmically computed score based on the number of matches, substitutions, insertions, and deletions (gaps) within an alignment. Scores for matches and substitutions Are derived from a scoring matrix such as the BLOSUM and PAM matrices for proteins, and aftine gap penalties suitable for the matrix are chosen. Alignment scores are in log odds units, often bit units (log to the base 2). Higher scores denote better alignments. See also Similarity score, Distance in sequence analysis.Alphabet(字母表)The total number of symbols in a sequence-4 for DNA sequences and 20 for protein sequences.Annotation(注释)The prediction of genes in a genome, including the location of protein-encoding genes, the sequence of the encoded proteins, anysignificantmatches to other Proteins of known function, and the location of RNA-encoding genes. Predictions are based on gene models; e.g., hidden Markov models of introns and exons in proteins encoding genes, and models of secondary structure in RNA.Anonymous FTP(匿名FTP)When a FTP service allows anyone to log in, it is said to provide anonymous FTP ser-vice. A user can log in to an anonymous FTP server by typing anonymous as the user name and his E-mail address as a password. Most Web browsers now negotiate anonymous FTP logon without asking the user for a user name and password. See also FTP.ASCIIThe American Standard Code for Information Interchange (ASCII) encodes unaccented letters a-z, A-Z, the numbers O-9, most punctuation marks, space, and a set of control characters such as carriage return and tab. ASCII specifies 128 characters that are mapped to the values O-127. ASCII tiles are commonly called plain text, meaning that they only encode text without extra markup.BAC clone(细菌人工染色体克隆)Bacterial artificial chromosome vector carrying a genomic DNA insert, typically 100–200 kb. Most of the large-insert clones sequenced in the project were BAC clones.Back-propagation(反向传输)When training feed-forward neural networks, a back-propagation algorithm can be used to modify the network weights. After each training input pattern is fed through the network, the network’s output is compared with the desired output and the amount of error is calculated. This error is back-propagated through the network by using an error function to correct the network weights. See also Feed-forward neural network.Baum-Welch algorithm(Baum-Welch算法)An expectation maximization algorithm that is used to train hidden Markov models.Baye’s rule(贝叶斯法则)Forms the basis of conditional probability by calculating the likelihood of an event occurring based on the history of the event and relevant background information. In terms of two parameters A and B, the theorem is stated in anequation: The condition-al probability of A, given B, P(AIB), is equal to the probability of A, P(A), times the conditional probability of B, given A, P(BIA), divided by the probability of B, P(B). P(A) is the historical or prior distribution value of A, P(BIA) is a new prediction for B for a particular value of A, and P(B) is the sum of the newly predicted values for B. P(AIB) is a posterior probability, representing a new prediction for A given the prior knowledge of A and the newly discovered relationships between A and B. Bayesian analysis(贝叶斯分析)A statistical procedure used to estimate parameters of an underlyingdistribution based on an observed distribution. See also Baye’s rule.Biochips(生物芯片)Miniaturized arrays of large numbers of molecular substrates, often oligonucleotides, in a defined pattern. They are also called DNA microarrays and microchips.Bioinformatics (生物信息学)The merger of biotechnology and information technology with the goal of revealing new insights and principles in biology. /The discipline of obtaining information about genomic or protein sequence data. This may involve similarity searches of databases, comparing your unidentified sequence to the sequences in a database, or making predictions about the sequence based on current knowledge of similar sequences. Databases are frequently made publically available through the Internet, or locally at your institution.Bit score (二进制值/ Bit值)The value S' is derived from the raw alignment score S in which the statistical properties of the scoring system used have been taken into account. Because bit scores have been normalized with respect to the scoring system, they can be used to compare alignment scores from different searches.Bit unitsFrom information theory, a bit denotes the amount of information required to distinguish between two equally likely possibilities. The number of bits of information, AJ, required to convey a message that has A4 possibilities is log2 M = N bits.BLAST (基本局部联配搜索工具,一种主要数据库搜索程序)Basic Local Alignment Search Tool. A set of programs, used to perform fast similarity searches. Nucleotide sequences can be compared with nucleotide sequences in a database using BLASTN, for example. Complex statistics areapplied to judge the significance of each match. Reported sequences may be homologous to, or related to the query sequence. The BLASTP program is used to search a protein database for a match against a query protein sequence. There are several other flavours of BLAST. BLAST2 is a newer release of BLAST. Allows for insertions or deletions in the sequences being aligned. Gapped alignments may be more biologically significant.Block(蛋白质家族中保守区域的组块)Conserved ungapped patterns approximately 3-60 amino acids in length in a set of related proteins.BLOSUM matrices(模块替换矩阵,一种主要替换矩阵)An alternative to PAM tables, BLOSUM tables were derived using local multiple alignments of more distantly related sequences than were used for the PAM matrix. These are used to assess the similarity of sequences when performing alignments.Boltzmann distribution(Boltzmann 分布)Describes the number of molecules that have energies above a certain level, based on the Boltzmann gas constant and the absolute temperature.Boltzmann probability function(Boltzmann概率函数)See Boltzmann distribution.Bootstrap analysisA method for testing how well a particular data set fits a model. For example, the validity of the branch arrangement in a predicted phylogenetic tree can be tested by resampling columns in a multiple sequence alignment to create many new alignments. The appearance of a particular branch in trees generated from these resampled sequences can then be measured. Alternatively, a sequence may be left out of an analysis to deter-mine how much the sequence influences the results of an analysis.Branch length(分支长度)In sequence analysis, the number of sequence changes along a particular branch of a phylogenetic tree.CDS or cds (编码序列)Coding sequence.Chebyshe, d inequalityThe probability that a random variable exceeds its mean is less than or equal to the square of 1 over the number of standard deviations from the mean. Clone (克隆)Population of identical cells or molecules (e.g. DNA), derived from a singleancestor.Cloning Vector (克隆载体)A molecule that carries a foreign gene into a host, and allows/facilitates the multiplication of that gene in a host. When sequencing a gene that has been cloned using a cloning vector (rather than by PCR), care should be taken not to include the cloning vector sequence when performing similarity searches. Plasmids, cosmids, phagemids, YACs and PACs are example types of cloning vectors.Cluster analysis(聚类分析)A method for grouping together a set of objects that are most similar from a larger group of related objects. The relationships are based on some criterion of similarity or difference. For sequences, a similarity or distance score or a statistical evaluation of those scores is used.CobblerA single sequence that represents the most conserved regions in a multiple sequence alignment. The BLOCKS server uses the cobbler sequence to perform a database similarity search as a way to reach sequences that are more divergent than would be found using the single sequences in the alignment for searches.Coding system (neural networks)Regarding neural networks, a coding system needs to be designed for representing input and output. The level of success found when training the model will be partially dependent on the quality of the coding system chosen. Codon usageAnalysis of the codons used in a particular gene or organism. COG(直系同源簇)Clusters of orthologous groups in a set of groups of related sequences in microorganism and yeast (S. cerevisiae). These groups are found by whole proteome comparisons and include orthologs and paralogs. See also Orthologs and Paralogs.Comparative genomics(比较基因组学)A comparison of gene numbers, gene locations, and biological functions of genes in the genomes of diverse organisms, one objective being to identify groups of genes that play a unique biological role in a particular organism. Complexity (of an algorithm)(算法的复杂性)Describes the number of steps required by the algorithm to solve a problem as a function of the amount of data; for example, the length of sequences to be aligned.Conditional probability(条件概率)The probability of a particular result (or of a particular value of a variable) given one or more events or conditions (or values of other variables). Conservation (保守)Changes at a specific position of an amino acid or (less commonly, DNA) sequence that preserve the physico-chemical properties of the original residue.Consensus(一致序列)A single sequence that represents, at each subsequent position, the variation found within corresponding columns of a multiple sequence alignment. Context-free grammarsA recursive set of production rules for generating patterns of strings. These consist of a set of terminal characters that are used to create strings, a set of nonterminal symbols that correspond to rules and act as placeholders for patterns that can be generated using terminal characters, a set of rules for replacing nonterminal symbols with terminal characters, and a start symbol. Contig (序列重叠群/拼接序列)A set of clones that can be assembled into a linear order. A DNA sequence that overlaps with another contig. The full set of overlapping sequences (contigs) can be put together to obtain the sequence for a long region of DNA that cannot be sequenced in one run in a sequencing assay. Important in genetic mapping at the molecular level.CORBA(国际对象管理协作组制定的使OOP对象与网络接口统一起来的一套跨计算机、操作系统、程序语言和网络的共同标准)The Common Object Request Broker Architecture (CORBA) is an open industry standard for working with distributed objects, developed by the Object Management Group. CORBA allows the interconnection of objects and applications regardless of computer language, machine architecture, or geographic location of the computers.Correlation coefficient(相关系数)A numerical measure, falling between -1 and 1, of the degree of the linear relationship between two variables. A positive value indicates a direct relationship, a negative value indicates an inverse relationship, and the distance of the value away from zero indicates the strength of the relationship. A value near zero indicates no relationship between the variables.Covariation (in sequences)(共变)Coincident change at two or more sequence positions in related sequencesthat may influence the secondary structures of RNA or protein molecules. Coverage (or depth) (覆盖率/厚度)The average number of times a nucleotide is represented by a high-quality base in a collection of random raw sequence. Operationally, a 'high-quality base' is defined as one with an accuracy of at least 99% (corresponding to a PHRED score of at least 20).Database(数据库)A computerized storehouse of data that provides a standardized way for locating, adding, removing, and changing data. See also Object-oriented database, Relational database.DendogramA form of a tree that lists the compared objects (e.g., sequences or genes in a microarray analysis) in a vertical order and joins related ones by levels of branches extending to one side of the list.Depth (厚度)See coverageDirichlet mixturesDefined as the conjugational prior of a multinomial distribution. One use is for predicting the expected pattern of amino acid variation found in the match state of a hid-den Markov model (representing one column of a multiple sequence alignment of proteins), based on prior distributions found in conserved protein domains (blocks).Distance in sequence analysis(序列距离)The number of observed changes in an optimal alignment of two sequences, usually not counting gaps.DNA Sequencing (DNA测序)The experimental process of determining the nucleotide sequence of a region of DNA. This is done by labelling each nucleotide (A, C, G or T) with either a radioactive or fluorescent marker which identifies it. There are several methods of applying this technology, each with their advantages and disadvantages. For more information, refer to a current text book. High throughput laboratories frequently use automated sequencers, which are capable of rapidly reading large numbers of templates. Sometimes, the sequences may be generated more quickly than they can be characterised. Domain (功能域)A discrete portion of a protein assumed to fold independently of the rest of the protein and possessing its own function.Dot matrix(点标矩阵图)Dot matrix diagrams provide a graphical method for comparing two sequences. One sequence is written horizontally across the top of the graph and the other along the left-hand side. Dots are placed within the graph at the intersection of the same letter appearing in both sequences. A series of diagonal lines in the graph indicate regions of alignment. The matrix may be filtered to reveal the most-alike regions by scoring a minimal threshold number of matches within a sequence window.Draft genome sequence (基因组序列草图)The sequence produced by combining the information from the individual sequenced clones (by creating merged sequence contigs and then employing linking information to create scaffolds) and positioning the sequence along the physical map of the chromosomes.DUST (一种低复杂性区段过滤程序)A program for filtering low complexity regions from nucleic acid sequences. Dynamic programming(动态规划法)A dynamic programming algorithm solves a problem by combining solutions to sub-problems that are computed once and saved in a table or matrix. Dynamic programming is typically used when a problem has many possible solutions and an optimal one needs to be found. This algorithm is used for producing sequence alignments, given a scoring system for sequence comparisons.EMBL (欧洲分子生物学实验室,EMBL数据库是主要公共核酸序列数据库之一)European Molecular Biology Laboratories. Maintain the EMBL database, one of the major public sequence databases.EMBnet (欧洲分子生物学网络)European Molecular Biology Network: / was established in 1988, and provides services including local molecular databases and software for molecular biologists in Europe. There are several large outposts of EMBnet, including EXPASY.Entropy(熵)From information theory, a measure of the unpredictable nature of a set of possible elements. The higher the level of variation within the set, the higher the entropy.Erdos and Renyi lawIn a toss of a “fair” coin, the number of heads in a row that can be expected is the logarithm of the number of tosses to the base 2. The law may begeneralized for more than two possible outcomes by changing the base of the logarithm to the number of out-comes. This law was used to analyze the number of matches and mismatches that can be expected between random sequences as a basis for scoring the statistical significance of a sequence alignment.EST (表达序列标签的缩写)See Expressed Sequence TagExpect value (E)(E值)E value. The number of different alignents with scores equivalent to or better than S that are expected to occur in a database search by chance. The lower the E value, the more significant the score. In a database similarity search, the probability that an alignment score as good as the one found between a query sequence and a database sequence would be found in as many comparisons between random sequences as was done to find the matching sequence. In other types of sequence analysis, E has a similar meaning.Expectation maximization (sequence analysis)An algorithm for locating similar sequence patterns in a set of sequences. A guessed alignment of the sequences is first used to generate an expected scoring matrix representing the distribution of sequence characters in each column of the alignment, this pattern is matched to each sequence, and the scoring matrix values are then updated to maximize the alignment of the matrix to the sequences. The procedure is repeated until there is no further improvement.Exon (外显子)。











abbreviation 缩写 [省略语]ablative 夺格(的)abrupt 突发音accent 口音/{Phonetics}重音accusative 受格(的)acoustic phonetics 声学语音学acquisition 习得action verb 动作动词active 主动语态active chart parser 活动图句法剖析程序active knowledge 主动知识active verb 主动动词actor-action-goal 施事(者)-动作-目标actualization 实现(化)acute 锐音address 地址{信息科学}/称呼(语){语言学} adequacy 妥善性adjacency pair 邻对adjective 形容词adjunct 附加语 [附加修饰语]adjunction 加接adverb 副词adverbial idiom 副词词组affective 影响的affirmative 肯定(的;式)affix 词缀affixation 加缀affricate 塞擦音agent 施事agentive-action verb 施事动作动词agglutinative 胶着(性)agreement 对谐AI (artificial intelligence) 人工智能 [人工智能] AI language 人工智能语言 [人工智能语言]Algebraic Linguistics 代数语言学algorithm 算法 [算法]alienable 可分割的alignment 对照 [多国语言文章词;词组;句子翻译的] allo- 同位-allomorph 同位语素allophone 同位音位alpha notation alpha 标记alphabetic writing 拼音文字alternation 交替alveolar 齿龈音ambiguity 歧义ambiguity resolution 歧义消解ambiguous 歧义American structuralism 美国结构主义analogy 类推analyzable 可分析的anaphor 照应语 [前方照应词]animate 有生的A-not-A question 正反问句antecedent 先行词anterior 舌前音anticipation 预期 (音变)antonym 反义词antonymy 反义A-over-A A-上-A 原则apposition 同位语appositive construction 同位结构appropriate 恰当的approximant 无擦通音approximate match 近似匹配arbitrariness 任意性archiphoneme 大音位argument 论元 [变元]argument structure 论元结构 [变元结构] arrangement 配列array 数组articulatory configuration 发音结构articulatory phonetics 发音语音学artificial intelligence (AI) 人工智能 [人工智能] artificial language 人工语言ASCII 美国标准信息交换码aspect 态 [体]aspirant 气音aspiration 送气assign 指派assimilation 同化association 关联associative phrase 联想词组asterisk 标星号ATN (augmented transition network) 扩充转移网络attested 经证实的attribute 属性attributive 属性auditory phonetics 听觉语音学augmented transition network 扩充转移网络automatic document classification 自动文件分类automatic indexing 自动索引automatic segmentation 自动切分automatic training 自动训练automatic word segmentation 自动分词automaton 自动机autonomous 自主的auxiliary 助动词axiom 公理baby-talk 儿语back-formation 逆生构词(法)backtrack 回溯Backus-Naur Form 巴科斯诺尔形式 [巴科斯诺尔范式] backward deletion 逆向删略ba-construction 把─字句balanced corpus 平衡语料库base 词基Bayesian learning 贝式学习Bayesian statistics 贝式统计behaviorism 行为主义belief system 信念系统benefactive 受益(格;的)best first parser 最佳优先句法剖析器bidirectional linked list 双向串行bigram 双连词bilabial 双唇音bilateral 双边的bilingual concordancer 双语关键词前后文排序程序binary feature 双向特征[二分征性]binding 约束bit 位 [二进制制;比特]biuniqueness 双向唯一性blade 舌叶blend 省并词block 封阻[封杀]Bloomfieldian 布隆菲尔德(学派)的body language 肢体语言Boolean lattice 布尔网格 [布尔网格]borrow 借移Bottom-up 由下而上bottom-up parsing 由下而上剖析bound 附着(的)bound morpheme 附着语素 [黏着语素]boundary marker 界线标记boundary symbol 界线符号bracketing 方括号法branching 分枝法breadth-first search 广度优先搜寻 [宽度优先搜索]breath group 换气单位breathy 气息音的buffer 缓冲区byte 字节CAI (Computer Assisted Instruction) 计算机辅助教学CALL (computer assisted language learning) 计算机辅助语言学习canonical 典范的capacity 能力cardinal 基数的cardinal vowels 基本元音case 格位case frame 格位框架Case Grammar 格位语法case marking 格位标志CAT (computer assisted translation) 计算机辅助翻译cataphora 下指Categorial Grammar 范畴语法Categorial Unification Grammar 范畴连并语法 [范畴合一语法] causative 使动causative verb 使役动词causativity 使役性centralization 央元音化chain 炼chart parsing 表式剖析 [图表句法分析]checked 受阻的checking 验证Chinese character code 中文编码 [汉字代码]Chinese character code for information interchange 中文信息交换码[汉字交换码]Chinese character coding input method 中文输入法 [汉字编码输入] choice 选择Chomsky hierarchy 杭士基阶层 [Chomsky 层次结构]citation form 基本形式CKY algorithm (Cocke-Kasami-Younger) CKY 算法classifier 类别词cleft sentence 分裂句click 啧音clitic 附着词closed world assumption 封闭世界假说cluster 音群Cocke-Kasami-Younger algorithm CKY 算法coda 音节尾code conversion 代码变换cognate 同源(的;词)Cognitive Linguistics 认知语言学coherence 一致性cohesion 凝结性 [黏着性;结合力]collapse 合并collective 集合的collocation 连用语 [同现;搭配]combinatorial construction 合并结构combinatorial insertion 合并中插combinatorial word 合并词Combinatory Categorial Grammar 组合范畴语法comment 评论commissive 许诺[语行]common sense semantics 常识语意学Communication Theory 通讯理论 [通讯论;信息论]Comparative Linguistics 比较语言学comparison 比较competence 语言知能compiler 编译器complement 补语complementary 互补complementary distribution 互补分布complementizer 补语标记complex predicate 复杂谓语complex stative construction 复杂状态结构complex symbol 复杂符号complexity 复杂度component 成分compositionality 语意合成性 [合成性]compound word 复合词Computational Lexical Semantics 计算词汇语意学Computational Lexicography 计算词典编纂学Computational Linguistics 计算语言学Computational Phonetics 计算语音学Computational Phonology 计算声韵学Computational Pragmatics 计算语用学Computational Semantics 计算语意学Computational Syntax 计算句法学computer language 计算器语言computer-aided translation 计算机辅助翻译 [计算器辅助翻译]computer-assisted instruction (CAI) 计算机辅助教学computer-assisted language learning 计算机辅助语言学习[计算器辅助语言学习] concatenation 串联concept classification 概念分类concept dependency 概念依存conceptual hierarchy 概念阶层concord 谐和concordance 关键词 (前后文) 排序concordancer 关键词 (前后文) 排序的程序concurrent parsing 并行句法剖析conditional decision 条件决定 [条件决策]conjoin 连接conjunction 连接词 (合取;逻辑积;"与";连词)conjunctive 连接的connected speech 连续语言Connectionist model 类神经网络模型Connectionist model for natural language 自然语言类神经网络模型[自然语言连接模型]connotation 隐涵意义consonant 子音 [辅音]constituent 成分constituent structure tree 词组结构树constraint 限制constraint propagation 限制条件的传递 [限定因素增殖]constraint-based grammar formalism 限制为本的语法形式Construct Grammar 句构语法content word 实词context 语境context-free language 语境自由语言 [上下文无关语言]context-sensitive language 语境限定语言 [上下文有关语言;上下文敏感语言] continuant 连续音continuous speech recognition 连续语音识别contraction 缩约control agreement principle 控制一致原理control structure 控制结构control theory 控制论convention 约定俗成[规约]convergence 收敛[趋同现象]conversational implicature 会话含义converse 相反(词;的)cooccurrence relation 共现关系 [同现关系]co-operative principle 合作原则coordination 对称连接词 [同等;并列连接]copula 系词co-reference 同指涉 [互指]co-referential 同指涉coronal 前舌音corpora 语料库corpus 语料库Corpus Linguistics 语料库语言学corpus-based learning 语料库为本的学习correlation 相关性counter-intuitive 违反语感的courseware 课程软件 [课件]coverb 动介词C-structure 成分结构data compression 数据压缩 [数据压缩]data driven analysis 资料驱动型分析 [数据驱动型分析]data structure 数据结构 [数据结构]database 数据库 [数据库]database knowledge representation 数据库知识表示 [数据库知识表示]data-driven 资料驱动 [数据驱动]dative 与格declarative knowledge 陈述性知识decomposition 分解deductive database 演译数据库 [演译数据库]default 默认值 [默认;缺省]definite 定指Definite Clause Grammar 确定子句语法definite state automaton 有限状态自动机Definite State Grammar 有限状态语法definiteness 定指degree adverb 程度副词degree of freedom 自由度deixis 指示delimiter 定界符号 [定界符]denotation 外延denotic logic 符号逻辑dependency 依存关系Dependency Grammar 依存关系语法dependency relation 依存关系depth-first search 深度优先搜寻derivation 派生derivational bound morpheme 派生性附着语素Descriptive Grammar 描述型语法 [描写语法]Descriptive Linguistics 描述语言学 [描写语言学] desiderative 意愿的determiner 限定词deterministic algorithm 决定型算法 [确定性算法] deterministic finite state automaton 决定型有限状态机deterministic parser 决定型语法剖析器 [确定性句法剖析程序] developmental psychology 发展心理学Diachronic Linguistics 历时语言学diacritic 附加符号dialectology 方言学dictionary database 辞典数据库 [词点数据库]dictionary entry 辞典条目digital processing 数字处理 [数值处理]diglossia 双言digraph 二合字母diminutive 指小词diphone 双连音directed acyclic graph 有向非循环图disambiguation 消除歧义 [歧义消除]discourse 篇章discourse analysis 篇章分析 [言谈分析]discourse planning 篇章规划Discourse Representation Theory 篇章表征理论 [言谈表示理论] discourse strategy 言谈策略discourse structure 言谈结构discrete 离散的disjunction 选言dissimilation 异化distributed 分布式的distributed cooperative reasoning 分布协调型推理distributed text parsing 分布式文本剖析disyllabic 双音节的ditransitive verb 双宾动词 [双宾语动词;双及物动词] divergence 扩散[分化]D-M (Determiner-Measure) construction 定量结构D-N (determiner-noun) construction 定名结构document retrieval system 文件检索系统 [文献检索系统] domain dependency 领域依存性 [领域依存关系]double insertion 交互中插double-base 双基downgrading 降级dummy 虚位duration 音长{语音学}/时段{语法学/语意学}dynamic programming 动态规划Earley algorithm Earley 算法echo 回声句egressive 呼气音ejective 紧喉音electronic dictionary 电子词典elementary string 基本字符串 [基本单词串]ellipsis 省略EM algorithm EM算法embedding 崁入emic 功能关系的empiricism 经验论Empty Category Principle 虚范畴原则 [空范畴原理]empty word 虚词enclitics 后接成份end user 终端用户 [最终用户]endocentric 同心的endophora 语境照应entailment 蕴涵entity 实体entropy 熵entry 条目episodic memory 情节性记忆epistemological network 认识论网络ergative verb 作格动词ergativity 作格性Esperando 世界语etic 无功能关系etymology 词源学event 事件event driven control 事件驱动型控制example-based machine translation 以例句为本的机器翻译exclamation 感叹exclusive disjunction 排它性逻辑 “或”experiencer case 经验者格expert system 专家系统extension 外延external argument 域外论元extraposition 移外变形 [外置转换]facility value 易度值feature 特征feature bundle 特征束feature co-occurrence restriction 特征同现限制 [特性同现限制] feature instantiation 特征体现feature structure 特征结构 [特性结构]feature unification 特征连并 [特性合一]feedback 回馈felicity condition 妥适条件file structure 档案结构finite automaton 有限状态机 [有限自动机]finite state 有限状态Finite State Morphology 有限状态构词法 [有限状态词法]finite-state automata 有限状态自动机finite-state language 有限状态语言finite-state machine 有限状态机finite-state transducer 有限状态置换器flap 闪音flat 降音foreground information 前景讯息 [前景信息]Formal Language Theory 形式语言理论Formal Linguistics 形式语言学Formal Semantics 形式语意学forward inference 前向推理 [向前推理]forward-backward algorithm 前前后后算法frame 框架frame based knowledge representation 框架型知识表示Frame Theory 框架理论free morpheme 自由语素Fregean principle Fregean 原则fricative 擦音F-structure 功能结构full text searching 全文检索function word 功能词Functional Grammar 功能语法functional programming 函数型程序设计 [函数型程序设计]functional sentence perspective 功能句子观functional structure 功能结构functional unification 功能连并 [功能合一]functor 功能符fundamental frequency 基频garden path sentence 花园路径句GB (Government and Binding) 管辖约束geminate 重叠音gender 性Generalized Phrase Structure Grammar 概化词组结构语法 [广义短语结构语法] Generative Grammar 衍生语法Generative Linguistics 衍生语言学 [生成语言学]generic 泛指genetic epistemology 发生认识论genetive marker 属格标记genitive 属格gerund 动名词Government and Binding Theory 管辖约束理论GPSG (Generalized Phrase Structure Grammar) 概化词组结构语法[广义短语结构语法]gradability 可分级性grammar checker 文法检查器grammatical affix 语法词缀grammatical category 语法范畴grammatical function 语法功能grammatical inference 文法推论grammatical relation 语法关系grapheme 字素haplology 类音删略head 中心语head driven phrase structure 中心语驱动词组结构 [中心词驱动词组结构] head feature convention 中心语特征继承原理 [中心词特性继承原理] Head-Driven Phrase Structure Grammar 中心语驱动词组结构律heteronym 同形heuristic parsing 经验式句法剖析Heuristics 经验知识hidden Markov model 隐式马可夫模型hierarchical structure 阶层结构 [层次结构]holophrase 单词句homograph 同形异义词homonym 同音异义词homophone 同音词homophony 同音异义homorganic 同部位音的Horn clause Horn 子句HPSG (Head-Driven Phrase Structure Grammar) 中心语驱动词组结构语法human-machine interface 人机界面hypernym 上位词hypertext 超文件 [超文本]hyponym 下位词hypotactic 主从结构的IC (immediate constituent) 直接成份ICG (Information-based Case Grammar) 讯息为本的格位语法idiom 成语 [熟语]idiosyncrasy 特异性illocutionary 施为性immediate constituent 直接成份imperative 祈使句implicative predicate 蕴含谓词implicature 含意indexical 标引的indirect object 间接宾语indirect speech act 间接言谈行动 [间接言语行为]Indo-European language 印欧语言inductional inference 归纳推理inference machine 推理机器infinitive 不定词 [to 不定式]infix 中缀inflection/inflexion 屈折变化inflectional affix 屈折词缀information extraction 信息撷取information processing 信息处理 [信息处理]information retrieval 信息检索Information Science 信息科学 [信息科学; 情报科学] Information Theory 信息论 [信息论]inherent feature 固有特征inherit 继承inheritance 继承inheritance hierarchy 继承阶层 [继承层次]inheritance of attribute 属性继承innateness position 语法天生假说insertion 中插inside-outside algorithm 里里外外算法instantiation 体现instrumental (case) 工具格integrated parser 集成句法剖析程序integrated theory of discourse analysis 篇章分析综合理论[言谈分析综合理论]intelligence intensive production 知识密集型生产intensifier 加强成分intensional logic 内含逻辑Intensional Semantics 内涵语意学intensional type 内含类型interjection/exclamation 感叹词inter-level 中间成分interlingua 中介语言interlingual 中介语(的)interlocutor 对话者internalise 内化International Phonetic Association (IPA) 国际语音学会internet 网际网络Interpretive Semantics 诠释性语意学intonation 语调intonation unit (IU) 语调单位IPA (International Phonetic Association) 国际语音学会IR (information retrieval) 信息检索IS-A relation IS-A 关系isomorphism 同形现象IU (intonation unit) 语调单位junction 连接keyword in context 上下文中关键词[上下文内关键词] kinesics 体势学knowledge acquisition 知识习得knowledge base 知识库knowledge based machine translation 知识为本之机器翻译knowledge extraction 知识撷取 [知识题取]knowledge representation 知识表示KWIC (keyword in context) 关键词前后文 [上下文内关键词] label 卷标labial 唇音labio-dental 唇齿音labio-velar 软颚唇音LAD (language acquisition device) 语言习得装置lag 发声延迟language acquisition 语言习得language acquisition device 语言习得装置language engineering 语言工程language generation 语言生成language intuition 语感language model 语言模型language technology 语言科技left-corner parsing 左角落剖析 [左角句法剖析]lemma 词元lenis 弱辅音letter-to-phone 字转音lexeme 词汇单位lexical ambiguity 词汇歧义lexical category 词类lexical conceptual structure 词汇概念结构lexical entry 词项lexical entry selection standard 选词标准lexical integrity 词语完整性Lexical Semantics 词汇语意学Lexical-Functional Grammar 词汇功能语法Lexicography 词典学Lexicology 词汇学lexicon 词汇库 [词典;词库]lexis 词汇层LF (logical form) 逻辑形式LFG (Lexical-Functional Grammar) 词汇功能语法liaison 连音linear bounded automaton 线性有限自主机linear precedence 线性次序lingua franca 共通语linguistic decoding 语言译码linguistic unit 语言单位linked list 串行loan 外来语local 局部的localism 方位主义localizer 方位词locus model 轨迹模型locution 惯用语logic 逻辑logic array network 逻辑数组网络logic programming 逻辑程序设计 [逻辑程序设计] logical form 逻辑形式logical operator 逻辑算子 [逻辑算符]Logic-Based Grammar 逻辑为本语法 [基于逻辑的语法] long term memory 长期记忆longest match principle 最长匹配原则 [最长一致法] LR (left-right) parsing LR 剖析machine dictionary 机器词典machine language 机器语言machine learning 机器学习machine translation 机器翻译machine-readable dictionary (MRD) 机读辞典Macrolinguistics 宏观语言学Markov chart 马可夫图Mathematical Linguistics 数理语言学maximum entropy 最大熵M-D (modifier-head) construction 偏正结构mean length of utterance (MLU) 语句平均长度measure of information 讯习测度 [信息测度] memory based 根据记忆的mental lexicon 心理词汇库mental model 心理模型mental process 心理过程 [智力过程;智力处理] metalanguage 超语言metaphor 隐喻metaphorical extension 隐喻扩展metarule 律上律 [元规则]metathesis 语音易位Microlinguistics 微观语言学middle structure 中间式结构minimal pair 最小对Minimalist Program 微言主义MLU (mean length of utterance) 语句平均长度modal 情态词modal auxiliary 情态助动词modal logic 情态逻辑modifier 修饰语Modular Logic Grammar 模块化逻辑语法modular parsing system 模块化句法剖析系统modularity 模块性(理论)module 模块monophthong 单元音monotonic 单调monotonicity 单调性Montague Grammar 蒙泰究语法 [蒙塔格语法]mood 语气morpheme 词素morphological affix 构词词缀morphological decomposition 语素分解morphological pattern 词型morphological processing 词素处理morphological rule 构词律 [词法规则] morphological segmentation 语素切分Morphology 构词学Morphophonemics 词音学 [形态音位学;语素音位学] morphophonological rule 形态音位规则Morphosyntax 词句法Motor Theory 肌动理论movement 移位MRD (machine-readable dictionary) 机读辞典MT (machine translation) 机器翻译multilingual processing system 多语讯息处理系统multilingual translation 多语翻译multimedia 多媒体multi-media communication 多媒体通讯multiple inheritance 多重继承multistate logic 多态逻辑mutation 语音转换mutual exclusion 互斥mutual information 相互讯息nativist position 语法天生假说natural language 自然语言natural language processing (NLP) 自然语言处理natural language understanding 自然语言理解negation 否定negative sentence 否定句neologism 新词语nested structure 套结构network 网络neural network 类神经网络Neurolinguistics 神经语言学neutralization 中立化n-gram n-连词n-gram modeling n-连词模型NLP (natural language processing) 自然语言处理node 节点nominalization 名物化nonce 暂用的non-finite 非限定non-finite clause 非限定式子句non-monotonic reasoning 非单调推理normal distribution 常态分布noun 名词noun phrase 名词组NP (noun phrase) completeness 名词组完全性object 宾语{语言学}/对象{信息科学}object oriented programming 对象导向程序设计 [面向对向的程序设计] official language 官方语言one-place predicate 一元述语on-line dictionary 线上查询词典 [联机词点]onomatopoeia 拟声词onset 节首音ontogeny 个体发生Ontology 本体论open set 开放集operand 操作数 [操作对象]optimization 最佳化 [最优化]overgeneralization 过度概化overgeneration 过度衍生paradigmatic relation 聚合关系paralanguage 附语言parallel construction 并列结构Parallel Corpus 平行语料库parallel distributed processing (PDP) 平行分布处理paraphrase 转述 [释意;意译;同意互训]parole 言语parser 剖析器 [句法剖析程序]parsing 剖析part of speech (POS) 词类particle 语助词PART-OF relation PART-OF 关系part-of-speech tagging 词类标注pattern recognition 型样识别P-C (predicate-complement) insertion 述补中插PDP (parallel distributed processing) 平行分布处理perception 知觉perceptron 感觉器 [感知器]perceptual strategy 感知策略performative 行为句periphrasis 用独立词表达perlocutionary 语效性的permutation 移位Petri Net Grammar Petri 网语法philology 语文学phone 语音phoneme 音素phonemic analysis 因素分析phonemic stratum 音素层Phonetics 语音学phonogram 音标Phonology 声韵学 [音位学;广义语音学]Phonotactics 音位排列理论phrasal verb 词组动词 [短语动词]phrase 词组 [短语]phrase marker 词组标记 [短语标记]pitch 音调pitch contour 调形变化Pivot Grammar 枢轴语法pivotal construction 承轴结构plausibility function 可能性函数PM (phrase marker) 词组标记 [短语标记]polysemy 多义性POS-tagging 词类标记postposition 方位词PP (preposition phrase) attachment 介词依附Pragmatics 语用学Precedence Grammar 优先级语法precision 精确度predicate 述词predicate calculus 述词计算predicate logic 述词逻辑 [谓词逻辑]predicate-argument structure 述词论元结构prefix 前缀premodification 前置修饰preposition 介词Prescriptive Linguistics 规定语言学 [规范语言学]presentative sentence 引介句presupposition 前提Principle of Compositionality 语意合成性原理privative 二元对立的probabilistic parser 概率句法剖析程序problem solving 解决问题program 程序programming language 程序设计语言 [程序设计语言]proofreading system 校对系统proper name 专有名词prosody 节律prototype 原型pseudo-cleft sentence 准分裂句Psycholinguistics 心理语言学punctuation 标点符号pushdown automata 下推自动机pushdown transducer 下推转换器qualification 后置修饰quantification 量化quantifier 范域词Quantitative Linguistics 计量语言学question answering system 问答系统queue 队列radical 字根 [词干;词根;部首;偏旁]radix of tuple 元组数基random access 随机存取rationalism 理性论rationalist (position) 理性论立场 [唯理论观点]reading laboratory 阅读实验室real time 实时real time control 实时控制 [实时控制]recursive transition network 递归转移网络reduplication 重叠词 [重复]reference 指涉referent 指称对象referential indices 指针referring expression 指涉词 [指示短语]register 缓存器 [寄存器]{信息科学}/调高{语音学}/语言的场合层级{社会语言学} regular language 正规语言 [正则语言]relational database 关系型数据库 [关系数据库]relative clause 关系子句relaxation method 松弛法relevance 相关性Restricted Logic Grammar 受限逻辑语法resumptive pronouns 复指代词retroactive inhibition 逆抑制rewriting rule 重写规则rheme 述位rhetorical structure 修辞结构rhetorics 修辞学robust 强健性robust processing 强健性处理robustness 强健性schema 基朴school grammar 教学语法scope 范域 [作用域;范围]script 脚本search mechanism 检索机制search space 检索空间searching route 检索路径 [搜索路径]second order predicate 二阶述词segmentation 分词segmentation marker 分段标志selectional restriction 选择限制semantic field 语意场semantic frame 语意架构semantic network 语意网络semantic representation 语意表征 [语义表示]semantic representation language 语意表征语言semantic restriction 语意限制semantic structure 语意结构Semantics 语意学sememe 意素Semiotics 符号学sender 发送者sensorimotor stage 感觉运动期sensory information 感官讯息 [感觉信息]sentence 句子sentence generator 句子产生器 [句子生成程序]sentence pattern 句型separation of homonyms 同音词区分sequence 序列serial order learning 顺序学习serial verb construction 连动结构set oriented semantic network 集合导向型语意网络 [面向集合型语意网络] SGML (Standard Generalized Markup Language) 结构化通用标记语言shift-reduce parsing 替换简化式剖析short term memory 短程记忆sign 信号signal processing technology 信号处理技术simple word 单纯词situation 情境Situation Semantics 情境语意学situational type 情境类型social context 社会环境sociolinguistics 社会语言学software engineering 软件工程 [软件工程]sort 排序speaker-independent speech recognition 非特定语者语音识别spectrum 频谱speech 口语speech act assignment 言语行为指定speech continuum 言语连续体speech disorder 语言失序 [言语缺失]speech recognition 语音辨识speech retrieval 语音检索speech situation 言谈情境 [言语情境]speech synthesis 语音合成speech translation system 语音翻译系统speech understanding system 语音理解系统spreading activation model 扩散激发模型standard deviation 标准差Standard Generalized Markup Language 标准通用标示语言start-bound complement 接头词state of affairs algebra 事态代数state transition diagram 状态转移图statement kernel 句核static attribute list 静态属性表statistical analysis 统计分析Statistical Linguistics 统计语言学statistical significance 统计意义stem 词干stimulus-response theory 刺激反应理论stochastic approach to parsing 概率式句法剖析 [句法剖析的随机方法] stop 爆破音Stratificational Grammar 阶层语法 [层级语法]string 字符串[串;字符串]string manipulation language 字符串操作语言string matching 字符串匹配 [字符串]structural ambiguity 结构歧义Structural Linguistics 结构语言学structural relation 结构关系structural transfer 结构转换structuralism 结构主义structure 结构structure sharing representation 结构共享表征subcategorization 次类划分 [下位范畴化]subjunctive 假设的sublanguage 子语言subordinate 从属关系subordinate clause 从属子句 [从句;子句]subordination 从属substitution rule 代换规则 [置换规则]substrate 底层语言suffix 后缀superordinate 上位的superstratum 上层语言suppletion 异型[不规则词型变化] suprasegmental 超音段的syllabification 音节划分syllable 音节syllable structure constraint 音节结构限制symbolization and verbalization 符号化与字句化synchronic 同步的synonym 同义词syntactic category 句法类别syntactic constituent 句法成分syntactic rule 语法规律 [句法规则]Syntactic Semantics 句法语意学syntagm 句段syntagmatic 组合关系 [结构段的;组合的]Syntax 句法Systemic Grammar 系统语法tag 标记target language 目标语言 [目标语言]task sharing 课题分享 [任务共享]tautology 套套逻辑 [恒真式;重言式;同义反复] taxonomical hierarchy 分类阶层 [分类层次] telescopic compound 套装合并template 模板temporal inference 循序推理 [时序推理] temporal logic 时间逻辑 [时序逻辑]temporal marker 时貌标记tense 时态terminology 术语text 文本text analyzing 文本分析text coherence 文本一致性text generation 文本生成 [篇章生成]Text Linguistics 文本语言学text planning 文本规划text proofreading 文本校对text retrieval 文本检索text structure 文本结构 [篇章结构]text summarization 文本自动摘要 [篇章摘要]text understanding 文本理解text-to-speech 文本转语音thematic role 题旨角色thematic structure 题旨结构theorem 定理thesaurus 同义词辞典theta role 题旨角色theta-grid 题旨网格token 实类 [标记项]tone 音调tone language 音调语言tone sandhi 连调变换top-down 由上而下 [自顶向下]topic 主题topicalization 主题化 [话题化]trace 痕迹Trace Theory 痕迹理论training 训练transaction 异动 [处理单位]transcription 转写 [抄写;速记翻译]transducer 转换器transfer 转移transfer approach 转换方法transfer framework 转换框架transformation 变形 [转换]Transformational Grammar 变形语法 [转换语法]transitional state term set 转移状态项集合transitivity 及物性translation 翻译translation equivalence 翻译等值性translation memory 翻译记忆transparency 透明性tree 树状结构 [树]Tree Adjoining Grammar 树形加接语法 [树连接语法]treebank 树图数据库[语法关系树库]trigram 三连词t-score t-数turing machine 杜林机 [图灵机]turing test 杜林测试 [图灵试验]type 类型type/token node 标记类型/实类节点type-feature structure 类型特征结构typology 类型学ultimate constituent 终端成分unbounded dependency 无界限依存underlying form 基底型式underlying structure 基底结构unification 连并 [合一]Unification-based Grammar 连并为本的语法 [基于合一的语法] Universal Grammar 普遍性语法universal instantiation 普遍例式universal quantifier 全称范域词unknown word 未知词 [未定义词]unrestricted grammar 非限制型语法usage flag 使用旗标user interface 使用者界面 [用户界面]Valence Grammar 结合价语法Valence Theory 结合价理论valency 结合价variance 变异数 [方差]verb 动词verb phrase 动词组 [动词短语]verb resultative compound 动补复合词verbal association 词语联想verbal phrase 动词组verbal production 言语生成vernacular 本地话V-O construction (verb-object) 动宾结构vocabulary 字汇vocabulary entry 词条vocal track 声道vocative 呼格voice recognition 声音辨识 [语音识别]vowel 元音vowel harmony 元音和谐 [元音和谐]waveform 波形weak verb 弱化动词Whorfian hypothesis Whorfian 假说word 词word frequency 词频word frequency distribution 词频分布word order 词序word segmentation 分词word segmentation standard for Chinese 中文分词规范word segmentation unit 分词单位 [切词单位]word set 词集working memory 工作记忆 [工作存储区]world knowledge 世界知识writing system 书写系统X-Bar Theory X标杠理论 ["x"阶理论]Zipf's Law 利夫规律 [齐普夫定律]阅读。


关键词:生物信息学;自然语言处理;比对;分类中图分类号:TP312,Q811.4 文献标识码:A 文章编号:1672-5565(2006)-01-041-04收稿日期:2005-01-21;修回日期:2005-02-25作者简介:徐继伟,哈尔滨工业大学计算机科学与技术学院,e -mail :jiweixu @Applications of natural language processing techniques to bioinformaticsX U Ji -wei(Department o f Computer Science ,Harbin Institute o f Technology ,Harbin 150001,China )Abstract :Many problems of bioin formatics are very similar to those of Natural Language Processing (N LP )from the view point of In formation Processing ,thus classic methods of N LP can be applied to Bioin formatics problems.Several comm on problems of Bioin formatics and N LP and their s olutions are introduced in this paper ,such as alignment ,classification and prediction problems.The discussion of the similarities suggest the usefulness of N LP methods in Bioin formatics problems and its potential applications in Bioin formatics.A step -by -step s olution of classifi 2cation is given to explain how to per form experiment with biological data.K ey Words :Bioin formatics ;Natural Language Processing ;Alignment ;Classification1 IntroductionBioin formatics is a novel research field of interdisci 2plinary sciences ,which integrates biology ,com puter sci 2ence ,applied mathematics and in formation technology in 2to a single discipline.With the development of Bioin for 2matics ,scientists can understand the protocol of biology w orld and s olve many current problems in biology.H owev 2er ,with the rapid accumulation of biological sequence da 2ta ,there is an exigent need to w ork out scientific com pu 2tational methods and advanced Bioinformatics techniques for reliable and large -scale biological data processing and analysis.Biological sequence processing is similar to text processing in N LP ,as shown in Figure 1,if biologi 2cal sequences being processed and analyzed are treated as strings ,regardless of their physical 3D structures.S ome Bioin formatics scientists have already applied N LP tech 2niques to their research ,and the performances of these methods show that many N LP techniques are applicable in Bioin formatics.2 Basic C onceptsAlthough people in Bioin formatics process biological data such as DNA ,RNA ,protein ,etc.and people in N LP process real -w orld texts from news ,reports ,pa 2pers ,speeches ,etc.,the real “objects ”to be manipulat 2ed in both fields are same.It is reas onable to believe that生物信息学 China Journal of Bioinformatics 专论与综述Figure1:M apping betw een hum an langu age andProtein Sequences(http:ΠΠ)there are s ome comm on problems and methods in both re2 search fields;however,different idioms exist in these tw o fields,for the convenience of readers,s ome frequently used concepts and notations are recalled here.2.1 N-gramA consecutive substring of n symbols,with n≤L (here L is the length of the whole string),is called an n -gram(in s ome references Bioinformatics is als o defined as n-w ord or n-tuple).The statistic of n-grams is usually obtained by traversing the string of length L with an n-wide sliding windows,from position1to L-n+1, each content in an n-wide window is an n-gram.There have been a set of tools for extracting both w ord n-grams (Latin lanuage such as English)and character n-grams (Eastern language such as Chinese)from raw corpora,and they are free for downloading at http:ΠΠw w ΠzhangleΠngram.html#download.2.2 Vector space m odelThe perhaps m ost comm only used data representation in in formation processing is the s o called Vector S pace M odel(VS M)(Salton et al.1983),and many machine learning methods are based on VS M,such as K-NN (Masand et al.1992),S VM(Vapnik1999;Joachims 1998),Na1¨ve Bayes(T zeras et al.1993;Decision T rees (Lewis et al.1994),neural netw orks(Wiener et al.1995) and s o on.The idea of VS M is using vector<W1,W2,…W n>to represent sequences or texts,where W i is the weight of ith feature of the sequences.F or exam ple,in se2 quence X=“AT CGG CTT”with trigram,the feature can be any element from{AT C,T CG,CGG,GG C,G CT, CTT}.G iven a collection of sequences and selected fea2 tures,then these sequences can be represented by a fea2 ture-by-sequence matrix M,where each entry repre2 sents the weight of a feature in a sequence.The weight of a feature can be calculated by one of the weighting schemes,such as tf-idf weighting,tfc-weighting and s o on.The tw o basic principles for weighting are(1)the m ore times a feature appears in a sequence,the closer relation it is to the sequence;(2)the m ore times a feature appears in the collection of all sequences,the m ore poorly it dis2 criminates between sequences.3 Similar problemsThere are several similar problems in Bioinformatics and N LP,s o knowing the s olution of one might be helpful to the other;there are als o s ome comm only used methods in both fields,and we can learn to apply the skillfully used methods to new problems.We will introduce these similar topics in this section and give a step-by-step case study of classification in the next section.3.1 AlignmentSequence Alignment in Bioin formatics and Bilingual C orpora Alignment in N LP are tw o problems that are the m ost similar to each other,they are both for helping peo2 ple understand unknown things from the known ones,that is,to help people understand English sentences with the help of the Chinese ones or help people know the function of new protein sequences with the help of known ones.S A can be applied to DNA and protein sequence database searches,m otif searches,gene identification searches and the analysis of multiple regions of similarity in long DNA sequences;it’s very useful for finding the hom ology of known sequence family and providing im portant informa2 tion about the function of newly sequenced genes.On the other hand,BC A plays im portant role in com putational language research,especially in Machine T ranslation, and is believed to be a promising direction for s olving knowledge acquisition-the bottle-neck of N LP.Looking deeply into the tw o problems,we can see other similarities.F or exam ple,S A can be sim ply classi224 生 物 信 息 学 第3卷fied to global alignment and local alignment while BC A includes paragraph alignment,sentence alignment and w ord alignment;the measure methods of S A and BC A are als o similar,s o there remains space for people in Bioin2 formatics to learn BC A methods from machine translation and im prove the quality of sequence alignment.C omm only used sentence alignment methods include length based alignment(Brown et al.1991);Part of S peech based alignment(K ay et al.1993)and biligural dictionary based alignment(Wang1999).Here the func2 tion of bilingural dictionary is similar to that of the matrix of PAM and BLOS UM.Furture m ore,the combination of above methods can im prove the performance(Wu1994; Simard et al.1992)3.2 ClassificationAnother comm on problem in Bioin formatics and N LP is classification.Rapidly increasing data of newspapers, digital libraries and the Internet are in great need of auto2 matic classification systems.The outburst of biological da2 ta,such as DNA and protein sequences,als o need classi2 fication to know their categ ories,we can see from the pro2 tein class tables in http:ΠΠmips.gs f.deΠgenreΠprojΠyeastl that a large percent of proteins are now in unknown class, s o we need to know their categ ories and functions with the help of effective classification systems.Many statistical classification and machine learning techniques have been applied to text class fication,includ2 ing nearest neighbour classifiers,decision trees,Bayesian classifiers,neural netw orks and Support Vector Machines. These methods are suitable for the task of classification. Additionally,people tried to combine several techniques together for im proving classification performance,for ex2 am ple,(Li et al.1991)presented a new method for web-page classification combining S VM and unsupervised clus2 tering and the combining with unsupervised clustering ob2 tained a better classification accuracy.3.3 Other potential applicationOther techniques in N LP,such as W ord Sense Dis2 ambiguation(Ide et al.1998)and Dependency G rammar Analysis(Liu1997),although haven’t been applied in Bioin formatics problems,might be valuable for reference by biologists in future.WS D in N LP is to determine the meaning of w ords that spelled the same but have different meanings in dif2 ferent contexts,for exam ple the w ord bank have different meanings in the following sentences:(1)I take a walk along the bank every m orning.(2)I bank at the National Bank.WS D can help a lot with the problems of natural lan2 guage understanding and machine translation.In Bioin for2 matics,it’s als o possible that snippets with the same structure function quite differently in different protein se2 quences,s o WS D techniques might be applicable in fu2 ture.Dependency G rammar Analysis problem is to find the dependency relationship between w ords and phrases; DG A can find that the w ords according and to have a high dependency with each other,and this dependency are useful in error detection:for exam ple in the sentence with error:According too my watch it is10o’clock. The error detection system can easily find the error with the help of dependency information.In Bioinformatics,a single snippet of sequence might not function correctly without the help of another snippets,that is to say,the snippets might have dependency relationship with each other,and only the co-occurrence of them can lead a biological function,in this way,DG A technique can be applied.There are other similar problems in the tw o research field,such as named entity detection in N LP and drag target detection in Bioinformatics,and we will discuss them in other papers instead of this one due to the space limitation.4 Case studyAs mentioned above,with the accelerated accumula2 tion of biological data,proteins are in great need of clas2 sification systems.We use RNA-binding protein classifi2 cation as the exam ple here to show how to perform experi2 ments with biological data.The RNA-binding proteins are downloaded from protein sequence database S wissport (http:ΠΠw w w.expasy.chΠsprot),and the classifier can be built with any method mentioned in Section3.2.we will34第4期 徐继伟,等:自然语言处理技术在生物信息学中的应用 give description of data set division experimental results and discussion in following paper in detail.The protein classification here is a supervised learn 2ing task ,defined as assigning categ ory labels that are pre -defined to new proteins based on the likelihood ob 2tained by a training set of labeled proteins.Assume pro 2teins P 1,P 2,…and P n consist a training set labeled Pro 2teases and P t is a new protein ,if the comarasion of P t with all P 1,P 2,…and P n get a high score ,then this clas 2sification determines that P t belongs to class Proteases and assigns label Proteases to P t .The architecture of classifi 2cation system isshown in Figure 2:Figure 2:architecture of classification systemIn Figure 2,the features might be the collection of n -grams mentioned above or any other form user defined.Both the labeled sequences and the new sequences are represented as vectors with the same features before learn 2ing and classifying ,the experiment includes these steps :(1)extract features ,such as n -grams ,from the la 2beled RNA -binding proteins and store them.Features do not contribute equally to the function of proteins ,and high dimensional vector space will degrade the system performance ,s o additional process is needed to eliminate useless features and reduce the feature dimension.(2)build the vector representation of labeled proteins with the features after dimension reduction in step (1).(3)learn and build the classifier with the vectors ob 2tained in step (2).(4)build the vector representation of new (testing )proteins with the features after dimension reduction in step (1).(5)classify with the classifier built in step (3)and the vector representation of new proteins built in step(4),and assign labels to the new proteins according to the likelihood obtained.5 C onclusionIn this paper ,s ome similarities in Bioinformatics and Natural Language Processing are discussed ,such as the problems of alignment and classification.Many exper 2imental results show that N LP techniques are applicable for s olving Bioin formatics problems.Further study on N LP techniques will help the development of Bioinformatics ,and we believe that s ome N LP techniques that haven ’t been applied to Bioinformatics will be useful tools for Bioin formatics in future.References :[1] Brown ,P.F.,Lai ,J.C.and M ercer ,.R.L.Aligning Sentences inparallel C orpora[R].Proc.O f the 29th Annual M eeting of the AC L ,1991169-176.[2] Ide ,N.and Veronia ,J.Introduction to the special issue on w ord sensedisambiguation:The state of the art [J ].C om putational Linguistics ,1998,24(1):1-40.[3] Joachims ,T.T ext Categ orization with support vector machines :Learn 2ing with many relevant features[R].In Proc.10th European C on ferenceon M achine Learning (EC M L ),S pringer Verlag 1998.[4] K ay ,M.and R ocheisen ,M.T ext -T ranslation Alignment [J ].C om pu 2tational Linguistics ,1993,19(1):121-142.[5] Lewis ,D.D.and Ringuette ,M.C om paris on of tw o learning alg orithmsfor text categ orization[R].In Proceedings of the Third Annual Sym pos 2ium on D ocument Analysis and In formation Retrieval (S DAIR ’94),1994.[6] Liu ,H.T.“Dependency G rammar and M achine T ranslation.”[J ](in Chinese ),Language Application ,1997,3,89-93.[7] M asand ,B.,Linoff ,G.and W altz ,D.Classifying news stories usingmem ory based reas oning[R].In 15th Ann Int AC M SIGIR C on ference on Research and Development in In formation Retrieval (SIGIR ’92),1992,59-64.[8] T zeras ,K.and Hartman ,S.Automatic indexing based on Bayesian in 2ference netw orks[R].In Proc 16th Ann Int AC M SIGIR C on ference on Research and Development in In formation Retrieval (SIGIR ’93),1993,22-34.[9] W ang ,B.“Chinese -English Biligural C orpora Automatic Alignment Research ”(in Chinese )[D].Ph.D thesis ,1999.[10] W iener ,E.,Pedersen ,J.O.and W eigend ,A.S.A neural netw ork ap 2proach to topic spotting [R ].In Proceedings of the F ourth Annual Sym posium on D ocument Analysis and In formation Retrieval (S DAIR ’95),1995.[11] Wu ,D.Aligning a Parallel English -Chinese C orpus S tatistically withLexical Criteria.Proc[R ].O f the 32nd Annual C on ference of AC L ,1994,80-87.44 生 物 信 息 学 第3卷。
