语音识别系统毕业论文中英文资料对照外文翻译文献
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Speech Recognition
Victor Zue, Ron Cole, & Wayne Ward
MIT Laboratory for Computer Science, Cambridge, Massachusetts, USA
Oregon Graduate Institute of Science & Technology, Portland, Oregon, USA Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
1 Defining the Problem
Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of words. The recognized words can be the final results, as for applications such as commands & control, data entry, and document preparation. They can also serve as the input to further linguistic processing in order to achieve speech understanding, a subject covered in section.
Speech recognition systems can be characterized by many parameters, some of the more important of which are shown in Figure. An isolated-word speech recognition system requires that the speaker pause briefly between words, whereas a continuous speech recognition system does not. Spontaneous, or extemporaneously generated, speech contains disfluencies, and is much more difficult to recognize than speech read from script. Some systems require speaker enrollment---a user must provide samples of his or her speech before using them, whereas other systems are said to be speaker-independent, in that no enrollment is necessary. Some of the other parameters depend on the specific task. Recognition is generally more difficult when vocabularies are large or have many similar-sounding words. When speech is produced in a sequence of words, language models or artificial grammars are used to restrict the combination of words.
The simplest language model can be specified as a finite-state network, where the permissible words following each word are given explicitly. More general language models approximating natural language are specified in terms of a context-sensitive grammar.
One popular measure of the difficulty of the task, combining the vocabulary size and the language model, is perplexity, loosely defined as the geometric mean of the number of words that can follow a word after the language model has been applied (see section for a discussion of language modeling in general and perplexity in particular). Finally, there are some external parameters that can affect speech recognition system performance, including the characteristics of the environmental noise and the type and the placement of the microphone.
Table: Typical parameters used to characterize the capability of speech recognition systems Speech recognition is a difficult problem, largely because of the many sources of variability associated with the signal. First, the acoustic realizations of phonemes, the smallest sound units of which words are composed, are highly dependent on the context in which they appear. These phonetic variabilities are exemplified by the acoustic differences of the phoneme,At word boundaries, contextual variations can be quite dramatic---making gas shortage sound like gash shortage in American English, and devo andare sound like devandare in Italian.
Second, acoustic variabilities can result from changes in the environment as well as in the position and characteristics of the transducer. Third, within-speaker variabilities can result from changes in the speaker's physical and emotional state, speaking rate, or voice quality. Finally, differences in sociolinguistic background, dialect, and vocal tract size and shape can contribute to across-speaker variabilities.
Figure shows the major components of a typical speech recognition system. The digitized
speech signal is first transformed into a set of useful measurements or features at a fixed rate, typically once every 10--20 msec (see sectionsand 11.3 for signal representation and digital signal processing, respectively). These measurements are then used to search for the most likely word candidate, making use of constraints imposed by the acoustic, lexical, and language models. Throughout this process, training data are used to determine the values of the model parameters.
Figure: Components of a typical speech recognition system.
Speech recognition systems attempt to model the sources of variability described above in several ways. At the level of signal representation, researchers have developed representations that emphasize perceptually important speaker-independent features of the signal, and de-emphasize speaker-dependent characteristics. At the acoustic phonetic level, speaker variability is typically modeled using statistical techniques applied to large amounts of data. Speaker adaptation algorithms have also been developed that adapt speaker-independent acoustic models to those of the current speaker during system use, (see section). Effects of linguistic context at the acoustic phonetic level are typically handled by training separate models for phonemes in different contexts; this is called context dependent acoustic modeling.
Word level variability can be handled by allowing alternate pronunciations of words in representations known as pronunciation networks. Common alternate pronunciations of words, as well as effects of dialect and accent are handled by allowing search algorithms to find alternate paths of phonemes through these networks. Statistical language models, based on estimates of the frequency of occurrence of word sequences, are often used to guide the search
through the most probable sequence of words.
The dominant recognition paradigm in the past fifteen years is known as hidden Markov models (HMM). An HMM is a doubly stochastic model, in which the generation of the underlying phoneme string and the frame-by-frame, surface acoustic realizations are both represented probabilistically as Markov processes, as discussed in sections,and 11.2. Neural networks have also been used to estimate the frame based scores; these scores are then integrated into HMM-based system architectures, in what has come to be known as hybrid systems, as described in section 11.5.
An interesting feature of frame-based HMM systems is that speech segments are identified during the search process, rather than explicitly. An alternate approach is to first identify speech segments, then classify the segments and use the segment scores to recognize words. This approach has produced competitive recognition performance in several tasks.
2 State of the Art
Comments about the state-of-the-art need to be made in the context of specific applications which reflect the constraints on the task. Moreover, different technologies are sometimes appropriate for different tasks. For example, when the vocabulary is small, the entire word can be modeled as a single unit. Such an approach is not practical for large vocabularies, where word models must be built up from subword units.
Performance of speech recognition systems is typically described in terms of word error rate E, defined as:
where N is the total number of words in the test set, and S, I, and D are the total number of substitutions, insertions, and deletions, respectively.
The past decade has witnessed significant progress in speech recognition technology. Word error rates continue to drop by a factor of 2 every two years. Substantial progress has been made in the basic technology, leading to the lowering of barriers to speaker independence, continuous speech, and large vocabularies. There are several factors that have contributed to this rapid progress. First, there is the coming of age of the HMM. HMM is powerful in that, with the availability of training data, the parameters of the model can be trained automatically to give
optimal performance.
Second, much effort has gone into the development of large speech corpora for system development, training, and testing. Some of these corpora are designed for acoustic phonetic research, while others are highly task specific. Nowadays, it is not uncommon to have tens of thousands of sentences available for system training and testing. These corpora permit researchers to quantify the acoustic cues important for phonetic contrasts and to determine parameters of the recognizers in a statistically meaningful way. While many of these corpora (e.g., TIMIT, RM, ATIS, and WSJ; see section 12.3) were originally collected under the sponsorship of the U.S. Defense Advanced Research Projects Agency (ARPA) to spur human language technology development among its contractors, they have nevertheless gained world-wide acceptance (e.g., in Canada, France, Germany, Japan, and the U.K.) as standards on which to evaluate speech recognition.
Third, progress has been brought about by the establishment of standards for performance evaluation. Only a decade ago, researchers trained and tested their systems using locally collected data, and had not been very careful in delineating training and testing sets. As a result, it was very difficult to compare performance across systems, and a system's performance typically degraded when it was presented with previously unseen data. The recent availability of a large body of data in the public domain, coupled with the specification of evaluation standards, has resulted in uniform documentation of test results, thus contributing to greater reliability in monitoring progress (corpus development activities and evaluation methodologies are summarized in chapters 12 and 13 respectively).
Finally, advances in computer technology have also indirectly influenced our progress. The availability of fast computers with inexpensive mass storage capabilities has enabled researchers to run many large scale experiments in a short amount of time. This means that the elapsed time between an idea and its implementation and evaluation is greatly reduced. In fact, speech recognition systems with reasonable performance can now run in real time using high-end workstations without additional hardware---a feat unimaginable only a few years ago.
One of the most popular, and potentially most useful tasks with low perplexity (PP=11) is the recognition of digits. For American English, speaker-independent recognition of digit strings spoken continuously and restricted to telephone bandwidth can achieve an error rate of 0.3% when the string length is known.
One of the best known moderate-perplexity tasks is the 1,000-word so-called Resource Management (RM) task, in which inquiries can be made concerning various naval vessels in the Pacific ocean. The best speaker-independent performance on the RM task is less than 4%, using a word-pair language model that constrains the possible words following a given word (PP=60). More recently, researchers have begun to address the issue of recognizing spontaneously generated speech. For example, in the Air Travel Information Service (ATIS) domain, word error rates of less than 3% has been reported for a vocabulary of nearly 2,000 words and a bigram language model with a perplexity of around 15.
High perplexity tasks with a vocabulary of thousands of words are intended primarily for the dictation application. After working on isolated-word, speaker-dependent systems for many years, the community has since 1992 moved towards very-large-vocabulary (20,000 words and more), high-perplexity (PP≈200), speaker-independent, continuous speech recognition. The best system in 1994 achieved an error rate of 7.2% on read sentences drawn from North America business news.
With the steady improvements in speech recognition performance, systems are now being deployed within telephone and cellular networks in many countries. Within the next few years, speech recognition will be pervasive in telephone networks around the world. There are tremendous forces driving the development of the technology; in many countries, touch tone penetration is low, and voice is the only option for controlling automated services. In voice dialing, for example, users can dial 10--20 telephone numbers by voice (e.g., call home) after having enrolled their voices by saying the words associated with telephone numbers. AT&T, on the other hand, has installed a call routing system using speaker-independent word-spotting technology that can detect a few key phrases (e.g., person to person, calling card) in sentences such as: I want to charge it to my calling card.
At present, several very large vocabulary dictation systems are available for document generation. These systems generally require speakers to pause between words. Their performance can be further enhanced if one can apply constraints of the specific domain such as dictating medical reports.
Even though much progress is being made, machines are a long way from recognizing conversational speech. Word recognition rates on telephone conversations in the Switchboard corpus are around 50%. It will be many years before unlimited vocabulary, speaker-independent
continuous dictation capability is realized.
3 Future Directions
In 1992, the U.S. National Science Foundation sponsored a workshop to identify the key research challenges in the area of human language technology, and the infrastructure needed to support the work. The key research challenges are summarized in. Research in the following areas for speech recognition were identified:
Robustness:
In a robust system, performance degrades gracefully (rather than catastrophically) as conditions become more different from those under which it was trained. Differences in channel characteristics and acoustic environment should receive particular attention.
Portability:
Portability refers to the goal of rapidly designing, developing and deploying systems for new applications. At present, systems tend to suffer significant degradation when moved to a new task. In order to return to peak performance, they must be trained on examples specific to the new task, which is time consuming and expensive.
Adaptation:
How can systems continuously adapt to changing conditions (new speakers, microphone, task, etc) and improve through use? Such adaptation can occur at many levels in systems, subword models, word pronunciations, language models, etc.
Language Modeling:
Current systems use statistical language models to help reduce the search space and resolve acoustic ambiguity. As vocabulary size grows and other constraints are relaxed to create more habitable systems, it will be increasingly important to get as much constraint as possible from language models; perhaps incorporating syntactic and semantic constraints that cannot be captured by purely statistical models.
Confidence Measures:
Most speech recognition systems assign scores to hypotheses for the purpose of rank ordering them. These scores do not provide a good indication of whether a hypothesis is correct or not, just that it is better than the other hypotheses. As we move to tasks that require actions,
we need better methods to evaluate the absolute correctness of hypotheses.
Out-of-Vocabulary Words:
Systems are designed for use with a particular set of words, but system users may not know exactly which words are in the system vocabulary. This leads to a certain percentage of out-of-vocabulary words in natural conditions. Systems must have some method of detecting such out-of-vocabulary words, or they will end up mapping a word from the vocabulary onto the unknown word, causing an error.
Spontaneous Speech:
Systems that are deployed for real use must deal with a variety of spontaneous speech phenomena, such as filled pauses, false starts, hesitations, ungrammatical constructions and other common behaviors not found in read speech. Development on the ATIS task has resulted in progress in this area, but much work remains to be done.
Prosody:
Prosody refers to acoustic structure that extends over several segments or words. Stress, intonation, and rhythm convey important information for word recognition and the user's intentions (e.g., sarcasm, anger). Current systems do not capture prosodic structure. How to integrate prosodic information into the recognition architecture is a critical question that has not yet been answered.
Modeling Dynamics:
Systems assume a sequence of input frames which are treated as if they were independent. But it is known that perceptual cues for words and phonemes require the integration of features that reflect the movements of the articulators, which are dynamic in nature. How to model dynamics and incorporate this information into recognition systems is an unsolved problem.
语音识别
舒维都,罗恩科尔,韦恩沃德
麻省理工学院计算机科学实验室,剑桥,马萨诸塞州,美国
俄勒冈科学与技术学院,波特兰,俄勒冈州,美国
卡耐基梅隆大学,匹兹堡,宾夕法尼亚州,美国
一定义问题
语音识别是指音频信号的转换过程,被电话或麦克风的所捕获的一系列的消息。
所识别的消息作为最后的结果,用于控制应用,如命令与数据录入,以及文件准备。
它们也可以作为处理输入的语言,以便进一步实现语音理解,在第一个主题涵盖。
语音识别系统可以用多个参数来描述,一些更重要参数在图形中显示出来.一个孤立字语音识别系统要求词与词之间短暂停顿,而连续语音识别系统对那些不自发的,或临时生成的,言语不流利的语音,比用讲稿读出更难以识别。
有些系统要求发言者登记——即用户在使用系统前必须为系统提供演讲样本或发言底稿,而其他系统据说是独立扬声器,因为没有必要登记。
一些参数特征依赖于特定的任务。
当词汇量比较大或有较多象声词的时候,识别起来一般比较困难。
当语音由有序的词语生成时,语言模型或特定语法便会限制词语的组合。
最简单的语言模型可以被指定为一个有限状态网络,每个语音所包含的所有允许的词语都能顾及到。
更普遍的近似自然语言的语言模型在语法方面被指定为上下文相关联。
一种普及的任务的难度测量,词汇量和语言模型相结合的语音比较复杂,大量语音的几何意义可以按照语音模型的应用定义宽泛些(参见文章对语言模型普遍性与复杂性的详细讨论)。
最后,还有一些其他参数,可以影响语音识别系统的性能,包括环境噪声和麦克风的类型和安置。
登记依赖扬声器到独立的扬声器
词汇小(<20 字)到大(>20,000 字)
语言模型有限个状态到上下文相关
混乱小(<10)到多(>100)
信噪比高(>30分贝)到低(<10分贝)
传感器消音麦克到电话
表格:特有参数用于表征语音识别系统的性能
语音识别是一个困难的问题,主要是因为与信号相关的变异有很多来源。
首先,音素,作为组成词语的最小的语音单位,它的声学呈现是高度依赖于他们所出现的语境的。
这些语音的变异性正好由音素的声学差异做出了验证。
在词语的范围里,语境的变化会相当富有戏剧性---使得美国英语里的gas shortage听起来很像gash shortage,而意大利语中的devo andare听起来会很像devandare。
其次,声变异可能由环境变化,以及传输介质的位置和特征引起。
第三,说话人的不同,演讲者身体和情绪上的差异可能导致演讲速度,质量和话音质量的差异。
最后,社会语言学背景,方言的差异和声道的大小和形状更进一步促进了演讲者的差异性。
数字图形展示了语音识别系统的主要组成部分。
数字化语音信号先转换成一系列有用的测量值或有特定速率的特征,通常每次间隔10 - 20毫秒(见第11.3章节,分别描述了模拟信号和数字信号的处理)。
然后这些测量被用来寻找最有可能的备选词汇,使用被声学模型、词汇模型、和语言模型强加的限制因素。
整个过程中,训练数据是用来确定模型参数值的。
图:一个典型语音识别系统的组成部分
语音识别系统尝试在上述变异的来源的某些方面做模型。
在信号描述的层面上,研究人员已经开发出了感性地强调重要发言者独立语音信号的特征,以及忽略发言者依赖环境的语音信号特征。
在声学语音层面上,说话人差异变化通常是参照使用大量的数据来做模型。
语音改编法则还开发出适应说话人独立声学模型以适应那些目前在系统中使用的说话人语音样本(参见文章)。
在语言方面语境影响的声学语音处理,通常情况下被不同的训练模式分隔为单独的音素,这就是所谓的上下文相关声学模型。
字级差异可以由发音网络中可描述的字词的候选发音来处理。
对于象声词的替代,考虑到方言以及口音的影响,通过搜索算法在网络上寻找音素的替代方法。
统计语言的模型基于对字序列的发生频率的估计,常常通过可能的词序来引导搜索。
众所周知在过去的15年中占主导地位的识别范例是隐马尔可夫模型(HMM)。
基于HMM是一种双随机模型,基本音素字符串和框架的生成,表面声波的变现都作为马氏过程来表述,在本章节中所讨论的和11.2节中的神经网络也被用来估算框架的基本性能,然后将这些性能集成到基于HMM的系统架构中,即现在被称为的混合系统所述的,参见第11.5节。
基于HMM系统框架的一种有趣的特点,就是相比明确的定义而言,语音片段是在搜索过程中被定义的。
另一种方法,是先找出语音片段,然后将这些片段分类并使用片段性能来识别文字。
这种做法已经产生在一些生产任务的竞争识别性能上了。
二目前发展现状
讨论目前的发展状况,需要联系到具体应用的环境,他影响到了任务的制约性。
此外,有时不同的技术适合于不同的任务。
例如,当词汇量小,整个单词可以建模为一个单元。
但这种做法对大词汇量来说是不实际的,如字词模式必须由单一字词单元建立。
语音识别系统的性能通常是用来描述字词的误码率E的,定义为:
其中N是指测试语音的总字数,S,I和D分别是替换词组,插入词组和删除词组的总数。
过去十年目睹识别技术在语音方面取得重大进展。
字错误率持续每两年下降50%。
基础技术已取得了重大的进展,从而降低了说话人独立语音,连续语音及大词汇量语音识别
的障碍。
有几个因素促成了这种迅速的进展。
首先,HMM时代即将到来。
HMM模型规模强大,以及具有有效地训练数据,可以自动训练出模型的最佳的性能。
第二,很大的努力已经投入到语音系统大量词汇识别的发展、训练和测试上。
语料库其中一些是专为语音声学研究的,也有非常具体的任务。
如今,这并非罕见有成千上万可行的句子提供给系统来训练及测试。
这些语料库允许研究人员量化语音声学的重要内容,以确定识别参数在统计上是有意义的方式。
尽管许多语料(如论文利用TIMIT,马币,车号自动识别等,参见12.3节)原本是在美国国防部高级研究计划局的赞助下收集的人类的语言来刺激其承办商的技术发展,然而他们获得了世界的广泛认可(例如,英国,加拿大,法国,德国,日本,)作为评价标准来建立语音识别。
第三,取得的进展所带来的性能评价标准的建立。
十年前,研究人员仅测试他们的系统培训和利用当地收集的数据,并没有很仔细划分培训和测试。
因此,这样便很难比较系统的全面性能,以及它所给出的数据在之前未出现时,系统的性能便逐渐退化。
公共领域最近提供的数据按照评价标准的规范,致使试验结果相同,从而有助于提高监测的可靠性(语料库发展活动的主体和评价方法,分别在12和13章作了总结)。
最后,计算机技术的进步,也间接影响了人类的进展。
提供大容量存储能力的快速且低廉的电脑,使研究人员能够短时间运行许多大型规模的实验。
这意味着经过实践和评价后的想法,它所花费的时间大大减少。
事实上,合理性能的语音识别系统现在可以在无附加设备的高端工作站随时运行----这在几年之前仍是个不可思议的想象。
其中最普遍的,最有用的和困惑最低最有潜在的任务是数字识别。
对于美国英语,独立演讲者的连续数字串识别和电话宽带限制的语音可以达到0.3%的误码率,前提是字符串的长度已知。
其中最著名的中等难度的任务是1000字的所谓资源管理(RM)的任务,其用来查询各种有关太平洋海军舰艇的研究。
最好的独立执行任务的语音设备执行RM任务不超过4%,用文字语言模型约束给定的单词。
最近,研究人员已经开始处理自发语音识别的问题了。
例如,在航空旅游信息服务(ATIS)域,超过3%的误码误率少报了近2000字的词汇和二元语言模型大约15的混乱度。
数千字词汇任务的高混乱度主要产生于听写任务中。
语音系统成立多年,使用鼓励词后,研究机构从1992年开始向超大词汇(20000字以上),高混乱度(P≈200),独立连续语音识别发展。
1994年的最好的语音系统实现了从北美商业新闻中读取句子并描述仅率7.2%的误码率的成绩。
随着语音识别性能的不断改善,系统现正部署在电话和许多国家的蜂窝网络。
统现正部署在电话和许多国家的蜂窝网络。
在未来几年中,语音识别的电话网络将在世界各地普遍存在。
有巨大的力量推动这项技术的发展,在许多国家,触摸音普及率低,声音是自动控制服务的唯一选择。
在语音拨号,例如,用户可以拨打10 - 20语音电话号码(例如,打电话回家后)登记,说他们的声音与电话号码相关的话。
AT&T公司,另一方面,安装了呼叫路由系统使用扬声器独立字研配技术,可检测数(例如,个人对个人的关键短语,要求在诸如句子卡):我想给它充电我电话卡。
目前,一些非常大的词汇听写系统可用于文档生成。
这些系统通常需要对词与词之间暂停发言。
他们的表现可以得到进一步加强,如果可以报考,如支配的具体领域限制的医疗报告。
尽管正在取得很大进展,机器是从认识到对话的讲话很长的路。
在语料库的总机电话交谈字识别率是50%左右。
这将是许多年以前无限的词汇,非特定人连续听写能力得以实现。
三未来发展方向
1992年,美国国家科学基金会主办的研讨会,以确定人类语言技术领域重点研究的挑战,以及工作需要的基础设施支持。
研究的主要挑战归纳为语音识别技术的以下几个方面:
鲁棒性:
在一个强大的系统,性能缓慢下降(而不是灾难性的)作为条件使得所与训练的数据更为不符。
在信道特征的差异和声学环境上应受到特别重视。
可携性:
便携性是指目标的快速设计,开发和部署新的应用系统。
目前,当系统时常遭受重大退化时,它便移动到一个新的任务上。
为了返回到峰值性能,他们必须接受培训的具体例子来完成新的任务,这样即费时又昂贵。
适应:
如何能适应系统不断变化的条件(新扬声器,麦克风,任务等)和使用,通过使用改进?这种适应可能发生在多层次的系统,模型子字,词的发音,语言模型等。
语言模型:
当前系统使用统计语言模型,是为了帮助减少搜索空间和解决声音的含糊问题。
随着词汇量的增长和其他方面的限制放宽,创造更适合人类居住的系统,这将使越来越重要的语言模型可以得到尽可能多的约束,也许结合句法,并不能由纯粹的统计模型捕获语义约束。
确保措施:
大多数语音识别系统分配分数来假设为基层来行使目的。
这些分数不提供或不充分表明他们是否有一个假设是正确的,只是因为这些假设优于其他。
当我们按任务要求开始行动时,我们需要更好的方法来评估假设的绝对正确性。
超纲词汇:
系统设计使用一套特定的单词,但系统的用户可能不知道哪些词是属于词汇系统中的。
这导致了某些自然条件下,超纲词汇占据了一定的百分比。
系统必须有一些方法来检测超纲的词汇,否则最终将会从词汇单词映射到未知的单词,导致发生错误。
自发演讲:
系统部署的行为是一个真正处理各种常见的自发讲话的现象,如填充停顿,错误的开始,犹豫,在讲话中的不合语法的结构和其他没有发现的行为。
在飞机任务上的发展,意味着在这一领域中的进展,但仍有许多工作要做。
韵律:
韵律是指在一些片段或字组上加以扩大的声学结构。
通过音量、语调和节奏来表达文字识别和用户意图的重要信息(例如,讽刺、愤怒)。
目前的系统并不能识别韵律的结构。
如何把韵律信息整合到识别系统中来是一个尚未解决的关键性问题。
建模动态:
假设一个系统的输入,他们一般被视为独立的被帧序列。
但据了解,对于文字和音素知觉线索的性质,其所需要整合的功能,反映了音节的动态,这是动态性的变动整合。
如何做动态模型识别系统,并将其纳入到语音识别系统中来仍是个未解决的问题。