外文翻译---说话人识别

合集下载

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

附录A 英文文献
Speaker Recognition
By Judith A. Markowitz, J. Markowitz Consultants
Speaker recognition uses features of a person‟s voice to identify or verify that person. It is a well-established biometric with commercial systems that are more than 10 years old and deployed non-commercial systems that are more than 20 years old. This paper describes how speaker recognition systems work and how they are used in applications.
1. Introduction
Speaker recognition (also called voice ID and voice biometrics) is the only human-biometric technology in commercial use today that extracts information from sound patterns. It is also one of the most well-established biometrics, with deployed commercial applications that are more than 10 years old and non-commercial systems that are more than 20 years old.
2. How do Speaker-Recognition Systems Work
Speaker-recognition systems use features of a person‟s voice and speaking style to:
●attach an identity to the voice of an unknown speaker
●verify that a person is who she/ he claims to be
●separate one person‟s voice from other voices in a multi-speaker
environment
The first operation is called speak identification or speaker recognition; the second has many names, including speaker verification, speaker authentication, voice verification, and voice recognition; the third is speaker separation or, in some situations, speaker classification. This papers focuses on speaker verification, the most highly commercialized of these technologies.
2.1 Overview of the Process
Speaker verification is a biometric technology used for determining whether the person is who she or he claims to be. It should not be confused with speech recognition, a non-biometric technology used for identifying what a person is saying. Speech recognition products are not designed to determine who is speaking.
Speaker verification begins with a claim of identity (see Figure A1). Usually, the claim entails manual entry of a personal identification number (PIN), but a growing number of products allow spoken entry of the PIN and use speech recognition to identify the numeric code. Some applications replace manual or spoken PIN entry with bank cards, smartcards, or the number of the telephone being used. PINS are also eliminated when a speaker-verification system contacts the user, an approach typical of systems used to monitor home-incarcerated criminals.
Figure A1.
Once the identity claim has been made, the system retrieves the stored voice sample (called a voiceprint) for the claimed identity and requests spoken input from the person making the claim. Usually, the requested input is a password. The newly input speech is compared with the stored voiceprint and the results of that comparison are measured against an acceptance/rejection threshold. Finally, the system accepts the speaker as the authorized user, rejects the speaker as an impostor, or takes another action determined by the application. Some systems report a confidence level or other score indicating how confident it about its decision.
If the verification is successful the system may update the acoustic information in the stored voiceprint. This process is called adaptation. Adaptation is an unobtrusive solution for keeping voiceprints current and is used by many commercial speaker verification systems.
2.2 The Speech Sample
As with all biometrics, before verification (or identification) can be performed the person must provide a sample of speech (called enrolment). The sample is used to create the stored voiceprint.
Systems differ in the type and amount of speech needed for enrolment and verification. The basic divisions among these systems are
●text dependent
●text independent
●text prompted
2.2.1 Text Dependent
Most commercial systems are text dependent.Text-dependent systems expect the speaker to say a pre-determined phrase, password, or ID. By controlling the words that are spoken the system can look for a close match with the stored voiceprint. Typically, each person selects a private password, although some administrators prefer to assign passwords. Passwords offer extra security, requiring an impostor to know the correct PIN and password and to have a matching voice. Some systems further enhance security by not storing a human-readable representation of the password.
A global phrase may also be used. In its 1996 pilot of speaker verification Chase Manhattan Bank used …Verification by Chemical Bank‟. Global phrases avoid the problem of forgotten passwords, but lack the added protection offered by private passwords.
2.2.2 Text Independent
Text-independent systems ask the person to talk. What the person says is different every time. It is extremely difficult to accurately compare utterances that are totally different from each other - particularly in noisy environments or over poor telephone connections. Consequently, commercial deployment of text-independent
verification has been limited.
2.2.3 Text Prompted
Text-prompted systems (also called challenge response) ask speakers to repeat one or more randomly selected numbers or words (e.g. “43516”, “27,46”, or “Friday, c omputer”). Text prompting adds time to enrolment and verification, but it enhances security against tape recordings. Since the items to be repeated cannot be predicted, it is extremely difficult to play a recording. Furthermore, there is no problem of forgetting a password, even though the PIN, if used, may still be forgotten.
2.3 Anti-speaker Modelling
Most systems compare the new speech sample with the stored voiceprint for the claimed identity. Other systems also compare the newly input speech with the voices of other people. Such techniques are called anti-speaker modelling. The underlying philosophy of anti-speaker modelling is that under any conditions a voice sample from a particular speaker will be more like other samples from that person than voice samples from other speakers. If, for example, the speaker is using a bad telephone connection and the match with the speaker‟s voiceprint is poor, it is likely that the scores for the cohorts (or world model) will be even worse.
The most common anti-speaker techniques are
●discriminate training
●cohort modeling
●world models
Discriminate training builds the comparisons into the voiceprint of the new speaker using the voices of the other speakers in the system. Cohort modelling selects a small set of speakers whose voices are similar to that of the person being enrolled. Cohorts are, for example, always the same sex as the speaker. When the speaker attempts verification, the incoming speech is compared with his/her stored voiceprint and with the voiceprints of each of the cohort speakers. World models (also called background models or composite models) contain a cross-section of voices. The same world model is used for all speakers.
2.4 Physical and Behavioural Biometrics
Speaker recognition is often characterized as a behavioural biometric. This description is set in contrast with physical biometrics, such as fingerprinting and iris scanning. Unfortunately, its classification as a behavioural biometric promotes the misunderstanding that speaker recognition is entirely (or almost entirely) behavioural. If that were the case, good mimics would have no difficulty defeating speaker-recognition systems. Early studies determined this was not the case and identified mimic-resistant factors. Those factors reflect the size and shape of a speaker‟s speaking mechanism (called the vocal tract).
The physical/behavioural classification also implies that performance of physical biometrics is not heavily influenced by behaviour. This misconception has led to the design of biometric systems that are unnecessarily vulnerable to careless and resistant users. This is unfortunate because it has delayed good human-factors design for those biometrics.
3. How is Speaker Verification Used?
Speaker verification is well-established as a means of providing biometric-based security for:
●telephone networks
●site access
●data and data networks
and monitoring of:
●criminal offenders in community release programmes
●outbound calls by incarcerated felons
●time and attendance
3.1 Telephone Networks
Toll fraud (theft of long-distance telephone services) is a growing problem that costs telecommunications services providers, government, and private industry US$3-5 billion annually in the United States alone. The major types of toll fraud include the following:
●Hacking CPE
●Calling card fraud
●Call forwarding
●Prisoner toll fraud
●Hacking 800 numbers
●Call sell operations
●900 number fraud
●Switch/network hits
●Social engineering
●Subscriber fraud
●Cloning wireless telephones
Among the most damaging are theft of services from customer premises equipment (CPE), such as PBXs, and cloning of wireless telephones. Cloning involves stealing the ID of a telephone and programming other phones with it. Subscriber fraud, a growing problem in Europe, involves enrolling for services, usually under an alias, with no intention of paying for them.
Speaker verification has two features that make it ideal for telephone and telephone network security: it uses voice input and it is not bound to proprietary hardware. Unlike most other biometrics that need specialized input devices, speaker verification operates with standard wireline and/or wireless telephones over existing telephone networks. Reliance on input devices created by other manufacturers for a purpose other than speaker verification also means that speaker verification cannot expect the consistency and quality offered by a proprietary input device. Speaker verification must overcome differences in input quality and the way in which speech frequencies are processed. This variability is produced by differences in network type (e.g. wireline v wireless), unpredictable noise levels on the line and in the background, transmission inconsistency, and differences in the microphone in telephone handset. Sensitivity to such variability is reduced through techniques such as speech enhancement and noise modelling, but products still need to be tested under expected conditions of use.
Applications of speaker verification on wireline networks include secure calling cards, interactive voice response (IVR) systems, and integration with security for
proprietary network systems. Such applications have been deployed by organizations as diverse as the University of Maryland, the Department of Foreign Affairs and International Trade Canada, and AMOCO. Wireless applications focus on preventing cloning but are being extended to subscriber fraud. The European Union is also actively applying speaker verification to telephony in various projects, including Caller Verification in Banking and Telecommunications, COST250, and Picasso.
3.2 Site access
The first deployment of speaker verification more than 20 years ago was for site access control. Since then, speaker verification has been used to control access to office buildings, factories, laboratories, bank vaults, homes, pharmacy departments in hospitals, and even access to the US and Canada. Since April 1997, the US Department of Immigration and Naturalization (INS) and other US and Canadian agencies have been using speaker verification to control after-hours border crossings at the Scobey, Montana port-of-entry. The INS is now testing a combination of speaker verification and face recognition in the commuter lane of other ports-of-entry.
3.3 Data and Data Networks
Growing threats of unauthorized penetration of computing networks, concerns about security of the Internet, and increases in off-site employees with data access needs have produced an upsurge in the application of speaker verification to data and network security.
The financial services industry has been a leader in using speaker verification to protect proprietary data networks, electronic funds transfer between banks, access to customer accounts for telephone banking, and employee access to sensitive financial information. The Illinois Department of Revenue, for example, uses speaker verification to allow secure access to tax data by its off-site auditors.
3.4 Corrections
In 1993, there were 4.8 million adults under correctional supervision in the United States and that number continues to increase. Community release programmes, such as parole and home detention, are the fastest growing segments of this industry. It is no longer possible for corrections officers to provide adequate monitoring of
those people.
In the US, corrections agencies have turned to electronic monitoring systems. Since the late 1980s speaker verification has been one of those electronic monitoring tools. Today, several products are used by corrections agencies, including an alcohol breathalyzer with speaker verification for people convicted of driving while intoxicated and a system that calls offenders on home detention at random times during the day.
Speaker verification also controls telephone calls made by incarcerated felons. Inmates place a lot of calls. In 1994, US telecommunications services providers made $1.5 billion on outbound calls from inmates. Most inmates have restrictions on whom they can call. Speaker verification ensures that an inmate is not using another inmate‟s PIN to make a forbidden contact.
3.5 Time and Attendance
Time and attendance applications are a small but growing segment of the speaker-verification market. SOC Credit Union in Michigan has used speaker verification for time and attendance monitoring of part-time employees for several years. Like many others, SOC Credit Union first deployed speaker verification for security and later extended it to time and attendance monitoring for part-time employees.
4. Standards
This paper concludes with a short discussion of application programming interface (API) standards. An API contains the function calls that enable programmers to use speaker-verification to create a product or application. Until April 1997, when the Speaker Verification API (SV API) standard was introduced, all available APIs for biometric products were proprietary. SV API remains the only API standard covering a specific biometric. It is now being incorporated into proposed generic biometric API standards. SV API was developed by a cross-section of speaker-recognition vendors, consultants, and end-user organizations to address a spectrum of needs and to support a broad range of product features. Because it supports both high level functions (e.g. calls to enrol) and low level functions (e.g. choices of audio input features) it
facilitates development of different types of applications by both novice and experienced developers.
Why is it important to support API standards? Developers using a product with a proprietary API face difficult choices if the vendor of that product goes out of business, fails to support its product, or does not keep pace with technological advances. One of those choices is to rebuild the application from scratch using a different product. Given the same events, developers using a SV API-compliant product can select another compliant vendor and need perform far fewer modifications. Consequently, SV API makes development with speaker verification less risky and less costly. The advent of generic biometric API standards further facilitates integration of speaker verification with other biometrics. All of this helps speaker-verification vendors because it fosters growth in the marketplace. In the final analysis active support of API standards by developers and vendors benefits everyone.
附录B 中文翻译
说话人识别
作者：Judith A. Markowitz, J. Markowitz Consultants 说话人识别是用一个人的语音特征来辨认或确认这个人。

有着10多年的商业系统和超过20年的非商业系统部署，它是一种行之有效的生物测定学。

本文介绍了说话人识别系统的工作原理，以及它们在应用软件中如何被使用。

1. 介绍
说话人识别（也叫语音身份和语音生物测定学）是当今从声音模式提取信息的商业应用中唯一的人类生物特征识别技术。

有着10多年的商业应用程序部署和超过20年的非商业系统，它也是最行之有效的生物测定学之一。

2. 说话人识别系统如何工作
说话人识别系统使用一个人的语音和说话风格来达到以下目的：
●为一个未知说话人的声音绑定一个身份
●确认一个人是他/她所宣称的
●在多说话人的环境中从其它的声音中区分出每一特定人的声音
第一个操作被称为说话人辨认或说话人识别；第二个有许多名字，包括说话人确认，说话人鉴定，声音确认和声音识别；第三个是说话人分离，某些情形下也叫说话人分类。

本文着重这些技术中最高度商业化的说话人确认。

2.1 方法概览
说话人确认是决定一个人是否是他或她所宣称身份的一种生物测定技术。

它不应同语音识别相混淆。

后者是一种用来确定一个人说什么的非生物测定技术。

语音识别产品不是被设计用来确定谁在发言的。

说话人确认以一个身份声明开始（见图B1）。

通常情况下，声明需要手工输入个人识别码( PIN ) 但越来越多的产品允许发言输入密码并使用语音识别确定数字代码。

一些应用程序用银行卡，智能卡，或使用中的电话号码取代个人识别码的手
动或语音输入。

当一个说话人确认系统联系用户时，个人识别码也会被取消，一个典型的这种系统被用来监测在家服刑的罪犯。

用户：声明一个身份
系统：访问该身份的存储声纹
系统：提示用户输入密码
用户：说出密码
系统：比较密码和存储样本
系统：比较结果和阈值
系统：接受或拒绝身份声明
图B1
一旦身份声明被做出，系统会取回声明身份的存储语音样本（叫做声纹）并要求声明用户的语音输入。

通常，要求的输入是一个密码。

最新输入的语音同存储的声纹相比较，比较的结果用一个接受／拒绝的阈值进行衡量。

最终，系统接受说话人为授权用户，或拒绝说话人为冒名顶替者，或做出应用程序定义的其它动作。

一些系统报告一个可信度或其它评分来说明它的决定的可信程度。

如果确认成功，系统可能升级存储声纹的声学信息。

这个过程叫做适应。

适应是用来保持声纹正确性的一种稳妥的解决方案。

它在许多商用说话人确认系统中被使用。

2.2 语音样本
同所有的生物认证一样，在确认（或辨认）可以被执行之前，一个语音样本必须被提供（这个过程也叫做登记）。

这个样本被用来生成存储声纹。

在需要登记和确认的语音类型和数量方面，系统之间有区别。

这些系统的基本分类是：
●文本相关
●文本无关
●文本提示型
2.2.1 文本相关
大部分的商业系统都是文本相关的。

文本相关的系统期待用户说出事先定义好的词组、密码或者标识符。

通过对被说出单词的控制，系统可以从存储的声纹中找出最为匹配的一个。

一个典型的例子，每个用户可以选择一个私有的密码，尽管一些管理员更喜欢分配密码。

因为冒名顶替者需要同时知道正确的个人身份号码和密码并且还要拥有一个相匹配的声音，所以密码提供了额外的安全性。

有些系统通过不存储密码的人类可读性信息来进一步提高安全性。

通用短语也可以被使用。

在1996年的说话人确认试验中，大通曼哈顿银行使用了“化学银行确认”。

通用短语避免了忘记密码的问题，但是缺乏私有密码所提供的额外保护。

2.2.2 文本无关
文本无关的系统要求用户说话。

该用户每次说的内容是不同的。

精确的匹配完全不同的语音是非常困难的，尤其是在高噪音环境下或者非常差的电话连接中。

因此，文本无关确认的商业化部署受到限制。

2.2.3 文本提示型
文本提示系统（也叫做口令应答）要求说话人重复一个或多个随机选择的数字或单词（例如“43516”、“27、46”或者“星期五、计算机”）。

文本提示增加了登记和确认的时间，但是它提高了针对磁带录音的安全性。

由于重述的条目不能被预测到，播放录音是非常困难的。

此外，这里没有忘记密码的问题。

即使是使用个人身份号码，它也可能被遗忘掉。

2.3 反说话人模型
大部分系统把新的语音样本同要求身份的存储声纹进行比较。

另一些系统也把最近输入的语音同其它人的声音相比较。

这种技术被叫做反说话人模型。

反说话人模型的基本原理是在任何条件下，来自某一特定说话人的语音样本比起其它说话人的语音样本总是更像这个说话人的其它样本。

例如，如果说话人使用一个差的电话连接并且这个说话人的声纹匹配也很差，很有可能同期组群（或世界模型）的得分会更差。

最常见的反说话人技术有：
●区别训练
●同期组群模型
●世界模型
区别训练在系统中建立了使用其它说话人声音的新说话人的声纹对照。

同期组群模型挑选少数说话人。

他们的声音与已登记人类似。

例如，同期组群通常是相同性别的说话人。

当说话人试图确认时，进入的语音与他／她的声纹及其每一个同期组群说话人的声纹进行比较。

世界模型(又称背景模式或复合模式) 包含一个语音的横截面断片。

同一个世界模型被用于所有的说话人。

2.4 物理和行为生物测定学
说话人识别通常表现为行为生物测定学的特征。

这样的描述是设定在与物理生物测定学的对照中的，例如指纹,虹膜扫描。

不幸的是，其作为行为生物测定学的分类促进了说话人识别被认为是完全（或者几乎完全）是行为性的误解。

如果是那样的话，好的模仿者会毫无困难地击败说话人识别系统。

早期的研究决定了事实并非如此。

它们确定了模仿抵抗因素。

这些因素反映了说话人发音器官（叫做声道）的大小和形状。

物理／行为的分类也暗示了物理生物测定学的性能不会受到很强的行为影响。

这种误解曾导致不必要地易受粗心、有抵抗力的用户攻击的生物测定系统的设计。

这是不幸的，因为它延缓了用于那些生物测定的好的人性因素设计。

3 说话人确认如何使用
说话人确认是一种行之有效的生物型安全手段。

它常用于：
●电话网络
●站点访问
●数据和数据网络
此外，也用于以下情况的监测：
●罪犯的社区释放方案
●在押重犯的外拨电话
●时间和出勤
3.1 电话网络
收费欺诈(盗用长途电话服务)是一个日益严重的问题，仅在美国它每年花费电讯服务供应商、政府与私营行业3－5亿美元。

主要的收费欺诈类型包括以下几种：
●黑客终端
●电话卡诈骗
●呼叫促进
●囚犯收费欺诈
●黑客800号码
●电话业务出售
●900号码欺骗
●交换机/网络攻击
●社会操纵
●欺诈订户
●克隆无线电话
其中最具破坏性的是客户端设备服务盗取，例如专用分组交换机和无线电话的克隆。

克隆包括电话号码的盗取并用它编程其它话机。

在欧洲，订户欺诈是一个日益严重的问题。

它涉及通常化名的服务登记，使用者无意为化名支付费用。

说话人确认有两个特征使它非常适用于电话和电话网络安全：它使用输入的语音而无需进入私人的硬件。

不像其它的生物测定需要特殊的输入设备，说话人识别可以在现有电话网络上的有线及/或无线电话上运转。

输入设备制造厂商的目的不是说话人确认。

依靠他们制造的输入设备意味着不能指望依靠一个专有输入装置来获得稳固性和质量。

说话人识别必须克服输入设备和语音频率处理方式上的困难。

可变性是由不同的网络类型（例如有线和无线），线路和环境中不可预知的噪音水平，传输不一致以及电话听筒的麦克风不同所引起。

这种可变性的灵敏度可以通过类似语音增强和噪音模型的技术来减弱，但是产品仍旧需要在期待的使用环境下进行测试。

有线网络上的说话人识别应用包括安全呼叫卡，互动声讯系统和专有网络体系的安全整合。

这种应用已经在马里兰大学、加拿大外交商贸部和美国石油公司多种组织中配置起来。

无线应用的重点在于防止克隆，但目前正扩展至订户欺诈。

欧盟也积极地将说话人识别运用到各种项目的电话业务中，其中包括银行和电信业的呼叫者确认系统，COST250系统和毕加索系统。

3.2站点访问
20多年前，第一个说话人确认系统的部署是用于站点访问控制的。

从那时起，说话人确认已经被用于办公楼、工厂、实验室、银行保险箱、住宅、医院药剂部门，甚至进入美国和加拿大的访问控制。

从1997年起，美国移民局（INS）和其他美国和加拿大的机构已经使用说话人确认来控制斯克比下班后的边境口岸和蒙大拿州的入境港。

美国移民局正在其它入境港的通勤线测试一个说话人确认和人脸识别的联合系统。

3.3 数据和数据网络
日益增长的涉及到互联网安全的未经授权的计算机网络渗透威胁和有数据访问需求的场外雇员的增加已经导致数据和网络安全的说话人确认应用的高潮。

在使用说话人确认保护专有数据网络，电子资金银行转帐，电话银行的客户账户访问和雇员访问敏感金融信息方面，金融服务行业一直处于领先地位。

例如，伊利诺斯州税务部使用说话人确认允许它的场外审计员对税务数据进行安全访问。

3.4 惩教
1993年，美国共有480万成年人处于惩教监管之下并且这个数字仍在继续增加。

社区释放方案，如假释和家庭拘留，是这个行业增长最快的部分。

狱警为这些人提供足够的监控已经不再可能了。

在美国，惩教机构已经转向电子监控系统。

从二十世纪八十年代后期开始，说话人确认已经成为那些电子监控工具中的一种。

今天，包括用于酒后驾驶者的说话人确认酒精检测器和一个在白天随机时间呼叫家庭拘留罪犯的系统在内的好几种产品被惩教机构使用。

说话人确认也可以控制在押重犯的电话呼叫。

囚犯的地方有很多电话。

1994年，美国电信服务供应商从监狱对外呼叫中得到15亿美元。

大部分犯人在呼叫对象上有限制。

说话人确认确保犯人不使用另一位同室者的个人身份号码获得禁止的联系。

3.5 时间和出勤
时间和出勤应用是说话人确认市场中很小但持续增长的一部分。

密歇根州的SOC信用合作社已经将说话人确认用于兼职员工的时间和出勤监测好几年了。

和其它机构一样，SOC信用合作社首先配置说话人识别用于安全，之后扩展到兼职员工的时间和出勤监测。

4. 标准
本文用简短的讨论总结应用程序接口（API）标准。

一个应用程序接口包含一个函数，程序员能够调用它生成一个说话人确定的产品或应用。

直到1997年四月，说话人识别应用程序接口（SVAPI）才被提出。

在此之前，所有可得到的生物测定产品的应用程序接口都是私有的。

说话人识别应用程序接口仍旧是唯一涵盖一种具体生物测定学的应用程序接口标准。

它现已被纳入拟议的生物测定学通用应用程序接口标准。

说话人识别应用程序接口是由一个跨部门的说话人识别供应商，顾问和终端用户组织发展的，用于解决一系列的需求并支持一系列广泛的产品特色。

因为既支持高层次的功能（如呼叫登记）又支持低层次的功能（如选择音频输入特点），它有利于被新手和有经验的开发商发展出不同类型的应用。

为什么支持应用程序接口标准非常重要呢？如果产品厂商生意倒闭、不再支持该产品或者没有跟上技术进步，那么使用私有应用程序接口产品的开发者就会面临困难的选择。

其中的一个选择就是使用不同的产品从零开始重建应用程序。

相同的情况下，使用兼容说话人识别应用程序接口产品的开发人员可以选择另一个兼容厂商，从而只需要作少得多的修改。

因此，说话人识别应用程序接口使说话人确认的开发更小风险并更少费用。

通用生物测定应用程序接口标准的出现使说话人确认同其它生物测定学的结合更加便利。

所以的这些都是对说话人确认厂商有利的因为培养了市场的增长。

归根究柢，开发者和厂商对应用程序接口标准的积极支持有益于每一个人。