TEXT-INDEPENDENT SPEAKER RECOGNITION USING PROBABILISTIC ABSTRACT SVM WITH GMM ADJUSTMENT

合集下载

外文翻译---说话人识别

外文翻译---说话人识别

附录A 英文文献Speaker RecognitionBy Judith A. Markowitz, J. Markowitz ConsultantsSpeaker recognition uses features of a person‟s voice to identify or verify that person. It is a well-established biometric with commercial systems that are more than 10 years old and deployed non-commercial systems that are more than 20 years old. This paper describes how speaker recognition systems work and how they are used in applications.1. IntroductionSpeaker recognition (also called voice ID and voice biometrics) is the only human-biometric technology in commercial use today that extracts information from sound patterns. It is also one of the most well-established biometrics, with deployed commercial applications that are more than 10 years old and non-commercial systems that are more than 20 years old.2. How do Speaker-Recognition Systems WorkSpeaker-recognition systems use features of a person‟s voice and speaking style to:●attach an identity to the voice of an unknown speaker●verify that a person is who she/ he claims to be●separate one person‟s voice from other voices in a multi-speakerenvironmentThe first operation is called speak identification or speaker recognition; the second has many names, including speaker verification, speaker authentication, voice verification, and voice recognition; the third is speaker separation or, in some situations, speaker classification. This papers focuses on speaker verification, the most highly commercialized of these technologies.2.1 Overview of the ProcessSpeaker verification is a biometric technology used for determining whether the person is who she or he claims to be. It should not be confused with speech recognition, a non-biometric technology used for identifying what a person is saying. Speech recognition products are not designed to determine who is speaking.Speaker verification begins with a claim of identity (see Figure A1). Usually, the claim entails manual entry of a personal identification number (PIN), but a growing number of products allow spoken entry of the PIN and use speech recognition to identify the numeric code. Some applications replace manual or spoken PIN entry with bank cards, smartcards, or the number of the telephone being used. PINS are also eliminated when a speaker-verification system contacts the user, an approach typical of systems used to monitor home-incarcerated criminals.Figure A1.Once the identity claim has been made, the system retrieves the stored voice sample (called a voiceprint) for the claimed identity and requests spoken input from the person making the claim. Usually, the requested input is a password. The newly input speech is compared with the stored voiceprint and the results of that comparison are measured against an acceptance/rejection threshold. Finally, the system accepts the speaker as the authorized user, rejects the speaker as an impostor, or takes another action determined by the application. Some systems report a confidence level or other score indicating how confident it about its decision.If the verification is successful the system may update the acoustic information in the stored voiceprint. This process is called adaptation. Adaptation is an unobtrusive solution for keeping voiceprints current and is used by many commercial speaker verification systems.2.2 The Speech SampleAs with all biometrics, before verification (or identification) can be performed the person must provide a sample of speech (called enrolment). The sample is used to create the stored voiceprint.Systems differ in the type and amount of speech needed for enrolment and verification. The basic divisions among these systems are●text dependent●text independent●text prompted2.2.1 Text DependentMost commercial systems are text dependent.Text-dependent systems expect the speaker to say a pre-determined phrase, password, or ID. By controlling the words that are spoken the system can look for a close match with the stored voiceprint. Typically, each person selects a private password, although some administrators prefer to assign passwords. Passwords offer extra security, requiring an impostor to know the correct PIN and password and to have a matching voice. Some systems further enhance security by not storing a human-readable representation of the password.A global phrase may also be used. In its 1996 pilot of speaker verification Chase Manhattan Bank used …Verification by Chemical Bank‟. Global phrases avoid the problem of forgotten passwords, but lack the added protection offered by private passwords.2.2.2 Text IndependentText-independent systems ask the person to talk. What the person says is different every time. It is extremely difficult to accurately compare utterances that are totally different from each other - particularly in noisy environments or over poor telephone connections. Consequently, commercial deployment of text-independentverification has been limited.2.2.3 Text PromptedText-prompted systems (also called challenge response) ask speakers to repeat one or more randomly selected numbers or words (e.g. “43516”, “27,46”, or “Friday, c omputer”). Text prompting adds time to enrolment and verification, but it enhances security against tape recordings. Since the items to be repeated cannot be predicted, it is extremely difficult to play a recording. Furthermore, there is no problem of forgetting a password, even though the PIN, if used, may still be forgotten.2.3 Anti-speaker ModellingMost systems compare the new speech sample with the stored voiceprint for the claimed identity. Other systems also compare the newly input speech with the voices of other people. Such techniques are called anti-speaker modelling. The underlying philosophy of anti-speaker modelling is that under any conditions a voice sample from a particular speaker will be more like other samples from that person than voice samples from other speakers. If, for example, the speaker is using a bad telephone connection and the match with the speaker‟s voiceprint is poor, it is likely that the scores for the cohorts (or world model) will be even worse.The most common anti-speaker techniques are●discriminate training●cohort modeling●world modelsDiscriminate training builds the comparisons into the voiceprint of the new speaker using the voices of the other speakers in the system. Cohort modelling selects a small set of speakers whose voices are similar to that of the person being enrolled. Cohorts are, for example, always the same sex as the speaker. When the speaker attempts verification, the incoming speech is compared with his/her stored voiceprint and with the voiceprints of each of the cohort speakers. World models (also called background models or composite models) contain a cross-section of voices. The same world model is used for all speakers.2.4 Physical and Behavioural BiometricsSpeaker recognition is often characterized as a behavioural biometric. This description is set in contrast with physical biometrics, such as fingerprinting and iris scanning. Unfortunately, its classification as a behavioural biometric promotes the misunderstanding that speaker recognition is entirely (or almost entirely) behavioural. If that were the case, good mimics would have no difficulty defeating speaker-recognition systems. Early studies determined this was not the case and identified mimic-resistant factors. Those factors reflect the size and shape of a speaker‟s speaking mechanism (called the vocal tract).The physical/behavioural classification also implies that performance of physical biometrics is not heavily influenced by behaviour. This misconception has led to the design of biometric systems that are unnecessarily vulnerable to careless and resistant users. This is unfortunate because it has delayed good human-factors design for those biometrics.3. How is Speaker Verification Used?Speaker verification is well-established as a means of providing biometric-based security for:●telephone networks●site access●data and data networksand monitoring of:●criminal offenders in community release programmes●outbound calls by incarcerated felons●time and attendance3.1 Telephone NetworksToll fraud (theft of long-distance telephone services) is a growing problem that costs telecommunications services providers, government, and private industry US$3-5 billion annually in the United States alone. The major types of toll fraud include the following:●Hacking CPE●Calling card fraud●Call forwarding●Prisoner toll fraud●Hacking 800 numbers●Call sell operations●900 number fraud●Switch/network hits●Social engineering●Subscriber fraud●Cloning wireless telephonesAmong the most damaging are theft of services from customer premises equipment (CPE), such as PBXs, and cloning of wireless telephones. Cloning involves stealing the ID of a telephone and programming other phones with it. Subscriber fraud, a growing problem in Europe, involves enrolling for services, usually under an alias, with no intention of paying for them.Speaker verification has two features that make it ideal for telephone and telephone network security: it uses voice input and it is not bound to proprietary hardware. Unlike most other biometrics that need specialized input devices, speaker verification operates with standard wireline and/or wireless telephones over existing telephone networks. Reliance on input devices created by other manufacturers for a purpose other than speaker verification also means that speaker verification cannot expect the consistency and quality offered by a proprietary input device. Speaker verification must overcome differences in input quality and the way in which speech frequencies are processed. This variability is produced by differences in network type (e.g. wireline v wireless), unpredictable noise levels on the line and in the background, transmission inconsistency, and differences in the microphone in telephone handset. Sensitivity to such variability is reduced through techniques such as speech enhancement and noise modelling, but products still need to be tested under expected conditions of use.Applications of speaker verification on wireline networks include secure calling cards, interactive voice response (IVR) systems, and integration with security forproprietary network systems. Such applications have been deployed by organizations as diverse as the University of Maryland, the Department of Foreign Affairs and International Trade Canada, and AMOCO. Wireless applications focus on preventing cloning but are being extended to subscriber fraud. The European Union is also actively applying speaker verification to telephony in various projects, including Caller Verification in Banking and Telecommunications, COST250, and Picasso.3.2 Site accessThe first deployment of speaker verification more than 20 years ago was for site access control. Since then, speaker verification has been used to control access to office buildings, factories, laboratories, bank vaults, homes, pharmacy departments in hospitals, and even access to the US and Canada. Since April 1997, the US Department of Immigration and Naturalization (INS) and other US and Canadian agencies have been using speaker verification to control after-hours border crossings at the Scobey, Montana port-of-entry. The INS is now testing a combination of speaker verification and face recognition in the commuter lane of other ports-of-entry.3.3 Data and Data NetworksGrowing threats of unauthorized penetration of computing networks, concerns about security of the Internet, and increases in off-site employees with data access needs have produced an upsurge in the application of speaker verification to data and network security.The financial services industry has been a leader in using speaker verification to protect proprietary data networks, electronic funds transfer between banks, access to customer accounts for telephone banking, and employee access to sensitive financial information. The Illinois Department of Revenue, for example, uses speaker verification to allow secure access to tax data by its off-site auditors.3.4 CorrectionsIn 1993, there were 4.8 million adults under correctional supervision in the United States and that number continues to increase. Community release programmes, such as parole and home detention, are the fastest growing segments of this industry. It is no longer possible for corrections officers to provide adequate monitoring ofthose people.In the US, corrections agencies have turned to electronic monitoring systems. Since the late 1980s speaker verification has been one of those electronic monitoring tools. Today, several products are used by corrections agencies, including an alcohol breathalyzer with speaker verification for people convicted of driving while intoxicated and a system that calls offenders on home detention at random times during the day.Speaker verification also controls telephone calls made by incarcerated felons. Inmates place a lot of calls. In 1994, US telecommunications services providers made $1.5 billion on outbound calls from inmates. Most inmates have restrictions on whom they can call. Speaker verification ensures that an inmate is not using another inmate‟s PIN to make a forbidden contact.3.5 Time and AttendanceTime and attendance applications are a small but growing segment of the speaker-verification market. SOC Credit Union in Michigan has used speaker verification for time and attendance monitoring of part-time employees for several years. Like many others, SOC Credit Union first deployed speaker verification for security and later extended it to time and attendance monitoring for part-time employees.4. StandardsThis paper concludes with a short discussion of application programming interface (API) standards. An API contains the function calls that enable programmers to use speaker-verification to create a product or application. Until April 1997, when the Speaker Verification API (SV API) standard was introduced, all available APIs for biometric products were proprietary. SV API remains the only API standard covering a specific biometric. It is now being incorporated into proposed generic biometric API standards. SV API was developed by a cross-section of speaker-recognition vendors, consultants, and end-user organizations to address a spectrum of needs and to support a broad range of product features. Because it supports both high level functions (e.g. calls to enrol) and low level functions (e.g. choices of audio input features) itfacilitates development of different types of applications by both novice and experienced developers.Why is it important to support API standards? Developers using a product with a proprietary API face difficult choices if the vendor of that product goes out of business, fails to support its product, or does not keep pace with technological advances. One of those choices is to rebuild the application from scratch using a different product. Given the same events, developers using a SV API-compliant product can select another compliant vendor and need perform far fewer modifications. Consequently, SV API makes development with speaker verification less risky and less costly. The advent of generic biometric API standards further facilitates integration of speaker verification with other biometrics. All of this helps speaker-verification vendors because it fosters growth in the marketplace. In the final analysis active support of API standards by developers and vendors benefits everyone.附录B 中文翻译说话人识别作者:Judith A. Markowitz, J. Markowitz Consultants 说话人识别是用一个人的语音特征来辨认或确认这个人。

声纹识别(1)

声纹识别(1)
模板匹配方法:利用动态时间弯折(DTW)以对准训练和测试特征序列,主要用于固定词组的应用(通常为文本相关任务);最近邻方法:训练时保留所有特征矢量,识别时对每个矢量都找到训练矢量中最近的K个,据此进行识别,通常模型存储和相似计算的量都很大;神经网络方法:有很多种形式,如多层感知、径向基函数(RBF)等,可以显式训练以区分说话人和其背景说话人,其训练量很大,且模型的可推广性不好;多项式分类器方法:有较高的精度,但模型存储和计算量都比较大。
理论上来说,声纹就像指纹一样,很少会有两个人具有相同的声纹特征。
美国研究机构已经表明在某些特点的环境下声纹可以用来作为有效的证据。并且美国联邦调查局对2000例与声纹相关的案件进行统计,利用声纹作为证据只有0.31%的错误率。目前利用声纹来区分不同人这项技术已经被广泛认可,并且在各个领域中都有应用。目前公安部声纹鉴别就采用类似方法,而且语谱图还是用的灰度来表示。主要抽取说话人声音的基音频谱及包络、基音帧的能量、基音共振峰的出现频率及其轨迹等参数表征,然后再与模式识别等传统匹配方法结合进行声纹识别。
人在讲话时使用的发声器官在尺寸和形态方面每个人的差异很大,所以任何两个人的声纹图谱都有差异,主要体现在如下方面:
共鸣方式特征:咽腔共鸣、鼻腔共鸣和口腔共鸣嗓音纯度特征:不同人的嗓音,纯度一般是不一样的,粗略地可分为高纯度(明亮)、低纯度(沙哑)和中等纯度三个等级平均音高特征:平均音高的高低就是一般所说的嗓音是高亢还是低沉音域特征:音域的高低就是通常所说的声音饱满还是干瘪 不同人的声音在语谱图中共振峰的分布情况不同,声纹识别正是通过比对两段语音的说话人在相同音素上的发声来判断是否为同一个人,从而实现“闻声识人”的功能。
声纹识别发展的分水岭
第三分水岭是在2011年,在第十一届全国人机语音通讯学术会议上,邓力分享了他在微软DNN-based speech recognition的研究结果,将识别率提升了30%,这将声纹识别的准确率一下子提升了一个层次。DNN能从大量样本中学习到高度抽象的说话人特征,并对噪声有很强的免疫力,至此深度学习被引入业界,国内对声纹识别技术的关注点也放到了深度学习上。

中英文翻译

中英文翻译

附录英文原文:Chinese Journal of ElectronicsVo1.15,No.3,July 2006A Speaker--Independent Continuous SpeechRecognition System Using Biomimetic Pattern RecognitionWANG Shoujue and QIN Hong(Laboratory of Artificial Neural Networks,Institute ol Semiconductors,Chinese Academy Sciences,Beijing 100083,China)Abstract—In speaker-independent speech recognition,the disadvantage of the most diffused technology(HMMs,or Hidden Markov models)is not only the need of many more training samples,but also long train time requirement. This Paper describes the use of Biomimetic pattern recognition(BPR)in recognizing some mandarin continuous speech in a speaker-independent Manner. A speech database was developed for the course of study.The vocabulary of the database consists of 15 Chinese dish’s names, the length of each name is 4 Chinese words.Neural networks(NNs)based on Multi-weight neuron(MWN) model are used to train and recognize the speech sounds.The number of MWN was investigated to achieve the optimal performance of the NNs-based BPR.This system, which is based on BPR and can carry out real time recognition reaches a recognition rate of 98.14%for the first option and 99.81%for the first two options to the Persons from different provinces of China speaking common Chinese speech.Experiments were also carried on to evaluate Continuous density hidden Markov models(CDHMM ),Dynamic time warping(DTW)and BPR for speech recognition.The Experiment results show that BPR outperforms CDHMM and DTW especially in the cases of samples of a finite size.Key words—Biomimetic pattern recognition, Speech recogniton,Hidden Markov models(HMMs),Dynamic time warping(DTW).I.IntroductionThe main goal of Automatic speech recognition(ASR)is to produce a system which will recognize accurately normal human speech from any speaker.The recognition system may be classified as speaker-dependent or speaker-independent.The speaker dependence requires that the system be personally trained with the speech of the person that will be involved with its operation in order to achieve a high recognition rate.For applications on the public facilities,on the other hand,the system must be capable of recognizing the speech uttered by many different people,with different gender,age,accent,etc.,the speaker independence has many more applications,primarily in the general area of public facilities.The most diffused technology in speaker-independent speech recognition is Hidden Markov Models,the disadvantage of it is not only the need of many more training samples,but also long train time requirement.Since Biomimetic pattern recognition(BPR) was first proposed by Wang Shoujue,it has already been applied to object recognition, face identification and face recognition etc.,and achieved much better performance.With some adaptations,such modeling techniques could be easily used within speech recognition too.In this paper,a real-time mandarin speech recognition system based on BPR is proposed,which outperforms HMMs especially in the cases of samples of a finite size.The system is a small vocabulary speaker independent continuous speech recognition one. The whole system is implemented on the PC under windows98/2000/XPenvironment with CASSANN-II neurocomputer.It supports standard 16-bit sound card .II .Introduction of Biomimetic Pattern Recognition and Multi —Weights Neuron Networks1. Biomimetic pattern recognitionTraditional Pattern Recognition aims at getting the optimal classification of different classes of sample in the feature space .However, the BPR intends to find the optimal coverage of the samples of the same type. It is from the Principle of Homology —Continuity ,that is to say ,if there are two samples of the same class, the difference between them must be gradually changed . So a gradual change sequence must be exists between the two samples. In BPR theory .the construction of the sample subspace of each type of samples depends only on the type itself .More detailedly ,the construction of the subspace of a certain type of samples depends on analyzing the relations between the trained types of samples and utilizing the methods of “cov erage of objects with complicated geometrical forms in the multidimensional space”.2.Multi-weights neuron and multi-weights neuron networksA Multi-weights neuron can be described as follows :12m Y=f[(,,,)]W W W X θΦ-…,,Where :12m ,,W W W …, are m-weights vectors ;X is the inputvector ;Φis the neuron’s computation function ;θis the threshold ;f is the activation function .According to dimension theory, in the feature spacen R ,n X R ∈,the function12m (,,,)W W W X Φ…,=θconstruct a (n-1)-dimensional hypersurface in n-dimensional space which isdetermined by the weights12m ,,W W W …,.It divides the n-dimensional space into two parts .If12m (,,,)W W W X θΦ=…, is a closed hypersurface, it constructs a finite subspace .According to the principle of BPR,determination the subspace of a certain type of samples basing on the type of samples itself .If we can find out a set of multi-weights neurons(Multi-weights neuron networks) that covering all the training samples ,the subspace of the neural networks represents the sample subspace. When an unknown sample is in the subspace, it can be determined to be the same type of the training samples .Moreover ,if a new type of samples added, it is not necessary to retrain anyone of the trained types of samples .The training of a certain type of samples has nothing to do with the other ones .III .System DescriptionThe Speech recognition system is divided into two main blocks. The first one is the signal pre-processing and speech feature extraction block .The other one is the Multi-weights neuron networks, which performs the task of BPR .1.Speech feature extractionMel based Campestral Coefficients(MFCC) is used as speech features .It is calculated as follows :A /D conversion ;Endpoint detection using short time energy and Zero crossing rate(ZCR);Preemphasis and hamming windowing ;Fast Fourier transform ;DCT transform .The number of features extracted for each frame is 16,and 32 frames are chosen for every utterance .A 512-dimensiona1-Me1-Cepstral feature vector(1632⨯ numerical values) represented the pronunciation of every word . 2. Multi-weights neuron networks architectureAs a new general purpose theoretical model of pattern Recognition, here BPR is realized by multi-weights neuron Networks. In training of a certain class of samples ,an multi-weights neuron subNetwork should beestablished .The subNetwork consists of one input layer .one multi-weights neuron hidden layer and one output layer. Such a subNetwork can be considered as a mapping 512:F R R →.12m ()min(,,Y )F X Y Y =…,,Where Y i is the output of a Multi-weights neuron. There are m hiddenMulti-weights neurons .i= 1,2, …,m,512X R ∈is the input vector .IV .Training for MWN Networks1. Basics of MWN networks trainingTraining one multi-weights neuron subNetwork requires calculating the multi-weights neuron layer weights .The multi-weights neuron and the training algorithm used was that of Ref.[4].In this algorithm ,if the number of training samples of each class is N,we can use2N -neurons .In this paper ,N=30.12[(,,,)]ii i i Y f s s s x ++=,is a function with multi-vector input ,one scalar quantity output .2. Optimization methodAccording to the comments in IV.1,if there are many training samples, the neuron number will be very large thus reduce the recognition speed .In the case of learning several classes of samples, knowledge of the class membership of training samples is available. We use this information in a supervised training algorithm to reduce the network scales .When training class A ,we looked the left training samples of the other 14 classes as class B . So there are 30 training samples in set1230:{,,}A A a a a =…,and 420 training samples inset 12420:{,,}B B b b =…,b .Firstly select 3 samples from A, and we have a neuron :1123Y =f[(,,,)]k k k a a a x .Let 01_123,=f[(,,,)]A i k k k i A A Y a a a a =,where i= 1,2, (30)1_123Y =f[(,,,)]B j k k k j a a a b ,where j= 1,2,…420;1_min(Y )B j V =,we specify a value r ,0<r<1.If1_*A i Y r V <,removed i a from set A, thus we get a new set (1)A .We continue until the number ofsamples in set ()k Ais(){}k A φ=,then the training is ended, and the subNetwork of class A has a hiddenlayer of1r - neurons.V .Experiment ResultsA speech database consisting of 15 Chinese dish’s names was developed for the course of study. The length of each name is 4 Chinese words, that is to say, each sample of speech is a continuous string of 4 words, such as “yu xiang rou si”,“gong bao ji ding”,etc .It was organized into two sets :training set and test set. The speech signal is sampled at 16kHz and 16-bit resolution .Table 1.Experimental result atof different values450 utterances constitute the training set used to train the multi-weights neuron networks. The 450 ones belong to 10 speakers(5 males and 5 females) who are from different Chinese provinces. Each of the speakers uttered each of the word 3 times. The test set had a total of 539 utterances which involved another 4 speakers who uttered the 15 words arbitrarily .The tests made to evaluate the recognition system were carried out on differentr from 0.5 to 0.95 with astep increment of 0.05.The experiment results at r of different values are shown in Table 1.Obviously ,the networks was able to achieve full recognition of training set at any r .From the experiments ,it was found that0.5r achieved hardly the same recognition rate as the Basic algorithm. In the mean time, theMWNs used in the networks are much less than of the Basic algorithm. Table 2.Experiment results of BPR basic algorithmExperiments were also carried on to evaluate Continuous density hidden Markov models (CDHMM),Dynamic time warping(DTW) and Biomimetic pattern recognition(BPR) for speech recognition, emphasizing the performance of each method across decreasing amounts of training samples as wellas requirement of train time. The CDHMM system was implemented with 5 states per word.Viterbi-algorithm and Baum-Welch re-estimation are used for training and recognition .The reference templates for DTW system are the training samples themselves. Both the CDHMM and DTW technique are implemented using the programs in Ref.[11].We give in Table 2 the experiment results comparison of BPR Basic algorithm ,Dynamic time warping (DTW)and Hidden Markov models (HMMs) method .The HMMs system was based on Continuous density hidden Markov models(CDHMMs),and was implemented with 5 states per name.VI.Conclusions and AcknowledgmentsIn this paper, A mandarin continuous speech recognition system based on BPR is established.Besides,a training samples selection method is also used to reduce the networks scales. As a new general purpose theoretical model of pattern Recognition,BPR could be used in speech recognition too, and the experiment results show that it achieved a higher performance than HMM s and DTW.References[1]WangShou-jue,“Blomimetic (Topological) pattern recognit ion-A new model of pattern recognition theoryand its application”,Acta Electronics Sinica,(inChinese),Vo1.30,No.10,PP.1417-1420,2002.[2]WangShoujue,ChenXu,“Blomimetic (Topological) pattern recognition-A new model of patternrecognition theory and its app lication”, Neural Networks,2003.Proceedings of the International Joint Conference on Neural Networks,Vol.3,PP.2258-2262,July 20-24,2003.[3]WangShoujue,ZhaoXingtao,“Biomimetic pattern recognition theory and its applications”,Chinese Journalof Electronics,V0l.13,No.3,pp.373-377,2004.[4]Xu Jian.LiWeijun et a1,“Architecture research and hardware implementation on simplified neuralcomputing system for face identification”,Neuarf Networks,2003.Proceedings of the Intern atonal Joint Conference on Neural Networks,Vol.2,PP.948-952,July 20-24 2003.[5]Wang Zhihai,Mo Huayi et al,“A method of biomimetic pattern recognition for face recognition”,Neural Networks,2003.Proceedings of the International Joint Conference on Neural Networks,Vol.3,pp.2216-2221,20-24 July 2003.[6]WangShoujue,WangLiyan et a1,“A General Purpose Neuron Processor with Digital-Analog Processing”,Chinese Journal of Electornics,Vol.3,No.4,pp.73-75,1994.[7]Wang Shoujue,LiZhaozhou et a1,“Discussion on the basic mathematical models of neurons in gen eralpurpose neuro-computer”,Acta Electronics Sinica(in Chinese),Vo1.29,No.5,pp.577-580,2001.[8]WangShoujue,Wang Bainan,“Analysis and theory of high-dimension space geometry of artificial neuralnetworks”,Acta Electronics Sinica (in Chinese),Vo1.30,No.1,pp.1-4,2001.[9]WangShoujue,Xujian et a1,“Multi-camera human-face personal identiifcation system based on thebiomimetic pattern recognition”,Acta Electronics Sinica (in Chinese),Vo1.31,No.1,pp.1-3,2003.[10]Ryszard Engelking,Dimension Theory,PWN-Polish Scientiifc Publishers—Warszawa,1978.[11]QiangHe,YingHe,Matlab Porgramming,Tsinghua University Press,2002.中文翻译:电子学报2006年7月15卷第3期基于仿生模式识别的非特定人连续语音识别系统王守觉秦虹(中国,北京100083,中科院半导体研究所人工神经网络实验室)摘要:在非特定人语音识别中,隐马尔科夫模型(HMMs)是使用最多的技术,但是它的不足之处在于:不仅需要更多的训练样本,而且训练的时间也很长。

speakerrecognition说话人识别

speakerrecognition说话人识别

語音訊號之音框(Frame)
語音訊號前處理
漢明窗 ( Hamming Window ) 2n
0.54 0.46cos( N 1 W (n) {0 otherwise ) 0 n N 1
正規化 ( Normalization )]
預強調 ( Pre-Emphasis )
里雅普諾夫指數演算法 (Liapunov exponents)
‧對一組離散的語音訊 號x0 , x1 , x2 ......,xn中任一xi (0 i n)搜尋一個與其 值最接近的x j (0 j n。 ) 求出初始距離: d ij | xi x j | 觀察經過時間t後的距離:dtij | xi t x j t | ‧ xi的里雅普諾夫指數: dtij 1 i ln | | t d ij

語音輸入模式 –文字相關 ( Text - Dependent ) 模式 –文字不相關 ( Text - Independent ) 模式
語者辨識流程
輸入語音 訊號 語音訊號 取樣 語音訊號 前處理
語音訊號 特徵抽取
語者辨識 分類器
分類
語者識別模式
語者驗證(比對)模式
註冊 語音特徵 資料庫 註冊者 特徵
i 1
2. 求線性預測編碼 ( LPC ) 係數
i( i )

1 i P
E (i 1) for j 1 : i 1
i 1) (ji ) (ji 1) ki i( j
ki
Levinsion-Durbin演算法
end E (i ) (1 ki2 ) E (i 1) end
帶通濾波器之輸出:
bm m
n bm m

基于总体变化子空间自适应的i-vector说话人识别系统研究

基于总体变化子空间自适应的i-vector说话人识别系统研究

短文自动化学报第40卷第8期2014年8月Brief Paper ACTA AUTOMATICA SINICA Vol.40,No.8August,2014基于总体变化子空间自适应的i-vector说话人识别系统研究栗志意1张卫强1何亮1刘加1摘要在说话人识别研究中,基于身份认证矢量(identity vector, i-vector)的子空间建模被证明是目前最前沿最有效的说话人建模技术,其中如何有效准确地估计总体变化子空间矩阵T成为影响系统性能好坏的关键问题.本文针对i-vector技术如何在新的应用环境下进行总体变化子空间矩阵T的自适应估计问题进行了研究,并提出了两种行之有效的自适应估计算法.在由美国国家标准技术局(American National Institute of Standard and Technology,NIST)组织的2008年说话人识别核心评测数据库以及自行采集的测试数据库上的实验结果显示,不论采用测试集数据本身还是与测试集较匹配的开发集数据,通过本文所提的自适应算法来更新总体变化子空间矩阵均可以使更新后的子空间更有利于新测试数据下的低维子空间描述,在新的测试环境下都更有利于说话人分类.此外实验结果还表明基于多子空间拼接的子空间自适应方法性能明显优于迭代自适应方法,而且两者的结合可达到最优的识别性能,且此时利用开发集数据进行自适应可以接近其利用测试集数据进行自适应得到的最优性能.关键词身份认证矢量,总体变化子空间,自适应,说话人识别引用格式栗志意,张卫强,何亮,刘加.基于总体变化子空间自适应的i-vector说话人识别系统研究.自动化学报,2014,40(8):1836−1840 DOI10.3724/SP.J.1004.2014.01836Total Variability Subspace Adaptation Based Speaker RecognitionLI Zhi-Yi1ZHANG Wei-Qiang1HE Liang1LIU Jia1Abstract In text-independent speaker recognition,the iden-tity vector(i-vector)based modeling method has recently been proved to be the most popular and efficient method.It is a key problem to estimate the total variability subspace T effi-ciently and accurately.In this paper,two adaptation algorithms are proposed in order to improve the performance of the i-vector base system in practical environments.Experiments on the2008 core speaker recognition evaluation dataset of American NIST and Technology and the self-collected speaker recognition eval-uation dataset demonstrate that using the proposed adaptation algorithms to adapt to the total variability subspace T from ei-ther the test dataset or the developing dataset is effective for improving the performance.In addition,the combination of the two adaptation algorithms can achieve almost the best perfor-mance using the developing dataset rather than the test dataset. Key words i-vector,total variability subspace,adaptation, speaker recognitionCitation Li Zhi-Yi,Zhang Wei-Qiang,He Liang,Liu Jia.To-tal variability subspace adaptation based speaker recognition. Acta Automatica Sinica,2014,40(8):1836−1840收稿日期2013-11-13录用日期2013-11-23Manuscript received November13,2013;accepted November23, 2013本文责任编委吴玺宏Recommended by Associate Editor WU Xi-Hong国家自然科学基金(61370034,61273268,61005019,90920302),北京市自然科学基金项目(KZ201110005005)资助说话人识别技术是指利用从说话人语音信号中提取出的声纹特征进行辨识或确认说话人身份的一项技术.作为一项重要的生物特征身份鉴定技术,该技术可广泛应用于国家安全、司法鉴定、语音拨号、电话银行等诸多领域[1].近几年来,以身份认证矢量i-vector为基础的说话人建模技术取得了非常大的成功,使得说话人识别系统的性能有了非常显著的提升[2−3],在由美国国家标准技术局(Ameri-can National Institute of Standards and Technology,NIST)组织的国际说话人评测中,基于该技术的说话人识别系统的性能明显优于之前广泛采用的高斯混合模型超矢量-支持矢量机(Gaussian mixture model super vector-support vector machine,GSV-SVM)[4]、联合因子分析(Joint factor analy-sis,JFA)[5−6],成为目前占主导地位的说话人识别系统.基于身份认证矢量i-vector的说话人建模方法与先前的GSV-SVM、JFA建模方法一样,都是基于高斯混合模型-通用背景模型(Gaussian mixture model-universal background model,GMM-UBM)[7],其基本思想是假设说话人信息以及信道信息同时处于高斯混合模型高维均值超矢量空间中的一个低维线性子空间流形结构中,如式(1)所示.M=m+Tw w(1)其中,M表示高斯混合模型均值超矢量,m表示一个与特定说话人和信道都无关的超矢量.而总体变化子空间矩阵T完成从高维空间到低维空间的映射,从而使得降维后的矢量更有利于进一步地分类和识别.因此该建模过程中,首先通过因子分析的方法,训练得到矩阵T,然后再将高维的高斯混合模型均值超矢量在该子空间上进行投影,得到低维的总体变化因子矢量,也称之为身份认证矢量i-vector,最后将得到的低维身份认证矢量i-vector进行线性鉴别性分析(Linear discriminate analysis,LDA)降维和类内协方差归一化(Within class covariance normalization,WCCN).前者线性鉴别性分析技术LDA的目的在于在满足最小化类内说话人距离和最大化类间说话人距离的鉴别性优化准则下进一步降低i-vector的维数,后者类内协方差归一化技术WCCN的目的是通过白化协方差矩阵,使得变换后的子空间的基尽可能正交.最终将经过LDA和WCCN变换过后的i-vector矢量,作为输入特征送入后续的分类器进行分类判决.经典的i-vector分类器包括余弦距离打分(Cosine distance scoring,CDS)分类器和SVM分类器[8]等,本文采用与文献[2]中一致的余弦距离打分CDS分类器.从i-vector建模方法的基本假设中可以看出,如何准确地估计总体变化子空间T是一个非常基础性且关键性的环节.准确的估计子空间T意味着投影后的低维i-vector矢量对于说话人和信道信息描述的更具有区分性,更有利于进一步的分类.在近年来由NIST组织的说话人评测中,我们发现i-vector系统的性能一致性优于GSV-SVM系统,其中一个因素在于NIST评测数据库的数据资源充足,数据质量较好,数据来源比较一致,因此会对训练矩阵带来很多利好条件.而在实际应用部署i-vector系统的过程中发现,Supported by National Natural Science Foundation of China (61370034,61273268,61005019,90920302)and Beijing Natural Sci-ence Foundation(KZ201110005005)1.清华大学电子工程系清华信息与科学技术国家实验室北京1000841.Tsinghua National Laboratory for Information Science and Tech-nology,Department of Electronic Engineering,Tsinghua University, Beijing1000848期栗志意等:基于总体变化子空间自适应的i-vector说话人识别系统研究1837由于实际测试场景复杂,数据资源短缺,数据质量较差,常导致子空间估计的不甚稳健[9],使得i-vector说话人系统在实际部署测试时的识别性能在很多情况下反而没有GSV-SVM 系统稳健.由于i-vector建模方法的核心和基础是对总体变化子空间T矩阵的估计,因此本文针对i-vector说话人识别系统在子空间自适应估计问题进行了深入研究,在此基础上提出了两种行之有效的子空间自适应算法,并通过实验对比给出了最优的自适应策略.本文安排如下:第1节介绍了i-vector说话人系统中总体变化子空间T矩阵的估计过程;第2节介绍了i-vector说话人系统中模型的训练和测试过程;第3节介绍了本文提出的两种总体变化子空间矩阵T的自适应算法,并给出所对应的系统框图;第4节给出本文所提算法在测试数据集上的实验结果和分析;最后在第5节给出总结和结论.1总体变化子空间T矩阵估计1.1统计量估计在i-vector系统总体变化子空间T的估计过程中,由于高斯混合模型均值超矢量是通过计算声学特征相对于通用背景模型UBM均值超矢量的零阶、一阶和二阶统计量得到的,因此本小节将首先给出声学特征各阶统计量的估计过程.为了估计各阶统计量,需要首先利用一些训练数据通过期望最大化(Expectation maximum,EM)算法训练得到通用背景模型UBM,该模型提供了一个统一的参考坐标空间,并且可以在一定程度上解决由于说话人训练数据较少导致的小样本问题.而高斯混合模型则可通过训练数据在该UBM上面进行最大后验概率(Maximum a posterior,MAP)自适应得到.各阶统计量的估计过程如下所示,假设说话人s的声学特征表示为x s,t,则其相对于UBM均值超矢量m的零阶统计量N c,s,一阶统计量F c,s以及二阶统计量S c,s可如式(2)所示.N c,s=tγc,s,tF c,s=tγc,s,t(x s,t−m c)S c,s=diag{tγc,s,t(x s,t−m c)(x s,t−m c)T}(2)式中m c代表UBM均值超矢量m中的第c个高斯均值分量.t表示时间帧索引.γc,s,t表示UBM第c个高斯分量的后验概率.diag{·}表示取对角运算.假设单高斯模型的维数为F,则将所有C个高斯模型的均值矢量拼接成的高维均值超矢量维数为F C.1.2子空间T估计对于上述得到的各阶统计量,子空间T的估计可以采用如下的期望最大化(Expectation maximum,EM)算法得到,首先随机初始化子空间矩阵T,然后固定T,在最大似然准则下估计隐变量w的一阶和二阶统计量,估计过程如式(3)所示.其中超矢量F s是由F c,s矢量拼接成的F C×1维的矢量.N s是由N c,s作为主对角元拼接成的F C×F C维的矩阵.L s=I+T TΣ−1N s TE[w s]=L−1s T TΣ−1F sE[w s w T s]=E[w s]E[w T s]+L−1s(3)式中L s是临时变量,Σ是UBM的协方差矩阵.接着更新T矩阵和协方差矩阵Σ.T矩阵的更新过程可利用式(4)来实现,也可根据文献[10]中的快速算法来实现.sN s T E[w s w T s]=sF s E[w s](4)对UBM协方差矩阵Σ的更新过程如式(5)所示.Σ=N−1sS s−N−1diag{sF s E[w T s]T T}(5)式中S s是由S c,s进行矩阵对角拼接成的F C×F C维的矩阵,N=N s为所有说话人的零阶统计量之和.对于上述步骤反复进行迭代6∼8次后,可近似认为T 和Σ收敛.2i-vector模型训练和测试本文对于i-vector模型的训练和测试采用与文献[2]一致的过程,即首先通过LDA和WCCN对于上述子空间投影后的i-vector矢量进行进一步的鉴别性降维和白化,然后利用余弦距离打分对处理后的i-vector进行最终打分和判决.2.1线性鉴别性分析线性鉴别性分析[11](Linear discriminant analysis, LDA)是模式识别领域广泛采用的一种鉴别性降维技术.在基于身份认证矢量i-vector的说话人系统中,由于前述基于因子分析的方法没有采用鉴别性的准则,因此通常首先采用LDA对因子分析后的i-vector矢量进行鉴别性降维.训练LDA矩阵的过程主要通过优化如式(6)所示的目标函数,即在最小化类内说话人距离和最大化类间说话人距离的鉴别性准则下求该目标函数的最优解.J(w)=w T S B ww T S W w(6)式(6)中类间协方差矩阵S B和类内协方差矩阵S W的计算过程如式(7)和式(8)所示.S B=Ss=1(w s−w)(w s−w)T(7) S W=Ss=11n sn si=1(w s i−w s)(w s i−w s)T(8)其中w s=(1/n s)n si=1w s i代表第s个说话人的i-vector 均值矢量.S表示说话人的个数,n s表示第s个说话人的i-vector段数.最终求解式(6)中最优化目标函数的过程可转换为求解如式(9)所示广义特征值的过程.S B w=λS W w(9)2.2类内方差归一化类内方差归一化[12](Within class covariance normal-ization,WCCN)通过白化说话人因子使得变换后说话人子空间的基尽可能正交.WCCN矩阵可由式(10)估计.W=1SSs=11n sn si=1(w s i−w s)(w s i−w s)T(10)1838自动化学报40卷其中w s =(1/n s ) n s i =1w s i 代表第s 个说话人的i-vector 均值矢量.S 表示说话人个数,n s 表示第s 个说话人的i-vector 段数.2.3余弦距离打分余弦距离打分[2]是一种对称式的核函数分类器,即说话人模型i-vector 矢量与测试段i-vector 矢量交换后打分结果不变.在对i-vector 分类时,该分类器将说话人矢量w tar 和测试段矢量w tst 的余弦距离分数直接作为判决分数,并与阈值θ进行比较,给出判决结果,如式(11)所示.score (w tar ,w tst )=w tar ,w tst||w tar ||·||w tst ||θ(11)本质上讲,该分类器通过归一化矢量的模去除了矢量幅度的影响,将两个矢量的余弦角度作为分类的依据,计算简单快捷,而且该分类器可以通过合并计算,加快后续分数归一化的过程.3总体变化子空间T 的自适应算法与联合因子分析JFA 子空间建模方法不同,i-vector 子空间建模过程中T 矩阵的估计不需要任何标签信息,属于无监督训练过程,这为在实际应用中通过大量的无标注数据来自适应估计T 矩阵来提高子空间估计的性能提供了可能.在实际应用中,为了在新测试环境下对子空间T 矩阵进行稳健地估计,本文源于不同的出发点,提出了两种子空间自适应算法,并提出了两者结合的自适应算法.3.1迭代自适应算法该算法的基本思想类似于高斯混合模型-通用背景模型的思想,首先离线利用已有的训练数据来按照1.2中的算法流程估计一个与测试条件不相关的子空间矩阵T o ,称之为通用子空间矩阵,然后从通用子空间矩阵T o 上进行子空间自适应,完成子空间的迁移变换,如图1所示.图1总体变化子空间T 的迭代自适应算法示意图Fig.1Total variability diagram of total variability subspace Titeration adaptation algorithm具体实施过程中,首先通过已有的训练数据得到通用子空间矩阵T o ,然后将该矩阵作为初始化种子,利用EM 算法在新的测试数据集上进行迭代自适应.该迭代过程与1中的迭代过程类似,其自适应过程如算法1所示.算法1.总体变化子空间T 的迭代自适应算法步骤1.用已有数据训练得到通用子空间变化矩阵T o 和UBM 协方差矩阵Σ,作为初始化种子;步骤2.根据T ,计算临时变量L ,并估计总体变化矢量w 的一阶统计量E[w s ]和二阶统计量E[w s w T s ];步骤3.根据步骤2中的统计量对T 矩阵进行更新,如式(4)所示;步骤4.根据步骤2和步骤3中结果更新UBM 协方差矩阵Σ,更新过程如式(5)所示;步骤5.未达到迭代次数则返回步骤2中继续;否则结束退出.3.2拼接自适应算法该算法的基本思想源于文献中对联合因子分析建模技术中信道空间拼接的思想[13−14],在联合因子分析建模过程中,通过拼接信道子空间,从而可以更有效地去掉信道因子分量.而在i-vector 建模过程中,我们认为在利用原始训练数据和新环境下的数据集可以分别训练得到两个反映不同角度的总体变化子空间,意味着通过在这两个总体变化子空间构成的联合子空间中对高维矢量进行投影,可以从不同角度来反映低维的总体变化因子.图2总体变化子空间T 拼接自适应算法示意图Fig.2Diagram of total variability subspace Tcombination adaptation algorithm该算法如图2所示,首先通过原始训练数据和新数据集利用第1.2中的估计算法分别训练得到两个不同的总体变化子空间矩阵T o 和T n ,然后将两个子空间进行拼接,得到自适应后的子空间矩阵.该算法流程如下所示:算法2.总体变化子空间T 拼接自适应算法步骤1.利用原始数据训练得到子空间变化矩阵T o ;步骤2.利用新的测试数据训练得到子空间变化矩阵T n ;步骤3.拼接T o 和T n 得到最终的自适应子空间T .3.3两者结合的自适应算法针对上述提出的两种子空间自适应算法,我们进一步提出了两者相结合的自适应算法,该结合算法可以实现之前两种算法的有效互补,更有利于总体变化因子在低维子空间中的表示.图3给出了两者结合的自适应算法流程结构图.图3总体变化子空间T 的迭代自适应和拼接自适应结合的自适应算法示意图Fig.3Diagram of total variability subspace T integration algorithm of iteration adaptation and subspace combinationadaptation8期栗志意等:基于总体变化子空间自适应的i-vector说话人识别系统研究1839该算法如图3所示,首先通过已有的训练数据利用第1.2节中的估计算法训练得到总体变化子空间矩阵T o,然后再根据算法1在该子空间上对新数据集进行迭代得到T n,最后对两个子空间进行拼接,得到最终的子空间矩阵.该算法流程如下所示.算法3.总体变化子空间T拼接自适应算法步骤1.利用原始数据训练得到子空间变化矩阵T o;步骤2.利用迭代自适应算法1,对上述矩阵在新数据集上迭代得到子空间变化矩阵T n;步骤3.拼接T o和T n得到最终的自适应子空间T.3.4算法复杂度分析通过分析以上三种自适应算法的流程可以看出,与不进行自适应相比,每种自适应算法都需要增加额外的自适应时间,而具体的时间与自适应所用的数据量大小呈线性关系.此外,三种子空间自适应算法本身的时间复杂度是一样的,而不同点在于基于子空间拼接的自适应算法在后续低维投影上的时间和空间是第一种自适应算法的一倍左右,因此在实际应用中也需要根据系统实时率和响应需求来采用最合适的自适应算法.4实验配置与实验结果为了验证本文所提算法的性能,本文将针对所提子空间自适应i-vector说话人系统性能进行实验验证和结果对比,并分析得出结论.在实际部署中,不同的应用环境使得我们是否可以将新数据用于自适应也有所不同.比如离线测试环境或者说话人检索情况下,我们可以获得所有的测试数据,因此可以直接利用测试数据.而对于某些应用环境下,比如在线测试情况下,我们不能得到所有的测试集数据,但我们可以退而求其次,提前获得一些接近该测试集数据的开发集数据.4.1实验配置本文实验中采用两个数据集,一个是NIST发布的SRE 2008的核心数据集,原始训练子空间数据来自Switchboard I和II约20000条,这些数据同样用于训练UBM模型, ZTnorm伙伴集以及训练LDA和WCCN矩阵,新数据集包括新的说话人训练数据集以及12922条的测试数据集,以及3000条左右的开发集.另一个是自行采集的数据集,原始训练子空间数据和新数据来源于不同的地域和采集卡.原始数据采集了约20000条,这些数据用于训练UBM模型, ZTnorm伙伴集以及训练LDA和WCCN矩阵,新数据集包括新的说话人训练数据集以及8000条的测试数据集,以及2000条左右的开发集.实验中采用Mel频率倒谱特征(Mel-frequency cepstral coefficients,MFCC)作为声学层特征.在预处理阶段采用G.723.1进行有效语音端点检测(Voice activity detection, VAD)以及采用倒谱均值减(Cepstral mean subtraction, CMS)技术来去除或抑制信道的卷积噪声,并设置了3s窗长用于特征弯折(Feature warping),进行了25%低能量删减以及预加重(因子为0.95).在上述预处理基础上,首先提取13维基本特征,并与一阶、二阶差分特征一起构成最终的39维MFCC声学层特征.实验中使用UBM的混合数设置为1024,高斯概率密度的方差采用对角阵.i-vector中总体变化子空间矩阵T的子空间维数也即列数均设置为400,训练时迭代次数取6次.LDA矩阵降维后的维数设置为200.4.2实验结果本文中衡量系统性能的指标采用等错误率(Equal error rate,EER)和最小检测代价函数(Minimum detection cost function,MinDCF).表1和表2给出了在两个数据集上由原始数据集训练得到子空间T和迭代自适应算法训练T在新开发集数据以及新测试集数据上的性能比较结果.可以看出,不管利用新开发集还是新测试集的数据进行自适应训练,均可以较之有性能提升.从表中数据还可以看出,在实际应用中,虽然开发集数据较之原始数据更接近测试集数据,但是由于开发集数据的获取有限,所以采用开发集数据进行迭代自适应获得的性能提升也有限.而由于测试集数据本身的匹配程度最高,因此可以得到最好的自适应性能.因此在实际应用中应该首先选择利用测试集数据本身来进行子空间T的自适应.在某些在线测试应用下,若无法利用测试集数据,也可以考虑采用开发集数据来做自适应.表1原始数据训练T与本文所提迭代自适应T在NIST SRE2008核心数据集上的性能比较Table1Performance comparison of baseline training T algorithm and the proposed iteration adaptation T algorithmon NIST SRE2008core dataset算法EER MinDCF 原始数据训练T 5.410.029新开发集数据自适应T 4.920.026新测试集数据自适应T 4.670.023表2原始数据训练T与本文所提迭代自适应T在自行采集数据集上的性能比较Table2Performance comparison of baseline training T algorithm and the proposed iteration adaptation T algorithmon actual application dataset算法EER MinDCF 原始数据训练T 3.000.014新开发集数据自适应T 2.990.013新测试集数据自适应T 2.000.011表3原始数据训练T与本文所提迭代自适应和子空间拼接自适应相结合的自适应算法在NIST SRE2008核心数据集上的性能比较Table3Performance comparison of baseline training T algorithm and the proposed integration algorithm of iteration adaptation and subspace combination adaptation on NISTSRE2008core dataset算法EER MinDCF 原始数据训练T 5.410.029新开发集数据自适应T 4.010.021新测试集数据自适应T 3.890.020从表1和表2中的实验结果可以看出,迭代自适应在两个数据集上均可以一致性地提高系统的性能.因此接下来直接对第3.3节中所提出的迭代自适应与拼接自适应相结合的自适应算法上进行实验比较.如表3和表4所示实验结果所示,通过与空间拼接自适应相结合,识别性能有更进一步的改善.且此时利用开发集数据进行自适应可以接近其利用测试集数据进行自适应得到的最优性能.因此实际应用中,如1840自动化学报40卷果在可表4原始数据训练T与本文所提迭代自适应和子空间拼接自适应相结合的自适应算法在自行采集数据集上性能比较Table4Performance comparison of baseline training T algorithm and the proposed integration algorithm of iteration adaptation and subspace combination adaptation on actualapplication dataset算法EER MinDCF原始数据训练T 3.000.014新开发集数据自适应T 1.990.012新测试集数据自适应T 1.990.010以获得测试集数据情况下,利用测试集数据进行自适应可以取得最优的自适应效果.当测试集数据不可用于训练子空间的情况下,可以退而求其次,利用与测试集较为匹配的开发集,可以取得同样不错的性能.这样也为我们在实际应用环境中,如何有效地通过自适应来提高i-vector说话人识别系统的性能提供了参考依据.5讨论与结论从基于身份认证矢量i-vector建模的说话人识别的原理假设来看,有效准确的估计子空间总体变化矩阵T是一个基本性和关键性的问题,会直接影响系统识别性能的好坏,同时也是影响该建模技术在实际应用中稳健性的关键问题.本文针对i-vector技术如何在实际应用中根据新数据来自适应子空间矩阵T进行了深入研究,提出几种切实可行的自适应估计算法,并针对不同的测试条件下给出了最优的自适应策略.本文所提算法在NIST SRE2008核心测试数据集和自行采集的测试数据库上的实验结果均显示,不论采用测试集本身还是与测试集较匹配的开发集数据,通过本文所提的自适应算法均可以使更新后的子空间更有利于新测试数据下的低维子空间描述,从而更有利于说话人识别.此外实验结果还表明基于多子空间拼接的自适应算法的性能明显优于迭代自适应算法,而且两者的结合可或得到最优的系统性能,且此时利用开发集数据进行自适应可以接近其利用测试集数据进行自适应得到的性能提升,这样为在实际应用环境下如何有效地通过子空间自适应来提高i-vector说话人识别系统的性能提供了重要的参考依据.References1Kinnunen T,Li H Z.An overview of text-independent speaker recognition:from features to supervectors.Speech Communication,2010,52(1):12−402Dehak N,Kenny P,Ouellet P,Dumouchel P.Front-end factor analysis for speaker verification.IEEE Transactions on Audio,Speech and Language Processing,2011,19(4): 788−7983Li Zhi-Yi,He Liang,Zhang Wei-Qiang,Liu Jia.Speaker recognition based on discriminant i-vector local distance pre-serving projection.Journal of Tsinghua University(Science and Technology),2012,52(5):598−601(栗志意,何亮,张卫强,刘加.基于鉴别性i-vector局部距离保持映射的说话人识别.清华大学学报(自然科学版),2012,52(5): 598−601)4Campbell W M,Campbell J P,Reynolds D A,Singer E,Torres-Carrasquillo P A.Support vector machines for speaker and language puter Speech and Language,2006,20(2−3):210−2295Kenny P,Boulianne G,Ouellet P,Dumouchel P.Speaker and session variability in GMM-based speaker verification.IEEE Transactions on Audio,Speech and Language Processing, 2007,15(4):1448−14606Kenny P,Boulianne G,Ouellet P,Dumouchel P.Joint factor analysis versus eigenchannels in speaker recognition.IEEE Transactions on Audio,Speech and Language Processing, 2007,15(4):1435−14477Reynolds D A,Quatieri T F,Dunn R B.Speaker verifica-tion using adapted Gaussian mixture models.Digital Signal Processing,2000,10(1−3):19−418Cortes C,Vapnik V.Support vector networks.Machine Learning,1995,20(3):273−2979Zhang Wen-Lin,Zhang Wei-Qiang,Liu Jia,Li Bi-Cheng, Qu Dan.A new subspace based speaker adaptation method.Acta Automatica Sinica,2011,37(12):1495−1502(张文林,张卫强,刘加,李弼程,屈丹.一种新的基于子空间的说话人自适应方法.自动化学报,2011,37(12):1495−1502)10Kenny P,Boulianne G,Dumouchel P.Eigenvoice model-ing with sparse training data.IEEE Transactions on Audio, Speech,and Language Processing,2005,13(3):345−354 11Bishop C M.Pattern Recognition and Machine Learning.Berlin:Springer,200812Hatch A O,Kajarekar S,Stolcke A.Within-class covari-ance normalization for SVM-based speaker recognition.In: Proceedings of the International Conference on Spoken Lan-guage Processing.Pittsburgh,PA,2006.1471−147413He Liang,Shi Yong-Zhe,Liu Jia.Eigenchannel space com-bination method of joint factor analysis Acta Automatica Sinica,2011,37(7):849−856(何亮,史永哲,刘加.联合因子分析中的本征信道空间拼接方法.自动化学报,2011,37(7):849−856)14Guo Wu,Li Yi-Jie,Dai Li-Rong,Wang Ren-Hua.Factor analysis and space assembling in speaker recognition.Acta Automatica Sinica,2009,35(9):1193−1198(郭武,李轶杰,戴礼荣,王仁华.说话人识别中的因子分析以及空间拼接.自动化学报,2009,35(9):1193−1198)栗志意清华大学电子工程系博士研究生.主要研究方向为说话人识别与语种识别.本文通信作者.E-mail:************************(LI Zhi-Yi Ph.D.candidate in the Department of Electronic Engineering,Tsinghua University.His research interest covers speaker recognition and language recognition.Corresponding author of this paper.)张卫强清华大学电子工程系助理研究员.主要研究方向为说话人识别与语种识别.E-mail:********************.cn(ZHANG Wei-Qiang Assistant professor in the Department of Electronic Engineering,Tsinghua University.His research in-terest covers speaker recognition and language recognition.)何亮清华大学电子工程系助理研究员.主要研究方向为说话人识别与语种识别.E-mail:********************.cn(HE Liang Assistant professor in the Department of Elec-tronic Engineering,Tsinghua University.His research interest covers speaker recognition and language recognition.)刘加清华大学电子工程系教授.主要研究方向为语音识别和信号处理.E-mail:*****************.cn(LIU Jia Professor in the Department of Electronic Engineer-ing,Tsinghua University.His research interest covers speech recognition and signal processing.)。

Gaussian Mixture Model

Gaussian Mixture Model

Both the nonparametric and parametric models are used in speaker recognition tasks. Nearest neighbor and Vector Quantization modeling are most common nonparametric models used in speaker recognition tasks. Gaussian Mixture Model (GMM) is the representative parametric models and widely used in the speaker recognition tasks. The general structure of speaker recognition systems is described in figure 1. Speaker recognition task falls under the general problem of pattern classification. Training Building Models for RefereSpeech Frame
Feature Extraction Speaker Recognition Decision
1. INTRODUCTION
Speaker recognition is the task of automatically recognizing who is speaking by identifying an unknown speaker among several reference speakers using speaker-specific information included in speech waves [1]. Speaker Recognition system exists anywhere when speakers are unknown and their identities are important. It makes the machine identification of participants in meetings, conferences, or conversations possible. Speaker recognition task can be text-independent and textdependent. By text-independent, we mean that the recognition procedure should work for any text in either training or testing. This is different from text-dependent recognition, where the text in both training and testing is the same or is known. Speaker recognition also can be classified into two further categories, close-set and open-set problems. The close-set problem is to identify a speaker from a set of N known speakers. While openset problem is to decide whether the speaker of an unknown testing utterance belongs to a set of N speakers. There are two basic tasks in Speaker recognition: Speaker Identification and Speaker Verification. For Speaker identification the system should decide the unknown speaker’s identity among N reference speakers while for speaker verification the system should decide whether the unknown speaker’s identity is the right one as he/she claims. It is a binary decision problem (accept or reject). And Speaker verification can also be thought as a special case of the open-set problem.

新时代职业英语人工智能英语教学课件AI English-Unit 2

新时代职业英语人工智能英语教学课件AI English-Unit 2

START WITH AI
Person identification
JOURNEY WITH AI
• Manual Reading
Key’s function and specificaitons
• Passage Reading
To recognize voice and speech
• Comparative Reading Installation and features
Manual Reading
Words & Phrases
Words
Phrases
echo n. 回声
plug in 接上;插上
frequency n. 频率
power up 使(机器)启动
pixel n. 像素
reset n. 恢复原位;重新设置
resolution n. 分辨率

restore v. 恢复
2 Your smart speaker will say “Restored to factory settings”. Wait 10 seconds and remove the power cable from AC mains input (7).
3 Plug the power cable back in. Your smart
型号 主机尺寸 (毫米) 屏幕尺寸
重量
WM120 190(高) x 110(宽) x 185(深) 7英寸 1.5千克
3. 重新插上电源线。你的智能音箱会显 摄像头 200万像素
示“首次通电”,然后显示“Wi-Fi 模式”。然后它会说:“进入设置模 式。按照应用程序中的说明完成安 装。”
4. 按照“连接到Wi-Fi网络”进行设置。

5-自动化 外文文献 英文文献 外文翻译 改进型智能机器人的语音识别方法

5-自动化 外文文献 英文文献 外文翻译  改进型智能机器人的语音识别方法

附件1:外文资料翻译译文改进型智能机器人的语音识别方法2、语音识别概述最近,由于其重大的理论意义和实用价值,语音识别已经受到越来越多的关注。

到现在为止,多数的语音识别是基于传统的线性系统理论,例如隐马尔可夫模型和动态时间规整技术。

随着语音识别的深度研究,研究者发现,语音信号是一个复杂的非线性过程,如果语音识别研究想要获得突破,那么就必须引进非线性系统理论方法。

最近,随着非线性系统理论的发展,如人工神经网络,混沌与分形,可能应用这些理论到语音识别中。

因此,本文的研究是在神经网络和混沌与分形理论的基础上介绍了语音识别的过程。

语音识别可以划分为独立发声式和非独立发声式两种。

非独立发声式是指发音模式是由单个人来进行训练,其对训练人命令的识别速度很快,但它对与其他人的指令识别速度很慢,或者不能识别。

独立发声式是指其发音模式是由不同年龄,不同性别,不同地域的人来进行训练,它能识别一个群体的指令。

一般地,由于用户不需要操作训练,独立发声式系统得到了更广泛的应用。

所以,在独立发声式系统中,从语音信号中提取语音特征是语音识别系统的一个基本问题。

语音识别包括训练和识别,我们可以把它看做一种模式化的识别任务。

通常地,语音信号可以看作为一段通过隐马尔可夫模型来表征的时间序列。

通过这些特征提取,语音信号被转化为特征向量并把它作为一种意见,在训练程序中,这些意见将反馈到HMM的模型参数估计中。

这些参数包括意见和他们响应状态所对应的概率密度函数,状态间的转移概率,等等。

经过参数估计以后,这个已训练模式就可以应用到识别任务当中。

输入信号将会被确认为造成词,其精确度是可以评估的。

整个过程如图一所示。

图1 语音识别系统的模块图3、理论与方法从语音信号中进行独立扬声器的特征提取是语音识别系统中的一个基本问题。

解决这个问题的最流行方法是应用线性预测倒谱系数和Mel频率倒谱系数。

这两种方法都是基于一种假设的线形程序,该假设认为说话者所拥有的语音特性是由于声道共振造成的。

基于核函数的IVEC-SVM说话人识别系统研究

基于核函数的IVEC-SVM说话人识别系统研究

说话人识别是指通过从说话人的语音信号中提取声纹 特征从而进行辨识或确认说话人身份的一项技术. 作为 一种重要的基于生物特征的身份鉴定技术, 目前说话人识 别 已 广 泛 应 用 于 国 家 安 全、司 法 鉴 定、语 音 拨 号、电 话 银 行等诸多领域. 近几年来, 以高斯混合模型 – 通用背景模型 (Gaussian mixture model – universal background model, GMM-UBM)[1] 为基础的说话人建模技术取得了非常大的成 功, 使得说话人识别系统的系统性能有了显著提升[2−3].
Citation Li Zhi-Yi, Zhang Wei-Qiang, He Liang, Liu Jia. Speaker recognition with kernel based IVEC-SVM. Acta Automatica Sinica, 2014, 40(4): 780−784
收稿日期 2012-09-12 录用日期 2013-01-18 Manuscript received September 12, 2012; accepted January 18, 2013 本文责任编委 宗成庆 Recommended by Associate Editor ZONG Cheng-Qing 国家自然科学基金 (61005019, 61273268, 90920302, 61370034) 资助 Supported by National Natural Science Foundation of China (61005019, 61273268, 90920302, 61370034) 1. 清华大学电子工程系清华信息与科学技术国家实验室 北京 100084 1. Tsinghua National Laboratory for Information Science and Tech-

语音识别外文翻译外文文献英文文献

语音识别外文翻译外文文献英文文献

Speech RecognitionVictor Zue, Ron Cole, & Wayne WardMIT Laboratory for Computer Science, Cambridge, Massachusetts, USA Oregon Graduate Institute of Science & Technology, Portland, Oregon, USACarnegie Mellon University, Pittsburgh, Pennsylvania, USA1 Defining the ProblemSpeech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of words. The recognized words can be the final results, as for applications such as commands & control, data entry, and document preparation. They can also serve as the input to further linguistic processing in order to achieve speech understanding, a subject covered in section.Speech recognition systems can be characterized by many parameters, some of the more important of which are shown in Figure. An isolated-word speech recognition system requires that the speaker pause briefly between words, whereas a continuous speech recognition system does not. Spontaneous, or extemporaneously generated, speech contains disfluencies, and is much more difficult to recognize than speech read from script. Some systems require speaker enrollment---a user must provide samples of his or her speech before using them, whereas other systems are said to be speaker-independent, in that no enrollment is necessary. Some of the other parameters depend on the specific task. Recognition is generally more difficult when vocabularies are large or have many similar-sounding words. When speech is produced in a sequence of words, language models or artificial grammars are used to restrict the combination of words.The simplest language model can be specified as a finite-state network, where the permissible words following each word are given explicitly. More general language models approximating natural language are specified in terms of a context-sensitive grammar.One popular measure of the difficulty of the task, combining the vocabulary size and the 1 language model, is perplexity, loosely defined as the geometric mean of the number of words that can follow a word after the language model has been applied (see section for a discussion of language modeling in general and perplexity in particular). Finally, there are some external parameters that can affect speech recognition system performance, including the characteristics of the environmental noise and the type and the placement of the microphone.Speech recognition is a difficult problem, largely because of the many sources of variability associated with the signal. First, the acoustic realizations of phonemes, the smallest sound units of which words are composed, are highly dependent on the context in which they appear. These phonetic variabilities are exemplified by the acoustic differences of the phoneme,At word boundaries, contextual variations can be quite dramatic---making gas shortage sound like gash shortage in American English, and devo andare sound like devandare in Italian.Second, acoustic variabilities can result from changes in the environment as well as in the position and characteristics of the transducer. Third, within-speaker variabilities can result from changes in the speaker's physical and emotional state, speaking rate, or voice quality. Finally, differences in sociolinguistic background, dialect, and vocal tract size and shape can contribute to across-speaker variabilities.Figure shows the major components of a typical speech recognition system. The digitized speech signal is first transformed into a set of useful measurements or features at a fixed rate, 2 typically once every 10--20 msec (see sectionsand 11.3 for signal representation and digital signal processing, respectively). These measurements are then used to search for the most likely word candidate, making use of constraints imposed by the acoustic, lexical, and language models. Throughout this process, training data are used to determine the values of the model parameters.Speech recognition systems attempt to model the sources of variability described above in several ways. At the level of signal representation, researchers have developed representations that emphasize perceptually important speaker-independent features of the signal, and de-emphasize speaker-dependent characteristics. At theacoustic phonetic level, speaker variability is typically modeled using statistical techniques applied to large amounts of data. Speaker adaptation algorithms have also been developed that adapt speaker-independent acoustic models to those of the current speaker during system use, (see section). Effects of linguistic context at the acoustic phonetic level are typically handled by training separate models for phonemes in different contexts; this is called context dependent acoustic modeling.Word level variability can be handled by allowing alternate pronunciations of words in representations known as pronunciation networks. Common alternate pronunciations of words, as well as effects of dialect and accent are handled by allowing search algorithms to find alternate paths of phonemes through these networks. Statistical language models, based on estimates of the frequency of occurrence of word sequences, are often used to guide the search through the most probable sequence of words.The dominant recognition paradigm in the past fifteen years is known as hidden Markov models (HMM). An HMM is a doubly stochastic model, in which the generation of the underlying phoneme string and the frame-by-frame, surface acoustic realizations are both represented probabilistically as Markov processes, as discussed in sections,and 11.2. Neural networks have also been used to estimate the frame based scores; these scores are then integrated into HMM-based system architectures, in what has come to be known as hybrid systems, as described in section 11.5.An interesting feature of frame-based HMM systems is that speech segments are identified during the search process, rather than explicitly. An alternate approach is to first identify speech segments, then classify the segments and use the segment scores to recognize words. This approach has produced competitive recognition performance in several tasks.2 State of the ArtComments about the state-of-the-art need to be made in the context of specific applications which reflect the constraints on the task. Moreover, different technologies are sometimes appropriate for different tasks. For example, when the vocabulary issmall, the entire word can be modeled as a single unit. Such an approach is not practical for large vocabularies, where word models must be built up from subword units.The past decade has witnessed significant progress in speech recognition technology. Word error rates continue to drop by a factor of 2 every two years. Substantial progress has been made in the basic technology, leading to the lowering of barriers to speaker independence, continuous speech, and large vocabularies. There are several factors that have contributed to this rapid progress. First, there is the coming of age of the HMM. HMM is powerful in that, with the availability of training data, the parameters of the model can be trained automatically to give optimal performance.Second, much effort has gone into the development of large speech corpora for system development, training, and testing. Some of these corpora are designed for acoustic phonetic research, while others are highly task specific. Nowadays, it is not uncommon to have tens of thousands of sentences available for system training and testing. These corpora permit researchers to quantify the acoustic cues important for phonetic contrasts and to determine parameters of the recognizers in a statistically meaningful way. While many of these corpora (e.g., TIMIT, RM, A TIS, and WSJ; see section 12.3) were originally collected under the sponsorship of the U.S. Defense Advanced Research Projects Agency (ARPA) to spur human language technology development among its contractors, they have nevertheless gained world-wide acceptance (e.g., in Canada, France, Germany, Japan, and the U.K.) as standards on which to evaluate speech recognition.Third, progress has been brought about by the establishment of standards for performance evaluation. Only a decade ago, researchers trained and tested their systems using locally collected data, and had not been very careful in delineating training and testing sets. As a result, it was very difficult to compare performance across systems, and a system's performance typically degraded when it was presented with previously unseen data. The recent availability of a large body of data in the public domain, coupled with the specification of evaluation standards, has resulted inuniform documentation of test results, thus contributing to greater reliability in monitoring progress (corpus development activities and evaluation methodologies are summarized in chapters 12 and 13 respectively).Finally, advances in computer technology have also indirectly influenced our progress. The availability of fast computers with inexpensive mass storage capabilities has enabled researchers to run many large scale experiments in a short amount of time. This means that the elapsed time between an idea and its implementation and evaluation is greatly reduced. In fact, speech recognition systems with reasonable performance can now run in real time using high-end workstations without additional hardware---a feat unimaginable only a few years ago.One of the most popular, and potentially most useful tasks with low perplexity (PP=11) is the recognition of digits. For American English, speaker-independent recognition of digit strings spoken continuously and restricted to telephone bandwidth can achieve an error rate of 0.3% when the string length is known.One of the best known moderate-perplexity tasks is the 1,000-word so-called Resource 5 Management (RM) task, in which inquiries can be made concerning various naval vessels in the Pacific ocean. The best speaker-independent performance on the RM task is less than 4%, using a word-pair language model that constrains the possible words following a given word (PP=60). More recently, researchers have begun to address the issue of recognizing spontaneously generated speech. For example, in the Air Travel Information Service (ATIS) domain, word error rates of less than 3% has been reported for a vocabulary of nearly 2,000 words and a bigram language model with a perplexity of around 15.High perplexity tasks with a vocabulary of thousands of words are intended primarily for the dictation application. After working on isolated-word, speaker-dependent systems for many years, the community has since 1992 moved towards very-large-vocabulary (20,000 words and more), high-perplexity (PP≈200), speaker-independent, continuous speech recognition. The best system in 1994 achieved an error rate of 7.2% on read sentences drawn from North America business news.With the steady improvements in speech recognition performance, systems are now being deployed within telephone and cellular networks in many countries. Within the next few years, speech recognition will be pervasive in telephone networks around the world. There are tremendous forces driving the development of the technology; in many countries, touch tone penetration is low, and voice is the only option for controlling automated services. In voice dialing, for example, users can dial 10--20 telephone numbers by voice (e.g., call home) after having enrolled their voices by saying the words associated with telephone numbers. AT&T, on the other hand, has installed a call routing system using speaker-independent word-spotting technology that can detect a few key phrases (e.g., person to person, calling card) in sentences such as: I want to charge it to my calling card.At present, several very large vocabulary dictation systems are available for document generation. These systems generally require speakers to pause between words. Their performance can be further enhanced if one can apply constraints of the specific domain such as dictating medical reports.Even though much progress is being made, machines are a long way from recognizing conversational speech. Word recognition rates on telephone conversations in the Switchboard corpus are around 50%. It will be many years before unlimited vocabulary, speaker-independent continuous dictation capability is realized.3 Future DirectionsIn 1992, the U.S. National Science Foundation sponsored a workshop to identify the key research challenges in the area of human language technology, and the infrastructure needed to support the work. The key research challenges are summarized in. Research in the following areas for speech recognition were identified:Robustness:In a robust system, performance degrades gracefully (rather than catastrophically) as conditions become more different from those under which it was trained. Differences in channel characteristics and acoustic environment shouldreceive particular attention.Portability:Portability refers to the goal of rapidly designing, developing and deploying systems for new applications. At present, systems tend to suffer significant degradation when moved to a new task. In order to return to peak performance, they must be trained on examples specific to the new task, which is time consuming and expensive.Adaptation:How can systems continuously adapt to changing conditions (new speakers, microphone, task, etc) and improve through use? Such adaptation can occur at many levels in systems, subword models, word pronunciations, language models, etc.Language Modeling:Current systems use statistical language models to help reduce the search space and resolve acoustic ambiguity. As vocabulary size grows and other constraints are relaxed to create more habitable systems, it will be increasingly important to get as much constraint as possible from language models; perhaps incorporating syntactic and semantic constraints that cannot be captured by purely statistical models.Confidence Measures:Most speech recognition systems assign scores to hypotheses for the purpose of rank ordering them. These scores do not provide a good indication of whether a hypothesis is correct or not, just that it is better than the other hypotheses. As we move to tasks that require actions, we need better methods to evaluate the absolute correctness of hypotheses.Out-of-Vocabulary W ords:Systems are designed for use with a particular set of words, but system users may not know exactly which words are in the system vocabulary. This leads to a certain percentage of out-of-vocabulary words in natural conditions. Systems must have some method of detecting such out-of-vocabulary words, or they will end up mapping a word from the vocabulary onto the unknown word, causing an error.Spontaneous Speech:Systems that are deployed for real use must deal with a variety of spontaneous speech phenomena, such as filled pauses, false starts, hesitations, ungrammatical constructions and other common behaviors not found in read speech. Development on the ATIS task has resulted in progress in this area, but much work remains to be done.Prosody:Prosody refers to acoustic structure that extends over several segments or words. Stress, intonation, and rhythm convey important information for word recognition and the user's intentions (e.g., sarcasm, anger). Current systems do not capture prosodic structure. How to integrate prosodic information into the recognition architecture is a critical question that has not yet been answered.Modeling Dynamics:Systems assume a sequence of input frames which are treated as if they were independent. But it is known that perceptual cues for words and phonemes require the integration of features that reflect the movements of the articulators, which are dynamic in nature. How to model dynamics and incorporate this information into recognition systems is an unsolved problem.语音识别舒维都,罗恩科尔,韦恩沃德麻省理工学院计算机科学实验室,剑桥,马萨诸塞州,美国俄勒冈科学与技术学院,波特兰,俄勒冈州,美国卡耐基梅隆大学,匹兹堡,宾夕法尼亚州,美国一定义问题语音识别是指音频信号的转换过程,被电话或麦克风的所捕获的一系列的消息。

BIBTeX制作参考文献

BIBTeX制作参考文献

BIBTeX制作参考⽂献⼀篇关于Latex的参考⽂献的好⽂章!基本问题都能解答~⽂章来源:ps:copyLaTeX 对参考⽂献的处理有这么⼀些优点:1. 可以维护⼀个 bib ⽂件,在你的整个研究⽣涯可以只维护这样⼀个⽂件,就象⼀个数据库,每个参考⽂献是⼀个记录,由⼀个唯⼀的 ID (例如下⾯的 MartinDSP00)描述。

⽐如我的 myreference.bib ⽂件⾥⼀条典型的⽂献是这样的:@article{MartinDSP00,author = "A. Martin and M. Przybocki",title = "The {NIST} 1999 speaker recognition evaluation --- an overview",journal = "Digital Signal Processing",volume = "10",pages = "1--18",year = "2000",}其中 {NIST} 中的⼤括号不会被显⽰,它的作⽤是保证将来的⽣成的参考⽂献中 NIST四个字保持原样,不会被⼩写。

2. 需要引⽤⽂献的时候,在正⽂⾥加⼊:\bibliographystyle{ieeetr}\bibliography{myreference}就可以⽤ \cite{} 来引⽤⽂献库中的论⽂了,如 \cite{MartinDSP00}。

上⾯第⼀⾏是控制⽂献显⽰格式的,这个后⾯再讲。

此时,myreference.bib ⽂件在正⽂ tex ⽂件的同⼀⽬录下,以保证 LaTeX 可以找到该 bib ⽂件。

3. 编译正⽂之后,⽣成 aux ⽂件,然后⽤ bibtex 在当前⽬录⽣成 bbl ⽂件,再编译正⽂两次,完整的 dvi 就⽣成了。

这样,LaTeX 保证了所有⽤ \cite 引⽤到的⽂献都已被列出,并且⾃动编号。

说话人识别

说话人识别
� 多门限判决相当于一种序贯判决方法,它使 用多个门限来作出接受还是拒绝的判决。
6.9.5 说话人识别中尚需进一步探索的课题
6.10 顽健语音识别技术
6.10.1 概述
� 通常在实验室相对安静环境下训练好的语音识别系 统,当用到与训练环境不匹配的实际环境时,性能 明显下降。如果语音识别系统在这种不匹配情况 下,识别性能的下降不明显,则称这样的系统为顽 健的(Robust)语音识别系统。
� 识别参数的时间变化,主要是由声源特性的 变化引起的。可以把声源与声道分离,只用 后者组成经得起语音长期变动的说话人识别 系统。
6.9.4 说话人识别技术中的一些实际问题
2. 顽健的说话人识别技术
� 说话人自身心理或生理因素的变化、采集环 境的变化、通道传输特性的变化等都可能使 说话人语音的声学特征产生变异,从而造成 说话人识别系统识别率的下降。
� 这些变化的条件包括:
① 说话人变化 从特定说话人到非特定说话人 ② 说话方式的变化 从孤立词识别到连续语音识别 ③ 词汇量的变化 从小词汇量任务到大词汇量任务 ④ 领域的变化 从特定词汇到不特定词汇,从特定领域文法
到不特定领域文法 ⑤ 环境的变化 从特定环境到不特定环境 ⑥ 发音变异 话者由于受生理、心理、情感等影响而产生的
� 与文本有关(text-dependent):要求说话人提供发 音的关键词或关键句子作为训练文本,而识别时也 必须按相同的内容发音。
� 与文本无关(text-independent):不论是在训练时 还是在识别时都不规定说话内容,即其识别对象是 自由的语音信号。
� 文本提示型(text-prompted):每次识别时,识别 系统在一个规模很大的文本集合中选择提示文本, 要求说话人按提示文本的内容发音,而识别和判决 是在说话人对文本内容正确发音的基础上进行的, 这样可以防止说话人的语言被盗用。

语者辨识

语者辨识
語者辨識 (Speaker Recognition)
張智星 Jang@.tw .tw/~jang
大綱


何謂語者辨識 語者辨識的應用 語者辨識的方法 「電話聲紋追蹤系統」速度與辨識率 結論
3
何謂語者辨識

語者辨識 (Speaker Recognition): 以語音的特性 來進行語者的身份識別或確認:
7
電話聲紋追Leabharlann 系統:功能功能:
本系統可對攔截到的大量語音對話進行比 對,並根據 GMM 所算出來的機率來排序, 能將正確目標語音推送到排名的前 10%。 換句話說,如果我們攔截到 1000 通電話 對話,經由本系統比對後,可以剔除 900 通,監聽人員只要監聽剩下較可能的 100 通對話,就可以獲取到同樣的效果,達到 事半功倍的目標。
8
電話聲紋追蹤系統:辨識率

與辨識率概估的相關數據

目標語者:10 人 目標語音總長度:1 分鐘 測試語音:長度 1 分鐘,共 100 通 錄音規格:取樣頻率 8KHz,解析度 8bit
9
電話聲紋追蹤系統:運算速度

運算速度及資料處理量



平台:Pentium 2.4 GHz CPU,512 MB DDR RAM 前置處理:對於 1 分鐘的目標語音進行 Feature Extraction,約需 2.8秒,再進行 8-Gaussian GMM 建模,約需 2.1 秒,因此共需要 4.9秒。 即時處理:對於 1 通測試語音(長度 1 分鐘) 進行 Feature Extraction,約需 2.8秒,再對 10 個 8-Gaussian GMM 機率計算(假設目標語者 有 10 人),約需 0.7 秒,因此共需3.5秒。 根據上述的資料,假設目標語者有 10 人,如果 我們使用 1 台個人電腦不斷運算,就可以同時持 續處理 20 通對話,若使用 5 台電腦進行平行處 理,就可以同時持續處理 100 通對話。 10

LD332X芯片

LD332X芯片

一、概述1.芯片介绍LD332X系列是基于非特定人语音识别(SI-ASR:Speaker-Independent Automatic Speech Recognition)技术的语音识别芯片/声控芯片。

该系列产品提供了真正的单芯片语音识别解决方案。

LD332X系列芯片是一颗集成了高精度A/D和D/A接口的语音识别芯片,不需要外接任何的辅助芯片如Flash、RAM、加密芯片等,也不需要其他在PC机上的有关语音识别训练的任何软件,直接集成在现有的产品中即可以实现语音识别/声控/人机对话功能。

并且,识别的关键词语列表是可以任意动态编辑的。

基于LD332X芯片,可以在任何的电子产品中,甚至包括最简单的51MCU 作为主控芯片的系统中,轻松实现语音识别/声控/人机对话功能。

为所有的电子产品增加人性化的操作界面:VUI(Voice User Interface)语音用户操作界面。

2.语音识别介绍语音识ASR 技术,是基于关键词语列表识的技术。

只需要设定好要识别的关键词语列表,并把这些关键词语以字符的形式传送到LD3320 内部,就可以对用户说出的关键词语进行识别。

不需要用户作任何地录音训练。

ASR 技术最重要的现实意义就在于提供了一种脱离按键,键盘,鼠标的基于语音的用户界面VUI:Voice User Interface每次识的过程,就是把用户说出的语音内容,通过频谱转换为语音特征,和这个关键词语列表中的条目进行一一匹配,最优匹配的一条作为识结果。

比如在手机的应用中,这个关键词语列表的内容就是电话本中的人名/手机的菜单命令/T 卡中的歌曲名字。

不论这个列表的条目内容是什么,只需要用户设置相关的寄存器,就可以把相应的待识条目内容以字符形式传递给识引擎。

LD3320 可以识列表中的关键词,用户说的语音可以是这个列表中任意的关键词语,而且不需要用户在识前进行任何训练。

识引擎不关心关键词语列表中的关键词语的内容,可以是命令,人名,歌曲名字,操作指令等等任何的汉字字符串。

MFCC介绍

MFCC介绍

在语音识别(SpeechRecognition)和话者识别(SpeakerRecognition)方面,最常用到的语音特征就是梅尔倒谱系数(Mel-scaleFrequency Cepstral Coefficients,简称MFCC)。

根据人耳听觉机理的研究发现,人耳对不同频率的声波有不同的听觉敏感度。

从200Hz到5000Hz的语音信号对语音的清晰度影响对大。

两个响度不等的声音作用于人耳时,则响度较高的频率成分的存在会影响到对响度较低的频率成分的感受,使其变得不易察觉,这种现象称为掩蔽效应。

由于频率较低的声音在内耳蜗基底膜上行波传递的距离大于频率较高的声音,故一般来说,低音容易掩蔽高音,而高音掩蔽低音较困难。

在低频处的声音掩蔽的临界带宽较高频要小。

所以,人们从低频到高频这一段频带内按临界带宽的大小由密到疏安排一组带通滤波器,对输入信号进行滤波。

将每个带通滤波器输出的信号能量作为信号的基本特征,对此特征经过进一步处理后就可以作为语音的输入特征。

由于这种特征不依赖于信号的性质,对输入信号不做任何的假设和限制,又利用了听觉模型的研究成果。

因此,这种参数比基于声道模型的LPCC相比具有更好的鲁邦性,更符合人耳的听觉特性,而且当信噪比降低时仍然具有较好的识别性能。

梅尔倒谱系数(Mel-scale Frequency Cepstral Coefficients,简称MFCC)是在Mel标度频率域提取出来的倒谱参数,Mel标度描述了人耳频率的非线性特性,它与频率的关系可用下式近似表示:。

说话人识别

说话人识别
H1
说话人识别
UBM模型的另一个用途,是可以在只有少量集 内说话人训练语料的条件下,依据UBM模型自适 应得到集内说话人模型。 最大后验准则 (Maximum A Posteriori, MAP) 方 法
nm P(qt i ot , )
t 1 T
1 T Em (O) P(qt i ot , )ot nm t 1
GMM Gaussian model
说话人识别
GMM本质上是一种多维概率密度函数 M 阶GMM的概率密度函数如下:
P(o ) P(o, i ) ci P(o i, )
i 1 i 1 M M

c
i 1
M
i
1
1
(o μ i )T Σi1 (o μ i ) P(o i, ) N (o,μ i ,Σi ) exp K 1 2 2 2 (2 ) Σi
t
P( q
i | ot , )
说话人识别
2 ik 2 P ( q i | o , )( o ) t tk tk t 1 T T
P( q
t 1
t
i | ot , )
说话人识别
开始 给定初始模型的阶数 M
初始化模型参数
0
对于每个特征参量
n arg max p(ot | n )
1 n N t
说话人识别
文本提示型的识别方法
非特定说话人 的基元模型
训练
基元模型生成
门限设定
语 音 输 入
端点 检测
特征 提取 指定文本 识别 文本模型生成
匹配计算
门限比较
判 定 输 出

语音知识英语

语音知识英语

语音知识英语Speech recognition, also known as automatic speech recognition (ASR) or voice recognition, is the technology that allows a computer or device to convert spoken words into written text. This technology has become increasingly popular and widely used in recent years, as it offers convenient and efficient ways for people to interact with their devices.There are two main types of speech recognition: speaker-dependent and speaker-independent. Speaker-dependent speech recognition requires the system to be trained on the specific voice of the user in order to accurately recognize their speech. This type of technology is commonly used in applications such as voice dialing on mobile phones, where the user trains the system to recognize their voice before using the feature.Speaker-independent speech recognition, on the other hand, does not require any training or customization and is designed to recognize any user's voice. This type of speech recognition is typically used in applications such as smart home devices, virtual assistants, and automated customer service systems.The process of speech recognition involves several steps. First, the system captures the user's speech using a microphone or other audio input device. The audio signal is then processed and analyzed to identify the individual sounds, or phonemes, that make up the speech. This is done using techniques such as pattern recognition and statistical modeling.Once the phonemes have been identified, the system matches themto words in its database to determine the most likely interpretation of the spoken words. This is known as the decoding process. The system also takes into account the context and grammar of the language being spoken to improve the accuracy of its interpretation.Speech recognition technology has advanced significantly in recent years, thanks to improvements in machine learning and artificial intelligence algorithms. This has allowed for more accurate and reliable speech recognition, even in noisy environments or with accents and dialects. It has also opened up new possibilities for voice-controlled interfaces and applications, such as voice assistants like Apple's Siri, Google Assistant, and Amazon's Alexa.Speech recognition is not without its limitations, however. Accurate speech recognition can be challenging for certain languages or dialects, and background noise or unclear speech can also affect its performance. Additionally, speech recognition systems may struggle with homophones, words that sound the same but have different meanings.Despite these challenges, speech recognition technology continues to improve and evolve. As the technology becomes more advanced and accessible, we can expect to see even more applications and uses for speech recognition in the future. From voice-controlled cars and homes to translation services and accessibility features, speech recognition is poised to revolutionize the way we interact with technology.。

The NIST Speaker Recognition Evaluations 1996

The NIST Speaker Recognition Evaluations 1996

The NIST Speaker Recognition Evaluations: 1996-2001Alvin F Martin, Mark A. PrzybockiNational Institute of Standards and TechnologyGaithersburg, MD 20899 USAalvin.martin@ mark.przybocki@AbstractWe discuss the history and purposes of the NIST evaluations of speaker recognition performance. We cover the sites that have participated, the performance measures used, and the formats used to report results. We consider the extent to which there has been measurable progress over the years. In particular, we examine apparent performance improvements seen in the 2001 evaluation. Information for prospective participants is included.1.IntroductionNIST (The National Institute of Standards and Technology) has coordinated evaluations of text independent speaker recognition using conversational telephone speech over the past six years. Some discussion of these evaluations may be found in [1], [2], and [3]. These evaluations have had as primary objectives:•Exploring promising new ideas in speaker recognition,•Developing advanced technology incorporating these ideas, and•Measuring the performance of this technology.Key features of these evaluations have been that they be: •Simple,•Focused on core technology issues,•Fully supported, and•Accessible.The evaluations have all included the basic one-speaker detection task consisting of a series of trials. Each trial presents the system with a target speaker, defined by some speech by the speaker (usually two minutes in duration), and with a test segment of up to one minute in duration, spoken by a single unknown speaker. For each trial, the system must decide whether or not the unknown speaker is the target, producing both a yes-or-no hard decision and a likelihood score.There are two types of trials: target trials where the unknown speaker is the target, and non-target trials where the unknown speaker is someone else. System errors for the first type are misses; for the second type false alarms. System performance may then be characterized by the two error rate types: miss rate and false alarm rate. The requirement for likelihood scores for all trials using a common scale allows these two error rates to be determined at multiple system operating points. [8]The 1999-2001 evaluations have included additional tasks beyond that of one-speaker detection. These tasks have been set in the context of test segments containing speech by multiple speakers. See [4 ] for further information. This paper restricts its discussion to the one-speaker detection task included in all of these evaluations.Most of the data used in these evaluations have come from the Switchboard Corpora of conversational telephone speech, available from the Linguistic Data Consortium (LDC) [5]. The 2000 and 2001 evaluations also used, in a separate test, non-conversational telephone speech from the Castilian Spanish AHUMADA Corpus [6] made available by Javier Ortega-Garcia of the Universidad Politecnica de Madrid.2.Evaluation ParticipantsAs shown in Table 1, the participants over the past six years have been from 12 countries on five continents, making these truly worldwide evaluations.In some cases two or more participants have worked in cooperation while submitting individual results for separate systems. Most notable has been the ELISA Consortium. This is an organization of primarily European sites that has created common system components while allowing individual sites to pursue variants to some components of particular interest to them.3.Evaluation HistoryThe present basic format of these evaluations using conversational telephone data was adopted in 1996. Subsequent evaluations have increased the numbers of speakers and trials, and have added other tasks as mentioned above. The AHUMADA data was added in 2000, and some of the newly collected Switchboard cellular data was included in 2001.Each evaluation has included certain types of trial specified as being the primary condition of particular interest. For example, earlier evaluations included segments of three different durations, namely 3, 10, or 30 seconds, and those of one particular duration were specified to be part of the primary condition. Likewise, there were multiple types of training data for each target speaker, with one specified as primary.From the beginning, it was recognized that different telephone handsets could greatly affect recognition performance. In particular, target trials would be easier if the training and test handsets used by the target speaker were identical. Both same and different target trial handsets were part of the primary condition in different years.It also became apparent, largely because of work done by MIT-Lincoln Lab, a participating site, that the microphone type of the handset was an important factor in performance. In the United States both electret and carbon button microphones are common. Performance is affected both by type (electret microphones enhance performance) and by whether the training and test types are the same. More recent evaluations have specified electret type as part of the primary condition.Table 2 lists the primary conditions, numbers of speakers and trials, and some of the distinguishing features of the several evaluations. It should be noted that the relatively large numbers of speakers and trials have been distinguishing features of the NIST evaluations. This has enhanced confidence not only in the overall evaluation results, but in examinations of various contributing factors that involve observing performance on small subsets of the data. These subsets need to be large enough for meaningful results, as suggested by Doddington’s “Rule of 30” (see [7]):To be 90 percent confident that the true error rate is within +/- 30 percent of the observed error rate, there must be at least 30 errors.4. Presentation of ResultsThe official performance measure for the NIST evaluations has been a weighted average, denoted C DET , of the miss andfalse alarm error rates as defined in figure 1.Figure 1: C DET function with current parameter values.The primary means of presenting system performance, however, has been with the use of DET (Detection Error Tradeoff) Curves [8]. These show the full range of possible operating points of the system based on the likelihood scores each system provides for all trials. A typical plot of such curves is shown in figure 2. The use of a normal deviate scale on both axes results in the curves being (approximately) linear if the underlying distributions of likelihood scores for both the target and non-target trials are (approximately) normal.Note that specific operating points may high-lighted on each DET Curve. Generally the point that correspond to the actual (hard) decisions (denoted by ‘•‘) and the point on the curve for which the C DET value is minimal (denoted by ‘♦‘) are plotted. A good choice of likelihood threshold value for the actual decisions will result in these points being identical.The C DET values corresponding to these points are then sometimes shown in a bar chart plot as in figure 3.5. Measuring ProgressA key question of interest is whether, and how much, progress in recognition performance has been achieved over the course of the evaluations. This can be frustratingly difficult to determine accurately. Although all evaluations included the basic task of one speaker detection, there have been both major and minor changes from one evaluation to another in the primary recognition condition of interest, for which sites were asked to optimize their systems (see table 1). Moreover, different test sets, even though selected in exactly the same manner, can easily be quite different in inherent difficulty.Figure 2: A typical DET Curve plot is shown. As noted below, these are actually performance results for one site in successive years.Figure 3: A typical bar graph plot corresponding to the DET Curves of figure 2. The left bar in each pair shows hard decision cost; the right bar the minimum C DET . The lower part of each bar shows the cost of missed detections; the upper part the cost of false alarms.This has been noted in recent NIST coordinated evaluations of speech recognition [9]. Figure 2 in fact shows performance results for one site from 1997-2000. For each year, the plot shown is for the subset of the 1997 and 1998 subsets were basically the same, there were unavoidable changes in segment durations and training procedures in 1999 and 2000, confounding performance comparisons. From figure 3 it is clear that the site did improve its threshold setting procedure over this period, producing actual C DET values better approximating the minimum values.The best indicator of performance improvement can be observed when a site provides results for both a previous and a current system on a given test set. This has been available in limited instances. Figure 4, for example shows performance on the 1999 primary condition data for systems from one site(different from the figure 2 site) developed for the 1997, 1998,Figure 4: One site’s progress from 1997-1999.and 1999 evaluations. The three DET Curves show evidence of real, if small, performance improvement from 1997 to 1999. The 2001 main one-speaker detection test set was primarily a repetition of that for 2000. Some additional trials using the same speakers and test segments were also included. A small additional test set involving newly collected cellular phone data was added as well.The decision to rely primarily on a repeat of the 2000 test was made because of the lack on large quantities of fresh test data. This is a continuing problem for ongoing evaluations. But it did offer the advantage of identical test sets with which to measure year-to-year system progress. On the other hand, there is a legitimate concern that systems may have adapted themselves to the old data. It is generally believed that the large size of the test set limits the extent to which this is likely to be the case, but this requires further examination in the future.Figure 5 shows DET Curves for 2000 and 2001 on the same set of trials for systems of six sites. Subject to the caveat noted above, there certainly appears to have been significant improvement by each of these sites.The 2001 evaluation also included what was known as the extended data task, which used the original Switchboard-1 Corpus. Here the test segments consisted of single entire conversation sides, with speaker training data consisting of 1 to 16 such entire conversation sides of the given speaker. Moreover, participants could use machine generated transcripts of all of these conversation sides as part of their systems. Dragon Systems provided transcripts created by their ASR system for this purpose.The extended data task was included as a result of some work by George Doddington and others showing the possibility that dramatic progress on the speaker detection task might be obtained by using such extended data including transcripts. The initial evaluation results appear very promising [10], [11]. This could represent a significant performance breakthrough for those limited applications for which such extended data would be available. Figure 5: Comparative performance in 2000 and 2001 of systems from six sites on the identical set of trials.6.Future PlansNIST is now considering plans for the tests to be included in the 2002 evaluation. Suggestions in this regard are welcome. Especially welcome would be leads and suggestions on appropriate conversational telephone type data that might be available for use.The NIST evaluations are open to all, and new participants are welcome. Potential participants may obtain data sets from previous evaluations for development work. These data sets are available from the LDC or from NIST. Sites that are not LDC members are asked to sign a license agreement limiting data use to research purposes over a specified time period. Evaluation information is available on the NIST web site: /speech/tests/spk/index.htm References[1]Przybocki, M. and Martin, A., NIST speaker recognitionevaluation – 1997”, RLA2C, Avignon, April 1998, pp.120-123[2]Doddington, G., et al., “The NIST speaker recognitionevaluation – Overview, methodology, systems, results, perspective”, Speech Communication 31 (2000), pp. 225-254[3]Martin, A. and Przybocki, M., “The NIST 1999 SpeakerRecognition Evaluation - An Overview”, Digital Signal Processing, Vol. 10, Num. 1-3. January/April/July 2000,pp. 1-18[4]Martin, A. and Przybocki, M.,“Speaker Recognition in aMulti-Speaker Environment”, Proc. Eurospeech ‘01 [5]Switchboard Corpora are available from the LDC at http:///[6]J. Ortega-Garcia et al., “AHUMADA: A Large SpeechCorpus in Spanish for Speaker Identification and Verification”, Proc. ICASSP ’98, Vol. II, pp. 773-776 [7]Doddington, G., “Speaker recognition evaluationmethodology: a review and perspective”,RLA2C, Avignon, April 1998, pp. 60-66[8]Martin, A., et al., “The DET curve in assessment ofdetection task performance”. Proc. EuroSpeeech Vol. 4 (1997), pp. 1895-1898[9]Fiscus, J., et al., “2000 NIST Evaluation ofConversational Speech Recognition Over the Telephone”, Proc. 2000 Speech Transcription Workshop, http: ///speech/publications/tw00/html/cts10/cts10 .htm[10]Doddington, G., “Some experiments on IdiolectalDifferences among Speakers”, /speech /tests/spk/ 2000/doc/n-gram_experiments-v06.pdf[11]G. Doddington, “Speaker Recognition based on IdiolectalDifferences between Speakers”, Proc. Eurospeech ‘01Africa Datavoice (South Africa)Asia Amdocs (Israel) Indian Institute of Technology Madras Australia Defence Science & Technology Organization Queensland University of TechnologyEurope Ensigma (UK) Laboratoire Informatique de Paris (France) Laboratoire d'Informatique pour la Mécanique Nijmegan University (Netherlands)et les Sciences de l'Ingénieur (France)ELISA Consortium:Ecole Polytechnique Federale de Lausanne (Switzerland)Ecole Nationale Superiere des Telecommuniciations (France)Faculté Polytechnique de Mons & Rice University (Belgium and USA)Institut Dalle Molle d’Intelligence Artificielle Perceptive (Switzerland)Insitut de Recherche et Informatique et Systemes Aleatoires (France)Laboratoire Informatique d’Avignon (France)Brno University of Technology (Czech Republic)North America Atlantic Coast Technologies (USA) Air Force Research Laboratory (USA) Aegir Systems (USA) BBN Technologies (USA)Dragon Systems (USA) ITT (USA)INRS-Telecom (Canada) MIT-Lincoln Laboratory (USA) Nuance (USA) Oregon Graduate Institute (USA) RGMM (Department of Defense, USA) Rutgers University (USA)SRI International (USA)Table 1: Participating sites in the NIST Speaker Recognition Evaluations, 1996-2000.Year Primary Condition Target Speakers/Target TrialsEvaluation Features1996 Not defined 40/3999 Tests of 3 durations, 3 training conditionsSwitchboard-1 data1997 Train/test using different handset30 second durationsTrain on 1 min. from each of twoconversations using different handsets~400/3050 Tests of 3 durations, 3 training conditionsSwitchboard-2 Phase 1 data1998 Train/test using same handset30 second durationsTrain on 1 min. from each of twoconversations using same handset~500/2687 Tests of 3 durations, 3 training conditionsSwitchboard-2 Phase 2 dataHandset type detector info made available1999 Train/test use different electret handsetsTest durations 15-45 secondsTrain on 1 min. from each of twoconversations using same handset233/479 Added multi-speaker tasksVariable durations used in main test trialsSwitchboard-2 Phase 3 data2000 Train/test use different electret handsetsTest durations 15-45 secondsTrain on 2 min. from one conversation804/4209 Resegmented 1997, 1998 test data for reuseExtra test on AHUMADA Spanish data2001 Train/test use different electret handsetsTest durations 15-45 secondsTrain on 2 min. from one conversation804/4209 Repeated 2000 main test with someadditional trialsAdditional test on Switchboard cellular dataAdditional test allowing human or machinetranscripts with extended training dataTable 2: Information on the evaluations, 1996-2001.。

联合因子分析中的本征信道空间拼接方法

联合因子分析中的本征信道空间拼接方法

的方法. 建立说话人模型时有: 1) U V D 联合估计 法; 2) U V 联合估计法; 3) 类似于 Gauss-Seidel 的 迭代估计法. 测试时有[9]: 1) 逐帧打分法; 2) 点估计 法; 3) 积分法.
根据每个步骤使用的方法及其组合的不同, JFA 有诸多变种. 其中, 性能较好的变种有文献 [3, 7] 提 供的方法.
Key words Speaker recognition, factor analysis, space combination, projection analysis
说话人识别隶属于语音识别, 其基本任务是判 断两段语音是否属于同一个说话人, 在信息安全领 域和人机交互领域有广泛的应用[1].
Abstract For application of joint factor analysis on the condition of multiple channels in text-independent speaker recognition, this paper proposes an eigenchannel space orthogonal combination method. The eigenchannel space can be estimated by a mix data method or a simple combination method on the condition of multiple channels. However, the former has space masking effects while the latter introduces space overlapping effects. This paper proves that the core computation of the speaker enrollment and test is an oblique projection. Space overlapping effects can be removed subsequently by an orthogonal method based on the above proof. On the NIST SRE 2008 core tasks corpus, the proposed method has a better performance than the mix data method and the simple combination method.

说话人识别研究综述_王书诏

说话人识别研究综述_王书诏

剧烈; ( 4) 加窗, 针对每个音框乘上汉明窗以消除音框
两端的不连续性, 避免分析时受到前后音框的影响;
( 5) 将音框通过低通滤波器, 可去除异常高起的噪声。
3 特征提取
经过预处理后, 几秒钟的语音就会产生很大的数 据量。提取说话人特征的过程, 实际上就是去除原来语 音中的冗余信息, 减小数据量的过程。从语音信号中提 取的说话人特征参数应满足以下准则: 对局外变量( 例 如说话人的健康状况和情绪, 系统的传输特性等) 不敏 感; 能够长期地保持稳定; 可经常表现出来; 易于进行 测量; 与其他特征不相关。
果, 而“倒谱特征”则是利用了对语音信号进 行适当的
同态滤波后, 可将激励信号与声道信号加以分离的原
理。倒谱中维数较低的分量对应于语音信号的声道分
量, 倒谱中维数较高的分量对应于语音信号的音源激
语音技术
Y Vo ic e t e c h n o lo g
励分量。因此, 利用语音信号倒谱可将它们分离, 彼此
一定的相似性准则形成判断。
输入语音 预处理
特性 提取
训练 识别
模型产生 模型存储
相似性准则
判决
图 1 说话人识别系统框图
2.3 预处理[5] 通常, 输入的语音信号都要进行预处理, 预处理过
程的好坏在一定程度上也影响系统的识别效果。一般
! " # 电声技术 2007 年 第 31 卷 第 1 期
语音技术
the feature extraction, model training and classification is reviewed and the trend and rubs are also discussed.
【Key wor ds】speaker recognition; feature extraction; model training; classification
  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

1. INTRODUCTION There are two major models for speaker recognition: discriminative model and generative model. Discriminative model consists of Artificial Neural Networks (ANN) and Support Vector Machines (SVM), etc. Generative model consists of Hidden Markov Model (HMM) and Gaussian Mixture Model (GMM), etc. Each one of them can construct speaker models for speaker recognition tasks, and previous studies show good performances. Discriminative models have good property of making full use of discriminative information of
TEXT-INDEPENDENT SPEAKER RECOGNITION USING PROBABILISTIC SVM WITH GMM ADJUSTMENT
Fenglei Hou and Bingxi Wang Department of Information Science, Institute of Information Engineering, Information Engineering University, Zhengzhou 450002, P.R. China Email: houfl@ bingxiwang@
For SVM there are two outputs corresponding to two classes C+ and C− . So the posterior probability outputs of the SVM are P (C+ | x) = 1 1 + exp(− f (x)) 1 1 + exp( f (x)) (4)
different classes, while generative models use statistical information. In a word, discriminative models use intra-class information and generative models uses intra-class information. Since discriminative model and generative model has both advantages of themselves, they also have disadvantages of lack using the other kind of information. Experiments show the two kinds of models have different kind of errors in performance [1]. Combining the discriminative model and the generative model can improve the performance of a speaker recognition system. There are several methods to combine the two kinds of models, model A and model B: (1). A and B are parallel (2). A embeds into B (3). B embeds into A This paper presents a new method to combine both discriminative model and the generative model to make use of the two kinds of models. Support vector machine is recently proved to be a good discriminative classifier for many kinds of pattern recognition applications, which also shows advantages in training and performance compare to artificial neural networks. Gaussian mixture models have become the dominant approach fro modeling in text-independent speaker recognition applications over the past several years because GMMs can present speaker’s statistical characters properly. A classifier that can produce a posterior probability is very useful in practical
P (C− | x) =
(5)
Here f (x) can be viewed as the distance from x to the support vectors [5]. That is to say that the posterior probability outputs of SVMs are based on the distance of testing vectors and support vectors. So the outputs reflect the inter-class information. The outputs of Gaussian mixture model are used to adjust the output of SVM to make more accurate reflection to the real data cases, since GMMs model the intra-class information properly. The outputs of GMM are [6]:
f (x) = ∑ yiα i k (xi , x) + b
i
(2)
One method of producing probabilistic outputs was proposed by Wahba [4], which used a logisபைடு நூலகம்ic link function, P (C | x) = p (x) = 1 1 + exp(− f (x)) (3)
ABSTRACT There are two most popular techniques in pattern recognition, discriminative classifiers and generative model classifiers. Combining them together could improve the performance of the recognition system. This paper presents a novel method for text-independent speaker recognition. This system uses the output of the Gaussian mixture model to adjust the probabilistic output of the support vector machine. The new probabilistic SVM/GMM model based speaker recognition system is tested on the NIST 2003 speaker recognition evaluation database. Results on text-independent speaker identification and verification are provided to demonstrate the effectiveness of such systems. Keywords: Speaker recognition, probabilistic Support vector machine, Gaussian Mixture model
recognition situations. Many decision methods are based on Bayes rule. Probabilistic output of the classifier makes it possible to use existing results of these theories. Especially in cases when a classifier is making a small part of an overall decision, and the classification outputs must be combined for the overall decision. Standard SVMs do not provide probabilistic output. Several methods were presented recently to deal with this problem. Vapnik [2] proposes a method to fit the posterior probability with a sum of cosine terms which shows promising results. But this method requires a solution of a linear system for every evaluation of the SVM. Platt [3] uses a parametric form of a sigmoid to fit the posterior probability. This method is quite simple in the form but the parameters need to be determined by training. This paper combines the SVM and GMM to utilize the abilities of both models. This method use GMM’s output to adjust the probabilistic output of SVM. In training phase, both SVMs and GMMs are trained independently. While in testing phase, the probability outputs of the GMMs are used to adjust the posterior probability output of SVMs. Likelihood scoring is performed using the new hybrid model. In Section 2 the algorithm of this method is described in detail. In Section 3 experiments on the NIST 2003 speaker recognition evaluation data are presented and the results are discussed in Section 4. Section 5 contains the concluding remarks. 2. PROBABILISTIC SVM/GMM MODEL Let xi , xi = 1,..., N be a set of data points and yi , yi = 1,..., N be corresponding target classes. The output of the standard SVM is
相关文档
最新文档