语音识别技术综述 张永双 苏州大学 苏州 江苏 摘要本文回顾了语音识别技术的发展历史,综述了语音识别系统的结构、分类及基本方法,分析了语音识别技术面临的问题及发展方向。
关键词:语音识别;特征;匹配AbstactThis article review the courses of speech recognition technology progress ,summarize the structure,classifications and basic methods of speech recognition system and analyze the direction and the issues which speech recognition technology development may confront with. Key words: speech recognition;character;matching引言语音识别技术就是让机器通过识别和理解过程把语音信号转变为相应的文本或命令的高技术。
60年代计算机的应用推动了语音识别技术的发展,提出两大重要研究成果:动态规划(Dynamic Planning,DP)和线性预测分析(Linear Predict,LP),其中后者较好的解决了语音信号产生模型的问题,对语音识别技术的发展产生了深远影响。
Speech recognitionLouise WangLanguage teaching in computers and networks has become an effective aid to traditional language teaching, and speech recognition technology has become a relatively new technology in computer-aided language learning. However, the application of this technology in language learning and human-computer interaction oral practice is still in the exploration stage. Speech recognition technology is one of the ten important technology development technologies in the field of information technology from 2000 to 2010. It is becoming a key technology for human-computer interaction in information technology.In the speech recognition experience class, the teacher guides the students to assemble the "smart fish lamp", explain the corresponding graphical programming program, and guide the students to learn the concept and discriminative features of voiceprint recognition through software and hardware. The combination of speech recognition technology and speech synthesis technology allows people to operate with voice commands without the need for a keyboard. The application of voice technology has become a competitive emerging high-tech industry.The new curriculum method, the vivid analysis of speech recognition knowledge, software and hardware knowledge, programming knowledge,enables students to better understand the working principle of artificial intelligence speech recognition, and exercise the students' ability to brain and hands and teamwork.The purpose of speech recognition is to convert vocabulary content in human speech into vocabulary content contained in a computer. Speech recognition includes voice dialing, voice navigation, indoor device control, voice document retrieval, simple dictation data entry, and the ability to build more complex applications.Speech recognition is a technique for solving the problem of "understanding" in human language. At present, the research on speech recognition technology has made breakthroughs. Speech recognition technologies such as voice telephone exchange, information network inquiry, home service, hotel service, medical service, banking service, industrial control, voice communication system, etc., almost involve various lines. Every aspect of industry and society.Unveiled the mystery of speech recognition and closely linked artificial intelligence to the learning and life of students. By experiencing the speech recognition course, students can deeply understand the infinite mystery of speech recognition and the profound impact on people's lives.。
人类所理解的词、短语或句子离散与清晰的边界实际上是将信号连续的流,而不是听起来: I went to the store yesterday昨天我去商店。
单词也可以混合,用Whadd ayawa吗?这代表着你想要做什么。
有超过二十多个不同的元音, 虽然,精确统计可以取决于演讲者的口音而定。
例如“水”这个词,wadder可以显著watter,woader wattah等等。
图1 语音识别系统的模块图3、理论与方法从语音信号中进行独立扬声器的特征提取是语音识别系统中的一个基本问题。
Then it proposes two new algorithms combining the heuristic weighting and the partition normalized distance measure with group vector quantization discriminative training to take advantage of both approaches. Experiments using the TIMIT corpus suggest that the new combined approach is superior to current VQ-based solutions (50% error reduction). It also outperforms the Gaussian Mixture Model using the Wavelet features tested in a similar setting.1.IntroductionVector quantization (VQ) based classification algorithms play an important rolein speech independent speaker identification (SI) systems. Although in baseline form, the VQ-based solution is less accurate than the Gaussian Mixture Model (GMM) , it offers simplicity in computation. For a large database of over hundreds or thousands of speakers, both accuracy and speed are important issues. Here we discuss VQ enhancements aimed at accuracy and fast computation.1.1 VQ Based Speaker Identification SystemFig. 1 shows the VQ based speaker identification system. It contains an offline training sub-system to produce VQ codebooks and an online testing sub-system to generate identification decision. Both sub-systems contain a preprocessing or feature extraction module to convert an audio utterance into a set of feature vectors. Features of interest in the recent literatures include the Mel-frequency cepstral coefficients (MFCC), the Line spectra pairs (LSP), the Wavelet packet parameter (WPP), or PCA and ICA features]. Although the WPP and ICA have been shown to offer advantages, we used MFCC in this paper to focus our attention on other modules of the system.Fig. 1. A VQ-based speaker identification system features an online sub-system for identifying testing audio utterance, and an offline training sub-system, which uses training audio utterance to generate a codebook for each speaker in the database.A VQ codebook normally consists of centroids of partitions over spea ker’s feature vector space. The effects to SI by different partition clustering algorithms, such as the LBG and the RLS, have been studied. The average error or distortion of the feature vectors }1,{T t X t ≤≤ of length T with a speaker k codebook is given by)],([1,11min j k t Tt s j k C X d T e ∑=≤≤= L k ≤≤1(1) d(.,.) is a distance function between two vectors. T D j k j k C c C j k ),...,(,,1,,,=is the j code of dimension D. S is the codebook size. L is the total number of speakers in the database. The baseline VQ algorithm of SI simply uses the LBG to generate codebooks and the square of the Euclidean distance as the d(.,.) .Many improvements to the baseline VQ algorithm have been published. Among them, there are two independent approaches: (1) choose a weighted distance function, such as the F-ratio and IHM weights, the Partition Normalized Distance Measure (PNDM) , and the Bhattacharyya Distance; (2) explore discrimination power of inter-speaker characteristics using the entire set of speakers, such as the Group Vector Quantization (GVQ) discriminative training, and the Speaker Discriminative Weighting. Experimentally we have found that PNDM and GVQ are two very effective methods in each of the groups respectively.1.2 Review of Partition Normalized Distance MeasureThe Partition Normalized Distance Measure is defined as the square of the weighted Euclidean distance.2,,1,,,)(),(i j k i D i i j k j k p c x w C X d -=∑=(2) The weighting coefficients are determined by minimizing the average error of training utterances of all the speakers, subject to the constraint that the geometric mean of the weights for each partition is equal to 1.T D j k j k j k x x X ),...,(,,1,,,= be a random training feature vector of speaker k, which is assigned to partition j via minimization process in Equation (1). It has mean and variance vectors:)]()[(][,,,,,,,j k j k T j k j k j k j k j k C X C X E V X E C --== (3)The constrained optimization criterion to be minimized in order to derive the weights is∑∑∑∑∑∑∑∑------------∏+⋅=-∏+-⋅=-∏+⋅=L k S j Di i j k D i j k i j k L k S j D i i j k D i j k i j k i j k i j k L k S j i j k D i j k j k j k p w w S L w c x E w S L w C X d E S L 111,,1,,,111,,1,2,,,,,,,11,,1,,,})1({1})1(])[({1)}1()],([{1λλλξ(4) Where L is the number of speakers, and S is the codebook size. Letting0,,=∂∂i j k w ξ and 0,=∂∂j k λξ (5) We haveD i j k D i j k v 1,,1,⎪⎭⎫ ⎝⎛∏=-λ and ij k jk i j k v w ,,,,,λ= (6)Where sub-script i is the feature vector component index, k and j are speaker andpartition indices respectively. Because k and j are in both sides of the equations, the weights are only dependent on the data from one partition of one speaker.1.3 Review of Group Vector QuantizationDiscriminative training is to use the data of all the speakers to train the codebook, so that it can achieve more accurate identification results by exploring the inter-speaker differences. The GVQ training algorithm is described as follows.Group Vector Quantization Algorithm:(1)Randomly choose a speaker j.(2)Select N vectors }1,{,N t X t j ≤≤(3)calculate error for all the codebooks.If following conditions are satisfied go to (4)a )}{min k k i e e ∀= ,but j i ≠;b )W e e e j ij <-,where W is a window size;Else go to (5)(4)for each }1,{,N t X t j ≤≤t j m j m j X C C ,,,)1(⋅+⋅-⇐αα where )},({min arg ,,,,l j t j C m j C X d C lt j ∀=t j n i n i X C C ,,,)1(⋅-⋅+⇐αα )},({min arg ,,,,n i t j C n i C X d C ll i ∀=(5)for each }1,{,N t X t j ≤≤,t j m j m j X C C ,,,)1(⋅+⋅-⇐εαα ,where )},({min arg ,,,,l j t j C m j C X d C ll j ∀=2.EnhancementsWe propose the following steps to further enhance the VQ based solution: (1) a Heuristic Weighted Distance (HWD), (2) combination of HWD and GVQ, and (3) combination of PNDM and GVQ.2.1 Heuristic Weighted DistanceThe PNDM weights are inversely proportional to partition variances of the feature components, as shown in Equation (6). It has been shown that variances of cepstral 21 . Clearly 11,1-≤≤>+D i v v i i where i is the vector element index, which reflects frequency band. The higher the index, the less feature value and its variance.We considered a Heuristic Weighted Distance as2,,1,)(),(),(i j k i D i i j k h c x D S w C X d -⋅=∑= (7)The weights are calculated by)1(),(1),(-⋅+=i D S c D S w i (8)Where c (S , D) is a function of both the codebook size S and the feature vector dimension D. For a given codebook, S and D are fixed, and thus c (S , D) is a constant. The value of c (S , D) is estimated experimentally by performing an exhaustive search to achieve the maximum identification rate in a given sample test dataset.2.2 Combination of HWD and GVQCombination of the HWD and the GVQ is achieved by simply replacing the original square of the Euclidean distance with the HWD Equation (7), and to adjust the GVQ updating parameter α whenever needed.2.3 Combination of PNDM and GVQTo combine PNDM with the GVQ requires a slight more work, because the GVQ alters the partition and thus its component variance. We have used the following algorithm to overcome this problem.Algorithm to Combine PNDM with the GVQ Discriminative Training:(1)Use LBG algorithm to generate initial LBG codebooks;(2)Calculate PNDM weights using the LBG codebooks, and produce PNDM weighted LBG codebooks, which are LBG codebooks appended with the PNDM weights;(3)Perform GVQ training with PNDM distance function, and generate the initial PNDM+GVQ codebooks by replacing the LBG codes with the GVQ codes;(4)Recalculate PNDM weights using the PNDM+GVQ codebooks, and produce the final PNDM+GVQ codebooks by replacing the old PNDM weights with the new ones.3.Experimental Comparison of VQ-based Algorithms3.1 Testing Data and Procedures168 speakers in TEST section of the TIMIT corpus are used for SI experiment, and 190 speakers from DR1, DR2, DR3 of TRAIN section are used for estimating the c(S,D) parameter. Each speaker has 10 good quality recordings of 16 KHz, 16bits/sample, and stored as WA VE files in NIST format. Two of them, SA1.WA V and SA2.WA V, are used for testing, and the rest for training codebooks. We did not perform silence removal on WA VE files, so that others could reproduce the environment with no additional complication of V AD algorithms and their parameters.A MFCC program converts all the WA VE files in a directory into one feature vector file, in which all the feature vectors are indexed with its speaker and recording. For each value of feature vector dimension, D=30, 40, 50, 60, 70, 80, 90, one training file and one testing file are created. They are used by all the algorithms to train codebooks of size S=16, 32, 64, and to perform identification test, respectively.The MFCC feature vectors are calculated as follows: 1) divide the entireutterance into blocks of size 512 samples with 256 overlapping; 2) perform pre-emphasize filtering with coefficient 0.97; 3) multiply with Hamming window, and perform short-time FFT; 4) apply the standard mel-frequency triangular filter banks to the square of magnitude of FFT; 5) apply the logarithm to the sum of all the outputs of each individual filter; 6) apply DCT on the entire set of data resulted from all filters; 7) drop the zero coefficient, to produce the cepstral coefficients; 8) after all the blocks being processed, calculate the mean over the entire time duration and subtract it from the cepstral coefficients; 9) calculate the 1st order time derivatives of cepstral coefficients, and concatenate them after the cepstral coefficients, to form a feature vector. For example, a filter-bank of size 16 will produce 30 dimensional feature vectors.Due to project time constraint, the HWD parameter c(S, D) was estimated at S=16, 32, 64, D=40, 80, so that it achieves the highest identification rate using the 190 speakers dataset of TRAIN section. For other values of S and D, it was interpolated or extrapolated from optimized samples. The results are shown in the bottom section of Table 1. The identification experiment was then performed using the 168 speakers dataset from TEST section. We have used different datasets for c(S, D) estimation, codebooks training, and identification rate testing, to produce objective results.3.2 Testing ResultsTable 1 shows identification rates for various algorithms. The value of the learning parameter a is displayed after the GVQ title, and the parameter c(S, D) is displayed at bottom section. Combination of the algorithms are indicated by a “+” sign between their name abbreviations.Table 1. Identification rates (%) and parameters for various VQ-based algorithms tested, where the 1st row is the feature vector dimension D, and the 1st column is the codebook size S.The baseline algorithm performs poorest as expected. The plain HWD, PNDM, and GVQ all show enhancements over the baseline. Combination methods further enhanced the plain methods. The PNDM+GVQ performs best when codebook size is 16 or 32, while the HWD+GVQ is better at codebook size 64. The highest score of the test is 99.7%, and corresponds to a single miss in 336 utterances of 168 speakers. It outperforms the reported rate 98.4% by using the GMM with WPP features.4.ConclusionA new approach combining the weighted distance measure and the discriminative training is proposed to enhance VQ-based solutions for speech independent speaker identification. An alternative heuristic weighted distance measure was explored, which lifts up higher order MFCC feature vector components using a linear formula. 摘要在提高基于VQ的说话人识别的解决方案中,加权距离测度和区分性训练是两种不同的方法。
Rabiner et al., 1993]. The Vector Quantization (VQ) is the fundamental and most successful technique used in speech coding, image coding, speech recognition, and speech synthesis and speaker recognition [S. Furui, 1986]. These techniques are applied firstly in the analysis of speech where the mapping of large vector space into a finite number of regions in thatspace. The VQ techniques are commonly applied to develop discrete or semi-continuous HMM based speech recognition system.In VQ, an ordered set of signal samples or parameters can be efficiently coded by matching the input vector to a similar pattern or codevector (codeword) in a predefined codebook [[Tzu-Chuen Lu et al., 2010].The VQ techniques are also known as data clustering methods in various disciplines. It is an unsupervised learning procedure widely used in many applications. The data clustering methods are classified as hard and soft clustering methods. These are centroid-based parametric clustering techniques based on a large class of distortion functions known as Bregman divergences [Arindam Banerjee et al., 2005].In the hard clustering, each data point belongs to exactly one of the partitions in obtaining the disjoint partitioning of the data whereas each data point has a certain probability of belonging to each of the partitions in soft clustering. The parametric clustering algorithms are very popular due to its simplicity and scalability. The hard clustering algorithms are based on the iterative relocation schemes. The classical K-means algorithm is based on Euclidean distance and the Linde-Buzo-Gray (LBG) algorithm is based on the Itakura-Saito distance. The performance of vector quantization techniques depends on the existence of a good codebook of representative vectors.In this paper, an efficient VQ codebook design algorithm is proposed known as Modified K-meansLBG algorithm. This algorithm provides superior performance as compared to classical K-means algorithm and the LBG algorithm. Section-2 describes the theoretical details of VQ. Section-3 elaborates LBG algorithm. Section-4 explains classical K-means algorithm. Section -5 emphasizes proposed modified K-meansLBG algorithm. The experimental work and results are discussed in Section-6 and the concluding remarks made at the end of the paper.2.Vector QuantizationThe main objective of data compression is to reduce the bit rate for transmission or data storage while maintaining the necessary fidelity of the data. The feature vectormay represent a number of different possible speech coding parameters including linear predictive coding (LPC) coefficients, cepstrum coefficients. The VQ can be considered as a generalization of scalar quantization to the quantization of a vector. The VQ encoder encodes a given set of k-dimensional data vectors with a much smaller subset. The subset C is called a codebook and its elements i C are called codewords, codevectors, reproducing vectors, prototypes or design samples. Only the index i is transmitted to the decoder. The decoder has the same codebook as the encoder, and decoding is operated by table look-up procedure.The commonly used vector quantizers are based on nearest neighbor called V oronoi or nearest neighbour vector quantizer. Both the classical K-means algorithm and the LBG algorithm belong to the class of nearest neighbor quantizers.A key component of pattern matching is the measurement of dissimilarity between two feature vectors. The measurement of dissimilarity satisfies three metric properties such as Positive definiteness property, Symmetry property and Triangular inequality property. Each metric has three main characteristics such as computational complexity, analytical tractability and feature evaluation reliability. The metrics used in speech processing are derived from the Minkowski metric [J. S. Pan et al. 1996]. The Minkowski metric can be expressed as∑=-=k i p i i p y xp Y X D 1),(Where },...,,{21k x x x X = and },...,,{21k y y y Y = are vectors and p is the order of the metric.The City block metric, Euclidean metric and Manhattan metric are the special cases of Minkowski metric. These metrics are very essential in the distortion measure computation functions.The distortion measure is one which satisfies only the positive definiteness property of the measurement of dissimilarity. There were many kinds of distortion measures including Euclidean distance, the Itakura distortion measure and the likelihood distortion measure, and so on.The Euclidean metric [Tzu-Chuen Lu et al., 2010] is commonly used because it fits the physical meaning of distance or distortion. In some applications division calculations are not required. To avoid calculating the divisions, the squared Euclidean metric is employed instead of the Euclidean metric in pattern matching.The quadratic metric [Marcel R. Ackermann et al., 2010] is an important generalization of the Euclidean metric. The weighted cepstral distortion measure is a kind of quadratec metric. The weighted cepstral distortion key feature is that it equalizes the importance in each dimension of cepstrum coefficients. In the speech recognition, the weighted cepstral distortion can be used to equalize the performance of the recognizer across different talkers. The Itakura-Saito distortion [Arindam Banerjee et al., 2005] measure computes a distortion between two input vectors by using their spectral densities.The performance of the vector quantizer can be evaluated by a distortion measureD which is a non-negative cost )ˆ,(j j X X Dassociated with quantizing any input vector j Xwith a reproduction vecto j X ˆ. Usually, the Euclidean distortion measure is used. The performance of a quantizer is always qualified by an average distortion)]ˆ,([j j v X X D E Detween the input vectors and the final reproduction vectors, where E represents the expectation operator. Normally, the performance of the quantizer will be good if the average distortion is small.Another important factor in VQ is the codeword search problem. As the vector dimension increases accordingly the search complexity increases exponentially, this is a major limitation of VQ codeword search. It limits the fidelity of coding for real time transmission.A full search algorithm is applied in VQ encoding and recognition. It is a time consuming process when the codebook size is large.In the codeword search problem, assigning one codeword to the test vector means the smallest distortion between the codeword and the test vector among all codewords. Given one codeword t C and the test vector X in the k-dimensional space,the distortion of the squared Euclidean metric can be expressed as follows:∑=-=ki i t i t c x C X D 12)(),(Where },......,,{21k t t t t c c c C = and },......,,{2,1k x x x X =There are three ways of generating and designing a good codebook namely the random method, the pair-wise nearestneighbor clustering and the splitting method. A wide variety of distortion functions, such as squared Euclidean distance, Mahalanobis distance, Itakura-Saito distance and relative entropy have been used for clustering. There are three major procedures in VQ, namely codebook generation, encoding procedure and decoding procedure. The LBG algorithm is an efficient VQ clustering algorithm. This algorithm is based either on a known probabilistic model or on a long training sequence of data.3.Linde –Buzo –Gray (LBG) algorithmThe LBG algorithm is also known as the Generalised Lloyd algorithm (GLA). It is an easy and rapid algorithm used as an iterative nonvariational technique for designing the scalar quantizer. It is a vector quantization algorithm to derive a good codebook by finding the centroids of partitioned sets and the minimum distortion partitions. In LBG , the initial centroids are generated from all of the training data by applying the splitting procedure. All the training vectors are incorporated to the training procedure at each iteration. The GLA algorithm is applied to generate the centroids and the centroids cannot change with time. The GLA algorithm starts from one cluster and then separates this cluster to two clusters, four clusters, and so on until N clusters are generated, where N is the desired number of clusters or codebook size. Therefore, the GLA algorithm is a divisive clustering approach. The classification at each stage uses the full-search algorithm to find the nearest centroid to each vector. The LBG is a local optimization procedure and solved through various approaches such as directed search binary-splitting, mean-distance-ordered partial codebook search [Linde et al., 1980, Modha et al., 2003], enhance LBG , GA-based algorithm[Tzu-Chuen Lu et al., 2010, Chin-Chen Chang et al. 2006], evolution-based tabu search approach [Shih-Ming Pan et al., 2007], and codebook generation algorithm[Buzo et al., 1980].In speech processing, vector quantization is used for instance of bit stream reduction in coding or in the tasks based on HMM. Initialization is an important step in the codebook estimation. Two approaches used for initialization are Random initialization, where L vectors are randomly chosen from the training vector set and Initialization from a smaller coding book by splitting the chosen vectors.The detailed LBG algorithm using unknown distribution is described as given below: Step 1: Design a 1-vector codebook. Set m=1. Calculate centroid∑==T j j X TC 111 Where T is the total number of data vectors.Step 2: Double the size of the codebook by splitting.Divide each centroid i C into two close vectors )1(12δ+⨯=-i i C C and m i C C i i ≤≤-⨯=1),1(2δ. Here δ is a small fixed perturbation scalar.Let m=2m . Set n=0 , here n is the iterative time.Step 3: Nearest-Neighbor Search.Find the nearest neighbor to each data vector. Put j Xin the partitioned set i P if i C is the nearest neighbor to j X .Step 4: Find Average Distortion.After obtaining the partitioned sets)1,(m i P P i ≤≤=, Set n=n+1 Calculate the overall average distortion∑∑--=m i T j i i j n i C D TD 11)(),(1 Where },......,,{)()(2)(1i T i i i iX X X P =Step 5: Centroid Update.Find centroids of all disjoint partitioned sets i P by∑-=i T j i j i i X T C 1)(1Step 6: Iteration 1.If ε>--n n n D D D /)(1 , go to step 3;otherwise go to step 7 and ε is a threshold.Step 7: Iteration 2.If m=N , then take the codebook i C as the final codebook; otherwise, go to step 2.Here N is the codebook size.The LBG algorithm has limitations like the quantized space is not optimized at each iteration and the algorithm is very sensitive to initial conditions.4.Classical K-means AlgorithmThe K-means algorithm is proposed by MacQueen in 1967. It is a well known iterative procedure for solving the clustering problems. It is also known as the C-means algorithm or basic ISODATA clustering algorithm. It is an unsupervised learning procedure which classifies the objects automatically based on the criteria that minimum distance to the centroid. In the K-means algorithm, the initial centroids are selected randomly from the training vectors and the training vectors are added to the training procedure one at a time. The training procedure terminates when the last vector is incorporated. The K-means algorithm is used to group data and the groups can change with time. The algorithm can be applied to VQ codebook design. The K-means algorithm can be described as follows:Step 1: Randomly select N training data vectors as the initial codevectors N i C i ,......,2,1,=from T training data vectors. Step 2: For each training data vector T j X j ,......,2,1,= assign j X to thepartitioned set i S if ),(min arg i j i C X D i =Step 3: Compute the centroid of the partitioned set that is codevector using ∑∈=i j S X j i i XS C 1Where i S denotes the number of training data vectors in the partitioned seti S . If there is no change in the clustering centroids, then terminate the program; otherwise, go to step 2.There are various limitations of K-means algorithm. Firstly, it requires large data to determine the cluster. Secondly, the number of cluster, K, must be determined beforehand. Thirdly, if the number of data is a small it difficult to find real cluster and lastly, as per assumption each attribute has the same weight and it quite difficult to knows which attribute contributes more to the grouping process.It is an algorithm to classify or to group objects based on attributes/features into K number of group. K is positive integer number. The grouping is done by minimizing the sum of squares of distances between data and the corresponding cluster centroid. The main aim of K-mean clustering is to classify the data. In practice, the number of iterations is generally much less than the number of points.5.Proposed Modified K-meansLBG AlgorithmThe proposed algorithms objective is to overcome the limitations of LBG algorithm and K-means algorithm. The proposed modified KmeansLBG algorithm is the combination of advantages of LBG algorithm and K-means algorithms. The KmeansLBG algorithm is described as given below:Step 1: Randomly select N training data vectors as the initial codevectors. Step 2: Calculate the no. of centroids.Step 3: Double the size of the codebook by splitting. Step 4: Nearest-Neighbor Search.Step 5: Find Average Distortion.Step 6: Update the centroid till there is no change in the clustering centroids,terminate the program otherwise go to step 1.6.Experimentation and ResultsThe TI46 database [NIST, 1991] is used for experimentation. There are 16 speakers from them 8 male speakers and 8 female speakers. The numbers of replications are 26 for utterance by each person. The total database size is 4160 utterances of which 1600 samples were used for training and remaining samples are used for testing of 10 words that are numbers in English 1 to 9 and 0 are sampled at a rate of 8000 Hz. A feature vector of 12-dimensional Linear Predicting Coding Cepstrum coefficients was obtained and provided as an input to vector quantization to find codewords for each class.There are five figures shows comparative graphs of the distortion measure obtained using LBG algorithm and K-means algorithm and proposed K-meansLBG algorithm. The distortion measure obtained by the proposed algorithm is smallest as compared to the K-means algorithm and the LBG algorithm.The proposed modified KmeanLBG algorithm gives minimum distortion measure as compared to K-means algorithm and LBG algorithm to increase the performance of the system. The smallest measure gives superior performance as compared to both the algorithms as is increased by about 1% to 4 % for every digit.7.ConclusionThe Vector Quantization techniques are efficiently applied in the development of speech recognition systems. In this paper, the proposed a novel vector quantization algorithm called K-meansLBG algorithm. It is used efficiently to increase the performance of the speech recognition system. The recognition accuracy obtained using K-meansLBG algorithm is better as compared to K-means and LBG algorithm. 摘要矢量量化的主要任务是产生良好的码本。
二、基于语音识别技术的文本翻译技术的形式基于语音识别技术的文本翻译技术,通常可以分为以下三种形式:1. 单模式语音翻译。
2. 双模式语音翻译。
3. 文字翻译。
三、基于语音识别技术的文本翻译技术的优势基于语音识别技术的文本翻译技术,具有以下优势:1. 准确度高由于使用的是先进的语音识别技术,并经过多种算法优化处理,因此转换的识别率非常高。
2. 速度快通过语音识别技术,翻译速度即时,可以实现实时翻译,让人们在交流时更为顺畅。
3. 方式丰富基于语音识别技术的文本翻译技术,支持多种模式,用户可以根据需要选择不同的模式,达到最好的体验效果。
这篇文献介绍了自然语言处理(Natural Language Processing, NLP)的基本概念和应用,以及它在现代社会中的重要性。
NLP 是一门研究如何让计算机能够理解和处理人类语言的学科。
NLP 在多个领域有着广泛的应用,包括机器翻译、语音识别、情感分析、信息检索等。
例如,在机器翻译方面,NLP 的技术使得计算机可以自动将一种语言翻译成另一种语言,为跨语言交流提供了便利。
在情感分析方面,NLP 可以帮助识别文本中的情感倾向,并对用户的情感进行分析。
随着人工智能技术的发展,NLP 在社会中的地位变得越来越重要。
NLP 技术的进步不仅可以提高计算机与人类之间的交流能力,还可以为各个行业带来革新和进步。
未来,NLP 有望在医疗保健、金融、智能客服等领域发挥更大的作用。
总之,NLP 是一门前沿的技术学科,它对于提高计算机与人类之间的交流能力和推动社会进步具有重要意义。
在未来的发展中,NLP 有望产生更大的影响,并在各个领域得到广泛应用。
人工智能辅助下的自动化语音评测技术(英文中文双语版优质文档)Automated speech evaluation technology refers to the process of using computer and artificial intelligence technology to evaluate human speech. With the continuous development of artificial intelligence technology, automated speech evaluation technology has been widely used. This article will discuss in depth the automated speech evaluation technology assisted by artificial intelligence, including technical principles, application scenarios, and future development trends.1. Technical principleThe core technology of automated speech evaluation technology is speech signal processing and artificial intelligence technology. Its main process includes the steps of speech signal acquisition, preprocessing, feature extraction and evaluation.First of all, voice signal acquisition requires the use of specific equipment or software to record human voice, and convert the voice signal into a digital signal for computer processing.Secondly, speech signal preprocessing is mainly to perform noise reduction, filtering, noise removal and other processing on the original speech signal to improve the accuracy and stability of subsequent processing.Then, feature extraction is an important part of automated speech evaluation technology. It mainly analyzes speech signals and extracts speech features, such as speech frequency, volume, pitch, etc., for subsequent model training and evaluation.Finally, evaluation is the ultimate goal of automated speech evaluation technology, which mainly uses artificial intelligence technology to analyze and judge speech signals, evaluate the quality and accuracy of speech, and provide corresponding feedback and improvements.2. Application scenariosAutomated voice evaluation technology has a wide range of application scenarios, such as:1. In the field of education: automated speech assessment technology can be used for students' oral examinations and pronunciation corrections to help students improve their English oral ability.2. Business field: Automated voice evaluation technology can be used for voice recognition and voice synthesis of customer service calls to improve the quality and efficiency of customer service.3. Medical field: Automated voice evaluation technology can be used for doctor's diagnosis and patient's voice monitoring, helping doctors to make early diagnosis and intervention.4. Security field: Automated voice evaluation technology can be used for voiceprint recognition and identity verification to improve security and prevent fraud.3. Future development trendThe future development trend of automated voice evaluation technology can be expected from the following aspects:1. The continuous development of artificial intelligence technology: With the continuous advancement of artificial intelligence technology, the accuracy and efficiency of automated speech evaluation technology will be greatly improved, and the quality and accuracy of speech can be judged and evaluated more accurately.2. The development of multi-modal voice evaluation technology: multi-modal voice evaluation technology can combine various sensors and modal information, such as video, gesture, etc., to conduct more comprehensive evaluation and analysis of voice, and improve the accuracy and stability of evaluation sex.3. Development of personalized voice evaluation technology: Personalized voice evaluation technology can provide personalized evaluation and feedback according to the user's voice characteristics and individual needs, helping users improve their voice ability and skills faster.4. Application of voice evaluation technology in smart hardware: With the popularization of smart hardware such as smart homes and smart speakers, automated voice evaluation technology will be more widely used to provide users with a more intelligent and humanized interactive experience.In short, automated speech evaluation technology will be applied in a wider range of fields, and will bring more intelligent and convenient services to human society. At the same time, we also need to continue to carry out technological innovation and application exploration to promote the continuous development and progress of automated voice evaluation technology.自动化语音评测技术是指利用计算机和人工智能技术来对人类语音进行评估的过程。
语音识别中英文资料对照外文翻译文献Speech Recognition Victor Zue Ron Cole amp Wayne Ward MIT Laboratory for Computer Science Cambridge Massachusetts USA Oregon Graduate Institute of Science amp Technology Portland Oregon USA Carnegie Mellon University Pittsburgh Pennsylvania USA 1 Defining the Problem Speech recognition is the process of converting an acoustic signal captured by amicrophone or a telephone to a set of words. The recognized words can be the final results asfor applications such as commands amp control data entry and document preparation. They canalso serve as the input to further linguistic processing in order to achieve speech understanding asubject covered in section. Speech recognition systems can be characterized by many parameters some of the moreimportant of which are shown in Figure. An isolated-word speech recognition system requires 1that the speaker pause briefly between words whereas a continuous speech recognition systemdoes not. Spontaneous or extemporaneously generated speech contains disfluencies and ismuch more difficult to recognize than speech read from script. Some systems require speakerenrollment---a user must provide samples of his or her speech before using them whereas othersystems are said to be speaker-independent in that no enrollment is necessary. Some of the otherparameters depend on the specific task. Recognition is generally more difficult whenvocabularies are large or have many similar-sounding words. When speech is produced in asequence of words language models or artificial grammars are used to restrict the combinationof words. The simplest language model can be specified as a finite-state network where thepermissible words following each word are given explicitly. More general language modelsapproximating natural language are specified in terms of a context-sensitive grammar. One popular measure of the difficulty of the task combining the vocabulary size and thelanguage model is perplexity loosely defined as the geometric mean of the number of wordsthat can follow a word after the language model has been applied see section for a discussion oflanguage modeling in general and perplexity in particular. Finally there are some externalparameters that can affect speech recognition system performance including the characteristicsof the environmental noise and the type and the placement of the microphone. Parameters Range Speaking Mode Isolated words to continuous speech Speaking Style Read speech to spontaneous speech Enrollment Speaker-dependent to Speaker-independent Vocabulary Smalllt20 words to largegt20000 words Language Model Finite-state to context-sensitive Perplexity Smalllt10 to largegt100 SNR High gt30 dB to law lt10dB Transducer Voice-cancelling microphone to telephoneTable: Typical parameters used to characterize the capability of speech recognition systems Speech recognition is a difficult problem largely because of the many sources of variabilityassociated with the signal. First the acoustic realizations of phonemes the smallest sound unitsof which words are composed are highly dependent on the context in which they appear. Thesephonetic variabilities are exemplified by the acoustic differences of the phoneme,At wordboundaries contextual variations can be quite dramatic---making gas shortage sound like gashshortage in American English and devo andare sound like devandare in Italian. Second acoustic variabilities can result from changes in the environment as well as in theposition and characteristics of the transducer. Third within-speaker variabilities can result fromchanges in the speakers physical and emotional state speaking rate or voice quality. Finallydifferences in sociolinguistic background dialect and vocal tract size and shape can contributeto across-speaker variabilities. Figure shows the major componentsof a typical speech recognition system. The digitizedspeech signal is first transformed into a set of useful measurements or features at a fixed ratetypically once every 10--20 msec see sectionsand 11.3 for signal representation and digitalsignal processing respectively. These measurements are then used to search for the most likelyword candidate making use of constraints imposed by the acoustic lexical and language models.Throughout this process training data are used to determine the values of the model parameters.Figure: Components of a typical speech recognition system. Speech recognition systems attempt to model the sources of variability described above inseveral ways. At the level of signal representation researchers have developed representationsthat emphasize perceptually important speaker-independent features of the signal andde-emphasize speaker-dependent characteristics. At the acoustic phonetic level speakervariability is typically modeled using statistical techniques applied to large amounts of data.Speaker adaptation algorithms have also been developed that adapt speaker-independent acousticmodels to those of the current speaker during system use see section. Effects of linguisticcontext at the acoustic phonetic level are typically handled by training separate models forphonemes in different contexts this is called context dependent acoustic modeling. Word level variability can be handled by allowing alternate pronunciations of words inrepresentations known as pronunciation networks. Common alternate pronunciations of wordsas well as effects of dialect and accent are handled by allowing search algorithms to findalternate paths of phonemes through these networks. Statistical language models based onestimates of the frequency of occurrence of word sequences are often used to guide the searchthrough the most probable sequence of words. The dominant recognition paradigm in the past fifteen years is known as hidden Markovmodels HMM. An HMM is a doubly stochastic model in which the generation of theunderlying phoneme string and the frame-by-frame surface acoustic realizations are bothrepresented probabilistically as Markov processes as discussed in sectionsand 11.2. Neuralnetworks have also been used to estimate the frame based scores these scores are then integratedinto HMM-based system architectures in what has come to be known as hybrid systems asdescribed in section 11.5. An interesting feature of frame-based HMM systems is that speech segments are identifiedduring the search process rather than explicitly. An alternate approach is to first identify speechsegments then classify the segments and use the segment scores to recognize words. Thisapproach has produced competitive recognition performance in several tasks. 2 State of the Art Comments about the state-of-the-art need to be made in the context of specific applicationswhich reflect the constraints on the task. Moreover different technologies are sometimesappropriate for different tasks. For example when the vocabulary is small the entire word canbe modeled as a single unit. Such an approach is not practical for large vocabularies where wordmodels must be built up from subword units. Performance of speech recognition systems is typically described in terms of word error rateE defined as: where N is the total number of words in the test set and S I and D are the total number ofsubstitutions insertions and deletions respectively. The past decade has witnessed significant progress in speech recognition technology. Worderror rates continue to drop by a factor of 2 every two years. Substantial progress has been madein the basic technology leading to the lowering of barriers to speaker independence continuousspeech and large vocabularies. There are several factors that have contributed to this rapidprogress. First there is the coming of age of the HMM. HMM is powerful in that with theavailability of training datathe parameters of the model can be trained automatically to giveoptimal performance. Second much effort has gone into the development of large speech corpora for systemdevelopment training and testing. Some of these corpora are designed for acoustic phoneticresearch while others are highly task specific. Nowadays it is not uncommon to have tens ofthousands of sentences available for system training and testing. These corpora permitresearchers to quantify the acoustic cues important for phonetic contrasts and to determineparameters of the recognizers in a statistically meaningful way. While many of these corporae.g. TIMIT RM ATIS and WSJ see section 12.3 were originally collected under thesponsorship of the U.S. Defense Advanced Research Projects Agency ARPA to spur humanlanguage technology development among its contractors they have nevertheless gainedworld-wide acceptance e.g. in Canada France Germany Japan and the U.K. as standards onwhich to evaluate speech recognition. Third progress has been brought about by the establishment of standards for performanceevaluation. Only a decade ago researchers trained and tested their systems using locallycollected data and had not been very careful in delineating training and testing sets. As a resultit was very difficult to compare performance across systems and a systems performancetypically degraded when it was presented with previously unseen data. The recent availability ofa large body of data in the public domain coupled with the specification of evaluation standardshas resulted in uniform documentation of test results thus contributing to greater reliability inmonitoring progress corpus development activities and evaluation methodologies aresummarized in chapters 12 and 13 respectively. Finally advances in computer technology have also indirectly influenced our progress. Theavailability of fast computers with inexpensive mass storage capabilities has enabled researchersto run many large scale experiments in a short amount of time. This means that the elapsed timebetween an idea and its implementation and evaluation is greatly reduced. In fact speechrecognition systems with reasonable performance can now run in real time using high-endworkstations without additional hardware---a feat unimaginable only a few years ago. One of the most popular and potentially most useful tasks with low perplexity PP11 isthe recognition of digits. For American English speaker-independent recognition of digit stringsspoken continuously and restricted to telephone bandwidth can achieve an error rate of 0.3when the string length is known. One of the best known moderate-perplexity tasks is the 1000-word so-called ResourceManagement RM task in which inquiries can be made concerning various naval vessels in thePacific ocean. The best speaker-independent performance on the RM task is less than 4 usinga word-pair language model that constrains the possible words following a given word PP60.More recently researchers have begun to address the issue of recognizing spontaneouslygenerated speech. For example in the Air Travel Information Service ATIS domain worderror rates of less than 3 has been reported for a vocabulary of nearly 2000 words and abigram language model with a perplexity of around 15. High perplexity tasks with a vocabulary of thousands of words are intended primarily forthe dictation application. After working on isolated-word speaker-dependent systems for manyyears the community has since 1992 moved towards very-large-vocabulary 20000 words andmore high-perplexity PP≈200 speaker-independent continuous speech recognition. The bestsystem in 1994 achieved an error rate of 7.2 on read sentences drawn from North Americabusiness news. With the steady improvements in speech recognition performance systems are now beingdeployed within telephone and cellular networks in many countries.Within the next few yearsspeech recognition will be pervasive in telephone networks around the world. There aretremendous forces driving the development of the technology in many countries touch tonepenetration is low and voice is the only option for controlling automated services. In voicedialing for example users can dial 10--20 telephone numbers by voice e.g. call home afterhaving enrolled their voices by saying the words associated with telephone numbers. ATampT onthe other hand has installed a call routing system using speaker-independent word-spottingtechnology that can detect a few key phrases e.g. person to person calling card in sentencessuch as: I want to charge it to my calling card. At present several very large vocabulary dictation systems are available for documentgeneration. These systems generally require speakers to pause between words. Theirperformance can be further enhanced if one can apply constraints of the specific domain such asdictating medical reports. Even though much progress is being made machines are a long way from recognizingconversational speech. Word recognition rates on telephone conversations in the Switchboardcorpus are around 50. It will be many years before unlimited vocabulary speaker-independentcontinuous dictation capability is realized. 3 Future Directions In 1992 the U.S. National Science Foundation sponsored a workshop to identify the keyresearch challenges in the area of human language technology and the infrastructure needed tosupport the work. The key research challenges are summarized in. Research in the followingareas for speech recognition were identified: Robustness: In a robust system performance degrades gracefully rather than catastrophically asconditions become more different from those under which it was trained. Differences in channelcharacteristics and acoustic environment should receive particular attention. Portability: Portability refers to the goal of rapidly designing developing and deploying systems fornew applications. At present systems tend to suffer significant degradation when moved to anew task. In order to return to peak performance they must be trained on examples specific tothe new task which is time consuming and expensive. Adaptation: How can systems continuously adapt to changing conditions new speakers microphonetask etc and improve through use Such adaptation can occur at many levels in systemssubword models word pronunciations language models etc. Language Modeling: Current systems use statistical language models to help reduce the search space and resolveacoustic ambiguity. As vocabulary size grows and other constraints are relaxed to create morehabitable systems it will be increasingly important to get as much constraint as possible fromlanguage models perhaps incorporating syntactic and semantic constraints that cannot becaptured by purely statistical models. Confidence Measures: Most speech recognition systems assign scores to hypotheses for the purpose of rankordering them. These scores do not provide a good indication of whether a hypothesis is corrector not just that it is better than the other hypotheses. As we move to tasks that require actionswe need better methods to evaluate the absolute correctness of hypotheses. Out-of-Vocabulary Words: Systems are designed for use with a particular set of words but system users may not knowexactly which words are in the system vocabulary. This leads to a certain percentage ofout-of-vocabulary words in natural conditions. Systems must have some method of detectingsuch out-of-vocabulary words or they will end up mapping a word from the vocabulary onto theunknown word causing an e.。
这些领域更广泛和更深入的讨论,可以发现在[12], [16], [19], [23], [24], [27], [32], [33], [41], [42], and [47].读者还可以参考以下网站:t he IEEE History Center’s Automatic Speech Synthesis和Recognition section、the Saras Institute’s History of Speech Language Technology Project在t .基础设施摩尔定律指出计算机发展的长期进展和预测,每12到18个月,计算实现一个给定的成本的费用会翻倍,以及同等萎缩的内存成本。
为了突破统计语言模型的限制,将自然语言结构信息(语法 信息、语义结构信息融入到语言模型中,对语言模型进行改 进,提出了基于语言模型的自适应研究[ 10] 。 思想:语言模型的自适应通常结合背景文字语料库预测, 是语音同一时期或同一领域的文字语料训练出较鲁棒的自适应 语言模型。
d.训练阶段 语音识别中HMM模型参数值的估计目前依然没有一个可靠 的闭式解,通常采用的是迭代训练的方法,每次都在旧的 HMM基础之上,利用最大似然准则[7]对参数进行优化。 经典算法——期望最大化算法、前后向算法 各自特点: EM算法能够有效地处理HMM中由于状态序列的隐藏造成 的不完全数据情况下的HMM参数更新问题。 BW算法可以非常高效的从训练数据中积累统计量,作为 HMM参数更新时所需要的必要信息。
研究进展1. 语音输入技术语音输入技术是自然语言翻译的基础。
2. 机器翻译技术机器翻译技术是另一种核心技术,也是自然语言翻译系统的关键部分。
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
中英文资料对照外文翻译(文档含英文原文和中文翻译)Speech Recognition1 Defining the ProblemSpeech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of words. The recognized words can be the final results, as for applications such as commands & control, data entry, and document preparation. They can also serve as the input to further linguistic processing in order to achieve speech understanding, a subject covered in section.Speech recognition systems can be characterized by many parameters, some of the more important of which are shown in Figure. An isolated-word speech recognition system requires that the speaker pause briefly between words, whereas a continuous speech recognition system does not. Spontaneous, or extemporaneously generated, speech contains disfluencies, and is much more difficult to recognize than speech read from script. Some systems require speaker enrollment---a user must provide samples of his or her speech before using them, whereas other systems are said to be speaker-independent, in that no enrollment is necessary. Some of the other parameters depend on the specific task. Recognition is generally more difficult when vocabularies are large or have many similar-sounding words. When speech is produced in a sequence of words, language models or artificial grammars are used to restrict the combination of words.The simplest language model can be specified as a finite-state network, where the1permissible words following each word are given explicitly. More general language models approximating natural language are specified in terms of a context-sensitive grammar.One popular measure of the difficulty of the task, combining the vocabulary size and the language model, is perplexity, loosely defined as the geometric mean of the number of words that can follow a word after the language model has been applied (see section for a discussion of language modeling in general and perplexity in particular). Finally, there are some external parameters that can affect speech recognition system performance, including the characteristics of the environmental noise and the type and the placement of the microphone.Table: Typical parameters used to characterize the capability of speech recognition systems Speech recognition is a difficult problem, largely because of the many sources of variability associated with the signal. First, the acoustic realizations of phonemes, the smallest sound units of which words are composed, are highly dependent on the context in which they appear. These phonetic variabilities are exemplified by the acoustic differences of the phoneme,At word boundaries, contextual variations can be quite dramatic---making gas shortage sound like gash shortage in American English, and devo andare sound like devandare in Italian.Second, acoustic variabilities can result from changes in the environment as well as in the position and characteristics of the transducer. Third, within-speaker variabilities can result from changes in the speaker's physical and emotional state, speaking rate, or voice quality. Finally, differences in sociolinguistic background, dialect, and vocal tract size and shape can contribute to across-speaker variabilities.Figure shows the major components of a typical speech recognition system. The digitized speech signal is first transformed into a set of useful measurements or features at a fixed rate, typically once every 10--20 msec (see sectionsand 11.3 for signal representation and digital signal processing, respectively). These measurements are then used to search for the most likely word candidate, making use of constraints imposed by the acoustic, lexical, and language models. Throughout this process, training data are used to determine the values of the model parameters.Figure: Components of a typical speech recognition system.Speech recognition systems attempt to model the sources of variability described above in several ways. At the level of signal representation, researchers have developed representations that emphasize perceptually important speaker-independent features of the signal, and de-emphasize speaker-dependent characteristics. At the acoustic phonetic level, speaker variability is typically modeled using statistical techniques applied to large amounts of data. Speaker adaptation algorithms have also been developed that adapt speaker-independent acoustic models to those of the current speaker during system use, (see section). Effects of linguistic context at the acoustic phonetic level are typically handled by training separate models for phonemes in different contexts; this is called context dependent acoustic modeling.Word level variability can be handled by allowing alternate pronunciations of words in representations known as pronunciation networks. Common alternate pronunciations of words, as well as effects of dialect and accent are handled by allowing search algorithms to find alternate paths of phonemes through these networks. Statistical language models, based on estimates of the frequency of occurrence of word sequences, are often used to guide the searchthrough the most probable sequence of words.The dominant recognition paradigm in the past fifteen years is known as hidden Markov models (HMM). An HMM is a doubly stochastic model, in which the generation of the underlying phoneme string and the frame-by-frame, surface acoustic realizations are both represented probabilistically as Markov processes, as discussed in sections,and 11.2. Neural networks have also been used to estimate the frame based scores; these scores are then integrated into HMM-based system architectures, in what has come to be known as hybrid systems, as described in section 11.5.An interesting feature of frame-based HMM systems is that speech segments are identified during the search process, rather than explicitly. An alternate approach is to first identify speech segments, then classify the segments and use the segment scores to recognize words. This approach has produced competitive recognition performance in several tasks.2 State of the ArtComments about the state-of-the-art need to be made in the context of specific applications which reflect the constraints on the task. Moreover, different technologies are sometimes appropriate for different tasks. For example, when the vocabulary is small, the entire word can be modeled as a single unit. Such an approach is not practical for large vocabularies, where word models must be built up from subword units.Performance of speech recognition systems is typically described in terms of word error rate E, defined as:where N is the total number of words in the test set, and S, I, and D are the total number of substitutions, insertions, and deletions, respectively.The past decade has witnessed significant progress in speech recognition technology. Word error rates continue to drop by a factor of 2 every two years. Substantial progress has been made in the basic technology, leading to the lowering of barriers to speaker independence, continuous speech, and large vocabularies. There are several factors that have contributed to this rapid progress. First, there is the coming of age of the HMM. HMM is powerful in that, with the availability of training data, the parameters of the model can be trained automatically to giveoptimal performance.Second, much effort has gone into the development of large speech corpora for system development, training, and testing. Some of these corpora are designed for acoustic phonetic research, while others are highly task specific. Nowadays, it is not uncommon to have tens of thousands of sentences available for system training and testing. These corpora permit researchers to quantify the acoustic cues important for phonetic contrasts and to determine parameters of the recognizers in a statistically meaningful way. While many of these corpora (e.g., TIMIT, RM, ATIS, and WSJ; see section 12.3) were originally collected under the sponsorship of the U.S. Defense Advanced Research Projects Agency (ARPA) to spur human language technology development among its contractors, they have nevertheless gained world-wide acceptance (e.g., in Canada, France, Germany, Japan, and the U.K.) as standards on which to evaluate speech recognition.Third, progress has been brought about by the establishment of standards for performance evaluation. Only a decade ago, researchers trained and tested their systems using locally collected data, and had not been very careful in delineating training and testing sets. As a result, it was very difficult to compare performance across systems, and a system's performance typically degraded when it was presented with previously unseen data. The recent availability of a large body of data in the public domain, coupled with the specification of evaluation standards, has resulted in uniform documentation of test results, thus contributing to greater reliability in monitoring progress (corpus development activities and evaluation methodologies are summarized in chapters 12 and 13 respectively).Finally, advances in computer technology have also indirectly influenced our progress. The availability of fast computers with inexpensive mass storage capabilities has enabled researchers to run many large scale experiments in a short amount of time. This means that the elapsed time between an idea and its implementation and evaluation is greatly reduced. In fact, speech recognition systems with reasonable performance can now run in real time using high-end workstations without additional hardware---a feat unimaginable only a few years ago.One of the most popular, and potentially most useful tasks with low perplexity (PP=11) is the recognition of digits. For American English, speaker-independent recognition of digit strings spoken continuously and restricted to telephone bandwidth can achieve an error rate of 0.3% when the string length is known.One of the best known moderate-perplexity tasks is the 1,000-word so-called Resource Management (RM) task, in which inquiries can be made concerning various naval vessels in the Pacific ocean. The best speaker-independent performance on the RM task is less than 4%, using a word-pair language model that constrains the possible words following a given word (PP=60). More recently, researchers have begun to address the issue of recognizing spontaneously generated speech. For example, in the Air Travel Information Service (ATIS) domain, word error rates of less than 3% has been reported for a vocabulary of nearly 2,000 words and a bigram language model with a perplexity of around 15.High perplexity tasks with a vocabulary of thousands of words are intended primarily for the dictation application. After working on isolated-word, speaker-dependent systems for many years, the community has since 1992 moved towards very-large-vocabulary (20,000 words and more), high-perplexity (PP≈200), speaker-independent, continuous speech recognition. The best system in 1994 achieved an error rate of 7.2% on read sentences drawn from North America business news.With the steady improvements in speech recognition performance, systems are now being deployed within telephone and cellular networks in many countries. Within the next few years, speech recognition will be pervasive in telephone networks around the world. There are tremendous forces driving the development of the technology; in many countries, touch tone penetration is low, and voice is the only option for controlling automated services. In voice dialing, for example, users can dial 10--20 telephone numbers by voice (e.g., call home) after having enrolled their voices by saying the words associated with telephone numbers. AT&T, on the other hand, has installed a call routing system using speaker-independent word-spotting technology that can detect a few key phrases (e.g., person to person, calling card) in sentences such as: I want to charge it to my calling card.At present, several very large vocabulary dictation systems are available for document generation. These systems generally require speakers to pause between words. Their performance can be further enhanced if one can apply constraints of the specific domain such as dictating medical reports.Even though much progress is being made, machines are a long way from recognizing conversational speech. Word recognition rates on telephone conversations in the Switchboard corpus are around 50%. It will be many years before unlimited vocabulary, speaker-independentcontinuous dictation capability is realized.3 Future DirectionsIn 1992, the U.S. National Science Foundation sponsored a workshop to identify the key research challenges in the area of human language technology, and the infrastructure needed to support the work. The key research challenges are summarized in. Research in the following areas for speech recognition were identified:Robustness:In a robust system, performance degrades gracefully (rather than catastrophically) as conditions become more different from those under which it was trained. Differences in channel characteristics and acoustic environment should receive particular attention.Portability:Portability refers to the goal of rapidly designing, developing and deploying systems for new applications. At present, systems tend to suffer significant degradation when moved to a new task. In order to return to peak performance, they must be trained on examples specific to the new task, which is time consuming and expensive.Adaptation:How can systems continuously adapt to changing conditions (new speakers, microphone, task, etc) and improve through use? Such adaptation can occur at many levels in systems, subword models, word pronunciations, language models, etc.Language Modeling:Current systems use statistical language models to help reduce the search space and resolve acoustic ambiguity. As vocabulary size grows and other constraints are relaxed to create more habitable systems, it will be increasingly important to get as much constraint as possible from language models; perhaps incorporating syntactic and semantic constraints that cannot be captured by purely statistical models.Confidence Measures:Most speech recognition systems assign scores to hypotheses for the purpose of rank ordering them. These scores do not provide a good indication of whether a hypothesis is correct or not, just that it is better than the other hypotheses. As we move to tasks that require actions,we need better methods to evaluate the absolute correctness of hypotheses.Out-of-Vocabulary Words:Systems are designed for use with a particular set of words, but system users may not know exactly which words are in the system vocabulary. This leads to a certain percentage of out-of-vocabulary words in natural conditions. Systems must have some method of detecting such out-of-vocabulary words, or they will end up mapping a word from the vocabulary onto the unknown word, causing an error.Spontaneous Speech:Systems that are deployed for real use must deal with a variety of spontaneous speech phenomena, such as filled pauses, false starts, hesitations, ungrammatical constructions and other common behaviors not found in read speech. Development on the ATIS task has resulted in progress in this area, but much work remains to be done.Prosody:Prosody refers to acoustic structure that extends over several segments or words. Stress, intonation, and rhythm convey important information for word recognition and the user's intentions (e.g., sarcasm, anger). Current systems do not capture prosodic structure. How to integrate prosodic information into the recognition architecture is a critical question that has not yet been answered.Modeling Dynamics:Systems assume a sequence of input frames which are treated as if they were independent. But it is known that perceptual cues for words and phonemes require the integration of features that reflect the movements of the articulators, which are dynamic in nature. How to model dynamics and incorporate this information into recognition systems is an unsolved problem.语音识别一定义问题语音识别是指音频信号的转换过程,被电话或麦克风的所捕获的一系列的消息。