Segmentation and classification of broadcast news audio

合集下载

相关主题

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

1. Introduction
Broadcast news data as distributed by the Linguistic Data Consortium (LDC) is a set of complete television or radio shows like CNN Headline News or MPR Marketplace. The various types of speech present in a typical broadcast are denoted by the focus conditions (Table 1). Opposed to F0, F1 and F5 the conditions F3,F4 and FX are sometimes severly distorted by non-speech sounds. F2 most commonly labels segments containing telephone interviews. Since using di erent approaches for various conditions has shown to be e ective, the segmentation stage also has to label segments according to bandwidth and speaker gender. The transcription of broadcast news requires techniques to deal with the large variety of data types present. Of particular importance is the presence of varying channel types (wide-band and telephone); data portions containing speech and/or music often simultaneously and a wide
Audio Stream
Coding
Audio Type Classification
M,MS,S,T Tagged Segments
Adapt Models Using MLLR
Discarded Music Segments
Relabelling Discard Music
ቤተ መጻሕፍቲ ባይዱ
The audio classi cation uses Gaussian mixture models (GMM) with 1024 mixture components and diagonal covariance matrices. Four models are trained with approximately 3 hours of audio each. The models used are S for pure wide-band speech, T for pure narrow-band speech, MS for music and speech, and M for Music. The use of a separate model for music and speech has been bene cial to decrease the loss of speech data. Using an additional model for various other background noises present in the data (e.g. helicopter or battle eld noise) turned out to be impossible due to lack of training data and the large diversity in the nature of the data. Moreover some of the material contains background speakers or speech in di erent languages, which adds to confusion with speech classes. BNtrain97 BNeval97
SEGMENTATION AND CLASSIFICATION OF BROADCAST NEWS AUDIO
T. Hain P.C. Woodland
Cambridge University Engineering Department, Trumpington Street, Cambridge CB2 1PZ, UK.
Table 1:
Broadcast news focus conditions.
variety of background noises from, for example, live outside broadcasts. Furthermore, if a transcription system is to deal with complete broadcasts, it must be able to deal with a continuous audio stream containing a mixture of any of the above data types. To deal with this type of data, transcription systems generally use a segmentation stage that splits the audio stream into discrete portions of the same audio type for further processing. Ideally, segments should be homogeneous (i.e. same speaker and channel conditions), and should contain the complete utterance by the particular speaker. Because of the large variety of audio types present, the data segments should be tagged with additional information so that subsequent stages can perform suitable processing. If possible, non-speech segments should be completely removed from the audio stream but it is important not to delete segments that in fact contain speech to be transcribed. The following section gives a brief system overview which is followed by a more detailed description and evaluation. Finally recognition experiments using the 1997 HTK broadcast news transcription system are presented on the November 1997 broadcast news evaluation set ( BNeval97 ).
2. System Overview
The overall segment processing can be subdivided into audio type classi cation and segmentation. The segment processing steps are shown in Figure 1. The classi cation stage labels audio frames according to bandwidth and discards non-speech segments, while the segmentation step
Focus F0 F1 F2 F3 F4 F5 FX
Description baseline broadcast speech (clean, planned) spontaneous broadcast speech (clean) low delity speech (wideband/narrowband) speech in the presence of background music speech under degraded acoustical conditions non-native speakers (clean, planned) all other speech (e.g. spontaneous non-native)
fth223, pcwg@eng.cam.ac.uk
ABSTRACT
Broadcast news contains a wide variety of di erent speakers and audio conditions (channel and background noise). This paper describes a segmentation, gender detection and audio classi cation scheme and presents experimental results on the DARPA 1997 broadcast news evaluation set. The goal of the segment processing algorithm is to convert the continuous input audio stream into reasonably-sized speech segments, which are labelled as either being narrow or wide-band speech and belonging either to a female or male speaker. Ideally, each segment should be homogeneous (i.e. same speaker and channel conditions) and the removal of non-speech segments should be designed to minimise incorrectly discarded speech. Since the reason for developing the algorithm has been to enable recognition of broadcast news data, the recognition performance using various segmentation sources has been tested. On the evaluation data, the rst pass of the HTK broadcast news transcription system using gender independent HMMs gave 23.0% word error using this segmentation scheme compared to 22.9% using manual segmentation and 23.9% based on CMU segmentation software distributed as reference algorithm by NIST.