14-音频特征提取与表达
合集下载
相关主题
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Zero Crossing Rate(过零率ZCR)
Count the number of times that the audio waveform crosses the zero axis.
ZCR is one of the most indicative and robust measure to discern unvoiced speech. Typically, unvoiced speech has a low volume but a high ZCR Using ZCR and volume together, one can prevent low energy unvoiced speech frames from being classified as silent.
3
采样率、时间刻度、 样本、格式、编码……
物理样本级
1 音频信号简介
4
音频信号简介
音频是信息的载体,一般对音频信号来说需要进行三方 面的研究:研究音频信号如何产生;音频信号如何传播 ;音频信号如何被人感知。 在音频检索中,需要进行特征提取、音频分割、音频识 别分类和索引检索这几个关键步骤。 人耳听到的音频是连续的模拟信号,而计算机只能处理 数字化的信息,所以模拟连续音频信号要经过离散化即 抽样后变成计算机处理的采样离散点。音频信号数字化 时的采样频率必须高于信号带宽的 2 倍,才能正确的恢 复信号。(Nyquist Theorem)
20
Subband Energy Ratio
The ratio of the energy in a frequency subband to the total energy
When the sampling rate is 22050Hz, the frequency ranges for the four subbands are 0-630Hz, 630-1720Hz, 1720-4400Hz and 4400-11025Hz.
22
0-200s: speech, 201-350: music, 351-450: environment sound
Spectral Rolloff
The 95th percentile of the power spectral distribution This measure distinguishes voiced from unvoiced speech. The value is higher for rightskewed distributions.
9
短时平均能量
10
Short time energy-短时平均能 量预备知识
将音频信号的 K 个采样点分割成前后迭加的音频帧, 相邻帧之间的迭加率一般为 30% 至 50% ,音频处理中的 短时帧均是这样得到的。 对离散信号序列的截短是通过离散信号序列与窗口 函数相乘来实现的。设 x(i : i N ) 是一个含 N 个采样 w(i)是长度为 N 的窗函数,用 w(i) 截短 点的短时帧, x(i : i N ) ,得到点序列x(i : i N ),即
17
Pitch
Valleys exist in voiced and music frames and vanish in noise and unvoiced frames
18
Spectral Features
Spectrum: the Fourier transform of the samples in this frames The difference among these three clips is more noticeable in the frequency domain than in the waveform domain
1 j j ( ) X N (e ) X (e ) W (e )d 2
jቤተ መጻሕፍቲ ባይዱ
其中, X 和 W 分别表示频谱。
12
Short time energy-短时平均能 量的概念
短时平均能量指在一个短时音频帧内采样点 信号所聚集的平均能量。假定一段连续音频 信号流 x 得到 K 个采样点,这 K个采样点被 分割成迭加率为50%的 M 个短时帧。
Unvoiced speech has a high proportion of energy contained in the high-frequency range of the spectrum This is a measure of the “skewness” of the spectral shape
15
过零率的应用
判断语音的开始和结束 语音信号开始和结束都大量集中了辅音信号,所以在语 音信号中,其开始和结束部分的过零率总会有显著升高, 所以利用过零率可以去判断语音是否开始和结束。
区分语音和音乐 大多数音乐信号集中在低频部分,其过零率不表现出突 然的升高或降落的跌宕特性,所以有时候也用过零率来区 分语音和音乐两种不同的音频信号。
Fixed length clips (1 to 2 seconds) or vary-length clips Both frames and clips may overlap with their previous ones
8
Frame-Level Features
Most of the frame-level features are inherited from speech signal processing Time-domain features Frequency-domain features We use N to denote the frame length, and sn(i) to denote the ith sample in the nth audio frame
声音的 产生
5
声音传 播
声音接 收
音频信号的处理简介
在音频处理中,一般假定音频信号特性在很短时间内变 化是很缓慢的,所以在这个变化缓慢的时间内所提取的 音频特征保持稳定。将离散的音频信号分成一定长度的 单位进行处理,即将离散音频采样点分成一个个音频。 在音频处理中,对连续的音频信号,并没有定义“关键 音频帧” 假设一段连续的音频信号流 x 采样后的离散音频信号可 以表示为 x ( x(1),...x(n),..x(K )) ,这就意味着从此连续的 音频信号中得到了 是时刻 n 得到 K 个采样数据,其中x(n) 的数据。将这 K 个数据分成 L 组,每一组就是一帧,每 一帧包含[K / L] 个采样点。每一组帧的[K / L] 采样点可 以提取nFeature 个特征,最后得到音频帧特征 L nFeature 音频信号的“短时”特征处理法是从采样点集合中提取
x(i : i N )
w(n) x(i : i N )
通过这样的途径,先前的每个短时帧中的N 个采样点 x(i : i N ) 被转换成 x(i : i N ) 。
11
Short time energy-短时平均能 量预备知识
由于时域上的信号做卷积运算,相当于频域上的相乘 ,因此窗口计算也可以如下表示:
A clip consists of a sequence of frames, and clip-level features usually characterize how frame-level features change over a clilp.
7
Frame and clip
特征,而不是像视频处理时,从每个关键采样点中提取 的特征来表示视频数据。为什么?
6
Frame and clip
Short-term frame level vs. long-term clip level
A frame is defined as a group of neighboring samples which last about 10 to 40 ms
14
Volume of an audio signal depends on the gain value of the recording and digitizing devices. We many normalize the volume for a frame by the maximum volume of some previous frames
21
Spectral Flux
Spectrum flux (SF) is defined as the average variation value of the spectrum between the adjacent two frames.
The SF values of speech are higher than those of music The environment sound is among the highest and changes more dramatically than the other two types of signal.
19
Spectral Features
Let Sn(w) denote the power spectrum (i.e. magnitude square of the spectrum) of frame n If we think of w as a random variable and Sn(w) normalized by the total power as the probability density function of w, we can define mean and standard deviation of w
16
Pitch
Pitch is the fundamental frequency (基频) of an audio waveform. Normally only voiced speech and harmonic (泛 音) music have well-defined pitch. Temporal estimation methods rely on computation of the short time autocorrelation function Rn(l) or MADE An(l)
For audio clips with sampling frequency 16kHz, how many samples are in a 20ms audio frame?
Within an audio frame we can assume that the audio signal is stationary.
每个短时帧和窗口函数大小假定为 N , 对于第m 个短时帧,其短时平均能量可以 使用下面的公式计算
1 2 [ x ( n ) w ( n m )] Em N m
13
Volume (Loudness, Energy)
Volume is a reliable indicator for silence detection, which may help to segment an audio sequence and to determine clip boundaries It is approximated by the root mean square of the signal magnitude within each frame
音频信号特征提取与表达
1
Visual VS Audio
70~80% 10%
音调
Color 音量 Texture
Shape
旋律
2
Motion
Location
音频内容分层描述
音乐叙事、音频对象描述 语音识别文本、事件……
语义级
感知特征:音调、音高、旋律、节奏 声学特征:能量、过零率、LPC系数
声学特征级