语音识别与深度学习

相关主题

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

DNN
6
Speech Recognition + Deep Neural Networks?
DSP
DNN
Language Model
7
Speech Recognition + Deep Neural Networks!
DSP
DNN
Language Model
3 months - 10% word error rate relative reduction Voice Search
W
W
Word Sequence
Acoustic Language
Model
Model
14
Acoustic Model
Speech Recognition as Probabilistic Transduction
Language Model
Meaning
Decision Trees
Sentence
• Also a (ugly) FST composition operation.
• These 1000 - 10000 states are the targets the acoustic model will attempt to predict, and the hidden states of the Hidden Markov Model (HMM).
4
Speech Recognition
DSP
Feature Extraction Acoustic Model Language Model
5
Speech Recognition + Deep Neural Networks?
Feature Extraction
DSP
Acoustic Model Language Model
• TANDEM
Hynek Hermansky, Daniel PW Ellis, and Sangita Sharma. "Tandem connectionist feature extraction for conventional HMM systems." ICASSP 2000.
2000
ϵ ϵ / 0.5 silence / 0.5
18
Sentence
Word
Phoneme
ϵ k / 1.0
ϵ ao / 1.0
Transduction via Composition
• Map output labels of Lexicon to input labels of Language Model.
and artiﬁcial intelligence 7, no. 04 (1993): 899-916.
• Bidirectional Recurrent Neural Networks
Mike Schuster, and Kuldip K. Paliwal. "Bidirectional recurrent neural networks." IEEE Transactions on Signal Processing, 45, no. 11 (1997): 2673-2681.
1993
Nelson Morgan, Herve Bourlard, Steve Renals, Michael Cohen, and Horacio Franco. "Hybrid neural network/
hidden Markov model systems for continuous speech recognition." International journal of pattern recognition
8
Similar Stories across the Industry
Microsoft
IBM
Google
Li Deng Frank Seide
Dong Yu
Tara Sainath Brian Kingsbury
Andrew Senior Georg Heigold Marc’Aurelio Ranzato
University of Toronto
Geoﬀ Hinton George Dahl Abdel-rahman Mohamed
And many others...
Deep Neural Networks for Acoustic Modeling in Speech Recognition, Geoﬀrey Hinton, Li Deng, Dong Yu, George Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara Sainath, Brian Kingsbury, IEEE Signal Processing 9 Magazine, Vol. 29, No. 6, November, 2012.
17
Word Lexicon
Phoneme
• call: k ao l
• dad: d ae d
• mom: m aa m
ϵ k / 0.33 ϵ
d / 0.33 ϵ
m / 0.33
ϵ ao / 1.0
ϵ ae / 1.0
ϵ aa / 1.0
call l / 1.0
dad d / 1.0 mom m / 1.0
16
Language Model
Sentence
Word
• Toy Example:
– “call mom” – “call dad”
call call / 0.5
call call / 0.5
mom
mom / 1.0
dad dad / 1.0
• In reality:
– 100M+ tokens – estimated using billions of training examples.
1997
1998 • Hierarchical Neural Networks Jürgen Fritsch and Michael Finke. "ACID/HNN: Clustering hierarchies of neural networks for context-dependent connectionist acoustic modeling." ICASSP 1998.
http://vincent.vanhoucke.com
2
A personal story Summer 2011...
3
The (better) Internship Story
• Geoﬀ Hinton Student:
Navdeep Jaitly, Geoﬀrey E. Hinton. Learning a better representation of speech soundwaves using restricted Boltzmann machines. ICASSP 2011
• Competitive results on TIMIT • Deep Belief Network (DBN, generative
model) used to pre-train a Deep Neural Network (DNN, discriminative model). • NO feature engineering.
11
Neural Networks for Speech in the 00’s
? ...
12
Speech Recognition A very quick and inaccurate introduction...
13
Audio Waveform
max p(W|O) = max p(O|W) p(W)
1989
• Recurrent Neural Networks
Tony Robinson. "A real-time recurrent error propagation network word recognition system”, ICASSP 1992.
1992
• Hybrid Systems
... ...
...
• ~O(100) states per phoneme.
ao_onset.2 ao_onset.1
... ...
...
– Too few: not enough modeling power. – Too many: overﬁtting.
ao_steady.2 ao_steady.1
Mehryar Mohri, Fernando Pereira, and Michael Riley. "Weighted ﬁnite-state transducers in speech
recognition." Computer Speech & Language 16, no. 1 (2002): 69-88.
Acoustic Modeling and Deep Learning
June 19th, 2013 Vincent Vanhoucke
A quick introduction
• Former Technical Lead of the Speech Recognition Quality Research team. • Now a member of the Deep Learning Infrastructure team.
• Join and optimize end-to-end graph.
ϵ
call l / 1.0
ϵ ϵ / 0.5 silence / 0.5
ϵ d / 0.5
ϵ
m / 0.5
ae / 1.0
ϵ aa / 1.0
dad
d / 1.0
ϵ
ϵ / 0.5
mom
silence / 0.5
m / 1.0
Other operations: Minimization, Determinization, Epsilon removal, Weight pushing.
algorithms over them. (e.g. Viterbi, forward-backward) • Powerful algorithms to optimize these graphs. • Pretty pictures:
Output Label Input Label / Weight
Feature Extraction
Word
Phoneme
NLP
State
Frame Audio
Acoustic Model
Lexicon
15
Weighted Finite State Transducers (FSTs)
• One of many ways to express this ‘probabilistic transduction’ idea. • A mathematically sound way to express probabilistic graphs and
A historical detour Everything old is new again...
10
Neural Networks for Speech in the 90’s
• Time-Delay Neural Networks
Alex Waibel, Toshiyuki Hanazawa, Geoﬀrey Hinton, Kiyohiro Shikano, and Kevin J. Lang. "Phoneme recognition using time-delay neural networks. " IEEE Transactions on Acoustics, Speech and Signal Processing, 37, no. 3 (1989): 328-339.
19
Phoneme
State
Connecting to the Acoustic Model
ϵ
ϵ
ao
ao_onset ao_steady
/ 1.0
/ 1.0
ao_oﬀset / 1.0
• Phonemes are mapped to a richer inventory of phonetic states.
Application Of Pretrained Deep Neural Networks To Large Vocabulary Speech Recognition,
Βιβλιοθήκη Baidu
Navdeep Jaitly, Patrick Nguyen, Andrew Senior, Vincent Vanhoucke, Interspeech 2012.