cch2-LanguageModel

合集下载

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Language Model
What is a Language Model?
Want to build models which assign scores to sentences.
“Today is Wednesday”) ≈0.001
p(
“Today Wednesday is”) ≈0.0000000000001
p(
“The eigenvalue is positive”) ≈0.00001
p(
Alternatively we want to compute
P(w5|w1,w2,w3,w4)
= the probability of a word given some previous words
Context/topic dependent!
Can also be regarded as a probabilistic mechanism for “generating”text, thus also called a “generative”model.
Where Are Language Models Used?
Provides a principled way to quantify the uncertainties associated with natural language
Language Models can be used for:
Speech recognition
conversion
pin-to-hanzi
Spelling correction
Optical character recognition
Machine translation
Natural language generation
Information retrieval
Any problem that involves sequences ?
Other Noisy-Channel Processes Handwriting recognition
OCR
Spelling Correction
Probabilistic Language Modeling
Assigns probability p(t )to a word sequence t = w 1w 2w 3w 4w 5w 6…
How to compute this joint probability p(t )? Intuition: let’s rely on the Chain Rule of Probability
The chain rule leads to a history-based model: we predict following things from past things.
1211n
11i=1
()()()(|,) = p(|,)
n n n i i p t p w w w p w p w w w w w w −=⋅⋅⋅=⋅⋅⋅⋅⋅⋅⋅⋅⋅∏
Unfortunately
There are a lot of possible sentences
In general, we’ll never be able to get enough data to compute the statistics for those longer prefixes
Same problem we had for the strings themselves
Independence Assumption
Make the simplifying assumption
I saw a) = P(rabbit|a)
P(rabbit|Yesterday
Or maybe
I saw a) = P(rabbit|saw,a) P(rabbit|Yesterday
That is, the probability in question is to some degree independent of its earlier history.
Independence Assumption
This particular kind of independence assumption is called a Markov assumption after the Russian mathematician Andrei Markov.
Markov Assumption
So for each component in the product replace with the approximation (assuming a prefix of N-1)
P (w n |w 1n −1
)≈P (w n |w n −N +1n −1
)
Bigram version
P (w n |w 1n −1
)≈P (w n |w n −1)
N-Gram Language Models
N-gram solution: assume each word depends only on a short linear history (a Markov assumption
)
No loss of generality to break sentence probability
down with the chain rule
N元语法模型－概念辨析
N元语法模型：N-Gram Model。

所谓N-Gram，指的是由N个词组成的串，可以称为“N 元组”，或“N元词串”。

基于N-Gram建立的语言模型，称为“N元语法模型(N-Gram Model)”。

Gram不是Grammar的简写。

在英文中，并没有N-Grammar的说法。

在在汉语中，单独说“N元语法”的时候，有时指“N元组(N-Gram)”，有时指“N元语法模型(N-Gram Model)”，
请注意根据上下文加以辨别。

Unigram Models
Simplest case: unigrams
Generative process: pick a word, pick a word, …
As a graphical model:
To make this a proper distribution over sentences, we have to generate a special STOP symbol last. (Why?)
Bigram Models
Big problem with unigrams:
P(the the the the) >>
P(I like ice cream)!
Obvious that this should help –in probabilistic terms, we’re using weaker conditional independence
assumptions (what’s the cost?)
Condition on previous word:
Estimating bigram probabilities:
The maximum likelihood estimate
<s> I am Sam </s>
　<s> Sam I am </s>
　<s> I do not like green eggs and ham </s>
This is the Maximum Likelihood Estimate, because it is the one which maximizes P(Training set|Model)
Berkeley Restaurant Project Sentences
can you tell me about any good cantonese restaurants mid priced thai food is what i’m looking for
tell me about chez panisse
can you give me a listing of the kinds of food that are available
i’m looking for a good place to eat breakfast
when is caffe venezia open during the day
Bigram Counts Out of 9222 sentences
occurred 827 times Eg. “I want”
Bigram Probabilities
Divide bigram counts by prefix unigram counts to get probabilities.
Bigram Estimates of Sentence Probabilities P(<s> I want english food </s>) =
P(i|<s>)*
P(want|I)*
P(english|want)*
P(food|english)*
P(</s>|food)*
=.000031
Kinds of Knowledge
As crude as they are, N-gram probabilities capture a range of interesting facts about language.
P(english|want) = .0011
P(chinese|want) = .0065
P(to|want) = .66
P(eat| to) = .28
P(food| to) = 0
P(want| spend) = 0
P (i | <s>) = .25
Shannon’s Method
Assigning probabilities to sentences is all well and good, but it’s not terribly illuminating . A more entertaining task is to turn the model around and use it to generate random sentences that are like the sentences from which the model was derived. Generally attributed to
Claude Shannon.
Shannon’s Method
Sample a random bigram (<s>, w) according to its probability
Now sample a random bigram (w, x) according to its probability
w matches the suffix of the first.
Where the prefix
And so on until we randomly choose a (y, </s>)
Then string the words together
<s> I
I want
want to
to eat
eat Chinese
Chinese food
food </s>
Shakespeare
Shakespeare as a Corpus
N=884,647 tokens, V=29,066
Shakespeare produced 300,000 bigram types out of V 2=
844 million possible bigrams...
So, 99.96% of the possible bigrams were never seen
(have zero entries in the table)
This is the biggest problem in language modeling;
we’ll come back to it.
Quadrigrams are worse
Limitations of the
Maximum Likelihood Estimator
Problem: often infinitely surprised when unseen word appears, P(unseen) = 0
–Problem (Zero Counts): this happens commonly
–Estimates for high count words are fairly accurate –Estimates for low count words are unstable
Dealing with Sparsity
For most N-grams, we have few observations
General approach: modify observed counts to improve estimates
–Discounting: allocate probability mass for unobserved events by discounting counts for observed events
–Interpolation: approximate counts of N-gram using
combination of estimates from related denser histories
–Back-off: approximate counts of unobserved N-gram based on the proportion of back‐off events (e.g., N-1 gram)
Laplace estimate:
Laplace-Smoothed Bigram Counts
Laplace-Smoothed Bigram Probabilities
Reconstituted Counts
Reconstituted Counts (2)
Big Change to the Counts!
C(want to) went from 608 to 238!
P(to|want) from .66 to .26!
Discount d= c*/c
=.10 A 10x reduction
d for “Chines
e food”
So in general, Laplace is a blunt instrument
Could use more fine-grained method (add-k)
But Laplace smoothing not used for N-grams, as we have much better methods
Despite its flaws Laplace (add-k) is however still used to smooth other probabilistic models in NLP, especially
For pilot studies
In domains where the number of zeros isn’t so huge.
Better Smoothing
Intuition used by many smoothing algorithms Good-Turing
Kneser-Ney
Witten-Bell
Is to use the count of things we’ve seen once to help estimate the count of things we’ve never seen
Good-Turing Smoothing
Basic idea: seeing something once is roughly the same as not seeing it at all
Count the number of times you observe an event once; use this as an estimate for unseen events
Distribute unseen events’probability equally over all unseen events
Adjust all other estimates downward, so that the set of probabilities sums to 1
Good-Turing Josh Goodman Intuition
Imagine you are fishing There are 8 species: carp, perch, whitefish, trout,
salmon, eel, catfish, bass You have caught up to now 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel
= 18 fish
How likely is it that the next fish caught will be a new species?
GT Fish Example
Bigram Frequencies of Frequencies and
GT Re-estimates
GT Smoothed Bigram Probabilities
In practice, assume large counts (c>k for some k) are reliable:
Also, need the N k to be non-zero, so we need to
smooth (interpolate) the N k counts before
computing c* from them
GT Complications
Problem
Both Add-1 and basic GT are trying to solve two distinct problems with the same hammer
How much probability mass to reserve for the
zeros
–How much to take from the rich
How to distribute that mass among the zeros
–Who gets how much
Example
Consider the zero bigrams
“The X”
“of X”
With GT they’re both zero and will get the same fraction of the reserved mass...
Backoff and Interpolation
Use what you do know...
If we are estimating:
trigram p(z|x,y)
but count(xyz) is zero
Use info from:
Bigram p(z|y)
Or even:
Unigram p(z)
How to combine this trigram, bigram, unigram info in a valid fashion?
Backoff Vs. Interpolation
Backoff: use trigram if you have it, otherwise bigram, otherwise unigram
Interpolation: mix all three
Interpolation Simple interpolation
Lambdas conditional on context:。