A Maximum Entropy Approach to Natural Language Processing一个自然语言处理的最大熵方法

合集下载

相关主题

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

The optimal
model: PS fˆ
augmaxH(p)
pC(S fˆ )
Gain:
L(S,
fˆ)
L(P S
fˆ
)
L(PS
)
,
where L is the log-likelihood of training data
Algorithm: Feature Selection
1. Start with S as an empty set; PS is uniform 2. For each feature f, compute PS f and L(S , f ) 3. Check the termination condition (specified by the user) 4. Select fˆ aug max L(S, f )
~ p(f) ~ p(x,y)f(x,y)
x,y
The expected value of f with respect to
the conditional probability p(y|x)
p (f) ~ p (x)p (y|x)f(x,y)
x,y
Constraint Equation
Set equal the two expected values:
Transition function:
t j (yi1, yi,x,i)
1 0
if yi1 INand yi otherwise.
NNP
Difference from MEMM
If the state feature is dropped, we obtain a MEMM model
The drawback of MEMM
~PSf augmaG xS,f () PSf
Conditional Random Field (CRF)
CRF
The probability of a label sequence y given observation
sequence x is the normalized product of potential functions,
H (p ) ~ p (x )p (y |x )lo p (y g |x )
x ,y
where p is chosen from
C { p |p ( fi ) ~ p ( fi ) i 1 ,2 .n . } ,.,
Constrained Optimization Problem
The Lagrangian
Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)
Boltzmann-Gibbs Distribution
Given:
States s1, s2, …, sn Density p(s) = ps
each of the form
exp
j
jt
j
(
yi
1,
yi
,
x,
i)
k
k
sk
(
yi
,
x,
i)
,
where yi-1 and yi are labels at position i-1 and i
t j (yi1, yi,x,i) is a transition feature function, and
s
Boltzmann-Gibbs (Cnt’d)
Consider the Lagrangian
L p s lo p s g i( p s f i( s ) D i) ( p s 1 )
is
s
Take partial derivatives of L with respect to ps
HMM is a generative model In order to define a joint distribution, this
model must enumerate all possible observation sequences and their corresponding label sequences This task is intractable, unless observation elements are represented as isolated units
~ p(f)p(f)
or equivalently,
~ p ( x ,y )f( x ,y ) ~ p ( x )p ( y |x )f( x ,y )
x ,y
x ,y
Maximum Entropy Principle
Given n feature functions fi, we want p(y|x) to maximize the entropy measure
f
5. Add fˆ to S 6. Update PS 7. Go to step 2
Approximation
Computation of maximum entropy model is costly for each candidate f
Simplification assumption:
The multipliers λ associated with S do not change when f is added to S
Approximation (cnt’d)
The approximate solution for S f then has
the form
PS, f
1 Z(x)
Maximum entropy Markov model (MEMM)
The posterior consists of transition probability
densities p(s | s´, X)
Boltzmann-Gibbs (Cnt’d)
Conditional random field (CRF)
Maximum Entropy Approach
An Example
Five possible French translations of the English word in: Dans, en, à, au cours de, pendant
Certain constraints obeyed:
Feature Selection
Motivation:
For a large collection of candidate features, we want to select a small subset
Incremental growth
Incremental Learning
Adding feature fˆ to S to obtain S fˆ Consider C(S fˆ) {p: p( f ) ~p( f ) i 1,2,...,n}
y: French word, x: English context
Indicator function of a context feature f
1ifyen anAdpfroilllionws f(x,y) 0otherwise.
Expected Values of f
The expected value of f with respect to the empirical distribution ~p(x,y)
p (n )(fi) ~ p (x ) p i(n )(y|x )fi(x ,y )
xy
Update Lagrange multipliers
exp (in1()-(in))p~ p (n()(fif)i)
Update proቤተ መጻሕፍቲ ባይዱability functions
pi(n 1 )(y|x)Z (x 1 )(n 1 )ex ipi(n 1 )fi(x,y )
Use the optimal posterior to classify
Boltzmann-Gibbs (Cnt’d)
Maximum Entropy (ME)
The posterior is the state probability density p(s | X), where X = (x1, x2, …, xn)
T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning, Springer-Verlag, 2019.
P. Baldi and S. Brunak, Bioinformatics: The Machine Learning Approach, The MIT Press, 2019.
When April follows in, the proper translation is en
How do we make the proper translation of a French word y under an English context x?
Formalism
Probability assignment p(y|x):
The posterior consists of both transition
probability densities p(s | s´, X) and
state probability densities p(s | X)
References
R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2nd Ed., Wiley Interscience, 2019.
(p ,) H (p ) i(p (fi ) ~ p (fi )) i
Solutions
p(y|x)Z1 (x)ex ipifi(x,y)
Z(x)e y
x pi ifi(x,y)
Iterative Solution
Compute the expectation of fi under the current estimate of probability function
The state probabilities are not learnt, but inferred
Bias can be generated, since the transition feature is dominating in the training
Difference from HMM
and set them to zero, we obtain Boltzmann-
Gibbs density functions
ps
expi i
Z
fi
(s)
where Z is the normalizing factor
Exercise
From the Lagrangian
L p s lo p s g i( p s f i( s ) D i) ( p s 1 )
is
s
derive
ps
expi i
Z
fi
(s)
Boltzmann-Gibbs (Cnt’d)
Classification Rule
Use of Boltzmann-Gibbs as prior distribution
Compute the posterior for given observed data and features fi
Maximum entropy principle:
Without any information, one chooses the density ps to maximize the entropy
ps logps s
subject to the constraints
psfi(s)Di, i
sk (yi,x,i) is a state function
Feature Functions
Example:
A feature given by
b(x,i)
1 0
if theobservation sequenceatpositioni otherwise.
is
the word"September"
PS
(y
|
x)ef
(x,y)
Z(x) PS (y | x)ef (x,y) y
Approximate Solution
The approximate gain is
GS,f ()L(PS,f )L(pS)~p(x)logZ(x)~p(f) x
The approximate solution is then