An Introduction to Restricted Boltzmann Machines

合集下载

相关主题

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Output: gradient approximation ∆wij, ∆bj and ∆ci for i = 1, ..., n, j = 1, ..., m
1: init ∆wij = ∆bj = ∆ci for i = 1, ..., n, j = 1, ..., m
2: for all the v ∈ S do 3: v(0) ← v
sample from the RBM-distribution (which would require to run a Markov chain
until the stationary distribution is reached), a Gibbs chain is run for only k
12: for i = 1, ..., n, j = 1, ..., m do
13:
∆wij ← ∆wij + p(Hi = 1|v(0)) · vj(0) − p(Hi = 1|v(k)) · vj(k)
∆bj ← ∆bj + vj(0) − vj(k)
∆ci ← ∆ci + p(Hi = 1|v(k))
visible variable:
∂lnL (θ|v)
∑
∂bj
= vj − p(v)vj
v
(7)
and w.r.t. the bias parameter ci of the ith hidden variable:
∂lnL (θ|v)
∑
∂ci
= p(Hi = 1|v) − p(v)p(Hi = 1|v)
steps in each tempered Markov chain yielding samples (v1, h1), · · · , (vM , hM ). After this, two neighboring Gibbs chains with temperatures Tr and Tr?1 may exchange particles (vr, hr) and (vr?1, hr?1) with an exchange probability based on the Metropolis ratio,
(2)
i=1
i=1
The absence of connections between hidden variables makes the marginal distribution of the visible variables easy to calculate:
p(v) =
1
∑ p(v, h) =
∂E(h, v) ∑
∂E(h, v)
= p(h|v)
+ p(h, v)
∂wij
h
∂wij ∑
h,v
∂wij
(6)
= p(Hi = 1|v)vj − p(v)p(Hi = 1|v)vj
v
Analogously to (6) we get the derivatives w.r.t. the bias parameter bj of the jth
1.2 The Gradient of the Log-Likelihood
The derivative of the log-likelihood of a single training pattern v w.r.t. the weight wij can be written as:
∂lnL (θ|v) ∑
2 Approximating the RBM Log-Likelihood Gradient
2.1 Contrastive Divergence
Algorithm 1 k-step contrastive divergence
Input: RNM(V1, ...Vm, H1, ..., Hn) ,training batch S
it holds:
( ∑ m
)
p(Hi = 1|v) = σ
wij vj + ci
(4)
j=1
and
( ∑n
)
p(Vi = 1|h) = σ
wij hi + bj
(5)
i=1
RBM can be reinterpreted as a standard feed-forward neural network with one
An Introduction to Restricted Boltzmann Machines
Yingying Zhang October 27, 2014
1 Restricted Boltzmann Machines
1.1 The structure of the Restricted Boltzmann Machines
given
by
the
Gibbs
distribution
p(v, h)
=
1 Z
exp(−E(v,
h))
with
the
energy
function
∑n ∑ m
∑ m
∑n
E(v, h) = −
hiwij vj − bivj − cini
(1)
i=1 j=1
j=1
i=1
For alli ∈ {1, ..., n} and j ∈ {i, ..., m},wij is a real valued weight,and ci and bj are real valued bias terms.The graph of an RBM has only connections between the layer of hidden and visible variables but not between two variables of the
3
is then approximated by:
C Dk (θ,
v(0))
=
−
∑ p(h|v(0)) ∂E(v(0), ∂θ
h)
+
∑ p(h|v(k)
∂E(v(k), h) ∂θ
(9)
h
h
The derivatives in direction of the single parameters are obtained by estimating the expectations over p(v) in (6), (7) and (8) by the single sample v(k). A batch
version of CD-k can be seen in algorithm 1.
2.2 Parallel Tempering
One of the most promising sampling technique used for RBM-training so far is parallel tempering. Parallel tempering, also known as replica exchange MCMC sampling.This can be formalized in the following way: Given an ordered set of M temperatures T1, T2, . . . , TM with 1 = T1 < T2 <, · · · , < TM , we deﬁne a set of M Markov chains with stationary distributions:
1
∑ e−E(v,h) =
1
∏m ebj vj
∏n ( 1
+
eci +∑m j =1
)
wij vj
Z
Z
Z
h
h
i=1
i=1
(3)
The conditional probability of a single variable being one can be interpreted as
the ﬁring rate of a (stochastic) neuron with sigmoid activation function because
layer of non-linear processing units. From this perspective the RBM is viewed as a deterministic function{0, 1}m → Rn that maps an input v ∈ {0, 1}mto y ∈ Rnwith yi = p(Hi = 1|v).That is, an observation is mapped to the expected value of the hidden neurons given the observation.
Figure 1: The undirected graph of an RBM with n hidden and m visible variables 1
same layer. So we have the relation as follows:
∏n
∏m
p(h|v) = p(hi|v) and p(v|h) = p(vi|h)
A RBM is an MRF associated with a bipartite undirected graph as shown in
Figure 1.It consists of m visible units V = (V1, ..., Vm) to represent observable
data and n hidden units H = (H1, ..., Hn) to capture dependencies between
observed variables. In binary RBMs, the random variables (V, H) take values
(v, h) ∈ {0, 1}m+n and the joint probability distribution under the model is
steps (and usually k = 1).The Gibbs chain is initialized with a training example v(0) of the training set and yields the sample v(k) after k steps. Each step t consists of sampling h(t) from p(h|v(t)) and sampling v(t+1)from p(v|h(t)) subsequently.The gradient w.r.t. θ of the log-likelihood for one training pattern v(0)
14: end for
15: end for
The idea of k-step contrastive divergence learning (CD-k) is quite simple:
Instead of approximating the second term in the log-likelihood gradient by a
4: for t = 0, ..., k − 1 do
5: for i = 1, ..., n do
6:
sample h(it) ∼ p(hi|v(t))
7:
end for
8: for j = 1, ..., m do
9:
sample vj(t) ∼ p(vj |h(t))
10:
end for
11: end for
v
(8)
2ቤተ መጻሕፍቲ ባይዱ
If we calculate the gradient of parameters θ by means of formula (6)(7)(7),summing over all values of the visible variables (or all values of the hidden if one decides to factorize over the visible variables beforehand) will be exponential complexity.One of methods to solving this problem is to approximate this expectation by samples from the model distribution.
pr(v, h)
=
1
e−
1 Tr
E(v,h)
Zr
(10)
where
Zr
=
∑
e−
1 Tr
E(v,h)
is
the
corresponding
partition
function,
and
p1
is
v,h
exactly the model distribution.
In each step of the algorithm, we run k (usually k = 1) Gibbs sampling