受限玻尔兹曼机详细讲解PPT
合集下载
相关主题
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
– Sample si s j for all pairs of units
– Repeat for all data vectors in the training set. • Negative phase
– Do not clamp any of the units – Let the whole network reach thermal equilibrium at a
temperature of 1 (where do we start?)
– Sample si s j for all pairs of units
– Repeat many times to get good estimates • Weight updates
– Update each weight by an amount proportional to the difference in sis j in the two phases.
– This is a big advantage over directed belief nets
hidden j
i visible
Maximizing the training data log
likelihood
Standard PoE form
•
We want maximizing parameters
• Can observe some of the variables and we would like to solve two problems:
• The inference problem: Infer the states of the unobserved variables.
• The learning problem: Adjust the interactions between variables to make the network more likely to generate the observed data.
• The derivation is nasty.
Frank Wood - fwood@
Equilibrium Is Hard to Achieve
• With:
log p(D | 1, ,n ) log fm d | m
log fm c|m
m
m
P0
m
P
1 1 e j s jwij T
1 1 eEi T
temperature
Energy gap Ei E(si0) E(si1)
The Energy of a joint configuration
binary state of unit i in joint configuration v, h
– The temperature controls the amount of noise.
– Decreasing all the energy gaps between configurations is equivalent to raising the noise level.
p( si 1)
p(v, h) eE(v,h)
• The probability of a configuration over the visible units is found by summing the probabilities of all the joint configurations that contain it.
Restricted Boltzmann Machines and Deep Belief Networks
Presented by Matt Luciw
USING A VAST, VAST MAJORITY OF SLIDES ORIGINALLY FROM:
Geoffrey Hinton, Sue Becker, Yann Le Cun, Yoshua Bengio, Frank Wood
• In an RBM, the hidden units are conditionally independent given the visible states.
– So can quickly get an unbiased sample from the posterior distribution when given a data-vector.
p(si 1)
ห้องสมุดไป่ตู้
1
1 exp(bi
s jwji )
j
1
p(si 1) 0.5
0 0
bi s jwji
j
Stochastic units
• Replace the binary threshold units by binary stochastic units that make biased random decisions.
• Unsupervised learning could do “local-learning” (each module tries its best to model what it sees)
• Inference (+ learning) is intractable in directed graphical models with many hidden variables
Restricted Boltzmann Machines
• Restrict the connectivity to make learning easier.
– Only one layer of hidden units.
• Deal with more layers later
– No connections between hidden units.
Expected value of product of states at thermal equilibrium when nothing is clamped
The (theoretical) batch learning
algorithm
• Positive phase – Clamp a data vector on the visible units. – Let the hidden units reach thermal equilibrium at a temperature of 1
log p(v) wij
si s j
v
sis j
free
Derivative of log probability of one training vector
Expected value of product of states at thermal equilibrium when the training vector is clamped on the visible units
can now train our PoE model. • But… there’s a problem:
– Pis computationally infeasible to obtain (esp. in an inner gradient ascent loop).
– Sampling Markov Chain must converge to target distribution. Often this takes a very long time!
• Current unsupervised learning methods don’t easily extend to learn multiple levels of representation
Belief Nets
• A belief net is a directed acyclic graph composed of stochastic variables.
– The energy is determined by the weights and biases (as in a Hopfield net).
• The energy of a joint configuration of the visible and hidden units determines its probability:
E(v,h)
sivhbi
sivh
s
vh j
wij
iunits
i j
Energy with configuration v on the visible units and h on the hidden units
bias of unit i
weight between units i and j
Stochastic binary neurons
• These have a state of 1 or 0 which is a stochastic function of the neuron’s bias, b, and the input it receives from other neurons.
Frank Wood - fwood@
A very surprising fact
• Everything that one weight needs to know about the other weights and the data in order to do maximum likelihood learning is contained in the difference of two correlations.
Solution: Contrastive Divergence!
log p(D | 1, ,n ) log fm d |m
log fm c|m
m
m
P0
m
P1
• Now we don’t have to run the sampling Markov Chain to convergence, instead we can stop after 1 iteration (or perhaps a few iterations more typically)
Motivations
• Supervised training of deep models (e.g. many-layered NNets) is difficult (optimization problem)
• Shallow models (SVMs, one-hidden-layer NNets, boosting, etc…) are unlikely candidates for learning highlevel abstractions needed for AI
indexes every non-identical pair of i and j once
Weights Energies Probabilities
• Each possible joint configuration of the visible and hidden units has an energy
stochastic hidden cause
visible effect
Use nets composed of layers of stochastic binary variables with weighted connections. Later, we will generalize to other types of variable.
fm d |m
arg max log
1, ,n
p(D | 1,
,n )
arg max log
1 , ,n
dD
m c m
fm
c
|
m
Over all training data.
Assuming d’s drawn independently from p()
• Differentiate w.r.t. to all parameters and perform gradient ascent to find optimal parameters.
– Repeat for all data vectors in the training set. • Negative phase
– Do not clamp any of the units – Let the whole network reach thermal equilibrium at a
temperature of 1 (where do we start?)
– Sample si s j for all pairs of units
– Repeat many times to get good estimates • Weight updates
– Update each weight by an amount proportional to the difference in sis j in the two phases.
– This is a big advantage over directed belief nets
hidden j
i visible
Maximizing the training data log
likelihood
Standard PoE form
•
We want maximizing parameters
• Can observe some of the variables and we would like to solve two problems:
• The inference problem: Infer the states of the unobserved variables.
• The learning problem: Adjust the interactions between variables to make the network more likely to generate the observed data.
• The derivation is nasty.
Frank Wood - fwood@
Equilibrium Is Hard to Achieve
• With:
log p(D | 1, ,n ) log fm d | m
log fm c|m
m
m
P0
m
P
1 1 e j s jwij T
1 1 eEi T
temperature
Energy gap Ei E(si0) E(si1)
The Energy of a joint configuration
binary state of unit i in joint configuration v, h
– The temperature controls the amount of noise.
– Decreasing all the energy gaps between configurations is equivalent to raising the noise level.
p( si 1)
p(v, h) eE(v,h)
• The probability of a configuration over the visible units is found by summing the probabilities of all the joint configurations that contain it.
Restricted Boltzmann Machines and Deep Belief Networks
Presented by Matt Luciw
USING A VAST, VAST MAJORITY OF SLIDES ORIGINALLY FROM:
Geoffrey Hinton, Sue Becker, Yann Le Cun, Yoshua Bengio, Frank Wood
• In an RBM, the hidden units are conditionally independent given the visible states.
– So can quickly get an unbiased sample from the posterior distribution when given a data-vector.
p(si 1)
ห้องสมุดไป่ตู้
1
1 exp(bi
s jwji )
j
1
p(si 1) 0.5
0 0
bi s jwji
j
Stochastic units
• Replace the binary threshold units by binary stochastic units that make biased random decisions.
• Unsupervised learning could do “local-learning” (each module tries its best to model what it sees)
• Inference (+ learning) is intractable in directed graphical models with many hidden variables
Restricted Boltzmann Machines
• Restrict the connectivity to make learning easier.
– Only one layer of hidden units.
• Deal with more layers later
– No connections between hidden units.
Expected value of product of states at thermal equilibrium when nothing is clamped
The (theoretical) batch learning
algorithm
• Positive phase – Clamp a data vector on the visible units. – Let the hidden units reach thermal equilibrium at a temperature of 1
log p(v) wij
si s j
v
sis j
free
Derivative of log probability of one training vector
Expected value of product of states at thermal equilibrium when the training vector is clamped on the visible units
can now train our PoE model. • But… there’s a problem:
– Pis computationally infeasible to obtain (esp. in an inner gradient ascent loop).
– Sampling Markov Chain must converge to target distribution. Often this takes a very long time!
• Current unsupervised learning methods don’t easily extend to learn multiple levels of representation
Belief Nets
• A belief net is a directed acyclic graph composed of stochastic variables.
– The energy is determined by the weights and biases (as in a Hopfield net).
• The energy of a joint configuration of the visible and hidden units determines its probability:
E(v,h)
sivhbi
sivh
s
vh j
wij
iunits
i j
Energy with configuration v on the visible units and h on the hidden units
bias of unit i
weight between units i and j
Stochastic binary neurons
• These have a state of 1 or 0 which is a stochastic function of the neuron’s bias, b, and the input it receives from other neurons.
Frank Wood - fwood@
A very surprising fact
• Everything that one weight needs to know about the other weights and the data in order to do maximum likelihood learning is contained in the difference of two correlations.
Solution: Contrastive Divergence!
log p(D | 1, ,n ) log fm d |m
log fm c|m
m
m
P0
m
P1
• Now we don’t have to run the sampling Markov Chain to convergence, instead we can stop after 1 iteration (or perhaps a few iterations more typically)
Motivations
• Supervised training of deep models (e.g. many-layered NNets) is difficult (optimization problem)
• Shallow models (SVMs, one-hidden-layer NNets, boosting, etc…) are unlikely candidates for learning highlevel abstractions needed for AI
indexes every non-identical pair of i and j once
Weights Energies Probabilities
• Each possible joint configuration of the visible and hidden units has an energy
stochastic hidden cause
visible effect
Use nets composed of layers of stochastic binary variables with weighted connections. Later, we will generalize to other types of variable.
fm d |m
arg max log
1, ,n
p(D | 1,
,n )
arg max log
1 , ,n
dD
m c m
fm
c
|
m
Over all training data.
Assuming d’s drawn independently from p()
• Differentiate w.r.t. to all parameters and perform gradient ascent to find optimal parameters.