深度学习文献综述
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
PHD Literature review Report (1)
Yannan (Summarize Methods to optimize the DNN)
1. Machine Learning and relative deep learning
As the subject of my PHD is carrying out in terms of deep learning based neuro morphic system with applications. The categories of deep learning algorithms should be selected carefully depending on different types of real problem as well as neuromorphic.
Normally we set up a NN, the performance of NN which including: training speed, training set accuracy and validation set accuracy are most important to prevent the results from overfitting we usually concern about. The recent optimization method from literatures/tutorials online can be summarised as:
1. L1/L2 Regularization
Define a cost function we trying to minimise as
J(w,b)=1
N
∑F(Yout(i),Y(i))
m
i=1
The L2 regularization using the Euclidean Norm with the prime vector to w, and omitted the low variance parameter bias b to reduce the effect of high variance as:
J(w,b)=1
∑F(Yout(i),Y(i))
m
i=1
+
λ
‖w‖22
Where
‖w‖22=∑w i2
n
i=1
=w T∙w
The L1 regularization makes more parameters are set to zero and makes the model becomes sparse:
J(w,b)=1
∑F(Yout(i),Y(i))
m
i=1
+
λ
|w|1
Where
|w|1=∑|w|
n
i=1
2. Dropout
Dropout is widely used in stop the deep NN from overfitting problem with a manual set keep-probability to randomly eliminate neurons from layers when training. This usually implement by multiplying a matrix with same shape as previous layer’s output containing ones and zeros. The dropout can shrink the weights and does some of those regularization and help prevent overfitting, this is similar to L2 regularization. However, dropout can be shown to be an adaptive form without a regularization while L2 penalty different depends on different weights relate to the size of activations being calculated.
3. Data augmentation
This method is useful when the data set is very poor but each data contains a lot of features like colourful images. The flipping, rotated zoomed image and add some distortions to image can helps generate original training data.
Figure. 1: Dropout sample with (a) before dropout (b) after dropout
Figure. 2: Horizontally flipped images
Figure. 3: Rotated zoomed image
4. Early stopping
As shown in figure 4 that, the testing set accuracy is not always increasing with the training set accuracy and local minima could be found for before completion of total iterations. The early stopping is usually work to improving the accuracy of validation set with some sacrificing of training set accuracy and simultaneously prevent network training from overfitting.
Figure. 4: Early stopping description
5. Normalize input
Normalizing input can usually speed up training speed and increase the performance of neural network. The usually step including substract mean and normal variance and set total training set range to be same length. Thus the learning rate does not need to be set as adaptive and to change along with every gradient descent, the normalization helps GD algorithm finds optimal parameters more accurate and quick.
Figure. 5: Left: after data normalization; Right: before normalization
6. Weight initialization for Vanishing/exploding gradients
When training very deep neural network, the derivatives can sometimes either very big or very small. A very deep neural network with ignored bias value can be considered as a stack multiplying of weights of each layer that:
Y=W n∙W n−1∙W n−2∙∙∙∙∙∙W3∙W2∙W1∙X
Where either a value of W is greater than 1 or less than 1 could results in a W n−1which in a huge or tiny value.
The square root of variance could be multiplied to the initialised weight to reduce the vanishing and exploding problem and the variance is activation function dependent that:
tanh(Xavier initalization)=√
1 n l−1
RELU(var)=√2 l−1
7. Mini-batch gradient decent
When the training set becomes really large then the traditionally SGD will results in a really slow training process due to gradient decent happen on individual inputs. The Mini-batch split the whole training samples into several batches with assigned batch size (for 10000 inputs with 100 batch size, the quantity of batches is 1000). And make the inputs within every batches to be a matrix/Vector and training all the data together. If the batch size is set to 1, then
this is exactly stochastic gradient decent and it will implement on every input rather than a group of inputs. The one epoch/iteration means all the batches have been trained by NN once.
The Typical mini-batch size could be 64, 128, 256, 512, 1024 and usually be the power of 2 for large training data set.
8. Momentum
The momentum in every iteration computes dW and db on current mini-batch And then compute
VdW=βVdW+(1−β)dW
Vdb=βVdb+(1−β)db
Then the update weight and bias by:
W=W−αVdW
b=b−αVdW
The momentum could be understood as the applying the Exponentially weighted averages (EWA) in the gradient decent and thus the updated regression is averaged outputs in terms of previous outputs with defined parameter βwhich is the learning rate in the NN. The regular choose of the βis 0.9 and corresponds to average the last 1
1−β
data to give the most suitable updates.
9. RMSprop
The RMSprop also computes dW and db in every iteration on the current mini-batch
And then compute
Figure. 6: Mini-batch for 10 batches
SdW =βSdW +(1−β)dW 2 Sdb =βSdb +(1−β)db 2
The RMSprop update parameters as follow:
W =W −αdW
√SdW
b =b −α√Sdb
The RMSprop can basically speed up the learning rate based on the features of weights and bias where sometimes its need either of them to be large and another one to be small that making GD converge more quikly.
10. Adam
Adam is basically the combination of Momentum and RMSprop, that its compute dW and db on current mini-batch. Then compute the same things from momentum and RMSprop we get:
VdW =β1VdW +(1−β1)dW Vdb =β1Vdb +(1−β1)db SdW =β2SdW +(1−β2)dW 2 Sdb =β2Sdb +(1−β2)db 2
With the different hyperparameters β1 and β2 On the nth order iteration Adam computes
VdW(after EWA bias correction)=VdW
1n
Vdb(after EWA bias correction)=Vdb
(1n )
SdW(after EWA bias correction)=SdW
(1−β2n )
Sdb(after EWA bias correction)=Sdb
(1−β2n )
The W and b updated as
W =W −αVdW
√SdW +ε
b =b −αVdb
√Sdb +ε
The general hyperparameter choice for Adam is Learning rate: need to be tune
β1:0.9 β2:0.99
ε doesn ′t really affect performance set as 10−8
11. Learning rate decay
The fixed learning rate usually results in noisy learning process and cannot reach the optimal point. The learning rate decay algorithm can reduce the learning rate
along with the iterations that allow NN can finally ends with relative accurate Optimal result.
This could be implemented with the epochs that
α=
1
1+decay_rate
α0
alternatively
α=decay_rate∗α0(expotentially decay)
α=
√num of epocℎ
0(discrete staircase)
12. Pick hyperparameters
The common painful on the DNN is to pick a sheer of hyperparameters with may including: learning rate, momentum factor β1, adam factor β1, β2, ε, the number of layers, number of hidden units, learning rate decay rate, batch size and so on.
The range of hyperparameter could be determined depending on the problem to be solved, the usually way is to randomly sample between the reasonable scale and take few of them into test and then reduce the range or changing the scale of sampling to improve the decision.
13. Batch normalization
Similar to the input normalization, a good distribution of the data could save the computation energy that making algorithms works faster, the batch normalization normalise the outputs of previous hidden layer (or the input of one hidden layer) that makes the computation within this layer becomes more faster. This also implemented by extracting the mean and variance of the computed data and normalize as:
Z(i)norm=Z(i)−μ√σ2+ε
For the hidden units with alternative mean and variance
Z(i)N=ΥZ(i)norm+β
Where Υand βare learnable parameters from the model if Υ=√σ2+ε and β=μthen Z(i)N=Z(i).
The implementation of the batch normalization is just simple like add a layer named BM layer with the additional hyperparameter βand Υfor each of them, they can also be updated by the optimizer like SGD, RMSprop etc. One thing needs to note that is the mean process actually eliminated the bias in the operation, this means that the hyperparameter b could be deleted from the layer in front of BM layer.
The mean and variance usually is estimated using EWA across mini-batch in training set, and use it in the test set.。