深度学习的随机矩阵理论模型_v._图文
合集下载
相关主题
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
▪ 基因检测:N=6033基因组, n=102人 ▪ 电网检测:N=3000—10000 PMUs, n次采
样观测
▪ A. N. Kolmogorov,渐近理论 (19701974)
▪ 高维协方差矩阵
X11
X NT
X
21
M
X
N1
X12 L X 22 L MO XN2 L
X1T
深度学习的随机矩阵理论模型_v._图文.pptx
深度学习理论- A Review
前向传播 f (L f ( f ( f (z, w1,b1), w2,b2)L ), wn,bn)
f
(
z)
Βιβλιοθήκη Baidu
1
1 ez
f
(z)
ez ez
ez ez
f
(z)
z 0
z0 z0
今天>1000 layers
[7] Andoni, Panigraphy, Valiant, Zhang. Learning Polynomials with Neural Networks. ICML 2014. [8] Dauphin, Pascanu, Gulcehre, Cho, Ganguli, Bengio, Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. NIPS 2014. [9] Choromanska, Henaff, Mathieu, Arous, LeCun, “The Loss Surfaces of Multilayer Networks,” AISTAT 2015. [10] Chaudhuri and Soatto The Effect of Gradient Noise on the Energy Landscape of Deep Networks, arXiV 2015. [11] Haeffele, Vidal. Global Optimality in Tensor Factorization, Deep Learning and Beyond, arXiv, 2015. [12] Janzamin, Sedghi, Anandkumar, Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods, arxiv 2015. [13] Dauphin, Yann N., et al. "Identifying and attacking the saddle point problem in high-dimensional non-convex optimization." Advances in neural information processing systems. 2014.
[1] Advani, Madhu S., and Andrew M. Saxe. "High-dimensional dynamics of generalization error in neural networks." arXiv preprint arXiv:1710.03667 (2017).
[1] Pennington, Jeffrey, and Yasaman Bahri. "Geometry of Neural Network Loss Surfaces via Random Matrix Theory." International Conference on Machine Learning. 2017.
RMT of Deep Learning
▪ Pearson, Fisher, Neyman, 经典统计学 (1900-1940s)
▪ 无穷向量的相关性(Karl Pearson, 1905) ▪ 有限向量的相关性(Fisher, 1924) ▪ 低维问题,随机变量维数 N=2-10
▪ 高维假设检验
c
( ,) 0
c c
the bulk of the eigenvalues depend on the architecture
the top discrete eigenvalues depend on the data
[1] Sagun, Levent, Léon Bottou, and Yann LeCun. "Singularity of the Hessian in Deep Learning." arXiv preprint arXiv:1611.07476 (2016).
Hessian Matrix
Gradient descent (green) and Newton's method (red) for minimizing a function.
[1] Y LeCun, L Bottou, GB ORR, and K-R Müller. Efficient backprop. Lecture notes in computer science, pages 9–50, 1998. [2] Sagun, L., Evci, U., Guney, V. U., Dauphin, Y., & Bottou, L. (2017). Empirical analysis of the hessian of over-parametrized neural networks.
X 2T M
X
ij
:
N
0,1
X
NT
1 0 L
1 XX T 0 1 L
T
M M O
0 0 L
0
0 M
1
N
N
1 2 L N 1??
L
Y X1X2 XL Xi i1
L=1
L=5
16/25/2020
Neuroscience Simple cells Complex cells Grandmother cells
Deep Network First layer Pooling Layer Last layer
[1] Cybenko. Approximations by superpositions of sigmoidal functions, Mathematics of Control, Signals, and Systems, 2 (4), 303-314, 1989. [2] Hornik, Stinchcombe and White. Multilayer feedforward networks are universal approximators, Neural Networks, 2(3), 359-366, 1989. [3] Hornik. Approximation Capabilities of Multilayer Feedforward Networks, Neural Networks, 4(2), 251–257, 1991. [4] Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, 39(3):930–945, 1993. [5] P Baldi, K Hornik, Neural networks and principal component analysis: Learning from examples without local minima, Neural networks, 1989. [6] Brady, Raghavan, Slawny. Back propagation fails to separate where perceptrons succeed. IEEE Trans Circuits & Systems, 36(5):665–674, 1989. [7] Gori, Tesi. On the problem of local minima in backpropagation. IEEE Trans. on Pattern Analysis and Machine Intelligence, 14(1):76–86, 1992. [8] Frasconi, Gori, Tesi. Successes and failures of backpropagation: A theoretical. Progress in Neural Networks: Architecture, 5:205, 1997.
[1] Razavian, Azizpour, Sullivan, Carlsson, CNN Features off-the-shelf: an Astounding Baseline for Recognition. CVPRW’14 [2] Srivastava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. Journal of machine learning research, 15(1):1929–1958. [3] Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pages 448–456.
[1] Sagun, L., Evci, U., Guney, V. U., Dauphin, Y., & Bottou, L. (2017). Empirical analysis of the hessian of over-parametrized neural networks.
H H0 H1
[1] Bruna-Mallat. Classification with scattering operators, CVPR’11. Invariant scattering convolution networks, arXiv’12. Mallat Waldspurger. Deep Learning by Scattering, arXiv’13. [2] Wiatowski, Bölcskei. A mathematical theory of deep convolutional neural networks for feature extraction. arXiv 2015. [3] Giryes, Sapiro, A Bronstein. Deep Neural Networks with Random Gaussian Weights: A Universal Classification Strategy? arXiv:1504.08291. [4] Sokolic. Margin Preservation of Deep Neural Networks, 2015 [5] Montufar. Geometric and Combinatorial Perspectives on Deep Neural Networks, 2015. [6] Neyshabur. The Geometry of Optimization and Generalization in Neural Networks: A Path-based Approach, 2015.
样观测
▪ A. N. Kolmogorov,渐近理论 (19701974)
▪ 高维协方差矩阵
X11
X NT
X
21
M
X
N1
X12 L X 22 L MO XN2 L
X1T
深度学习的随机矩阵理论模型_v._图文.pptx
深度学习理论- A Review
前向传播 f (L f ( f ( f (z, w1,b1), w2,b2)L ), wn,bn)
f
(
z)
Βιβλιοθήκη Baidu
1
1 ez
f
(z)
ez ez
ez ez
f
(z)
z 0
z0 z0
今天>1000 layers
[7] Andoni, Panigraphy, Valiant, Zhang. Learning Polynomials with Neural Networks. ICML 2014. [8] Dauphin, Pascanu, Gulcehre, Cho, Ganguli, Bengio, Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. NIPS 2014. [9] Choromanska, Henaff, Mathieu, Arous, LeCun, “The Loss Surfaces of Multilayer Networks,” AISTAT 2015. [10] Chaudhuri and Soatto The Effect of Gradient Noise on the Energy Landscape of Deep Networks, arXiV 2015. [11] Haeffele, Vidal. Global Optimality in Tensor Factorization, Deep Learning and Beyond, arXiv, 2015. [12] Janzamin, Sedghi, Anandkumar, Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods, arxiv 2015. [13] Dauphin, Yann N., et al. "Identifying and attacking the saddle point problem in high-dimensional non-convex optimization." Advances in neural information processing systems. 2014.
[1] Advani, Madhu S., and Andrew M. Saxe. "High-dimensional dynamics of generalization error in neural networks." arXiv preprint arXiv:1710.03667 (2017).
[1] Pennington, Jeffrey, and Yasaman Bahri. "Geometry of Neural Network Loss Surfaces via Random Matrix Theory." International Conference on Machine Learning. 2017.
RMT of Deep Learning
▪ Pearson, Fisher, Neyman, 经典统计学 (1900-1940s)
▪ 无穷向量的相关性(Karl Pearson, 1905) ▪ 有限向量的相关性(Fisher, 1924) ▪ 低维问题,随机变量维数 N=2-10
▪ 高维假设检验
c
( ,) 0
c c
the bulk of the eigenvalues depend on the architecture
the top discrete eigenvalues depend on the data
[1] Sagun, Levent, Léon Bottou, and Yann LeCun. "Singularity of the Hessian in Deep Learning." arXiv preprint arXiv:1611.07476 (2016).
Hessian Matrix
Gradient descent (green) and Newton's method (red) for minimizing a function.
[1] Y LeCun, L Bottou, GB ORR, and K-R Müller. Efficient backprop. Lecture notes in computer science, pages 9–50, 1998. [2] Sagun, L., Evci, U., Guney, V. U., Dauphin, Y., & Bottou, L. (2017). Empirical analysis of the hessian of over-parametrized neural networks.
X 2T M
X
ij
:
N
0,1
X
NT
1 0 L
1 XX T 0 1 L
T
M M O
0 0 L
0
0 M
1
N
N
1 2 L N 1??
L
Y X1X2 XL Xi i1
L=1
L=5
16/25/2020
Neuroscience Simple cells Complex cells Grandmother cells
Deep Network First layer Pooling Layer Last layer
[1] Cybenko. Approximations by superpositions of sigmoidal functions, Mathematics of Control, Signals, and Systems, 2 (4), 303-314, 1989. [2] Hornik, Stinchcombe and White. Multilayer feedforward networks are universal approximators, Neural Networks, 2(3), 359-366, 1989. [3] Hornik. Approximation Capabilities of Multilayer Feedforward Networks, Neural Networks, 4(2), 251–257, 1991. [4] Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, 39(3):930–945, 1993. [5] P Baldi, K Hornik, Neural networks and principal component analysis: Learning from examples without local minima, Neural networks, 1989. [6] Brady, Raghavan, Slawny. Back propagation fails to separate where perceptrons succeed. IEEE Trans Circuits & Systems, 36(5):665–674, 1989. [7] Gori, Tesi. On the problem of local minima in backpropagation. IEEE Trans. on Pattern Analysis and Machine Intelligence, 14(1):76–86, 1992. [8] Frasconi, Gori, Tesi. Successes and failures of backpropagation: A theoretical. Progress in Neural Networks: Architecture, 5:205, 1997.
[1] Razavian, Azizpour, Sullivan, Carlsson, CNN Features off-the-shelf: an Astounding Baseline for Recognition. CVPRW’14 [2] Srivastava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. Journal of machine learning research, 15(1):1929–1958. [3] Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pages 448–456.
[1] Sagun, L., Evci, U., Guney, V. U., Dauphin, Y., & Bottou, L. (2017). Empirical analysis of the hessian of over-parametrized neural networks.
H H0 H1
[1] Bruna-Mallat. Classification with scattering operators, CVPR’11. Invariant scattering convolution networks, arXiv’12. Mallat Waldspurger. Deep Learning by Scattering, arXiv’13. [2] Wiatowski, Bölcskei. A mathematical theory of deep convolutional neural networks for feature extraction. arXiv 2015. [3] Giryes, Sapiro, A Bronstein. Deep Neural Networks with Random Gaussian Weights: A Universal Classification Strategy? arXiv:1504.08291. [4] Sokolic. Margin Preservation of Deep Neural Networks, 2015 [5] Montufar. Geometric and Combinatorial Perspectives on Deep Neural Networks, 2015. [6] Neyshabur. The Geometry of Optimization and Generalization in Neural Networks: A Path-based Approach, 2015.