深度学习并行优化算法_for
合集下载
相关主题
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Parallel optimization methods
Typical parallel L-BFGS
Parameter Server
Sandblaster L-BFGS(Dean J, NIPS’12)
Model Replicas
Data
Parallel optimization methods
Facebook Framework
• Model parallelism on multi-GPUs platform • Data Parallelism on multi-GPUs platform
Facebook Framework
• Model and data parallelism
– Too many parameters to train
• GoogLeNet –Top1 of Large Scale Visual Recognition Challenge 2014
– Iterative algorithms
Background
Optimize_method
• Basic steps of some optimization methods
Cg,lbfgs
Search_direction
sgd
Line_search NO wolfe_line_search
Opt_poch < opt_max_epoch YES minibatch < minibatch_num NO
polyinterp Cal ΔW Cal ΔW Update W Update W
• • • • Use GPUs and CPUs Data Parallelization Model can be parallelized as well Use Parameter Server to coordinate
Baidu PADDLE (Kaiyu, CIKM 2013)
• Flexible model structures
– Model replicas asynchronously fetch parameters w and push gradients Δw to the parameter server
Parallel optimization methods
• Downpour SGD (Dean J, NIPS’12)
Parallel optimization methods
Typical parallel L-BFGS • Each node compute the gradient on a specific subset of dataset. • The gradient are sent back to a central server. • Need waiting for the slowest machine • Do not scale well to large shared clusters Sandblaster L-BFGS(Dean J, NIPS’12) • Add a coordinator to balance data. • Coordinator assigns each of the N model replicas a small portion of work, much smaller than 1/Nth of the total size of a batch • Faster nodes do more work than slower nodes.
Tencent Mariana (Yongqiang Zou, Proceedings of the VLDB Endowment 2014)
• A multi-GPU data parallelism framework for deep neural networks (DNNs). • A multi-GPU model parallelism and data parallelism framework for deep convolutional neural networks (CNNs). (like facebook’s) • A CPU cluster framework for large scale DNNs.
Parallelism in Deep Learning
Yueqing Wang Supervisor: Prof. Dou
Contents
• • • • • Background Data Parallelism Model Parallelism Parallel Optimization Methods Parallel Framework
Batch 1
Node N
Slice 1-N Server Time
Batch 2
Slice2-N Wa it
Batch 1 Node N Node 2
Wait
Slice 1-N
Batch 2
Slice2-N Server Time S2(2N)
Slice1(2N)
Slice1-(N+2)
Node 2
– Downpour Байду номын сангаасGD – Sandblaster L-BFGS
Google DistBelief (Dean.J NIPS’12)
• Model Parallelism • Distributed optimization algorithms
Baidu PADDLE (Kaiyu, CIKM 2013)
– Conjugate Gradient (CG), L-BFGS, stochastic gradient descent (sgd) – W means the parameters we want to train – “Cal ΔW” is the most time consuming step.
Tencent Mariana
• Model and data parallelism
Tencent Mariana
• A CPU cluster framework for large scale DNNs
Discussion
• What can we do next?
– New parallel algorithms? – Improve current parallel algorithms? – Design parallel algorithms for new platform?
Background
• Why need parallelism in DL?
– Big data
• Imagenet: 1000 categories, 1.2 million images for training and 150,000 images for testing and validating.
• MIC/FPGA
• Sandblaster L-BFGS(Dean J, NIPS’12)
Parallel Framework
• Google DistBelief
• Baidu PADDLE • Tencent Mariana • Facebook Parallel Framework
Google DistBelief (Dean.J NIPS’12)
Typical parallel L-BFGS • Need waiting for the slowest machine
#Slicei-1= #Slicei-N = 1/N*size(Batch1)
#Slicei-1, …… #Slicei-MN << 1/N*size(Batch1)
Sandblaster L-BFGS(Dean J, NIPS’12) • There is no waiting time in the figure
Node 1
Slice 1-2
Wait
Slice2-2
Slice 1-2
Slice2-2 S2(N+1)
Slice1-1
Wait
Slice2-1
Node 1
Slice 1-1
Slice1(N+1)
Slice1(2N+1)
Time
Slice2-3
Parallel optimization methods
• Utilize computing clusters with thousands of machines to train large models • Model Parallelism • Develop two algorithms for large distributed training
stop
Data Parallelism
• Each thread has one model replicas. • Use different data batches to cal • Parameter server to control W • A classical algorithm: downpour SGD
Model Parallelism
• Share data • Different nodes have different parameters to train • Need communication among nodes
Parallel optimization methods
• Downpour SGD (Dean J, NIPS’12)