MLA2016分布式学习算法
合集下载
相关主题
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
X
K (x, y )f (y )dρX (y ),
1/2
x ∈ X.
The r-th power Lr K is well defined for any r ≥ 0. Its range
2 ) gives the RKHS H = L 2 ) and for 0 < r ≤ 1/2, Lr ( L ( L K ρ ρX K K X r− 2 ) ⊂ (H , L 2 ) 2 ) 2 ) for Lr ( L and ( H , L ⊂ L ( L 2 r, ∞ 2 r, ∞ K K ρ ρ ρ ρX K K X X X any > 0 when the support of ρX is X . So we may assume
Recent work: Zhou-Chawla-Jin-Williams, Zhang-Duchi-Wainwright, Shamir-Srebro, ...
First Previous Next Last Back Close Quit 3
II. Least squares regression and and regularization II.1. Model for the least squares regression. Learn f : X → Y from a random sample D = {(xi, yi)}N i=1 Take X to be a compact metric space and Y = R. y ≈ f (x) Due to noises or other uncertainty, we assume a (unknown) probability measure ρ on Z = X × Y governs the sampling. marginal distribution ρX on X : x = {xi}N i=1 drawn according to ρX conditional distribution ρ(·|x) at x ∈ X Learning the regression function: fρ(x) = Y ydρ(y |x) yi ≈ fρ(xi)
Analysis of Distributed Learning Algorithms
Ding-Xuan Zhou City University of Hong Kong E-mail: mazhou@cityu.edu.hk Supported in part by Research Grants Council of Hong Kong
First Previous Next Last Back Close Quit 4
II.2. Error decomposition and ERM E ls(f ) = Z (f (x) − y )2dρ minimized by fρ:
2 ≥ 0. E ls(f ) − E ls(fρ) = f − fρ 2 =: f − f ρ 2 ρ L
First Previous Next Last Back Close Quit θ ,∞ . 1+θ 6
II.4. Examples of hypothesis spaces Sobolv spaces: if X ⊂ Rn, ρX is the normalized Lebesgue measure, and B is the Sobolev space H s with s > n/2, then (H s, L2 ρX )
1 global output might be f D = m m f . j =1 Dj
The distributed learning method has been observed to be very successful in many practical applications. There a challenging theoretical question is raised: If we had a ”big machine” which could implement the same learning algorithm to the whole data set D to produce an output fD , could f D be as efficient as fD ?
Start
November 5, 2016
Outline of the Talk I. Distributed learning with big data II. Least squares regression and and regularization III. Distributed learning with regularization schemes IV. Optimal rates for regularization V. Other distributed learning algorithms VI. Further topics
Approximation Error. Smale-Zhou (Anal. Appl. 2003) E ls(fH) − E ls(fρ) = fH − fρ 2 = inf L2 ρX f ∈H fH ≈ fρ when H is rich (f (x) − fρ(x))2dρX
Theorem 1 Let B be a Hilbert space (such as a Sobolev space or a reproducing kernel Hilbert space). If B ⊂ L2 ρX is dense and θ > 0, then f − fρ L2 = O(R−θ ) ρX f B ≤R inf if and only if fρ lies in the interpolation space (B, L2 ρX )
First
Previous
Next
Last
Back
Close
Quit
1
I. Distributed learning with big data Big data leads to scientific challenges: storage bottleneck, algorithmic scalability, ... Distributed learning: based on a divide-and-conquer approach A distributed learning algorithm consisting of three steps: (1) partitioning the data into disjoint subsets (2) applying a learning algorithm implemented in an individual machine or processor to each data subset to produce an individual output (3) synthesizing a global output by utilizing some average of the individual outputs Advantages: reducing the memory and computing costs to handle big data
(f (x) − y )2dρ
First
Previous
Next
Last
Back
Close
Quit
5
II.3. Approximation error Analysis. fD − fρ 2 L2
ρX
= X (fD (x) − fρ(x))2dρX is bounded
ls (f ) − E ls (f ) + E ls (f ) − E ls (f ) . by 2 supf ∈H ED ρ H
θ s− H 1+θ θ 1+θ ,∞
is the Besov space B2,∞ and
θ s 1+θ
θ s 1+ H θ
⊂ B2,∞ ⊂
θ s 1+θ
for any > 0. Range of power of integral operator: if K : X × X → R is a Mercer kernel (continuous, symmetric and positive semidefinite), then the integral operator LK on L2 ρX is defined by LK (f )(x) =
ls (f ), fD = arg min ED f ∈H ls (f ) = 1 ED N
N i=1
(f (xi) − yi)2.
Target function fH: best approximation of fρ in H fH = arg min E ls(f ) = arg inf
f ∈H f ∈H Z
First Previous Next Last Back Close Quit 8
fρ = Lr K (g ρ )
First Previous Next Last
for some r > 0, gρ ∈ L2 ρX .
Back Close Quit 7
II.5. Least squares regularization fD,λ := arg min
1 N
i=1
f ∈HK N
ρX
Classical Approach of Empirical Risk Minimization (ERM) Let H be a compact subset of C (X ) called hypothesis space (model selection). The ERM algorithm is given by
First Previous Next Last Back Close Quit 2
If we divide a sample D = {(xi, yi)}N i=1 of input-output pairs into disjoint subsets {Dj }m j =1 , applying a learning algorithm to the much smaller data subset Dj gives an output fDj , and the
(f (xi) − ห้องสมุดไป่ตู้i)2 + λ f 2 K ,
λ > 0.
A large literature in learning theory: books by Vapnik, Sch¨ olkopfSmola, Wahba, Anthony-Bartlett, Shawe-Taylor-Cristianini, Steinwa Christmann, Cucker-Zhou, ... many papers: Cucker-Smale, Zhang, De Vito-CaponnettoRosasco, Smale-Zhou, Lin-Zeng-Fang-Xu, Yao, Chen-Xu, ShiFeng-Zhou, Wu-Ying-Zhou, ... regularity of fρ complexity of HK : covering numbers, decay of eigenvalues {λi} of LK , effective dimension, ... decay of y : |y | ≤ M , exponential decay, moment decay2 ∈ Lp for the ing condition, E[|y |q ] < ∞ for some q > 2, σρ ρX 2 2 conditional variance σρ (x) = Y (y − fρ(x)) dρ(y |x), ...
K (x, y )f (y )dρX (y ),
1/2
x ∈ X.
The r-th power Lr K is well defined for any r ≥ 0. Its range
2 ) gives the RKHS H = L 2 ) and for 0 < r ≤ 1/2, Lr ( L ( L K ρ ρX K K X r− 2 ) ⊂ (H , L 2 ) 2 ) 2 ) for Lr ( L and ( H , L ⊂ L ( L 2 r, ∞ 2 r, ∞ K K ρ ρ ρ ρX K K X X X any > 0 when the support of ρX is X . So we may assume
Recent work: Zhou-Chawla-Jin-Williams, Zhang-Duchi-Wainwright, Shamir-Srebro, ...
First Previous Next Last Back Close Quit 3
II. Least squares regression and and regularization II.1. Model for the least squares regression. Learn f : X → Y from a random sample D = {(xi, yi)}N i=1 Take X to be a compact metric space and Y = R. y ≈ f (x) Due to noises or other uncertainty, we assume a (unknown) probability measure ρ on Z = X × Y governs the sampling. marginal distribution ρX on X : x = {xi}N i=1 drawn according to ρX conditional distribution ρ(·|x) at x ∈ X Learning the regression function: fρ(x) = Y ydρ(y |x) yi ≈ fρ(xi)
Analysis of Distributed Learning Algorithms
Ding-Xuan Zhou City University of Hong Kong E-mail: mazhou@cityu.edu.hk Supported in part by Research Grants Council of Hong Kong
First Previous Next Last Back Close Quit 4
II.2. Error decomposition and ERM E ls(f ) = Z (f (x) − y )2dρ minimized by fρ:
2 ≥ 0. E ls(f ) − E ls(fρ) = f − fρ 2 =: f − f ρ 2 ρ L
First Previous Next Last Back Close Quit θ ,∞ . 1+θ 6
II.4. Examples of hypothesis spaces Sobolv spaces: if X ⊂ Rn, ρX is the normalized Lebesgue measure, and B is the Sobolev space H s with s > n/2, then (H s, L2 ρX )
1 global output might be f D = m m f . j =1 Dj
The distributed learning method has been observed to be very successful in many practical applications. There a challenging theoretical question is raised: If we had a ”big machine” which could implement the same learning algorithm to the whole data set D to produce an output fD , could f D be as efficient as fD ?
Start
November 5, 2016
Outline of the Talk I. Distributed learning with big data II. Least squares regression and and regularization III. Distributed learning with regularization schemes IV. Optimal rates for regularization V. Other distributed learning algorithms VI. Further topics
Approximation Error. Smale-Zhou (Anal. Appl. 2003) E ls(fH) − E ls(fρ) = fH − fρ 2 = inf L2 ρX f ∈H fH ≈ fρ when H is rich (f (x) − fρ(x))2dρX
Theorem 1 Let B be a Hilbert space (such as a Sobolev space or a reproducing kernel Hilbert space). If B ⊂ L2 ρX is dense and θ > 0, then f − fρ L2 = O(R−θ ) ρX f B ≤R inf if and only if fρ lies in the interpolation space (B, L2 ρX )
First
Previous
Next
Last
Back
Close
Quit
1
I. Distributed learning with big data Big data leads to scientific challenges: storage bottleneck, algorithmic scalability, ... Distributed learning: based on a divide-and-conquer approach A distributed learning algorithm consisting of three steps: (1) partitioning the data into disjoint subsets (2) applying a learning algorithm implemented in an individual machine or processor to each data subset to produce an individual output (3) synthesizing a global output by utilizing some average of the individual outputs Advantages: reducing the memory and computing costs to handle big data
(f (x) − y )2dρ
First
Previous
Next
Last
Back
Close
Quit
5
II.3. Approximation error Analysis. fD − fρ 2 L2
ρX
= X (fD (x) − fρ(x))2dρX is bounded
ls (f ) − E ls (f ) + E ls (f ) − E ls (f ) . by 2 supf ∈H ED ρ H
θ s− H 1+θ θ 1+θ ,∞
is the Besov space B2,∞ and
θ s 1+θ
θ s 1+ H θ
⊂ B2,∞ ⊂
θ s 1+θ
for any > 0. Range of power of integral operator: if K : X × X → R is a Mercer kernel (continuous, symmetric and positive semidefinite), then the integral operator LK on L2 ρX is defined by LK (f )(x) =
ls (f ), fD = arg min ED f ∈H ls (f ) = 1 ED N
N i=1
(f (xi) − yi)2.
Target function fH: best approximation of fρ in H fH = arg min E ls(f ) = arg inf
f ∈H f ∈H Z
First Previous Next Last Back Close Quit 8
fρ = Lr K (g ρ )
First Previous Next Last
for some r > 0, gρ ∈ L2 ρX .
Back Close Quit 7
II.5. Least squares regularization fD,λ := arg min
1 N
i=1
f ∈HK N
ρX
Classical Approach of Empirical Risk Minimization (ERM) Let H be a compact subset of C (X ) called hypothesis space (model selection). The ERM algorithm is given by
First Previous Next Last Back Close Quit 2
If we divide a sample D = {(xi, yi)}N i=1 of input-output pairs into disjoint subsets {Dj }m j =1 , applying a learning algorithm to the much smaller data subset Dj gives an output fDj , and the
(f (xi) − ห้องสมุดไป่ตู้i)2 + λ f 2 K ,
λ > 0.
A large literature in learning theory: books by Vapnik, Sch¨ olkopfSmola, Wahba, Anthony-Bartlett, Shawe-Taylor-Cristianini, Steinwa Christmann, Cucker-Zhou, ... many papers: Cucker-Smale, Zhang, De Vito-CaponnettoRosasco, Smale-Zhou, Lin-Zeng-Fang-Xu, Yao, Chen-Xu, ShiFeng-Zhou, Wu-Ying-Zhou, ... regularity of fρ complexity of HK : covering numbers, decay of eigenvalues {λi} of LK , effective dimension, ... decay of y : |y | ≤ M , exponential decay, moment decay2 ∈ Lp for the ing condition, E[|y |q ] < ∞ for some q > 2, σρ ρX 2 2 conditional variance σρ (x) = Y (y − fρ(x)) dρ(y |x), ...