基于近似动态规划算法研究

合集下载

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

0 0
1
2
3
4
5 6 Time step
7
8
9
10
系统加入噪声扰动
加入噪声量
0.3sin( x(t ))
x(t 1) f ( x(t )) g ( x(t ))u(t ) 0.3sin( x(t ))
2 1.8 1.6 1.4 x1 x1(with disturbance)
0.2
1. Introduction
动态规划及贝尔曼最优性原理
Dynamic programming and Bellman’s principle of optimality 系统描述系统性能指标
J [ x(i ), i ] k iU [ x( k ), u (k ), k ]
k i
t
1 ˆ ˆ [ J (t ) U (t ) J (t 1)]2 2 t
ˆ ˆ J (t ) U (t ) J (t 1) ˆ U (t ) [U (t 1) J (t 2)] k tU (k )
k t
u(t) Action Network x(t)
State trajecLeabharlann Baiduory x1
-0.2
-0.4
-0.6
0
1
2
3
4
5 6 Time step
7
8
9
10
7
0 -0.5 -1
x 10
-3
6
5 -1.5
The cost
4
The control
1 2 3 4 5 Time step 6 7 8 9
-2 -2.5 -3
3
2 -3.5 1 -4 -4.5
BP算法的变形
批处理(Batching) 动量BP算法(MOBP) 可变学习速度的BP算法(VLBP) 共轭梯度法(CGBP) LM
BP 算法(LMBP)
3. Adaptive Critic Design

HDP(Heuristic dynamic programming): DHP(Dual heuristic dynamic programming):
x2 x2(with disturbance)
0
-0.2
State trajectory
1.2 1 0.8 0.6 0.4
State trajectory
0 1 2 3 4 5 6 Time step 7 8 9 10
-0.4
-0.6
-0.8 0.2 0 -1
0
1
2
3
4
5 6 Time step
7
8
基于近似动态规划的算法研究
Research on an iterative algorithm for approximate optimal control based on adaptive critic design 姓名：曹宁导师：张化光教授
本文主要内容

1.Introduction(引言) 2.Theory of Neural Network (神经网络理论) 3.Adaptive Critic Design(近似动态规划原理) 4.Discrete Time Nonlinear HJB Solution(离散非线性系统HJB方程的解) 5.Neural Network Modeling(神经网络建模)
Output layer
a=f2(W2f1(W1p+b1)+b2)
误差反传算法
1. 正向传播
2. 误差反向传播
计算
M (nM )(t a) s 2F s m F m (nm )(W m1 )T s m1
M
其中
f m (n1m ) m (n m ) 0 F 0
4 Discrete Time Nonlinear HJB Solution
离散系统HJB的解
系统方程
x(t 1) f ( x(t )) g ( x(t ))u ( x(t ))
V ( x(t )) x(i )T Qx(i ) u (i )T Ru (i )
i t
目标函数
ˆ ˆ ˆ ( x(t )T Qx(t ) uiT( j ) Rui ( j ) Vi ( x(t 1))) Wui ( j ) ˆ ˆ (uiT( j ) Rui ( j ) ) Wui ( j ) ˆ uiT( j ) Wui ( j ) ˆ Vi ( x(t 1)) Wui ( j )
V * ( x(t )) min( x(t )T Qx(t ) u(t )T Ru(t ) V * ( x(t 1)))
u (t )
1 1 V * ( x(t 1)) u* ( x(t )) R g ( x(t ))T 2 x(t 1)
* 1 V * ( x(t 1))T 1 T V ( x(t 1)) V ( x(t )) x(t ) Qx(t ) g ( x(t )) R g ( x(t )) V * ( x(t 1)) 4 x(t 1) x(t 1) * T
9
10
7 v v(with disturbance)
0.01 0 -0.01 -0.02 -0.03 u u(with disturbance)
6
5
The cost
4
The control
1 2 3 4 5 6 Time step 7 8 9 10
-0.04 -0.05 -0.06
pR
输出其中
a f ( n)
n w1,1 p1 w1,2 p2 w1,R pR b
神经网络模型(Network architectures)
w1i,j p1 j p2 i a1j t at
2
w2j,t a1 a2

pR

aS
Input layer Hidden layers
U[ x(t ), u(t ), t ] J *[ x(t 1), t 1]
J [ x(t ), t ] min(U [ x(t ), u(t ), t ] J [ x(t 1), t 1])
* * u (t )
u* (t ) arg min(U [ x(t ), u (t ), t ] J *[ x(t 1), t 1])
u(t) Action Network x(t)
u(t) Action Network x(t)
HDP评论网的训练
J (t ) k tU [ x(k ), u ( k ), k ]
k t
Ĵ(t+1) Critic Network x(t+1) Model Network
Eh Eh (t )
神经元结构（neuron model）
f-激活函数 • 阈值型(Hard limit) • 线性型(Linear) • S型(Log-sigmoid)
Inputs p1 p2
• • • • • •
Multiple-input Neuron ouputs w1,1 ∑ w1,R b 1 a=f(Wp+b) n f a
ˆ ˆ ˆ ˆ (uiT( j ) Rui ( j ) ) uiT( j ) x(t 1)T ( x(t 1))T Vi ( x(t 1)) ˆ ˆ ui ( j ) Wui ( j ) ui ( j ) x(t 1) ( x(t 1))
HDP迭代算法
Start
Initialization V0=0
Solving the minimizing problem ui(x)=min(x(t)TQx(t)+uT(x(t))Ru(x(t))+Vi(x(t+1)))
Updating the value function Vi+1=x(t) Qx(t)+uT(x(t))Ru(x(t))+Vi(f(x(t))+g(x(t))u i(x(t)) = TQx(t)+uiT(x(t))Rui(x(t))+Vi(x(t+1)) x(t)
WVi 1 arg min{ W ( x(t )) d ( ( x(t )),W
WVi 1 T Vi 1
T Vi 1
) dx(t )}.
2
T WVi ( x(t )) ( x(t )) dx
1

( x(t ))Vˆ
i 1
x15 x14 x2
x13
2 x13 x2
x12 x2
2 x1 x2
3 x2 5 x2 ]
3 x12 x2
4 x1 x2
2 1.8
0.4
0.2 1.6 1.4 1.2 1 0.8 0.6 0.4 -0.8 0.2 0 -1 0
state trajecteory x2
0 1 2 3 4 5 6 Time step 7 8 9 10
T
仿真实验
x(t 1) f ( x(t )) g ( x(t ))u(t )
2 0.2 x1 (t ) exp( x2 (t )) 0 f ( x(t )) g ( x(t )) 3 0.2 0.3x2 (t )
T ˆ Vi ( x(t ),WVi ) WVi ( x(t ))
T ˆ ui ( x(t ),Wui ) Wui ( x(t ))
( x) [ x12
4 x2
x1 x2 x16
2 x2
x14
2 x14 x2
x13 x2
3 x13 x2
2 x12 x2 4 x12 x2
3 x1 x2 5 x1 x2 6 x2 ]
x15 x2
( x) [ x1 x2
( ( x(t )),WVi )dx
控制信号
T ˆ ui ( x(t ),Wui ) Wui ( x(t ))
Wui ( j 1) Wui ( j )
ˆ ˆ ˆ ( x(t )T Qx(t ) uiT( j ) Rui ( j ) Vi ( x(t 1))) Wui ( j )
u (t )
动态规划的缺点：
维数灾问题 (curse of dimensionality)
解决办法：使用诸如人工神经网络一类的结构来近似表达目标函数进而得到动态规划问题的近似解，即近似动态规划（Adaptive Critic Design, ACD）。
2. Theory of Neural Network

GDHP(Globalized dual heuristic dynamic programming)
AD (action dependent) forms of HDP, DHP,GDHP

HDP和ADHDP
Ĵ(t+1) Critic Network x(t+1) Model Network Q(t) Critic Network New critic network x(t+1) Model Network
T
( x(t 1))T ˆ 2 ( x(t )) Rui ( j ) ( x(t )) g ( x(t )) WVi x(t 1)
Wui ( j 1) ( x(t 1))T ˆ Wui ( j ) (2 ( x(t )) Rui ( j ) ( x(t )) g ( x(t )) WVi ). x(t 1)
0 m f m (n2 ) 0
m (n m ) f sm 0 0
s M s M 1 s 2 s1
权值及偏置更新
W m (k 1) W m (k ) s m (a m1 )T bm (k 1) bm (k ) s m
T
i=i+1
no
|Vi+1-Vi|<ε yes Finish
神经网络实现：值函数
T ˆ Vi ( x(t ),WVi ) WVi ( x(t ))
T ˆ ˆ ˆ d ( ( x(t )), WVi ) x(t )T Qx(t ) uiT ( x(t )) Rui ( x(t )) Vi ( x(t 1)) T ˆ ˆ x(t )T Qx(t ) uiT ( x(t )) Rui ( x(t )) WVi ( x(t 1))