基于近似动态规划算法研究

合集下载
  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

0 0
1
2
3
4
5 6 Time step
7
8
9
10
系统加入噪声扰动
加入噪声量
0.3sin( x(t ))
x(t 1) f ( x(t )) g ( x(t ))u(t ) 0.3sin( x(t ))
2 1.8 1.6 1.4 x1 x1(with disturbance)
0.2
1. Introduction
动态规划及贝尔曼最优性原理
Dynamic programming and Bellman’s principle of optimality 系统描述 系统性能指标
J [ x(i ), i ] k iU [ x( k ), u (k ), k ]
k i
t
1 ˆ ˆ [ J (t ) U (t ) J (t 1)]2 2 t
ˆ ˆ J (t ) U (t ) J (t 1) ˆ U (t ) [U (t 1) J (t 2)] k tU (k )
k t
u(t) Action Network x(t)
State trajecLeabharlann Baiduory x1
-0.2
-0.4
-0.6
0
1
2
3
4
5 6 Time step
7
8
9
10
7
0 -0.5 -1
x 10
-3
6
5 -1.5
The cost
4
The control
1 2 3 4 5 Time step 6 7 8 9
-2 -2.5 -3
3
2 -3.5 1 -4 -4.5
BP算法的变形
批处理(Batching) 动量BP算法(MOBP) 可变学习速度的BP算法(VLBP) 共轭梯度法(CGBP) LM
BP 算法(LMBP)
3. Adaptive Critic Design

HDP(Heuristic dynamic programming): DHP(Dual heuristic dynamic programming):
x2 x2(with disturbance)
0
-0.2
State trajectory
1.2 1 0.8 0.6 0.4
State trajectory
0 1 2 3 4 5 6 Time step 7 8 9 10
-0.4
-0.6
-0.8 0.2 0 -1
0
1
2
3
4
5 6 Time step
7
8
基于近似动态规划的算法研究
Research on an iterative algorithm for approximate optimal control based on adaptive critic design 姓名:曹宁 导师:张化光 教授
本文主要内容



1.Introduction(引言) 2.Theory of Neural Network (神经网络理论) 3.Adaptive Critic Design(近似动态规划原理) 4.Discrete Time Nonlinear HJB Solution(离散非 线性系统HJB方程的解) 5.Neural Network Modeling(神经网络建模)
Output layer
a=f2(W2f1(W1p+b1)+b2)
误差反传算法
1. 正向传播
2. 误差反向传播
计算
M (nM )(t a) s 2F s m F m (nm )(W m1 )T s m1
M
其中
f m (n1m ) m (n m ) 0 F 0
4 Discrete Time Nonlinear HJB Solution
离散系统HJB的解
系统方程
x(t 1) f ( x(t )) g ( x(t ))u ( x(t ))
V ( x(t )) x(i )T Qx(i ) u (i )T Ru (i )
i t
目标函数
ˆ ˆ ˆ ( x(t )T Qx(t ) uiT( j ) Rui ( j ) Vi ( x(t 1))) Wui ( j ) ˆ ˆ (uiT( j ) Rui ( j ) ) Wui ( j ) ˆ uiT( j ) Wui ( j ) ˆ Vi ( x(t 1)) Wui ( j )
V * ( x(t )) min( x(t )T Qx(t ) u(t )T Ru(t ) V * ( x(t 1)))
u (t )
1 1 V * ( x(t 1)) u* ( x(t )) R g ( x(t ))T 2 x(t 1)
* 1 V * ( x(t 1))T 1 T V ( x(t 1)) V ( x(t )) x(t ) Qx(t ) g ( x(t )) R g ( x(t )) V * ( x(t 1)) 4 x(t 1) x(t 1) * T
9
10
7 v v(with disturbance)
0.01 0 -0.01 -0.02 -0.03 u u(with disturbance)
6
5
The cost
4
The control
1 2 3 4 5 6 Time step 7 8 9 10
-0.04 -0.05 -0.06
pR
输出 其中
a f ( n)
n w1,1 p1 w1,2 p2 w1,R pR b
神经网络模型(Network architectures)
w1i,j p1 j p2 i a1j t at
2
w2j,t a1 a2

pR




aS
Input layer Hidden layers
U[ x(t ), u(t ), t ] J *[ x(t 1), t 1]
J [ x(t ), t ] min(U [ x(t ), u(t ), t ] J [ x(t 1), t 1])
* * u (t )
u* (t ) arg min(U [ x(t ), u (t ), t ] J *[ x(t 1), t 1])
u(t) Action Network x(t)
u(t) Action Network x(t)
HDP评论网的训练
J (t ) k tU [ x(k ), u ( k ), k ]
k t
Ĵ(t+1) Critic Network x(t+1) Model Network
Eh Eh (t )
神经元结构(neuron model)
f-激活函数 • 阈值型(Hard limit) • 线性型(Linear) • S型(Log-sigmoid)
Inputs p1 p2
• • • • • •
Multiple-input Neuron ouputs w1,1 ∑ w1,R b 1 a=f(Wp+b) n f a
ˆ ˆ ˆ ˆ (uiT( j ) Rui ( j ) ) uiT( j ) x(t 1)T ( x(t 1))T Vi ( x(t 1)) ˆ ˆ ui ( j ) Wui ( j ) ui ( j ) x(t 1) ( x(t 1))
HDP迭代算法
Start
Initialization V0=0
Solving the minimizing problem ui(x)=min(x(t)TQx(t)+uT(x(t))Ru(x(t))+Vi(x(t+1)))
Updating the value function Vi+1=x(t) Qx(t)+uT(x(t))Ru(x(t))+Vi(f(x(t))+g(x(t))u i(x(t)) = TQx(t)+uiT(x(t))Rui(x(t))+Vi(x(t+1)) x(t)
WVi 1 arg min{ W ( x(t )) d ( ( x(t )),W
WVi 1 T Vi 1
T Vi 1
) dx(t )}.
2
T WVi ( x(t )) ( x(t )) dx
1

( x(t ))Vˆ
i 1
x15 x14 x2
x13
2 x13 x2
x12 x2
2 x1 x2
3 x2 5 x2 ]
3 x12 x2
4 x1 x2
2 1.8
0.4
0.2 1.6 1.4 1.2 1 0.8 0.6 0.4 -0.8 0.2 0 -1 0
state trajecteory x2
0 1 2 3 4 5 6 Time step 7 8 9 10
T
仿真实验
x(t 1) f ( x(t )) g ( x(t ))u(t )
2 0.2 x1 (t ) exp( x2 (t )) 0 f ( x(t )) g ( x(t )) 3 0.2 0.3x2 (t )
T ˆ Vi ( x(t ),WVi ) WVi ( x(t ))
T ˆ ui ( x(t ),Wui ) Wui ( x(t ))
( x) [ x12
4 x2
x1 x2 x16
2 x2
x14
2 x14 x2
x13 x2
3 x13 x2
2 x12 x2 4 x12 x2
3 x1 x2 5 x1 x2 6 x2 ]
x15 x2
( x) [ x1 x2
( ( x(t )),WVi )dx
控制信号
T ˆ ui ( x(t ),Wui ) Wui ( x(t ))
Wui ( j 1) Wui ( j )
ˆ ˆ ˆ ( x(t )T Qx(t ) uiT( j ) Rui ( j ) Vi ( x(t 1))) Wui ( j )
u (t )
动态规划的缺点:
维数灾问题 (curse of dimensionality)
解决办法:使用诸如人工神经网络一类的结构来近似表达目 标函数进而得到动态规划问题的近似解,即近似动态规划 (Adaptive Critic Design, ACD)。
2. Theory of Neural Network


GDHP(Globalized dual heuristic dynamic programming)
AD (action dependent) forms of HDP, DHP,GDHP

HDP和ADHDP
Ĵ(t+1) Critic Network x(t+1) Model Network Q(t) Critic Network New critic network x(t+1) Model Network
T
( x(t 1))T ˆ 2 ( x(t )) Rui ( j ) ( x(t )) g ( x(t )) WVi x(t 1)
Wui ( j 1) ( x(t 1))T ˆ Wui ( j ) (2 ( x(t )) Rui ( j ) ( x(t )) g ( x(t )) WVi ). x(t 1)
0 m f m (n2 ) 0
m (n m ) f sm 0 0
s M s M 1 s 2 s1
权值及偏置更新
W m (k 1) W m (k ) s m (a m1 )T bm (k 1) bm (k ) s m
T
i=i+1
no
|Vi+1-Vi|<ε yes Finish
神经网络实现:值函数
T ˆ Vi ( x(t ),WVi ) WVi ( x(t ))
T ˆ ˆ ˆ d ( ( x(t )), WVi ) x(t )T Qx(t ) uiT ( x(t )) Rui ( x(t )) Vi ( x(t 1)) T ˆ ˆ x(t )T Qx(t ) uiT ( x(t )) Rui ( x(t )) WVi ( x(t 1))
相关文档
最新文档