马尔可夫决策过程(MDP)
合集下载
相关主题
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
MDP
Markov decision process (MDP)
MDP = Markov process + actions + reward functions + + 1
3
Possible future state
1
Current state
2
Action A1 Reward 20 0.1
Possible future state
2
-The principle of optimality
f ( x) = x
• • • • • • • • • •
Markov Process Markov Reward Process MRP Markov Decision Process MDP MDP MDP
MDP
MDP
State 0 State 1 State 2 State 3 Action:Left 0 1 2 3 Action:Right
S = {0, 1, 2, 3} A = {Left, Right} Reward: -1 for every step moved Discount factor: 0.5
MDP
State 0 State 1 State 2 State 3 Action:Left 0 1 2 3 Action:Right
0.9
Action A2 Reward 5
1
Current state 22
2
3
Possible future state
2
Possible future state
Markov decision process (MDP)
MDP = Markov process + actions + reward functions + +
MRP
H ( S ) = u( S ) +
S 0 2S
•
MDP
H (S, A) = u(S, A) +
A2 A
U (S ) = max H (S, A)
⇤
S 0 2S
X
P (S, A, S )U (S )
0
0
(S ) = arg max H (A, S )
Bellman equation
Bellman equation
/ action state transition prob. reward function discount factor
MDP
•
MDP
•
MDP ”
• •
“ “ ] ] action/decision ”
[ [
MDP
•
MDP
< S, A, P, U,
•
>
S P
-
-
•
•
A
reward-
• • • • • • • • • •
2
Backward induction
1.0 Reward 6
3
Reward 5
1
0.2 0.8
5
4
Reward 20
0
Reward 2 1.0
Reward 0
Reward 9
H (S = 4) = u(S = 4) = 2 H (S = 5) = u(S = 5) = 9
14
MRP
0.1 0.9
16
⇥
Reward 9
⇤
MRP
0.1 0.9
2
1.0 Reward 6
3
Reward 5
1
0.2 0.8
5
4
Reward 20
0
Reward 2 1.0
Reward 0
Reward 9
(
) H (St ) = E u(St ) + H (St+1 ) X H ( S ) = u( S ) + P (S, S 0 )H (S 0 )
Markov decision process (MDP)
/ Email: yangzhang@whut.edu.cn
• • • • • • • • • •
Markov Process Markov Reward Process MRP Markov Decision Process MDP MDP MDP
Reward 9
⇤
“
”
15
MRP
0.1 0.9
2
1.0 Reward 6
3
Reward 5
1
0.2 0.8
5
4
Reward 20
0
Reward 2 1.0
Reward 0
H (S = 3)
= =
H(S=2), H(S=1), …
u(S = 3) + 0.2H (S = 4) + 0.8H (S = 5) ⇥ ⇤ 6 + 0. 2 · 2 + 0 . 8 · 9
2
3
0.9
Action A2 Reward 5
1
Current state
0.1
2
1
Possible future state
Action A1 Reward 20
Possible future state
23
MDP
•
MDP
CMDP
POMDP
< S, A, P, U,
• • • • •
>
S A P U
1)
• • • •
/ random walk
• • • • • • • • • •
Markov Process Markov Reward Process MRP Markov Decision Process MDP MDP MDP
MDP
Markov reward process (MRP)
43
•
•
: Moving and charge/
Buy/Sell energy
“ ” “ ”
•
Energy gateway buys from chargers (Charging)
• • •
(Policy iteration algorithm) Value iteration
0 ( S ) , 8S
Bellman eqn
• •
2S
n+1 (S )
: S 7! A, 8S 2 S
Bellman equation
• • • •
O ( | A| · | S | ) | A| |S|
• • • •
MRP
backward induction absorbing state “ ”
Bellman equation
• • •
(Value iteration algorithm) 0 Bellman eqn
U0 (S ) , 0, 8S 2 S X U (S ) P (S, A, S 0 )Un (S 0 )
1 6 1 P(A = Left) = 6 4 0 0
2
0 0 1 0
0 0 0 1
0 0 7 7 0 5 0
3
1 6 0 P(A = Right) = 6 4 0 0
2
0 0 0 0
0 1 0 0
0 0 7 7 1 5 1
3
Value: H=0.0
0.0
0.0
0.0
Period 1
Action:
< S, P, U, >
• • • •
S P U
/ state transition prob. reward function discount factor
MRP0.1 0.9
2
1.0 Reward u(S = 3)
3
Reward 5
1
0.2 0.8
5
4
Reward 20
0
Reward u(S = 4) 1.0
Markov Process Markov Reward Process MRP Markov Decision Process MDP MDP MDP
MDP
MDP
•
MDP
• • •
“
”
Policy
S
Baidu Nhomakorabea
A : S 7! A
⇤
•
: S 7! A
Bellman
•
Bellman equation
X P (S, S 0 )H (S 0 )
Value iteration algorithm
For each state S : H0 ( S ) , 0 Repeat until converge: For each state S : For each action A : Compute Hn+1 (S, A) = u(S, A) + Compute and store
17
S 0 2S
MRP •
absorbing state
•
1.0
2
1.0 Reward 6
3
Reward 5
1
Reward 20 1.0
0
1.0
Reward 0
MRP 0.1 0.9
2
Value iteration
1.0 Reward 6
3
Reward 5
1
0.2 0.8
5
4
Reward 20
0
Reward 0
Reward …
•
MRP
0.1 0.9
2
1.0 Reward 6
3
Reward 5
1
0.2 0.8
5
4
Reward 20
0
Reward 2 1.0
Reward 0
start from here
Reward “ ” immediate
Reward 9
S
13
H (S )
MRP
0.1 0.9
Value: H=0.0
←/→ ←/→
←
-1.0
-1.0
-1.0
Period 2
Action:
Value: H=0.0
←
-1.0
←/→
-1.5
←
-1.5
Period 3
Action: …
←
←
←
MDP
•
MDP
•
•
• • • • • • • • • •
Markov Process Markov Reward Process MRP Markov Decision Process MDP MDP MDP
MDP
•
•
•
•
: : : , 41
/ ,
Copyright: Forbes
• •
Powercaster Tx and Rx
Charging station
: RF energy Tx/Rx
• • •
Friis’ formula
Beamforming
P
42
•
Electricity chargers : At different fixed locations, e.g., power outlets, base stations End users of energy : Those who need energy, but are not covered by chargers Mobile energy gateway transferring (wirelessly)
P (Xt+1 |Xt , Xt
1 , Xt 2 , · · · )
2
1
3
P (4|3)
4
0 5
5
• • •
Random walk
•
P (Xt+1 |Xt , Xt
• •
1 , Xt 2 , · · · )
= P (Xt+1 |Xt , Xt
1)
St = ( X t , Xt St 2 {(s, s), (s, r), (r, s), (r, r)}
⇤ ⇤ n+1 (S )
= arg max Hn+1 (S, A)
A A2 A
S 0 2S
X
P (S, A, S 0 )Un (S 0 )
Compute and store Un+1 (S ) = max Hn+1 (S, A) Return
( S ) , U ( S ) , 8S 2 S
Bellman equation
MDP
• • • • • • • • • •
Markov Process Markov Reward Process MRP Markov Decision Process MDP MDP MDP
MDP
•
Xt+1
( Xt , t 2 I )
Xt X t 1 , Xt 2 , · · · = P (Xt+1 |Xt )
2
1.0 Reward 6
3
Reward 5
1
0.2 0.8
5
4
Reward 20
0
Reward 2 1.0
Reward 0
H (S = 3)
=
= 2 [0, 1)
u(S = 3) + 0.2H (S = 4) + 0.8H (S = 5) ⇥ ⇤ 6 + 0. 2 · 2 + 0 . 8 · 9
⇥
MDP = Markov process + reward/utility functions + /
0.1 0.9
2
1.0 Reward u(S = 3)
3
Reward 5
1
0.2 0.8
5
4
Reward 20
0
Reward u(S = 4) 1.0
Reward 0
Reward …
10
MRP
•
MRP
Reward 2 1.0
Reward 0
Reward 9
H (S ) , 0, 8S 2 S
H (S )
H ( S ) = u( S ) +
S 0 2S
19
X
P (S, S 0 )H (S 0 )
• • • • • • • • • •
Markov Process Markov Reward Process MRP Markov Decision Process MDP MDP MDP
Action A2 Reward 5
1
Current state
Hn+1 (S, A) = u(S, A) +
A2 A
Un+1 (S ) = max Hn+1 (S, A)
3
Possible future state
S 0 2S
0.9 0.1
2
Action A1 Reward 20
Possible future state