MIT开放课程Dynamic Programming Lecture (2)
合集下载
相关主题
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
View x ˜k = (xk , yk ) as the new state. • DP algorithm for the reformulated problem: Jk (xk , xk−1 ) =
uk ∈Uk (xk ) wk
min
E
gk (xk , uk , wk )
+ Jk+1 fk (xk , xk−1 , uk , wk ), xk
• System xk+1 = (1 − a)xk + auk , k = 0, 1,
where a is given scalar from the interval (0, 1). • Cost
2 r(x2 − T )2 + u2 + u 0 1
where r is given positive scalar. • DP Algorithm: J2 (x2 ) = r(x2 − T )2 J1 (x1 ) = min u2 1 + r (1 − a)x1 + au1 − T
wk+1 ,...,wN −1
E
gN (xN ) +
i=k+1
gi xi , µi (xi ), wi
= min E = min E =
µk w k µk w k
∗ gk xk , µk (xk ), wk + Jk +1 fk xk , µk (xk ), wk
gk xk , µk (xk ), wk + Jk+1 fk xk , µk (xk ), wk E gk (xk , uk , wk ) + Jk+1 fk (xk , uk , wk )
PROOF OF THE INDUCTION STEP • Let πk = µk , µk+1 , . . . , µN −1 denote a tail policy from time k onward ∗ (x • Assume that Jk+1 (xk+1 ) = Jk +1 k+1 ). Then
uk ∈Uk (xk ) wk
min
= J k ( xk )
LINEAR-QUADRATIC ANALYTICAL EXAMPLE
Βιβλιοθήκη Baidu
Initial Temperature x0
Oven 1 Temperature u0
x1
Oven 2 Temperature u1
Final Temperature x2
+ Jk+1 fk (xk , uk , wk )
• Then J0 (x0 ), generated at the last step, is equal to the optimal cost J ∗ (x0 ). Also, the policy
∗ π ∗ = {µ∗ , . . . , µ 0 N −1 }
STOCHASTIC INVENTORY EXAMPLE
wk Demand at Period k
Stock at Period k xk
Inventory System
Stock at Period k + 1 xk
+ 1 = xk
+ uk - wk
Cost of Period k cuk + r (xk + uk - wk)
uk ≥0 wk
+ Jk+1 (xk + uk − wk )
DP ALGORITHM • Start with JN (xN ) = gN (xN ), and go backwards using Jk (xk ) =
uk ∈Uk (xk ) wk
min
E gk (xk , uk , wk ) , k = 0, 1, . . . , N − 1.
BASIC PROBLEM • System xk+1 = fk (xk , uk , wk ), k = 0, . . . , N − 1 • Control contraints uk ∈ U (xk ) • Probability distribution Pk (· | xk , uk ) of wk • Policies π = {µ0 , . . . , µN −1 }, where µk maps states xk into controls uk = µk (xk ) and is such that µk (xk ) ∈ Uk (xk ) for all xk • Expected cost of π starting at x0 is
∗ Jk ( xk ) =
(µk ,πk+1 ) wk ,...,wN −1 N −1
min
E
gk xk , µk (xk ), wk
+ gN (xN ) +
i=k+1
gi xi , µi (xi ), wi
= min E
µk w k
gk xk , µk (xk ), wk
N −1
+ min
πk+1
N −1
Jπ (x0 ) = E
gN (xN ) +
k=0
gk (xk , µk (xk ), wk )
• Optimal cost function J ∗ (x0 ) = min Jπ (x0 )
π
• Optimal policy π ∗ is one that satisfies Jπ∗ (x0 ) = J ∗ (x0 )
9
1
8
5
1 0 State
CA 3 C 4 2 4
Initial
5
6
3
CAB
1
3
CD
CAD
3
7
6
5
3 CDA
2
• Start from the last tail subproblem and go backwards • At each state-time pair, we record the optimal cost-to-go and the optimal decision
u1 2
J0 (x0 ) = min u2 0 + J1 (1 − a)x0 + au0
u0
STATE AUGMENTATION • When assumptions of the basic problem are violated (e.g., disturbances are correlated, cost is nonadditive, etc) reformulate/augment the state. • Example: Time lags xk+1 = fk (xk , xk−1 , uk , wk ) • Introduce additional state variable yk = xk−1 . New system takes the form xk+1 yk+1 = fk (xk , yk , uk , wk ) xk
Stock Ordered at Period k uk
• Tail Subproblems of Length 1: JN −1 (xN −1 ) =
uN −1 ≥0 wN −1
min
E
cuN −1
+ r(xN −1 + uN −1 − wN −1 ) • Tail Subproblems of Length N − k : Jk (xk ) = min E cuk + r(xk + uk − wk )
DETERMINISTIC SCHEDULING EXAMPLE • Find optimal sequence of operations A, B, C, D (A must precede B and C must precede D)
ABC 3 AB 2 A ACB 4 3 AC ACD
6
where µ∗ k (xk ) minimizes in the right side above for each xk and k , is optimal. • Justification: Proof by induction that Jk (xk ) is ∗ (x ), defined as the optimal cost of the equal to Jk k tail subproblem that starts at time k at state xk . • Note that ALL the tail subproblems are solved in addition to the original problem, and the intensive computational requirements.
PRINCIPLE OF OPTIMALITY
∗ ∗ • Let π ∗ = {µ∗ 0 , µ1 , . . . , µN −1 } be an optimal policy
• Consider the “tail subproblem” whereby we are at xi at time i and wish to minimize the “cost-to-go” from time i to time N
6.231 DYNAMIC PROGRAMMING LECTURE 2 LECTURE OUTLINE • Principle of optimality • DP example: Deterministic problem • DP example: Stochastic problem • The general algorithm • State augmentation
N −1
E
gN (xN ) +
k=i
gk xk , µk (xk ), wk
∗ , . . . , µ∗ and the “tail policy” {µ∗ , µ i i+1 N −1 }
xi Tail Subproblem
0
i
N
• Principle of optimality : The tail policy is optimal for the tail subproblem • DP first solves all tail subroblems of final stage • At the generic step, it solves all tail subproblems of a given time length, using the solution of the tail subproblems of shorter time length