Algorithms for MDP Planning
Shivaram Kalyanakrishnan
Department of Computer Science and Engineering Indian Institute of Technology Bombay
shivaram@cse.iitb.ac.in
August 2019
Overview
1. Value Iteration
2. Linear Programming
3. Policy Iteration
Policy Improvement Theorem
Overview
1. Value Iteration
2. Linear Programming
3. Policy Iteration
Policy Improvement Theorem
Value Iteration
V0←Arbitrary, element-wise bounded,n-length vector.t←0.
Repeat:
Fors∈S:
Vt+1(s)←maxa∈AP
s0∈ST(s,a,s0) (R(s,a,s0) +γVt(s0)).
t←t+1.
UntilVt ≈Vt−1(up to machine precision).
Convergence toV?guaranteed using a max-norm contraction argument.
Value Iteration
V0←Arbitrary, element-wise bounded,n-length vector.t←0.
Repeat:
Fors∈S:
Vt+1(s)←maxa∈AP
s0∈ST(s,a,s0) (R(s,a,s0) +γVt(s0)).
t←t+1.
UntilVt ≈Vt−1(up to machine precision).
Convergence toV?guaranteed using a max-norm contraction argument.
Overview
1. Value Iteration
2. Linear Programming 3. Policy Iteration
Policy Improvement Theorem
Linear Programming
Minimise X
s∈S
V(s)
subject to V(s)≥X
s0∈S
T(s,a,s0) R(s,a,s0) +γV(s0)
,∀s∈S,∀a∈A.
Let|S|=nand|A|=k. nvariables,nkconstraints.
Can also be posed asdual withnk variables andnconstraints.
Linear Programming
Minimise X
s∈S
V(s)
subject to V(s)≥X
s0∈S
T(s,a,s0) R(s,a,s0) +γV(s0)
,∀s∈S,∀a∈A.
Let|S|=nand|A|=k.
nvariables,nkconstraints.
Can also be posed asdual withnk variables andnconstraints.
Linear Programming
Minimise X
s∈S
V(s)
subject to V(s)≥X
s0∈S
T(s,a,s0) R(s,a,s0) +γV(s0)
,∀s∈S,∀a∈A.
Let|S|=nand|A|=k.
nvariables,nkconstraints.
Can also be posed asdual withnk variables andnconstraints.
Linear Programming
Minimise X
s∈S
V(s)
subject to V(s)≥X
s0∈S
T(s,a,s0) R(s,a,s0) +γV(s0)
,∀s∈S,∀a∈A.
Let|S|=nand|A|=k.
nvariables,nkconstraints.
Can also be posed asdual withnk variables andnconstraints.
Overview
1. Value Iteration
2. Linear Programming
3. Policy Iteration
Policy Improvement Theorem
Policy Improvement
Givenπ,
Pickone or moreimprovable states, and in them, Switch to anarbitraryimproving action.
Let the resulting policy beπ0.
s s s s s s s
s1 2 3 4 5 6 7 8
π
Policy Improvement Theorem:
(1) Ifπhas no improvable states, then it is optimal, else (2) ifπ0is obtained as above, then
∀s∈S:Vπ0(s)≥Vπ(s)and∃s∈S:Vπ0(s)>Vπ(s).
Policy Improvement
Givenπ,
Pickone or moreimprovable states, and in them, Switch to anarbitraryimproving action.
Let the resulting policy beπ0.
s s s s s s s
s1 2 3 4 5 6 7 8
π
Policy Improvement Theorem:
(1) Ifπhas no improvable states, then it is optimal, else (2) ifπ0is obtained as above, then
∀s∈S:Vπ0(s)≥Vπ(s)and∃s∈S:Vπ0(s)>Vπ(s).
Policy Improvement
Givenπ,
Pickone or moreimprovable states, and in them, Switch to anarbitraryimproving action.
Let the resulting policy beπ0.
s s s s s s s
s1 2 3 4 5 6 7 8
π
Q (s , ) π Q (s , ) π 3
3
Policy Improvement Theorem:
(1) Ifπhas no improvable states, then it is optimal, else (2) ifπ0is obtained as above, then
∀s∈S:Vπ0(s)≥Vπ(s)and∃s∈S:Vπ0(s)>Vπ(s).
Policy Improvement
Givenπ,
Pickone or moreimprovable states, and in them, Switch to anarbitraryimproving action.
Let the resulting policy beπ0.
s s s s s s s
s1 2 3 4 5 6 7 8
π
Q (s , ) π Q (s , ) π
7 7
Q (s , ) π Q (s , ) π 3
3
Policy Improvement Theorem:
(1) Ifπhas no improvable states, then it is optimal, else (2) ifπ0is obtained as above, then
∀s∈S:Vπ0(s)≥Vπ(s)and∃s∈S:Vπ0(s)>Vπ(s).
Policy Improvement
Givenπ,
Pickone or moreimprovable states, and in them, Switch to anarbitraryimproving action.
Let the resulting policy beπ0.
s s s s s s s
s1 2 3 4 5 6 7 8
π
Improvable states
Policy Improvement Theorem:
(1) Ifπhas no improvable states, then it is optimal, else (2) ifπ0is obtained as above, then
∀s∈S:Vπ0(s)≥Vπ(s)and∃s∈S:Vπ0(s)>Vπ(s).
Policy Improvement
Givenπ,
Pickone or moreimprovable states, and in them, Switch to anarbitraryimproving action.
Let the resulting policy beπ0.
s s s s s s s
s1 2 3 4 5 6 7 8
π
Improvable states Improving actions
Policy Improvement Theorem:
(1) Ifπhas no improvable states, then it is optimal, else (2) ifπ0is obtained as above, then
∀s∈S:Vπ0(s)≥Vπ(s)and∃s∈S:Vπ0(s)>Vπ(s).
Policy Improvement
Givenπ,
Pickone or moreimprovable states, and in them, Switch to anarbitraryimproving action.
Let the resulting policy beπ0.
s s s s s s s
s1 2 3 4 5 6 7 8
π
Improvable states Improving actions
Policy Improvement Theorem:
(1) Ifπhas no improvable states, then it is optimal, else (2) ifπ0is obtained as above, then
∀s∈S:Vπ0(s)≥Vπ(s)and∃s∈S:Vπ0(s)>Vπ(s).
Policy Improvement
Givenπ,
Pickone or moreimprovable states, and in them, Switch to anarbitraryimproving action.
Let the resulting policy beπ0.
s s s s s s s
s1 2 3 4 5 6 7 8
π
s s s s s s s
s1 2 3 4 5 6 7 8
π
Policy Improvement
Policy Improvement Theorem:
(1) Ifπhas no improvable states, then it is optimal, else (2) ifπ0is obtained as above, then
∀s∈S:Vπ0(s)≥Vπ(s)and∃s∈S:Vπ0(s)>Vπ(s).
Policy Improvement
Givenπ,
Pickone or moreimprovable states, and in them, Switch to anarbitraryimproving action.
Let the resulting policy beπ0.
s s s s s s s
s1 2 3 4 5 6 7 8
π
s s s s s s s
s1 2 3 4 5 6 7 8
π
Policy Improvement
Policy Improvement Theorem:
(1) Ifπhas no improvable states, then it is optimal, else
Policy Improvement
Givenπ,
Pickone or moreimprovable states, and in them, Switch to anarbitraryimproving action.
Let the resulting policy beπ0.
s s s s s s s
s1 2 3 4 5 6 7 8
π
s s s s s s s
s1 2 3 4 5 6 7 8
π
Policy Improvement
Policy Improvement Theorem:
(1) Ifπhas no improvable states, then it is optimal, else
Definitions and Basic Facts
ForX:S→RandY :S→R, we defineXYif∀s∈S:X(s)≥Y(s), and we defineXYifXY and∃s∈S:X(s)>Y(s).
For policiesπ1, π2∈Π, we defineπ1π2ifVπ1Vπ2, and we defineπ1π2ifVπ1 Vπ2.
Bellman Operator.Forπ∈Π, we defineBπ: (S→R)→(S→R)as follows: forX:S→Rand∀s∈S,
(Bπ(X))(s)def=X
s0∈S
T(s, π(s),s0) R(s, π(s),s0) +γX(s0) .
Fact 1. Forπ∈Π,X:S→R, andY :S→R:
ifXY, thenBπ(X)Bπ(Y).
Fact 2. Forπ∈ΠandX:S→R:
l→∞lim(Bπ)l(X) =Vπ.(from Banach’s FP Theorem)
Definitions and Basic Facts
ForX:S→RandY :S→R, we defineXYif∀s∈S:X(s)≥Y(s), and we defineXYifXY and∃s∈S:X(s)>Y(s).
For policiesπ1, π2∈Π, we defineπ1π2ifVπ1Vπ2, and we defineπ1π2ifVπ1Vπ2.
Bellman Operator.Forπ∈Π, we defineBπ: (S→R)→(S→R)as follows: forX:S→Rand∀s∈S,
(Bπ(X))(s)def=X
s0∈S
T(s, π(s),s0) R(s, π(s),s0) +γX(s0) .
Fact 1. Forπ∈Π,X:S→R, andY :S→R:
ifXY, thenBπ(X)Bπ(Y).
Fact 2. Forπ∈ΠandX:S→R:
l→∞lim(Bπ)l(X) =Vπ.(from Banach’s FP Theorem)
Definitions and Basic Facts
ForX:S→RandY :S→R, we defineXYif∀s∈S:X(s)≥Y(s), and we defineXYifXY and∃s∈S:X(s)>Y(s).
For policiesπ1, π2∈Π, we defineπ1π2ifVπ1Vπ2, and we defineπ1π2ifVπ1Vπ2.
Bellman Operator.Forπ∈Π, we defineBπ: (S→R)→(S→R)as follows:
forX :S→Rand∀s∈S, (Bπ(X))(s)def=X
s0∈S
T(s, π(s),s0) R(s, π(s),s0) +γX(s0) .
Fact 1. Forπ∈Π,X:S→R, andY :S→R:
ifXY, thenBπ(X)Bπ(Y).
Fact 2. Forπ∈ΠandX:S→R:
l→∞lim(Bπ)l(X) =Vπ.(from Banach’s FP Theorem)
Definitions and Basic Facts
ForX:S→RandY :S→R, we defineXYif∀s∈S:X(s)≥Y(s), and we defineXYifXY and∃s∈S:X(s)>Y(s).
For policiesπ1, π2∈Π, we defineπ1π2ifVπ1Vπ2, and we defineπ1π2ifVπ1Vπ2.
Bellman Operator.Forπ∈Π, we defineBπ: (S→R)→(S→R)as follows:
forX :S→Rand∀s∈S, (Bπ(X))(s)def=X
s0∈S
T(s, π(s),s0) R(s, π(s),s0) +γX(s0) .
Fact 1. Forπ∈Π,X :S→R, andY :S→R:
ifXY, thenBπ(X)Bπ(Y).
Fact 2. Forπ∈ΠandX:S→R:
l→∞lim(Bπ)l(X) =Vπ.(from Banach’s FP Theorem)
Definitions and Basic Facts
ForX:S→RandY :S→R, we defineXYif∀s∈S:X(s)≥Y(s), and we defineXYifXY and∃s∈S:X(s)>Y(s).
For policiesπ1, π2∈Π, we defineπ1π2ifVπ1Vπ2, and we defineπ1π2ifVπ1Vπ2.
Bellman Operator.Forπ∈Π, we defineBπ: (S→R)→(S→R)as follows:
forX :S→Rand∀s∈S, (Bπ(X))(s)def=X
s0∈S
T(s, π(s),s0) R(s, π(s),s0) +γX(s0) .
Fact 1. Forπ∈Π,X :S→R, andY :S→R:
ifXY, thenBπ(X)Bπ(Y).
Fact 2. Forπ∈ΠandX:S→R:
Proof of Policy Improvement Theorem
Observe that forπ, π0∈Π,∀s∈S:Bπ0(Vπ)(s) =Qπ(s, π0(s)).
πhas no improvable states
=⇒ ∀π0∈Π :VπBπ0(Vπ)
=⇒ ∀π0∈Π :VπBπ0(Vπ)(Bπ0)2(Vπ)
=⇒ ∀π0∈Π :VπBπ0(Vπ)(Bπ0)2(Vπ) · · · lim
l→∞(Bπ0)l(Vπ)
=⇒ ∀π0∈Π :VπVπ0.
πhas improvable states and policy improvement yieldsπ0
=⇒ Bπ0(Vπ)Vπ
=⇒ (Bπ0)2(Vπ)Bπ0(Vπ)Vπ
=⇒ lim
l→∞(Bπ0)l(Vπ) · · · (Bπ0)2(Vπ)Bπ0(Vπ)Vπ
=⇒ Vπ0 Vπ.
Proof of Policy Improvement Theorem
Observe that forπ, π0∈Π,∀s∈S:Bπ0(Vπ)(s) =Qπ(s, π0(s)).
πhas no improvable states
=⇒ ∀π0∈Π :VπBπ0(Vπ)
=⇒ ∀π0∈Π :VπBπ0(Vπ)(Bπ0)2(Vπ)
=⇒ ∀π0∈Π :VπBπ0(Vπ)(Bπ0)2(Vπ) · · · lim
l→∞(Bπ0)l(Vπ)
=⇒ ∀π0∈Π :VπVπ0.
πhas improvable states and policy improvement yieldsπ0
=⇒ Bπ0(Vπ)Vπ
=⇒ (Bπ0)2(Vπ)Bπ0(Vπ)Vπ
=⇒ lim
l→∞(Bπ0)l(Vπ) · · · (Bπ0)2(Vπ)Bπ0(Vπ)Vπ
=⇒ Vπ0 Vπ.
Proof of Policy Improvement Theorem
Observe that forπ, π0∈Π,∀s∈S:Bπ0(Vπ)(s) =Qπ(s, π0(s)).
πhas no improvable states
=⇒ ∀π0∈Π :VπBπ0(Vπ)
=⇒ ∀π0∈Π :VπBπ0(Vπ)(Bπ0)2(Vπ)
=⇒ ∀π0∈Π :VπBπ0(Vπ)(Bπ0)2(Vπ) · · · lim
l→∞(Bπ0)l(Vπ)
=⇒ ∀π0∈Π :VπVπ0.
πhas improvable states and policy improvement yieldsπ0
=⇒ Bπ0(Vπ)Vπ
=⇒ (Bπ0)2(Vπ)Bπ0(Vπ)Vπ
=⇒ lim
l→∞(Bπ0)l(Vπ) · · · (Bπ0)2(Vπ)Bπ0(Vπ)Vπ
=⇒ Vπ0 Vπ.
Proof of Policy Improvement Theorem
Observe that forπ, π0∈Π,∀s∈S:Bπ0(Vπ)(s) =Qπ(s, π0(s)).
πhas no improvable states
=⇒ ∀π0∈Π :VπBπ0(Vπ)
=⇒ ∀π0∈Π :VπBπ0(Vπ)(Bπ0)2(Vπ)
=⇒ ∀π0∈Π :VπBπ0(Vπ)(Bπ0)2(Vπ) · · · lim
l→∞(Bπ0)l(Vπ)
=⇒ ∀π0∈Π :VπVπ0.
πhas improvable states and policy improvement yieldsπ0
=⇒ Bπ0(Vπ)Vπ
=⇒ (Bπ0)2(Vπ)Bπ0(Vπ)Vπ
=⇒ lim
l→∞(Bπ0)l(Vπ) · · · (Bπ0)2(Vπ)Bπ0(Vπ)Vπ
=⇒ Vπ0 Vπ.
Proof of Policy Improvement Theorem
Observe that forπ, π0∈Π,∀s∈S:Bπ0(Vπ)(s) =Qπ(s, π0(s)).
πhas no improvable states
=⇒ ∀π0∈Π :VπBπ0(Vπ)
=⇒ ∀π0∈Π :VπBπ0(Vπ)(Bπ0)2(Vπ)
=⇒ ∀π0∈Π :VπBπ0(Vπ)(Bπ0)2(Vπ) · · · lim
l→∞(Bπ0)l(Vπ)
=⇒ ∀π0∈Π :VπVπ0.
πhas improvable states and policy improvement yieldsπ0
=⇒ Bπ0(Vπ)Vπ
=⇒ (Bπ0)2(Vπ)Bπ0(Vπ)Vπ
=⇒ lim
l→∞(Bπ0)l(Vπ) · · · (Bπ0)2(Vπ)Bπ0(Vπ)Vπ
=⇒ Vπ0 Vπ.
Proof of Policy Improvement Theorem
Observe that forπ, π0∈Π,∀s∈S:Bπ0(Vπ)(s) =Qπ(s, π0(s)).
πhas no improvable states
=⇒ ∀π0∈Π :VπBπ0(Vπ)
=⇒ ∀π0∈Π :VπBπ0(Vπ)(Bπ0)2(Vπ)
=⇒ ∀π0∈Π :VπBπ0(Vπ)(Bπ0)2(Vπ) · · · lim
l→∞(Bπ0)l(Vπ)
=⇒ ∀π0∈Π :VπVπ0.
πhas improvable states and policy improvement yieldsπ0
=⇒ Bπ0(Vπ)Vπ
=⇒ (Bπ0)2(Vπ)Bπ0(Vπ)Vπ
=⇒ lim
l→∞(Bπ0)l(Vπ) · · · (Bπ0)2(Vπ)Bπ0(Vπ)Vπ
=⇒ Vπ0 Vπ.
Proof of Policy Improvement Theorem
Observe that forπ, π0∈Π,∀s∈S:Bπ0(Vπ)(s) =Qπ(s, π0(s)).
πhas no improvable states
=⇒ ∀π0∈Π :VπBπ0(Vπ)
=⇒ ∀π0∈Π :VπBπ0(Vπ)(Bπ0)2(Vπ)
=⇒ ∀π0∈Π :VπBπ0(Vπ)(Bπ0)2(Vπ) · · · lim
l→∞(Bπ0)l(Vπ)
=⇒ ∀π0∈Π :VπVπ0.
πhas improvable states and policy improvement yieldsπ0
=⇒ Bπ0(Vπ)Vπ
=⇒ (Bπ0)2(Vπ)Bπ0(Vπ)Vπ
=⇒ lim
l→∞(Bπ0)l(Vπ) · · · (Bπ0)2(Vπ)Bπ0(Vπ)Vπ
=⇒ Vπ0 Vπ.
Proof of Policy Improvement Theorem
Observe that forπ, π0∈Π,∀s∈S:Bπ0(Vπ)(s) =Qπ(s, π0(s)).
πhas no improvable states
=⇒ ∀π0∈Π :VπBπ0(Vπ)
=⇒ ∀π0∈Π :VπBπ0(Vπ)(Bπ0)2(Vπ)
=⇒ ∀π0∈Π :VπBπ0(Vπ)(Bπ0)2(Vπ) · · · lim
l→∞(Bπ0)l(Vπ)
=⇒ ∀π0∈Π :VπVπ0.
πhas improvable states and policy improvement yieldsπ0
=⇒ Bπ0(Vπ)Vπ
=⇒ (Bπ0)2(Vπ)Bπ0(Vπ)Vπ
=⇒ lim
l→∞(Bπ0)l(Vπ) · · · (Bπ0)2(Vπ)Bπ0(Vπ)Vπ
=⇒ Vπ0 Vπ.
Proof of Policy Improvement Theorem
Observe that forπ, π0∈Π,∀s∈S:Bπ0(Vπ)(s) =Qπ(s, π0(s)).
πhas no improvable states
=⇒ ∀π0∈Π :VπBπ0(Vπ)
=⇒ ∀π0∈Π :VπBπ0(Vπ)(Bπ0)2(Vπ)
=⇒ ∀π0∈Π :VπBπ0(Vπ)(Bπ0)2(Vπ) · · · lim
l→∞(Bπ0)l(Vπ)
=⇒ ∀π0∈Π :VπVπ0.
πhas improvable states and policy improvement yieldsπ0
=⇒ Bπ0(Vπ)Vπ
=⇒ (Bπ0)2(Vπ)Bπ0(Vπ)Vπ
⇒ π0 l π · · · π0 2 π π0 π π
=⇒ Vπ0 Vπ.
Proof of Policy Improvement Theorem
Observe that forπ, π0∈Π,∀s∈S:Bπ0(Vπ)(s) =Qπ(s, π0(s)).
πhas no improvable states
=⇒ ∀π0∈Π :VπBπ0(Vπ)
=⇒ ∀π0∈Π :VπBπ0(Vπ)(Bπ0)2(Vπ)
=⇒ ∀π0∈Π :VπBπ0(Vπ)(Bπ0)2(Vπ) · · · lim
l→∞(Bπ0)l(Vπ)
=⇒ ∀π0∈Π :VπVπ0.
πhas improvable states and policy improvement yieldsπ0
=⇒ Bπ0(Vπ)Vπ
=⇒ (Bπ0)2(Vπ)Bπ0(Vπ)Vπ
⇒ π0 l π · · · π0 2 π π0 π π
Policy Iteration Algorithm
π←Arbitrary policy.
Whileπhas improvable states:
π←PolicyImprovement(π).
Number of iterations depends onswitching strategy. Current bounds quite loose.
Policy Iteration Algorithm
π←Arbitrary policy.
Whileπhas improvable states:
π←PolicyImprovement(π).
Number of iterations depends onswitching strategy. Current bounds quite loose.
Policy Iteration Algorithm
π←Arbitrary policy.
Whileπhas improvable states:
π←PolicyImprovement(π).
Number of iterations depends onswitching strategy. Current bounds quite loose.
Policy Iteration Algorithm
π←Arbitrary policy.
Whileπhas improvable states:
π←PolicyImprovement(π).
Number of iterations depends onswitching strategy. Current bounds quite loose.
Policy Iteration Algorithm
π←Arbitrary policy.
Whileπhas improvable states:
π←PolicyImprovement(π).
Number of iterations depends onswitching strategy. Current bounds quite loose.
Policy Iteration Algorithm
π←Arbitrary policy.
Whileπhas improvable states:
π←PolicyImprovement(π).
Number of iterations depends onswitching strategy. Current bounds quite loose.
Policy Iteration Algorithm
π←Arbitrary policy.
Whileπhas improvable states:
π←PolicyImprovement(π).
Number of iterations depends onswitching strategy. Current bounds quite loose.
Policy Iteration Algorithm
π←Arbitrary policy.
Whileπhas improvable states:
π←PolicyImprovement(π).