Value Iteration

(1)

Algorithms for MDP Planning

Shivaram Kalyanakrishnan

Department of Computer Science and Engineering Indian Institute of Technology Bombay

shivaram@cse.iitb.ac.in

August 2019

(2)

Overview

1. Value Iteration

2. Linear Programming

3. Policy Iteration

Policy Improvement Theorem

(3)

Overview

1. Value Iteration

3. Policy Iteration

(4)

Value Iteration

V0←Arbitrary, element-wise bounded,n-length vector.t←0.

Repeat:

Fors∈S:

Vt+1(s)←maxa∈AP

s⁰∈ST(s,a,s⁰) (R(s,a,s⁰) +γVt(s⁰)).

t←t+1.

UntilVt ≈Vt−1(up to machine precision).

Convergence toV^?guaranteed using a max-norm contraction argument.

(5)

Value Iteration

V0←Arbitrary, element-wise bounded,n-length vector.t←0.

Repeat:

Fors∈S:

Vt+1(s)←maxa∈AP

s⁰∈ST(s,a,s⁰) (R(s,a,s⁰) +γVt(s⁰)).

t←t+1.

UntilVt ≈Vt−1(up to machine precision).

Convergence toV^?guaranteed using a max-norm contraction argument.

(6)

Overview

1. Value Iteration

2. Linear Programming 3. Policy Iteration

(7)

Linear Programming

Minimise X

s∈S

V(s)

subject to V(s)≥X

s⁰∈S

T(s,a,s⁰) R(s,a,s⁰) +γV(s⁰)

,∀s∈S,∀a∈A.

Let|S|=nand|A|=k. nvariables,nkconstraints.

Can also be posed asdual withnk variables andnconstraints.

(8)

Linear Programming

Minimise X

s∈S

V(s)

subject to V(s)≥X

s⁰∈S

,∀s∈S,∀a∈A.

Let|S|=nand|A|=k.

nvariables,nkconstraints.

(9)

Linear Programming

Minimise X

s∈S

V(s)

subject to V(s)≥X

s⁰∈S

,∀s∈S,∀a∈A.

Let|S|=nand|A|=k.

(10)

Linear Programming

Minimise X

s∈S

V(s)

subject to V(s)≥X

s⁰∈S

,∀s∈S,∀a∈A.

Let|S|=nand|A|=k.

(11)

Overview

1. Value Iteration

3. Policy Iteration

(12)

Policy Improvement

Givenπ,

Pickone or moreimprovable states, and in them, Switch to anarbitraryimproving action.

Let the resulting policy beπ⁰.

s s s s s s s

s₁ ₂ ₃ ₄ ₅ ₆ ₇ ₈

π

Policy Improvement Theorem:

(1) Ifπhas no improvable states, then it is optimal, else (2) ifπ⁰is obtained as above, then

∀s∈S:V^π⁰(s)≥V^π(s)and∃s∈S:V^π⁰(s)>V^π(s).

(13)

Policy Improvement

Givenπ,

s s s s s s s

s₁ ₂ ₃ ₄ ₅ ₆ ₇ ₈

π

(14)

Policy Improvement

Givenπ,

s s s s s s s

s1 2 3 4 5 6 7 8

π

Q (s , ) π Q (s , ) π 3

3

(15)

Policy Improvement

Givenπ,

s s s s s s s

s1 2 3 4 5 6 7 8

π

Q (s , ) π Q (s , ) π

7 7

Q (s , ) π Q (s , ) π 3

3

(16)

Policy Improvement

Givenπ,

s s s s s s s

s1 2 3 4 5 6 7 8

π

Improvable states

(17)

Policy Improvement

Givenπ,

s s s s s s s

s1 2 3 4 5 6 7 8

π

Improvable states Improving actions

(18)

Policy Improvement

Givenπ,

s s s s s s s

s1 2 3 4 5 6 7 8

π

Improvable states Improving actions

(19)

Policy Improvement

Givenπ,

s s s s s s s

s1 2 3 4 5 6 7 8

π

s s s s s s s

s1 2 3 4 5 6 7 8

π

Policy Improvement

(20)

Policy Improvement

Givenπ,

s s s s s s s

s1 2 3 4 5 6 7 8

π

s s s s s s s

s1 2 3 4 5 6 7 8

π

Policy Improvement

(1) Ifπhas no improvable states, then it is optimal, else

(21)

Policy Improvement

Givenπ,

s s s s s s s

s1 2 3 4 5 6 7 8

π

s s s s s s s

s1 2 3 4 5 6 7 8

π

Policy Improvement

(1) Ifπhas no improvable states, then it is optimal, else

(22)

Definitions and Basic Facts

ForX:S→RandY :S→R, we defineXYif∀s∈S:X(s)≥Y(s), and we defineXYifXY and∃s∈S:X(s)>Y(s).

For policiesπ1, π2∈Π, we defineπ1π2ifV^π¹V^π², and we defineπ1π2ifV^π¹ V^π².

Bellman Operator.Forπ∈Π, we defineB^π: (S→R)→(S→R)as follows: forX:S→Rand∀s∈S,

(B^π(X))(s)^def=X

s⁰∈S

T(s, π(s),s⁰) R(s, π(s),s⁰) +γX(s⁰) .

Fact 1. Forπ∈Π,X:S→R, andY :S→R:

ifXY, thenB^π(X)B^π(Y).

Fact 2. Forπ∈ΠandX:S→R:

l→∞lim(B^π)^l(X) =V^π.(from Banach’s FP Theorem)

(23)

Definitions and Basic Facts

For policiesπ1, π2∈Π, we defineπ1π2ifV^π¹V^π², and we defineπ1π2ifV^π¹V^π².

Bellman Operator.Forπ∈Π, we defineB^π: (S→R)→(S→R)as follows: forX:S→Rand∀s∈S,

(B^π(X))(s)^def=X

s⁰∈S

T(s, π(s),s⁰) R(s, π(s),s⁰) +γX(s⁰) .

(24)

Definitions and Basic Facts

Bellman Operator.Forπ∈Π, we defineB^π: (S→R)→(S→R)as follows:

forX :S→Rand∀s∈S, (B^π(X))(s)^def=X

s⁰∈S

T(s, π(s),s⁰) R(s, π(s),s⁰) +γX(s⁰) .

(25)

Definitions and Basic Facts

s⁰∈S

T(s, π(s),s⁰) R(s, π(s),s⁰) +γX(s⁰) .

Fact 1. Forπ∈Π,X :S→R, andY :S→R:

(26)

Definitions and Basic Facts

s⁰∈S

T(s, π(s),s⁰) R(s, π(s),s⁰) +γX(s⁰) .

Fact 1. Forπ∈Π,X :S→R, andY :S→R:

(27)

Proof of Policy Improvement Theorem

Observe that forπ, π⁰∈Π,∀s∈S:B^π⁰(V^π)(s) =Q^π(s, π⁰(s)).

πhas no improvable states

=⇒ ∀π⁰∈Π :V^πB^π⁰(V^π)

=⇒ ∀π⁰∈Π :V^πB^π⁰(V^π)(B^π⁰)²(V^π)

=⇒ ∀π⁰∈Π :V^πB^π⁰(V^π)(B^π⁰)²(V^π) · · · lim

l→∞(B^π⁰)^l(V^π)

=⇒ ∀π⁰∈Π :V^πV^π⁰.

πhas improvable states and policy improvement yieldsπ⁰

=⇒ B^π⁰(V^π)V^π

=⇒ (B^π⁰)²(V^π)B^π⁰(V^π)V^π

=⇒ lim

l→∞(B^π⁰)^l(V^π) · · · (B^π⁰)²(V^π)B^π⁰(V^π)V^π

=⇒ V^π⁰ V^π.

(28)

Proof of Policy Improvement Theorem

=⇒ ∀π⁰∈Π :V^πB^π⁰(V^π)

=⇒ ∀π⁰∈Π :V^πB^π⁰(V^π)(B^π⁰)²(V^π)

l→∞(B^π⁰)^l(V^π)

=⇒ ∀π⁰∈Π :V^πV^π⁰.

=⇒ (B^π⁰)²(V^π)B^π⁰(V^π)V^π

=⇒ lim

=⇒ V^π⁰ V^π.

(29)

Proof of Policy Improvement Theorem

=⇒ ∀π⁰∈Π :V^πB^π⁰(V^π)

=⇒ ∀π⁰∈Π :V^πB^π⁰(V^π)(B^π⁰)²(V^π)

l→∞(B^π⁰)^l(V^π)

=⇒ ∀π⁰∈Π :V^πV^π⁰.

=⇒ (B^π⁰)²(V^π)B^π⁰(V^π)V^π

=⇒ lim

=⇒ V^π⁰ V^π.

(30)

Proof of Policy Improvement Theorem

=⇒ ∀π⁰∈Π :V^πB^π⁰(V^π)

=⇒ ∀π⁰∈Π :V^πB^π⁰(V^π)(B^π⁰)²(V^π)

l→∞(B^π⁰)^l(V^π)

=⇒ ∀π⁰∈Π :V^πV^π⁰.

=⇒ (B^π⁰)²(V^π)B^π⁰(V^π)V^π

=⇒ lim

=⇒ V^π⁰ V^π.

(31)

Proof of Policy Improvement Theorem

=⇒ ∀π⁰∈Π :V^πB^π⁰(V^π)

=⇒ ∀π⁰∈Π :V^πB^π⁰(V^π)(B^π⁰)²(V^π)

l→∞(B^π⁰)^l(V^π)

=⇒ ∀π⁰∈Π :V^πV^π⁰.

=⇒ (B^π⁰)²(V^π)B^π⁰(V^π)V^π

=⇒ lim

=⇒ V^π⁰ V^π.

(32)

Proof of Policy Improvement Theorem

=⇒ ∀π⁰∈Π :V^πB^π⁰(V^π)

=⇒ ∀π⁰∈Π :V^πB^π⁰(V^π)(B^π⁰)²(V^π)

l→∞(B^π⁰)^l(V^π)

=⇒ ∀π⁰∈Π :V^πV^π⁰.

=⇒ (B^π⁰)²(V^π)B^π⁰(V^π)V^π

=⇒ lim

=⇒ V^π⁰ V^π.

(33)

Proof of Policy Improvement Theorem

=⇒ ∀π⁰∈Π :V^πB^π⁰(V^π)

=⇒ ∀π⁰∈Π :V^πB^π⁰(V^π)(B^π⁰)²(V^π)

l→∞(B^π⁰)^l(V^π)

=⇒ ∀π⁰∈Π :V^πV^π⁰.

=⇒ (B^π⁰)²(V^π)B^π⁰(V^π)V^π

=⇒ lim

=⇒ V^π⁰ V^π.

(34)

Proof of Policy Improvement Theorem

=⇒ ∀π⁰∈Π :V^πB^π⁰(V^π)

=⇒ ∀π⁰∈Π :V^πB^π⁰(V^π)(B^π⁰)²(V^π)

l→∞(B^π⁰)^l(V^π)

=⇒ ∀π⁰∈Π :V^πV^π⁰.

=⇒ (B^π⁰)²(V^π)B^π⁰(V^π)V^π

=⇒ lim

=⇒ V^π⁰ V^π.

(35)

Proof of Policy Improvement Theorem

=⇒ ∀π⁰∈Π :V^πB^π⁰(V^π)

=⇒ ∀π⁰∈Π :V^πB^π⁰(V^π)(B^π⁰)²(V^π)

l→∞(B^π⁰)^l(V^π)

=⇒ ∀π⁰∈Π :V^πV^π⁰.

=⇒ (B^π⁰)²(V^π)B^π⁰(V^π)V^π

⇒ ^π⁰ ^l ^π · · · ^π⁰ ² ^π ^π⁰ ^π ^π

=⇒ V^π⁰ V^π.

(36)

Proof of Policy Improvement Theorem

=⇒ ∀π⁰∈Π :V^πB^π⁰(V^π)

=⇒ ∀π⁰∈Π :V^πB^π⁰(V^π)(B^π⁰)²(V^π)

l→∞(B^π⁰)^l(V^π)

=⇒ ∀π⁰∈Π :V^πV^π⁰.

=⇒ (B^π⁰)²(V^π)B^π⁰(V^π)V^π

⇒ ^π⁰ ^l ^π · · · ^π⁰ ² ^π ^π⁰ ^π ^π

(37)

Policy Iteration Algorithm

π←Arbitrary policy.

Whileπhas improvable states:

π←PolicyImprovement(π).

Number of iterations depends onswitching strategy. Current bounds quite loose.

(38)

Policy Iteration Algorithm

(39)

Policy Iteration Algorithm

(40)

Policy Iteration Algorithm

(41)

Policy Iteration Algorithm

(42)

Policy Iteration Algorithm

(43)

Policy Iteration Algorithm

(44)