• No results found

Value Iteration

N/A
N/A
Protected

Academic year: 2022

Share "Value Iteration"

Copied!
44
0
0

Loading.... (view fulltext now)

Full text

(1)

Algorithms for MDP Planning

Shivaram Kalyanakrishnan

Department of Computer Science and Engineering Indian Institute of Technology Bombay

shivaram@cse.iitb.ac.in

August 2019

(2)

Overview

1. Value Iteration

2. Linear Programming

3. Policy Iteration

Policy Improvement Theorem

(3)

Overview

1. Value Iteration

2. Linear Programming

3. Policy Iteration

Policy Improvement Theorem

(4)

Value Iteration

V0←Arbitrary, element-wise bounded,n-length vector.t←0.

Repeat:

Fors∈S:

Vt+1(s)←maxa∈AP

s0∈ST(s,a,s0) (R(s,a,s0) +γVt(s0)).

t←t+1.

UntilVt ≈Vt−1(up to machine precision).

Convergence toV?guaranteed using a max-norm contraction argument.

(5)

Value Iteration

V0←Arbitrary, element-wise bounded,n-length vector.t←0.

Repeat:

Fors∈S:

Vt+1(s)←maxa∈AP

s0∈ST(s,a,s0) (R(s,a,s0) +γVt(s0)).

t←t+1.

UntilVt ≈Vt−1(up to machine precision).

Convergence toV?guaranteed using a max-norm contraction argument.

(6)

Overview

1. Value Iteration

2. Linear Programming 3. Policy Iteration

Policy Improvement Theorem

(7)

Linear Programming

Minimise X

s∈S

V(s)

subject to V(s)≥X

s0∈S

T(s,a,s0) R(s,a,s0) +γV(s0)

,∀s∈S,∀a∈A.

Let|S|=nand|A|=k. nvariables,nkconstraints.

Can also be posed asdual withnk variables andnconstraints.

(8)

Linear Programming

Minimise X

s∈S

V(s)

subject to V(s)≥X

s0∈S

T(s,a,s0) R(s,a,s0) +γV(s0)

,∀s∈S,∀a∈A.

Let|S|=nand|A|=k.

nvariables,nkconstraints.

Can also be posed asdual withnk variables andnconstraints.

(9)

Linear Programming

Minimise X

s∈S

V(s)

subject to V(s)≥X

s0∈S

T(s,a,s0) R(s,a,s0) +γV(s0)

,∀s∈S,∀a∈A.

Let|S|=nand|A|=k.

nvariables,nkconstraints.

Can also be posed asdual withnk variables andnconstraints.

(10)

Linear Programming

Minimise X

s∈S

V(s)

subject to V(s)≥X

s0∈S

T(s,a,s0) R(s,a,s0) +γV(s0)

,∀s∈S,∀a∈A.

Let|S|=nand|A|=k.

nvariables,nkconstraints.

Can also be posed asdual withnk variables andnconstraints.

(11)

Overview

1. Value Iteration

2. Linear Programming

3. Policy Iteration

Policy Improvement Theorem

(12)

Policy Improvement

Givenπ,

Pickone or moreimprovable states, and in them, Switch to anarbitraryimproving action.

Let the resulting policy beπ0.

s s s s s s s

s1 2 3 4 5 6 7 8

π

Policy Improvement Theorem:

(1) Ifπhas no improvable states, then it is optimal, else (2) ifπ0is obtained as above, then

∀s∈S:Vπ0(s)≥Vπ(s)and∃s∈S:Vπ0(s)>Vπ(s).

(13)

Policy Improvement

Givenπ,

Pickone or moreimprovable states, and in them, Switch to anarbitraryimproving action.

Let the resulting policy beπ0.

s s s s s s s

s1 2 3 4 5 6 7 8

π

Policy Improvement Theorem:

(1) Ifπhas no improvable states, then it is optimal, else (2) ifπ0is obtained as above, then

∀s∈S:Vπ0(s)≥Vπ(s)and∃s∈S:Vπ0(s)>Vπ(s).

(14)

Policy Improvement

Givenπ,

Pickone or moreimprovable states, and in them, Switch to anarbitraryimproving action.

Let the resulting policy beπ0.

s s s s s s s

s1 2 3 4 5 6 7 8

π

Q (s , ) π Q (s , ) π 3

3

Policy Improvement Theorem:

(1) Ifπhas no improvable states, then it is optimal, else (2) ifπ0is obtained as above, then

∀s∈S:Vπ0(s)≥Vπ(s)and∃s∈S:Vπ0(s)>Vπ(s).

(15)

Policy Improvement

Givenπ,

Pickone or moreimprovable states, and in them, Switch to anarbitraryimproving action.

Let the resulting policy beπ0.

s s s s s s s

s1 2 3 4 5 6 7 8

π

Q (s , ) π Q (s , ) π

7 7

Q (s , ) π Q (s , ) π 3

3

Policy Improvement Theorem:

(1) Ifπhas no improvable states, then it is optimal, else (2) ifπ0is obtained as above, then

∀s∈S:Vπ0(s)≥Vπ(s)and∃s∈S:Vπ0(s)>Vπ(s).

(16)

Policy Improvement

Givenπ,

Pickone or moreimprovable states, and in them, Switch to anarbitraryimproving action.

Let the resulting policy beπ0.

s s s s s s s

s1 2 3 4 5 6 7 8

π

Improvable states

Policy Improvement Theorem:

(1) Ifπhas no improvable states, then it is optimal, else (2) ifπ0is obtained as above, then

∀s∈S:Vπ0(s)≥Vπ(s)and∃s∈S:Vπ0(s)>Vπ(s).

(17)

Policy Improvement

Givenπ,

Pickone or moreimprovable states, and in them, Switch to anarbitraryimproving action.

Let the resulting policy beπ0.

s s s s s s s

s1 2 3 4 5 6 7 8

π

Improvable states Improving actions

Policy Improvement Theorem:

(1) Ifπhas no improvable states, then it is optimal, else (2) ifπ0is obtained as above, then

∀s∈S:Vπ0(s)≥Vπ(s)and∃s∈S:Vπ0(s)>Vπ(s).

(18)

Policy Improvement

Givenπ,

Pickone or moreimprovable states, and in them, Switch to anarbitraryimproving action.

Let the resulting policy beπ0.

s s s s s s s

s1 2 3 4 5 6 7 8

π

Improvable states Improving actions

Policy Improvement Theorem:

(1) Ifπhas no improvable states, then it is optimal, else (2) ifπ0is obtained as above, then

∀s∈S:Vπ0(s)≥Vπ(s)and∃s∈S:Vπ0(s)>Vπ(s).

(19)

Policy Improvement

Givenπ,

Pickone or moreimprovable states, and in them, Switch to anarbitraryimproving action.

Let the resulting policy beπ0.

s s s s s s s

s1 2 3 4 5 6 7 8

π

s s s s s s s

s1 2 3 4 5 6 7 8

π

Policy Improvement

Policy Improvement Theorem:

(1) Ifπhas no improvable states, then it is optimal, else (2) ifπ0is obtained as above, then

∀s∈S:Vπ0(s)≥Vπ(s)and∃s∈S:Vπ0(s)>Vπ(s).

(20)

Policy Improvement

Givenπ,

Pickone or moreimprovable states, and in them, Switch to anarbitraryimproving action.

Let the resulting policy beπ0.

s s s s s s s

s1 2 3 4 5 6 7 8

π

s s s s s s s

s1 2 3 4 5 6 7 8

π

Policy Improvement

Policy Improvement Theorem:

(1) Ifπhas no improvable states, then it is optimal, else

(21)

Policy Improvement

Givenπ,

Pickone or moreimprovable states, and in them, Switch to anarbitraryimproving action.

Let the resulting policy beπ0.

s s s s s s s

s1 2 3 4 5 6 7 8

π

s s s s s s s

s1 2 3 4 5 6 7 8

π

Policy Improvement

Policy Improvement Theorem:

(1) Ifπhas no improvable states, then it is optimal, else

(22)

Definitions and Basic Facts

ForX:S→RandY :S→R, we defineXYif∀s∈S:X(s)≥Y(s), and we defineXYifXY and∃s∈S:X(s)>Y(s).

For policiesπ1, π2∈Π, we defineπ1π2ifVπ1Vπ2, and we defineπ1π2ifVπ1 Vπ2.

Bellman Operator.Forπ∈Π, we defineBπ: (S→R)→(S→R)as follows: forX:S→Rand∀s∈S,

(Bπ(X))(s)def=X

s0∈S

T(s, π(s),s0) R(s, π(s),s0) +γX(s0) .

Fact 1. Forπ∈Π,X:S→R, andY :S→R:

ifXY, thenBπ(X)Bπ(Y).

Fact 2. Forπ∈ΠandX:S→R:

l→∞lim(Bπ)l(X) =Vπ.(from Banach’s FP Theorem)

(23)

Definitions and Basic Facts

ForX:S→RandY :S→R, we defineXYif∀s∈S:X(s)≥Y(s), and we defineXYifXY and∃s∈S:X(s)>Y(s).

For policiesπ1, π2∈Π, we defineπ1π2ifVπ1Vπ2, and we defineπ1π2ifVπ1Vπ2.

Bellman Operator.Forπ∈Π, we defineBπ: (S→R)→(S→R)as follows: forX:S→Rand∀s∈S,

(Bπ(X))(s)def=X

s0∈S

T(s, π(s),s0) R(s, π(s),s0) +γX(s0) .

Fact 1. Forπ∈Π,X:S→R, andY :S→R:

ifXY, thenBπ(X)Bπ(Y).

Fact 2. Forπ∈ΠandX:S→R:

l→∞lim(Bπ)l(X) =Vπ.(from Banach’s FP Theorem)

(24)

Definitions and Basic Facts

ForX:S→RandY :S→R, we defineXYif∀s∈S:X(s)≥Y(s), and we defineXYifXY and∃s∈S:X(s)>Y(s).

For policiesπ1, π2∈Π, we defineπ1π2ifVπ1Vπ2, and we defineπ1π2ifVπ1Vπ2.

Bellman Operator.Forπ∈Π, we defineBπ: (S→R)→(S→R)as follows:

forX :S→Rand∀s∈S, (Bπ(X))(s)def=X

s0∈S

T(s, π(s),s0) R(s, π(s),s0) +γX(s0) .

Fact 1. Forπ∈Π,X:S→R, andY :S→R:

ifXY, thenBπ(X)Bπ(Y).

Fact 2. Forπ∈ΠandX:S→R:

l→∞lim(Bπ)l(X) =Vπ.(from Banach’s FP Theorem)

(25)

Definitions and Basic Facts

ForX:S→RandY :S→R, we defineXYif∀s∈S:X(s)≥Y(s), and we defineXYifXY and∃s∈S:X(s)>Y(s).

For policiesπ1, π2∈Π, we defineπ1π2ifVπ1Vπ2, and we defineπ1π2ifVπ1Vπ2.

Bellman Operator.Forπ∈Π, we defineBπ: (S→R)→(S→R)as follows:

forX :S→Rand∀s∈S, (Bπ(X))(s)def=X

s0∈S

T(s, π(s),s0) R(s, π(s),s0) +γX(s0) .

Fact 1. Forπ∈Π,X :S→R, andY :S→R:

ifXY, thenBπ(X)Bπ(Y).

Fact 2. Forπ∈ΠandX:S→R:

l→∞lim(Bπ)l(X) =Vπ.(from Banach’s FP Theorem)

(26)

Definitions and Basic Facts

ForX:S→RandY :S→R, we defineXYif∀s∈S:X(s)≥Y(s), and we defineXYifXY and∃s∈S:X(s)>Y(s).

For policiesπ1, π2∈Π, we defineπ1π2ifVπ1Vπ2, and we defineπ1π2ifVπ1Vπ2.

Bellman Operator.Forπ∈Π, we defineBπ: (S→R)→(S→R)as follows:

forX :S→Rand∀s∈S, (Bπ(X))(s)def=X

s0∈S

T(s, π(s),s0) R(s, π(s),s0) +γX(s0) .

Fact 1. Forπ∈Π,X :S→R, andY :S→R:

ifXY, thenBπ(X)Bπ(Y).

Fact 2. Forπ∈ΠandX:S→R:

(27)

Proof of Policy Improvement Theorem

Observe that forπ, π0∈Π,∀s∈S:Bπ0(Vπ)(s) =Qπ(s, π0(s)).

πhas no improvable states

=⇒ ∀π0∈Π :VπBπ0(Vπ)

=⇒ ∀π0∈Π :VπBπ0(Vπ)(Bπ0)2(Vπ)

=⇒ ∀π0∈Π :VπBπ0(Vπ)(Bπ0)2(Vπ) · · · lim

l→∞(Bπ0)l(Vπ)

=⇒ ∀π0∈Π :VπVπ0.

πhas improvable states and policy improvement yieldsπ0

=⇒ Bπ0(Vπ)Vπ

=⇒ (Bπ0)2(Vπ)Bπ0(Vπ)Vπ

=⇒ lim

l→∞(Bπ0)l(Vπ) · · · (Bπ0)2(Vπ)Bπ0(Vπ)Vπ

=⇒ Vπ0 Vπ.

(28)

Proof of Policy Improvement Theorem

Observe that forπ, π0∈Π,∀s∈S:Bπ0(Vπ)(s) =Qπ(s, π0(s)).

πhas no improvable states

=⇒ ∀π0∈Π :VπBπ0(Vπ)

=⇒ ∀π0∈Π :VπBπ0(Vπ)(Bπ0)2(Vπ)

=⇒ ∀π0∈Π :VπBπ0(Vπ)(Bπ0)2(Vπ) · · · lim

l→∞(Bπ0)l(Vπ)

=⇒ ∀π0∈Π :VπVπ0.

πhas improvable states and policy improvement yieldsπ0

=⇒ Bπ0(Vπ)Vπ

=⇒ (Bπ0)2(Vπ)Bπ0(Vπ)Vπ

=⇒ lim

l→∞(Bπ0)l(Vπ) · · · (Bπ0)2(Vπ)Bπ0(Vπ)Vπ

=⇒ Vπ0 Vπ.

(29)

Proof of Policy Improvement Theorem

Observe that forπ, π0∈Π,∀s∈S:Bπ0(Vπ)(s) =Qπ(s, π0(s)).

πhas no improvable states

=⇒ ∀π0∈Π :VπBπ0(Vπ)

=⇒ ∀π0∈Π :VπBπ0(Vπ)(Bπ0)2(Vπ)

=⇒ ∀π0∈Π :VπBπ0(Vπ)(Bπ0)2(Vπ) · · · lim

l→∞(Bπ0)l(Vπ)

=⇒ ∀π0∈Π :VπVπ0.

πhas improvable states and policy improvement yieldsπ0

=⇒ Bπ0(Vπ)Vπ

=⇒ (Bπ0)2(Vπ)Bπ0(Vπ)Vπ

=⇒ lim

l→∞(Bπ0)l(Vπ) · · · (Bπ0)2(Vπ)Bπ0(Vπ)Vπ

=⇒ Vπ0 Vπ.

(30)

Proof of Policy Improvement Theorem

Observe that forπ, π0∈Π,∀s∈S:Bπ0(Vπ)(s) =Qπ(s, π0(s)).

πhas no improvable states

=⇒ ∀π0∈Π :VπBπ0(Vπ)

=⇒ ∀π0∈Π :VπBπ0(Vπ)(Bπ0)2(Vπ)

=⇒ ∀π0∈Π :VπBπ0(Vπ)(Bπ0)2(Vπ) · · · lim

l→∞(Bπ0)l(Vπ)

=⇒ ∀π0∈Π :VπVπ0.

πhas improvable states and policy improvement yieldsπ0

=⇒ Bπ0(Vπ)Vπ

=⇒ (Bπ0)2(Vπ)Bπ0(Vπ)Vπ

=⇒ lim

l→∞(Bπ0)l(Vπ) · · · (Bπ0)2(Vπ)Bπ0(Vπ)Vπ

=⇒ Vπ0 Vπ.

(31)

Proof of Policy Improvement Theorem

Observe that forπ, π0∈Π,∀s∈S:Bπ0(Vπ)(s) =Qπ(s, π0(s)).

πhas no improvable states

=⇒ ∀π0∈Π :VπBπ0(Vπ)

=⇒ ∀π0∈Π :VπBπ0(Vπ)(Bπ0)2(Vπ)

=⇒ ∀π0∈Π :VπBπ0(Vπ)(Bπ0)2(Vπ) · · · lim

l→∞(Bπ0)l(Vπ)

=⇒ ∀π0∈Π :VπVπ0.

πhas improvable states and policy improvement yieldsπ0

=⇒ Bπ0(Vπ)Vπ

=⇒ (Bπ0)2(Vπ)Bπ0(Vπ)Vπ

=⇒ lim

l→∞(Bπ0)l(Vπ) · · · (Bπ0)2(Vπ)Bπ0(Vπ)Vπ

=⇒ Vπ0 Vπ.

(32)

Proof of Policy Improvement Theorem

Observe that forπ, π0∈Π,∀s∈S:Bπ0(Vπ)(s) =Qπ(s, π0(s)).

πhas no improvable states

=⇒ ∀π0∈Π :VπBπ0(Vπ)

=⇒ ∀π0∈Π :VπBπ0(Vπ)(Bπ0)2(Vπ)

=⇒ ∀π0∈Π :VπBπ0(Vπ)(Bπ0)2(Vπ) · · · lim

l→∞(Bπ0)l(Vπ)

=⇒ ∀π0∈Π :VπVπ0.

πhas improvable states and policy improvement yieldsπ0

=⇒ Bπ0(Vπ)Vπ

=⇒ (Bπ0)2(Vπ)Bπ0(Vπ)Vπ

=⇒ lim

l→∞(Bπ0)l(Vπ) · · · (Bπ0)2(Vπ)Bπ0(Vπ)Vπ

=⇒ Vπ0 Vπ.

(33)

Proof of Policy Improvement Theorem

Observe that forπ, π0∈Π,∀s∈S:Bπ0(Vπ)(s) =Qπ(s, π0(s)).

πhas no improvable states

=⇒ ∀π0∈Π :VπBπ0(Vπ)

=⇒ ∀π0∈Π :VπBπ0(Vπ)(Bπ0)2(Vπ)

=⇒ ∀π0∈Π :VπBπ0(Vπ)(Bπ0)2(Vπ) · · · lim

l→∞(Bπ0)l(Vπ)

=⇒ ∀π0∈Π :VπVπ0.

πhas improvable states and policy improvement yieldsπ0

=⇒ Bπ0(Vπ)Vπ

=⇒ (Bπ0)2(Vπ)Bπ0(Vπ)Vπ

=⇒ lim

l→∞(Bπ0)l(Vπ) · · · (Bπ0)2(Vπ)Bπ0(Vπ)Vπ

=⇒ Vπ0 Vπ.

(34)

Proof of Policy Improvement Theorem

Observe that forπ, π0∈Π,∀s∈S:Bπ0(Vπ)(s) =Qπ(s, π0(s)).

πhas no improvable states

=⇒ ∀π0∈Π :VπBπ0(Vπ)

=⇒ ∀π0∈Π :VπBπ0(Vπ)(Bπ0)2(Vπ)

=⇒ ∀π0∈Π :VπBπ0(Vπ)(Bπ0)2(Vπ) · · · lim

l→∞(Bπ0)l(Vπ)

=⇒ ∀π0∈Π :VπVπ0.

πhas improvable states and policy improvement yieldsπ0

=⇒ Bπ0(Vπ)Vπ

=⇒ (Bπ0)2(Vπ)Bπ0(Vπ)Vπ

=⇒ lim

l→∞(Bπ0)l(Vπ) · · · (Bπ0)2(Vπ)Bπ0(Vπ)Vπ

=⇒ Vπ0 Vπ.

(35)

Proof of Policy Improvement Theorem

Observe that forπ, π0∈Π,∀s∈S:Bπ0(Vπ)(s) =Qπ(s, π0(s)).

πhas no improvable states

=⇒ ∀π0∈Π :VπBπ0(Vπ)

=⇒ ∀π0∈Π :VπBπ0(Vπ)(Bπ0)2(Vπ)

=⇒ ∀π0∈Π :VπBπ0(Vπ)(Bπ0)2(Vπ) · · · lim

l→∞(Bπ0)l(Vπ)

=⇒ ∀π0∈Π :VπVπ0.

πhas improvable states and policy improvement yieldsπ0

=⇒ Bπ0(Vπ)Vπ

=⇒ (Bπ0)2(Vπ)Bπ0(Vπ)Vπ

π0 l π · · · π0 2 π π0 π π

=⇒ Vπ0 Vπ.

(36)

Proof of Policy Improvement Theorem

Observe that forπ, π0∈Π,∀s∈S:Bπ0(Vπ)(s) =Qπ(s, π0(s)).

πhas no improvable states

=⇒ ∀π0∈Π :VπBπ0(Vπ)

=⇒ ∀π0∈Π :VπBπ0(Vπ)(Bπ0)2(Vπ)

=⇒ ∀π0∈Π :VπBπ0(Vπ)(Bπ0)2(Vπ) · · · lim

l→∞(Bπ0)l(Vπ)

=⇒ ∀π0∈Π :VπVπ0.

πhas improvable states and policy improvement yieldsπ0

=⇒ Bπ0(Vπ)Vπ

=⇒ (Bπ0)2(Vπ)Bπ0(Vπ)Vπ

π0 l π · · · π0 2 π π0 π π

(37)

Policy Iteration Algorithm

π←Arbitrary policy.

Whileπhas improvable states:

π←PolicyImprovement(π).

Number of iterations depends onswitching strategy. Current bounds quite loose.

(38)

Policy Iteration Algorithm

π←Arbitrary policy.

Whileπhas improvable states:

π←PolicyImprovement(π).

Number of iterations depends onswitching strategy. Current bounds quite loose.

(39)

Policy Iteration Algorithm

π←Arbitrary policy.

Whileπhas improvable states:

π←PolicyImprovement(π).

Number of iterations depends onswitching strategy. Current bounds quite loose.

(40)

Policy Iteration Algorithm

π←Arbitrary policy.

Whileπhas improvable states:

π←PolicyImprovement(π).

Number of iterations depends onswitching strategy. Current bounds quite loose.

(41)

Policy Iteration Algorithm

π←Arbitrary policy.

Whileπhas improvable states:

π←PolicyImprovement(π).

Number of iterations depends onswitching strategy. Current bounds quite loose.

(42)

Policy Iteration Algorithm

π←Arbitrary policy.

Whileπhas improvable states:

π←PolicyImprovement(π).

Number of iterations depends onswitching strategy. Current bounds quite loose.

(43)

Policy Iteration Algorithm

π←Arbitrary policy.

Whileπhas improvable states:

π←PolicyImprovement(π).

Number of iterations depends onswitching strategy. Current bounds quite loose.

(44)

Policy Iteration Algorithm

π←Arbitrary policy.

Whileπhas improvable states:

π←PolicyImprovement(π).

References

Related documents

Choice of comparison operator crucially determines sorting order (increasing/decreasing), and also how equal elements

• Decide which half of array to recurse on based on output of comparison

• Recall how we accessed member data values of structures V3 p, *ptrP;. cin

• Uses dynamically allocated array to store elements Array can grow or shrink in size. • Dynamic memory management

Department of Computer Science and Engineering Indian Institute of Technology Bombay.

Sort remaining unsorted sub-array using the same technique. Selection Sort

• “cin” (for keyboard input) and “cout” (for console output) work because of instructions in “iostream” header file. Compiler Directives..

Department of Computer Science & Engineering INDIAN INSTITUTE OF TECHNOLOGY, DELHI.