• No results found

Reinforcement Learning Problem

N/A
N/A
Protected

Academic year: 2022

Share "Reinforcement Learning Problem"

Copied!
20
0
0

Loading.... (view fulltext now)

Full text

(1)

13. Reinforcement Learning

[Read Chapter 13]

[Exercises 13.1, 13.2, 13.4]

Control learning

Control policies that choose optimal actions

Q learning

Convergence

(2)

Control Learning

Consider learning to choose actions, e.g.,

Robot learning to dock on battery charger

Learning to choose actions to optimize factory output

Learning to play Backgammon Note several problem characteristics:

Delayed reward

Opportunity for active exploration

Possibility that state only partially observable

Possible need to learn multiple tasks with same sensors/eectors

(3)

One Example: TD-Gammon

[Tesauro, 1995]

Learn to play Backgammon Immediate reward

+100 if win

-100 if lose

0 for all other states

Trained by playing 1.5 million games against itself Now approximately equal to best human player

(4)

Reinforcement Learning Problem

Agent

Environment

State Reward Action

r + γ r + r + ... , whereγ γ <1

0 2

2 1

Goal: Learn to choose actions that maximize

s0 a0 s1 a1 s2 a2

r0 r1 r2 ...

<

0

(5)

Markov Decision Processes

Assume

nite set of states S

set of actions A

at each discrete time agent observes state st 2 S and chooses action at 2 A

then receives immediate reward rt

and state changes to st+1

Markov assumption: st+1 = (st;at) and

r

t = r(st;at)

{ i.e., rt and st+1 depend only on current state and action

{ functions and r may be nondeterministic

{ functions and r not necessarily known to agent

(6)

Agent's Learning Task

Execute actions in environment, observe results, and

learn action policy : S ! A that maximizes

E[rt + rt+1 + 2rt+2 + :::] from any starting state in S

here 0 < 1 is the discount factor for future rewards

Note something new:

Target function is : S ! A

but we have no training examples of form hs;ai

training examples are of form hhs;ai;ri

(7)

Value Function

To begin, consider deterministic worlds...

For each possible policy the agent might adopt, we can dene an evaluation function over states

V

(s) rt + rt+1 + 2rt+2 + :::

1

X

i=0

i

r

t+i

where rt;rt+1;::: are generated by following policy

starting at state s

Restated, the task is to learn the optimal policy

argmax

V

(s);(8s)

(8)

G

100

100 0

0 0 0 0 0 0

0 0

0 0

r(s; a) (immediate reward) values

100 G

90

100

81 90 81

81 90 81

72 72 81

0

Q(s;a) values

G

100

100 90

90

81

0

V

(s) values

G

One optimal policy

(9)

What to Learn

We might try to have agent learn the evaluation function V (which we write as V )

It could then do a lookahead search to choose best action from any state s because

(s) = argmax

a

[r(s;a) + V ((s; a))]

A problem:

This works well if agent knows : S A ! S, and r : S A ! <

But when it doesn't, it can't choose actions this way

(10)

Q

Function

Dene new function very similar to V

Q(s;a) r(s;a) + V ((s; a))

If agent learns Q, it can choose optimal action even without knowing !

(s) = argmax

a

[r(s;a) + V ((s; a))]

(s) = argmax

a

Q(s; a)

Q is the evaluation function the agent will learn

(11)

Training Rule to Learn

Q

Note Q and V closely related:

V

(s) = max

a 0

Q(s;a0)

Which allows us to write Q recursively as

Q(st;at) = r(st;at) + V ((st;at)))

= r(st;at) + max

a 0

Q(st+1;a0)

Nice! Let ^Q denote learner's current approximation to Q. Consider training rule

^

Q(s; a) r + max

a 0

^

Q(s0;a0)

where s0 is the state resulting from applying action

a in state s

(12)

Q

Learning for Deterministic Worlds

For each s; a initialize table entry ^Q(s; a) 0 Observe current state s

Do forever:

Select an action a and execute it

Receive immediate reward r

Observe the new state s0

Update the table entry for ^Q(s; a) as follows:

^

Q(s;a) r + max

a 0

^

Q(s0;a0)

s s

0

(13)

Updating

Q

^

100

81

R

63 72

initial state: s

1

90 100

81

R

63

next state: s2 aright

^

Q(s1;aright) r + max

a 0

^

Q(s2; a0)

0 + 0:9 maxf63;81;100g 90

notice if rewards non-negative, then

(8s;a;n) ^Qn+1(s;a) Q^n(s; a)

and ( ) 0 ^ ( ) ( )

(14)

^

Q converges to Q. Consider case of deterministic world where see each hs;ai visited innitely often.

Proof: Dene a full interval to be an interval during which each hs; ai is visited. During each full

interval the largest error in ^Q table is reduced by factor of

Let ^Qn be table after n updates, and n be the maximum error in ^Qn; that is

n = maxs;a jQ^n(s; a) ? Q(s;a)j

For any table entry ^Qn(s;a) updated on iteration

n + 1, the error in the revised estimate ^Qn+1(s; a) is

jQ^n+1(s;a) ? Q(s; a)j = j(r + max

a 0

^

Q

n(s0;a0))

?(r + max

a 0

Q(s0;a0))j

= jmax

a 0

^

Q

n(s0;a0) ? max

a 0

Q(s0;a0)j

max

a 0

jQ^n(s0;a0) ? Q(s0;a0)j

max

s 00

;a 0

jQ^n(s00;a0) ? Q(s00;a0)j

jQ^n+1(s;a) ? Q(s; a)j n

(15)

Note we used general fact that

jmaxa f1(a) ? maxa f2(a)j maxa jf1(a) ? f2(a)j

(16)

Nondeterministic Case

What if reward and next state are non-deterministic?

We redene V;Q by taking expected values

V

(s) E[rt + rt+1 + 2rt+2 + :::]

E[ 1X

i=0

i

r

t+i]

Q(s;a) E[r(s;a) + V ((s;a))]

(17)

Nondeterministic Case

Q learning generalizes to nondeterministic worlds Alter training rule to

^

Q

n(s;a) (1?n) ^Qn?1(s; a)+n[r+max

a 0

^

Q

n?1(s0;a0)]

where

n = 1

1 + visitsn(s;a)

Can still prove convergence of ^Q to Q [Watkins and Dayan, 1992]

(18)

Temporal Dierence Learning

Q learning: reduce discrepancy between successive

Q estimates

One step time dierence:

Q

(1)(st;at) rt + maxa Q^(st+1;a) Why not two steps?

Q

(2)(st;at) rt + rt+1 + 2 maxa Q^(st+2;a) Or n?

Q

(n)(st;at) rt+rt+1++(n?1)rt+n?1+n maxa Q^(st+n;a) Blend all of these:

Q

(st;at) (1?) "Q(1)(st;at) + Q(2)(st;at) + 2Q(3)(st;at) + #

(19)

Temporal Dierence Learning

Q

(st;at) (1?) "Q(1)(st;at) + Q(2)(st;at) + 2Q(3)(st;at) + # Equivalent expression:

Q

(st;at) = rt + [ (1 ? )maxa Q^(st;at) + Q(st+1;at+1)]

TD() algorithm uses above training rule

Sometimes converges faster than Q learning

converges for learning V for any 0 1 (Dayan, 1992)

Tesauro's TD-Gammon uses this algorithm

(20)

Subtleties and Ongoing Research

Replace ^Q table with neural net or other generalizer

Handle case where state only partially observable

Design optimal exploration strategies

Extend to continuous action, state

Learn and use ^ : S A ! S

Relationship to dynamic programming

References

Related documents

This paper models Unit Commitment as a multi stage decision making task and an efficient Reinforcement Learning solution is formulated considering minimum up

Elevator dispatching (CB1996) Present Continuous Neural network (46) Acrobot control (S1996) Absent Continuous Tile coding (4) Dynamic channel allocation (SB1997) Absent Discrete

◦ The different types of reinforcement that are used for effective learning are—positive reinforcement, negative reinforcement, punishment and extinction. ◦ In positive

can prepare as best it can for the impacts we now know are inevitable and locked into the global climate... National Cricket Boards from each Test-playing nation to commission

Among compounds of series 1, the best activity profile was displayed by compound la which showed greater killing potential towards the human cell lines than muri ne Ll210 cell line

An earlier report by the Science Research Council of Britain estimated the world market for medical aids at ~ 11,000 million which included approximately ~ 4000 million

In this paper we consider the problem of approximately solving a nonlinear ill- posed operator equation of the Hammerstein type with a monotone

directive principles now stand elevated to inalienable fundamental human rights and therefore are justifiable by themselves. 5) India is party to several