• No results found

UCB Algorithm

N/A
N/A
Protected

Academic year: 2022

Share "UCB Algorithm"

Copied!
47
0
0

Loading.... (view fulltext now)

Full text

(1)

1/9

Regret Bound for UCB

(Adapted from the proof by Auer et al., 2002)

Shivaram Kalyanakrishnan

shivaram@cse.iitb.ac.in

Department of Computer Science and Engineering Indian Institute of Technology Bombay

August 2018

(2)

2/9

UCB

- Pull each arm once.

- At timet∈ {n,n+1, . . .}, for every arma,ucbtadef= ˆpta+ r2 ln(t)

uta ; pullargmaxaucbta.

R 1

0

pa t ucb at

Shivaram Kalyanakrishnan (2018) UCB Regret 2 / 9

(3)

2/9

UCB Algorithm

UCB

- Pull each arm once.

- At timet∈ {n,n+1, . . .}, for every arma,ucbtadef= ˆpta+ r2 ln(t)

uta ; pullargmaxaucbta.

R 1

0

pa t ucb at

(4)

2/9

UCB

- Pull each arm once.

- At timet∈ {n,n+1, . . .}, for every arma,ucbtadef= ˆpta+ r2 ln(t)

uta ; pullargmaxaucbta.

R 1

0

pa t ucb at

Recall thatRT=Tp−PT−1 t=0 E[rt].

Shivaram Kalyanakrishnan (2018) UCB Regret 2 / 9

(5)

2/9

UCB Algorithm

UCB

- Pull each arm once.

- At timet∈ {n,n+1, . . .}, for every arma,ucbtadef= ˆpta+ r2 ln(t)

uta ; pullargmaxaucbta.

R 1

0

pa t ucb at

Recall thatRT=Tp−PT−1 t=0 E[rt].

We shall show that UCB achievesRT=O P

a:pa6=p 1

p−palog(T) .

(6)

3/9

a=p−pa(instance-specificconstant).

Shivaram Kalyanakrishnan (2018) UCB Regret 3 / 9

(7)

3/9

Notation

a

=defp−pa(instance-specificconstant).

LetZat be theeventthat armais pulled at timet.

(8)

3/9

a=p−pa(instance-specificconstant).

LetZat be theeventthat armais pulled at timet.

Letztabe arandom variablethat takes value 1 if armais pulled at timet, and 0 otherwise.

Shivaram Kalyanakrishnan (2018) UCB Regret 3 / 9

(9)

3/9

Notation

a

=defp−pa(instance-specificconstant).

LetZat be theeventthat armais pulled at timet.

Letztabe arandom variablethat takes value 1 if armais pulled at timet, and 0 otherwise.

Observe thatE[zta] =P{Zat}(1) + (1−P{Zat})(0) =P{Zat}.

(10)

3/9

a=p−pa(instance-specificconstant).

LetZat be theeventthat armais pulled at timet.

Letztabe arandom variablethat takes value 1 if armais pulled at timet, and 0 otherwise.

Observe thatE[zta] =P{Zat}(1) + (1−P{Zat})(0) =P{Zat}.

As in the algorithm,utais arandom variablethat denotes the number of pulls armahas received up to (and excluding) timet:

uta=

t−1

X

i=0

zia.

Shivaram Kalyanakrishnan (2018) UCB Regret 3 / 9

(11)

3/9

Notation

a

=defp−pa(instance-specificconstant).

LetZat be theeventthat armais pulled at timet.

Letztabe arandom variablethat takes value 1 if armais pulled at timet, and 0 otherwise.

Observe thatE[zta] =P{Zat}(1) + (1−P{Zat})(0) =P{Zat}.

As in the algorithm,utais arandom variablethat denotes the number of pulls armahas received up to (and excluding) timet:

uta=

t−1

X

i=0

zia.

We define an instance-specificconstant

¯uTa=def 8

(∆a)2 ln(T)

that will serve in our proof as a “sufficient” number of pulls of armafor horizonT.

(12)

4/9

Shivaram Kalyanakrishnan (2018) UCB Regret 4 / 9

(13)

4/9

Step 1: Show that R

T

= P

a:pa6=p

E [u

Ta

]∆

a

.

RT=Tp

T−1

X

t=0

E[rt]

(14)

4/9

RT=Tp

T−1

X

t=0

E[rt]

=Tp

T−1

X

t=0

X

a∈A

P{Zat}E[rt|Zta]

Shivaram Kalyanakrishnan (2018) UCB Regret 4 / 9

(15)

4/9

Step 1: Show that R

T

= P

a:pa6=p

E [u

Ta

]∆

a

.

RT=Tp

T−1

X

t=0

E[rt]

=Tp

T−1

X

t=0

X

a∈A

P{Zat}E[rt|Zta]

=Tp

T−1

X

t=0

X

a∈A

E[zta]pa

(16)

4/9

RT=Tp

T−1

X

t=0

E[rt]

=Tp

T−1

X

t=0

X

a∈A

P{Zat}E[rt|Zta]

=Tp

T−1

X

t=0

X

a∈A

E[zta]pa

= X

a∈A

E[uTa]

!

p−X

a∈A

E[uTa]pa

Shivaram Kalyanakrishnan (2018) UCB Regret 4 / 9

(17)

4/9

Step 1: Show that R

T

= P

a:pa6=p

E [u

Ta

]∆

a

.

RT=Tp

T−1

X

t=0

E[rt]

=Tp

T−1

X

t=0

X

a∈A

P{Zat}E[rt|Zta]

=Tp

T−1

X

t=0

X

a∈A

E[zta]pa

= X

a∈A

E[uTa]

!

p−X

a∈A

E[uTa]pa

=X

a∈A

E[uTa](p−pa)

(18)

4/9

RT=Tp

T−1

X

t=0

E[rt]

=Tp

T−1

X

t=0

X

a∈A

P{Zat}E[rt|Zta]

=Tp

T−1

X

t=0

X

a∈A

E[zta]pa

= X

a∈A

E[uTa]

!

p−X

a∈A

E[uTa]pa

=X

a∈A

E[uTa](p−pa)

= X

a:pa6=p

E[uTa]∆a.

Shivaram Kalyanakrishnan (2018) UCB Regret 4 / 9

(19)

4/9

Step 1: Show that R

T

= P

a:pa6=p

E [u

Ta

]∆

a

.

RT=Tp

T−1

X

t=0

E[rt]

=Tp

T−1

X

t=0

X

a∈A

P{Zat}E[rt|Zta]

=Tp

T−1

X

t=0

X

a∈A

E[zta]pa

= X

a∈A

E[uTa]

!

p−X

a∈A

E[uTa]pa

=X

a∈A

E[uTa](p−pa)

= X

a:pa6=p

E[uTa]∆a.

To show the regret bound, we shall show for each sub-optimal armathat

E[uTa] =O 1

(∆a)2log(T)

.

(20)

5/9

Shivaram Kalyanakrishnan (2018) UCB Regret 5 / 9

(21)

5/9

Step 2: Split sub-optimal pulls into two regimes.

To proveE[uTa] =O 1

2alog(T)

, we showE[uTa]≤¯uTa+Cfor some constantC.

(22)

5/9

To proveE[uTa] =O 1

2alog(T)

, we showE[uTa]≤¯uTa+Cfor some constantC.

E[uTa] =

T−1

X

t=0

E[zta]

Shivaram Kalyanakrishnan (2018) UCB Regret 5 / 9

(23)

5/9

Step 2: Split sub-optimal pulls into two regimes.

To proveE[uTa] =O 1

2alog(T)

, we showE[uTa]≤¯uTa+Cfor some constantC.

E[uTa] =

T−1

X

t=0

E[zta]

=

T−1

X

t=0

P{Zat}

(24)

5/9

To proveE[uTa] =O 1

2alog(T)

, we showE[uTa]≤¯uTa+Cfor some constantC.

E[uTa] =

T−1

X

t=0

E[zta]

=

T−1

X

t=0

P{Zat}

=

T−1

X

t=0

P{Zat and(uta<¯uTa)}+

T−1

X

t=0

P{Zat and(uta≥¯uTa)}

Shivaram Kalyanakrishnan (2018) UCB Regret 5 / 9

(25)

5/9

Step 2: Split sub-optimal pulls into two regimes.

To proveE[uTa] =O 1

2alog(T)

, we showE[uTa]≤¯uTa+Cfor some constantC.

E[uTa] =

T−1

X

t=0

E[zta]

=

T−1

X

t=0

P{Zat}

=

T−1

X

t=0

P{Zat and(uta<¯uTa)}+

T−1

X

t=0

P{Zat and(uta≥¯uTa)}

=A+B.

(26)

5/9

To proveE[uTa] =O 1

2alog(T)

, we showE[uTa]≤¯uTa+Cfor some constantC.

E[uTa] =

T−1

X

t=0

E[zta]

=

T−1

X

t=0

P{Zat}

=

T−1

X

t=0

P{Zat and(uta<¯uTa)}+

T−1

X

t=0

P{Zat and(uta≥¯uTa)}

=A+B.

We showAis upper-bounded by¯uTa andBis upper-bounded by a constant.

Shivaram Kalyanakrishnan (2018) UCB Regret 5 / 9

(27)

6/9

Step 3: Bounding A.

(28)

6/9

A=

T−1

X

t=0

P{Zat and(uta<¯uTa)}

Shivaram Kalyanakrishnan (2018) UCB Regret 6 / 9

(29)

6/9

Step 3: Bounding A.

A=

T−1

X

t=0

P{Zat and(uta<¯uTa)}

=

T−1

X

t=0

¯uTa−1

X

m=0

P{Zatand(uta=m)}

(30)

6/9

A=

T−1

X

t=0

P{Zat and(uta<¯uTa)}

=

T−1

X

t=0

¯uTa−1

X

m=0

P{Zatand(uta=m)}

=

¯ uTa−1

X

m=0 T−1

X

t=0

P{Zatand(uta=m)}

Shivaram Kalyanakrishnan (2018) UCB Regret 6 / 9

(31)

6/9

Step 3: Bounding A.

A=

T−1

X

t=0

P{Zat and(uta<¯uTa)}

=

T−1

X

t=0

¯uTa−1

X

m=0

P{Zatand(uta=m)}

=

¯ uTa−1

X

m=0 T−1

X

t=0

P{Zatand(uta=m)}

=

¯ uTa−1

X

m=0

P{(Za0and(u0a=m))or(Za1and(u1a=m))or. . . or(ZaT−1and(uT−1a =m))}

(32)

6/9

A=

T−1

X

t=0

P{Zat and(uta<¯uTa)}

=

T−1

X

t=0

¯uTa−1

X

m=0

P{Zatand(uta=m)}

=

¯ uTa−1

X

m=0 T−1

X

t=0

P{Zatand(uta=m)}

=

¯ uTa−1

X

m=0

P{(Za0and(u0a=m))or(Za1and(u1a=m))or. . . or(ZaT−1and(uT−1a =m))}

¯ uTa−1

X

m=0

1

Shivaram Kalyanakrishnan (2018) UCB Regret 6 / 9

(33)

6/9

Step 3: Bounding A.

A=

T−1

X

t=0

P{Zat and(uta<¯uTa)}

=

T−1

X

t=0

¯uTa−1

X

m=0

P{Zatand(uta=m)}

=

¯ uTa−1

X

m=0 T−1

X

t=0

P{Zatand(uta=m)}

=

¯ uTa−1

X

m=0

P{(Za0and(u0a=m))or(Za1and(u1a=m))or. . . or(ZaT−1and(uT−1a =m))}

¯ uTa−1

X

m=0

1

= ¯uTa.

(34)

6/9

A=

T−1

X

t=0

P{Zat and(uta<¯uTa)}

=

T−1

X

t=0

¯uTa−1

X

m=0

P{Zatand(uta=m)}

=

¯ uTa−1

X

m=0 T−1

X

t=0

P{Zatand(uta=m)}

=

¯ uTa−1

X

m=0

P{(Za0and(u0a=m))or(Za1and(u1a=m))or. . . or(ZaT−1and(uT−1a =m))}

¯ uTa−1

X

m=0

1

= ¯uTa.

We have used the fact that for 0≤i<j≤t−1,(Zai and(uia=m))and (Zaj and(uja=m))are mutually exclusive.

Shivaram Kalyanakrishnan (2018) UCB Regret 6 / 9

(35)

7/9

Step 4.1: Bounding B.

(36)

7/9

B=

T−1

X

t=0

P{Zat and(uta ≥¯uTa)}

Shivaram Kalyanakrishnan (2018) UCB Regret 7 / 9

(37)

7/9

Step 4.1: Bounding B.

B=

T−1

X

t=0

P{Zat and(uta ≥¯uTa)}

T−1

X

t=0

P (

ˆ pta+

r2 uta

ln(t)≥ˆpt+ r2

utln(t)

!

and(uta≥¯uTa) )

(38)

7/9

B=

T−1

X

t=0

P{Zat and(uta ≥¯uTa)}

T−1

X

t=0

P (

ˆ pta+

r2 uta

ln(t)≥ˆpt+ r2

utln(t)

!

and(uta≥¯uTa) )

T−1

X

t=0 t

X

x=¯uTa t

X

y=1

P (

ˆ pa(x) +

r2

xln(t)≥ˆp(y) + r2

yln(t) )

where

ˆ

pa(x)is the empirical mean of the firstxpulls of arma, and ˆ

p(y)is the empirical mean of the firstypulls of arm⋆.

Shivaram Kalyanakrishnan (2018) UCB Regret 7 / 9

(39)

8/9

Step 4.2: Bounding B.

Fixx∈ {¯uTa,¯uTa+1, . . . ,t}andy∈ {1,2, . . . ,t}.

(40)

8/9

We have:

ˆpa(x) + r2

xln(t)≥ˆp(y) + r2

yln(t)

=⇒ ˆpa(x) + r2

xln(t)≥p

! or

ˆ p(y) +

r2

yln(t)<p

.

Shivaram Kalyanakrishnan (2018) UCB Regret 8 / 9

(41)

8/9

Step 4.2: Bounding B.

Fixx∈ {¯uTa,¯uTa+1, . . . ,t}andy∈ {1,2, . . . ,t}.

We have:

ˆpa(x) + r2

xln(t)≥ˆp(y) + r2

yln(t)

=⇒ ˆpa(x) + r2

xln(t)≥p

! or

ˆ p(y) +

r2

yln(t)<p

.

Sincex≥¯uTa, we have q2

xln(t)≤q

2

¯uTa ln(t)≤2a, and so ˆpa(x) +

r2

xln(t)≥p =⇒ ˆpa(x)≥pa+∆a

2 .

(42)

8/9

We have:

ˆpa(x) + r2

xln(t)≥ˆp(y) + r2

yln(t)

=⇒ ˆpa(x) + r2

xln(t)≥p

! or

ˆ p(y) +

r2

yln(t)<p

.

Sincex≥¯uTa, we have q2

xln(t)≤q

2

¯uTa ln(t)≤2a, and so ˆpa(x) +

r2

xln(t)≥p =⇒ ˆpa(x)≥pa+∆a

2 .

In summary:

ˆ pa(x)+

r2

xln(t)ˆp(y)+

s 2

yln(t) =

ˆ

pa(x)pa+a

2

or ˆp(y)<p s

2 yln(t)

! .

Shivaram Kalyanakrishnan (2018) UCB Regret 8 / 9

(43)

9/9

Step 4.3: Bounding B.

Continuing from Step 4.1, and now invoking Hoeffding’s Inequality:

B≤

T−1

X

t=0 t

X

x=¯uTa t

X

y=1

P (

ˆ pa(x) +

r2

xln(t)≥ˆp(y) + r2

yln(t) )

(44)

9/9

Continuing from Step 4.1, and now invoking Hoeffding’s Inequality:

B≤

T−1

X

t=0 t

X

x=¯uTa t

X

y=1

P (

ˆ pa(x) +

r2

xln(t)≥ˆp(y) + r2

yln(t) )

T−1

X

t=0 t

X

x=¯uTa t

X

y=1

P

ˆ

pa(x)≥pa+∆a

2

+P

ˆ

p(y)<p− r2

yln(t)

Shivaram Kalyanakrishnan (2018) UCB Regret 9 / 9

(45)

9/9

Step 4.3: Bounding B.

Continuing from Step 4.1, and now invoking Hoeffding’s Inequality:

B≤

T−1

X

t=0 t

X

x=¯uTa t

X

y=1

P (

ˆ pa(x) +

r2

xln(t)≥ˆp(y) + r2

yln(t) )

T−1

X

t=0 t

X

x=¯uTa t

X

y=1

P

ˆ

pa(x)≥pa+∆a

2

+P

ˆ

p(y)<p− r2

yln(t)

T−1

X

t=0 t

X

x=¯uTa t

X

y=1

e−2x(∆a2 )2+e−2y

q2 yln(t)2

(46)

9/9

Continuing from Step 4.1, and now invoking Hoeffding’s Inequality:

B≤

T−1

X

t=0 t

X

x=¯uTa t

X

y=1

P (

ˆ pa(x) +

r2

xln(t)≥ˆp(y) + r2

yln(t) )

T−1

X

t=0 t

X

x=¯uTa t

X

y=1

P

ˆ

pa(x)≥pa+∆a

2

+P

ˆ

p(y)<p− r2

yln(t)

T−1

X

t=0 t

X

x=¯uTa t

X

y=1

e−2x(∆a2 )2+e−2y

q2 yln(t)2

T−1

X

t=0 t

X

x=¯uTa t

X

y=1

e−4 ln(t)+e−4 ln(t)

T−1

X

t=0

t2 2

t4

X

t=0

2 t22

3.

Shivaram Kalyanakrishnan (2018) UCB Regret 9 / 9

(47)

9/9

Step 4.3: Bounding B.

Continuing from Step 4.1, and now invoking Hoeffding’s Inequality:

B≤

T−1

X

t=0 t

X

x=¯uTa t

X

y=1

P (

ˆ pa(x) +

r2

xln(t)≥ˆp(y) + r2

yln(t) )

T−1

X

t=0 t

X

x=¯uTa t

X

y=1

P

ˆ

pa(x)≥pa+∆a

2

+P

ˆ

p(y)<p− r2

yln(t)

T−1

X

t=0 t

X

x=¯uTa t

X

y=1

e−2x(∆a2 )2+e−2y

q2 yln(t)2

T−1

X

t=0 t

X

x=¯uTa t

X

y=1

e−4 ln(t)+e−4 ln(t)

T−1

X

t=0

t2 2

t4

X

t=0

2 t22

3.

We are done.

References

Related documents

Memory locations accessed: local variables/arrays of functions Statically allocated in stack segment when function is called.. Quick Recap of

Choice of comparison operator crucially determines sorting order (increasing/decreasing), and also how equal elements

• Decide which half of array to recurse on based on output of comparison

• Recall how we accessed member data values of structures V3 p, *ptrP;. cin

• Uses dynamically allocated array to store elements Array can grow or shrink in size. • Dynamic memory management

Sort remaining unsorted sub-array using the same technique. Selection Sort

• “cin” (for keyboard input) and “cout” (for console output) work because of instructions in “iostream” header file. Compiler Directives..

Department of Computer Science &amp; Engineering INDIAN INSTITUTE OF TECHNOLOGY, DELHI.