1/9
Regret Bound for UCB
(Adapted from the proof by Auer et al., 2002)
Shivaram Kalyanakrishnan
shivaram@cse.iitb.ac.in
Department of Computer Science and Engineering Indian Institute of Technology Bombay
August 2018
2/9
UCB
- Pull each arm once.
- At timet∈ {n,n+1, . . .}, for every arma,ucbtadef= ˆpta+ r2 ln(t)
uta ; pullargmaxaucbta.
R 1
0
pa t ucb at
Shivaram Kalyanakrishnan (2018) UCB Regret 2 / 9
2/9
UCB Algorithm
UCB
- Pull each arm once.
- At timet∈ {n,n+1, . . .}, for every arma,ucbtadef= ˆpta+ r2 ln(t)
uta ; pullargmaxaucbta.
R 1
0
pa t ucb at
2/9
UCB
- Pull each arm once.
- At timet∈ {n,n+1, . . .}, for every arma,ucbtadef= ˆpta+ r2 ln(t)
uta ; pullargmaxaucbta.
R 1
0
pa t ucb at
Recall thatRT=Tp⋆−PT−1 t=0 E[rt].
Shivaram Kalyanakrishnan (2018) UCB Regret 2 / 9
2/9
UCB Algorithm
UCB
- Pull each arm once.
- At timet∈ {n,n+1, . . .}, for every arma,ucbtadef= ˆpta+ r2 ln(t)
uta ; pullargmaxaucbta.
R 1
0
pa t ucb at
Recall thatRT=Tp⋆−PT−1 t=0 E[rt].
We shall show that UCB achievesRT=O P
a:pa6=p⋆ 1
p⋆−palog(T) .
3/9
∆a=p⋆−pa(instance-specificconstant).
Shivaram Kalyanakrishnan (2018) UCB Regret 3 / 9
3/9
Notation
∆a
=defp⋆−pa(instance-specificconstant).
LetZat be theeventthat armais pulled at timet.
3/9
∆a=p⋆−pa(instance-specificconstant).
LetZat be theeventthat armais pulled at timet.
Letztabe arandom variablethat takes value 1 if armais pulled at timet, and 0 otherwise.
Shivaram Kalyanakrishnan (2018) UCB Regret 3 / 9
3/9
Notation
∆a
=defp⋆−pa(instance-specificconstant).
LetZat be theeventthat armais pulled at timet.
Letztabe arandom variablethat takes value 1 if armais pulled at timet, and 0 otherwise.
Observe thatE[zta] =P{Zat}(1) + (1−P{Zat})(0) =P{Zat}.
3/9
∆a=p⋆−pa(instance-specificconstant).
LetZat be theeventthat armais pulled at timet.
Letztabe arandom variablethat takes value 1 if armais pulled at timet, and 0 otherwise.
Observe thatE[zta] =P{Zat}(1) + (1−P{Zat})(0) =P{Zat}.
As in the algorithm,utais arandom variablethat denotes the number of pulls armahas received up to (and excluding) timet:
uta=
t−1
X
i=0
zia.
Shivaram Kalyanakrishnan (2018) UCB Regret 3 / 9
3/9
Notation
∆a
=defp⋆−pa(instance-specificconstant).
LetZat be theeventthat armais pulled at timet.
Letztabe arandom variablethat takes value 1 if armais pulled at timet, and 0 otherwise.
Observe thatE[zta] =P{Zat}(1) + (1−P{Zat})(0) =P{Zat}.
As in the algorithm,utais arandom variablethat denotes the number of pulls armahas received up to (and excluding) timet:
uta=
t−1
X
i=0
zia.
We define an instance-specificconstant
¯uTa=def 8
(∆a)2 ln(T)
that will serve in our proof as a “sufficient” number of pulls of armafor horizonT.
4/9
Shivaram Kalyanakrishnan (2018) UCB Regret 4 / 9
4/9
Step 1: Show that R
T= P
a:pa6=p⋆
E [u
Ta]∆
a.
RT=Tp⋆−
T−1
X
t=0
E[rt]
4/9
RT=Tp⋆−
T−1
X
t=0
E[rt]
=Tp⋆−
T−1
X
t=0
X
a∈A
P{Zat}E[rt|Zta]
Shivaram Kalyanakrishnan (2018) UCB Regret 4 / 9
4/9
Step 1: Show that R
T= P
a:pa6=p⋆
E [u
Ta]∆
a.
RT=Tp⋆−
T−1
X
t=0
E[rt]
=Tp⋆−
T−1
X
t=0
X
a∈A
P{Zat}E[rt|Zta]
=Tp⋆−
T−1
X
t=0
X
a∈A
E[zta]pa
4/9
RT=Tp⋆−
T−1
X
t=0
E[rt]
=Tp⋆−
T−1
X
t=0
X
a∈A
P{Zat}E[rt|Zta]
=Tp⋆−
T−1
X
t=0
X
a∈A
E[zta]pa
= X
a∈A
E[uTa]
!
p⋆−X
a∈A
E[uTa]pa
Shivaram Kalyanakrishnan (2018) UCB Regret 4 / 9
4/9
Step 1: Show that R
T= P
a:pa6=p⋆
E [u
Ta]∆
a.
RT=Tp⋆−
T−1
X
t=0
E[rt]
=Tp⋆−
T−1
X
t=0
X
a∈A
P{Zat}E[rt|Zta]
=Tp⋆−
T−1
X
t=0
X
a∈A
E[zta]pa
= X
a∈A
E[uTa]
!
p⋆−X
a∈A
E[uTa]pa
=X
a∈A
E[uTa](p⋆−pa)
4/9
RT=Tp⋆−
T−1
X
t=0
E[rt]
=Tp⋆−
T−1
X
t=0
X
a∈A
P{Zat}E[rt|Zta]
=Tp⋆−
T−1
X
t=0
X
a∈A
E[zta]pa
= X
a∈A
E[uTa]
!
p⋆−X
a∈A
E[uTa]pa
=X
a∈A
E[uTa](p⋆−pa)
= X
a:pa6=p⋆
E[uTa]∆a.
Shivaram Kalyanakrishnan (2018) UCB Regret 4 / 9
4/9
Step 1: Show that R
T= P
a:pa6=p⋆
E [u
Ta]∆
a.
RT=Tp⋆−
T−1
X
t=0
E[rt]
=Tp⋆−
T−1
X
t=0
X
a∈A
P{Zat}E[rt|Zta]
=Tp⋆−
T−1
X
t=0
X
a∈A
E[zta]pa
= X
a∈A
E[uTa]
!
p⋆−X
a∈A
E[uTa]pa
=X
a∈A
E[uTa](p⋆−pa)
= X
a:pa6=p⋆
E[uTa]∆a.
To show the regret bound, we shall show for each sub-optimal armathat
E[uTa] =O 1
(∆a)2log(T)
.
5/9
Shivaram Kalyanakrishnan (2018) UCB Regret 5 / 9
5/9
Step 2: Split sub-optimal pulls into two regimes.
To proveE[uTa] =O 1
∆2alog(T)
, we showE[uTa]≤¯uTa+Cfor some constantC.
5/9
To proveE[uTa] =O 1
∆2alog(T)
, we showE[uTa]≤¯uTa+Cfor some constantC.
E[uTa] =
T−1
X
t=0
E[zta]
Shivaram Kalyanakrishnan (2018) UCB Regret 5 / 9
5/9
Step 2: Split sub-optimal pulls into two regimes.
To proveE[uTa] =O 1
∆2alog(T)
, we showE[uTa]≤¯uTa+Cfor some constantC.
E[uTa] =
T−1
X
t=0
E[zta]
=
T−1
X
t=0
P{Zat}
5/9
To proveE[uTa] =O 1
∆2alog(T)
, we showE[uTa]≤¯uTa+Cfor some constantC.
E[uTa] =
T−1
X
t=0
E[zta]
=
T−1
X
t=0
P{Zat}
=
T−1
X
t=0
P{Zat and(uta<¯uTa)}+
T−1
X
t=0
P{Zat and(uta≥¯uTa)}
Shivaram Kalyanakrishnan (2018) UCB Regret 5 / 9
5/9
Step 2: Split sub-optimal pulls into two regimes.
To proveE[uTa] =O 1
∆2alog(T)
, we showE[uTa]≤¯uTa+Cfor some constantC.
E[uTa] =
T−1
X
t=0
E[zta]
=
T−1
X
t=0
P{Zat}
=
T−1
X
t=0
P{Zat and(uta<¯uTa)}+
T−1
X
t=0
P{Zat and(uta≥¯uTa)}
=A+B.
5/9
To proveE[uTa] =O 1
∆2alog(T)
, we showE[uTa]≤¯uTa+Cfor some constantC.
E[uTa] =
T−1
X
t=0
E[zta]
=
T−1
X
t=0
P{Zat}
=
T−1
X
t=0
P{Zat and(uta<¯uTa)}+
T−1
X
t=0
P{Zat and(uta≥¯uTa)}
=A+B.
We showAis upper-bounded by¯uTa andBis upper-bounded by a constant.
Shivaram Kalyanakrishnan (2018) UCB Regret 5 / 9
6/9
Step 3: Bounding A.
6/9
A=
T−1
X
t=0
P{Zat and(uta<¯uTa)}
Shivaram Kalyanakrishnan (2018) UCB Regret 6 / 9
6/9
Step 3: Bounding A.
A=
T−1
X
t=0
P{Zat and(uta<¯uTa)}
=
T−1
X
t=0
¯uTa−1
X
m=0
P{Zatand(uta=m)}
6/9
A=
T−1
X
t=0
P{Zat and(uta<¯uTa)}
=
T−1
X
t=0
¯uTa−1
X
m=0
P{Zatand(uta=m)}
=
¯ uTa−1
X
m=0 T−1
X
t=0
P{Zatand(uta=m)}
Shivaram Kalyanakrishnan (2018) UCB Regret 6 / 9
6/9
Step 3: Bounding A.
A=
T−1
X
t=0
P{Zat and(uta<¯uTa)}
=
T−1
X
t=0
¯uTa−1
X
m=0
P{Zatand(uta=m)}
=
¯ uTa−1
X
m=0 T−1
X
t=0
P{Zatand(uta=m)}
=
¯ uTa−1
X
m=0
P{(Za0and(u0a=m))or(Za1and(u1a=m))or. . . or(ZaT−1and(uT−1a =m))}
6/9
A=
T−1
X
t=0
P{Zat and(uta<¯uTa)}
=
T−1
X
t=0
¯uTa−1
X
m=0
P{Zatand(uta=m)}
=
¯ uTa−1
X
m=0 T−1
X
t=0
P{Zatand(uta=m)}
=
¯ uTa−1
X
m=0
P{(Za0and(u0a=m))or(Za1and(u1a=m))or. . . or(ZaT−1and(uT−1a =m))}
≤
¯ uTa−1
X
m=0
1
Shivaram Kalyanakrishnan (2018) UCB Regret 6 / 9
6/9
Step 3: Bounding A.
A=
T−1
X
t=0
P{Zat and(uta<¯uTa)}
=
T−1
X
t=0
¯uTa−1
X
m=0
P{Zatand(uta=m)}
=
¯ uTa−1
X
m=0 T−1
X
t=0
P{Zatand(uta=m)}
=
¯ uTa−1
X
m=0
P{(Za0and(u0a=m))or(Za1and(u1a=m))or. . . or(ZaT−1and(uT−1a =m))}
≤
¯ uTa−1
X
m=0
1
= ¯uTa.
6/9
A=
T−1
X
t=0
P{Zat and(uta<¯uTa)}
=
T−1
X
t=0
¯uTa−1
X
m=0
P{Zatand(uta=m)}
=
¯ uTa−1
X
m=0 T−1
X
t=0
P{Zatand(uta=m)}
=
¯ uTa−1
X
m=0
P{(Za0and(u0a=m))or(Za1and(u1a=m))or. . . or(ZaT−1and(uT−1a =m))}
≤
¯ uTa−1
X
m=0
1
= ¯uTa.
We have used the fact that for 0≤i<j≤t−1,(Zai and(uia=m))and (Zaj and(uja=m))are mutually exclusive.
Shivaram Kalyanakrishnan (2018) UCB Regret 6 / 9
7/9
Step 4.1: Bounding B.
7/9
B=
T−1
X
t=0
P{Zat and(uta ≥¯uTa)}
Shivaram Kalyanakrishnan (2018) UCB Regret 7 / 9
7/9
Step 4.1: Bounding B.
B=
T−1
X
t=0
P{Zat and(uta ≥¯uTa)}
≤
T−1
X
t=0
P (
ˆ pta+
r2 uta
ln(t)≥ˆpt⋆+ r2
ut⋆ln(t)
!
and(uta≥¯uTa) )
7/9
B=
T−1
X
t=0
P{Zat and(uta ≥¯uTa)}
≤
T−1
X
t=0
P (
ˆ pta+
r2 uta
ln(t)≥ˆpt⋆+ r2
ut⋆ln(t)
!
and(uta≥¯uTa) )
≤
T−1
X
t=0 t
X
x=¯uTa t
X
y=1
P (
ˆ pa(x) +
r2
xln(t)≥ˆp⋆(y) + r2
yln(t) )
where
ˆ
pa(x)is the empirical mean of the firstxpulls of arma, and ˆ
p⋆(y)is the empirical mean of the firstypulls of arm⋆.
Shivaram Kalyanakrishnan (2018) UCB Regret 7 / 9
8/9
Step 4.2: Bounding B.
Fixx∈ {¯uTa,¯uTa+1, . . . ,t}andy∈ {1,2, . . . ,t}.
8/9
We have:
ˆpa(x) + r2
xln(t)≥ˆp⋆(y) + r2
yln(t)
=⇒ ˆpa(x) + r2
xln(t)≥p⋆
! or
ˆ p⋆(y) +
r2
yln(t)<p⋆
.
Shivaram Kalyanakrishnan (2018) UCB Regret 8 / 9
8/9
Step 4.2: Bounding B.
Fixx∈ {¯uTa,¯uTa+1, . . . ,t}andy∈ {1,2, . . . ,t}.
We have:
ˆpa(x) + r2
xln(t)≥ˆp⋆(y) + r2
yln(t)
=⇒ ˆpa(x) + r2
xln(t)≥p⋆
! or
ˆ p⋆(y) +
r2
yln(t)<p⋆
.
Sincex≥¯uTa, we have q2
xln(t)≤q
2
¯uTa ln(t)≤∆2a, and so ˆpa(x) +
r2
xln(t)≥p⋆ =⇒ ˆpa(x)≥pa+∆a
2 .
8/9
We have:
ˆpa(x) + r2
xln(t)≥ˆp⋆(y) + r2
yln(t)
=⇒ ˆpa(x) + r2
xln(t)≥p⋆
! or
ˆ p⋆(y) +
r2
yln(t)<p⋆
.
Sincex≥¯uTa, we have q2
xln(t)≤q
2
¯uTa ln(t)≤∆2a, and so ˆpa(x) +
r2
xln(t)≥p⋆ =⇒ ˆpa(x)≥pa+∆a
2 .
In summary:
ˆ pa(x)+
r2
xln(t)≥ˆp⋆(y)+
s 2
yln(t) =⇒
ˆ
pa(x)≥pa+∆a
2
or ˆp⋆(y)<p⋆− s
2 yln(t)
! .
Shivaram Kalyanakrishnan (2018) UCB Regret 8 / 9
9/9
Step 4.3: Bounding B.
Continuing from Step 4.1, and now invoking Hoeffding’s Inequality:
B≤
T−1
X
t=0 t
X
x=¯uTa t
X
y=1
P (
ˆ pa(x) +
r2
xln(t)≥ˆp⋆(y) + r2
yln(t) )
9/9
Continuing from Step 4.1, and now invoking Hoeffding’s Inequality:
B≤
T−1
X
t=0 t
X
x=¯uTa t
X
y=1
P (
ˆ pa(x) +
r2
xln(t)≥ˆp⋆(y) + r2
yln(t) )
≤
T−1
X
t=0 t
X
x=¯uTa t
X
y=1
P
ˆ
pa(x)≥pa+∆a
2
+P
ˆ
p⋆(y)<p⋆− r2
yln(t)
Shivaram Kalyanakrishnan (2018) UCB Regret 9 / 9
9/9
Step 4.3: Bounding B.
Continuing from Step 4.1, and now invoking Hoeffding’s Inequality:
B≤
T−1
X
t=0 t
X
x=¯uTa t
X
y=1
P (
ˆ pa(x) +
r2
xln(t)≥ˆp⋆(y) + r2
yln(t) )
≤
T−1
X
t=0 t
X
x=¯uTa t
X
y=1
P
ˆ
pa(x)≥pa+∆a
2
+P
ˆ
p⋆(y)<p⋆− r2
yln(t)
≤
T−1
X
t=0 t
X
x=¯uTa t
X
y=1
e−2x(∆a2 )2+e−2y
q2 yln(t)2
9/9
Continuing from Step 4.1, and now invoking Hoeffding’s Inequality:
B≤
T−1
X
t=0 t
X
x=¯uTa t
X
y=1
P (
ˆ pa(x) +
r2
xln(t)≥ˆp⋆(y) + r2
yln(t) )
≤
T−1
X
t=0 t
X
x=¯uTa t
X
y=1
P
ˆ
pa(x)≥pa+∆a
2
+P
ˆ
p⋆(y)<p⋆− r2
yln(t)
≤
T−1
X
t=0 t
X
x=¯uTa t
X
y=1
e−2x(∆a2 )2+e−2y
q2 yln(t)2
≤
T−1
X
t=0 t
X
x=¯uTa t
X
y=1
e−4 ln(t)+e−4 ln(t)
≤
T−1
X
t=0
t2 2
t4
≤
∞
X
t=0
2 t2 =π2
3.
Shivaram Kalyanakrishnan (2018) UCB Regret 9 / 9
9/9
Step 4.3: Bounding B.
Continuing from Step 4.1, and now invoking Hoeffding’s Inequality:
B≤
T−1
X
t=0 t
X
x=¯uTa t
X
y=1
P (
ˆ pa(x) +
r2
xln(t)≥ˆp⋆(y) + r2
yln(t) )
≤
T−1
X
t=0 t
X
x=¯uTa t
X
y=1
P
ˆ
pa(x)≥pa+∆a
2
+P
ˆ
p⋆(y)<p⋆− r2
yln(t)
≤
T−1
X
t=0 t
X
x=¯uTa t
X
y=1
e−2x(∆a2 )2+e−2y
q2 yln(t)2
≤
T−1
X
t=0 t
X
x=¯uTa t
X
y=1
e−4 ln(t)+e−4 ln(t)
≤
T−1
X
t=0
t2 2
t4
≤
∞
X
t=0
2 t2 =π2
3.
We are done.