Training algorithms for Structured Learning
Sunita Sarawagi IIT Bombay
http://www.cse.iitb.ac.in/~sunita
Training
Given
N input output pairs (x1,y1),(x2,y2), . . . ,(xN,yN) Features f(x,y) = P
cf(x,yc,c) Error of output : E(yi,y)
I (Use short form: E(yi,y) = Ei(y))
I Also decomposes over smaller parts:
Ei(y) =P
c∈C Ei,c(yc)
Find w
Small training error
Generalizes to unseen instances Efficient for structured models
Outline
1 Likelihood based Training
2 Max-margin training
Cutting-plane approaches
Decomposition-based approaches
Probability distribution from scores
Convert scores into a probability distribution Pr(y|x) = 1
Zw(x) exp(w.f(x,y)) where Zw(x) = P
y0 exp(w.f(x,y0))
When y vector of variables, say y1, . . . ,yn, and decomposition parts c are subsets of variables we get a graphical model.
Pr(y|x) = 1
Zw(x) exp X
c
w.f(x,yc,c)
!
= 1 Z
Y
c
ψc(yc) with clique potential ψc(yc) = exp(w.f(x,yc,c))
These are called Conditional Random Fields (CRFs).
Probability distribution from scores
Convert scores into a probability distribution Pr(y|x) = 1
Zw(x) exp(w.f(x,y)) where Zw(x) = P
y0 exp(w.f(x,y0))
When y vector of variables, say y1, . . . ,yn, and decomposition parts c are subsets of variables we get a graphical model.
Pr(y|x) = 1
Zw(x) exp X
c
w.f(x,yc,c)
!
= 1 Z
Y
c
ψc(yc)
These are called Conditional Random Fields (CRFs).
Probability distribution from scores
Convert scores into a probability distribution Pr(y|x) = 1
Zw(x) exp(w.f(x,y)) where Zw(x) = P
y0 exp(w.f(x,y0))
When y vector of variables, say y1, . . . ,yn, and decomposition parts c are subsets of variables we get a graphical model.
Pr(y|x) = 1
Zw(x) exp X
c
w.f(x,yc,c)
!
= 1 Z
Y
c
ψc(yc) with clique potential ψc(yc) = exp(w.f(x,yc,c))
These are called Conditional Random Fields (CRFs).
Training via gradient descent
L(w) =X
`
log Pr(y`|x`,w) =X
`
(w·f(x`,y`)−logZw(x`))
Add a regularizer to prevent over-fitting.
maxw
X
`
(w·f(x`,y`)−logZw(x`))− ||w||2/C
Concave inw =⇒ gradient descent methods will work.
Gradient:
∇L(w) = X
`
f(x`,y`)− P
y0f(y0,x`) exp{w·f(x`,y0)}
Zw(x`) −2w/C
= X
f(x,y )−E f(x,y0)−2w/C
Training algorithm
1: Initialize w0 = 0
2: for t = 1. . .T do
3: for ` = 1. . .N do
4: gk,` = fk(x`,y`)−EPr(y0|w)fk(x`,y0) k = 1. . .K
5: end for
6: gk = P
`gk,` k = 1. . .K
7: wkt = wkt−1 +γt(gk −2wkt−1/C)
8: Exit if ||g|| ≈ zero
9: end for
Running time of the algorithm is O(INn(m2 +K)) where I is the total number of iterations.
Training algorithm
1: Initialize w0 = 0
2: for t = 1. . .T do
3: for ` = 1. . .N do
4: gk,` = fk(x`,y`)−EPr(y0|w)fk(x`,y0) k = 1. . .K
5: end for
6: gk = P
`gk,` k = 1. . .K
7: wkt = wkt−1 +γt(gk −2wkt−1/C)
8: Exit if ||g|| ≈ zero
9: end for
Running time of the algorithm is O(INn(m2 +K)) where I is the total number of iterations.
Training algorithm
1: Initialize w0 = 0
2: for t = 1. . .T do
3: for ` = 1. . .N do
4: gk,` = fk(x`,y`)−EPr(y0|w)fk(x`,y0) k = 1. . .K
5: end for
6: gk = P
`gk,` k = 1. . .K
7: wkt = wkt−1 +γt(gk −2wkt−1/C)
8: Exit if ||g|| ≈ zero
9: end for
Running time of the algorithm is O(INn(m2 +K)) where I is the total number of iterations.
Training algorithm
1: Initialize w0 = 0
2: for t = 1. . .T do
3: for ` = 1. . .N do
4: gk,` = fk(x`,y`)−EPr(y0|w)fk(x`,y0) k = 1. . .K
5: end for
6: gk = P
`gk,` k = 1. . .K
7: wkt = wkt−1 +γt(gk −2wkt−1/C)
8: Exit if ||g|| ≈ zero
9: end for
Running time of the algorithm is O(INn(m2 +K)) where I is the total number of iterations.
Calculating EPr(y0|w)fk(x`,y0) using inference.
Likelihood-based trainer
1 Penalizes all wrong ys the same way, does not exploit Ei(y)
2 Requires the computation of sum-marginals, not possible in all kinds of structured learning.
1 Collective extraction
2 Sentence Alignment
3 Ranking
Outline
1 Likelihood based Training
2 Max-margin training
Cutting-plane approaches
Decomposition-based approaches
Two formulations
1 Margin scaling min
w,ξ
1
2||w||2 + C N
N
X
i=1
ξi
s.t. wTf(xi,yi)−wTf(xi,y) ≥Ei(y)−ξi ∀y,i
2 Slack scaling minw,ξ
1
2||w||2 + C N
N
X
i=1
ξi
s.t. wTf(xi,yi)−wTf(xi,y) ≥ 1− ξi
Ei(y) ∀y,i
Two formulations
1 Margin scaling min
w,ξ
1
2||w||2 + C N
N
X
i=1
ξi
s.t. wTf(xi,yi)−wTf(xi,y) ≥Ei(y)−ξi ∀y,i
2 Slack scaling min
w,ξ
1
2||w||2 + C N
N
X
i=1
ξi
s.t. wTf(xi,yi)−wTf(xi,y) ≥ 1− ξi
Ei(y) ∀y,i
Max-margin loss surrogates
True error Ei(argmaxyw:f(xi;y))
maxy[Ei(y)¡w:±f(xi;y)]+ maxyEi(y)[1¡w:±f(xi;y)]+
Let w:±f(xi;y) =w:f(xi;yi)¡w:f(xi;y)
1. Margin Loss
2. Slack Loss
0 2 4 6 8 10 12 14
-2 3
Slack
Margin
Ideal
E(y)=4
Max-margin training: margin-scaling
The Primal (P):
minw,ξ
1
2||w||2 + C N
N
X
i=1
ξi
s.t. wTδfi(y) ≥Ei(y)−ξi ∀y,i : 1. . .N
Good news: Convex in w, ξ
Bad news: exponential number of constraints Two main lines of attacks
1 Cutting-plane: generate constraints on the fly.
2 Decomposition: polynomial-sized rewrite of objective in terms of parts of y
Max-margin training: margin-scaling
The Primal (P):
minw,ξ
1
2||w||2 + C N
N
X
i=1
ξi
s.t. wTδfi(y) ≥Ei(y)−ξi ∀y,i : 1. . .N Good news: Convex in w, ξ
Bad news: exponential number of constraints Two main lines of attacks
1 Cutting-plane: generate constraints on the fly.
2 Decomposition: polynomial-sized rewrite of objective in
1 The Primal (P):
min
w,ξ
1
2||w||2 + C N
N
X
i=1
ξi
s.t. wTδfi(y) ≥Ei(y)−ξi ∀y,i : 1. . .N
2 The Dual (D) of (P) max
αi(y)
−1 2
X
i,y
αi(y)δfi(y)X
j,y0
αj(y0)δfj(y0) +X
Ei(y)αi(y) s.t. X
y
αi(y) = C N αi(y) ≥ 0 i : 1. . .N
1 The Primal (P):
min
w,ξ
1
2||w||2 + C N
N
X
i=1
ξi
s.t. wTδfi(y) ≥Ei(y)−ξi ∀y,i : 1. . .N
2 The Dual (D) of (P) max
αi(y)
−1 2
X
i,y
αi(y)δfi(y)X
j,y0
αj(y0)δfj(y0) +X
Ei(y)αi(y) s.t. X
y
αi(y) = C N
Properties of Dual
1 Strong duality holds: Primal (P) solution = Dual (D) solution.
2 w = P
i,yαi(y)δfi(y)
3 Dual (D) is concave in α, constraints are simpler.
4 Size of α is still intractably large =⇒ cannot solve via standard libraries.
Cutting plane method
1 Exponentiated gradient approach requires
computation of sum-marginals and decomposable losses.
2 Cutting plane — a more general approach that just requires MAP
Cutting-plane algorithm [TJHA05]
1: Initialize w0 = 0, Active constraints=Empty.
2: for t = 1. . .T do
3: for ` = 1. . .N do
4: ˆy = argmaxy(E`(y) +wt ·f(x`,y))
5: if wt.δf`(ˆy) < E`(ˆy)−ξ`t − then
6: Add (x`,ˆy) to set of active constraints.
7: wt, ξt=solve QP with active constraints.
8: end if
9: end for
10: Exit if no new constraint added.
11: end for
Cutting-plane algorithm [TJHA05]
1: Initialize w0 = 0, Active constraints=Empty.
2: for t = 1. . .T do
3: for ` = 1. . .N do
4: ˆy = argmaxy(E`(y) +wt ·f(x`,y))
5: if wt.δf`(ˆy) < E`(ˆy)−ξ`t − then
6: Add (x`,ˆy) to set of active constraints.
7: wt, ξt=solve QP with active constraints.
8: end if
9: end for
10: Exit if no new constraint added.
11: end for
Efficient solution in the dual space
Solve QP in the dual space.
1 Initially αti(y) = 0,∀y 6= yi, αti(yi) = CN
2 For t = 1, . . . ,T
1 Choose a i from 1, . . . ,N.
2 yˆ= argmaxy(E`(y) +wt ·f(x`,y)) where wt =P
i,yαi(y)δfi(y)
3 αi(ˆy) = coordinate with highest gradient.
4 Optimize J(α) over set of ys in the active set (SMO applicable here).
Convergence results
Let R2 = maxδfi(y)δfj(y0),∆ = maxi,yEi(y)
Theorem
J(αt+1)−J(αt) ≥ min(2NC, 8R22)
Theorem
The number of constraints that the cutting plane algorithm adds is at most max(2N∆ ,8C∆R2 2)
Single slack formulation [JFY09]
Theorem
The number of constraints that the cutting plane
algorithm adds in the single slack formulation is at most max(log4RC∆2C2, 16CR 2)
Decomposition-based approaches
1 δfi(y) =P
cδfi,c(yc)
2 Ei(y) =P
cEi,c(yc)
Rewrite the dual as max
µi,c(yc)−1 2
X
i,c,yc
δfi,c(yc)µi,c(yc) X
j,d,y0d
δfj,d(y0d)µj,d(yd0)
+ X
i,c,yc
Ei,c(yc)µi,c(yc) s.t. X
y
µi,c(yc) = X
y∼yc
αi(y) X
y
αi(y) = C
N, αi(y)≥0 i : 1. . .N
Decomposition-based approaches
1 δfi(y) =P
cδfi,c(yc)
2 Ei(y) =P
cEi,c(yc) Rewrite the dual as
max
µi,c(yc)−1 2
X
i,c,yc
δfi,c(yc)µi,c(yc) X
j,d,y0d
δfj,d(y0d)µj,d(yd0)
+ X
i,c,yc
Ei,c(yc)µi,c(yc) s.t. X
y
µi,c(yc) = X
y∼yc
αi(y) X
y
αi(y) = C
N, αi(y)≥0 i : 1. . .N
αs as probabilities
Scale αs with CN. max
µi,c(yc)− C 2N
X
i,c,yc
δfi,c(yc)µi,c(yc) X
j,d,yd0
δfj,d(yd0)µj,d(yd0)
+ X
i,c,yc
Ei,c(yc)µi,c(yc)
s.t. µi,c(yc) ∈ Marginals of any valid distribution
Solve via the exponentiated gradient method.
αs as probabilities
Scale αs with CN. max
µi,c(yc)− C 2N
X
i,c,yc
δfi,c(yc)µi,c(yc) X
j,d,yd0
δfj,d(yd0)µj,d(yd0)
+ X
i,c,yc
Ei,c(yc)µi,c(yc)
s.t. µi,c(yc) ∈ Marginals of any valid distribution Solve via the exponentiated gradient method.
Exponentiated gradient algorithm
1 Initially µi,c(yi,c) = 1, for yi,c 6= yc, µi,c(yc) = 0
2 For t = 1, . . . ,T
1 Choose a i from 1, . . . ,N.
2 Ignore constraints and perform a gradient-based update: si,c =µti,c+η(Ei,c −wtδfi(yc))
where wt =P
i,c,ycµti,c(yc)δfi,c(yc)
3 Define a distribution α by exponentiating the updates:
αi(y)t+1= 1
Z exp X
c
si,c(yc)
!
where Z =P
yexp(P
csi,c(yc))
4 New feasible values are marginals of α µt+1i,c (yc) = X
y∼yc
αi(y)t+1
Exponentiated gradient algorithm
1 Initially µi,c(yi,c) = 1, for yi,c 6= yc, µi,c(yc) = 0
2 For t = 1, . . . ,T
1 Choose a i from 1, . . . ,N.
2 Ignore constraints and perform a gradient-based update:
si,c =µti,c+η(Ei,c −wtδfi(yc)) where wt =P
i,c,ycµti,c(yc)δfi,c(yc)
3 Define a distribution α by exponentiating the updates:
αi(y)t+1= 1
Z exp X
c
si,c(yc)
!
where Z =P
yexp(P
csi,c(yc))
4 New feasible values are marginals of α µt+1i,c (yc) = X
y∼yc
αi(y)t+1
Exponentiated gradient algorithm
1 Initially µi,c(yi,c) = 1, for yi,c 6= yc, µi,c(yc) = 0
2 For t = 1, . . . ,T
1 Choose a i from 1, . . . ,N.
2 Ignore constraints and perform a gradient-based update:
si,c =µti,c+η(Ei,c −wtδfi(yc)) where wt =P
i,c,ycµti,c(yc)δfi,c(yc)
3 Define a distribution α by exponentiating the updates:
αi(y)t+1 = 1
Z exp X
c
si,c(yc)
!
where Z =P
yexp(P
csi,c(yc))
4 New feasible values are marginals of α µt+1i,c (yc) = X
y∼yc
αi(y)t+1
Exponentiated gradient algorithm
1 Initially µi,c(yi,c) = 1, for yi,c 6= yc, µi,c(yc) = 0
2 For t = 1, . . . ,T
1 Choose a i from 1, . . . ,N.
2 Ignore constraints and perform a gradient-based update:
si,c =µti,c+η(Ei,c −wtδfi(yc)) where wt =P
i,c,ycµti,c(yc)δfi,c(yc)
3 Define a distribution α by exponentiating the updates:
αi(y)t+1 = 1
Z exp X
c
si,c(yc)
!
where Z =P
yexp(P
csi,c(yc))
4 New feasible values are marginals of α µt+1i,c (yc) = X
y∼yc
αi(y)t+1
Convergence results
Theorem
J(αt+1)−J(αt) ≥ 1ηKL(αt, αt+1) where η ≤ nR12 where R = maxδfi(y)δfj(y0)
Theorem
Let α∗ = dual optimal. Then at the T th iteration. J(α∗)− 1
TηKL(α∗;α0) ≤ J(αT+1) ≤J(α∗)
Theorem
The number of iterations of the algorithm is at most
N2
R2KL(α∗;α0)
Convergence results
Theorem
J(αt+1)−J(αt) ≥ 1ηKL(αt, αt+1) where η ≤ nR12 where R = maxδfi(y)δfj(y0)
Theorem
Let α∗ = dual optimal. Then at the T th iteration.
J(α∗)− 1
TηKL(α∗;α0) ≤ J(αT+1) ≤J(α∗)
Theorem
The number of iterations of the algorithm is at most
N2
R2KL(α∗;α0)
Convergence results
Theorem
J(αt+1)−J(αt) ≥ 1ηKL(αt, αt+1) where η ≤ nR12 where R = maxδfi(y)δfj(y0)
Theorem
Let α∗ = dual optimal. Then at the T th iteration.
J(α∗)− 1
TηKL(α∗;α0) ≤ J(αT+1) ≤J(α∗)
Theorem
The number of iterations of the algorithm is at most
Training SPENs
Training objective
L(W,yT) = [E(W,yi)−E(W,yT) + ∆(yT,yi)]+
whereyt+1=yt−ηt∇yE(W,yt), y0=F(W,x) For computing the gradient ofL(W,yT) wrtW, we need to backprop through all stages whereW oryt appears. That gives.
dL
dW =∇WL(W,yT) +
T
X
t=0
dL dyt
∇Wyt
We compute the above expression iteratively as follows:
dL
dyt = dyt+1 dyt
dL dyt+1 dyt+1
dyt
= d(yt−ηt∇yE(W,yt)) dyt
=I−ηtHy,y(E(W,yt))
∇Wyt = ∇W(yt−ηt∇yE(W,yt)) =HW,y(E(W,yt−1))
Backproped computation of
dWdL1: dL
dW = ∇WL(W,yT)
2: dL
dyT = −∇yE(W,yT) +∇y∆(yT,yi) if L(W,yT) > 0, = 0 otherwise.
3: for t = T −1, . . . ,0 do
4: dL
dyt = dydL
t+1 −ηtHy,y(E(W,yt))dydL
t+1
5: if t > 0 then
6: dL
dW− = ηtHW,y(E(W,yt−1))dydL
t
7: else
8: dL
dW− = ηt∇WF(W,x)dydL
t
9: end if
10: end for
Efficient computation with Hessians
Instead of automatic differentiation that gives rise to Hessians, the method of finite differences is used is more efficient.
df(θ)
dθ .v = lim
r→0
1
r[f(θ+rv)−f(θ)]
HW,y(E(W,y)).v = lim
r→0
1
r[∇WE(W,y+rv)−∇WE(W,y)]
where v= dLdy
Hy,y(E(W,y)).v= lim
r→0
1
r[∇yE(W,y+rv)− ∇yE(W,y)]
where v= dLdy
Summary
1 Two very efficient algorithms for training structured models that avoids the problem of exponential output space.
2 Other alternatives
1 Online training, example MIRA and Collins Trainer
2 Stochastic trainers: LARank.
3 Local training: SEARN
3 Extension to Slack-scaling and other loss functions.
Peter L. Bartlett, Michael Collins, Ben Taskar, and David McAllester.
Exponentiated gradient algorithms for large-margin structured classification.
In Lawrence K. Saul, Yair Weiss, and L´eon Bottou, editors, Advances in Neural Information Processing Systems 17, pages 113–120. MIT Press, Cambridge, MA, 2005.
Michael Collins, Amir Globerson, Terry Koo, Xavier Carreras, and Peter L. Bartlett.
Exponentiated gradient algorithms for conditional random fields and max-margin markov networks.
J. Mach. Learn. Res., 9:1775–1822, 2008.
Thorsten Joachims, Thomas Finley, and Chun-Nam John Yu.
Cutting-plane training of structural svms.
Machine Learning, 77(1):27–59, 2009.
I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun.
Large margin methods for structured and interdependent output variables.
Journal of Machine Learning Research (JMLR), 6(Sep):1453–1484, 2005.