• No results found

Probability distribution from scores

N/A
N/A
Protected

Academic year: 2022

Share "Probability distribution from scores"

Copied!
45
0
0

Loading.... (view fulltext now)

Full text

(1)

Training algorithms for Structured Learning

Sunita Sarawagi IIT Bombay

http://www.cse.iitb.ac.in/~sunita

(2)

Training

Given

N input output pairs (x1,y1),(x2,y2), . . . ,(xN,yN) Features f(x,y) = P

cf(x,yc,c) Error of output : E(yi,y)

I (Use short form: E(yi,y) = Ei(y))

I Also decomposes over smaller parts:

Ei(y) =P

c∈C Ei,c(yc)

Find w

Small training error

Generalizes to unseen instances Efficient for structured models

(3)

Outline

1 Likelihood based Training

2 Max-margin training

Cutting-plane approaches

Decomposition-based approaches

(4)

Probability distribution from scores

Convert scores into a probability distribution Pr(y|x) = 1

Zw(x) exp(w.f(x,y)) where Zw(x) = P

y0 exp(w.f(x,y0))

When y vector of variables, say y1, . . . ,yn, and decomposition parts c are subsets of variables we get a graphical model.

Pr(y|x) = 1

Zw(x) exp X

c

w.f(x,yc,c)

!

= 1 Z

Y

c

ψc(yc) with clique potential ψc(yc) = exp(w.f(x,yc,c))

These are called Conditional Random Fields (CRFs).

(5)

Probability distribution from scores

Convert scores into a probability distribution Pr(y|x) = 1

Zw(x) exp(w.f(x,y)) where Zw(x) = P

y0 exp(w.f(x,y0))

When y vector of variables, say y1, . . . ,yn, and decomposition parts c are subsets of variables we get a graphical model.

Pr(y|x) = 1

Zw(x) exp X

c

w.f(x,yc,c)

!

= 1 Z

Y

c

ψc(yc)

These are called Conditional Random Fields (CRFs).

(6)

Probability distribution from scores

Convert scores into a probability distribution Pr(y|x) = 1

Zw(x) exp(w.f(x,y)) where Zw(x) = P

y0 exp(w.f(x,y0))

When y vector of variables, say y1, . . . ,yn, and decomposition parts c are subsets of variables we get a graphical model.

Pr(y|x) = 1

Zw(x) exp X

c

w.f(x,yc,c)

!

= 1 Z

Y

c

ψc(yc) with clique potential ψc(yc) = exp(w.f(x,yc,c))

These are called Conditional Random Fields (CRFs).

(7)

Training via gradient descent

L(w) =X

`

log Pr(y`|x`,w) =X

`

(w·f(x`,y`)logZw(x`))

Add a regularizer to prevent over-fitting.

maxw

X

`

(w·f(x`,y`)logZw(x`))− ||w||2/C

Concave inw = gradient descent methods will work.

Gradient:

∇L(w) = X

`

f(x`,y`) P

y0f(y0,x`) exp{w·f(x`,y0)}

Zw(x`) 2w/C

= X

f(x,y )E f(x,y0)2w/C

(8)

Training algorithm

1: Initialize w0 = 0

2: for t = 1. . .T do

3: for ` = 1. . .N do

4: gk,` = fk(x`,y`)−EPr(y0|w)fk(x`,y0) k = 1. . .K

5: end for

6: gk = P

`gk,` k = 1. . .K

7: wkt = wkt−1t(gk −2wkt−1/C)

8: Exit if ||g|| ≈ zero

9: end for

Running time of the algorithm is O(INn(m2 +K)) where I is the total number of iterations.

(9)

Training algorithm

1: Initialize w0 = 0

2: for t = 1. . .T do

3: for ` = 1. . .N do

4: gk,` = fk(x`,y`)−EPr(y0|w)fk(x`,y0) k = 1. . .K

5: end for

6: gk = P

`gk,` k = 1. . .K

7: wkt = wkt−1t(gk −2wkt−1/C)

8: Exit if ||g|| ≈ zero

9: end for

Running time of the algorithm is O(INn(m2 +K)) where I is the total number of iterations.

(10)

Training algorithm

1: Initialize w0 = 0

2: for t = 1. . .T do

3: for ` = 1. . .N do

4: gk,` = fk(x`,y`)−EPr(y0|w)fk(x`,y0) k = 1. . .K

5: end for

6: gk = P

`gk,` k = 1. . .K

7: wkt = wkt−1t(gk −2wkt−1/C)

8: Exit if ||g|| ≈ zero

9: end for

Running time of the algorithm is O(INn(m2 +K)) where I is the total number of iterations.

(11)

Training algorithm

1: Initialize w0 = 0

2: for t = 1. . .T do

3: for ` = 1. . .N do

4: gk,` = fk(x`,y`)−EPr(y0|w)fk(x`,y0) k = 1. . .K

5: end for

6: gk = P

`gk,` k = 1. . .K

7: wkt = wkt−1t(gk −2wkt−1/C)

8: Exit if ||g|| ≈ zero

9: end for

Running time of the algorithm is O(INn(m2 +K)) where I is the total number of iterations.

(12)

Calculating EPr(y0|w)fk(x`,y0) using inference.

(13)

Likelihood-based trainer

1 Penalizes all wrong ys the same way, does not exploit Ei(y)

2 Requires the computation of sum-marginals, not possible in all kinds of structured learning.

1 Collective extraction

2 Sentence Alignment

3 Ranking

(14)

Outline

1 Likelihood based Training

2 Max-margin training

Cutting-plane approaches

Decomposition-based approaches

(15)

Two formulations

1 Margin scaling min

w,ξ

1

2||w||2 + C N

N

X

i=1

ξi

s.t. wTf(xi,yi)−wTf(xi,y) ≥Ei(y)−ξi ∀y,i

2 Slack scaling minw,ξ

1

2||w||2 + C N

N

X

i=1

ξi

s.t. wTf(xi,yi)−wTf(xi,y) ≥ 1− ξi

Ei(y) ∀y,i

(16)

Two formulations

1 Margin scaling min

w,ξ

1

2||w||2 + C N

N

X

i=1

ξi

s.t. wTf(xi,yi)−wTf(xi,y) ≥Ei(y)−ξi ∀y,i

2 Slack scaling min

w,ξ

1

2||w||2 + C N

N

X

i=1

ξi

s.t. wTf(xi,yi)−wTf(xi,y) ≥ 1− ξi

Ei(y) ∀y,i

(17)

Max-margin loss surrogates

True error Ei(argmaxyw:f(xi;y))

maxy[Ei(y)¡w:±f(xi;y)]+ maxyEi(y)[1¡w:±f(xi;y)]+

Let w:±f(xi;y) =w:f(xi;yi)¡w:f(xi;y)

1. Margin Loss

2. Slack Loss

0 2 4 6 8 10 12 14

-2 3

Slack

Margin

Ideal

E(y)=4

(18)

Max-margin training: margin-scaling

The Primal (P):

minw,ξ

1

2||w||2 + C N

N

X

i=1

ξi

s.t. wTδfi(y) ≥Ei(y)−ξi ∀y,i : 1. . .N

Good news: Convex in w, ξ

Bad news: exponential number of constraints Two main lines of attacks

1 Cutting-plane: generate constraints on the fly.

2 Decomposition: polynomial-sized rewrite of objective in terms of parts of y

(19)

Max-margin training: margin-scaling

The Primal (P):

minw,ξ

1

2||w||2 + C N

N

X

i=1

ξi

s.t. wTδfi(y) ≥Ei(y)−ξi ∀y,i : 1. . .N Good news: Convex in w, ξ

Bad news: exponential number of constraints Two main lines of attacks

1 Cutting-plane: generate constraints on the fly.

2 Decomposition: polynomial-sized rewrite of objective in

(20)

1 The Primal (P):

min

w,ξ

1

2||w||2 + C N

N

X

i=1

ξi

s.t. wTδfi(y) ≥Ei(y)−ξi ∀y,i : 1. . .N

2 The Dual (D) of (P) max

αi(y)

−1 2

X

i,y

αi(y)δfi(y)X

j,y0

αj(y0)δfj(y0) +X

Ei(y)αi(y) s.t. X

y

αi(y) = C N αi(y) ≥ 0 i : 1. . .N

(21)

1 The Primal (P):

min

w,ξ

1

2||w||2 + C N

N

X

i=1

ξi

s.t. wTδfi(y) ≥Ei(y)−ξi ∀y,i : 1. . .N

2 The Dual (D) of (P) max

αi(y)

−1 2

X

i,y

αi(y)δfi(y)X

j,y0

αj(y0)δfj(y0) +X

Ei(y)αi(y) s.t. X

y

αi(y) = C N

(22)

Properties of Dual

1 Strong duality holds: Primal (P) solution = Dual (D) solution.

2 w = P

i,yαi(y)δfi(y)

3 Dual (D) is concave in α, constraints are simpler.

4 Size of α is still intractably large =⇒ cannot solve via standard libraries.

(23)

Cutting plane method

1 Exponentiated gradient approach requires

computation of sum-marginals and decomposable losses.

2 Cutting plane — a more general approach that just requires MAP

(24)

Cutting-plane algorithm [TJHA05]

1: Initialize w0 = 0, Active constraints=Empty.

2: for t = 1. . .T do

3: for ` = 1. . .N do

4: ˆy = argmaxy(E`(y) +wt ·f(x`,y))

5: if wt.δf`(ˆy) < E`(ˆy)−ξ`t − then

6: Add (x`,ˆy) to set of active constraints.

7: wt, ξt=solve QP with active constraints.

8: end if

9: end for

10: Exit if no new constraint added.

11: end for

(25)

Cutting-plane algorithm [TJHA05]

1: Initialize w0 = 0, Active constraints=Empty.

2: for t = 1. . .T do

3: for ` = 1. . .N do

4: ˆy = argmaxy(E`(y) +wt ·f(x`,y))

5: if wt.δf`(ˆy) < E`(ˆy)−ξ`t − then

6: Add (x`,ˆy) to set of active constraints.

7: wt, ξt=solve QP with active constraints.

8: end if

9: end for

10: Exit if no new constraint added.

11: end for

(26)

Efficient solution in the dual space

Solve QP in the dual space.

1 Initially αti(y) = 0,∀y 6= yi, αti(yi) = CN

2 For t = 1, . . . ,T

1 Choose a i from 1, . . . ,N.

2 yˆ= argmaxy(E`(y) +wt ·f(x`,y)) where wt =P

i,yαi(y)δfi(y)

3 αiy) = coordinate with highest gradient.

4 Optimize J(α) over set of ys in the active set (SMO applicable here).

(27)

Convergence results

Let R2 = maxδfi(y)δfj(y0),∆ = maxi,yEi(y)

Theorem

J(αt+1)−J(αt) ≥ min(2NC, 8R22)

Theorem

The number of constraints that the cutting plane algorithm adds is at most max(2N∆ ,8C∆R2 2)

(28)

Single slack formulation [JFY09]

Theorem

The number of constraints that the cutting plane

algorithm adds in the single slack formulation is at most max(log4RC∆2C2, 16CR 2)

(29)

Decomposition-based approaches

1 δfi(y) =P

cδfi,c(yc)

2 Ei(y) =P

cEi,c(yc)

Rewrite the dual as max

µi,c(yc)1 2

X

i,c,yc

δfi,c(yci,c(yc) X

j,d,y0d

δfj,d(y0dj,d(yd0)

+ X

i,c,yc

Ei,c(yci,c(yc) s.t. X

y

µi,c(yc) = X

y∼yc

αi(y) X

y

αi(y) = C

N, αi(y)0 i : 1. . .N

(30)

Decomposition-based approaches

1 δfi(y) =P

cδfi,c(yc)

2 Ei(y) =P

cEi,c(yc) Rewrite the dual as

max

µi,c(yc)1 2

X

i,c,yc

δfi,c(yci,c(yc) X

j,d,y0d

δfj,d(y0dj,d(yd0)

+ X

i,c,yc

Ei,c(yci,c(yc) s.t. X

y

µi,c(yc) = X

y∼yc

αi(y) X

y

αi(y) = C

N, αi(y)0 i : 1. . .N

(31)

αs as probabilities

Scale αs with CN. max

µi,c(yc)− C 2N

X

i,c,yc

δfi,c(yci,c(yc) X

j,d,yd0

δfj,d(yd0j,d(yd0)

+ X

i,c,yc

Ei,c(yci,c(yc)

s.t. µi,c(yc) ∈ Marginals of any valid distribution

Solve via the exponentiated gradient method.

(32)

αs as probabilities

Scale αs with CN. max

µi,c(yc)− C 2N

X

i,c,yc

δfi,c(yci,c(yc) X

j,d,yd0

δfj,d(yd0j,d(yd0)

+ X

i,c,yc

Ei,c(yci,c(yc)

s.t. µi,c(yc) ∈ Marginals of any valid distribution Solve via the exponentiated gradient method.

(33)

Exponentiated gradient algorithm

1 Initially µi,c(yi,c) = 1, for yi,c 6= yc, µi,c(yc) = 0

2 For t = 1, . . . ,T

1 Choose a i from 1, . . . ,N.

2 Ignore constraints and perform a gradient-based update: si,c =µti,c+η(Ei,c wtδfi(yc))

where wt =P

i,c,ycµti,c(yc)δfi,c(yc)

3 Define a distribution α by exponentiating the updates:

αi(y)t+1= 1

Z exp X

c

si,c(yc)

!

where Z =P

yexp(P

csi,c(yc))

4 New feasible values are marginals of α µt+1i,c (yc) = X

y∼yc

αi(y)t+1

(34)

Exponentiated gradient algorithm

1 Initially µi,c(yi,c) = 1, for yi,c 6= yc, µi,c(yc) = 0

2 For t = 1, . . . ,T

1 Choose a i from 1, . . . ,N.

2 Ignore constraints and perform a gradient-based update:

si,c =µti,c+η(Ei,c wtδfi(yc)) where wt =P

i,c,ycµti,c(yc)δfi,c(yc)

3 Define a distribution α by exponentiating the updates:

αi(y)t+1= 1

Z exp X

c

si,c(yc)

!

where Z =P

yexp(P

csi,c(yc))

4 New feasible values are marginals of α µt+1i,c (yc) = X

y∼yc

αi(y)t+1

(35)

Exponentiated gradient algorithm

1 Initially µi,c(yi,c) = 1, for yi,c 6= yc, µi,c(yc) = 0

2 For t = 1, . . . ,T

1 Choose a i from 1, . . . ,N.

2 Ignore constraints and perform a gradient-based update:

si,c =µti,c+η(Ei,c wtδfi(yc)) where wt =P

i,c,ycµti,c(yc)δfi,c(yc)

3 Define a distribution α by exponentiating the updates:

αi(y)t+1 = 1

Z exp X

c

si,c(yc)

!

where Z =P

yexp(P

csi,c(yc))

4 New feasible values are marginals of α µt+1i,c (yc) = X

y∼yc

αi(y)t+1

(36)

Exponentiated gradient algorithm

1 Initially µi,c(yi,c) = 1, for yi,c 6= yc, µi,c(yc) = 0

2 For t = 1, . . . ,T

1 Choose a i from 1, . . . ,N.

2 Ignore constraints and perform a gradient-based update:

si,c =µti,c+η(Ei,c wtδfi(yc)) where wt =P

i,c,ycµti,c(yc)δfi,c(yc)

3 Define a distribution α by exponentiating the updates:

αi(y)t+1 = 1

Z exp X

c

si,c(yc)

!

where Z =P

yexp(P

csi,c(yc))

4 New feasible values are marginals of α µt+1i,c (yc) = X

y∼yc

αi(y)t+1

(37)

Convergence results

Theorem

J(αt+1)−J(αt) ≥ 1ηKL(αt, αt+1) where η ≤ nR12 where R = maxδfi(y)δfj(y0)

Theorem

Let α = dual optimal. Then at the T th iteration. J(α)− 1

TηKL(α0) ≤ J(αT+1) ≤J(α)

Theorem

The number of iterations of the algorithm is at most

N2

R2KL(α0)

(38)

Convergence results

Theorem

J(αt+1)−J(αt) ≥ 1ηKL(αt, αt+1) where η ≤ nR12 where R = maxδfi(y)δfj(y0)

Theorem

Let α = dual optimal. Then at the T th iteration.

J(α)− 1

TηKL(α0) ≤ J(αT+1) ≤J(α)

Theorem

The number of iterations of the algorithm is at most

N2

R2KL(α0)

(39)

Convergence results

Theorem

J(αt+1)−J(αt) ≥ 1ηKL(αt, αt+1) where η ≤ nR12 where R = maxδfi(y)δfj(y0)

Theorem

Let α = dual optimal. Then at the T th iteration.

J(α)− 1

TηKL(α0) ≤ J(αT+1) ≤J(α)

Theorem

The number of iterations of the algorithm is at most

(40)

Training SPENs

Training objective

L(W,yT) = [E(W,yi)E(W,yT) + ∆(yT,yi)]+

whereyt+1=ytηtyE(W,yt), y0=F(W,x) For computing the gradient ofL(W,yT) wrtW, we need to backprop through all stages whereW oryt appears. That gives.

dL

dW =WL(W,yT) +

T

X

t=0

dL dyt

Wyt

We compute the above expression iteratively as follows:

dL

dyt = dyt+1 dyt

dL dyt+1 dyt+1

dyt

= d(ytηtyE(W,yt)) dyt

=IηtHy,y(E(W,yt))

Wyt = W(ytηtyE(W,yt)) =HW,y(E(W,yt−1))

(41)

Backproped computation of

dWdL

1: dL

dW = ∇WL(W,yT)

2: dL

dyT = −∇yE(W,yT) +∇y∆(yT,yi) if L(W,yT) > 0, = 0 otherwise.

3: for t = T −1, . . . ,0 do

4: dL

dyt = dydL

t+1 −ηtHy,y(E(W,yt))dydL

t+1

5: if t > 0 then

6: dL

dW− = ηtHW,y(E(W,yt−1))dydL

t

7: else

8: dL

dW− = ηtWF(W,x)dydL

t

9: end if

10: end for

(42)

Efficient computation with Hessians

Instead of automatic differentiation that gives rise to Hessians, the method of finite differences is used is more efficient.

df(θ)

.v = lim

r→0

1

r[f+rv)f(θ)]

HW,y(E(W,y)).v = lim

r→0

1

r[∇WE(W,y+rv)−∇WE(W,y)]

where v= dLdy

Hy,y(E(W,y)).v= lim

r→0

1

r[∇yE(W,y+rv)− ∇yE(W,y)]

where v= dLdy

(43)

Summary

1 Two very efficient algorithms for training structured models that avoids the problem of exponential output space.

2 Other alternatives

1 Online training, example MIRA and Collins Trainer

2 Stochastic trainers: LARank.

3 Local training: SEARN

3 Extension to Slack-scaling and other loss functions.

(44)

Peter L. Bartlett, Michael Collins, Ben Taskar, and David McAllester.

Exponentiated gradient algorithms for large-margin structured classification.

In Lawrence K. Saul, Yair Weiss, and L´eon Bottou, editors, Advances in Neural Information Processing Systems 17, pages 113–120. MIT Press, Cambridge, MA, 2005.

Michael Collins, Amir Globerson, Terry Koo, Xavier Carreras, and Peter L. Bartlett.

Exponentiated gradient algorithms for conditional random fields and max-margin markov networks.

J. Mach. Learn. Res., 9:1775–1822, 2008.

Thorsten Joachims, Thomas Finley, and Chun-Nam John Yu.

Cutting-plane training of structural svms.

Machine Learning, 77(1):27–59, 2009.

(45)

I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun.

Large margin methods for structured and interdependent output variables.

Journal of Machine Learning Research (JMLR), 6(Sep):1453–1484, 2005.

References

Related documents

Graphical models make explicit an efficient joint distribution from these independencies... More examples

I Our formal reasoning, via logic, will lead to building efficient algorithms that solve many instances of this problem easily!.. Module

● Review various statistical alignment models and heuristic models and Training of the alignment models.. ● IBM models 1-5 , HMM Model, Model 6

• Smoothing algorithms: Discount some probability mass from seen Ngrams and redistribute discounted mass to unseen events.. • Two different kinds of smoothing that combine

As other tunas i t avoids areas of very low salinity and also muddy and silt laden waters Its distribution in areas of known occurrence in the Indian Ocean

A random sample of 8 mango trees reveals the following number of fruits they yield.. of errors

Later, we discuss the properties of Zografos-Balakrishnan-generalized Exponential distribution, such as, probability density function (pdf), cumulative distribution function

Gupta and Kundu (1999) compared the maximum likelihood estimators (MLE) with the other estimators such as the method of moments estimators (MME), estimators