• No results found

Layout of the talk

N/A
N/A
Protected

Academic year: 2022

Share "Layout of the talk"

Copied!
93
0
0

Loading.... (view fulltext now)

Full text

(1)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Convex and non-convex worlds in machine learning

Anna Choromanska

Courant Institute of Mathematical Sciences New York University

(2)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Convex and non-convex worlds

Machine learning and optimization- many machine learning problems are formulated as minimization of some loss function on a training set of examples. Loss functions expresses the

discrepancy between the predictions of the model being trained and the actual problem instances. Optimization algorithms can then minimize this loss. (Wikipedia)

Convex world local min = global min strictly convex: unique min

efficient solvers

strong theoretical guarantees

Non-convex world

multiple local min 6=global min many solvers come from convex world

weak theoretical guarantees if any

(3)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Convex and non-convex worlds

Machine learning and optimization- many machine learning problems are formulated as minimization of some loss function on a training set of examples. Loss functions expresses the

discrepancy between the predictions of the model being trained and the actual problem instances. Optimization algorithms can then minimize this loss. (Wikipedia)

Convex world local min = global min strictly convex: unique min

efficient solvers

strong theoretical guarantees

Non-convex world

multiple local min 6=global min many solvers come from convex world

weak theoretical guarantees if any

(4)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Convex and non-convex worlds

Machine learning and optimization- many machine learning problems are formulated as minimization of some loss function on a training set of examples. Loss functions expresses the

discrepancy between the predictions of the model being trained and the actual problem instances. Optimization algorithms can then minimize this loss. (Wikipedia)

Convex world local min = global min strictly convex: unique min

efficient solvers

strong theoretical guarantees

Non-convex world

multiple local min 6=global min many solvers come from convex world

weak theoretical guarantees if any

(5)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Layout of the talk

Optimization solvers: generic optimization vs bound majorization partition function-based objectives

Design (convex) solver: quadratic bound majorization [JC12,CKJ12, ACJK14]

Challenging problems: multi-class classification

Design objective: statistical and computational constraints online multi-class partition trees for logarithmic time predictions [CL14, CAL13,CCB15,CCJM15, CCJM13, CJKMM13, BCCL15, CM12]

Non-convex problems: deep learning highly non-convex objective

Build understanding: new theoretical results [CHMCL15, CLB15, ZCL15]

(6)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Layout of the talk

Optimization solvers: generic optimization vs bound majorization partition function-based objectives

Design (convex) solver: quadratic bound majorization [JC12,CKJ12, ACJK14]

Challenging problems: multi-class classification

Design objective: statistical and computational constraints online multi-class partition trees for logarithmic time predictions [CL14,CAL13,CCB15,CCJM15, CCJM13, CJKMM13, BCCL15, CM12]

Non-convex problems: deep learning highly non-convex objective

Build understanding: new theoretical results [CHMCL15, CLB15, ZCL15]

(7)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Layout of the talk

Optimization solvers: generic optimization vs bound majorization partition function-based objectives

Design (convex) solver: quadratic bound majorization [JC12,CKJ12, ACJK14]

Challenging problems: multi-class classification

Design objective: statistical and computational constraints online multi-class partition trees for logarithmic time predictions [CL14,CAL13,CCB15,CCJM15, CCJM13, CJKMM13, BCCL15, CM12]

Non-convex problems: deep learning highly non-convex objective

Build understanding: new theoretical results [CHMCL15, CLB15, ZCL15]

(8)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

How to build good efficient

convex solver?

(9)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Optimization solvers

Generic optimization techniques and majorization methods

Batch

steepest descent conjugate gradient Newton

(L)BFGS [B70]

Stochastic SGD [RB51]

ASGD [PJ92]

SAG [LRSB12]

SDCA [SSZ13]

SVRG [JZ13]

Semi-stochastic

hybrid deterministic-stochastic methods [FS12]

Majorization methods MISO [M13]

iterative scaling [DR72]

EM [DLR77]

Quadratic lower bound principle [BL88]

. . .

(10)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Majorization

Bound majorization

If cost functionθ = arg minθC(θ) has no closed form solution Majorization uses a surrogateQ with closed form solution Monotonically improves from initialθ0

Find bound Q(θ,θi)≥C(θ) whereQ(θii) =C(θi) Updateθi+1= arg minθQ(θ,θi)

Repeat until converged

(11)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Majorization

Bound majorization

If cost functionθ = arg minθC(θ) has no closed form solution Majorization uses a surrogateQ with closed form solution Monotonically improves from initialθ0

Find bound Q(θ,θi)≥C(θ) whereQ(θii) =C(θi) Updateθi+1= arg minθQ(θ,θi)

Repeat until converged

(12)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Majorization

Bound majorization

If cost functionθ = arg minθC(θ) has no closed form solution Majorization uses a surrogateQ with closed form solution Monotonically improves from initialθ0

Find bound Q(θ,θi)≥C(θ) whereQ(θii) =C(θi) Updateθi+1= arg minθQ(θ,θi)

Repeat until converged

(13)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Majorization

Generic optimization techniques vs majorization methods

Majorization methods preferred until [W03, AG07].

...Why? Slower than other optimizers, because of loose &

complicated bounds.

Let’s fix this!!!

(14)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Partition Bound

Partition Function

Log-linear model partition functions Z(θ) =X

y

h(y) exp(θ>f(y)) Partition function ensures thatp(y|θ) normalizes.

It is a central quantity to optimize in maximum likelihood and e-family [P36]

maximum entropy [J57]

conditional random fields [LMP01]

log-linear models [DR72]

graphical models, HMMs [JGJS99].

Problem: it’s ugly to minimize, we much prefer quadratics

(15)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Partition Bound

Partition Function Bound

The bound lnZ(θ) ≤ lnz+12(θ−θ)˜>Σ(θ−θ) + (θ˜ −θ)˜ >µ is tight at ˜θ and holds for parameters given by

Input ˜θ,f(y),h(y)∀y ∈Ω Initz →0+,µ=0,Σ=zI For eachy ∈Ω{

α =h(y) exp( ˜θ>f(y)) r =f(y)−µ

Σ+ = tanh(

1 2ln(α/z)) 2 ln(α/z) rr>

µ+ = z+αα r z + =α }

Output z,µ,Σ −5 0 5

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

θ

log(Z) and Bounds

O(nd2) and update viaθ←θ˜−Σ−1µinO(d3).

(16)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Partition Bound

Partition Function Bound

The bound lnZ(θ) ≤ lnz+12(θ−θ)˜>Σ(θ−θ) + (θ˜ −θ)˜ >µ is tight at ˜θ and holds for parameters given by

Input ˜θ,f(y),h(y)∀y ∈Ω Initz →0+,µ=0,Σ=zI For eachy ∈Ω{

α =h(y) exp( ˜θ>f(y)) r =f(y)−µ

Σ+ = tanh(

1 2ln(α/z)) 2 ln(α/z) rr>

µ+ = z+αα r z + =α }

Output z,µ,Σ −5 0 5

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

θ

log(Z) and Bounds

O(nd2) and update viaθ←θ˜−Σ−1µinO(d3).

(17)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Bound applications

Conditional Random Fields (CRFs)

Trained on iid data {(x1,y1), ...,(xt,yt)}

Each CRF is a log-linear model p(y|xj,θ) = 1

Zxj(θ)hxj(y) exp(θ>fxj(y)) Regularized maximum likelihood objective is

J(θ) =

t

X

j=1

loghxj(yj)

Zxj(θ) +θ>fxj(yj)−tλ 2 kθk2

(18)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Bound applications

Maximum Likelihood Algorithm for CRFs

While not converged Forj = 1, . . . ,t

Compute bound for µjj from hxj,fxj,θ˜ Set ˜θ= arg minθ∈ΛP

j 1

2(θ−θ)˜ >j+λI)(θ−θ)˜ +P

jθ>j −fxj(yj) +λθ)˜

Theorem

The algorithm outputs aθˆsuch that J( ˆθ)−J(θ0)≥(1−) max

θ∈Λ(J(θ)−J(θ0)) withinl

log 1 /log

1 +λ2rlog2nn)m steps.

(19)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Bound applications

Maximum Likelihood Algorithm for CRFs

While not converged Forj = 1, . . . ,t

Compute bound for µjj from hxj,fxj,θ˜ Set ˜θ= arg minθ∈ΛP

j 1

2(θ−θ)˜ >j+λI)(θ−θ)˜ +P

jθ>j −fxj(yj) +λθ)˜ Theorem

The algorithm outputs aθˆsuch that J( ˆθ)−J(θ0)≥(1−) max

θ∈Λ(J(θ)−J(θ0)) withinl

log 1 /log

1 +λ2rlog2nn)m steps.

(20)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Experiments

Experiments - Markov CRFs

Bound admits low-rank version (O(tnd))

As in LBFGS, use rank-k storageΣ=VSV>+D

Absorb residual into diagonalDLow-rank is still a bound

Graphical models, e.g. Markov CRFs

Build junction tree and run aCollect algorithm Only needsO(td2P

c|Yc|) rather thanO(td2n) CONLL dataset

Algorithm time passes L-BFGS 1.00t 17

CG 3.47t 23

Bound 0.64t 4

Algorithm time passes

L-BFGS 1.00t 22

CG 5.94t 27

IIS ≥ 6.35t ≥150

Bound [W03]

Latent models

Objective function is non-concave: ratio of partition functions Apply Jensen to numerator and our bound to denominator Often better solution than BFGS, Newton, CG, SD, ...

(21)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Experiments

Experiments - Markov CRFs

Bound admits low-rank version (O(tnd))

As in LBFGS, use rank-k storageΣ=VSV>+D

Absorb residual into diagonalDLow-rank is still a bound Graphical models, e.g. Markov CRFs

Build junction tree and run aCollect algorithm Only needsO(td2P

c|Yc|) rather thanO(td2n)

CONLL dataset Algorithm time passes

L-BFGS 1.00t 17

CG 3.47t 23

Bound 0.64t 4

Algorithm time passes

L-BFGS 1.00t 22

CG 5.94t 27

IIS ≥ 6.35t ≥150

Bound [W03]

Latent models

Objective function is non-concave: ratio of partition functions Apply Jensen to numerator and our bound to denominator Often better solution than BFGS, Newton, CG, SD, ...

(22)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Experiments

Experiments - Markov CRFs

Bound admits low-rank version (O(tnd))

As in LBFGS, use rank-k storageΣ=VSV>+D

Absorb residual into diagonalDLow-rank is still a bound Graphical models, e.g. Markov CRFs

Build junction tree and run aCollect algorithm Only needsO(td2P

c|Yc|) rather thanO(td2n) CONLL dataset

Algorithm time passes L-BFGS 1.00t 17

CG 3.47t 23

Bound 0.64t 4

Algorithm time passes

L-BFGS 1.00t 22

CG 5.94t 27

IIS ≥ 6.35t ≥150

Bound [W03]

Latent models

Objective function is non-concave: ratio of partition functions Apply Jensen to numerator and our bound to denominator Often better solution than BFGS, Newton, CG, SD, ...

(23)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Experiments

Experiments - Markov CRFs

Bound admits low-rank version (O(tnd))

As in LBFGS, use rank-k storageΣ=VSV>+D

Absorb residual into diagonalDLow-rank is still a bound Graphical models, e.g. Markov CRFs

Build junction tree and run aCollect algorithm Only needsO(td2P

c|Yc|) rather thanO(td2n) CONLL dataset

Algorithm time passes L-BFGS 1.00t 17

CG 3.47t 23

Bound 0.64t 4

Algorithm time passes

L-BFGS 1.00t 22

CG 5.94t 27

IIS ≥ 6.35t ≥150

Bound [W03]

Latent models

Objective function is non-concave: ratio of partition functions Apply Jensen to numerator and our bound to denominator Often better solution than BFGS, Newton, CG, SD, ...

(24)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Experiments

Experiments - Latent models

Bounding also simplifies mixture models with hidden variables (mixtures of Gaussians, HMMs, latent graphical models) Assume exponential family mixture components (Gaussian, multinomial, Poisson, Laplace)

Latent CRF or log-linear model [Quattoni et al. ’07]

L(θ) =

t

Y

j=1

P

mexp θ>fj,yj,m P

y,mexp (θ>fj,y,m) ≥Q(θ,θ)˜ Apply Jensen to numerator and our bound to denominator

(25)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Experiments

Experiments - (Semi-)Stochastic Bound Majorization

Computing the bound isO(t)→ intractable for larget Semi-stochastic: compute bound on data mini-batches

convergence to a stationary point under weak assumptions (in particular convexity is not required)

linearconvergence rate for logistic regression problem when batch size grows sufficiently fast

Theorem

For each iteration we have (for any >0) J(θk)−J(θ)≤

1−ρ L

k

[J(θ0)−J(θ)] +O(Ck), with Ck = max{Bk,(1−ρL+)k} and Bk =k∇J(θ)|θ=θ

k −gkTk2, whereT is the mini-batch.

(26)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Experiments

Experiments - (Semi-)Stochastic Bound Majorization

Computing the bound isO(t)→ intractable for larget Semi-stochastic: compute bound on data mini-batches

convergence to a stationary point under weak assumptions (in particular convexity is not required)

linearconvergence rate for logistic regression problem when batch size grows sufficiently fast

Theorem

For each iteration we have (for any >0) J(θk)−J(θ)≤

1−ρ L

k

[J(θ0)−J(θ)] +O(Ck), with Ck = max{Bk,(1−ρL+)k} and Bk =k∇J(θ)|θ=θ

k −gkTk2, whereT is the mini-batch.

(27)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Experiments

Experiments - (Semi-)Stochastic Bound Majorization

Datasets: rcv1, adult, protein

5 10 15 20 25

10−2 10−1

Effective Passes

Objective minus optimum (training)

LBFGS SGD ASGD SAG SAGls SQB

5 10 15 20

10−1.8 10−1.7

Test logistic loss

Effective Passes 3 5 10 15 20 25

3.2 3.4 3.6 3.8 4 4.2

Test error [%]

Effective Passes

5 10 15 20 25

10−2

5 10 15 20 25

10−1.433 10−1.43 10−1.427 10−1.424

5 10 15 20 25

15.4 15.5 15.6 15.7 15.8 15.9

5 10 15 20 25

10−4 10−3 10−2

5 10 15 20

10−1.8 10−1.7

5 10 15 20 25

0.26 0.28 0.3 0.32 0.34

(28)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

How to design good objective

function?

(29)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Multi-class classification problem

eXtreme multi-class classification problem

Problem setting

classification with large number of classes data is accessed online

Goal:

good predictor with logarithmictraining andtestingtime reduction to tree-structured binary classification

top-down approach for class partitioning allowing gradient descent style optimization

(30)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Multi-class classification problem

What was already done...

Intractable

one-against-all [RK04]

variants of ECOC [DB95], e.g. PECOC [LB05]

clustering-based approaches [BWG10, WMY13]

Choice of partition not addressed

Filter Tree and error-correcting tournaments [BLR09]

Choice of partition addressed, but dedicated to conditional probability estimation

conditional probability tree [BLLSS09]

Splitting criteria not well-suited to large class setting decision trees [KM95]

. . .

(31)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Splitting criterion

How do you learn the structure?

Not all partitions are equally difficult, e.g.

if you do {1,7}vs {3,8}, the next problem is hard;

if you do {1,8}vs {3,7}, the next problem is easy;

if you do {1,3}vs {7,8}, the next problem is easy.

[BWG10]: Better to confuse near leaves than near root. Intuition: The root predictor tends to be overconstrained while the leafwards predictors are less constrained.

(32)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Splitting criterion

How do you learn the structure?

Not all partitions are equally difficult, e.g.

if you do {1,7}vs {3,8}, the next problem is hard;

if you do {1,8}vs {3,7}, the next problem is easy;

if you do {1,3}vs {7,8}, the next problem is easy.

[BWG10]: Better to confuse near leaves than near root.

Intuition: The root predictor tends to be overconstrained while the leafwards predictors are less constrained.

(33)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Splitting criterion

How do you learn the structure?

Our approach:

top-down approach for class partitioning splitting criterion guaranteeing

balanced tree⇒ logarithmic training and testing time and

small classification error

(34)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Splitting criterion

Pure split and balanced split

kr(x): number of data points in the same class asx on the right side of the partitioning

k(x): total number of data points in the same class asx nr: number of data points on the right side of the partitioning n: total number of data points

Measure of balanceness: nnr

(35)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Splitting criterion

Pure split and balanced split

kr(x): number of data points in the same class asx on the right side of the partitioning

k(x): total number of data points in the same class asx nr: number of data points on the right side of the partitioning n: total number of data points

Measure of balanceness: nnr

(36)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Splitting criterion

Pure split and balanced split

kr(x): number of data points in the same class asx on the right side of the partitioning

k(x): total number of data points in the same class asx nr: number of data points on the right side of the partitioning n: total number of data points

Measure of balanceness: nnr

Measure of purity: kk(x)r(x)

(37)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Splitting criterion

Pure split and balanced split

kr(x): number of data points in the same class asx on the right side of the partitioning

k(x): total number of data points in the same class asx nr: number of data points on the right side of the partitioning n: total number of data points

Measure of balanceness: nnr Measure of purity: kk(x)r(x)

(38)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Splitting criterion

Pure split and balanced split

k: number of classes

H: hypothesis class (typically: linear classifiers) πy = |Xny|

balance =Pr(h(x)>0) purity =Pk

y=1πymin(Pr(h(x)>0|y),Pr(h(x)<0|y))

Definition (Balanced split)

The hypothesis h∈ H induces a balanced split iff

c∈(0,0.5]c ≤balance≤1−c. Definition (Pure split)

The hypothesis h∈ H induces a pure split iff

δ∈[0,0.5)purity≤δ.

(39)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Splitting criterion

Pure split and balanced split

k: number of classes

H: hypothesis class (typically: linear classifiers) πy = |Xny|

balance =Pr(h(x)>0) purity =Pk

y=1πymin(Pr(h(x)>0|y),Pr(h(x)<0|y)) Definition (Balanced split)

The hypothesish ∈ H induces a balanced split iff

c∈(0,0.5]c ≤balance≤1−c.

Definition (Pure split)

The hypothesis h∈ H induces a pure split iff

δ∈[0,0.5)purity≤δ.

(40)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Splitting criterion

Pure split and balanced split

k: number of classes

H: hypothesis class (typically: linear classifiers) πy = |Xny|

balance =Pr(h(x)>0) purity =Pk

y=1πymin(Pr(h(x)>0|y),Pr(h(x)<0|y)) Definition (Balanced split)

The hypothesish ∈ H induces a balanced split iff

c∈(0,0.5]c ≤balance≤1−c. Definition (Pure split)

The hypothesish ∈ H induces a pure split iff

δ∈[0,0.5)purity≤δ.

(41)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Splitting criterion

Objective function

J(h) = 2

k

X

y=1

πy|P(h(x)>0)−P(h(x)>0|y)|

= 2Ex,y[|P(h(x)>0)−P(h(x)>0|y)|]

J(h)⇒ Splitting criterion (objective function) Given a set ofn examples each with one ofk labels, find a partitionerh that maximizes the objective.

Lemma

For any hypothesis h:X 7→ {−1,1}, the objective J(h)satisfies J(h)∈[0,1]. Furthermore, h induces a maximally pure and balanced partition iff J(h) = 1.

(42)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Splitting criterion

Balancing and purity factors

Balacing factor balance∈

"

1−p

1−J(h)

2 ,1 +p

1−J(h) 2

#

0 0.2 0.4 0.6 0.8 1

0 0.2 0.4 0.6 0.8 1

J(h)

y

y =1−

1−J(h) 2

y =1+

1−J(h) 2

(43)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Splitting criterion

Balancing and purity factors

Purity factor

purity≤ 2−J(h)

4·balance −balance

0 0.2 0.4 0.6 0.8 1

0 0.1 0.2 0.3 0.4 0.5

J(h)

y

balance = 1/2

y =2−balanceJ(h)balance

(44)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Boosting statement

What is the quality of obtained tree?

In each node of the tree T optimize the splitting criterion Apply recursively to construct a tree structure

Measure the quality of the tree using entropy

GT = X

l∈leafs ofT

wl

k

X

y=1

πl,yln 1

πl,y

Why?

Small entropy of leafs⇒ pure leafs Goal: maximizing the objective reduces the entropy

(45)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Boosting statement

What is the quality of obtained tree?

Definition (Weak Hypothesis Assumption)

Letmdenotes any node of the tree T, and letβm =P(hm(x)>0) andPm,i =P(hm(x)>0|i). Furthermore, let γ ∈R+ be such that for allm,γ ∈(0,min(βm,1−βm)]. We say that theweak

hypothesis assumptionis satisfied when for any distribution P over X at each node m of the treeT there exists a hypothesis hm ∈ H such thatJ(hm)/2 =Pk

i=1πm,i|Pm,i−βm| ≥γ. Theorem

Under the Weak Hypothesis Assumption, for any∈[0,1], to obtain GT ≤it suffices to make 1

4(1−γ)2lnk γ2 splits.

Tree depth ≈log

1

4(1−γ)2lnk

γ2

=O(lnk)⇒

⇒ logarithmictraining andtesting time

(46)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Boosting statement

What is the quality of obtained tree?

Definition (Weak Hypothesis Assumption)

Letmdenotes any node of the tree T, and letβm =P(hm(x)>0) andPm,i =P(hm(x)>0|i). Furthermore, let γ ∈R+ be such that for allm,γ ∈(0,min(βm,1−βm)]. We say that theweak

hypothesis assumptionis satisfied when for any distribution P over X at each node m of the treeT there exists a hypothesis hm ∈ H such thatJ(hm)/2 =Pk

i=1πm,i|Pm,i−βm| ≥γ. Theorem

Under the Weak Hypothesis Assumption, for any∈[0,1], to obtain GT ≤it suffices to make 1

4(1−γ)2lnk γ2 splits.

Tree depth ≈log

1

4(1−γ)2lnk

γ2

=O(lnk)⇒

⇒ logarithmictraining andtesting time

(47)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Online partitioning

LOMtree algorithm

Recall the objective function we consider at every tree node:

J(h) = 2Ey[|Ex[1(h(x)>0)]−Ex[1(h(x)>0|y)]|].

Problem: discrete optimization

Relaxation: drop the indicator operator and look at margins

The objective function becomes

J(h) = 2Ey[|Ex[h(x)]−Ex[h(x)|y]|]

Keep the online empirical estimates of these expectations. The sign of the difference of two expectations decides whether to send an example to the left or right child node.

(48)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Online partitioning

LOMtree algorithm

Recall the objective function we consider at every tree node:

J(h) = 2Ey[|Ex[1(h(x)>0)]−Ex[1(h(x)>0|y)]|].

Problem: discrete optimization

Relaxation: drop the indicator operator and look at margins

The objective function becomes

J(h) = 2Ey[|Ex[h(x)]−Ex[h(x)|y]|]

Keep the online empirical estimates of these expectations. The sign of the difference of two expectations decides whether to send an example to the left or right child node.

(49)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Online partitioning

LOMtree algorithm

Recall the objective function we consider at every tree node:

J(h) = 2Ey[|Ex[1(h(x)>0)]−Ex[1(h(x)>0|y)]|].

Problem: discrete optimization

Relaxation: drop the indicator operator and look at margins The objective function becomes

J(h) = 2Ey[|Ex[h(x)]−Ex[h(x)|y]|]

Keep the online empirical estimates of these expectations.

The sign of the difference of two expectations decides whether to send an example to the left or right child node.

(50)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Online partitioning

LOMtree algorithm

Lete = 0 andfor all y,ey = 0,ny = 0 For each example(x,y)

if ey <e then b=−1 elseb = 1 Updatew using(x,b)

ny ←ny+ 1 ey(ny−1)en y

y +w.xn

y

e ← (n−1)en +w.xn

(51)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Online partitioning

LOMtree algorithm

Lete = 0 andfor all y,ey = 0,ny = 0 For each example(x,y)

if ey <e then b=−1 elseb = 1 Updatew using(x,b)

ny ←ny+ 1 ey(ny−1)en y

y +w.xn

y

e ← (n−1)en +w.xn

(52)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Online partitioning

LOMtree algorithm

Lete = 0 andfor all y,ey = 0,ny = 0 For each example(x,y)

if ey <e then b=−1 elseb = 1 Updatew using(x,b)

ny ←ny+ 1 ey(ny−1)en y

y +w.xn

y

e ← (n−1)en +w.xn

(53)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Online partitioning

LOMtree algorithm

Lete = 0 andfor all y,ey = 0,ny = 0 For each example(x,y)

if ey <e then b=−1 elseb = 1 Updatew using(x,b)

ny ←ny+ 1 ey(ny−1)en y

y +w.xn

y

e ← (n−1)en +w.xn

(54)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Online partitioning

LOMtree algorithm

Lete = 0 andfor all y,ey = 0,ny = 0 For each example(x,y)

if ey <e then b=−1 elseb = 1 Updatew using(x,b)

ny ←ny+ 1 ey(ny−1)en y

y +w.xn

y

e ← (n−1)en +w.xn

(55)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Online partitioning

LOMtree algorithm

Lete = 0 andfor all y,ey = 0,ny = 0 For each example(x,y)

if ey <e then b=−1 elseb = 1 Updatew using(x,b)

ny ←ny+ 1 ey(ny−1)en y

y +w.xn

y

e ← (n−1)en +w.xn

(56)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Online partitioning

LOMtree algorithm

Lete = 0 andfor all y,ey = 0,ny = 0 For each example(x,y)

if ey <e then b=−1 elseb = 1 Updatew using(x,b)

ny ←ny+ 1 ey(ny−1)en y

y +w.xn

y

e ← (n−1)en +w.xn

(57)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Online partitioning

LOMtree algorithm

Lete = 0 andfor all y,ey = 0,ny = 0 For each example(x,y)

if ey <e then b=−1 elseb = 1 Updatew using(x,b)

ny ←ny+ 1 ey(ny−1)en y

y +w.xn

y

e ← (n−1)en +w.xn

(58)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Online partitioning

LOMtree algorithm

Lete = 0 andfor all y,ey = 0,ny = 0 For each example(x,y)

if ey <e then b=−1 elseb = 1 Updatew using(x,b)

ny ←ny+ 1 ey(ny−1)en y

y +w.xn

y

e ← (n−1)en +w.xn

(59)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Online partitioning

LOMtree algorithm

Lete = 0 andfor all y,ey = 0,ny = 0 For each example(x,y)

if ey <e then b=−1 elseb = 1 Updatew using(x,b)

ny ←ny+ 1 ey(ny−1)en y

y +w.xn

y

e ← (n−1)en +w.xn

(60)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Online partitioning

LOMtree algorithm

Lete = 0 andfor all y,ey = 0,ny = 0 For each example(x,y)

if ey <e then b=−1 elseb = 1 Updatew using(x,b)

ny ←ny+ 1 ey(ny−1)en y

y +w.xn

y

e ← (n−1)en +w.xn

(61)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Online partitioning

LOMtree algorithm

Lete = 0 andfor all y,ey = 0,ny = 0 For each example(x,y)

if ey <e then b=−1 elseb = 1 Updatew using(x,b)

ny ←ny+ 1 ey(ny−1)en y

y +w.xn

y

e ← (n−1)en +w.xn

Apply recursively to construct a tree structure.

(62)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Online partitioning

LOMtree algorithm

Lete = 0 andfor all y,ey = 0,ny = 0 For each example(x,y)

if ey <e then b=−1 elseb = 1 Updatew using(x,b)

ny ←ny+ 1 ey(ny−1)en y

y +w.xn

y

e ← (n−1)en +w.xn

Apply recursively to construct a tree structure.

(63)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Online partitioning

LOMtree algorithm

Lete = 0 andfor all y,ey = 0,ny = 0 For each example(x,y)

if ey <e then b=−1 elseb = 1 Updatew using(x,b)

ny ←ny+ 1 ey(ny−1)en y

y +w.xn

y

e ← (n−1)en +w.xn

Apply recursively to construct a tree structure.

(64)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Experiments

Experiments

Table : Training time on selected problems.

Isolet Sector Aloi LOMtree 16.27s 12.77s 51.86s

OAA 19.58s 18.37s 11m2.43s Table : Per-example test time on all problems.

Isolet Sector Aloi ImNet ODP LOMtree 0.14ms 0.13ms 0.06ms 0.52ms 0.26ms

OAA 0.16 ms 0.24ms 0.33ms 0.21s 1.05s Table : Test error (%) and confidence interval on all problems.

LOMtree Rtree Filter tree Isolet (26) 6.36±1.71 16.92±2.63 15.10±2.51 Sector (105) 16.19±2.33 15.77±2.3017.70±2.41 Aloi (1000) 16.50±0.70 83.74±0.70 80.50±0.75 ImNet (22K) 90.17±0.05 96.99±0.03 92.12±0.04 ODP (105K) 93.46±0.12 93.85±0.12 93.76±0.12

(65)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Experiments

Experiments

26 105 1000 21841 105033

0 0.2 0.4 0.6 0.8 1

number of classes

accuracy

LOMtree vs one−against−all

OAA LOMtree

6 8 10 12 14 16

2 4 6 8 10 12

log2(number of classes) log2(time ratio)

LOMtree vs one−against−all

(66)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

How to understand non-convex

optimization?

(67)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Deep learning: motivation and challenges

Why non-convex optimization and deep learning?

State-of-the art results on number of problems:

image recognition [KSH12, CMGS10]

speech recognition [HDYDMJSVNSK12, GMH13]

natural language processing [WCA14]

video recognition [KTSLSF-F14, SZ14]

(68)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Deep learning: motivation and challenges

Why non-convex optimization and deep learning?

ImageNet 2014 Challenge: mostly convolutional networks

(69)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Deep learning: motivation and challenges

Challenge

Goal: Understanding loss function in deep learning.

Recent related works: Choromanska et al., 2015, Goodfellow et al., 2015, Dauphin et al., 2014, Saxe et al., 2014.

Questions:

Why the result of multiple experiments with multilayer networks consistently give very similar performance despite the presence of many local minima?

What is the role of saddle points in the optimization problem? Is the surface of the loss function of multilayer networks structured?

(70)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Deep learning: motivation and challenges

Challenge

Goal: Understanding loss function in deep learning.

Recent related works: Choromanska et al., 2015, Goodfellow et al., 2015, Dauphin et al., 2014, Saxe et al., 2014.

Questions:

Why the result of multiple experiments with multilayer networks consistently give very similar performance despite the presence of many local minima?

What is the role of saddle points in the optimization problem? Is the surface of the loss function of multilayer networks structured?

(71)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Deep learning: motivation and challenges

Challenge

Goal: Understanding loss function in deep learning.

Recent related works: Choromanska et al., 2015, Goodfellow et al., 2015, Dauphin et al., 2014, Saxe et al., 2014.

Questions:

Why the result of multiple experiments with multilayer networks consistently give very similar performance despite the presence of many local minima?

What is the role of saddle points in the optimization problem?

Is the surface of the loss function of multilayer networks structured?

(72)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Deep learning: motivation and challenges

Challenge

Goal: Understanding loss function in deep learning.

Recent related works: Choromanska et al., 2015, Goodfellow et al., 2015, Dauphin et al., 2014, Saxe et al., 2014.

Questions:

Why the result of multiple experiments with multilayer networks consistently give very similar performance despite the presence of many local minima?

What is the role of saddle points in the optimization problem?

Is the surface of the loss function of multilayer networks structured?

(73)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Deep learning: motivation and challenges

Multilayer network and spin-glass model

Can we use the spin-glass theory to explain the optimization paradigm with large multilayer networks?

What assumptions need to be made?

(74)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Deep learning: motivation and challenges

Multilayer network and spin-glass model

Can we use the spin-glass theory to explain the optimization paradigm with large multilayer networks?

What assumptions need to be made?

(75)

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Non-convex loss function in deep learning

Loss function in deep learning and assumptions

X1 X2

. . . . . .

Xd

1 2 . . .

n1

1 2 . . .

n2

. . . 1 2 . . . nH

Y

Y = 1

Λ(H−1)/2

Ψ

X

i=1

XiAi

H

Y

k=1

wi(k), Ψ - number of input-output paths, Λ = √H

Ψ (assume Λ∈Z+) H−1 - number of hidden layers

wi(k) - the weight of the kth segment of theith path Ai - Bernoulli r.v. denoting path activation (0/1)

References

Related documents

- Machine Learning (Supervised, Unsupervised, Reinforcement) - Neural Networks and Deep Learning.. - Computer Vision

Deep learning can find highly non-linear patterns in visual, audio data.. Shivaram

A deep learning based multi-task ensemble model for intent detection and slot filling in spoken language understanding, In: Neural Information Processing - 25th

In general refer to 4.2.5 of Boyd for operations that preserve quasi-convexity.. And what about operations that convert quasi-convex function into a

What distinguishes disciplined convex programming from more general convex programming are the rules governing the construction of the expressions used in objective functions

Other complications of open fractures include non-union, malunion, osteomyelitis, deep seated infections and even functional loss of the limb.The exposed fracture site,

The purpose of this dissertation was to provide a review of the theory of Optimization, in particular non-linear and quadratic programming, and the algorithms suitable for solving

The main difference between the two is that our objective function is now a quadratic function (the constraints are still linear).Because of its many applications, quadratic