• No results found

Machine Learning Applications

N/A
N/A
Protected

Academic year: 2022

Share "Machine Learning Applications"

Copied!
59
0
0

Loading.... (view fulltext now)

Full text

(1)

First order methods

FOR CONVEX OPTIMIZATION

Saketh (IIT Bombay)

(2)

Topics

β€’ Part – I

β€’ Optimal methods for unconstrained convex programs

β€’ Smooth objective

β€’ Non-smooth objective

β€’ Part – II

β€’ Optimal methods for constrained convex programs

β€’ Projection based

β€’ Frank-Wolfe based

β€’ Functional constraint based

β€’ Prox-based methods for structured non-smooth programs

(3)

Non-Topics

β€’ Step-size schemes

β€’ Bundle methods

β€’ Stochastic methods

β€’ Inexact oracles

β€’ Non-Euclidean extensions (Mirror-friends)

(4)

Motivation

& EXAMPLE APPLICATIONS

(5)

Machine Learning Applications

β€’ Goal: Construct 𝑓 ∢ 𝑋 π‘Œ

(6)

Machine Learning Applications

β€’ Goal: Construct 𝑓 ∢ 𝑋 π‘Œ

(7)

Machine Learning Applications

β€’ Goal: Construct 𝑓 ∢ 𝑋 π‘Œ

β€’ Input data: { π‘₯1, 𝑦1 , … , π‘₯π‘š, π‘¦π‘š }

(8)

Machine Learning Applications

β€’ Goal: Construct 𝑓 ∢ 𝑋 π‘Œ

β€’ Input data: { π‘₯1, 𝑦1 , … , π‘₯π‘š, π‘¦π‘š }

β€’ Model: 𝑓 π‘₯ = π‘€π‘‡πœ™ π‘₯

(9)

Machine Learning Applications

β€’ Goal: Construct 𝑓 ∢ 𝑋 π‘Œ

β€’ Input data: { π‘₯1, 𝑦1 , … , π‘₯π‘š, π‘¦π‘š }

β€’ Model: 𝑓 π‘₯ = π‘€π‘‡πœ™ π‘₯

β€’ Algorithm: Find simple functions that explain data

π‘€βˆˆπ‘…min𝑛 Ξ©(𝑀) +

𝑖=1 π‘š

𝑙(π‘€π‘‡πœ™ π‘₯𝑖 , 𝑦𝑖)

(10)

π‘€βˆˆπ‘…min𝑛 Ξ©(𝑀) +

𝑖=1 π‘š

𝑙(π‘€π‘‡πœ™ π‘₯𝑖 , 𝑦𝑖)

Typical Program – Machine Learning

Smooth/Non-Smooth Smooth/Non-Smooth

β€’ Unconstrained

, smooth functional constraints

β€’ Smooth/Non-smooth/Composite Objectives

(11)

π‘€βˆˆπ‘Šmin Ξ©(𝑀) +

𝑖=1 π‘š

𝑙(π‘€π‘‡πœ™ π‘₯𝑖 , 𝑦𝑖)

Typical Program – Machine Learning

Smooth/Non-Smooth Smooth/Non-Smooth

β€’ Unconstrained/Constrained

β€’ Simple domains, smooth functional constraints

β€’ Smooth/Non-smooth/Composite Objectives

Domain e.g. simplex

(12)

π‘€βˆˆπ‘Šmin Ξ©(𝑀) +

𝑖=1 π‘š

𝑙(π‘€π‘‡πœ™ π‘₯𝑖 , 𝑦𝑖)

s.t. 𝐺 𝑀 ≀ 0

Typical Program – Machine Learning

Smooth/Non-Smooth Smooth/Non-Smooth

β€’ Unconstrained/Constrained

β€’ Simple domains, smooth functional constraints

β€’ Smooth/Non-smooth/Composite Objectives

Domain

e.g. simplex Constraints

e.g. multiple models

(13)

Scale is the issue!

β€’ m, n as well as no. models may run into millions!

β€’ Even a single iteration of IPM/Newton-variants is in-feasible.

β€’ β€œSlower” but β€œcheaper” methods are the alternative

β€’ Decomposition based

β€’ First order methods

(14)

First Order Methods - Overview

β€’ Iterative, gradient-like information, O(n) per iteration 

β€’ E.g. Gradient method, Cutting planes, Conjugate gradient

β€’ Very old methods (1950s)

β€’ Far slower than IPM:

β€’ Sub-linear rate  . (Not crucial for ML)

β€’ But (nearly) n-independent 

β€’ Widely employed in state-of-the-art ML systems

β€’ Choice of variant depends on problem structure

(15)

First Order Methods - Overview

β€’ Iterative, gradient-like information, O(n) per iteration 

β€’ E.g. Gradient method, Cutting planes, Conjugate gradient

β€’ Very old methods (1950s)

β€’ Far slower than IPM:

β€’ Sub-linear rate  . (Not crucial for ML)

β€’ But (nearly) n-independent 

β€’ Widely employed in state-of-the-art ML systems

β€’ Choice of variant depends on problem structure

(16)

Smooth un-constrained

M I N

𝑀 ∈ 𝑅𝑛 𝑖 = 1

π‘š

π‘€π‘‡πœ™ π‘₯𝑖 βˆ’ 𝑦𝑖 2

(17)

Smooth Convex Functions

β€’ Continuously differentiable

β€’ Gradient is Lipschitz continuous

(18)

Smooth Convex Functions

β€’ Continuously differentiable

β€’ Gradient is Lipschitz continuous

β€’ 𝛻𝑓 π‘₯ βˆ’ 𝛻𝑓(𝑦) ≀ 𝐿 π‘₯ βˆ’ 𝑦

β€’ E.g. 𝑔 π‘₯ ≑ π‘₯2 is not L-conts. over 𝑅 but is over [0,1] with L=2

β€’ E.g. 𝑔 π‘₯ ≑ π‘₯ is L-conts. with L=1

(19)

Smooth Convex Functions

β€’ Continuously differentiable

β€’ Gradient is Lipschitz continuous

β€’ 𝛻𝑓 π‘₯ βˆ’ 𝛻𝑓(𝑦) ≀ 𝐿 π‘₯ βˆ’ 𝑦

β€’ E.g. 𝑔 π‘₯ ≑ π‘₯2 is not L-conts. over 𝑅 but is over [0,1] with L=2

β€’ E.g. 𝑔 π‘₯ ≑ π‘₯ is L-conts. with L=1

Theorem: Let 𝑓 be convex twice differentiable. Then 𝑓 is smooth with const. 𝐿 ⇔ 𝑓′′ π‘₯ β‰Ό 𝐿 I𝑛

(20)

Smooth Convex Functions

β€’ Continuously differentiable

β€’ Gradient is Lipschitz continuous

β€’ 𝛻𝑓 π‘₯ βˆ’ 𝛻𝑓(𝑦) ≀ 𝐿 π‘₯ βˆ’ 𝑦

β€’ E.g. 𝑔 π‘₯ ≑ π‘₯2 is not L-conts. over 𝑅 but is over [0,1] with L=2

β€’ E.g. 𝑔 π‘₯ ≑ π‘₯ is L-conts. with L=1

Theorem: Let 𝑓 be convex twice differentiable. Then 𝑓 is smooth with const. 𝐿 ⇔ 𝑓′′ π‘₯ β‰Ό 𝐿 I𝑛

(21)

Gradient Method

[Cauchy1847]

β€’ Move iterate in direction of instantaneous decrease

β€’ π‘₯π‘˜+1 = π‘₯π‘˜ βˆ’ π‘ π‘˜π›»π‘“ π‘₯π‘˜ , sk > 0

(22)

Gradient Method

β€’ Move iterate in direction of instantaneous decrease

β€’ π‘₯π‘˜+1 = π‘₯π‘˜ βˆ’ π‘ π‘˜π›»π‘“ π‘₯π‘˜ , sk > 0

β€’ Regularized minimization of first order approx.

β€’ π‘₯π‘˜+1 = argminπ‘₯βˆˆπ‘…π‘› 𝑓 π‘₯π‘˜ + 𝛻𝑓 π‘₯π‘˜ 𝑇(π‘₯ βˆ’ π‘₯π‘˜) + 1

2π‘ π‘˜ π‘₯ βˆ’ π‘₯π‘˜ 2

(23)

Gradient Method

β€’ Move iterate in direction of instantaneous decrease

β€’ π‘₯π‘˜+1 = π‘₯π‘˜ βˆ’ π‘ π‘˜π›»π‘“ π‘₯π‘˜ , sk > 0

β€’ Regularized minimization of first order approx.

β€’ π‘₯π‘˜+1 = argminπ‘₯βˆˆπ‘…π‘› 𝑓 π‘₯π‘˜ + 𝛻𝑓 π‘₯π‘˜ 𝑇(π‘₯ βˆ’ π‘₯π‘˜) + 1

2π‘ π‘˜ π‘₯ βˆ’ π‘₯π‘˜ 2

π‘₯π‘˜ π‘₯π‘˜+1

𝑓 π‘₯π‘˜

πœ™(π‘₯π‘˜) πœ™(π‘₯π‘˜+1)

(24)

Gradient Method

β€’ Move iterate in direction of instantaneous decrease

β€’ π‘₯π‘˜+1 = π‘₯π‘˜ βˆ’ π‘ π‘˜π›»π‘“ π‘₯π‘˜ , sk > 0

β€’ Regularized minimization of first order approx.

β€’ π‘₯π‘˜+1 = argminπ‘₯βˆˆπ‘…π‘› 𝑓 π‘₯π‘˜ + 𝛻𝑓 π‘₯π‘˜ 𝑇(π‘₯ βˆ’ π‘₯π‘˜) + 1

2π‘ π‘˜ π‘₯ βˆ’ π‘₯π‘˜ 2

β€’ Various step-size schemes

β€’ Constant (1/𝐿)

β€’ Diminishing (π‘ π‘˜ ↓ 0, βˆ‘π‘ π‘˜ = ∞)

β€’ Exact or back-tracking line search

(25)

Convergence rate – Gradient method

Theorem[Ne04]: If 𝑓 is smooth with const. 𝐿 and sπ‘˜ = 1

𝐿, then gradient method generates π‘₯π‘˜ such that:

𝑓 π‘₯π‘˜ βˆ’ 𝑓 π‘₯βˆ— ≀ 2𝐿 π‘₯0 βˆ’ π‘₯βˆ— 2 π‘˜ + 4 . Proof Sketch:

β€’ 𝑓 π‘₯π‘˜+1 ≀ 𝑓 π‘₯π‘˜ + 𝛻𝑓 π‘₯π‘˜ 𝑇 π‘₯π‘˜+1 βˆ’ π‘₯π‘˜ + 𝐿

2 π‘₯π‘˜+1 βˆ’ π‘₯π‘˜ 2

β€’ 𝑓 π‘₯π‘˜+1 β‰₯ 𝑓 π‘₯π‘˜ + 𝛻𝑓 π‘₯π‘˜ 𝑇 π‘₯π‘˜+1 βˆ’ π‘₯π‘˜ + 1

2𝐿 𝛻𝑓(π‘₯π‘˜+1) βˆ’ 𝛻𝑓(π‘₯π‘˜) 2

β€’ π‘₯π‘˜+1 βˆ’ π‘₯βˆ— 2 ≀ π‘₯π‘˜ βˆ’ π‘₯βˆ— 2 βˆ’ 1

𝐿2 𝛻𝑓 π‘₯π‘˜ 2

β€’ Ξ”π‘˜+1 ≀ Ξ”π‘˜ βˆ’ Ξ”2 π‘Ÿ

02 (Solve recursion)

(26)

Convergence rate – Gradient method

Theorem[Ne04]: If 𝑓 is smooth with const. 𝐿 and sπ‘˜ = 1

𝐿, then gradient method generates π‘₯π‘˜ such that:

𝑓 π‘₯π‘˜ βˆ’ 𝑓 π‘₯βˆ— ≀ 2𝐿 π‘₯0 βˆ’ π‘₯βˆ— 2 π‘˜ + 4 . Proof Sketch:

β€’ 𝑓 π‘₯π‘˜+1 ≀ 𝑓 π‘₯π‘˜ + 𝛻𝑓 π‘₯π‘˜ 𝑇 π‘₯π‘˜+1 βˆ’ π‘₯π‘˜ + 𝐿

2 π‘₯π‘˜+1 βˆ’ π‘₯π‘˜ 2

β€’ 𝑓 π‘₯π‘˜+1 β‰₯ 𝑓 π‘₯π‘˜ + 𝛻𝑓 π‘₯π‘˜ 𝑇 π‘₯π‘˜+1 βˆ’ π‘₯π‘˜ + 1

2𝐿 𝛻𝑓(π‘₯π‘˜+1) βˆ’ 𝛻𝑓(π‘₯π‘˜) 2

β€’ π‘₯π‘˜+1 βˆ’ π‘₯βˆ— 2 ≀ π‘₯π‘˜ βˆ’ π‘₯βˆ— 2 βˆ’ 1

𝐿2 𝛻𝑓 π‘₯π‘˜ 2

β€’ Ξ”π‘˜+1 ≀ Ξ”π‘˜ βˆ’ Ξ”2 π‘Ÿ

02 (Solve recursion)

Majorization minimization

(27)

Convergence rate – Gradient method

Theorem[Ne04]: If 𝑓 is smooth with const. 𝐿 and sπ‘˜ = 1

𝐿, then gradient method generates π‘₯π‘˜ such that:

𝑓 π‘₯π‘˜ βˆ’ 𝑓 π‘₯βˆ— ≀ 2𝐿 π‘₯0 βˆ’ π‘₯βˆ— 2 π‘˜ + 4 . Proof Sketch:

β€’ 𝑓 π‘₯π‘˜+1 ≀ 𝑓 π‘₯π‘˜ + 𝛻𝑓 π‘₯π‘˜ 𝑇 π‘₯π‘˜+1 βˆ’ π‘₯π‘˜ + 𝐿

2 π‘₯π‘˜+1 βˆ’ π‘₯π‘˜ 2

β€’ 𝑓 π‘₯π‘˜+1 β‰₯ 𝑓 π‘₯π‘˜ + 𝛻𝑓 π‘₯π‘˜ 𝑇 π‘₯π‘˜+1 βˆ’ π‘₯π‘˜ + 1

2𝐿 𝛻𝑓(π‘₯π‘˜+1) βˆ’ 𝛻𝑓(π‘₯π‘˜) 2

β€’ π‘₯π‘˜+1 βˆ’ π‘₯βˆ— 2 ≀ π‘₯π‘˜ βˆ’ π‘₯βˆ— 2 βˆ’ 1

𝐿2 𝛻𝑓 π‘₯π‘˜ 2

β€’ Ξ”π‘˜+1 ≀ Ξ”π‘˜ βˆ’ Ξ”2 π‘Ÿ

02 (Solve recursion)

Majorization minimization

(28)

Comments on rate of convergence

β€’ Sub-linear, very slow compared to IPM

β€’ Applies to conjugate gradient and other traditional variants

β€’ Sub-optimal (may be?):

Theorem[Ne04]: For any π‘˜ ≀ π‘›βˆ’1

2 , and any π‘₯0, there exists a smooth

function f, with const. L, such that with any first order method, we have:

𝑓 π‘₯π‘˜ βˆ’ 𝑓 π‘₯βˆ— β‰₯ 3𝐿 π‘₯0 βˆ’ π‘₯βˆ— 2 32 π‘˜ + 1 2 . Proof Sketch: Choose function such that

π‘₯π‘˜ ∈ 𝑙𝑖𝑛 𝛻𝑓 π‘₯0 , … , 𝛻𝑓 π‘₯π‘˜βˆ’1 βŠ‚ π‘…π‘˜,𝑛

(29)

Comments on rate of convergence

β€’ Sub-linear, very slow compared to IPM

β€’ Applies to conjugate gradient and other traditional variants

β€’ Sub-optimal (may be?):

Theorem[Ne04]: For any π‘˜ ≀ π‘›βˆ’1

2 , and any π‘₯0, there exists a smooth

function f, with const. L, such that with any first order method, we have:

𝑓 π‘₯π‘˜ βˆ’ 𝑓 π‘₯βˆ— β‰₯ 3𝐿 π‘₯0 βˆ’ π‘₯βˆ— 2 32 π‘˜ + 1 2 . Proof Sketch: Choose function such that

π‘₯π‘˜ ∈ 𝑙𝑖𝑛 𝛻𝑓 π‘₯0 , … , 𝛻𝑓 π‘₯π‘˜βˆ’1 βŠ‚ π‘…π‘˜,𝑛

(30)

Comments on rate of convergence

β€’ Sub-linear, very slow compared to IPM

β€’ Applies to conjugate gradient and other traditional variants

β€’ Sub-optimal (may be?):

Theorem[Ne04]: For any π‘˜ ≀ π‘›βˆ’1

2 , and any π‘₯0, there exists a smooth

function f, with const. L, such that with any first order method, we have:

𝑓 π‘₯π‘˜ βˆ’ 𝑓 π‘₯βˆ— β‰₯ 3𝐿 π‘₯0 βˆ’ π‘₯βˆ— 2 32 π‘˜ + 1 2 . Proof Sketch: Choose function such that

π‘₯π‘˜ ∈ 𝑙𝑖𝑛 𝛻𝑓 π‘₯0 , … , 𝛻𝑓 π‘₯π‘˜βˆ’1 βŠ‚ π‘…π‘˜,𝑛

(31)

Comments on rate of convergence

β€’ Sub-linear, very slow compared to IPM

β€’ Applies to conjugate gradient and other traditional variants

β€’ Sub-optimal (may be?):

Theorem[Ne04]: For any π‘˜ ≀ π‘›βˆ’1

2 , and any π‘₯0, there exists a smooth

function f, with const. L, such that with any first order method, we have:

𝑓 π‘₯π‘˜ βˆ’ 𝑓 π‘₯βˆ— β‰₯ 3𝐿 π‘₯0 βˆ’ π‘₯βˆ— 2 32 π‘˜ + 1 2 . Proof Sketch: Choose function such that

π‘₯π‘˜ ∈ 𝑙𝑖𝑛 𝛻𝑓 π‘₯0 , … , 𝛻𝑓 π‘₯π‘˜βˆ’1 βŠ‚ π‘…π‘˜,𝑛

Strongly convex: 𝑢 π‘Έβˆ’πŸ

𝑸+𝟏 πŸπ’Œ

Strongly convex: 𝑢 π‘Έβˆ’πŸ

𝑸+𝟏 πŸπ’Œ

(32)

Intuition for non-optimality

β€’ All variants are descent methods

β€’ Descent essential for proof

β€’ Overkill leading to restrictive movements

β€’ Try non-descent alternatives!

(33)

Intuition for non-optimality

β€’ All variants are descent methods

β€’ Descent essential for proof

β€’ Overkill leading to restrictive movements

β€’ Try non-descent alternatives!

(34)

Towards optimality

[Moritz Hardt]

β€’ 𝑓 π‘₯ = 1

2 π‘₯𝑇𝐴π‘₯ βˆ’ 𝑏π‘₯ ; π‘₯0 = 𝑏

β€’ π‘₯π‘˜ = π‘₯π‘˜βˆ’1 βˆ’ 1

𝐿 𝐴π‘₯π‘˜βˆ’1 βˆ’ 𝑏 = βˆ‘π‘–=0π‘˜ 𝐼 βˆ’ 𝐴

𝐿

𝑖 𝑏 𝐿

Lemma[Mo12]: There is a (Chebyshev) poly. of degree 𝑂 𝑄 log 1 πœ– such that 𝑝 0 = 1 and 𝑝 π‘₯ ≀ πœ– βˆ€ π‘₯ ∈ πœ‡, 𝐿 .

β€’ Chebyshev poly. have two term recursive formula, hence we expect:

β€’ π‘₯π‘˜ = π‘₯π‘˜βˆ’1 βˆ’ π‘ π‘˜βˆ’1𝛻𝑓 π‘₯π‘˜βˆ’1 + πœ†π‘˜βˆ’1𝛻𝑓 π‘₯π‘˜βˆ’2 , to be optimal (acceleration)

Sub-optimal: 𝑢 𝟏 βˆ’ 𝟏

𝑸 π’Œ

(35)

Towards optimality

[Moritz Hardt]

β€’ 𝑓 π‘₯ = 1

2 π‘₯𝑇𝐴π‘₯ βˆ’ 𝑏π‘₯ ; π‘₯0 = 𝑏

β€’ π‘₯π‘˜ = π‘₯π‘˜βˆ’1 βˆ’ 1

𝐿 𝐴π‘₯π‘˜βˆ’1 βˆ’ 𝑏 = βˆ‘π‘–=0π‘˜ 𝐼 βˆ’ 𝐴

𝐿

𝑖 𝑏 𝐿

Lemma[Mo12]: There is a (Chebyshev) poly. of degree 𝑂 𝑄 log 1 πœ– such that 𝑝 0 = 1 and 𝑝 π‘₯ ≀ πœ– βˆ€ π‘₯ ∈ πœ‡, 𝐿 .

β€’ Chebyshev poly. have two term recursive formula, hence we expect:

β€’ π‘₯π‘˜ = π‘₯π‘˜βˆ’1 βˆ’ π‘ π‘˜βˆ’1𝛻𝑓 π‘₯π‘˜βˆ’1 + πœ†π‘˜βˆ’1𝛻𝑓 π‘₯π‘˜βˆ’2 , to be optimal (acceleration)

Sub-optimal: 𝑢 𝟏 βˆ’ 𝟏

𝑸 π’Œ

Optimal: 𝑢 𝟏 βˆ’ 𝟏

𝑸 π’Œ

(36)

Towards optimality

[Moritz Hardt]

β€’ 𝑓 π‘₯ = 1

2 π‘₯𝑇𝐴π‘₯ βˆ’ 𝑏π‘₯ ; π‘₯0 = 𝑏

β€’ π‘₯π‘˜ = π‘₯π‘˜βˆ’1 βˆ’ 1

𝐿 𝐴π‘₯π‘˜βˆ’1 βˆ’ 𝑏 = βˆ‘π‘–=0π‘˜ 𝐼 βˆ’ 𝐴

𝐿

𝑖 𝑏 𝐿

Lemma[Mo12]: There is a (Chebyshev) poly. of degree 𝑂 𝑄 log 1 πœ– such that 𝑝 0 = 1 and 𝑝 π‘₯ ≀ πœ– βˆ€ π‘₯ ∈ πœ‡, 𝐿 .

β€’ Chebyshev poly. have two term recursive formula, hence we expect:

β€’ π‘₯π‘˜ = π‘₯π‘˜βˆ’1 βˆ’ π‘ π‘˜βˆ’1𝛻𝑓 π‘₯π‘˜βˆ’1 + πœ†π‘˜βˆ’1𝛻𝑓 π‘₯π‘˜βˆ’2 , to be optimal (acceleration)

Sub-optimal: 𝑢 𝟏 βˆ’ 𝟏

𝑸 π’Œ

Optimal: 𝑢 𝟏 βˆ’ 𝟏

𝑸 π’Œ

(37)

Accelerated Gradient Method

[Ne83,88,Be09]

β€’ π‘¦π‘˜ = π‘₯π‘˜βˆ’1 + π‘˜βˆ’2

π‘˜+1 π‘₯π‘˜βˆ’1 βˆ’ π‘₯π‘˜βˆ’2 (Extrapolation or momentum)

β€’ π‘₯π‘˜ = π‘¦π‘˜ βˆ’ π‘ π‘˜π›»π‘“(π‘¦π‘˜) (Usual gradient step)

Two step history

(38)

Accelerated Gradient Method

[Ne83,88,Be09]

β€’ π‘¦π‘˜ = π‘₯π‘˜βˆ’1 + π‘˜βˆ’2

π‘˜+1 π‘₯π‘˜βˆ’1 βˆ’ π‘₯π‘˜βˆ’2 (Extrapolation or momentum)

β€’ π‘₯π‘˜ = π‘¦π‘˜ βˆ’ π‘ π‘˜π›»π‘“(π‘¦π‘˜) (Usual gradient step)

π‘₯π‘˜βˆ’2 π‘₯π‘˜βˆ’1

π‘₯π‘˜

π‘¦π‘˜

(39)

Rate of Convergence – Accelerated gradient

Theorem [Be09]: If 𝑓 is smooth with const. 𝐿 and sπ‘˜ = 1

𝐿, then accelerated gradient method generates π‘₯π‘˜ such that:

𝑓 π‘₯π‘˜ βˆ’ 𝑓 π‘₯βˆ— ≀ 2𝐿 π‘₯0 βˆ’ π‘₯βˆ— 2 π‘˜ + 1 2 . Proof Sketch:

β€’ 𝑓 π‘₯π‘˜ ≀ 𝑓 𝑧 + 𝐿 π‘₯π‘˜ βˆ’ 𝑦 𝑇 𝑧 βˆ’ π‘₯π‘˜ + 𝐿

2 π‘₯π‘˜ βˆ’ 𝑦 2 βˆ€ 𝑧 ∈ 𝑅𝑛

β€’ Convex combination at 𝑧 = π‘₯π‘˜, 𝑧 = π‘₯βˆ— leads to:

π‘˜ + 1 2

2𝐿 𝑓 π‘₯π‘˜ βˆ’ π‘“βˆ— + π’šπ’Œ βˆ’ π’™βˆ— 2 ≀ π‘˜ 2

2𝐿 𝑓 π‘₯π‘˜βˆ’1 βˆ’ π‘“βˆ— + π’šπ’Œβˆ’πŸ βˆ’ π’™βˆ— 2

≀ π‘₯0 βˆ’ π‘₯βˆ— 2

Indeed optimal!

(40)

Rate of Convergence – Accelerated gradient

Theorem [Be09]: If 𝑓 is smooth with const. 𝐿 and sπ‘˜ = 1

𝐿, then accelerated gradient method generates π‘₯π‘˜ such that:

𝑓 π‘₯π‘˜ βˆ’ 𝑓 π‘₯βˆ— ≀ 2𝐿 π‘₯0 βˆ’ π‘₯βˆ— 2 π‘˜ + 1 2 . Proof Sketch:

β€’ 𝑓 π‘₯π‘˜ ≀ 𝑓 𝑧 + 𝐿 π‘₯π‘˜ βˆ’ 𝑦 𝑇 𝑧 βˆ’ π‘₯π‘˜ + 𝐿

2 π‘₯π‘˜ βˆ’ 𝑦 2 βˆ€ 𝑧 ∈ 𝑅𝑛

β€’ Convex combination at 𝑧 = π‘₯π‘˜, 𝑧 = π‘₯βˆ— leads to:

π‘˜ + 1 2

2𝐿 𝑓 π‘₯π‘˜ βˆ’ π‘“βˆ— + π’šπ’Œ βˆ’ π’™βˆ— 2 ≀ π‘˜ 2

2𝐿 𝑓 π‘₯π‘˜βˆ’1 βˆ’ π‘“βˆ— + π’šπ’Œβˆ’πŸ βˆ’ π’™βˆ— 2

≀ π‘₯0 βˆ’ π‘₯βˆ— 2

(41)

Rate of Convergence – Accelerated gradient

Theorem [Be09]: If 𝑓 is smooth with const. 𝐿 and sπ‘˜ = 1

𝐿, then accelerated gradient method generates π‘₯π‘˜ such that:

𝑓 π‘₯π‘˜ βˆ’ 𝑓 π‘₯βˆ— ≀ 2𝐿 π‘₯0 βˆ’ π‘₯βˆ— 2 π‘˜ + 1 2 . Proof Sketch:

β€’ 𝑓 π‘₯π‘˜ ≀ 𝑓 𝑧 + 𝐿 π‘₯π‘˜ βˆ’ 𝑦 𝑇 𝑧 βˆ’ π‘₯π‘˜ + 𝐿

2 π‘₯π‘˜ βˆ’ 𝑦 2 βˆ€ 𝑧 ∈ 𝑅𝑛

β€’ Convex combination at 𝑧 = π‘₯π‘˜, 𝑧 = π‘₯βˆ— leads to:

π‘˜ + 1 2

2𝐿 𝑓 π‘₯π‘˜ βˆ’ π‘“βˆ— + π’šπ’Œ βˆ’ π’™βˆ— 2 ≀ π‘˜ 2

2𝐿 𝑓 π‘₯π‘˜βˆ’1 βˆ’ π‘“βˆ— + π’šπ’Œβˆ’πŸ βˆ’ π’™βˆ— 2

≀ π‘₯0 βˆ’ π‘₯βˆ— 2

(42)

A Comparison of the two gradient methods

π‘₯βˆˆπ‘…min1000log

𝑖=1 2000

𝑒 π‘Žπ‘–π‘‡π‘₯+𝑏𝑖

[L. Vandenberghe EE236C Notes]

(43)

Junk variants other than Accelerated gradient?

β€’ Accelerated gradient is

β€’ Less robust than gradient method [Moritz Hardt]

β€’ Accumulates error with inexact oracles [De13]

β€’ Who knows what will happen in your application?

(44)

Summary of un-constrained smooth convex programs

β€’ Gradient method and friends: πœ– β‰ˆ 𝑂( 1 π‘˜)

β€’ Sub-linear and sub-optimal rate.

β€’ Additionally, strong convexity gives: πœ– β‰ˆ 𝑢 π‘Έβˆ’πŸ

𝑸+𝟏

πŸπ’Œ . Sub-optimal but

linear rate.

β€’ Accelerated gradient methods: πœ– β‰ˆ 𝑂( 1 π‘˜2)

β€’ Sub-linear but optimal

β€’ 𝑂(𝑛) computation per iteration

β€’ Additionally, strong convexity gives: πœ– β‰ˆ 𝑢 π‘Έβˆ’πŸ

𝑸+𝟏 πŸπ’Œ

. Optimal but still linear rate.

(45)

Summary of un-constrained smooth convex programs

β€’ Gradient method and friends: πœ– β‰ˆ 𝑂( 1 π‘˜)

β€’ Sub-linear and sub-optimal rate.

β€’ Additionally, strong convexity gives: πœ– β‰ˆ 𝑢 π‘Έβˆ’πŸ

𝑸+𝟏

πŸπ’Œ . Sub-optimal but

linear rate.

β€’ Accelerated gradient methods: πœ– β‰ˆ 𝑂( 1 π‘˜2)

β€’ Sub-linear but optimal

β€’ 𝑂(𝑛) computation per iteration

β€’ Additionally, strong convexity gives: πœ– β‰ˆ 𝑢 π‘Έβˆ’πŸ

𝑸+𝟏 πŸπ’Œ

. Optimal but still linear rate.

(46)

Non-smooth unconstrained

M I N

𝑀 ∈ 𝑅𝑛 𝑖 = 1

π‘š

π‘€β€²πœ™ π‘₯𝑖 βˆ’ 𝑦𝑖

(47)

What is first order info?

(π‘₯0, 𝑓(π‘₯0))

(48)

What is first order info?

𝐿 π‘₯ ≑ 𝑓 π‘₯0 + 𝛻𝑓 π‘₯0 𝑇 π‘₯ βˆ’ π‘₯0

(π‘₯0, 𝑓(π‘₯0))

(49)

What is first order info?

(π‘₯1, 𝑓(π‘₯1))

Canonical form: 𝐿 π‘₯ ≑ 𝑓 π‘₯1 + 𝑔𝑇 π‘₯ βˆ’ π‘₯1 . Multiple 𝑔 exist such that 𝐿 π‘₯ ≀ 𝑓 π‘₯ βˆ€π‘₯

g is defined as a sub-gradient

(50)

First Order Methods (Non-smooth)

Theorem: Let 𝑓 be a closed convex function. Then

β€’ At any π‘₯ ∈ π‘Ÿπ‘–(π‘‘π‘œπ‘š 𝑓), sub-gradient exists and set of all sub-gradients (denoted by πœ•π‘“(π‘₯); sub-differential set) is closed convex.

β€’ If 𝑓 is differentiable at π‘₯ ∈ 𝑖𝑛𝑑(π‘‘π‘œπ‘š 𝑓), then gradient is the only sub- gradient.

Theorem: π‘₯ ∈ 𝑅𝑛 minimizes 𝑓 if and only if 0 ∈ πœ•π‘“(π‘₯).

(51)

Sub-gradient Method

β€’ Assume oracle that throws a sub-gradient.

β€’ Sub-gradient method:

β€’ π‘₯π‘˜+1 = argminπ‘₯βˆˆπ‘…π‘› 𝑓 π‘₯π‘˜ + 𝑔𝑓 π‘₯π‘˜ 𝑇 π‘₯ βˆ’ π‘₯π‘˜ + 1

2π‘ π‘˜ π‘₯ βˆ’ π‘₯π‘˜ 2

β€’

π‘₯

π‘˜+1

= π‘₯

π‘˜

βˆ’ 𝑠

π‘˜

𝑔

𝑓

π‘₯

π‘˜

(52)

Can sub-gradient replace gradient?

β€’ No majorization minimization

β€’ βˆ’π‘”π‘“ π‘₯ not even descent direction

β€’ E.g. 𝑓 π‘₯1, π‘₯2 ≑ π‘₯1 + 2|π‘₯2|

(1,0) (1,2)

(53)

How far can sub-gradient take?

Expect slower than 𝑂( 1 π‘˜)

(54)

How far can sub-gradient take?

Theorem[Ne04]: Let π‘₯0 βˆ’ π‘₯βˆ— ≀ 𝑅 and L the Lip. const. of 𝑓 over this ball. Then sequence generated by sub-gradient descent satisfies:

π‘–βˆˆ{1,..,π‘˜}min 𝑓(π‘₯𝑖) βˆ’ 𝑓 π‘₯βˆ— ≀ 𝐿𝑅

π‘˜ + 1 . Proof Sketch:

β€’ 2π‘ π‘˜Ξ”π‘˜ ≀ π‘Ÿπ‘˜2 βˆ’ π‘Ÿπ‘˜+12 + π‘ π‘˜2 𝑔𝑓(π‘₯π‘˜) 2

β€’ LHS ≀ π‘Ÿ0

2+βˆ‘π‘–=0π‘˜ 𝑠𝑖2 𝑔𝑓 π‘₯π‘˜ 2

2 βˆ‘π‘–=0π‘˜ 𝑠𝑖 ≀ 𝑅2+βˆ‘π‘–=0π‘˜ 𝑠𝑖2𝐿

2 βˆ‘π‘–=0π‘˜ 𝑠𝑖 ; Choose π‘ π‘˜ = 𝑅 π‘˜+1

Always exists!

(55)

How far can sub-gradient take?

Theorem[Ne04]: Let π‘₯0 βˆ’ π‘₯βˆ— ≀ 𝑅 and L the Lip. const. of 𝑓 over this ball. Then sequence generated by sub-gradient descent satisfies:

π‘–βˆˆ{1,..,π‘˜}min 𝑓(π‘₯𝑖) βˆ’ 𝑓 π‘₯βˆ— ≀ 𝐿𝑅

π‘˜ + 1 . Proof Sketch:

β€’ 2π‘ π‘˜Ξ”π‘˜ ≀ π‘Ÿπ‘˜2 βˆ’ π‘Ÿπ‘˜+12 + π‘ π‘˜2 𝑔𝑓(π‘₯π‘˜) 2

β€’ LHS ≀ π‘Ÿ0

2+βˆ‘π‘–=0π‘˜ 𝑠𝑖2 𝑔𝑓 π‘₯π‘˜ 2

2 βˆ‘π‘–=0π‘˜ 𝑠𝑖 ≀ 𝑅2+βˆ‘π‘–=0π‘˜ 𝑠𝑖2𝐿2

2 βˆ‘π‘–=0π‘˜ 𝑠𝑖 ; Choose π‘ π‘˜ = 𝑅 π‘˜+1

(56)

Is this optimal?

Theorem[Ne04]: For any π‘˜ ≀ 𝑛 βˆ’ 1, and any π‘₯0 such that

π‘₯0 βˆ’ π‘₯βˆ— ≀ 𝑅, there exists a convex 𝑓, with const. L over the ball, such that with any first order method, we have:

𝑓 π‘₯π‘˜ βˆ’ 𝑓 π‘₯βˆ— β‰₯ 𝐿𝑅

2(1 + π‘˜ + 1) . Proof Sketch: Choose function such that

π‘₯π‘˜ ∈ 𝑙𝑖𝑛 𝑔𝑓 π‘₯0 , … , 𝑔𝑓 π‘₯π‘˜βˆ’1 βŠ‚ π‘…π‘˜,𝑛

(57)

Summary of non-smooth unconstrained

β€’ Sub-gradient descent method: πœ– β‰ˆ 𝑂 1

π‘˜ .

β€’ Sub-linear, slower than smooth case

β€’ But, optimal!

β€’ Can do better if additional structure (later)

(58)

Summary of Unconstrained Case

0 0.2 0.4 0.6 0.8 1 1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Chart Title

Non-smooth Smooth Gr. Smooth Acc. Gr.

(59)

Bibliography

β€’ [Ne04] Nesterov, Yurii. Introductory lectures on convex optimization : a basic course. Kluwer Academic Publ., 2004. http://hdl.handle.net/2078.1/116858.

β€’ [Ne83] Nesterov, Yurii. A method of solving a convex programming problem with convergence rate O (1/k2). Soviet Mathematics Doklady, Vol. 27(2), 372-376 pages.

β€’ [Mo12] Moritz Hardt, Guy N. Rothblum and Rocco A. Servedio. Private data release via learning thresholds. SODA 2012, 168-187 pages.

β€’ [Be09] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal of Imaging Sciences, Vol. 2(1), 2009. 183-202 pages.

‒ [De13] Olivier Devolder, François Glineur and Yurii Nesterov. First-order methods of smooth convex optimization with inexact oracle. Mathematical Programming 2013.

References

Related documents

Realizing the importance for coupling the training adn test datapoints, it was proposed to search the space of all trace bounded conic combinations of a given set of base

Introduction Main Ideas Unconstrained Min Constrained Min Submodular Max DS Optimization Submodular Constraints.. +

β€’ data: Comes from various sources such as sensors, domain knowledge, experimental runs, etc.... β€’ Ability of computers to β€œlearn” from β€œdata” or β€œpast

– a: alignment between words in phrase pair (Δ“, f) – w(x|y): word translation probability. β€’ Inverse

This is a key observation that merits a discussion: the simple MLE algorithm achieves the best fit to the training data because it essentially searches a bigger model; whereas

We want to build a model which predicts well on data A model’s performance is quantified by a loss function.. a sophisticated

Communicating w across the network takes constant amount of time denoted by T c , and communicating a subset of w takes time.. proportional to

The IMDb movie review dataset is considered for classification unlike the chapter 3 where two different datasets are considered as the polarity dataset classification work on