• No results found

First order methods

N/A
N/A
Protected

Academic year: 2022

Share "First order methods"

Copied!
42
0
0

Loading.... (view fulltext now)

Full text

(1)

First order methods

FOR CONVEX OPTIMIZATION

Saketh (IIT Bombay)

(2)

Topics

• Part – I

Optimal methods for unconstrained convex programs

Smooth objective

Non-smooth objective

• Part – II

Optimal methods for constrained convex programs

Projection based

Frank-Wolfe based

Functional constraint based

Prox-based methods for structured non-smooth programs

(3)

Constrained Optimization - Illustration

𝑥 𝑓

𝐹

(4)

Constrained Optimization - Illustration

𝑥

𝑥 𝑖𝑠 𝑜𝑝𝑡𝑖𝑚𝑎𝑙 ⇔ 𝛻𝑓 𝑥∗ 𝑇𝑢 ≥ 0 ∀ 𝑢 ∈ 𝑇𝐹(𝑥)

𝛻𝑓 𝑥

𝑓

𝐹

(5)

Two Strategies

• Stay feasible and minimize

• Projection based

• Frank-Wolfe based

(6)

Two Strategies

• Alternate between

• Minimization

• Move towards feasibility set

(7)

Projection Based Methods

CONSTRAINED CONVEX PROGRAMS

(8)

Projected Gradient Method

min𝑥∈𝑋 𝑓(𝑥)

• 𝑥𝑘+1 = argmin𝑥∈𝑋 𝑓 𝑥𝑘 + 𝛻𝑓 𝑥𝑘 𝑇 𝑥 − 𝑥𝑘 + 1

2𝑠𝑘 𝑥 − 𝑥𝑘 2

= argmin𝑥∈𝑋 𝑥 − 𝑥𝑘 − 𝑠𝑘𝛻𝑓(𝑥𝑘) 2

≡ Π𝑋 𝑥𝑘 − 𝑠𝑘𝛻𝑓(𝑥𝑘)

𝑋 is closed convex

(9)

Projected Gradient Method

min𝑥∈𝑋 𝑓(𝑥)

• 𝑥𝑘+1 = argmin𝑥∈𝑋 𝑓 𝑥𝑘 + 𝛻𝑓 𝑥𝑘 𝑇 𝑥 − 𝑥𝑘 + 1

2𝑠𝑘 𝑥 − 𝑥𝑘 2

= argmin𝑥∈𝑋 𝑥 − 𝑥𝑘 − 𝑠𝑘𝛻𝑓(𝑥𝑘) 2

≡ Π𝑋 𝑥𝑘 − 𝑠𝑘𝛻𝑓(𝑥𝑘)

(10)

Projected Gradient Method

min𝑥∈𝑋 𝑓(𝑥)

• 𝑥𝑘+1 = argmin𝑥∈𝑋 𝑓 𝑥𝑘 + 𝛻𝑓 𝑥𝑘 𝑇 𝑥 − 𝑥𝑘 + 1

2𝑠𝑘 𝑥 − 𝑥𝑘 2

= argmin𝑥∈𝑋 𝑥 − 𝑥𝑘 − 𝑠𝑘𝛻𝑓(𝑥𝑘) 2

≡ Π𝑋 𝑥𝑘 − 𝑠𝑘𝛻𝑓(𝑥𝑘) X is simple:

oracle for projections

(11)

Projected Gradient Method

min𝑥∈𝑋 𝑓(𝑥)

• 𝑥𝑘+1 = argmin𝑥∈𝑋 𝑓 𝑥𝑘 + 𝛻𝑓 𝑥𝑘 𝑇 𝑥 − 𝑥𝑘 + 1

2𝑠𝑘 𝑥 − 𝑥𝑘 2

= argmin𝑥∈𝑋 𝑥 − 𝑥𝑘 − 𝑠𝑘𝛻𝑓(𝑥𝑘) 2

≡ Π𝑋 𝑥𝑘 − 𝑠𝑘𝛻𝑓(𝑥𝑘)

𝑥𝑘

𝑥𝑘 − 𝑠𝑘𝛻𝑓(𝑥𝑘) 𝑥𝑘+1

(12)

Will it work?

• 𝑥𝑘+1 − 𝑥∗ 2 = Π𝑋(𝑥𝑘 − 𝑠𝑘𝛻𝑓(𝑥𝑘)) − 𝑥∗ 2

≤ (𝑥𝑘 − 𝑠𝑘𝛻𝑓(𝑥𝑘)) − 𝑥∗ 2 (Why?)

• Remaining analysis exactly same (smooth/non-smooth)

• Analysis a bit more involved for projected accelerated gradient

Define gradient map: ℎ 𝑥𝑘 𝑥𝑘−Π𝑋 𝑥𝑘−𝑠𝑘𝛻𝑓 𝑥𝑘

𝑠𝑘

Satisfies same fundamental properties as gradient!

(13)

Will it work?

• 𝑥𝑘+1 − 𝑥∗ 2 = Π𝑋(𝑥𝑘 − 𝑠𝑘𝛻𝑓(𝑥𝑘)) − 𝑥∗ 2

≤ (𝑥𝑘 − 𝑠𝑘𝛻𝑓(𝑥𝑘)) − 𝑥∗ 2 (Why?)

• Remaining analysis exactly same (smooth/non-smooth)

• Analysis a bit more involved for projected accelerated gradient

Define gradient map: ℎ 𝑥𝑘 𝑥𝑘−Π𝑋 𝑥𝑘−𝑠𝑘𝛻𝑓 𝑥𝑘

𝑠𝑘

Satisfies same fundamental properties as gradient!

(14)

Simple sets

• Non-negative orthant

• Ball, ellipse

• Box, simplex

• Cones

• PSD matrices

• Spectrahedron

(15)

Summary of Projection Based Methods

• Rates of convergence remain exactly same

• Projection oracle needed (simple sets)

Caution with non-analytic cases

(16)

Frank-Wolfe Methods

CONSTRAINED CONVEX PROGRAMS

(17)

Avoid Projections

• 𝑦𝑘+1 = argmin𝑥∈𝑋 𝑓 𝑥𝑘 + 𝛻𝑓 𝑥𝑘 𝑇 𝑥 − 𝑥𝑘 + 1

2𝑠𝑘 𝑥 − 𝑥𝑘 2

= argmin𝑥∈𝑋 𝛻𝑓 𝑥𝑘 𝑇𝑥 (Support Function)

Restrict moving far away:

𝑥𝑘+1 ≡ 𝑠𝑘𝑦𝑘+1 + 1 − 𝑠𝑘 𝑥𝑘

(18)

Avoid Projections

[FW59]

• 𝑦𝑘+1 = argmin𝑥∈𝑋 𝑓 𝑥𝑘 + 𝛻𝑓 𝑥𝑘 𝑇 𝑥 − 𝑥𝑘 + 1

2𝑠𝑘 𝑥 − 𝑥𝑘 2

= argmin𝑥∈𝑋 𝛻𝑓 𝑥𝑘 𝑇𝑥 (Support Function)

Restrict moving far away:

𝑥𝑘+1 ≡ 𝑠𝑘𝑦𝑘+1 + 1 − 𝑠𝑘 𝑥𝑘

(19)

Illustration

[Mart Jaggi, ICML 2014]

−𝛻𝑓(𝑥) 𝑦

𝑥+

(20)

Illustration

[Mart Jaggi, ICML 2014]

(21)

On Conjugates and Support Functions

Convex 𝑓 is point-wise maximum of affine minorants

Provides dual definition:

𝑓 𝑥 = max

𝑦∈𝑌 𝑎𝑦𝑇𝑥 − 𝑏𝑦, equivalently:

∃𝑓 ∋ 𝑓 𝑥 = max

𝑦∈𝑑𝑜𝑚 𝑓 𝑦𝑇𝑥 − 𝑓(𝑦)

𝑓 is called conjugate or Fenchel dual

If 𝑓 𝑦 is indicator of set S we get conic 𝑓:

𝑓 𝑥 = max

𝑦∈𝑆 𝑦𝑇𝑥

(22)

On Conjugates and Support Functions

Convex 𝑓 is point-wise maximum of affine minorants

Provides dual definition:

𝑓 𝑥 = max

𝑦∈𝑌 𝑎𝑦𝑇𝑥 − 𝑏𝑦, equivalently:

∃𝑓 𝑓 𝑥 = max

𝑦∈𝑑𝑜𝑚 𝑓 𝑦𝑇𝑥 − 𝑓(𝑦)

𝑓 is called conjugate or Fenchel dual or Legendre transformation (𝑓∗∗ = 𝑓).

If 𝑓 𝑦 is indicator of set S we get conic 𝑓:

𝑓 𝑥 = max

𝑦∈𝑆 𝑦𝑇𝑥

(23)

On Conjugates and Support Functions

Convex 𝑓 is point-wise maximum of affine minorants

Provides dual definition:

𝑓 𝑥 = max

𝑦∈𝑌 𝑎𝑦𝑇𝑥 − 𝑏𝑦, equivalently:

∃𝑓 ∋ 𝑓 𝑥 = max

𝑦∈𝑑𝑜𝑚 𝑓 𝑦𝑇𝑥 − 𝑓(𝑦)

𝑓 is called conjugate or Fenchel dual or Legendre transformation (𝑓∗∗ = 𝑓).

If 𝑓 𝑦 is indicator of set 𝑆 we get conic 𝑓:

𝑓 𝑥 = max

𝑦∈𝑆 𝑦𝑇𝑥

If 𝑆 is a norm ball, we get dual norm

(24)

Connection with sub-gradient

Let,

𝑦 ∈ argmax

𝑦∈𝑑𝑜𝑚 𝑓

𝑦𝑇𝑥 − 𝑓(𝑦) i.e., 𝑓 𝑥 + 𝑓 𝑦 = 𝑥𝑇𝑦

Then 𝒚 must be a sub-gradient of 𝒇 at 𝒙

dual form exposes sub-gradients

If 𝑓 𝑦 is indicator of set S we get conic 𝑓:

𝑓 𝑥 = max

𝑦∈𝑆 𝑦𝑇𝑥

(25)

Conjugates e.g.

𝒇(𝒙) 𝒇(𝒙) Projection?

𝑥 𝑝 𝑥 𝑝

𝑝−1

No (𝑝 ∉ {1,2, ∞})

𝜎(𝑋) 𝑝 𝜎(𝑋) 𝑝

𝑝−1 No (𝑝 ∉ {1,2, ∞})

• 𝑥 1 Projection, conjugate = 𝑂(𝑛 log 𝑛), 𝑂(𝑛)

• 𝜎(𝑋) 1 Projection, conjugate = Full, First SVD

(26)

Rate of Convergence

Theorem[Ma11]: If 𝑋 is compact convex set and 𝑓 is smooth with const. 𝐿, and 𝑠𝑘 = 2

𝑘+2, then the iterates generated by Frank-Wolfe satisfy:

𝑓 𝑥𝑘 − 𝑓 𝑥 ≤ 4𝐿 𝑑 𝑋 2 𝑘 + 2 . Proof Sketch:

• 𝑓 𝑥𝑘+1 ≤ 𝑓 𝑥𝑘 + 𝑠𝑘𝛻𝑓 𝑥𝑘 𝑇 𝑦𝑘+1 − 𝑥𝑘 + 𝑠𝑘

2𝐿

2 𝑑 𝑋 2

• Δ𝑘+1 ≤ 1 − 𝑠𝑘 Δ𝑘 + 𝑠𝑘

2𝐿

2 𝑑 𝑋 2 (Solve recursion)

Sub- optimal

(27)

Rate of Convergence

Theorem[Ma11]: If 𝑋 is compact convex set and 𝑓 is smooth with const. 𝐿, and 𝑠𝑘 = 2

𝑘+2, then the iterates generated by Frank-Wolfe satisfy:

𝑓 𝑥𝑘 − 𝑓 𝑥 ≤ 4𝐿 𝑑 𝑋 2 𝑘 + 2 . Proof Sketch:

• 𝑓 𝑥𝑘+1 ≤ 𝑓 𝑥𝑘 + 𝑠𝑘𝛻𝑓 𝑥𝑘 𝑇 𝑦𝑘+1 − 𝑥𝑘 + 𝑠𝑘

2𝐿

2 𝑑 𝑋 2

• Δ𝑘+1 ≤ 1 − 𝑠𝑘 Δ𝑘 + 𝑠𝑘

2𝐿

2 𝑑 𝑋 2 (Solve recursion)

(28)

Sparse Representation – Optimality

• If 𝑥0 = 0 and domain is 𝑙1ball, 𝑥𝑘 ∈ 𝑅𝑘,𝑛

We get exact sparsity! (unlike proj. grad.)

• Sparse representation by extreme points

• 𝜖 ≈ 𝑂(𝐿 𝑑 𝑋 2 𝑘) need atleast 𝑘 non-zeros [Ma11]

• Optimal in terms of accuracy-sparsity trade-off

Not in terms of accuracy-iterations

(1,0) (0,1)

(−1,0)

(0, −1)

(29)

Sparse Representation – Optimality

• If 𝑥0 = 0 and domain is 𝑙1ball, 𝑥𝑘 ∈ 𝑅𝑘,𝑛

We get exact sparsity! (unlike proj. grad.)

• Sparse representation by extreme points

• 𝜖 ≈ 𝑂(𝐿 𝑑 𝑋 2 𝑘) need atleast 𝑘 non-zeros [Ma11]

• Optimal in terms of accuracy-sparsity trade-off

Not in terms of accuracy-iterations

(1,0) (0,1)

(−1,0)

(0, −1)

(30)

Summary comparison of always feasible methods

Property Projected Gr. Frank-Wolfe

Rate of convergence + -

Sparse Solutions - +

Iteration Complexity - +

Affine Invariance - +

(31)

Functional Constrained

BASED METHODS

(32)

Assumptions

𝑥∈𝑅min𝑛 𝑓0(𝑥)

𝑠. 𝑡. 𝑓𝑖 𝑥 ≤ 0 ∀ 𝑖 = 1, … , 𝑛

• All 𝑓0, 𝑓_𝑖 are smooth

• L is max. const. among all

(33)

Algorithm

At iteration 1 ≤ 𝑘 ≤ 𝑁:

• Check if 𝑓𝑖 𝑥𝑘𝑅

𝑁 𝛻𝑓𝑖 𝑥𝑘 ∀ 𝑖

If yes, then “productive” step: 𝑖 𝑘 = 0

If no, then “non-productive” step: 𝑖(𝑘) set to a violator

• 𝑥𝑘+1 = 𝑥𝑘𝑅

𝑁 𝛻𝑓𝑖 𝑘 (𝑥𝑘) 𝛻𝑓𝑖 𝑘 (𝑥𝑘)

• Output: 𝑥𝑁, the best among the productive.

(34)

Does it converge?

Theorem[Ju12]: Let 𝑋 be bounded and 𝐿 be the smoothness const.

(upper bound). Then,

• 𝑓0 𝑥𝑁 − 𝑓0 𝑥𝐿𝑅

𝑁

• 𝑓𝑖 𝑥𝑁𝐿𝑅

𝑁 ∀ 𝑖

Proof Sketch: Let 𝑓0 𝑥𝑁 − 𝑓0 𝑥 > 𝐿𝑅

𝑁

𝑘=1𝑁 𝑥𝑘 − 𝑥∗ 𝑇 𝛻𝑓𝑖(𝑘)(𝑥𝑘) 𝛻𝑓𝑖(𝑘)(𝑥𝑘) ≤ 𝑅𝑁

Non-productive: 𝑅

𝑁 𝑥𝑘 − 𝑥∗ 𝑇 𝛻𝑓𝑖(𝑘)(𝑥𝑘) 𝛻𝑓𝑖(𝑘)(𝑥𝑘) ≥ 𝑅2

𝑁

Productive: 𝑅

𝑁 𝑥𝑘 − 𝑥∗ 𝑇 𝛻𝑓𝑖(𝑘)(𝑥𝑘) 𝛻𝑓𝑖(𝑘)(𝑥𝑘) > 𝑅2

𝑁

(35)

Composite Objective

PROX BASED METHODS

(36)

Composite Objectives

𝑤∈𝑅min𝑛 Ω(𝑤) +

𝑖=1 𝑚

𝑙(𝑤𝜙 𝑥𝑖 , 𝑦𝑖)

Non-Smooth g(w) Smooth f(w)

Key Idea: Do not approximate non-smooth part

(37)

Proximal Gradient Method

• 𝑥𝑘+1 = argmin𝑥 𝑓 𝑥𝑘 + 𝛻𝑓 𝑥𝑘 𝑇 𝑥 − 𝑥𝑘 + 1

2𝑠𝑘 𝑥 − 𝑥𝑘 2 + 𝑔(𝑥)

• If 𝑔 is indicator, then same as projected gr.

• If 𝑔 is support function: 𝑔 𝑥 = max

𝑦∈𝑆 𝑥𝑇𝑦

Assume min-max interchange

𝑥

𝑘+1

= 𝑥

𝑘

− 𝑠

𝑘

𝛻𝑓 𝑥

𝑘

− 𝑠

𝑘

Π

𝑆

1

𝑠

𝑘

𝑥

𝑘

− 𝑠

𝑘

𝛻𝑓 𝑥

𝑘

(38)

Proximal Gradient Method

• 𝑥𝑘+1 = argmin𝑥 𝑓 𝑥𝑘 + 𝛻𝑓 𝑥𝑘 𝑇 𝑥 − 𝑥𝑘 + 1

2𝑠𝑘 𝑥 − 𝑥𝑘 2 + 𝑔(𝑥)

• If 𝑔 is indicator, then same as projected gr.

• If 𝑔 is support function: 𝑔 𝑥 = max

𝑦∈𝑆 𝑥𝑇𝑦

Assume min-max interchange

𝑥

𝑘+1

= 𝑥

𝑘

− 𝑠

𝑘

𝛻𝑓 𝑥

𝑘

− 𝑠

𝑘

Π

𝑆

1

𝑠

𝑘

𝑥

𝑘

− 𝑠

𝑘

𝛻𝑓 𝑥

𝑘

Again, projection

(39)

Rate of Convergence

Theorem[Ne04]: If 𝑓 is smooth with const. 𝐿, and s𝑘 = 1

𝐿, then proximal gradient method generates 𝑥𝑘 such that:

𝑓 𝑥𝑘 − 𝑓 𝑥 ≤ 𝐿 𝑥0 − 𝑥∗ 2 2𝑘 .

• Can be accelerated to 𝑂(1/𝑘2)

• Composite same rate as smooth provided proximal oracle exists!

(40)

Bibliography

[Ne04] Nesterov, Yurii. Introductory lectures on convex optimization : a basic course. Kluwer Academic Publ., 2004. http://hdl.handle.net/2078.1/116858.

[Ne83] Nesterov, Yurii. A method of solving a convex programming problem with convergence rate O (1/k2). Soviet Mathematics Doklady, Vol. 27(2), 372-376 pages.

[Mo12] Moritz Hardt, Guy N. Rothblum and Rocco A. Servedio. Private data release via learning thresholds. SODA 2012, 168-187 pages.

[Be09] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal of Imaging Sciences, Vol. 2(1), 2009. 183-202 pages.

[De13] Olivier Devolder, François Glineur and Yurii Nesterov. First-order methods of smooth convex optimization with inexact oracle. Mathematical Programming 2013.

[FW59] Marguerite Frank and Philip Wolfe. An Algorithm for Quadratic Programming. Naval Research Logistics Quarterly, 1959, Vol 3, 95-110 pages.

(41)

Bibliography

[Ma11] Martin Jaggi. Sparse Convex Optimization Methods for Machine Learning. PhD Thesis, 2011.

[Ju12] A Juditsky and A Nemirovski. First Order Methods for Non-smooth Convex Large-Scale Optimization, I: General Purpose Methods. Optimization methods for machine learning. The MIT Press, 2012. 121-184 pages.

(42)

References

Related documents

The lines of analysis, however, tend to be different, since incremental gra- dient methods rely for convergence on arguments based on decrease of the cost function value,

In this chapter, we focus on the simplest general-purpose FOMs, mirror descent (MD) methods aimed at solving nonsmooth convex minimization problems, specifically, general-type

traditional techniques for general nonconvex problems involve comprom local optimization methods (nonlinear programming). find a point that minimizes f 0 among feasible points near

When population is clumped, ID is strongly influenced by ‘n’ the number of individuals in the sample. Species diversity may be thought of as being composed of two components. The

Bridge truss optimization under moving load using continuous and discrete design variables in optimization methods.. Vedat Toğan* & Ayşe

Free vibration of uniform composite beam has been studied by researchers with different boundary conditions using various methods of analysis such as First Order Shear

There is a good agreement for numerical computations between these three methods (Euler-Maruyama methods, Milstein’s Method and Strong order Taylor method).

1 Chapter 2 Pre-requisites to quadratic programming 2 Chapter 3 Convex functions and its properties 5 Chapter 4 Unconstrained problems of optimization 7 Chapter 5