First order methods
FOR CONVEX OPTIMIZATION
Saketh (IIT Bombay)
Topics
β’ Part β I
β’ Optimal methods for unconstrained convex programs
β’ Smooth objective
β’ Non-smooth objective
β’ Part β II
β’ Optimal methods for constrained convex programs
β’ Projection based
β’ Frank-Wolfe based
β’ Functional constraint based
β’ Prox-based methods for structured non-smooth programs
Non-Topics
β’ Step-size schemes
β’ Bundle methods
β’ Stochastic methods
β’ Inexact oracles
β’ Non-Euclidean extensions (Mirror-friends)
Motivation
& EXAMPLE APPLICATIONS
Machine Learning Applications
β’ Goal: Construct π βΆ π π
Machine Learning Applications
β’ Goal: Construct π βΆ π π
Machine Learning Applications
β’ Goal: Construct π βΆ π π
β’ Input data: { π₯1, π¦1 , β¦ , π₯π, π¦π }
Machine Learning Applications
β’ Goal: Construct π βΆ π π
β’ Input data: { π₯1, π¦1 , β¦ , π₯π, π¦π }
β’ Model: π π₯ = π€ππ π₯
Machine Learning Applications
β’ Goal: Construct π βΆ π π
β’ Input data: { π₯1, π¦1 , β¦ , π₯π, π¦π }
β’ Model: π π₯ = π€ππ π₯
β’ Algorithm: Find simple functions that explain data
π€βπ minπ Ξ©(π€) +
π=1 π
π(π€ππ π₯π , π¦π)
π€βπ minπ Ξ©(π€) +
π=1 π
π(π€ππ π₯π , π¦π)
Typical Program β Machine Learning
Smooth/Non-Smooth Smooth/Non-Smooth
β’ Unconstrained
, smooth functional constraints
β’ Smooth/Non-smooth/Composite Objectives
π€βπmin Ξ©(π€) +
π=1 π
π(π€ππ π₯π , π¦π)
Typical Program β Machine Learning
Smooth/Non-Smooth Smooth/Non-Smooth
β’ Unconstrained/Constrained
β’ Simple domains, smooth functional constraints
β’ Smooth/Non-smooth/Composite Objectives
Domain e.g. simplex
π€βπmin Ξ©(π€) +
π=1 π
π(π€ππ π₯π , π¦π)
s.t. πΊ π€ β€ 0
Typical Program β Machine Learning
Smooth/Non-Smooth Smooth/Non-Smooth
β’ Unconstrained/Constrained
β’ Simple domains, smooth functional constraints
β’ Smooth/Non-smooth/Composite Objectives
Domain
e.g. simplex Constraints
e.g. multiple models
Scale is the issue!
β’ m, n as well as no. models may run into millions!
β’ Even a single iteration of IPM/Newton-variants is in-feasible.
β’ βSlowerβ but βcheaperβ methods are the alternative
β’ Decomposition based
β’ First order methods
First Order Methods - Overview
β’ Iterative, gradient-like information, O(n) per iteration ο
β’ E.g. Gradient method, Cutting planes, Conjugate gradient
β’ Very old methods (1950s)
β’ Far slower than IPM:
β’ Sub-linear rate ο . (Not crucial for ML)
β’ But (nearly) n-independent ο
β’ Widely employed in state-of-the-art ML systems
β’ Choice of variant depends on problem structure
First Order Methods - Overview
β’ Iterative, gradient-like information, O(n) per iteration ο
β’ E.g. Gradient method, Cutting planes, Conjugate gradient
β’ Very old methods (1950s)
β’ Far slower than IPM:
β’ Sub-linear rate ο . (Not crucial for ML)
β’ But (nearly) n-independent ο
β’ Widely employed in state-of-the-art ML systems
β’ Choice of variant depends on problem structure
Smooth un-constrained
M I N
π€ β π π π = 1
π
π€ππ π₯π β π¦π 2
Smooth Convex Functions
β’ Continuously differentiable
β’ Gradient is Lipschitz continuous
Smooth Convex Functions
β’ Continuously differentiable
β’ Gradient is Lipschitz continuous
β’ π»π π₯ β π»π(π¦) β€ πΏ π₯ β π¦
β’ E.g. π π₯ β‘ π₯2 is not L-conts. over π but is over [0,1] with L=2
β’ E.g. π π₯ β‘ π₯ is L-conts. with L=1
Smooth Convex Functions
β’ Continuously differentiable
β’ Gradient is Lipschitz continuous
β’ π»π π₯ β π»π(π¦) β€ πΏ π₯ β π¦
β’ E.g. π π₯ β‘ π₯2 is not L-conts. over π but is over [0,1] with L=2
β’ E.g. π π₯ β‘ π₯ is L-conts. with L=1
Theorem: Let π be convex twice differentiable. Then π is smooth with const. πΏ β πβ²β² π₯ βΌ πΏ Iπ
Smooth Convex Functions
β’ Continuously differentiable
β’ Gradient is Lipschitz continuous
β’ π»π π₯ β π»π(π¦) β€ πΏ π₯ β π¦
β’ E.g. π π₯ β‘ π₯2 is not L-conts. over π but is over [0,1] with L=2
β’ E.g. π π₯ β‘ π₯ is L-conts. with L=1
Theorem: Let π be convex twice differentiable. Then π is smooth with const. πΏ β πβ²β² π₯ βΌ πΏ Iπ
Gradient Method
[Cauchy1847]β’ Move iterate in direction of instantaneous decrease
β’ π₯π+1 = π₯π β π ππ»π π₯π , sk > 0
Gradient Method
β’ Move iterate in direction of instantaneous decrease
β’ π₯π+1 = π₯π β π ππ»π π₯π , sk > 0
β’ Regularized minimization of first order approx.
β’ π₯π+1 = argminπ₯βπ π π π₯π + π»π π₯π π(π₯ β π₯π) + 1
2π π π₯ β π₯π 2
Gradient Method
β’ Move iterate in direction of instantaneous decrease
β’ π₯π+1 = π₯π β π ππ»π π₯π , sk > 0
β’ Regularized minimization of first order approx.
β’ π₯π+1 = argminπ₯βπ π π π₯π + π»π π₯π π(π₯ β π₯π) + 1
2π π π₯ β π₯π 2
π₯π π₯π+1
π π₯π
π(π₯π) π(π₯π+1)
Gradient Method
β’ Move iterate in direction of instantaneous decrease
β’ π₯π+1 = π₯π β π ππ»π π₯π , sk > 0
β’ Regularized minimization of first order approx.
β’ π₯π+1 = argminπ₯βπ π π π₯π + π»π π₯π π(π₯ β π₯π) + 1
2π π π₯ β π₯π 2
β’ Various step-size schemes
β’ Constant (1/πΏ)
β’ Diminishing (π π β 0, βπ π = β)
β’ Exact or back-tracking line search
Convergence rate β Gradient method
Theorem[Ne04]: If π is smooth with const. πΏ and sπ = 1
πΏ, then gradient method generates π₯π such that:
π π₯π β π π₯β β€ 2πΏ π₯0 β π₯β 2 π + 4 . Proof Sketch:
β’ π π₯π+1 β€ π π₯π + π»π π₯π π π₯π+1 β π₯π + πΏ
2 π₯π+1 β π₯π 2
β’ π π₯π+1 β₯ π π₯π + π»π π₯π π π₯π+1 β π₯π + 1
2πΏ π»π(π₯π+1) β π»π(π₯π) 2
β’ π₯π+1 β π₯β 2 β€ π₯π β π₯β 2 β 1
πΏ2 π»π π₯π 2
β’ Ξπ+1 β€ Ξπ β Ξ2 π
02 (Solve recursion)
Convergence rate β Gradient method
Theorem[Ne04]: If π is smooth with const. πΏ and sπ = 1
πΏ, then gradient method generates π₯π such that:
π π₯π β π π₯β β€ 2πΏ π₯0 β π₯β 2 π + 4 . Proof Sketch:
β’ π π₯π+1 β€ π π₯π + π»π π₯π π π₯π+1 β π₯π + πΏ
2 π₯π+1 β π₯π 2
β’ π π₯π+1 β₯ π π₯π + π»π π₯π π π₯π+1 β π₯π + 1
2πΏ π»π(π₯π+1) β π»π(π₯π) 2
β’ π₯π+1 β π₯β 2 β€ π₯π β π₯β 2 β 1
πΏ2 π»π π₯π 2
β’ Ξπ+1 β€ Ξπ β Ξ2 π
02 (Solve recursion)
Majorization minimization
Convergence rate β Gradient method
Theorem[Ne04]: If π is smooth with const. πΏ and sπ = 1
πΏ, then gradient method generates π₯π such that:
π π₯π β π π₯β β€ 2πΏ π₯0 β π₯β 2 π + 4 . Proof Sketch:
β’ π π₯π+1 β€ π π₯π + π»π π₯π π π₯π+1 β π₯π + πΏ
2 π₯π+1 β π₯π 2
β’ π π₯π+1 β₯ π π₯π + π»π π₯π π π₯π+1 β π₯π + 1
2πΏ π»π(π₯π+1) β π»π(π₯π) 2
β’ π₯π+1 β π₯β 2 β€ π₯π β π₯β 2 β 1
πΏ2 π»π π₯π 2
β’ Ξπ+1 β€ Ξπ β Ξ2 π
02 (Solve recursion)
Majorization minimization
Comments on rate of convergence
β’ Sub-linear, very slow compared to IPM
β’ Applies to conjugate gradient and other traditional variants
β’ Sub-optimal (may be?):
Theorem[Ne04]: For any π β€ πβ1
2 , and any π₯0, there exists a smooth
function f, with const. L, such that with any first order method, we have:
π π₯π β π π₯β β₯ 3πΏ π₯0 β π₯β 2 32 π + 1 2 . Proof Sketch: Choose function such that
π₯π β πππ π»π π₯0 , β¦ , π»π π₯πβ1 β π π,π
Comments on rate of convergence
β’ Sub-linear, very slow compared to IPM
β’ Applies to conjugate gradient and other traditional variants
β’ Sub-optimal (may be?):
Theorem[Ne04]: For any π β€ πβ1
2 , and any π₯0, there exists a smooth
function f, with const. L, such that with any first order method, we have:
π π₯π β π π₯β β₯ 3πΏ π₯0 β π₯β 2 32 π + 1 2 . Proof Sketch: Choose function such that
π₯π β πππ π»π π₯0 , β¦ , π»π π₯πβ1 β π π,π
Comments on rate of convergence
β’ Sub-linear, very slow compared to IPM
β’ Applies to conjugate gradient and other traditional variants
β’ Sub-optimal (may be?):
Theorem[Ne04]: For any π β€ πβ1
2 , and any π₯0, there exists a smooth
function f, with const. L, such that with any first order method, we have:
π π₯π β π π₯β β₯ 3πΏ π₯0 β π₯β 2 32 π + 1 2 . Proof Sketch: Choose function such that
π₯π β πππ π»π π₯0 , β¦ , π»π π₯πβ1 β π π,π
Comments on rate of convergence
β’ Sub-linear, very slow compared to IPM
β’ Applies to conjugate gradient and other traditional variants
β’ Sub-optimal (may be?):
Theorem[Ne04]: For any π β€ πβ1
2 , and any π₯0, there exists a smooth
function f, with const. L, such that with any first order method, we have:
π π₯π β π π₯β β₯ 3πΏ π₯0 β π₯β 2 32 π + 1 2 . Proof Sketch: Choose function such that
π₯π β πππ π»π π₯0 , β¦ , π»π π₯πβ1 β π π,π
Strongly convex: πΆ πΈβπ
πΈ+π ππ
Strongly convex: πΆ πΈβπ
πΈ+π ππ
Intuition for non-optimality
β’ All variants are descent methods
β’ Descent essential for proof
β’ Overkill leading to restrictive movements
β’ Try non-descent alternatives!
Intuition for non-optimality
β’ All variants are descent methods
β’ Descent essential for proof
β’ Overkill leading to restrictive movements
β’ Try non-descent alternatives!
Towards optimality
[Moritz Hardt]β’ π π₯ = 1
2 π₯ππ΄π₯ β ππ₯ ; π₯0 = π
β’ π₯π = π₯πβ1 β 1
πΏ π΄π₯πβ1 β π = βπ=0π πΌ β π΄
πΏ
π π πΏ
Lemma[Mo12]: There is a (Chebyshev) poly. of degree π π log 1 π such that π 0 = 1 and π π₯ β€ π β π₯ β π, πΏ .
β’ Chebyshev poly. have two term recursive formula, hence we expect:
β’ π₯π = π₯πβ1 β π πβ1π»π π₯πβ1 + ππβ1π»π π₯πβ2 , to be optimal (acceleration)
Sub-optimal: πΆ π β π
πΈ π
Towards optimality
[Moritz Hardt]β’ π π₯ = 1
2 π₯ππ΄π₯ β ππ₯ ; π₯0 = π
β’ π₯π = π₯πβ1 β 1
πΏ π΄π₯πβ1 β π = βπ=0π πΌ β π΄
πΏ
π π πΏ
Lemma[Mo12]: There is a (Chebyshev) poly. of degree π π log 1 π such that π 0 = 1 and π π₯ β€ π β π₯ β π, πΏ .
β’ Chebyshev poly. have two term recursive formula, hence we expect:
β’ π₯π = π₯πβ1 β π πβ1π»π π₯πβ1 + ππβ1π»π π₯πβ2 , to be optimal (acceleration)
Sub-optimal: πΆ π β π
πΈ π
Optimal: πΆ π β π
πΈ π
Towards optimality
[Moritz Hardt]β’ π π₯ = 1
2 π₯ππ΄π₯ β ππ₯ ; π₯0 = π
β’ π₯π = π₯πβ1 β 1
πΏ π΄π₯πβ1 β π = βπ=0π πΌ β π΄
πΏ
π π πΏ
Lemma[Mo12]: There is a (Chebyshev) poly. of degree π π log 1 π such that π 0 = 1 and π π₯ β€ π β π₯ β π, πΏ .
β’ Chebyshev poly. have two term recursive formula, hence we expect:
β’ π₯π = π₯πβ1 β π πβ1π»π π₯πβ1 + ππβ1π»π π₯πβ2 , to be optimal (acceleration)
Sub-optimal: πΆ π β π
πΈ π
Optimal: πΆ π β π
πΈ π
Accelerated Gradient Method
[Ne83,88,Be09]β’ π¦π = π₯πβ1 + πβ2
π+1 π₯πβ1 β π₯πβ2 (Extrapolation or momentum)
β’ π₯π = π¦π β π ππ»π(π¦π) (Usual gradient step)
Two step history
Accelerated Gradient Method
[Ne83,88,Be09]β’ π¦π = π₯πβ1 + πβ2
π+1 π₯πβ1 β π₯πβ2 (Extrapolation or momentum)
β’ π₯π = π¦π β π ππ»π(π¦π) (Usual gradient step)
π₯πβ2 π₯πβ1
π₯π
π¦π
Rate of Convergence β Accelerated gradient
Theorem [Be09]: If π is smooth with const. πΏ and sπ = 1
πΏ, then accelerated gradient method generates π₯π such that:
π π₯π β π π₯β β€ 2πΏ π₯0 β π₯β 2 π + 1 2 . Proof Sketch:
β’ π π₯π β€ π π§ + πΏ π₯π β π¦ π π§ β π₯π + πΏ
2 π₯π β π¦ 2 β π§ β π π
β’ Convex combination at π§ = π₯π, π§ = π₯β leads to:
π + 1 2
2πΏ π π₯π β πβ + ππ β πβ 2 β€ π 2
2πΏ π π₯πβ1 β πβ + ππβπ β πβ 2
β€ π₯0 β π₯β 2
Indeed optimal!
Rate of Convergence β Accelerated gradient
Theorem [Be09]: If π is smooth with const. πΏ and sπ = 1
πΏ, then accelerated gradient method generates π₯π such that:
π π₯π β π π₯β β€ 2πΏ π₯0 β π₯β 2 π + 1 2 . Proof Sketch:
β’ π π₯π β€ π π§ + πΏ π₯π β π¦ π π§ β π₯π + πΏ
2 π₯π β π¦ 2 β π§ β π π
β’ Convex combination at π§ = π₯π, π§ = π₯β leads to:
π + 1 2
2πΏ π π₯π β πβ + ππ β πβ 2 β€ π 2
2πΏ π π₯πβ1 β πβ + ππβπ β πβ 2
β€ π₯0 β π₯β 2
Rate of Convergence β Accelerated gradient
Theorem [Be09]: If π is smooth with const. πΏ and sπ = 1
πΏ, then accelerated gradient method generates π₯π such that:
π π₯π β π π₯β β€ 2πΏ π₯0 β π₯β 2 π + 1 2 . Proof Sketch:
β’ π π₯π β€ π π§ + πΏ π₯π β π¦ π π§ β π₯π + πΏ
2 π₯π β π¦ 2 β π§ β π π
β’ Convex combination at π§ = π₯π, π§ = π₯β leads to:
π + 1 2
2πΏ π π₯π β πβ + ππ β πβ 2 β€ π 2
2πΏ π π₯πβ1 β πβ + ππβπ β πβ 2
β€ π₯0 β π₯β 2
A Comparison of the two gradient methods
π₯βπ min1000log
π=1 2000
π ππππ₯+ππ
[L. Vandenberghe EE236C Notes]
Junk variants other than Accelerated gradient?
β’ Accelerated gradient is
β’ Less robust than gradient method [Moritz Hardt]
β’ Accumulates error with inexact oracles [De13]
β’ Who knows what will happen in your application?
Summary of un-constrained smooth convex programs
β’ Gradient method and friends: π β π( 1 π)
β’ Sub-linear and sub-optimal rate.
β’ Additionally, strong convexity gives: π β πΆ πΈβπ
πΈ+π
ππ . Sub-optimal but
linear rate.
β’ Accelerated gradient methods: π β π( 1 π2)
β’ Sub-linear but optimal
β’ π(π) computation per iteration
β’ Additionally, strong convexity gives: π β πΆ πΈβπ
πΈ+π ππ
. Optimal but still linear rate.
Summary of un-constrained smooth convex programs
β’ Gradient method and friends: π β π( 1 π)
β’ Sub-linear and sub-optimal rate.
β’ Additionally, strong convexity gives: π β πΆ πΈβπ
πΈ+π
ππ . Sub-optimal but
linear rate.
β’ Accelerated gradient methods: π β π( 1 π2)
β’ Sub-linear but optimal
β’ π(π) computation per iteration
β’ Additionally, strong convexity gives: π β πΆ πΈβπ
πΈ+π ππ
. Optimal but still linear rate.
Non-smooth unconstrained
M I N
π€ β π π π = 1
π
π€β²π π₯π β π¦π
What is first order info?
(π₯0, π(π₯0))
What is first order info?
πΏ π₯ β‘ π π₯0 + π»π π₯0 π π₯ β π₯0
(π₯0, π(π₯0))
What is first order info?
(π₯1, π(π₯1))
Canonical form: πΏ π₯ β‘ π π₯1 + ππ π₯ β π₯1 . Multiple π exist such that πΏ π₯ β€ π π₯ βπ₯
g is defined as a sub-gradient
First Order Methods (Non-smooth)
Theorem: Let π be a closed convex function. Then
β’ At any π₯ β ππ(πππ π), sub-gradient exists and set of all sub-gradients (denoted by ππ(π₯); sub-differential set) is closed convex.
β’ If π is differentiable at π₯ β πππ‘(πππ π), then gradient is the only sub- gradient.
Theorem: π₯ β π π minimizes π if and only if 0 β ππ(π₯).
Sub-gradient Method
β’ Assume oracle that throws a sub-gradient.
β’ Sub-gradient method:
β’ π₯π+1 = argminπ₯βπ π π π₯π + ππ π₯π π π₯ β π₯π + 1
2π π π₯ β π₯π 2
β’
π₯
π+1= π₯
πβ π
ππ
ππ₯
πCan sub-gradient replace gradient?
β’ No majorization minimization
β’ βππ π₯ not even descent direction
β’ E.g. π π₯1, π₯2 β‘ π₯1 + 2|π₯2|
(1,0) (1,2)
How far can sub-gradient take?
Expect slower than π( 1 π)
How far can sub-gradient take?
Theorem[Ne04]: Let π₯0 β π₯β β€ π and L the Lip. const. of π over this ball. Then sequence generated by sub-gradient descent satisfies:
πβ{1,..,π}min π(π₯π) β π π₯β β€ πΏπ
π + 1 . Proof Sketch:
β’ 2π πΞπ β€ ππ2 β ππ+12 + π π2 ππ(π₯π) 2
β’ LHS β€ π0
2+βπ=0π π π2 ππ π₯π 2
2 βπ=0π π π β€ π 2+βπ=0π π π2πΏ
2 βπ=0π π π ; Choose π π = π π+1
Always exists!
How far can sub-gradient take?
Theorem[Ne04]: Let π₯0 β π₯β β€ π and L the Lip. const. of π over this ball. Then sequence generated by sub-gradient descent satisfies:
πβ{1,..,π}min π(π₯π) β π π₯β β€ πΏπ
π + 1 . Proof Sketch:
β’ 2π πΞπ β€ ππ2 β ππ+12 + π π2 ππ(π₯π) 2
β’ LHS β€ π0
2+βπ=0π π π2 ππ π₯π 2
2 βπ=0π π π β€ π 2+βπ=0π π π2πΏ2
2 βπ=0π π π ; Choose π π = π π+1
Is this optimal?
Theorem[Ne04]: For any π β€ π β 1, and any π₯0 such that
π₯0 β π₯β β€ π , there exists a convex π, with const. L over the ball, such that with any first order method, we have:
π π₯π β π π₯β β₯ πΏπ
2(1 + π + 1) . Proof Sketch: Choose function such that
π₯π β πππ ππ π₯0 , β¦ , ππ π₯πβ1 β π π,π
Summary of non-smooth unconstrained
β’ Sub-gradient descent method: π β π 1
π .
β’ Sub-linear, slower than smooth case
β’ But, optimal!
β’ Can do better if additional structure (later)
Summary of Unconstrained Case
0 0.2 0.4 0.6 0.8 1 1.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Chart Title
Non-smooth Smooth Gr. Smooth Acc. Gr.
Bibliography
β’ [Ne04] Nesterov, Yurii. Introductory lectures on convex optimization : a basic course. Kluwer Academic Publ., 2004. http://hdl.handle.net/2078.1/116858.
β’ [Ne83] Nesterov, Yurii. A method of solving a convex programming problem with convergence rate O (1/k2). Soviet Mathematics Doklady, Vol. 27(2), 372-376 pages.
β’ [Mo12] Moritz Hardt, Guy N. Rothblum and Rocco A. Servedio. Private data release via learning thresholds. SODA 2012, 168-187 pages.
β’ [Be09] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal of Imaging Sciences, Vol. 2(1), 2009. 183-202 pages.
⒠[De13] Olivier Devolder, François Glineur and Yurii Nesterov. First-order methods of smooth convex optimization with inexact oracle. Mathematical Programming 2013.