What’s the Setup?

(1)

Part 2: First-Order Methods for Convex Optimization

M´ario A. T. Figueiredo¹ and Stephen J. Wright²

1Instituto de Telecomunica¸c˜oes, Instituto Superior T´ecnico, Lisboa, Portugal

2Computer Sciences Department, University of Wisconsin,

Madison, WI, USA

HIM, January 2016

(2)

Focus (Initially) on Smooth Convex Functions

Consider min

x∈Rⁿ

f(x), with f smooth and convex.

Usually assume µI ∇²f(x)LI, ∀_x, with 0≤µ≤L (thus L is a Lipschitz constant of∇f).

Ifµ >0,then f isµ-strongly convex (as seen in Part 1) and f(y)≥f(x) +∇f(x)^T(y−x) +µ

2ky−xk²₂. Defineconditioning (or condition number) as κ:=L/µ.

We are often interested in convex quadratics:

f(x) = 1

2x^TA x, µI ALI or f(x) = 1

2kBx −bk²₂, µI B^TB LI

(3)

What’s the Setup?

We consider iterative algorithms: generate {x_k},k = 0,1,2, . . . from x_k+1 = Φ(x_k) or x_k+1 = Φ(x_k,xk−1) or x_k+1 = Φ(x_k,xk−1, . . . ,x₁,x₀).

For now, assume we can evaluate f(xt) and∇f(xt) at each iteration.

Later, we look at broader classes of problems:

nonsmoothf;

f not available (or too expensive to evaluate exactly);

only anestimate of the gradient is available;

a constraint x∈Ω, usually for a simple Ω (e.g. ball, box, simplex);

nonsmooth regularization; i.e., instead of simply f(x), we want to minimizef(x) +τ ψ(x).

We focus on algorithms that can be adapted to those scenarios.

(4)

Steepest Descent

Steepest descent (a.k.a. gradient descent):

x_k+1 =x_k−α_k∇f(x_k), for someα_k >0.

Different ways to select an appropriate α_k.

1 Interpolating scheme with safeguarding to identify an approximate minimizingα_k.

2 Backtrack. Try ¯α, ¹₂α,¯ ¹₄α,¯ ¹₈α, ... until sufficient decrease in¯ f.

3 Don’t test for function decrease; use rules based on Landµ.

4 Set αk based on experience with similar problems. Or adaptively.

Analysis for 1 and 2 usually yields global convergence at unspecified rate.

The “greedy” strategy of getting good decrease in the current search direction may lead to better practical results.

Analysis for 3: Focuses on convergence rate, and leads to accelerated multi-step methods.

(5)

Line Search

Seekα_k that satisfies Wolfe conditions: “sufficient decrease” in f: f(xk−αk∇f(xk))≤f(xk)−c1αkk∇f(xk)k², (0<c11) while “not being too small” (significant increase in the directional derivative):

∇f(xk+1)^T∇f(xk)≥ −c₂k∇f(xk)k², (c1 <c2 <1).

(works for nonconvex f.) Can show that accumulation points ¯x of {x_k} are stationary: ∇f(¯x) = 0 (thus minimizers, iff is convex)

Can do one-dimensional line search for α_k, taking minima of quadratic or cubic interpolations of the function and gradient at the last two values tried. Use bracketing to stabilize. Usually finds suitable α within 3 attempts. (Nocedal and Wright, 2006, Chapter 3)

(6)

Backtracking

Try αk = ¯α,^α₂^¯,^α₄^¯,^α₈^¯, ... until thesufficient decreasecondition is satisfied.

No need to check the second Wolfe condition: the α_k thus identified is

“within striking distance” of an α that’s too large — so it is not too short.

Backtracking is widely used in applications, but doesn’t work on nonsmooth problems, or when f is not available / too expensive.

(7)

Constant (Short) Steplength

By elementary use of Taylor’s theorem, and since ∇²f(x)LI, f(x_k+1)≤f(x_k)−α_k

1− α_k 2 L

k∇f(x_k)k²₂ For α_k ≡1/L, f(x_k+1)≤f(x_k)− 1

2Lk∇f(x_k)k²₂, thus k∇f(x_k)k² ≤2L[f(x_k)−f(x_k₊₁)]

Summing for k = 0,1, . . . ,N, and telescoping the sum,

N

X

k=0

k∇f(x_k)k² ≤2L[f(x₀)−f(x_N+1)].

It follows that∇f(x_k)→0 iff is bounded below.

(8)

Rate Analysis

Suppose that the minimizer x^∗ is unique.

Another elementary use of Taylor’s theorem shows that kx_k+1−x^∗k²≤ kx_k −x^∗k²−α_k

2 L−α_k

k∇f(x_k)k², so that {kx_k−x^∗k} is decreasing.

Define for convenience: ∆_k :=f(x_k)−f(x^∗). By convexity, have

∆k ≤ ∇f(xk)^T(xk −x^∗)≤ k∇f(xk)k kx_k−x^∗k ≤ k∇f(xk)k kx₀−x^∗k.

From previous page (subtractingf(x^∗) from both sides of the inequality), and using the inequality above, we have

∆k+1 ≤∆k−(1/2L)k∇f(xk)k² ≤∆k − 1

2Lkx₀−x^∗k²∆²_k.

(9)

Weakly convex: 1/k sublinear rate

Take reciprocal of both sides and manipulate (using (1−)⁻¹≥1 +):

1

∆_k+1 ≥ 1

∆_k + 1

2Lkx₀−x^∗k² ≥ 1

∆0

+ k+ 1

2Lkx₀−x^∗k² ≥ k+ 1 2Lkx₀−x^∗k² which yields

f(x_k+1)−f(x^∗)≤ 2Lkx₀−xk² k+ 1 . The classic 1/k convergence rate!

(10)

Strongly convex: Linear rate

From strong convexity condition, we have for any z: f(z)≥f(x_k) +∇f(x_k)^T(z−x_k) +µ

2kz−x_kk². By minimizing both sides w.r.t. z we obtain

f(x^∗)≥f(x_k)− 1

2µk∇f(x_k)k², so that

k∇f(xk)k² ≥2µ(f(xk)−f(x^∗)).

Recall too that for step α_k ≡1/Lwe have f(x_k+1)≤f(x_k)− 1

2Lk∇f(x_k)k²₂.

By subtracting f(x^∗) from both sides of this expression we have (f(x_k+1)−f(x^∗))≤

1−µ L

(f(x_k)−f(x^∗)).

(11)

Linear convergence without strong convexity

The linear convergence analysis depended on two bounds:

f(x_k+1)≤f(x_k)−a1k∇f(x_k)k², (1) k∇f(xk)k² ≤a2(f(xk)−f(x^∗)), (2) for some positive a₁,a₂. In fact, many algorithms that use first derivatives (or estimates) satisfy a bound like (1).

We derived (2) from strong convexity, but it also holds for interesting cases that are not strongly convex:

Quadratic growth condition: f(x)−f^∗≥a₂dist(x,solution set)², for somea₂ >0. Allows nonunique solution.

(2) is a special case of a Kurdyka-Lojasewicz property, which holds in many interesting situations — even for nonconvex f, near a local min.

f(x) =Pm

i=1h(a_i^Tx), whereh:R→Ris strongly convex, even when m<n, in which case ∇²f(x) is singular.

(12)

Exact minimizing α

_k

: Faster rate?

Question: does takingα_k as the exact minimizer off along−∇f(x_k) yield better rate of linear convergence?

Considerf(x) = ¹₂x^TA x (thus x^∗ = 0 andf(x^∗) = 0.) We have ∇f(x_k) =A x_k. Exactly minimizing w.r.t. α_k,

αk = arg min

α

1

2(xk −αAxk)^TA(xk −αAxk) = x_k^TA²x_k x_k^TA³xk

∈ 1

L,1 µ

Thus

f(x_k+1)≤f(x_k)−1 2

(x_k^TA²x_k)² (x_k^TAxk)(x_k^TA³xk), so, defining z_k :=Ax_k, we have

f(x_k+1)−f(x^∗)

f(x_k)−f(x^∗) ≤1− kz_kk⁴

(z_k^TA⁻¹z_k)(z_k^TAz_k).

(13)

Exact minimizing α

_k

: Faster rate?

Using Kantorovich inequality:

(z^TAz)(z^TA⁻¹z)≤ (L+µ)² 4Lµ kzk⁴. Thus

f(x_k₊₁)−f(x^∗)

f(xk)−f(x^∗) ≤1− 4Lµ (L+µ)² =

1− 2 κ+ 1

2

, where κ:=L/µ.

Only a small factor of improvement in the linear rate over constant steplength.

(14)

The slow linear rate is typical!

Not just a pessimistic bound!

(15)

Multistep Methods: The Heavy-Ball

Enhance the search direction using a contribution from the previous step.

(known as heavy ball,momentum, or two-step)

Consider first a constant step lengthα, and a second parameter β for the

“momentum” term:

x_k+1=x_k −α∇f(x_k) +β(x_k −xk−1) Analyze by defining a composite iterate vector:

w_k :=

x_k −x^∗ xk−1−x^∗

.

Thus

w_k+1=Bw_k+o(kw_kk), B :=

−α∇²f(x^∗) + (1 +β)I −βI

I 0

.

(16)

Multistep Methods: The Heavy-Ball

Matrix B has same eigenvalues as −αΛ + (1 +β)I −βI

I 0

, Λ = diag(λ₁, λ₂, . . . , λ_n), where λ_i are the eigenvalues of∇²f(x^∗).

Choose α,β to explicitly minimize the max eigenvalue of B, obtain α= 4

L

1 (1 + 1/√

κ)², β =

1− 2

√κ+ 1 2

.

Leads to linear convergence for kx_k−x^∗k with rate approximately

1− 2

√κ+ 1

.

(17)

Summary: Linear Convergence, Strictly Convex f

Steepest descent: Linear rate approx

1− 2 κ

; Heavy-ball: Linear rate approx

1− 2

√κ

.

Big difference! To reducekx_k−x^∗kby a factor, needk large enough that

1− 2 κ

k

≤ ⇐ k ≥ κ

2|log| (steepest descent)

1− 2

√κ k

≤ ⇐ k ≥

√κ

2 |log| (heavy-ball) A factor of √

κ difference; e.g. ifκ= 1000 (not at all uncommon in inverse problems), need ∼30 times fewer steps.

(18)

Conjugate Gradient

Basic conjugate gradient (CG) step is

xk+1 =xk+αkpk, pk =−∇f(xk) +γkpk−1. Can be identified with heavy-ball, with βk = αkγk

αk−1

.

However, CG can be implemented in a way that doesn’t require knowledge (or estimation) of Landµ.

Choose α_k to (approximately) miminizef alongp_k;

Choose γ_k by a variety of formulae (Fletcher-Reeves, Polak-Ribiere, etc), all of which are equivalent if f is convex quadratic. e.g.

γ_k = k∇f(x_k)k² k∇f(xk−1)k²

(19)

Conjugate Gradient

Nonlinear CG: Variants include Fletcher-Reeves, Polak-Ribiere, Hestenes.

Restarting periodically with p_k =−∇f(x_k) is useful (e.g. everyn iterations, or when p_k is not a descent direction).

For quadraticf, convergence analysis is based on eigenvalues of Aand Chebyshev polynomials, min-max arguments. Get

Finite terminationin as many iterations as there are distinct eigenvalues;

Asymptotic linear convergencewith rate approx 1− 2

√κ. (like heavy-ball.)

(Nocedal and Wright, 2006, Chapter 5)

(20)

Accelerated First-Order Methods

Accelerate the rate to 1/k² for weakly convex, while retaining the linear rate (related to √

κ) for strongly convex case.

Nesterov (1983)describes a method that requiresκ.

Initialize: Choosex₀,α₀∈(0,1); set y₀ ←x₀. Iterate: x_k+1 ←y_k −¹_L∇f(y_k);(*short-step*)

findα_k+1∈(0,1): α²_k+1= (1−α_k+1)α²_k+^α^k+1_κ ; set β_k = α_k(1−α_k)

α²_k+α_k+1;

set y_k+1←x_k+1+β_k(x_k+1−x_k).

Still works for weakly convex (κ=∞).

(21)

k

xk+1

xk

y_k+1 x_k+2

y_k+2

y

Separates the “gradient descent” and “momentum” step components.

(22)

Convergence Results: Nesterov

Ifα₀≥1/√ κ, have

f(x_k)−f(x^∗)≤c1min

1− 1

√κ k

, 4L

(√

L+c₂k)²

! ,

where constants c1 and c2 depend onx0,α0,L.

Linear convergence “heavy-ball” rate for strongly convexf; 1/k² sublinear rate otherwise.

In the special case of α0 = 1/√

κ, this scheme yields αk ≡ 1

√κ, βk ≡1− 2

√κ+ 1.

(23)

FISTA

Beck and Teboulle (2009) propose a similar algorithm, with a fairly short and elementary analysis (though still not intuitive).

Initialize: Choosex0; set y1=x0,t1 = 1;

Iterate: x_k ←y_k−_L¹∇f(y_k);

t_k+1 ← ¹₂ 1 +

q

1 + 4t_k²

; yk+1←xk+ tk−1

t_k₊₁ (xk −xk−1).

For (weakly) convex f, converges withf(x_k)−f(x^∗)∼1/k².

WhenL is not known, increase an estimate ofLuntil it’s big enough.

Beck and Teboulle (2009) do the convergence analysis in 2-3 pages;

elementary, but “technical.”

(24)

A Non-Monotone Gradient Method: Barzilai-Borwein

Barzilai and Borwein (1988) (BB) proposed an unusual choice ofα_k. Allows f to increase (sometimes a lot) on some steps: non-monotone.

xk+1 =xk−αk∇f(xk), αk := arg min

α ks_k −αzkk², where

s_k :=x_k−x_k−1, z_k :=∇f(x_k)− ∇f(x_k−1).

Explicitly, we have

α_k = s_k^Tz_k z_k^Tz_k. Note that for f(x) = ¹₂x^TAx, we have

α_k = s_k^TAs_k s_k^TA²sk

∈ 1

L,1 µ

.

BB can be viewed as a quasi-Newton method, with the Hessian approximated byα⁻¹I.

(25)

Comparison: BB vs Greedy Steepest Descent

(26)

There Are Many BB Variants

use α_k =s_k^Ts_k/s_k^Tz_k in place ofα_k =s_k^Tz_k/z_k^Tz_k; alternate between these two formulae;

hold α_k constant for a number (2, 3, 5) of successive steps;

takeα_k to be the steepest descent step from the previous iteration.

Nonmonotonicity appears essentialto performance. Some variants get global convergence by requiring a sufficient decrease in f over the worst of the last M (say 10) iterates.

The original 1988 analysis in BB’s paper is nonstandard and illuminating (just for a 2-variable quadratic).

In fact, most analyses of BB and related methods are nonstandard, and consider only special cases. The precursor of such analyses is Akaike (1959). More recently, see Ascher, Dai, Fletcher, Hager and others.

(27)

Extending to the Constrained Case: x ∈ Ω

How to change these methods to handle theconstraint x∈Ω? (assuming that Ω is aclosed convex set)

Some algorithms and theory stay much the same,

...if we can involve the constraintx ∈Ω explicity in the subproblems.

Example: Nesterov’s constant step scheme requires just one calculation to be changed from the unconstrained version.

Initialize: Choosex0,α0∈(0,1); set y0 ←x0. Iterate: xk+1 ←arg miny∈Ω 1

2ky−[yk −¹_L∇f(yk)]k²₂; findα_k+1∈(0,1): α²_k+1= (1−α_k+1)α²_k+^α^k+1_κ ; set β_k = ^α_α^k2^(1−α^k⁾

k+αk+1;

set y_k+1←x_k+1+β_k(x_k+1−x_k).

Convergence theory is unchanged.

(28)

Regularized Optimization

How to change these methods to handle regularized optimization?

minx f(x) +τ ψ(x),

where f is convex and smooth, whileψ is convex but usuallynonsmooth.

Often, all that is needed is to change the update step to xk = arg min

x kx−Φ(xk)k²₂+λψ(x).

where Φ(x_k) is gradient descent step, or something more complicated (such as heavy ball, or some other accelerated method).

This is theshrinkage/tresholding step; how to solve it with a nonsmooth ψ? That’s the topic of the following slides.

(29)

Handling Nonsmoothness (e.g. `

₁

Norm)

Convexity ⇒ continuity (on the domain of the function).

Convexity 6⇒ differentiability (e.g., ψ(x) =kxk₁).

Subgradients generalize gradients for general convex functions:

v is a subgradientoff atx if f(x⁰)≥f(x) +v^T(x⁰−x) Subdifferential: ∂f(x) ={all subgradients off atx}

Iff is differentiable, ∂f(x) ={∇f(x)}

linear lower bound nondifferentiable case

(30)

More on Subgradients and Subdifferentials

The subdifferential is a set-valued function:

f :R^d →R ⇒ ∂f :R^d →subsets ofR^d f(x) =







−2x−1, x ≤ −1

−x, −1<x≤0 x²/2, x >0

(3)

∂f(x) =











{−2}, x <−1 [−2,−1], x =−1

{−1}, −1<x<0 [−1,0], x = 0

{x}, x >0

Fermat’s Rule: x ∈arg minxf(x) ⇔ 0∈∂f(x)

(31)

A Key Tool: Moreau’s Proximity Operators

Moreau (1962) proximity operator

bx∈arg min

x

1

2kx−yk²₂+ψ(x) =: prox_ψ(y)

...well defined for convex ψ, since k · −yk²₂ is coercive and strictly convex.

Example: (seen above) prox_τ|·|(y) = soft(y, τ) = sign(y) max{|y| −τ,0}

Block separability: x = (x₁, ...,x_N) (a partition of the components ofx) ψ(x) =X

i

ψ_i(x_i) ⇒ (prox_ψ(y))_i = prox_ψ_i(y_i) Relationship with subdifferential: z = prox_ψ(y) ⇔ z−y ∈∂ψ(z) Resolvent: z = prox_ψ(y) ⇔ 0∈∂ψ(z) + (z−y) ⇔ y ∈(∂ψ+I)z

prox_ψ(y) = (∂ψ+I)⁻¹y

(32)

Important Proximity Operators

Soft-thresholdingis the proximity operator of the`1 norm.

Consider theindicator ιS of a convex set S;

prox_ι_S(u) = arg min

x

1

2kx−uk²₂+ιS(x) = arg min

x∈S

1

2kx−yk²₂ =PS(u) ...the Euclidean projectiononS.

Squared Euclidean norm (separable, smooth):

prox_τk·k²

2(y) = arg min

x kx−yk²₂+τkxk²₂ = y 1 +τ Euclidean norm (not separable, nonsmooth):

prox_τk·k₂(y) = y

kyk2(kyk₂−τ), ifkyk₂ > τ 0 ifkyk₂ ≤τ

(33)

More Proximity Operators

(Combettes and Pesquet, 2011)

(34)

Another Key Tool: Fenchel-Legendre Conjugates

The Fenchel-Legendre conjugateof a proper convex function f — denoted byf^∗ :Rⁿ →R¯ — is defined by

f^∗(u) = sup

x

x^Tu−f(x)

Main properties and relationship with proximity operators:

Biconjugation: iff is convex and proper,f^∗∗=f. Moreau’s decomposition: prox_f(u) + prox_f∗(u) =u

...meaning that, if you know prox_f, you know prox_f^∗, and vice-versa.

Conjugate of indicator: if f(x) =ι_C(x), where C is a convex set, f^∗(u) = sup

x

x^Tu−ι_C(x) = sup

x∈C

x^Tu ≡σ_C(u) (support functionofC).

(35)

From Conjugates to Proximity Operators

Notice that |u|= sup_x∈[−1,1]x^Tu =σ_[−1,1](u), thus | · |^∗ =ι_[−1,1]. Using Moreau’s decomposition, we easily derive the soft-threshold:

prox_τ|·|= 1−prox_ι

[−τ,τ]= 1−P_[−τ,τ_]= soft(·, τ)

Conjugate of a norm: iff(x) =τkxk_p thenf^∗=ι_{x:kxk_q_≤τ}, where _q¹ +¹_p = 1 (a H¨older pair, or H¨older conjugates).

That is,k · k_p andk · k_q are dual norms:

kzk_q = sup{x^Tz : kxk_p≤1}= sup

x∈Bp(1)

x^Tz =σ_B_p₍₁₎(z)

(36)

From Conjugates to Proximity Operators

Proximity of norm:

prox_τk·k_p =I −P_B_q_(τ) whereBq(τ) ={x: kxk_q ≤τ} and ¹_q+_p¹ = 1.

Example: computing prox_k·k_∞ (notice `∞ is not separable):

Since _∞¹ + ¹₁ = 1,

prox_τk·k_∞ =I −P_B₁_(τ)

... the proximity operator of`∞ norm is the residual of the projection on an `₁ ball.

Projection on`1 ball has no closed form, but there areefficient (linear cost) algorithms (Brucker, 1984), (Maculan and de Paula, 1989).

(37)

Geometry and Effect of prox

_`_∞

Whereas `1 promotes sparsity, `∞ promotes equality (in absolute value).

(38)

From Conjugates to Proximity Operators

The dual of the `₂ norm is the`₂ norm.

(39)

Group Norms and their Prox Operators

Group-norm regularizer: ψ(x) =

M

X

m=1

λ_mkx_G_mk_p.

In the non-overlapping case (G₁, ...,G_m is a partition of {1, ...,n}), simply use separability:

prox_ψ(u)

Gm= prox_λ_m_k·k_p u_G_m .

In the tree-structuredcase, can get a complete ordering of the groups:

G₁ G₂...G_M, where (G G⁰) ⇔ (G ⊂G⁰) or (G∩G⁰=∅).

Define Πm :Rⁿ →R^N:

(Πm(u))Gm = prox_λ_m_k·k_p(uGm),

(Πm(u))G¯m =uG¯m, where ¯Gm ={1, ...,n} \Gm

Then

prox_ψ = Π_M ◦ · · · ◦Π₂◦Π₁ ...only valid forp ∈ {1,2,∞}(Jenatton et al., 2011).

(40)

Matrix Nuclear Norm and its Prox Operator

Recall the trace/nuclear norm: kXk_∗ =

min{m,n}

X

i=1

σ_i.

The dual of a Schattenp-norm is a Schatten q-norm, with

1

q+ ¹_p = 1. Thus, the dual of the nuclear norm is the spectral norm:

kXk_∞= max

σ₁, ..., σ_min{m,n} .

IfY =UΛV^T is the SVD of Y, we have

prox_τk·k_∗(Y) =UΛV^T −P_{X_:max{σ₁_,...,σ_min{m,n}_}≤τ}(UΛV^T)

=Usoft Λ, τ V^T.

(41)

Atomic Norms: A Unified View

(42)

Another Use of Fenchel-Legendre Conjugates

The original problem: min

x f(x) +ψ(x) Often this has the form: min

x g(A x) +ψ(x)

Using the definition of conjugate g(A x) = sup_u u^TA x −g^∗(u) minx g(A x) +ψ(x) = inf

x sup

u

u^TA x−g^∗(u) +ψ(x)

= sup

u

(−g^∗(u)) + inf

x u^TA x+ψ(x)

= sup

u

(−g^∗(u))−sup

x

−x^TA^Tu−ψ(x)

| {z }

ψ^∗(−A^Tu)

=−inf

u g^∗(u) +ψ^∗(−A^Tu)

The dual infug^∗(u) +ψ^∗(−A^Tu) is sometimes easier to handle.

(43)

Basic Proximal-Gradient Algorithm

Use basic structure:

x_k = arg min

x kx−Φ(x_k)k²₂+ψ(x).

with Φ(x_k) a simple gradient descent step, thus x_k+1 = prox_α_k_ψ x_k−α_k∇f(x_k) This approach goes by many names, such as

“proximal gradient algorithm” (PGA),

“iterative shrinkage/thresholding” (IST),

“forward-backward splitting” (FBS)

It it has been reinvented several times in different communities:

optimization, partial differential equations, convex analysis, signal processing, machine learning.

(44)

Convergence of the Proximal-Gradient Algorithm

Basic algorithm: x_k+1= prox_α_k_ψ x_k −α_k∇f(x_k) generalized (possibly inexact) version:

xk+1=(1−λk)xk+λk

prox_α_k_ψ xk−αk∇f(xk) +bk

+ak

wherea_k andb_k are“errors” in computing the prox and the gradient;

λ_k is anover-relaxationparameter.

Convergence is guaranteed (Combettes and Wajs, 2006) if X 0<infα_k ≤supα_k <²_L

X λ_k ∈(0,1], with infλ_k >0 X P∞

k kakk<∞andP∞

k kbkk<∞

(45)

Proximal-Gradient Algorithm: Quadratic Case

Consider thequadraticcase (of great interest): f(x) = ¹₂kB x−bk²₂. Here, ∇f(x) =B^T(B x−b) and the IST/PGA/FBS algorithm is

xk+1 = prox_α_k_ψ xk−αkB^T(B x−b)

can be implemented with only matrix-vector multiplications withB andB^T.

This is avery importantfeature in large-scale applications, such as image processing, where fast algorithms exist for computing these products (e.g. fast Fourier transforms or wavelet transforms), but these matrices cannot be formed and storedexplicitly.

In this case, some more refined convergence results are available.

Even more refined results are available ifψ(x) =τkxk₁

(46)

More on IST/FBS/PGA for the `

₂

-`

₁

Case

Problem: bx∈G = arg min

x∈Rⁿ 1

2kB x−bk²₂+τkxk₁ (recall B^TB LI) IST/FBS/PGA becomes x_k+1= soft x_k −αB^T(B x−b), ατ with α <2/L.

The zero set: Z ⊆ {1, ...,n}: bx∈G ⇒bxZ = 0

Zeros are found in a finite number of iterations (Hale et al., 2008):

after a finite number of iterations, we have (xk)Z = 0.

After that, if B_Z^TBZ µI, with µ >0 (thusκ(B_Z^TBZ) =L/µ):

kx_k+1−bxk₂ ≤ 1−κ

1 +κkx_k−bxk₂ (linear convergence) for the optimal choiceα = 2/(L+µ). (Weaker condition suffices for lienar convergence of {f(x_k)}; see above.)

(47)

FISTA with prox operations

Recall that FISTA — fast iterative shrinkage-thresholding algorithm

— ((Beck and Teboulle, 2009), based on (Nesterov, 1983)) is a heavy-ball-typeacceleration of IST:

Initialize: Chooseα≤1/L,x₀; sety₁=x₀,t₁= 1;

Iterate: xk ←prox_{τ αψ} yk −α∇f(yk)

;

t_k+1← ¹₂ 1 +p

1 + 4t_k²

;

yk+1←xk+t_k −1 tk+1

(xk−xk−1).

Acceleration:

FISTA: f(x_k)−f(bx)∼O 1

k²

IST: f(x_k)−f(bx)∼O 1

k

. WhenLis not known, increase an estimate of Luntil it’s big enough.

(48)

Heavy Ball Acceleration: TwIST

TwIST (two-step iterative shrinkage-thresholding(Bioucas-Dias and Figueiredo, 2007)) is aheavy-ball-type accelerationof IST, for

minx 1

2kB x−bk²₂+τ ψ(x) Iterations (with α <2/L)

x_k₊₁= (γ−β)x_k +(1−γ)xk−1+βprox_{ατ ψ} x_k −αB^T(B x−b) Analysis in the strongly convex case: µI B^TB LI, with µ >0.

Conditioning (as above) κ=L/µ <∞.

Optimal parameters: γ =ρ²+ 1, β= _µ+L^2α , whereρ= ¹⁻

√κ 1+√

κ, yield linear convergence

kx_k+1−bxk₂ ≤ 1−√ κ 1 +√

κkx_k−bxk₂

versus ^1−κ_1+κ for IST

(49)

Illustration of the TwIST Acceleration

(50)

Acceleration via Larger Steps: SpaRSA

The standard step-size α_k ≤2/L in IST istoo timid

TheSpARSA (sparse reconstruction by separable approximation) framework proposesbolder choices of α_k (Wright et al., 2009):

X Barzilai-Borwein (see above), to mimic Newton steps — or at least get the scaling right.

X keep increasingαk until monotonicity is violated: backtrack.

Convergence to critical points (minima in the convex case) is

guaranteed for a safeguarded version: ensure sufficient decrease w.r.t.

the worst value in previous M iterations.

(51)

Another Approach: GPSR

min_x ¹₂kB x−bk²₂+τkxk₁ can be written as a standard QP:

minu,v

1

2kB(u−v)−bk²₂+τu^T1 +τu^T1 s.t. u ≥0, v ≥0, whereui = max{0,xi} andvi = max{0,−x_i}.

Withz = u

v

, problem can be written in canonical form minz

1

2z^TQ z+c^Tz s.t. z ≥0

Solving this problem with projected gradient using Barzilai-Borwein steps: GPSR (gradient projection for sparse reconstruction)

(Figueiredo et al., 2007).

(52)

Speed Comparisons

Lorenz (2011) proposed a way of generating problem instances with known solutionbx: useful for speed comparison.

Define: R_k = ^kx_k^k⁻^b^xk²

bxk2 andr_k = ^L(x^k_L(^)−L(^b^x)

bx) (where L(x) =f(x) +τ ψ(x)).

(53)

More Speed Comparisons

(54)

Even More Speed Comparisons

(55)

Acceleration by Continuation

IST/FBS/PGA can be very slow ifτ is very small and/orf is poorly conditioned.

A very simple acceleration strategy: continuation/homotopy Initialization: Setτ₀τ, starting point ¯x, factorσ∈(0,1), andk = 0.

Iterations: Find approx solutionx(τk) of minx f(x) +τkψ(x), starting from ¯x; ifτk =τf STOP;

Setτ_k+1←max(τ_f, στ_k) and ¯x ←x(τ_k);

Often the solution pathx(τ), for arange of values ofτ is desired, anyway (e.g., within an outer method to choose an optimal τ) Shown to be very effective in practice (Hale et al., 2008; Wright et al., 2009). Recently analyzed by Xiao and Zhang (2012).

(56)

Acceleration by Continuation: An Example

Classical sparse reconstructionproblem (Wright et al., 2009)

bx∈arg min

x 1

2kB x−bk²₂+τkxk₁ with B ∈R^1024×4096 (thus x ∈R⁴⁰⁹⁶ andb∈R¹⁰²⁴).

(57)

A Final Touch: Debiasing

Consider problems of the form bx ∈arg min

x∈Rⁿ 1

2kB x−bk²₂+τkxk₁ Often, the original goal was to minimize the quadratic term, after the support of x had been found. But the `₁ term can cause the nonzero values of xi to be “suppressed.”

Debiasing:

X find the zero set (complement of the support of bx):

Z(bx) ={1, ...,n} \supp(bx).

X solve minxkB x−bk²₂ s.t. x_Z(

bx)= 0. (Fix the zeros and solve an unconstrained problem over the support.)

Often, this problem has to be solved using an algorithm that only involves products by B and B^T, since this matrix cannot be partitioned.

(58)

Effect of Debiasing

0 500 1000 1500 2000 2500 3000 3500 4000

−1 0 1

Original (n = 4096, number of nonzeros = 160)

0 500 1000 1500 2000 2500 3000 3500 4000

−1 0

1 SpaRSA reconstruction (m = 1024, tau = 0.08, MSE = 0.0072)

0 500 1000 1500 2000 2500 3000 3500 4000

−505

10 Minimum norm solution (MSE = 1.568)

0 500 1000 1500 2000 2500 3000 3500 4000

−1 0

1 Debiased (MSE = 3.377e−005)

(59)

Example: Matrix Recovery (Toh and Yun, 2010)

(60)

Conditional Gradient

Also known as “Frank-Wolfe” after the authors who devised it in the 1950s. Later analysis by Dunn (around 1990). Suddenly a topic of enormous renewed interest; see for example (Jaggi, 2013).

minx∈Ω f(x),

where f is a convex function and Ω is a closed, bounded, convex set.

Start at x₀∈Ω. At iteration k:

v_k := arg min

v∈Ω v^T∇f(x_k);

x_k+1 :=x_k +α_k(v_k −x_k), α_k = 2 k+ 2.

Potentially useful when it is easy to minimize a linear function over theoriginalconstraint set Ω;

Admits an elementary convergence theory: 1/k sublinear rate.

Same convergence theory holds if we use a line search for α .

(61)

Conditional Gradient for Atomic-Norm Constraints

Conditional Gradient is particularly useful for optimization over atomic-norm constraints.

min f(x) s.t. kxk_A ≤τ.

Reminder: Given the set of atoms A(possibly infinite) we have kxk_A:= inf

( X

a∈A

c_a : x=X

a∈A

c_aa, c_a≥0 )

.

The search direction v_k isτ¯a_k, where

¯

ak := arg min

a∈Aha,∇f(xk)i.

That is, we seek the atom that lines up best with the negative gradient direction −∇f(xk).

(62)

Generating Atoms

We can think of each step as the “addition of a new atom to the basis.”

Note thatxk is expressed in terms of{¯a0,¯a1, . . . ,¯ak}.

If few iterations are needed to find a solution of acceptable accuracy, then we have an approximate solution that’s represented in terms of few atoms, that is, sparse or compactly represented.

For many atomic sets A of interest, the new atom can be found cheaply.

Example: For the constraint kxk₁ ≤τ, the atoms are

{±e_i : i = 1,2, . . . ,n}. ifi_k is the index at which |[∇f(x_k)]i|attains its maximum, we have

¯

a_k =−sign([∇f(x_k)]_i_k)e_i_k

Example: For the constraint kxk_∞≤τ, the atoms are the 2ⁿ vectors with entries±1. We have

[¯a_k]_i =−sign[∇f(x_k)]_i, i = 1,2, . . . ,n.

(63)

More Examples

Example: Nuclear Norm. For the constraintkXk_∗ ≤τ, for which the atoms are the rank-one matrices, we have ¯A_k =u_kv_k^T, whereu_k andv_k are the first columns of the matricesU_k andV_k obtained from the SVD

∇f(Xk) =UkΣkV_k^T.

Example: sum-of-`₂. For the constraint

m

X

i=1

kx_[i]k₂≤τ,

the atoms are the vectors athat contain all zeros except for a vector u_[i]

with unit 2-norm in the [i] block position. (Infinitely many.) The atom ¯ak

contains nonzero components in the block i_k for which k[∇f(x_k)]_[i_]k is maximized, and the nonzero part is

u_[i]=−[∇f(x_k)]_[i_k_]/k[∇f(x_k)]_[i_k_]k.

(64)

Other Enhancements

Reoptimizing. Instead of fixing the contributionαk from each atom at the time it joins the basis, we can periodically and approximately reoptimize over the current basis.

This is a finite dimension optimization problem over the (nonnegative) coefficients of the basis atoms.

It need only be solved approximately.

If any coefficient is reduced to zero, it can be dropped from the basis.

Dropping Atoms. Sparsity of the solution can be improved by dropping atoms from the basis, if doing so does not degrade the value of f too much (see (Rao et al., 2013)).

In the important least-squares case, the effect of dropping can be evaluated efficiently.

(65)

Interior-Point Methods

Interior-point methods were tried early for compressed sensing, regularized least squares, support vector machines.

SVM with hinge loss formulated as a QP, solved with a primal-dual interior-point method. Included in the OOQP distribution (Gertz and Wright, 2003); see also (Ferris and Munson, 2002).

Compressed sensing and LASSO variable selection formulated as bound-constrained QPs and solved with primal-dual; or second-order cone programs solved with barrier (Cand`es and Romberg, 2005) However they were mostly superseded by first-order methods.

Stochastic gradient in machine learning (low accuracy, simple data access);

Gradient projection (GPSR) and prox-gradient (SpaRSA, FPC) in compressed sensing (require only matrix-vector multiplications).

Is it time to reconsider interior-point methods?

(66)

Compressed Sensing: Splitting and Conditioning

Consider the `2-`1 problem minx

1

2kBx −bk²₂+τkxk₁,

where B∈R^m×n. Recall the bound constrained convex QP formulation:

u≥0,v≥0min 1

2kB(u−v)−bk²₂+τ1^T(u+v).

B has special properties associated with compressed sensing matrices (e.g.

RIP) that make the problem well conditioned.

(Though the objective is only weakly convex, RIP ensures that when restricted to the optimal support, the active Hessian submatrix is well conditioned.)

(67)

Compressed Sensing via Primal-Dual Interior-Point

Fountoulakis et al. (2012) describe an approach that solves the bounded-QP formulation.

Uses a vanilla primal-dual interior-point framework.

Solves the linear system at each interior-point iteration with a conjugate gradient (CG) method.

Preconditions CG with a simple matrix that exploits the RIP properties ofB.

Matrix for each linear system in the interior point solver has the form M:=

B^TB −B^TB

−B^TB B^TB

+

U⁻¹S 0 0 V⁻¹T

,

where U = diag(u),V = diag(v), and S = diag(s) andT = diag(t) are constructed from the Lagrange multipliers for the bound u≥0,v ≥0.

(68)

The preconditioner replaces B^TB by (m/n)I. Makes sense according to the RIP properties ofB.

P := m n

I −I

−I I

+

U⁻¹S 0 0 V⁻¹T

,

Convergence of preconditioned CG depends on the eigenvalue distribution of P⁻¹M. Gondzio and Fountoulakis (2013) shows that the gap between largest and smallest eigenvalues actually decreases as the interior-point iterates approach a solution. (The gap blows up to ∞ for the

non-preconditioned system.)

Overall, the strategy is competitive with first-order methods, on random test problems.

(69)

Preconditioning: Effect on Eigenvalue Spread / Solve Time

Red = preconditioned, Blue = non-preconditioned.

(70)

References I

Akaike, H. (1959). On a successive transformation of probability distribution and its application to the analysis fo the optimum gradient method.Annals of the Institute of Statistics and Mathematics of Tokyo, 11:1–17.

Barzilai, J. and Borwein, J. (1988). Two point step size gradient methods. IMA Journal of Numerical Analysis, 8:141–148.

Beck, A. and Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems.SIAM Journal on Imaging Sciences, 2(1):183–202.

Bioucas-Dias, J. and Figueiredo, M. (2007). A new twist: two-step iterative

shrinkage/thresholding algorithms for image restoration. IEEE Transactions on Image Processing, 16:2992–3004.

Brucker, P. (1984). An O(n) algorithm for quadratic knapsack problems. Operations Research Letters, 3:163–166.

Cand`es, E. and Romberg, J. (2005). `1-MAGIC: Recovery of sparse signals via convex programming. Technical report, California Institute of Technology.

Combettes, P. and Pesquet, J.-C. (2011). Signal recovery by proximal forward-backward splitting. InFixed-Point Algorithms for Inverse Problems in Science and Engineering, pages 185–212. Springer.

Combettes, P. and Wajs, V. (2006). Proximal splitting methods in signal processing. Multiscale Modeling and Simulation, 4:1168–1200.

(71)

References II

Ferris, M. C. and Munson, T. S. (2002). Interior-point methods for massive support vector machines.SIAM Journal on Optimization, 13(3):783–804.

Figueiredo, M., Nowak, R., and Wright, S. (2007). Gradient projection for sparse reconstruction:

application to compressed sensing and other inverse problems. IEEE Journal of Selected Topics in Signal Processing: Special Issue on Convex Optimization Methods for Signal Processing, 1:586–598.

Fountoulakis, K., Gondzio, J., and Zhlobich, P. (2012). Matrix-free interior point method for compressed sensing problems. Technical Report, University of Edinburgh.

Gertz, E. M. and Wright, S. J. (2003). Object-oriented software for quadratic programming.

ACM Transations on Mathematical Software, 29:58–81.

Gondzio, J. and Fountoulakis, K. (2013). Second-order methods for l1-regularization. Talk at Optimization and Big DataWorkshop, Edinburgh.

Hale, E., Yin, W., and Zhang, Y. (2008). Fixed-point continuation for l1-minimization:

Methodology and convergence.SIAM Journal on Optimization, 19:1107–1130.

Jaggi, M. (2013). Revisiting frank-wolfe: Projection-free sparse convex optimization. Technical Report, ´Ecole Polytechnique, France.

Jenatton, R., Mairal, J., Obozinski, G., and Bach, F. (2011). Proximal methods for hierarchical sparse coding.Journal of Machine Learning Research, 12:2297–2334.

Lorenz, D. (2011). Constructing test instances for basis pursuit denoising.

arXiv.org/abs/1103.2897.

(72)

References III

Maculan, N. and de Paula, G. G. (1989). A linear-time median-finding algorithm for projecting a vector on the simplex ofRⁿ. Operations Research Letters, 8:219–222.

Moreau, J. (1962). Fonctions convexes duales et points proximaux dans un espace hilbertien.

CR Acad. Sci. Paris S´er. A Math, 255:2897–2899.

Nesterov, Y. (1983). A method of solving a convex programming problem with convergence rate O(1/k²). Soviet Math. Doklady, 27:372–376.

Nocedal, J. and Wright, S. J. (2006). Numerical Optimization. Springer, New York.

Rao, N., Shah, P., Wright, S. J., and Nowak, R. (2013). A greedy forward-backward algorithm for atomic-norm-constrained optimization. InProceedings of ICASSP.

Toh, K.-C. and Yun, S. (2010). An accelerated proximal gradient algorithm for nuclear norm regularized least squares problems.Pacific Journal of Optimization, 6:615–640.

Wright, S., Nowak, R., and Figueiredo, M. (2009). Sparse reconstruction by separable approximation. IEEE Transactions on Signal Processing, 57:2479–2493.

Xiao, L. and Zhang, T. (2012). A proximal-gradient homotopy method for the sparse least-squares problem.SIAM Journal on Optimization. (to appear; available at http://arxiv.org/abs/1203.3002).