First-Order Methods of Smooth Convex Optimization with Inexact Oracle.

(1)

(will be inserted by the editor)

First-Order Methods of Smooth Convex Optimization with Inexact Oracle.

Olivier Devolder ·

Fran¸cois Glineur · Yurii Nesterov

the date of receipt and acceptance should be inserted later

Abstract We introduce the notion of inexact first-order oracle and analyze the behaviour of several first-order methods of smooth convex optimization used with such an oracle. This notion of inexact oracle naturally appears in the context of smoothing techniques, Moreau-Yosida regularization, Augmented Lagrangians and many other situations.

We derive complexity estimates for primal, dual and fast gradient methods, and study in particular their dependence on the accuracy of the oracle and the desired accuracy of the objective function. We observe that the superiority of fast gradient methods over the classical ones is no longer absolute when an inexact oracle is used. We prove that, contrary to simple gradient schemes, fast gradient methods must necessarily suffer from error accumulation.

Finally, we show that the notion of inexact oracle allows the application of first-order methods of smooth convex optimization to solve non-smooth or weakly smooth convex problems.

The first author is a F.R.S.-FNRS Research Fellow. This text presents research results of the Belgian Program on Interuniversity Poles of Attraction initiated by the Belgian State, Prime Minister’s Office, Science Policy Programming. The scientific responsibility rests with its authors.

O. Devolder

Universit´e catholique de Louvain, ICTEAM Institute, B-1348 Louvain-La-Neuve, Belgium;

Universit´e catholique de Louvain, CORE, B-1348 Louvain-La-Neuve, Belgium.

Tel.: +32-10-479422, Fax: +32-10-474301 E-mail: Olivier.Devolder@uclouvain.be F. Glineur

E-mail: Francois.Glineur@uclouvain.be Y. Nesterov

E-mail: Yurii.Nesterov@uclouvain.be

(2)

Keywords Smooth convex optimization, first-order methods, inexact oracle, gradient methods, fast gradient methods, complexity bounds.

Mathematics Subject Classification (2000) 90C06, 90C25, 90C60

1 Introduction

In large-scale convex optimization, first-order methods are methods of choice due to their cheap iteration cost. When the objective function is assumed to be smooth, for example when its gradient is Lipschitz-continuous with constantL, the simplest numerical schemes to be considered are the gradient method and its variants. If accuracyis desired for the objective function, these methods requireO ^L

iterations.

However, it is well-known that in the black-box framework [11], first-order methods can achieve the lower complexity bound ofOq

L

iterations. Such optimal methods, called Fast Gradient Methods (FGM), have been developed for various classes of problems since 1983 [12, 13, 14, 15] and outperform the- oretically, and often in practice, the classical gradient methods. Interest into these methods has been renewed recently with development of smoothing techniques for non-smooth convex problems (see [15, 16, 17, 4]), where FGMs are used to minimize a smooth approximation of a non-smooth objective function.

Standard analysis of first-order methods assumes availability of exact first- order information. Namely, the oracle must provide at each given point the exact values of the function and its gradient. However, in many convex problems, including those obtained by smoothing techniques, the objective function and its gradient are computed by solving another auxiliary optimization problem. In practice, we are often only able to solve these subproblems approx- imately. Hence, in that context, numerical methods solving the outer problem are provided with inexact first-order information. This led us to investigate the behavior of first-order methods working with an inexact oracle.

We introduce in Section 2 a new definition of inexact first-order oracle and list a few simple examples. In Section 3, we show how our concept is applicable to situations when the inexact oracle is computed by an auxiliary optimization problem. In particular, we consider convex-concave saddle point problems, augmented Lagrangians, and Moreau-Yosida regularization.

In Sections 4 and 5, we consider classical (primal and dual) and fast gradient methods, designed for the class of convex functions with Lipschitz- continuous gradient. We obtain efficiency estimates when these methods are used with an inexact first-order oracle. We also study the link between desired accuracy for the objective function and necessary accuracy for the oracle. We observe that the superiority of the fast gradient methods over the classical ones is no longer absolute when an inexact oracle is used, because FGMs suffer from error accumulation. In particular, fast methods require first-order information with higher accuracy than standard gradient methods to obtain a solution with a given accuracy. Therefore, the choice between these methods depends

(3)

on the availability and relative cost of an inexact oracle at different levels of accuracy, as is explained in Section 6.

In Section 7, we compare our approach with other definitions of inexact oracle, as applied to the smoothed max-representable functions typically obtained by the smoothing techniques [3, 1]. We show that our definition can give better complexity results.

Our definition of inexact oracle is applicable to non-smooth and weakly smooth convex problems. Section 8 shows how to apply first-order methods designed for smooth convex optimization to functions with a weaker level of smoothness. For that, we show that (exact) first-order information for a non- smooth problem, such as subgradients, can be viewed as an inexact oracle, so that the methods of Sections 4 and 5 can be applied. We obtain in this way “universal” first-order methods possessing optimal rates of convergence for objective functions with different level of smoothness. We also prove lower bounds on the rate of error accumulation for any first-order method using an inexact oracle, which shows that all methods discussed in this paper have the lowest possible rate of error accumulation. In particular, it appears that while slower standard gradient methods are able to maintain an error comparable to the oracle accuracy, any optimal method must suffer from error accumulation.

2 Inexact first-order oracle

2.1 Motivation and definition

We consider the following convex optimization problem:

f^∗= min

x∈Qf(x), (1)

where Qis a closed convex set in a finite-dimensional spaceE, and function f is convex onQ. SpaceE is endowed with the normk·k_E andE^∗, the dual space ofE, with the dual normkgk^∗_E = sup_y∈E{|hg, yi|:kyk_E ≤1}whereh., .i denotes the dual pairing. For example, by fixing a positive definite self-adjoint operatorB :E→E^∗, we can define the following Euclidean norms:

khk_E =khk₂=hBh, hi ∀h∈E ksk^∗_E=ksk^∗₂=hs, B⁻¹si ∀s∈E^∗.

We assume that problem (1) is solvable with optimal solutionx^∗.

ConsiderF_L^1,1(Q), the class of convex functions on convex setQwhose gradient is Lipschitz-continuous with constantL. It is well-known that functions belonging to this class satisfy

0 ≤ f(x)− f(y) +h∇f(y), x−yi

≤ L

2 kx−yk²_E for allx, y∈Q, (2) see the top of Figure 1. Moreover, it is easy to check that, for a given y, quantitiesf(y) and∇f(y) are uniquely determined by this pair of inequalities.

(4)

Therefore, membership inF_L^1,1(Q) can be characterized by the existence of an oraclereturning for each pointy∈Qa pair (fL(y), gL(y))∈R×E^∗, necessarily equal to (f(y),∇f(y)), satisfying

0 ≤ f(x)− fL(y) +hgL(y), x−yi

≤ L

2 kx−yk²_E for allx∈Q (both zeroth-order and first-order information are included in the oracle). Our definition of an inexact oracle simply consists in introducing a given amount δof tolerance in this pair of inequalities (see bottom of Figure 1).

Definition 1 Let function f be convex on convex set Q. We say that it is equipped with a first-order (δ, L)-oracle if for any y ∈Q we can compute a pair (fδ,L(y), gδ,L(y))∈R×E^∗ such that

0 ≤ f(x)− f_δ,L(y) +hgδ,L(y), x−yi

≤ L

2 kx−yk²_E+δfor allx∈Q. (3) A functionfbelongs toF_L^1,1(Q) if and only it admits a (0, L)-oracle, namely (f0,L(y), g0,L(y)) = (f(y),∇f(y)). However, the class of functions admitting a (δ, L)-oracle is strictly larger, and includes non-smooth functions, as we will see later.

2.2 Properties

We list here a few important properties of (δ, L)-oracles.

– A (δ, L)-oracle provides a lower δ-approximation of the function value.

Indeed, takingx=y in (3), we obtain

fδ,L(y)≤f(y)≤fδ,L(y) +δ. (4) – A (δ, L)-oracle provides aδ-subgradient off at y∈Q, i.e.

gδ,L(y)∈∂δf(y) ={z∈E^∗:f(x)≥f(y) +hz, x−yi −δ ∀x∈Q}.

Indeed, using the first inequality in (3) and (4), we have for allx, y∈Q f(x) ≥ fδ,L(y) +hgδ,L(y), x−yi ≥ f(y) +hgδ,L(y), x−yi −δ. (5) Methods of non-smooth convex optimization based onδ-subgradients have a long history (see e.g. [20, 19, 2, 9] for subgradient methods, and [2, 6, 7] for proximal point and bundle methods). We will show later that a standard subgradient can also satisfy the second inequality in (3), which opens the possibility of using the concept of inexact oracle in the context of non- smooth convex optimization.

– A (δ, L) oracle can certify than an approximate solution has accuracy δ.

Indeed, assuminggδ,L(y) satisfieshgδ,L(y), x−yi ≥0 for allx∈Q, we have thatfδ,L(y)≤f(x^∗) =f^∗and therefore, using (4), we havef(y)≤f^∗+δ.

(5)

fHyL+ÑfHyLHy-xL+L 2Èx-yÈ²

fHxL

Hy,fHyLL

fHyL+ÑfHyLHy-xL

Exact oracleHfHyL,ÑfHyLLfor F_L^1,1HQL

f∆,LHyL+g∆,LHyLHy-xL+ L 2Èx-yÈ²+∆

fHxL

Hy, f∆,LHyLL Hy, f∆,LHyL+∆L

f∆,LHyL+g∆,LHyLHy-xL

∆

Inexact oracleHf∆,LHyL,g∆,LHyLL

Fig. 1 Illustration of lower and upper bounds (blue lines) implied by the definition of an exact (top) and inexact (bottom) oracle.

– If f admits a (δ, L)-oracle, thencf admits a (cδ, cL)-oracle for any value of the constant c >0. If f_i admits a (δ_i, L_i)-oracle, i= 1,2, thenf₁+f₂ admits a (δ₁+δ₂, L₁+L₂)-oracle.

– WhenQ=E, the difference betweengδ,L and any subgradientgy ∈∂f(y) is bounded as follows

kgy−gδ,L(y)k^∗_E ≤[2δL]¹². (6) Indeed, for anyx∈Qwe havef(x)≥f(y) +hgy, x−yi ≥fδ,L(y) +hgy, x−

yi. Subtracting this inequality from the second part of (3), we get that hgy−gδ,L(y), x−yi ≤ L

2kx−yk²_E+δ

holds for allx∈Q. Ifz∈Eis such thatkgy−gδ,L(y)k^∗_E=|hgy−gδ,L(y), zi|

andkzk_E = 1 and if we choosex∈Qsuch thatx−y=tsign(hgy−gδ,L, zi)z

(6)

witht >0, we obtain:

tkgy−gδ,L(y)k^∗_E≤ L

2t²+δ⇔ kgy−gδ,L(y)k^∗_E ≤L 2t+δ

t . (7) This upper bound attains its minimum [2δL]¹² whent= [^2δ_L]¹². In particular, whenQ=E, parametertis free to take any real value, and we obtain inequality (6). For constrained problems, a similar bound can be obtained in terms of the distanced(y, ∂Q) betweenyand the boundary ofQ: letting

d(y, ∂Q) = max{r|kx−ykE≤r⇒x∈Q}

we have that (7) holds for alltsuch that 0< t≤d(y, ∂Q), so that kgy−gδ,L(y)k^∗_E≤

(_L

2d(y, ∂Q) +_d(y,∂Q)^δ when 0< d(y, ∂Q)≤[^2δ_L]¹² , [2δL]¹² whend(y, ∂Q)≥[^2δ_L]¹² . – IfE is endowed with the Euclidean normk.k₂, the distance between exact

and inexact gradient mappings can be bounded by the same quantities as the distance between exact and inexact (sub)gradients. Recall that for any γ >0,g∈E^∗ and y∈E, the gradient mappingMγ(y, g), which replaces the gradient for constrained problems, is defined by

Tγ(y, g) = arg min

x∈Q{hg, x−yi+γ

2kx−yk²_E} (8)

M_γ(y, g) =γ(y−T_γ(y, g)). (9)

If f is subdifferentiable at point y, the exact gradient mapping for any subgradientg_y∈∂f(y) is equal toM_γ(y, g_y). Similarly, if an inexact (δ, L) oracle returns (f_δ,L(y), g_δ,L(y)) for point y, we call M_γ(y, g_δ,L(y)) the inexact gradient mapping. We are going to prove that the following holds

kMγ(y, gy)−Mγ(y, gδ,L(y))k₂≤ kgy−gδ,L(y)k^∗₂. (10) First-order optimality conditions for (8) can be written as

hg+γB(Tγ(y, g)−y), x−Tγ(y, g)i ≥0 ∀x∈Q. (11) Applying those to Tγ(y, gy) andTγ(y, gδ,L(y)) leads to

hgy−BMγ(y, gy), x−Tγ(y, gy)i ≥0 ∀x∈Q hgδ,L(y)−BM_γ(y, g_δ,L(y)), x−T_γ(y, g_δ,L(y))i ≥0 ∀x∈Q and specializing respectively to x=T_γ(y, g_δ,L(y)) andx=T_γ(y, g_y) gives

hgy−BM_γ(y, g_y), T_γ(y, g_δ,L(y))−T_γ(y, g_y)i ≥0 hgδ,L(y)−BMγ(y, gδ,L(y)), Tγ(y, gy)−Tγ(y, gδ,L(y))i ≥0.

Using now (9) in the inner products, multiplying by γ and summing, we obtain

hgy−BMγ(y, gy)−gδ,L(y)+BMγ(y, gδ,L(y)), Mγ(y, gy)−Mγ(y, gδ,L(y))i ≥0

(7)

and assuming that E is endowed with the Euclidean normk.k₂ gives hgy−gδ,L(y), Mγ(y, gy)−Mγ(y, gδ,L(y))i ≥ kMγ(y, gy)−Mγ(y, gδ,L(y))k²₂, from which the desired inequality (10) follows by Cauchy-Schwartz.

Characterizing the class of functions that can be endowed with a (δ, L)- oracle is an interesting open question. We provide below some necessary conditions in the simple case whereQ=E andEis endowed with the Euclidean normk.k₂. First of all, we establish the following inequality:

Theorem 1 If f is equipped with a (δ, L)-oracle, we have 1

2L(kgδ,L(x)−g_δ,L(y)k^∗₂)²≤f(y)−f_δ,L(x)− hgδ,L(x), y−xi+δ ∀x, y∈E.

Proof Letx∈E and consider the functionF(y) =f(y)− hgδ,L(x), yi.

As (fδ,L(y), gδ,L(y)) is a (δ, L)-oracle for f and (−hgδ,L(x), yi,−gδ,L(x)) is a (0,0)-oracle for−hgδ,L(x), yi, the resulting sum of oracles (Fδ,L(y), Gδ,L(y)) = (fδ,L(y)− hgδ,L(x), yi, gδ,L(y)−gδ,L(x)) is a (δ, L)-oracle for F(y). Using the lower bound in the definition of the oracleFδ,L(x) +hGδ,L(x), y−xi ≤F(y), valid for anyy, and the fact thatGδ,L(x) = 0, we derive

Fδ,L(x)≤F

y− 1

LB⁻¹Gδ,L(y)

≤Fδ,L(y) +hGδ,L(y),−1

LB⁻¹Gδ,L(y)i+L 2

1

LGδ,L(y)

∗

2

² +δ

=Fδ,L(y)− 1

2L kGδ,L(y)k^∗₂2

+δ which allows us to obtain

1

2L kgδ,L(y)−gδ,L(x)k^∗₂²

≤fδ,L(y)−fδ,L(x)− hgδ,L(x), y−xi+δ

≤f(y)−f_δ,L(x)− hgδ,L(x), y−xi+δ.

u t As a Corollary, we have:

Corollary 1 Iff is equipped with a(δ, L)-oracle, then we have for allx, y∈E kgδ,L(x)−g_δ,L(y)k^∗₂≤

q

L²kx−yk²₂+ 4Lδ and for any g_x∈∂f(x) and anyg_y ∈∂f(y)

kgx−gyk^∗₂≤(2√ 2 + 2)

√

Lδ+Lkx−yk₂.

(8)

Proof Our first claim directly follows from 1

2L kg_δ,L(x)−g_δ,L(y)k^∗₂²

≤ f(y)−f_δ,L(x)− hg_δ,L(x), y−xi+δ

(3)

≤ L

2 kx−yk²₂+ 2δ.

Furthermore, for anygx∈∂f(x) andgy∈∂f(y), we have by (6):

kgx−gδ,L(x)k^∗₂≤√ 2δL, kgy−gδ,L(y)k^∗₂≤√

2δL,

and therefore (using thek · k2≤ k · k1inequality for the last step) kgx−gyk^∗₂ ≤ kgx−gδ,L(x)k^∗₂+kgδ,L(x)−gδ,L(x)k^∗₂+kgδ,L(y)−gyk^∗₂

≤2√ 2δL+

q

L²kx−yk²₂+ 4Lδ

≤(2√

2 + 2)√

δL+Lkx−yk₂.

u t We conclude that the variation of subgradients off is locally bounded, i.e.

kgx−gyk^∗₂≤(2√ 2 + 2)

√

Lδ+LR ∀x, ys.t.kx−yk2≤R.

Note however that this property is true for any subdifferentiable convex function defined on the whole space E. Assume now that function f is endowed with a family of (δ, L(δ))-oracles and consider the following situations:

–

δ→0limL(δ) = ¯L <+∞

In this case we have kgx−gyk^∗₂ ≤ L¯kx−yk₂ and f must be a smooth convex function with a Lipschitz-continuous gradient.

–

lim

δ→∞L(δ) = 0 and lim

δ→∞L(δ)δ= ¯C <+∞,

which is the case for example when L(δ) = ^C_δ^¯. We have kgx−gyk^∗₂ ≤ (2√

2 + 2)√

C¯ so thatf must be a convex function with bounded variation of subgradients.

–

δ→∞lim L(δ) = 0 and lim

δ→∞L(δ)δ= 0

which would happen for example when L(δ) = _δ^C^¯2. We have in that case that kgx−gyk^∗₂≤0 andf must be a constant function.

(9)

2.3 Examples

To conlude this section, we consider four simple examples of inexact oracle.

More sophisticated examples will be given in Section 3.

a. Computations at shifted points. Let functionf ∈F_M^1,1(Q) be endowed with an oracle providing at each point y ∈ Q the exact values of function and gradient, albeit computed at a shifted point ˆy different from y. Let us show that such an oracle can be converted into a (δ, L)-oracle with

δ=Mky−ykˆ ²_E, L= 2M.

Convexity off implies the following inequality for anyx∈Q f(x)≥f(ˆy) +h∇f(ˆy), x−yiˆ

=f(ˆy) +h∇f(ˆy), y−yiˆ +h∇f(ˆy), x−yi.

Therefore, to satisfy the first inequality in (3) we can choosefδ,L(y)^def= f(ˆy) + h∇f(ˆy), y−yi, andˆ gδ,L(y)^def= ∇f(ˆy). In order to prove the second inequality in (3), note that we have for allx∈Q

f(x)

(2)

≤ f(ˆy) +h∇f(ˆy), x−yiˆ +^M₂ kx−ykˆ ²_E

= f(ˆy) +h∇f(ˆy), y−yiˆ +h∇f(ˆy), x−yi+^M₂ kx−ykˆ ²_E. Sincek · k²_E is a convex function, we have

kx−ykˆ ²_E= 1

2 2(x−y) +1

2 2(y−y)ˆ

2 E

(12)

≤ 1

2k2(x−y)k²_E+1

2k2(y−y)kˆ ²_E= 2ky−ykˆ ²_E+ 2kx−yk²_E.(13) Therefore,

f(x)≤fδ,L(y) +hgδ,L(y), x−yi+Mkx−yk²_E+Mky−ykˆ ²_E. We can therefore choose L = 2M and δ= Mky−ykˆ ²_E to satisfy the (δ, L)- oracle definition.

(10)

b. Convex problems with weaker level of smoothness. Let us show that the notion of (δ, L)-oracle can be useful for solving the problems withexactfirst- order information but with a lower level of smoothness. Let function f be convex and subdifferentiable onQ. For eachy∈Q, denote byg(y) an arbitrary element of the subdifferential ∂f(y). Assume that f satisfies the following H¨older condition:

kg(x)−g(y)k^∗_E≤L_νkx−yk^ν_E, ∀x, y∈Q, (14) whereν∈[0,1], andLν <+∞. This condition leads to the following inequality:

f(x)≤f(y) +hg(y), x−yi+_1+ν^L^ν kx−yk^1+ν_E , ∀x, y∈Q. (15) Denote the class of such functions by F_L^1,ν

ν (Q). When ν = 1, we get functions with Lipschitz-continuous gradient. For ν < 1, we get a lower level of smoothness. In particular, when ν = 0, we obtain functions whose subgradients havebounded variation. Clearly, the latter class includes functions whose subgradients are uniformly bounded byM (just takeL0= 2M).

Let us fixν ∈[0,1) and an arbitraryδ >0 . We are going to find a constant A(δ, ν) such that for any functionf ∈F_L^1,ν

ν (Q) we have

f(x)−f(y)− hg(y), x−yi ≤ ^A(δ,ν)₂ kx−yk²_E+δ, ∀x, y∈Q. (16) Comparing (15) and (16), we need chooseA(δ, ν) such that

Lν

1 +ν kx−yk^1+ν_E ≤ A(δ, ν)

2 kx−yk²_E+δ . Sincet=kx−yk²_E can take any nonnegative value, we may choose

A(δ, ν) = 2 max

t≥0

nLν

1+νt^−1+ν−δt⁻²o

=Lν

hLν

2δ ·^1−ν_1+νi^1−ν_1+ν .

(the latter expression is obtained after straightforward computations, the optimal value oftin the maximization beingt_∗=h

L_ν

2δ ·^1−ν_1+νi−_1+ν¹

). This means that the exact first-order information (f(y), g(y)) also constitutes an inexact (δ, A(δ, ν))-oracle. We will therefore be able to apply the methods from Sec- tions 4 and 5, initially devoted for smooth problems, to the minimization of the non- or weakly smooth objectivef.

For example, for functions with bounded variation of subgradients (ν= 0) we have

A(δ,0) = ^L_2δ²⁰. (17)

so that a (δ,^L_2δ²⁰)-oracle is available for all values ofδ >0.

Note that parameter δ does not represent an actual accuracy: it can be chosen arbitrarily, independently of the answer of the oracle. In particular, δ can be chosen as small as we want, at the price of a larger value for Lipschitz constant L of the (δ, L)-oracle, which grows as O

δ⁻^1−ν^1+ν

. Section 8 will describe the details and consequences of the application of first-order method of smooth convex optimization to non-smooth or weakly smooth functions.

(11)

Remark 1 This analysis can easily be extended to the case whereδ-subgradients with bounded variations are used instead of exact subgradients. We obtain in this case a (2δ, A(δ, ν))-oracle.

c. Function smoothed by local averaging. Another typical approach in order to apply first-order method of F_L^1,1(E) to a non-smooth function consists in smoothing the function by averaging of first-order information. Assume that E is endowed with an Euclidean norm and consider a non-smooth convex functionf ∈F_M^1,0(E). Letr >0, y∈E, and define

f_δ(y) = 1 Vr

Z

kz−yk₂≤r

f(z) dz

∇fr(y) =gr(y) = 1 Vr

Z

kz−yk₂≤r

g(z) dz

where Vr denotes the volume of a Euclidean ball with radius r, and {g(z) : kz−yk₂≤r} is a measurable selection of subgradients off in this ball. Asf is convex and Lipschitz-continuous with constantM we have

0≤f(x)−f(z)− hg(z), x−zi ≤Mkx−zk₂ ∀x, z∈E and therefore

f(x)≥f(z) +hg(z), x−yi+hg(z), y−zi ∀x, y, z∈E

f(x)≤f(z) +hg(z), x−yi+hg(z), y−zi+Mkx−zk₂ ∀x, y, z∈E.

Averaging now these two inequalities with respect to z over the ball {z : kz−yk₂≤r}, we obtain for allx, y∈Z

f(x)≥fr(y) +hgr(y), x−yi −M r f(x)≤fr(y) +hgr(y), x−yi+M r+M

V_r Z

kz−yk₂≤r

kx−zk₂dz (where we used that |hg(z), y−zi| ≤ kg(z)k^∗₂ky−zk₂ ≤M r). Furthermore, we have

kx−zk₂

(13)

≤ q

2kx−yk²₂+ 2kz−yk²₂≤2kx−yk²₂+ 2kz−yk²₂

2r +r

2 (where the second inequality comes from the arithmetic-geometric inequality), and therefore

f(x)≤fr(y) +hgr(y), x−yi+3

2M r+Mkx−yk²₂

r +M

V_r Z

kz−yk₂≤r

kz−yk₂ dz

≤fr(y) +hgr(y), x−yi+5

2M r+Mkx−yk²₂

r .

Finally, choosingfδ,L(y) =fr(y)−M r,gδ,L(y) =gr(y),δ= ^{7M r}₂ andL= ^2M_r , we obtain a (δ, L) = (^{7M r}₂ ,^2M_r ) = (δ,^7M_δ²)-oracle. Note that the dependence ofLinM andδis similar to that of the previous example, where subgradients are used directly instead of being averaged.

(12)

d. Functions approximated by a smooth function When a functionf can be well approximated by a smooth convex function ¯f, in the sense that their difference is bounded, the exact values of ¯f and its gradient provide an inexact oracle forf. Indeed, assume that there exists a smooth convex function ¯f ∈ F_L^1,1(Q) such that ¯f is a δ-lower approximation off on allQ, i.e.

0≤f(y)−f¯(y)≤δ ∀y∈Q.

We can then show that

f(x)≥f¯(x)≥f¯(y) +h∇f¯(y), x−yi ∀x, y∈Q, (using convexity of ¯f), and

f(x)≤f¯(x) +δ≤f¯(y) +h∇f¯(y), x−yi+L

2 kx−yk²_E+δ ∀x, y∈Q.

(using Lipschitz continuity of ¯f), which shows that ( ¯f(y),∇f¯(y)) is a (δ, L)- oracle forf.

One might wonder whether all inexact oracles can be obtained in that fashion, i.e. whether any inexact oracle can be seen as an exact oracle for a smooth approximation ¯f. It turns out that is not the case: indeed, as we have seen earlier, whenf has subgradients with bounded variation, its exact function values and subgradients can be seen as a (δ, L)-oracle (for arbitrary value of δ). Clearly, such an oracle cannot be at the same time equal to the exact function values and gradient of any smooth function ¯f.

Finally, note that the above result can be readily extended to the case when the δ-lower approximation ¯f is not necessarily smooth but is equipped with an inexact (δ⁰, L) oracle: we can then show that the inexact oracle of ¯f also constitutes an inexact (δ+δ⁰, L) oracle forf.

3 Inexact oracle for functions defined by an optimization problem

3.1 Accuracy measures for approximate solutions

In this section, we consider smooth convex optimization problems of the form (1) whose objective function is defined by another optimization problem:

f(x) = max

u∈UΨ(x, u), (18)

whereU is a convex set of a finite dimensional spaceF endowed with the norm k.k_F and for anyx∈QfunctionΨ(x,·) is smooth and (strongly) concave with concavity parameterκ≥0 . Computation off and its gradient requires the exact solution of this auxiliary problem. However, in practice, such a solution might often be impossible or too costly to compute, so that an approximate solution has to be used instead.

(13)

We will measure the accuracy of an approximate solution ux for problem (18) in three different ways:

V₁(u_x) = max

u∈Uh∇₂Ψ(x, u_x), u−u_xi, V₂(u_x) = max

u∈U

h

Ψ(x, u)−Ψ(x, u_x) +^κ₂ku_x−uk²_Fi , V3(ux) = max

u∈U [Ψ(x, u)−Ψ(x, ux)].

(19)

SinceΨ(x,·) is strongly concave, we have:

Ψ(x, u)≤Ψ(x, u_x) +h∇2Ψ(x, u_x), u−u_xi −^κ₂ku−u_xk²_F, ∀u∈U.

Therefore our three measures are related by

V3(ux)≤V2(ux) ≤ V1(ux).

For a given level of accuracyδ >0, the conditionV₁(u_x)≤δis the strongest, and conditionV₃(u_x)≤δis the most relaxed.

We describe below three classes of max-type functions for which the approximate solution of subproblem (18), when satisfying one of the conditions V_i(u_x)≤δ, allows the construction of a (δ, L)-oracle.

Let us show first how to satisfy stopping criteria (19) in practice. The most common criterion is the third one. It amounts to estimating the optimality gap in the value of objective function. Many optimization methods offer direct control of this criterion. Other criteria might be more difficult to han- dle. Therefore, let us describe a “brute force” approach designed to satisfy the strongestV1 criterion (Here we assume thatF is endowed with an Euclidean norm).

LetDu<∞be the diameter ofU. Let us chooseu0∈U and form a new function

Ψ¯(x, u) =Ψ(x, u)−¹₂µku−u0k²₂.

Denote by ¯Vi(u) the corresponding accuracy measures, andu^∗_x= arg max

u∈U

Ψ¯(x, u).

For anyu∈U we obtain

0≥ h∇2Ψ(x, u¯ ^∗_x), u−u^∗_xi = h∇2Ψ¯(x, u^∗_x), u_x−u^∗_xi+h∇2Ψ(x, u¯ ^∗_x), u−u_xi

≥ −V¯₃(u_x) +h∇₂Ψ(x, u¯ ^∗_x)− ∇₂Ψ(x, u¯ _x), u−u_xi+h∇₂Ψ¯(x, u_x), u−u_xi

≥ −V¯₃(u_x)− k∇₂Ψ¯(x, u^∗_x)− ∇₂Ψ¯(x, u_x)k^∗₂D_u+h∇₂Ψ¯(x, u_x), u−u_xi.

Since∇2Ψ¯(x,·) is Lipschitz continuous onU with constantL, we get 1

2L

∇₂Ψ¯(x, u^∗_x)− ∇₂Ψ¯(x, u_x)

∗ 2

2

≤Ψ¯(x, u^∗_x)−Ψ¯(x, u_x) +h∇₂Ψ¯(x, u^∗_x), u_x−u^∗_xi

≤Ψ¯(x, u^∗_x)−Ψ¯(x, ux) = ¯V3(ux).

(14)

and therefore:

V1(ux) ≤ V¯1(ux) +µD²_u

(2)

≤ V¯3(ux) +Du[2LV¯3(ux)]^1/2+µD_u². Thus, if we chooseµ= _3D^δ2

u, we can get the desired level ofV1(ux) by ensuring V¯₃(u_x)≤ _18LD^δ² 2

u

. Note that function ¯Ψ(x,·) is strongly concave. Therefore, the complexity of its maximization in the scale ¯V₃depends logarithmically on the desired accuracy. If this is done, for example, by a FGM, it requires at most O(^L_δ_1/2^1/2ln¹_δ) iterations (see section 2.2 in [14]).

3.2 Functions obtained by smoothing techniques

Let U be a closed, convex set of a finite dimensional space F endowed with the normk·k_F, and

Ψ(x, u) =G(u) +hAu, xi,

where A:F →E^∗ is a linear operator, and G(u) is a differentiable, strongly concave function with concavity parameter κ >0. Under these assumptions, optimization problem (18) has only one optimal solution u^∗_x. Moreover, f is convex and smooth with Lipschitz-continuous gradient ∇f(x) = Au^∗_x. The corresponding Lipschitz-constant is equal to

L(f) = ¹_κkAk²_F→E∗ (20) where kAk_F_→E∗ = max{kAuk_E∗ : kuk_F = 1}. The importance of this class of functions is justified by the smoothing approach for non-smooth convex optimization (see [15, 16, 17, 4]).

Suppose that for ally∈Qwe can find a pointu_y ∈U satisfying condition V3(uy) =Ψ(y, u^∗_y)−Ψ(y, uy) ≤ ^δ₂. (21) Let us show that this allows us to construct an (δ,2L(f))-oracle. Indeed, since Ψ(·, u) is convex, for allu∈U, we have

f(x) =Ψ(x, u^∗_x) ≥ Ψ(x, uy) ≥ Ψ(y, uy) +h∇1Ψ(y, uy), x−yi

=fδ,L(y) +hgδ,L(y), x−yi,

(22)

wheref_δ,L(y)^def= Ψ(y, u_y), g_δ,L(y)^def= ∇₁Ψ(y, u_y) =Au_y, and Lwill be speci- fied later. Further, note that

h∇1Ψ(y, u^∗_y), x−yi=hgδ,L(y), x−yi+hA(u^∗_y−u_y), x−yi. (23) Sincef has Lipschitz-continuous gradient, we have:

f(x) ≤ f(y) +h∇f(y), x−yi+^L(f)₂ kx−yk²_E

= f(y) +h∇Ψ1(y, u^∗_y), x−yi+^L(f)₂ kx−yk²_E

(23)= f(y) +hgδ,L(y), x−yi+^L(f)₂ kx−yk²_E+hA(u^∗_y−u_y), x−yi.

(15)

On the other hand, we have:

hA(u^∗_y−uy), x−yi ≤ u^∗_y−uy

_F

A^T(x−y)

∗ E (20)

≤ ^κ₂

u^∗_y−u_y

2

F+^L(f)₂ kx−yk²_E. Therefore,

f(x)≤f(y) +hg_δ,L(y), x−yi+L(f)kx−yk²_E+^κ₂

u^∗_y−u_y

2 F. SinceΨ is strongly concave, ^κ₂

uy−u^∗_y

2

F ≤Ψ(y, u^∗_y)−Ψ(y, uy). Thus, f(x)≤Ψ(y, uy) + 2(Ψ(y, u^∗_y)−Ψ(y, uy)) +hgδ,L(y), x−yi+L(f)kx−yk²_E. In view of conditions (21) and (22), we have proved that the pair (Ψ(y, u_y), Au_y), satisfying condition (21), corresponds to an (δ, L)-oracle withL= 2L(f).

3.3 Moreau-Yosida regularization

In this section, we consider functions of the form f(x) = min

u∈U

nL(x, u)^def= h(u) +^κ₂ku−xk²₂o

, (24)

wherehis a smooth convex function on a convex setU ⊂Rⁿendowed with the usual Euclidean normkxk²₂=hx, xi.The functionf is convex with Lipschitz- continuous gradient∇f(x) =κ(x−u^∗_x), whereu^∗_xdenotes the unique optimal solution of the problem (24). The Lipschitz constant of the gradient is equal toκ.

Instead of solving exactly the problem (24), we compute a feasible solution uxsatisfying

V2(ux) = max

u∈U

nL(x, ux)− L(x, u) +^κ₂ku−uxk²₂o

≤ δ. (25) (SinceL is convex inu, we inverted the sign in the definition of V₂ in (19).) Let us show that for allx∈Qthe objects

fδ,L(x) =L(x, ux)−δ = h(ux) +^κ₂kux−xk²₂−δ, gδ,L(x) =∇1L(x, ux) = κ(x−ux)

(26)

(16)

correspond to an answer of an (δ, L)-oracle withL=κ. Indeed, f(x) = L(x, u^∗_x) ≥ L(y, u^∗_x) +^κ₂hy−x,2u^∗_x−x−yi

(25)

≥ L(y, u_y) +^κ₂ku^∗_x−u_yk²₂−δ+^κ₂hy−x,2u^∗_x−x−yi

= L(y, uy) +κhy−uy, x−yi+^κ₂ku^∗_x−uyk²₂−δ +^κ₂hy−x,2u^∗_x−2uy+y−xi

= L(y, uy) +κhy−uy, x−yi −δ +^κ₂

ku^∗_x−uyk²₂+ky−xk²₂+ 2hy−x, u^∗_x−uyi

≥ L(y, u_y) +κhy−u_y, x−yi −δ.

Thus, we satisfy the first inequality in (3) with the values defined by (26).

Further, for allx, y∈Qwe have

f(x) =h(u^∗_x) +^κ₂ku^∗_x−xk²₂ ≤ h(uy) +^κ₂kuy−xk²₂

=h(u_y) +^κ₂ku_y−yk²₂+^κ₂hx−y, x+y−2u_yi

=L(y, uy) +κhy−uy, x−yi+^κ₂ky−xk²₂.

Thus, in view of definition (26), we have proved the second inequality in (3) withL=κ.

3.4 Functions defined by Augmented Lagrangians Consider the following convex problem:

maxu∈U {h(u) : Au= 0}, (27) wherehis a smooth concave function on the convex setU ⊂F,F is a finite- dimensional space, and A :F → E^∗ is a linear operator. Let E be endowed with the Euclidean norm k.k₂. In the Augmented Lagrangian approach, we need to solve the dual problem

minx∈E f(x), (28)

where

f(x)^def= max

u∈U

h

Ψ(x, u)^def= h(u) +hAu, xi −^κ₂(kAuk^∗₂)²i

. (29)

It is well-known thatf is a convex smooth function with Lipschitz-continuous gradient

∇f(x) =Au^∗_x,

(17)

whereu^∗_xdenotes any optimal solution of the optimization problem (29). The Lipschitz constant of the gradient is equal to ¹_κ.

Assume that, instead of solving (28) exactly, we compute an approximate solutionux∈U such that

V₁(u_x) = max

u∈U h∇2Ψ(x, u_x), u−u_xi

= max

u∈U h∇h(ux) +A^Tx−κA^TB⁻¹Aux, u−uxi ≤ δ.

(30)

Let us show that the objects

f_δ,L(x) = Ψ(x, u_x), g_δ,L(x) = ∇1Ψ(x, u_x) = Au_x (31) correspond to a (δ, L)-oracle withL= ¹_κ. Indeed, for allx, y∈E we have

f(x) = max

u∈U

h(u) +hAu, xi −^κ₂(kAuk^∗₂)²

≥h(uy) +hAuy, xi −^κ₂(kAuyk^∗₂)² = Ψ(y, uy) +hAuy, x−yi.

Thus, in view of definition (31), the first inequality in (3) is proved. Further, f(x) ≤ max

u∈U{h(uy) +h∇h(uy), u−u_yi+hAu, xi −^κ₂(kAuk^∗₂)²}

(30)

≤ max

u∈U{h(uy)− hA^Ty−κA^TB⁻¹Au_y, u−u_yi+hAu, xi −^κ₂(kAuk^∗₂)²}+δ

= Ψ(y, uy) +hAuy, x−yi + max

u∈U

hA(u−uy), x−yi −^κ₂(kA(u−uy)k^∗₂)² +δ.

Thus, in view of (31), we have proved the second inequality in (3) withL=_κ¹.

4 Gradient methods with inexact oracle

Consider the problem (1), where f is endowed with an inexact (δ, L)-oracle.

In this section, we will use the Euclidean norm k.k2. As usual when dealing with constrained problems, we assume that the gradient mapping TL(x, g) is computable for anyx∈Qandg∈E^∗, see (8).

4.1 Primal gradient method

The classical (primal) gradient method can be adapted in a straightforward manner to accept first-order information from an inexact oracle: it is enough to replace the true gradient by its approximate counterpart gδ,L. Moreover,

(18)

we allow the parameters (δk, Lk) of the inexact oracle to be different for each iterationk. We obtain

Initialization:Choosex0∈Q.

Iteration (k≥0): 1.Chooseδ_k andL_k.

2.Compute (f_δ_k_,L_k(x_k), g_δ_k_,L_k(x_k)).

3.Computexk+1=TL_k(xk, gδ_k,L_k(xk)).

(PGM)

Theorem 2 Fork≥1, we have

k−1

P

i=0 1

Li[f(x_i+1)−f(x^∗)]≤ ¹₂kx0−x^∗k²₂+

k−1

P

i=0 δ_i

Li. (32)

Proof Denote rk =kxk−x^∗k²₂,fk =fδ_k,L_k(xk), andgk=gδ_k,L_k(xk). Then r²_k+1 = r_k²+ 2hB(xk+1−xk), xk+1−x^∗i − kxk+1−xkk²₂

(11)

≤ r_k²+_L²

khg_k, x^∗−x_k+1i − kx_k+1−x_kk²₂

= r_k²+_L²

khg_k, x^∗−x_ki −_L²

k[hg_k, x_k+1−x_ki+^L₂^kkx_k+1−x_kk²₂]

(3)

≤ r_k²+_L²

k[f(x^∗)−fk]−_L²

k[f(xk+1)−fk−δk].

Summing up these inequalities fori= 0, . . . , k−1, we obtain (32). ut When exact first-order information is used (δ_i = 0, L_i = L), it is well- known that sequence{f(xi)}must be decreasing. This is no longer true when an inexact oracle is used. Therefore, let us define

ˆ x_k =

Pk−1 i=0L⁻¹_i x_i+1 Pk−1

i=0L⁻¹_i ∈ Q.

Sincef is convex, we have

f(ˆx_k)−f(x^∗)≤

1

2kx0−x^∗k²₂+Pk−1 i=0L⁻¹_i δi

Pk−1

i=0L⁻¹_i . (33)

In the case when the oracle accuracy is constant (δ_i=δ,L_i=L), we have f(ˆxk)−f(x^∗)≤ ^LR_2k² +δ, R ^def= kx0−x^∗k₂. (34) Thus, there is no error accumulation, and the upper bound for the objective function accuracy decreases with k and asymptotically tends toδ. Hence, if an accuracy on the objective function is required (with > δ),k = _2(−δ)^LR² iterations are sufficient. In particular, we see that PGM allows the oracle accuracy to be of the same order as the desired accuracy for the objective function.

(19)

4.2 Dual gradient method

This method [18] generates two sequences{xk}_k≥0 and{yk}_k≥0. Initialization: Choosex0∈Q.

Iteration (k≥0): 1.Chooseδ_k andL_k. 2.Compute (f_δ_k_,L_k(x_k), g_δ_k_,L_k(x_k)).

3.Computexk+1= arg min

x∈Q

_k P

i=0 1

Lihgδ_i,L_i(xi), x−xii+¹₂kx−x0k²₂

. (35)

Defineyk=TL_k(xk, gδ_k,L_k(xk)),k≥0.

Theorem 3 For any k≥0 we have

k

P

i=0 1

L_i[f(yi)−f(x^∗)]≤ ¹₂kx0−x^∗k²₂+

k

P

i=0 δ_i

L_i. (36)

Proof Fork≥0, denotef_k =f_δ_k_,L_k(x_k),g_k=g_δ_k_,L_k(x_k), and ψk(x) =

k

P

i=0 1

L_i[fi+hgi, x−xii] +¹₂kx−x0k²₂, ψ^∗_k = min

x∈Qψk(x).

In view of the first inequality in (3), we have for allx∈Q ψ_k^∗≤ψ_k(x) ≤ Pk

i=0 1

Lif(x) +¹₂kx−x₀k²₂. (37) Let us prove that ψ_k^∗ ≥

k

P

i=0 1

L_i[f(yi)−δi]. Indeed, this inequality is valid for k= 0:

f(y0)

(3)

≤f0+hg0, y0−x0i+^L₂⁰ky0−x0k²₂+δ0 =L0ψ₀^∗+δ0. Assume it is valid for somek≥1. SinceΨ_k(x) is strongly convex, we have:

ψk(x)≥ψ^∗_k+¹₂kx−xk+1k²₂, x∈Q Therefore,

ψ^∗_k+1 = min

x∈Q

n

ψk(x) +_L¹

k+1[fk+1+hgk+1, x−xk+1i]o

≥ ψ_k^∗+_L¹

k+1min

x∈Q

n

fk+1+hgk+1, x−xk+1i+^L^k+1₂ kx−xk+1k²₂o

(3)

≥ ψ_k^∗+_L¹

k+1(f(y_k+1)−δ_k+1).

(20)

Hence, using our inductive assumption, we can prove thatψ^∗_k≥

k

P

i=0 1

L_i[f(yi)−

δi] for all k ≥ 0. To conclude we combine this fact with inequality (37) for

x=x^∗. ut

As for the Primal Gradient Method, we define ˆ

yk =

Pk i=0L⁻¹_i y_i Pk

i=0L⁻¹_i ∈ Q, and obtain the same upper bound

f(ˆyk)−f(x^∗)≤

1

2kx₀−x^∗k²₂+Pk i=0L⁻¹_i δ_i Pk

i=0L⁻¹_i , k≥0. (38)

Since we obtain the same convergence results for both primal and dual gradient methods, we will refer to both as Classical Gradient Methods (CGM) in the rest of this paper.

5 Fast gradient method with inexact oracle

5.1 Convergence analysis

In this section, we adapt one of the last versions of Fast Gradient Method (FGM) developed in [15]. Letd(x) be a prox-function, differentiable and strongly convex onQ, and letx₀= arg min

x∈Qd(x) be its prox-center.

Translating and scalingdif necessary, we can always ensure that

d(x0) = 0, d(x)≥ ¹₂kx−x0k²_E, ∀x∈Q. (39) (herek·k_E denotes any norm onE). Let{αk}^∞_k=0 be a sequence of reals such that

α₀∈(0,1], ^α_L²^k

k ≤ A_k ^def=

k

P

i=0 α_i

L_i, k≥0. (40)

Defineτk= _A^α^k+1

k+1L_k+1, k≥0 and consider the following method.

Initialization: Chooseδ0,L0, andx0= arg min

x∈Qd(x).

Iteration (k≥0): 1.Compute (f_δ_k_,L_k(x_k), g_δ_k_,L_k(x_k)).

2.Computey_k =T_L_k(x_k, g_δ_k_,L_k(x_k)).

3.Computez_k= arg min

x∈Q{d(x) +

k

P

i=0 α_i

Lihgδi,Li(x_i), x−x_ii}.

4.Chooseδk+1 andLk+1. Definexk+1=τkzk+ (1−τk)yk.

(FGM)

Denoteψ_k^∗= min

x∈Q{d(x) +Pk i=0

α_i

L_i[fδ_i,L_i(xi) +hgδ_i,L_i(xi), x−xii]}.