• No results found

Proving Convergence of Descent Algorithm

N/A
N/A
Protected

Academic year: 2022

Share "Proving Convergence of Descent Algorithm"

Copied!
24
0
0

Loading.... (view fulltext now)

Full text

(1)

Convergence of Descent Algorithms: Generic and Specific Cases

September 28, 2018 141 / 408

(2)

Back to: Generic Convergence of Descent Algorithm

Consider the general descent algorithm (∇Tf(xk)∆xk <0 for eachk) with each step:

xk+1 =xk+tkxk.

Supposefis bounded below inn and

is continuously differentiable in an open setN containing the level set{x|f(x)f(x0)}

fis Lipschiz continuous.

Then,∑

k=1

(∇Tf(xk)∆xk)2

∥∆xk2 <∞ (that is, it is finite) Proof:

For any descent algorithm: ∇Tf(xk)∆xk <0for eachk with each step:

xk+1 =xk+tk∆xk.

From the second Strong Wolfe condition:

September 28, 2018 142 / 408

normalized directional derivative

Overall: Sum of squares of normalized directional derivatives is finite

(3)

Consider the general descent algorithm (∇Tf(xk)∆xk <0 for eachk) with each step:

xk+1 =xk+tkxk.

Supposefis bounded below inn and

is continuously differentiable in an open setN containing the level set{x|f(x)f(x0)}

fis Lipschiz continuous.

Then,∑

k=1

(∇Tf(xk)∆xk)2

∥∆xk2 <∞ (that is, it is finite) Proof:

For any descent algorithm: ∇Tf(xk)∆xk <0for eachk with each step:

xk+1 =xk+tk∆xk.

From the second Strong Wolfe condition:

Tf(xk+tk∆xk)∆xkc2

Tf(xk)∆xk

(38)

September 28, 2018 142 / 408

(4)

Proving Convergence of Descent Algorithm

Since c2>0and ∇Tf(xk)∆xk<0,

September 28, 2018 143 / 408

(5)

Since c2>0and ∇Tf(xk)∆xk<0,

Tf(xk+tk∆xk)∆xkc2Tf(xk)∆xk (39) Subtracting ∇Tf(xk)∆xk from both sides of (39)

[

f(xk+tk∆xk)− ∇f(xk)]T

∆xk≥(c2−1)∇Tf(xk)∆xk (40) By Cauchy Shwarz inequality and from Lipschitz continuity,

September 28, 2018 143 / 408

(6)

Proving Convergence of Descent Algorithm

Since c2>0and ∇Tf(xk)∆xk<0,

Tf(xk+tk∆xk)∆xkc2Tf(xk)∆xk (39) Subtracting ∇Tf(xk)∆xk from both sides of (39)

[

f(xk+tk∆xk)− ∇f(xk)]T

∆xk≥(c2−1)∇Tf(xk)∆xk (40) By Cauchy Shwarz inequality and from Lipschitz continuity,

[∇f(xk+tkxk)− ∇f(xk)]T

xk ≤ ∥∇f(xk+tkxk)− ∇f(xk)∥∥∆xk∥ ≤L∥xk2tk (41)

September 28, 2018 143 / 408

(7)

Combining (40) and (41),

tkc2−1

LTf(xk)∆xk

∥∆xk2 (42) Substituting (42) into the first Wolfe condition (while recalling that ∇Tf(xk)∆xk <0),

September 28, 2018 144 / 408

(8)

Proving Convergence of Descent Algorithm (contd.)

Combining (40) and (41),

tkc2−1

LTf(xk)∆xk

∥∆xk2 (42) Substituting (42) into the first Wolfe condition (while recalling that ∇Tf(xk)∆xk <0), f(xk+t∆xk)<f(xk) +c1tTf(xk)∆xk

f(xk+1)<f(xk)−c1

1−c2

L (

Tf(xk)∆xk)2

∥∆xk2 (43) Substitutingc=c11−c2

L and applying (43) successively,

September 28, 2018 144 / 408

(9)

Combining (40) and (41),

tkc2−1

LTf(xk)∆xk

∥∆xk2 (42) Substituting (42) into the first Wolfe condition (while recalling that ∇Tf(xk)∆xk <0), f(xk+t∆xk)<f(xk) +c1tTf(xk)∆xk

f(xk+1)<f(xk)−c1

1−c2

L (

Tf(xk)∆xk)2

∥∆xk2 (43) Substitutingc=c11−c2

L and applying (43) successively,

f(xk+1)<f(x0)−ck

i=0

(

Tf(xi)∆xi)2

∥∆xi2 (44)

September 28, 2018 144 / 408

c>0

(10)

Proving Convergence of Descent Algorithm (contd.)

Taking limits of (44) as k→ ∞,

k→∞lim ck

i=0

(

Tf(xi)∆xi)2

∥∆xi2 < lim

k→∞ f(x0)−f(xk+1)≤ ∞ (45) where thelast inequality is because the descent algorithm proceeds only if

f(xk+1)≤f(xk), and we have assumed that f is bounded below inℜn. This proves finiteness of the summation

Thus, lim

k→∞

Tf(xk)∆xk

∥∆xk

5Making use of the Cauchy Schwarz inequality

September 28, 2018 145 / 408

= 0

(11)

Taking limits of (44) as k→ ∞,

k→∞lim ck

i=0

(

Tf(xi)∆xi)2

∥∆xi2 < lim

k→∞ f(x0)−f(xk+1)≤ ∞ (45) where thelast inequality is because the descent algorithm proceeds only if

f(xk+1)≤f(xk), and we have assumed that f is bounded below inℜn. This proves finiteness of the summation

Thus, lim

k→∞

Tf(xk)∆xk

∥∆xk∥ = 0.

If we additionally assume that the descent direction is never orthogonal to the gradient, i.e.,∆xTf(xk∥|∇f(xk)∆xkk) ≥Γ for some Γ>0, then, we can show5 that

5Making use of the Cauchy Schwarz inequality

September 28, 2018 145 / 408

(12)

Proving Convergence of Descent Algorithm (contd.)

Taking limits of (44) as k→ ∞,

k→∞lim ck

i=0

(

Tf(xi)∆xi)2

∥∆xi2 < lim

k→∞ f(x0)−f(xk+1)≤ ∞ (45) where thelast inequality is because the descent algorithm proceeds only if

f(xk+1)≤f(xk), and we have assumed that f is bounded below inℜn. This proves finiteness of the summation

Thus, lim

k→∞

Tf(xk)∆xk

∥∆xk∥ = 0.

If we additionally assume that the descent direction is never orthogonal to the gradient, i.e.,∆xTf(xk∥|∇f(xk)∆xkk) ≥Γ for some Γ>0, then, we can show5 that lim

k→0∥∇f(xk)∥= 0 This shows convergence for a generic descent algorithm. What we are more interested in however, is therate of convergence of specific descent algorithms.

5Making use of the Cauchy Schwarz inequality

September 28, 2018 145 / 408

nothing about

for what k?

(13)

We desire the second rate of convergence But to discuss rate

of convergence (as against an abstract notion of

convergence), we will need to assume

a) convexity and b) specific form of descent algorithm rate of convergence of GRADIENT DESCENT for convex functions under Strong Wolfe conditions,

Lipschitz continuity

on the gradient

(14)

General Algorithm: Steepest Descent (contd)

Finda starting pointx(0) ∈D.

repeat

1. Set ∆x(k)=argmin{

Tf(x(k))v | ||v||= 1 }.

2. Choose a step size t(k)>0 using exact or backtracking ray search.

3. Obtainx(k+1) =x(k)+t(k)∆x(k). 4. Set k=k+ 1.

until stopping criterion (such as||∇f(x(k+1))||≤ϵ) is satisfied Figure 9:The steepest descent algorithm.

Two examples of the steepest descent method are the gradient descent method (for the eucledian orL2 norm)and the coordinate-descent method (for theL1 norm). One fact however is that no two norms should give exactly opposite steepest descent directions, though they may point in different directions.

September 28, 2018 146 / 408

(15)

Corresponds exactly to the choice of L1 norm for the steepest descent method. The steepest descent direction using theL1 norm is given by∆x=−∂f(x)∂xi ui where,

∂f(x)

∂xi =||∇f(x)|| andui is defined as the unit vector pointing along the ith axis.

Thus each iteration of the coordinate descent method involves optimizing over one component of the vector x(k) (having the largest absolute value in the gradient vector).

Finda starting pointx(0)D.

Selectan appropriate norm||.||.

repeat

1. Let∂f(x(k))

∂x(k)i =||∇f(x||). 2. Set∆x(k)=∂f(x(k))

∂x(k)i ui.

3. Choose a step sizet(k)>0using exact or backtracking ray search.

4. Obtainx(k+1)=x(k)+t(k)∆x(k). 5. Setk=k+ 1.

untilstopping criterion (such as||∇f(x(k+1))||ϵ) is satisfied

September 28, 2018 147 / 408

(16)

Algorithms: Gradient Descent

This classic greedy algorithm for minimization uses the negative of the gradient of the function at the current pointx as the descent direction∆x.

This choice of ∆x corresponds to the direction of steepest descent under the L2

(eucledian) norm and follows from

September 28, 2018 148 / 408

(17)

This classic greedy algorithm for minimization uses the negative of the gradient of the function at the current pointx as the descent direction∆x.

This choice of ∆x corresponds to the direction of steepest descent under the L2

(eucledian) norm and follows from the Cauchy Shwarz inequality Finda starting pointx(0) ∈D

repeat

1. Set ∆x(k)=−∇f(x(k)).

2. Choose a step size t(k)>0 using exact or backtracking ray search.

3. Obtainx(k+1) =x(k)+t(k)∆x(k). 4. Set k=k+ 1.

until stopping criterion (such as||∇f(x(k+1))||2≤ϵ) is satisfied

The steepest descent method can be thought of as changing the coordinate system in a particular way and then applying the gradient descent method in the changed coordinate system.

September 28, 2018 148 / 408

(18)

Convergence of the Gradient Descent Algorithm

We recap the (necessary) inequality (36) resulting from Lipschitz continuity of∇f(x):

f(y)f(x) +f(x)(yx) + L2∥y−x∥2

Considering xkx, and xk+1=xktk∇f(xk)≡y, we get

September 28, 2018 149 / 408

(19)

We recap the (necessary) inequality (36) resulting from Lipschitz continuity of∇f(x):

f(y)f(x) +f(x)(yx) + L2∥y−x∥2

Considering xkx, and xk+1=xktk∇f(xk)≡y, we get

f(xk+1)≤f(xk)−tkf(xk)∇f(xk) +L( tk)2

2

f(xk) 2

=⇒ f(xk+1)≤f(xk)−(1−Ltk 2 )t

f(xk) 2 We desire to have the following (46). It holds if....

f(xk+1)≤f(xk)−bt 2

∇f(xk)

2 (46)

September 28, 2018 149 / 408

(20)

Convergence of the Gradient Descent Algorithm

We recap the (necessary) inequality (36) resulting from Lipschitz continuity of∇f(x):

f(y)f(x) +f(x)(yx) + L2∥y−x∥2

Considering xkx, and xk+1=xktk∇f(xk)≡y, we get

f(xk+1)≤f(xk)−tkf(xk)∇f(xk) +L( tk)2

2

f(xk) 2

=⇒ f(xk+1)≤f(xk)−(1−Ltk 2 )t

f(xk) 2 We desire to have the following (46). It holds if....

f(xk+1)≤f(xk)−bt 2

∇f(xk)

2 (46)

With fixed step sizetk=bt, we ensure that0<bt 1L

September 28, 2018 149 / 408

for general descent algos, exact and backtracking search for t were motivated

For gradient descent with Lipschitz

continuity on gradient, here is another way of choosing t

(21)

We recap the (necessary) inequality (36) resulting from Lipschitz continuity of∇f(x):

f(y)f(x) +f(x)(yx) + L2∥y−x∥2

Considering xkx, and xk+1=xktk∇f(xk)≡y, we get

f(xk+1)≤f(xk)−tkf(xk)∇f(xk) +L( tk)2

2

f(xk) 2

=⇒ f(xk+1)≤f(xk)−(1−Ltk 2 )t

f(xk) 2 We desire to have the following (46). It holds if....

f(xk+1)≤f(xk)−bt 2

∇f(xk)

2 (46)

With fixed step sizetk=bt, we ensure that0<bt 1L = 1L2bt 12.

September 28, 2018 149 / 408

(22)

Convergence of the Gradient Descent Algorithm

We recap the (necessary) inequality (36) resulting from Lipschitz continuity of∇f(x):

f(y)f(x) +f(x)(yx) + L2∥y−x∥2

Considering xkx, and xk+1=xktk∇f(xk)≡y, we get

f(xk+1)≤f(xk)−tkf(xk)∇f(xk) +L( tk)2

2

f(xk) 2

=⇒ f(xk+1)≤f(xk)−(1−Ltk 2 )t

f(xk) 2 We desire to have the following (46). It holds if....

f(xk+1)≤f(xk)−bt 2

∇f(xk)

2 (46)

With fixed step sizetk=bt, we ensure that0<bt 1L = 1L2bt 12.

With backtracking step seach, (46) holds withbt=min{

1,β2(1−cL 1)}

September 28, 2018 149 / 408

(23)

=⇒ f(xk)≤f(x) +∇f(xk)(xkx) Thus,

f(xk+1)≤f(xk)−2t

f(xk) 2

=⇒ f(xk+1)≤f(x) +∇f(xk)(xkx)−2t

f(xk) 2

=⇒ f(xk+1)≤f(x) +2t1

xkx

2+∇Tf(xk)(xkx)− t2 ∇f(xk)

22t1

xkx 2

September 28, 2018 150 / 408

(24)

Using convexity, we havef(x)≥f(xk) +∇f(xk)(xxk)

=⇒ f(xk)≤f(x) +∇f(xk)(xkx) Thus,

f(xk+1)≤f(xk)−2t

f(xk) 2

=⇒ f(xk+1)≤f(x) +∇f(xk)(xkx)−2t

f(xk) 2

=⇒ f(xk+1)≤f(x) +2t1

xkx

2+∇Tf(xk)(xkx)− t2 ∇f(xk)

22t1

xkx 2

=⇒ f(xk+1)≤f(x) +2t1(

xkx

2−xkxt∇f(xk) 2)

=⇒ f(xk+1)≤f(x) +2t1(

xkx

2−xk+1x 2)

=⇒ f(xk+1)−f(x)≤ 1 2t(

xkx

2−xk+1x

2) (47)

September 28, 2018 150 / 408

References

Related documents

Recall: Assuming Lipschitz continuity on gradient ∇ f and convexity of f and assuming bounded iterates and assuming convexity of C (and therefore of I C ) we obtained O(1/k)

This classic greedy algorithm for minimization uses the negative of the gradient of the function at the current point x ∗ as the descent direction ∆x ∗.. This choice of ∆x

The convergence of the steepest descent method can be stated in the same form as in (47), using the fact that any norm can be bounded in terms of the Euclidean norm, i.e., there

The trend of specific descent methods has been like a parabola - starting with simple steepest descent techniques, then accomodating the curvature hessian matrix through a

The convergence of the steepest descent method can be stated in the same form as in (47), using the fact that any norm can be bounded in terms of the Euclidean norm, i.e., there

The trend of specific descent methods has been like a parabola - starting with simple steepest descent techniques, then accomodating the curvature hessian matrix through a

The idea in the steepest descent method is to choose a norm and then determine a descent direction such that for a unit step in that norm, the first order prediction of decrease

Intuitively, a Lipschitz continuous function is limited in how fast it changes: there exists a definite real number L such that, for every pair of points on the graph of the