• No results found

Optimization for Machine Learning

N/A
N/A
Protected

Academic year: 2022

Share "Optimization for Machine Learning"

Copied!
79
0
0

Loading.... (view fulltext now)

Full text

(1)

Optimization for Machine Learning

Lecture: Introduction to Convexity

S.V . N. (vishy) Vishwanathan

UCSC vishy@ucsc.edu

June 10, 2015

(2)

Machine Learning

We want to build a model which predicts well on data A model’s performance is quantified by a loss function

a sophisticated discrepancy score

Our model must generalize to unseen data

Avoid over-fitting by penalizing complex models (Regularization)

More Formally

Training data: {x 1 , . . . , x m } Labels: {y 1 , . . . , y m } Learn a vector: w

minimize

w J(w ) := λΩ(w )

| {z }

Regularizer

+ 1 m

m

X

i=1

l(x i , y i , w )

| {z }

Risk R

emp

(3)

Machine Learning

We want to build a model which predicts well on data A model’s performance is quantified by a loss function

a sophisticated discrepancy score

Our model must generalize to unseen data

Avoid over-fitting by penalizing complex models (Regularization)

More Formally

Training data: {x 1 , . . . , x m } Labels: {y 1 , . . . , y m } Learn a vector: w

minimize

w J(w ) := λΩ(w )

| {z }

Regularizer

+ 1 m

m

X

i=1

l(x i , y i , w )

| {z }

Risk R

emp

(4)

Machine Learning

We want to build a model which predicts well on data A model’s performance is quantified by a loss function

a sophisticated discrepancy score

Our model must generalize to unseen data

Avoid over-fitting by penalizing complex models (Regularization)

More Formally

Training data: {x 1 , . . . , x m } Labels: {y 1 , . . . , y m } Learn a vector: w

minimize

w J(w ) := λΩ(w )

| {z }

Regularizer

+ 1 m

m

X

i=1

l(x i , y i , w )

| {z }

Risk R

emp

(5)

Machine Learning

We want to build a model which predicts well on data A model’s performance is quantified by a loss function

a sophisticated discrepancy score

Our model must generalize to unseen data

Avoid over-fitting by penalizing complex models (Regularization) More Formally

Training data: {x 1 , . . . , x m } Labels: {y 1 , . . . , y m } Learn a vector: w

minimize

w J(w ) := λΩ(w )

| {z }

Regularizer

+ 1 m

m

X

i=1

l(x i , y i , w )

| {z }

Risk R

emp

(6)

Machine Learning

We want to build a model which predicts well on data A model’s performance is quantified by a loss function

a sophisticated discrepancy score

Our model must generalize to unseen data

Avoid over-fitting by penalizing complex models (Regularization) More Formally

Training data: {x 1 , . . . , x m } Labels: {y 1 , . . . , y m } Learn a vector: w

minimize

w J(w ) := λΩ(w )

| {z }

Regularizer

+ 1 m

m

X

i=1

l(x i , y i , w )

| {z }

Risk R

emp

(7)

Outline

1 Convex Functions and Sets

2 Operations Which Preserve Convexity

3 First Order Properties

4 Subgradients

5 Constraints

6 Warmup: Minimizing a 1-d Convex Function

7 Warmup: Coordinate Descent

(8)

Focus of my Lectures

(9)

Focus of my Lectures

(10)

Focus of my Lectures

−2 0

2 −2

0 0 2

10

(11)

Disclaimer

My focus is on showing connections between various methods

I will sacrifice mathematical rigor and focus on intuition

(12)

Convex Function

f (x 0 ) f (x)

A function f is convex if, and only if, for all x, x 0 and λ ∈ (0, 1)

f (λx + (1 − λ)x 0 ) ≤ λf (x) + (1 − λ)f (x 0 )

(13)

Convex Function

f (x 0 ) f (x)

A function f is strictly convex if, and only if, for all x , x 0 and λ ∈ (0, 1)

f (λx + (1 − λ)x 0 )<λf (x) + (1 − λ)f (x 0 )

(14)

Convex Function

f (x 0 ) f (x)

A function f is σ-strongly convex if, and only if, f (·) − σ 2 k·k 2 is convex.

That is, for all x, x 0 and λ ∈ (0, 1)

f (λx + (1 − λ)x 0 ) ≤ λf (x) + (1 − λ)f (x 0 ) − σ

2 λ(1 − λ)

x − x 0

2

(15)

Exercise: Jensen’s Inequality

Extend the definition of convexity to show that if f is convex, then for all λ i ≥ 0 such that P

i λ i = 1 we have

f X

i

λ i x i

!

≤ X

i

λ i f (x i )

(16)

Some Familiar Examples

−4 −2 2 4

2 4 6 8 10 12

f (x) = 1 2 x 2 (Square norm)

(17)

Some Familiar Examples

−2 0

2 −2

0 2 0

50

f (x, y) = 1 2

x, y 10, 1

2, 1 x y

(18)

Some Familiar Examples

0 0.2 0.4 0.6 0.8 1

−0.6

−0.4

−0.2 0

f (x) = x log x + (1 − x) log(1 − x) (Negative entropy)

(19)

Some Familiar Examples

0 0.5 1

1.5 2 0 1

2

−2

−1 0

f (x, y) = x log x + y log y − x − y (Un-normalized negative entropy)

(20)

Some Familiar Examples

−3 −2 −1 0 1 2 3 0

1 2 3 4

f (x) = max(0, 1 − x) (Hinge Loss)

(21)

Some Other Important Examples

Linear functions: f (x) = ax + b Softmax: f (x) = log P

i exp(x i )

Norms: For example the 2-norm f (x) = q

P

i x i 2

(22)

Convex Sets

A set C is convex if, and only if, for all x, x 0 ∈ C and λ ∈ (0, 1) we have

λx + (1 − λ)x 0 ∈ C

(23)

Convex Sets and Convex Functions

A function f is convex if, and only if, its epigraph is a convex set

(24)

Convex Sets and Convex Functions

Indicator functions of convex sets are convex I C (x) =

( 0 if x ∈ C

∞ otherwise.

(25)

Below sets of Convex Functions

−2 0

2 −2

0 0 2

10

f (x, y) = x 2 + y 2

(26)

Below sets of Convex Functions

−2 0

2 −2

0 2 10

16 16

16 16

14 14

14 14

12 12

12 12

10 10

10 10

8

8 8

8 6

6

6 6

4

4 4

2

2

(27)

Below sets of Convex Functions

16 16 16 16

14 14 14 14

12 12

12 12

10 10 10 8 10

8

8 8

8 8 8

6

6

6

6 6 6

4

4

4 4 4

2

2 2

−3 −2 −1 0 1 2 3

−2

0

2

(28)

Below sets of Convex Functions

0 0.5 1

1.5 2 0

1 2

−2

−1 0

f (x, y ) = x log x + y log y − x − y

(29)

Below sets of Convex Functions

0 1

2 0

1 2

−2

−1 0 0

. 2

− 0 − . 4 0 .6

− 0.8

− 0 .8

− 0 .8

− 1

− 1

− 1

− 1 . 2

− 1.2

− 1 . 4

− 1 . 4

−1.4 − 1 . 6

− 1 .6

− 1.6

− 1 . 8

− 1 . 8

− 2

(30)

Below sets of Convex Functions

− 0 . 2 0 .4 − 0 .6 − 0 .8 −0.8

− 0 . 8

−1

− 1

−1

− 1 − 1 . 2

− 1 . 2

−1.2 −1 .2

− 1 . 4

− 1 . 4

− 1 . 4

−1.4

− 1 . 4

− 1 . 6

− 1 . 6

− 1 . 6

−1.6

− 1 . 6

− 1 . 8

− 1.8

− 1 . 8

− 1 . 8

− 2

0 0.5 1 1.5 2

0

0.5

1

1.5

2

(31)

Below sets of Convex Functions

If f is convex, then all its level sets are convex

Is the converse true? (Exercise: construct a counter-example)

(32)

Minima on Convex Sets

Set of minima of a convex function is a convex set

Proof: Consider the set {x : f (x) ≤ f }

(33)

Minima on Convex Sets

Set of minima of a strictly convex function is a singleton

Proof: try this at home!

(34)

Outline

1 Convex Functions and Sets

2 Operations Which Preserve Convexity

3 First Order Properties

4 Subgradients

5 Constraints

6 Warmup: Minimizing a 1-d Convex Function

7 Warmup: Coordinate Descent

(35)

Set Operations

Intersection of convex sets is convex

Image of a convex set under a linear transformation is convex

Inverse image of a convex set under a linear transformation is convex

(36)

Function Operations

Linear Combination with non-negative weights: f (x) = P

i w i f i (x) s.t. w i ≥ 0

Pointwise maximum: f (x) = max i f i (x)

Composition with affine function: f (x) = g (Ax + b)

Projection along a direction: f (η) = g (x 0 + ηd )

Restricting the domain on a convex set: f (x)s.t. x ∈ C

(37)

One Quick Example

The piecewise linear function f (x) := max i hu i , xi is convex

(38)

Outline

1 Convex Functions and Sets

2 Operations Which Preserve Convexity

3 First Order Properties

4 Subgradients

5 Constraints

6 Warmup: Minimizing a 1-d Convex Function

7 Warmup: Coordinate Descent

(39)

First Order Taylor Expansion

The First Order Taylor approximation globally lower bounds the function

For any x and x 0 we have

f (x) ≥ f (x 0 ) +

x − x 0 , ∇f (x 0 )

(40)

Bregman Divergence

f (x 0 ) + hx − x 0 , ∇f (x 0 )i f (x)

f (x 0 )

For any x and x 0 the Bregman divergence defined by f is given by

f (x, x 0 ) = f (x) − f (x 0 ) −

x − x 0 , ∇f (x 0 )

.

(41)

Euclidean Distance Squared

Bregman Divergence

For any x and x 0 the Bregman divergence defined by f is given by

f (x, x 0 ) = f (x) − f (x 0 ) −

x − x 0 , ∇f (x 0 ) .

Use f (x) = 1 2 kxk 2 and verify that

f (x, x 0 ) = 1 2

x − x 0

2

(42)

Unnormalized Relative Entropy

Bregman Divergence

For any x and x 0 the Bregman divergence defined by f is given by

f (x, x 0 ) = f (x) − f (x 0 ) −

x − x 0 , ∇f (x 0 ) . Use f (x) = P

i x i log x i − x i and verify that

∆ f (x, x 0 ) = X

i

x i log x i − x i − x i log x i 0 + x i 0

(43)

Identifying the Minimum

Let f : X → R be a differentiable convex function. Then x is a minimizer of f , if, and only if,

x 0 − x, ∇f (x)

≥ 0 for all x 0 . One way to ensure this is to set ∇f (x) = 0

Minimizing a smooth convex function is the same as finding an x

such that ∇f (x) = 0

(44)

Outline

1 Convex Functions and Sets

2 Operations Which Preserve Convexity

3 First Order Properties

4 Subgradients

5 Constraints

6 Warmup: Minimizing a 1-d Convex Function

7 Warmup: Coordinate Descent

(45)

What if the Function is NonSmooth?

The piecewise linear function

f (x) := max

i hu i , xi

is convex but not differentiable at the kinks!

(46)

Subgradients to the Rescue

A subgradient at x 0 is any vector s which satisfies f (x) ≥ f (x 0 ) +

x − x 0 , s

for all x

Set of all subgradients is denoted as ∂f (w )

(47)

Subgradients to the Rescue

A subgradient at x 0 is any vector s which satisfies f (x) ≥ f (x 0 ) +

x − x 0 , s

for all x

Set of all subgradients is denoted as ∂f (w )

(48)

Subgradients to the Rescue

A subgradient at x 0 is any vector s which satisfies f (x) ≥ f (x 0 ) +

x − x 0 , s

for all x

Set of all subgradients is denoted as ∂f (w )

(49)

Example

−3 −2 −1 0 1 2 3 0

1 2 3

f (x) = |x| and ∂f (0) = [−1, 1]

(50)

Identifying the Minimum

Let f : X → R be a convex function. Then x is a minimizer of f , if, and only if, there exists a µ ∈ ∂f (x) such that

x 0 − x, µ

≥ 0 for all x 0 .

One way to ensure this is to ensure that 0 ∈ ∂f (x)

(51)

Outline

1 Convex Functions and Sets

2 Operations Which Preserve Convexity

3 First Order Properties

4 Subgradients

5 Constraints

6 Warmup: Minimizing a 1-d Convex Function

7 Warmup: Coordinate Descent

(52)

A Simple Example

−4 −2 2 4

2 4 6 8 10 12

Minimize 1 2 x 2 s.t. 1 ≤ w ≤ 2

(53)

Projection

x 0 x

P C (x 0 ) := min

x∈ C

x − x 0

2

(54)

First Order Conditions For Constrained Problems

x = P C (x − ∇f (x))

If x − ∇f (x) ∈ C then P C (x − ∇f (x)) = x implies that ∇f (x) = 0

Otherwise, it shows that the constraints are preventing further

progress in the direction of descent

(55)

Outline

1 Convex Functions and Sets

2 Operations Which Preserve Convexity

3 First Order Properties

4 Subgradients

5 Constraints

6 Warmup: Minimizing a 1-d Convex Function

7 Warmup: Coordinate Descent

(56)

Problem Statement

Given a black-box which can compute J : R → R and J 0 : R → R find

the minimum value of J

(57)

Increasing Gradients

From the first order conditions

J(w ) ≥ J(w 0 ) + (w − w 0 ) · J 0 (w 0 ) and

J(w 0 ) ≥ J(w ) + (w 0 − w ) · J 0 (w ) Add the two

(w − w 0 ) · (J 0 (w ) − J 0 (w 0 )) ≥ 0

w ≥ w 0 implies that J 0 (w ) ≥ J 0 (w 0 )

(58)

Increasing Gradients

w

J(w )

(59)

Increasing Gradients

w

J 0 (w)

(60)

Increasing Gradients

w

J(w )

(61)

Increasing Gradients

w

J 0 (w)

(62)

Problem Restatement

w J 0 (w)

Identify the point where the increasing function J 0 crosses zero

(63)

Bisection Algorithm

U

L w

J 0 (w )

(64)

Bisection Algorithm

U L

M

w

J 0 (w )

(65)

Bisection Algorithm

U

L w

J 0 (w )

(66)

Bisection Algorithm

U

L M w

J 0 (w )

(67)

Bisection Algorithm

U

L w

J 0 (w )

(68)

Interval Bisection

Require: L, U,

1: maxgrad ← J 0 (U )

2: while (U − L) · maxgrad > do

3: M ← U+L 2

4: if J 0 (M) > 0 then

5: U ← M

6: else

7: L ← M

8: end if

9: end while

10: return U+L 2

(69)

Outline

1 Convex Functions and Sets

2 Operations Which Preserve Convexity

3 First Order Properties

4 Subgradients

5 Constraints

6 Warmup: Minimizing a 1-d Convex Function

7 Warmup: Coordinate Descent

(70)

Problem Statement

−2 0

2 −2

0 2 0

50

Given a black-box which can compute J : R n → R and J 0 : R n → R n

find the minimum value of J

(71)

Concrete Example

−2 0

2 −2

0 2 0

50

f (x, y) = 1 2

x, y 10, 1

2, 1 x y

(72)

Concrete Example

−2 0

2 −2

0 2 0

50

x y

f (x, 3) = 1 2

x, 3 10, 1

2, 1 x 3

(73)

Concrete Example

−3 −2 −1 0 1 2 3 0

20 40 60

x f (x, 3) = 5x 2 + 9

2 x + 9

2

(74)

Concrete Example

−3 −2 −1 0 1 2 3 0

20 40 60

x f (x , 3) = 5x 2 + 9

2 x + 9

2 Minima: x = − 9

20

(75)

Concrete Example

−2 0

2 −2

0 2 0

50

x y

(76)

Concrete Example

−3 −2 −1 0 1 2 3 2

4 6 8

y f (− 9

20 , y) = 1

2 y 2 − 27 40 y + 81

80

(77)

Concrete Example

−3 −2 −1 0 1 2 3 2

4 6 8

y f (− 9

20 , y) = 1

2 y 2 − 27 40 y + 81

80 Minima: y = 27

40

(78)

Concrete Example

−3 −2 −1 0 1 2 3 0

10 20 30 40 50

x f ( x , 27 40 )

Are we done?

(79)

Concrete Example

−3 −2 −1 0 1 2 3 0

10 20 30 40 50

x = − 20 9

x f ( x , 27 40 )

Are we done?

References

Related documents

24 Indeed, and particularly from a development and poverty reduction perspective, many of the benefits people get from nature rely as much on the amount (eg the abundance

Technical metadata include where the data come from, how the data were changed, how the data are organized, how the data are stored, who owns the data, who is responsible for the

It denoted by p d.. It is the ratio of the pitch circle diameter in millimetres to the number of teeth. It is usually denoted by m. It is the radial distance from the top of the

The Macroeconomic Policy and Financing for Development Division of ESCAP is undertaking an evaluation of this publication, A Review of Access to Finance by Micro, Small and Medium

motivations, but must balance the multiple conflicting policies and regulations for both fossil fuels and renewables 87 ... In order to assess progress on just transition, we put

The discussions above can be summarized as, ANN is a mathematical model or machine learning algorithm based on human’s brain biological function or neural structure of human

The closure of schools and universities around the world to prevent the spread of the COVID-19 pandemic, caused a major education crisis that reached its peak in mid-April

The overall work done in this thesis aims to promote professional activity in the area of fuel cell controller and low power electronics used in the modern age, with an