When to Consider Neural Networks

(1)

[Read Ch. 4]

[Recommended exercises 4.1, 4.2, 4.5, 4.9, 4.11]

Threshold units

Gradient descent

Multilayer networks

Backpropagation

Hidden layer representations

Example: Face Recognition

Advanced topics

(2)

Connectionist Models

Consider humans:

Neuron switching time ~

:

001 second

Number of neurons ~ 10¹⁰

Connections per neuron ~ 10^4?5

Scene recognition time ~

:

1 second

100 inference steps doesn't seem like enough

! much parallel computation

Properties of articial neural nets (ANN's):

Many neuron-like threshold switching units

Many weighted interconnections among units

Highly parallel, distributed process

Emphasis on tuning weights automatically

(3)

When to Consider Neural Networks

Input is high-dimensional discrete or real-valued (e.g. raw sensor input)

Output is discrete or real valued

Output is a vector of values

Possibly noisy data

Form of target function is unknown

Human readability of result is unimportant Examples:

Speech phoneme recognition [Waibel]

Image classication [Kanade, Baluja, Rowley]

Financial prediction

(4)

ALVINN drives 70 mph on highways

Sharp Left

Sharp Right

4 Hidden Units

30 Output Units

30x32 Sensor Input Retina Straight

Ahead

(5)

Perceptron

w₁ w₂

w_n

w₀ x₁

x₂

x_n

x₀=1

. .

.

Σ

_Σ _AA

wi xi

n

i=0 1 if > 0

-1 otherwise

{

o = Σⁿ _{wi xi}

i=0

o

(

x

¹

;:::;x

ⁿ) =

8

>

<

>

:

1 if

w

⁰ +

w

¹

x

¹ + +

w

ⁿ

x

ⁿ

>

0

?1 otherwise.

Sometimes we'll use simpler vector notation:

o

(

~x

) =

8

>

<

>

:

1 if

~w

~x >

0

?1 otherwise.

(6)

Decision Surface of a Perceptron

x1 x2

+ +

- - +

-

x1 x2

(a) (b)

-

+ -

+

Represents some useful functions

What weights represent

g

(

x

¹

;x

²) =

AND

(

x

¹

;x

²)?

But some functions not representable

e.g., not linearly separable

Therefore, we'll want networks of these...

(7)

Perceptron training rule

w

ⁱ

w

ⁱ +

w

ⁱ where

w

ⁱ =

(

t

^?

o

)

x

ⁱ Where:

t

=

c

(

~x

) is target value

o

is perceptron output

is small constant (e.g., .1) called learning rate

(8)

Perceptron training rule

Can prove it will converge

If training data is linearly separable

and

suciently small

(9)

Gradient Descent

To understand, consider simpler linear unit, where

o

=

w

⁰ +

w

¹

x

¹ + +

w

ⁿ

x

ⁿ

Let's learn

w

ⁱ's that minimize the squared error

E

[

~w

] 1

2 ^d2D^X (

t

^d ^?

o

^d)² Where

D

is set of training examples

(10)

Gradient Descent

-1 0

1 2

-2 -1

0 2 1

3 0

5 10 15 20 25

w0 w1

E[w]

Gradient

r

E

[

~w

] ²⁶⁴

@E

@w

⁰

; @E @w

¹

;

@E

@w

ⁿ

3

7

5

Training rule:

~w

= ^?

^r

E

[

~w

] i.e.,

w

ⁱ = ^?

@E @w

ⁱ

(11)

Gradient Descent

@w @E

ⁱ ⁼

@

@w

ⁱ¹2 ^X^d(

t

^d ^?

o

^d)²

= 12 ^X^d

@

@w

ⁱ⁽

t

^d ^?

o

^d)²

= 12 ^X^d 2(

t

^d ^?

o

^d)

@

@w

ⁱ⁽

t

^d ^?

o

^d)

= ^X

d

(

t

^d ^?

o

^d)

@

@w

ⁱ⁽

t

^d ^?

~w

~x

^d)

@w @E

ⁱ ⁼ ^X^d⁽

t

^d ^?

o

^d)(^?

x

^i;d)

(12)

Gradient Descent

Gradient-Descent(

training examples;

)

Each training example is a pair of the form

h

~x;t

ⁱ, where

~x

is the vector of input values, and

t

is the target output value.

is the learning rate (e.g., .05).

Initialize each

w

ⁱ to some small random value

Until the termination condition is met, Do

{

Initialize each

w

ⁱ to zero.

{

^{For each} ^h

^~x;t

ⁱ ⁱⁿ

training examples

, Do

Input the instance

~x

to the unit and compute the output

o

For each linear unit weight

w

ⁱ, Do

w

ⁱ

w

ⁱ +

(

t

^?

o

)

x

ⁱ

{

For each linear unit weight

w

ⁱ, Do

w

ⁱ

w

ⁱ +

w

ⁱ

(13)

Summary

Perceptron training rule guaranteed to succeed if

Training examples are linearly separable

Suciently small learning rate

Linear unit training rule uses gradient descent

Guaranteed to converge to hypothesis with minimum squared error

Given suciently small learning rate

Even when training data contains noise

Even when training data not separable by

H

(14)

Incremental (Stochastic) Gradient Descent Batch mode

Gradient Descent:

Do until satised

1. Compute the gradient ^r

E

^D[

~w

] 2.

~w ~w

^?

^r

E

^D[

~w

]

Incremental mode

Gradient Descent:

Do until satised

For each training example

d

in

D

1. Compute the gradient ^r

E

^d[

~w

] 2.

~w ~w

^?

^r

E

^d[

~w

]

E

^D[

~w

] 1

2 ^d2D^X (

t

^d ^?

o

^d)²

E

^d[

~w

] 1

2(

t

^d ^?

o

^d)²

Incremental Gradient Descent can approximate Batch Gradient Descent arbitrarily closely if

made small enough

(15)

Multilayer Networks of Sigmoid Units

F1 F2

head hid who’d hood

... ...

(16)

Sigmoid Unit

w1 w2

wn

w0 x1

x2

xn

x0 = 1

AA AA A

. .

.

Σ

net = Σ_{wi xi}

i=0 n

1 1 + e^-net o = σ(net) =

(

x

) is the sigmoid function 1 +1

e

^?x

Nice property: ^d(x)^dx =

(

x

)(1 ^?

(

x

))

We can derive gradient decent rules to train

One sigmoid unit

Multilayer networks of sigmoid units ^! Backpropagation

(17)

Error Gradient for a Sigmoid Unit

@w @E

ⁱ ⁼

@

@w

ⁱ ¹2 ^d2D^X (

t

^d ^?

o

^d)²

= 12 ^X^d

@

@w

ⁱ⁽

t

^d ^?

o

^d)²

= 12 ^X^d 2(

t

^d ^?

o

^d)

@

@w

ⁱ⁽

t

^d ^?

o

^d)

= ^X

d

(

t

^d ^?

o

^d) ⁰^B^@^?

@o

^d

@w

ⁱ

1

C

A

= ^?^X

d

(

t

^d ^?

o

^d)

@o

^d

@net

^d

@net

^d

@w

ⁱ But we know:

@o

^d

@net

^d ⁼

@

(

net

^d)

@net

^d ⁼

o

^d(1 ^?

o

^d)

@net

^d

@w

ⁱ ⁼

@

(

~w

~x

^d)

@w

ⁱ ⁼

x

^i;d

So:

@E

@w

ⁱ ⁼ ^?^d2D^X ⁽

t

^d ^?

o

^d)

o

^d(1 ^?

o

^d)

x

^i;d

(18)

Backpropagation Algorithm

Initialize all weights to small random numbers.

Until satised, Do

For each training example, Do

1. Input the training example to the network and compute the network outputs

2. For each output unit

k

^k

o

^k(1 ^?

o

^k)(

t

^k ^?

o

^k) 3. For each hidden unit

h

^h

o

^h(1 ^?

o

^h) ^X

k2outputs

w

^h;k

^k 4. Update each network weight

w

^i;j

w

^i;j

w

^i;j +

w

^i;j where

w

^i;j =

^j

x

^i;j

(19)

More on Backpropagation

Gradient descent over entire network weight vector

Easily generalized to arbitrary directed graphs

Will nd a local, not necessarily global error minimum

{

In practice, often works well (can run multiple times)

Often include weight momentum

w

^i;j(

n

) =

^j

x

^i;j +

w

^i;j(

n

^? 1)

Minimizes error over training examples

{

Will it generalize well to subsequent examples?

Training can take thousands of iterations ^! slow!

Using network after training is very fast

(20)

Learning Hidden Layer Representations

Inputs Outputs

A target function:

Input Output

10000000 ^! 10000000 01000000 ^! 01000000 00100000 ^! 00100000 00010000 ^! 00010000 00001000 ^! 00001000 00000100 ^! 00000100 00000010 ^! 00000010 00000001 ^! 00000001 Can this be learned??

(21)

Learning Hidden Layer Representations

A network: Inputs Outputs

Learned hidden layer representation:

Input Hidden Output

Values

10000000 ^! .89 .04 .08 ^! 10000000 01000000 ^! .01 .11 .88 ^! 01000000 00100000 ^! .01 .97 .27 ^! 00100000 00010000 ^! .99 .97 .71 ^! 00010000 00001000 ^! .03 .05 .02 ^! 00001000 00000100 ^! .22 .99 .99 ^! 00000100 00000010 ^! .80 .01 .98 ^! 00000010 00000001 ^! .60 .94 .01 ^! 00000001

(22)

Training

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0 500 1000 1500 2000 2500

Sum of squared errors for each output unit

(23)

Training

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 500 1000 1500 2000 2500

Hidden unit encoding for input 01000000

(24)

Training

-5 -4 -3 -2 -1 0 1 2 3 4

0 500 1000 1500 2000 2500

Weights from inputs to one hidden unit

(25)

Convergence of Backpropagation

Gradient descent to some local minimum

Perhaps not global minimum...

Add momentum

Stochastic gradient descent

Train multiple nets with dierent inital weights Nature of convergence

Initialize weights near zero

Therefore, initial networks near-linear

Increasingly non-linear functions possible as training progresses

(26)

Expressive Capabilities of ANNs

Boolean functions:

Every boolean function can be represented by network with single hidden layer

but might require exponential (in number of inputs) hidden units

Continuous functions:

Every bounded continuous function can be approximated with arbitrarily small error, by network with one hidden layer [Cybenko 1989;

Hornik et al. 1989]

Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988].

(27)

Overtting in ANNs

0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.01

0 5000 10000 15000 20000

Error

Number of weight updates Error versus weight updates (example 1)

Training set error Validation set error

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08

0 1000 2000 3000 4000 5000 6000

Error

Number of weight updates Error versus weight updates (example 2)

Training set error Validation set error

(28)

... ...

left strt rght up

30x32

inputs

Typical input images

90% accurate learning head pose, and recognizing 1-of-20 faces

(29)

... ...

left strt rght up

30x32

inputs

Learned Weights

Typical input images

http://www.cs.cmu.edu/tom/faces.html

(30)

Alternative Error Functions

Penalize large weights:

E

(

~w

) 1

2 ^d2D^X ^k2outputs^X (

t

^kd ^?

o

^kd)² +

^X

i;j

w

^ji² Train on target slopes as well as values:

E

(

~w

) 1

2 ^d2D^X ^k2outputs^X

2

6

4(

t

^kd ^?

o

^kd)² +

^X

j2inputs 0

B

@

@t

^kd

@x

^j^d ^?

@o

^kd

@x

^j^d

1

C

A 2

3

7

5

Tie together weights:

e.g., in phoneme recognition network

(31)

Recurrent Networks

x(t) x(t) c(t)

x(t) c(t)

y(t)

b y(t + 1)

Feedforward network Recurrent network

Recurrent network unfolded in time

y(t + 1)

y(t– 1) x(t– 1) c(t– 1)

x(t– 2) c(t– 2)

(a) (b)

(c)