[Read Ch. 4]
[Recommended exercises 4.1, 4.2, 4.5, 4.9, 4.11]
Threshold units
Gradient descent
Multilayer networks
Backpropagation
Hidden layer representations
Example: Face Recognition
Advanced topics
Connectionist Models
Consider humans:
Neuron switching time ~
:
001 secondNumber of neurons ~ 1010
Connections per neuron ~ 104?5
Scene recognition time ~
:
1 second100 inference steps doesn't seem like enough
! much parallel computation
Properties of articial neural nets (ANN's):
Many neuron-like threshold switching units
Many weighted interconnections among units
Highly parallel, distributed process
Emphasis on tuning weights automatically
When to Consider Neural Networks
Input is high-dimensional discrete or real-valued (e.g. raw sensor input)
Output is discrete or real valued
Output is a vector of values
Possibly noisy data
Form of target function is unknown
Human readability of result is unimportant Examples:
Speech phoneme recognition [Waibel]
Image classication [Kanade, Baluja, Rowley]
Financial prediction
ALVINN drives 70 mph on highways
Sharp Left
Sharp Right
4 Hidden Units
30 Output Units
30x32 Sensor Input Retina Straight
Ahead
Perceptron
w1 w2
wn
w0 x1
x2
xn
x0=1
. .
.
Σ
Σ AAwi xi
n
i=0 1 if > 0
-1 otherwise
{
o = Σn wi xi
i=0
o
(x
1;:::;x
n) =8
>
>
<
>
>
:
1 if
w
0 +w
1x
1 + +w
nx
n>
0?1 otherwise.
Sometimes we'll use simpler vector notation:
o
(~x
) =8
>
>
<
>
>
:
1 if
~w
~x >
0?1 otherwise.
Decision Surface of a Perceptron
x1 x2
+ +
- - +
-
x1 x2
(a) (b)
-
+ -
+
Represents some useful functions
What weights represent
g
(x
1;x
2) =AND
(x
1;x
2)?But some functions not representable
e.g., not linearly separable
Therefore, we'll want networks of these...
Perceptron training rule
w
iw
i +w
i wherew
i = (t
?o
)x
i Where:
t
=c
(~x
) is target value
o
is perceptron outputis small constant (e.g., .1) called learning rate
Perceptron training rule
Can prove it will converge
If training data is linearly separable
and
suciently smallGradient Descent
To understand, consider simpler linear unit, where
o
=w
0 +w
1x
1 + +w
nx
nLet's learn
w
i's that minimize the squared errorE
[~w
] 12 d2DX (
t
d ?o
d)2 WhereD
is set of training examplesGradient Descent
-1 0
1 2
-2 -1
0 2 1
3 0
5 10 15 20 25
w0 w1
E[w]
Gradient
r
E
[~w
] 264@E
@w
0; @E @w
1;
@E
@w
n3
7
5
Training rule:
~w
= ?rE
[~w
] i.e.,w
i = ?@E @w
iGradient Descent
@w @E
i =@
@w
i12 Xd(t
d ?o
d)2= 12 Xd
@
@w
i(t
d ?o
d)2= 12 Xd 2(
t
d ?o
d)@
@w
i(t
d ?o
d)= X
d
(
t
d ?o
d)@
@w
i(t
d ?~w
~x
d)@w @E
i = Xd(t
d ?o
d)(?x
i;d)Gradient Descent
Gradient-Descent(
training examples;
)Each training example is a pair of the form
h
~x;t
i, where~x
is the vector of input values, andt
is the target output value. is the learning rate (e.g., .05).Initialize each
w
i to some small random valueUntil the termination condition is met, Do
{
Initialize eachw
i to zero.{
For each h~x;t
i intraining examples
, DoInput the instance
~x
to the unit and compute the outputo
For each linear unit weight
w
i, Dow
iw
i + (t
?o
)x
i{
For each linear unit weightw
i, Dow
iw
i +w
iSummary
Perceptron training rule guaranteed to succeed if
Training examples are linearly separable
Suciently small learning rate
Linear unit training rule uses gradient descent
Guaranteed to converge to hypothesis with minimum squared error
Given suciently small learning rate
Even when training data contains noise
Even when training data not separable by
H
Incremental (Stochastic) Gradient Descent Batch mode
Gradient Descent:Do until satised
1. Compute the gradient r
E
D[~w
] 2.~w ~w
? rE
D[~w
]Incremental mode
Gradient Descent:Do until satised
For each training example
d
inD
1. Compute the gradient rE
d[~w
] 2.~w ~w
? rE
d[~w
]E
D[~w
] 12 d2DX (
t
d ?o
d)2E
d[~w
] 12(
t
d ?o
d)2Incremental Gradient Descent can approximate Batch Gradient Descent arbitrarily closely if
made small enoughMultilayer Networks of Sigmoid Units
F1 F2
head hid who’d hood
... ...
Sigmoid Unit
w1 w2
wn
w0 x1
x2
xn
x0 = 1
AA AA A
. .
.
Σ
net = Σwi xi
i=0 n
1 1 + e-net o = σ(net) =
(x
) is the sigmoid function 1 +1e
?xNice property: d(x)dx =
(x
)(1 ? (x
))We can derive gradient decent rules to train
One sigmoid unit
Multilayer networks of sigmoid units ! Backpropagation
Error Gradient for a Sigmoid Unit
@w @E
i =@
@w
i 12 d2DX (t
d ?o
d)2= 12 Xd
@
@w
i(t
d ?o
d)2= 12 Xd 2(
t
d ?o
d)@
@w
i(t
d ?o
d)= X
d
(
t
d ?o
d) 0B@?@o
d@w
i1
C
A
= ?X
d
(
t
d ?o
d)@o
d@net
d@net
d@w
i But we know:@o
d@net
d =@
(net
d)@net
d =o
d(1 ?o
d)@net
d@w
i =@
(~w
~x
d)@w
i =x
i;dSo:
@E
@w
i = ?d2DX (t
d ?o
d)o
d(1 ?o
d)x
i;dBackpropagation Algorithm
Initialize all weights to small random numbers.
Until satised, Do
For each training example, Do
1. Input the training example to the network and compute the network outputs
2. For each output unit
k
ko
k(1 ?o
k)(t
k ?o
k) 3. For each hidden unith
ho
h(1 ?o
h) Xk2outputs
w
h;kk 4. Update each network weightw
i;jw
i;jw
i;j +w
i;j wherew
i;j = jx
i;jMore on Backpropagation
Gradient descent over entire network weight vector
Easily generalized to arbitrary directed graphs
Will nd a local, not necessarily global error minimum
{
In practice, often works well (can run multiple times)Often include weight momentum
w
i;j(n
) = jx
i;j +w
i;j(n
? 1)Minimizes error over training examples
{
Will it generalize well to subsequent examples?Training can take thousands of iterations ! slow!
Using network after training is very fast
Learning Hidden Layer Representations
Inputs Outputs
A target function:
Input Output
10000000 ! 10000000 01000000 ! 01000000 00100000 ! 00100000 00010000 ! 00010000 00001000 ! 00001000 00000100 ! 00000100 00000010 ! 00000010 00000001 ! 00000001 Can this be learned??
Learning Hidden Layer Representations
A network: Inputs Outputs
Learned hidden layer representation:
Input Hidden Output
Values
10000000 ! .89 .04 .08 ! 10000000 01000000 ! .01 .11 .88 ! 01000000 00100000 ! .01 .97 .27 ! 00100000 00010000 ! .99 .97 .71 ! 00010000 00001000 ! .03 .05 .02 ! 00001000 00000100 ! .22 .99 .99 ! 00000100 00000010 ! .80 .01 .98 ! 00000010 00000001 ! .60 .94 .01 ! 00000001
Training
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0 500 1000 1500 2000 2500
Sum of squared errors for each output unit
Training
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0 500 1000 1500 2000 2500
Hidden unit encoding for input 01000000
Training
-5 -4 -3 -2 -1 0 1 2 3 4
0 500 1000 1500 2000 2500
Weights from inputs to one hidden unit
Convergence of Backpropagation
Gradient descent to some local minimum
Perhaps not global minimum...
Add momentum
Stochastic gradient descent
Train multiple nets with dierent inital weights Nature of convergence
Initialize weights near zero
Therefore, initial networks near-linear
Increasingly non-linear functions possible as training progresses
Expressive Capabilities of ANNs
Boolean functions:
Every boolean function can be represented by network with single hidden layer
but might require exponential (in number of inputs) hidden units
Continuous functions:
Every bounded continuous function can be approximated with arbitrarily small error, by network with one hidden layer [Cybenko 1989;
Hornik et al. 1989]
Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988].
Overtting in ANNs
0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.01
0 5000 10000 15000 20000
Error
Number of weight updates Error versus weight updates (example 1)
Training set error Validation set error
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08
0 1000 2000 3000 4000 5000 6000
Error
Number of weight updates Error versus weight updates (example 2)
Training set error Validation set error
... ...
left strt rght up
30x32
inputs
Typical input images
90% accurate learning head pose, and recognizing 1-of-20 faces
... ...
left strt rght up
30x32
inputs
Learned Weights
Typical input images
http://www.cs.cmu.edu/tom/faces.html
Alternative Error Functions
Penalize large weights:
E
(~w
) 12 d2DX k2outputsX (
t
kd ?o
kd)2 + Xi;j
w
ji2 Train on target slopes as well as values:E
(~w
) 12 d2DX k2outputsX
2
6
6
6
4(
t
kd ?o
kd)2 + Xj2inputs 0
B
B
@
@t
kd@x
jd ?@o
kd@x
jd1
C
C
A 2
3
7
7
7
5
Tie together weights:
e.g., in phoneme recognition network
Recurrent Networks
x(t) x(t) c(t)
x(t) c(t)
y(t)
b y(t + 1)
Feedforward network Recurrent network
Recurrent network unfolded in time
y(t + 1)
y(t + 1)
y(t– 1) x(t– 1) c(t– 1)
x(t– 2) c(t– 2)
(a) (b)
(c)