CS623: Introduction to
Computing with Neural Nets (lecture-5)
Pushpak Bhattacharyya
Computer Science and Engineering Department
IIT Bombay
Backpropagation algorithm
• Fully connected feed forward network
• Pure FF network (no jumping of connections over layers)
Hidden layers
Input layer (n i/p neurons)
Output layer (m o/p neurons)
j i
wji
….
….
….
….
Gradient Descent Equations
i ji
j ji
j
th j
ji j j
ji
ji ji
w jo j net w
net j E
w net net net
E w
E
w w E
δ ηδ ηδ δ
δ δ δ
δ δ δ
δ δ
δ
η δ η
η δ
=
=
∆
−
=
=
×
=
≤
≤
=
−
=
∆
) layer j
at the input
(
) 1 0
rate, learning
(
Backpropagation – for outermost layer
i j
j j
j ji
j j
j j
m
p
p p
th j
j j j
j
o o
o o
t w
o o
o t
j
o t
E
net net o o
E net
j E
) 1
( )
(
)) 1
( )
( ( Hence,
) 2 (
1
) layer j
at the input
(
1
2
−
−
=
∆
−
−
−
−
=
−
=
=
×
−
=
−
=
=
η δ
δ δ δ
δ δ
δ δ
Backpropagation for hidden layers
Hidden layers
Input layer (n i/p neurons)
Output layer (m o/p neurons)
j i
….
….
….
….
k
δk is propagated backwards to find value of δj
Backpropagation – for hidden layers
i j j
k
k kj
j j
k k kj
j
j j
k k j
j j
j
j j j
j i ji
o o o
w
o o
w
o o o
netk net
E o o o
E
net o o
E net
j E
jo w
) 1
( ) (
) 1
( )
( Hence,
) 1
( )
(
) 1
(
layer next
layer next layer
next
−
=
−
×
×
−
−
=
−
×
×
−
=
−
×
−
=
×
−
=
−
=
=
∆
∈
∈
∈
δ
δ δ
δ δ δ
δ δ
δ
δ δ δ
δ δ
δ δ
ηδ
General Backpropagation Rule
i j
j k
k
kj o o o
w ) (1 ) (
layer next
−
=
∈
δ
) 1
( )
( j j j j
j = t − o o − o
δ
i
ji
jo
w = ηδ
• General weight updating rule:
∆
• Where
for outermost layer
for hidden layers
How does it work?
• Input propagation forward and error propagation backward (e.g. XOR)
w2=1
w1=1 = 0.5
x1x2 x1x2
-1
x1 x2
1.5 -1
1.5
1 1
Issues in the training algorithm
• Algorithm is greedy. It always changes weight such that E reduces.
• The algorithm may get stuck up in a local minimum.
Local Minima
Due to the Greedy
nature of BP, it can get stuck in local
minimum m and will never be able to
reach the global minimum g as the error can only
decrease by weight change.
m
g
m- local minima, g- global minima Error Surface
Figure- Getting Stuck in local minimum
Reasons for no progress in training
1. Stuck in local minimum.
2. Network paralysis. (High –ve or +ve i/p makes neurons to saturate.)
3.
η
(learning rate) is too small.Diagnostics in action (1)
1) If stuck in local minimum, try the following:
– Re-initializing the weight vector.
– Increase the learning rate.
– Introduce more neurons in the hidden layer.
Diagnostics in action (1) contd.
2) If it is network paralysis, then increase the number of neurons in the hidden layer.
Problem: How to configure the hidden layer ?
Known: One hidden layer seems to be sufficient. [Kolmogorov (1960’s)]
Diagnostics in action(2)
Kolgomorov statement:
More hidden layers reduce the size of individual layers.
Diagnostics in action(3)
3) Observe the outputs: If they are close to 0 or 1, try the following:
1. Scale the inputs or divide by a normalizing factor.
2. Change the shape and size of the sigmoid.
Answers to Quiz-1
• Q1: Show that of the 256 Boolean functions of 3 variables, only half are computable by a threshold perceptron
• Ans: The characteristic equation for 3 variables is W1X1+W2X2+W3X3= (E)
The 8 Boolean value combinations when inserted in (E) will produce 8 hyperplanes passing through the origin in the < W1, W2, W3, > space.
Q1 (contd)
The maximum number of function
computable by this perceptron is the number of regions produced by the
intersection of these 8 planes in the 4 dimensional space
R8,4= R7,4 + R7,3 (1)
R1,4= 2 and Rm,2= 2m, for m= 1,4 (boundary condition)
(8,4)
(7,4) (7,3)
(6,4) (6,3) (6,2)
(5,4) (5,3) (5,2)
(4,4)
(3,4) (2,4)
(1,4)
(4,3) (4,2)
(3,3) (3,2)
(2,3) (2,2)
(1,2) (1,3)
2 2 2
4
6
8
10
12
4 4
8 8
30 22
16 14
52 32
84 44
128
Each non-root node has 2 children as per the the
recurrence relation.
The value of Rm,n is Stored beside the node (m,n)
Answer
Answer to Quiz1 (contd)
• Q2. Prove if a perceptron with sin(x) as i-o relation can compute X-OR
• Ans:
yu
yl
y
W1 W2
Q2 (contd)
Input <0,0>: y < yl
sin( ) < yl (1)
Input <0,1>: y > yu
sin(W1+ ) > yu (2) Input <1,0>: y > yu
sin(W2+ ) > yu (3) Input <1,1>: y < yl
sin(W1+W2+ ) < yl (4)
Q2 (contd)
Taking
yl =0.1, yu= 0.9 W1=( /2)=W2
= 0
We can see that the perceptron can compute X-OR
Answer to Quiz-1 (contd)
• Q3: If in the perceptron training algorithm, the failed vector is again chosen by the
algorithm, will there be any problem?
• Ans:
In PTA,
Wn= Wn-1+ Xfail
After this, Xfail is chosen again for testing
and is added if fails again. This continues until Wk.Xfail > 0. Will this terminate?
Q3 (contd)
It will, because:
Wn= Wn-1 + Xfail Wn-1= Wn-2 + Xfail
. . .
Wn= W0 + n.Xfail
Therefore, Wn.Xfail= W0.Xfail+ n.(Xfail)2
-
Positive, growing with n.
Will overtake –
after some iterations.
Hence “no problem” is the answer.