## Lecture 20 contd: Neural Network Training using Backpropagation, Convolutional And Recurrent Neural Networks

Instructor: Prof. Ganesh Ramakrishnan

. . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . .

. .

. . . . .

## Feed-forward Neural Nets

*x*_{n}*x*_{2}
*x*_{1}
1

*z*_{1} =*g*(∑)

*z*_{2} =*g*(∑)
*w*_{n1}

*w*_{21}
*w*_{11}
*w*_{01}

*w*_{n2}*w*_{22}
*w*_{12}
*w*_{02}

*f*_{1} =*g(.)*

*f*_{2} =*g(.)*
inputs

October 10, 2016 2 / 19

## Training a Neural Network

STEP 0: Pick a network architecture

Number of input units: Dimension of features *ϕ*
(

**x**^{(i)})
.
Number of output units: Number of classes.

Reasonable default: 1 hidden layer, or if >1 hidden layer, have same number of hidden units in every layer.

Number of hidden units in each layer a constant factor (3 or 4) of dimension of*x.*

We will use

▶ the smooth sigmoidal function*g(s) =*_{1+e}^{1}_{−}*s*: **We have now learnt how to train a single**
**node sigmoidal (LR) neural network**

▶ instead of the non-smooth step function*g(s) = 1*if*s**∈*[θ,*∞*)and*g(s) = 0*otherwise.

. . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . .

. .

. . . . .

**High Level Overview** of Backpropagation Algorithm for Training NN

1 Randomly initialize weights*w*^{l}* _{ij}* for

*l*= 1, . . . ,

*L,i*= 1, . . . ,

*s*

*,*

_{l}*j*= 1, . . . ,

*s*

*.*

_{l+1}2 Implement**forward propagation** to get*f** _{w}*(x) for any

*x∈ D*.

3 Execute **backpropagation**

1 by computing partial derivatives ^{∂}

*∂w*^{(l)}_{ij}*E*(w)for*l*= 1, . . . ,*L,**i*= 1, . . . ,*s**l*,*j*= 1, . . . ,*s**l+1*.

2 and using gradient descent to try to minimize (non-convex)*E*(w)as a function of
parameters**w.**

*w*^{l}* _{ij}*=

*w*

^{l}

_{ij}*−*

*η*

*∂*

*∂w*^{(l)}_{ij}*E*(w)

4 Verify that the cost function*E*(w) has indeed reduced, else resort to some random
perturbation of weights **w.**

October 10, 2016 4 / 19

## The Backpropagation Algorithm

. . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . .

. .

. . . . .

## Setting Notation for Backpropagation

*σ*^{l}_{s}^{−}_{l}_{−}^{1}_{1}
*σ*_{i}^{l}^{−}^{1}
*σ*_{2}^{l}^{−}^{1}
*σ*_{1}^{l}^{−}^{1}

*σ*_{j}* ^{l}*=

*σ*(

*sum*^{l}* _{j}*)

*σ*^{l}_{s}* _{l}* =

*σ*

(*sum*^{l}_{s}* _{l}*)

*w*

^{l}

_{s}

_{l}*−*1*j*

*w*^{l}_{ij}*w*^{l}_{2j}
*w*^{l}_{1j}

*w*^{l}_{s}_{l−1}_{s}_{l}*w*^{l}_{is}_{l}*w*^{l}_{2s}_{l}*w*^{l}_{1s}_{l}

*σ*^{L}_{1}

*σ*^{L}* _{K}*
(l

*−*1)

*layer*

^{th}October 10, 2016 6 / 19

## Gradient Computation

The Neural Network objective to be minimized:

*E*(w) =*−*1
*m*

∑^{m}

*i=1*

∑*K*
*k=1*

*y*^{(i)}* _{k}* log
(

*σ*^{L}* _{k}*(

**x**

^{(i)}))

+(
1*−**y*^{(i)}* _{k}* )

log (

1*−**σ*^{L}* _{k}*(

**x**

^{(i)}))

+ *λ*
2m

∑*L*
*l=1*

*s*∑*l**−1*
*i=1*

*s*_{l}

∑

*j=1*

(*w*^{l}* _{ij}*)2

(1)

*sum*^{l}* _{j}* =

*s*_{l−1}

∑

*k=1*

*w*^{l}_{kj}*σ*_{k}^{l}^{−}^{1} and*σ*^{l}* _{i}* =

^{1}

1+e^{−}^{suml}^{i}

*∂E*

*∂w*^{l}** _{ij}** =

_{∂σ}

^{∂E}**l**

**j**

*∂σ*_{j}^{l}

*∂sum*^{l}_{j}

*∂sum*^{l}_{j}

*∂w*^{l}** _{ij}** +

_{2m}

^{λ}*w*

^{l}

_{ij}*∂σ*^{l}_{j}

*∂sum*^{l}** _{j}** =
(

1
1+e^{−suml}^{i}

) (

1*−* ^{1}

1+e^{−suml}^{i}

)

=*σ*_{j}** ^{l}**(1

*−σ*

_{j}**)**

^{l}*∂sum*^{l}_{j}

*∂w*^{l}** _{ij}** =

_{∂w}

^{∂}

_{l}*ij*

*s**l**−*1

∑

*k=1*

*w*^{l}_{kj}*σ*_{k}^{l}^{−}^{1}

=*σ*_{i}^{l}^{−}^{1}

. . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . .

. .

. . . . .

## Gradient Computation

The Neural Network objective to be minimized:

*E*(w) =*−*1
*m*

∑^{m}

*i=1*

∑*K*
*k=1*

*y*^{(i)}* _{k}* log
(

*σ*^{L}* _{k}*(

**x**

^{(i)}))

+(
1*−**y*^{(i)}* _{k}* )

log (

1*−**σ*^{L}* _{k}*(

**x**

^{(i)}))

+ *λ*
2m

∑*L*
*l=1*

*s*∑*l**−1*
*i=1*

*s*_{l}

∑

*j=1*

(*w*^{l}* _{ij}*)2

(1)

*sum*^{l}* _{j}*=

*s*_{l}_{−}_{1}

∑

*k=1*

*w*^{l}_{kj}*σ*_{k}^{l}^{−}^{1} and*σ*^{l}* _{i}* =

^{1}

1+e^{−}^{suml}^{i}

*∂E*

*∂w*^{l}** _{ij}** =

_{∂σ}

^{∂E}**l**

**j**

*∂σ*_{j}^{l}

*∂sum*^{l}_{j}

*∂sum*^{l}_{j}

*∂w*^{l}** _{ij}** +

_{2m}

^{λ}*w*

^{l}

_{ij}*∂σ*^{l}_{j}

*∂sum*^{l}** _{j}** =
(

1
1+e^{−suml}^{i}

) (

1*−* ^{1}

1+e^{−suml}^{i}

)

=*σ*^{l}** _{j}**(1

*−σ*

_{j}**)**

^{l}*∂sum*^{l}_{j}

*∂w*^{l}** _{ij}** =

_{∂w}

^{∂}*l*

*ij*

*s**l**−*1

∑

*k=1*

*w*^{l}_{kj}*σ*_{k}^{l}^{−}^{1}

=*σ*_{i}^{l}^{−}^{1}

October 10, 2016 7 / 19

For a single example(x,*y):*

*−*

∑^{K}

*k=1*

*y** _{k}*log(

*σ*

^{L}*(x))*

_{k}+ (1*−**y** _{k}*)log(

1*−**σ*^{L}* _{k}*(x))

+ *λ*
2m

∑*L*
*l=1*

*s*∑*l**−1*
*i=1*

*s**l*

∑

*j=1*

(*w*^{l}* _{ij}*)2

(2)

*∂E*

*∂σ*^{l}** _{j}** =

*s**l+1*

∑

*p=1*

*∂E*

*∂sum*^{l+1}_{p}

*∂sum*^{l+1}_{p}

*∂σ*^{l}* _{j}* =

*s**l+1*

∑

*p=1*

*∂E*

*∂σ***p**^{l+1}

*∂σ*_{p}^{l+1}

*∂sum*^{l+1}_{p}*w*^{l+1}* _{jp}* since

^{∂sum}

^{l+1}

^{p}*∂σ*^{l}* _{j}* =

*w*

^{l+1}

_{jp}*∂E*

*∂σ*^{L}** _{j}** =

*−*

_{σ}

^{y}**L**

^{j}**j** *−*_{1}^{1}_{−}^{−}_{σ}^{y}**L**^{j}**j**

. . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . .

. .

. . . . .

## Backpropagation in Action

*σ*^{l}_{s}^{−}_{l}_{−}^{1}_{1}
*σ*_{i}^{l}^{−}^{1}
*σ*_{2}^{l}^{−}^{1}
*σ*_{1}^{l}^{−}^{1}

*∂E*

*∂σ***l**
**j**

=
*sl+1*∑
*p=1*

*∂E*

*∂σ*^{l+1}_{p}

*∂σ*^{l+1}_{p}

*∂sum*^{l+1}_{p}*w*^{l+1}_{jp}

*∂E*

*∂σ***l**
**sl**

=
*sl+1*∑
*p=1*

*∂E*

*∂σ*^{l+1}**p**

*∂σ*^{l+1}_{p}

*∂sum*^{l+1}_{sl+1}*w*^{l+1}_{sl+1}_{p}

*w*^{l}_{s}_{l}

*−*1*j*

*w*^{l}_{ij}*w*^{l}_{2j}
*w*^{l}_{1j}

*w*^{l}_{s}_{l}

*−*1*s**l*

*w*^{l}_{is}

*l*

*w*^{l}_{2s}_{l}*w*^{l}_{1s}_{l}

*σ*^{L}_{1}

*σ*^{L}* _{K}*
(l

*−*1)

*layer*

^{th}October 10, 2016 9 / 19

## Backpropagation in Action

*.*
*σ*_{i}^{l}^{−}^{1}

*.*
*.*

*∂E*

*∂σ*^{l}_{j}

*.*

*∂E*

*∂w*^{l}** _{ij}** =

_{∂σ}

^{∂E}**l**

**j**

*∂σ*_{j}^{l}

*∂sum*^{l}_{j}

*∂sum*^{l}_{j}

*∂w*^{l}_{ij}

*.*
*.*

*.*
*.*
*.*
*.*

*σ*^{L}_{1}

*σ*^{L}* _{K}*
(l

*−*1)

*layer*

^{th}. . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . .

. .

. . . . .

## Recall and Substitute

*sum*^{l}* _{j}*=

*s*_{l−1}

∑

*k=1*

*w*^{l}_{kj}*σ*_{k}^{l}^{−}^{1} and*σ*^{l}* _{i}* =

^{1}

1+e^{−}^{suml}^{i}

*∂E*

*∂w*^{l}** _{ij}** =

_{∂σ}

^{∂E}**l**

**j**

*∂σ*_{j}^{l}

*∂sum*^{l}_{j}

*∂sum*^{l}_{j}

*∂w*^{l}** _{ij}** +

_{2m}

^{λ}*w*

^{l}

_{ij}*∂σ*^{l}_{j}

*∂sum*^{l}** _{j}** =

*σ*

_{j}**(1**

^{l}*−σ*

^{l}**)**

_{j}*∂sum*^{l}_{j}

*∂w*^{l}** _{ij}** =

*σ*

_{i}

^{l}

^{−}

^{1}*∂E*

*∂σ*^{l}** _{j}** =

*s*_{l+1}

∑

*p=1*

*∂E*

*∂σ*_{j}^{l+1}

*∂σ*_{j}^{l+1}

*∂sum*^{l+1}_{p}*w*^{l+1}_{jp}

*∂E*

*∂σ*^{L}** _{j}** =

*−*

_{σ}

^{y}**L**

^{j}**j** *−*_{1}^{1}_{−}^{−}_{σ}^{y}**L**^{j}**j**

October 10, 2016 11 / 19

## Backpropagation in Action

*.*
*σ*_{i}^{l}^{−}^{1}

*.*
*.*

*∂E*

*∂σ*_{j}** ^{l}**,

*σ*

_{j}

^{l}*.*

*∂E*

*∂w*^{l}** _{ij}** =

_{∂σ}

^{∂E}**l**

**j***σ*_{j}** ^{l}**(1

*−σ*

^{l}**)σ**

_{j}

^{l}

_{i}

^{−}**+**

^{1}_{2m}

^{λ}*w*

^{l}

_{ij}*.*

*.*

*.*
*.*
*.*
*.*

*σ*^{L}_{1}

*σ*^{L}* _{K}*
(l

*−*1)

*layer*

^{th}. . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . .

. .

. . . . .

## Backpropagation in Action

*.*
*σ*_{i}^{l}^{−}^{1}

*.*
*.*

*∂E*

*∂σ*^{l}_{j}

*∂E*

*∂σ*^{l}_{sl}

*.*

*w*^{l}* _{ij}* =

*w*

^{l}

_{ij}*−η*

_{∂w}

^{∂E}**l**

**ij**

*.*
*.*

*.*
*.*
*.*
*.*

*σ*^{L}_{1}

*σ*^{L}* _{K}*
(l

*−*1)

*layer*

^{th}October 10, 2016 13 / 19

## The Backpropagation Algorithm for Training NN

1 Randomly initialize weights*w*^{l}* _{ij}* for

*l*= 1, . . . ,

*L,i*= 1, . . . ,

*s*

*,*

_{l}*j*= 1, . . . ,

*s*

*.*

_{l+1}2 Implement**forward propagation** to get*f*** _{w}**(x) for every

**x**

*∈ D*.

3 Execute **backpropagation**on any misclassified**x***∈ D* by performing gradient descent to
minimize (non-convex)*E*(w)as a function of parameters **w.**

4 *∂E*

*∂σ*^{L}** _{j}** =

*−*

_{σ}

^{y}**L**

^{j}**j** *−*_{1}^{1}_{−}^{−}_{σ}^{y}**L**^{j}**j**

for *j*= 1 to*s** _{L}*.

5 For *l*=*L−*1 down to 2:

1 *∂E*

*∂σ*^{l}** _{j}** =

*s**l+1*

∑

*p=1*

*∂E*

*∂σ*^{l+1}_{j}*σ*_{j}** ^{l+1}**(1

*−*

*σ*

_{j}**)w**

^{l+1}

^{l+1}

_{jp}2 *∂E*

*∂w*^{l}** _{ij}** =

_{∂σ}

^{∂E}**l**

**j**

*σ*_{j}** ^{l}**(1

*−*

*σ*

_{j}**)σ**

^{l}

^{l}

_{i}

^{−}**+**

^{1}_{2m}

^{λ}*w*

^{l}

_{ij}3 *w*^{l}* _{ij}*=

*w*

^{l}

_{ij}*−*

*η*

_{∂w}

^{∂E}**l**

**ij**

6 Keep picking misclassified examples until the cost function *E*(w) shows significant

. . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . .

. .

. . . . .

## Challenges with Neural Network Training

October 10, 2016 15 / 19

## What Changed with Neural Networks?

Origin: Computational Model of Threshold Logic from Warren McCulloch and Walter Pitts (1943)

Big Leap: For ImageNet Challenge, AlexNet acheived 85 % accuracy(NIPS 2012). Prev best was 75 % (CVPR 2011).

Present best is 96.5 % MSRA (arXiv 2015). Comparable to human level accuracy.

Challenges involved were varied background, same object with different colors(e.g. cats), varied sizes and postures of same objects, varied illuminated conditions.

Tasks like OCR, Speech recognition are now possible without segmenting the word image/signal into character images/signals.

. . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . .

. .

. . . . .

## LeNet(1989 and 1998) v/s AlexNet(2012)

October 10, 2016 17 / 19

## Reasons for Big Leap

Why LeeNet was not as successful as AlexNet, though the algorithm was same?

Right algorithm at wrong time.

Modern features.

Advancement in Machine learning.

Realistic data collection in huge amount due to: regular competitions, evaluation metrics or challenging problem statements.

Advances in Computational Resources: GPUs, industrial scale clusters.

Evolution of tasks: Classification of 10 objects to 100 objects to “structure of classes”.

. . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . .

. .

. . . . .

## Convolutional Neural Network

Variation of multi layer feedforward neural network designed to use minimal preprocessing with wide application in image recognition and natural language processing

Traditional multilayer perceptron(MLP) models do not take into account spatial structure of data and suffer from curse of dimensionality

Convolution Neural network has smaller number of parameters due to**local connections**
and**weight sharing**

October 10, 2016 19 / 19