CS460/626 : Natural Language
Processing/Speech, NLP and the Web
(Lecture 29, 30– Linear and Logistic Regression; Dimensionality Reduction;
Principal Component Analysis)
Pushpak Bhattacharyya CSE Dept.,
IIT Bombay
26
thand 27
thMarch, 2012
Least Square Method: fitting a line
(following Manning and Schutz, Foundation of Statistical NLP, 1999)
Given set of N points (x
1,y
1), (x
2,y
2),…, (x
N,y
N)
Find a line f(x)=mx+b that best fits the data
m m and b are the parameters to be found and b are the parameters to be found
W: <m, b> is the weight vector
The line that best fits the data is the one that minimizes the sum of squares of the
distances
Values of m and b
Partial differentiation of SS(m,b) wrt b
and m yields respectively
Example
(Manning and Schutz, FSNLP, 1999)
Implication of the “line” fitting
A B
C D
1 2
3
4
O
•1, 2, 3, 4: are the points
•A, B, C, D: are their projections on the fitted line
•Suppose 1, 2 form a class and 3, 4 another class
•Of course, it is easy to set up a hyper plane that will separate 1 and 2 from 3 and 4
•That will be classification in 2 dimension
•But suppose we form another attribute of these points, viz., distances of their
•projections On the line from “O”
•Then the points can be classified by a threshold on these distances
•This effectively is classification in the reduced dimension (1 dimension)
Dimensionality Reduction
- is the process of reducing the number of Attributes
- reducing does not mean discarding, but to combine attributes to form new attributes
-Let <pi, qi> be the points on 2-dim plane. There are TWO attributes of each point
-If we take the projection of these points on the best fitting line and CHARACTERISE them based on their distance from a specified pt on the line, the dimensionality has been reduced
Y X Class D(x,y,O) Class
f(p ,q ,m,c) 1
q1 p1 1
q2 p2 2
q3 p3 2
… … …
qn pn 1
f(p1,q1,m,c) 1 f(p2,q2,m,c) 2 f(p3,q3,m,c) 2
… …
f(pn,qn,m,c) 1 Systematic Approach to Dimensionality Reduction
• Principal Component Analysis
• Singular Value Decomposition
•Latent Semantic Analysis (Latent Semantic Indexing)
When the dimensionality is more than 2
Let X be the input vectors: M X N (M input vectors with N features)
Yj=w 0 +w 1 .x j1 +w 2 .x j2 +w 3 .x j3 +…+w n .x jn
find the weight vector W:<w , w , w ,
find the weight vector W:<w 0 , w 1 , w 2 , w 3 , …, w n >
It can be shown that
The multivariate data
f 1 f 2 f 3 f 4 f 5 … f n
x 11 x 12 x 13 x 14 x 15 … x 1n y 1
x 21 x 22 x 23 x 24 x 25 … x 2n y 2
x x x x x … x y
x 31 x 32 x 33 x 34 x 35 … x 3n y 3 x 41 x 42 x 43 x 44 x 45 … x 4n y 4
…
x m1 x m2 x m3 x m4 x m5 … x mn y m
Logistic Regression
Linear regression: predicting a real- valued outcome
Classification: Output takes a value from a small set of discrete values from a small set of discrete values
Simplest classification: Two classes (0/1 or true/false)
Predict the class and also give the
probability of belongingness to the
class
Linear to logistic regression
P(y=true |x)= Σ i=0,n w i f i = w.f
But, not a legal probability value! Value from –∞ to +∞
Predict the ratio of the probability of being in the class to the probability of not being in the class to the probability of not being in the class
Odds Ratio:
If an event has probability 0.75 of occurring and
probability 0.25 of not occurring, we say the odds
of occurring is 0.75/0.25 = 3.
Odds Ratio (following Jurafski and Martin, Speech and NLP, 2009)
Ratio of probabilities can lie between 0 and ∞. But the RHS is between Ratio of probabilities can lie between 0 and ∞. But the RHS is between -∞ and + ∞.
Introduce log. Then get the expression for p(y=true|x)
Logistic function for p(y=true|x)
The form of p(.)= is called the logistic function
It maps values from –∞ to +∞ to lie
It maps values from –∞ to +∞ to lie
between 0 and 1
Classification using logistic regression
For belonging to the true class
This gives
In other words, Equivalent to placing a
Hyperplane to separate the Two classes
Learning in logistic regression
In linear regression we used minimizing the sum square error (SSE)
In Logistic regression, we use maximum likelihood estimation
Choose the weights such that the
Choose the weights such that the conditional probability p(y|x) is
maximized
Steps of learning w
For a particular <x,y>
For all <x,y> pairs
Working with log
Substituting the values of Ps This can be converted to
Principal Component Analysis
Example: IRIS Data (only 3 values out of 150)
ID Petal
Length (a
1)
Petal Width (a
2)
Sepal Length (a
3)
Sepal Width (a
4)
Classific ation
001 5.1 3.5 1.4 0.2 Iris-
001 5.1 3.5 1.4 0.2 Iris-
setosa
051 7.0 3.2 4.7 1.4, Iris-
versicol or
101 6.3 3.3 6.0 2.5 Iris-
virginica
Training and Testing Data
Training: 80% of the data; 40 from each class: total 120
Testing: Remaining 30
Do we have to consider all the 4 attributes for
Do we have to consider all the 4 attributes for classification?
Less attributes is likely increase the generalization
performance (Occam Razor Hypothesis: A simpler
hypothesis generalizes better)
The multivariate data
X 1 X 2 X 3 X 4 X 5 … X p
x 11 x 12 x 13 x 14 x 15 … x 1p
x x x x x … x
x 21 x 22 x 23 x 24 x 25 … x 2p x 31 x 32 x 33 x 34 x 35 … x 3p x 41 x 42 x 43 x 44 x 45 … x 4p
…
…
x n1 x n2 x n3 x n4 x n5 … x np
Some preliminaries
Sample mean vector: <µ
1, µ
2, µ
3,…, µ
p>
For the i
thvariable: µ
i= (Σ
nj=1x
ij)/n
Variance for the i
thvariable:
σ = [Σ (x - µ ) ]/ [n-1]
σ
i 2= [Σ
nj=1(x
ij- µ
i)
2]/ [n-1]
Sample covariance:
c
ab= [Σ
nj=1((x
aj- µ
a)(x
bj- µ
b))]/ [n-1]
This measures the correlation in the data In fact, the correlation coefficient
r
ab= c
ab/ σ
aσ
bStandardize the variables
For each variable x
ijReplace the values by
y
ij= (x
ij- µ
i)/σ
i 2Correlation Matrix
=
1 1
1
3 2
1
2 23
21
1 13
12
L M
L K
p p
p
p p
r r
r
r r
r
r r
r
R
Short digression: Eigenvalues and Eigenvectors
AX=λX
a
11x
1+ a
12x
2+ a
13x
3+ … a
1px
p=λx
1a x + a x + a x + … a x =λx a
21x
1+ a
22x
2+ a
23x
3+ … a
2px
p=λx
2…
…
a
p1x
1+ a
p2x
2+ a
p3x
3+ … a
ppx
p=λx
pHere, λs are eigenvalues and the solution
<x
1, x
2, x
3,… x
p>
For each λ is the eigenvector
Short digression: To find the Eigenvalues and Eigenvectors
Solve the characteristic function det(A – λI)=0
Example:
-9 4 7 -6
Verify:
-9 4 -1 -1
= -13
Characteristic equation (-9-λ)(-6- λ)-28=0 Real eigenvalues: -13, -2 Eigenvector of eigenvalue -13:
(-1, 1)
Eigenvector of eigenvalue -2:
(4, 7)
= -13 7 -6 1 1
λ 0 I=
0 λ
Next step in finding the PCs
=
1 1
1
2 23
21
1 13
12
L M
L K
p p
r r
r
r r
r
r r
r R
1 2 3L 1
p p
p
r r
r
Find the eigenvalues and eigenvectors of R
Example
49 birds: 21 survived in a storm and 28 died.
5 body characteristics given
X1: body length; X2: alar extent; X3: beak and head length X4: humerus length; X5: keel length
Could we have predicted the fate from the body charateristic
=
000 .
1 607 .
0 526 .
0 529 .
0 605 .
0
000 .
1 763 .
0 769 .
0 645 .
0
000 .
1 674 .
0 662 .
0
000 .
1 735 .
0
000 .
1
R
Eigenvalues and Eigenvectors of R
Component Eigen value
First Eigen- vector: V1
V2 V3 V4 V5
1 3.612 0.452 0.462 0.451 0.471 0.398
2 0.532 -0.051 0.300 0.325 0.185 -0.877
2 0.532 -0.051 0.300 0.325 0.185 -0.877
3 0.386 0.691 0.341 -0.455 -0.411 -0.179
4 0.302 -0.420 0.548 -0.606 0.388 0.069
5 0.165 0.374 -0.530 -0.343 0.652 -0.192
Which principal components are important?
Total variance in the data=
λ 1 + λ 2 + λ 3 + λ 4 + λ 5
= sum of diagonals of R= 5
First eigenvalue= 3.616 ≈ 72% of total
First eigenvalue= 3.616 ≈ 72% of total variance 5
Second ≈ 10.6%, Third ≈ 7.7%, Fourth
≈ 6.0% and Fifth ≈ 3.3%
First PC is the most important and sufficient for studying the
classification
Forming the PCs
Z
1= 0.452X
1+0.462X
2+0.451X
3+0.471X
4+0.398X
5
Z
2= -0.051X
1+0.300X
2+0.325X
3+0.185X
4-0.877X
5
For all the 49 birds find the first two For all the 49 birds find the first two principal components
This becomes the new data
Classify using them
For the first bird
X
1=156, X
2=245, X
3=31.6, X
4=18.5, X
5=20.5 After standardizing
Y
1=(156-157.98)/3.65=-0.54, Y
2=(245-241.33)/5.1=0.73, Y =(31.6-31.5)/0.8=0.17, Y
3=(31.6-31.5)/0.8=0.17, Y
4=(18.5-18.46)/0.56=0.05, Y
5=(20.5-20.8)/0.99=-0.33 PC
1for the first bird=
Z
1= 0.45X(-0.54)+
0.46X(0.725)+0.45X(0.17)+0.47X(0.05)+0.39X(-0.33)
=0.064
Similarly, Z
2= 0.602
Reduced Classification Data
Instead of X
1X
2X
3X
4X
549