• No results found

(Lecture 29, 30– Linear and Logistic Regression; Dimensionality Reduction;

N/A
N/A
Protected

Academic year: 2022

Share "(Lecture 29, 30– Linear and Logistic Regression; Dimensionality Reduction; "

Copied!
30
0
0

Loading.... (view fulltext now)

Full text

(1)

CS460/626 : Natural Language

Processing/Speech, NLP and the Web

(Lecture 29, 30– Linear and Logistic Regression; Dimensionality Reduction;

Principal Component Analysis)

Pushpak Bhattacharyya CSE Dept.,

IIT Bombay

26

th

and 27

th

March, 2012

(2)

Least Square Method: fitting a line

(following Manning and Schutz, Foundation of Statistical NLP, 1999)

Given set of N points (x

1

,y

1

), (x

2

,y

2

),…, (x

N

,y

N

)

Find a line f(x)=mx+b that best fits the data

m m and b are the parameters to be found and b are the parameters to be found

W: <m, b> is the weight vector

The line that best fits the data is the one that minimizes the sum of squares of the

distances

(3)

Values of m and b

Partial differentiation of SS(m,b) wrt b

and m yields respectively

(4)

Example

(Manning and Schutz, FSNLP, 1999)

(5)

Implication of the “line” fitting

A B

C D

1 2

3

4

O

•1, 2, 3, 4: are the points

•A, B, C, D: are their projections on the fitted line

•Suppose 1, 2 form a class and 3, 4 another class

•Of course, it is easy to set up a hyper plane that will separate 1 and 2 from 3 and 4

•That will be classification in 2 dimension

•But suppose we form another attribute of these points, viz., distances of their

•projections On the line from “O”

•Then the points can be classified by a threshold on these distances

•This effectively is classification in the reduced dimension (1 dimension)

(6)

Dimensionality Reduction

- is the process of reducing the number of Attributes

- reducing does not mean discarding, but to combine attributes to form new attributes

-Let <pi, qi> be the points on 2-dim plane. There are TWO attributes of each point

-If we take the projection of these points on the best fitting line and CHARACTERISE them based on their distance from a specified pt on the line, the dimensionality has been reduced

Y X Class D(x,y,O) Class

f(p ,q ,m,c) 1

q1 p1 1

q2 p2 2

q3 p3 2

… … …

qn pn 1

f(p1,q1,m,c) 1 f(p2,q2,m,c) 2 f(p3,q3,m,c) 2

… …

f(pn,qn,m,c) 1 Systematic Approach to Dimensionality Reduction

• Principal Component Analysis

• Singular Value Decomposition

•Latent Semantic Analysis (Latent Semantic Indexing)

(7)

When the dimensionality is more than 2

Let X be the input vectors: M X N (M input vectors with N features)

Yj=w 0 +w 1 .x j1 +w 2 .x j2 +w 3 .x j3 +…+w n .x jn

find the weight vector W:<w , w , w ,

find the weight vector W:<w 0 , w 1 , w 2 , w 3 , …, w n >

It can be shown that

(8)

The multivariate data

f 1 f 2 f 3 f 4 f 5 … f n

x 11 x 12 x 13 x 14 x 15 … x 1n y 1

x 21 x 22 x 23 x 24 x 25 … x 2n y 2

x x x x x … x y

x 31 x 32 x 33 x 34 x 35 … x 3n y 3 x 41 x 42 x 43 x 44 x 45 … x 4n y 4

x m1 x m2 x m3 x m4 x m5 … x mn y m

(9)

Logistic Regression

Linear regression: predicting a real- valued outcome

Classification: Output takes a value from a small set of discrete values from a small set of discrete values

Simplest classification: Two classes (0/1 or true/false)

Predict the class and also give the

probability of belongingness to the

class

(10)

Linear to logistic regression

P(y=true |x)= Σ i=0,n w i f i = w.f

But, not a legal probability value! Value from –∞ to +∞

Predict the ratio of the probability of being in the class to the probability of not being in the class to the probability of not being in the class

Odds Ratio:

If an event has probability 0.75 of occurring and

probability 0.25 of not occurring, we say the odds

of occurring is 0.75/0.25 = 3.

(11)

Odds Ratio (following Jurafski and Martin, Speech and NLP, 2009)

Ratio of probabilities can lie between 0 and ∞. But the RHS is between Ratio of probabilities can lie between 0 and ∞. But the RHS is between -∞ and + ∞.

Introduce log. Then get the expression for p(y=true|x)

(12)

Logistic function for p(y=true|x)

The form of p(.)= is called the logistic function

It maps values from –∞ to +∞ to lie

It maps values from –∞ to +∞ to lie

between 0 and 1

(13)

Classification using logistic regression

For belonging to the true class

This gives

In other words, Equivalent to placing a

Hyperplane to separate the Two classes

(14)

Learning in logistic regression

In linear regression we used minimizing the sum square error (SSE)

In Logistic regression, we use maximum likelihood estimation

Choose the weights such that the

Choose the weights such that the conditional probability p(y|x) is

maximized

(15)

Steps of learning w

For a particular <x,y>

For all <x,y> pairs

Working with log

Substituting the values of Ps This can be converted to

(16)

Principal Component Analysis

(17)

Example: IRIS Data (only 3 values out of 150)

ID Petal

Length (a

1

)

Petal Width (a

2

)

Sepal Length (a

3

)

Sepal Width (a

4

)

Classific ation

001 5.1 3.5 1.4 0.2 Iris-

001 5.1 3.5 1.4 0.2 Iris-

setosa

051 7.0 3.2 4.7 1.4, Iris-

versicol or

101 6.3 3.3 6.0 2.5 Iris-

virginica

(18)

Training and Testing Data

Training: 80% of the data; 40 from each class: total 120

Testing: Remaining 30

Do we have to consider all the 4 attributes for

Do we have to consider all the 4 attributes for classification?

Less attributes is likely increase the generalization

performance (Occam Razor Hypothesis: A simpler

hypothesis generalizes better)

(19)

The multivariate data

X 1 X 2 X 3 X 4 X 5 … X p

x 11 x 12 x 13 x 14 x 15 … x 1p

x x x x x … x

x 21 x 22 x 23 x 24 x 25 … x 2p x 31 x 32 x 33 x 34 x 35 … x 3p x 41 x 42 x 43 x 44 x 45 … x 4p

x n1 x n2 x n3 x n4 x n5 … x np

(20)

Some preliminaries

Sample mean vector: <µ

1

, µ

2

, µ

3

,…, µ

p

>

For the i

th

variable: µ

i

= (Σ

nj=1

x

ij

)/n

Variance for the i

th

variable:

σ = [Σ (x - µ ) ]/ [n-1]

σ

i 2

= [Σ

nj=1

(x

ij

- µ

i

)

2

]/ [n-1]

Sample covariance:

c

ab

= [Σ

nj=1

((x

aj

- µ

a

)(x

bj

- µ

b

))]/ [n-1]

This measures the correlation in the data In fact, the correlation coefficient

r

ab

= c

ab

/ σ

a

σ

b

(21)

Standardize the variables

For each variable x

ij

Replace the values by

y

ij

= (x

ij

- µ

i

)/σ

i 2

Correlation Matrix

 

 

 

 

=

1 1

1

3 2

1

2 23

21

1 13

12

L M

L K

p p

p

p p

r r

r

r r

r

r r

r

R

(22)

Short digression: Eigenvalues and Eigenvectors

AX=λX

a

11

x

1

+ a

12

x

2

+ a

13

x

3

+ … a

1p

x

p

=λx

1

a x + a x + a x + … a x =λx a

21

x

1

+ a

22

x

2

+ a

23

x

3

+ … a

2p

x

p

=λx

2

a

p1

x

1

+ a

p2

x

2

+ a

p3

x

3

+ … a

pp

x

p

=λx

p

Here, λs are eigenvalues and the solution

<x

1

, x

2

, x

3

,… x

p

>

For each λ is the eigenvector

(23)

Short digression: To find the Eigenvalues and Eigenvectors

Solve the characteristic function det(A – λI)=0

Example:

-9 4 7 -6

Verify:

-9 4 -1 -1

= -13

Characteristic equation (-9-λ)(-6- λ)-28=0 Real eigenvalues: -13, -2 Eigenvector of eigenvalue -13:

(-1, 1)

Eigenvector of eigenvalue -2:

(4, 7)

= -13 7 -6 1 1

λ 0 I=

0 λ

(24)

Next step in finding the PCs

 

 

 

 

 

 

=

1 1

1

2 23

21

1 13

12

L M

L K

p p

r r

r

r r

r

r r

r R

 

 

1 2 3

L 1

p p

p

r r

r

Find the eigenvalues and eigenvectors of R

(25)

Example

49 birds: 21 survived in a storm and 28 died.

5 body characteristics given

X1: body length; X2: alar extent; X3: beak and head length X4: humerus length; X5: keel length

Could we have predicted the fate from the body charateristic

 

 

 

 

 

 

=

000 .

1 607 .

0 526 .

0 529 .

0 605 .

0

000 .

1 763 .

0 769 .

0 645 .

0

000 .

1 674 .

0 662 .

0

000 .

1 735 .

0

000 .

1

R

(26)

Eigenvalues and Eigenvectors of R

Component Eigen value

First Eigen- vector: V1

V2 V3 V4 V5

1 3.612 0.452 0.462 0.451 0.471 0.398

2 0.532 -0.051 0.300 0.325 0.185 -0.877

2 0.532 -0.051 0.300 0.325 0.185 -0.877

3 0.386 0.691 0.341 -0.455 -0.411 -0.179

4 0.302 -0.420 0.548 -0.606 0.388 0.069

5 0.165 0.374 -0.530 -0.343 0.652 -0.192

(27)

Which principal components are important?

Total variance in the data=

λ 1 + λ 2 + λ 3 + λ 4 + λ 5

= sum of diagonals of R= 5

First eigenvalue= 3.616 ≈ 72% of total

First eigenvalue= 3.616 ≈ 72% of total variance 5

Second ≈ 10.6%, Third ≈ 7.7%, Fourth

≈ 6.0% and Fifth ≈ 3.3%

First PC is the most important and sufficient for studying the

classification

(28)

Forming the PCs

Z

1

= 0.452X

1

+0.462X

2

+0.451X

3

+0.471X

4

+0.398X

5

Z

2

= -0.051X

1

+0.300X

2

+0.325X

3

+0.185X

4

-0.877X

5

For all the 49 birds find the first two For all the 49 birds find the first two principal components

This becomes the new data

Classify using them

(29)

For the first bird

X

1

=156, X

2

=245, X

3

=31.6, X

4

=18.5, X

5

=20.5 After standardizing

Y

1

=(156-157.98)/3.65=-0.54, Y

2

=(245-241.33)/5.1=0.73, Y =(31.6-31.5)/0.8=0.17, Y

3

=(31.6-31.5)/0.8=0.17, Y

4

=(18.5-18.46)/0.56=0.05, Y

5

=(20.5-20.8)/0.99=-0.33 PC

1

for the first bird=

Z

1

= 0.45X(-0.54)+

0.46X(0.725)+0.45X(0.17)+0.47X(0.05)+0.39X(-0.33)

=0.064

Similarly, Z

2

= 0.602

(30)

Reduced Classification Data

Instead of X

1

X

2

X

3

X

4

X

5

49

Use

49 rows

Z

1

Z

2

49 rows

References

Related documents

(2003) have given an indirect algorithm for linear L 1 -SVMs that works by approximating the L 1 loss function by a sequence of smooth modified logistic regression loss functions

In this lecture we discussed how to minimize the error function ε(w, D) that we used for the least square linear regression model in the last lecture.. To do this, some basic

1 Approach 1: The Reproducing Kernel Hilbert Space and Representer theorem (Generalized from derivation of Kernel Logistic Regression, Tutorial 7, Problem 3)

Prove that the Kernelized Logistic Regression form is equivalent to the original Logistic Regression minimum regularized cross entropy form: 2 Hints.. Contrast the Kernelized

Percentage of countries with DRR integrated in climate change adaptation frameworks, mechanisms and processes Disaster risk reduction is an integral objective of

Angola Benin Burkina Faso Burundi Central African Republic Chad Comoros Democratic Republic of the Congo Djibouti Eritrea Ethiopia Gambia Guinea Guinea-Bissau Haiti Lesotho

Lack of inspection of the CIT(A)’s work by the CCIT indicates lack of monitoring on the appeal process leading to various irregularities and compliance issues

The sample is loaded on the column previously equiliberated with the buffer containing high salt (0.5M NaCl). Unbound protein is washed with the equilibration buffer and then