• No results found

Paper: Regression Analysis III

N/A
N/A
Protected

Academic year: 2022

Share "Paper: Regression Analysis III"

Copied!
50
0
0

Loading.... (view fulltext now)

Full text

(1)

Subject: Statistics

Paper: Regression Analysis III

Module: The analysis of 2x2 tables

(2)

Development Team

Principal investigator: Dr. Bhaswati Ganguli,Professor, Department of Statistics, University of Calcutta

Paper co-ordinator: Dr. Bhaswati Ganguli,Professor, Department of Statistics, University of Calcutta

Content writer: Sayantee Jana, Graduate student, Department of Mathematics and Statistics, McMaster University. Sujit Kumar Ray,Analytics professional, Kolkata

Content reviewer: Department of Statistics, University of Calcutta

Regression Analysis III 2 / 27

(3)

Definition of Contingency Table

I Let us define two categorical random variables:

I X= education level (predictor),Y= Jobs (outcome)

I X: X1, X2, ..., Xp→ Categories ofX

I Y: Y1, Y2, ..., Yq→ Categories ofY

Definition 1

A rectangular table havingprows for categories of X andq columns for categories ofY with the cells of the table containing frequency counts of outcomes for a sample, is called a contingency table, a term introduced by Karl Pearson 1904. A contingency table withp rows andq columns is called anp×q orp-by-q table.

(4)

Definition of Contingency Table

I Let us define two categorical random variables:

I X= education level (predictor),Y= Jobs (outcome)

I X: X1, X2, ..., Xp→ Categories ofX

I Y: Y1, Y2, ..., Yq→ Categories ofY

Definition 1

A rectangular table havingprows for categories of X andq columns for categories ofY with the cells of the table containing frequency counts of outcomes for a sample, is called a contingency table, a term introduced by Karl Pearson 1904. A contingency table withp rows andq columns is called anp×q orp-by-q table.

1Agresti, A. (2014). Categorical data analysis. John Wiley Sons.

Regression Analysis III 3 / 27

(5)

Definition of Contingency Table

I Let us define two categorical random variables:

I X= education level (predictor),Y= Jobs (outcome)

I X: X1, X2, ..., Xp→ Categories ofX

I Y: Y1, Y2, ..., Yq→ Categories ofY

Definition 1

A rectangular table havingprows for categories of X andq columns for categories ofY with the cells of the table containing frequency counts of outcomes for a sample, is called a contingency table, a term introduced by Karl Pearson 1904. A contingency table withp rows andq columns is called anp×q orp-by-q table.

(6)

Definition of Contingency Table

I Let us define two categorical random variables:

I X= education level (predictor),Y= Jobs (outcome)

I X: X1, X2, ..., Xp→ Categories ofX

I Y: Y1, Y2, ..., Yq→ Categories ofY

Definition 1

A rectangular table havingprows for categories of X andq columns for categories ofY with the cells of the table containing frequency counts of outcomes for a sample, is called a contingency table, a term introduced by Karl Pearson 1904. A contingency table withp rows andq columns is called anp×q orp-by-q table.

1Agresti, A. (2014). Categorical data analysis. John Wiley Sons.

Regression Analysis III 3 / 27

(7)

Definition of Contingency Table

I Let us define two categorical random variables:

I X= education level (predictor),Y= Jobs (outcome)

I X: X1, X2, ..., Xp→ Categories ofX

I Y: Y1, Y2, ..., Yq→ Categories ofY

Definition 1

A rectangular table havingprows for categories of X andq columns for categories ofY with the cells of the table containing frequency counts of outcomes for a sample, is called a contingency table, a term introduced by Karl Pearson 1904. A contingency table withp rows andq columns is called anp×q orp-by-q table.

(8)

A 2× 2 contingency table

This is how a 2×2 contingency table will look like

Table:Response variableY vs. Predictor variableX Predictor Response variable

Variable Yi = 0 Yi= 1 Xi = 0 n11 n12 Xi = 1 n21 n22

Here,n22 is the number of subjects who have Xi =Yi= 1, and so on.

Regression Analysis III 4 / 27

(9)

Example

Here is an example of a 2×2 contingency table2 :

Table:Study of the relationship between wafer quality and the presence of particles on the wafer.

Quality No Particles Particles

Good 320 14

Bad 80 36

2Faraway, J.J. (2006). Extending the Linear Model with R. Chapman

(10)

More examples

I So in a 2×2 table both the response and predictor variables have two categories.

I But they can also have more than two categories.

I For example a 4×4 table would be, 3

Table:Unaided distance of vision of 7477 women aged 30-39.

Grade of the Grade of left eye

right eye Highest Second Third Lowest

Highest 1520 266 124 66

Second 234 1512 432 78

Third 117 362 1772 205

Lowest 36 82 179 492

3Plackett, R.L. (1981). The analysis of Categorical Data. Macmillan Publishing Co., Inc., New York

Regression Analysis III 6 / 27

(11)

More examples

I So in a 2×2 table both the response and predictor variables have two categories.

I But they can also have more than two categories.

I For example a 4×4 table would be, 3

Table:Unaided distance of vision of 7477 women aged 30-39.

Grade of the Grade of left eye

right eye Highest Second Third Lowest

Highest 1520 266 124 66

Second 234 1512 432 78

Third 117 362 1772 205

Lowest 36 82 179 492

3Plackett, R.L. (1981). The analysis of Categorical Data. Macmillan

(12)

More examples

I So in a 2×2 table both the response and predictor variables have two categories.

I But they can also have more than two categories.

I For example a 4×4 table would be, 3

Table:Unaided distance of vision of 7477 women aged 30-39.

Grade of the Grade of left eye

right eye Highest Second Third Lowest

Highest 1520 266 124 66

Second 234 1512 432 78

Third 117 362 1772 205

Lowest 36 82 179 492

3Plackett, R.L. (1981). The analysis of Categorical Data. Macmillan Publishing Co., Inc., New York

Regression Analysis III 6 / 27

(13)

Notes

I All the examples in this presentation are two-way contingency tables.

I A contingency table may be 3-way also.

(14)

Notes

I All the examples in this presentation are two-way contingency tables.

I A contingency table may be 3-way also.

Regression Analysis III 7 / 27

(15)

Joint distribution

TheJoint distributionof the two random variablesX andY is the joint probability ofX taking a particular value, say,Xi andY taking another value, say,Yj,

πij = P ( X =Xi and Y =Yj ) , i= 1(1)p,j = 1(1)q

(16)

Joint distribution

TheJoint distributionof the two random variablesX andY is the joint probability ofX taking a particular value, say,Xi andY taking another value, say,Yj,

πij = P ( X =Xi and Y =Yj ) , i= 1(1)p,j = 1(1)q

Regression Analysis III 8 / 27

(17)

Marginal distribution

I The marginal distribution of X is the probability that X assumes a value, say,Xi for all values ofY and vice-versa for the marginal distribution ofY.

I The marginal distribution of X is:

πi0 =

q

X

j=1

πij =P(X=Xi),for i= 1(1)p

I

I The marginal distribution of Y is:

π0j =

p

X

i=1

πij =P(Y =Yj),forj = 1(1)q

I

(18)

Conditional distribution

I The conditional distribution of Y given X =Xi is the conditional probability thatY assumes some value, say Yj

given that we know thatX has assumed the valueXi.

I The conditional distribution of Y given X =Xi :

πj|i = ππij

i0, forj = 1(1)q ,∀i= 1(1)p

I The conditional distribution of X given Y =Yj :

πi|j = ππij

0j, for i= 1(1)p ,∀j= 1(1)q

Regression Analysis III 10 / 27

(19)

Conditional distribution

I The conditional distribution of Y given X =Xi is the conditional probability thatY assumes some value, say Yj

given that we know thatX has assumed the valueXi.

I The conditional distribution of Y given X =Xi :

πj|i = ππij

i0, forj = 1(1)q ,∀i= 1(1)p

I The conditional distribution of X given Y =Yj :

πi|j = ππij

0j, for i= 1(1)p ,∀j= 1(1)q

(20)

Conditional distribution

I The conditional distribution of Y given X =Xi is the conditional probability thatY assumes some value, say Yj

given that we know thatX has assumed the valueXi.

I The conditional distribution of Y given X =Xi :

πj|i = ππij

i0, forj = 1(1)q ,∀i= 1(1)p

I The conditional distribution of X given Y =Yj :

πi|j = ππij

0j, for i= 1(1)p ,∀j= 1(1)q

Regression Analysis III 10 / 27

(21)

Conditional distribution

I The conditional distribution of Y given X =Xi is the conditional probability thatY assumes some value, say Yj

given that we know thatX has assumed the valueXi.

I The conditional distribution of Y given X =Xi :

πj|i = ππij

i0, forj = 1(1)q ,∀i= 1(1)p

I The conditional distribution of X given Y =Yj :

πi|j = ππij

0j, for i= 1(1)p ,∀j= 1(1)q

(22)

Conditional distribution

I The conditional distribution of Y given X =Xi is the conditional probability thatY assumes some value, say Yj

given that we know thatX has assumed the valueXi.

I The conditional distribution of Y given X =Xi :

πj|i = ππij

i0, forj = 1(1)q ,∀i= 1(1)p

I The conditional distribution of X given Y =Yj :

πi|j = ππij

0j, for i= 1(1)p ,∀j= 1(1)q

Regression Analysis III 10 / 27

(23)

Conditional distribution

I The conditional distribution of Y given X =Xi is the conditional probability thatY assumes some value, say Yj

given that we know thatX has assumed the valueXi.

I The conditional distribution of Y given X =Xi :

πj|i = ππij

i0, forj = 1(1)q ,∀i= 1(1)p

I The conditional distribution of X given Y =Yj :

πi|j = ππij

0j, for i= 1(1)p ,∀j= 1(1)q

(24)

Independence and Chi-square test

I If both X and Y are independent random variables then πiji0π0j, or, πj|i0j

I Our primary objective in analysis of a contingency table is testing the independence of the two categorical variables.

I So in a 2×2 table the test statistic popularly used for testing the hypothesesH0:pij =pipj ,∀i= 1(1)p,∀j = 1(1)q is

χ2 = (n11+n12)(n11+n21) n

wherenij are the corresponding cell-counts andnis the total count.

Regression Analysis III 11 / 27

(25)

Fisher’s exact test

I But the Pearsonian Chi-square statistic mentioned above is not suitable for small samples.

I For small samples Fisher’s exact testis recommended. The test statistic is4 5,

(n11+n12)!(n11+n21)!(n21+n22)!(n12+n22)!

n!n11!n12!n21!n22!

4Faraway, J.J. (2006). Extending the Linear Model with R. Chapman Hall/CRC.

(26)

Cumulative distribution and Stochastically larger

I IfX is fixed

I The cumulative distribution of Y|X=Xi is define as :

Fj|i =X

k≤j

πk|i

I

I For iandi0 ifFj|i ≤Fj|i0 ∀j = 1(1)q

then the distribution ofY|X=Xi is called stochastically larger than the distribution of Y|X=Xi0

Regression Analysis III 13 / 27

(27)

Cumulative distribution and Stochastically larger

I IfX is fixed

I The cumulative distribution of Y|X=Xi is define as :

Fj|i =X

k≤j

πk|i

I

I For iandi0 ifFj|i ≤Fj|i0 ∀j = 1(1)q

then the distribution ofY|X=Xi is called stochastically larger than the distribution of Y|X=Xi0

(28)

Cumulative distribution and Stochastically larger

I IfX is fixed

I The cumulative distribution of Y|X=Xi is define as :

Fj|i =X

k≤j

πk|i

I

I For iandi0 ifFj|i ≤Fj|i0 ∀j = 1(1)q

then the distribution ofY|X=Xi is called stochastically larger than the distribution of Y|X=Xi0

Regression Analysis III 13 / 27

(29)

Cumulative distribution and Stochastically larger

I IfX is fixed

I The cumulative distribution of Y|X=Xi is define as :

Fj|i =X

k≤j

πk|i

I

I For iandi0 ifFj|i ≤Fj|i0 ∀j = 1(1)q

then the distribution ofY|X=Xi is called stochastically larger than the distribution of Y|X=Xi0

(30)

The multinomial distribution

I The pmf of multinomial distribution is

n!

x1!x2!...xk!(p1)x1(p2)x2....(pk)xk , where

x1+x2+....+xk=nand p1+p2+....+pk= 1

I It is the joint probability ofk mutually exclusive outcomes in n independent trials wherepi is the probability of success in theith outcome and Xi is the number of times (out of then trials) the ith outcome occurs,

I E(Xi) =npi andV ar(Xi) =npi(1−pi)

Regression Analysis III 14 / 27

(31)

The multinomial distribution

I The pmf of multinomial distribution is

n!

x1!x2!...xk!(p1)x1(p2)x2....(pk)xk , where

x1+x2+....+xk=nand p1+p2+....+pk= 1

I It is the joint probability ofk mutually exclusive outcomes in n independent trials wherepi is the probability of success in theith outcome and Xi is the number of times (out of then trials) the ith outcome occurs,

I E(Xi) =npi andV ar(Xi) =npi(1−pi)

(32)

The multinomial distribution

I The pmf of multinomial distribution is

n!

x1!x2!...xk!(p1)x1(p2)x2....(pk)xk , where

x1+x2+....+xk=nand p1+p2+....+pk= 1

I It is the joint probability ofk mutually exclusive outcomes in n independent trials wherepi is the probability of success in theith outcome and Xi is the number of times (out of then trials) the ith outcome occurs,

I E(Xi) =npi andV ar(Xi) =npi(1−pi)

Regression Analysis III 14 / 27

(33)

Pearson Chi-square statistic

I Suppose we haven samples that have cross-classified intopq categories (p categories ofX andq categories ofY). Then the distribution of the cell-countsnij is a multinomial distribution with ntrials and probabilitiespij. Thus the joint probability of the cell-countsnij, i= 1(1)p, j = 1(1)q is,

n!

Q

i

Q

jnij! Y

i

Y

j

pnijij,

whereP

i

P

jnij =nandP

i

P

jpij = 1

(34)

Pearson Chi-square statistic

I Pearson Chi-square statistic for the multinomial distribution

I Under independence, µij =E(nij) =npipj = ninnj

I So the required Chi-square statistic for a multinomial data is

χ2 =X

i

X

j

(nij−µij)2 µij , with d.f.=(p−1)(q−1)

Regression Analysis III 16 / 27

(35)

Pearson Chi-square statistic

I Pearson Chi-square statistic for the multinomial distribution

I Under independence, µij =E(nij) =npipj = ninnj

I So the required Chi-square statistic for a multinomial data is

χ2 =X

i

X

j

(nij−µij)2 µij , with d.f.=(p−1)(q−1)

(36)

Pearson Chi-square statistic

I Pearson Chi-square statistic for the multinomial distribution

I Under independence, µij =E(nij) =npipj = ninnj

I So the required Chi-square statistic for a multinomial data is

χ2 =X

i

X

j

(nij−µij)2 µij , with d.f.=(p−1)(q−1)

Regression Analysis III 16 / 27

(37)

Summary

I A contingency table is a cross-classification table of two or more categorical variables.

I Our major objective while analyzing a contingency table is testing for independence of the response and predictor variable.

I For large samples we can use the Pearsonian Chi-square test for independence.

I For small samples we can use Fisher’s exact test to test for independence.

(38)

Summary

I A contingency table is a cross-classification table of two or more categorical variables.

I Our major objective while analyzing a contingency table is testing for independence of the response and predictor variable.

I For large samples we can use the Pearsonian Chi-square test for independence.

I For small samples we can use Fisher’s exact test to test for independence.

Regression Analysis III 17 / 27

(39)

Summary

I A contingency table is a cross-classification table of two or more categorical variables.

I Our major objective while analyzing a contingency table is testing for independence of the response and predictor variable.

I For large samples we can use the Pearsonian Chi-square test for independence.

I For small samples we can use Fisher’s exact test to test for independence.

(40)

Summary

I A contingency table is a cross-classification table of two or more categorical variables.

I Our major objective while analyzing a contingency table is testing for independence of the response and predictor variable.

I For large samples we can use the Pearsonian Chi-square test for independence.

I For small samples we can use Fisher’s exact test to test for independence.

Regression Analysis III 17 / 27

(41)

Example 16

## creating a dataset in R, data on quality improvement study

## of a sample of wafers y <- c(320,14,80,36)

particle <- gl (2,1,4,labels=c("no","yes")) quality <- gl(2,2,labels=c("good","bad")) wafer <- data.frame(y,particle,quality) wafer

## creating a contingency table of the dataset ov <- xtabs(y ~ quality+particle)

print(ov)

## calculating the proportions

pp <- prop.table( xtabs(y ~ particle)) print(pp)

qp <- prop.table( xtabs(y ~ quality)) print(qp)

6Faraway, J.J. (2006). Extending the Linear Model with R. Chapman

(42)

Example 1 contd ...

## contingency table of proportion (not of counts) fv <- outer(qp,pp)*450

print(fv)

## calculation of Frequency Chi-square statistic sum((ov-fv)^2/fv)

## Frequency Chi-square statistic implementing Yates' continuity correction prop.test(ov)

## Fisher's Exact Test fisher.test(ov)

Regression Analysis III Sample R script to complement this module 19 / 27

(43)

Example 27

## install.packages("faraway") require(faraway)

data(haireye) haireye

## contingency table of the haireye dataset ct <- xtabs(y ~ hair + eye, haireye)

print(ct)

## Frequency chi-square statistic alongwith the p-value summary(ct)

## graphical display of contingency table dotchart (ct)

## illustrative plot

mosaicplot(ct,color=TRUE,main=NULL,las=1)

7Faraway, J.J. (2006). Extending the Linear Model with R. Chapman

(44)

Example 3: matched pairs 8

# data on vision of a sample of women data(eyegrade)

## contingency table

ct <- xtabs(y ~ right+left, eyegrade) print(ct)

## Frequency chi-square statistic alongwith the p-value summary(ct)

## graphical display of contingency table dotchart (ct)

## illustrative plot

mosaicplot(ct,color=TRUE,main=NULL,las=1)

8Faraway, J.J. (2006). Extending the Linear Model with R. Chapman Hall/CRC

Regression Analysis III Sample R script to complement this module 21 / 27

(45)

Example 4: Three-Way Contingency Tables 9

data(femsmoke) femsmoke

## contingency table of smoking versus dead ct <- xtabs(y ~ smoker+dead, femsmoke) print(ct)

## contingency table of proportion prop.table(ct,1)

## Frequency chi-square statistic alongwith the p-value

## testing for independence of smoking against dead summary(ct)

9Faraway, J.J. (2006). Extending the Linear Model with R. Chapman

(46)

Example contd ...

## contingency table for a given age-group 55-64

cta <- xtabs(y ~ smoker+dead, femsmoke,subset=(age=="55-64")) print(cta)

summary(cta)

## contingency table of proportion prop.table(cta,1)

## contingency table of smoking versus age ct <- xtabs(y ~ smoker+age, femsmoke) print(ct)

## Frequency chi-square statistic alongwith the p-value

## for testing independence of smoking against age summary(ct)

Regression Analysis III Sample R script to complement this module 23 / 27

(47)

Example contd ...

## contingency table of proportion

prop.table(xtabs(y ~ smoker+age, femsmoke),2)

## three-way contingency table

ct3 <- xtabs(y ~ smoker+dead+age,femsmoke)

## Frequency chi-square statistic alongwith the p-value

## for testing independence of smoking, death and age summary(ct3)

(48)

Example 5: Another function to calculate chi-square statistic 10

## creating frequency table with the name depsmok depsmok <- matrix(c(144,1729,50,1290),byrow=T,ncol=2);

dimnames(depsmok) <- list(Ever_Smoker=c("Yes","No"), Depression=c("Yes","No"));addmargins(depsmok)

## computes proportions table and marginal proportions addmargins(prop.table(depsmok))

## computes the chi-square test of independence chisq.test(depsmok)

## computes the expected frequencies chisq.test(depsmok)$expected

10Kateri, M. (2010). Contingency Table Analysis Methods and Implementation Using R. Springer

Regression Analysis III Sample R script to complement this module 25 / 27

(49)

Visualizing contingency tables barplots, sieveplots and mosaicplots 11

## creating a dataset with the name confinan

confinan<-matrix(c(98,363,153,165,443,128),byrow=T,ncol=3) dimnames(confinan) <- list(Gender=c("males","females"),

Conf=c("great deal","only some","hardly any"))

## It is a data on confidence in banks and financial

## institutions cross-classified according to gender

barplot(prop.table(confinan),density=30,legend.text=T,main=

"Confidence in Banks and Financial Institutions by Gender (GSS2008)", xlab="Confidence level", ylab="Proportions", ylim=c(0,0.65))

11Source : Kateri, M. (2010). Contingency Table Analysis Methods and

(50)

Visualizing contingency tables contd ...

# install.packges(vcd) require(vcd)

sieve(confinan, shade=T)

mosaic(confinan,gp=shading_hcl,residuals_type="deviance", labeling = labeling_residuals)

## mosaic in vcd package is an alternative function to

## mosaicplot in the graphics package

Regression Analysis III Sample R script to complement this module 27 / 27

References

Related documents

I Categorical variable such as social class, educational level where the categories are ordered but the distance or spacing between the categories is not defined or unknown, is

Chi Square analysis of the significance of POSSUM score to predict morbidity and mortality among our study group patient showed a high level of significance

Chi-square value ( χ 2 ) 5.37 ( P&gt;O.146) shows that there is a significant association between the level functional independence and the demographic variables. So that

§ Sampling  involves  splitting  the  core  into  2  equal  halves  along  the  point  of  curvature  of  foliations  or  along  orientation  lines.  One  half 

In our analysis, we use hydrological simulations, land use maps of real estate, infrastructure databases, and damage curves to examine the expected impact of flooding on Ho Chi

Percentage of countries with DRR integrated in climate change adaptation frameworks, mechanisms and processes Disaster risk reduction is an integral objective of

Content writer: Sayantee Jana, Graduate student, Department of Mathematics and Statistics, McMaster University Sujit Kumar Ray, Analytics professional, Kolkata.. Content

I It is more convenient to test the parameters of logit model than calculating many Odds Ratios from contingency tables and testing for them, when we have multiple predictor