Paper: Regression Analysis III

(1)

Subject: Statistics

Paper: Regression Analysis III

Module: The analysis of 2x2 tables

(2)

Development Team

Principal investigator: Dr. Bhaswati Ganguli,Professor, Department of Statistics, University of Calcutta

Paper co-ordinator: Dr. Bhaswati Ganguli,Professor, Department of Statistics, University of Calcutta

Content writer: Sayantee Jana, Graduate student, Department of Mathematics and Statistics, McMaster University. Sujit Kumar Ray,Analytics professional, Kolkata

Content reviewer: Department of Statistics, University of Calcutta

Regression Analysis III 2 / 27

(3)

Definition of Contingency Table

I Let us define two categorical random variables:

I X= education level (predictor),Y= Jobs (outcome)

I X: X₁, X₂, ..., X_p→ Categories ofX

I Y: Y₁, Y₂, ..., Y_q→ Categories ofY

Definition ¹

A rectangular table havingprows for categories of X andq columns for categories ofY with the cells of the table containing frequency counts of outcomes for a sample, is called a contingency table, a term introduced by Karl Pearson 1904. A contingency table withp rows andq columns is called anp×q orp-by-q table.

(4)

Definition ¹

1Agresti, A. (2014). Categorical data analysis. John Wiley Sons.

(5)

Definition ¹

(6)

Definition ¹

1Agresti, A. (2014). Categorical data analysis. John Wiley Sons.

(7)

Definition ¹

(8)

A 2× 2 contingency table

This is how a 2×2 contingency table will look like

Table:Response variableY vs. Predictor variableX Predictor Response variable

Variable Y_i = 0 Y_i= 1 X_i = 0 n₁₁ n₁₂ Xi = 1 n21 n22

Here,n₂₂ is the number of subjects who have X_i =Y_i= 1, and so on.

(9)

Example

Here is an example of a 2×2 contingency table² :

Table:Study of the relationship between wafer quality and the presence of particles on the wafer.

Quality No Particles Particles

Good 320 14

Bad 80 36

2Faraway, J.J. (2006). Extending the Linear Model with R. Chapman

(10)

More examples

I So in a 2×2 table both the response and predictor variables have two categories.

I But they can also have more than two categories.

I For example a 4×4 table would be, ³

Table:Unaided distance of vision of 7477 women aged 30-39.

Grade of the Grade of left eye

right eye Highest Second Third Lowest

Highest 1520 266 124 66

Second 234 1512 432 78

Third 117 362 1772 205

Lowest 36 82 179 492

3Plackett, R.L. (1981). The analysis of Categorical Data. Macmillan Publishing Co., Inc., New York

(11)

More examples

Highest 1520 266 124 66

Second 234 1512 432 78

Third 117 362 1772 205

Lowest 36 82 179 492

3Plackett, R.L. (1981). The analysis of Categorical Data. Macmillan

(12)

More examples

Highest 1520 266 124 66

Second 234 1512 432 78

Third 117 362 1772 205

Lowest 36 82 179 492

3Plackett, R.L. (1981). The analysis of Categorical Data. Macmillan Publishing Co., Inc., New York

(13)

Notes

I All the examples in this presentation are two-way contingency tables.

I A contingency table may be 3-way also.

(14)

Notes

I All the examples in this presentation are two-way contingency tables.

I A contingency table may be 3-way also.

(15)

Joint distribution

TheJoint distributionof the two random variablesX andY is the joint probability ofX taking a particular value, say,Xi andY taking another value, say,Y_j,

πij = P ( X =Xi and Y =Yj ) , i= 1(1)p,j = 1(1)q

(16)

Joint distribution

TheJoint distributionof the two random variablesX andY is the joint probability ofX taking a particular value, say,Xi andY taking another value, say,Y_j,

πij = P ( X =Xi and Y =Yj ) , i= 1(1)p,j = 1(1)q

(17)

Marginal distribution

I The marginal distribution of X is the probability that X assumes a value, say,Xi for all values ofY and vice-versa for the marginal distribution ofY.

I The marginal distribution of X is:

π_i0 =

q

X

j=1

π_ij =P(X=X_i),for i= 1(1)p

I

I The marginal distribution of Y is:

π0j =

p

X

i=1

πij =P(Y =Yj),forj = 1(1)q

I

(18)

Conditional distribution

I The conditional distribution of Y given X =X_i is the conditional probability thatY assumes some value, say Yj

given that we know thatX has assumed the valueXi.

I The conditional distribution of Y given X =Xi :

π_j|i = _π^π^ij

i0, forj = 1(1)q ,∀i= 1(1)p

I The conditional distribution of X given Y =Y_j :

πi|j = _π^π^ij

0j, for i= 1(1)p ,∀j= 1(1)q

(19)

π_j|i = _π^π^ij

i0, forj = 1(1)q ,∀i= 1(1)p

πi|j = _π^π^ij

0j, for i= 1(1)p ,∀j= 1(1)q

(20)

π_j|i = _π^π^ij

i0, forj = 1(1)q ,∀i= 1(1)p

πi|j = _π^π^ij

0j, for i= 1(1)p ,∀j= 1(1)q

(21)

π_j|i = _π^π^ij

i0, forj = 1(1)q ,∀i= 1(1)p

πi|j = _π^π^ij

0j, for i= 1(1)p ,∀j= 1(1)q

(22)

π_j|i = _π^π^ij

i0, forj = 1(1)q ,∀i= 1(1)p

πi|j = _π^π^ij

0j, for i= 1(1)p ,∀j= 1(1)q

(23)

π_j|i = _π^π^ij

i0, forj = 1(1)q ,∀i= 1(1)p

πi|j = _π^π^ij

0j, for i= 1(1)p ,∀j= 1(1)q

(24)

Independence and Chi-square test

I If both X and Y are independent random variables then π_ij =π_i0π_0j, or, π_j|i =π_0j

I Our primary objective in analysis of a contingency table is testing the independence of the two categorical variables.

I So in a 2×2 table the test statistic popularly used for testing the hypothesesH0:pij =pipj ,∀i= 1(1)p,∀j = 1(1)q is

χ² = (n11+n12)(n11+n21) n

wheren_ij are the corresponding cell-counts andnis the total count.

(25)

Fisher’s exact test

I But the Pearsonian Chi-square statistic mentioned above is not suitable for small samples.

I For small samples Fisher’s exact testis recommended. The test statistic is^{4 5},

(n11+n12)!(n11+n21)!(n21+n22)!(n12+n22)!

n!n₁₁!n₁₂!n₂₁!n₂₂!

4Faraway, J.J. (2006). Extending the Linear Model with R. Chapman Hall/CRC.

(26)

Cumulative distribution and Stochastically larger

I IfX is fixed

I The cumulative distribution of Y|X=Xi is define as :

F_j|i =X

k≤j

π_k|i

I

I For iandi⁰ ifF_j|i ≤F_j|i⁰ ∀j = 1(1)q

then the distribution ofY|X=X_i is called stochastically larger than the distribution of Y|X=X_i⁰

(27)

I IfX is fixed

F_j|i =X

k≤j

π_k|i

I

(28)

I IfX is fixed

F_j|i =X

k≤j

π_k|i

I

(29)

I IfX is fixed

F_j|i =X

k≤j

π_k|i

I

(30)

The multinomial distribution

I The pmf of multinomial distribution is

n!

x1!x2!...xk!(p₁)^x¹(p₂)^x²....(p_k)^x^k , where

x1+x2+....+xk=nand p1+p2+....+pk= 1

I It is the joint probability ofk mutually exclusive outcomes in n independent trials wherep_i is the probability of success in thei^th outcome and X_i is the number of times (out of then trials) the i^th outcome occurs,

I E(Xi) =npi andV ar(Xi) =npi(1−pi)

(31)

n!

x1+x2+....+xk=nand p1+p2+....+pk= 1

(32)

n!

x1+x2+....+xk=nand p1+p2+....+pk= 1

(33)

Pearson Chi-square statistic

I Suppose we haven samples that have cross-classified intopq categories (p categories ofX andq categories ofY). Then the distribution of the cell-countsnij is a multinomial distribution with ntrials and probabilitiesp_ij. Thus the joint probability of the cell-countsn_ij, i= 1(1)p, j = 1(1)q is,

n!

Q

i

Q

jnij! Y

i

Y

j

pⁿ_ij^ij,

whereP

i

P

jn_ij =nandP

i

P

jp_ij = 1

(34)

I Pearson Chi-square statistic for the multinomial distribution

I Under independence, µ_ij =E(n_ij) =np_ip_j = ⁿⁱ_nⁿ^j

I So the required Chi-square statistic for a multinomial data is

χ² =X

i

X

j

(nij−µij)² µ_ij , with d.f.=(p−1)(q−1)

(35)

χ² =X

i

X

j

(36)

χ² =X

i

X

j

(37)

Summary

I A contingency table is a cross-classification table of two or more categorical variables.

I Our major objective while analyzing a contingency table is testing for independence of the response and predictor variable.

I For large samples we can use the Pearsonian Chi-square test for independence.

I For small samples we can use Fisher’s exact test to test for independence.

(38)

Summary

(39)

Summary

(40)

Summary

(41)

Example 1⁶

## creating a dataset in R, data on quality improvement study

## of a sample of wafers y <- c(320,14,80,36)

particle <- gl (2,1,4,labels=c("no","yes")) quality <- gl(2,2,labels=c("good","bad")) wafer <- data.frame(y,particle,quality) wafer

## creating a contingency table of the dataset ov <- xtabs(y ~ quality+particle)

print(ov)

## calculating the proportions

pp <- prop.table( xtabs(y ~ particle)) print(pp)

qp <- prop.table( xtabs(y ~ quality)) print(qp)

(42)

Example 1 contd ...

## contingency table of proportion (not of counts) fv <- outer(qp,pp)*450

print(fv)

## calculation of Frequency Chi-square statistic sum((ov-fv)^2/fv)

## Frequency Chi-square statistic implementing Yates' continuity correction prop.test(ov)

## Fisher's Exact Test fisher.test(ov)

Regression Analysis III Sample R script to complement this module 19 / 27

(43)

Example 2⁷

## install.packages("faraway") require(faraway)

data(haireye) haireye

## contingency table of the haireye dataset ct <- xtabs(y ~ hair + eye, haireye)

print(ct)

## Frequency chi-square statistic alongwith the p-value summary(ct)

## graphical display of contingency table dotchart (ct)

## illustrative plot

mosaicplot(ct,color=TRUE,main=NULL,las=1)

(44)

Example 3: matched pairs ⁸

# data on vision of a sample of women data(eyegrade)

## contingency table

ct <- xtabs(y ~ right+left, eyegrade) print(ct)

## Frequency chi-square statistic alongwith the p-value summary(ct)

## graphical display of contingency table dotchart (ct)

## illustrative plot

mosaicplot(ct,color=TRUE,main=NULL,las=1)

8Faraway, J.J. (2006). Extending the Linear Model with R. Chapman Hall/CRC

(45)

Example 4: Three-Way Contingency Tables ⁹

data(femsmoke) femsmoke

## contingency table of smoking versus dead ct <- xtabs(y ~ smoker+dead, femsmoke) print(ct)

## contingency table of proportion prop.table(ct,1)

## Frequency chi-square statistic alongwith the p-value

## testing for independence of smoking against dead summary(ct)

(46)

Example contd ...

## contingency table for a given age-group 55-64

cta <- xtabs(y ~ smoker+dead, femsmoke,subset=(age=="55-64")) print(cta)

summary(cta)

## contingency table of proportion prop.table(cta,1)

## contingency table of smoking versus age ct <- xtabs(y ~ smoker+age, femsmoke) print(ct)

## for testing independence of smoking against age summary(ct)

(47)

Example contd ...

## contingency table of proportion

prop.table(xtabs(y ~ smoker+age, femsmoke),2)

## three-way contingency table

ct3 <- xtabs(y ~ smoker+dead+age,femsmoke)

## for testing independence of smoking, death and age summary(ct3)

(48)

Example 5: Another function to calculate chi-square statistic ¹⁰

## creating frequency table with the name depsmok depsmok <- matrix(c(144,1729,50,1290),byrow=T,ncol=2);

dimnames(depsmok) <- list(Ever_Smoker=c("Yes","No"), Depression=c("Yes","No"));addmargins(depsmok)

## computes proportions table and marginal proportions addmargins(prop.table(depsmok))

## computes the chi-square test of independence chisq.test(depsmok)

## computes the expected frequencies chisq.test(depsmok)$expected

10Kateri, M. (2010). Contingency Table Analysis Methods and Implementation Using R. Springer

(49)

Visualizing contingency tables barplots, sieveplots and mosaicplots ¹¹

## creating a dataset with the name confinan

confinan<-matrix(c(98,363,153,165,443,128),byrow=T,ncol=3) dimnames(confinan) <- list(Gender=c("males","females"),

Conf=c("great deal","only some","hardly any"))

## It is a data on confidence in banks and financial

## institutions cross-classified according to gender

barplot(prop.table(confinan),density=30,legend.text=T,main=

"Confidence in Banks and Financial Institutions by Gender (GSS2008)", xlab="Confidence level", ylab="Proportions", ylim=c(0,0.65))

11Source : Kateri, M. (2010). Contingency Table Analysis Methods and

(50)

Visualizing contingency tables contd ...

# install.packges(vcd) require(vcd)

sieve(confinan, shade=T)

mosaic(confinan,gp=shading_hcl,residuals_type="deviance", labeling = labeling_residuals)

## mosaic in vcd package is an alternative function to

## mosaicplot in the graphics package