Subject: Statistics
Paper: Regression Analysis III
Module: The analysis of 2x2 tables
Development Team
Principal investigator: Dr. Bhaswati Ganguli,Professor, Department of Statistics, University of Calcutta
Paper co-ordinator: Dr. Bhaswati Ganguli,Professor, Department of Statistics, University of Calcutta
Content writer: Sayantee Jana, Graduate student, Department of Mathematics and Statistics, McMaster University. Sujit Kumar Ray,Analytics professional, Kolkata
Content reviewer: Department of Statistics, University of Calcutta
Regression Analysis III 2 / 27
Definition of Contingency Table
I Let us define two categorical random variables:
I X= education level (predictor),Y= Jobs (outcome)
I X: X1, X2, ..., Xp→ Categories ofX
I Y: Y1, Y2, ..., Yq→ Categories ofY
Definition 1
A rectangular table havingprows for categories of X andq columns for categories ofY with the cells of the table containing frequency counts of outcomes for a sample, is called a contingency table, a term introduced by Karl Pearson 1904. A contingency table withp rows andq columns is called anp×q orp-by-q table.
Definition of Contingency Table
I Let us define two categorical random variables:
I X= education level (predictor),Y= Jobs (outcome)
I X: X1, X2, ..., Xp→ Categories ofX
I Y: Y1, Y2, ..., Yq→ Categories ofY
Definition 1
A rectangular table havingprows for categories of X andq columns for categories ofY with the cells of the table containing frequency counts of outcomes for a sample, is called a contingency table, a term introduced by Karl Pearson 1904. A contingency table withp rows andq columns is called anp×q orp-by-q table.
1Agresti, A. (2014). Categorical data analysis. John Wiley Sons.
Regression Analysis III 3 / 27
Definition of Contingency Table
I Let us define two categorical random variables:
I X= education level (predictor),Y= Jobs (outcome)
I X: X1, X2, ..., Xp→ Categories ofX
I Y: Y1, Y2, ..., Yq→ Categories ofY
Definition 1
A rectangular table havingprows for categories of X andq columns for categories ofY with the cells of the table containing frequency counts of outcomes for a sample, is called a contingency table, a term introduced by Karl Pearson 1904. A contingency table withp rows andq columns is called anp×q orp-by-q table.
Definition of Contingency Table
I Let us define two categorical random variables:
I X= education level (predictor),Y= Jobs (outcome)
I X: X1, X2, ..., Xp→ Categories ofX
I Y: Y1, Y2, ..., Yq→ Categories ofY
Definition 1
A rectangular table havingprows for categories of X andq columns for categories ofY with the cells of the table containing frequency counts of outcomes for a sample, is called a contingency table, a term introduced by Karl Pearson 1904. A contingency table withp rows andq columns is called anp×q orp-by-q table.
1Agresti, A. (2014). Categorical data analysis. John Wiley Sons.
Regression Analysis III 3 / 27
Definition of Contingency Table
I Let us define two categorical random variables:
I X= education level (predictor),Y= Jobs (outcome)
I X: X1, X2, ..., Xp→ Categories ofX
I Y: Y1, Y2, ..., Yq→ Categories ofY
Definition 1
A rectangular table havingprows for categories of X andq columns for categories ofY with the cells of the table containing frequency counts of outcomes for a sample, is called a contingency table, a term introduced by Karl Pearson 1904. A contingency table withp rows andq columns is called anp×q orp-by-q table.
A 2× 2 contingency table
This is how a 2×2 contingency table will look like
Table:Response variableY vs. Predictor variableX Predictor Response variable
Variable Yi = 0 Yi= 1 Xi = 0 n11 n12 Xi = 1 n21 n22
Here,n22 is the number of subjects who have Xi =Yi= 1, and so on.
Regression Analysis III 4 / 27
Example
Here is an example of a 2×2 contingency table2 :
Table:Study of the relationship between wafer quality and the presence of particles on the wafer.
Quality No Particles Particles
Good 320 14
Bad 80 36
2Faraway, J.J. (2006). Extending the Linear Model with R. Chapman
More examples
I So in a 2×2 table both the response and predictor variables have two categories.
I But they can also have more than two categories.
I For example a 4×4 table would be, 3
Table:Unaided distance of vision of 7477 women aged 30-39.
Grade of the Grade of left eye
right eye Highest Second Third Lowest
Highest 1520 266 124 66
Second 234 1512 432 78
Third 117 362 1772 205
Lowest 36 82 179 492
3Plackett, R.L. (1981). The analysis of Categorical Data. Macmillan Publishing Co., Inc., New York
Regression Analysis III 6 / 27
More examples
I So in a 2×2 table both the response and predictor variables have two categories.
I But they can also have more than two categories.
I For example a 4×4 table would be, 3
Table:Unaided distance of vision of 7477 women aged 30-39.
Grade of the Grade of left eye
right eye Highest Second Third Lowest
Highest 1520 266 124 66
Second 234 1512 432 78
Third 117 362 1772 205
Lowest 36 82 179 492
3Plackett, R.L. (1981). The analysis of Categorical Data. Macmillan
More examples
I So in a 2×2 table both the response and predictor variables have two categories.
I But they can also have more than two categories.
I For example a 4×4 table would be, 3
Table:Unaided distance of vision of 7477 women aged 30-39.
Grade of the Grade of left eye
right eye Highest Second Third Lowest
Highest 1520 266 124 66
Second 234 1512 432 78
Third 117 362 1772 205
Lowest 36 82 179 492
3Plackett, R.L. (1981). The analysis of Categorical Data. Macmillan Publishing Co., Inc., New York
Regression Analysis III 6 / 27
Notes
I All the examples in this presentation are two-way contingency tables.
I A contingency table may be 3-way also.
Notes
I All the examples in this presentation are two-way contingency tables.
I A contingency table may be 3-way also.
Regression Analysis III 7 / 27
Joint distribution
TheJoint distributionof the two random variablesX andY is the joint probability ofX taking a particular value, say,Xi andY taking another value, say,Yj,
πij = P ( X =Xi and Y =Yj ) , i= 1(1)p,j = 1(1)q
Joint distribution
TheJoint distributionof the two random variablesX andY is the joint probability ofX taking a particular value, say,Xi andY taking another value, say,Yj,
πij = P ( X =Xi and Y =Yj ) , i= 1(1)p,j = 1(1)q
Regression Analysis III 8 / 27
Marginal distribution
I The marginal distribution of X is the probability that X assumes a value, say,Xi for all values ofY and vice-versa for the marginal distribution ofY.
I The marginal distribution of X is:
πi0 =
q
X
j=1
πij =P(X=Xi),for i= 1(1)p
I
I The marginal distribution of Y is:
π0j =
p
X
i=1
πij =P(Y =Yj),forj = 1(1)q
I
Conditional distribution
I The conditional distribution of Y given X =Xi is the conditional probability thatY assumes some value, say Yj
given that we know thatX has assumed the valueXi.
I The conditional distribution of Y given X =Xi :
πj|i = ππij
i0, forj = 1(1)q ,∀i= 1(1)p
I The conditional distribution of X given Y =Yj :
πi|j = ππij
0j, for i= 1(1)p ,∀j= 1(1)q
Regression Analysis III 10 / 27
Conditional distribution
I The conditional distribution of Y given X =Xi is the conditional probability thatY assumes some value, say Yj
given that we know thatX has assumed the valueXi.
I The conditional distribution of Y given X =Xi :
πj|i = ππij
i0, forj = 1(1)q ,∀i= 1(1)p
I The conditional distribution of X given Y =Yj :
πi|j = ππij
0j, for i= 1(1)p ,∀j= 1(1)q
Conditional distribution
I The conditional distribution of Y given X =Xi is the conditional probability thatY assumes some value, say Yj
given that we know thatX has assumed the valueXi.
I The conditional distribution of Y given X =Xi :
πj|i = ππij
i0, forj = 1(1)q ,∀i= 1(1)p
I The conditional distribution of X given Y =Yj :
πi|j = ππij
0j, for i= 1(1)p ,∀j= 1(1)q
Regression Analysis III 10 / 27
Conditional distribution
I The conditional distribution of Y given X =Xi is the conditional probability thatY assumes some value, say Yj
given that we know thatX has assumed the valueXi.
I The conditional distribution of Y given X =Xi :
πj|i = ππij
i0, forj = 1(1)q ,∀i= 1(1)p
I The conditional distribution of X given Y =Yj :
πi|j = ππij
0j, for i= 1(1)p ,∀j= 1(1)q
Conditional distribution
I The conditional distribution of Y given X =Xi is the conditional probability thatY assumes some value, say Yj
given that we know thatX has assumed the valueXi.
I The conditional distribution of Y given X =Xi :
πj|i = ππij
i0, forj = 1(1)q ,∀i= 1(1)p
I The conditional distribution of X given Y =Yj :
πi|j = ππij
0j, for i= 1(1)p ,∀j= 1(1)q
Regression Analysis III 10 / 27
Conditional distribution
I The conditional distribution of Y given X =Xi is the conditional probability thatY assumes some value, say Yj
given that we know thatX has assumed the valueXi.
I The conditional distribution of Y given X =Xi :
πj|i = ππij
i0, forj = 1(1)q ,∀i= 1(1)p
I The conditional distribution of X given Y =Yj :
πi|j = ππij
0j, for i= 1(1)p ,∀j= 1(1)q
Independence and Chi-square test
I If both X and Y are independent random variables then πij =πi0π0j, or, πj|i =π0j
I Our primary objective in analysis of a contingency table is testing the independence of the two categorical variables.
I So in a 2×2 table the test statistic popularly used for testing the hypothesesH0:pij =pipj ,∀i= 1(1)p,∀j = 1(1)q is
χ2 = (n11+n12)(n11+n21) n
wherenij are the corresponding cell-counts andnis the total count.
Regression Analysis III 11 / 27
Fisher’s exact test
I But the Pearsonian Chi-square statistic mentioned above is not suitable for small samples.
I For small samples Fisher’s exact testis recommended. The test statistic is4 5,
(n11+n12)!(n11+n21)!(n21+n22)!(n12+n22)!
n!n11!n12!n21!n22!
4Faraway, J.J. (2006). Extending the Linear Model with R. Chapman Hall/CRC.
Cumulative distribution and Stochastically larger
I IfX is fixed
I The cumulative distribution of Y|X=Xi is define as :
Fj|i =X
k≤j
πk|i
I
I For iandi0 ifFj|i ≤Fj|i0 ∀j = 1(1)q
then the distribution ofY|X=Xi is called stochastically larger than the distribution of Y|X=Xi0
Regression Analysis III 13 / 27
Cumulative distribution and Stochastically larger
I IfX is fixed
I The cumulative distribution of Y|X=Xi is define as :
Fj|i =X
k≤j
πk|i
I
I For iandi0 ifFj|i ≤Fj|i0 ∀j = 1(1)q
then the distribution ofY|X=Xi is called stochastically larger than the distribution of Y|X=Xi0
Cumulative distribution and Stochastically larger
I IfX is fixed
I The cumulative distribution of Y|X=Xi is define as :
Fj|i =X
k≤j
πk|i
I
I For iandi0 ifFj|i ≤Fj|i0 ∀j = 1(1)q
then the distribution ofY|X=Xi is called stochastically larger than the distribution of Y|X=Xi0
Regression Analysis III 13 / 27
Cumulative distribution and Stochastically larger
I IfX is fixed
I The cumulative distribution of Y|X=Xi is define as :
Fj|i =X
k≤j
πk|i
I
I For iandi0 ifFj|i ≤Fj|i0 ∀j = 1(1)q
then the distribution ofY|X=Xi is called stochastically larger than the distribution of Y|X=Xi0
The multinomial distribution
I The pmf of multinomial distribution is
n!
x1!x2!...xk!(p1)x1(p2)x2....(pk)xk , where
x1+x2+....+xk=nand p1+p2+....+pk= 1
I It is the joint probability ofk mutually exclusive outcomes in n independent trials wherepi is the probability of success in theith outcome and Xi is the number of times (out of then trials) the ith outcome occurs,
I E(Xi) =npi andV ar(Xi) =npi(1−pi)
Regression Analysis III 14 / 27
The multinomial distribution
I The pmf of multinomial distribution is
n!
x1!x2!...xk!(p1)x1(p2)x2....(pk)xk , where
x1+x2+....+xk=nand p1+p2+....+pk= 1
I It is the joint probability ofk mutually exclusive outcomes in n independent trials wherepi is the probability of success in theith outcome and Xi is the number of times (out of then trials) the ith outcome occurs,
I E(Xi) =npi andV ar(Xi) =npi(1−pi)
The multinomial distribution
I The pmf of multinomial distribution is
n!
x1!x2!...xk!(p1)x1(p2)x2....(pk)xk , where
x1+x2+....+xk=nand p1+p2+....+pk= 1
I It is the joint probability ofk mutually exclusive outcomes in n independent trials wherepi is the probability of success in theith outcome and Xi is the number of times (out of then trials) the ith outcome occurs,
I E(Xi) =npi andV ar(Xi) =npi(1−pi)
Regression Analysis III 14 / 27
Pearson Chi-square statistic
I Suppose we haven samples that have cross-classified intopq categories (p categories ofX andq categories ofY). Then the distribution of the cell-countsnij is a multinomial distribution with ntrials and probabilitiespij. Thus the joint probability of the cell-countsnij, i= 1(1)p, j = 1(1)q is,
n!
Q
i
Q
jnij! Y
i
Y
j
pnijij,
whereP
i
P
jnij =nandP
i
P
jpij = 1
Pearson Chi-square statistic
I Pearson Chi-square statistic for the multinomial distribution
I Under independence, µij =E(nij) =npipj = ninnj
I So the required Chi-square statistic for a multinomial data is
χ2 =X
i
X
j
(nij−µij)2 µij , with d.f.=(p−1)(q−1)
Regression Analysis III 16 / 27
Pearson Chi-square statistic
I Pearson Chi-square statistic for the multinomial distribution
I Under independence, µij =E(nij) =npipj = ninnj
I So the required Chi-square statistic for a multinomial data is
χ2 =X
i
X
j
(nij−µij)2 µij , with d.f.=(p−1)(q−1)
Pearson Chi-square statistic
I Pearson Chi-square statistic for the multinomial distribution
I Under independence, µij =E(nij) =npipj = ninnj
I So the required Chi-square statistic for a multinomial data is
χ2 =X
i
X
j
(nij−µij)2 µij , with d.f.=(p−1)(q−1)
Regression Analysis III 16 / 27
Summary
I A contingency table is a cross-classification table of two or more categorical variables.
I Our major objective while analyzing a contingency table is testing for independence of the response and predictor variable.
I For large samples we can use the Pearsonian Chi-square test for independence.
I For small samples we can use Fisher’s exact test to test for independence.
Summary
I A contingency table is a cross-classification table of two or more categorical variables.
I Our major objective while analyzing a contingency table is testing for independence of the response and predictor variable.
I For large samples we can use the Pearsonian Chi-square test for independence.
I For small samples we can use Fisher’s exact test to test for independence.
Regression Analysis III 17 / 27
Summary
I A contingency table is a cross-classification table of two or more categorical variables.
I Our major objective while analyzing a contingency table is testing for independence of the response and predictor variable.
I For large samples we can use the Pearsonian Chi-square test for independence.
I For small samples we can use Fisher’s exact test to test for independence.
Summary
I A contingency table is a cross-classification table of two or more categorical variables.
I Our major objective while analyzing a contingency table is testing for independence of the response and predictor variable.
I For large samples we can use the Pearsonian Chi-square test for independence.
I For small samples we can use Fisher’s exact test to test for independence.
Regression Analysis III 17 / 27
Example 16
## creating a dataset in R, data on quality improvement study
## of a sample of wafers y <- c(320,14,80,36)
particle <- gl (2,1,4,labels=c("no","yes")) quality <- gl(2,2,labels=c("good","bad")) wafer <- data.frame(y,particle,quality) wafer
## creating a contingency table of the dataset ov <- xtabs(y ~ quality+particle)
print(ov)
## calculating the proportions
pp <- prop.table( xtabs(y ~ particle)) print(pp)
qp <- prop.table( xtabs(y ~ quality)) print(qp)
6Faraway, J.J. (2006). Extending the Linear Model with R. Chapman
Example 1 contd ...
## contingency table of proportion (not of counts) fv <- outer(qp,pp)*450
print(fv)
## calculation of Frequency Chi-square statistic sum((ov-fv)^2/fv)
## Frequency Chi-square statistic implementing Yates' continuity correction prop.test(ov)
## Fisher's Exact Test fisher.test(ov)
Regression Analysis III Sample R script to complement this module 19 / 27
Example 27
## install.packages("faraway") require(faraway)
data(haireye) haireye
## contingency table of the haireye dataset ct <- xtabs(y ~ hair + eye, haireye)
print(ct)
## Frequency chi-square statistic alongwith the p-value summary(ct)
## graphical display of contingency table dotchart (ct)
## illustrative plot
mosaicplot(ct,color=TRUE,main=NULL,las=1)
7Faraway, J.J. (2006). Extending the Linear Model with R. Chapman
Example 3: matched pairs 8
# data on vision of a sample of women data(eyegrade)
## contingency table
ct <- xtabs(y ~ right+left, eyegrade) print(ct)
## Frequency chi-square statistic alongwith the p-value summary(ct)
## graphical display of contingency table dotchart (ct)
## illustrative plot
mosaicplot(ct,color=TRUE,main=NULL,las=1)
8Faraway, J.J. (2006). Extending the Linear Model with R. Chapman Hall/CRC
Regression Analysis III Sample R script to complement this module 21 / 27
Example 4: Three-Way Contingency Tables 9
data(femsmoke) femsmoke
## contingency table of smoking versus dead ct <- xtabs(y ~ smoker+dead, femsmoke) print(ct)
## contingency table of proportion prop.table(ct,1)
## Frequency chi-square statistic alongwith the p-value
## testing for independence of smoking against dead summary(ct)
9Faraway, J.J. (2006). Extending the Linear Model with R. Chapman
Example contd ...
## contingency table for a given age-group 55-64
cta <- xtabs(y ~ smoker+dead, femsmoke,subset=(age=="55-64")) print(cta)
summary(cta)
## contingency table of proportion prop.table(cta,1)
## contingency table of smoking versus age ct <- xtabs(y ~ smoker+age, femsmoke) print(ct)
## Frequency chi-square statistic alongwith the p-value
## for testing independence of smoking against age summary(ct)
Regression Analysis III Sample R script to complement this module 23 / 27
Example contd ...
## contingency table of proportion
prop.table(xtabs(y ~ smoker+age, femsmoke),2)
## three-way contingency table
ct3 <- xtabs(y ~ smoker+dead+age,femsmoke)
## Frequency chi-square statistic alongwith the p-value
## for testing independence of smoking, death and age summary(ct3)
Example 5: Another function to calculate chi-square statistic 10
## creating frequency table with the name depsmok depsmok <- matrix(c(144,1729,50,1290),byrow=T,ncol=2);
dimnames(depsmok) <- list(Ever_Smoker=c("Yes","No"), Depression=c("Yes","No"));addmargins(depsmok)
## computes proportions table and marginal proportions addmargins(prop.table(depsmok))
## computes the chi-square test of independence chisq.test(depsmok)
## computes the expected frequencies chisq.test(depsmok)$expected
10Kateri, M. (2010). Contingency Table Analysis Methods and Implementation Using R. Springer
Regression Analysis III Sample R script to complement this module 25 / 27
Visualizing contingency tables barplots, sieveplots and mosaicplots 11
## creating a dataset with the name confinan
confinan<-matrix(c(98,363,153,165,443,128),byrow=T,ncol=3) dimnames(confinan) <- list(Gender=c("males","females"),
Conf=c("great deal","only some","hardly any"))
## It is a data on confidence in banks and financial
## institutions cross-classified according to gender
barplot(prop.table(confinan),density=30,legend.text=T,main=
"Confidence in Banks and Financial Institutions by Gender (GSS2008)", xlab="Confidence level", ylab="Proportions", ylim=c(0,0.65))
11Source : Kateri, M. (2010). Contingency Table Analysis Methods and
Visualizing contingency tables contd ...
# install.packges(vcd) require(vcd)
sieve(confinan, shade=T)
mosaic(confinan,gp=shading_hcl,residuals_type="deviance", labeling = labeling_residuals)
## mosaic in vcd package is an alternative function to
## mosaicplot in the graphics package
Regression Analysis III Sample R script to complement this module 27 / 27