Paper: Regression Analysis III
Module: Models with constant Coefficient of
Variation
Principal investigator: Dr. Bhaswati Ganguli,Professor, Department of Statistics, University of Calcutta
Paper co-ordinator: Dr. Bhaswati Ganguli,Professor, Department of Statistics, University of Calcutta
Content writer: Sayantee Jana, Graduate student, Department of Mathematics and Statistics, McMaster University Sujit Kumar Ray,Analytics professional, Kolkata
Content reviewer: Department of Statistics, University of Calcutta
I
Usual linear model for continuous data assumes that the variance is constant.
I
The normality assumption of errors is then generally made.
I
Very often the variance may increase with the mean - true
both for continuous data and discrete data.
I
Usual linear model for continuous data assumes that the variance is constant.
I
The normality assumption of errors is then generally made.
I
Very often the variance may increase with the mean - true
both for continuous data and discrete data.
I
Usual linear model for continuous data assumes that the variance is constant.
I
The normality assumption of errors is then generally made.
I
Very often the variance may increase with the mean - true
both for continuous data and discrete data.
I
Let
Ybe the response variable.
I E(Y) =µ
and
V ar(Y) = Φµ2.
I Φ = V ar(Yµ2 ) = (CV)2
I Φ=constant : variance changes with the mean but CV is
constant.
I
Let
Ybe the response variable.
I E(Y) =µ
and
V ar(Y) = Φµ2.
I Φ = V ar(Yµ2 ) = (CV)2
I Φ=constant : variance changes with the mean but CV is
constant.
I
Let
Ybe the response variable.
I E(Y) =µ
and
V ar(Y) = Φµ2.
I Φ = V ar(Yµ2 ) = (CV)2
I Φ=constant : variance changes with the mean but CV is
constant.
I
Let
Ybe the response variable.
I E(Y) =µ
and
V ar(Y) = Φµ2.
I Φ = V ar(Yµ2 ) = (CV)2
I Φ=constant : variance changes with the mean but CV is
constant.
I
One technique : If
Yis continuous make the transformation -
Y∗ =log(Y)E(Y∗) =log(µ)
V ar(Y∗) = Φ
I Y∗
has a constant variance.
I Y
- log normal
⇒Y∗normal
I
One technique : If
Yis continuous make the transformation -
Y∗ =log(Y)E(Y∗) =log(µ)
V ar(Y∗) = Φ
I Y∗
has a constant variance.
I Y
- log normal
⇒Y∗normal
I
One technique : If
Yis continuous make the transformation -
Y∗ =log(Y)E(Y∗) =log(µ)
V ar(Y∗) = Φ
I Y∗
has a constant variance.
I Y
- log normal
⇒Y∗normal
I
Second technique : Use original
Yand the GLM setup instead of making a transformation.
I
Gamma model :
f(y) =Γ(ν)αν e−αyyν−1I µ=E(Y) = αν ⇒α= νµ
I f(y) = (ν/µ)Γ(ν)νeνµyyν−1
I
Second technique : Use original
Yand the GLM setup instead of making a transformation.
I
Gamma model :
f(y) =Γ(ν)αν e−αyyν−1I µ=E(Y) = αν ⇒α= νµ
I f(y) = (ν/µ)Γ(ν)νeνµyyν−1
I
Second technique : Use original
Yand the GLM setup instead of making a transformation.
I
Gamma model :
f(y) =Γ(ν)αν e−αyyν−1I µ=E(Y) = αν ⇒α= νµ
I f(y) = (ν/µ)Γ(ν)νeνµyyν−1
I
Second technique : Use original
Yand the GLM setup instead of making a transformation.
I
Gamma model :
f(y) =Γ(ν)αν e−αyyν−1I µ=E(Y) = αν ⇒α= νµ
I f(y) = (ν/µ)Γ(ν)νeνµyyν−1
I
The pdf can be expressed in an exponential form.
I E(Y) =µ
I V ar(Y) = Φµ2
I
Gamma distribution leads to the constant CV model.
I
The canonical link for this distribution is the inverse link but a
much preferred link is the log link.
I
The pdf can be expressed in an exponential form.
I E(Y) =µ
I V ar(Y) = Φµ2
I
Gamma distribution leads to the constant CV model.
I
The canonical link for this distribution is the inverse link but a
much preferred link is the log link.
I
The pdf can be expressed in an exponential form.
I E(Y) =µ
I V ar(Y) = Φµ2
I
Gamma distribution leads to the constant CV model.
I
The canonical link for this distribution is the inverse link but a
much preferred link is the log link.
I
The pdf can be expressed in an exponential form.
I E(Y) =µ
I V ar(Y) = Φµ2
I
Gamma distribution leads to the constant CV model.
I
The canonical link for this distribution is the inverse link but a
much preferred link is the log link.
I
The pdf can be expressed in an exponential form.
I E(Y) =µ
I V ar(Y) = Φµ2
I
Gamma distribution leads to the constant CV model.
I
The canonical link for this distribution is the inverse link but a
much preferred link is the log link.
I
Very often in a model the variance may increase with the mean but the CV may be constant.
I
One technique to overcome this is to use transformation of variable.
I
Another technique is to use the Gamma distribution which leads to the constant CV model.
I
The canonical link for this distribution is the inverse link but a
much preferred link is the log link.
I
Very often in a model the variance may increase with the mean but the CV may be constant.
I
One technique to overcome this is to use transformation of variable.
I
Another technique is to use the Gamma distribution which leads to the constant CV model.
I
The canonical link for this distribution is the inverse link but a
much preferred link is the log link.
I
Very often in a model the variance may increase with the mean but the CV may be constant.
I
One technique to overcome this is to use transformation of variable.
I
Another technique is to use the Gamma distribution which leads to the constant CV model.
I
The canonical link for this distribution is the inverse link but a
much preferred link is the log link.
I
Very often in a model the variance may increase with the mean but the CV may be constant.
I
One technique to overcome this is to use transformation of variable.
I
Another technique is to use the Gamma distribution which leads to the constant CV model.
I
The canonical link for this distribution is the inverse link but a
much preferred link is the log link.
# install.packages("faraway") require(faraway)
data(clot)
clotting <- data.frame(
u = c(5,10,15,20,30,40,60,80,100), lot1 = c(118,58,42,35,27,25,21,19,18), lot2 = c(69,35,26,21,18,16,13,12,12))
## lot1 represents the time taken for blood to clot for the
## first lot and similarly for lot2, it represents the time
## taken for blood clot for the second lot u
## represents the percentage of plasma concnetration
1Source of data : McCullagh Nelder (1989, pp. 300-2)
summary(glm(lot1 ~ log(u), data = clotting, family = Gamma)) summary(glm(lot2 ~ log(u), data = clotting, family = Gamma))
## The times for clotting seems to be significant for both
## the lots
## we will fit one model with lot (as a factor representing
## which lot the sample unit belongs to) as one of the
## predictor variables. For that we need to create a new
## dataset which combines the data for the two lots.
## creating plasma, lot and clot.time variables from the data
## frame (i.e., transform from two columns of clotting times
## to one). Alternatively we can also use the 'clot' data
## available in the faraway library in R.
plasma <- rep(clotting$u, 2)
lot <- factor( c(rep(1,length(clotting$u)), rep(2,length(clotting$u))) )
clot.time <- c(clotting$lot1, clotting$lot2) log.plasma <- log10(plasma)
4the blood clotting example, Stat 7430 by Peter F. Craigmile, Mar 2015
## summary plots of the data
plot(plasma, clot.time, xlab="plasma %", ylab="clotting time", pch=c(16,17),col=c(2,4))
legend(60, 110, c("Lot 1", "Lot 2"), pch=c(16,17),col=c(2,4))
## Interpretation : a straight-line model for mean response
## as a function of the percentage concentration would be
## inappropriate.
plot(plasma,1/clot.time,xlab="plasma %",ylab="1/(clotting time)", pch=c(16,17),col=c(2,4), ylim=c(0,0.1))
legend("bottomright",c("Lot 1","Lot 2"),pch=c(16,17),col=c(2,4))
## Interpretation: the plot of the inverse of the clotting times
## against the percentage of concentration is non-linear as well.
plot(log.plasma, clot.time, xlab="log10(plasma %)", ylab="(clotting time)",pch=c(16,17),col=c(2,4)) legend("topright", c("Lot 1", "Lot 2"), pch=c(16,17),
col=c(2,4))
## Interpretation : the plot of the clotting times against the
## log of concentration also shows a non-linear relationship.
plot(log.plasma, log(clot.time), xlab="plasma %", ylab="log(clotting time)",pch=c(16,17),col=c(2,4))
legend("topright", c("Lot 1", "Lot 2"), pch=c(16,17),col=c(2,4))
## Interpretation : relationship is roughly linear plot(log.plasma, 1/clot.time,
xlab="log10(plasma %)", ylab="1/(clotting time)", pch=c(16,17),col=c(2,4),ylim=c(0,0.1))
legend("bottomright",c("Lot 1","Lot 2"),pch=c(16,17),col=c(2,4))
## Interpretation : the plot looks more linear, so the
## relationship of log of concentration and inverse of
## the clotting time seems linear.
## that we have considered. Also we have to keep in mind , that
## a sensible model for mean response time would be
## one that keeps the positivity restriction for the response.
## Fitting a Gamma GLM model to the data, including the
## interactions.
model <- glm(clot.time ~ log.plasma * lot, family=Gamma(link = "inverse"))
## It is to be noted that it was not necessary to specify the
## link function here since the inverse link is the default
## link for the Gamma family. But the link has been mentioned
## in this excercise to emphasize that an inverse link is being
## used here.
summary(model)## summarize the model
## Interpretation : All the predictor variables seem to be
## significant including the interaction term.
sigma2.hat <- summary(model)$dispersion
sigma2.hat ## the estimate of the dispersion parameter nu.hat <- 1/summary(model)$dispersion
nu.hat ## the corresponding estimate of nu
## The estimated dispersion parameter, which for the gamma
## is the estimated coefficient of variation (CV) is very
## small 0.002. This indicates that the estimated variation
## for the fitted gamma model, relative to the mean, is
## small - indicating a good fit.
## the analysis of deviance table, adjusting for overdispersion anova(model, test="F")
## Interpretation : It seems more appropriate to use the model
## with all the three predictors - log of concentration of plasma,
## lot and the interaction between the log of concentration of
## plasma and lot. One reason for the above conclusion is the
## least deviance of the last model and another reason being
## that it is significant and all the terms it includes are
## significant, which is evident from the first two models
## where each term is being added sequentially.
fits <- fitted(model) ## fitted response values dev.resids <- resid(model) ## model residuals
## Diagnostics plots
par(mfrow=c(1,3), bty="L", cex=0.7, mar=c(4.1, 4.1, 1, 1)) plot(fits, dev.resids,
xlab="fitted values", ylab="deviance residuals", ylim=c(-0.1, 0.1),pch=c(16,17),col=c(2,4)) abline(h=0, lty=2)
plot(log.plasma[lot=="1"], dev.resids[lot=="1"], pch=16,col=2, xlab="log10(plasma %)", ylab="deviances",ylim=c(-0.1, 0.1)) points(log.plasma[lot=="2"], dev.resids[lot=="2"], pch=17,col=4) abline(h=0, lty=2)
qqnorm(dev.resids, main="", ylim=c(-0.1, 0.1)) qqline(dev.resids)
## Interpretation : the overall fit seems to be fairly good.
## fucntion to calculate the hat matrix for a GLM object 'model' calculate.hat.matrix <- function (model) {
w <- model$weights X <- model.matrix(model) A <- X * sqrt(w)
A %*% solve(crossprod(A)) %*% t(A)}
H <- calculate.hat.matrix(model)
# Calculating the hat matrix for our model
plot(log.plasma[lot=="1"], diag(H)[lot=="1"], pch=16,col=2, xlab="log10(plasma %)", ylab="Diagonal elements of the Hat matrix",ylim=c(0, 1))
points(log.plasma[lot=="2"], diag(H)[lot=="2"], pch=17,col=4) legend("topright", c("Lot 1", "Lot 2"), pch=c(16,17),col=c(2,4))
n <- length(log.plasma) ## Number of cases
Di<- cooks.distance(model) ## Cook's distance for our Gamma GLM
## Plotting Cook's Distance to identify outliers
par(mfrow=c(1,1), bty="L", cex=0.75, mar=c(4.1, 4.1, 1, 1)) plot(log.plasma[lot=="1"], Di[lot=="1"], pch=16,col=2,
xlab="log10(plasma %)",ylab="Cook's Distance",ylim=c(0, 20)) points(log.plasma[lot=="2"], Di[lot=="2"], pch=17,col=4)
abline(h=qf(0.5, p, n-p), lty=2, col="green")
text(1.75, qf(0.5, p, n-p)+0.6, "50th percentile of F(p, n-p) distribution", cex=1.0, col="lightgreen")
abline(h=qf(0.95, p, n-p), lty=2, col="green")
text(1.75, qf(0.95, p, n-p)+0.6, "95th percentile of F(p, n-p) distribution", cex=1.0, col="lightgreen")
## There seems to be two outliers, one from each lot, since
## their Cook's Distance are substantially larger than the rest.
beta = model$coefficients ## regression coefficients beta
gg=summary(model)
cov=gg$cov.scaled # covariance matrix of regression coefficients cov
xd = model.matrix(model) ## the design matrix for our model xd
lp = xd %*% beta ## Creates estimated linear predictor Xd*beta vlp=xd %*%cov%*%t(xd) # Covariance matrix of linear predictor q = sqrt( diag(vlp) ) ## s.e. of linear predictors, as vector
## approximate lower and upper 95% CI for linear predictors low = lp - 1.96*q
up = lp + 1.96*q
## plot of data on inverse response scale with fit and CI
## then data on original scale but fit and CI on inverse scale par(mfrow=c(1,2))
plot(log(plasma),1/clot.time);
lines(log(plasma),lp); lines(log(plasma),up,lty=3);
lines(log(plasma),low,lty=3) plot(log(plasma),clot.time)
lines(log(plasma),1/lp); lines(log(plasma),1/up,lty=3);
lines(log(plasma),1/low,lty=3)
## The plots suggest that the fit is pretty good but it looks
## very messy, so we will fit the models seperately for each
## lot and plot them.
fit1=glm(lot1 ~ log(u), data = clotting, family = Gamma) beta = fit1$coefficients ## regression coefficients beta
gg=summary(fit1)
cov=gg$cov.scaled # covariance matrix of regression coefficients cov
## Creating design matrix for the model nlot=as.numeric(lot)
xd = cbind(rep(1,9),log(clotting$u)) xd
lp = xd %*% beta ## Creates estimated linear predictor Xd*beta vlp=xd %*%cov%*%t(xd) # Covariance matrix of linear predictor q = sqrt( diag(vlp) ) ## s.e. of linear predictors, as vector
## approximate lower and upper 95% CI for linear predictors low = lp - 1.96*q
## plot of data for Lot 1 on inverse response scale,
## then fit and CI par(mfrow=c(1,2))
plot(log(clotting$u),1/clotting$lot1);
lines(log(clotting$u),lp); lines(log(clotting$u),up,lty=3);
lines(log(clotting$u),low,lty=3) plot(log(clotting$u),clotting$lot1);
lines(log(clotting$u),1/lp); lines(log(clotting$u),1/up,lty=3);
lines(log(clotting$u),1/low,lty=3)
fit2=glm(lot2 ~ log(u), data = clotting, family = Gamma) beta = fit2$coefficients ## regression coefficients beta
gg=summary(fit2)
cov=gg$cov.scaled # covariance matrix of regression coefficients cov
## Creating design matrix for the model nlot=as.numeric(lot)
xd = cbind(rep(1,9),log(clotting$u)) xd
lp = xd %*% beta ## Creates estimated linear predictor Xd*beta vlp=xd %*%cov%*%t(xd) # Covariance matrix of linear predictor q = sqrt( diag(vlp) ) ## s.e. of linear predictors, as vector
## approximate lower and upper 95% CI for linear predictors low = lp - 1.96*q
## plot of data for Lot 2 on inverse response scale,
## then fit and CI par(mfrow=c(1,2))
plot(log(clotting$u),1/clotting$lot2);
lines(log(clotting$u),lp); lines(log(clotting$u),up,lty=3);
lines(log(clotting$u),low,lty=3) plot(log(clotting$u),clotting$lot2);
lines(log(clotting$u),1/lp); lines(log(clotting$u),1/up,lty=3);
lines(log(clotting$u),1/low,lty=3)
## The plots are lot less messy and more comprehendible as
## they have been split for the two different lots. And it is
## very evident from the plots for both the lots that a Gamma
## model with inverse link is a very good fit for this data.