• No results found

Usual linear model for continuous data assumes that the variance is constant.

N/A
N/A
Protected

Academic year: 2022

Share "Usual linear model for continuous data assumes that the variance is constant."

Copied!
43
0
0

Loading.... (view fulltext now)

Full text

(1)

Paper: Regression Analysis III

Module: Models with constant Coefficient of

Variation

(2)

Principal investigator: Dr. Bhaswati Ganguli,Professor, Department of Statistics, University of Calcutta

Paper co-ordinator: Dr. Bhaswati Ganguli,Professor, Department of Statistics, University of Calcutta

Content writer: Sayantee Jana, Graduate student, Department of Mathematics and Statistics, McMaster University Sujit Kumar Ray,Analytics professional, Kolkata

Content reviewer: Department of Statistics, University of Calcutta

(3)

I

Usual linear model for continuous data assumes that the variance is constant.

I

The normality assumption of errors is then generally made.

I

Very often the variance may increase with the mean - true

both for continuous data and discrete data.

(4)

I

Usual linear model for continuous data assumes that the variance is constant.

I

The normality assumption of errors is then generally made.

I

Very often the variance may increase with the mean - true

both for continuous data and discrete data.

(5)

I

Usual linear model for continuous data assumes that the variance is constant.

I

The normality assumption of errors is then generally made.

I

Very often the variance may increase with the mean - true

both for continuous data and discrete data.

(6)

I

Let

Y

be the response variable.

I E(Y) =µ

and

V ar(Y) = Φµ2

.

I Φ = V ar(Yµ2 ) = (CV)2

I Φ=constant : variance changes with the mean but CV is

constant.

(7)

I

Let

Y

be the response variable.

I E(Y) =µ

and

V ar(Y) = Φµ2

.

I Φ = V ar(Yµ2 ) = (CV)2

I Φ=constant : variance changes with the mean but CV is

constant.

(8)

I

Let

Y

be the response variable.

I E(Y) =µ

and

V ar(Y) = Φµ2

.

I Φ = V ar(Yµ2 ) = (CV)2

I Φ=constant : variance changes with the mean but CV is

constant.

(9)

I

Let

Y

be the response variable.

I E(Y) =µ

and

V ar(Y) = Φµ2

.

I Φ = V ar(Yµ2 ) = (CV)2

I Φ=constant : variance changes with the mean but CV is

constant.

(10)

I

One technique : If

Y

is continuous make the transformation -

Y =log(Y)

E(Y) =log(µ)

V ar(Y) = Φ

I Y

has a constant variance.

I Y

- log normal

⇒Y

normal

(11)

I

One technique : If

Y

is continuous make the transformation -

Y =log(Y)

E(Y) =log(µ)

V ar(Y) = Φ

I Y

has a constant variance.

I Y

- log normal

⇒Y

normal

(12)

I

One technique : If

Y

is continuous make the transformation -

Y =log(Y)

E(Y) =log(µ)

V ar(Y) = Φ

I Y

has a constant variance.

I Y

- log normal

⇒Y

normal

(13)

I

Second technique : Use original

Y

and the GLM setup instead of making a transformation.

I

Gamma model :

f(y) =Γ(ν)αν e−αyyν−1

I µ=E(Y) = αν ⇒α= νµ

I f(y) = (ν/µ)Γ(ν)νeνµyyν−1

(14)

I

Second technique : Use original

Y

and the GLM setup instead of making a transformation.

I

Gamma model :

f(y) =Γ(ν)αν e−αyyν−1

I µ=E(Y) = αν ⇒α= νµ

I f(y) = (ν/µ)Γ(ν)νeνµyyν−1

(15)

I

Second technique : Use original

Y

and the GLM setup instead of making a transformation.

I

Gamma model :

f(y) =Γ(ν)αν e−αyyν−1

I µ=E(Y) = αν ⇒α= νµ

I f(y) = (ν/µ)Γ(ν)νeνµyyν−1

(16)

I

Second technique : Use original

Y

and the GLM setup instead of making a transformation.

I

Gamma model :

f(y) =Γ(ν)αν e−αyyν−1

I µ=E(Y) = αν ⇒α= νµ

I f(y) = (ν/µ)Γ(ν)νeνµyyν−1

(17)

I

The pdf can be expressed in an exponential form.

I E(Y) =µ

I V ar(Y) = Φµ2

I

Gamma distribution leads to the constant CV model.

I

The canonical link for this distribution is the inverse link but a

much preferred link is the log link.

(18)

I

The pdf can be expressed in an exponential form.

I E(Y) =µ

I V ar(Y) = Φµ2

I

Gamma distribution leads to the constant CV model.

I

The canonical link for this distribution is the inverse link but a

much preferred link is the log link.

(19)

I

The pdf can be expressed in an exponential form.

I E(Y) =µ

I V ar(Y) = Φµ2

I

Gamma distribution leads to the constant CV model.

I

The canonical link for this distribution is the inverse link but a

much preferred link is the log link.

(20)

I

The pdf can be expressed in an exponential form.

I E(Y) =µ

I V ar(Y) = Φµ2

I

Gamma distribution leads to the constant CV model.

I

The canonical link for this distribution is the inverse link but a

much preferred link is the log link.

(21)

I

The pdf can be expressed in an exponential form.

I E(Y) =µ

I V ar(Y) = Φµ2

I

Gamma distribution leads to the constant CV model.

I

The canonical link for this distribution is the inverse link but a

much preferred link is the log link.

(22)

I

Very often in a model the variance may increase with the mean but the CV may be constant.

I

One technique to overcome this is to use transformation of variable.

I

Another technique is to use the Gamma distribution which leads to the constant CV model.

I

The canonical link for this distribution is the inverse link but a

much preferred link is the log link.

(23)

I

Very often in a model the variance may increase with the mean but the CV may be constant.

I

One technique to overcome this is to use transformation of variable.

I

Another technique is to use the Gamma distribution which leads to the constant CV model.

I

The canonical link for this distribution is the inverse link but a

much preferred link is the log link.

(24)

I

Very often in a model the variance may increase with the mean but the CV may be constant.

I

One technique to overcome this is to use transformation of variable.

I

Another technique is to use the Gamma distribution which leads to the constant CV model.

I

The canonical link for this distribution is the inverse link but a

much preferred link is the log link.

(25)

I

Very often in a model the variance may increase with the mean but the CV may be constant.

I

One technique to overcome this is to use transformation of variable.

I

Another technique is to use the Gamma distribution which leads to the constant CV model.

I

The canonical link for this distribution is the inverse link but a

much preferred link is the log link.

(26)

# install.packages("faraway") require(faraway)

data(clot)

clotting <- data.frame(

u = c(5,10,15,20,30,40,60,80,100), lot1 = c(118,58,42,35,27,25,21,19,18), lot2 = c(69,35,26,21,18,16,13,12,12))

## lot1 represents the time taken for blood to clot for the

## first lot and similarly for lot2, it represents the time

## taken for blood clot for the second lot u

## represents the percentage of plasma concnetration

1Source of data : McCullagh Nelder (1989, pp. 300-2)

(27)

summary(glm(lot1 ~ log(u), data = clotting, family = Gamma)) summary(glm(lot2 ~ log(u), data = clotting, family = Gamma))

## The times for clotting seems to be significant for both

## the lots

(28)

## we will fit one model with lot (as a factor representing

## which lot the sample unit belongs to) as one of the

## predictor variables. For that we need to create a new

## dataset which combines the data for the two lots.

## creating plasma, lot and clot.time variables from the data

## frame (i.e., transform from two columns of clotting times

## to one). Alternatively we can also use the 'clot' data

## available in the faraway library in R.

plasma <- rep(clotting$u, 2)

lot <- factor( c(rep(1,length(clotting$u)), rep(2,length(clotting$u))) )

clot.time <- c(clotting$lot1, clotting$lot2) log.plasma <- log10(plasma)

4the blood clotting example, Stat 7430 by Peter F. Craigmile, Mar 2015

(29)

## summary plots of the data

plot(plasma, clot.time, xlab="plasma %", ylab="clotting time", pch=c(16,17),col=c(2,4))

legend(60, 110, c("Lot 1", "Lot 2"), pch=c(16,17),col=c(2,4))

## Interpretation : a straight-line model for mean response

## as a function of the percentage concentration would be

## inappropriate.

plot(plasma,1/clot.time,xlab="plasma %",ylab="1/(clotting time)", pch=c(16,17),col=c(2,4), ylim=c(0,0.1))

legend("bottomright",c("Lot 1","Lot 2"),pch=c(16,17),col=c(2,4))

## Interpretation: the plot of the inverse of the clotting times

## against the percentage of concentration is non-linear as well.

(30)

plot(log.plasma, clot.time, xlab="log10(plasma %)", ylab="(clotting time)",pch=c(16,17),col=c(2,4)) legend("topright", c("Lot 1", "Lot 2"), pch=c(16,17),

col=c(2,4))

## Interpretation : the plot of the clotting times against the

## log of concentration also shows a non-linear relationship.

(31)

plot(log.plasma, log(clot.time), xlab="plasma %", ylab="log(clotting time)",pch=c(16,17),col=c(2,4))

legend("topright", c("Lot 1", "Lot 2"), pch=c(16,17),col=c(2,4))

## Interpretation : relationship is roughly linear plot(log.plasma, 1/clot.time,

xlab="log10(plasma %)", ylab="1/(clotting time)", pch=c(16,17),col=c(2,4),ylim=c(0,0.1))

legend("bottomright",c("Lot 1","Lot 2"),pch=c(16,17),col=c(2,4))

## Interpretation : the plot looks more linear, so the

## relationship of log of concentration and inverse of

## the clotting time seems linear.

(32)

## that we have considered. Also we have to keep in mind , that

## a sensible model for mean response time would be

## one that keeps the positivity restriction for the response.

## Fitting a Gamma GLM model to the data, including the

## interactions.

model <- glm(clot.time ~ log.plasma * lot, family=Gamma(link = "inverse"))

## It is to be noted that it was not necessary to specify the

## link function here since the inverse link is the default

## link for the Gamma family. But the link has been mentioned

## in this excercise to emphasize that an inverse link is being

## used here.

summary(model)## summarize the model

## Interpretation : All the predictor variables seem to be

## significant including the interaction term.

(33)

sigma2.hat <- summary(model)$dispersion

sigma2.hat ## the estimate of the dispersion parameter nu.hat <- 1/summary(model)$dispersion

nu.hat ## the corresponding estimate of nu

## The estimated dispersion parameter, which for the gamma

## is the estimated coefficient of variation (CV) is very

## small 0.002. This indicates that the estimated variation

## for the fitted gamma model, relative to the mean, is

## small - indicating a good fit.

(34)

## the analysis of deviance table, adjusting for overdispersion anova(model, test="F")

## Interpretation : It seems more appropriate to use the model

## with all the three predictors - log of concentration of plasma,

## lot and the interaction between the log of concentration of

## plasma and lot. One reason for the above conclusion is the

## least deviance of the last model and another reason being

## that it is significant and all the terms it includes are

## significant, which is evident from the first two models

## where each term is being added sequentially.

(35)

fits <- fitted(model) ## fitted response values dev.resids <- resid(model) ## model residuals

## Diagnostics plots

par(mfrow=c(1,3), bty="L", cex=0.7, mar=c(4.1, 4.1, 1, 1)) plot(fits, dev.resids,

xlab="fitted values", ylab="deviance residuals", ylim=c(-0.1, 0.1),pch=c(16,17),col=c(2,4)) abline(h=0, lty=2)

plot(log.plasma[lot=="1"], dev.resids[lot=="1"], pch=16,col=2, xlab="log10(plasma %)", ylab="deviances",ylim=c(-0.1, 0.1)) points(log.plasma[lot=="2"], dev.resids[lot=="2"], pch=17,col=4) abline(h=0, lty=2)

qqnorm(dev.resids, main="", ylim=c(-0.1, 0.1)) qqline(dev.resids)

## Interpretation : the overall fit seems to be fairly good.

(36)

## fucntion to calculate the hat matrix for a GLM object 'model' calculate.hat.matrix <- function (model) {

w <- model$weights X <- model.matrix(model) A <- X * sqrt(w)

A %*% solve(crossprod(A)) %*% t(A)}

H <- calculate.hat.matrix(model)

# Calculating the hat matrix for our model

plot(log.plasma[lot=="1"], diag(H)[lot=="1"], pch=16,col=2, xlab="log10(plasma %)", ylab="Diagonal elements of the Hat matrix",ylim=c(0, 1))

points(log.plasma[lot=="2"], diag(H)[lot=="2"], pch=17,col=4) legend("topright", c("Lot 1", "Lot 2"), pch=c(16,17),col=c(2,4))

(37)

n <- length(log.plasma) ## Number of cases

Di<- cooks.distance(model) ## Cook's distance for our Gamma GLM

## Plotting Cook's Distance to identify outliers

par(mfrow=c(1,1), bty="L", cex=0.75, mar=c(4.1, 4.1, 1, 1)) plot(log.plasma[lot=="1"], Di[lot=="1"], pch=16,col=2,

xlab="log10(plasma %)",ylab="Cook's Distance",ylim=c(0, 20)) points(log.plasma[lot=="2"], Di[lot=="2"], pch=17,col=4)

abline(h=qf(0.5, p, n-p), lty=2, col="green")

text(1.75, qf(0.5, p, n-p)+0.6, "50th percentile of F(p, n-p) distribution", cex=1.0, col="lightgreen")

abline(h=qf(0.95, p, n-p), lty=2, col="green")

text(1.75, qf(0.95, p, n-p)+0.6, "95th percentile of F(p, n-p) distribution", cex=1.0, col="lightgreen")

## There seems to be two outliers, one from each lot, since

## their Cook's Distance are substantially larger than the rest.

(38)

beta = model$coefficients ## regression coefficients beta

gg=summary(model)

cov=gg$cov.scaled # covariance matrix of regression coefficients cov

xd = model.matrix(model) ## the design matrix for our model xd

lp = xd %*% beta ## Creates estimated linear predictor Xd*beta vlp=xd %*%cov%*%t(xd) # Covariance matrix of linear predictor q = sqrt( diag(vlp) ) ## s.e. of linear predictors, as vector

## approximate lower and upper 95% CI for linear predictors low = lp - 1.96*q

up = lp + 1.96*q

(39)

## plot of data on inverse response scale with fit and CI

## then data on original scale but fit and CI on inverse scale par(mfrow=c(1,2))

plot(log(plasma),1/clot.time);

lines(log(plasma),lp); lines(log(plasma),up,lty=3);

lines(log(plasma),low,lty=3) plot(log(plasma),clot.time)

lines(log(plasma),1/lp); lines(log(plasma),1/up,lty=3);

lines(log(plasma),1/low,lty=3)

## The plots suggest that the fit is pretty good but it looks

## very messy, so we will fit the models seperately for each

## lot and plot them.

(40)

fit1=glm(lot1 ~ log(u), data = clotting, family = Gamma) beta = fit1$coefficients ## regression coefficients beta

gg=summary(fit1)

cov=gg$cov.scaled # covariance matrix of regression coefficients cov

## Creating design matrix for the model nlot=as.numeric(lot)

xd = cbind(rep(1,9),log(clotting$u)) xd

lp = xd %*% beta ## Creates estimated linear predictor Xd*beta vlp=xd %*%cov%*%t(xd) # Covariance matrix of linear predictor q = sqrt( diag(vlp) ) ## s.e. of linear predictors, as vector

## approximate lower and upper 95% CI for linear predictors low = lp - 1.96*q

(41)

## plot of data for Lot 1 on inverse response scale,

## then fit and CI par(mfrow=c(1,2))

plot(log(clotting$u),1/clotting$lot1);

lines(log(clotting$u),lp); lines(log(clotting$u),up,lty=3);

lines(log(clotting$u),low,lty=3) plot(log(clotting$u),clotting$lot1);

lines(log(clotting$u),1/lp); lines(log(clotting$u),1/up,lty=3);

lines(log(clotting$u),1/low,lty=3)

(42)

fit2=glm(lot2 ~ log(u), data = clotting, family = Gamma) beta = fit2$coefficients ## regression coefficients beta

gg=summary(fit2)

cov=gg$cov.scaled # covariance matrix of regression coefficients cov

## Creating design matrix for the model nlot=as.numeric(lot)

xd = cbind(rep(1,9),log(clotting$u)) xd

lp = xd %*% beta ## Creates estimated linear predictor Xd*beta vlp=xd %*%cov%*%t(xd) # Covariance matrix of linear predictor q = sqrt( diag(vlp) ) ## s.e. of linear predictors, as vector

## approximate lower and upper 95% CI for linear predictors low = lp - 1.96*q

(43)

## plot of data for Lot 2 on inverse response scale,

## then fit and CI par(mfrow=c(1,2))

plot(log(clotting$u),1/clotting$lot2);

lines(log(clotting$u),lp); lines(log(clotting$u),up,lty=3);

lines(log(clotting$u),low,lty=3) plot(log(clotting$u),clotting$lot2);

lines(log(clotting$u),1/lp); lines(log(clotting$u),1/up,lty=3);

lines(log(clotting$u),1/low,lty=3)

## The plots are lot less messy and more comprehendible as

## they have been split for the two different lots. And it is

## very evident from the plots for both the lots that a Gamma

## model with inverse link is a very good fit for this data.

References

Related documents

In this study serum beta HCG estimated in early second trimester, women with elevated levels, categorized under high risk group.. So it is easy to identify the high

The purpose of this study is to determine whether elevated Serum Beta HCG in (early) second trimester in comparison with roll over test is a better predictor

One-way analysis variance for pharmacokinetic parameters and PROC GLM (procedure general linear model) followed by Dunnett's t and t test (parametric test) and

The plot of q e versus C e for the adsorption of MB onto raw and MWSBD at room temperature according to the non-linear form of Langmuir isotherm model reveals a good

Figure lb shows a linear plot, but the Seebeck coefficient increased slightly with temperature, as against an expected constant value of ct for localized electrons

For beta*parti(;h»s of continuous energy, an exponential absorption law is approximately valid and this typi^ of exiioiUiiitial absorxition is observed down to 1% ol beta

Subsequently, multiple linear regression analysis was carried out between the obtained EC e values and S2 data, for the prediction of soil salinity models.. The relationship

Systems of linear differential equations, types of linear systems, differential operators, an operator method for solving linear systems with constant coefficients,