• No results found

Paper: Regression Analysis III

N/A
N/A
Protected

Academic year: 2022

Share "Paper: Regression Analysis III"

Copied!
45
0
0

Loading.... (view fulltext now)

Full text

(1)

Subject: Statistics

Paper: Regression Analysis III

Module: Logit and probit models

(2)

Development Team

Principal investigator: Dr. Bhaswati Ganguli,Professor, Department of Statistics, University of Calcutta

Paper co-ordinator: Dr. Bhaswati Ganguli,Professor, Department of Statistics, University of Calcutta

Content writer: Sayantee Jana, Graduate student, Department of Mathematics and Statistics, McMaster University Sujit Kumar Ray,Analytics professional, Kolkata

Content reviewer: Department of Statistics, University of Calcutta

Regression Analysis III 2 / 14

(3)

Introduction

I Q : When do we use Logistic Regression ?

I A : When the response variable is binary and the explanatory variable(s) is/are numeric and/or categorical. A logistic regression model can have one or more than one predictor variable.

I Logistic regression is often used because the relationship between the discrete response variable and a predictor is non-linear.

I Example1 : The probability of heart disease changes very little with a 10 point difference among people with low blood pressure, but a 10 point change can mean a drastic change in the probability of heart disease in people with high blood pressure.

1Agresti, A. (2014). Categorical data analysis. John Wiley Sons.

(4)

Introduction

I Q : When do we use Logistic Regression ?

I A : When the response variable is binary and the explanatory variable(s) is/are numeric and/or categorical. A logistic regression model can have one or more than one predictor variable.

I Logistic regression is often used because the relationship between the discrete response variable and a predictor is non-linear.

I Example1 : The probability of heart disease changes very little with a 10 point difference among people with low blood pressure, but a 10 point change can mean a drastic change in the probability of heart disease in people with high blood pressure.

1Agresti, A. (2014). Categorical data analysis. John Wiley Sons.

Regression Analysis III 3 / 14

(5)

Introduction

I Q : When do we use Logistic Regression ?

I A : When the response variable is binary and the explanatory variable(s) is/are numeric and/or categorical. A logistic regression model can have one or more than one predictor variable.

I Logistic regression is often used because the relationship between the discrete response variable and a predictor is non-linear.

I Example1 : The probability of heart disease changes very little with a 10 point difference among people with low blood pressure, but a 10 point change can mean a drastic change in the probability of heart disease in people with high blood pressure.

1Agresti, A. (2014). Categorical data analysis. John Wiley Sons.

(6)

Introduction

I Q : When do we use Logistic Regression ?

I A : When the response variable is binary and the explanatory variable(s) is/are numeric and/or categorical. A logistic regression model can have one or more than one predictor variable.

I Logistic regression is often used because the relationship between the discrete response variable and a predictor is non-linear.

I Example1 : The probability of heart disease changes very little with a 10 point difference among people with low blood pressure, but a 10 point change can mean a drastic change in the probability of heart disease in people with high blood pressure.

1Agresti, A. (2014). Categorical data analysis. John Wiley Sons.

Regression Analysis III 3 / 14

(7)

Assumptions of logistic regression model

I The only “real” limitation of logistic regression is that the outcome must be discrete..

I If the distributional assumptions are met than discriminant function analysis may be more powerful, although it has been shown to overestimate the association using discrete

predictors.

I If the outcome is continuous then multiple regression is more powerful given that the assumptions are met.

I Ratio of cases to variables - using discrete variables requires that there are enough responses in every given category.

I If there are too many cells with no responses parameter estimates and standard errors will be likely to blow up.

(8)

Assumptions of logistic regression model

I The only “real” limitation of logistic regression is that the outcome must be discrete..

I If the distributional assumptions are met than discriminant function analysis may be more powerful, although it has been shown to overestimate the association using discrete

predictors.

I If the outcome is continuous then multiple regression is more powerful given that the assumptions are met.

I Ratio of cases to variables - using discrete variables requires that there are enough responses in every given category.

I If there are too many cells with no responses parameter estimates and standard errors will be likely to blow up.

Regression Analysis III 4 / 14

(9)

Assumptions of logistic regression model

I The only “real” limitation of logistic regression is that the outcome must be discrete..

I If the distributional assumptions are met than discriminant function analysis may be more powerful, although it has been shown to overestimate the association using discrete

predictors.

I If the outcome is continuous then multiple regression is more powerful given that the assumptions are met.

I Ratio of cases to variables - using discrete variables requires that there are enough responses in every given category.

I If there are too many cells with no responses parameter estimates and standard errors will be likely to blow up.

(10)

Assumptions of logistic regression model

I The only “real” limitation of logistic regression is that the outcome must be discrete..

I If the distributional assumptions are met than discriminant function analysis may be more powerful, although it has been shown to overestimate the association using discrete

predictors.

I If the outcome is continuous then multiple regression is more powerful given that the assumptions are met.

I Ratio of cases to variables - using discrete variables requires that there are enough responses in every given category.

I If there are too many cells with no responses parameter estimates and standard errors will be likely to blow up.

Regression Analysis III 4 / 14

(11)

Assumptions of logistic regression model

I The only “real” limitation of logistic regression is that the outcome must be discrete..

I If the distributional assumptions are met than discriminant function analysis may be more powerful, although it has been shown to overestimate the association using discrete

predictors.

I If the outcome is continuous then multiple regression is more powerful given that the assumptions are met.

I Ratio of cases to variables - using discrete variables requires that there are enough responses in every given category.

I If there are too many cells with no responses parameter estimates and standard errors will be likely to blow up.

(12)

Assumptions Continued ...

I Too many cells can make groups perfectly separable (e.g.

multicollinear) which will make maximum likelihood estimation impossible.

I Linearity in the logit - the regression equation should have a linear relationship with the logit form of the discrete variable.

There is no assumption about the predictors being linearly related to each other.

I Absence of multicollinearity.

I No outliers.

I Independence of errors - assumes a between subjects design.

There are other forms if the design is within subjects.

Regression Analysis III 5 / 14

(13)

Assumptions Continued ...

I Too many cells can make groups perfectly separable (e.g.

multicollinear) which will make maximum likelihood estimation impossible.

I Linearity in the logit - the regression equation should have a linear relationship with the logit form of the discrete variable.

There is no assumption about the predictors being linearly related to each other.

I Absence of multicollinearity.

I No outliers.

I Independence of errors - assumes a between subjects design.

There are other forms if the design is within subjects.

(14)

Assumptions Continued ...

I Too many cells can make groups perfectly separable (e.g.

multicollinear) which will make maximum likelihood estimation impossible.

I Linearity in the logit - the regression equation should have a linear relationship with the logit form of the discrete variable.

There is no assumption about the predictors being linearly related to each other.

I Absence of multicollinearity.

I No outliers.

I Independence of errors - assumes a between subjects design.

There are other forms if the design is within subjects.

Regression Analysis III 5 / 14

(15)

Assumptions Continued ...

I Too many cells can make groups perfectly separable (e.g.

multicollinear) which will make maximum likelihood estimation impossible.

I Linearity in the logit - the regression equation should have a linear relationship with the logit form of the discrete variable.

There is no assumption about the predictors being linearly related to each other.

I Absence of multicollinearity.

I No outliers.

I Independence of errors - assumes a between subjects design.

There are other forms if the design is within subjects.

(16)

Assumptions Continued ...

I Too many cells can make groups perfectly separable (e.g.

multicollinear) which will make maximum likelihood estimation impossible.

I Linearity in the logit - the regression equation should have a linear relationship with the logit form of the discrete variable.

There is no assumption about the predictors being linearly related to each other.

I Absence of multicollinearity.

I No outliers.

I Independence of errors - assumes a between subjects design.

There are other forms if the design is within subjects.

Regression Analysis III 5 / 14

(17)

Assumptions Continued ...

I Too many cells can make groups perfectly separable (e.g.

multicollinear) which will make maximum likelihood estimation impossible.

I Linearity in the logit - the regression equation should have a linear relationship with the logit form of the discrete variable.

There is no assumption about the predictors being linearly related to each other.

I Absence of multicollinearity.

I No outliers.

I Independence of errors - assumes a between subjects design.

There are other forms if the design is within subjects.

(18)

The logistic regression model

I Y : binary response variable

I X = (x1, x2, ..., xp): explanatory variables

I Linear Probability Model : Yi=Xi0β+ui

Yi =

(1 if an event happens

0 if an event does not happen

I Pi=P[Yi = 1|Xi],0≤Pi≤1

I E(Yi|Xi) =1.Pi+ 0.(1−Pi)=Pi

I Pi=Xi0β ⇒ does not guarantee 0≤Pi ≤1 and hence violates the law of probability.

Regression Analysis III 6 / 14

(19)

The logistic regression model

I Y : binary response variable

I X = (x1, x2, ..., xp): explanatory variables

I Linear Probability Model : Yi=Xi0β+ui

Yi =

(1 if an event happens

0 if an event does not happen

I Pi=P[Yi = 1|Xi],0≤Pi≤1

I E(Yi|Xi) =1.Pi+ 0.(1−Pi)=Pi

I Pi=Xi0β ⇒ does not guarantee 0≤Pi ≤1 and hence violates the law of probability.

(20)

The logistic regression model

I Y : binary response variable

I X = (x1, x2, ..., xp): explanatory variables

I Linear Probability Model : Yi=Xi0β+ui

Yi =

(1 if an event happens

0 if an event does not happen

I Pi=P[Yi = 1|Xi],0≤Pi≤1

I E(Yi|Xi) =1.Pi+ 0.(1−Pi)=Pi

I Pi=Xi0β ⇒ does not guarantee 0≤Pi ≤1 and hence violates the law of probability.

Regression Analysis III 6 / 14

(21)

The logistic regression model

I Y : binary response variable

I X = (x1, x2, ..., xp): explanatory variables

I Linear Probability Model : Yi=Xi0β+ui

Yi =

(1 if an event happens

0 if an event does not happen

I Pi=P[Yi = 1|Xi],0≤Pi≤1

I E(Yi|Xi) =1.Pi+ 0.(1−Pi)=Pi

I Pi=Xi0β ⇒ does not guarantee 0≤Pi ≤1 and hence violates the law of probability.

(22)

The logistic regression model

I Y : binary response variable

I X = (x1, x2, ..., xp): explanatory variables

I Linear Probability Model : Yi=Xi0β+ui

Yi =

(1 if an event happens

0 if an event does not happen

I Pi=P[Yi = 1|Xi],0≤Pi≤1

I E(Yi|Xi) =1.Pi+ 0.(1−Pi)=Pi

I Pi=Xi0β ⇒ does not guarantee 0≤Pi ≤1 and hence violates the law of probability.

Regression Analysis III 6 / 14

(23)

Utility function

I We define, Ui =Xi0β

I latent variable : U ∼F(.)

I F(.) is likely to be a symmetric bell-shaped distribution.

Yi=

(1 if Ui ≥U 0 if Ui < U

Pi = P[Yi= 1|Xi]

= P[Ui≥U |Xi]

= P[U ≤Xi0β |Xi]

= F(Xi0β)

(24)

Logit model

I F ∼Φi.e. Pi = Φ(Xi0β) → Probit model

I F ∼logistic distribution i.e. Pi =F(Xi0β) = 1

(1+e−X

0 iβ)

I log(1−PPi

i) =Xi0β : logit model

I Merits of the logit model :

I Modeling the log of odds of an event.

I F(.)is symmetric.

Regression Analysis III 8 / 14

(25)

Logit model

I F ∼Φi.e. Pi = Φ(Xi0β) → Probit model

I F ∼logistic distribution i.e. Pi =F(Xi0β) = 1

(1+e−X

0 iβ)

I log(1−PPi

i) =Xi0β : logit model

I Merits of the logit model :

I Modeling the log of odds of an event.

I F(.)is symmetric.

(26)

Logit model

I F ∼Φi.e. Pi = Φ(Xi0β) → Probit model

I F ∼logistic distribution i.e. Pi =F(Xi0β) = 1

(1+e−X

0 iβ)

I log(1−PPi

i) =Xi0β : logit model

I Merits of the logit model :

I Modeling the log of odds of an event.

I F(.)is symmetric.

Regression Analysis III 8 / 14

(27)

Logit model

I F ∼Φi.e. Pi = Φ(Xi0β) → Probit model

I F ∼logistic distribution i.e. Pi =F(Xi0β) = 1

(1+e−X

0 iβ)

I log(1−PPi

i) =Xi0β : logit model

I Merits of the logit model :

I Modeling the log of odds of an event.

I F(.)is symmetric.

(28)

Calculating Odds Ratio

Calculating Odds Ratio : From a contingency table.

Let us consider a 2×2 table of counts.

Table I : Response variableY vs. Predictor variable X Predictor Response variable

Variable Diseased Non-diseased

Exposed a b

Unexposed c d

OR= a×d b×c

Regression Analysis III 9 / 14

(29)

Calculating Odds Ratio

Calculating Odds Ratio : From a contingency table.

Let us consider a 2×2 table of counts.

Table I : Response variableY vs. Predictor variable X Predictor Response variable

Variable Diseased Non-diseased

Exposed a b

Unexposed c d

OR= a×d b×c

(30)

Log of odds

I Let π0 =P(E|D),π1 =P(E|N D)

I log(1−ππ ) = intercept + disease status ×β

I disease status= 1, if diseased disease status=0, if non diseased

I log of odds(π0)= intercept +β log of odds(π1)= intercept

Regression Analysis III 10 / 14

(31)

Log of odds

I Let π0 =P(E|D),π1 =P(E|N D)

I log(1−ππ ) = intercept + disease status ×β

I disease status= 1, if diseased disease status=0, if non diseased

I log of odds(π0)= intercept +β log of odds(π1)= intercept

(32)

Log of odds

I Let π0 =P(E|D),π1 =P(E|N D)

I log(1−ππ ) = intercept + disease status ×β

I disease status= 1, if diseased disease status=0, if non diseased

I log of odds(π0)= intercept +β log of odds(π1)= intercept

Regression Analysis III 10 / 14

(33)

Log of odds

I Let π0 =P(E|D),π1 =P(E|N D)

I log(1−ππ ) = intercept + disease status ×β

I disease status= 1, if diseased disease status=0, if non diseased

I log of odds(π0)= intercept +β log of odds(π1)= intercept

(34)

Calculating OR : from a logit model

I log of odds ratio comparing exposed to unexposed

= log of odds for exposed - log odds for unexposed

=(intercept + β×1)-(intercept +β×0)=β

I H0:β= 0 ≈log(OR) = 0 ≈OR= 1

I Thus when we have multiple covariates and multiple ORs to calculate and test for association between each of the

predictor and the response variable, then instead of calculating each of the ORs from the corresponding contingency table it is more convenient to just test each parameter in a logit model.

I This convenience also applies when we have multiple categories for any of the predictor variables.

Regression Analysis III 11 / 14

(35)

Calculating OR : from a logit model

I log of odds ratio comparing exposed to unexposed

= log of odds for exposed - log odds for unexposed

=(intercept + β×1)-(intercept +β×0)=β

I H0:β= 0 ≈log(OR) = 0 ≈OR= 1

I Thus when we have multiple covariates and multiple ORs to calculate and test for association between each of the

predictor and the response variable, then instead of calculating each of the ORs from the corresponding contingency table it is more convenient to just test each parameter in a logit model.

I This convenience also applies when we have multiple categories for any of the predictor variables.

(36)

Calculating OR : from a logit model

I log of odds ratio comparing exposed to unexposed

= log of odds for exposed - log odds for unexposed

=(intercept + β×1)-(intercept +β×0)=β

I H0:β= 0 ≈log(OR) = 0 ≈OR= 1

I Thus when we have multiple covariates and multiple ORs to calculate and test for association between each of the

predictor and the response variable, then instead of calculating each of the ORs from the corresponding contingency table it is more convenient to just test each parameter in a logit model.

I This convenience also applies when we have multiple categories for any of the predictor variables.

Regression Analysis III 11 / 14

(37)

Calculating OR : from a logit model

I log of odds ratio comparing exposed to unexposed

= log of odds for exposed - log odds for unexposed

=(intercept + β×1)-(intercept +β×0)=β

I H0:β= 0 ≈log(OR) = 0 ≈OR= 1

I Thus when we have multiple covariates and multiple ORs to calculate and test for association between each of the

predictor and the response variable, then instead of calculating each of the ORs from the corresponding contingency table it is more convenient to just test each parameter in a logit model.

I This convenience also applies when we have multiple categories for any of the predictor variables.

(38)

Summary

I In a logistic model we require the response variable to be a binary variable.

I The predictor variables can be discrete or continuous.

I In logistic regression we model the log of odds of an event.

I Hence a logistic regression model is easy to interpret.

I Testing the parameter of a logit model is equivalent to testing the odds ratio.

I It is more convenient to test the parameters of logit model than calculating many Odds Ratios from contingency tables and testing for them, when we have multiple predictor variables or multiple categories of the predictor variable.

Regression Analysis III 12 / 14

(39)

Summary

I In a logistic model we require the response variable to be a binary variable.

I The predictor variables can be discrete or continuous.

I In logistic regression we model the log of odds of an event.

I Hence a logistic regression model is easy to interpret.

I Testing the parameter of a logit model is equivalent to testing the odds ratio.

I It is more convenient to test the parameters of logit model than calculating many Odds Ratios from contingency tables and testing for them, when we have multiple predictor variables or multiple categories of the predictor variable.

(40)

Summary

I In a logistic model we require the response variable to be a binary variable.

I The predictor variables can be discrete or continuous.

I In logistic regression we model the log of odds of an event.

I Hence a logistic regression model is easy to interpret.

I Testing the parameter of a logit model is equivalent to testing the odds ratio.

I It is more convenient to test the parameters of logit model than calculating many Odds Ratios from contingency tables and testing for them, when we have multiple predictor variables or multiple categories of the predictor variable.

Regression Analysis III 12 / 14

(41)

Summary

I In a logistic model we require the response variable to be a binary variable.

I The predictor variables can be discrete or continuous.

I In logistic regression we model the log of odds of an event.

I Hence a logistic regression model is easy to interpret.

I Testing the parameter of a logit model is equivalent to testing the odds ratio.

I It is more convenient to test the parameters of logit model than calculating many Odds Ratios from contingency tables and testing for them, when we have multiple predictor variables or multiple categories of the predictor variable.

(42)

Summary

I In a logistic model we require the response variable to be a binary variable.

I The predictor variables can be discrete or continuous.

I In logistic regression we model the log of odds of an event.

I Hence a logistic regression model is easy to interpret.

I Testing the parameter of a logit model is equivalent to testing the odds ratio.

I It is more convenient to test the parameters of logit model than calculating many Odds Ratios from contingency tables and testing for them, when we have multiple predictor variables or multiple categories of the predictor variable.

Regression Analysis III 12 / 14

(43)

Summary

I In a logistic model we require the response variable to be a binary variable.

I The predictor variables can be discrete or continuous.

I In logistic regression we model the log of odds of an event.

I Hence a logistic regression model is easy to interpret.

I Testing the parameter of a logit model is equivalent to testing the odds ratio.

I It is more convenient to test the parameters of logit model than calculating many Odds Ratios from contingency tables and testing for them, when we have multiple predictor variables or multiple categories of the predictor variable.

(44)

Example 12

mydata=read.csv("http://www.ats.ucla.edu/stat/data/

binary.csv")

## view the first few rows of the data head(mydata)

# logistic regression

mylogit=glm(admit ~ as.factor(rank),data=mydata, family="binomial")

summary(mylogit) mylogit$coefficients

ct=xtabs(~admit +rank, data = mydata)

# install.packages(epitools)

require(epitools) # contains the function oddsratio oddsratio((ct))$measure

exp(mylogit$coefficients)

2Source of data: http://www.ats.ucla.edu/stat/r/dae/logit.htm

Regression Analysis III Sample R script to complement this module 13 / 14

(45)

Example 23

# install.packages("coin")

require(coin) ## contains the dataset alzheimer data(alzheimer)

head(alzheimer)

## graphical display

layout(matrix(1:2, ncol = 2))

spineplot(disease ~ smoking, data = alzheimer, subset = gender == "Male",main = "Male") spineplot(disease ~ smoking, data = alzheimer,

subset = gender == "Female",main = "Female")

mod2=(glm(as.factor(disease)~as.factor(gender)+as.factor(smoking),data=alzheimer,family=binomial(link=logit))) summary(mod2);mod2$coefficients

ct=xtabs(~disease+gender+smoking, data = alzheimer);ct

3help file for the dataset alzheimer in library coin, R documentation

References

Related documents

I Categorical variable such as social class, educational level where the categories are ordered but the distance or spacing between the categories is not defined or unknown, is

I Our major objective while analyzing a contingency table is testing for independence of the response and predictor variable.. I For large samples we can use the Pearsonian

• When we solve on paper, we write many numbers; we do not need separate variables to store them. • As you calculate on paper, identify the numbers that are

We will perform multiple regression diagnostics and we will include few more variable for the sake of illustration in our model namely we would like to know

The goal of the analysis is to develop multinomial logistic regression models /equations by developing the relationship with the factors/ predictor variables with the

Multivariate linear regression analysis method is a statistical technique for estimating the linear relationships among variables. It includes many techniques for

Since the biomass consists of multiple (elemental components) independent variables and calorific value as dependent variable, multiple regression analysis is used to find the

Distillation column is basically a MIMO process means multiple inputs and multiple outputs so when I design and control this process I must have to decide which variables