Chapter 4 Logistic Regression (Binary)

Sources for this chapter:

Data for this chapter:

  • The directmktg data is used from the MKT4320BGSU course package. Load the package and use the data() function to load the data.

    # Load the course package
    library(MKT4320BGSU)
    # Load the data
    data(directmktg)

4.1 Introduction

Base R is typically sufficient for performing the basics of logisitic regression, but to get some of the outputs required, I have created some user defined functions that are available in the course package.

  • or_table.R() produces Odds Ratio Coefficients
  • logreg_cm() produces the Classification Matrix
  • logreg_roc() produces the ROC Curves
  • gainlift() produces Gain/Lift tables and charts
  • logreg_cut() produces the Sensitivity/Specificity Plot

Video Tutorial: Logistic Regression - Introduction

4.2 The glm() Function

  • glm stands for Generalized Linear Model, and this function can be used with a variety of dependent variables by specifying different family="FAMILY" options *lm(dv ~ iv1 + iv2, data) is the same as
    `glm(dv ~ iv1 + iv2, data, family=“gaussian”)
  • Usage: glm(formula, data, family)

4.3 Logistic Regression using glm()

  • Binary logistic regression is performed using the glm() function when the family="binomial" is specified.

  • Usage: glm(formula, data, family="binonmial")

    • family="binomial" tells R to use logistic regression on a binary dependent variable
  • If glm(formula, data, family="binomial") is run by itself, R only outputs the **Logit formulation* coefficients (and some other measures)

    glm(buy ~ age + gender, directmktg, family="binomial")
    
    Call:  glm(formula = buy ~ age + gender, family = "binomial", data = directmktg)
    
    Coefficients:
     (Intercept)           age  genderFemale  
        -8.02069       0.18954      -0.09468  
    
    Degrees of Freedom: 399 Total (i.e. Null);  397 Residual
    Null Deviance:      521.6 
    Residual Deviance: 336.1    AIC: 342.1

    Table 4.1: Logit Coefficients from glm() call

Video Tutorial: glm() Basic Usage

  • However, if the results of the glm() call are assigned to an object, the Base R summary() function can be used to get much more detailed output, but requires manual calculation of McFadden’s Pseudo-\(R^2\)

    prelim <- glm(buy ~ age + gender, directmktg, family="binomial")
    summary(prelim)
    
    Call:
    glm(formula = buy ~ age + gender, family = "binomial", data = directmktg)
    
    Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
    (Intercept)  -8.02069    0.78700 -10.191   <2e-16 ***
    age           0.18954    0.01926   9.842   <2e-16 ***
    genderFemale -0.09468    0.27354  -0.346    0.729    
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    (Dispersion parameter for binomial family taken to be 1)
    
        Null deviance: 521.57  on 399  degrees of freedom
    Residual deviance: 336.14  on 397  degrees of freedom
    AIC: 342.14
    
    Number of Fisher Scoring iterations: 5
    # Manually calculate McFadden's Pseudo R-sq
    Mrsq <- 1-prelim$deviance/prelim$null.deviance
    cat("McFadden's Pseudo-Rsquared = ", Mrsq)
    McFadden's Pseudo-Rsquared =  0.3555239

    Table 4.2: Summary results from glm() call

Video Tutorial: glm() Output

  • For “nicer” looking results, the package jtools can be used

    library(jtools)
    summ(prelim,   # Saved object from before
         digits=4, # How many digits to display in each column
         model.info = FALSE)  # Suppress extraneous information

    χ²(2)

    185.4317

    Pseudo-R² (Cragg-Uhler)

    0.5092

    Pseudo-R² (McFadden)

    0.3555

    AIC

    342.1413

    BIC

    354.1157

    Est.

    S.E.

    z val.

    p

    (Intercept)

    -8.0207

    0.7870

    -10.1915

    0.0000

    age

    0.1895

    0.0193

    9.8422

    0.0000

    genderFemale

    -0.0947

    0.2735

    -0.3461

    0.7293

    Standard errors: MLE

    Table 4.3: Summary results from glm() using jtools package

Video Tutorial: glm() Output Using the jtools Package

4.3.1 Odds Ratio Coefficients

  • To get the Odds Ratio coefficients, use the or_table() function from the MKT4320BGSU course package.

  • Usage: or_table(MODEL, DIGITS, CONF)

    • MODEL is the name of the object with results of the glm() call
    • DIGITS is the number of digits to round the values to in the table
    • CONF is the confidence interval level (e.g., 95 = 95%)
    or_table(prelim,  # Saved logistic regression model from above
                 4,      # Number of digits to round output to
                 95)    # Level of confidence
                    Parameter OR Est      p   2.5%  97.5%
    (Intercept)   (Intercept) 0.0003 0.0000 0.0001 0.0015
    age                   age 1.2087 0.0000 1.1639 1.2552
    genderFemale genderFemale 0.9097 0.7293 0.5322 1.5550

    Table 4.4: Odds Ratio Coefficients

  • Odds Ratio coefficients can also be obtained from the summ function of the jtools package

    summ(prelim,   # Saved object from before
         digits=4, # How many digits to display in each column
         exp = TRUE,  # Obtain exponentiated coefficients (i.e., Odds Ratio)
         model.info = FALSE)  # Suppress extraneous information

    χ²(2)

    185.4317

    Pseudo-R² (Cragg-Uhler)

    0.5092

    Pseudo-R² (McFadden)

    0.3555

    AIC

    342.1413

    BIC

    354.1157

    exp(Est.)

    2.5%

    97.5%

    z val.

    p

    (Intercept)

    0.0003

    0.0001

    0.0015

    -10.1915

    0.0000

    age

    1.2087

    1.1639

    1.2552

    9.8422

    0.0000

    genderFemale

    0.9097

    0.5322

    1.5550

    -0.3461

    0.7293

    Standard errors: MLE

    Table 4.5: Odds Ratio Coefficients using jtools package

Video Tutorial: glm() Odds Ratio Coefficients

4.4 Estimate with Training Sample Only

Generally, it is good practice to use a training sample and a testing/holdout sample to see how well the model performs “out of sample”. To do this, the data must be split into two groups: train and test

4.4.1 Creating train and test Samples

  • While this can be done in a number of ways, the package caret has a very useful function that attempts to balance the percent of “positive” cases in both samples, while still creating random samples
  • After running this procedure, two new data frames will be created
    • One containing the training data, train
    • One containing the testing/holdout data, test
  • Two methods are available: using the splitsample function from the MKT4320BGSU package or manually using caret
    • To use the splitsample function
      • Usage: splitsample(data, outcome, p=0.75, seed=4320)
        • data is the name of the dataframe to be used to estimate the model
        • outcome is the name of the outcome variable (i.e., dependent variable) for the model to be estimated
        • p is the percent of the data to be assigned to the training sample; defaults to 0.75
        • seed is an integer to be used for random number generation; defaults to 4320
      • Requires the caret package
      • Will create two new dataframes in your environment: train and test
    splitsample(directmktg, buy)
    • To use the caret package manually, follow the code below
    # Use 'caret' package to create training and test/holdout samples
    # This will create two separate dataframes: train and test
    
    library(caret)    # Load the 'caret package'
    
    # Start by setting a random number seed so that the same samples
    # will be created each time
    set.seed(4320)
    
    # Create a dataframe of row numbers that should be in the 'train' sample
    inTrain <- createDataPartition(y=directmktg$buy, # Outcome variable
                                   p=.75, # Percent of sample to be in 'train'
                                   list=FALSE)  # Put result in matrix
    
    # Select only rows from data that are in 'inTrain'; assign to 'train'
    train <- directmktg[inTrain,]
    # Select only rows from data that not in 'inTrain'; assign to 'test'
    test <- directmktg[-inTrain,]

Video Tutorial: Creating Training and Test Samples

4.4.2 Run Model with only Training Sample

  • Once completed, run the model on the training sample only
    • Goodness-of-Fit measures can be obtained from the results
    model <- glm(buy ~ age + gender, train, family="binomial")
    summ(model,   # 'jtools' results; use 'summary' for Base R
         4, model.info=FALSE)
    χ²(2) 130.57
    p 0.00
    Pseudo-R² (Cragg-Uhler) 0.48
    Pseudo-R² (McFadden) 0.33
    AIC 268.37
    BIC 279.49
    Est. S.E. z val. p
    (Intercept) -7.71 0.87 -8.81 0.00
    age 0.18 0.02 8.52 0.00
    genderFemale -0.01 0.31 -0.04 0.97
    Standard errors: MLE; Continuous predictors are mean-centered and scaled by 1 s.d. The outcome variable remains in its original units.
    or_table(model)
                    Parameter OR Est     p   2.5%  97.5%
    (Intercept)   (Intercept) 0.0004 0.000 0.0001 0.0025
    age                   age 1.1996 0.000 1.1505 1.2509
    genderFemale genderFemale 0.9873 0.967 0.5399 1.8056
    Table 4.6: Logistic Regression Results for Training Sample

Video Tutorial: Model Fit - Goodness-of-Fit Measures

4.5 Classification Matrix

  • To get the Classification Matrix, use the logreg_cm() function from the MKT4320BGSU course package.

  • Usage: logreg_cm(MOD, DATA, "POS", CUTOFF=)

    • MOD is the name of the object with results of the glm() call
    • DATA is the data set for which the Classification Matrix should be produced (i.e., original, training, or testing)
    • "POS" is the level of the factor variable that is the SUCCESS level
    • CUTOFF= is the cutoff threshold; default is 0.5
  • This function requires the package caret, which should already be loaded from the splitting of the data

    # Classification Matrix for training sample
    logreg_cm(model,   # Name of the results object
              train,   # Use training sample
              "Yes")   # Level of 'buy' that represents success/true
    Confusion Matrix and Statistics
    
              Reference
    Prediction  No Yes
           No  179  37
           Yes  14  71
    
                   Accuracy : 0.8306          
                     95% CI : (0.7833, 0.8712)
        No Information Rate : 0.6412          
        P-Value [Acc > NIR] : 3.179e-13       
    
                      Kappa : 0.6136          
    
     Mcnemar's Test P-Value : 0.002066        
    
                Sensitivity : 0.6574          
                Specificity : 0.9275          
             Pos Pred Value : 0.8353          
             Neg Pred Value : 0.8287          
                 Prevalence : 0.3588          
             Detection Rate : 0.2359          
       Detection Prevalence : 0.2824          
          Balanced Accuracy : 0.7924          
    
           'Positive' Class : Yes             
    
    PCC = 53.99%

    Table 4.7: Classification Matrix for Training Sample

Video Tutorial: Model Fit - Classification Matrix

4.6 ROC Curve

  • To get the ROC curve, use the user defined function ``logreg_roc() function from the MKT4320BGSU course package.

  • Usage: logreg_roc(MOD, DATA)

    • MOD is the name of the object with results of the glm() call
    • DATA is the data set for which the ROC should be produced (i.e., original, training, or testing)
  • This function requires the package pROC, which needs to be loaded

    # Load 'pROC' package
    library(pROC)
    
    # ROC Curve for the training sample
    logreg_roc(model,   # Name of the results object
               train)   # Use training sample
    ROC Curve for Training Sample

    Figure 4.1: ROC Curve for Training Sample

Video Tutorial: Model Fit - ROC Curve

4.7 Gain/Lift Charts and Tables

  • To get the Gain/Lift charts and tables , use the gainlift() function from the MKT4320BGSU course package.

  • Usage: OBJ.NAME <- gainlift(MOD, TRAIN, TEST, "POS")

    • OBJ.NAME is the name of the object to assign the results to
    • MOD is the name of the object with results of the glm() call
    • TRAIN is the name of the data frame containing the training sample
    • TEST is the name of the data frame containing the test/holdout sample
    • "POS" is the level of the factor variable that is the SUCCESS level
  • This function requires the ggplot2, dplyr, and tidyr packages, which need to be loaded first

  • This user defined function returns a list of four objects

    • To use, assign the result to an object name, and then call one of the four objects returned from the function
     # Load necessary packages (if not already loaded)
    library(ggplot2)
    library(dplyr)
    library(tidyr)
    
    # Call the function and assign to object named 'glresults'
    glresults <- gainlift(model,  # Name of the glm results object
                          train,  # Name of the training data frame
                          test,   # Name of the testing data frame
                          "Yes")  # Level that represents success/true
    • OBJ.NAME$gaintable returns the Gain table
    glresults$gaintable
    # A tibble: 20 × 3
       `% Sample` Holdout Training
            <dbl>   <dbl>    <dbl>
     1       0.05   0.114    0.130
     2       0.1    0.229    0.269
     3       0.15   0.371    0.370
     4       0.2    0.457    0.5  
     5       0.25   0.6      0.602
     6       0.3    0.714    0.667
     7       0.35   0.8      0.704
     8       0.4    0.8      0.731
     9       0.45   0.829    0.769
    10       0.5    0.886    0.815
    11       0.55   0.914    0.861
    12       0.6    0.943    0.870
    13       0.65   0.971    0.889
    14       0.7    1        0.944
    15       0.75   1        0.954
    16       0.8    1        0.972
    17       0.85   1        0.991
    18       0.9    1        1    
    19       0.95   1        1    
    20       1      1        1    

    Table 4.8: Gain Table

    • OBJ.NAME$gainplot returns the Gain plot
    glresults$gainplot
    Gain Plot

    Figure 4.2: Gain Plot

    • OBJ.NAME$lifttable returns the Lift table
    glresults$lifttable
    # A tibble: 20 × 3
       `% Sample` Holdout Training
            <dbl>   <dbl>    <dbl>
     1       0.05    2.83     2.60
     2       0.1     2.51     2.69
     3       0.15    2.63     2.48
     4       0.2     2.38     2.51
     5       0.25    2.48     2.42
     6       0.3     2.44     2.23
     7       0.35    2.33     2.02
     8       0.4     2.03     1.83
     9       0.45    1.86     1.71
    10       0.5     1.79     1.64
    11       0.55    1.68     1.57
    12       0.6     1.58     1.46
    13       0.65    1.50     1.37
    14       0.7     1.43     1.35
    15       0.75    1.34     1.28
    16       0.8     1.25     1.22
    17       0.85    1.18     1.17
    18       0.9     1.11     1.11
    19       0.95    1.05     1.06
    20       1       1        1   

    Table 4.9: Lift Table

    • OBJ.NAME$liftplot returns the Lift plot
    glresults$liftplot
    Lift Plot

    Figure 4.3: Lift Plot

Video Tutorial: Model Performance - Gain and Lift

4.8 Sensitivity/Specificity Plot

  • To get the Sensitivity/Specificity plot, use the logreg_cut() function from the MKT4320BGSU course package.

  • Usage: logreg_cut(MOD, DATA, "POS")

    • MOD is the name of the object with results of the glm() call
    • DATA is the data set for which the plot should be produced (i.e., original, training, or testing)
    • "POS" is the level of the factor variable that is the SUCCESS level
  • This function requires the ggplot2 package, which needs to be loaded first

    # Load necessary packages
    library(ggplot2)
    
    # Sensitivity/Specificity Plot for Training Sample
    logreg_cut(model,  # Name of the results object
              train,   # Use training sample
              "Yes")   # Level of 'buy' that represents success/true
    Sensitivity/Specificity Plot for Training Sample

    Figure 4.4: Sensitivity/Specificity Plot for Training Sample

Video Tutorial: Logistic Regression - Sensitivity/Specificity Plots

4.9 Margin Plots

  • Margin plots are done in much the same way as with linear regression margin plots

    # Create plot for 'age' and assign to 'p1'
    p1 <- mymp(model, "age")$plot
    
    # Create plot for 'gender' and assign to 'p2'
    p2 <- mymp(model, "gender")$plot
    
    #Use 'cowplot' to put into single plot
    cowplot::plot_grid(p1,p2)
    Margin plots for variables in model

    Figure 4.5: Margin plots for variables in model

Video Tutorial: Logistic Regression - Margin Plots