Chapter 9 Binary Logistic Regression

9.1 Why Logistic Regression in Marketing Analytics

Many important marketing decisions involve binary outcomes:

Did the customer buy or not?
Did the customer respond to a promotion?
Did the customer churn?

In these cases, the dependent variable takes on only two possible values. A standard linear regression model is not appropriate because it can produce predicted values below 0 or above 1 and does not model probabilities correctly.

Binary logistic regression is designed specifically for situations where the outcome is binary. Rather than predicting the outcome directly, logistic regression models the probability that the outcome occurs.

From a marketing perspective, this is powerful: instead of simply predicting “buy” or “not buy,” we can estimate how likely a customer is to buy and then use those probabilities for targeting and decision-making.

9.2 The Direct Marketing Data

In this chapter, we use the directmktg dataset, which contains information on a direct marketing campaign.

The dataset consists of 400 prospects and the following variables:

userid: Unique identifier for each prospect
age: Prospect age in years
gender: Prospect gender (coded as provided)
salary: Prospect salary in thousands of dollars
buy: Purchase decision (“Yes” or “No”)

The marketing objective is:

Predict whether a prospect will purchase and estimate the probability of purchase.

data(directmktg)

9.3 Splitting Data into Training and Test Samples

To evaluate predictive performance, we split the data into training and test samples using splitsample() from the MKT4320BGSU package.

Usage:

splitsample(data, outcome = NULL, group = NULL, choice = NULL, alt = NULL,
p = 0.75, seed = 4320)
where:
- data is the data frame to split.
- outcome is the outcome variable used for stratification. Required when group is NULL. Optional when group is provided. For binary logistic regression, it is required.
- group is NOT USED FOR BINARY LOGISTIC
- choice is NOT USED FOR BINARY LOGISTIC
- alt is NOT USED FOR BINARY LOGISTIC
- p is the proportion of observations to place in the training set. Must be strictly between 0 and 1. Default is 0.75.
- seed is the random seed for reproducibility. Default is 4320.

sp <- splitsample(directmktg, outcome = "buy")
train <- sp$train
test  <- sp$test

The training data are used to estimate the model. The test data are reserved for out-of-sample evaluation.

9.4 Estimating a Binary Logistic Regression Model

We estimate a logistic regression model using glm() with family = "binomial". Remember, we want to estimate the model with the train data we just created.

mod <- glm(buy ~ age + gender + salary, data = train, family = "binomial")
summary(mod)


Call:
glm(formula = buy ~ age + gender + salary, family = "binomial", 
    data = train)

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -13.166094   1.621695  -8.119 4.71e-16 ***
age            0.250215   0.032095   7.796 6.38e-15 ***
genderFemale  -0.406881   0.349811  -1.163    0.245    
salary         0.040632   0.006742   6.026 1.68e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 392.94  on 300  degrees of freedom
Residual deviance: 210.68  on 297  degrees of freedom
AIC: 218.68

Number of Fisher Scoring iterations: 6

The coefficients indicate how each variable affects the log-odds of purchase. The sign of each coefficient tells us whether the variable increases or decreases the likelihood of purchase.

For “nicer” looking results, the summ() function from the jtools package can be used.

library(jtools)
summ(mod, digits = 4)

Observations	301
Dependent variable	buy
Type	Generalized linear model
Family	binomial
Link	logit

χ²(3)	182.2574
p	0.0000
Pseudo-R² (Cragg-Uhler)	0.6231
Pseudo-R² (McFadden)	0.4638
AIC	218.6842
BIC	233.5126

	Est.	S.E.	z val.	p
(Intercept)	-13.1661	1.6217	-8.1187	0.0000
age	0.2502	0.0321	7.7961	0.0000
genderFemale	-0.4069	0.3498	-1.1631	0.2448
salary	0.0406	0.0067	6.0265	0.0000
Standard errors: MLE

9.5 Interpreting Odds Ratios

Because log-odds are difficult to interpret, we often convert coefficients to odds ratios. While we can do this by simply taking the natural exponents of the model coefficients (i.e., \(e^{coeff}\)), exp(coef(mod)), you can again use the summ() function from jtools and use the exp = TRUE option.

summ(mod, exp = TRUE, digits = 4)

Observations	301
Dependent variable	buy
Type	Generalized linear model
Family	binomial
Link	logit

χ²(3)	182.2574
p	0.0000
Pseudo-R² (Cragg-Uhler)	0.6231
Pseudo-R² (McFadden)	0.4638
AIC	218.6842
BIC	233.5126

	exp(Est.)	2.5%	97.5%	z val.	p
(Intercept)	0.0000	0.0000	0.0000	-8.1187	0.0000
age	1.2843	1.2060	1.3677	7.7961	0.0000
genderFemale	0.6657	0.3354	1.3215	-1.1631	0.2448
salary	1.0415	1.0278	1.0553	6.0265	0.0000
Standard errors: MLE

An odds ratio greater than 1 indicates higher odds of purchase as the predictor increases. An odds ratio less than 1 indicates lower odds.

9.6 Predicted Probabilities

Logistic regression naturally produces predicted probabilities.

train$phat <- predict(mod, type = "response")
head(train$phat)

           1            2            3            4            5            6 
0.0004805918 0.0267022049 0.0109742358 0.0048495633 0.0170625797 0.0321671256

These probabilities are often more useful than hard classifications because they allow ranking and targeting. However, these probabilities are also used in evaluating model fit through the classification matrix. NOTE: You do not have to create these probabilities manually like shown above.

9.7 Classification and Cutoff Values

To classify prospects, we choose a probability cutoff (commonly 0.50). That is, a prospect/row is predicted to be “positive” if their predicted probability of the positive outcome is equal to 0.50 or greater. We use the logistic_classify() function from the MKT4320BGSU package to create classification matrices.

Usage:

classify_logistic(MOD, DATA, POSITIVE, CUTOFF = 0.5, DATA2 = NULL,
LABEL1 = "Sample 1", LABEL2 = "Sample 2", digits = 3, ft = FALSE)
where:
- MOD is a fitted binary logistic regression glm object (i.e., family = “binomial”).
- DATA a data frame for which the classification matrix should be produced. Usually train.
- POSITIVE is the level in the outcome variable representing the positive outcome.
- CUTOFF is the pobability cutoff for classification (default = 0.5).
- DATA2 is an optional second data frame (e.g., test/holdout sample).
- LABEL1 is the label for the first data set (default = “Sample 1”)
- LABEL2 is the label for the section data set, if provided (default = “Sample 2”)
- digits is the number of decimal places (default = 3) to show.
- ft is a logical operator to return a nicer looking flextable (default = FALSE).

classify_logistic(MOD = mod, DATA = train, POSITIVE = "Yes")


Classification Matrix - Sample 1 (Cutoff = 0.50)
Accuracy = 0.844
PCC = 0.540

       No Yes Total
No    176  30   206
Yes    17  78    95
Total 193 108   301

Selected Statistics (Positive Class):
  Sensitivity (TPR): 0.722
  Specificity (TNR): 0.912
  Precision (PPV):   0.821

We can also evaluate classification performance on the test sample and get the output in a “nicer” table.

classify_logistic(MOD = mod, DATA = train, DATA2 = test, POSITIVE = "Yes",
                  LABEL1 = "Training", LABEL2 = "Test", ft=TRUE)

9.8 Choosing a Cutoff

The default cutoff of 0.50 may not be optimal. The cutoff_logistic() function helps visualize tradeoffs between sensitivity, specificity, and accuracy.

Usage:

cutoff_logistic(MOD, DATA, POSITIVE, LABEL1 = "Sample 1",
DATA2 = NULL, LABEL2 = "Sample 2")
where:
- MOD is a fitted binary logistic regression model (glm with family = “binomial”).
- DATA a data frame for which the classification matrix should be produced. Usually train.
- POSITIVE is the level in the outcome variable representing the positive outcome.
- LABEL1 is the label for the first data set (default = “Sample 1”).
- DATA2 is an optional second data frame (e.g., test/holdout sample).
- LABEL2 is the label for the section data set, if provided (default = “Sample 2”).

cut <- cutoff_logistic(MOD = mod, DATA = train, DATA2 = test, POSITIVE = "Yes", 
                LABEL1 = "Training", LABEL2 = "Test")

9.9 ROC Curve and AUC

The ROC curve evaluates model discrimination across all possible cutoffs. The Area Under the Curve (AUC) summarizes overall classification performance. To get the ROC curve, use the roc_logistic() function from the MKT4320BGSU package.

Usage:

roc_logistic(MOD, DATA, LABEL1 = "Sample 1",
DATA2 = NULL, LABEL2 = "Sample 2")
where:
- MOD is a fitted binary logistic regression model (glm with family = “binomial”).
- DATA a data frame for ROC computation. Usually train.
- LABEL1 is the label for the first data set (default = “Sample 1”).
- DATA2 is an optional second data frame (e.g., test/holdout sample) (default = NULL).
- LABEL2 is the label for the section data set, if provided (default = “Sample 2”).

roc_logistic(MOD = mod, DATA = train, DATA2 = test, 
             LABEL1 = "Training",LABEL2 = "Test")

Setting levels: control = No, case = Yes

Setting direction: controls < cases

Setting levels: control = No, case = Yes

Setting direction: controls < cases

$sample1


$sample2

9.10 Gain and Lift Charts

Gain and lift charts are especially useful for targeting decisions. These plots (and tables) show how much better the model performs relative to random targeting. To get the gain and lift charts, use the gainlift_logistic() function from the MKT4320BGSU package.

Usage:

gainlift_logistic(MOD, TRAIN, TEST, POSITIVE)
where:
- MOD is a fitted binary logistic regression model (glm with family = “binomial”).
- TRAIN a data frame with the training data (usually train).
- TEST a data frame with the test/holdout data (usually test).
- POSITIVE is the level in the outcome variable representing the positive outcome.

gl <- gainlift_logistic(MOD = mod, TRAIN = train, TEST = test, POSITIVE = "Yes")
gl$gainplot

gl$gaintable

# A tibble: 20 × 3
   `% Sample` Training Holdout
        <dbl>    <dbl>   <dbl>
 1       0.05    0.130   0.114
 2       0.1     0.25    0.2  
 3       0.15    0.370   0.343
 4       0.2     0.481   0.486
 5       0.25    0.593   0.6  
 6       0.3     0.713   0.743
 7       0.35    0.778   0.771
 8       0.4     0.843   0.829
 9       0.45    0.907   0.914
10       0.5     0.935   0.971
11       0.55    0.981   1    
12       0.6     0.991   1    
13       0.65    0.991   1    
14       0.7     0.991   1    
15       0.75    1       1    
16       0.8     1       1    
17       0.85    1       1    
18       0.9     1       1    
19       0.95    1       1    
20       1       1       1

gl$liftplot

gl$lifttable

# A tibble: 20 × 3
   `% Sample` Training Holdout
        <dbl>    <dbl>   <dbl>
 1       0.05     2.60    2.83
 2       0.1      2.51    2.2 
 3       0.15     2.48    2.42
 4       0.2      2.42    2.53
 5       0.25     2.38    2.48
 6       0.3      2.38    2.54
 7       0.35     2.23    2.25
 8       0.4      2.11    2.10
 9       0.45     2.02    2.06
10       0.5      1.88    1.96
11       0.55     1.79    1.83
12       0.6      1.66    1.68
13       0.65     1.53    1.55
14       0.7      1.42    1.43
15       0.75     1.34    1.34
16       0.8      1.25    1.25
17       0.85     1.18    1.18
18       0.9      1.11    1.11
19       0.95     1.06    1.05
20       1        1       1

9.11 Margin Plots with `easy_mp()`

Coefficients are difficult to interpret directly. Marginal effects plots help visualize how predictors influence purchase probability. As with linear regression (see Section 8.6), we use the easy_mp() function from the MKT4320BGSU package to help create the margin plots.

mp_age <- easy_mp(mod, focal = "age")
mp_age$plot

We can also examine categorical predictors.

mp_gender <- easy_mp(mod, focal = "gender")
mp_gender$plot

We can also examine interactions.

mod_int <- glm(buy ~ age * salary + gender, data = train, family = "binomial")
mp_gender <- easy_mp(mod_int, focal = "age", int="salary")
mp_gender$plot

9.12 Putting It All Together

Binary logistic regression provides a complete framework for:

Predicting purchase probabilities
Classifying prospects
Evaluating models using multiple metrics
Supporting targeting decisions

No single metric tells the whole story. Marketing analysts must choose evaluation tools that align with business objectives.

9.13 What’s Next

In the next chapter, we shift from prediction to segmentation. We will use cluster analysis to:

Identify distinct customer segments based on observed attributes
Understand how customers naturally group together
Support positioning, targeting, and strategy development
Translate data-driven segments into actionable marketing insights

Where logistic regression answers questions like “Who is likely to buy?”, cluster analysis addresses questions such as:

What types of customers do we have?
How are customers meaningfully different from one another?
How many segments make sense from a managerial perspective?

The next chapter introduces several clustering approaches, discusses how to choose the number of clusters, and emphasizes interpretation and managerial usefulness over purely technical solutions.

Chapter 9 Binary Logistic Regression

9.1 Why Logistic Regression in Marketing Analytics

9.2 The Direct Marketing Data

9.3 Splitting Data into Training and Test Samples

9.4 Estimating a Binary Logistic Regression Model

9.5 Interpreting Odds Ratios

9.6 Predicted Probabilities

9.7 Classification and Cutoff Values

9.8 Choosing a Cutoff

9.9 ROC Curve and AUC

9.10 Gain and Lift Charts

9.11 Margin Plots with easy_mp()

9.12 Putting It All Together

9.13 What’s Next

9.11 Margin Plots with `easy_mp()`