Chapter 8 A/B Testing and Uplift Modeling

Data for this chapter:

  • The email.camp.w data is used from the MKT4320BGSU course package. Load the package and use the data() function to load the data.

    # Load the course package
    library(MKT4320BGSU)
    # Load the data
    data(emailcampw)  # Note: the dataframe is called email.camp.w

8.1 Introduction

While A/B testing and Uplift modeling can be preformed with mostly base R functions, several user-defined functions that are part of the MKT4320BGSUpackage have been created to make the process more streamlined and consistent.

Data for this chapter

8.2 Randomization check

  • To perform a randomization check for the treatment and control groups for an A/B test, use the rcheck function.
  • This function checks if the characteristics/covariates used for uplift modeling from an A/B test were randomly assigned to the test and control groups.
  • To use the function, we must pass it a dataframe containing the covariates we want to use to check randomization. We must also provide it with the name of the treatment variable, and the name(s) of the outcome variabe(s) if they are included in the dataframe.

8.2.1 Using the rcheck function

  • Requires the following packages:
    • fastDummies
    • htmlTable (if option nice="ht" is used)
    • flextable (if option nice="ft" is used)
  • Usage: rcheck(data, treatment, outcome=NULL, nice=c("no","ft", "ht")) where:
    • data is the name of the dataframe containing the treatment variable and the covariates.
    • treatment is the variable name identifying the treatment variable. Must be in quotations.
    • outcome is the name or names of the variables that identifies the outcome variables. Default value is NULL. Must be in quotations.
    • nice is the format for the output; can be:
      • "no" for standard output
      • "ft" for output using the flextable package
      • "ht" for output using the htmlTable package
  • Returns: A table containing the results of the randomization check

8.2.1.1 Examples

  • Example 1: Standard output

    rcheck(email.camp.w, "promotion", c("visit", "spend"), nice="no")
                                     variable treatment_mean control_mean      sd
    recency                           recency          5.810        5.725   3.504
    history                           history        245.995      242.539 253.384
    womens                             womens          0.545        0.539   0.498
    newbie                             newbie          0.497        0.493   0.500
    zip_Rural                       zip_Rural          0.143        0.148   0.353
    zip_Surburban               zip_Surburban          0.459        0.445   0.498
    zip_Urban                       zip_Urban          0.398        0.406   0.490
    channel_Multichannel channel_Multichannel          0.122        0.120   0.326
    channel_Phone               channel_Phone          0.436        0.439   0.496
    channel_Web                   channel_Web          0.442        0.441   0.497
                         scale_mean_diff p_val
    recency                        0.024 0.227
    history                        0.014 0.495
    womens                         0.011 0.574
    newbie                         0.007 0.719
    zip_Rural                     -0.014 0.497
    zip_Surburban                  0.027 0.185
    zip_Urban                     -0.017 0.403
    channel_Multichannel           0.006 0.782
    channel_Phone                 -0.005 0.809
    channel_Web                    0.001 0.968
  • Example 2: flextable output

    rcheck(email.camp.w, "promotion", c("visit", "spend"), nice="ft")

    Variable

    Mean

    SD

    Scaled Mean Difference

    p-value

    Treatment

    Control

    recency

    5.810

    5.725

    3.504

    0.024

    0.227

    history

    245.995

    242.539

    253.384

    0.014

    0.495

    womens

    0.545

    0.539

    0.498

    0.011

    0.574

    newbie

    0.497

    0.493

    0.500

    0.007

    0.719

    zip_Rural

    0.143

    0.148

    0.353

    -0.014

    0.497

    zip_Surburban

    0.459

    0.445

    0.498

    0.027

    0.185

    zip_Urban

    0.398

    0.406

    0.490

    -0.017

    0.403

    channel_Multichannel

    0.122

    0.120

    0.326

    0.006

    0.782

    channel_Phone

    0.436

    0.439

    0.496

    -0.005

    0.809

    channel_Web

    0.442

    0.441

    0.497

    0.001

    0.968

8.3 Average Treatment Effect

  • To examine the average treatment effect both without control variables with control variables to account for observed heterogeneity, use the abate function.
  • This function uses linear regression to calculate the average treatment effects both without controls and with controls. The function returns a flextable object.

8.3.1 Using the abate function

  • Requires the following packages:
    • dplyr
    • gtsummary
    • flextable
  • Usage: abate(model, treatement) where:
    • model is an existing linear regression (lm) object containing all control variables and the treatment variable. Treatment variable should appear as the first independent variable.
    • treatment is the variable name identifying the treatment variable. Must be in quotations.
  • Returns: A flextable object containing the results.

8.3.1.1 Examples

  • Example:

    # Create the 'lm' models
    ate.visit <- lm(visit ~ promotion + recency + history + zip + womens,
                    data=email.camp.w)
    ate.spend <- lm(spend ~ promotion + recency + history + zip + womens,
                    data=email.camp.w)
    
    # Use the function
    abate(ate.visit, "promotion")

    Without
    Controls

    With
    Controls

    Characteristic

    Beta

    p-value

    Beta

    p-value

    (Intercept)

    0.106

    <0.001

    0.151

    <0.001

    promotion

    0.049

    <0.001

    0.050

    <0.001

    recency

    -0.006

    <0.001

    history

    0.000

    <0.001

    zip

    Rural

    Surburban

    -0.053

    <0.001

    Urban

    -0.065

    <0.001

    womens

    0.046

    <0.001

    p-value

    <0.001

    <0.001

    0.005

    0.024

    abate(ate.spend, "promotion")

    Without
    Controls

    With
    Controls

    Characteristic

    Beta

    p-value

    Beta

    p-value

    (Intercept)

    0.651

    <0.001

    1.265

    0.011

    promotion

    0.436

    0.108

    0.450

    0.097

    recency

    -0.081

    0.042

    history

    0.000

    0.703

    zip

    Rural

    Surburban

    -0.596

    0.144

    Urban

    0.098

    0.814

    womens

    0.049

    0.858

    p-value

    0.11

    0.032

    0.000

    0.001

8.4 Uplift Modling using Regression

  • To perform a uplift modeling using regression, we will use the reguplift function. This function performs uplift modeling based on either logistic regression (for binary outcomes) or linear regression (for continuous outcomes). The function uses the two-model, indirect modeling approach.

  • In order to use the function, we must first create our base model.

    • The base model is usually a model with no interactions included, along with the treatment variable.
    • If known interactions are to be used, the base model can include the interactions also.
    • The base model must contain the treatment variable as the first independent variable.
  • Base model examples:

    # Base model for binary outcome variable
    email.visit <- glm(visit ~ promotion + recency + history + zip + womens,
                       data=email.camp.w, family="binomial")
    
    # Base model for continuous outcome variable
    email.spend <- lm(visit ~ promotion + recency + history + zip + womens,
                      data=email.camp.w)

8.4.1 Using the reguplift function

  • Requires the following packages:
    • ggplot2
    • gtsummary (if option ct="Y" is used)
    • flextable (if option ct="Y" is used or if option int="Y" is used)
  • Usage: reguplift(model, treatment, pdata=NULL,ng=10, ar=NULL, int="N", ct="N") where:
    • model is a logistic or linear regression model saved results. The model must have been run where the treatment variable was the first term in the right-hand side of the model formula, followed by all independent variables. For option int="Y", no interaction terms should have been included in the original model.
    • treatment is the variable name identifying the treatment variable. Must be in quotations.
    • pdata is the data upon which to calculate the lift. Default is NULL, in which case the lift will be calculated using the original model data.
    • ng is the number of groups to split the data for the group output table and the plots. Must be an integer between 5 and 20. Default is 10.
    • ar is the aspect ratio for the plots. Default is NULL.
    • int is an indicator if an interaction check between independent variables is desired (int="Y") or not (int="N"). Default is “N”.
    • ct is an indicator if comparison tables between treatment levels is desired (ct="Y") or not (ct="N"). Default is “N”. Rarely used.
  • Returns: A list containing the following objects.
    • $group is a table of lift results by ordered group based on ng
    • $all is the original model data or pdata (if provided) with lift values appended.
    • $plots is a list containing three plots:
      • $qini is a Qini plot containing a Qini coefficient
      • $uplift is a mean uplift plot by ordered group
      • $c.gain is a cumulative gain plot by ordered group
    • $int is an interaction table showing significant potential interactions.
    • $ct is a comparison table between treatment levels.

8.4.1.1 Examples

  • Using all default options

    # Save results as an object
    visit.uplift <- reguplift(email.visit, "promotion")
    spend.uplift <- reguplift(email.spend, "promotion")
    
    # Examine results
    visit.uplift$plots
    $qini

    
    $uplift

    
    $c.gain

    spend.uplift$plots
    $qini

    
    $uplift

    
    $c.gain

  • Using options

    # Save results as an object
    spend.uplift.5 <- reguplift(email.spend, "promotion", ng=5, int="Y")
    
    # Examine results
    spend.uplift.5$plots
    $qini

    
    $uplift

    
    $c.gain

    spend.uplift.5$int

    Interaction

    Control

    womens:zipSurburban

    0.057

    womens:zipUrban

    0.006

    1 Values are p-values for interaction
    2 Outcome = visit
    3 Control: promotion = 0
    4 Treat: promotion = 1

8.5 LIFT Plots

  • To get LIFT plots based on an uplift modeling object, use the liftplot function.
  • This function creates a lift plot following uplift modeling. It can create a histogram (if var is null) or an error-bar plot. For continuous variables, it will create an error-bar for the quintile values of the variable. For factor variables, it will create an error-bar for each level of the factor. It can also create side-by-side error-bar plots for two variables simultaneously by using the byvar option.

8.5.1 Using the liftplot function

  • Requires the following packages:
    • ggplot2
  • Usage: liftplot(data, var=NULL, byvar=NULL, ar=NULL, ci=c(0.90, 0.95, 0.975, 0.99, 0)) where:
    • data is the name of the dataframe with the results of an uplift modeling analysis.
    • var is the variable name for which the error-bars should be created. Must be in quotations. Default is NULL for a histogram.
    • byvar is the variable that identifies second variable if side-by-side error-bar plots are desired. Must be in quotations. Default is NULL.
    • ar is the aspect ratio for the plots. Default is NULL.
    • ci is the type of error-var desired. Ignored if var is NULL. Must be one of the following if var is not NULL:
      • 0 for error-bars to represent 1 standard deviation
      • 0.90 or 0.95 or 0.975 or 0.99 for error-bars to represent the desired confidence level.
  • Returns: A ggplot object

8.5.1.1 Examples

  • Example 1: Histogram

    liftplot(visit.uplift$all)

  • Example 2: Single variable lift plots

    # Standard deviation error bars
    liftplot(visit.uplift$all, var="recency", ci=0)

    # 99% CI error bars
    liftplot(visit.uplift$all, var="zip", ci=0.99)

  • Example 3: Side-by-side, two variable lift plots

    # Standard deviation error bars
    liftplot(spend.uplift$all, var="recency", byvar="zip", ci=0)

    # 99% CI error bars
    liftplot(spend.uplift$all, var="zip", byvar="womens", ci=0.99)