Chapter 5 dplyr Package

In this course, data transformations are primarily performed using the dplyr package (pronounced DEE ply er). This package makes data manipulation easier and more intuitive (for most). dplyr is built around the five main “verbs” shown below that make up a majority of data manipulation. However, there are other functions that dplyr uses to also help with data manipulation.

select is used to subset columns
filter is used to subset rows
mutate is used to add new columns based on calculations (usually with other columns)
summarise is use to perform summary calculations (e.g., mean, max, etc.) on data set
group_by is used to group rows of a data frame with the same value in specified columns

In addition, dplyr uses the pipe, %>%, to string together a series of functions. Think of functions strung together as upstream and downstream functions. The function to the left of %>% is the upstream function, while the function to the right is the downstream function.

By default, the downstream function assumes the value coming from the upstream function is the first argument in its function. Therefore, the first argument can be omitted. If the downstream function needs to use the value from the upstream function assigned to a different argument, a . is simply put in the position of that argument

5.1 The dplyr workflow

The dplyr package is designed to make data manipulation clear and readable.

library(dplyr)

A typical dplyr workflow:

Start with a data frame
Apply a sequence of transformation verbs
Save or display the result

5.2 Selecting variables

Use select() to keep only the variables you need. This does not modify the original dataset unless you save the result.

directmktg %>%
  select(userid, age, gender) %>%
  head()

    userid age gender
1 15624510  19   Male
2 15810944  35   Male
3 15668575  26 Female
4 15603246  27 Female
5 15804002  19   Male
6 15728773  27   Male

“Negative” selection can also be done by using the - (minus sign) before a variable name or a vector of variable names.

directmktg %>%
  select(-gender) %>%
  head()

    userid age buy salary age10 buy_binary salary_cat
1 15624510  19  No     19   1.9          0        Low
2 15810944  35  No     20   3.5          0        Low
3 15668575  26  No     43   2.6          0        Low
4 15603246  27  No     57   2.7          0        Med
5 15804002  19  No     76   1.9          0        Med
6 15728773  27  No     58   2.7          0        Med

directmktg %>%
  select(-c(age,gender)) %>%
  head()

    userid buy salary age10 buy_binary salary_cat
1 15624510  No     19   1.9          0        Low
2 15810944  No     20   3.5          0        Low
3 15668575  No     43   2.6          0        Low
4 15603246  No     57   2.7          0        Med
5 15804002  No     76   1.9          0        Med
6 15728773  No     58   2.7          0        Med

5.3 Filtering observations

Use filter() to keep rows that meet certain conditions. (Note: the nrow(object_name) from base R provides the number of rows in the data frame).

nrow(directmktg)

[1] 400

directmktg %>%
  filter(age >= 35) %>%
  nrow()

[1] 254

Multiple conditions can be combined:

directmktg %>%
  filter(age >= 35, 
         gender == "Male") %>%
  nrow()

[1] 124

5.4 Creating new variables

Use mutate() to create or transform variables. New variables are added to the data frame.

directmktg %>%
  select(userid, age) %>%
  mutate(age10 = age / 10) %>%
  head()

    userid age age10
1 15624510  19   1.9
2 15810944  35   3.5
3 15668575  26   2.6
4 15603246  27   2.7
5 15804002  19   1.9
6 15728773  27   2.7

5.5 Summaries with `summarise()`

The summarise() function is used to compute summary statistics from a data frame. It can be used with or without grouping.

When summarise() is used without group_by(), it computes summaries over the entire dataset.

directmktg %>%
  summarise(
    n = n(),
    mean_age = mean(age),
    buy_rate = mean(buy == "Yes")
  )

    n mean_age buy_rate
1 400   37.655   0.3575

Here’s what this code is doing:

n() counts the total number of observations in the dataset
mean(age) computes the overall average age
mean(buy == "Yes") computes the overall purchase rate

The result is a data frame with one row, where each column represents a summary statistic for the full dataset.

5.6 Grouped summaries with `group_by()` and `summarise()`

Often, you want to compute summaries separately for different groups, such as customer segments or demographic categories.

In dplyr, this is done by combining group_by() with summarize().

directmktg %>%
  group_by(gender)

# A tibble: 400 × 8
# Groups:   gender [2]
     userid   age buy   gender salary age10 buy_binary salary_cat
      <dbl> <dbl> <fct> <fct>   <dbl> <dbl>      <dbl> <chr>     
 1 15624510    19 No    Male       19   1.9          0 Low       
 2 15810944    35 No    Male       20   3.5          0 Low       
 3 15668575    26 No    Female     43   2.6          0 Low       
 4 15603246    27 No    Female     57   2.7          0 Med       
 5 15804002    19 No    Male       76   1.9          0 Med       
 6 15728773    27 No    Male       58   2.7          0 Med       
 7 15598044    27 No    Female     84   2.7          0 High      
 8 15694829    32 Yes   Female    150   3.2          1 High      
 9 15600575    25 No    Male       33   2.5          0 Low       
10 15727311    35 No    Female     65   3.5          0 Med       
# ℹ 390 more rows

The group_by() function does not change the data values. Instead, it tells R how the data should be temporarily divided into groups for the next operation.

At this point, no calculations have been performed. Once the data are grouped, summarise() computes statistics within each group.

directmktg %>%
  group_by(gender) %>%
  summarise(
    n = n(),
    mean_age = mean(age),
    buy_rate = mean(buy == "Yes")
  )

# A tibble: 2 × 4
  gender     n mean_age buy_rate
  <fct>  <int>    <dbl>    <dbl>
1 Male     196     36.9    0.337
2 Female   204     38.4    0.377

Step by step:

group_by(gender) splits the data into separate groups based on gender
n() counts observations within each group
mean(age) computes the average age within each group
mean(buy == "Yes") computes the purchase rate within each group

The result is a data frame with one row per group and one column per summary statistic.

After summarise() runs, the grouping structure is automatically dropped, so the result behaves like a regular data frame.

5.7 Connecting `summarize()` to base R

Conceptually, grouped summaries in dplyr perform the same task as a multi-step process in base R:

Split the data into groups
Compute summary statistics for each group
Combine the results into a table

For example, in base R you might compute group means using functions such as aggregate() or by manually subsetting the data.

The advantage of group_by() and summarise() is that these steps are expressed explicitly and readably, making your data transformations easier to follow, debug, and modify.

5.8 Combining transformations

As you’ve seen, one of the main advantages of dplyr is that multiple steps can be chained together.

directmktg_clean <- directmktg %>%
  filter(age >= 35) %>%
  mutate(age10 = age / 10,
         buy_binary = ifelse(buy == "Yes", 1, 0)) %>%
  select(userid, age10, gender, buy_binary)

head(directmktg_clean)

    userid age10 gender buy_binary
1 15810944   3.5   Male          0
2 15727311   3.5 Female          0
3 15733883   4.7   Male          1
4 15617482   4.5   Male          1
5 15704583   4.6   Male          1
6 15621083   4.8 Female          1

This approach keeps data preparation transparent and reproducible.

5.9 Key takeaway

Before any modeling or visualization, you should be able to: - load data reliably, - inspect variable types and values, - identify missing or problematic data, and - transform data into a usable analytical form.

These steps are essential for sound marketing analytics.

5.10 What’s next

In the next chapter, we will use cleaned and well-understood data to perform descriptive analysis, including frequency tables, crosstabs, measures of central tendency and dispersion, and correlation.