Chapter 5 dplyr Package
In this course, data transformations are primarily performed using the dplyr package (pronounced DEE ply er). This package makes data manipulation easier and more intuitive (for most). dplyr is built around the five main “verbs” shown below that make up a majority of data manipulation. However, there are other functions that dplyr uses to also help with data manipulation.
selectis used to subset columnsfilteris used to subset rowsmutateis used to add new columns based on calculations (usually with other columns)summariseis use to perform summary calculations (e.g., mean, max, etc.) on data setgroup_byis used to group rows of a data frame with the same value in specified columns
In addition, dplyr uses the pipe, %>%, to string together a series of functions. Think of functions strung together as upstream and downstream functions. The function to the left of %>% is the upstream function, while the function to the right is the downstream function.
By default, the downstream function assumes the value coming from the upstream function is the first argument in its function. Therefore, the first argument can be omitted. If the downstream function needs to use the value from the upstream function assigned to a different argument, a . is simply put in the position of that argument
5.1 The dplyr workflow
The dplyr package is designed to make data manipulation clear and readable.
A typical dplyr workflow:
- Start with a data frame
- Apply a sequence of transformation verbs
- Save or display the result
5.2 Selecting variables
Use select() to keep only the variables you need. This does not modify the original dataset unless you save the result.
userid age gender
1 15624510 19 Male
2 15810944 35 Male
3 15668575 26 Female
4 15603246 27 Female
5 15804002 19 Male
6 15728773 27 Male
“Negative” selection can also be done by using the - (minus sign) before a variable name or a vector of variable names.
userid age buy salary age10 buy_binary salary_cat
1 15624510 19 No 19 1.9 0 Low
2 15810944 35 No 20 3.5 0 Low
3 15668575 26 No 43 2.6 0 Low
4 15603246 27 No 57 2.7 0 Med
5 15804002 19 No 76 1.9 0 Med
6 15728773 27 No 58 2.7 0 Med
userid buy salary age10 buy_binary salary_cat
1 15624510 No 19 1.9 0 Low
2 15810944 No 20 3.5 0 Low
3 15668575 No 43 2.6 0 Low
4 15603246 No 57 2.7 0 Med
5 15804002 No 76 1.9 0 Med
6 15728773 No 58 2.7 0 Med
5.3 Filtering observations
Use filter() to keep rows that meet certain conditions. (Note: the nrow(object_name) from base R provides the number of rows in the data frame).
[1] 400
[1] 254
Multiple conditions can be combined:
[1] 124
5.4 Creating new variables
Use mutate() to create or transform variables. New variables are added to the data frame.
userid age age10
1 15624510 19 1.9
2 15810944 35 3.5
3 15668575 26 2.6
4 15603246 27 2.7
5 15804002 19 1.9
6 15728773 27 2.7
5.5 Summaries with summarise()
The summarise() function is used to compute summary statistics from a data frame.
It can be used with or without grouping.
When summarise() is used without group_by(), it computes summaries over the
entire dataset.
n mean_age buy_rate
1 400 37.655 0.3575
Here’s what this code is doing:
n()counts the total number of observations in the datasetmean(age)computes the overall average agemean(buy == "Yes")computes the overall purchase rate
The result is a data frame with one row, where each column represents a summary statistic for the full dataset.
5.6 Grouped summaries with group_by() and summarise()
Often, you want to compute summaries separately for different groups, such as customer segments or demographic categories.
In dplyr, this is done by combining group_by() with summarize().
# A tibble: 400 × 8
# Groups: gender [2]
userid age buy gender salary age10 buy_binary salary_cat
<dbl> <dbl> <fct> <fct> <dbl> <dbl> <dbl> <chr>
1 15624510 19 No Male 19 1.9 0 Low
2 15810944 35 No Male 20 3.5 0 Low
3 15668575 26 No Female 43 2.6 0 Low
4 15603246 27 No Female 57 2.7 0 Med
5 15804002 19 No Male 76 1.9 0 Med
6 15728773 27 No Male 58 2.7 0 Med
7 15598044 27 No Female 84 2.7 0 High
8 15694829 32 Yes Female 150 3.2 1 High
9 15600575 25 No Male 33 2.5 0 Low
10 15727311 35 No Female 65 3.5 0 Med
# ℹ 390 more rows
The group_by() function does not change the data values. Instead, it tells R how
the data should be temporarily divided into groups for the next operation.
At this point, no calculations have been performed. Once the data are grouped, summarise() computes statistics within each group.
directmktg %>%
group_by(gender) %>%
summarise(
n = n(),
mean_age = mean(age),
buy_rate = mean(buy == "Yes")
)# A tibble: 2 × 4
gender n mean_age buy_rate
<fct> <int> <dbl> <dbl>
1 Male 196 36.9 0.337
2 Female 204 38.4 0.377
Step by step:
group_by(gender)splits the data into separate groups based on gendern()counts observations within each groupmean(age)computes the average age within each groupmean(buy == "Yes")computes the purchase rate within each group
The result is a data frame with one row per group and one column per summary statistic.
After summarise() runs, the grouping structure is automatically dropped, so the
result behaves like a regular data frame.
5.7 Connecting summarize() to base R
Conceptually, grouped summaries in dplyr perform the same task as a multi-step
process in base R:
- Split the data into groups
- Compute summary statistics for each group
- Combine the results into a table
For example, in base R you might compute group means using functions such as
aggregate() or by manually subsetting the data.
The advantage of group_by() and summarise() is that these steps are expressed
explicitly and readably, making your data transformations easier to follow,
debug, and modify.
5.8 Combining transformations
As you’ve seen, one of the main advantages of dplyr is that multiple steps can be chained together.
directmktg_clean <- directmktg %>%
filter(age >= 35) %>%
mutate(age10 = age / 10,
buy_binary = ifelse(buy == "Yes", 1, 0)) %>%
select(userid, age10, gender, buy_binary)
head(directmktg_clean) userid age10 gender buy_binary
1 15810944 3.5 Male 0
2 15727311 3.5 Female 0
3 15733883 4.7 Male 1
4 15617482 4.5 Male 1
5 15704583 4.6 Male 1
6 15621083 4.8 Female 1
This approach keeps data preparation transparent and reproducible.