Chapter 6 Descriptive Analysis
Descriptive analysis summarizes and helps you understand your data numerically. In marketing analytics, it is often your first “sanity check” before modeling or visualization.
In this chapter you will learn how to:
- inspect a dataset and its variables,
- create frequency tables and crosstabs,
- compute measures of central tendency and dispersion,
- and compute and interpret correlations.
For this chapter, we’ll be using the airlinesat_small dataset from the MKT4320BGSU.
6.1 Descriptive Analysis
A quick descriptive workflow often looks like:
- Confirm the data structure (rows/columns, variable types)
- Look for missing values and odd ranges
- Summarize key variables (categorical → counts; numeric → mean/SD, etc.)
- Compare segments (e.g., by treatment, gender, region, etc.)
6.1.1 Inspecting the dataset
The following functions are useful for inspecting the dataset.
str()shows each variables’ type, displays factor levels if present, and gives a compact preview of values for each variabledim()returns the number of rows and columsn to quickly tell you sample size and number of variablesnames()provides all column names, which helps with selecting variables for writing code or formulashead()gives the first six rows of the dataset to allow for visual inspection of raw values
'data.frame': 1065 obs. of 13 variables:
$ age : num 30 55 56 43 44 40 39 41 33 51 ...
$ country : Factor w/ 5 levels "at","ch","de",..: 2 2 2 4 2 2 2 2 2 3 ...
$ flight_class : Factor w/ 3 levels "Business","Economy",..: 2 1 2 2 1 3 2 1 2 1 ...
$ flight_latest : Factor w/ 6 levels "within the last 12 months",..: 4 3 5 3 6 5 6 3 3 4 ...
$ flight_purpose: Factor w/ 2 levels "Business","Leisure": 2 1 1 2 1 2 1 1 2 1 ...
$ flight_type : Factor w/ 2 levels "Domestic","International": 1 2 1 1 2 2 1 2 1 2 ...
$ gender : Factor w/ 2 levels "female","male": 2 2 1 1 1 2 2 2 2 2 ...
$ language : Factor w/ 3 levels "English","French",..: 2 1 1 2 1 3 2 2 2 3 ...
$ nflights : num 2 6 8 7 25 16 35 9 3 4 ...
$ status : Factor w/ 3 levels "Blue","Gold",..: 1 2 1 1 2 2 1 2 1 2 ...
$ nps : num 6 10 8 8 6 7 8 7 8 8 ...
$ overall_sat : num 2 6 2 4 2 4 4 4 4 3 ...
$ reputation : num 3 6 4 6 5 3 3 4 2 4 ...
[1] 1065 13
[1] "age" "country" "flight_class" "flight_latest" "flight_purpose" "flight_type"
[7] "gender" "language" "nflights" "status" "nps" "overall_sat"
[13] "reputation"
age country flight_class flight_latest flight_purpose flight_type gender language nflights status nps
1 30 ch Economy within the last 6 months Leisure Domestic male French 2 Blue 6
2 55 ch Business within the last 3 months Business International male English 6 Gold 10
3 56 ch Economy within the last month Business Domestic female English 8 Blue 8
4 43 fr Economy within the last 3 months Leisure Domestic female French 7 Blue 8
5 44 ch Business within the last week Business International female English 25 Gold 6
6 40 ch First within the last month Leisure International male German 16 Gold 7
overall_sat reputation
1 2 3
2 6 6
3 2 4
4 4 6
5 2 5
6 4 3
6.1.2 Missing values (quick checks)
It is also important to check for missing values, because they can have an adverse effect on some analyses. The is.na() function is often used for this, which returns a TRUE if a value is NA. Ultimately, this will tell which values in the data are missing to help decide how to handle them.
age country flight_class flight_latest flight_purpose flight_type gender language
0 0 0 0 0 0 0 0
nflights status nps overall_sat reputation
0 0 0 0 0
[1] 0
6.2 Frequency Tables
Frequency tables summarize categorical variables (and sometimes binned numeric variables).
6.2.1 One-way frequency table
To create a frequency table, use the table() function in R. Counts are often converted into proportions or rates to make results easier to interpret and compare.
In R, the proportions() function is used to convert frequency tables into proportions. Multiply the table by 100 to get percentages.
female male
280 785
female male
0.2629108 0.7370892
female male
26.29108 73.70892
6.3 Crosstabs
A crosstab is a frequency table for two categorical variables. Crosstabs are used frequently in marketing to compare segments (e.g., purchase by gender).
6.3.1 Base R
Base R does not do a great job of easily creating crosstabs and testing for independence of the two variables. Instead, a multistep process is required:
Create the two-way frequency table using the
table(rowvar, colvar)function and assign it to a separate objectDisplay the two-way freq table by just using the table name
female male Business 39 146 Economy 239 626 First 2 13Use the function
proportions(tablename, margin)on the newly created object to get column, row, or total percentagesproportions(tablename)gives total percentagesproportions(tablename, 1)gives row percentagesproportions(tablename, 2)gives column percentagesfemale male Business 0.036619718 0.137089202 Economy 0.224413146 0.587793427 First 0.001877934 0.012206573female male Business 0.2108108 0.7891892 Economy 0.2763006 0.7236994 First 0.1333333 0.8666667female male Business 0.139285714 0.185987261 Economy 0.853571429 0.797452229 First 0.007142857 0.016560510
Use the function
chisq.test(tablename)on the newly created object to run the test of independenceWarning in chisq.test(ct): Chi-squared approximation may be incorrectPearson's Chi-squared test data: ct X-squared = 4.6912, df = 2, p-value = 0.09579
6.3.2 Using package sjPlot
The sjPlot package can print crosstabs with nicer formatting. Use the function tab_xtab(var.row=, var.col=, show.col.prc=TRUE) to get a standard crosstab with column percentages and the chi-square test of independence.
| flight_class | gender | Total | |
|---|---|---|---|
| female | male | ||
| Business |
39 13.9 % |
146 18.6 % |
185 17.4 % |
| Economy |
239 85.4 % |
626 79.7 % |
865 81.2 % |
| First |
2 0.7 % |
13 1.7 % |
15 1.4 % |
| Total |
280 100 % |
785 100 % |
1065 100 % |
χ2=4.691 · df=2 · Cramer’s V=0.066 · Fisher’s p=0.095 |
6.4 Measures of Central Tendency and Dispersion
For numeric variables, the most common summaries are:
- Central tendency: mean, median, mode
- Dispersion: variance, standard deviation, IQR, range
6.4.1 Base R
Any individual summary statistic can be easily calculated using Base R with functions such as:
mean(var)for meansd(var)for standard deviationquantile(var, .percentile)for percentiles (e.g., ‘.50’ would be median)
For summary statistics except for standard deviation, the summary(object) function can be used, where object can be a single variable or an entire data frame
Min. 1st Qu. Median Mean 3rd Qu. Max.
19.00 42.00 50.00 50.42 58.00 101.00
age country flight_class flight_latest flight_purpose flight_type
Min. : 19.00 at:108 Business:185 within the last 12 months:139 Business:525 Domestic :558
1st Qu.: 42.00 ch: 66 Economy :865 within the last 2 days : 57 Leisure :540 International:507
Median : 50.00 de:695 First : 15 within the last 3 months :296
Mean : 50.42 fr: 1 within the last 6 months :187
3rd Qu.: 58.00 us:195 within the last month :253
Max. :101.00 within the last week :133
gender language nflights status nps overall_sat reputation
female:280 English:233 Min. : 1.00 Blue :677 Min. : 1.00 Min. :1.00 Min. :1.000
male :785 French : 10 1st Qu.: 4.00 Gold :143 1st Qu.: 6.00 1st Qu.:3.00 1st Qu.:4.000
German :822 Median : 8.00 Silver:245 Median : 8.00 Median :4.00 Median :5.000
Mean : 13.42 Mean : 7.52 Mean :3.74 Mean :4.841
3rd Qu.: 16.00 3rd Qu.: 9.00 3rd Qu.:5.00 3rd Qu.:6.000
Max. :457.00 Max. :11.00 Max. :7.00 Max. :7.000
[1] 50.41972
[1] 12.27464
If you want a single, custom summary:
x <- airlinesat_small$age
c(n = sum(!is.na(x)),
mean = mean(x, na.rm = TRUE),
sd = sd(x, na.rm = TRUE),
median = median(x, na.rm = TRUE),
iqr = IQR(x, na.rm = TRUE),
min = min(x, na.rm = TRUE),
max = max(x, na.rm = TRUE)) n mean sd median iqr min max
1065.00000 50.41972 12.27464 50.00000 16.00000 19.00000 101.00000
6.4.2 Using package dplyr
With dplyr, you can summarize many variables and/or do summaries by groups.
Overall summaries:
airlinesat_small %>%
summarise(n = n(),
mean_age = mean(age, na.rm = TRUE),
sd_age = sd(age, na.rm = TRUE),
median_age = median(age, na.rm = TRUE),
iqr_age = IQR(age, na.rm = TRUE),
mean_nflights = mean(nflights, na.rm = TRUE),
sd_nflights = sd(nflights, na.rm = TRUE)) n mean_age sd_age median_age iqr_age mean_nflights sd_nflights
1 1065 50.41972 12.27464 50 16 13.41878 20.22647
Group summaries (example: by buy):
airlinesat_small %>%
group_by(status) %>%
summarise(n = n(),
mean_age = mean(age, na.rm = TRUE),
sd_age = sd(age, na.rm = TRUE),
mean_nflights = mean(nflights, na.rm = TRUE),
sd_nflights = sd(nflights, na.rm = TRUE))# A tibble: 3 × 6
status n mean_age sd_age mean_nflights sd_nflights
<fct> <int> <dbl> <dbl> <dbl> <dbl>
1 Blue 677 50.6 13.3 8.23 19.2
2 Gold 143 53 10.0 24.6 20.9
3 Silver 245 48.3 10.0 21.2 17.2
6.4.3 Using package vtable
The vtable package can generate clean, compact descriptive tables with the sumtable(data, vars=c("")) function. For factor variables, it will provide the counts and percents of each level.
sumtable(airlinesat_small, c("age", "nflights", "status"),
add.median=TRUE, # 'add.median=TRUE' includes a 50th percentile column
title="")| Variable | N | Mean | Std. Dev. | Min | Pctl. 25 | Pctl. 50 | Pctl. 75 | Max |
|---|---|---|---|---|---|---|---|---|
| age | 1065 | 50 | 12 | 19 | 42 | 50 | 58 | 101 |
| nflights | 1065 | 13 | 20 | 1 | 4 | 8 | 16 | 457 |
| status | 1065 | |||||||
| … Blue | 677 | 64% | ||||||
| … Gold | 143 | 13% | ||||||
| … Silver | 245 | 23% |
The sumtable() function can also provide summary statistics by a grouping variable.
sumtable(airlinesat_small, c("age", "nflights", "status"),
add.median=TRUE,
group="gender",
title="")| Variable | N | Mean | SD | Median | N | Mean | SD | Median |
|---|---|---|---|---|---|---|---|---|
| age | 280 | 51 | 13 | 51 | 785 | 50 | 12 | 50 |
| nflights | 280 | 9.3 | 12 | 5 | 785 | 15 | 22 | 9 |
| status | 280 | 785 | ||||||
| … Blue | 225 | 80% | 452 | 58% | ||||
| … Gold | 16 | 6% | 127 | 16% | ||||
| … Silver | 39 | 14% | 206 | 26% |
6.5 Correlation
Correlation measures the strength of a linear relationship between two numeric variables. The most common is Pearson correlation (the default).
6.5.1 Base R
Base R can easily provide a correlation matrix of many variables, and it can provide a correlation test between two variables at a time, but it cannot produce a correlation matrix with p-values.
To get a correlation matrix, use the cor() function with a dataframe of the variables desired or using indexing on variable names. By default, it uses all observations, which can create NA values if any missing values exists. Therefore, the preference is to add the option use = "pairwise.complete.obs" to only calculate the correlation on non-missing values for each pair of variables.
mycorr_df <- airlinesat_small %>%
select(age, nflights, nps, overall_sat, reputation)
cor(mycorr_df, use = "pairwise.complete.obs") age nflights nps overall_sat reputation
age 1.00000000 -0.11576301 0.09867319 0.05903446 0.06082991
nflights -0.11576301 1.00000000 -0.08949782 -0.05366975 -0.06364290
nps 0.09867319 -0.08949782 1.00000000 0.29961310 0.50712230
overall_sat 0.05903446 -0.05366975 0.29961310 1.00000000 0.17748688
reputation 0.06082991 -0.06364290 0.50712230 0.17748688 1.00000000
cor(airlinesat_small[,c("age", "nflights", "nps", "overall_sat", "reputation")],
use = "pairwise.complete.obs") age nflights nps overall_sat reputation
age 1.00000000 -0.11576301 0.09867319 0.05903446 0.06082991
nflights -0.11576301 1.00000000 -0.08949782 -0.05366975 -0.06364290
nps 0.09867319 -0.08949782 1.00000000 0.29961310 0.50712230
overall_sat 0.05903446 -0.05366975 0.29961310 1.00000000 0.17748688
reputation 0.06082991 -0.06364290 0.50712230 0.17748688 1.00000000
To get the correlation test for any one pair of variables, use the cor.test(var1, var2) function. By default, it includes only observations that are non-missing in both variables.
Pearson's product-moment correlation
data: airlinesat_small$age and airlinesat_small$nflights
t = -3.7998, df = 1063, p-value = 0.000153
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.17461941 -0.05608231
sample estimates:
cor
-0.115763
6.5.2 Using package Hmisc
The Hmisc package is useful for correlation matrices with p-values using the rcorr() function. This function expects a matrix (or data frame coerced to matrix).
age nflights nps overall_sat reputation
age 1.00 -0.12 0.10 0.06 0.06
nflights -0.12 1.00 -0.09 -0.05 -0.06
nps 0.10 -0.09 1.00 0.30 0.51
overall_sat 0.06 -0.05 0.30 1.00 0.18
reputation 0.06 -0.06 0.51 0.18 1.00
n= 1065
P
age nflights nps overall_sat reputation
age 0.0002 0.0013 0.0541 0.0472
nflights 0.0002 0.0035 0.0800 0.0378
nps 0.0013 0.0035 0.0000 0.0000
overall_sat 0.0541 0.0800 0.0000 0.0000
reputation 0.0472 0.0378 0.0000 0.0000
6.5.3 Using package sjPlot
The sjPlot package can produce formatted correlation tables for reports using the tab_corr() function. By default, it uses listwise (or casewise) deletion, which removes an entire case (or row) if any value is NA, which may result in major data loss. Use the option na.deletion = "pairwise" to prevent this.
| age | nflights | nps | overall_sat | reputation | |
|---|---|---|---|---|---|
| age | |||||
| nflights | -0.116*** | ||||
| nps | 0.099** | -0.089** | |||
| overall_sat | 0.059 | -0.054 | 0.300*** | ||
| reputation | 0.061* | -0.064* | 0.507*** | 0.177*** | |
| Computed correlation used pearson-method with pairwise-deletion. | |||||
6.6 Why descriptive analysis matters
Descriptive statistics help you to:
- detect unusual values,
- understand typical behavior,
- compare groups,
- and check whether results are plausible.
They also guide decisions about which models or visualizations are appropriate.
6.7 What’s next
In the next chapter, we move from numbers to visuals. You will learn how to create effective data visualizations using a small amount of Base R, but primarily ggplot2. Visualizations allow you to see patterns, distributions, and relationships that are often difficult to detect from tables alone.
Together, descriptive statistics and visualization form the foundation of exploratory data analysis, which prepares you for modeling and inference later in the course.