Chapter 6 Descriptive Analysis

Descriptive analysis summarizes and helps you understand your data numerically. In marketing analytics, it is often your first “sanity check” before modeling or visualization.

In this chapter you will learn how to:

inspect a dataset and its variables,
create frequency tables and crosstabs,
compute measures of central tendency and dispersion,
and compute and interpret correlations.

For this chapter, we’ll be using the airlinesat_small dataset from the MKT4320BGSU.

data(airlinesat_small)

6.1 Descriptive Analysis

A quick descriptive workflow often looks like:

Confirm the data structure (rows/columns, variable types)
Look for missing values and odd ranges
Summarize key variables (categorical → counts; numeric → mean/SD, etc.)
Compare segments (e.g., by treatment, gender, region, etc.)

6.1.1 Inspecting the dataset

The following functions are useful for inspecting the dataset.

str() shows each variables’ type, displays factor levels if present, and gives a compact preview of values for each variable
dim() returns the number of rows and columsn to quickly tell you sample size and number of variables
names() provides all column names, which helps with selecting variables for writing code or formulas
head() gives the first six rows of the dataset to allow for visual inspection of raw values

# Take a look at the data frame
str(airlinesat_small)

'data.frame':   1065 obs. of  13 variables:
 $ age           : num  30 55 56 43 44 40 39 41 33 51 ...
 $ country       : Factor w/ 5 levels "at","ch","de",..: 2 2 2 4 2 2 2 2 2 3 ...
 $ flight_class  : Factor w/ 3 levels "Business","Economy",..: 2 1 2 2 1 3 2 1 2 1 ...
 $ flight_latest : Factor w/ 6 levels "within the last 12 months",..: 4 3 5 3 6 5 6 3 3 4 ...
 $ flight_purpose: Factor w/ 2 levels "Business","Leisure": 2 1 1 2 1 2 1 1 2 1 ...
 $ flight_type   : Factor w/ 2 levels "Domestic","International": 1 2 1 1 2 2 1 2 1 2 ...
 $ gender        : Factor w/ 2 levels "female","male": 2 2 1 1 1 2 2 2 2 2 ...
 $ language      : Factor w/ 3 levels "English","French",..: 2 1 1 2 1 3 2 2 2 3 ...
 $ nflights      : num  2 6 8 7 25 16 35 9 3 4 ...
 $ status        : Factor w/ 3 levels "Blue","Gold",..: 1 2 1 1 2 2 1 2 1 2 ...
 $ nps           : num  6 10 8 8 6 7 8 7 8 8 ...
 $ overall_sat   : num  2 6 2 4 2 4 4 4 4 3 ...
 $ reputation    : num  3 6 4 6 5 3 3 4 2 4 ...

# Dimensions (rows, columns)
dim(airlinesat_small)

[1] 1065   13

# Variable names
names(airlinesat_small)

 [1] "age"            "country"        "flight_class"   "flight_latest"  "flight_purpose" "flight_type"   
 [7] "gender"         "language"       "nflights"       "status"         "nps"            "overall_sat"   
[13] "reputation"

# First few rows
head(airlinesat_small)

  age country flight_class            flight_latest flight_purpose   flight_type gender language nflights status nps
1  30      ch      Economy within the last 6 months        Leisure      Domestic   male   French        2   Blue   6
2  55      ch     Business within the last 3 months       Business International   male  English        6   Gold  10
3  56      ch      Economy    within the last month       Business      Domestic female  English        8   Blue   8
4  43      fr      Economy within the last 3 months        Leisure      Domestic female   French        7   Blue   8
5  44      ch     Business     within the last week       Business International female  English       25   Gold   6
6  40      ch        First    within the last month        Leisure International   male   German       16   Gold   7
  overall_sat reputation
1           2          3
2           6          6
3           2          4
4           4          6
5           2          5
6           4          3

6.1.2 Missing values (quick checks)

It is also important to check for missing values, because they can have an adverse effect on some analyses. The is.na() function is often used for this, which returns a TRUE if a value is NA. Ultimately, this will tell which values in the data are missing to help decide how to handle them.

# Missing values by variable
colSums(is.na(airlinesat_small))

           age        country   flight_class  flight_latest flight_purpose    flight_type         gender       language 
             0              0              0              0              0              0              0              0 
      nflights         status            nps    overall_sat     reputation 
             0              0              0              0              0

# Total missing values in the dataset
sum(is.na(airlinesat_small))

[1] 0

6.2 Frequency Tables

Frequency tables summarize categorical variables (and sometimes binned numeric variables).

6.2.1 One-way frequency table

To create a frequency table, use the table() function in R. Counts are often converted into proportions or rates to make results easier to interpret and compare.

In R, the proportions() function is used to convert frequency tables into proportions. Multiply the table by 100 to get percentages.

# Frequency table for a categorical variable
table(airlinesat_small$gender)


female   male 
   280    785

# Add proportions (shares)
proportions(table(airlinesat_small$gender))


   female      male 
0.2629108 0.7370892

# Add percentages
100 * proportions(table(airlinesat_small$gender))


  female     male 
26.29108 73.70892

6.3 Crosstabs

A crosstab is a frequency table for two categorical variables. Crosstabs are used frequently in marketing to compare segments (e.g., purchase by gender).

6.3.1 Base R

Base R does not do a great job of easily creating crosstabs and testing for independence of the two variables. Instead, a multistep process is required:

Create the two-way frequency table using the table(rowvar, colvar) function and assign it to a separate object

Display the two-way freq table by just using the table name

# Counts
ct <- table(airlinesat_small$flight_class, airlinesat_small$gender)
ct


           female male
  Business     39  146
  Economy     239  626
  First         2   13

Use the function proportions(tablename, margin) on the newly created object to get column, row, or total percentages

proportions(tablename) gives total percentages
proportions(tablename, 1) gives row percentages

proportions(tablename, 2) gives column percentages

proportions(ct)             # Total proportions: what pprportion of all respondents


                female        male
  Business 0.036619718 0.137089202
  Economy  0.224413146 0.587793427
  First    0.001877934 0.012206573

proportions(ct, margin=1)   # Row proportions: how are genders distributed within a row


              female      male
  Business 0.2108108 0.7891892
  Economy  0.2763006 0.7236994
  First    0.1333333 0.8666667

proportions(ct, margin=2)   # Column proportions: how is th outcome distributed within each gender


                female        male
  Business 0.139285714 0.185987261
  Economy  0.853571429 0.797452229
  First    0.007142857 0.016560510

Use the function chisq.test(tablename) on the newly created object to run the test of independence

chisq.test(ct)

Warning in chisq.test(ct): Chi-squared approximation may be incorrect


    Pearson's Chi-squared test

data:  ct
X-squared = 4.6912, df = 2, p-value = 0.09579

6.3.2 Using package sjPlot

The sjPlot package can print crosstabs with nicer formatting. Use the function tab_xtab(var.row=, var.col=, show.col.prc=TRUE) to get a standard crosstab with column percentages and the chi-square test of independence.

tab_xtab(airlinesat_small$flight_class,
         airlinesat_small$gender,
         show.col.prc = TRUE)

flight_class	gender		Total
flight_class	female	male	Total
Business	39 13.9 %	146 18.6 %	185 17.4 %
Economy	239 85.4 %	626 79.7 %	865 81.2 %
First	2 0.7 %	13 1.7 %	15 1.4 %
Total	280 100 %	785 100 %	1065 100 %
χ²=4.691 · df=2 · Cramer’s V=0.066 · Fisher’s p=0.095

6.4 Measures of Central Tendency and Dispersion

For numeric variables, the most common summaries are:

Central tendency: mean, median, mode
Dispersion: variance, standard deviation, IQR, range

6.4.1 Base R

Any individual summary statistic can be easily calculated using Base R with functions such as:

mean(var) for mean
sd(var) for standard deviation
quantile(var, .percentile) for percentiles (e.g., ‘.50’ would be median)

For summary statistics except for standard deviation, the summary(object) function can be used, where object can be a single variable or an entire data frame

# Base R summary (min, quartiles, median, mean, max)
summary(airlinesat_small$age)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  19.00   42.00   50.00   50.42   58.00  101.00

summary(airlinesat_small)

      age         country    flight_class                   flight_latest  flight_purpose        flight_type 
 Min.   : 19.00   at:108   Business:185   within the last 12 months:139   Business:525    Domestic     :558  
 1st Qu.: 42.00   ch: 66   Economy :865   within the last 2 days   : 57   Leisure :540    International:507  
 Median : 50.00   de:695   First   : 15   within the last 3 months :296                                      
 Mean   : 50.42   fr:  1                  within the last 6 months :187                                      
 3rd Qu.: 58.00   us:195                  within the last month    :253                                      
 Max.   :101.00                           within the last week     :133                                      
    gender       language      nflights         status         nps         overall_sat     reputation   
 female:280   English:233   Min.   :  1.00   Blue  :677   Min.   : 1.00   Min.   :1.00   Min.   :1.000  
 male  :785   French : 10   1st Qu.:  4.00   Gold  :143   1st Qu.: 6.00   1st Qu.:3.00   1st Qu.:4.000  
              German :822   Median :  8.00   Silver:245   Median : 8.00   Median :4.00   Median :5.000  
                            Mean   : 13.42                Mean   : 7.52   Mean   :3.74   Mean   :4.841  
                            3rd Qu.: 16.00                3rd Qu.: 9.00   3rd Qu.:5.00   3rd Qu.:6.000  
                            Max.   :457.00                Max.   :11.00   Max.   :7.00   Max.   :7.000

# Mean and SD (remove missing values if present)
mean(airlinesat_small$age, na.rm = TRUE)

[1] 50.41972

sd(airlinesat_small$age, na.rm = TRUE)

[1] 12.27464

If you want a single, custom summary:

x <- airlinesat_small$age

c(n = sum(!is.na(x)),
  mean = mean(x, na.rm = TRUE),
  sd = sd(x, na.rm = TRUE),
  median = median(x, na.rm = TRUE),
  iqr = IQR(x, na.rm = TRUE),
  min = min(x, na.rm = TRUE),
  max = max(x, na.rm = TRUE))

         n       mean         sd     median        iqr        min        max 
1065.00000   50.41972   12.27464   50.00000   16.00000   19.00000  101.00000

6.4.2 Using package dplyr

With dplyr, you can summarize many variables and/or do summaries by groups.

Overall summaries:

airlinesat_small %>%
  summarise(n = n(),
            mean_age = mean(age, na.rm = TRUE),
            sd_age = sd(age, na.rm = TRUE),
            median_age = median(age, na.rm = TRUE),
            iqr_age = IQR(age, na.rm = TRUE),
            mean_nflights = mean(nflights, na.rm = TRUE),
            sd_nflights = sd(nflights, na.rm = TRUE))

     n mean_age   sd_age median_age iqr_age mean_nflights sd_nflights
1 1065 50.41972 12.27464         50      16      13.41878    20.22647

Group summaries (example: by buy):

airlinesat_small %>%
  group_by(status) %>%
  summarise(n = n(),
            mean_age = mean(age, na.rm = TRUE),
            sd_age = sd(age, na.rm = TRUE),
            mean_nflights = mean(nflights, na.rm = TRUE),
            sd_nflights = sd(nflights, na.rm = TRUE))

# A tibble: 3 × 6
  status     n mean_age sd_age mean_nflights sd_nflights
  <fct>  <int>    <dbl>  <dbl>         <dbl>       <dbl>
1 Blue     677     50.6   13.3          8.23        19.2
2 Gold     143     53     10.0         24.6         20.9
3 Silver   245     48.3   10.0         21.2         17.2

6.4.3 Using package vtable

The vtable package can generate clean, compact descriptive tables with the sumtable(data, vars=c("")) function. For factor variables, it will provide the counts and percents of each level.

sumtable(airlinesat_small, c("age", "nflights", "status"),
         add.median=TRUE,    # 'add.median=TRUE' includes a 50th percentile column
          title="")

(#tab:vtable_sumtable)
Variable	N	Mean	Std. Dev.	Min	Pctl. 25	Pctl. 50	Pctl. 75	Max
age	1065	50	12	19	42	50	58	101
nflights	1065	13	20	1	4	8	16	457
status	1065
… Blue	677	64%
… Gold	143	13%
… Silver	245	23%

The sumtable() function can also provide summary statistics by a grouping variable.

sumtable(airlinesat_small, c("age", "nflights", "status"),
         add.median=TRUE, 
         group="gender",
         title="")

(#tab:vtable_sumtable_grouping)
gender	female				male
Variable	N	Mean	SD	Median	N	Mean	SD	Median
age	280	51	13	51	785	50	12	50
nflights	280	9.3	12	5	785	15	22	9
status	280				785
… Blue	225	80%			452	58%
… Gold	16	6%			127	16%
… Silver	39	14%			206	26%

6.5 Correlation

Correlation measures the strength of a linear relationship between two numeric variables. The most common is Pearson correlation (the default).

6.5.1 Base R

Base R can easily provide a correlation matrix of many variables, and it can provide a correlation test between two variables at a time, but it cannot produce a correlation matrix with p-values.

To get a correlation matrix, use the cor() function with a dataframe of the variables desired or using indexing on variable names. By default, it uses all observations, which can create NA values if any missing values exists. Therefore, the preference is to add the option use = "pairwise.complete.obs" to only calculate the correlation on non-missing values for each pair of variables.

mycorr_df <- airlinesat_small %>%
  select(age, nflights, nps, overall_sat, reputation)
cor(mycorr_df, use = "pairwise.complete.obs")

                    age    nflights         nps overall_sat  reputation
age          1.00000000 -0.11576301  0.09867319  0.05903446  0.06082991
nflights    -0.11576301  1.00000000 -0.08949782 -0.05366975 -0.06364290
nps          0.09867319 -0.08949782  1.00000000  0.29961310  0.50712230
overall_sat  0.05903446 -0.05366975  0.29961310  1.00000000  0.17748688
reputation   0.06082991 -0.06364290  0.50712230  0.17748688  1.00000000

cor(airlinesat_small[,c("age", "nflights", "nps", "overall_sat", "reputation")],
    use = "pairwise.complete.obs")

                    age    nflights         nps overall_sat  reputation
age          1.00000000 -0.11576301  0.09867319  0.05903446  0.06082991
nflights    -0.11576301  1.00000000 -0.08949782 -0.05366975 -0.06364290
nps          0.09867319 -0.08949782  1.00000000  0.29961310  0.50712230
overall_sat  0.05903446 -0.05366975  0.29961310  1.00000000  0.17748688
reputation   0.06082991 -0.06364290  0.50712230  0.17748688  1.00000000

To get the correlation test for any one pair of variables, use the cor.test(var1, var2) function. By default, it includes only observations that are non-missing in both variables.

cor.test(airlinesat_small$age, airlinesat_small$nflights)


    Pearson's product-moment correlation

data:  airlinesat_small$age and airlinesat_small$nflights
t = -3.7998, df = 1063, p-value = 0.000153
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.17461941 -0.05608231
sample estimates:
      cor 
-0.115763

6.5.2 Using package Hmisc

The Hmisc package is useful for correlation matrices with p-values using the rcorr() function. This function expects a matrix (or data frame coerced to matrix).

rcorr(as.matrix(mycorr_df))

              age nflights   nps overall_sat reputation
age          1.00    -0.12  0.10        0.06       0.06
nflights    -0.12     1.00 -0.09       -0.05      -0.06
nps          0.10    -0.09  1.00        0.30       0.51
overall_sat  0.06    -0.05  0.30        1.00       0.18
reputation   0.06    -0.06  0.51        0.18       1.00

n= 1065 


P
            age    nflights nps    overall_sat reputation
age                0.0002   0.0013 0.0541      0.0472    
nflights    0.0002          0.0035 0.0800      0.0378    
nps         0.0013 0.0035          0.0000      0.0000    
overall_sat 0.0541 0.0800   0.0000             0.0000    
reputation  0.0472 0.0378   0.0000 0.0000

6.5.3 Using package sjPlot

The sjPlot package can produce formatted correlation tables for reports using the tab_corr() function. By default, it uses listwise (or casewise) deletion, which removes an entire case (or row) if any value is NA, which may result in major data loss. Use the option na.deletion = "pairwise" to prevent this.

tab_corr(mycorr_df,
         triangle = "lower",
         show.p = TRUE,
         na.deletion = "pairwise")

	age	nflights	nps	overall_sat
age
nflights	-0.116***
nps	0.099**	-0.089**
overall_sat	0.059	-0.054	0.300***
reputation	0.061*	-0.064*	0.507***	0.177***
Computed correlation used pearson-method with pairwise-deletion.

6.6 Why descriptive analysis matters

Descriptive statistics help you to:

detect unusual values,
understand typical behavior,
compare groups,
and check whether results are plausible.

They also guide decisions about which models or visualizations are appropriate.

6.7 What’s next

In the next chapter, we move from numbers to visuals. You will learn how to create effective data visualizations using a small amount of Base R, but primarily ggplot2. Visualizations allow you to see patterns, distributions, and relationships that are often difficult to detect from tables alone.

Together, descriptive statistics and visualization form the foundation of exploratory data analysis, which prepares you for modeling and inference later in the course.