Chapter 2 Examining and Summarizing Data
Sources for this chapter:
- ggplot2: https://ggplot2.tidyverse.org/
2.1 Introduction
Examining and summarizing data involves visualizations (e.g., graphs and charts) and tables. For visualization, the most popular package in R is th ggplot2
package.
Data for this chapter:
The
airlinesat
data is used from theMKT4320BGSU
course package. Load the package and use thedata()
function to load the data.
2.2 Visualizations
2.2.1 Package ggplot2
2.2.1.1 Introduction
- First, make sure the
ggplot2
library is loaded:
library(ggplot2)
ggplot2
is based on the basic idea that for a graph, we need three things:- Data:
ggplot2
is designed to work with data frames - Coordinate system: what are the x and y variables
- Geoms: how is the data being represented (e.g., points, lines, etc.)
+
- Data:
- When ggplot is used in the console or from a script, the plot appears in the
Viewer
tab of the lower-right corner
Video Tutorial: Data Visualizations using ggplot (Introduction)
2.2.1.2 Usage
- A plot starts with the
ggplot()
function, which requires two arguments:- Source of the data, which can be piped in (i.e.,
%>%
) - The mapping of the data components
- This argument is an aesthetic function,
aes()
, which maps the variable(s) to the coordinate system
- This argument is an aesthetic function,
ggplot()
function alone is used, the output is simply the coordinate system, but with nothing plotted- Because a geom hasn’t been requested
ggplot(airlinesat, # Use data frame 'airlinesat' aes(x=country, y=nflights)) # Map 'country' on x and 'nflights' on y
- Source of the data, which can be piped in (i.e.,
- Other parts of the plot are adding in layers, using
+
- A good analogy is building a house:
The call toggplot()
is the foundation, but the structure is built one layer at a time - Example: Request a column chart for a discrete x and a continuous y
ggplot(airlinesat, # Use data frame 'airlinesat' aes(x=country, y=nflights)) + # Map 'country' on x and 'nflights' on y geom_col() # Ask for column chart as the geom
- NOTE: Each geom has a default statistic to plot
- In this case, it is summing the
nflights
variable by country - We can use
dplyr
andggplot2
together to get a different value, such as the mean
- In this case, it is summing the
# Use airlinesat data airlinesat %>% # Group data by 'country' group_by(country) %>% # Create summary statistic summarise(mean_nflights=mean(nflights, na.rm=TRUE)) %>% # Pass this results to ggplot and start the plot ggplot(aes(x=country, y=mean_nflights)) + # Note dataset was `piped` # Request column geom geom_col()
- A good analogy is building a house:
Using ggplot()
can get much more advanced. As the tutorial progresses, many examples of additional layers to a ggplot()
will be shown.
Video Tutorial: Data Visualizations using ggplot (Basic Usage) Video Tutorial: Data Visualizations using ggplot (Linking with dplyr)
2.2.2 Bar and Column Charts
In ggplot
, bar charts, geom_bar()
, are used for plotting a single discrete variable, while column charts, geom_col()
, are used for plotting a discrete variable on the x axis and a continuous variable on the y axis.
2.2.2.1 Bar Charts
The standard bar chart provides a count of observations of each category of discrete variable x
Video Tutorial: Bar Charts with ggplot::geom_bar (Part 1)
To get percentages of each category, we need to summarize the data and calculate the proportion for each category
airlinesat %>% group_by(gender) %>% # Group data by gender summarise(n=n()) %>% # Create variable with count of each gender mutate(prop=n/sum(n)) %>% # Create variable with proportion by gender ggplot(aes(x=gender, # Variable for the x-axis y=prop)) + # Use 'prop' instead of default counts for y-axis geom_bar(stat="identity") # Use the value of y as-is
Video Tutorial: Bar Charts with ggplot::geom_bar (Part 2)
To make the chart “pretty”, we change the color of each bar we can add layers for axis labels, use the
scales
package to have the y-axis show percent, add labels for the bars etc.airlinesat %>% group_by(gender) %>% # Group data by gender summarise(n=n()) %>% # Create variable with count of each gender mutate(prop=n/sum(n)) %>% # Create variable with proportion by gender ggplot(aes(x=gender, # Variable for the x-axis y=prop, # Use 'prop' instead of default counts for y-axis fill=gender)) + # Use different color for each bar geom_bar(show.legend=FALSE, # Hide legend stat="identity", ) + # Use the value of y as-is scale_y_continuous(labels=scales::label_percent()) + # y-axis labels % labs(x="Gender", y="Percent") + # Label x- and y-axes geom_text(aes(label=sprintf("%.1f%%", prop*100)), # Format label number vjust=.95, # Vertically adjust the labels fontface="bold", # Bold typeface color="white") # Text color
Video Tutorial: Bar Charts with ggplot::geom_bar (Part 3)
2.2.2.1.1 Bar Chart Variations
2.2.2.1.1.1 Stacked Bar Chart
Used to show one discrete variable by another discrete variable, such as data you would see in a cross-tab
- The x= variable specifies the axis, while the fill= variable stacks the bars by the other variable
- As with other bar charts, the default is to count observations, so some manipulation is needed to get “100% stacked bar charts”
airlinesat %>%
group_by(gender, flight_type) %>% # Group data by two discrete variables
summarise(n=n()) %>% # Count observations for each combination
mutate(prop=n/sum(n)) %>% # Calculate prop WITHIN first grouping variable
ggplot(aes(x=gender, y=prop, fill=flight_type)) +
geom_bar(position="fill", # Stack the bars
stat="identity") + # Use the value of y as-is
scale_y_continuous(labels=scales::label_percent()) + # y-axis labels %
labs(x="Gender", y="Percent", # Label x- and y-axes
fill="Flight Type") + # Label legend
geom_text(aes(label=sprintf("%.1f%%", prop*100)), # Format data label
position=position_stack(vjust=.95), # Adjust the labels
fontface="bold", # Bold typeface
color="white") # Text color
2.2.2.1.1.2 Side-by-Side Bar Chart
Also used to shows one discrete variable by another discrete variable
- Again, default is to count observations, so some manipulation required to get percentages
- Percentages can be within a group (like in 100% stacked, see Figure 2.8) or percent of overall total (see Figure 2.9)
# NOTE: The code for this chart is nearly identical to the previous figure # ONLY the changes have been commented on below airlinesat %>% group_by(gender, flight_type) %>% summarise(n=n()) %>% mutate(prop=n/sum(n)) %>% ggplot(aes(x=gender, y=prop, fill=flight_type)) + # NOTE: Use position="dodge" to make bars side-by-side geom_bar(position="dodge", stat="identity") + scale_y_continuous(labels=scales::label_percent()) + labs(x="Gender", y="Percent", fill="Flight Type") + # NOTE: Use position=position_dodge(width=1) to position labels # in center of each bar horizontally; use vjust=.95 to # position labels at the top of each bar geom_text(aes(label=sprintf("%.1f%%", prop*100)), position=position_dodge(width=1), vjust=.95, fontface="bold", color="white")
# NOTE: The code for this chart is nearly identical to the previous figure # ONLY the changes have been commented on below airlinesat %>% group_by(gender, flight_type) %>% # NOTE: Use .groups="drop" to remove the grouping structure after # summarising the data summarise(n=n(), .groups="drop") %>% mutate(prop=n/sum(n)) %>% ggplot(aes(x=gender, y=prop, fill=flight_type)) + geom_bar(position="dodge", stat="identity") + scale_y_continuous(labels=scales::label_percent()) + labs(x="Gender", y="Percent", fill="Flight Type") + geom_text(aes(label=sprintf("%.1f%%", prop*100)), position=position_dodge(width=1), vjust=.95, fontface="bold", color="white")
2.2.2.2 Column Charts
The standard column chart provides a sum of continuous variable y of each category of disrete variable x
To get a different summary statistic, such as mean, we can summarize the data and calculate the summary statistic for each category (and make the graph prettier)
airlinesat %>% group_by(flight_type) %>% summarise(mean=mean(nflights)) %>% ggplot(aes(x=flight_type, y=mean, fill=flight_type)) + geom_col(show.legend=FALSE) + labs(x="Flight Type", y="Mean Number of Flights") + geom_text(aes(label=sprintf("%.2f", mean)), # Format label number vjust=.95, # Vertically adjust the labels fontface="bold", # Bold typeface color="white") # Text color
Video Tutorial: Column Charts with ggplot::geom_col (Part 1)
2.2.2.2.1 Side-by-Side Column Chart
A side by side column chart can be used to show two discrete variables on the x-axis
airlinesat %>% group_by(flight_type, flight_purpose) %>% summarise(mean=mean(nflights), .groups="drop") %>% ggplot(aes(x=flight_type, y=mean, fill=flight_purpose)) + geom_col(position="dodge") + labs(x="Flight Type", y="Mean Number of Flights", fill="Flight Purpose") + geom_text(aes(label=sprintf("%.2f", mean)), position=position_dodge(width=1), vjust=.95, fontface="bold", color="white")
Video Tutorial: Column Charts with ggplot::geom_col (Part 2)
2.2.3 Histogram
In ggplot
, histograms are produced with the geom_histogram()
geom, which produces a histogram of a single continuous variable.
- By default, the y-axis is a count of observations in each “bin” of the x variable
- A bin is a range of values of the continuous x variable
- By default,
ggplot
will produce a histogram with 30 bins, and a message is produced to that effect unless the bins are changed manually
2.2.3.1 Changing Bins
Histograms can look quite different based on the bins used. Bins can be changed in two ways: (1) number of bins; and (2) bin width
- Changing the number of bins is done with the
bins=
option- For example:
geom_histogram(bins=20)
- For example:
- Changing the bin width is done with the
binwidth=
option- For example;
geom_histogram(binwidth=5)
- For example;
- Use the interactive histograms (Figure 2.14 and Figure 2.15 to see how the histograms change
Video Tutorial: Histograms with ggplot::geom_histogram (Part 1)
2.2.3.2 Improving the Look
You may find the default histogram a little “blah” or tough to read. Just as the look of bar and column charts could be changed, so can the look of histograms
airlinesat %>%
ggplot(aes(x=age)) +
geom_histogram(color="black", # Adds red border around each bar
fill="tan") + # Makes each bar blue
labs(x="Age", y="Frequency")
Video Tutorial: Histograms with ggplot::geom_histogram (Part 2)
2.2.3.3 Other Options
- Instead of the default count of observations, a density histogram can be created, where the sum of the area of the bars adds up to 1
- Often, a normal curve is added
look of bar and column charts could be changed, so can the look of histograms
airlinesat %>%
ggplot(aes(x=age)) +
geom_histogram(aes(y=..density..), # Request density instead of count
color="black", # Adds red border around each bar
fill="tan") + # Makes each bar blue
stat_function(fun=function(x) # Adds normal curve ovarlay
dnorm(x,
mean=mean(airlinesat$age, na.rm=TRUE), # Mean of normal dist
sd=sd(airlinesat$age, na.rm=TRUE))) + # StdDev of normal dist
labs(x="Age", y="Density")
Video Tutorial: Histograms with ggplot::geom_histogram (Part 3)
2.2.4 Box Plot
Box Plots are drawn with the geom_boxplot()
geom, which by default creates a box plot for a continuous y variable, but for each level of a discrete x variable. In addition, the standard box plot does not contain “whiskers”.
To get a box plot for only the continuous y variable, use
x=""
as the discrete x variableTo add whiskers, include a
stat_boxplot(geom="errorbar")
layerairlinesat %>% ggplot(aes(x="", y=age)) + geom_boxplot() + stat_boxplot(geom="errorbar") + # Add whiskers to box plot labs(x="", # Remove x axis label y="Age") # Make y axis label nicer
To make comparisons across a discrete x variable, replaces the
x=""
from before withx=VARIABLE
airlinesat %>% ggplot(aes(x=flight_purpose, y=age)) + geom_boxplot() + stat_boxplot(geom="errorbar") + # Add whiskers to box plot labs(x="Flight Purpose", y="Age")
2.2.5 Scatterplot
- Scatterplots are drawn with the
geom_point()
geom and are used to show the relationship between two continuous variables- Notice the warning given due to missing values (these warnings will be suppressed in other scatterplots below)
airlinesat %>% ggplot(aes(x=age, y=s10)) + # s10 is satisfaction with condition of airplane geom_point()
2.2.5.1 Trendline
- Scatterplots become more helpful when we add a trend line.
- The most common trend line is a simple regression line, although others can be used.
- Use
geom_smooth(method="lm", se=FALSE)
to add a linear trend line
airlinesat %>% ggplot(aes(x=age, y=s10)) + # s10 is satisfaction with condition of airplane geom_point() + geom_smooth(method="lm", se=FALSE) + # Add trendline labs(x="Age", y="Satisfaction with Aircraft Condition")
- Use
Video Tutorial: Scatterplots with ggplot::geom_point (Part 1)
2.2.5.2 Other Options
- The color, shape, and size of the points can be changed
- In addition, they can vary by levels of a discrete variable
- If a trend line is requested, separate trend lines will be provided for each level of the discrete variable
airlinesat %>% ggplot(aes(x=age, y=s10, color=flight_type)) + geom_point(shape=17) + geom_smooth(method="lm", se=FALSE) + # Add trendline labs(x="Age", y="Satisfaction with Aircraft Condition", color="Flight Type")
- In addition, they can vary by levels of a discrete variable
Video Tutorial: Scatterplots with ggplot::geom_point (Part 2)
2.3 Tables and Statistics
2.3.1 Frequency Table
2.3.1.1 Base R
- The
table(data$variable)
function can produce a one-way frequency table- Wrapping the call to
table
withproportions
will create the table with proportions (i.e., percent in each category)
English French German 233 10 822
English French German 0.218779343 0.009389671 0.771830986
Table 2.1: One-way frequency table using Base R - Wrapping the call to
2.3.1.2 Package questionr
The
freq()
command from the packagequestionr
produces nice one-way frequency tables (i.e., a frequency table for a single discrete variable)library(questionr) freq(airlinesat$language, # Provide discrete variable cum=TRUE, # Add cumulative percent column total=TRUE) # Add total row at bottom
n % val% %cum val%cum English 233 21.9 21.9 21.9 21.9 French 10 0.9 0.9 22.8 22.8 German 822 77.2 77.2 100.0 100.0 Total 1065 100.0 100.0 100.0 100.0
Table 2.2: One-way frequency table using
questionr
2.3.2 Crosstabs
2.3.2.1 Base R
- Base R does not do a great job of easily creating cross-tabs and testing for independent of the two variables
- Using base R, a multistep process is required
- Create the two-way frequency table using the
table(rowvar, colvar)
function and assign it to a separate object- Display the two-way freq table by just using the table name
- Use the function
proportions(tablename, margin)
on the newly created object to get column, row, or total percentagesproportions(tablename)
gives total percentagesproportions(tablename, 1)
gives row percentagesproportions(tablename, 2)
gives column percentages
- Use the function
chisq.test(tablename)
on the newly created object to run the test of independence
# Create two way table crosstab <- table(airlinesat$flight_purpose, # row Variable airlinesat$gender) # Column variable crosstab # Display 2-way freq table
female male Business 76 449 Leisure 204 336
female male Business 0.2714286 0.5719745 Leisure 0.7285714 0.4280255
Pearson's Chi-squared test with Yates' continuity correction data: crosstab X-squared = 73.386, df = 1, p-value < 2.2e-16
Table 2.3: Cross-tabs using Base R - Create the two-way frequency table using the
2.3.2.2 Alternative Packages
The following packages are not availabe through the BGSU Virtual Computing lab, but can be installed if using R/RStudio on your own machine. These packages produce nicely formatted crosstabs.
2.3.2.2.1 Package sjPlot
Use the function
tab_xtab(var.row=, var.col=, show.col.prc=TRUE)
to get a standard crosstab with column percentageslibrary(sjPlot) tab_xtab(var.row=airlinesat$flight_purpose, var.col=airlinesat$gender, show.col.prc=TRUE)
flight_purpose
gender
Total
female
male
Business
76
27.1 %449
57.2 %525
49.3 %Leisure
204
72.9 %336
42.8 %540
50.7 %Total
280
100 %785
100 %1065
100 %χ2=73.386 · df=1 · φ=0.265 · p=0.000
Table 2.4: Cross-tab using sjPlot
2.3.2.2.2 Package gmodels
Function
CrossTable(rowvar, colvar, OPTIONS)
has many options similar to SPSSlibrary(gmodels) CrossTable(airlinesat$flight_purpose, airlinesat$gender, prop.r=FALSE, # Exclude row percentages prop.t=FALSE, # Exclude total percentages, prop.chisq=FALSE, # Exclude cell contribution to chi-sq digits=2, # 2 digits after decimal point chisq=TRUE, # Request test of independence format="SPSS") # Request SPSS formatting
Cell Contents |-------------------------| | Count | | Column Percent | |-------------------------| Total Observations in Table: 1065 | airlinesat$gender airlinesat$flight_purpose | female | male | Row Total | --------------------------|-----------|-----------|-----------| Business | 76 | 449 | 525 | | 27.14% | 57.20% | | --------------------------|-----------|-----------|-----------| Leisure | 204 | 336 | 540 | | 72.86% | 42.80% | | --------------------------|-----------|-----------|-----------| Column Total | 280 | 785 | 1065 | | 26.29% | 73.71% | | --------------------------|-----------|-----------|-----------| Statistics for All Table Factors Pearson's Chi-squared test ------------------------------------------------------------ Chi^2 = 74.58406 d.f. = 1 p = 5.811064e-18 Pearson's Chi-squared test with Yates' continuity correction ------------------------------------------------------------ Chi^2 = 73.38648 d.f. = 1 p = 1.065938e-17 Minimum expected frequency: 138.0282
Table 2.5: Cross-tab using gmodels
2.3.3 Measures of Centrality and Dispersion
2.3.3.1 Base R
Any individual summary statistic can be easily calculated using Base R with functions such as:
mean(x)
for meansd(x)
for standard deviationquantile(x, .percentile)
for percentiles (e.g., ‘.50’ would be median)
For summary statistics except for standard deviation, the
summary(object)
function can be used, where object can be a single variable or an entire data frameMin. 1st Qu. Median Mean 3rd Qu. Max. 1.00 4.00 8.00 13.42 16.00 457.00
Table 2.6: Summary statistics in R Base, one variable
age nflights s10 Min. : 19.00 Min. : 1.00 Min. : 1.00 1st Qu.: 42.00 1st Qu.: 4.00 1st Qu.: 50.00 Median : 50.00 Median : 8.00 Median : 61.00 Mean : 50.42 Mean : 13.42 Mean : 64.54 3rd Qu.: 58.00 3rd Qu.: 16.00 3rd Qu.: 83.00 Max. :101.00 Max. :457.00 Max. :100.00 NA's :40
Table 2.7: Summary statistics in R Base, multiple variables
Summary statistics for a continuous variable by different levels of a discrete variable can also be done in Base R using the
tapply(continuous variable, discrete variable, function)
functiontapply(airlinesat$nflights, # Continous variable to apply the function to airlinesat$flight_purpose, # Discrete, grouping variable summary) # R function to apply by group
$Business Min. 1st Qu. Median Mean 3rd Qu. Max. 1.00 6.00 12.00 18.65 25.00 120.00 $Leisure Min. 1st Qu. Median Mean 3rd Qu. Max. 1.000 3.000 4.000 8.337 8.000 457.000
Table 2.8: Summary statistics in R Base, one variable, grouped
2.3.3.2 Package dplyr
- The
dplyr
package can also be used to manually create tables of summary statistics- One continuous variable
airlinesat %>% summarise(mean=mean(age), sd=sd(age), q1=quantile(age, .25), median=quantile(age,.50), q3=quantile(age, .75))
mean sd q1 median q3 1 50.41972 12.27464 42 50 58
Table 2.9: Summary statistics using dplyr, one variable - Multiple continuous variables
age nflights s10 Min. : 19.00 Min. : 1.00 Min. : 1.00 1st Qu.: 42.00 1st Qu.: 4.00 1st Qu.: 50.00 Median : 50.00 Median : 8.00 Median : 61.00 Mean : 50.42 Mean : 13.42 Mean : 64.54 3rd Qu.: 58.00 3rd Qu.: 16.00 3rd Qu.: 83.00 Max. :101.00 Max. :457.00 Max. :100.00 NA's :40
Table 2.10: Summary statistics using dplyr, multiple variables - One continuous variable by a discrete/grouping variable
airlinesat %>% group_by(flight_purpose) %>% summarise(mean=mean(age), sd=sd(age), q1=quantile(age, .25), median=quantile(age,.50), q3=quantile(age, .75))
# A tibble: 2 × 6 flight_purpose mean sd q1 median q3 <fct> <dbl> <dbl> <dbl> <dbl> <dbl> 1 Business 48.5 9.91 41 49 55 2 Leisure 52.3 14.0 43 53 63
Table 2.11: Summary statistics using dplyr, one variable, grouped
2.3.3.3 Package vtable
- Package
vtable
produces very nice looking tables of summary statistics, but it isn’t available in BGSU’s Virtual Computer Lab. - Use function
sumtable(data, vars="varname")
to produce the table- One continuous variable
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max nflights 1065 13 20 1 4 16 457 Table 2.12: Summary statistics using vtable, one variable - Multiple continuous variables
sumtable(airlinesat, vars=c("nflights","age","s10"), # Use `c()` for multiple variables add.median=TRUE) # Request median
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 50 Pctl. 75 Max nflights 1065 13 20 1 4 8 16 457 age 1065 50 12 19 42 50 58 101 s10 1025 65 21 1 50 61 83 100 Table 2.13: Summary statistics using vtable, multiple variable - One or more continuous variables by a discrete/grouping variable
Variable N Mean SD Median N Mean SD Median flight_purpose Business Leisure nflights 525 19 18 12 540 8.3 21 4 age 525 48 9.9 49 540 52 14 53 s10 511 62 21 60 514 67 21 65 Table 2.14: Summary statistics using vtable, multiple variables, grouped
2.3.4 Correlation
Correlation provides a measure of the strength of association between two continuous variables.
2.3.4.1 Base R
- Base R can easily provide the correlation and a test of the correlation using the
cor.test(variable1, variable2)
function- By default, it includes only observations that are non-missing in both variables
Pearson's product-moment correlation data: airlinesat$age and airlinesat$nflights t = -3.7998, df = 1063, p-value = 0.000153 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.17461941 -0.05608231 sample estimates: cor -0.115763
Table 2.15: Correlation with test in Base R - Base R can also easily provide a correlation matrix of variables using the
cor(data)
function- By default, correlation will only be calculated for those pairs of variables that have no missing values
- Use option
use="pairwise.complete.obs"
to exclude observations that are non-missing in both variables - However, Base R cannot produce a correlation matrix with p-values
# First create data frame with only variables wanted mycorr <- airlinesat[,c("age", "nflights", "s10")] # Use function `round` to limit to 3 digits after decimal point round(cor(mycorr, use="pairwise.complete.obs"), 3)
age nflights s10 age 1.000 -0.116 0.167 nflights -0.116 1.000 -0.121 s10 0.167 -0.121 1.000
Table 2.16: Correlation matrix in Base R
2.3.4.2 Package Hmisc
- The function
rcorr()
from the packageHmisc
, which is available in the BGSU Virtual Computing Lab, can be used to create correlation matrices also- The
rcorr()
function requires a matrix, so the data frame of variables must be coerced into a matrix first - By default,
rcorr()
produces three separate matrices: correlation, number of observations, and p-values - Separate tables can be requested
rcorr(as.matrix(dataframe))]]"r"]]
provides the correlation matrixrcorr(as.matrix(dataframe))]]"P"]]
provides the matrix of p-values
age nflights s10 age 1.00 -0.12 0.17 nflights -0.12 1.00 -0.12 s10 0.17 -0.12 1.00 n age nflights s10 age 1065 1065 1025 nflights 1065 1065 1025 s10 1025 1025 1025 P age nflights s10 age 2e-04 0e+00 nflights 2e-04 1e-04 s10 0e+00 1e-04
Table 2.17: Correlation matrix using Hmisc # Use 'round()' function to limit digits in output round(rcorr(as.matrix(mycorr))[["r"]],4) round(rcorr(as.matrix(mycorr))[["P"]],5)
age nflights s10 age 1.0000 -0.1158 0.1671 nflights -0.1158 1.0000 -0.1206 s10 0.1671 -0.1206 1.0000
age nflights s10 age NA 0.00015 0.00000 nflights 0.00015 NA 0.00011 s10 0.00000 0.00011 NA
Table 2.18: Separate correlation matrix output using Hmisc - The
2.3.4.3 Package sjPlot
- The function
tab_corr()
from thesjPlot
package produces very nice correlation matricessjPlot
is not available in BGSU’s Virtual Computing Lab
library(sjPlot) tab_corr(mycorr, # Data frame of variables to use; created earlier na.deletion = "pairwise", # Delete obs if either variable is missing corr.method = "pearson", # Choose Pearson correlation coefficient show.p = TRUE, # Show asterisks for significant correlations digits = 3, # Show three decimal points triangle = "lower", # Show only lower triangle fade.ns=FALSE) # Do not fade insignficant correlations)
age nflights s10 age nflights -0.116*** s10 0.167*** -0.121*** Computed correlation used pearson-method with pairwise-deletion. Table 2.19: Correlation matrix output using sjPlot
2.3.4.4 Package GGally
The
ggpairs()
function from packageGGally
can produce a combination scatterplot and correlation matrixlibrary(GGally) ggpairs(mycorr, # Data frame created earlier lower=list(continuous=wrap("smooth", # Adds fit lines... method="lm", # Using linear regression... se=FALSE, # Without CI bands color="blue")), # Color dots diag=list(continuous="blankDiag")) # Sets diagonals to be blank