Visualizing Data

CS&SS 508 • Lecture 2

2 April 2024

Victoria Sass

Roadmap


Last time, we learned:

  • R and RStudio
  • Quarto headers, syntax, and chunks
  • Basics of functions, objects, and vectors
  • Base R and packages


Today, we will cover:

  • Introducing the tidyverse!
  • Basics of ggplot2
  • Advanced features of ggplot2
  • ggplot2 extensions

File Types

We mainly work with three types of files in this class:

  • .qmd1: These are markdown syntax files, where you write code and plain or formatted text to make documents.
  • .R: These are R syntax files, where you write code to process and analyze data without making an output document2.
  • .html (or .pdf, .docx, etc.): These are the output documents created when you Render a quarto markdown document.

Make sure you understand the difference between the uses of these file types! Please ask for clarification if needed!

Introducing the tidyverse

Packages

Last week we discussed Base R and the fact that what makes R extremely powerful and flexible is the large number of diverse user-created packages.

What are packages again?

Recall that packages are simply collections of functions and tools others have already created, that will make your life easier!

The package 2-step

Remember that to install a new package you use install.packages("package_name") in the console. You only need to do this once per machine (unless you want to update to a newer version of a package).

To load a package into your current session of R you use library(package_name), preferably at the beginning of your R script or Quarto document. Every time you open RStudio it’s a new session and you’ll have to call library() on the packages you want to use.

Packages

The Packages tab in the bottom-right pane of RStudio lists your installed packages.

The tidyverse

The tidyverse refers to two things:

  1. a specific package in R that loads several core packages within the tidyverse.
  2. a specific design philosophy, grammar, and focus on “tidy” data structures developed by Hadley Wickham1 and his team at RStudio (now named Posit).

The tidyverse package

The core packages within the tidyverse include:

  • ggplot2 (visualizations)
  • dplyr (data manipulation)
  • tidyr (data reshaping)
  • readr (data import/export)
  • purrr (iteration)
  • tibble (modern dataframe)
  • stringr (text data)
  • forcats (factors)


The tidyverse philosophy

The principles underlying the tidyverse are:

  1. Reuse existing data structures.
  2. Compose simple functions with the pipe.
  3. Embrace functional programming.
  4. Design for humans.

Gapminder Data

We’ll be working with data from Hans Rosling’s Gapminder project. An excerpt of these data can be accessed through an R package called gapminder1. Check the packages tab to see if gapminder appears (unchecked) in your computer’s list of downloaded packages.

If it doesn’t, run install.packages("gapminder") in the console.

Now, load the gapminder package as well as the tidyverse package:

library(gapminder)
library(tidyverse)
1
Every time you library (i.e. load) tidyverse it will tell you which individual packages it is loading, as well as all function conflicts it has with other packages loaded in the current session. This is useful information but you can suppress seeing/printing this output by adding the message: false chunk option to your code chunk.
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Check Out Gapminder

The data frame we will work with is called gapminder, available once you have loaded the package. Let’s see its structure:

str(gapminder)
tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
 $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
 $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
 $ pop      : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
 $ gdpPercap: num [1:1704] 779 821 853 836 740 ...


What’s Notable Here?

  • Factor variables country and continent
    • Factors are categorical data with an underlying numeric representation
    • We’ll spend a lot of time on factors later!
  • Many observations: \(n=1704\) rows
  • For each observation, a few variables: \(p=6\) columns
  • A nested/hierarchical structure: year in country in continent
    • These are panel data!

Base R plot

China <- gapminder |> 
  filter(country == "China")
plot(lifeExp ~ year, 
     data = China, 
     xlab = "Year", 
     ylab = "Life expectancy",
     main = "Life expectancy in China", 
     col = "red", 
     pch = 16)


This plot is made with one function and many arguments.

Fancier: ggplot

ggplot(data = China, 
       mapping = aes(x = year, y = lifeExp)) +
  geom_point(color = "red", size = 3) +
  labs(title = "Life expectancy in China", 
       x = "Year", 
       y = "Life expectancy") +
  theme_minimal(base_size = 18)


This ggplot is made with many functions and fewer arguments in each.

ggplot2

The ggplot2 package provides an alternative toolbox for plotting.

The core idea underlying this package is the layered grammar of graphics: i.e. that we can break up elements of a plot into pieces and combine them.

ggplots take a bit more work to create than Base R plots, but are usually:

  • prettier
  • more professional
  • much more customizable

Layered grammar of graphics

Structure of a ggplot

ggplot graphics objects consist of two primary components:

  1. Layers, the components of a graph.

    • We add layers to a ggplot object using +.
    • This includes adding lines, shapes, and text to a plot.
  1. Aesthetics, which determine how the layers appear.

    • We set aesthetics using arguments (e.g. color = "red") inside layer functions.
    • This includes modifying locations, colors, and sizes of the layers.

Aesthetic Vignette

Learn more about all possible aesthetic mappings here.

Layers

Layers are the components of the graph, such as:

  • ggplot(): initializes basic plotting object, specifies input data
  • geom_point(): layer of scatterplot points
  • geom_line(): layer of lines
  • geom_histogram(): layer of a histogram
  • labs (or to specify individually: ggtitle(), xlab(), ylab()): layers of labels
  • facet_wrap(): layer creating multiple plot panels
  • theme_bw(): layer replacing default gray background with black-and-white

Layers are separated by a + sign. For clarity, I usually put each layer on a new line.

Syntax Warning

Be sure to end each line with the +. The code will not run if a new line begins with a +.

Aesthetics

Aesthetics control the appearance of the layers:

  • x, y: \(x\) and \(y\) coordinate values to use
  • color: set color of elements based on some data value
  • group: describe which points are conceptually grouped together for the plot (often used with lines)
  • size: set size of points/lines based on some data value (greater than 0)
  • alpha: set transparency based on some data value (between 0 and 1)

Mapping data inside aes() vs. creating plot-wise settings outside aes()

When aesthetic arguments are called within aes() they specify a variable of the data and therefore map said value of the data by that aesthetic. Called outside aes(), these are only settings that can be given a specific value but will not display a dimension of the data.

ggplot Templates


All layers have:

an initializing ggplot call and at least one geom function.


ggplot(data = [dataset], 
       mapping = aes(x = [x-variable], y = [y-variable])) +
  geom_xxx() +
  other options
ggplot(data = [dataset], 
       mapping = aes(x = [x-variable], y = [y-variable])) +
  geom_xxx() +
  geom_yyy(mapping = aes(x = [x-variable], y = [y-variable])) +
  other options
ggplot() +
  geom_xxx(data = [dataset1],
           mapping = aes(x = [x-variable], y = [y-variable])) +
  geom_yyy(data = [dataset2],
           mapping = aes(x = [x-variable], y = [y-variable])) +
  other options

Example: Basic Jargon in Action!

Axis Labels, Points, No Background

Base ggplot

ggplot(data = China,  
       aes(x = year, y = lifeExp)) 

Axis Labels, Points, No Background

Scatterplot

ggplot(data = China, 
       aes(x = year, y = lifeExp)) +
  geom_point()

Axis Labels, Points, No Background

Point Color and Size

ggplot(data = China, 
       aes(x = year, y = lifeExp)) +
  geom_point(color = "red", size = 3)

Axis Labels, Points, No Background

X-Axis Label

ggplot(data = China, 
       aes(x = year, y = lifeExp)) +
  geom_point(color = "red", size = 3) +
  labs(x = "Year")

Axis Labels, Points, No Background

Y-Axis Label

ggplot(data = China, 
       aes(x = year, y = lifeExp)) +
  geom_point(color = "red", size = 3) +
  labs(x = "Year", 
       y = "Life expectancy")

Axis Labels, Points, No Background

Title

ggplot(data = China, 
       aes(x = year, y = lifeExp)) +
  geom_point(color = "red", size = 3) +
  labs(x = "Year",
       y = "Life expectancy",
       title = "Life expectancy in China")

Axis Labels, Points, No Background

Theme

ggplot(data = China, 
       aes(x = year, y = lifeExp)) +
  geom_point(color = "red", size = 3) +
  labs(x = "Year",
       y = "Life expectancy",
       title = "Life expectancy in China") +
  theme_minimal()

Axis Labels, Points, No Background

Text Size

ggplot(data = China, 
       aes(x = year, y = lifeExp)) +
  geom_point(color = "red", size = 3) +
  labs(x = "Year",
       y = "Life expectancy",
       title = "Life expectancy in China") +
  theme_minimal(base_size = 18) 

Plotting All Countries

We have a plot we like for China…

… but what if we want all the countries?

Plotting All Countries

A Mess!

ggplot(data = gapminder,
       aes(x = year, y = lifeExp)) +
  geom_point(color = "red", size = 3) +
  labs(x = "Year",
       y = "Life expectancy",
       title = "Life expectancy over time") + 
  theme_minimal(base_size = 18) 

Plotting All Countries

Lines

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp)) +
  geom_line(color = "red", size = 3) + 
  labs(x = "Year",
       y = "Life expectancy",
       title = "Life expectancy over time") + 
  theme_minimal(base_size = 18) 

Plotting All Countries

Grouping

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, 
           group = country)) +
  geom_line(color = "red", size = 3) +
  labs(x = "Year",
       y = "Life expectancy",
       title = "Life expectancy over time") + 
  theme_minimal(base_size = 18) 

Plotting All Countries

Size

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, 
           group = country)) +
  geom_line(color = "red") +
  labs(x = "Year",
       y = "Life expectancy",
       title = "Life expectancy over time") + 
  theme_minimal(base_size = 18) 

Plotting All Countries

Color

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, 
           group = country, 
           color = continent)) +
  geom_line() +
  labs(x = "Year",
       y = "Life expectancy",
       title = "Life expectancy over time") + 
  theme_minimal(base_size = 18) 

Plotting All Countries

Facets

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, 
           group = country, 
           color = continent)) +
  geom_line() +
  labs(x = "Year",
       y = "Life expectancy",
       title = "Life expectancy over time") + 
  theme_minimal(base_size = 18) +
  facet_wrap(vars(continent))

Plotting All Countries

Text Size

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, 
           group = country, 
           color = continent)) +
  geom_line() +
  labs(x = "Year",
       y = "Life expectancy",
       title = "Life expectancy over time") + 
  theme_minimal() +
  facet_wrap(vars(continent))

Plotting All Countries

No Legend

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, 
           group = country, 
           color = continent)) +
  geom_line() +
  labs(x = "Year",
       y = "Life expectancy",
       title = "Life expectancy over time") + 
  theme_minimal() +
  facet_wrap(vars(continent)) + 
  theme(legend.position = "none")

Lab 2

Make a histogram

In pairs, create a histogram of life expectancy observations in the complete Gapminder dataset.

  1. Set the base layer by specifying the data as gapminder and the x variable as lifeExp

  2. Add a second layer to create a histogram using the function geom_histogram()

  3. Customize your plot with nice axis labels and a title.

  4. Add the color “salmon” to the entire plot (hint: use the fill argument, not color).

  5. Change this fill setting to an aesthetic and map continent onto it.

  6. Change the geom to geom_freqpoly. What happened and how might you fix it?

  7. Add facets for continent (create only 1 column).

  8. Add one of the built-in themes from ggplot2.

  9. Remove the legend from the plot.

Solution: 1. Set Base Layer

ggplot(gapminder, aes(x = lifeExp))

Solution: 2. Add Histogram Layer

ggplot(gapminder, aes(x = lifeExp)) +
  geom_histogram(bins = 30)

Solution: 3. Add Label Layers

ggplot(gapminder, aes(x = lifeExp)) +
  geom_histogram(bins = 30) +
  xlab("Life Expectancy") +
  ylab("Count") +
  ggtitle("Histogram of Life Expectancy in Gapminder Data")

Solution: 4. Add fill setting

ggplot(gapminder, aes(x = lifeExp)) +
  geom_histogram(bins = 30, fill = "salmon") +
  xlab("Life Expectancy") +
  ylab("Count") +
  ggtitle("Histogram of Life Expectancy in Gapminder Data")

Solution: 5. Add fill aesthetic

ggplot(gapminder, aes(x = lifeExp, fill = continent)) +
  geom_histogram(bins = 30) +
  xlab("Life Expectancy") +
  ylab("Count") +
  ggtitle("Histogram of Life Expectancy in Gapminder Data")

Solution: 6. Change geometry

ggplot(gapminder, aes(x = lifeExp, fill = continent)) +
  geom_freqpoly(bins = 30) +
  xlab("Life Expectancy") +
  ylab("Count") +
  ggtitle("Histogram of Life Expectancy in Gapminder Data")

Solution: 6. Change geometry

ggplot(gapminder, aes(x = lifeExp, color = continent)) +
  geom_freqpoly(bins = 30) +
  xlab("Life Expectancy") +
  ylab("Count") +
  ggtitle("Histogram of Life Expectancy in Gapminder Data")

Solution: 7. Add facets

ggplot(gapminder, aes(x = lifeExp, color = continent)) +
  geom_freqpoly(bins = 30) +
  facet_wrap(vars(continent), ncol = 1) +
  xlab("Life Expectancy") +
  ylab("Count") +
  ggtitle("Histogram of Life Expectancy in Gapminder Data")

Solution: 8. Add nicer theme

ggplot(gapminder, aes(x = lifeExp, color = continent)) +
  geom_freqpoly(bins = 30) +
  facet_wrap(vars(continent), ncol = 1) +
  xlab("Life Expectancy") +
  ylab("Count") +
  ggtitle("Histogram of Life Expectancy in Gapminder Data") +
  theme_minimal() 

Solution: 9. Remove legend

ggplot(gapminder, aes(x = lifeExp, color = continent)) +
  geom_freqpoly(bins = 30) +
  facet_wrap(vars(continent), ncol = 1) +
  xlab("Life Expectancy") +
  ylab("Count") +
  ggtitle("Histogram of Life Expectancy in Gapminder Data") + 
  theme_minimal() + 
  theme(legend.position = "none") 

Break!

Advanced ggplot tools

Further customization

Next, we’ll discuss:

  • Storing, modifying, and saving ggplots

  • Advanced axis changes (scales, text, ticks)

  • Legend changes (scales, colors, locations)

  • Using multiple geoms

  • Adding annotation for emphasis

Storing Plots

We can assign a ggplot object to a name:

lifeExp_by_year <- 
  ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, 
           group = country, 
           color = continent)) +
  geom_line() +
  labs(x = "Year",
       y = "Life expectancy",
       title = "Life expectancy over time") + 
  theme_minimal() + 
  facet_wrap(vars(continent)) +
  theme(legend.position = "none")

Afterwards, you can display or modify ggplots…

Showing a Stored Graph

lifeExp_by_year

Overriding previous specifications

lifeExp_by_year + 
  facet_grid(cols = vars(continent)) 

Adding More Layers

lifeExp_by_year +
  facet_grid(cols = vars(continent)) + 
    theme(legend.position = "bottom")

Saving ggplot Plots

If you want to save a ggplot, use ggsave():

ggsave(filename = "I_saved_a_file.pdf", 
       plot = lifeExp_by_year,
       height = 3, width = 5, units = "in")

If you didn’t manually set font sizes, these will usually come out at a reasonable size given the dimensions of your output file.

Changing the Axes

We can modify the axes in a variety of ways, such as:

  • Change the \(x\) or \(y\) range using xlim() or ylim() layers

  • Change to a logarithmic or square-root scale on either axis: scale_x_log10(), scale_y_sqrt()

  • Change where the major/minor breaks are: scale_x_continuous(breaks =, minor_breaks = )

Axis Changes

ggplot(data = China, aes(x = year, y = gdpPercap)) +
    geom_line() +
    scale_y_log10(breaks = c(1000, 2000, 3000, 4000, 5000)) + 
    xlim(1940, 2010) + ggtitle("Chinese GDP per capita")

Precise Legend Position

lifeExp_by_year +
  theme(legend.position = c(0.8, 0.2)) 

Instead of coordinates, you could also use “top”, “bottom”, “left”, or “right”.

Scales for Color, Shape, etc.

Scales are layers that control how the mapped aesthetics appear.

You can modify these with a scale_[aesthetic]_[option]() layer:

  • [aesthetic] is x, y, color, shape, linetype, alpha, size, fill, etc.
  • [option] is something like manual, continuous, binned or discrete (depending on nature of the variable).

Examples:

  • scale_alpha_ordinal(): scales alpha transparency for ordinal categorical variable
  • scale_x_log10(): maps a log10 transformation of the x-axis variable
  • scale_color_manual(): allows manual specification of color aesthetic

Legend Name and Manual Colors

lifeExp_by_year +
  theme(legend.position = c(0.8, 0.2)) +
  scale_color_manual(
    name = "Which continent are\nwe looking at?", # \n adds a line break 
    values = c("Africa" = "#4e79a7", "Americas" = "#f28e2c", 
               "Asia" = "#e15759", "Europe" = "#76b7b2", "Oceania" = "#59a14f"))



Note

This scale argument knows to “map” onto continent because it is specified as the aesthetic for color in our original ggplot object.

Fixed versus Free Scales

Code
gapminder_sub <- gapminder |> 
  filter(year %in% c(1952, 1982, 2002)) # create subset with only 3 years
  
scales_plot <- ggplot(data = gapminder_sub, 
       aes(x = lifeExp, y = gdpPercap, fill = continent)) + 
  geom_jitter(alpha = 0.5, # alpha of points halfway transparent
              pch = 21, # shape is a circle with fill
              size = 3, # increase size
              color = "black") + # outline of circle is black 
  scale_fill_viridis_d(option = "D") + # circle is filled by colors perceptable for various forms of color-blindness
  facet_grid(rows = vars(year), # facet by years in the row
             cols = vars(continent)) + # facet by continent in the columns
  ggthemes::theme_tufte(base_size = 20) # increase base text size
scales_plot

Code
scales_plot + scale_y_log10(breaks = c(250, 1000, 10000, 50000, 115000)) # transform the y axis to the logarithm to gain better visualization

Code
scales_plot + scale_y_log10(breaks = c(250, 1000, 10000, 50000, 115000)) +
  facet_grid(rows = vars(year), 
             cols = vars(continent), 
             scales = "free_x") # make the x axis vary by data 

Code
scales_plot + scale_y_log10(breaks = c(250, 1000, 10000, 50000, 115000)) +
  facet_grid(rows = vars(year), 
             cols = vars(continent), 
             scales = "free_y") # make the y axis vary by data 

Code
scales_plot + scale_y_log10(breaks = c(250, 1000, 10000, 50000, 115000)) +
  facet_grid(rows = vars(year), 
             cols = vars(continent), 
             scales = "free") # make both axes vary by data 

Using multiple geoms

ggplot(gapminder, aes(x = continent, y = lifeExp)) +
  geom_boxplot(outlier.colour = "maroon")

Using multiple geoms

ggplot(gapminder, aes(x = continent, y = lifeExp)) +
  geom_boxplot(outlier.colour = "maroon") +
  geom_point(alpha = 0.25)

Using multiple geoms

ggplot(gapminder, aes(x = continent, y = lifeExp)) +
  geom_boxplot(outlier.colour = "maroon") +
  geom_jitter(alpha = 0.25)

Using multiple geoms

ggplot(gapminder, aes(x = continent, y = lifeExp)) +
  geom_boxplot(outlier.colour = "maroon") +
  geom_jitter(position = position_jitter(width = 0.1, height = 0), alpha = 0.25)

Annotating specific datapoints for emphasis

Code
ggplot(gapminder, aes(x = continent, y = lifeExp)) +
  geom_boxplot(outlier.colour = "maroon", outlier.size = 3) +
  geom_jitter(data = no_outliers, position = position_jitter(width = 0.1, height = 0), alpha = 0.25, size = 3) + 
  geom_text(data = outliers, aes(label = country), color = "maroon", size = 8) + 
  theme_minimal(base_size = 18)

Code
library(ggrepel)
ggplot(gapminder, aes(x = continent, y = lifeExp)) +
  geom_boxplot(outlier.colour = "maroon", outlier.size = 3) +
  geom_jitter(data = no_outliers, position = position_jitter(width = 0.1, height = 0), alpha = 0.25, size = 3) + 
  geom_label_repel(data = outliers, aes(label = country), color = "maroon", alpha = 0.7, size = 8, max.overlaps = 13) + 
  theme_minimal(base_size = 18)

Code
outliers <- gapminder |> 
  group_by(continent) |> 
  mutate(outlier = case_when(quantile(lifeExp, probs = 0.25) - (IQR(lifeExp) * 1.5) > lifeExp ~ "outlier", # anything lower than the 1st quartile - 1.5*IQR 
                             quantile(lifeExp, probs = 0.75) + (IQR(lifeExp) * 1.5) < lifeExp ~ "outlier", # anything higher than the 3rd quartile + 1.5*IQR
                             .default = NA)) |> 
  filter(!is.na(outlier)) |> # remove non-outliers
  ungroup() |> group_by(country) |> # regroup by country
  filter(lifeExp == min(lifeExp)) # filter just the min for each country

outliers
# A tibble: 13 × 7
# Groups:   country [13]
   country                continent  year lifeExp      pop gdpPercap outlier
   <fct>                  <fct>     <int>   <dbl>    <int>     <dbl> <chr>  
 1 Albania                Europe     1952    55.2  1282697     1601. outlier
 2 Bosnia and Herzegovina Europe     1952    53.8  2791000      974. outlier
 3 Bulgaria               Europe     1952    59.6  7274900     2444. outlier
 4 Haiti                  Americas   1952    37.6  3201488     1840. outlier
 5 Libya                  Africa     2002    72.7  5368585     9535. outlier
 6 Mauritius              Africa     2007    72.8  1250882    10957. outlier
 7 Montenegro             Europe     1952    59.2   413834     2648. outlier
 8 Portugal               Europe     1952    59.8  8526050     3068. outlier
 9 Reunion                Africa     1992    73.6   622191     6101. outlier
10 Rwanda                 Africa     1992    23.6  7290203      737. outlier
11 Serbia                 Europe     1952    58.0  6860147     3581. outlier
12 Tunisia                Africa     2002    73.0  9770575     5723. outlier
13 Turkey                 Europe     1952    43.6 22235677     1969. outlier
Code
no_outliers <- gapminder |> 
  group_by(continent) |> 
  mutate(outlier = case_when(quantile(lifeExp, probs = 0.25) - (IQR(lifeExp) * 1.5) > lifeExp ~ "outlier",
                             quantile(lifeExp, probs = 0.75) + (IQR(lifeExp) * 1.5) < lifeExp ~ "outlier", 
                             .default = NA)) |> 
  filter(is.na(outlier)) # remove outliers

no_outliers
# A tibble: 1,679 × 7
# Groups:   continent [5]
   country     continent  year lifeExp      pop gdpPercap outlier
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl> <chr>  
 1 Afghanistan Asia       1952    28.8  8425333      779. <NA>   
 2 Afghanistan Asia       1957    30.3  9240934      821. <NA>   
 3 Afghanistan Asia       1962    32.0 10267083      853. <NA>   
 4 Afghanistan Asia       1967    34.0 11537966      836. <NA>   
 5 Afghanistan Asia       1972    36.1 13079460      740. <NA>   
 6 Afghanistan Asia       1977    38.4 14880372      786. <NA>   
 7 Afghanistan Asia       1982    39.9 12881816      978. <NA>   
 8 Afghanistan Asia       1987    40.8 13867957      852. <NA>   
 9 Afghanistan Asia       1992    41.7 16317921      649. <NA>   
10 Afghanistan Asia       1997    41.8 22227415      635. <NA>   
# ℹ 1,669 more rows

Bonus: Advanced Example!

End Result

We’re going to slowly build up a really detailed plot now!

Base ggplot

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, group = country)) 

Lines

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, group = country)) +
  geom_line() 

Continent Average

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, group = country)) +
  geom_line() +
  geom_line(stat = "smooth", 
            method = "loess", 
            aes(group = continent)) 

Facets

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, group = country)) +
  geom_line() +
  geom_line(stat = "smooth", 
            method = "loess", 
            aes(group = continent)) +
  facet_wrap(~ continent, 
             nrow = 2)

Color Scale

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, group = country)) +
  geom_line() +
  geom_line(stat = "smooth", 
            method = "loess", 
            aes(group = continent)) +
  facet_wrap(~ continent, 
             nrow = 2) +
  scale_color_manual(name = "Life Exp. for:", 
                     values = c("Country" = "black", "Continent" = "blue"))

Size Scale

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, group = country)) +
  geom_line() +
  geom_line(stat = "smooth", 
            method = "loess", 
            aes(group = continent)) +
  facet_wrap(~ continent, 
             nrow = 2) +
  scale_color_manual(name = "Life Exp. for:", 
                     values = c("Country" = "black", "Continent" = "blue")) +
  scale_size_manual(name = "Life Exp. for:", 
                    values = c("Country" = 0.25, "Continent" = 3))

Mapping Color & Size

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, group = country)) +
  geom_line(aes(color = "Country", size = "Country")) +
  geom_line(stat = "smooth", method = "loess", 
            aes(group = continent, color = "Continent", size = "Continent")) +
  facet_wrap(~ continent, 
             nrow = 2) +
  scale_color_manual(name = "Life Exp. for:", 
                     values = c("Country" = "black", "Continent" = "blue")) +
  scale_size_manual(name = "Life Exp. for:", 
                    values = c("Country" = 0.25, "Continent" = 3))

Alpha (Transparency)

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, group = country)) +
  geom_line(alpha = 0.5, 
            aes(color = "Country", size = "Country")) +
  geom_line(stat = "smooth", method = "loess", 
            aes(group = continent, color = "Continent", size = "Continent"), 
            alpha = 0.5) +
  facet_wrap(~ continent, 
             nrow = 2) +
  scale_color_manual(name = "Life Exp. for:", 
                     values = c("Country" = "black", "Continent" = "blue")) +
  scale_size_manual(name = "Life Exp. for:", 
                    values = c("Country" = 0.25, "Continent" = 3))

Theme and Labels

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, group = country)) +
  geom_line() +
  geom_line(stat = "smooth", 
            method = "loess", 
            aes(group = continent)) +
  facet_wrap(~ continent, 
             nrow = 2) +
  scale_color_manual(name = "Life Exp. for:", 
                     values = c("Country" = "black", "Continent" = "blue")) +
  scale_size_manual(name = "Life Exp. for:", 
                    values = c("Country" = 0.25, "Continent" = 3)) +
  theme_minimal(base_size = 14) + 
  labs(y = "Years", 
       x = "")

Title and Subtitle

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, group = country)) +
  geom_line() +
  geom_line(stat = "smooth", 
            method = "loess", 
            aes(group = continent)) +
  facet_wrap(~ continent, 
             nrow = 2) +
  scale_color_manual(name = "Life Exp. for:", 
                     values = c("Country" = "black", "Continent" = "blue")) +
  scale_size_manual(name = "Life Exp. for:", 
                    values = c("Country" = 0.25, "Continent" = 3)) +
  theme_minimal(base_size = 14) + 
  labs(y = "Years", 
       x = "", 
       title = "Life Expectancy, 1952-2007", 
       subtitle = "By continent and country")

Angled Tick Values

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, group = country)) +
  geom_line() +
  geom_line(stat = "smooth", 
            method = "loess", 
            aes(group = continent)) +
  facet_wrap(~ continent, 
             nrow = 2) +
  scale_color_manual(name = "Life Exp. for:", 
                     values = c("Country" = "black", "Continent" = "blue")) +
  scale_size_manual(name = "Life Exp. for:", 
                    values = c("Country" = 0.25, "Continent" = 3)) +
  theme_minimal(base_size = 14) + 
  labs(y = "Years", 
       x = "", 
       title = "Life Expectancy, 1952-2007", 
       subtitle = "By continent and country") +
  theme(axis.text.x = element_text(angle = 45)) 

Legend Position

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, group = country)) +
  geom_line() +
  geom_line(stat = "smooth", 
            method = "loess", 
            aes(group = continent)) +
  facet_wrap(~ continent, 
             nrow = 2) +
  scale_color_manual(name = "Life Exp. for:", 
                     values = c("Country" = "black", "Continent" = "blue")) +
  scale_size_manual(name = "Life Exp. for:", 
                    values = c("Country" = 0.25, "Continent" = 3)) +
  theme_minimal(base_size = 14) + 
  labs(y = "Years", 
       x = "", 
       title = "Life Expectancy, 1952-2007", 
       subtitle = "By continent and country") +
  theme(legend.position = c(0.82, 0.15), 
        axis.text.x = element_text(angle = 45))

ggplot Extensions!

tidyverse extended universe

ggplot2 can obviously do a lot on its own. But because R allows for anyone and everyone to expand the functionality of what already exists, numerous extensions1 to ggplot2 have been created.

We’ve already seen one example with ggrepel. But let’s look at a few others…

geomtextpath

If you want your labels to follow along the path of your plot (and maintain proper angles and spacing) try using geomtextpath.

Code
# install.packages("geomtextpath") <- run in console first
library(geomtextpath)
gapminder |> 
  filter(country %in% c("Cuba", "Haiti", "Dominican Republic")) |> # restricting data to 3 regionally-specific countries
  ggplot(aes(x = year, 
             y = lifeExp, 
             color = country, 
             label = country)) + # specify label with text to appear
  geom_textpath() + # adding textpath geom to put labels within lines
  theme(legend.position = "none") # removing legend

ggridges

We can visualize the differing distributions of a continuous variable by levels of a categorical variable with ggridges!

Code
# install.packages("ggridges") <- run in console first
library(ggridges)
ggplot(gapminder, 
       aes(x = lifeExp, 
           y = continent, 
           fill = continent, 
           color = continent)) +
  geom_density_ridges(alpha = 0.5, 
                      show.legend = FALSE) # add ridges, make all a bit transparent, remove legend

Correlation Matricies

Make visually appealing & informative correlation plots in GGally or ggcorrplot.

Code
# install.packages("GGally") <- run in console first
library(GGally)

ggcorr(swiss, 
       geom = "circle", 
       min_size = 25, # specify minimum size of shape 
       max_size = 25, # specify maximum size of shape 
       label = TRUE, # label circles with correlation coefficient
       label_alpha = TRUE, # less strong correlations have lower alpha
       label_round = 2, # round correlations coefficients to 2 decimal points
       legend.position = c(0.15, 0.6), 
       legend.size = 12)

Code
# install.packages("ggcorrplot") <- run in console first
library(ggcorrplot)

# compute correlation matrix
corr <- round(cor(swiss), 1)
# computer matrix of correlation p-values
p_mat <- cor_pmat(swiss)

ggcorrplot(corr,
           hc.order = TRUE, # use hierarchical clustering to group like-correlations together
           type = "lower", # only show lower half of correlation matrix
           p.mat = p_mat, # give corresponding p-values for correlation matrix
           insig = "pch", # add default shape (an X) to correlations that are insignificant
           outline.color = "black", # outline cells in white
           ggtheme = ggthemes::theme_tufte(), # using a specific theme I like from ggthemes package 
           colors = c("#4e79a7", "white", "#e15759")) + # specify custom colors 
  theme(legend.position = c(0.15, 0.67))

Code
ggpairs(swiss, 
        lower = list(continuous = wrap("smooth", # specify a smoothing line added to scatterplots
                                       alpha = 0.5, 
                                       size=0.2))) + 
  ggthemes::theme_tufte() # add nice theme from ggthemes

patchwork

Combine separate plots into the same graphic using patchwork.

Code
# install.packages("patchwork") <- run in console first
library(patchwork)

# Create first plot object
plot_lifeExp <- ggplot(gapminder, 
                       aes(x = lifeExp, y = continent, fill = continent, color = continent)) +
  geom_density_ridges(alpha = 0.5, show.legend = FALSE)

# Create second plot object
plot_boxplot <- ggplot(gapminder, 
                       aes(x = continent, y = lifeExp, color = continent), 
                       alpha = 0.5) +
  geom_boxplot(outlier.colour = "black", varwidth = TRUE) + # change outlier color and make width of boxes relative to N
  coord_flip() + # flip the coordinates (x & y) to align with first plot
  geom_jitter(position = position_jitter(width = 0.1, height = 0), # add datapoints to boxplot
              alpha = 0.25) + 
  geom_label_repel(data = outliers, # mapping new dataset with the outliers
                   aes(label = country), 
                   color = "black", 
                   alpha = 0.7, 
                   max.overlaps = 13) +
  theme(axis.text.y = element_blank(), # remove y axis text 
        axis.ticks.y = element_blank(), # remove y axis ticks 
        axis.title.y = element_blank(), # remove y axis title 
        legend.position = "none")

plot_lifeExp + plot_boxplot # simply add two objects together to place side by side

themes in ggplot2

There are several built-in themes within ggplot2.

Code
plot_lifeExp + theme_bw() # reusing plot_lifeExp from previous slide and changing theme

Code
plot_lifeExp + theme_light()

Code
plot_lifeExp + theme_classic()

Code
plot_lifeExp + theme_linedraw()

Code
plot_lifeExp + theme_dark()

Code
plot_lifeExp + theme_minimal()

Code
plot_lifeExp + theme_gray()

Code
plot_lifeExp + theme_void()

ggthemes

Code
library(ggthemes)
plot_lifeExp + theme_excel()

Code
plot_lifeExp + theme_economist()

Code
plot_lifeExp + theme_few()

Code
plot_lifeExp + theme_fivethirtyeight()

Code
plot_lifeExp + theme_gdocs()

Code
plot_lifeExp + theme_stata()

Code
plot_lifeExp + theme_tufte()

Code
plot_lifeExp + theme_wsj()

Other theme packages and making your own!

These are just a handful of all the ready-made theme options available out there. Some other packages that might be useful/fun to check out:

  • hrbrthemes - provides typography-centric themes and theme components for ggplot2
  • urbnthemes a set of tools for creating Urban Institute-themed plots and maps in R
  • bbplot - provides helpful functions for creating and exporting graphics made in ggplot in the style used by the BBC News data team
  • ggpomological - A ggplot2 theme based on the USDA Pomological Watercolor Collection

You are also able to design your own theme using the theme() function and really getting into the weeds with how to specify all the non-data ink in your plot. Once you come up with a theme you like you can save it as an object (i.e. my_theme) and add it to any ggplot you create to maintain your own unique and consistent style.

Summary

Summary

ggplot2 can do a LOT! I don’t expect you to memorize all these tools, and neither should you! With time and practice, you’ll start to remember the key tools.

Homework