Visualizing Data

CS&SS 508 • Lecture 2

7 October 2025

Victoria Sass

Roadmap


Last time, we learned:

  • R and RStudio
  • Quarto headers, syntax, and chunks
  • Basics of functions, objects, and vectors
  • Base R and packages


Today, we will cover:

  • Introducing the tidyverse!
  • Basics of ggplot2
  • Advanced features of ggplot2
  • ggplot2 extensions

File Types

We mainly work with three types of files in this class:

  • .qmd1: These are markdown syntax files, where you write code and plain or formatted text to make documents.
  • .R: These are R syntax files, where you write code to process and analyze data without making an output document 2.
  • .html (or .pdf, .docx, etc.): These are the output documents created when you Render a quarto markdown document.

Make sure you understand the difference between the uses of these file types! Please ask for clarification if needed!

Introducing the tidyverse

Packages

Last week we discussed Base R and the fact that what makes R extremely powerful and flexible is the large number of diverse user-created packages.

What are packages again?

Recall that packages are simply collections of functions and code1 others have already created, that will make your life easier!

The package 2-step

Remember that to install a new package you use install.packages("package_name") in the console. You only need to do this once per machine (unless you want to update to a newer version of a package).

To load a package into your current session of R you use library(package_name), preferably at the beginning of your R script or Quarto document. Every time you open RStudio it’s a new session and you’ll have to call library() on the packages you want to use.

Packages

The Packages tab in the bottom-right pane of RStudio lists your installed packages.

The tidyverse

The tidyverse refers to two things:

  1. a specific package in R that loads several core packages within the tidyverse.
  2. a specific design philosophy, grammar, and focus on “tidy” data structures developed by Hadley Wickham1 and his team at RStudio (now named Posit).

The tidyverse package

The core packages within the tidyverse include:

  • ggplot2 (visualizations)
  • dplyr (data manipulation)
  • tidyr (data reshaping)
  • readr (data import/export)
  • purrr (iteration)
  • tibble (modern dataframe)
  • stringr (text data)
  • forcats (factors)


The tidyverse philosophy

The principles underlying the tidyverse are:

  1. Reuse existing data structures.
  2. Compose simple functions with the pipe.
  3. Embrace functional programming.
  4. Design for humans.

Gapminder Data

We’ll be working with data from Hans Rosling’s Gapminder project. An excerpt of these data can be accessed through an R package called gapminder1. Check the packages tab to see if gapminder appears (unchecked) in your computer’s list of downloaded packages.

If it doesn’t, run install.packages("gapminder") in the console.

Now, load the gapminder package as well as the tidyverse package:

library(gapminder)
library(tidyverse)
1
Every time you library (i.e. load) tidyverse it will tell you which individual packages it is loading, as well as all function conflicts it has with other packages loaded in the current session. This is useful information but you can suppress seeing/printing this output by adding the message: false chunk option to your code chunk.
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Check Out Gapminder

The data frame we will work with is called gapminder, available once you have loaded the package. Let’s see its structure:

str(gapminder)
tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
 $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
 $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
 $ pop      : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
 $ gdpPercap: num [1:1704] 779 821 853 836 740 ...


What’s Notable Here?

  • Factor variables country and continent
    • Factors are categorical data with an underlying numeric representation
    • We’ll spend a lot of time on factors later!
  • Many observations: \(n=1704\) rows
  • For each observation, a few variables: \(p=6\) columns
  • A nested/hierarchical structure: year in country in continent
    • These are panel data!

Base R plot

China <- gapminder |> 
  filter(country == "China")
plot(lifeExp ~ year, 
     data = China, 
     xlab = "Year", 
     ylab = "Life expectancy",
     main = "Life expectancy in China", 
     col = "red", 
     pch = 16)


This plot is made with one function and many arguments.

Fancier: ggplot

ggplot(data = China, 
       mapping = aes(x = year, y = lifeExp)) +
  geom_point(color = "red", size = 3) +
  labs(title = "Life expectancy in China", 
       x = "Year", 
       y = "Life expectancy") +
  theme_minimal(base_size = 18)


This ggplot is made with many functions and fewer arguments in each.

ggplot2

The ggplot2 package provides an alternative toolbox for plotting.

The core idea underlying this package is the layered grammar of graphics: i.e. that we can break up elements of a plot into pieces and combine them.

ggplots take a bit more work to create than Base R plots, but are usually:

  • prettier
  • more professional
  • much more customizable

Layered grammar of graphics

Structure of a ggplot

ggplot graphics objects consist of two primary components:

  1. Layers, the components of a graph.

    • We add layers to a ggplot object using +.
    • This includes adding lines, shapes, and text to a plot.
  1. Aesthetics, which determine how the layers appear.

    • We set aesthetics using arguments (e.g. color = "red") inside layer functions.
    • This includes modifying locations, colors, and sizes of the layers.

Aesthetic Vignette

Learn more about all possible aesthetic mappings here.

Layers

Layers are the components of the graph, such as:

  • ggplot(): initializes basic plotting object, specifies input data
  • geom_point(): layer of scatterplot points
  • geom_line(): layer of lines
  • geom_histogram(): layer of a histogram
  • labs (or to specify individually: ggtitle(), xlab(), ylab()): layers of labels
  • facet_wrap(): layer creating multiple plot panels
  • theme_bw(): layer replacing default gray background with black-and-white

Layers are separated by a + sign. For clarity, I usually put each layer on a new line.

Syntax Warning

Be sure to end each line with the +. The code will not run if a new line begins with a +.

Aesthetics

Aesthetics control the appearance of the layers:

  • x, y: \(x\) and \(y\) coordinate values to use
  • color: set color of elements based on some data value
  • group: describe which points are conceptually grouped together for the plot (often used with lines)
  • size: set size of points/lines based on some data value (greater than 0)
  • alpha: set transparency based on some data value (between 0 and 1)

Mapping data inside aes() vs. creating plot-wise settings outside aes()

When aesthetic arguments are called within aes() they specify a variable of the data and therefore map said value of the data by that aesthetic. Called outside aes(), these are only settings that can be given a specific value but will not display a dimension of the data.

ggplot Templates


All layers have:

an initializing ggplot call and at least one geom function.


ggplot(data = [dataset], 
       mapping = aes(x = [x_variable], y = [y_variable])) +
  geom_xxx() +
  other options
ggplot(data = [dataset], 
       mapping = aes(x = [x_variable], y = [y_variable])) +
  geom_xxx() +
  geom_yyy(mapping = aes(x = [x_variable], y = [y_variable])) +
  other options
ggplot() +
  geom_xxx(data = [dataset1],
           mapping = aes(x = [x_variable], y = [y_variable])) +
  geom_yyy(data = [dataset2],
           mapping = aes(x = [x_variable], y = [y_variable])) +
  other options

Example: Basic Jargon in Action!

Axis Labels, Points, No Background

Base ggplot

ggplot(data = China,  
       aes(x = year, y = lifeExp)) 

Axis Labels, Points, No Background

Scatterplot

ggplot(data = China, 
       aes(x = year, y = lifeExp)) +
  geom_point()

Axis Labels, Points, No Background

Point Color and Size

ggplot(data = China, 
       aes(x = year, y = lifeExp)) +
  geom_point(color = "red", size = 3)

Axis Labels, Points, No Background

X-Axis Label

ggplot(data = China, 
       aes(x = year, y = lifeExp)) +
  geom_point(color = "red", size = 3) +
  labs(x = "Year")

Axis Labels, Points, No Background

Y-Axis Label

ggplot(data = China, 
       aes(x = year, y = lifeExp)) +
  geom_point(color = "red", size = 3) +
  labs(x = "Year", 
       y = "Life expectancy")

Axis Labels, Points, No Background

Title

ggplot(data = China, 
       aes(x = year, y = lifeExp)) +
  geom_point(color = "red", size = 3) +
  labs(x = "Year",
       y = "Life expectancy",
       title = "Life expectancy in China")

Axis Labels, Points, No Background

Theme

ggplot(data = China, 
       aes(x = year, y = lifeExp)) +
  geom_point(color = "red", size = 3) +
  labs(x = "Year",
       y = "Life expectancy",
       title = "Life expectancy in China") +
  theme_minimal()

Axis Labels, Points, No Background

Text Size

ggplot(data = China, 
       aes(x = year, y = lifeExp)) +
  geom_point(color = "red", size = 3) +
  labs(x = "Year",
       y = "Life expectancy",
       title = "Life expectancy in China") +
  theme_minimal(base_size = 18) 

Plotting All Countries

We have a plot we like for China…

… but what if we want all the countries?

Plotting All Countries

A Mess!

ggplot(data = gapminder,
       aes(x = year, y = lifeExp)) +
  geom_point(color = "red", size = 3) +
  labs(x = "Year",
       y = "Life expectancy",
       title = "Life expectancy over time") + 
  theme_minimal(base_size = 18) 

Plotting All Countries

Lines

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp)) +
  geom_line(color = "red", size = 3) + 
  labs(x = "Year",
       y = "Life expectancy",
       title = "Life expectancy over time") + 
  theme_minimal(base_size = 18) 

Plotting All Countries

Grouping

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, 
           group = country)) +
  geom_line(color = "red", size = 3) +
  labs(x = "Year",
       y = "Life expectancy",
       title = "Life expectancy over time") + 
  theme_minimal(base_size = 18) 

Plotting All Countries

Size

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, 
           group = country)) +
  geom_line(color = "red") +
  labs(x = "Year",
       y = "Life expectancy",
       title = "Life expectancy over time") + 
  theme_minimal(base_size = 18) 

Plotting All Countries

Color

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, 
           group = country, 
           color = continent)) +
  geom_line() +
  labs(x = "Year",
       y = "Life expectancy",
       title = "Life expectancy over time") + 
  theme_minimal(base_size = 18) 

Plotting All Countries

Facets

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, 
           group = country, 
           color = continent)) +
  geom_line() +
  labs(x = "Year",
       y = "Life expectancy",
       title = "Life expectancy over time") + 
  theme_minimal(base_size = 18) +
  facet_wrap(vars(continent))

Plotting All Countries

Text Size

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, 
           group = country, 
           color = continent)) +
  geom_line() +
  labs(x = "Year",
       y = "Life expectancy",
       title = "Life expectancy over time") + 
  theme_minimal() +
  facet_wrap(vars(continent))

Plotting All Countries

No Legend

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, 
           group = country, 
           color = continent)) +
  geom_line() +
  labs(x = "Year",
       y = "Life expectancy",
       title = "Life expectancy over time") + 
  theme_minimal() +
  facet_wrap(vars(continent)) + 
  theme(legend.position = "none")

Lab 2

Make a histogram

In pairs, create a histogram of life expectancy observations in the complete Gapminder dataset.

  1. Set the base layer by specifying the data as gapminder and the x variable as lifeExp

  2. Add a second layer to create a histogram using the function geom_histogram()

  3. Customize your plot with nice axis labels and a title.

  4. Add the color “salmon” to the entire plot (hint: use the fill argument, not color).

  5. Change this fill setting to an aesthetic and map continent onto it.

  6. Change the geom to geom_freqpoly. What happened and how might you fix it?

  7. Add facets for continent (create only 1 column).

  8. Add one of the built-in themes from ggplot2.

  9. Remove the legend from the plot.

Solution: 1. Set Base Layer

ggplot(gapminder, aes(x = lifeExp))

Solution: 2. Add Histogram Layer

ggplot(gapminder, aes(x = lifeExp)) +
  geom_histogram(bins = 30)

Solution: 3. Add Label Layers

ggplot(gapminder, aes(x = lifeExp)) +
  geom_histogram(bins = 30) +
  xlab("Life Expectancy") +
  ylab("Count") +
  ggtitle("Histogram of Life Expectancy in Gapminder Data")

Solution: 4. Add fill setting

ggplot(gapminder, aes(x = lifeExp)) +
  geom_histogram(bins = 30, fill = "salmon") +
  xlab("Life Expectancy") +
  ylab("Count") +
  ggtitle("Histogram of Life Expectancy in Gapminder Data")

Solution: 5. Add fill aesthetic

ggplot(gapminder, aes(x = lifeExp, fill = continent)) +
  geom_histogram(bins = 30) +
  xlab("Life Expectancy") +
  ylab("Count") +
  ggtitle("Histogram of Life Expectancy in Gapminder Data")

Solution: 6. Change geometry

ggplot(gapminder, aes(x = lifeExp, fill = continent)) +
  geom_freqpoly(bins = 30) +
  xlab("Life Expectancy") +
  ylab("Count") +
  ggtitle("Histogram of Life Expectancy in Gapminder Data")

Solution: 6. Change geometry

ggplot(gapminder, aes(x = lifeExp, color = continent)) +
  geom_freqpoly(bins = 30) +
  xlab("Life Expectancy") +
  ylab("Count") +
  ggtitle("Histogram of Life Expectancy in Gapminder Data")

Solution: 7. Add facets

ggplot(gapminder, aes(x = lifeExp, color = continent)) +
  geom_freqpoly(bins = 30) +
  facet_wrap(vars(continent), ncol = 1) +
  xlab("Life Expectancy") +
  ylab("Count") +
  ggtitle("Histogram of Life Expectancy in Gapminder Data")

Solution: 8. Add nicer theme

ggplot(gapminder, aes(x = lifeExp, color = continent)) +
  geom_freqpoly(bins = 30) +
  facet_wrap(vars(continent), ncol = 1) +
  xlab("Life Expectancy") +
  ylab("Count") +
  ggtitle("Histogram of Life Expectancy in Gapminder Data") +
  theme_minimal() 

Solution: 9. Remove legend

ggplot(gapminder, aes(x = lifeExp, color = continent)) +
  geom_freqpoly(bins = 30) +
  facet_wrap(vars(continent), ncol = 1) +
  xlab("Life Expectancy") +
  ylab("Count") +
  ggtitle("Histogram of Life Expectancy in Gapminder Data") + 
  theme_minimal() + 
  theme(legend.position = "none") 

Break!

Advanced ggplot tools

Further customization

Next, we’ll discuss:

  • Storing, modifying, and saving ggplots

  • Advanced axis changes (scales, text, ticks)

  • Legend changes (scales, colors, locations)

  • Using multiple geoms

  • Adding annotation for emphasis

Storing Plots

We can assign a ggplot object to a name:

lifeExp_by_year <- 
  ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, 
           group = country, 
           color = continent)) +
  geom_line() +
  labs(x = "Year",
       y = "Life expectancy",
       title = "Life expectancy over time") + 
  theme_minimal() + 
  facet_wrap(vars(continent)) +
  theme(legend.position = "none")

Afterwards, you can display or modify ggplots…

Showing a Stored Graph

lifeExp_by_year

Overriding previous specifications

lifeExp_by_year + 
  facet_grid(cols = vars(continent)) 

Adding More Layers

lifeExp_by_year +
  facet_grid(cols = vars(continent)) + 
    theme(legend.position = "bottom")

Saving ggplot Plots

If you want to save a ggplot, use ggsave():

ggsave(filename = "I_saved_a_file.pdf", 
       plot = lifeExp_by_year,
       height = 3, width = 5, units = "in")

If you didn’t manually set font sizes, these will usually come out at a reasonable size given the dimensions of your output file.

Changing the Axes

We can modify the axes in a variety of ways, such as:

  • Change the \(x\) or \(y\) range using xlim() or ylim() layers

  • Change to a logarithmic or square-root scale on either axis: scale_x_log10(), scale_y_sqrt()

  • Change where the major/minor breaks are: scale_x_continuous(breaks = value(s), minor_breaks = value(s))

Axis Changes

ggplot(data = China, aes(x = year, y = gdpPercap)) +
    geom_line() +
    xlim(1940, 2010) + 
    scale_y_log10(breaks = c(1000, 2000, 3000, 4000, 5000)) + 
    ggtitle("Chinese GDP per capita")

Precise Legend Position

lifeExp_by_year +
  theme(legend.position = "inside", legend.position.inside = c(0.8, 0.2))
1
If you choose position the legend inside the plot pane itself, you need to provide the coordinates (between c(1, 1)) for where it should be placed.

Instead of plot-pane coordinates, you could also use top, bottom, left, or right.

Scales for Color, Shape, etc.

Scales are layers that control how the mapped aesthetics appear.

You can modify these with a scale_[aesthetic]_[option]() layer:

  • [aesthetic] is x, y, color, shape, linetype, alpha, size, fill, etc.
  • [option] is something like manual, continuous, binned or discrete (depending on nature of the variable).

Examples:

  • scale_alpha_ordinal(): scales alpha transparency for ordinal categorical variable
  • scale_x_log10(): maps a log10 transformation of the x-axis variable
  • scale_color_manual(): allows manual specification of color aesthetic

Legend Name and Manual Colors

lifeExp_by_year +
  theme(legend.position = "inside", legend.position.inside = c(0.8, 0.2)) +
  scale_color_manual(
    name = "Which continent are\nwe looking at?",
    values = c("Africa" = "#80719e", "Americas" = "#fdc57e", 
               "Asia" = "#c55347", "Europe" = "#007190", "Oceania" = "#648f7b"))
1
This scale argument knows to “map” onto continent because it is specified as the aesthetic for color in our original ggplot object.
2
\n adds a line break

Fixed versus Free Scales

Code
gapminder_sub <- gapminder |> 
  filter(year %in% c(1952, 1982, 2002))
  
scales_plot <- ggplot(data = gapminder_sub,
                      aes(x = lifeExp, y = gdpPercap, fill = continent)) +
              geom_jitter(alpha = 0.5,
                          pch = 21,
                          size = 3,
                          color = "black") +
              scale_fill_viridis_d(option = "D") +
              facet_grid(rows = vars(year),
                         cols = vars(continent)) +
              ggthemes::theme_tufte(base_size = 20)
scales_plot
1
Create subset with only 3 years of the data
2
alpha controls transparency and ranges from 0 (completely opaque) to 1 (completely solid)
3
This shape is a circle with fill (therefore it can take different colors for its outline, via color, and its interior, via fill)
4
Increase size of points
5
Outline of circle is black
6
Circle is filled by colors perceptable for various forms of color-blindness
7
Facet by years in the row and by continent in the columns
8
Use a nice theme from the ggthemes package and increase text size throughout the plot

Code
scales_plot + 
  scale_y_log10(breaks = c(250, 1000, 10000, 50000, 115000))
9
Transform the y axis to the logarithm to gain better visualization

Code
scales_plot + 
  scale_y_log10(breaks = c(250, 1000, 10000, 50000, 115000)) +
  facet_grid(rows = vars(year), 
             cols = vars(continent), 
             scales = "free_x")
10
Make the x-axis vary by data

Code
scales_plot + 
  scale_y_log10(breaks = c(250, 1000, 10000, 50000, 115000)) +
  facet_grid(rows = vars(year), 
             cols = vars(continent), 
             scales = "free_y")
11
Make the y-axis vary by data

Code
scales_plot + 
  scale_y_log10(breaks = c(250, 1000, 10000, 50000, 115000)) +
  facet_grid(rows = vars(year), 
             cols = vars(continent), 
             scales = "free")
12
Make both axes vary by data

Using multiple geoms

ggplot(gapminder, aes(x = continent, y = lifeExp)) +
  geom_boxplot(outlier.colour = "maroon")

Using multiple geoms

ggplot(gapminder, aes(x = continent, y = lifeExp)) +
  geom_boxplot(outlier.colour = "maroon") +
  geom_point(alpha = 0.25)

Using multiple geoms

ggplot(gapminder, aes(x = continent, y = lifeExp)) +
  geom_boxplot(outlier.colour = "maroon") +
  geom_jitter(alpha = 0.25)

Using multiple geoms

ggplot(gapminder, aes(x = continent, y = lifeExp)) +
  geom_boxplot(outlier.colour = "maroon") +
  geom_jitter(position = position_jitter(width = 0.1, height = 0),
              alpha = 0.25)
1
You’ll notice our outliers are repeated here since we’ve mapped them with both geoms. We’ll clean that up in the next slide…

Annotating specific datapoints for emphasis

Code
outliers <- gapminder |> 
  group_by(continent) |> 
  mutate(outlier = case_when(
    quantile(lifeExp, probs = 0.25) - (IQR(lifeExp) * 1.5) > lifeExp ~ "outlier",
    quantile(lifeExp, probs = 0.75) + (IQR(lifeExp) * 1.5) < lifeExp ~ "outlier",
    .default = NA)
    ) |> 
  filter(!is.na(outlier)) |>
  ungroup() |> group_by(country) |>
  filter(lifeExp == min(lifeExp))

outliers
1
Anything lower than the 1st quartile - 1.5*IQR
2
Anything higher than the 3rd quartile + 1.5*IQR
3
Remove non-outliers (coded as missing in previous step)
4
Regroup by country
5
Filter for just the minimum life expectancy for each country
# A tibble: 13 × 7
# Groups:   country [13]
   country                continent  year lifeExp      pop gdpPercap outlier
   <fct>                  <fct>     <int>   <dbl>    <int>     <dbl> <chr>  
 1 Albania                Europe     1952    55.2  1282697     1601. outlier
 2 Bosnia and Herzegovina Europe     1952    53.8  2791000      974. outlier
 3 Bulgaria               Europe     1952    59.6  7274900     2444. outlier
 4 Haiti                  Americas   1952    37.6  3201488     1840. outlier
 5 Libya                  Africa     2002    72.7  5368585     9535. outlier
 6 Mauritius              Africa     2007    72.8  1250882    10957. outlier
 7 Montenegro             Europe     1952    59.2   413834     2648. outlier
 8 Portugal               Europe     1952    59.8  8526050     3068. outlier
 9 Reunion                Africa     1992    73.6   622191     6101. outlier
10 Rwanda                 Africa     1992    23.6  7290203      737. outlier
11 Serbia                 Europe     1952    58.0  6860147     3581. outlier
12 Tunisia                Africa     2002    73.0  9770575     5723. outlier
13 Turkey                 Europe     1952    43.6 22235677     1969. outlier
Code
no_outliers <- gapminder |> 
  group_by(continent) |> 
  mutate(outlier = case_when(
    quantile(lifeExp, probs = 0.25) - (IQR(lifeExp) * 1.5) > lifeExp ~ "outlier",
    quantile(lifeExp, probs = 0.75) + (IQR(lifeExp) * 1.5) < lifeExp ~ "outlier",
    .default = NA)) |> 
  filter(is.na(outlier))

no_outliers
6
Remove outliers from original data
# A tibble: 1,679 × 7
# Groups:   continent [5]
   country     continent  year lifeExp      pop gdpPercap outlier
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl> <chr>  
 1 Afghanistan Asia       1952    28.8  8425333      779. <NA>   
 2 Afghanistan Asia       1957    30.3  9240934      821. <NA>   
 3 Afghanistan Asia       1962    32.0 10267083      853. <NA>   
 4 Afghanistan Asia       1967    34.0 11537966      836. <NA>   
 5 Afghanistan Asia       1972    36.1 13079460      740. <NA>   
 6 Afghanistan Asia       1977    38.4 14880372      786. <NA>   
 7 Afghanistan Asia       1982    39.9 12881816      978. <NA>   
 8 Afghanistan Asia       1987    40.8 13867957      852. <NA>   
 9 Afghanistan Asia       1992    41.7 16317921      649. <NA>   
10 Afghanistan Asia       1997    41.8 22227415      635. <NA>   
# ℹ 1,669 more rows
Code
ggplot(gapminder, aes(x = continent, y = lifeExp)) +
  geom_boxplot(outlier.shape = NA) +
  geom_jitter(data = no_outliers,
              position = position_jitter(width = 0.1, height = 0), 
              alpha = 0.25, 
              size = 3) + 
  geom_jitter(data = outliers,
              color = "maroon",
              position = position_jitter(width = 0.1, height = 0), 
              alpha = 0.7, 
              size = 3) +
  geom_text(data = outliers,
            aes(label = country),
            color = "maroon", 
            size = 8) + 
  theme_minimal(base_size = 18)
7
Remove outliers from boxplot geom
8
Plot points that are not categorized as outliers without color
9
Plot points that are categorized as outliers with color
10
Only add identifying text to outlier points

Code
library(ggrepel)
ggplot(gapminder, aes(x = continent, y = lifeExp)) +
  geom_boxplot(outlier.shape = NA) +
  geom_jitter(data = no_outliers, 
              position = position_jitter(width = 0.1, height = 0), 
              alpha = 0.25, 
              size = 3) + 
  geom_jitter(data = outliers, 
              color = "maroon",
              position = position_jitter(width = 0.1, height = 0), 
              alpha = 0.7, 
              size = 3) +
  geom_label_repel(data = outliers, 
                   aes(label = country), 
                   color = "maroon", 
                   alpha = 0.7,
                   size = 8, 
                   max.overlaps = 13) +
  theme_minimal(base_size = 18)
11
A package that provides additional geoms for ggplot2 to repel overlapping text labels
12
Allow points to be somewhat visible through text labels
13
Tolerance for permissible overlapping labels (default is 10; I chose 13 so none of the outliers would be removed)

Bonus: Advanced Example!

End Result

We’re going to slowly build up a really detailed plot now!

Base ggplot

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, group = country)) 

What might be a good geom layer for this data?

Lines

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, group = country)) +
  geom_line() 

Let’s also add a continent-specific average so we can visualize country-deviations from the regional average.

Continent Average

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, group = country)) +
  geom_line() +
  geom_line(stat = "smooth", 
            method = "loess",
            aes(group = continent)) 
1
A loess curve is something like a moving average.

We can’t quite distinguish the averages from everything else yet. Let’s facet by continent and start mapping aesthetics to our data to visualize things more clearly.

Facets

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, group = country)) +
  geom_line() +
  geom_line(stat = "smooth", 
            method = "loess", 
            aes(group = continent)) +
  facet_wrap(vars(continent),
             nrow = 2)
1
You can specify the faceting variable by wrapping the variable name in vars() (preferred), using ~ variable_name notation, or quoting variable name(s) as a character vector.

Facets allow us to gain a clearer understanding of the regional patterns. We want to differentiate the continent-average line from the country-specific lines though so let’s change it’s color.

Color Scale

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, group = country)) +
  geom_line() +
  geom_line(stat = "smooth", 
            method = "loess", 
            aes(group = continent)) +
  facet_wrap(~ continent, 
             nrow = 2) +
  scale_color_manual(name = "Life Exp. for:",
                     values = c("Country" = "black", "Continent" = "blue"))
1
Create informative legend title
2
Specify mapping variables and their respective color values

Hmm, can’t quite see the blue line yet. Let’s make it bigger?

Size Scale

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, group = country)) +
  geom_line() +
  geom_line(stat = "smooth", 
            method = "loess", 
            aes(group = continent)) +
  facet_wrap(~ continent, 
             nrow = 2) +
  scale_color_manual(name = "Life Exp. for:", 
                     values = c("Country" = "black", "Continent" = "blue")) +
  scale_size_manual(name = "Life Exp. for:",
                    values = c("Country" = 0.25, "Continent" = 3))
1
Use same legend title as previous scale to combine separate aesthetics into one legend
2
Specify mapping variables and their respective size values

It doesn’t look like our color and size scales are actually mapping onto our variables. Why is that?

Mapping Color & Size

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, group = country)) +
  geom_line(aes(color = "Country", size = "Country")) +
  geom_line(stat = "smooth", method = "loess", 
            aes(group = continent, color = "Continent", size = "Continent")) +
  facet_wrap(~ continent, 
             nrow = 2) +
  scale_color_manual(name = "Life Exp. for:", 
                     values = c("Country" = "black", "Continent" = "blue")) +
  scale_size_manual(name = "Life Exp. for:", 
                    values = c("Country" = 0.25, "Continent" = 3))
1
Add mapping aesthetics for color and size for both Country- and Continent-specific line geoms

Huzzah! Let’s change the transparency on these lines a touch so we can see all our data more easily.

Alpha (Transparency)

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, group = country)) +
  geom_line(alpha = 0.5, 
            aes(color = "Country", size = "Country")) +
  geom_line(stat = "smooth", method = "loess", 
            aes(group = continent, color = "Continent", size = "Continent"), 
            alpha = 0.5) +
  facet_wrap(~ continent, 
             nrow = 2) +
  scale_color_manual(name = "Life Exp. for:", 
                     values = c("Country" = "black", "Continent" = "blue")) +
  scale_size_manual(name = "Life Exp. for:", 
                    values = c("Country" = 0.25, "Continent" = 3))

Now we’re getting somewhere! We can also add useful labels and clean up the theme.

Theme and Labels

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, group = country)) +
  geom_line() +
  geom_line(stat = "smooth", 
            method = "loess", 
            aes(group = continent)) +
  facet_wrap(~ continent, 
             nrow = 2) +
  scale_color_manual(name = "Life Exp. for:", 
                     values = c("Country" = "black", "Continent" = "blue")) +
  scale_size_manual(name = "Life Exp. for:", 
                    values = c("Country" = 0.25, "Continent" = 3)) +
  theme_minimal(base_size = 14) +
  labs(y = "Years",
       x = "")
1
Add a nicer theme and increase relative font size throughout plot
2
Since our x-axis is calendar year and our y-axis is years of life expectancy, let’s avoid confusion by assigning Years to the y-axis and removing the x-axis label (which can be inferred from the plot title we’ll add next)

What’s our plot showing? We should be explicit about that.

Title and Subtitle

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, group = country)) +
  geom_line() +
  geom_line(stat = "smooth", 
            method = "loess", 
            aes(group = continent)) +
  facet_wrap(~ continent, 
             nrow = 2) +
  scale_color_manual(name = "Life Exp. for:", 
                     values = c("Country" = "black", "Continent" = "blue")) +
  scale_size_manual(name = "Life Exp. for:", 
                    values = c("Country" = 0.25, "Continent" = 3)) +
  theme_minimal(base_size = 14) + 
  labs(y = "Years", 
       x = "", 
       title = "Life Expectancy, 1952-2007", 
       subtitle = "By continent and country")

The x-axis feels a little busy right now…

Angled Tick Values

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, group = country)) +
  geom_line() +
  geom_line(stat = "smooth", 
            method = "loess", 
            aes(group = continent)) +
  facet_wrap(~ continent, 
             nrow = 2) +
  scale_color_manual(name = "Life Exp. for:", 
                     values = c("Country" = "black", "Continent" = "blue")) +
  scale_size_manual(name = "Life Exp. for:", 
                    values = c("Country" = 0.25, "Continent" = 3)) +
  theme_minimal(base_size = 14) + 
  labs(y = "Years", 
       x = "", 
       title = "Life Expectancy, 1952-2007", 
       subtitle = "By continent and country") +
  theme(axis.text.x = element_text(angle = 45))
1
The theme() function has many arguments that allow you to provide more granular, non-data, aesthetic customizations, such as rotating the x-axis text in this example.

Note - fewer values might be better than angled labels! Finally, let’s move our legend so it isn’t wasting space.

Legend Position

ggplot(data = gapminder, 
       aes(x = year, y = lifeExp, group = country)) +
  geom_line() +
  geom_line(stat = "smooth", 
            method = "loess", 
            aes(group = continent)) +
  facet_wrap(~ continent, 
             nrow = 2) +
  scale_color_manual(name = "Life Exp. for:", 
                     values = c("Country" = "black", "Continent" = "blue")) +
  scale_size_manual(name = "Life Exp. for:", 
                    values = c("Country" = 0.25, "Continent" = 3)) +
  theme_minimal(base_size = 14) + 
  labs(y = "Years", 
       x = "", 
       title = "Life Expectancy, 1952-2007", 
       subtitle = "By continent and country") +
  theme(legend.position = "inside", legend.position.inside = c(0.82, 0.15), 
        axis.text.x = element_text(angle = 45))

Voilà!

ggplot Extensions!

tidyverse extended universe

ggplot2 can obviously do a lot on its own. But because R allows for anyone and everyone to expand the functionality of what already exists, numerous extensions1 to ggplot2 have been created.

We’ve already seen one example with ggrepel. But let’s look at a few others…

geomtextpath

If you want your labels to follow along the path of your plot (and maintain proper angles and spacing) try using geomtextpath.

Code
library(geomtextpath)
gapminder |> 
  filter(country %in% c("Cuba", "Haiti", "Dominican Republic")) |>
  ggplot(aes(x = year, 
             y = lifeExp, 
             color = country, 
             label = country)) +
  geom_textpath() +
  theme(legend.position = "none")
1
Run install.packages("geomtextpath") in console first
2
Restricting data to 3 regionally-specific countries
3
Specify label with text to appear
4
Adding textpath geom to put labels within lines
5
Removing legend

ggridges

We can visualize the differing distributions of a continuous variable by levels of a categorical variable with ggridges!

Code
library(ggridges)
ggplot(gapminder, 
       aes(x = lifeExp, 
           y = continent, 
           fill = continent, 
           color = continent)) +
  geom_density_ridges(alpha = 0.5,
                      show.legend = FALSE)
1
Run install.packages("ggridges") in console first
2
Add ridges, make all ridges a bit transparent, remove legend

ggwordcloud

If you are working with text data, you may want to visualize the sentiment of words in your documents. You can use ggwordcloud for that.

Code
library(ggwordcloud)
library(tidytext)
library(textdata)
library(janitor)

CSSS_508_Introductions <- read_csv(
  "CSSS 508 Introductions.csv",
  col_select = `What is one word that best describes your feelings about taking this class?`
)

feelings <- CSSS_508_Introductions |> 
  rename(text = `What is one word that best describes your feelings about taking this class?`) |> 
  mutate(text = str_to_lower(text)) |>
  unnest_tokens(word, text) |>
  filter(!word %in% stop_words$word) |>
  mutate(word = str_replace_all(word, "[[:punct:]]", "")) |>
  filter(!word %in%  c("", "bit", "audit", "level", "understanding", "rr"))

bing <- get_sentiments("bing")

word_sentiments <- feelings |> 
  inner_join(bing, by = "word") |>
  count(word, sentiment, sort = TRUE)

ggplot(word_sentiments, 
       aes(label = word, size = n, color = sentiment)) +
  geom_text_wordcloud() +
  scale_size_area(max_size = 30) + 
  scale_color_manual(values = c("positive" = "#007190", "negative" = "#c56127")) +
  theme_minimal() +
  theme(legend.position = "bottom")
1
Run install.packages("ggwordcloud") in console first
2
A package that provides functions to transform text to enable various types of text analyses
3
Provides various dictionaries to categorize words for sentiment analysis (i.e. bing)
4
Allows us to standardize text format (i.e. make all words lowercase)
5
Reads in the responses to a specific question from last week’s introductory survey
6
Makes all letters lowercase
7
Takes character strings with any number of words and splits them into individual-word-strings (each row of new word variable is one word)
8
Removes “stop words” (i.e. and, the, it, etc.) from word
9
Removes any punctuation
10
Explicitly removes non-feeling related words
11
Loads basic positive/negative sentiment dictionary (bing)
12
Adds sentiment column (with values positive/negative) for the words in word variable
13
Gets a count of the words (new variable n) to allow us to map by frequency
14
For geom_wordcloud we want to map onto aesthetics label, size, and color and then we can use familiar ggplot functions to add further specifications (i.e. custom colors, themes. etc.)

Feelings about taking Introduction to R (Fall 2025)

Correlation Matricies

Make visually appealing & informative correlation plots in GGally or ggcorrplot.

Code
library(GGally)

ggcorr(swiss, 
       geom = "circle", 
       min_size = 25,
       max_size = 25,
       label = TRUE,
       label_alpha = TRUE,
       label_round = 2,
       legend.position = c(0.2, 0.75), 
       legend.size = 12)
1
Run install.packages("GGally") in console first
2
Specify minimum size of shape
3
Specify maximum size of shape
4
Label circles with correlation coefficient
5
Weaker correlations have lower alpha
6
Round correlations coefficients to 2 decimal points

Code
library(ggcorrplot)

corr <- round(cor(swiss), 1)
p_mat <- cor_pmat(swiss)

ggcorrplot(corr,
           hc.order = TRUE,
           type = "lower",
           p.mat = p_mat,
           insig = "pch",
           outline.color = "black",
           ggtheme = ggthemes::theme_tufte(),
           colors = c("#4e79a7", "white", "#e15759")) +
  theme(legend.position = "inside", legend.position.inside = c(0.15, 0.67))
1
Run install.packages("ggcorrplot") in console first
2
Compute correlation matrix
3
Compute matrix of correlation p-values
4
Use hierarchical clustering to group like-correlations together
5
Only show lower half of correlation matrix
6
Give corresponding p-values for correlation matrix
7
Add default shape (an X) to correlations that are insignificant
8
Outline cells in white
9
Using a specific theme I like from ggthemes package
10
Specify custom colors

Code
ggpairs(swiss, 
        lower = list(continuous = wrap("smooth",
                                       alpha = 0.5, 
                                       size=0.2))) + 
  ggthemes::theme_tufte()
1
Specify a smoothing line added to scatterplots
2
Add nice theme from ggthemes

patchwork

Combine separate plots into the same graphic using patchwork.

Code
library(patchwork)

plot_lifeExp <- ggplot(gapminder,
                       aes(x = lifeExp, 
                           y = continent, 
                           fill = continent, 
                           color = continent)) + 
  geom_density_ridges(alpha = 0.5, show.legend = FALSE)

plot_boxplot <- ggplot(gapminder,
                       aes(x = continent, 
                           y = lifeExp, 
                           color = continent), 
                       alpha = 0.5) +
  ggplot2::geom_boxplot(outlier.shape = NA, varwidth = TRUE) +
  coord_flip() +
  geom_jitter(data = outliers,
              color = "black",
              position = position_jitter(width = 0.1, height = 0), 
              alpha = 0.6) + 
  geom_jitter(data = no_outliers,
              position = position_jitter(width = 0.1, height = 0), 
              alpha = 0.25) + 
  geom_label_repel(data = outliers,
                   aes(label = country), 
                   color = "black", 
                   alpha = 0.6, 
                   max.overlaps = 13) +
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        axis.title.y = element_blank(),
        legend.position = "none")

plot_lifeExp + plot_boxplot
1
Run install.packages("patchwork") in console first
2
Create first plot object
3
Create second plot object
4
Remove geom_boxplot outliers and make width of boxes relative to N
5
Flip the coordinates (x & y) to align with first plot
6
Add outlier datapoints
7
Add non-outlier datapoints
8
Mapping new dataset with the outliers
9
Remove y-axis text
10
Remove y-axis ticks
11
Remove y-axis title
12
Adding both objects together places them side by side

themes in ggplot2

There are several built-in themes within ggplot2.

Code
plot_lifeExp + theme_bw()
1
Reusing plot_lifeExp from previous slide and changing theme

Code
plot_lifeExp + theme_light()

Code
plot_lifeExp + theme_classic()

Code
plot_lifeExp + theme_linedraw()

Code
plot_lifeExp + theme_dark()

Code
plot_lifeExp + theme_minimal()

Code
plot_lifeExp + theme_gray()

Code
plot_lifeExp + theme_void()

ggthemes

Code
library(ggthemes)
plot_lifeExp + theme_excel()

Code
plot_lifeExp + theme_economist()

Code
plot_lifeExp + theme_few()

Code
plot_lifeExp + theme_fivethirtyeight()

Code
plot_lifeExp + theme_gdocs()

Code
plot_lifeExp + theme_stata()

Code
plot_lifeExp + theme_tufte()

Code
plot_lifeExp + theme_wsj()

Other theme packages and making your own!

These are just a handful of all the ready-made theme options available out there. Some other packages that might be useful/fun to check out:

  • hrbrthemes - provides typography-centric themes and theme components for ggplot2
  • urbnthemes a set of tools for creating Urban Institute-themed plots and maps in R
  • bbplot - provides helpful functions for creating and exporting graphics made in ggplot in the style used by the BBC News data team
  • ggpomological - A ggplot2 theme based on the USDA Pomological Watercolor Collection

You are also able to design your own theme using the theme() function and really getting into the weeds with how to specify all the non-data ink in your plot. Once you come up with a theme you like you can save it as an object (i.e. my_theme) and add it to any ggplot you create to maintain your own unique and consistent style.

Summary

Summary

ggplot2 can do a LOT! I don’t expect you to memorize all these tools, and neither should you! With time and practice, you’ll start to remember the key tools.

Homework