Manipulating and Summarizing Data

CS&SS 508 • Lecture 4

16 April 2024

Victoria Sass

Roadmap


Last time, we learned:

  • Best Practices
    • Code Style
    • Workflow
  • Reproducible Research
  • Indexing vectors & dataframes in Base R


Today, we will cover:

  • Types of Data
    • Logical Operators
  • Subsetting data
  • Modifying data
  • Summarizing data
  • Merging data

Death to Spreadsheets

Tools like Excel or Google Sheets let you manipulate spreadsheets using functions.

  • Spreadsheets are not reproducible: It’s hard to know how someone changed the raw data!
  • It’s hard to catch mistakes when you use spreadsheets1.

Today, we’ll use R to manipulate data more transparently and reproducibly.

How is data stored in R?

Under the hood, R stores different types of data in different ways.

  • e.g., R knows that 4.0 is a number, and that "Vic" is not a number.

So what exactly are the common data types, and how do we know what R is doing?


  • Logicals (logical)
  • Factors (factor)
  • Date/Date-time (Date, POSIXct, POSIXt)
  • Numbers (integer, double)
  • Missing Values (NA, NaN, Inf)
  • Character Strings (character)
  • c(FALSE, TRUE, TRUE)
  • factor(c("red", "blue"))
  • as_Date(c("2018-10-04"))
  • c(1, 10*3, 4, -3.14)
  • c(NA, NA, NA, NaN, NaN, NA)
  • c("red", "blue", "blue")

Logical Operators

Booleans

The simplest data type is a Boolean, or binary, variable: TRUE or FALSE1.

More often than not our data don’t actually have a variable with this data type, but they are definitely created and evaluated in the data manipulation and summarizing process.

Logical operators refer to base functions which allow us to test if a condition is present between two objects.

For example, we may test

  • Is A equal to B?
  • Is A greater than B?
  • Is A within B?

Naturally, these types of expressions produce a binary outcome of T or F which enables us to transform our data in a variety of ways!

Logical Operators in R

Comparing objects

  • ==:
  • !=:
  • >, >=, <, <=:
  • %in%:
  • is equal to1
  • not equal to
  • less than, less than or equal to, etc.
  • used when checking if equal to one of several values

Combining comparisons

  • &:
  • |:
  • !:
  • xor():
  • both conditions need to hold (AND)
  • at least one condition needs to hold (OR)
  • inverts a logical condition (TRUE becomes FALSE, vice versa)
  • exclusive OR (i.e. x or y but NOT both)

Unexpected Behavior

Be careful using == with numbers:

x <- c(1 / 49 * 49, sqrt(2) ^ 2)
x
[1] 1 2
x == c(1, 2)
print(x, digits = 16)
1
Computers store numbers with a fixed number of decimal places so there’s no way to precisely represent decimals.
2
dplyr::near() is a useful alternative which ignores small differences.
[1] FALSE FALSE
[1] 0.9999999999999999 2.0000000000000004

Similarly mysterious, missing values (NA) represent the unknown. Almost anything conditional involving NAs will also be unknown:

NA > 5
10 == NA
NA == NA
3
The logic here: if you have one unknown and a second unknown, you don’t actually know if they equal one another!
[1] NA
[1] NA
[1] NA

This is the reason we use is.na() to check for missingness.

is.na(c(NA, 5))
[1]  TRUE FALSE

Examples of Logical Operators

Let’s create two objects, A and B

A <- c(5, 10, 15)
B <- c(5, 15, 25)

Comparisons:

A == B
A > B
A %in% B
4
Will return a vector the length of A that is TRUE whenever a value in A is anywhere in B.
Note: You CAN use %in% to search for NAs.
[1]  TRUE FALSE FALSE
[1] FALSE FALSE FALSE
[1]  TRUE FALSE  TRUE

Combinations:

A > 5 & A <= B 
B < 10 | B > 20
!(A == 10)
5
Be sure not to cut corners (i.e. writing
B < 10 | > 20). The code won’t technically error but it won’t evaluate the way you expect it to. Read more about the confusing logic behind this here.
[1] FALSE  TRUE  TRUE
[1]  TRUE FALSE  TRUE
[1]  TRUE FALSE  TRUE

Logical Summaries

  • any():
  • all():
  • the equivalent of |; it’ll return TRUE if there are any TRUE’s in x
  • the equivalent of &; it’ll return TRUE only if all values of x are TRUE’s
C <- c(5, 10, NA, 10, 20, NA)
any(C <= 10)
all(C <= 20)
all(C <= 20, na.rm = TRUE)
mean(C, na.rm = TRUE)
6
Like other summary functions, they’ll return NA if there are any missing values present and it’s FALSE.
7
Use na.rm = TRUE to remove NAs prior to evaluation.
8
When you evaluate a logical vector numerically, TRUE = 1 and FALSE = 0. This makes sum() and mean() useful when summarizing logical functions (sum gives number of TRUEs and mean gives the proportion).
[1] TRUE
[1] NA
[1] TRUE
[1] 11.25

Conditional transformations

if_else()

If you want to use one value when a condition is TRUE and another value when it’s FALSE.

if_else(condition = "A logical vector", 
        true = "Output when condition is true", 
        false = "Output when condition is false")
x <- c(-3:3, NA)
if_else(x > 0, "+ve", "-ve", "???")
9
There’s an optional fourth argument, missing which will be used if the input is NA.
[1] "-ve" "-ve" "-ve" "-ve" "+ve" "+ve" "+ve" "???"
case_when()

A very useful extension of if_else() for multiple conditions1.

case_when(
  x == 0   ~ "0",
  x < 0    ~ "-ve", 
  x > 0    ~ "+ve",
  is.na(x) ~ "???"
)
10
Use .default if you want to create a “default”/catch all value.
11
Both functions require compatible types: i.e. numerical and logical, strings and factors, dates and datetimes, NA and everything.
[1] "-ve" "-ve" "-ve" "0"   "+ve" "+ve" "+ve" "???"

dplyr

Today, we’ll use tools from the dplyr package to manipulate data!

  • Like ggplot2, dplyr is part of the Tidyverse, and included in the tidyverse package.
library(tidyverse)

To demonstrate data transformations we’re going to use the nycflights13 dataset, which you’ll need to download and load into R

# Download and load data
# install.packages("nycflights13")
library(nycflights13)
12
Run in console.
13
Load into R session.

nycflights13 includes five data frames1, some of which contain missing data (NA):

data(flights)
data(airlines)
data(airports)
data(planes)
data(weather)
14
flights leaving JFK, LGA, or EWR in 2013
15
airline abbreviations
16
airport metadata
17
airplane metadata
18
hourly weather data for JFK, LGA, and EWR

dplyr Basics

All dplyr functions have the following in common:

  1. The first argument is always a data frame.
  2. The subsequent arguments typically describe which columns to operate on, using the variable names (without quotes).
  3. The output is always a new data frame.

Each function operates either on rows, columns, groups, or entire tables.

To save the transformations you’ve made to a data frame you’ll need to save the output to a new object.

Subsetting data

Subset Rows: filter()

We often get big datasets, and we only want some of the entries. We can subset rows using filter().

delay_2hr <- flights |> 
  filter(dep_delay > 120)
delay_2hr
1
Here’s where all your new knowledge about logical operators comes in handy! Make sure to use == not = to test the logical condition.
2
Now, delay_2hr is an object in our environment which contains rows corresponding to flights that experienced at least a 2 hour delay.
# A tibble: 9,723 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      848           1835       853     1001           1950
 2  2013     1     1      957            733       144     1056            853
 3  2013     1     1     1114            900       134     1447           1222
 4  2013     1     1     1540           1338       122     2020           1825
 5  2013     1     1     1815           1325       290     2120           1542
 6  2013     1     1     1842           1422       260     1958           1535
 7  2013     1     1     1856           1645       131     2212           2005
 8  2013     1     1     1934           1725       129     2126           1855
 9  2013     1     1     1938           1703       155     2109           1823
10  2013     1     1     1942           1705       157     2124           1830
# ℹ 9,713 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Subset Columns: select()

What if we want to keep every observation, but only use certain variables? Use select()!

We can select columns by name:

flights |> 
  select(year, month, day)
3
You can use a - before a variable name or a vector of variables to drop them from the data (i.e. 
select(-c(year, month, day))).
# A tibble: 336,776 × 3
    year month   day
   <int> <int> <int>
 1  2013     1     1
 2  2013     1     1
 3  2013     1     1
 4  2013     1     1
 5  2013     1     1
 6  2013     1     1
 7  2013     1     1
 8  2013     1     1
 9  2013     1     1
10  2013     1     1
# ℹ 336,766 more rows

Subset Columns: select()

What if we want to keep every observation, but only use certain variables? Use select()!

We can select columns between variables (inclusive):

flights |> 
  select(year:day)
4
Add a ! before year and you’ll drop this group of variables from the data.
# A tibble: 336,776 × 3
    year month   day
   <int> <int> <int>
 1  2013     1     1
 2  2013     1     1
 3  2013     1     1
 4  2013     1     1
 5  2013     1     1
 6  2013     1     1
 7  2013     1     1
 8  2013     1     1
 9  2013     1     1
10  2013     1     1
# ℹ 336,766 more rows

Subset Columns: select()

What if we want to keep every observation, but only use certain variables? Use select()!

We can select columns based on a condition:

flights |> 
  select(where(is.character))
5
There are a number of helper functions you can use with select() including starts_with(), ends_with(), contains() and num_range(). Read more about these and more here.
# A tibble: 336,776 × 4
   carrier tailnum origin dest 
   <chr>   <chr>   <chr>  <chr>
 1 UA      N14228  EWR    IAH  
 2 UA      N24211  LGA    IAH  
 3 AA      N619AA  JFK    MIA  
 4 B6      N804JB  JFK    BQN  
 5 DL      N668DN  LGA    ATL  
 6 UA      N39463  EWR    ORD  
 7 B6      N516JB  EWR    FLL  
 8 EV      N829AS  LGA    IAD  
 9 B6      N593JB  JFK    MCO  
10 AA      N3ALAA  LGA    ORD  
# ℹ 336,766 more rows

Finding Unique Rows: distinct()

You may want to find the unique combinations of variables in a dataset. Use distinct()

flights |> 
  distinct(origin, dest)
6
Find all unique origin and destination pairs.
# A tibble: 224 × 2
   origin dest 
   <chr>  <chr>
 1 EWR    IAH  
 2 LGA    IAH  
 3 JFK    MIA  
 4 JFK    BQN  
 5 LGA    ATL  
 6 EWR    ORD  
 7 EWR    FLL  
 8 LGA    IAD  
 9 JFK    MCO  
10 LGA    ORD  
# ℹ 214 more rows

distinct() drops variables!

By default, distinct() drops unused variables. If you don’t want to drop them, add the argument .keep_all = TRUE:

flights |> 
  distinct(origin, dest, .keep_all = TRUE)
7
It’s not a coincidence that all of these distinct flights are on January 1: distinct() will find the first occurrence of a unique row in the dataset and discard the rest. Use count() if you’re looking for the number of occurrences.
# A tibble: 224 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 214 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Count Unique Rows: count()

flights |>
  count(origin, dest, sort = TRUE)
8
sort = TRUE arranges them in descending order of number of occurrences.
# A tibble: 224 × 3
   origin dest      n
   <chr>  <chr> <int>
 1 JFK    LAX   11262
 2 LGA    ATL   10263
 3 LGA    ORD    8857
 4 JFK    SFO    8204
 5 LGA    CLT    6168
 6 EWR    ORD    6100
 7 JFK    BOS    5898
 8 LGA    MIA    5781
 9 JFK    MCO    5464
10 EWR    BOS    5327
# ℹ 214 more rows

Modifying data

Sorting Data by Rows: arrange()

Sometimes it’s useful to sort rows in your data, in ascending (low to high) or descending (high to low) order. We do that with arrange().

flights |> 
  arrange(year, month, day, dep_time)
1
If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns.
# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Sorting Data by Rows: arrange()

To sort in descending order, using desc() within arrange()

flights |> 
  arrange(desc(dep_delay))
# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     9      641            900      1301     1242           1530
 2  2013     6    15     1432           1935      1137     1607           2120
 3  2013     1    10     1121           1635      1126     1239           1810
 4  2013     9    20     1139           1845      1014     1457           2210
 5  2013     7    22      845           1600      1005     1044           1815
 6  2013     4    10     1100           1900       960     1342           2211
 7  2013     3    17     2321            810       911      135           1020
 8  2013     6    27      959           1900       899     1236           2226
 9  2013     7    22     2257            759       898      121           1026
10  2013    12     5      756           1700       896     1058           2020
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Rename Variables: rename()

You may receive data with unintuitive variable names. Change them using rename().

flights |> 
  rename(tail_num = tailnum)
2
rename(new_name = old_name) is the format. Reminder to use janitor::clean_names() if you want to automate this process for a lot of variables.
# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tail_num <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Variable Syntax

I recommend against using spaces in a name! It makes things really hard sometimes!!

Create New Columns: mutate()

You can add new columns to a data frame using mutate().

flights |> 
  mutate(
    gain = dep_delay - arr_delay,
    speed = distance / air_time * 60,
    .before = 1
  )
3
By default, mutate() adds new columns on the right hand side of your dataset, which makes it difficult to see if anything happened. You can use the .before argument to specify which numeric index (or variable name) to move the newly created variable to. .after is an alternative argument for this.
# A tibble: 336,776 × 21
    gain speed  year month   day dep_time sched_dep_time dep_delay arr_time
   <dbl> <dbl> <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1    -9  370.  2013     1     1      517            515         2      830
 2   -16  374.  2013     1     1      533            529         4      850
 3   -31  408.  2013     1     1      542            540         2      923
 4    17  517.  2013     1     1      544            545        -1     1004
 5    19  394.  2013     1     1      554            600        -6      812
 6   -16  288.  2013     1     1      554            558        -4      740
 7   -24  404.  2013     1     1      555            600        -5      913
 8    11  259.  2013     1     1      557            600        -3      709
 9     5  405.  2013     1     1      557            600        -3      838
10   -10  319.  2013     1     1      558            600        -2      753
# ℹ 336,766 more rows
# ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Specifying Variables to Keep: mutate()

You can specify which columns to keep with the .keep argument:

flights |> 
  mutate(
    gain = dep_delay - arr_delay,
    hours = air_time / 60,
    gain_per_hour = gain / hours,
    .keep = "used"
  )
4
“used” retains only the variables used to create the new variables, which is useful for checking your work. Other options include: “all,” “unused,” and “none.”
# A tibble: 336,776 × 6
   dep_delay arr_delay air_time  gain hours gain_per_hour
       <dbl>     <dbl>    <dbl> <dbl> <dbl>         <dbl>
 1         2        11      227    -9 3.78          -2.38
 2         4        20      227   -16 3.78          -4.23
 3         2        33      160   -31 2.67         -11.6 
 4        -1       -18      183    17 3.05           5.57
 5        -6       -25      116    19 1.93           9.83
 6        -4        12      150   -16 2.5           -6.4 
 7        -5        19      158   -24 2.63          -9.11
 8        -3       -14       53    11 0.883         12.5 
 9        -3        -8      140     5 2.33           2.14
10        -2         8      138   -10 2.3           -4.35
# ℹ 336,766 more rows

Move Variables Around: relocate()

You might want to collect related variables together or move important variables to the front. Use relocate()!

flights |> 
  relocate(time_hour, air_time)
5
By default relocate() moves variables to the front but you can also specify where to put them using the .before and .after arguments, just like in mutate().
# A tibble: 336,776 × 19
   time_hour           air_time  year month   day dep_time sched_dep_time
   <dttm>                 <dbl> <int> <int> <int>    <int>          <int>
 1 2013-01-01 05:00:00      227  2013     1     1      517            515
 2 2013-01-01 05:00:00      227  2013     1     1      533            529
 3 2013-01-01 05:00:00      160  2013     1     1      542            540
 4 2013-01-01 05:00:00      183  2013     1     1      544            545
 5 2013-01-01 06:00:00      116  2013     1     1      554            600
 6 2013-01-01 05:00:00      150  2013     1     1      554            558
 7 2013-01-01 06:00:00      158  2013     1     1      555            600
 8 2013-01-01 06:00:00       53  2013     1     1      557            600
 9 2013-01-01 06:00:00      140  2013     1     1      557            600
10 2013-01-01 06:00:00      138  2013     1     1      558            600
# ℹ 336,766 more rows
# ℹ 12 more variables: dep_delay <dbl>, arr_time <int>, sched_arr_time <int>,
#   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,
#   dest <chr>, distance <dbl>, hour <dbl>, minute <dbl>

Summarizing data

Grouping Data: group_by()

If you want to analyze your data by specific groupings, use group_by():

flights |> 
  group_by(month)
1
group_by() doesn’t change the data but you’ll notice that the output indicates that it is “grouped by” month (Groups: month [12]). This means subsequent operations will now work “by month”.
# A tibble: 336,776 × 19
# Groups:   month [12]
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Summarizing Data: summarize()

summarize() calculates summaries of variables in your data:

  • Count the number of rows
  • Calculate the mean
  • Calculate the sum
  • Find the minimum or maximum value

You can use any function inside summarize() that aggregates multiple values into a single value (like sd(), mean(), or max()).

summarize() Example

Let’s see what this looks like in our flights dataset:

flights |> 
  summarize(
    avg_delay = mean(dep_delay)
  )
2
The NA produced here is a result of calling mean on dep_delay. Any summarizing function will return NA if any of the values are NA. We can set na.rm = TRUE to change this behavior.
# A tibble: 1 × 1
  avg_delay
      <dbl>
1        NA

summarize() Example

Let’s see what this looks like in our flights dataset:

flights |> 
  summarize(
    avg_delay = mean(dep_delay, na.rm = TRUE) 
  )
# A tibble: 1 × 1
  avg_delay
      <dbl>
1      12.6

Summarizing Data by Groups

What if we want to summarize data by our groups? Use group_by() and summarize()

flights |> 
  group_by(month) |> 
  summarize(
    delay = mean(dep_delay, na.rm = TRUE)
  )
# A tibble: 12 × 2
   month delay
   <int> <dbl>
 1     1 10.0 
 2     2 10.8 
 3     3 13.2 
 4     4 13.9 
 5     5 13.0 
 6     6 20.8 
 7     7 21.7 
 8     8 12.6 
 9     9  6.72
10    10  6.24
11    11  5.44
12    12 16.6 

Because we did group_by() with month, then used summarize(), we get one row per value of month!

Summarizing Data by Groups

You can create any number of summaries in a single call to summarize().

flights |> 
  group_by(month) |> 
  summarize(
    delay = mean(dep_delay, na.rm = TRUE), 
    n = n()
  )
3
n() returns the number of rows in each group.
# A tibble: 12 × 3
   month delay     n
   <int> <dbl> <int>
 1     1 10.0  27004
 2     2 10.8  24951
 3     3 13.2  28834
 4     4 13.9  28330
 5     5 13.0  28796
 6     6 20.8  28243
 7     7 21.7  29425
 8     8 12.6  29327
 9     9  6.72 27574
10    10  6.24 28889
11    11  5.44 27268
12    12 16.6  28135

Grouping by Multiple Variables

daily <- flights |> 
  group_by(year, month, day)  
daily
# A tibble: 336,776 × 19
# Groups:   year, month, day [365]
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Summary & Grouping Behavior

When you summarize a tibble grouped by more than one variable, each summary peels off the last group. You can change the default behavior by setting the .groups argument to a different value, e.g., “drop” to drop all grouping or “keep” to preserve the same groups. The default is “drop_last”.

Remove Grouping: ungroup()

daily |> 
  ungroup() 
# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

New Alternative for Grouping: .by

flights |> 
  summarize(
    delay = mean(dep_delay, na.rm = TRUE), 
    n = n(),
    .by = month
  )
4
.by works with all verbs and has the advantage that you don’t need to use the .groups argument to suppress the grouping message or ungroup() when you’re done.
# A tibble: 12 × 3
   month delay     n
   <int> <dbl> <int>
 1     1 10.0  27004
 2    10  6.24 28889
 3    11  5.44 27268
 4    12 16.6  28135
 5     2 10.8  24951
 6     3 13.2  28834
 7     4 13.9  28330
 8     5 13.0  28796
 9     6 20.8  28243
10     7 21.7  29425
11     8 12.6  29327
12     9  6.72 27574

Select Specific Rows Per Group: slice_*

There are five handy functions that allow you extract specific rows within each group:

  • df |> slice_head(n = 1) takes the first row from each group.
  • df |> slice_tail(n = 1) takes the last row in each group.
  • df |> slice_min(x, n = 1) takes the row with the smallest value of column x.
  • df |> slice_max(x, n = 1) takes the row with the largest value of column x.
  • df |> slice_sample(n = 1) takes one random row.

Let’s find the flights that are most delayed upon arrival at each destination.

Select Specific Rows Per Group: slice_*

flights |> 
  group_by(dest) |> 
  slice_max(arr_delay, n = 1) |>
  relocate(dest, arr_delay) 
5
You can vary n to select more than one row, or instead of n =, you can use prop = 0.1 to select (e.g.) 10% of the rows in each group.
# A tibble: 108 × 19
# Groups:   dest [105]
   dest  arr_delay  year month   day dep_time sched_dep_time dep_delay arr_time
   <chr>     <dbl> <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1 ABQ         153  2013     7    22     2145           2007        98      132
 2 ACK         221  2013     7    23     1139            800       219     1250
 3 ALB         328  2013     1    25      123           2000       323      229
 4 ANC          39  2013     8    17     1740           1625        75     2042
 5 ATL         895  2013     7    22     2257            759       898      121
 6 AUS         349  2013     7    10     2056           1505       351     2347
 7 AVL         228  2013     8    13     1156            832       204     1417
 8 BDL         266  2013     2    21     1728           1316       252     1839
 9 BGR         238  2013    12     1     1504           1056       248     1628
10 BHM         291  2013     4    10       25           1900       325      136
# ℹ 98 more rows
# ℹ 10 more variables: sched_arr_time <int>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>

Merging Data

Why Merge Data?

In practice, we often collect data from different sources. To analyze the data, we usually must first combine (merge) them.

For example, imagine you would like to study county-level patterns with respect to age and grocery spending. However, you can only find,

  • County level age data from the US Census, and
  • County level grocery spending data from the US Department of Agriculture

Merge the data!!

To do this we’ll be using the various join functions from the dplyr package.

Joining in Concept

We need to think about the following when we want to merge data frames A and B:

  • Which rows are we keeping from each data frame?
  • Which columns are we keeping from each data frame?
  • Which variables determine whether rows match?

Keys

Keys are the way that two datasets are connected to one another. The two types of keys are:

  1. Primary: a variable or set of variables that uniquely identifies each observation.
    1. When more than one variable makes up the primary key it’s called a compound key
  2. Foreign: a variable (or set of variables) that corresponds to a primary key in another table.

Primary Keys

Let’s look at our data to gain a better sense of what this all means.

airlines records two pieces of data about each airline: its carrier code and its full name. You can identify an airline with its two letter carrier code, making carrier the primary key.

airlines 
# A tibble: 16 × 2
   carrier name                       
   <chr>   <chr>                      
 1 9E      Endeavor Air Inc.          
 2 AA      American Airlines Inc.     
 3 AS      Alaska Airlines Inc.       
 4 B6      JetBlue Airways            
 5 DL      Delta Air Lines Inc.       
 6 EV      ExpressJet Airlines Inc.   
 7 F9      Frontier Airlines Inc.     
 8 FL      AirTran Airways Corporation
 9 HA      Hawaiian Airlines Inc.     
10 MQ      Envoy Air                  
11 OO      SkyWest Airlines Inc.      
12 UA      United Air Lines Inc.      
13 US      US Airways Inc.            
14 VX      Virgin America             
15 WN      Southwest Airlines Co.     
16 YV      Mesa Airlines Inc.         

airports records data about each airport. You can identify each airport by its three letter airport code, making faa the primary key.

airports
# A tibble: 1,458 × 8
   faa   name                             lat    lon   alt    tz dst   tzone    
   <chr> <chr>                          <dbl>  <dbl> <dbl> <dbl> <chr> <chr>    
 1 04G   Lansdowne Airport               41.1  -80.6  1044    -5 A     America/…
 2 06A   Moton Field Municipal Airport   32.5  -85.7   264    -6 A     America/…
 3 06C   Schaumburg Regional             42.0  -88.1   801    -6 A     America/…
 4 06N   Randall Airport                 41.4  -74.4   523    -5 A     America/…
 5 09J   Jekyll Island Airport           31.1  -81.4    11    -5 A     America/…
 6 0A9   Elizabethton Municipal Airport  36.4  -82.2  1593    -5 A     America/…
 7 0G6   Williams County Airport         41.5  -84.5   730    -5 A     America/…
 8 0G7   Finger Lakes Regional Airport   42.9  -76.8   492    -5 A     America/…
 9 0P2   Shoestring Aviation Airfield    39.8  -76.6  1000    -5 U     America/…
10 0S9   Jefferson County Intl           48.1 -123.    108    -8 A     America/…
# ℹ 1,448 more rows

planes records data about each plane. You can identify a plane by its tail number, making tailnum the primary key.

planes
# A tibble: 3,322 × 9
   tailnum  year type              manufacturer model engines seats speed engine
   <chr>   <int> <chr>             <chr>        <chr>   <int> <int> <int> <chr> 
 1 N10156   2004 Fixed wing multi… EMBRAER      EMB-…       2    55    NA Turbo…
 2 N102UW   1998 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
 3 N103US   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
 4 N104UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
 5 N10575   2002 Fixed wing multi… EMBRAER      EMB-…       2    55    NA Turbo…
 6 N105UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
 7 N107US   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
 8 N108UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
 9 N109UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
10 N110UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
# ℹ 3,312 more rows

weather records data about the weather at the origin airports. You can identify each observation by the combination of location and time, making origin and time_hour the compound primary key.

weather
# A tibble: 26,115 × 15
   origin  year month   day  hour  temp  dewp humid wind_dir wind_speed
   <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>
 1 EWR     2013     1     1     1  39.0  26.1  59.4      270      10.4 
 2 EWR     2013     1     1     2  39.0  27.0  61.6      250       8.06
 3 EWR     2013     1     1     3  39.0  28.0  64.4      240      11.5 
 4 EWR     2013     1     1     4  39.9  28.0  62.2      250      12.7 
 5 EWR     2013     1     1     5  39.0  28.0  64.4      260      12.7 
 6 EWR     2013     1     1     6  37.9  28.0  67.2      240      11.5 
 7 EWR     2013     1     1     7  39.0  28.0  64.4      240      15.0 
 8 EWR     2013     1     1     8  39.9  28.0  62.2      250      10.4 
 9 EWR     2013     1     1     9  39.9  28.0  62.2      260      15.0 
10 EWR     2013     1     1    10  41    28.0  59.6      260      13.8 
# ℹ 26,105 more rows
# ℹ 5 more variables: wind_gust <dbl>, precip <dbl>, pressure <dbl>,
#   visib <dbl>, time_hour <dttm>

flights has three variables (time_hour, flight, carrier) that uniquely identify an observation. More significantly, however, it contains foreign keys that correspond to the primary keys of the other datasets.

flights
# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Foreign Keys

Note: grey shading indicates the primary key for that particular dataset.

  • flights$origin –> airports$faa
  • flights$dest –> airports$faa
  • flights$origin-flights$time_hour –> weather$origin-weather$time_hour.
  • flights$tailnum –> planes$tailnum
  • flights$carrier –> airlines$carrier

Checking Keys

A nice feature of these data are that the primary and foreign keys have the same name and almost every variable name used across multiple tables has the same meaning.1 This isn’t always the case!2

It is good practice to make sure your primary keys actually uniquely identify an observation and that they don’t have any missing values.

planes |> 
  count(tailnum) |>
  filter(n > 1)
1
If your primary keys uniquely identify each observation you’ll get an empty tibble in return.
# A tibble: 0 × 2
# ℹ 2 variables: tailnum <chr>, n <int>
planes |> 
  filter(is.na(tailnum))
2
If none of your primary keys are missing you’ll get an empty tibble in return here too.
# A tibble: 0 × 9
# ℹ 9 variables: tailnum <chr>, year <int>, type <chr>, manufacturer <chr>,
#   model <chr>, engines <int>, seats <int>, speed <int>, engine <chr>

Surrogate Keys

Sometimes you’ll want to create an index of your observations to serve as a surrogate key because the compound primary key is not particularly easy to reference.

For example, our flights dataset has three variables that uniquely identify each observation: time_hour, carrier, flight.

flights2 <- flights |> 
  mutate(id = row_number(), .before = 1)
flights2
3
row_number() simply specifies the row number of the data frame.
# A tibble: 336,776 × 20
      id  year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1     1  2013     1     1      517            515         2      830
 2     2  2013     1     1      533            529         4      850
 3     3  2013     1     1      542            540         2      923
 4     4  2013     1     1      544            545        -1     1004
 5     5  2013     1     1      554            600        -6      812
 6     6  2013     1     1      554            558        -4      740
 7     7  2013     1     1      555            600        -5      913
 8     8  2013     1     1      557            600        -3      709
 9     9  2013     1     1      557            600        -3      838
10    10  2013     1     1      558            600        -2      753
# ℹ 336,766 more rows
# ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Basic (Equi-) Joins

All join functions have the same basic interface: they take a pair of data frames and return one data frame.

The order of the rows and columns is primarily going to be determined by the first data frame.

dplyr has two types of joins: mutating and filtering.


Mutating Joins

Add new variables to one data frame from matching observations from another data frame.

  • left_join()
  • right_join()
  • inner_join()
  • full_join()

Filtering Joins

Filter observations from one data frame based on whether or not they match an observation in another data frame.

  • semi_join()
  • anti-join()

Mutating Joins



left_join()

left_join in nycflights13

flights2 <- flights |> 
  select(year, time_hour, origin, dest, tailnum, carrier)

With only the pertinent variables from the flights dataset, we can see how a left_join works with the airlines dataset.

flights2 |>
  left_join(airlines)
Joining with `by = join_by(carrier)`
# A tibble: 336,776 × 7
    year time_hour           origin dest  tailnum carrier name                  
   <int> <dttm>              <chr>  <chr> <chr>   <chr>   <chr>                 
 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA      United Air Lines Inc. 
 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA      United Air Lines Inc. 
 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA      American Airlines Inc.
 4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6      JetBlue Airways       
 5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL      Delta Air Lines Inc.  
 6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA      United Air Lines Inc. 
 7  2013 2013-01-01 06:00:00 EWR    FLL   N516JB  B6      JetBlue Airways       
 8  2013 2013-01-01 06:00:00 LGA    IAD   N829AS  EV      ExpressJet Airlines I…
 9  2013 2013-01-01 06:00:00 JFK    MCO   N593JB  B6      JetBlue Airways       
10  2013 2013-01-01 06:00:00 LGA    ORD   N3ALAA  AA      American Airlines Inc.
# ℹ 336,766 more rows

Different variable meanings

flights2 |> 
  left_join(planes)
Joining with `by = join_by(year, tailnum)`
# A tibble: 336,776 × 13
    year time_hour           origin dest  tailnum carrier type  manufacturer
   <int> <dttm>              <chr>  <chr> <chr>   <chr>   <chr> <chr>       
 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA      <NA>  <NA>        
 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA      <NA>  <NA>        
 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA      <NA>  <NA>        
 4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6      <NA>  <NA>        
 5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL      <NA>  <NA>        
 6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA      <NA>  <NA>        
 7  2013 2013-01-01 06:00:00 EWR    FLL   N516JB  B6      <NA>  <NA>        
 8  2013 2013-01-01 06:00:00 LGA    IAD   N829AS  EV      <NA>  <NA>        
 9  2013 2013-01-01 06:00:00 JFK    MCO   N593JB  B6      <NA>  <NA>        
10  2013 2013-01-01 06:00:00 LGA    ORD   N3ALAA  AA      <NA>  <NA>        
# ℹ 336,766 more rows
# ℹ 5 more variables: model <chr>, engines <int>, seats <int>, speed <int>,
#   engine <chr>

When we try to do this, however, we get a bunch of NAs. Why?

Different variable meanings

flights2 |> 
  left_join(planes)
Joining with `by = join_by(year, tailnum)`
# A tibble: 336,776 × 13
    year time_hour           origin dest  tailnum carrier type  manufacturer
   <int> <dttm>              <chr>  <chr> <chr>   <chr>   <chr> <chr>       
 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA      <NA>  <NA>        
 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA      <NA>  <NA>        
 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA      <NA>  <NA>        
 4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6      <NA>  <NA>        
 5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL      <NA>  <NA>        
 6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA      <NA>  <NA>        
 7  2013 2013-01-01 06:00:00 EWR    FLL   N516JB  B6      <NA>  <NA>        
 8  2013 2013-01-01 06:00:00 LGA    IAD   N829AS  EV      <NA>  <NA>        
 9  2013 2013-01-01 06:00:00 JFK    MCO   N593JB  B6      <NA>  <NA>        
10  2013 2013-01-01 06:00:00 LGA    ORD   N3ALAA  AA      <NA>  <NA>        
# ℹ 336,766 more rows
# ℹ 5 more variables: model <chr>, engines <int>, seats <int>, speed <int>,
#   engine <chr>

Join is trying to use tailnum and year as a compound key. While both datasets have year as a variable, they mean different things. Therefore, we need to be explicit here about what to join by.

Different variable meanings

flights2 |> 
  left_join(planes, join_by(tailnum))
4
join_by(tailnum) is short for join_by(tailnum == tailnum) making these types of basic joins equi joins.
# A tibble: 336,776 × 14
   year.x time_hour           origin dest  tailnum carrier year.y type          
    <int> <dttm>              <chr>  <chr> <chr>   <chr>    <int> <chr>         
 1   2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA        1999 Fixed wing mu…
 2   2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA        1998 Fixed wing mu…
 3   2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA        1990 Fixed wing mu…
 4   2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6        2012 Fixed wing mu…
 5   2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL        1991 Fixed wing mu…
 6   2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA        2012 Fixed wing mu…
 7   2013 2013-01-01 06:00:00 EWR    FLL   N516JB  B6        2000 Fixed wing mu…
 8   2013 2013-01-01 06:00:00 LGA    IAD   N829AS  EV        1998 Fixed wing mu…
 9   2013 2013-01-01 06:00:00 JFK    MCO   N593JB  B6        2004 Fixed wing mu…
10   2013 2013-01-01 06:00:00 LGA    ORD   N3ALAA  AA          NA <NA>          
# ℹ 336,766 more rows
# ℹ 6 more variables: manufacturer <chr>, model <chr>, engines <int>,
#   seats <int>, speed <int>, engine <chr>

Different variable names

If you have keys that have the same meaning (values) but are named different things in their respective datasets you’d also specify that with join_by()

flights2 |> 
  left_join(airports, join_by(dest == faa))
5
by = c("dest" = "faa") was the former syntax for this and you still might see that in older code.
# A tibble: 336,776 × 13
    year time_hour           origin dest  tailnum carrier name         lat   lon
   <int> <dttm>              <chr>  <chr> <chr>   <chr>   <chr>      <dbl> <dbl>
 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA      George Bu…  30.0 -95.3
 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA      George Bu…  30.0 -95.3
 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA      Miami Intl  25.8 -80.3
 4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6      <NA>        NA    NA  
 5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL      Hartsfiel…  33.6 -84.4
 6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA      Chicago O…  42.0 -87.9
 7  2013 2013-01-01 06:00:00 EWR    FLL   N516JB  B6      Fort Laud…  26.1 -80.2
 8  2013 2013-01-01 06:00:00 LGA    IAD   N829AS  EV      Washingto…  38.9 -77.5
 9  2013 2013-01-01 06:00:00 JFK    MCO   N593JB  B6      Orlando I…  28.4 -81.3
10  2013 2013-01-01 06:00:00 LGA    ORD   N3ALAA  AA      Chicago O…  42.0 -87.9
# ℹ 336,766 more rows
# ℹ 4 more variables: alt <dbl>, tz <dbl>, dst <chr>, tzone <chr>

This will match dest to faa for the join and then drop faa.

Different variable names

You can request dplyr to keep both keys with keep = TRUE argument.

flights2 |> 
  left_join(airports, join_by(dest == faa), keep = TRUE) 
# A tibble: 336,776 × 14
    year time_hour           origin dest  tailnum carrier faa   name         lat
   <int> <dttm>              <chr>  <chr> <chr>   <chr>   <chr> <chr>      <dbl>
 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA      IAH   George Bu…  30.0
 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA      IAH   George Bu…  30.0
 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA      MIA   Miami Intl  25.8
 4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6      <NA>  <NA>        NA  
 5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL      ATL   Hartsfiel…  33.6
 6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA      ORD   Chicago O…  42.0
 7  2013 2013-01-01 06:00:00 EWR    FLL   N516JB  B6      FLL   Fort Laud…  26.1
 8  2013 2013-01-01 06:00:00 LGA    IAD   N829AS  EV      IAD   Washingto…  38.9
 9  2013 2013-01-01 06:00:00 JFK    MCO   N593JB  B6      MCO   Orlando I…  28.4
10  2013 2013-01-01 06:00:00 LGA    ORD   N3ALAA  AA      ORD   Chicago O…  42.0
# ℹ 336,766 more rows
# ℹ 5 more variables: lon <dbl>, alt <dbl>, tz <dbl>, dst <chr>, tzone <chr>

right_join()

Has the same interface as a left_join but keeps all rows in y instead of x

inner_join()

Has the same interface as a left_join but only keeps rows that occur in both x and y

full_join()

Has the same interface as a left_join but keeps all rows in either x or y

Filtering Joins



semi_join()

Keeps all rows in x that have a match in y

semi_join() in nycflights13

We could use a semi-join to filter the airports dataset to show just the origin airports.

airports |> 
  semi_join(flights2, join_by(faa == origin))
# A tibble: 3 × 8
  faa   name                  lat   lon   alt    tz dst   tzone           
  <chr> <chr>               <dbl> <dbl> <dbl> <dbl> <chr> <chr>           
1 EWR   Newark Liberty Intl  40.7 -74.2    18    -5 A     America/New_York
2 JFK   John F Kennedy Intl  40.6 -73.8    13    -5 A     America/New_York
3 LGA   La Guardia           40.8 -73.9    22    -5 A     America/New_York

anti_join()

Returns all rows in x that don’t have a match in y

anti_join() in nycflights13

We can find rows that are missing from airports by looking for flights that don’t have a matching destination airport.

airports |> 
  anti_join(flights2, join_by(faa == origin))
# A tibble: 1,455 × 8
   faa   name                             lat    lon   alt    tz dst   tzone    
   <chr> <chr>                          <dbl>  <dbl> <dbl> <dbl> <chr> <chr>    
 1 04G   Lansdowne Airport               41.1  -80.6  1044    -5 A     America/…
 2 06A   Moton Field Municipal Airport   32.5  -85.7   264    -6 A     America/…
 3 06C   Schaumburg Regional             42.0  -88.1   801    -6 A     America/…
 4 06N   Randall Airport                 41.4  -74.4   523    -5 A     America/…
 5 09J   Jekyll Island Airport           31.1  -81.4    11    -5 A     America/…
 6 0A9   Elizabethton Municipal Airport  36.4  -82.2  1593    -5 A     America/…
 7 0G6   Williams County Airport         41.5  -84.5   730    -5 A     America/…
 8 0G7   Finger Lakes Regional Airport   42.9  -76.8   492    -5 A     America/…
 9 0P2   Shoestring Aviation Airfield    39.8  -76.6  1000    -5 U     America/…
10 0S9   Jefferson County Intl           48.1 -123.    108    -8 A     America/…
# ℹ 1,445 more rows

More Than One Match

There are three possible outcomes for a row in x:

  • If it doesn’t match anything, it’s dropped.
  • If it matches 1 row in y, it’s preserved.
  • If it matches more than 1 row in y, it’s duplicated once for each match.

What happens if we match on more than one row?

More Than One Match

df1 <- tibble(key = c(1, 2, 2), val_x = c("x1", "x2", "x3"))
df2 <- tibble(key = c(1, 2, 2), val_y = c("y1", "y2", "y3"))

df1 |> 
  inner_join(df2, join_by(key))
# A tibble: 5 × 3
    key val_x val_y
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   
3     2 x2    y3   
4     2 x3    y2   
5     2 x3    y3   

If you are doing this deliberately, you can set relationship = “many-to-many”, as the warning suggests.

Non-Equi Joins

The joins we’ve discussed thus far have all been equi-joins, where the rows match if the x key equals the y key. But you can also specify other types of relationships.

dplyr has four different types of non-equi joins:

  • Cross joins match every pair of rows.

Non-Equi Joins

The joins we’ve discussed thus far have all been equi-joins, where the rows match if the x key equals the y key. But you can also specify other types of relationships.

dplyr has four different types of non-equi joins:

  • Cross joins match every pair of rows.
  • Inequality joins use <, <=, >, and >= instead of ==.
    • Overlap joins are a special type of inequality join designed to work with ranges1.

Non-Equi Joins

The joins we’ve discussed thus far have all been equi-joins, where the rows match if the x key equals the y key. But you can also specify other types of relationships.

dplyr has four different types of non-equi joins:

  • Cross joins match every pair of rows.
  • Inequality joins use <, <=, >, and >= instead of ==.
    • Overlap joins are a special type of inequality join designed to work with ranges.
  • Rolling joins are similar to inequality joins but only find the closest match.

Lab

Manipulating Data

  1. Create a new object that contains gapminder1 (1) observations from China, India, and United States after 1980, and (2) variables corresponding to country, year, population, and life expectancy.
  2. How many rows and columns does the object contain?
  3. Sort the rows by year (ascending order) and population (descending order) and save that over the object created for answer 1. Print the first 6 rows.
  4. Add a new variable that contains population in billions.
  5. By year, calculate the total population (in billions) across these three countries
  6. In ggplot, create a line plot showing life expectancy over time by country. Make the plot visually appealing!

Answers

Question 1:

subset_gapminder <- gapminder |> 
  filter(country %in% c("China","India","United States"),  year > 1980 ) |>
  select(country, year, pop, lifeExp)
subset_gapminder
# A tibble: 18 × 4
   country        year        pop lifeExp
   <fct>         <int>      <int>   <dbl>
 1 China          1982 1000281000    65.5
 2 China          1987 1084035000    67.3
 3 China          1992 1164970000    68.7
 4 China          1997 1230075000    70.4
 5 China          2002 1280400000    72.0
 6 China          2007 1318683096    73.0
 7 India          1982  708000000    56.6
 8 India          1987  788000000    58.6
 9 India          1992  872000000    60.2
10 India          1997  959000000    61.8
11 India          2002 1034172547    62.9
12 India          2007 1110396331    64.7
13 United States  1982  232187835    74.6
14 United States  1987  242803533    75.0
15 United States  1992  256894189    76.1
16 United States  1997  272911760    76.8
17 United States  2002  287675526    77.3
18 United States  2007  301139947    78.2

Answers

Question 2:

# Option 1
c(nrow(subset_gapminder), ncol(subset_gapminder))
[1] 18  4
# Option 2
glimpse(subset_gapminder)
Rows: 18
Columns: 4
$ country <fct> "China", "China", "China", "China", "China", "China", "India",…
$ year    <int> 1982, 1987, 1992, 1997, 2002, 2007, 1982, 1987, 1992, 1997, 20…
$ pop     <int> 1000281000, 1084035000, 1164970000, 1230075000, 1280400000, 13…
$ lifeExp <dbl> 65.525, 67.274, 68.690, 70.426, 72.028, 72.961, 56.596, 58.553…
# Option 3
dim(subset_gapminder)
[1] 18  4

Answers

Question 3:

subset_gapminder <- subset_gapminder |> 
  arrange(year, desc(pop))


subset_gapminder |> head(6)
# A tibble: 6 × 4
  country        year        pop lifeExp
  <fct>         <int>      <int>   <dbl>
1 China          1982 1000281000    65.5
2 India          1982  708000000    56.6
3 United States  1982  232187835    74.6
4 China          1987 1084035000    67.3
5 India          1987  788000000    58.6
6 United States  1987  242803533    75.0
print(subset_gapminder[1:6, ])
# A tibble: 6 × 4
  country        year        pop lifeExp
  <fct>         <int>      <int>   <dbl>
1 China          1982 1000281000    65.5
2 India          1982  708000000    56.6
3 United States  1982  232187835    74.6
4 China          1987 1084035000    67.3
5 India          1987  788000000    58.6
6 United States  1987  242803533    75.0

Answers

Question 4:

subset_gapminder <- subset_gapminder |> 
  mutate(pop_billions = pop/1000000000)

subset_gapminder
# A tibble: 18 × 5
   country        year        pop lifeExp pop_billions
   <fct>         <int>      <int>   <dbl>        <dbl>
 1 China          1982 1000281000    65.5        1.00 
 2 India          1982  708000000    56.6        0.708
 3 United States  1982  232187835    74.6        0.232
 4 China          1987 1084035000    67.3        1.08 
 5 India          1987  788000000    58.6        0.788
 6 United States  1987  242803533    75.0        0.243
 7 China          1992 1164970000    68.7        1.16 
 8 India          1992  872000000    60.2        0.872
 9 United States  1992  256894189    76.1        0.257
10 China          1997 1230075000    70.4        1.23 
11 India          1997  959000000    61.8        0.959
12 United States  1997  272911760    76.8        0.273
13 China          2002 1280400000    72.0        1.28 
14 India          2002 1034172547    62.9        1.03 
15 United States  2002  287675526    77.3        0.288
16 China          2007 1318683096    73.0        1.32 
17 India          2007 1110396331    64.7        1.11 
18 United States  2007  301139947    78.2        0.301

Answers

Question 5:

subset_gapminder |> 
  group_by(year) |> 
  summarize(TotalPop_Billions = sum(pop_billions))
# A tibble: 6 × 2
   year TotalPop_Billions
  <int>             <dbl>
1  1982              1.94
2  1987              2.11
3  1992              2.29
4  1997              2.46
5  2002              2.60
6  2007              2.73
subset_gapminder |> 
  summarize(TotalPop_Billions = sum(pop_billions), 
            .by = year)
1
This new syntax allows for per-operation grouping which means it is only active within a single verb at a time (as opposed to being applied to the entire tibble until ungroup() is called). Learn more about this new feature here)
# A tibble: 6 × 2
   year TotalPop_Billions
  <int>             <dbl>
1  1982              1.94
2  1987              2.11
3  1992              2.29
4  1997              2.46
5  2002              2.60
6  2007              2.73

Answers

Question 6:

library(ggplot2) 
library(ggthemes)
library(geomtextpath)
ggplot(subset_gapminder, aes(year, lifeExp, color = country)) +
  geom_point() + 
  geom_textpath(aes(label = country), show.legend = FALSE) +
  xlab("Year") + 
  ylab("Life Expectancy (years)") +
  ggtitle("Life Expectancy (1982-2007)","China, India, and United States") +
  scale_x_continuous(breaks = c(1982, 1987, 1992, 1997, 2002, 2007), minor_breaks = c()) +
  ylim(c(50, 80)) + 
  scale_color_discrete(name = "Country") + 
  theme_tufte(base_size = 20) + 
  theme(legend.position = "bottom")

Homework