Homework 4 Key

Code
library(tidyverse)
library(ggthemes)
library(nycflights13)
library(scales)
library(gt)

Answer 1

Choose an airport outside New York, and count how many flights went to that airport from NYC in 2013. How many of those flights started at JFK, LGA, and EWR respectively?

Code
sd_flights <- flights |> 
  filter(dest == "SAN") |> 
  summarize(n = n(), 
            .by = origin)
1
An alternative to this summarize call would be to call count(origin).

There were 2,737 total flights from New York to San Diego. 1,603 of those flights originated from JFK and 1,134 from EWR.1

Answer 2

The variable arr_delay contains arrival delays in minutes (negative values represent early arrivals). Make a ggplot histogram displaying arrival delays for 2013 flights from NYC to the airport you chose.

Code
flights |> 
  filter(dest == "SAN") |> 
  ggplot(aes(x = arr_delay)) +
  geom_histogram() + 
  geom_vline(xintercept = 0, color = "#e15759") +
  labs(title = "Arrival delays from NYC airports to SAN in 2013",
       x = "Arrival delays (minutes)",
       y = "Count") +
  theme_tufte() + 
  theme(plot.background = element_rect(fill = "#f6f7f9", color = NA))
2
geom_vline adds a vertical line to your plot which can be useful for demarcating comparison or threshold values.
3
Changing the background color of my plot to match the website’s color using fill and removing the border of the plot using color.

Answer 3

Use left_join to add weather data at departure to the subsetted data. If time_hour didn’t exist in one or both of these datasets, which variables would you need to merge on? Calculate the mean temperature by month at departure (temp) across all flights.

Code
flights |> 
  filter(dest == "SAN") |> 
  left_join(weather) |> 
  summarize(avg_temp = mean(temp, na.rm = TRUE) |> round(2) |> number(suffix = "°F"),
            .by = month) |> 
  arrange(month) |> 
  gt() |>
  cols_align(align = "center") |>
  cols_label(month = "Month",
             avg_temp = "Average Temperature") |>
  tab_options(table.background.color = "#f6f7f9")
4
When creating the mean temperature I also chose to round to two decimal places with round() and add the suffix °F with the number function from the scales package. This level of detail is not necessary for this assignment.
5
Making the table a bit nicer looking (again, not necessary here)
Month Average Temperature
1 37°F
2 36°F
3 41°F
4 54°F
5 64°F
6 74°F
7 82°F
8 77°F
9 70°F
10 61°F
11 46°F
12 40°F

If time_hour didn’t exist in one of both of these datasets you would have to join on origin, year, month, day, and hour.

Answer 4

Investigate if there is a relationship between departure delay (dep_delay) and precipitation (precip) in the full dataset. Is the relationship different between JFK, LGA, and EWR? I suggest answering this question by making a plot and writing down a one-sentence interpretation2.

Code
flights |> 
  left_join(weather) |> 
  ggplot(aes(x = precip, y = dep_delay)) + 
  geom_jitter(alpha = 0.1) +
  geom_smooth(method = "lm") +
  labs(title = "Does precipitation increase the likelihood of departure delays?",
       x = "Precipitation (in inches)",
       y = "Departure delay (in minutes)") + 
  theme_tufte() + 
  theme(plot.background = element_rect(fill = "#f6f7f9", color = NA))
6
It can be helpful to make all points transparent if there is extreme over-plotting like in this plot. It allows you see more clearly see where there is over-plotting.
7
Adding a regression line allows for better visualization of the relationship between precipitation and departure delays.

It looks like there is a slightly positive relationship between precipitation and the length of a departure delay.

Code
flights |> 
  left_join(weather) |> 
  ggplot(aes(x = precip, y = dep_delay, color = origin)) +
  geom_jitter(alpha = 0.1) + 
  geom_smooth(method = "lm", fill = "#f6f7f9") + 
  labs(title = "Does precipitation increase the likelihood of departure delays?",
       x = "Precipitation (in inches)",
       y = "Departure delay (in minutes)") + 
  theme_tufte() + 
  theme(plot.background = element_rect(fill = "#f6f7f9", color = NA))
8
Just added the mapping aesthetic by color to visualize this by origin airport.

When broken down by origin airport it looks like this slight positive association is a bit stronger for JFK and LGA than it is for EWR.

Footnotes

  1. To add various punctuation to inline numeric values try the number function from the scales package.↩︎

  2. Hint: Read about geom_smooth() and consider how you might use it with the argument method = "lm" to plot a relationship between these two variables.↩︎