Iteration

CS&SS 508 • Lecture 9

21 May 2024

Victoria Sass

Roadmap


Last time, we learned:

  • Function Basics
  • Types of Functions
    • Vector Functions
    • Dataframe Functions
    • Plot Functions
  • Function Style Guide


Today, we will cover:

  • Introduction to Iteration
  • Common Iteration Tasks
    • Modifying Multiple Columns
    • Reading in Multiple Files
    • Saving Multiple Outputs
  • Base R Equivalents
    • Apply Family
    • for loops

Introduction to Iteration

Bad Repetition

If someone doesn’t know better, they might find the means of variables in the swiss data by typing a line of code for each column:


mean1 <- mean(swiss$Fertility)
mean2 <- mean(swiss$Agriculture)
mean3 <- mean(swissExamination)
mean4 <- mean(swiss$Fertility)
mean5 <- mean(swiss$Catholic)
mean5 <- mean(swiss$Infant.Mortality)
c(mean1, mean2 mean3, mean4, mean5, man6)
> Error: <text>:7:16: unexpected symbol
> 6: mean5 <- mean(swiss$Infant.Mortality)
> 7: c(mean1, mean2 mean3
>                   ^


Can you spot the problems?


How upset would they be if the swiss data had 200 columns instead of 6?

Good Repetition

Today you’ll learn a better way to repeat tasks, without repeating code, using functions from the dplyr and purrr packages in the tidyverse.


swiss |> dplyr::summarize(
  across(Fertility:Infant.Mortality, mean)
  )
>   Fertility Agriculture Examination Education Catholic Infant.Mortality
> 1     70.14       50.66       16.49     10.98    41.14            19.94

Goal: Don’t Repeat Yourself (DRY)!

The DRY idea: Computers are much better at doing the same thing over and over again than we are.

  • Writing code to repeat tasks for us reduces the most common human coding mistakes.
  • It also substantially reduces the time and effort involved in processing large volumes of data.
  • Lastly, compact code is more readable and easier to troubleshoot.

Method: Iteration!

Iteration involves repeatedly performing the same action on different objects.


We’ve already done some iteration, both because it’s built into R in certain ways, and because many of the tidyverse packages we’ve used have functions that are iterative.


Some examples we’ve seen:

  • Multiplying a vector x by any integer
    • Other languages require explicit looping but R iterates automatically with its recycling rules
  • Facetting ggplots
  • Summarizing a grouped dataset


We’re now going to learn what makes R a functional programming language. That is, we’ll learn some functions that themselves take functions as arguments.

Modifying Multiple Columns

Simple, Motivating Example…Continued

Let’s return to our first example from last week:

df <- tibble(
  a = rnorm(5),
  b = rnorm(5),
  c = rnorm(5),
  d = rnorm(5)
  )
df
> # A tibble: 5 × 4
>        a       b      c      d
>    <dbl>   <dbl>  <dbl>  <dbl>
> 1 -0.130 -0.398   0.347  0.368
> 2 -1.14   1.04   -0.433 -1.07 
> 3 -2.18  -1.06   -0.278 -1.10 
> 4  0.765 -0.0376  0.674 -0.185
> 5 -0.474  1.27    1.25   1.36


df |> mutate(
  a = (a - min(a, na.rm = TRUE)) / (max(a, na.rm = TRUE) - min(a, na.rm = TRUE)),
  b = (b - min(b, na.rm = TRUE)) / (max(b, na.rm = TRUE) - min(a, na.rm = TRUE)),
  c = (c - min(c, na.rm = TRUE)) / (max(c, na.rm = TRUE) - min(c, na.rm = TRUE)),
  d = (d - min(d, na.rm = TRUE)) / (max(d, na.rm = TRUE) - min(d, na.rm = TRUE))
  )
rescale01 <- function(x) { 
  (x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
}



df |> mutate(a = rescale01(a),
             b = rescale01(b),
             c = rescale01(c),
             d = rescale01(d))
> # A tibble: 5 × 4
>       a     b      c      d
>   <dbl> <dbl>  <dbl>  <dbl>
> 1 0.696 0.284 0.462  0.597 
> 2 0.354 0.902 0      0.0147
> 3 0     0     0.0918 0     
> 4 1     0.439 0.656  0.372 
> 5 0.580 1     1      1


Can we make this mutate call even more efficient?

df |> mutate(across(a:d, rescale01))
> # A tibble: 5 × 4
>       a     b      c      d
>   <dbl> <dbl>  <dbl>  <dbl>
> 1 0.696 0.284 0.462  0.597 
> 2 0.354 0.902 0      0.0147
> 3 0     0     0.0918 0     
> 4 1     0.439 0.656  0.372 
> 5 0.580 1     1      1

Basics of across()

across() makes it easy to apply the same transformation to multiple columns.


across(.cols, .fns, .names = NULL)


There are three particularly important arguments, the first two of which you’ll use in every call to across().

  • .cols specifies which columns to iterate over.
  • .fns specifies what to do with each column.
  • .names specifies the names of the output columns.

Reading in columns with .cols

.cols uses the same specifications as select() so you can use tidyselect functions like starts_with() to select columns based on their name.

iris |> 
  summarise(across(starts_with("Sepal"), median))
>   Sepal.Length Sepal.Width
> 1          5.8           3

You can also use everything() which selects every (non-grouping) column.

iris |> 
  summarise(across(everything(), median), 
            .by = Species)
>      Species Sepal.Length Sepal.Width Petal.Length Petal.Width
> 1     setosa          5.0         3.4         1.50         0.2
> 2 versicolor          5.9         2.8         4.35         1.3
> 3  virginica          6.5         3.0         5.55         2.0

Lastly, where() allows you to select columns based on their type.

iris |> 
  summarise(across(where(is.numeric), median))
1
Just like other selectors, you can combine these with Boolean algebra. For example, !where(is.numeric) selects all non-numeric columns.
>   Sepal.Length Sepal.Width Petal.Length Petal.Width
> 1          5.8           3         4.35         1.3

Calling a single function

The second argument to across() is what makes R a functional programming language. Here we’re passing a function to another function.


Important Distinction

We’re passing this function to across(), so across() can call it; we’re not calling it ourselves. That means the function name should never be followed by (). If you forget, you’ll get an error:


airquality |> 
  summarise(across(Ozone:Temp, median()))
2
This error arises because you’re calling the function with no input, i.e. median().
> Error in `summarise()`:
> ℹ In argument: `across(Ozone:Temp, median())`.
> Caused by error in `median.default()`:
> ! argument "x" is missing, with no default

Anonymous Functions

If the function you pass to across() has its own arguments that you want to specify, you’ll need to use an anonymous function:

airquality |> 
  summarise(across(Ozone:Temp, \(x) median(x, na.rm = TRUE)))
3
So-called anonymous, because we never explicitly gave it a name with <-. Another term programmers use for this is “lambda function”.
>   Ozone Solar.R Wind Temp
> 1  31.5     205  9.7   79


You might also see older code that looks like this:

airquality |> 
  summarise(across(Ozone:Temp, ~ median(.x, na.rm = TRUE)))
4
This is another way to write anonymous functions but it only works inside tidyverse functions and always uses the variable name .x. Base syntax is now recommended (i.e. \(x) x + 1).
>   Ozone Solar.R Wind Temp
> 1  31.5     205  9.7   79

Calling multiple functions

What if we want to know how many missing values we removed, in addition to calculating the median without those values?

If you need to call multiple functions within across(), you’ll need to turn them into a named list.

airquality |> 
  summarise(across(Ozone:Temp, 
                   list(median = \(x) median(x, na.rm = TRUE),
                        n_miss = \(x) sum(is.na(x)))))
5
The names of the list are used to name the new variables. In fact, the columns are named using a glue specification {.col}_{.fn} where .col is the name of the original column and .fn is the name of the function.
>   Ozone_median Ozone_n_miss Solar.R_median Solar.R_n_miss Wind_median
> 1         31.5           37            205              7         9.7
>   Wind_n_miss Temp_median Temp_n_miss
> 1           0          79           0

Column Names

By default, the output of across() is given the same names as the inputs. This means that across() inside of mutate() will replace existing columns.

df |> mutate(across(a:d, rescale01))
> # A tibble: 5 × 4
>       a     b      c      d
>   <dbl> <dbl>  <dbl>  <dbl>
> 1 0.696 0.284 0.462  0.597 
> 2 0.354 0.902 0      0.0147
> 3 0     0     0.0918 0     
> 4 1     0.439 0.656  0.372 
> 5 0.580 1     1      1

If you’d like to instead create new columns, you can use the .names argument to give the output new names.

df |> mutate(across(a:d, rescale01, .names = "{.col}_rescaled"))
6
.col simply represents the original variable name.
> # A tibble: 5 × 8
>        a       b      c      d a_rescaled b_rescaled c_rescaled d_rescaled
>    <dbl>   <dbl>  <dbl>  <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
> 1 -0.130 -0.398   0.347  0.368      0.696      0.284     0.462      0.597 
> 2 -1.14   1.04   -0.433 -1.07       0.354      0.902     0          0.0147
> 3 -2.18  -1.06   -0.278 -1.10       0          0         0.0918     0     
> 4  0.765 -0.0376  0.674 -0.185      1          0.439     0.656      0.372 
> 5 -0.474  1.27    1.25   1.36       0.580      1         1          1

if_any() and if_all()

across() works well with mutate() and summarize() but it has two variants that work with filter().

airquality |> filter(if_any(Ozone:Temp, is.na))
7
This is the same as airquality |> filter(is.na(Ozone) | is.na(Solar.R) | is.na(Wind) | is.na(Temp))
>    Ozone Solar.R Wind Temp Month Day
> 1     NA      NA 14.3   56     5   5
> 2     28      NA 14.9   66     5   6
> 3     NA     194  8.6   69     5  10
> 4      7      NA  6.9   74     5  11
> 5     NA      66 16.6   57     5  25
> 6     NA     266 14.9   58     5  26
> 7     NA      NA  8.0   57     5  27
> 8     NA     286  8.6   78     6   1
> 9     NA     287  9.7   74     6   2
> 10    NA     242 16.1   67     6   3
> 11    NA     186  9.2   84     6   4
> 12    NA     220  8.6   85     6   5
> 13    NA     264 14.3   79     6   6
> 14    NA     273  6.9   87     6   8
> 15    NA     259 10.9   93     6  11
> 16    NA     250  9.2   92     6  12
> 17    NA     332 13.8   80     6  14
> 18    NA     322 11.5   79     6  15
> 19    NA     150  6.3   77     6  21
> 20    NA      59  1.7   76     6  22
> 21    NA      91  4.6   76     6  23
> 22    NA     250  6.3   76     6  24
> 23    NA     135  8.0   75     6  25
> 24    NA     127  8.0   78     6  26
> 25    NA      47 10.3   73     6  27
> 26    NA      98 11.5   80     6  28
> 27    NA      31 14.9   77     6  29
> 28    NA     138  8.0   83     6  30
> 29    NA     101 10.9   84     7   4
> 30    NA     139  8.6   82     7  11
> 31    NA     291 14.9   91     7  14
> 32    NA     258  9.7   81     7  22
> 33    NA     295 11.5   82     7  23
> 34    78      NA  6.9   86     8   4
> 35    35      NA  7.4   85     8   5
> 36    66      NA  4.6   87     8   6
> 37    NA     222  8.6   92     8  10
> 38    NA     137 11.5   86     8  11
> 39    NA      64 11.5   79     8  15
> 40    NA     255 12.6   75     8  23
> 41    NA     153  5.7   88     8  27
> 42    NA     145 13.2   77     9  27
airquality |> filter(if_all(Ozone:Temp, is.na))
8
This is the same as airquality |> filter(is.na(Ozone) & is.na(Solar.R) & is.na(Wind) & is.na(Temp))
> [1] Ozone   Solar.R Wind    Temp    Month   Day    
> <0 rows> (or 0-length row.names)

across() in Functions

Naturally, across() lends itself to functions because it allows you to operate on multiple columns simultaneously.

Just remember to embrace with { } when using an argument for column selection since the first argument of across() uses the tidy evaluation method tidy-select.

summarize_means <- function(df, summary_vars = where(is.numeric)) {
  df |> 
    summarize(
      across({{ summary_vars }}, \(x) mean(x, na.rm = TRUE)),
      n = n(),
      .groups = "drop"
    )
}


diamonds |> 
  group_by(cut) |> 
  summarize_means(c(carat, x:z))
> # A tibble: 5 × 6
>   cut       carat     x     y     z     n
>   <ord>     <dbl> <dbl> <dbl> <dbl> <int>
> 1 Fair      1.05   6.25  6.18  3.98  1610
> 2 Good      0.849  5.84  5.85  3.64  4906
> 3 Very Good 0.806  5.74  5.77  3.56 12082
> 4 Premium   0.892  5.97  5.94  3.65 13791
> 5 Ideal     0.703  5.51  5.52  3.40 21551

Reading in Multiple Files

Bad repetition redux

Imagine you have a directory full of excel spreadsheets you want to read into R.

data2019 <- readxl::read_excel("data/y2019.xlsx")
data2020 <- readxl::read_excel("data/y2020.xlsx")
data2021 <- readxl::read_excel("data/y2021.xlsx")
data2022 <- readxl::read_excel("data/y2022.xlsx")

data <- bind_rows(data2019, data2020, data2021, data2022)

You could technically do it with copy and paste but we know that that’s probably not the most efficient, least error-prone approach.

Not to mention how inconvenient this would be if you had hundreds of files to read in and combine.

The iterative approach involves three broad steps:

  • use list.files() to list all the files in a directory
  • use purrr::map() to read each of them into a list
  • use purrr::list_rbind() to combine them into a single data frame

Step 1: Listing Files in a Directory

The first part of this method involves creating a character vector of all the file paths for the files you want to read in. We’ll motivate this example by reading in the gapminder data that’s saved in separate excel sheets by year in my working directory.


paths <- list.files("data/gapminder",
                    pattern = "[.]xlsx$",
                    full.names = TRUE)
paths
1
The first argument, path, is the directory to look within.
2
pattern is a regular expression used to filter the file names. The most common pattern is something like [.]xlsx$ or [.]csv$ to find all files with a specified extension.
3
full.names determines whether or not the directory name should be included in the output. You almost always want this to be TRUE.
>  [1] "data/gapminder/1952.xlsx" "data/gapminder/1957.xlsx"
>  [3] "data/gapminder/1962.xlsx" "data/gapminder/1967.xlsx"
>  [5] "data/gapminder/1972.xlsx" "data/gapminder/1977.xlsx"
>  [7] "data/gapminder/1982.xlsx" "data/gapminder/1987.xlsx"
>  [9] "data/gapminder/1992.xlsx" "data/gapminder/1997.xlsx"
> [11] "data/gapminder/2002.xlsx" "data/gapminder/2007.xlsx"

Reading Files into a List

Now we want to read these excel sheets into a single object so we can use iteration in the next step! A list is the perfect tool for this.

files <- list(
  readxl::read_excel("data/gapminder/1952.xlsx"),
  readxl::read_excel("data/gapminder/1957.xlsx"),
  readxl::read_excel("data/gapminder/1962.xlsx"),
  readxl::read_excel("data/gapminder/1967.xlsx"),
  readxl::read_excel("data/gapminder/1972.xlsx"),
  readxl::read_excel("data/gapminder/1977.xlsx"),
  readxl::read_excel("data/gapminder/1982.xlsx"),
  readxl::read_excel("data/gapminder/1987.xlsx"),
  readxl::read_excel("data/gapminder/1992.xlsx"),
  readxl::read_excel("data/gapminder/1997.xlsx"),
  readxl::read_excel("data/gapminder/2002.xlsx"),
  readxl::read_excel("data/gapminder/2007.xlsx")
)

Unfortunately, this is just as tedious a method as reading in all the separate file paths and creating individual data frame objects!

Step 2: Using map() instead!

Instead of listing out all the read_excel() calls in our list, we can used the map() function from the tidyverse’s purrr package. map() is similar to across(), but instead of doing something to each column in a data frame, it does something to each element of a vector.

files <- map(paths, readxl::read_excel)

Now, what does files contain?

files[[1]]
> # A tibble: 142 × 5
>    country     continent lifeExp      pop gdpPercap
>    <chr>       <chr>       <dbl>    <dbl>     <dbl>
>  1 Afghanistan Asia         28.8  8425333      779.
>  2 Albania     Europe       55.2  1282697     1601.
>  3 Algeria     Africa       43.1  9279525     2449.
>  4 Angola      Africa       30.0  4232095     3521.
>  5 Argentina   Americas     62.5 17876956     5911.
>  6 Australia   Oceania      69.1  8691212    10040.
>  7 Austria     Europe       66.8  6927772     6137.
>  8 Bahrain     Asia         50.9   120447     9867.
>  9 Bangladesh  Asia         37.5 46886859      684.
> 10 Belgium     Europe       68    8730405     8343.
> # ℹ 132 more rows

Step 3: Combine Dataframes into One

Now that we have all our individual dataframes in elements of a list, we can use list_rbind to combine them into one dataframe.

list_rbind(files)
> # A tibble: 1,704 × 5
>    country     continent lifeExp      pop gdpPercap
>    <chr>       <chr>       <dbl>    <dbl>     <dbl>
>  1 Afghanistan Asia         28.8  8425333      779.
>  2 Albania     Europe       55.2  1282697     1601.
>  3 Algeria     Africa       43.1  9279525     2449.
>  4 Angola      Africa       30.0  4232095     3521.
>  5 Argentina   Americas     62.5 17876956     5911.
>  6 Australia   Oceania      69.1  8691212    10040.
>  7 Austria     Europe       66.8  6927772     6137.
>  8 Bahrain     Asia         50.9   120447     9867.
>  9 Bangladesh  Asia         37.5 46886859      684.
> 10 Belgium     Europe       68    8730405     8343.
> # ℹ 1,694 more rows

The super efficient, full code for the last two steps would therefore be:

paths |> 
  map(readxl::read_excel) |> 
  list_rbind()

Data in the Filepath

You may have noticed that we’re missing a year indicator in our final dataset. That’s because that information is actually a part of the filename itself.

There’s a way to include the filename in the data but we have to add another step to our paths pipeline:

files <- paths |> 
  set_names(basename) |>
  map(readxl::read_excel)
4
The set_names function takes the function basename which extracts just the file name from the full path. This line of code will therefore create a named vector of the file paths where the names are actually the filenames.

What is this doing?

paths |> 
  set_names(basename)
>                  1952.xlsx                  1957.xlsx 
> "data/gapminder/1952.xlsx" "data/gapminder/1957.xlsx" 
>                  1962.xlsx                  1967.xlsx 
> "data/gapminder/1962.xlsx" "data/gapminder/1967.xlsx" 
>                  1972.xlsx                  1977.xlsx 
> "data/gapminder/1972.xlsx" "data/gapminder/1977.xlsx" 
>                  1982.xlsx                  1987.xlsx 
> "data/gapminder/1982.xlsx" "data/gapminder/1987.xlsx" 
>                  1992.xlsx                  1997.xlsx 
> "data/gapminder/1992.xlsx" "data/gapminder/1997.xlsx" 
>                  2002.xlsx                  2007.xlsx 
> "data/gapminder/2002.xlsx" "data/gapminder/2007.xlsx"

Data in the Filepath

You may have noticed that we’re missing a year indicator in our final dataset. That’s because that information is actually a part of the filename itself.

There’s a way to include the filename in the data but we have to add another step to our paths pipeline:

files <- paths |> 
  set_names(basename) |>
  map(readxl::read_excel)
4
The set_names function takes the function basename which extracts just the file name from the full path. This line of code will therefore create a named vector of the file paths where the names are actually the filenames.

What is this doing?

paths |> 
  set_names(basename) |> 
  map(readxl::read_excel)
> $`1952.xlsx`
> # A tibble: 142 × 5
>    country     continent lifeExp      pop gdpPercap
>    <chr>       <chr>       <dbl>    <dbl>     <dbl>
>  1 Afghanistan Asia         28.8  8425333      779.
>  2 Albania     Europe       55.2  1282697     1601.
>  3 Algeria     Africa       43.1  9279525     2449.
>  4 Angola      Africa       30.0  4232095     3521.
>  5 Argentina   Americas     62.5 17876956     5911.
>  6 Australia   Oceania      69.1  8691212    10040.
>  7 Austria     Europe       66.8  6927772     6137.
>  8 Bahrain     Asia         50.9   120447     9867.
>  9 Bangladesh  Asia         37.5 46886859      684.
> 10 Belgium     Europe       68    8730405     8343.
> # ℹ 132 more rows
> 
> $`1957.xlsx`
> # A tibble: 142 × 5
>    country     continent lifeExp      pop gdpPercap
>    <chr>       <chr>       <dbl>    <dbl>     <dbl>
>  1 Afghanistan Asia         30.3  9240934      821.
>  2 Albania     Europe       59.3  1476505     1942.
>  3 Algeria     Africa       45.7 10270856     3014.
>  4 Angola      Africa       32.0  4561361     3828.
>  5 Argentina   Americas     64.4 19610538     6857.
>  6 Australia   Oceania      70.3  9712569    10950.
>  7 Austria     Europe       67.5  6965860     8843.
>  8 Bahrain     Asia         53.8   138655    11636.
>  9 Bangladesh  Asia         39.3 51365468      662.
> 10 Belgium     Europe       69.2  8989111     9715.
> # ℹ 132 more rows
> 
> $`1962.xlsx`
> # A tibble: 142 × 5
>    country     continent lifeExp      pop gdpPercap
>    <chr>       <chr>       <dbl>    <dbl>     <dbl>
>  1 Afghanistan Asia         32.0 10267083      853.
>  2 Albania     Europe       64.8  1728137     2313.
>  3 Algeria     Africa       48.3 11000948     2551.
>  4 Angola      Africa       34    4826015     4269.
>  5 Argentina   Americas     65.1 21283783     7133.
>  6 Australia   Oceania      70.9 10794968    12217.
>  7 Austria     Europe       69.5  7129864    10751.
>  8 Bahrain     Asia         56.9   171863    12753.
>  9 Bangladesh  Asia         41.2 56839289      686.
> 10 Belgium     Europe       70.2  9218400    10991.
> # ℹ 132 more rows
> 
> $`1967.xlsx`
> # A tibble: 142 × 5
>    country     continent lifeExp      pop gdpPercap
>    <chr>       <chr>       <dbl>    <dbl>     <dbl>
>  1 Afghanistan Asia         34.0 11537966      836.
>  2 Albania     Europe       66.2  1984060     2760.
>  3 Algeria     Africa       51.4 12760499     3247.
>  4 Angola      Africa       36.0  5247469     5523.
>  5 Argentina   Americas     65.6 22934225     8053.
>  6 Australia   Oceania      71.1 11872264    14526.
>  7 Austria     Europe       70.1  7376998    12835.
>  8 Bahrain     Asia         59.9   202182    14805.
>  9 Bangladesh  Asia         43.5 62821884      721.
> 10 Belgium     Europe       70.9  9556500    13149.
> # ℹ 132 more rows
> 
> $`1972.xlsx`
> # A tibble: 142 × 5
>    country     continent lifeExp      pop gdpPercap
>    <chr>       <chr>       <dbl>    <dbl>     <dbl>
>  1 Afghanistan Asia         36.1 13079460      740.
>  2 Albania     Europe       67.7  2263554     3313.
>  3 Algeria     Africa       54.5 14760787     4183.
>  4 Angola      Africa       37.9  5894858     5473.
>  5 Argentina   Americas     67.1 24779799     9443.
>  6 Australia   Oceania      71.9 13177000    16789.
>  7 Austria     Europe       70.6  7544201    16662.
>  8 Bahrain     Asia         63.3   230800    18269.
>  9 Bangladesh  Asia         45.3 70759295      630.
> 10 Belgium     Europe       71.4  9709100    16672.
> # ℹ 132 more rows
> 
> $`1977.xlsx`
> # A tibble: 142 × 5
>    country     continent lifeExp      pop gdpPercap
>    <chr>       <chr>       <dbl>    <dbl>     <dbl>
>  1 Afghanistan Asia         38.4 14880372      786.
>  2 Albania     Europe       68.9  2509048     3533.
>  3 Algeria     Africa       58.0 17152804     4910.
>  4 Angola      Africa       39.5  6162675     3009.
>  5 Argentina   Americas     68.5 26983828    10079.
>  6 Australia   Oceania      73.5 14074100    18334.
>  7 Austria     Europe       72.2  7568430    19749.
>  8 Bahrain     Asia         65.6   297410    19340.
>  9 Bangladesh  Asia         46.9 80428306      660.
> 10 Belgium     Europe       72.8  9821800    19118.
> # ℹ 132 more rows
> 
> $`1982.xlsx`
> # A tibble: 142 × 5
>    country     continent lifeExp      pop gdpPercap
>    <chr>       <chr>       <dbl>    <dbl>     <dbl>
>  1 Afghanistan Asia         39.9 12881816      978.
>  2 Albania     Europe       70.4  2780097     3631.
>  3 Algeria     Africa       61.4 20033753     5745.
>  4 Angola      Africa       39.9  7016384     2757.
>  5 Argentina   Americas     69.9 29341374     8998.
>  6 Australia   Oceania      74.7 15184200    19477.
>  7 Austria     Europe       73.2  7574613    21597.
>  8 Bahrain     Asia         69.1   377967    19211.
>  9 Bangladesh  Asia         50.0 93074406      677.
> 10 Belgium     Europe       73.9  9856303    20980.
> # ℹ 132 more rows
> 
> $`1987.xlsx`
> # A tibble: 142 × 5
>    country     continent lifeExp       pop gdpPercap
>    <chr>       <chr>       <dbl>     <dbl>     <dbl>
>  1 Afghanistan Asia         40.8  13867957      852.
>  2 Albania     Europe       72     3075321     3739.
>  3 Algeria     Africa       65.8  23254956     5681.
>  4 Angola      Africa       39.9   7874230     2430.
>  5 Argentina   Americas     70.8  31620918     9140.
>  6 Australia   Oceania      76.3  16257249    21889.
>  7 Austria     Europe       74.9   7578903    23688.
>  8 Bahrain     Asia         70.8    454612    18524.
>  9 Bangladesh  Asia         52.8 103764241      752.
> 10 Belgium     Europe       75.4   9870200    22526.
> # ℹ 132 more rows
> 
> $`1992.xlsx`
> # A tibble: 142 × 5
>    country     continent lifeExp       pop gdpPercap
>    <chr>       <chr>       <dbl>     <dbl>     <dbl>
>  1 Afghanistan Asia         41.7  16317921      649.
>  2 Albania     Europe       71.6   3326498     2497.
>  3 Algeria     Africa       67.7  26298373     5023.
>  4 Angola      Africa       40.6   8735988     2628.
>  5 Argentina   Americas     71.9  33958947     9308.
>  6 Australia   Oceania      77.6  17481977    23425.
>  7 Austria     Europe       76.0   7914969    27042.
>  8 Bahrain     Asia         72.6    529491    19036.
>  9 Bangladesh  Asia         56.0 113704579      838.
> 10 Belgium     Europe       76.5  10045622    25576.
> # ℹ 132 more rows
> 
> $`1997.xlsx`
> # A tibble: 142 × 5
>    country     continent lifeExp       pop gdpPercap
>    <chr>       <chr>       <dbl>     <dbl>     <dbl>
>  1 Afghanistan Asia         41.8  22227415      635.
>  2 Albania     Europe       73.0   3428038     3193.
>  3 Algeria     Africa       69.2  29072015     4797.
>  4 Angola      Africa       41.0   9875024     2277.
>  5 Argentina   Americas     73.3  36203463    10967.
>  6 Australia   Oceania      78.8  18565243    26998.
>  7 Austria     Europe       77.5   8069876    29096.
>  8 Bahrain     Asia         73.9    598561    20292.
>  9 Bangladesh  Asia         59.4 123315288      973.
> 10 Belgium     Europe       77.5  10199787    27561.
> # ℹ 132 more rows
> 
> $`2002.xlsx`
> # A tibble: 142 × 5
>    country     continent lifeExp       pop gdpPercap
>    <chr>       <chr>       <dbl>     <dbl>     <dbl>
>  1 Afghanistan Asia         42.1  25268405      727.
>  2 Albania     Europe       75.7   3508512     4604.
>  3 Algeria     Africa       71.0  31287142     5288.
>  4 Angola      Africa       41.0  10866106     2773.
>  5 Argentina   Americas     74.3  38331121     8798.
>  6 Australia   Oceania      80.4  19546792    30688.
>  7 Austria     Europe       79.0   8148312    32418.
>  8 Bahrain     Asia         74.8    656397    23404.
>  9 Bangladesh  Asia         62.0 135656790     1136.
> 10 Belgium     Europe       78.3  10311970    30486.
> # ℹ 132 more rows
> 
> $`2007.xlsx`
> # A tibble: 142 × 5
>    country     continent lifeExp       pop gdpPercap
>    <chr>       <chr>       <dbl>     <dbl>     <dbl>
>  1 Afghanistan Asia         43.8  31889923      975.
>  2 Albania     Europe       76.4   3600523     5937.
>  3 Algeria     Africa       72.3  33333216     6223.
>  4 Angola      Africa       42.7  12420476     4797.
>  5 Argentina   Americas     75.3  40301927    12779.
>  6 Australia   Oceania      81.2  20434176    34435.
>  7 Austria     Europe       79.8   8199783    36126.
>  8 Bahrain     Asia         75.6    708573    29796.
>  9 Bangladesh  Asia         64.1 150448339     1391.
> 10 Belgium     Europe       79.4  10392226    33693.
> # ℹ 132 more rows

Data in the Filepath

To create a year variable we need to tell list_rbind to save the filename information.

gapminder <- paths |> 
  set_names(basename) |> 
  map(readxl::read_excel) |> 
  list_rbind(names_to = "year") |>
  mutate(year = parse_number(year))
gapminder
5
The name of each list element (the filename) will be saved as the variable year.
6
Extracting just the numeric part of the filename which is the actual year.
> # A tibble: 1,704 × 6
>     year country     continent lifeExp      pop gdpPercap
>    <dbl> <chr>       <chr>       <dbl>    <dbl>     <dbl>
>  1  1952 Afghanistan Asia         28.8  8425333      779.
>  2  1952 Albania     Europe       55.2  1282697     1601.
>  3  1952 Algeria     Africa       43.1  9279525     2449.
>  4  1952 Angola      Africa       30.0  4232095     3521.
>  5  1952 Argentina   Americas     62.5 17876956     5911.
>  6  1952 Australia   Oceania      69.1  8691212    10040.
>  7  1952 Austria     Europe       66.8  6927772     6137.
>  8  1952 Bahrain     Asia         50.9   120447     9867.
>  9  1952 Bangladesh  Asia         37.5 46886859      684.
> 10  1952 Belgium     Europe       68    8730405     8343.
> # ℹ 1,694 more rows
write_csv(gapminder, "gapminder.csv")
7
Be sure to save your work so you can simply read in one file when working on this project in the future!

More Complex Cases

Complicated Filenames

There may be other variables stored in the directory name, or maybe the file name contains multiple bits of data. If so, use set_names() (w/o arguments) to record the full path, then use separate_wider_delim() and friends to turn them into useful columns. See example at the end of this section.

Untidy data of the same structure

You can use map many times to perform different tidying and data manipulation tasks before combining datasets. Alternatively you can list_rbind first and then perform data manipulation tasks using a standard dplyr approach. See examples here.

Heterogenous data

Read this section of “R for Data Science”

Troubleshooting

Read this section of “R for Data Science”

Saving Multiple Outputs

Writing multiple csv files

Let’s imagine we want to save multiple datasets based on a feature of the data.

For example, what if we want a different csv for each clarity type in the diamonds dataset?

The easiest way to make these individual datasets is using group_nest():

by_clarity <- diamonds |> 
  group_nest(clarity) |>
  mutate(path = str_glue("diamonds-{clarity}.csv"))

by_clarity
1
Nests a tibble using a grouping specification. You can add the argument keep = TRUE if you want to include the grouping variable in the nested tibbles.
2
Creates a column that gives the name of output file.
> # A tibble: 8 × 3
>   clarity               data path             
>   <ord>   <list<tibble[,9]>> <glue>           
> 1 I1               [741 × 9] diamonds-I1.csv  
> 2 SI2            [9,194 × 9] diamonds-SI2.csv 
> 3 SI1           [13,065 × 9] diamonds-SI1.csv 
> 4 VS2           [12,258 × 9] diamonds-VS2.csv 
> 5 VS1            [8,171 × 9] diamonds-VS1.csv 
> 6 VVS2           [5,066 × 9] diamonds-VVS2.csv
> 7 VVS1           [3,655 × 9] diamonds-VVS1.csv
> 8 IF             [1,790 × 9] diamonds-IF.csv

Writing multiple csv files

Let’s imagine we want to save multiple datasets based on a feature of the data.

For example, what if we want a different csv for each clarity type in the diamonds dataset?

The easiest way to make these individual datasets is using group_nest():

by_clarity$data[[1]]
> # A tibble: 741 × 9
>    carat cut       color depth table price     x     y     z
>    <dbl> <ord>     <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
>  1  0.32 Premium   E      60.9    58   345  4.38  4.42  2.68
>  2  1.17 Very Good J      60.2    61  2774  6.83  6.9   4.13
>  3  1.01 Premium   F      61.8    60  2781  6.39  6.36  3.94
>  4  1.01 Fair      E      64.5    58  2788  6.29  6.21  4.03
>  5  0.96 Ideal     F      60.7    55  2801  6.37  6.41  3.88
>  6  1.04 Premium   G      62.2    58  2801  6.46  6.41  4   
>  7  1    Fair      G      66.4    59  2808  6.16  6.09  4.07
>  8  1.2  Fair      F      64.6    56  2809  6.73  6.66  4.33
>  9  0.43 Very Good E      58.4    62   555  4.94  5     2.9 
> 10  1.02 Premium   G      60.3    58  2815  6.55  6.5   3.94
> # ℹ 731 more rows

Using walk()

We basically want to carry out the following but we can’t simply use map() because now we have 2 arguments that vary.

write_csv(by_clarity$data[[1]], by_clarity$path[[1]])
write_csv(by_clarity$data[[2]], by_clarity$path[[2]])
write_csv(by_clarity$data[[3]], by_clarity$path[[3]])
...
write_csv(by_clarity$by_clarity[[8]], by_clarity$path[[8]])

So we could use map2(), which allows us to map over 2 inputs!

map2(by_clarity$data, by_clarity$path, write_csv)

If we were to run the above, it will apply the first two arguments to the write_csv() function and also print out all the datasets as it saves them.

Since we don’t actually care about the output (i.e. the printed datasets) and only want the files to be written, there’s an even better function we can use: walk2().

walk2(by_clarity$data, by_clarity$path, write_csv)

This performs the exact same thing as map2() but throws the output away. Therefore we’re left with just the file-saving behavior which is what we’re after.

Saving multiple plots

The same basic approach can be used to save multiple plots.

First let’s create a function that draws the plot we want.

carat_histogram <- function(df) {
  ggplot(df, aes(x = carat)) + geom_histogram(binwidth = 0.1)  
}

carat_histogram(by_clarity$data[[1]])

Saving multiple plots

Now we can use map() to create a list of many plots and their eventual file paths:

by_clarity <- by_clarity |> 
  mutate(
    plot = map(data, carat_histogram),
    path = str_glue("clarity-{clarity}.png")
  )
by_clarity
> # A tibble: 8 × 4
>   clarity               data path             plot  
>   <ord>   <list<tibble[,9]>> <glue>           <list>
> 1 I1               [741 × 9] clarity-I1.png   <gg>  
> 2 SI2            [9,194 × 9] clarity-SI2.png  <gg>  
> 3 SI1           [13,065 × 9] clarity-SI1.png  <gg>  
> 4 VS2           [12,258 × 9] clarity-VS2.png  <gg>  
> 5 VS1            [8,171 × 9] clarity-VS1.png  <gg>  
> 6 VVS2           [5,066 × 9] clarity-VVS2.png <gg>  
> 7 VVS1           [3,655 × 9] clarity-VVS1.png <gg>  
> 8 IF             [1,790 × 9] clarity-IF.png   <gg>
by_clarity$plot[[1]]

Saving multiple plots

Then use walk2() with ggsave() to save each plot:

walk2(
  by_clarity$path,
  by_clarity$plot,
  \(path, plot) ggsave(path, plot, width = 6, height = 6)
)


Which is shorthand for:

ggsave(by_clarity$path[[1]], by_clarity$plot[[1]], width = 6, height = 6)
ggsave(by_clarity$path[[2]], by_clarity$plot[[2]], width = 6, height = 6)
ggsave(by_clarity$path[[3]], by_clarity$plot[[3]], width = 6, height = 6)
...
ggsave(by_clarity$path[[8]], by_clarity$plot[[8]], width = 6, height = 6)

Apply Family

lapply

Base R has it’s own family of iterative functions: the apply family of functions.

The most one-to-one translation in this family is lapply (list apply) to map.

lapply(swiss, FUN = median)
1
Since all of the examples of map in today’s lecture are fairly simple, you can swap in lapply for any of them.
> $Fertility
> [1] 70.4
> 
> $Agriculture
> [1] 54.1
> 
> $Examination
> [1] 16
> 
> $Education
> [1] 8
> 
> $Catholic
> [1] 15.14
> 
> $Infant.Mortality
> [1] 20

Simply, lapply() is used to apply a function over a list of any kind (e.g. a data frame) and return a list.

sapply(): Simple lapply()

A downside to lapply() is that lists can be hard to work with. sapply(), therefore, always tries to simplify the result.

sapply(swiss, FUN = median)
>        Fertility      Agriculture      Examination        Education 
>            70.40            54.10            16.00             8.00 
>         Catholic Infant.Mortality 
>            15.14            20.00

In this case, our list was simplified to a named numeric vector. However, the simplification can fail and give you an unexpected type so proceed with caution if you intend to use sapply().

vapply(): vector apply

This version takes an additional argument that specifies the expected type, ensuring that simplification occurs the same way regardless of the input.

vapply(swiss, median, double(1))
>        Fertility      Agriculture      Examination        Education 
>            70.40            54.10            16.00             8.00 
>         Catholic Infant.Mortality 
>            15.14            20.00

tapply()

Another important member of the apply family is tapply() which computes a single grouped summary.

diamonds |> 
  group_by(cut) |> 
  summarize(price = mean(price))
> # A tibble: 5 × 2
>   cut       price
>   <ord>     <dbl>
> 1 Fair      4359.
> 2 Good      3929.
> 3 Very Good 3982.
> 4 Premium   4584.
> 5 Ideal     3458.
tapply(diamonds$price, diamonds$cut, mean)
>      Fair      Good Very Good   Premium     Ideal 
>      4359      3929      3982      4584      3458


Unfortunately tapply() returns its results in a named vector which requires some gymnastics if you want to collect multiple summaries and grouping variables into a data frame.

apply()

Lastly, there’s apply(), which works over matrices or data frames. You can apply the function to each row (MARGIN = 1) or column (MARGIN = 2).


apply(swiss, MARGIN = 2, FUN = summary)
>         Fertility Agriculture Examination Education Catholic Infant.Mortality
> Min.        35.00        1.20        3.00      1.00    2.150            10.80
> 1st Qu.     64.70       35.90       12.00      6.00    5.195            18.15
> Median      70.40       54.10       16.00      8.00   15.140            20.00
> Mean        70.14       50.66       16.49     10.98   41.144            19.94
> 3rd Qu.     78.45       67.65       22.00     12.00   93.125            21.70
> Max.        92.50       89.70       37.00     53.00  100.000            26.60

for loops

Anatomy of a for loop

for loops are the fundamental building block of iteration that both the apply and map families use under the hood.

As you become a more experienced R programmer, for loops are a powerful and general tool that will be important to learn.


The basic structure of a for loop looks like this:

for (element in vector) {
  # do something with element
}

Parallel with walk()

The most straightforward use of for loops is to achieve the same effect as walk(): call some function with a side-effect on each element of a vector/list.

A very basic example:

for(i in 1:10) {
    print(i)
}
> [1] 1
> [1] 2
> [1] 3
> [1] 4
> [1] 5
> [1] 6
> [1] 7
> [1] 8
> [1] 9
> [1] 10
1:10 |>
  walk(\(x) print(x))
> [1] 1
> [1] 2
> [1] 3
> [1] 4
> [1] 5
> [1] 6
> [1] 7
> [1] 8
> [1] 9
> [1] 10

Things get a little trickier if you want to save the output of the for loop.

When you’re ready to dive into more advanced functional programming topics, including loops, check out the Control Flow and Functional Programming chapters of Advanced R.

Lab

Iteration with across

  1. Compute the number of unique values in each column of palmerpenguins::penguins1.

  2. Compute the mean of every column in mtcars.

  3. Group diamonds by cut, clarity, and color then count the number of observations and compute the mean of each numeric column.

  4. What happens if you use a list of functions in across(), but don’t name them? How is the output named?

Answers

  1. Compute the number of unique values in each column of palmerpenguins::penguins1.
library(palmerpenguins)
data(penguins)

penguins |> summarise(across(everything(), n_distinct))
> # A tibble: 1 × 8
>   species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
>     <int>  <int>          <int>         <int>             <int>       <int>
> 1       3      3            165            81                56          95
> # ℹ 2 more variables: sex <int>, year <int>

Answers

  1. Compute the mean of every column in mtcars.
mtcars |> summarise(across(everything(), mean))
>     mpg   cyl  disp    hp  drat    wt  qsec     vs     am  gear  carb
> 1 20.09 6.188 230.7 146.7 3.597 3.217 17.85 0.4375 0.4062 3.688 2.812

Answers

  1. Group diamonds by cut, clarity, and color then count the number of observations and compute the mean of each numeric column.
diamonds |> summarise(n = n(),
                      across(where(is.numeric), mean),
                      .by = c(cut, clarity, color))
> # A tibble: 276 × 11
>    cut       clarity color     n carat depth table price     x     y     z
>    <ord>     <ord>   <ord> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
>  1 Ideal     SI2     E       469 0.874  61.7  56.1 3891.  6.02  6.02  3.71
>  2 Premium   SI1     E       614 0.726  61.2  58.8 3363.  5.64  5.61  3.44
>  3 Good      VS1     E        89 0.681  61.6  59.2 3713.  5.49  5.52  3.39
>  4 Premium   VS2     I       315 1.24   61.3  58.9 7156.  6.70  6.67  4.09
>  5 Good      SI2     J        53 1.32   62.4  59.1 5306.  6.85  6.86  4.27
>  6 Very Good VVS2    J        29 1.10   62.4  58.3 5960.  6.34  6.37  3.96
>  7 Very Good VVS1    I        69 0.571  62.2  58.0 2056.  5.17  5.20  3.22
>  8 Very Good SI1     H       547 0.974  62.0  58.0 4934.  6.15  6.17  3.82
>  9 Fair      VS2     E        42 0.690  64.5  59.4 3042.  5.50  5.45  3.53
> 10 Very Good VS1     H       257 0.772  62.0  57.7 3750.  5.68  5.70  3.53
> # ℹ 266 more rows

Answers

  1. What happens if you use a list of functions in across() but don’t name them? How is the output named?
airquality |> 
  summarize(
    across(Ozone:Day, list(
      \(x) median(x, na.rm = TRUE),
      \(x) sum(is.na(x))
    )),
    n = n()
  )
>   Ozone_1 Ozone_2 Solar.R_1 Solar.R_2 Wind_1 Wind_2 Temp_1 Temp_2 Month_1
> 1    31.5      37       205         7    9.7      0     79      0       7
>   Month_2 Day_1 Day_2   n
> 1       0    16     0 153

The default behavior of across if the names for multiple functions are not supplied is simply to append the variable name with a number, i.e. the first function will be {.col}_1, the second function will be {.col}_2, etc.

Homework