Iteration
CS&SS 508 • Lecture 9
21 May 2024
Victoria Sass
R
Equivalents
for
loopsIf someone doesn’t know better, they might find the means of variables in the swiss
data by typing a line of code for each column:
> Error: <text>:7:16: unexpected symbol
> 6: mean5 <- mean(swiss$Infant.Mortality)
> 7: c(mean1, mean2 mean3
> ^
Can you spot the problems?
How upset would they be if the swiss
data had 200 columns instead of 6?
Today you’ll learn a better way to repeat tasks, without repeating code, using functions from the dplyr
and purrr
packages in the tidyverse
.
> Fertility Agriculture Examination Education Catholic Infant.Mortality
> 1 70.14 50.66 16.49 10.98 41.14 19.94
The DRY idea: Computers are much better at doing the same thing over and over again than we are.
Iteration involves repeatedly performing the same action on different objects.
We’ve already done some iteration, both because it’s built into R
in certain ways, and because many of the tidyverse packages we’ve used have functions that are iterative.
R
iterates automatically with its recycling rulesWe’re now going to learn what makes R
a functional programming language. That is, we’ll learn some functions that themselves take functions as arguments.
Let’s return to our first example from last week:
df |> mutate(
a = (a - min(a, na.rm = TRUE)) / (max(a, na.rm = TRUE) - min(a, na.rm = TRUE)),
b = (b - min(b, na.rm = TRUE)) / (max(b, na.rm = TRUE) - min(a, na.rm = TRUE)),
c = (c - min(c, na.rm = TRUE)) / (max(c, na.rm = TRUE) - min(c, na.rm = TRUE)),
d = (d - min(d, na.rm = TRUE)) / (max(d, na.rm = TRUE) - min(d, na.rm = TRUE))
)
Can we make this mutate call even more efficient?
across()
across()
makes it easy to apply the same transformation to multiple columns.
There are three particularly important arguments, the first two of which you’ll use in every call to across()
.
.cols
specifies which columns to iterate over..fns
specifies what to do with each column..names
specifies the names of the output columns..cols
.cols
uses the same specifications as select()
so you can use tidyselect
functions like starts_with()
to select columns based on their name.
You can also use everything()
which selects every (non-grouping) column.
Lastly, where()
allows you to select columns based on their type.
The second argument to across()
is what makes R
a functional programming language. Here we’re passing a function to another function.
Important Distinction
We’re passing this function to across()
, so across()
can call it; we’re not calling it ourselves. That means the function name should never be followed by ()
. If you forget, you’ll get an error:
i.e. median()
.
> Error in `summarise()`:
> ℹ In argument: `across(Ozone:Temp, median())`.
> Caused by error in `median.default()`:
> ! argument "x" is missing, with no default
If the function you pass to across()
has its own arguments that you want to specify, you’ll need to use an anonymous function:
You might also see older code that looks like this:
.x
. Base syntax is now recommended (i.e. \(x) x + 1
).
> Ozone Solar.R Wind Temp
> 1 31.5 205 9.7 79
What if we want to know how many missing values we removed, in addition to calculating the median without those values?
If you need to call multiple functions within across()
, you’ll need to turn them into a named list.
airquality |>
summarise(across(Ozone:Temp,
list(median = \(x) median(x, na.rm = TRUE),
n_miss = \(x) sum(is.na(x)))))
{.col}_{.fn}
where .col
is the name of the original column and .fn
is the name of the function.
> Ozone_median Ozone_n_miss Solar.R_median Solar.R_n_miss Wind_median
> 1 31.5 37 205 7 9.7
> Wind_n_miss Temp_median Temp_n_miss
> 1 0 79 0
By default, the output of across()
is given the same names as the inputs. This means that across()
inside of mutate() will replace existing columns.
If you’d like to instead create new columns, you can use the .names
argument to give the output new names.
.col
simply represents the original variable name.
> # A tibble: 5 × 8
> a b c d a_rescaled b_rescaled c_rescaled d_rescaled
> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
> 1 -0.130 -0.398 0.347 0.368 0.696 0.284 0.462 0.597
> 2 -1.14 1.04 -0.433 -1.07 0.354 0.902 0 0.0147
> 3 -2.18 -1.06 -0.278 -1.10 0 0 0.0918 0
> 4 0.765 -0.0376 0.674 -0.185 1 0.439 0.656 0.372
> 5 -0.474 1.27 1.25 1.36 0.580 1 1 1
if_any()
and if_all()
across()
works well with mutate()
and summarize()
but it has two variants that work with filter()
.
airquality |> filter(is.na(Ozone) | is.na(Solar.R) | is.na(Wind) | is.na(Temp))
> Ozone Solar.R Wind Temp Month Day
> 1 NA NA 14.3 56 5 5
> 2 28 NA 14.9 66 5 6
> 3 NA 194 8.6 69 5 10
> 4 7 NA 6.9 74 5 11
> 5 NA 66 16.6 57 5 25
> 6 NA 266 14.9 58 5 26
> 7 NA NA 8.0 57 5 27
> 8 NA 286 8.6 78 6 1
> 9 NA 287 9.7 74 6 2
> 10 NA 242 16.1 67 6 3
> 11 NA 186 9.2 84 6 4
> 12 NA 220 8.6 85 6 5
> 13 NA 264 14.3 79 6 6
> 14 NA 273 6.9 87 6 8
> 15 NA 259 10.9 93 6 11
> 16 NA 250 9.2 92 6 12
> 17 NA 332 13.8 80 6 14
> 18 NA 322 11.5 79 6 15
> 19 NA 150 6.3 77 6 21
> 20 NA 59 1.7 76 6 22
> 21 NA 91 4.6 76 6 23
> 22 NA 250 6.3 76 6 24
> 23 NA 135 8.0 75 6 25
> 24 NA 127 8.0 78 6 26
> 25 NA 47 10.3 73 6 27
> 26 NA 98 11.5 80 6 28
> 27 NA 31 14.9 77 6 29
> 28 NA 138 8.0 83 6 30
> 29 NA 101 10.9 84 7 4
> 30 NA 139 8.6 82 7 11
> 31 NA 291 14.9 91 7 14
> 32 NA 258 9.7 81 7 22
> 33 NA 295 11.5 82 7 23
> 34 78 NA 6.9 86 8 4
> 35 35 NA 7.4 85 8 5
> 36 66 NA 4.6 87 8 6
> 37 NA 222 8.6 92 8 10
> 38 NA 137 11.5 86 8 11
> 39 NA 64 11.5 79 8 15
> 40 NA 255 12.6 75 8 23
> 41 NA 153 5.7 88 8 27
> 42 NA 145 13.2 77 9 27
across()
in FunctionsNaturally, across()
lends itself to functions because it allows you to operate on multiple columns simultaneously.
Just remember to embrace with { }
when using an argument for column selection since the first argument of across()
uses the tidy evaluation method tidy-select.
> # A tibble: 5 × 6
> cut carat x y z n
> <ord> <dbl> <dbl> <dbl> <dbl> <int>
> 1 Fair 1.05 6.25 6.18 3.98 1610
> 2 Good 0.849 5.84 5.85 3.64 4906
> 3 Very Good 0.806 5.74 5.77 3.56 12082
> 4 Premium 0.892 5.97 5.94 3.65 13791
> 5 Ideal 0.703 5.51 5.52 3.40 21551
Imagine you have a directory full of excel spreadsheets you want to read into R
.
You could technically do it with copy and paste but we know that that’s probably not the most efficient, least error-prone approach.
Not to mention how inconvenient this would be if you had hundreds of files to read in and combine.
The iterative approach involves three broad steps:
list.files()
to list all the files in a directorypurrr::map()
to read each of them into a listpurrr::list_rbind()
to combine them into a single data frameThe first part of this method involves creating a character vector of all the file paths for the files you want to read in. We’ll motivate this example by reading in the gapminder data that’s saved in separate excel sheets by year in my working directory.
path
, is the directory to look within.
pattern
is a regular expression used to filter the file names. The most common pattern is something like [.]xlsx$
or [.]csv$
to find all files with a specified extension.
full.names
determines whether or not the directory name should be included in the output. You almost always want this to be TRUE.
> [1] "data/gapminder/1952.xlsx" "data/gapminder/1957.xlsx"
> [3] "data/gapminder/1962.xlsx" "data/gapminder/1967.xlsx"
> [5] "data/gapminder/1972.xlsx" "data/gapminder/1977.xlsx"
> [7] "data/gapminder/1982.xlsx" "data/gapminder/1987.xlsx"
> [9] "data/gapminder/1992.xlsx" "data/gapminder/1997.xlsx"
> [11] "data/gapminder/2002.xlsx" "data/gapminder/2007.xlsx"
Now we want to read these excel sheets into a single object so we can use iteration in the next step! A list is the perfect tool for this.
files <- list(
readxl::read_excel("data/gapminder/1952.xlsx"),
readxl::read_excel("data/gapminder/1957.xlsx"),
readxl::read_excel("data/gapminder/1962.xlsx"),
readxl::read_excel("data/gapminder/1967.xlsx"),
readxl::read_excel("data/gapminder/1972.xlsx"),
readxl::read_excel("data/gapminder/1977.xlsx"),
readxl::read_excel("data/gapminder/1982.xlsx"),
readxl::read_excel("data/gapminder/1987.xlsx"),
readxl::read_excel("data/gapminder/1992.xlsx"),
readxl::read_excel("data/gapminder/1997.xlsx"),
readxl::read_excel("data/gapminder/2002.xlsx"),
readxl::read_excel("data/gapminder/2007.xlsx")
)
Unfortunately, this is just as tedious a method as reading in all the separate file paths and creating individual data frame objects!
map()
instead!Instead of listing out all the read_excel()
calls in our list, we can used the map()
function from the tidyverse’s purrr
package. map()
is similar to across()
, but instead of doing something to each column in a data frame, it does something to each element of a vector.
Now, what does files
contain?
> # A tibble: 142 × 5
> country continent lifeExp pop gdpPercap
> <chr> <chr> <dbl> <dbl> <dbl>
> 1 Afghanistan Asia 28.8 8425333 779.
> 2 Albania Europe 55.2 1282697 1601.
> 3 Algeria Africa 43.1 9279525 2449.
> 4 Angola Africa 30.0 4232095 3521.
> 5 Argentina Americas 62.5 17876956 5911.
> 6 Australia Oceania 69.1 8691212 10040.
> 7 Austria Europe 66.8 6927772 6137.
> 8 Bahrain Asia 50.9 120447 9867.
> 9 Bangladesh Asia 37.5 46886859 684.
> 10 Belgium Europe 68 8730405 8343.
> # ℹ 132 more rows
Now that we have all our individual dataframes in elements of a list, we can use list_rbind
to combine them into one dataframe.
> # A tibble: 1,704 × 5
> country continent lifeExp pop gdpPercap
> <chr> <chr> <dbl> <dbl> <dbl>
> 1 Afghanistan Asia 28.8 8425333 779.
> 2 Albania Europe 55.2 1282697 1601.
> 3 Algeria Africa 43.1 9279525 2449.
> 4 Angola Africa 30.0 4232095 3521.
> 5 Argentina Americas 62.5 17876956 5911.
> 6 Australia Oceania 69.1 8691212 10040.
> 7 Austria Europe 66.8 6927772 6137.
> 8 Bahrain Asia 50.9 120447 9867.
> 9 Bangladesh Asia 37.5 46886859 684.
> 10 Belgium Europe 68 8730405 8343.
> # ℹ 1,694 more rows
You may have noticed that we’re missing a year indicator in our final dataset. That’s because that information is actually a part of the filename itself.
There’s a way to include the filename in the data but we have to add another step to our paths
pipeline:
What is this doing?
> 1952.xlsx 1957.xlsx
> "data/gapminder/1952.xlsx" "data/gapminder/1957.xlsx"
> 1962.xlsx 1967.xlsx
> "data/gapminder/1962.xlsx" "data/gapminder/1967.xlsx"
> 1972.xlsx 1977.xlsx
> "data/gapminder/1972.xlsx" "data/gapminder/1977.xlsx"
> 1982.xlsx 1987.xlsx
> "data/gapminder/1982.xlsx" "data/gapminder/1987.xlsx"
> 1992.xlsx 1997.xlsx
> "data/gapminder/1992.xlsx" "data/gapminder/1997.xlsx"
> 2002.xlsx 2007.xlsx
> "data/gapminder/2002.xlsx" "data/gapminder/2007.xlsx"
You may have noticed that we’re missing a year indicator in our final dataset. That’s because that information is actually a part of the filename itself.
There’s a way to include the filename in the data but we have to add another step to our paths
pipeline:
set_names
function takes the function basename
which extracts just the file name from the full path. This line of code will therefore create a named vector of the file paths where the names are actually the filenames.
What is this doing?
> $`1952.xlsx`
> # A tibble: 142 × 5
> country continent lifeExp pop gdpPercap
> <chr> <chr> <dbl> <dbl> <dbl>
> 1 Afghanistan Asia 28.8 8425333 779.
> 2 Albania Europe 55.2 1282697 1601.
> 3 Algeria Africa 43.1 9279525 2449.
> 4 Angola Africa 30.0 4232095 3521.
> 5 Argentina Americas 62.5 17876956 5911.
> 6 Australia Oceania 69.1 8691212 10040.
> 7 Austria Europe 66.8 6927772 6137.
> 8 Bahrain Asia 50.9 120447 9867.
> 9 Bangladesh Asia 37.5 46886859 684.
> 10 Belgium Europe 68 8730405 8343.
> # ℹ 132 more rows
>
> $`1957.xlsx`
> # A tibble: 142 × 5
> country continent lifeExp pop gdpPercap
> <chr> <chr> <dbl> <dbl> <dbl>
> 1 Afghanistan Asia 30.3 9240934 821.
> 2 Albania Europe 59.3 1476505 1942.
> 3 Algeria Africa 45.7 10270856 3014.
> 4 Angola Africa 32.0 4561361 3828.
> 5 Argentina Americas 64.4 19610538 6857.
> 6 Australia Oceania 70.3 9712569 10950.
> 7 Austria Europe 67.5 6965860 8843.
> 8 Bahrain Asia 53.8 138655 11636.
> 9 Bangladesh Asia 39.3 51365468 662.
> 10 Belgium Europe 69.2 8989111 9715.
> # ℹ 132 more rows
>
> $`1962.xlsx`
> # A tibble: 142 × 5
> country continent lifeExp pop gdpPercap
> <chr> <chr> <dbl> <dbl> <dbl>
> 1 Afghanistan Asia 32.0 10267083 853.
> 2 Albania Europe 64.8 1728137 2313.
> 3 Algeria Africa 48.3 11000948 2551.
> 4 Angola Africa 34 4826015 4269.
> 5 Argentina Americas 65.1 21283783 7133.
> 6 Australia Oceania 70.9 10794968 12217.
> 7 Austria Europe 69.5 7129864 10751.
> 8 Bahrain Asia 56.9 171863 12753.
> 9 Bangladesh Asia 41.2 56839289 686.
> 10 Belgium Europe 70.2 9218400 10991.
> # ℹ 132 more rows
>
> $`1967.xlsx`
> # A tibble: 142 × 5
> country continent lifeExp pop gdpPercap
> <chr> <chr> <dbl> <dbl> <dbl>
> 1 Afghanistan Asia 34.0 11537966 836.
> 2 Albania Europe 66.2 1984060 2760.
> 3 Algeria Africa 51.4 12760499 3247.
> 4 Angola Africa 36.0 5247469 5523.
> 5 Argentina Americas 65.6 22934225 8053.
> 6 Australia Oceania 71.1 11872264 14526.
> 7 Austria Europe 70.1 7376998 12835.
> 8 Bahrain Asia 59.9 202182 14805.
> 9 Bangladesh Asia 43.5 62821884 721.
> 10 Belgium Europe 70.9 9556500 13149.
> # ℹ 132 more rows
>
> $`1972.xlsx`
> # A tibble: 142 × 5
> country continent lifeExp pop gdpPercap
> <chr> <chr> <dbl> <dbl> <dbl>
> 1 Afghanistan Asia 36.1 13079460 740.
> 2 Albania Europe 67.7 2263554 3313.
> 3 Algeria Africa 54.5 14760787 4183.
> 4 Angola Africa 37.9 5894858 5473.
> 5 Argentina Americas 67.1 24779799 9443.
> 6 Australia Oceania 71.9 13177000 16789.
> 7 Austria Europe 70.6 7544201 16662.
> 8 Bahrain Asia 63.3 230800 18269.
> 9 Bangladesh Asia 45.3 70759295 630.
> 10 Belgium Europe 71.4 9709100 16672.
> # ℹ 132 more rows
>
> $`1977.xlsx`
> # A tibble: 142 × 5
> country continent lifeExp pop gdpPercap
> <chr> <chr> <dbl> <dbl> <dbl>
> 1 Afghanistan Asia 38.4 14880372 786.
> 2 Albania Europe 68.9 2509048 3533.
> 3 Algeria Africa 58.0 17152804 4910.
> 4 Angola Africa 39.5 6162675 3009.
> 5 Argentina Americas 68.5 26983828 10079.
> 6 Australia Oceania 73.5 14074100 18334.
> 7 Austria Europe 72.2 7568430 19749.
> 8 Bahrain Asia 65.6 297410 19340.
> 9 Bangladesh Asia 46.9 80428306 660.
> 10 Belgium Europe 72.8 9821800 19118.
> # ℹ 132 more rows
>
> $`1982.xlsx`
> # A tibble: 142 × 5
> country continent lifeExp pop gdpPercap
> <chr> <chr> <dbl> <dbl> <dbl>
> 1 Afghanistan Asia 39.9 12881816 978.
> 2 Albania Europe 70.4 2780097 3631.
> 3 Algeria Africa 61.4 20033753 5745.
> 4 Angola Africa 39.9 7016384 2757.
> 5 Argentina Americas 69.9 29341374 8998.
> 6 Australia Oceania 74.7 15184200 19477.
> 7 Austria Europe 73.2 7574613 21597.
> 8 Bahrain Asia 69.1 377967 19211.
> 9 Bangladesh Asia 50.0 93074406 677.
> 10 Belgium Europe 73.9 9856303 20980.
> # ℹ 132 more rows
>
> $`1987.xlsx`
> # A tibble: 142 × 5
> country continent lifeExp pop gdpPercap
> <chr> <chr> <dbl> <dbl> <dbl>
> 1 Afghanistan Asia 40.8 13867957 852.
> 2 Albania Europe 72 3075321 3739.
> 3 Algeria Africa 65.8 23254956 5681.
> 4 Angola Africa 39.9 7874230 2430.
> 5 Argentina Americas 70.8 31620918 9140.
> 6 Australia Oceania 76.3 16257249 21889.
> 7 Austria Europe 74.9 7578903 23688.
> 8 Bahrain Asia 70.8 454612 18524.
> 9 Bangladesh Asia 52.8 103764241 752.
> 10 Belgium Europe 75.4 9870200 22526.
> # ℹ 132 more rows
>
> $`1992.xlsx`
> # A tibble: 142 × 5
> country continent lifeExp pop gdpPercap
> <chr> <chr> <dbl> <dbl> <dbl>
> 1 Afghanistan Asia 41.7 16317921 649.
> 2 Albania Europe 71.6 3326498 2497.
> 3 Algeria Africa 67.7 26298373 5023.
> 4 Angola Africa 40.6 8735988 2628.
> 5 Argentina Americas 71.9 33958947 9308.
> 6 Australia Oceania 77.6 17481977 23425.
> 7 Austria Europe 76.0 7914969 27042.
> 8 Bahrain Asia 72.6 529491 19036.
> 9 Bangladesh Asia 56.0 113704579 838.
> 10 Belgium Europe 76.5 10045622 25576.
> # ℹ 132 more rows
>
> $`1997.xlsx`
> # A tibble: 142 × 5
> country continent lifeExp pop gdpPercap
> <chr> <chr> <dbl> <dbl> <dbl>
> 1 Afghanistan Asia 41.8 22227415 635.
> 2 Albania Europe 73.0 3428038 3193.
> 3 Algeria Africa 69.2 29072015 4797.
> 4 Angola Africa 41.0 9875024 2277.
> 5 Argentina Americas 73.3 36203463 10967.
> 6 Australia Oceania 78.8 18565243 26998.
> 7 Austria Europe 77.5 8069876 29096.
> 8 Bahrain Asia 73.9 598561 20292.
> 9 Bangladesh Asia 59.4 123315288 973.
> 10 Belgium Europe 77.5 10199787 27561.
> # ℹ 132 more rows
>
> $`2002.xlsx`
> # A tibble: 142 × 5
> country continent lifeExp pop gdpPercap
> <chr> <chr> <dbl> <dbl> <dbl>
> 1 Afghanistan Asia 42.1 25268405 727.
> 2 Albania Europe 75.7 3508512 4604.
> 3 Algeria Africa 71.0 31287142 5288.
> 4 Angola Africa 41.0 10866106 2773.
> 5 Argentina Americas 74.3 38331121 8798.
> 6 Australia Oceania 80.4 19546792 30688.
> 7 Austria Europe 79.0 8148312 32418.
> 8 Bahrain Asia 74.8 656397 23404.
> 9 Bangladesh Asia 62.0 135656790 1136.
> 10 Belgium Europe 78.3 10311970 30486.
> # ℹ 132 more rows
>
> $`2007.xlsx`
> # A tibble: 142 × 5
> country continent lifeExp pop gdpPercap
> <chr> <chr> <dbl> <dbl> <dbl>
> 1 Afghanistan Asia 43.8 31889923 975.
> 2 Albania Europe 76.4 3600523 5937.
> 3 Algeria Africa 72.3 33333216 6223.
> 4 Angola Africa 42.7 12420476 4797.
> 5 Argentina Americas 75.3 40301927 12779.
> 6 Australia Oceania 81.2 20434176 34435.
> 7 Austria Europe 79.8 8199783 36126.
> 8 Bahrain Asia 75.6 708573 29796.
> 9 Bangladesh Asia 64.1 150448339 1391.
> 10 Belgium Europe 79.4 10392226 33693.
> # ℹ 132 more rows
To create a year
variable we need to tell list_rbind
to save the filename information.
gapminder <- paths |>
set_names(basename) |>
map(readxl::read_excel) |>
list_rbind(names_to = "year") |>
mutate(year = parse_number(year))
gapminder
year
.
> # A tibble: 1,704 × 6
> year country continent lifeExp pop gdpPercap
> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
> 1 1952 Afghanistan Asia 28.8 8425333 779.
> 2 1952 Albania Europe 55.2 1282697 1601.
> 3 1952 Algeria Africa 43.1 9279525 2449.
> 4 1952 Angola Africa 30.0 4232095 3521.
> 5 1952 Argentina Americas 62.5 17876956 5911.
> 6 1952 Australia Oceania 69.1 8691212 10040.
> 7 1952 Austria Europe 66.8 6927772 6137.
> 8 1952 Bahrain Asia 50.9 120447 9867.
> 9 1952 Bangladesh Asia 37.5 46886859 684.
> 10 1952 Belgium Europe 68 8730405 8343.
> # ℹ 1,694 more rows
Complicated Filenames
There may be other variables stored in the directory name, or maybe the file name contains multiple bits of data. If so, use set_names()
(w/o arguments) to record the full path, then use separate_wider_delim()
and friends to turn them into useful columns. See example at the end of this section.
Untidy data of the same structure
You can use map
many times to perform different tidying and data manipulation tasks before combining datasets. Alternatively you can list_rbind
first and then perform data manipulation tasks using a standard dplyr
approach. See examples here.
Heterogenous data
Read this section of “R for Data Science”
Troubleshooting
Read this section of “R for Data Science”
Let’s imagine we want to save multiple datasets based on a feature of the data.
For example, what if we want a different csv for each clarity
type in the diamonds
dataset?
The easiest way to make these individual datasets is using group_nest()
:
by_clarity <- diamonds |>
group_nest(clarity) |>
mutate(path = str_glue("diamonds-{clarity}.csv"))
by_clarity
keep = TRUE
if you want to include the grouping variable in the nested tibbles.
> # A tibble: 8 × 3
> clarity data path
> <ord> <list<tibble[,9]>> <glue>
> 1 I1 [741 × 9] diamonds-I1.csv
> 2 SI2 [9,194 × 9] diamonds-SI2.csv
> 3 SI1 [13,065 × 9] diamonds-SI1.csv
> 4 VS2 [12,258 × 9] diamonds-VS2.csv
> 5 VS1 [8,171 × 9] diamonds-VS1.csv
> 6 VVS2 [5,066 × 9] diamonds-VVS2.csv
> 7 VVS1 [3,655 × 9] diamonds-VVS1.csv
> 8 IF [1,790 × 9] diamonds-IF.csv
Let’s imagine we want to save multiple datasets based on a feature of the data.
For example, what if we want a different csv for each clarity
type in the diamonds
dataset?
The easiest way to make these individual datasets is using group_nest()
:
> # A tibble: 741 × 9
> carat cut color depth table price x y z
> <dbl> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
> 1 0.32 Premium E 60.9 58 345 4.38 4.42 2.68
> 2 1.17 Very Good J 60.2 61 2774 6.83 6.9 4.13
> 3 1.01 Premium F 61.8 60 2781 6.39 6.36 3.94
> 4 1.01 Fair E 64.5 58 2788 6.29 6.21 4.03
> 5 0.96 Ideal F 60.7 55 2801 6.37 6.41 3.88
> 6 1.04 Premium G 62.2 58 2801 6.46 6.41 4
> 7 1 Fair G 66.4 59 2808 6.16 6.09 4.07
> 8 1.2 Fair F 64.6 56 2809 6.73 6.66 4.33
> 9 0.43 Very Good E 58.4 62 555 4.94 5 2.9
> 10 1.02 Premium G 60.3 58 2815 6.55 6.5 3.94
> # ℹ 731 more rows
walk()
We basically want to carry out the following but we can’t simply use map()
because now we have 2 arguments that vary.
So we could use map2()
, which allows us to map over 2 inputs!
If we were to run the above, it will apply the first two arguments to the write_csv()
function and also print out all the datasets as it saves them.
Since we don’t actually care about the output (i.e. the printed datasets) and only want the files to be written, there’s an even better function we can use: walk2()
.
The same basic approach can be used to save multiple plots.
Now we can use map()
to create a list of many plots and their eventual file paths:
by_clarity <- by_clarity |>
mutate(
plot = map(data, carat_histogram),
path = str_glue("clarity-{clarity}.png")
)
by_clarity
> # A tibble: 8 × 4
> clarity data path plot
> <ord> <list<tibble[,9]>> <glue> <list>
> 1 I1 [741 × 9] clarity-I1.png <gg>
> 2 SI2 [9,194 × 9] clarity-SI2.png <gg>
> 3 SI1 [13,065 × 9] clarity-SI1.png <gg>
> 4 VS2 [12,258 × 9] clarity-VS2.png <gg>
> 5 VS1 [8,171 × 9] clarity-VS1.png <gg>
> 6 VVS2 [5,066 × 9] clarity-VVS2.png <gg>
> 7 VVS1 [3,655 × 9] clarity-VVS1.png <gg>
> 8 IF [1,790 × 9] clarity-IF.png <gg>
Then use walk2()
with ggsave()
to save each plot:
Which is shorthand for:
lapply
Base R
has it’s own family of iterative functions: the apply family of functions.
The most one-to-one translation in this family is lapply
(list apply) to map
.
map
in today’s lecture are fairly simple, you can swap in lapply
for any of them.
> $Fertility
> [1] 70.4
>
> $Agriculture
> [1] 54.1
>
> $Examination
> [1] 16
>
> $Education
> [1] 8
>
> $Catholic
> [1] 15.14
>
> $Infant.Mortality
> [1] 20
Simply, lapply()
is used to apply a function over a list of any kind (e.g. a data frame) and return a list.
sapply()
: Simple lapply()
A downside to lapply()
is that lists can be hard to work with. sapply()
, therefore, always tries to simplify the result.
> Fertility Agriculture Examination Education
> 70.40 54.10 16.00 8.00
> Catholic Infant.Mortality
> 15.14 20.00
In this case, our list was simplified to a named numeric vector. However, the simplification can fail and give you an unexpected type so proceed with caution if you intend to use sapply()
.
vapply()
: vector applyThis version takes an additional argument that specifies the expected type, ensuring that simplification occurs the same way regardless of the input.
tapply()
Another important member of the apply family is tapply()
which computes a single grouped summary.
Unfortunately tapply()
returns its results in a named vector which requires some gymnastics if you want to collect multiple summaries and grouping variables into a data frame.
apply()
Lastly, there’s apply()
, which works over matrices or data frames. You can apply the function to each row (MARGIN = 1)
or column (MARGIN = 2)
.
> Fertility Agriculture Examination Education Catholic Infant.Mortality
> Min. 35.00 1.20 3.00 1.00 2.150 10.80
> 1st Qu. 64.70 35.90 12.00 6.00 5.195 18.15
> Median 70.40 54.10 16.00 8.00 15.140 20.00
> Mean 70.14 50.66 16.49 10.98 41.144 19.94
> 3rd Qu. 78.45 67.65 22.00 12.00 93.125 21.70
> Max. 92.50 89.70 37.00 53.00 100.000 26.60
for
loopsfor
loopfor
loops are the fundamental building block of iteration that both the apply and map families use under the hood.
As you become a more experienced R
programmer, for
loops are a powerful and general tool that will be important to learn.
walk()
The most straightforward use of for loops is to achieve the same effect as walk()
: call some function with a side-effect on each element of a vector/list.
A very basic example:
Things get a little trickier if you want to save the output of the for loop.
When you’re ready to dive into more advanced functional programming topics, including loops, check out the Control Flow and Functional Programming chapters of Advanced R.
across
Compute the number of unique values in each column of palmerpenguins::penguins
1.
Compute the mean of every column in mtcars.
Group diamonds
by cut
, clarity
, and color
then count the number of observations and compute the mean of each numeric column.
What happens if you use a list of functions in across()
, but don’t name them? How is the output named?
palmerpenguins::penguins
1.mtcars.
diamonds
by cut
, clarity
, and color
then count the number of observations and compute the mean of each numeric column.> # A tibble: 276 × 11
> cut clarity color n carat depth table price x y z
> <ord> <ord> <ord> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
> 1 Ideal SI2 E 469 0.874 61.7 56.1 3891. 6.02 6.02 3.71
> 2 Premium SI1 E 614 0.726 61.2 58.8 3363. 5.64 5.61 3.44
> 3 Good VS1 E 89 0.681 61.6 59.2 3713. 5.49 5.52 3.39
> 4 Premium VS2 I 315 1.24 61.3 58.9 7156. 6.70 6.67 4.09
> 5 Good SI2 J 53 1.32 62.4 59.1 5306. 6.85 6.86 4.27
> 6 Very Good VVS2 J 29 1.10 62.4 58.3 5960. 6.34 6.37 3.96
> 7 Very Good VVS1 I 69 0.571 62.2 58.0 2056. 5.17 5.20 3.22
> 8 Very Good SI1 H 547 0.974 62.0 58.0 4934. 6.15 6.17 3.82
> 9 Fair VS2 E 42 0.690 64.5 59.4 3042. 5.50 5.45 3.53
> 10 Very Good VS1 H 257 0.772 62.0 57.7 3750. 5.68 5.70 3.53
> # ℹ 266 more rows
across()
but don’t name them? How is the output named?airquality |>
summarize(
across(Ozone:Day, list(
\(x) median(x, na.rm = TRUE),
\(x) sum(is.na(x))
)),
n = n()
)
> Ozone_1 Ozone_2 Solar.R_1 Solar.R_2 Wind_1 Wind_2 Temp_1 Temp_2 Month_1
> 1 31.5 37 205 7 9.7 0 79 0 7
> Month_2 Day_1 Day_2 n
> 1 0 16 0 153
The default behavior of across
if the names for multiple functions are not supplied is simply to append the variable name with a number, i.e. the first function will be {.col}_1
, the second function will be {.col}_2
, etc.