Workflow & Reproducibility
CS&SS 508 • Lecture 3
9 April 2024
Victoria Sass
tidyverse
ggplot2
ggplot2
ggplot2
R
There are honestly no hard, fast rules about what is the correct code. You can produce all styles of code that will run and get you your desired results.
However, the following recommendations are made with these truths in mind:
You can read more about the specific style we’re using in this class here.
It’s good practice to name objects (and oftentimes variables) using only lowercase letters, numbers, and _
(to separate words).
Remember to give them descriptive names, even if that means they’re longer.
For readability you’ll want to put spaces around all mathematical operators1 (i.e. +
, -
, ==
, <
, etc.) as well as the assignment operator (<-
).
As you begin to use more functions, sequentially, it can start to get unclear what’s happening when, and to what.
With nested functions, like those above, you need to read the order of operations inside out, which is a bit awkward. It becomes even more confusing the more function calls you have, especially when they have multiple arguments each.
Enter the pipe1: |>
Pipes read “left to right” which is much more intuitive!
As you can see, pipes allow us to “chain” many function calls together easily.
The so-called “native pipe” (i.e. built into base R
) is relatively new. Before this, the pipe was a function from the magrittr
package that looks like this: %>%
.
This latter version continues to work but has a different functionality than the new, native pipe.
Most importantly, while both the magrittr
pipe and the native pipe take the LHS (left-hand side) and “pipe” it to the RHS (right-hand side), they operate differently when it comes to explicitly specifying which argument of the RHS to pipe the LHS into.
a <- c("Z", NA, "C", "G", "A")
# magrittr pipe
a %>% gsub('A', '-', x = .)
# native pipe
a |> gsub('A','-', x = _) # _ is the placeholder for |>
a |> gsub(pattern = 'A', replacement = '-') # leaving the "piped" argument as the only unnamed argument also works
a |> (\(placeholder) gsub('A', '-', x = placeholder))() # using an anonymous function call allows you to be explicit while specifying your own placeholder
Some good syntax practices:
|>
and it should usually be the last thing on a line.The |>
is recommended over %>%
simply because it’s much simpler to use and it’s always available (%>%
relied on the magrittr
package which was a dependency of tidyverse
packages).
You’ll need to specify to R
that you want to enable its usage by going to Tools > Global Options > Code. Within the “Editing” Tab there is an option to “Use native pipe operator, |>
”. Check it!
Keyboard Shortcut
To insert a pipe (with spaces) quickly: Ctrl+Shift+M (Windows & Linux OS) Shift+Command+M (Mac)
There are some other useful formatting options I’d suggest setting globally and others you can specify to your preferences.
There are some other useful formatting options I’d suggest setting globally and others you can specify to your preferences.
There are some other useful formatting options I’d suggest setting globally and others you can specify to your preferences.
There are some other useful formatting options I’d suggest setting globally and others you can specify to your preferences.
There are some other useful formatting options I’d suggest setting globally and others you can specify to your preferences.
There are some other useful formatting options I’d suggest setting globally and others you can specify to your preferences.
There are some other useful formatting options I’d suggest setting globally and others you can specify to your preferences.
Imagine you’ve inherited a bunch of code from someone else and NOTHING is styled in the tidyverse
way you’ve become accustomed. Or, you’ve dabbled in R
before and you have some old code that is all over the place, with respect to styling.
Thanks to Lorenz Walthert there’s a package for that! Using the styler
package you can automatically apply the tidyverse
style guide standards to various filetypes (.R, .qmd, .Rmd, etc.) or even entire projects.
Have a style or variation of the tidyverse
style that you prefer? You can specify that in the package as well. Use the keyboard shortcut Cmd/Ctl + Shift + P and search for “styler” to see all the options available.
We’ve been working with Quarto documents but you’ll sometimes simply want to use an R script, which is basically an entire file that is just a code chunk.
File names should:
Organizing research projects is something you either do accidentally — and badly — or purposefully with some upfront labor.
Uniform organization makes switching between or revisiting projects easier.
project/
readme.md
data/
derived/
data_processed.RData
raw/
data_core.csv
data_supplementary.csv
docs/
paper_asa.qmd
paper_journal.qmd
syntax/
01-functions.R
02-cleaning.R
03_merging.R
04-exploratory.R
05-models.R
06-visualizations.R
visuals/
descriptive.png
heatmap.png
predicted_probabilities.png
docs
syntax
data
visuals
readme.md
describes the projectalternative model.R
code for exploratory analysis.r
finalreport.qmd
FinalReport.qmd
fig 1.png
Figure_02.png
model_first_try.R
run-first.r
temp.txt
Your working directory is where R
will look for any files that you ask it to load and where it’ll put anything you ask it to save. It is literally just a folder somewhere on your computer or the remote server/cloud you’re working within.
You can ask R
what your current working directory is by running getwd()
(get
w
orking d
irectory).
[1] "/Users/victoriasass/Desktop/GitHub/CSSS508/Lectures/Lecture3"
You can see above that this lecture was created in a lecture-specific folder within a lectures folder, in a directory for this class, which is in a folder called GitHub
on the Desktop of my laptop.
While you can technically set your working directory using setwd()
(set
w
orking d
irectory) and giving R
a filepath, in the name of reproducible research DO NOT DO THIS! I strongly advise an alternative: RStudio Projects.
A “project” is RStudio’s built-in organizational support system which keeps all the files associated with a given project (i.e. data, R scripts, figures, results) together in one directory.
Creating a new project quite basically creates a new folder in a place that you specify. But it also does a few of other extremely useful things:
.Rproj
file which tracks your command history and all the files in your project folder.You can create a project by clicking
To summarize Jenny Bryan, one should separate workflow (i.e. your personal tastes and habits) from product (i.e. the logic and output that is the essence of your project).
The software you use to write your code (e.g. R/RStudio)
The location you store a project
The specific computer you use
The code you ran earlier or typed into your console
The raw data
The code that operates on your raw data
The packages you use
The output files or documents
Each data analysis (or course using R) should be organized as a project.
For research to be reproducible, it must also be portable. Portable software operates independently of workflow.
setwd()
.read_csv("C:/my_project/data/my_data.csv")
install.packages()
in R
script or .qmd files.rm(list=ls())
anywhere but your console.here
package) to set directories.read_csv("./data/my_data.csv")
library()
.setwd()
and rm(list=ls())
rm(list=ls())
Make sure not to expect rm(list=ls())
to give you a fresh R
session. It may feel that way when all the objects in your global environment disappear but there are a host of dependencies (i.e. loaded packages, options set to non-defaults, the working directory) that have not changed. Your script will still be vulnerable to those settings unless you start a fresh R
session.
A file path specifies the location of a file in a computer’s file system structure. They can be used to locate files and web resources. Some important things to note:
/
).
/
).
\
)./
) as the path separator regardless of the operating system.Specifies the location of a file from the root directory in the file system structure. They are also called “full file paths” or “full paths.”
Specifies the location of a file in the same folder or on the same server. In other words, a relative file path specifies a location of a file that is relative to the current directory.
Relative file paths use a dot notation at the start of the path, followed by a path separator and the location of the file.
.
) indicates the current directory (as shown above)..
) indicates the parent directory.When you work in an RStudio Project your working directory is the project folder.
If you are working on a R
script or qmd file in a subfolder of this project, the working directory of that file will be its subfolder (not the project folder.
Keep this in mind when you’re writing code and testing it interactively! Your current working directory will be the project folder when running code interactively even if you’re writing code for a qmd that has a subfolder as the working directory.
Often you do not want to include all code for a project in one .qmd
file:
There are two ways to deal with this:
Use separate .R
scripts or .qmd
files which save results from complicated parts of a project, then load these results in the main .qmd
file.
Use source()
to run external .R
scripts when the .qmd
renders
I find it beneficial to break projects into many files:
Splitting up a project carries benefits:
Professional researchers and teams design projects as a pipeline.
A pipeline is a series of consecutive processing elements (scripts and functions in R).
Each stage of a pipeline…
This means…
Every stage (oval) has an unambiguous input and output. Everything that precedes a given stage is a dependency — something required to run it.
If you haven’t already, go to Tools > Global Options and adjust your settings (i.e. General, Code > Editing, and Code > Display) to those recommended in the lecture and any others that you’d like to change (i.e. Appearance, Pane Layout, or R Markdown)
Restyle the following pipelines following the guidelines discussed in lecture:
flights|>filter(dest=="IAH")|>group_by(year,month,day)|>summarize(n=n(),
delay=mean(arr_delay,na.rm=TRUE))|>filter(n>10)
flights|>filter(carrier=="UA",dest%in%c("IAH","HOU"),sched_dep_time>
0900,sched_arr_time<2000)|>group_by(flight)|>summarize(delay=mean(
arr_delay,na.rm=TRUE),cancelled=sum(is.na(arr_delay)),n=n())|>filter(n>10)
Press Option + Shift + K / Alt + Shift + K. What happens? How can you get to the same place using the menus?
Tweak each of the following R commands so that they run correctly:
Clear .RData, Never save
Native pipe
Highlight function calls; preview colors; rainbow parentheses
Code appearance
Pane Layout
Markdown Preferences
flights|>filter(dest=="IAH")|>group_by(year,month,day)|>summarize(n=n(),
delay=mean(arr_delay,na.rm=TRUE))|>filter(n>10)
flights|>filter(carrier=="UA",dest%in%c("IAH","HOU"),sched_dep_time>
0900,sched_arr_time<2000)|>group_by(flight)|>summarize(delay=mean(
arr_delay,na.rm=TRUE),cancelled=sum(is.na(arr_delay)),n=n())|>filter(n>10)
flights |>
filter(dest == "IAH") |>
group_by(year, month, day) |>
summarize(
n = n(),
delay = mean(arr_delay, na.rm = TRUE)
) |>
filter(n > 10)
flights |>
filter(carrier == "UA", dest %in% c("IAH", "HOU"),
sched_dep_time > 0900, sched_arr_time < 2000) |>
group_by(flight) |>
summarize(
delay = mean(arr_delay, na.rm = TRUE),
cancelled = sum(is.na(arr_delay)),
n = n()
) |>
filter(n > 10)
CSSS508/
Homeworks/
HW1/
homework1.qmd
homework1.html
HW2/
homework2.qmd
homework2.html
HW3/
homework3.qmd
homework3.html
HW4/
homework4.qmd
homework4.html
data.csv
HW5/
homework5.qmd
homework5.html
data/
data_raw.csv
data_processed.Rdata
HW6/
HW7/
HW8/
HW9/
Reproducibility is not replication.
Reproducible studies can still be wrong… and in fact reproducibility makes proving a study wrong much easier.
Reproducibility means:
Any study that isn’t reproducible can only be trusted on faith.
Reproducibility comes in three forms (Stodden 2014):
R is particularly well suited to enabling computational reproducibility1.
They will not fix flawed research design, nor offer a remedy for improper application of statistical methods.
Those are the difficult, non-automatable things you want skills in.
Elements of computational reproducibility:
For academic papers, degrees of reproducibility vary:
Interactive documents — like Quarto docs — combine code and text together into a self-contained document.
Interactive documents allow a reader to examine your computational methods within the document itself; in effect, they are self-documenting.
By re-running the code, they reproduce your results on demand.
Common Platforms:
[
Given a vector of values:
You can select from the vector
[
You can select rows and columns from dataframes with df[rows, cols]
.
# A tibble: 3 × 3
x y z
<int> <chr> <dbl>
1 1 a 0.957
2 2 e 0.795
3 3 f 0.734
# A tibble: 3 × 2
x y
<int> <chr>
1 1 a
2 2 e
3 3 f
data.frame()
vs. tibble()
Tibbles are the tidyverse version of a base R
dataframe. Usually you can use them interchangably without issue but they have slightly different behavior that’s important to know about when indexing in this way.
If df
is a data.frame, then df[, cols]
will return a vector if col
selects a single column and a data frame if it selects more than one column.
[1] 1 2 3
If df
is a tibble, then [
will always return a tibble.
# A tibble: 3 × 1
x
<int>
1 1
2 2
3 3
[[
and $
[
, which selects many elements, is paired with [[
and $
, which extract a single element.
# A tibble: 4 × 2
x y
<int> <dbl>
1 1 10
2 2 4
3 3 1
4 4 21
One of the most difficult things as a beginner in R
(or any language tbh) is not always knowing what to ask to solve your issue. Being in this class is a great first step! Some other useful tools:
R
to your query is basic but useful and often overlooked. Including the package name, if you’re using one, is another. Finally, what is it you want to do? For example “R dplyr create new variable based on value of another.”R
and make sure to include a reprex
so people can actually understand what your issue is.repr
oducible ex
ample this is a version of your code that someone could copy and run on their own machine, making it possible for them to help you troubleshoot your problem.reprex
package for assistance with this!1
R
works in a certain way, and developing practices that keep you organized will make you more efficient and help prevent minor and major frustrations going forward.