Evaluation | Points |
---|---|
Didn't turn anything in. | 0 |
Turned in but low effort, ignoring many directions. | 1 |
Decent effort, followed directions with some minor issues. | 2 |
Nailed it! | 3 |
Introduction to R, RStudio, and Quarto
CS&SS 508 • Lecture 1
26 March 2024
Victoria Sass
R
and RStudio
for over 10 years 😱Let’s start by getting to know each other a bit better. On your index card write the following:
Name and pronouns
Program and year
Experience with programming (in R or more generally)
One word that best describes your feelings about taking this class
Would you rather be able to converse with (non-human) animals, or have lifelong fluency in every (human) language?
Pair up with someone nearby and introduce yourself to one another. Let’s take about 5-10 minutes to do this.
The syllabus (as well as lots of other information) can be found on our course website:
https://vsass.github.io/CSSS508
Feel free to follow along online as I run through the syllabus!
This course is intended to give students a foundational understanding of programming in the statistical language R. This knowledge is intended to be broadly useful wherever you encounter data in your education and career. General topics we will focus on include:
By the end of this course you should feel confident approaching any data you encounter in the future. We will cover almost no statistics, however it is the intention that this course will leave you prepared to progress in CS&SS or STAT courses with the ability to focus on statistics instead of coding. Additionally, the basic concepts you learn will be applicable to other programming languages and research in general, such as logic and algorithmic thinking.
Lecture: On Tuesdays we will meet in the CSSCR lab for an interactive session where we’ll cover a specific topic to help you learn fundamental skills, concepts, and principles for learning R. Additionally, these sessions will provide you with the opportunity to work with each other to learn and practice key skills in R. I will be available to answer questions and help troubleshoot code as well.
Office Hours: Drop-in to ask questions, get advice, or continue discussions from lab/lecture. We can talk in a breakout room or with the group!
How to Contact Me
Please message me in our Slack Workspace rather than sending me an email. I get far too many emails a day and I don’t want to miss your message!
Communication
Learning is collaborative! In addition to being the place to communicate with me, our Slack is also where you can ask one another questions, share resources, and just generally check in with each other about how your adventures with R
are going. You can find the link to join our workplace on our course Canvas.
Homework & Peer-Reviews
We will be using Canvas solely for homework & peer review submissions/deadlines and for any links I only want to distribute to those registered for this class (i.e. Slack and Office Hours Zoom).
Course Content
All course content will be accessible on our course website: https://vsass.github.io/CSSS508.
If you’ve never used Slack before you’ll need to download the desktop app.
A useful quick-start guide can be found here.
Go to our Canvas site for the invite link to join our private workspace.
March 26
April 2
April 9
April 16
April 23
April 30
May 7
May 14
May 21
May 28
Week 1: Introduction to R, RStudio, and Quarto
Week 2: Visualizing Data
Week 3: Workflow and Reproducibility
Week 4: Importing, Exporting, and Cleaning Data
Week 5: Manipulating and Summarizing Data
Week 6: Data Structures & Types
Week 7: Working with Text Data
Week 8: Writing Functions
Week 9: Iteration
Week 10: Next Steps
None 😎
Materials: All course materials will be provided on the course website. This includes:
Laptops: You’re welcome to bring a laptop to class if you’d prefer to use your own machine.
Keep In Mind
The versions of R
, RStudio, and Quarto (as well as any packages you have installed) will not necessarily be the same/up to date if you do your work on different computers. My advice is to consistently use the same device for homework assignments or to make sure to download the latest versions of R
, RStudio, and Quarto when using a new machine.
Textbooks: This course has no textbook. However, I will be suggesting selections from R for Data Science to pair with each week’s topic. While not required, I strongly suggest reading those selections before doing the homework for that week.
Credit/No Credit (C/NC); You need at least 60% to get Credit
9 total homeworks; assessed on a 0-3 point rubric. Assigned at the end of lecture sessions and due a week later.
Evaluation | Points |
---|---|
Didn't turn anything in. | 0 |
Turned in but low effort, ignoring many directions. | 1 |
Decent effort, followed directions with some minor issues. | 2 |
Nailed it! | 3 |
One per homework, assessed on a binary satisfactory/unsatisfactory scale. Due 5 days after homework due date.
Evaluation | Points |
---|---|
Didn’t follow all peer-review instructions. |
0 |
Peer review is at least several sentences long, |
1 |
Homework/peer grading instructions and deadlines can be found on the Homework page of the course website. All homework will be turned in on Canvas by 4:30pm the day it is due.
Late Homework Will Automatically Lose Peer-Review Credit
Peer reviews are randomly assigned when the due date/time is reached. Therefore, if you don’t submit your homework on time, you will not be given a peer’s homework to review and vice versa. That said, life is messy and complicated and we all miss deadlines for a variety of reasons. Therefore, you can request that I review and provide feedback on a late assignment (message me on Slack) but you won’t be able to earn peer-review credit for that particular homework.
Yes, because:
You will write your reports better knowing others will see them
You learn alternate approaches to the same problem
You will have more opportunities to practice and have the material sink in
How to peer review:
Academic integrity is essential to this course and to your learning. Violations of the academic integrity policy include but are not limited to:
I hope you will collaborate with peers on assignments and use Internet resources when questions arise to help solve issues. The key is that you ultimately submit your own work.
Anything found in violation of this policy will be automatically given a score of 0 with no exceptions. If the situation merits, it will also be reported to the UW Student Conduct Office, at which point it is out of my hands. If you have any questions about this policy, please do not hesitate to reach out and ask.
I’m committed to fostering a friendly and inclusive classroom environment in which all students have an equal opportunity to learn and succeed. This course is an attempt to make an often difficult and frustrating experience (learning R
for the first time) less obfuscating, daunting, and stressful. That said, learning happens in different ways at at a different pace for everyone. Learning is also a collaborative and creative process and my aim is to create an environment in which you all feel comfortable asking questions of me and each other. Treat your peers and yourself with empathy and respect as you all approach this topic from a range of backgrounds and experiences (in programming and in life).
Names & Pronouns: Everyone deserves to be addressed respectfully and correctly. Fill out your profile on Slack with your picture, preferred name (as your Display Name), and correct gender pronouns so we can all be on the same page!
Covid Considerations: I will follow all University rules and procedures regarding Covid, which may or may not change during the quarter. I also recognize that Covid creates unique circumstances and concerns for each of us, which may limit your ability to fully attend or participate in this course. You never need to apologize to me for anything pandemic-related. If there is something I can do to make you feel more comfortable during class, please let me know!
Diversity: Diverse backgrounds, embodiments, and experiences are essential to the critical thinking endeavor at the heart of university education. Therefore, I expect you to follow the UW Student Conduct Code in your interactions with your colleagues and me in this course by respecting the many social and cultural differences among us, which may include, but are not limited to: age, cultural background, disability, ethnicity, family status, gender identity and presentation, body size/shape, citizenship and immigration status, national origin, race, religious and political beliefs, sex, sexual orientation, socioeconomic status, and veteran status.
Accessibility & Accommodations: Your experience in this class is important to me. If you have already established accommodations with Disability Resources for Students (DRS), please communicate your approved accommodations to me at your earliest convenience so we can discuss your needs in this course. If you have not yet established services through DRS, but have a temporary health condition or permanent disability that requires accommodations (conditions include but not limited to; mental health, attention-related, learning, vision, hearing, physical or health impacts), you are welcome to contact DRS at 206-543-8924, uwdrs@uw.edu, or through their website.
Religious Accommodations: Washington state law requires that UW develop a policy for accommodation of student absences or significant hardship due to reasons of faith or conscience, or for organized religious activities. The UW's policy, including more information about how to request an accommodation, is available at Religious Accommodations Policy. Accommodations must be requested within the first two weeks of this course using the Religious Accommodations Request form.
Getting Help: If at any point during the quarter you find yourself struggling to keep up, please let me know! I am here to help. A great place to start this process is by chatting before1 class, coming to office hours, or message meon Slack.
Also, help one another as you navigate this course! Slack allows you to chat directly with one another, send messages to the whole class about specific topics (see the already-created # r-code-questions and # quarto-questions channels), send snippets of code or entire files to one another, and much more.
Feedback
If you have feedback on any part of this course or the classroom environment I want to hear it! You can message me directly on Slack or send me an anonymous message here. Additionally, I will send out a mid-quarter feedback survey on Slack around Week 5.
Don’t ask like this:
tried lm(y~x) but it iddn’t work wat do
Instead, ask like this:
y <- seq(1:10) + rnorm(10) x <- seq(0:10) model <- lm(y ~ x)
Running the block above gives me the following error, anyone know why?
Error in model.frame.default(formula = y ~ x, drop.unused.levels = TRUE) : variable lengths differ (found for 'x')
FYI
If you ask me a question directly over Slack I may send out your question (anonymously) along with my answer to the whole course.
Bold usually indicates an important vocabulary term. Remember these!
Italics indicate emphasis but also are used to point out things you must click with a mouse.
Code
represents R code you could use to perform actions.
Ctrl-P
to open the print dialogue.”Bold usually indicates an important vocabulary term. Remember these!
Italics indicate emphasis but also are used to point out things you must click with a mouse.
Code
represents R code you could use to perform actions.
Ctrl-P
to open the print dialogue.”Code chunks that span the page represent actual R code embedded in the slides.
Since the lectures for this class were created using Quarto, there are numerous built-in features meant to facilitate your learning, particularly of R
.
R
code embedded in the slides you will see a which you can click to copy the code. You can then paste it in your own Quarto document or R
script to run it in your session of RStudio.R is a programming language built for statistical computing.
If one already knows Stata or similar software, why use R?
R Studio is a “front-end” or integrated development environment (IDE) for R that can make your life easier.
We’ll show RStudio can…
It can also…
Manage git
repositories
Run interactive tutorials
Handle other languages like C++, Python, SQL, HTML, and shell scripting
Built upon many of the developments of the R Markdown ecosystem, Quarto distills them into one coherent system and additionally expands its functionality by supporting other programming languages besides R, including Python and Julia.
The ability to create Quarto files in R is a powerful advantage. It allows us to:
If you don’t already have R and RStudio on your machine, now is the time to do so!
Open up RStudio now and choose File > New File > R Script.
Then, let’s get oriented with the interface:
Top Left: Code editor pane, data viewer (browse with tabs)
Bottom Left: Console for running code (>
prompt)
Top Right: List of objects in environment, code history tab.
Bottom Right: Tabs for browsing files, viewing plots, managing packages, and viewing help files.
There are several ways to run R code in RStudio:
Ctrl+Enter
or ⌘+Enter
to run them all.There are several ways to run R code in RStudio:
Highlight lines in the editor window and click Run at the top right corner of said window or hit Ctrl+Enter
or ⌘+Enter
to run them all.
With your caret1 on a line you want to run, hit Ctrl+Enter
or ⌘+Enter
. Note your caret moves to the next line, so you can run code sequentially with repeated presses.
Enter
.The console will show the lines you ran followed by any printed output.
If you mess up (e.g. leave off a parenthesis), R might show a +
sign prompting you to finish the command:
Finish the command or hit Esc
to get out of this.
In the console, type 123 + 456 + 789
and hit Enter
.
The [1]
in the output indicates the numeric index of the first element on that line.
Now in your blank R document in the editor, try typing the line sqrt(400)
and either clicking Run or hitting Ctrl+Enter
or ⌘+Enter
.
sqrt()
is an example of a function in R.
Arguments are the inputs to a function. In this case, the only argument to sqrt()
is x
which can be a number or a vector of numbers.
The basic template of a function is
function_name(argument1, argument2 = value2, argument3 = value3...)
Something to Note
Functions can have a wide range of arguments and some are required for the function to run, while others remain optional. You can see from each functions’ help page which are not required because they will have an =
with some default value pre-selected. If there is no =
it is up to the user to define that value and it’s therefore a required specification.
If we didn’t have a good guess as to what sqrt()
will do, we can type ?sqrt
in the console and look at the Help panel on the bottom right.
If you’re trying to look up the help page for a function and can’t remember its name you can search by a keyword and you will get a list of help pages containing said keyword.
Help files provide documentation on how to use functions and what functions produce. They will generally consist of the following sections:
R stores everything as an object, including data, functions, models, and output.
Operators like <-
are functions that look like symbols but typically sit between their arguments (e.g. numbers or objects) instead of having them inside ()
like in sqrt(x)
.
We do math with operators, e.g., x + y
.
+
is the addition operator!
You can display or “call” an object simply by using its name.
Object names must begin with a letter and can contain letters, numbers, .
, and _
.
Try to be consistent in naming objects. RStudio auto-complete means long, descriptive names are better than short, vague ones! Good names save confusion later!
_
is a common and practical naming convention that I strongly recommend.Remember that object names are CaSe SeNsItIvE!!
Also, TYPOS MATTER!
An object’s name represents the information stored in that object, so you can treat the object’s name as if it were the values stored inside. . . .
A vector is one of many data types available in R
. Specifically, it is a series of elements, such as numbers, strings, or booleans (i.e. TRUE
, FALSE
).
You can create a vector using the function c()
which stands for “combine” or “concatenate”. . . .
If you name an object the same name as an existing object, it will overwrite it.
There are other, more complex data types in R which we will discuss later in the quarter! These include matrices, arrays, lists, and dataframes.
Most data sets you will work with will be read into R
and stored as a dataframe, so this course will mainly focus on manipulating and visualizing these objects.
Let’s try making an Quarto file:
My First Qmd
and click Create
---
title: "ggplot2 demo"
author: "Norah Jones"
date: "5/22/2021"
format:
html:
fig-width: 8
fig-height: 4
code-fold: true
---
## Air Quality
@fig-airquality further explores the impact of temperature on ozone level.
```{r}
#| label: fig-airquality
#| fig-cap: "Temperature and ozone level."
#| warning: false
library(ggplot2)
ggplot(airquality, aes(Temp, Ozone)) +
geom_point() +
geom_smooth(method = "loess")
```
Elements of a Quarto document include:
---
s).```
s) and/or their output.The header of an .qmd file is a YAML1code block, and everything else is part of the main document. Try adding some of these other fields to your YAML and re-render it to see what it looks like.
To mess with global formatting, you can modify the header2.
Include math \(y= \left( \frac{2}{3} \right)^2\) inline.
Or centered on your page like so:
\[\frac{1}{n} \sum_{i=1}^{n} x_i = \bar{x}_n\]
Or write code-looking font
.
Or a block of code:
y <- 1:5
z <- y^2
Quarto docs can be modified in many ways. Visit these links for more information.
Inside Quarto, lines of R code are called chunks. Code is sandwiched between sets of three backticks and {r}
. This chunk of code…
Produces this output in your document:
speed dist
Min. : 4.0 Min. : 2.00
1st Qu.:12.0 1st Qu.: 26.00
Median :15.0 Median : 36.00
Mean :15.4 Mean : 42.98
3rd Qu.:19.0 3rd Qu.: 56.00
Max. :25.0 Max. :120.00
Add this code chunk to your document!
Chunks have options that control what happens with their code. They are specified as special comments at the top of a block. For example:
Some useful and common options include:
echo: false
- Keeps R code from being shown in the document
eval: false
- Shows R code in the document without running it
include: false
- Hides all output but still runs code (good for setup
chunks where you load packages!)
output: false
- Doesn’t include the results of that code chunk in the output
cache: true
- Saves results of running that chunk so if it takes a while, you won’t have to re-run it each time you re-render the document
fig.height: 5, fig.width: 5
- modify the dimensions of any plots that are generated in the chunk (units are in inches)
fig.cap: "Text"
- add a caption to your figure in the chunk
Try adding or changing the chunk options for the chunk in my_first_Rmd.qmd
and re-render your document to see what happens.
Sometimes we want to insert a value directly into our text. We do that using code in single backticks starting off with r
.
Four score and seven years ago is the same as `r 4*20 + 7` years.
Four score and seven years ago is the same as 87 years.
The value of `x` rounded to the nearest two decimals is `r round(x, 2)`.
The value of x
rounded to the nearest two decimals is 8.77.
Having R dump values directly into your document protects you from silly mistakes:
In your YAML header, make the date come from R’s Sys.time()
function by changing:
date: "March 26, 2024"
to
date: "`r Sys.time()`"
R
and PackagesR
Simply by downloading R
you have access to what is referred to as Base R
. That is, the built-in functions and datasets that R
comes equipped with, right out of the box.
Examples that we’ve already seen include <-
, sqrt()
, +
, Sys.time()
, and summary()
but there are obviously many many more.
You can see a whole list of what Base R
contains by running library(help = "base")
in the console.
R
Dataset: cars
In the sample Quarto document you are working on, we can load the built-in data cars
, which loads as a dataframe, a type of object mentioned earlier. Then, we can look at it in a couple different ways.
data(cars)
loads this dataframe into the Global Environment.
View(cars)
pops up a Viewer tab in the source pane (“interactive” use only, don’t put in Quarto document!).
R
Dataset: cars
str()
displays the structure of an object:
R
Dataset: cars
str()
displays the structure of an object:
'data.frame': 50 obs. of 2 variables:
$ speed: num 4 4 7 7 8 9 10 10 10 11 ...
$ dist : num 2 10 4 22 16 10 18 26 34 17 ...
summary()
displays summary information 1:
R
is pretty…Basichist()
generates a histogram of a vector. Note that you can access a vector that is a column of a dataframe using $
, the extract operator.
R
is pretty…BasicWe can try and make this histogram a bit more appealing by adding more arguments and their specifications.
R
is pretty…BasicWe can also make scatterplots to show the relationship between two variables.
plot(dist ~ speed, data = cars,
xlab = "Speed (mph)",
ylab = "Stopping distance (ft)",
main = "Speeds and stopping distances of cars",
pch = 16) # Point shape
abline(h = mean(cars$dist), col = "firebrick") # add horizontal line (y-value)
abline(v = mean(cars$speed), col = "cornflowerblue") # add vertical line (x-value)
Note
dist ~ speed
is a formula of the type y ~ x
. The first element (dist
) gets plotted on the y-axis and the second (speed
) goes on the x-axis. Regression formulae follow this convention as well!
R
is pretty…BasicWe can also make scatterplots to show the relationship between two variables.
plot(dist ~ speed, data = cars,
xlab = "Speed (mph)",
ylab = "Stopping distance (ft)", # add y-axis label
main = "Speeds and stopping distances of cars",
pch = 16) # Point shape
abline(h = mean(cars$dist), col = "firebrick") # add horizontal line
abline(v = mean(cars$speed), col = "cornflowerblue") # add vertical line
R
Dataset: swiss
Let’s look at another built-in dataset.
First, run ?swiss
in the console to see what things mean.
Then, load it using data(swiss)
swiss
What makes R
so powerful though is it’s extensive library of packages. Due to it’s open-source nature, anyone (even you!) can write a package that others can use.
Packages contain pre-made functions and/or data that can be used to extend Base R
’s capabilities.
Base R
/Package Analogy
Base R
is like creating a recipe from scratch: going to the store and buying all the ingredients and cooking it by yourself. Using a package is more akin to using a meal-kit service: you still have to cook but you’re provided with the ingredients and step-by-step instructions for making the recipe.
As of this writing there are 20,6131 available packages!
To use a package outside of Base R
you need to do two things:
To use a package outside of Base R
you need to do two things:
CRAN
(The C
omprehensive R
A
rchive N
etwork) by running the following in your console:1
install.packages("package_name")
To use a package outside of Base R
you need to do two things:
CRAN
(The C
omprehensive R
A
rchive N
etwork) by running the following in your console:1
install.packages("package_name")
This downloads the package to your local machine (or the server of whatever remote machine you’re using). Thus, you only every need to do it once for each package2!
To use a package outside of Base R
you need to do two things:
CRAN
(The C
omprehensive R
A
rchive N
etwork) by running the following in your console1:install.packages("package_name")
This downloads the package to your local machine (or the server of whatever remote machine you’re using). Thus, you only every need to do it once for each package2!
R
so you can use it. You’ll do this by putting the following in an R
Script or embedded in a code chunk in a Quarto file:gt
PackageLet’s make a table that’s more polished than the code-y output R
automatically gives us. To do this, we’ll want to install our first package called gt
. In the console, run: install.packages("gt")
.
Nesting Functions
Note that we put the summary(swiss)
function call inside the as.data.frame.matrix()
call which all went into the gt()
function. This is called nesting functions and is very common. I’ll introduce a method next week to avoid confusion from nesting too many functions inside each other.
What’s as.data.frame.matrix()
Doing?
gt()
takes as its first argument a data.frame
-type object, while summary()
produces a table
-type object. Therefore, as.data.frame.matrix()
was additionally needed to turn the table
into a data.frame
.
Fertility | Agriculture | Examination | Education | Catholic | Infant.Mortality |
---|---|---|---|---|---|
Min. :35.00 | Min. : 1.20 | Min. : 3.00 | Min. : 1.00 | Min. : 2.150 | Min. :10.80 |
1st Qu.:64.70 | 1st Qu.:35.90 | 1st Qu.:12.00 | 1st Qu.: 6.00 | 1st Qu.: 5.195 | 1st Qu.:18.15 |
Median :70.40 | Median :54.10 | Median :16.00 | Median : 8.00 | Median : 15.140 | Median :20.00 |
Mean :70.14 | Mean :50.66 | Mean :16.49 | Mean :10.98 | Mean : 41.144 | Mean :19.94 |
3rd Qu.:78.45 | 3rd Qu.:67.65 | 3rd Qu.:22.00 | 3rd Qu.:12.00 | 3rd Qu.: 93.125 | 3rd Qu.:21.70 |
Max. :92.50 | Max. :89.70 | Max. :37.00 | Max. :53.00 | Max. :100.000 | Max. :26.60 |
gt
’s Version of head()
and tail()
Fertility Agriculture Examination Education Catholic
Courtelary 80.2 17.0 15 12 9.96
Delemont 83.1 45.1 6 9 84.84
Franches-Mnt 92.5 39.7 5 5 93.40
Moutier 85.8 36.5 12 7 33.77
Neuveville 76.9 43.5 17 15 5.16
Porrentruy 76.1 35.3 9 7 90.57
Infant.Mortality
Courtelary 22.2
Delemont 22.2
Franches-Mnt 20.2
Moutier 20.3
Neuveville 20.6
Porrentruy 26.6
Fertility | Agriculture | Examination | Education | Catholic | Infant.Mortality | |
---|---|---|---|---|---|---|
1 | 80.2 | 17.0 | 15 | 12 | 9.96 | 22.2 |
2 | 83.1 | 45.1 | 6 | 9 | 84.84 | 22.2 |
3 | 92.5 | 39.7 | 5 | 5 | 93.40 | 20.2 |
4..44 | ||||||
45 | 35.0 | 1.2 | 37 | 53 | 42.34 | 18.0 |
46 | 44.7 | 46.6 | 16 | 29 | 50.43 | 18.2 |
47 | 42.8 | 27.7 | 22 | 29 | 58.33 | 19.3 |
👋 Bye Bye as.data.frame.matrix()
We no longer need as.data.frame.matrix()
since we’re no longer using summary()
. Both head()
and gt_preview()
take a data.frame
-type object as their first argument which is the same data type as swiss
.
Comments
Anything writen after
#
1 will be ignorned by R.Comments help collaborators and future-you understand what, and more importantly, why you are doing what you’re doing with that specific line/chunk of code.
Additionally, comments allow you to explain your overall coding plan and record anything important that you’ve discovered along the way.