Working with Text Data

CS&SS 508 • Lecture 7

7 May 2024

Victoria Sass

Roadmap


Last time, we learned:

  • Types of Data
    • Numbers
    • Missing Values
  • Data Structures
    • Vectors
    • Matrices
    • Lists


Today, we will cover:

  • Types of Data
    • Strings
  • Pattern Matching & Regular Expressions

Strings

Basics of Strings

  • A general programming term for a unit of character data is a string
    • Strings are a sequence of characters
    • In R, “strings” and “character data” are mostly interchangeable.
    • Some languages have more precise distinctions, but we won’t worry about that here!
  • We can create strings by surrounding text, numbers, spaces, or symbols with quotes!
    • Examples: "Hello! My name is Vic" or "%*$#01234"
  • You can create a string using either single quotes (' ') or double quotes (" ")
    • In the interests of consistency, the tidyverse style guide recommends using " ", unless the string contains multiple " "

Escaping with Strings

We use a lot of different symbols in our code that we might actually want to represent within a string itself. To do that, we need to escape that particular character. We can do that using \.

For instance, if we want to include a literal single or double quote in our string, we’d escape it by writing:

"\'"
'\"'
1
Single quote.
2
Double quote.

Similarly, if we want to represent a \ we’ll need to escape it as well…

"\\"
3
Backslash.

Note: When you print these objects you’ll see the escape characters. To actually view the string’s contents ( and not the syntax needed to construct it), use str_view().

str_view(c("\'", '\"', "\\"))
4
All stringr functions begin with the prefix str_ which is useful due to R Studio’s auto-complete feature.
> [1] │ '
> [2] │ "
> [3] │ \

Other Special Characters

There are other things you may want to represent inside a character string, such as a new line, or a tab space.

str_view("Sometimes you need\nto create another line.")
str_view("\tOther times you just need to indent somewhere.")
5
Use \n to create a new line. Helpful when plotting if you have variable names or values that are wordy! If you need to do this for one or more variables you can use str_wrap() and specify the character width you desire.
6
Use \t to add a tab. str_view will highlight tabs in blue in your console to make it stand out from other random whitespace.
> [1] │ Sometimes you need
>     │ to create another line.
> [1] │ {\t}Other times you just need to indent somewhere.

Additionally, you can represent Unicode characters which will be written with the \u or \U escape.

str_view(c("\U1F00F", "\u2866", "\U1F192"))
> [1] │ 🀏
> [2] │ ⡦
> [3] │ 🆒

Data: King County Restaurant Inspections!

Today we’ll study real data on food safety inspections in King County, collected from data.kingcounty.gov.

Note these data are fairly large in their native .csv format. The following code can be used to download the data directly from my Github page as a smaller, .Rdata object:

load(url("https://github.com/vsass/CSSS508/raw/main/Lectures/Lecture7/data/restaurants.Rdata"))

Quick Examination of the Data

glimpse(restaurants)
> Rows: 256,681
> Columns: 22
> $ Name                         <chr> "#807 TUTTA BELLA", "#807 TUTTA BELLA", "…
> $ `Program Identifier`         <chr> "#807 TUTTA BELLA", "#807 TUTTA BELLA", "…
> $ `Inspection Date`            <chr> "03/02/2023", "03/02/2023", "08/31/2022",…
> $ Description                  <chr> "Seating 0-12 - Risk Category III", "Seat…
> $ Address                      <chr> "2746 NE 45TH ST", "2746 NE 45TH ST", "27…
> $ City                         <chr> "SEATTLE", "SEATTLE", "SEATTLE", "SEATTLE…
> $ `Zip Code`                   <dbl> 98105, 98105, 98105, 98105, 98105, 98105,…
> $ Phone                        <chr> "(206) 722-6400", "(206) 722-6400", "(206…
> $ Longitude                    <dbl> -122.2964, -122.2964, -122.2964, -122.296…
> $ Latitude                     <dbl> 47.66231, 47.66231, 47.66231, 47.66231, 4…
> $ `Inspection Business Name`   <chr> "#807 TUTTA BELLA", "#807 TUTTA BELLA", "…
> $ `Inspection Type`            <chr> "Routine Inspection/Field Review", "Routi…
> $ `Inspection Score`           <dbl> 20, 20, 10, 10, 0, 0, 0, 30, 30, 0, 47, 4…
> $ `Inspection Result`          <chr> "Unsatisfactory", "Unsatisfactory", "Unsa…
> $ `Inspection Closed Business` <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
> $ `Violation Type`             <chr> "RED", "RED", "BLUE", "RED", NA, NA, NA, …
> $ `Violation Description`      <chr> "1300 - Food contact surfaces cleaned and…
> $ `Violation Points`           <dbl> 15, 5, 5, 5, 0, 0, 0, 5, 25, 0, 5, 2, 5, …
> $ Business_ID                  <chr> "PR0089260", "PR0089260", "PR0089260", "P…
> $ Inspection_Serial_Num        <chr> "DAJ5DTHLV", "DAJ5DTHLV", "DAEEWQC0L", "D…
> $ Violation_Record_ID          <chr> "IVBTPZO0B", "IV5GOME67", "IVQ7QYW2V", "I…
> $ Grade                        <dbl> 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2,…


Good Questions to Ask

  • What does each row represent?
  • Is the data in long or wide format?
  • What are the key variables?
  • How are the data stored? (data type)

Creating Strings

You can create strings based on the value of other strings with str_c() (string combine), which takes any number of vectors and returns a character vector.

str_c(c("CSSS", "STAT", "SOC"), 508)
str_c(c("CSSS", "STAT", "SOC"), 508, sep = " ")
str_c(c("CSSS", "STAT", "SOC"), 508, sep = " ", collapse = ", ")
7
By default, str_c doesn’t put a space between the vectors it is combining.
8
You can add a specific separator, including a space, using the sep argument.
9
If you want to combine the output into a single string, use collapse.
> [1] "CSSS508" "STAT508" "SOC508" 
> [1] "CSSS 508" "STAT 508" "SOC 508" 
> [1] "CSSS 508, STAT 508, SOC 508"

Example #1 with Restaurant Data

restaurants |> 
  select(Name, Address, City) |> 
  distinct() |> 
  mutate(Sentence = str_c(Name, " is located at ", Address, " in ", City, "."),
         .keep = "none")
10
Notice there are spaces at the beginning and end of the fixed character strings. This is because if we used the sep argument here it would add a space before the period at the end of the sentence. So instead, we can add them directly where we want them.
11
Using .keep = "none" here in order to see just the results of our mutate.
> # A tibble: 10,969 × 1
>    Sentence                                                                    
>    <chr>                                                                       
>  1 #807 TUTTA BELLA is located at 2746 NE 45TH ST in SEATTLE.                  
>  2 +MAS CAFE  is located at 1906 N 34TH ST in SEATTLE.                         
>  3 ?al?al Cafe is located at 122 2ND AVE S in SEATTLE.                         
>  4 100 LB CLAM is located at 1001 FAIRVIEW AVE N Unit 1700A in SEATTLE.        
>  5 1000 SPIRITS is located at 1225 1ST AVE in SEATTLE.                         
>  6 100TH AVE CAKES is located at 15364 NE 96TH PL in REDMOND.                  
>  7 108 VIETNAMESE AUTHENTIC  CUISINE is located at 18114 E VALLEY HWY in KENT. 
>  8 11TH FRAME RESTAURANT & LOUNGE is located at 7638 NE BOTHELL WAY in KENMORE.
>  9 125TH ST GRILL is located at 12255 AURORA AVE N in Seattle.                 
> 10 12S TACOS MEXICAN FOOD KC1012 is located at 625 S 4TH ST in RENTON.         
> # ℹ 10,959 more rows

Example #2 with Restaurant Data

As we saw in the previous example, when you’re mixing many fixed and variable strings with str_c() things can get overwhelmed by quotation marks pretty easily. An alternative with simpler syntax is str_glue() in which anything inside {} will be evaluated like it’s outside the quotes.

restaurants |> 
  select(Name, Address, City) |> 
  distinct() |> 
  mutate(Sentence = str_glue("{Name} is located at {Address} in {City}."), 
         .keep = "none")
> # A tibble: 10,969 × 1
>    Sentence                                                                    
>    <glue>                                                                      
>  1 #807 TUTTA BELLA is located at 2746 NE 45TH ST in SEATTLE.                  
>  2 +MAS CAFE  is located at 1906 N 34TH ST in SEATTLE.                         
>  3 ?al?al Cafe is located at 122 2ND AVE S in SEATTLE.                         
>  4 100 LB CLAM is located at 1001 FAIRVIEW AVE N Unit 1700A in SEATTLE.        
>  5 1000 SPIRITS is located at 1225 1ST AVE in SEATTLE.                         
>  6 100TH AVE CAKES is located at 15364 NE 96TH PL in REDMOND.                  
>  7 108 VIETNAMESE AUTHENTIC  CUISINE is located at 18114 E VALLEY HWY in KENT. 
>  8 11TH FRAME RESTAURANT & LOUNGE is located at 7638 NE BOTHELL WAY in KENMORE.
>  9 125TH ST GRILL is located at 12255 AURORA AVE N in Seattle.                 
> 10 12S TACOS MEXICAN FOOD KC1012 is located at 625 S 4TH ST in RENTON.         
> # ℹ 10,959 more rows

Example #3 with Restaurant Data

If you want to create a summary of certain character strings you can use str_flatten() which takes a character vector and combines each element of the vector into a single string.

restaurants |> 
  select(Name, `Inspection Score`) |>
  summarize(inspection_scores = str_flatten(`Inspection Score`, collapse = ", "), 
            .by = Name)
11
Notice that when a variable has spaces in it’s name (rather than being separated with an underscore in snake_case, for instance) you need to put backticks around it so R knows it is a singular object name.
> # A tibble: 9,878 × 2
>    Name                                inspection_scores                        
>    <chr>                               <chr>                                    
>  1 "#807 TUTTA BELLA"                  20, 20, 10, 10, 0, 0                     
>  2 "+MAS CAFE "                        0, 30, 30, 0, 47, 47, 47, 47, 47, 0, 0   
>  3 "?al?al Cafe"                       0, 0                                     
>  4 "100 LB CLAM"                       0, 0, 0, 25, 25, 25, 25, 0, 0            
>  5 "1000 SPIRITS"                      0, 5, 0, 5, 0, 5, 0, 5, 0, 32, 32, 32, 2…
>  6 "100TH AVE CAKES"                   0, 0, 0, 0                               
>  7 "108 VIETNAMESE AUTHENTIC  CUISINE" 35, 35, 35, 30, 30, 15, 15               
>  8 "11TH FRAME RESTAURANT & LOUNGE"    20, 20, 0, 10, 10, 5, 0, 30, 30, 18, 18,…
>  9 "125TH ST GRILL"                    0, 20, 20, 20, 0, 20, 20, 20, 18, 18, 18…
> 10 "12S TACOS MEXICAN FOOD KC1012"     <NA>                                     
> # ℹ 9,868 more rows

Example #4 with Restaurant Data

What if we want to plot one of the variables in our dataset but many of its values are too long and it’d be too arduous to manually add \n to every long value? There’s str_wrap()!

restaurants |> 
  mutate(Name = str_wrap(Name, width = 20)) |> 
  distinct(Name)
> # A tibble: 9,873 × 1
>    Name                               
>    <chr>                              
>  1 "#807 TUTTA BELLA"                 
>  2 "+MAS CAFE"                        
>  3 "?al?al Cafe"                      
>  4 "100 LB CLAM"                      
>  5 "1000 SPIRITS"                     
>  6 "100TH AVE CAKES"                  
>  7 "108 VIETNAMESE\nAUTHENTIC CUISINE"
>  8 "11TH FRAME\nRESTAURANT & LOUNGE"  
>  9 "125TH ST GRILL"                   
> 10 "12S TACOS MEXICAN\nFOOD KC1012"   
> # ℹ 9,863 more rows
Name

#807 TUTTA BELLA

+MAS CAFE

?al?al Cafe

100 LB CLAM

1000 SPIRITS

100TH AVE CAKES

108 VIETNAMESE
AUTHENTIC CUISINE

11TH FRAME
RESTAURANT & LOUNGE

125TH ST GRILL

12S TACOS MEXICAN
FOOD KC1012

Separating Character Strings into Multiple Variables

Oftentimes you’ll have multiple pieces of information in one single string. That’s where the family of separate_* functions1 come in handy.

separate_longer_delim(col, delim)
separate_longer_position(col, width)
separate_wider_delim(col, delim, names)
separate_wider_position(col, widths)
12
Takes a string and splits it into many rows based on a specified delimiter. Tends to be most useful when the number of components varies from row to row.
13
Rarer use case but also splits into many rows, now based on the width of the output desired.
14
Takes a string and splits it into many columns based on a specified delimiter. Need to provide names for the new columns created by the split.
15
Rather than a delimiter you provide a named integer vector where the name gives the name of the new column, and the value is the number of characters it occupies.

Example with Restaurant Data

The most common use case will be the need to split a character string into multiple columns, which will require the separate_wider_* functions1.

restaurants |> 
  select(`Inspection Date`) |>
  separate_wider_delim(`Inspection Date`, 
                       delim = "/", 
                       names = c("month", "day", "year"))
16
This variable was read in as a character string rather than a date object.
> # A tibble: 256,681 × 3
>    month day   year 
>    <chr> <chr> <chr>
>  1 03    02    2023 
>  2 03    02    2023 
>  3 08    31    2022 
>  4 08    31    2022 
>  5 01    13    2022 
>  6 01    06    2021 
>  7 06    22    2023 
>  8 03    01    2023 
>  9 03    01    2023 
> 10 07    13    2022 
> # ℹ 256,671 more rows

separate_wider_* functions

The nice thing about this set of functions is that they have a built-in debugging method for instances when some rows don’t have the expected number of pieces.

restaurants |> 
  select(Address) |> 
  separate_wider_delim(Address, 
                       delim = " ", 
                       names = c("num", "name", "type")) 
> Error in `separate_wider_delim()`:
> ! Expected 3 pieces in each element of `Address`.
> ! 792 values were too short.
> ℹ Use `too_few = "debug"` to diagnose the problem.
> ℹ Use `too_few = "align_start"/"align_end"` to silence this message.
> ! 223848 values were too long.
> ℹ Use `too_many = "debug"` to diagnose the problem.
> ℹ Use `too_many = "drop"/"merge"` to silence this message.

These debugging options will add 3 new variables to the data frame that begin with the name of the splitting variable with a suffix to designate the information they provide.

  • _ok is a binary TRUE/FALSE telling you if that observation split in the expected way.
  • _pieces returns the number of pieces that observation actually contains.
  • _remainder returns the additional pieces left over (if any) for that observation.

separate_wider_* functions

The nice thing about this set of functions is that they have a built-in debugging method for instances when some rows don’t have the expected number of pieces.

debug <- restaurants |> 
  select(Address) |> 
  separate_wider_delim(Address, 
                       delim = " ", 
                       names = c("num", "name", "type"), 
                       too_many = "debug",
                       too_few = "debug") 
debug[debug$Address_pieces == 4, ]
17
too_many = "drop" will drop any additional pieces and too_many = "merge" will merge them all into the final column.
18
Example of the too_many error (Address_pieces ranged from 4 to 9 in this dataset).
> # A tibble: 172,718 × 7
>    num   name  type  Address         Address_ok Address_pieces Address_remainder
>    <chr> <chr> <chr> <chr>           <lgl>               <int> <chr>            
>  1 2746  NE    45TH  2746 NE 45TH ST FALSE                   4 " ST"            
>  2 2746  NE    45TH  2746 NE 45TH ST FALSE                   4 " ST"            
>  3 2746  NE    45TH  2746 NE 45TH ST FALSE                   4 " ST"            
>  4 2746  NE    45TH  2746 NE 45TH ST FALSE                   4 " ST"            
>  5 2746  NE    45TH  2746 NE 45TH ST FALSE                   4 " ST"            
>  6 2746  NE    45TH  2746 NE 45TH ST FALSE                   4 " ST"            
>  7 1906  N     34TH  1906 N 34TH ST  FALSE                   4 " ST"            
>  8 1906  N     34TH  1906 N 34TH ST  FALSE                   4 " ST"            
>  9 1906  N     34TH  1906 N 34TH ST  FALSE                   4 " ST"            
> 10 1906  N     34TH  1906 N 34TH ST  FALSE                   4 " ST"            
> # ℹ 172,708 more rows

separate_wider_* functions

The nice thing about this set of functions is that they have a built-in debugging method for instances when some rows don’t have the expected number of pieces.

debug <- restaurants |> 
  select(Address) |> 
  separate_wider_delim(Address, 
                       delim = " ", 
                       names = c("num", "name", "type"),
                       too_many = "debug",
                       too_few = "debug")
debug[debug$Address_pieces == 2, ]
19
too_few = "align_start" and too_few = "align_end" will add NAs to the missing pieces depending on where they should go.
20
Example of the too_few error.
> # A tibble: 792 × 7
>    num   name     type  Address      Address_ok Address_pieces Address_remainder
>    <chr> <chr>    <chr> <chr>        <lgl>               <int> <chr>            
>  1 1401  BROADWAY <NA>  1401 BROADW… FALSE                   2 ""               
>  2 1401  BROADWAY <NA>  1401 BROADW… FALSE                   2 ""               
>  3 1401  BROADWAY <NA>  1401 BROADW… FALSE                   2 ""               
>  4 1401  BROADWAY <NA>  1401 BROADW… FALSE                   2 ""               
>  5 1401  BROADWAY <NA>  1401 BROADW… FALSE                   2 ""               
>  6 1401  BROADWAY <NA>  1401 BROADW… FALSE                   2 ""               
>  7 1401  BROADWAY <NA>  1401 BROADW… FALSE                   2 ""               
>  8 1401  BROADWAY <NA>  1401 BROADW… FALSE                   2 ""               
>  9 1401  BROADWAY <NA>  1401 BROADW… FALSE                   2 ""               
> 10 1401  BROADWAY <NA>  1401 BROADW… FALSE                   2 ""               
> # ℹ 782 more rows

Modifying Strings: Converting Cases

str_to_upper(), str_to_lower(), str_to_title() convert cases, which is often a good idea to do before searching for values:

unique_cities <- unique(restaurants$City)
unique_cities  |> 
  head()
> [1] "SEATTLE" "REDMOND" "KENT"    "KENMORE" "Seattle" "RENTON"
str_to_upper(unique_cities) |> 
  head()
> [1] "SEATTLE" "REDMOND" "KENT"    "KENMORE" "SEATTLE" "RENTON"
str_to_lower(unique_cities) |> 
  head()
> [1] "seattle" "redmond" "kent"    "kenmore" "seattle" "renton"
str_to_title(unique_cities) |> 
  head()
> [1] "Seattle" "Redmond" "Kent"    "Kenmore" "Seattle" "Renton"

Modifying Strings: Removing Whitespace

Extra leading or trailing whitespace is common in text data:

unique_names <- unique(restaurants$Name)
unique_names |> head(3)
> [1] "#807 TUTTA BELLA" "+MAS CAFE "       "?al?al Cafe"

We can remove the white space using str_trim():

str_trim(unique_names) |> head(3)
> [1] "#807 TUTTA BELLA" "+MAS CAFE"        "?al?al Cafe"

Counting Characters

At the most basic level you can use str_length() to count the characters are in a string.

phone_numbers <- restaurants |>
  select(`Phone`) |> 
  mutate(phone_length = str_length(`Phone`))

phone_numbers |> count(phone_length)
21
Getting the length of Phone
22
Getting the count of different lengths for Phone found in the data
> # A tibble: 4 × 2
>   phone_length      n
>          <int>  <int>
> 1           14 185717
> 2           15    155
> 3           18     49
> 4           NA  70760
phone_numbers |> 
  filter(phone_length %in% c(15, 18)) |> 
  slice_head(n = 1, by = phone_length)
23
Filtering for the two abnormal phone number lengths, and getting the first observation (row) by the two different numbers (15, 18).
> # A tibble: 2 × 2
>   Phone              phone_length
>   <chr>                     <int>
> 1 (714) 670-=5051              15
> 2 (822) 370-0EXT3700           18

Subsetting Strings

If we want to subset a string we can use str_sub(). Let’s pull out just the area codes from the Phone variable.

restaurants |> 
  select(`Phone`) |> 
  mutate(area_code = str_sub(`Phone`, start = 2, end = 4)) |>
  distinct(area_code)
24
start and end are the positions where the “substring” should start and end (inclusive). You can also use negative values to count backwards from the end of a string. Note that str_sub() won’t fail if the string is too short: it will just return as much as possible.
> # A tibble: 209 × 1
>    area_code
>    <chr>    
>  1 206      
>  2 952      
>  3 758      
>  4 425      
>  5 702      
>  6 509      
>  7 512      
>  8 <NA>     
>  9 760      
> 10 801      
> # ℹ 199 more rows

Working with Non-English Strings

Computer infrastructure is heavily biased towards English speakers so there are some things to be aware of if you’re interested in working with character data in a different language.


Encoding

  • UTF-8 can encode just about every character used by humans today and many extra symbols like emojis.
    • readr uses UTF-8 everywhere. This is a good default but will fail for data produced by older systems that don’t use UTF-8.
  • To read these correctly, you specify the encoding via the locale argument (hopefully that information is provided in the data documentation).
    • Unfortunately, that’s rarely the case, so readr provides guess_encoding() to help you figure it out. It’s not foolproof and works better when you have lots of text.
  • Learn more about the intricacies of encoding here.

Working with Non-English Strings

Computer infrastructure is heavily biased towards English speakers so there are some things to be aware of if you’re interested in working with character data in a different language.


Letter Variations

  • Accented letters may be either 1 character or 2 depending upon how they’re encoded, which affects position for str_length() and str_sub().
  • str_equal() will recognize that the different variations have the same appearance while == will evaluate them as different.

Working with Non-English Strings

Computer infrastructure is heavily biased towards English speakers so there are some things to be aware of if you’re interested in working with character data in a different language.


Locale-Dependent Functions

  • A locale is similar to a language but includes an optional region specifier to handle regional variations within a language1.
  • Base R string functions will automatically use the locale set by your operating system which means that base R string functions do what you expect for your language.
    • However, your code might work differently if you share it with someone who lives in a different country.
    • To avoid this problem, stringr defaults to English rules by using the “en” locale and requires you to specify the locale argument to override it.

Pattern Matching &
Regular Expressions

Pattern-Matching!

It’s common to want to see if a string satisfies a certain pattern.

We did this with numeric values earlier in this course!

restaurants |>  
  filter(`Inspection Score` < 10 | `Inspection Score` > 150)
> # A tibble: 139,744 × 22
>    Name         `Program Identifier` `Inspection Date` Description Address City 
>    <chr>        <chr>                <chr>             <chr>       <chr>   <chr>
>  1 "#807 TUTTA… #807 TUTTA BELLA     01/13/2022        Seating 0-… 2746 N… SEAT…
>  2 "#807 TUTTA… #807 TUTTA BELLA     01/06/2021        Seating 0-… 2746 N… SEAT…
>  3 "+MAS CAFE " +MAS CAFE            06/22/2023        Seating 0-… 1906 N… SEAT…
>  4 "+MAS CAFE " +MAS CAFE            07/13/2022        Seating 0-… 1906 N… SEAT…
>  5 "+MAS CAFE " +MAS CAFE            12/29/2021        Seating 0-… 1906 N… SEAT…
>  6 "+MAS CAFE " +MAS CAFE            07/29/2020        Seating 0-… 1906 N… SEAT…
>  7 "?al?al Caf… ?al?al Cafe          03/16/2023        Seating 13… 122 2N… SEAT…
>  8 "?al?al Caf… ?al?al Cafe          01/11/2023        Seating 13… 122 2N… SEAT…
>  9 "100 LB CLA… 100 LB CLAM          09/13/2023        Seating 0-… 1001 F… SEAT…
> 10 "100 LB CLA… 100 LB CLAM          07/13/2022        Seating 0-… 1001 F… SEAT…
> # ℹ 139,734 more rows
> # ℹ 16 more variables: `Zip Code` <dbl>, Phone <chr>, Longitude <dbl>,
> #   Latitude <dbl>, `Inspection Business Name` <chr>, `Inspection Type` <chr>,
> #   `Inspection Score` <dbl>, `Inspection Result` <chr>,
> #   `Inspection Closed Business` <lgl>, `Violation Type` <chr>,
> #   `Violation Description` <chr>, `Violation Points` <dbl>, Business_ID <chr>,
> #   Inspection_Serial_Num <chr>, Violation_Record_ID <chr>, Grade <dbl>

Patterns: str_detect()

We can do similar pattern-checking using str_detect():

str_detect(string, pattern)
1
string is the character string (or vector of strings) we want to examine and pattern is the pattern that we’re checking for, inside string. The output will be a TRUE/FALSE vector indicating if pattern was found.


restaurants |> 
  select(Name, Address) |> 
  filter(str_detect(Address, "Pike")) |> 
  distinct()
> # A tibble: 5 × 2
>   Name                           Address       
>   <chr>                          <chr>         
> 1 Axum Foods DBA Lands Of Origin 1532 Pike PL  
> 2 CHA CHA LOUNGE                 1013 E Pike ST
> 3 Kitchen and Market             1926 Pike PL  
> 4 Luke's Lobster                 104 Pike ST   
> 5 SAM'S TAVERN                   1024 E Pike ST


Hmmm…there are only 5 restaurants on a street with Pike in the name?!

Patterns: str_detect()

We can do similar pattern-checking using str_detect():

str_detect(string, pattern)
1
string is the character string (or vector of strings) we want to examine and pattern is the pattern that we’re checking for, inside string. The output will be a TRUE/FALSE vector indicating if pattern was found.


restaurants |> 
  select(Name, Address) |> 
  mutate(Address = str_to_title(Address)) |>
  filter(str_detect(Address, "Pike")) |> 
  distinct()
2
Note: Results are case-sensitive!! Therefore we need to transform all the addresses to the same case.
> # A tibble: 139 × 2
>    Name                                       Address           
>    <chr>                                      <chr>             
>  1 ALDER & ASH                                629 Pike St       
>  2 ALIBI ROOM, THE                            85 Pike St        
>  3 AMAZON RETAIL LLC                          610 E Pike St     
>  4 ATHENIAN INN                               1517 Pike Pl      
>  5 ATRIUM KITCHEN AT PIKE PLACE MARKET        93 Pike St Ste 101
>  6 AUDACITY WINEBAR ALEXANDRIA NICOLE CELLARS 800 Pike St       
>  7 Axum Foods DBA Lands Of Origin             1532 Pike Pl      
>  8 AYUTTHAYA THAI RESTAURANT                  727 E Pike St     
>  9 BAGELBOP                                   93 Pike St        
> 10 BAI TONG THAI STREET CAFE                  1121 E Pike St    
> # ℹ 129 more rows

Replacement: str_replace()

What about if you want to replace a string with something else? Use str_replace()!

This function works very similarly to str_detect(), but with one extra argument:

str_replace(string, pattern, replacement)
3
replacement is what pattern is substituted for.
restaurants |> 
  select(`Inspection Date`) |> 
  mutate(full_date = str_replace(string = `Inspection Date`, 
                                 pattern = "01/",
                                 replacement = "January "))
4
In this case, our pattern is limited since "01/" occurs both for the month and the day. This would be a good place for a regular expression.
> # A tibble: 256,681 × 2
>    `Inspection Date` full_date      
>    <chr>             <chr>          
>  1 03/02/2023        03/02/2023     
>  2 03/02/2023        03/02/2023     
>  3 08/31/2022        08/31/2022     
>  4 08/31/2022        08/31/2022     
>  5 01/13/2022        January 13/2022
>  6 01/06/2021        January 06/2021
>  7 06/22/2023        06/22/2023     
>  8 03/01/2023        03/January 2023
>  9 03/01/2023        03/January 2023
> 10 07/13/2022        07/13/2022     
> # ℹ 256,671 more rows

What are Regular Expressions?

Regular expressions1 or regexes are how we describe patterns we are looking for in text in a way that a computer can understand. We write an expression, apply it to a string input, and then can do things with matches we find.

  • Literal characters are defined snippets to search for like Pike or 01/.
  • Metacharacters2 let us be flexible in describing patterns. Some basic types of metacharacters are listed below.
    • Quantifiers control how many times a pattern can match
      • ? makes a pattern optional (i.e. it matches 0 or 1 times)
      • + lets a pattern repeat (i.e. it matches at least once)
      • * lets a pattern be optional or repeat (i.e. it matches any number of times, including 0)
      • {n} matches exactly n times, {n,} matches at least n times, {n, m} matches between n and m times
    • Character classes are defined by [] and let you match a set of characters
      • . matches any character except a new line (\n)
      • - allows you to specify a range
      • You can invert a match by starting it with ^
    • Grouping allows you to override the default precedence rules for regular expressions
      • () also allows you to create groups which can be referenced later in the regular expression with backreferences, like \1, \2
      • Use (?:), the non-grouping parentheses, to control precedence but not capture the match in a group. This is slightly more efficient than capturing parentheses and most useful for complex cases where you need to capture matches and control precedence independently.
    • Alternation, |, allows us to pick between one or more alternative patterns
    • Anchors allow you to add specificity as to where the match occurs
      • Use ^ to anchor the start
      • Use $ to anchor the end
      • Match the boundary between words (start or end) with \b
    • Lookarounds look ahead or behind the current match without “consuming” any characters. These are useful when you want to check that a pattern exists, but you don’t want to include it in the result.
      • (?=...) is a positive look-ahead assertion. Matches if ... matches at the current input
      • (?!...) is a negative look-ahead assertion. Matches if ... does not match at the current input
      • (?<=...) is a positive look-behind assertion. Matches if ... matches text preceding the current position. Length must be bounded (i.e. no * or +)
      • (?<!...) is a negative look-behind assertion. Matches if ... does not match text preceding the current position. Length must be bounded (i.e. no * or +)

You can read more about regular expressions in stringr here and this is a useful tutorial to learn regex if you need to/when you’re ready!


Separation with regex

Let’s go back to our example and see if we can use a regular expression to replace 01/ just for the month position of our date variable.

restaurants |> 
  select(`Inspection Date`) |> 
  mutate(full_date = str_replace(string = `Inspection Date`, 
                                 pattern = "^01/",
                                 replacement = "January "))
5
We can pretty simply use a regex signifier (the starting anchor ^) to make sure our replacement only happens to the 01/s in the month position.
> # A tibble: 256,681 × 2
>    `Inspection Date` full_date      
>    <chr>             <chr>          
>  1 03/02/2023        03/02/2023     
>  2 03/02/2023        03/02/2023     
>  3 08/31/2022        08/31/2022     
>  4 08/31/2022        08/31/2022     
>  5 01/13/2022        January 13/2022
>  6 01/06/2021        January 06/2021
>  7 06/22/2023        06/22/2023     
>  8 03/01/2023        03/01/2023     
>  9 03/01/2023        03/01/2023     
> 10 07/13/2022        07/13/2022     
> # ℹ 256,671 more rows

Separation with regex

Let’s look at a more realistic example and introduce the regex version of our separate_wider_* functions. What if we wanted to separate the Description variable into two separate variables: capacity_description and risk_category?

restaurants |> 
  count(Description) |>
  print(n = 33)
6
See all distinct values that Description takes to figure out how we need to separate this character vector.
7
You can force a tibble to print more than the default 10 rows by specifying the number with print(n).
> # A tibble: 33 × 2
>    Description                                       n
>    <chr>                                         <int>
>  1 Bakery-no seating - Risk Category I              15
>  2 Bakery-no seating - Risk Category II           4396
>  3 Bakery-no seating - Risk Category III           325
>  4 Bed and Breakfast - Risk Category I              72
>  5 Caterer - Risk Category I                        53
>  6 Caterer - Risk Category II                       85
>  7 Caterer - Risk Category III                    2006
>  8 Grocery Store-no seating - Risk Category I     9752
>  9 Grocery Store-no seating - Risk Category II    2487
> 10 Limited Food Services - no permanent plumbing  1152
> 11 Meat/Sea Food - Risk Category III             14258
> 12 Mobile Food Unit - Risk Category I              735
> 13 Mobile Food Unit - Risk Category II             494
> 14 Mobile Food Unit - Risk Category III           4500
> 15 Non-Profit Institution - Risk Category I        840
> 16 Non-Profit Institution - Risk Category II       613
> 17 Non-Profit Institution - Risk Category III     6342
> 18 School Lunch Program - Risk II                13767
> 19 Seating 0-12 - Risk Category I                 4329
> 20 Seating 0-12 - Risk Category II                7102
> 21 Seating 0-12 - Risk Category III              40880
> 22 Seating 13-50 - Risk Category I                1511
> 23 Seating 13-50 - Risk Category II               7331
> 24 Seating 13-50 - Risk Category III             58592
> 25 Seating 151-250 - Risk Category I                73
> 26 Seating 151-250 - Risk Category II               26
> 27 Seating 151-250 - Risk Category III           10347
> 28 Seating 51-150 - Risk Category I                667
> 29 Seating 51-150 - Risk Category II               975
> 30 Seating 51-150 - Risk Category III            55706
> 31 Seating > 250 - Risk Category I                  69
> 32 Seating > 250 - Risk Category II                  4
> 33 Seating > 250 - Risk Category III              7177


Separation with regex

res_sep <- restaurants |> 
  distinct(Name, Description) |>
  separate_wider_regex(cols = Description,
                       patterns = c(capacity_description = "^.+",
                                    risk_category = "Risk ?(?:Category)? ?I{1,3}$")) 
8
For this example I want to limit the dataset just to the pertinent variables for illustrative purposes so I am only keeping the distinct values of Name and Description.
9
The cols argument of this function is the column you want to separate.
10
The patterns argument takes a named character vector where the names become the column names and the character strings are regular expressions that match the desired contents of the vector.
> Error in `separate_wider_regex()`:
> ! Expected each value of `Description` to match the pattern, the whole
>   pattern, and nothing but the pattern.
> ! 104 values have problems.
> ℹ Use `too_few = "debug"` to diagnose the problem.
> ℹ Use `too_few = "start"` to silence this message.


I’ve triggered the debugging error message which tells me how to diagnose/ignore the mismatch that’s occurring.

Separation with regex

res_sep <- restaurants |> 
  distinct(Name, Description) |> 
  separate_wider_regex(cols = Description, 
                       patterns = c(capacity_description = "^.+",
                                    risk_category = "Risk ?(?:Category)? ?I{1,3}$"),
                       too_few = "debug") |> 
  distinct(capacity_description, risk_category, Description_ok, 
           Description_matches, Description_remainder) |>
  print(n = 33)
11
"^" matches the beginning of a string,
"." matches any character except a new line, and "+" quantifies that ".", asking it to return 1 or more characters.
12
"Risk" matches exactly, " ?" matches a singular white space 0 or 1 time,
"(?:Category)?" optionally matches the exact word “Category”, again " ?" matches a singular white space 0 or 1 time, "I{1,3}" matches “I” 1-3 times, and "$" signifies the end of the string.
13
Using distinct() on the created and debugging variables allows us to see what didn’t match.
> # A tibble: 33 × 5
>    capacity_description         risk_category Description_ok Description_matches
>    <chr>                        <chr>         <lgl>                        <int>
>  1 "Seating 0-12 - "            Risk Categor… TRUE                             2
>  2 "Seating 13-50 - "           Risk Categor… TRUE                             2
>  3 "Seating 51-150 - "          Risk Categor… TRUE                             2
>  4 "Bakery-no seating - "       Risk Categor… TRUE                             2
>  5 "Mobile Food Unit - "        Risk Categor… TRUE                             2
>  6 "Seating > 250 - "           Risk Categor… TRUE                             2
>  7 "Seating 151-250 - "         Risk Categor… TRUE                             2
>  8 "Grocery Store-no seating -… Risk Categor… TRUE                             2
>  9 "Seating 13-50 - "           Risk Categor… TRUE                             2
> 10 "Caterer - "                 Risk Categor… TRUE                             2
> 11 "Caterer - "                 Risk Categor… TRUE                             2
> 12 "Seating 13-50 - "           Risk Categor… TRUE                             2
> 13 "Seating 0-12 - "            Risk Categor… TRUE                             2
> 14 "Meat/Sea Food - "           Risk Categor… TRUE                             2
> 15 "Bakery-no seating - "       Risk Categor… TRUE                             2
> 16 "Seating 0-12 - "            Risk Categor… TRUE                             2
> 17 "Caterer - "                 Risk Categor… TRUE                             2
> 18 "Limited Food Services - no… <NA>          FALSE                            1
> 19 "Seating 51-150 - "          Risk Categor… TRUE                             2
> 20 "Seating 51-150 - "          Risk Categor… TRUE                             2
> 21 "School Lunch Program - "    Risk II       TRUE                             2
> 22 "Mobile Food Unit - "        Risk Categor… TRUE                             2
> 23 "Mobile Food Unit - "        Risk Categor… TRUE                             2
> 24 "Non-Profit Institution - "  Risk Categor… TRUE                             2
> 25 "Grocery Store-no seating -… Risk Categor… TRUE                             2
> 26 "Bakery-no seating - "       Risk Categor… TRUE                             2
> 27 "Seating > 250 - "           Risk Categor… TRUE                             2
> 28 "Non-Profit Institution - "  Risk Categor… TRUE                             2
> 29 "Non-Profit Institution - "  Risk Categor… TRUE                             2
> 30 "Seating 151-250 - "         Risk Categor… TRUE                             2
> 31 "Seating > 250 - "           Risk Categor… TRUE                             2
> 32 "Seating 151-250 - "         Risk Categor… TRUE                             2
> 33 "Bed and Breakfast - "       Risk Categor… TRUE                             2
> # ℹ 1 more variable: Description_remainder <chr>

Separation with regex

res_sep <- restaurants |> 
  distinct(Name, Description) |> 
  separate_wider_regex(cols = Description, 
                       patterns = c(capacity_description = "^.+",
                                    risk_category = "Risk ?(?:Category)? ?I{1,3}$"), 
                       too_few = "align_start")
res_sep
14
Since the only non-match was the one without a valid value for risk_category, we can give too_few the value align_start which tells the function to fill in anything without a value for the second variable with an NA.
> # A tibble: 11,209 × 3
>    Name                                capacity_description   risk_category    
>    <chr>                               <chr>                  <chr>            
>  1 "#807 TUTTA BELLA"                  "Seating 0-12 - "      Risk Category III
>  2 "+MAS CAFE "                        "Seating 0-12 - "      Risk Category III
>  3 "?al?al Cafe"                       "Seating 13-50 - "     Risk Category III
>  4 "100 LB CLAM"                       "Seating 0-12 - "      Risk Category III
>  5 "1000 SPIRITS"                      "Seating 51-150 - "    Risk Category III
>  6 "100TH AVE CAKES"                   "Bakery-no seating - " Risk Category II 
>  7 "108 VIETNAMESE AUTHENTIC  CUISINE" "Seating 51-150 - "    Risk Category III
>  8 "11TH FRAME RESTAURANT & LOUNGE"    "Seating 51-150 - "    Risk Category III
>  9 "125TH ST GRILL"                    "Seating 51-150 - "    Risk Category III
> 10 "12S TACOS MEXICAN FOOD KC1012"     "Mobile Food Unit - "  Risk Category III
> # ℹ 11,199 more rows

We can clean up these variables a bit more with a version of str_replace(): str_remove(). This technically replaces the pattern match with "", or an empty string.

Separation with regex

res_sep <- restaurants |> 
  distinct(Name, Description) |> 
  separate_wider_regex(cols = Description, 
                       patterns = c(capacity_description = "^.+",
                                    risk_category = "Risk ?(?:Category)? ?I{1,3}$"), 
                       too_few = "align_start") |>
  mutate(capacity_description = str_remove(capacity_description, pattern = " - $"),
         risk_category = str_remove(risk_category, pattern = "Risk ?(?:Category)? "))
res_sep
15
We can remove the trailing - by using str_remove and providing the regular expression for that piece of the capacity_description string.
16
Since this variable is already named risk_category, we can remove that language from the beginning of each string, by matching the first part of our original regular expression for this variable.
> # A tibble: 11,209 × 3
>    Name                                capacity_description risk_category
>    <chr>                               <chr>                <chr>        
>  1 "#807 TUTTA BELLA"                  Seating 0-12         III          
>  2 "+MAS CAFE "                        Seating 0-12         III          
>  3 "?al?al Cafe"                       Seating 13-50        III          
>  4 "100 LB CLAM"                       Seating 0-12         III          
>  5 "1000 SPIRITS"                      Seating 51-150       III          
>  6 "100TH AVE CAKES"                   Bakery-no seating    II           
>  7 "108 VIETNAMESE AUTHENTIC  CUISINE" Seating 51-150       III          
>  8 "11TH FRAME RESTAURANT & LOUNGE"    Seating 51-150       III          
>  9 "125TH ST GRILL"                    Seating 51-150       III          
> 10 "12S TACOS MEXICAN FOOD KC1012"     Mobile Food Unit     III          
> # ℹ 11,199 more rows

Separation with regex

What do the final 33 distinct values of these two new variables look like?

res_sep |> 
  distinct(capacity_description, risk_category) |> 
  print(n = 33)
> # A tibble: 33 × 2
>    capacity_description                          risk_category
>    <chr>                                         <chr>        
>  1 Seating 0-12                                  III          
>  2 Seating 13-50                                 III          
>  3 Seating 51-150                                III          
>  4 Bakery-no seating                             II           
>  5 Mobile Food Unit                              III          
>  6 Seating > 250                                 III          
>  7 Seating 151-250                               III          
>  8 Grocery Store-no seating                      I            
>  9 Seating 13-50                                 II           
> 10 Caterer                                       II           
> 11 Caterer                                       III          
> 12 Seating 13-50                                 I            
> 13 Seating 0-12                                  I            
> 14 Meat/Sea Food                                 III          
> 15 Bakery-no seating                             III          
> 16 Seating 0-12                                  II           
> 17 Caterer                                       I            
> 18 Limited Food Services - no permanent plumbing <NA>         
> 19 Seating 51-150                                I            
> 20 Seating 51-150                                II           
> 21 School Lunch Program                          II           
> 22 Mobile Food Unit                              I            
> 23 Mobile Food Unit                              II           
> 24 Non-Profit Institution                        III          
> 25 Grocery Store-no seating                      II           
> 26 Bakery-no seating                             I            
> 27 Seating > 250                                 I            
> 28 Non-Profit Institution                        II           
> 29 Non-Profit Institution                        I            
> 30 Seating 151-250                               I            
> 31 Seating > 250                                 II           
> 32 Seating 151-250                               II           
> 33 Bed and Breakfast                             I


Nice!

Other Uses for Regular Expressions

Even if you aren’t explicitly manipulating/analyzing text data for your research, knowing some things about regular expressions will still come in handy because they’re used in other places, both in Base R and the tidyverse.

  • apropos(pattern)
  • list.files(path, pattern)
  • matches()
  • pivot_longer()
  • separate_*_delim()

apropos()

apropos(pattern) searches all objects available from the global environment that match the given pattern. This is useful if you can’t quite remember the name of a function, for example:

apropos("separate")
> [1] "separate"                 "separate_"               
> [3] "separate_longer_delim"    "separate_longer_position"
> [5] "separate_rows"            "separate_rows_"          
> [7] "separate_wider_delim"     "separate_wider_position" 
> [9] "separate_wider_regex"

list.files()

list.files(path, pattern) lists all files in path that match a regular expression pattern. For example, you can find all the Quarto files in the current directory with:

list.files(pattern = "\\.qmd$")
> [1] "CSSS508_Lecture7_index.qmd" "CSSS508_Lecture7.qmd"

matches()

matches(pattern) will select all variables whose name matches the supplied pattern.. It’s a tidyselect function (like starts_with() and the like) that you can use in any tidyverse function that selects variables.

names(iris)
iris %>% select(matches("[pt]al")) |>
  names() 
17
[pt] signifies match either p or t.
> [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"     
> [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"

pivot_longer()

pivot_longer()’s argument names_pattern takes a vector of regular expressions, just like separate_wider_regex(). It’s useful when extracting data out of variable names with a complex structure.

names(who) |> head(n = 10)
>  [1] "country"      "iso2"         "iso3"         "year"         "new_sp_m014" 
>  [6] "new_sp_m1524" "new_sp_m2534" "new_sp_m3544" "new_sp_m4554" "new_sp_m5564"
who |> pivot_longer(cols = new_sp_m014:newrel_f65,
                    names_to = c("diagnosis", "gender", "age"), 
                    names_pattern = "new_?(.*)_(.)(.*)",
                    values_to = "count") |> 
  slice_head(n = 10)
18
"new_?(.*)_(.)(.*)" explained: new matches exactly, then _? optionally matches an underscore, (.*) matches any number of characters and in this example it captures the new diagnosis variable, _ matches exactly, (.) matches one character which captures the gender variable m or f in this example, and lastly, (.*) again matches any number of characters, in this case it captures the varying digits of the age variable.
> # A tibble: 10 × 8
>    country     iso2  iso3   year diagnosis gender age   count
>    <chr>       <chr> <chr> <dbl> <chr>     <chr>  <chr> <dbl>
>  1 Afghanistan AF    AFG    1980 sp        m      014      NA
>  2 Afghanistan AF    AFG    1980 sp        m      1524     NA
>  3 Afghanistan AF    AFG    1980 sp        m      2534     NA
>  4 Afghanistan AF    AFG    1980 sp        m      3544     NA
>  5 Afghanistan AF    AFG    1980 sp        m      4554     NA
>  6 Afghanistan AF    AFG    1980 sp        m      5564     NA
>  7 Afghanistan AF    AFG    1980 sp        m      65       NA
>  8 Afghanistan AF    AFG    1980 sp        f      014      NA
>  9 Afghanistan AF    AFG    1980 sp        f      1524     NA
> 10 Afghanistan AF    AFG    1980 sp        f      2534     NA

separate_*_delim()

The delim argument in separate_longer_delim() and separate_wider_delim() usually matches a fixed string, but you can use regex() to make it match a pattern. This is useful, for example, if you want to match a comma that is optionally followed by a space, i.e. regex(", ?").

Base R Equivalents1

Base R

paste0(x, sep, collapse)

nchar(x)
substr(x, start, end)
toupper(x)
tolower(x)
tools::toTitleCase(x)
trimws(x)
grepl(pattern, x)
sub(x, pattern, replacement)
strwrap(x)

stringr

str_c(x, sep, collapse)
str_flatten(x, collapse)
str_length(x)
str_sub(x, start, end)
str_to_upper(x)
str_to_lower(x)
str_to_title(x)
str_trim(x)
str_detect(x, pattern)
str_replace(x, pattern, replacement)
str_wrap(x)

There are many other useful stringr functions/variants of the functions we used today. Check them out here.


Lab

Strings

First, install the babynames packages in your console, then run the following code to load the babynames dataset into your global environment.

library(babynames)
data(babynames) 
1
US baby names provided by the Social Security Administration. This package contains all names used for at least 5 children of either sex for 1880-2017.


  1. What is the shortest name length? What is the longest name length? Mean? Median?
  2. What is the most popular letter for a name to start with?1
  3. Pick a year between 1880 and 2017 and use either str_c() or str_glue() to create a new variable that is a sentence stating what the most popular name was for each binary sex category in that year. Bonus: Add a line break in your sentence and use str_view() to see what the new string looks like2.
  4. Optional bonus: Make a plot of the popularity of your own name/nickname over time. What year was your name most popular? Is that close to your birth year?

Answers

babynames
> # A tibble: 1,924,665 × 5
>     year sex   name          n   prop
>    <dbl> <chr> <chr>     <int>  <dbl>
>  1  1880 F     Mary       7065 0.0724
>  2  1880 F     Anna       2604 0.0267
>  3  1880 F     Emma       2003 0.0205
>  4  1880 F     Elizabeth  1939 0.0199
>  5  1880 F     Minnie     1746 0.0179
>  6  1880 F     Margaret   1578 0.0162
>  7  1880 F     Ida        1472 0.0151
>  8  1880 F     Alice      1414 0.0145
>  9  1880 F     Bertha     1320 0.0135
> 10  1880 F     Sarah      1288 0.0132
> # ℹ 1,924,655 more rows

Answers

  1. What is the shortest name length? What is the longest name length? Mean? Median?
babynames |> 
  distinct(name) |> 
  mutate(length = str_length(name)) |> 
  summarise(shortest = min(length), 
            longest = max(length), 
            mean = mean(length), 
            median = median(length))
> # A tibble: 1 × 4
>   shortest longest  mean median
>      <int>   <int> <dbl>  <dbl>
> 1        2      15  6.53      6

Answers

  1. What is the most popular letter for a name to start with?1
babynames |> 
  mutate(first = str_sub(name, 1, 1)) |> 
  count(first, wt = prop,) |> 
  arrange(desc(n))
> # A tibble: 26 × 2
>    first     n
>    <chr> <dbl>
>  1 J      32.6
>  2 M      25.9
>  3 A      20.9
>  4 C      18.7
>  5 R      17.8
>  6 E      16.0
>  7 D      15.5
>  8 L      15.3
>  9 S      13.8
> 10 B      11.8
> # ℹ 16 more rows

Answers

  1. Pick a year between 1880 and 2017 and use either str_c() or str_glue() to create a new variable that is a sentence stating what the most popular name was for each binary sex category in that year. Bonus: Add a line break in your sentence and use str_view() to see what the new string looks like1.
babynames |> 
  filter(year == 1950) |> 
  mutate(sex2 = if_else(sex == "F", "girl", "boy")) |>
  slice_max(prop, by = c(sex)) |>
  mutate(Sentence = str_wrap(str_glue("The most popular name for {sex2}s in 
                                      {year} was {name}."), 
                             width = 25)) |> 
  pull(Sentence) |>
  str_view()
2
Creating a new sex2 variable for better interpretability of the final Sentence variable.
3
Getting the most popular (by proportion of all names) male and female names.
4
pull() is similar to indexing with $ in Base R but works well with pipes. This is necessary to do before str_view() which only takes a vector of values (not a column from a data frame).
> [1] │ The most popular name for
>     │ girls in 1950 was Linda.
> [2] │ The most popular name for
>     │ boys in 1950 was James.

Answers

  1. Optional bonus: Make a plot of the popularity of your own name/nickname over time. What year was your name most popular? Is that close to your birth year?
library(ggrepel)
library(ggthemes)
library(patchwork)

colors <- c("#4e79a7","#f28e2c","#e15759","#76b7b2","#59a14f","#edc949",
            "#af7aa1","#ff9da7","#9c755f","#bab0ab")

victoria_plot <- babynames |> 
  filter(name == "Victoria") |> 
  mutate(sex2 = if_else(sex == "F", "Female", "Male")) |>
  ggplot(aes(x = year, y = prop, group = name, fill = name)) +
  geom_density(stat = "identity", alpha = 0.25, color = colors[1]) + 
  geom_vline(xintercept = 1988, color = colors[2], linetype = 2) +
  geom_vline(data = babynames |>
               filter(name == "Victoria") |>
               mutate(sex2 = if_else(sex == "F", "Female", "Male")) |>
               slice_max(prop, by = sex2),
             aes(xintercept = year), color = colors[3]) +
  facet_grid(sex2 ~ .,
             scales = "free_y") +
  scale_fill_manual(values = colors[1]) +
  labs(title = 'Popularity of the name "Victoria"',
       subtitle = "1880-2017, by binary sex category",
       y = "",
       x = "") +
  theme_tufte(base_size = 16) + 
  theme(legend.position = "none", 
        strip.background = element_rect(color="black",
                                        fill= alpha(colors[10], 0.5),
                                        linetype = 0))

vic_plot <- babynames |> 
  filter(name == "Vic") |> 
  mutate(sex2 = if_else(sex == "F", "Female", "Male")) |> 
  ggplot(aes(x = year, y = prop, group = name, fill = name)) +
  geom_density(stat = "identity", alpha = 0.25, color = colors[6]) + 
  geom_vline(xintercept = 1988, color = colors[2], linetype = 2) +
  geom_vline(data = babynames |> 
               filter(name == "Vic") |> 
               mutate(sex2 = if_else(sex == "F", "Female", "Male")) |> 
               slice_max(prop, by = sex2), 
             aes(xintercept = year), color = colors[3]) + 
  facet_grid(sex2 ~ .,
             scales = "free_y") + 
  scale_fill_manual(values = colors[6]) + 
  labs(title = 'Popularity of the name "Vic"',
       y = "", 
       caption = "Note: y-axes are of different scales; 
       Orange, dashed line represents 1988; # 
       Red, solid line represents most popular # 
       year for that name-sex pairing.",
       x = "Year") +
  theme_tufte(base_size = 16) + 
  theme(legend.position = "none", 
        strip.background = element_rect(color="black", 
                                        fill= alpha(colors[10], 0.5), 
                                        linetype = 0))

combo_plots <- victoria_plot / vic_plot + ylab(NULL)

wrap_elements(combo_plots) +
  theme_tufte(base_size = 16) +
  labs(tag = "Proportion of all names given to U.S. newborns") +
  theme(plot.tag = element_text(size = rel(1.25), angle = 90),
        plot.tag.position = "left")
5
For labels that don’t overlap.
6
For extra built-in themes.
7
Allows distinct plots to be put together into one visualization.
8
Creating an alternative sex variable for facet visualization purposes.
9
Vertical line for birth year.
10
Vertical line for most popular year for that name/nickname.
11
Facetting by sex2 and allowing the y-axis to vary based on facet value.
12
Applying desired colors.
13
Leaving axes blank for final patchwork labelling.
14
Specifying color for facet labels.
15
Adding note and x-axis text since this plot will be at the bottom of the overall visualization.
16
Creating object for patchwork visuaization.
17
Putting together the two separate plots.
18
Creating and plotting a y axis that spans both plots.

Answers

Example using regular expressions:

nicknames <- babynames |> 
  mutate(nickname = case_when(str_detect(name, pattern = "^Vi.{2}oria$") ~ "Victoria",
                              str_detect(name, pattern = "^Vi.{2}or$") ~ "Victor",
                              str_detect(name, pattern = "^Vi[ck]{1,2}$") ~ "Vic",
                              str_detect(name, pattern = "^Tor[riey]*$") ~ "Tori",
                              str_detect(name, pattern = "^Vi[ck]+[iey]*$") ~ "Vicky",
                               .default = NA)) |> 
  filter(!is.na(nickname)) |>
  mutate(prop2 = sum(prop),
         .by = c(year, nickname, sex)) |>
  distinct(year, nickname, prop2, sex) |> 
  mutate(sex2 = if_else(sex == "F", "Female", "Male"),
         nickname = fct(nickname, levels = c("Victoria", "Victor", "Vicky", "Tori", "Vic")))

my_names <- nicknames |> 
  ggplot(aes(x = year, y = prop2, fill = nickname, group = nickname)) +
  geom_density(aes(color = nickname), stat = "identity", alpha = 0.15) +
  geom_vline(xintercept = 1988, color = colors[4], linetype = 2) +
  scale_fill_manual(values = colors[c(1:3, 5:7)]) +
  scale_color_manual(values = colors[c(1:3, 5:7)]) + 
  facet_grid(sex2 ~ .,
             scales = "free_y") + 
  geom_label_repel(data = nicknames |> slice_max(prop2, by = c(sex2, nickname)),
                   aes(label = nickname), stat = "identity") + 
  labs(title = 'Popularity of all nicknames for "Victoria" (including all spelling variants)',
       caption = "Note: y-axes are of different scales; Teal, dashed line represents 1988",
       subtitle = "1880-2017, by binary sex category",
       y = "Proportion of all names given to U.S. newborns", 
       x = "Year") +
  theme_tufte(base_size = 16) + 
  theme(legend.position = "none", 
        strip.background = element_rect(color="black", fill= alpha(colors[10], 0.5), linetype = 0))
my_names
19
Creating a new variable that finds all spelling variations of “Victoria” and its most common derivatives using regular expressions.
20
Removing all names that don’t match any of the versions of “Victoria” or its nicknames.
21
Calculating a new proportion that collapses all spelling variations into the most common variant.
22
Creating an alternative sex variable for facet visualization purposes.
23
Putting names in the order I want to assign for colors.
24
Picking the specific colors I want to assign to the 5 names

Homework