- 1
- Single quote.
- 2
- Double quote.
Working with Text Data
CS&SS 508 • Lecture 7
7 May 2024
Victoria Sass
"Hello! My name is Vic"
or "%*$#01234"
' '
) or double quotes (" "
)
" "
, unless the string contains multiple " "
We use a lot of different symbols in our code that we might actually want to represent within a string itself. To do that, we need to escape that particular character. We can do that using \
.
For instance, if we want to include a literal single or double quote in our string, we’d escape it by writing:
Note: When you print these objects you’ll see the escape characters. To actually view the string’s contents ( and not the syntax needed to construct it), use str_view()
.
There are other things you may want to represent inside a character string, such as a new line, or a tab space.
str_view("Sometimes you need\nto create another line.")
str_view("\tOther times you just need to indent somewhere.")
\n
to create a new line. Helpful when plotting if you have variable names or values that are wordy! If you need to do this for one or more variables you can use str_wrap()
and specify the character width you desire.
\t
to add a tab. str_view
will highlight tabs in blue in your console to make it stand out from other random whitespace.
> [1] │ Sometimes you need
> │ to create another line.
> [1] │ {\t}Other times you just need to indent somewhere.
Additionally, you can represent Unicode characters which will be written with the \u
or \U
escape.
Today we’ll study real data on food safety inspections in King County, collected from data.kingcounty.gov.
Note these data are fairly large in their native .csv
format. The following code can be used to download the data directly from my Github
page as a smaller, .Rdata
object:
> Rows: 256,681
> Columns: 22
> $ Name <chr> "#807 TUTTA BELLA", "#807 TUTTA BELLA", "…
> $ `Program Identifier` <chr> "#807 TUTTA BELLA", "#807 TUTTA BELLA", "…
> $ `Inspection Date` <chr> "03/02/2023", "03/02/2023", "08/31/2022",…
> $ Description <chr> "Seating 0-12 - Risk Category III", "Seat…
> $ Address <chr> "2746 NE 45TH ST", "2746 NE 45TH ST", "27…
> $ City <chr> "SEATTLE", "SEATTLE", "SEATTLE", "SEATTLE…
> $ `Zip Code` <dbl> 98105, 98105, 98105, 98105, 98105, 98105,…
> $ Phone <chr> "(206) 722-6400", "(206) 722-6400", "(206…
> $ Longitude <dbl> -122.2964, -122.2964, -122.2964, -122.296…
> $ Latitude <dbl> 47.66231, 47.66231, 47.66231, 47.66231, 4…
> $ `Inspection Business Name` <chr> "#807 TUTTA BELLA", "#807 TUTTA BELLA", "…
> $ `Inspection Type` <chr> "Routine Inspection/Field Review", "Routi…
> $ `Inspection Score` <dbl> 20, 20, 10, 10, 0, 0, 0, 30, 30, 0, 47, 4…
> $ `Inspection Result` <chr> "Unsatisfactory", "Unsatisfactory", "Unsa…
> $ `Inspection Closed Business` <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
> $ `Violation Type` <chr> "RED", "RED", "BLUE", "RED", NA, NA, NA, …
> $ `Violation Description` <chr> "1300 - Food contact surfaces cleaned and…
> $ `Violation Points` <dbl> 15, 5, 5, 5, 0, 0, 0, 5, 25, 0, 5, 2, 5, …
> $ Business_ID <chr> "PR0089260", "PR0089260", "PR0089260", "P…
> $ Inspection_Serial_Num <chr> "DAJ5DTHLV", "DAJ5DTHLV", "DAEEWQC0L", "D…
> $ Violation_Record_ID <chr> "IVBTPZO0B", "IV5GOME67", "IVQ7QYW2V", "I…
> $ Grade <dbl> 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2,…
Good Questions to Ask
You can create strings based on the value of other strings with str_c()
(str
ing c
ombine), which takes any number of vectors and returns a character vector.
str_c(c("CSSS", "STAT", "SOC"), 508)
str_c(c("CSSS", "STAT", "SOC"), 508, sep = " ")
str_c(c("CSSS", "STAT", "SOC"), 508, sep = " ", collapse = ", ")
str_c
doesn’t put a space between the vectors it is combining.
sep
argument.
collapse
.
> [1] "CSSS508" "STAT508" "SOC508"
> [1] "CSSS 508" "STAT 508" "SOC 508"
> [1] "CSSS 508, STAT 508, SOC 508"
restaurants |>
select(Name, Address, City) |>
distinct() |>
mutate(Sentence = str_c(Name, " is located at ", Address, " in ", City, "."),
.keep = "none")
sep
argument here it would add a space before the period at the end of the sentence. So instead, we can add them directly where we want them.
.keep = "none"
here in order to see just the results of our mutate.
> # A tibble: 10,969 × 1
> Sentence
> <chr>
> 1 #807 TUTTA BELLA is located at 2746 NE 45TH ST in SEATTLE.
> 2 +MAS CAFE is located at 1906 N 34TH ST in SEATTLE.
> 3 ?al?al Cafe is located at 122 2ND AVE S in SEATTLE.
> 4 100 LB CLAM is located at 1001 FAIRVIEW AVE N Unit 1700A in SEATTLE.
> 5 1000 SPIRITS is located at 1225 1ST AVE in SEATTLE.
> 6 100TH AVE CAKES is located at 15364 NE 96TH PL in REDMOND.
> 7 108 VIETNAMESE AUTHENTIC CUISINE is located at 18114 E VALLEY HWY in KENT.
> 8 11TH FRAME RESTAURANT & LOUNGE is located at 7638 NE BOTHELL WAY in KENMORE.
> 9 125TH ST GRILL is located at 12255 AURORA AVE N in Seattle.
> 10 12S TACOS MEXICAN FOOD KC1012 is located at 625 S 4TH ST in RENTON.
> # ℹ 10,959 more rows
As we saw in the previous example, when you’re mixing many fixed and variable strings with str_c()
things can get overwhelmed by quotation marks pretty easily. An alternative with simpler syntax is str_glue()
in which anything inside {}
will be evaluated like it’s outside the quotes.
> # A tibble: 10,969 × 1
> Sentence
> <glue>
> 1 #807 TUTTA BELLA is located at 2746 NE 45TH ST in SEATTLE.
> 2 +MAS CAFE is located at 1906 N 34TH ST in SEATTLE.
> 3 ?al?al Cafe is located at 122 2ND AVE S in SEATTLE.
> 4 100 LB CLAM is located at 1001 FAIRVIEW AVE N Unit 1700A in SEATTLE.
> 5 1000 SPIRITS is located at 1225 1ST AVE in SEATTLE.
> 6 100TH AVE CAKES is located at 15364 NE 96TH PL in REDMOND.
> 7 108 VIETNAMESE AUTHENTIC CUISINE is located at 18114 E VALLEY HWY in KENT.
> 8 11TH FRAME RESTAURANT & LOUNGE is located at 7638 NE BOTHELL WAY in KENMORE.
> 9 125TH ST GRILL is located at 12255 AURORA AVE N in Seattle.
> 10 12S TACOS MEXICAN FOOD KC1012 is located at 625 S 4TH ST in RENTON.
> # ℹ 10,959 more rows
If you want to create a summary of certain character strings you can use str_flatten()
which takes a character vector and combines each element of the vector into a single string.
restaurants |>
select(Name, `Inspection Score`) |>
summarize(inspection_scores = str_flatten(`Inspection Score`, collapse = ", "),
.by = Name)
R
knows it is a singular object name.
> # A tibble: 9,878 × 2
> Name inspection_scores
> <chr> <chr>
> 1 "#807 TUTTA BELLA" 20, 20, 10, 10, 0, 0
> 2 "+MAS CAFE " 0, 30, 30, 0, 47, 47, 47, 47, 47, 0, 0
> 3 "?al?al Cafe" 0, 0
> 4 "100 LB CLAM" 0, 0, 0, 25, 25, 25, 25, 0, 0
> 5 "1000 SPIRITS" 0, 5, 0, 5, 0, 5, 0, 5, 0, 32, 32, 32, 2…
> 6 "100TH AVE CAKES" 0, 0, 0, 0
> 7 "108 VIETNAMESE AUTHENTIC CUISINE" 35, 35, 35, 30, 30, 15, 15
> 8 "11TH FRAME RESTAURANT & LOUNGE" 20, 20, 0, 10, 10, 5, 0, 30, 30, 18, 18,…
> 9 "125TH ST GRILL" 0, 20, 20, 20, 0, 20, 20, 20, 18, 18, 18…
> 10 "12S TACOS MEXICAN FOOD KC1012" <NA>
> # ℹ 9,868 more rows
What if we want to plot one of the variables in our dataset but many of its values are too long and it’d be too arduous to manually add \n
to every long value? There’s str_wrap()
!
> # A tibble: 9,873 × 1
> Name
> <chr>
> 1 "#807 TUTTA BELLA"
> 2 "+MAS CAFE"
> 3 "?al?al Cafe"
> 4 "100 LB CLAM"
> 5 "1000 SPIRITS"
> 6 "100TH AVE CAKES"
> 7 "108 VIETNAMESE\nAUTHENTIC CUISINE"
> 8 "11TH FRAME\nRESTAURANT & LOUNGE"
> 9 "125TH ST GRILL"
> 10 "12S TACOS MEXICAN\nFOOD KC1012"
> # ℹ 9,863 more rows
Name |
---|
#807 TUTTA BELLA |
+MAS CAFE |
?al?al Cafe |
100 LB CLAM |
1000 SPIRITS |
100TH AVE CAKES |
108 VIETNAMESE |
11TH FRAME |
125TH ST GRILL |
12S TACOS MEXICAN |
Oftentimes you’ll have multiple pieces of information in one single string. That’s where the family of separate_*
functions1 come in handy.
separate_longer_delim(col, delim)
separate_longer_position(col, width)
separate_wider_delim(col, delim, names)
separate_wider_position(col, widths)
The most common use case will be the need to split a character string into multiple columns, which will require the separate_wider_*
functions1.
restaurants |>
select(`Inspection Date`) |>
separate_wider_delim(`Inspection Date`,
delim = "/",
names = c("month", "day", "year"))
> # A tibble: 256,681 × 3
> month day year
> <chr> <chr> <chr>
> 1 03 02 2023
> 2 03 02 2023
> 3 08 31 2022
> 4 08 31 2022
> 5 01 13 2022
> 6 01 06 2021
> 7 06 22 2023
> 8 03 01 2023
> 9 03 01 2023
> 10 07 13 2022
> # ℹ 256,671 more rows
separate_wider_*
functionsThe nice thing about this set of functions is that they have a built-in debugging method for instances when some rows don’t have the expected number of pieces.
> Error in `separate_wider_delim()`:
> ! Expected 3 pieces in each element of `Address`.
> ! 792 values were too short.
> ℹ Use `too_few = "debug"` to diagnose the problem.
> ℹ Use `too_few = "align_start"/"align_end"` to silence this message.
> ! 223848 values were too long.
> ℹ Use `too_many = "debug"` to diagnose the problem.
> ℹ Use `too_many = "drop"/"merge"` to silence this message.
These debugging options will add 3 new variables to the data frame that begin with the name of the splitting variable with a suffix to designate the information they provide.
_ok
is a binary TRUE
/FALSE
telling you if that observation split in the expected way._pieces
returns the number of pieces that observation actually contains._remainder
returns the additional pieces left over (if any) for that observation.separate_wider_*
functionsThe nice thing about this set of functions is that they have a built-in debugging method for instances when some rows don’t have the expected number of pieces.
debug <- restaurants |>
select(Address) |>
separate_wider_delim(Address,
delim = " ",
names = c("num", "name", "type"),
too_many = "debug",
too_few = "debug")
debug[debug$Address_pieces == 4, ]
too_many = "drop"
will drop any additional pieces and too_many = "merge"
will merge them all into the final column.
too_many
error (Address_pieces
ranged from 4 to 9 in this dataset).
> # A tibble: 172,718 × 7
> num name type Address Address_ok Address_pieces Address_remainder
> <chr> <chr> <chr> <chr> <lgl> <int> <chr>
> 1 2746 NE 45TH 2746 NE 45TH ST FALSE 4 " ST"
> 2 2746 NE 45TH 2746 NE 45TH ST FALSE 4 " ST"
> 3 2746 NE 45TH 2746 NE 45TH ST FALSE 4 " ST"
> 4 2746 NE 45TH 2746 NE 45TH ST FALSE 4 " ST"
> 5 2746 NE 45TH 2746 NE 45TH ST FALSE 4 " ST"
> 6 2746 NE 45TH 2746 NE 45TH ST FALSE 4 " ST"
> 7 1906 N 34TH 1906 N 34TH ST FALSE 4 " ST"
> 8 1906 N 34TH 1906 N 34TH ST FALSE 4 " ST"
> 9 1906 N 34TH 1906 N 34TH ST FALSE 4 " ST"
> 10 1906 N 34TH 1906 N 34TH ST FALSE 4 " ST"
> # ℹ 172,708 more rows
separate_wider_*
functionsThe nice thing about this set of functions is that they have a built-in debugging method for instances when some rows don’t have the expected number of pieces.
debug <- restaurants |>
select(Address) |>
separate_wider_delim(Address,
delim = " ",
names = c("num", "name", "type"),
too_many = "debug",
too_few = "debug")
debug[debug$Address_pieces == 2, ]
too_few = "align_start"
and too_few = "align_end"
will add NA
s to the missing pieces depending on where they should go.
too_few
error.
> # A tibble: 792 × 7
> num name type Address Address_ok Address_pieces Address_remainder
> <chr> <chr> <chr> <chr> <lgl> <int> <chr>
> 1 1401 BROADWAY <NA> 1401 BROADW… FALSE 2 ""
> 2 1401 BROADWAY <NA> 1401 BROADW… FALSE 2 ""
> 3 1401 BROADWAY <NA> 1401 BROADW… FALSE 2 ""
> 4 1401 BROADWAY <NA> 1401 BROADW… FALSE 2 ""
> 5 1401 BROADWAY <NA> 1401 BROADW… FALSE 2 ""
> 6 1401 BROADWAY <NA> 1401 BROADW… FALSE 2 ""
> 7 1401 BROADWAY <NA> 1401 BROADW… FALSE 2 ""
> 8 1401 BROADWAY <NA> 1401 BROADW… FALSE 2 ""
> 9 1401 BROADWAY <NA> 1401 BROADW… FALSE 2 ""
> 10 1401 BROADWAY <NA> 1401 BROADW… FALSE 2 ""
> # ℹ 782 more rows
str_to_upper()
, str_to_lower()
, str_to_title()
convert cases, which is often a good idea to do before searching for values:
> [1] "SEATTLE" "REDMOND" "KENT" "KENMORE" "Seattle" "RENTON"
Extra leading or trailing whitespace is common in text data:
> [1] "#807 TUTTA BELLA" "+MAS CAFE " "?al?al Cafe"
At the most basic level you can use str_length()
to count the characters are in a string.
phone_numbers <- restaurants |>
select(`Phone`) |>
mutate(phone_length = str_length(`Phone`))
phone_numbers |> count(phone_length)
Phone
Phone
found in the data
> # A tibble: 4 × 2
> phone_length n
> <int> <int>
> 1 14 185717
> 2 15 155
> 3 18 49
> 4 NA 70760
> # A tibble: 2 × 2
> Phone phone_length
> <chr> <int>
> 1 (714) 670-=5051 15
> 2 (822) 370-0EXT3700 18
If we want to subset a string we can use str_sub()
. Let’s pull out just the area codes from the Phone
variable.
restaurants |>
select(`Phone`) |>
mutate(area_code = str_sub(`Phone`, start = 2, end = 4)) |>
distinct(area_code)
start
and end
are the positions where the “substring” should start and end (inclusive). You can also use negative values to count backwards from the end of a string. Note that str_sub()
won’t fail if the string is too short: it will just return as much as possible.
> # A tibble: 209 × 1
> area_code
> <chr>
> 1 206
> 2 952
> 3 758
> 4 425
> 5 702
> 6 509
> 7 512
> 8 <NA>
> 9 760
> 10 801
> # ℹ 199 more rows
Computer infrastructure is heavily biased towards English speakers so there are some things to be aware of if you’re interested in working with character data in a different language.
readr
uses UTF-8 everywhere. This is a good default but will fail for data produced by older systems that don’t use UTF-8.locale
argument (hopefully that information is provided in the data documentation).
readr
provides guess_encoding()
to help you figure it out. It’s not foolproof and works better when you have lots of text.Computer infrastructure is heavily biased towards English speakers so there are some things to be aware of if you’re interested in working with character data in a different language.
str_length()
and str_sub()
.str_equal()
will recognize that the different variations have the same appearance while ==
will evaluate them as different.Computer infrastructure is heavily biased towards English speakers so there are some things to be aware of if you’re interested in working with character data in a different language.
R
string functions will automatically use the locale set by your operating system which means that base R string functions do what you expect for your language.
stringr
defaults to English rules by using the “en” locale and requires you to specify the locale argument to override it.It’s common to want to see if a string satisfies a certain pattern.
We did this with numeric values earlier in this course!
> # A tibble: 139,744 × 22
> Name `Program Identifier` `Inspection Date` Description Address City
> <chr> <chr> <chr> <chr> <chr> <chr>
> 1 "#807 TUTTA… #807 TUTTA BELLA 01/13/2022 Seating 0-… 2746 N… SEAT…
> 2 "#807 TUTTA… #807 TUTTA BELLA 01/06/2021 Seating 0-… 2746 N… SEAT…
> 3 "+MAS CAFE " +MAS CAFE 06/22/2023 Seating 0-… 1906 N… SEAT…
> 4 "+MAS CAFE " +MAS CAFE 07/13/2022 Seating 0-… 1906 N… SEAT…
> 5 "+MAS CAFE " +MAS CAFE 12/29/2021 Seating 0-… 1906 N… SEAT…
> 6 "+MAS CAFE " +MAS CAFE 07/29/2020 Seating 0-… 1906 N… SEAT…
> 7 "?al?al Caf… ?al?al Cafe 03/16/2023 Seating 13… 122 2N… SEAT…
> 8 "?al?al Caf… ?al?al Cafe 01/11/2023 Seating 13… 122 2N… SEAT…
> 9 "100 LB CLA… 100 LB CLAM 09/13/2023 Seating 0-… 1001 F… SEAT…
> 10 "100 LB CLA… 100 LB CLAM 07/13/2022 Seating 0-… 1001 F… SEAT…
> # ℹ 139,734 more rows
> # ℹ 16 more variables: `Zip Code` <dbl>, Phone <chr>, Longitude <dbl>,
> # Latitude <dbl>, `Inspection Business Name` <chr>, `Inspection Type` <chr>,
> # `Inspection Score` <dbl>, `Inspection Result` <chr>,
> # `Inspection Closed Business` <lgl>, `Violation Type` <chr>,
> # `Violation Description` <chr>, `Violation Points` <dbl>, Business_ID <chr>,
> # Inspection_Serial_Num <chr>, Violation_Record_ID <chr>, Grade <dbl>
str_detect()
We can do similar pattern-checking using str_detect()
:
string
is the character string (or vector of strings) we want to examine and pattern
is the pattern that we’re checking for, inside string
. The output will be a TRUE
/FALSE
vector indicating if pattern was found.
> # A tibble: 5 × 2
> Name Address
> <chr> <chr>
> 1 Axum Foods DBA Lands Of Origin 1532 Pike PL
> 2 CHA CHA LOUNGE 1013 E Pike ST
> 3 Kitchen and Market 1926 Pike PL
> 4 Luke's Lobster 104 Pike ST
> 5 SAM'S TAVERN 1024 E Pike ST
Hmmm…there are only 5 restaurants on a street with Pike in the name?!
str_detect()
We can do similar pattern-checking using str_detect()
:
string
is the character string (or vector of strings) we want to examine and pattern
is the pattern that we’re checking for, inside string
. The output will be a TRUE
/FALSE
vector indicating if pattern was found.
restaurants |>
select(Name, Address) |>
mutate(Address = str_to_title(Address)) |>
filter(str_detect(Address, "Pike")) |>
distinct()
> # A tibble: 139 × 2
> Name Address
> <chr> <chr>
> 1 ALDER & ASH 629 Pike St
> 2 ALIBI ROOM, THE 85 Pike St
> 3 AMAZON RETAIL LLC 610 E Pike St
> 4 ATHENIAN INN 1517 Pike Pl
> 5 ATRIUM KITCHEN AT PIKE PLACE MARKET 93 Pike St Ste 101
> 6 AUDACITY WINEBAR ALEXANDRIA NICOLE CELLARS 800 Pike St
> 7 Axum Foods DBA Lands Of Origin 1532 Pike Pl
> 8 AYUTTHAYA THAI RESTAURANT 727 E Pike St
> 9 BAGELBOP 93 Pike St
> 10 BAI TONG THAI STREET CAFE 1121 E Pike St
> # ℹ 129 more rows
str_replace()
What about if you want to replace a string with something else? Use str_replace()
!
This function works very similarly to str_detect()
, but with one extra argument:
restaurants |>
select(`Inspection Date`) |>
mutate(full_date = str_replace(string = `Inspection Date`,
pattern = "01/",
replacement = "January "))
"01/"
occurs both for the month and the day. This would be a good place for a regular expression.
> # A tibble: 256,681 × 2
> `Inspection Date` full_date
> <chr> <chr>
> 1 03/02/2023 03/02/2023
> 2 03/02/2023 03/02/2023
> 3 08/31/2022 08/31/2022
> 4 08/31/2022 08/31/2022
> 5 01/13/2022 January 13/2022
> 6 01/06/2021 January 06/2021
> 7 06/22/2023 06/22/2023
> 8 03/01/2023 03/January 2023
> 9 03/01/2023 03/January 2023
> 10 07/13/2022 07/13/2022
> # ℹ 256,671 more rows
Regular expressions1 or regexes are how we describe patterns we are looking for in text in a way that a computer can understand. We write an expression, apply it to a string input, and then can do things with matches we find.
Pike
or 01/
.?
makes a pattern optional (i.e. it matches 0 or 1 times)+
lets a pattern repeat (i.e. it matches at least once)*
lets a pattern be optional or repeat (i.e. it matches any number of times, including 0){n}
matches exactly n
times, {n,}
matches at least n
times, {n, m}
matches between n
and m
times[]
and let you match a set of characters
.
matches any character except a new line (\n
)-
allows you to specify a range^
()
also allows you to create groups which can be referenced later in the regular expression with backreferences, like \1
, \2
(?:)
, the non-grouping parentheses, to control precedence but not capture the match in a group. This is slightly more efficient than capturing parentheses and most useful for complex cases where you need to capture matches and control precedence independently.|
, allows us to pick between one or more alternative patterns^
to anchor the start$
to anchor the end\b
(?=...)
is a positive look-ahead assertion. Matches if ...
matches at the current input(?!...)
is a negative look-ahead assertion. Matches if ...
does not match at the current input(?<=...)
is a positive look-behind assertion. Matches if ...
matches text preceding the current position. Length must be bounded (i.e. no *
or +
)(?<!...)
is a negative look-behind assertion. Matches if ...
does not match text preceding the current position. Length must be bounded (i.e. no *
or +
)You can read more about regular expressions in stringr
here and this is a useful tutorial to learn regex
if you need to/when you’re ready!
Let’s go back to our example and see if we can use a regular expression to replace 01/
just for the month position of our date variable.
restaurants |>
select(`Inspection Date`) |>
mutate(full_date = str_replace(string = `Inspection Date`,
pattern = "^01/",
replacement = "January "))
^
) to make sure our replacement only happens to the 01/
s in the month position.
> # A tibble: 256,681 × 2
> `Inspection Date` full_date
> <chr> <chr>
> 1 03/02/2023 03/02/2023
> 2 03/02/2023 03/02/2023
> 3 08/31/2022 08/31/2022
> 4 08/31/2022 08/31/2022
> 5 01/13/2022 January 13/2022
> 6 01/06/2021 January 06/2021
> 7 06/22/2023 06/22/2023
> 8 03/01/2023 03/01/2023
> 9 03/01/2023 03/01/2023
> 10 07/13/2022 07/13/2022
> # ℹ 256,671 more rows
Let’s look at a more realistic example and introduce the regex version of our separate_wider_*
functions. What if we wanted to separate the Description
variable into two separate variables: capacity_description
and risk_category
?
Description
takes to figure out how we need to separate this character vector.
print(n)
.
> # A tibble: 33 × 2
> Description n
> <chr> <int>
> 1 Bakery-no seating - Risk Category I 15
> 2 Bakery-no seating - Risk Category II 4396
> 3 Bakery-no seating - Risk Category III 325
> 4 Bed and Breakfast - Risk Category I 72
> 5 Caterer - Risk Category I 53
> 6 Caterer - Risk Category II 85
> 7 Caterer - Risk Category III 2006
> 8 Grocery Store-no seating - Risk Category I 9752
> 9 Grocery Store-no seating - Risk Category II 2487
> 10 Limited Food Services - no permanent plumbing 1152
> 11 Meat/Sea Food - Risk Category III 14258
> 12 Mobile Food Unit - Risk Category I 735
> 13 Mobile Food Unit - Risk Category II 494
> 14 Mobile Food Unit - Risk Category III 4500
> 15 Non-Profit Institution - Risk Category I 840
> 16 Non-Profit Institution - Risk Category II 613
> 17 Non-Profit Institution - Risk Category III 6342
> 18 School Lunch Program - Risk II 13767
> 19 Seating 0-12 - Risk Category I 4329
> 20 Seating 0-12 - Risk Category II 7102
> 21 Seating 0-12 - Risk Category III 40880
> 22 Seating 13-50 - Risk Category I 1511
> 23 Seating 13-50 - Risk Category II 7331
> 24 Seating 13-50 - Risk Category III 58592
> 25 Seating 151-250 - Risk Category I 73
> 26 Seating 151-250 - Risk Category II 26
> 27 Seating 151-250 - Risk Category III 10347
> 28 Seating 51-150 - Risk Category I 667
> 29 Seating 51-150 - Risk Category II 975
> 30 Seating 51-150 - Risk Category III 55706
> 31 Seating > 250 - Risk Category I 69
> 32 Seating > 250 - Risk Category II 4
> 33 Seating > 250 - Risk Category III 7177
res_sep <- restaurants |>
distinct(Name, Description) |>
separate_wider_regex(cols = Description,
patterns = c(capacity_description = "^.+",
risk_category = "Risk ?(?:Category)? ?I{1,3}$"))
Name
and Description
.
cols
argument of this function is the column you want to separate.
patterns
argument takes a named character vector where the names become the column names and the character strings are regular expressions that match the desired contents of the vector.
> Error in `separate_wider_regex()`:
> ! Expected each value of `Description` to match the pattern, the whole
> pattern, and nothing but the pattern.
> ! 104 values have problems.
> ℹ Use `too_few = "debug"` to diagnose the problem.
> ℹ Use `too_few = "start"` to silence this message.
I’ve triggered the debugging error message which tells me how to diagnose/ignore the mismatch that’s occurring.
res_sep <- restaurants |>
distinct(Name, Description) |>
separate_wider_regex(cols = Description,
patterns = c(capacity_description = "^.+",
risk_category = "Risk ?(?:Category)? ?I{1,3}$"),
too_few = "debug") |>
distinct(capacity_description, risk_category, Description_ok,
Description_matches, Description_remainder) |>
print(n = 33)
"^"
matches the beginning of a string, "."
matches any character except a new line, and "+"
quantifies that "."
, asking it to return 1 or more characters.
"Risk"
matches exactly, " ?"
matches a singular white space 0 or 1 time, "(?:Category)?"
optionally matches the exact word “Category”, again " ?"
matches a singular white space 0 or 1 time, "I{1,3}"
matches “I” 1-3 times, and "$"
signifies the end of the string.
distinct()
on the created and debugging variables allows us to see what didn’t match.
> # A tibble: 33 × 5
> capacity_description risk_category Description_ok Description_matches
> <chr> <chr> <lgl> <int>
> 1 "Seating 0-12 - " Risk Categor… TRUE 2
> 2 "Seating 13-50 - " Risk Categor… TRUE 2
> 3 "Seating 51-150 - " Risk Categor… TRUE 2
> 4 "Bakery-no seating - " Risk Categor… TRUE 2
> 5 "Mobile Food Unit - " Risk Categor… TRUE 2
> 6 "Seating > 250 - " Risk Categor… TRUE 2
> 7 "Seating 151-250 - " Risk Categor… TRUE 2
> 8 "Grocery Store-no seating -… Risk Categor… TRUE 2
> 9 "Seating 13-50 - " Risk Categor… TRUE 2
> 10 "Caterer - " Risk Categor… TRUE 2
> 11 "Caterer - " Risk Categor… TRUE 2
> 12 "Seating 13-50 - " Risk Categor… TRUE 2
> 13 "Seating 0-12 - " Risk Categor… TRUE 2
> 14 "Meat/Sea Food - " Risk Categor… TRUE 2
> 15 "Bakery-no seating - " Risk Categor… TRUE 2
> 16 "Seating 0-12 - " Risk Categor… TRUE 2
> 17 "Caterer - " Risk Categor… TRUE 2
> 18 "Limited Food Services - no… <NA> FALSE 1
> 19 "Seating 51-150 - " Risk Categor… TRUE 2
> 20 "Seating 51-150 - " Risk Categor… TRUE 2
> 21 "School Lunch Program - " Risk II TRUE 2
> 22 "Mobile Food Unit - " Risk Categor… TRUE 2
> 23 "Mobile Food Unit - " Risk Categor… TRUE 2
> 24 "Non-Profit Institution - " Risk Categor… TRUE 2
> 25 "Grocery Store-no seating -… Risk Categor… TRUE 2
> 26 "Bakery-no seating - " Risk Categor… TRUE 2
> 27 "Seating > 250 - " Risk Categor… TRUE 2
> 28 "Non-Profit Institution - " Risk Categor… TRUE 2
> 29 "Non-Profit Institution - " Risk Categor… TRUE 2
> 30 "Seating 151-250 - " Risk Categor… TRUE 2
> 31 "Seating > 250 - " Risk Categor… TRUE 2
> 32 "Seating 151-250 - " Risk Categor… TRUE 2
> 33 "Bed and Breakfast - " Risk Categor… TRUE 2
> # ℹ 1 more variable: Description_remainder <chr>
res_sep <- restaurants |>
distinct(Name, Description) |>
separate_wider_regex(cols = Description,
patterns = c(capacity_description = "^.+",
risk_category = "Risk ?(?:Category)? ?I{1,3}$"),
too_few = "align_start")
res_sep
risk_category
, we can give too_few
the value align_start
which tells the function to fill in anything without a value for the second variable with an NA
.
> # A tibble: 11,209 × 3
> Name capacity_description risk_category
> <chr> <chr> <chr>
> 1 "#807 TUTTA BELLA" "Seating 0-12 - " Risk Category III
> 2 "+MAS CAFE " "Seating 0-12 - " Risk Category III
> 3 "?al?al Cafe" "Seating 13-50 - " Risk Category III
> 4 "100 LB CLAM" "Seating 0-12 - " Risk Category III
> 5 "1000 SPIRITS" "Seating 51-150 - " Risk Category III
> 6 "100TH AVE CAKES" "Bakery-no seating - " Risk Category II
> 7 "108 VIETNAMESE AUTHENTIC CUISINE" "Seating 51-150 - " Risk Category III
> 8 "11TH FRAME RESTAURANT & LOUNGE" "Seating 51-150 - " Risk Category III
> 9 "125TH ST GRILL" "Seating 51-150 - " Risk Category III
> 10 "12S TACOS MEXICAN FOOD KC1012" "Mobile Food Unit - " Risk Category III
> # ℹ 11,199 more rows
We can clean up these variables a bit more with a version of str_replace()
: str_remove()
. This technically replaces the pattern match with ""
, or an empty string.
res_sep <- restaurants |>
distinct(Name, Description) |>
separate_wider_regex(cols = Description,
patterns = c(capacity_description = "^.+",
risk_category = "Risk ?(?:Category)? ?I{1,3}$"),
too_few = "align_start") |>
mutate(capacity_description = str_remove(capacity_description, pattern = " - $"),
risk_category = str_remove(risk_category, pattern = "Risk ?(?:Category)? "))
res_sep
-
by using str_remove
and providing the regular expression for that piece of the capacity_description
string.
risk_category
, we can remove that language from the beginning of each string, by matching the first part of our original regular expression for this variable.
> # A tibble: 11,209 × 3
> Name capacity_description risk_category
> <chr> <chr> <chr>
> 1 "#807 TUTTA BELLA" Seating 0-12 III
> 2 "+MAS CAFE " Seating 0-12 III
> 3 "?al?al Cafe" Seating 13-50 III
> 4 "100 LB CLAM" Seating 0-12 III
> 5 "1000 SPIRITS" Seating 51-150 III
> 6 "100TH AVE CAKES" Bakery-no seating II
> 7 "108 VIETNAMESE AUTHENTIC CUISINE" Seating 51-150 III
> 8 "11TH FRAME RESTAURANT & LOUNGE" Seating 51-150 III
> 9 "125TH ST GRILL" Seating 51-150 III
> 10 "12S TACOS MEXICAN FOOD KC1012" Mobile Food Unit III
> # ℹ 11,199 more rows
What do the final 33 distinct values of these two new variables look like?
> # A tibble: 33 × 2
> capacity_description risk_category
> <chr> <chr>
> 1 Seating 0-12 III
> 2 Seating 13-50 III
> 3 Seating 51-150 III
> 4 Bakery-no seating II
> 5 Mobile Food Unit III
> 6 Seating > 250 III
> 7 Seating 151-250 III
> 8 Grocery Store-no seating I
> 9 Seating 13-50 II
> 10 Caterer II
> 11 Caterer III
> 12 Seating 13-50 I
> 13 Seating 0-12 I
> 14 Meat/Sea Food III
> 15 Bakery-no seating III
> 16 Seating 0-12 II
> 17 Caterer I
> 18 Limited Food Services - no permanent plumbing <NA>
> 19 Seating 51-150 I
> 20 Seating 51-150 II
> 21 School Lunch Program II
> 22 Mobile Food Unit I
> 23 Mobile Food Unit II
> 24 Non-Profit Institution III
> 25 Grocery Store-no seating II
> 26 Bakery-no seating I
> 27 Seating > 250 I
> 28 Non-Profit Institution II
> 29 Non-Profit Institution I
> 30 Seating 151-250 I
> 31 Seating > 250 II
> 32 Seating 151-250 II
> 33 Bed and Breakfast I
Nice!
Even if you aren’t explicitly manipulating/analyzing text data for your research, knowing some things about regular expressions will still come in handy because they’re used in other places, both in Base R and the tidyverse.
apropos(pattern)
searches all objects available from the global environment that match the given pattern. This is useful if you can’t quite remember the name of a function, for example:
list.files(path, pattern)
lists all files in path that match a regular expression pattern. For example, you can find all the Quarto files in the current directory with:
matches(pattern)
will select all variables whose name matches the supplied pattern.. It’s a tidyselect
function (like starts_with()
and the like) that you can use in any tidyverse function that selects variables.
pivot_longer()
’s argument names_pattern
takes a vector of regular expressions, just like separate_wider_regex()
. It’s useful when extracting data out of variable names with a complex structure.
> [1] "country" "iso2" "iso3" "year" "new_sp_m014"
> [6] "new_sp_m1524" "new_sp_m2534" "new_sp_m3544" "new_sp_m4554" "new_sp_m5564"
who |> pivot_longer(cols = new_sp_m014:newrel_f65,
names_to = c("diagnosis", "gender", "age"),
names_pattern = "new_?(.*)_(.)(.*)",
values_to = "count") |>
slice_head(n = 10)
"new_?(.*)_(.)(.*)"
explained: new
matches exactly, then _?
optionally matches an underscore, (.*)
matches any number of characters and in this example it captures the new diagnosis
variable, _
matches exactly, (.)
matches one character which captures the gender
variable m
or f
in this example, and lastly, (.*)
again matches any number of characters, in this case it captures the varying digits of the age
variable.
> # A tibble: 10 × 8
> country iso2 iso3 year diagnosis gender age count
> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl>
> 1 Afghanistan AF AFG 1980 sp m 014 NA
> 2 Afghanistan AF AFG 1980 sp m 1524 NA
> 3 Afghanistan AF AFG 1980 sp m 2534 NA
> 4 Afghanistan AF AFG 1980 sp m 3544 NA
> 5 Afghanistan AF AFG 1980 sp m 4554 NA
> 6 Afghanistan AF AFG 1980 sp m 5564 NA
> 7 Afghanistan AF AFG 1980 sp m 65 NA
> 8 Afghanistan AF AFG 1980 sp f 014 NA
> 9 Afghanistan AF AFG 1980 sp f 1524 NA
> 10 Afghanistan AF AFG 1980 sp f 2534 NA
The delim
argument in separate_longer_delim()
and separate_wider_delim()
usually matches a fixed string, but you can use regex()
to make it match a pattern. This is useful, for example, if you want to match a comma that is optionally followed by a space, i.e. regex(", ?")
.
R
Equivalents1R
paste0(x, sep, collapse)
nchar(x)
substr(x, start, end)
toupper(x)
tolower(x)
tools::toTitleCase(x)
trimws(x)
grepl(pattern, x)
sub(x, pattern, replacement)
strwrap(x)
stringr
str_c(x, sep, collapse)
str_flatten(x, collapse)
str_length(x)
str_sub(x, start, end)
str_to_upper(x)
str_to_lower(x)
str_to_title(x)
str_trim(x)
str_detect(x, pattern)
str_replace(x, pattern, replacement)
str_wrap(x)
There are many other useful stringr
functions/variants of the functions we used today. Check them out here.
First, install the babynames
packages in your console, then run the following code to load the babynames
dataset into your global environment.
str_c()
or str_glue()
to create a new variable that is a sentence stating what the most popular name was for each binary sex category in that year. Bonus: Add a line break in your sentence and use str_view()
to see what the new string looks like2.> # A tibble: 1,924,665 × 5
> year sex name n prop
> <dbl> <chr> <chr> <int> <dbl>
> 1 1880 F Mary 7065 0.0724
> 2 1880 F Anna 2604 0.0267
> 3 1880 F Emma 2003 0.0205
> 4 1880 F Elizabeth 1939 0.0199
> 5 1880 F Minnie 1746 0.0179
> 6 1880 F Margaret 1578 0.0162
> 7 1880 F Ida 1472 0.0151
> 8 1880 F Alice 1414 0.0145
> 9 1880 F Bertha 1320 0.0135
> 10 1880 F Sarah 1288 0.0132
> # ℹ 1,924,655 more rows
str_c()
or str_glue()
to create a new variable that is a sentence stating what the most popular name was for each binary sex category in that year. Bonus: Add a line break in your sentence and use str_view()
to see what the new string looks like1.babynames |>
filter(year == 1950) |>
mutate(sex2 = if_else(sex == "F", "girl", "boy")) |>
slice_max(prop, by = c(sex)) |>
mutate(Sentence = str_wrap(str_glue("The most popular name for {sex2}s in
{year} was {name}."),
width = 25)) |>
pull(Sentence) |>
str_view()
sex2
variable for better interpretability of the final Sentence
variable.
pull()
is similar to indexing with $
in Base R
but works well with pipes. This is necessary to do before str_view()
which only takes a vector of values (not a column from a data frame).
> [1] │ The most popular name for
> │ girls in 1950 was Linda.
> [2] │ The most popular name for
> │ boys in 1950 was James.
library(ggrepel)
library(ggthemes)
library(patchwork)
colors <- c("#4e79a7","#f28e2c","#e15759","#76b7b2","#59a14f","#edc949",
"#af7aa1","#ff9da7","#9c755f","#bab0ab")
victoria_plot <- babynames |>
filter(name == "Victoria") |>
mutate(sex2 = if_else(sex == "F", "Female", "Male")) |>
ggplot(aes(x = year, y = prop, group = name, fill = name)) +
geom_density(stat = "identity", alpha = 0.25, color = colors[1]) +
geom_vline(xintercept = 1988, color = colors[2], linetype = 2) +
geom_vline(data = babynames |>
filter(name == "Victoria") |>
mutate(sex2 = if_else(sex == "F", "Female", "Male")) |>
slice_max(prop, by = sex2),
aes(xintercept = year), color = colors[3]) +
facet_grid(sex2 ~ .,
scales = "free_y") +
scale_fill_manual(values = colors[1]) +
labs(title = 'Popularity of the name "Victoria"',
subtitle = "1880-2017, by binary sex category",
y = "",
x = "") +
theme_tufte(base_size = 16) +
theme(legend.position = "none",
strip.background = element_rect(color="black",
fill= alpha(colors[10], 0.5),
linetype = 0))
vic_plot <- babynames |>
filter(name == "Vic") |>
mutate(sex2 = if_else(sex == "F", "Female", "Male")) |>
ggplot(aes(x = year, y = prop, group = name, fill = name)) +
geom_density(stat = "identity", alpha = 0.25, color = colors[6]) +
geom_vline(xintercept = 1988, color = colors[2], linetype = 2) +
geom_vline(data = babynames |>
filter(name == "Vic") |>
mutate(sex2 = if_else(sex == "F", "Female", "Male")) |>
slice_max(prop, by = sex2),
aes(xintercept = year), color = colors[3]) +
facet_grid(sex2 ~ .,
scales = "free_y") +
scale_fill_manual(values = colors[6]) +
labs(title = 'Popularity of the name "Vic"',
y = "",
caption = "Note: y-axes are of different scales;
Orange, dashed line represents 1988; #
Red, solid line represents most popular #
year for that name-sex pairing.",
x = "Year") +
theme_tufte(base_size = 16) +
theme(legend.position = "none",
strip.background = element_rect(color="black",
fill= alpha(colors[10], 0.5),
linetype = 0))
combo_plots <- victoria_plot / vic_plot + ylab(NULL)
wrap_elements(combo_plots) +
theme_tufte(base_size = 16) +
labs(tag = "Proportion of all names given to U.S. newborns") +
theme(plot.tag = element_text(size = rel(1.25), angle = 90),
plot.tag.position = "left")
sex
variable for facet visualization purposes.
sex2
and allowing the y-axis to vary based on facet value.
Example using regular expressions:
nicknames <- babynames |>
mutate(nickname = case_when(str_detect(name, pattern = "^Vi.{2}oria$") ~ "Victoria",
str_detect(name, pattern = "^Vi.{2}or$") ~ "Victor",
str_detect(name, pattern = "^Vi[ck]{1,2}$") ~ "Vic",
str_detect(name, pattern = "^Tor[riey]*$") ~ "Tori",
str_detect(name, pattern = "^Vi[ck]+[iey]*$") ~ "Vicky",
.default = NA)) |>
filter(!is.na(nickname)) |>
mutate(prop2 = sum(prop),
.by = c(year, nickname, sex)) |>
distinct(year, nickname, prop2, sex) |>
mutate(sex2 = if_else(sex == "F", "Female", "Male"),
nickname = fct(nickname, levels = c("Victoria", "Victor", "Vicky", "Tori", "Vic")))
my_names <- nicknames |>
ggplot(aes(x = year, y = prop2, fill = nickname, group = nickname)) +
geom_density(aes(color = nickname), stat = "identity", alpha = 0.15) +
geom_vline(xintercept = 1988, color = colors[4], linetype = 2) +
scale_fill_manual(values = colors[c(1:3, 5:7)]) +
scale_color_manual(values = colors[c(1:3, 5:7)]) +
facet_grid(sex2 ~ .,
scales = "free_y") +
geom_label_repel(data = nicknames |> slice_max(prop2, by = c(sex2, nickname)),
aes(label = nickname), stat = "identity") +
labs(title = 'Popularity of all nicknames for "Victoria" (including all spelling variants)',
caption = "Note: y-axes are of different scales; Teal, dashed line represents 1988",
subtitle = "1880-2017, by binary sex category",
y = "Proportion of all names given to U.S. newborns",
x = "Year") +
theme_tufte(base_size = 16) +
theme(legend.position = "none",
strip.background = element_rect(color="black", fill= alpha(colors[10], 0.5), linetype = 0))
my_names
sex
variable for facet visualization purposes.