Roadmap

  • Text as Tidy Data
  • Sentiment Analysis
  • tf-idf
  • Word Embeddings
  • Topic Modeling
  • Further Learning & Resources

Source Material

Go to book site

Go to book site

Text as Tidy Data

The tidyverse


The tidyverse refers to two things:

  1. a specific package in R that loads several core packages within the tidyverse.
  2. a specific design philosophy, grammar, and focus on “tidy” data structures developed by Hadley Wickham 1 and the team at Posit (formerly known as R Studio).

The tidyverse package

The core packages within the tidyverse include:

  • ggplot2 (visualizations)
  • dplyr (data manipulation)
  • tidyr (data reshaping)
  • readr (data import/export)
  • purrr (iteration)
  • tibble (modern dataframe)
  • stringr (text data)
  • forcats (factors)


The tidyverse philosophy

The principles underlying the tidyverse are:

  1. Reuse existing data structures.
  2. Compose simple functions with the pipe.
  3. Embrace functional programming.
  4. Design for humans.

What is Tidy Data?

1

Why do we Want Tidy Data?

  • Easier to understand many rows than many columns1
  • Required for plotting in ggplot22
  • Required for many types of statistical procedures (e.g. hierarchical or mixed effects models)
  • Fewer issues with missing values and “imbalanced” repeated measures data
  • Having a consistent method for storing data means it’s easier to learn the tools to work with it since there’s an underlying uniformity.

Most real-world data is not tidy because data are often organized for goals other than analysis (i.e. data entry; not intentionally created as data…) and most people aren’t familiar with the principles of tidy data.

Why do we Want Tidy Data?

1

Data from bobsburgersR

install.packages("pak")
pak("poncest/bobsburgersR")
1
Run these two lines of code in your console.

The pak package is an alternative to using install.packages() and, for our purposes here, allows us to download our data directly from Github.

library(tidyverse)
library(bobsburgersR)
head(transcript_data, n = 10)
# A tibble: 10 × 6
   season episode title        line raw_text                            dialogue
    <dbl>   <dbl> <chr>       <dbl> <chr>                               <chr>   
 1      1       1 Human Flesh     1 <NA>                                <NA>    
 2      1       1 Human Flesh     2 <NA>                                <NA>    
 3      1       1 Human Flesh     3 <NA>                                <NA>    
 4      1       1 Human Flesh     4 Listen, pep talk.                   Listen,…
 5      1       1 Human Flesh     5 Big day today.                      Big day…
 6      1       1 Human Flesh     6 It's our grand re-re-re-opening.    It's ou…
 7      1       1 Human Flesh     7 It's labor day weekend, And it loo… It's la…
 8      1       1 Human Flesh     8 So we have to-- Big day for anothe… So we h…
 9      1       1 Human Flesh     9 Go ahead, sorry.                    Go ahea…
10      1       1 Human Flesh    10 Go ahead, do your pep.              Go ahea…

First steps

library(tidytext) 
library(textclean)

clean_text <- function(text) {
  text |> 
    replace_html() |>
    replace_non_ascii() |>
    replace_contraction() |>
    replace_white() |>
    tolower() |>
    str_remove_all(pattern = "[:punct:]|[:symbol:]")
}
2
You can optionally run check_text() on your raw text variable to receive a print-out of possible functions you may want to use to clean your raw text data.
3
Creating our own helper function to clean our raw text variable
4
Removes any HTML tags
5
Removes non_ASCII characters
6
Expands contractions to avoid repetition
7
Collapses multiple spaces
8
Normalizes all lower case
9
Removes all punctuation/symbols

Cleaning the data

clean_bob <- transcript_data |> 
  drop_na() |>
  mutate(cleaned_text = raw_text |> clean_text()) 
10
Drops any rows in a dataframe that have NAs in them


clean_bob |> 
  select(raw_text) |> 
  slice_sample(n = 5) 
# A tibble: 5 × 1
  raw_text                                                                      
  <chr>                                                                         
1 Yeah, come, Susmita.                                                          
2 And the heat is off in the restaurant,                                        
3 Whoa! Oh.                                                                     
4 I was gonna give it to you for your birthday  and, uh, surprise you, but it's…
5 It'll be just like at the doctor,  but we've both seen each other naked.      
clean_bob |> 
  select(cleaned_text) |> 
  slice_sample(n = 5) 
# A tibble: 5 × 1
  cleaned_text                                                                  
  <chr>                                                                         
1 yeah come susmita                                                             
2 and the heat is off in the restaurant                                         
3 whoa oh                                                                       
4 i was gonna give it to you for your birthday and uh surprise you but it is in…
5 it will be just like at the doctor but we have both seen each other naked     

Tokenization

  • In order to create a tidy dataset from lots of text, we need to define a feature that we want to use as our unit of analysis.
  • This could be many different things (i.e. a sentence, a paragraph, an n-gram, etc.) but the most common token is a word.

Tidying the data

tidy_bob <- clean_bob |> 
  left_join(transcript_data |>
              select(season, episode) |> 
              distinct() |> 
              mutate(episode_seq = row_number())) |>
  select(-c(dialogue, raw_text)) |>
  unnest_tokens(output = word, input = cleaned_text)

tidy_bob
11
This entire left_join dataset simply creates an episode number that isn’t broken up by season
12
Removes the dialogue and raw_text columns
13
This function tokenizes our data into a tidy format
# A tibble: 1,127,930 × 6
   season episode title        line episode_seq word  
    <dbl>   <dbl> <chr>       <dbl>       <int> <chr> 
 1      1       1 Human Flesh     4           1 listen
 2      1       1 Human Flesh     4           1 pep   
 3      1       1 Human Flesh     4           1 talk  
 4      1       1 Human Flesh     5           1 big   
 5      1       1 Human Flesh     5           1 day   
 6      1       1 Human Flesh     5           1 today 
 7      1       1 Human Flesh     6           1 it    
 8      1       1 Human Flesh     6           1 is    
 9      1       1 Human Flesh     6           1 our   
10      1       1 Human Flesh     6           1 grand 
# ℹ 1,127,920 more rows

Stop Words

As you might imagine, most words that appear in a corpus of text are going to be fairly boring and (depending on the analysis) not very insightful. These are what’s known as stop words.

data(stop_words)
stop_words
# A tibble: 1,149 × 2
   word        lexicon
   <chr>       <chr>  
 1 a           SMART  
 2 a's         SMART  
 3 able        SMART  
 4 about       SMART  
 5 above       SMART  
 6 according   SMART  
 7 accordingly SMART  
 8 across      SMART  
 9 actually    SMART  
10 after       SMART  
# ℹ 1,139 more rows
stop_words |> count(lexicon)
# A tibble: 3 × 2
  lexicon      n
  <chr>    <int>
1 SMART      571
2 onix       404
3 snowball   174

Removing Stop Words

tidy_bob <- tidy_bob |> 
  anti_join(stop_words)

tidy_bob
# A tibble: 355,240 × 6
   season episode title        line episode_seq word         
    <dbl>   <dbl> <chr>       <dbl>       <int> <chr>        
 1      1       1 Human Flesh     4           1 listen       
 2      1       1 Human Flesh     4           1 pep          
 3      1       1 Human Flesh     4           1 talk         
 4      1       1 Human Flesh     5           1 day          
 5      1       1 Human Flesh     6           1 grand        
 6      1       1 Human Flesh     6           1 rerereopening
 7      1       1 Human Flesh     7           1 labor        
 8      1       1 Human Flesh     7           1 day          
 9      1       1 Human Flesh     7           1 weekend      
10      1       1 Human Flesh     7           1 wharf        
# ℹ 355,230 more rows

Term Frequency

The most basic statistic we can derive after these preliminary steps is calculating which words are used most often in our data.

tidy_bob |> 
  count(word, sort = TRUE)
14
This is the same as group_by(word) |> summarise(n = n()) |> arrange(desc(n))
# A tibble: 25,814 × 2
   word       n
   <chr>  <int>
 1 yeah    6444
 2 uh      5320
 3 gonna   4687
 4 bob     3576
 5 tina    3157
 6 hey     2787
 7 gene    2557
 8 wait    2314
 9 god     2235
10 louise  2002
# ℹ 25,804 more rows

Term Frequency Visualization

observable <- c("#1f77b4","#ff7f0e","#2ca02c","#d62728","#9467bd",
                "#8c564b","#e377c2","#7f7f7f","#bcbd22","#17becf")

tidy_bob |> 
  group_by(season) |> 
  count(word, sort = TRUE) |>
  slice_max(order_by = n, n = 7) |>
  ungroup() |> 
  mutate(word_color = fct_reorder(word, n),
         word = reorder_within(word, n, season)) |>
  ggplot(aes(x = n, y = word, fill = word_color)) +
  geom_col(show.legend = FALSE) +
  labs(y = NULL) + 
  scale_y_reordered() +
  scale_fill_manual(values = observable) +
  facet_wrap( ~ season, scales = "free_y") +
  theme_minimal(base_size = 18)
15
Create an inclusive color palette (i.e. visually distinct colors for those with colorblindness)
16
Only get the top 7 words per season (which returns the top 10 words overall for these data)
17
Since reorder_within is a bit of a workaround, to apply our own color scheme we need to create a version of our variable explicitly for the fill argument (see this issue for reference)
18
Reorder the words by their count per season
19
This argument pairs with reorder_within to properly label the final plots
20
Applies our color palette to the 10 most common words in the data
21
Creates small multiples of the top 7 most common words by each season

Sentiment Analysis

Sentiment Analysis and Tidy Data

  • Sentiment analysis (aka opinion mining) can be used to get a sense of a source’s attitudes or emotions in a section of writing
  • We can use pre-existing sentiment dictionaries, create our own, or do a combination of the two
  • Words can be categorized in a number of ways:
    • As a simple binary of negative or positive
    • Along a numeric/likert scale from most negative to most positive
    • According to a set of particular emotions/attitudes they represent
  • These measures can then be summarized to represent the overall sentiment or the sentiment within some subset of text (pages, chapters, books, authors, etc.).

sentiments datasets

library(textdata)
1
Required to download the AFINN and nrc datasets
get_sentiments("afinn")
2
Scale from -5 to 5 delineating most negative to most positive
# A tibble: 2,477 × 2
   word       value
   <chr>      <dbl>
 1 abandon       -2
 2 abandoned     -2
 3 abandons      -2
 4 abducted      -2
 5 abduction     -2
 6 abductions    -2
 7 abhor         -3
 8 abhorred      -3
 9 abhorrent     -3
10 abhors        -3
# ℹ 2,467 more rows
get_sentiments("bing")
3
Binary categorization of positive/negative
# A tibble: 6,786 × 2
   word        sentiment
   <chr>       <chr>    
 1 2-faces     negative 
 2 abnormal    negative 
 3 abolish     negative 
 4 abominable  negative 
 5 abominably  negative 
 6 abominate   negative 
 7 abomination negative 
 8 abort       negative 
 9 aborted     negative 
10 aborts      negative 
# ℹ 6,776 more rows
get_sentiments("nrc")
4
Categorizations of positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust
# A tibble: 13,872 × 2
   word        sentiment
   <chr>       <chr>    
 1 abacus      trust    
 2 abandon     fear     
 3 abandon     negative 
 4 abandon     sadness  
 5 abandoned   anger    
 6 abandoned   fear     
 7 abandoned   negative 
 8 abandoned   sadness  
 9 abandonment anger    
10 abandonment fear     
# ℹ 13,862 more rows

Compare Lexicons

afinn <- tidy_bob |> 
  inner_join(get_sentiments("afinn")) |>
  group_by(episode_seq) |>
  summarise(sentiment = sum(value)) |>
  mutate(method = "AFINN")

bing_and_nrc <- bind_rows(
  
  tidy_bob |> 
    inner_join(get_sentiments("bing"))  |> 
    mutate(method = "Bing et al."),
  
  tidy_bob |> 
    inner_join(get_sentiments("nrc") |>  
                 filter(sentiment %in% c("positive", "negative")),
               relationship = "many-to-many"
    )  |> 
    mutate(method = "NRC"))  |> 
  
  count(method, episode_seq, sentiment)  |>
  pivot_wider(names_from = sentiment,
              values_from = n,
              values_fill = 0) |>
  mutate(sentiment = positive - negative)
5
Attach lexicon values to tidy data
6
Group the data by episode
7
Sum the sentiment values for each token (word) for each episode
8
Add a column to differentiate the different methods
9
Keeping only the rows with “positive” or “negative” sentiments
10
Repeated words will match many times with this sentiment so it’s a many-to-many merge
11
Count the number of positive and negative sentiments by sentiment lexicon and episode number
12
This function takes the sentiment column and turns its values (positive, negative) into new columns
13
Values from n should fill the two new positive and negative sentiment columns
14
Implicit NAs (that are created by the new data structure) should be given a value of 0
15
Calculate overall sentiment by subtracting negative from positive counts

Visualize Lexicon Differences

bind_rows(afinn, bing_and_nrc) |> 
  mutate(sign = ifelse(sentiment > 0, "positive", "negative")) |>
  ggplot(aes(episode_seq, sentiment, fill = method, alpha = sign)) +
  geom_col(show.legend = FALSE) +
  scale_alpha_manual(values = c(0.5, 1)) +
  scale_fill_manual(values = observable[1:3]) +
  facet_wrap(~ method, ncol = 1, scales = "free_y") +
  theme_minimal(base_size = 18)
16
Create a variable to distinguish positive/negative explicitly
17
Create a plot with episode number on the x-axis, sentiment on the y-axis, color the plot by method using fill, and create differences in transparency (alpha) by sign
18
Use columns to represent the data
19
Change alpha transparency to 50% for negative and keep at 100% (no transparency) for positive
20
Manually change the fill colors to our pre-defined observable palette
21
Create small multiples by the lexicon method and allow the y-axis to vary based on their respective data ranges
22
Apply ggplot2’s minimal theme and increase the base font size

bind_rows(afinn, bing_and_nrc) |> 
  ggplot(aes(episode_seq, sentiment, color = method)) +
  geom_smooth(se = FALSE, span = 0.4) +
  scale_color_manual(values = observable[1:3]) + 
  theme_minimal(base_size = 18) +
  theme(legend.position = "bottom") 
23
Use color here instead of fill since we’re mapping a line instead of a column
24
Creates a moving average of sentiments (averaging window defined by span; se controls whether to show confidence intervals)

Putting it all together

What if we want to see the top words broken down by sentiment and lexicon?

afinn <- tidy_bob |> 
  inner_join(get_sentiments("afinn")) |> 
  mutate(sentiment = if_else(value > 0, "positive", "negative")) |>
  count(word, sentiment, sort = TRUE) |>
  mutate(method = "AFINN")

bing <- tidy_bob |> 
  inner_join(get_sentiments("bing")) |> 
  count(word, sentiment, sort = TRUE) |> 
  mutate(method = "Bing et al.")


nrc <- tidy_bob |> 
  inner_join(get_sentiments("nrc") |> 
               filter(sentiment %in% c("positive", "negative"))) |> 
  count(word, sentiment, sort = TRUE) |> 
  mutate(method = "NRC")
26
Recoding this sentiment to negative if value < 0 and positive if value > 0

Top Words by Sentiment & Lexicon

library(wordcloud) 
library(reshape2)
27
Needed to restructure dataset for wordclouds
afinn |> 
  arrange(desc(n)) |>
  acast(word ~ sentiment, value.var = "n", fill = 0) |> 
  comparison.cloud(colors = observable[c(4, 1)], 
                   title.size = 2, 
                   scale=c(4, 1),
                   max.words = 100)

bing |> 
  arrange(desc(n)) |>
  acast(word ~ sentiment, value.var = "n", fill = 0) |> 
  comparison.cloud(colors = observable[c(4, 1)], 
                   scale=c(4, 1),
                   title.size = 2)

nrc |> 
  arrange(desc(n)) |>
  acast(word ~ sentiment, value.var = "n", fill = 0) |> 
  comparison.cloud(colors = observable[c(4, 1)], 
                   title.size = 2)

tf-idf

Word Frequency

Most documents will have a lot of some words and not very many of much fewer words

season_words <- clean_bob |>
  select(season, cleaned_text) |>
  unnest_tokens(word, cleaned_text) |>
  count(season, word, sort = TRUE)

season_words
1
Gets the count of each word by season
# A tibble: 83,633 × 3
   season word      n
    <dbl> <chr> <int>
 1     13 i      4024
 2     12 i      3886
 3     11 i      3832
 4     10 i      3816
 5      8 i      3573
 6      7 i      3527
 7      4 i      3476
 8      9 i      3407
 9      5 i      3406
10      3 i      3385
# ℹ 83,623 more rows
total_words <- season_words |>
  group_by(season) |>
  summarize(total = sum(n))

total_words
2
Creates a count of total words by season
# A tibble: 14 × 2
   season total
    <dbl> <int>
 1      1 44304
 2      2 31904
 3      3 84253
 4      4 86537
 5      5 85996
 6      6 78691
 7      7 90347
 8      8 90281
 9      9 92870
10     10 94967
11     11 95939
12     12 97485
13     13 97133
14     14 57223

Word Frequency by Season

library(colorspace)
library(ggridges) 
season_words <- left_join(season_words, total_words)

observable_light <- lighten(observable, 0.5)

ggplot(season_words, aes(x = n/total, y = season, fill = as_factor(season))) +
  geom_density_ridges(show.legend = FALSE) +
  xlim(NA, 0.0009) +
  scale_fill_manual(values = c(observable, observable_light)[order(rep(seq_along(observable), 2))][1:14]) +
  theme_minimal(base_size = 18)
3
Combines both datasets by season
4
Create version of observable color palette that is 50% lighter
5
A fun geom from the ggridges package that allows you to visualize many density curves at once
6
Switching between observable and observable_light to utilize a larger (14) color palette

tf versus tf-idf

  • So far we’ve been looking at term frequency (tf)
  • Another approach is to look at a term’s inverse document frequency (idf), which decreases the weight for commonly used words and increases the weight for words that are not used very much in a collection of documents
  • This can be combined with term frequency to calculate a term’s tf-idf (the two quantities multiplied together), which is the frequency of a term adjusted for how rarely it is used
library(topicmodels)

season_tf_idf <- season_words |> 
  bind_tf_idf(word, season, n)

season_tf_idf
# A tibble: 83,633 × 7
   season word      n total     tf   idf tf_idf
    <dbl> <chr> <int> <int>  <dbl> <dbl>  <dbl>
 1     13 i      4024 97133 0.0414     0      0
 2     12 i      3886 97485 0.0399     0      0
 3     11 i      3832 95939 0.0399     0      0
 4     10 i      3816 94967 0.0402     0      0
 5      8 i      3573 90281 0.0396     0      0
 6      7 i      3527 90347 0.0390     0      0
 7      4 i      3476 86537 0.0402     0      0
 8      9 i      3407 92870 0.0367     0      0
 9      5 i      3406 85996 0.0396     0      0
10      3 i      3385 84253 0.0402     0      0
# ℹ 83,623 more rows

Highest tf-idf by Season

season_tf_idf |> 
  select(-total) |> 
  arrange(desc(tf_idf))
# A tibble: 83,633 × 6
   season word            n       tf   idf  tf_idf
    <dbl> <chr>       <int>    <dbl> <dbl>   <dbl>
 1      1 ok             66 0.00149   1.54 0.00229
 2      2 beefsquatch    23 0.000721  2.64 0.00190
 3     14 cricket        37 0.000647  2.64 0.00171
 4      2 rodney         16 0.000502  2.64 0.00132
 5      9 clem           45 0.000485  2.64 0.00128
 6     10 wharfy         43 0.000453  2.64 0.00119
 7     12 purrbo         42 0.000431  2.64 0.00114
 8      8 bleaken        38 0.000421  2.64 0.00111
 9      1 torpedo        31 0.000700  1.54 0.00108
10      2 hanky          13 0.000407  2.64 0.00108
# ℹ 83,623 more rows
season_tf_idf |> 
  group_by(season) |> 
  slice_max(tf_idf, n = 10) |> 
  ungroup() |> 
  ggplot(aes(tf_idf, fct_reorder(word, tf_idf), fill = as_factor(season))) +
  geom_col(show.legend = FALSE) +
  scale_fill_manual(values = c(observable, observable_light)[order(rep(seq_along(observable), 2))][1:14]) +
  facet_wrap(~ season, ncol = 5, scales = "free") +
  labs(x = "tf-idf", y = NULL) + 
  theme_minimal(base_size = 18)

Word Embeddings



“You shall know a word
by the company it keeps.”
      - John Rupert Firth (British linguist)

Two Approaches

Restructuring our Data

nested_eps <- tidy_bob |> 
  add_count(word) |>
  filter(n >= 50) |>
  select(episode_seq, word) |>
  nest(words = c(word))

nested_eps 
1
Adds the count, n, for each word; equivalent to group_by(word) |> mutate(n = n())
2
Keep only words that occur at least 50 times
3
nest creates a list-column; in this case, a dataframe of all words that occurred by each episode
# A tibble: 272 × 2
   episode_seq words             
         <int> <list>            
 1           1 <tibble [703 × 1]>
 2           2 <tibble [768 × 1]>
 3           3 <tibble [680 × 1]>
 4           4 <tibble [778 × 1]>
 5           5 <tibble [664 × 1]>
 6           6 <tibble [792 × 1]>
 7           7 <tibble [694 × 1]>
 8           8 <tibble [714 × 1]>
 9           9 <tibble [642 × 1]>
10          10 <tibble [843 × 1]>
# ℹ 262 more rows

What do these nested tibbles look like? Let’s look at words from episode 1:

nested_eps[1, 2][[1]]
[[1]]
# A tibble: 703 × 1
   word   
   <chr>  
 1 listen 
 2 talk   
 3 day    
 4 day    
 5 weekend
 6 wharf  
 7 day    
 8 reason 
 9 linda  
10 middle 
# ℹ 693 more rows

Creating a Window Function

library(slider)

slide_windows <- function(tbl, window_size) {
  
  skipgrams <- slider::slide(
    tbl,
    ~.x,
    .after = window_size - 1,
    .step = 1,
    .complete = TRUE
  )
  
  safe_mutate <- safely(mutate)
  
  out <- map2(skipgrams,
              1:length(skipgrams),
              ~ safe_mutate(.x, window_id = .y))

  out |>
    list_transpose() |>
    pluck("result") |>
    compact() |>
    bind_rows()
}
4
This function identifies skip-gram windows in order to calculate the skip-gram probabilities
5
Read more about the slide function here
6
Takes in a tibble (of words, in this case)
7
Applies a function to the window; here we’re simply returning the words as-is
8
Defines how many elements after the current one should be included in the window
9
How many elements to shift by when computing the window
10
If TRUE, it only evaluates complete windows (i.e. entire window size is available from the current element)
11
skipgrams is a list-column containing a skipgram window for each word, for each episode
12
Creates a version of mutate that won’t break the function, rather it’ll capture results and errors in a list object for each skipgram
13
map2 mutates over each of these skipgrams to append a window_id column (which is the index of the skipgram for that epidose) so we can identify them when we unpack this extremely nested words list later
14
The function that map2 is iterating through our skipgrams and their index numbers with is safe_mutate, which will create a list of 2 for each skipgram it iterates through, putting successfully mutated skipgram results into result and unsuccessfully mutated skipgram results into error
15
Turns each episode’s words list from a list of skipgrams (each with a list of result and error per our safe_mutate function) into its transposed version, in this case two lists (result and error), each composed of the full list of skipgrams for that episode
16
pluck pulls out the result list for each episode’s words column so now each episode’s list is composed of a list of tibbles for each skipgram widow with their corresponding window_id
17
Removes empty skipgrams (i.e. where window was incomplete)
18
Combines all skipgram words and window_ids into one complete tibble in words column corresponding to each episode

Let’s Have a Look Around 👀

library(widyr)
library(furrr)

plan(multisession)

tidy_pmi <- nested_eps |> 
  mutate(words = future_map(words, \(x) slide_windows(x, 4L))) |>
  unnest(words) |>
  unite(window_id, episode_seq, window_id) |>
  pairwise_pmi(word, window_id)
19
Allows this process to be run computationally in parallel (huge time-saver 🙏) since each episode can be run independently of the others
20
Takes the words column and applies our slide_windows() function to each episode’s list with a skip-gram window of 4 words
21
Unnests the words column so it is no longer an embedded list per episode but instead we have a tibble with columns for episode_seq, word, and window_id. Instead of 272 rows (1 row for each episode) we now have 925,280 rows (1 row for each word in a skip-gram window for all 272 episodes)
22
Overwrites the window_id column to be a combination of episode_seq and window_id, separated by _
23
Uses each instance of a word and its associated window_id to calculate the logarithm of probability of finding two words together, normalized by the probability of finding each of the words alone.


tidy_pmi
# A tibble: 420,500 × 3
   item1   item2      pmi
   <chr>   <chr>    <dbl>
 1 talk    listen  0.773 
 2 day     listen -0.224 
 3 weekend listen  0.0570
 4 wharf   listen -0.490 
 5 reason  listen  0.801 
 6 linda   listen  0.110 
 7 middle  listen -1.46  
 8 ahead   listen -1.10  
 9 sell    listen  0.676 
10 burgers listen -0.322 
# ℹ 420,490 more rows

Singular Value Decomposition (SVD)

  • We can next determine the word vectors from the PMI values using singular value decomposition (SVD)
    • This is a data reduction technique akin to PCA (Principle Component Analysis), i.e. it works by taking our data and decomposing it onto special orthogonal axes.
      • The first axis is chosen to capture as much of the variance as possible
      • Keeping that first axis fixed, the remaining orthogonal axes are rotated to maximize the variance in the second axis
      • This is then repeated for all the remaining axes

tidy_word_vectors <- tidy_pmi  |> 
  widely_svd(
    item1, item2, pmi,
    nv = 100,
    maxit = 1000
  )

tidy_word_vectors
24
Takes our tidy_pmi table, turns it into a wide matrix, and performs dimensionality reduction on it, and returns it in tidy format
25
item1 serves as the item we want to perform dimensionality reduction on, item2 is the feature that links items to one another, and pmi is the value we’re reducing
26
Number of principle components to estimate
27
Optional argument specifying the maximum number of iterations
# A tibble: 116,600 × 3
   item1   dimension    value
   <chr>       <dbl>    <dbl>
 1 talk            1  0.0273 
 2 day             1  0.0487 
 3 weekend         1 -0.0351 
 4 wharf           1 -0.0366 
 5 reason          1 -0.0341 
 6 linda           1  0.0604 
 7 middle          1 -0.0319 
 8 ahead           1 -0.0267 
 9 sell            1 -0.0293 
10 burgers         1 -0.00294
# ℹ 116,590 more rows

Which words are close to each other in this new feature-space of word embeddings?

nearest_neighbors <- function(df, token) {
  df |>
    widely(
      ~ {
        y <- .[rep(token, nrow(.)), ]
        res <- rowSums(. * y) / 
          (sqrt(rowSums(. ^ 2)) * sqrt(sum(.[token, ] ^ 2)))
        
        matrix(res, ncol = 1, dimnames = list(x = names(res)))
      },
      sort = TRUE
    )(item1, dimension, value) |>
    select(-item2)
}
28
This function will find the nearest words to any given example using our newly created word embeddings


tidy_word_vectors |> 
  nearest_neighbors(
    "weird"
    )
# A tibble: 1,166 × 2
   item1  value
   <chr>  <dbl>
 1 weird  1    
 2 makes  0.348
 3 sounds 0.315
 4 bad    0.313
 5 crazy  0.303
 6 uhhuh  0.271
 7 hiding 0.259
 8 bobby  0.252
 9 pig    0.252
10 havent 0.247
# ℹ 1,156 more rows
tidy_word_vectors |> 
  nearest_neighbors(
    "jimmy"
    )
# A tibble: 1,166 × 2
   item1  value
   <chr>  <dbl>
 1 jimmy  1    
 2 jr     0.596
 3 pesto  0.502
 4 zeke   0.398
 5 trev   0.375
 6 told   0.371
 7 pestos 0.356
 8 friend 0.291
 9 butt   0.267
10 pizza  0.260
# ℹ 1,156 more rows
tidy_word_vectors |> 
  nearest_neighbors(
    "fart"
    )
# A tibble: 1,166 × 2
   item1   value
   <chr>   <dbl>
 1 fart    1    
 2 farts   0.417
 3 loud    0.377
 4 gas     0.339
 5 iii     0.329
 6 fire    0.323
 7 teeth   0.287
 8 notice  0.266
 9 turkeys 0.263
10 pee     0.252
# ℹ 1,156 more rows

Dimensions Explaining Most Variation

tidy_word_vectors |>
  filter(dimension <= 9) |>
  mutate(sign = if_else(value > 0, "positive", "negative")) |> 
  group_by(dimension) |>
  top_n(10, abs(value)) |>
  ungroup() |>
  mutate(item1 = reorder_within(item1, value, dimension)) |>
  ggplot(aes(item1, value, fill = as_factor(dimension), alpha = sign)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ dimension, scales = "free_y", ncol = 3) +
  scale_x_reordered() +
  scale_fill_manual(values = observable[1:9]) +
  scale_alpha_manual(values = c(0.5, 1)) +
  coord_flip() +
  labs(
    x = NULL,
    y = "Value",
    title = "First 9 principal components for text of Bob's Burgers scripts",
    subtitle = "Top words contributing to the components that explain the most variation") + 
  theme_minimal(base_size = 18)

Pre-trained Embeddings

  • Word-embeddings work well when you have a large amount of data and our Bob dataset is a bit small
    • In that case, you may want to explore using pre-trained word-embeddings
  • We’re going to use the wordsalad package which provides some convenient functions to access some of the more popular pre-trained embeddings
library(wordsalad)

glove_word_vec <- 
  glove(text = clean_bob$cleaned_text,
        dim = 100, 
        window = 4, 
        min_count = 50, 
        stopwords = stop_words$word, 
        composition = "tibble")

glove_word_vec
29
Global Vectors for Word Representation, aka GloVe
# A tibble: 1,166 × 101
   tokens         V1      V2      V3       V4      V5      V6      V7       V8
   <chr>       <dbl>   <dbl>   <dbl>    <dbl>   <dbl>   <dbl>   <dbl>    <dbl>
 1 aa         0.0849 -0.599   0.275  -0.233    0.0862 -0.104  -0.0301 -0.169  
 2 admit      0.718  -0.213  -0.211  -0.0504   0.508  -0.104  -0.0441  0.304  
 3 breaking  -0.131   0.569  -0.236  -0.00908  0.447  -0.302  -0.307   0.212  
 4 ceiling   -0.353  -0.346  -0.356  -0.418    0.0683 -0.0663  0.176  -0.183  
 5 character -0.158   0.423   0.889  -0.247    0.536   0.0749  0.138   0.902  
 6 cleaning   0.0753 -0.0813 -0.0965 -0.205   -0.260   0.207   0.933  -0.117  
 7 closet     0.0156  0.157   0.391  -0.566    0.188  -0.189   0.506   0.354  
 8 college   -0.301   0.312   0.138   0.219   -0.390   0.218  -0.428  -0.227  
 9 corn      -0.251  -0.313  -0.508  -0.470   -0.393   0.269  -0.130   0.0282 
10 destroy    0.314   0.0674  0.642   0.161   -0.434  -0.0652  0.770   0.00947
# ℹ 1,156 more rows
# ℹ 92 more variables: V9 <dbl>, V10 <dbl>, V11 <dbl>, V12 <dbl>, V13 <dbl>,
#   V14 <dbl>, V15 <dbl>, V16 <dbl>, V17 <dbl>, V18 <dbl>, V19 <dbl>,
#   V20 <dbl>, V21 <dbl>, V22 <dbl>, V23 <dbl>, V24 <dbl>, V25 <dbl>,
#   V26 <dbl>, V27 <dbl>, V28 <dbl>, V29 <dbl>, V30 <dbl>, V31 <dbl>,
#   V32 <dbl>, V33 <dbl>, V34 <dbl>, V35 <dbl>, V36 <dbl>, V37 <dbl>,
#   V38 <dbl>, V39 <dbl>, V40 <dbl>, V41 <dbl>, V42 <dbl>, V43 <dbl>, …

Tidying our GloVe(s)

tidy_glove <- 
  glove_word_vec |> 
  pivot_longer(contains("V"),
               names_to = "dimension", 
               names_prefix = "V", 
               names_transform = list(dimension = as.integer)) |> 
  rename(item1 = tokens)

tidy_glove
# A tibble: 116,600 × 3
   item1 dimension   value
   <chr>     <int>   <dbl>
 1 aa            1  0.0849
 2 aa            2 -0.599 
 3 aa            3  0.275 
 4 aa            4 -0.233 
 5 aa            5  0.0862
 6 aa            6 -0.104 
 7 aa            7 -0.0301
 8 aa            8 -0.169 
 9 aa            9  0.344 
10 aa           10  0.796 
# ℹ 116,590 more rows

Most Variation (Pre-Trained)

tidy_glove |> 
  filter(dimension <= 9)  |> 
  mutate(sign = if_else(value > 0, "positive", "negative")) |> 
  group_by(dimension) |>
  top_n(10, abs(value)) |>
  ungroup() |>
  mutate(item1 = reorder_within(item1, value, dimension)) |>
  ggplot(aes(item1, value, fill = as_factor(dimension), alpha = sign)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~dimension, scales = "free_y", ncol = 3) +
  scale_x_reordered() +
  scale_fill_manual(values = observable[1:9]) +
  scale_alpha_manual(values = c(0.5, 1)) +
  coord_flip() +
  labs(
    x = NULL,
    y = "Value",
    title = "First 9 principal components for text of Bob's Burgers scripts using GloVe word embeddings",
    subtitle = "Top words contributing to the components that explain the most variation") + 
  theme_minimal(base_size = 18)

Topic Modeling

Latent Dirichlet allocation (LDA)

  • A type of Bayesian network model that makes two general assumptions
    • Each document is a mixture of topics
    • Each topic is a distribution of words

LDA is a mathematical method for estimating
both of these at the same time: finding the
mixture of words that is associated with each
topic, while also determining the mixture
of topics that describes each document.

Cast Tidy Data into a Matrix

library(topicmodels)

bob_matrix <- season_words |> 
  anti_join(stop_words) |> 
  cast_dtm(season, word, n)
  
bob_matrix
1
This function turns tidy data (with one token per row/observation) into a Document x Term matrix (necessary data structure for topic modeling)
<<DocumentTermMatrix (documents: 14, terms: 25814)>>
Non-/sparse entries: 76935/284461
Sparsity           : 79%
Maximal term length: 28
Weighting          : term frequency (tf)
bob_matrix$dimnames$Docs
2
This shows us the row names of our sparse matrix (the documents that make up our corpus)
 [1] "11" "12" "10" "13" "9"  "4"  "7"  "8"  "3"  "5"  "6"  "14" "1"  "2" 
bob_matrix$dimnames$Terms |> head(50)
3
Here we see the first 50 column names (tokens/words) of our sparse matrix (out of 25464 total)
 [1] "yeah"   "uh"     "gonna"  "bob"    "tina"   "hey"    "gene"   "god"   
 [9] "wait"   "louise" "teddy"  "guys"   "kids"   "time"   "dad"    "linda" 
[17] "mom"    "love"   "um"     "day"    "huh"    "ii"     "people" "boo"   
[25] "fine"   "stop"   "whoa"   "ah"     "fun"    "lot"    "nice"   "lin"   
[33] "bobby"  "stuff"  "guess"  "ooh"    "cat"    "pretty" "house"  "school"
[41] "couch"  "bad"    "ha"     "aah"    "jimmy"  "coming" "ow"     "cool"  
[49] "feel"   "home"  

Runing our LDA Model

bob_lda <- LDA(bob_matrix, 
               k = 5,
               control = list(seed = 01302025))

bob_lda
4
Need to specify the number of topics (read more about how one might select the best K value here)
5
Set a seed so that the output of the model is predictable
A LDA_VEM topic model with 5 topics.

Per-Topic-Per-Word Probabilities

bob_topics <- tidy(bob_lda, matrix = "beta") 

bob_topics
# A tibble: 129,070 × 3
   topic term    beta
   <int> <chr>  <dbl>
 1     1 yeah  0.0180
 2     2 yeah  0.0179
 3     3 yeah  0.0191
 4     4 yeah  0.0173
 5     5 yeah  0.0181
 6     1 uh    0.0128
 7     2 uh    0.0141
 8     3 uh    0.0187
 9     4 uh    0.0158
10     5 uh    0.0136
# ℹ 129,060 more rows
bob_top_terms <- bob_topics |> 
  group_by(topic) |> 
  slice_max(beta, n = 10) |> 
  ungroup() |> 
  arrange(topic, -beta)

bob_top_terms
# A tibble: 50 × 3
   topic term     beta
   <int> <chr>   <dbl>
 1     1 yeah  0.0180 
 2     1 uh    0.0128 
 3     1 bob   0.0120 
 4     1 gonna 0.0120 
 5     1 hey   0.00958
 6     1 tina  0.00891
 7     1 gene  0.00774
 8     1 god   0.00627
 9     1 dad   0.00615
10     1 wait  0.00607
# ℹ 40 more rows
bob_top_terms |> 
  mutate(term = reorder_within(term, beta, topic)) |> 
  ggplot(aes(beta, term, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  scale_y_reordered() + 
  scale_fill_manual(values = observable[1:5]) + 
  theme_minimal(base_size = 18)

Document-topic Probabilities

bob_documents <- tidy(bob_lda, matrix = "gamma")
bob_documents
# A tibble: 70 × 3
   document topic       gamma
   <chr>    <int>       <dbl>
 1 11           1 0.000000666
 2 12           1 0.000000668
 3 10           1 0.000000668
 4 13           1 0.000000671
 5 9            1 0.000000660
 6 4            1 0.000000741
 7 7            1 1.00       
 8 8            1 0.000000699
 9 3            1 1.00       
10 5            1 0.00232    
# ℹ 60 more rows
bob_documents |> 
  mutate(document = fct(document, levels = seq(1:14) |> as.character())) |> 
  group_by(topic) |> 
  slice_max(gamma, n = 10) |> 
  ungroup() |> 
  arrange(topic, -gamma) |> 
  ggplot(aes(gamma, document, fill = factor(topic))) +
  geom_col(show.legend = FALSE, alpha = 0.75) +
  scale_fill_manual(values = observable[1:5]) + 
  theme_minimal(base_size = 18)

Bob’s kinda boring 1

A better example:

library(gutenbergr)

books <- gutenberg_download(c(61, 408, 833, 14977), 
                            meta_fields = c("title", "author"))

books |> distinct(author, title)
# A tibble: 5 × 2
  author                                       title                            
  <chr>                                        <chr>                            
1 Marx, Karl                                   "The Communist Manifesto"        
2 Engels, Friedrich                            "The Communist Manifesto"        
3 Du Bois, W. E. B. (William Edward Burghardt) "The Souls of Black Folk"        
4 Veblen, Thorstein                            "The Theory of the Leisure Class"
5 Wells-Barnett, Ida B.                        "The Red Record\nTabulated Stati…

Book Cleaning

books <- books |> 
  distinct(text, .keep_all = TRUE) |>
  mutate(author = if_else(author == "Marx, Karl",
                          "Marx, Karl; Engels, Friedrich",
                          author),
         title = if_else(str_detect(title, "The Red Record"),
                         "The Red Record",
                         title))

books
6
Remove duplicates of The Communist Manifesto
7
Explicitly adding Engels to Communist Manifesto authorship
8
Truncating The Red Record for readability
# A tibble: 19,636 × 4
   gutenberg_id text                                                title author
          <int> <chr>                                               <chr> <chr> 
 1           61 "The Communist Manifesto"                           The … Marx,…
 2           61 ""                                                  The … Marx,…
 3           61 "by Karl Marx and Friedrich Engels"                 The … Marx,…
 4           61 "[From the English edition of 1888, edited by Frie… The … Marx,…
 5           61 "Contents"                                          The … Marx,…
 6           61 " I. BOURGEOIS AND PROLETARIANS"                    The … Marx,…
 7           61 " II. PROLETARIANS AND COMMUNISTS"                  The … Marx,…
 8           61 " III. SOCIALIST AND COMMUNIST LITERATURE"          The … Marx,…
 9           61 " IV. POSITION OF THE COMMUNISTS IN RELATION TO TH… The … Marx,…
10           61 "A spectre is haunting Europe—the spectre of Commu… The … Marx,…
# ℹ 19,626 more rows

Divide Each Book into Sections

by_section <- books |> 
  filter(text != "") |>
  mutate(text = str_remove_all(text, "[:digit:]"),
         word_count = str_count(text, "\\S+")) |>
  group_by(title) |> 
  mutate(cumulative_words = cumsum(word_count),
         row_chunk = ceiling(cumulative_words / 2000)) |>
  group_by(title, row_chunk) |> 
  summarize(section = str_c(text, collapse = " "), .groups = "drop") |>
  unite(document, title, row_chunk, sep = "_", remove = FALSE) |>
  select(document, section)

by_section
9
Remove row with empty strings
10
Remove all digits from our raw text column
11
Count number of words per row
12
Calculate cumulative word count per book
13
Creates a row_chunk index that groups observations of ~ 2000 words together
14
Combines the row_chunk indices together so each book section is about the same length
15
Combines title and row_chunk into a new variable called document, separated by _

Divide Each Book into Sections

# A tibble: 112 × 2
   document                  section                                            
   <chr>                     <chr>                                              
 1 The Communist Manifesto_1 "The Communist Manifesto by Karl Marx and Friedric…
 2 The Communist Manifesto_2 "powers of the nether world whom he has called up …
 3 The Communist Manifesto_3 "the bourgeois family-relations; modern industrial…
 4 The Communist Manifesto_4 "abolition of bourgeois property, the standard of …
 5 The Communist Manifesto_5 "and deserted with loud and irreverent laughter. O…
 6 The Communist Manifesto_6 "Bourgeois Socialism attains adequate expression, …
 7 The Red Record_1          "The Red Record: Tabulated Statistics and Alleged …
 8 The Red Record_2          "describe as such? Not by any means. With the Sout…
 9 The Red Record_3          "Va.; Nov. , Samuel Motlow, Lynchburg, Va.; Nov. ,…
10 The Red Record_4          "  Ford was greatly hurt and the Negro was held to…
# ℹ 102 more rows

Create our Tidy Text Dataset

by_section_word <- by_section |> 
  unnest_tokens(word, section)

by_section_word
# A tibble: 221,458 × 2
   document                  word     
   <chr>                     <chr>    
 1 The Communist Manifesto_1 the      
 2 The Communist Manifesto_1 communist
 3 The Communist Manifesto_1 manifesto
 4 The Communist Manifesto_1 by       
 5 The Communist Manifesto_1 karl     
 6 The Communist Manifesto_1 marx     
 7 The Communist Manifesto_1 and      
 8 The Communist Manifesto_1 friedrich
 9 The Communist Manifesto_1 engels   
10 The Communist Manifesto_1 from     
# ℹ 221,448 more rows

Find Document-Word Counts

word_counts <- by_section_word |> 
  anti_join(stop_words) |> 
  count(document, word, sort = TRUE)

word_counts
# A tibble: 56,260 × 3
   document                           word         n
   <chr>                              <chr>    <int>
 1 The Theory of the Leisure Class_43 class       41
 2 The Theory of the Leisure Class_45 class       37
 3 The Souls of Black Folk_26         negro       34
 4 The Theory of the Leisure Class_26 life        34
 5 The Communist Manifesto_3          property    33
 6 The Red Record_12                  white       33
 7 The Theory of the Leisure Class_1  class       32
 8 The Theory of the Leisure Class_51 class       32
 9 The Red Record_1                   white       31
10 The Theory of the Leisure Class_33 class       31
# ℹ 56,250 more rows
sections_dtm <- word_counts |> 
  cast_dtm(document, word, n)

sections_dtm
<<DocumentTermMatrix (documents: 112, terms: 13844)>>
Non-/sparse entries: 56260/1494268
Sparsity           : 96%
Maximal term length: 19
Weighting          : term frequency (tf)
sections_lda <- LDA(sections_dtm, k = 4, control = list(seed = 01302025))

sections_lda
A LDA_VEM topic model with 4 topics.

Visualize Top Terms

sections_topics <- tidy(sections_lda, matrix = "beta")
sections_topics |> 
  arrange(-beta)
# A tibble: 55,376 × 3
   topic term        beta
   <int> <chr>      <dbl>
 1     2 class    0.0216 
 2     2 leisure  0.0163 
 3     1 life     0.0161 
 4     2 life     0.0127 
 5     3 negro    0.0110 
 6     3 white    0.00946
 7     4 life     0.00904
 8     1 class    0.00884
 9     2 economic 0.00795
10     3 black    0.00774
# ℹ 55,366 more rows
top_terms <- sections_topics |> 
  group_by(topic) |> 
  slice_max(beta, n = 5) |> 
  ungroup() |> 
  arrange(topic, -beta)

top_terms |> 
  mutate(term = reorder_within(term, beta, topic)) |> 
  ggplot(aes(beta, term, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  scale_fill_manual(values = observable[1:4]) +
  facet_wrap(~ topic, scales = "free") +
  scale_y_reordered() + 
  theme_minimal(base_size = 18)

Per-document classification

sections_gamma <- tidy(sections_lda, matrix = "gamma")
sections_gamma
# A tibble: 448 × 3
   document                           topic     gamma
   <chr>                              <int>     <dbl>
 1 The Theory of the Leisure Class_43     1 0.0000349
 2 The Theory of the Leisure Class_45     1 0.0000349
 3 The Souls of Black Folk_26             1 0.0000320
 4 The Theory of the Leisure Class_26     1 1.00     
 5 The Communist Manifesto_3              1 1.00     
 6 The Red Record_12                      1 0.0000343
 7 The Theory of the Leisure Class_1      1 0.0000347
 8 The Theory of the Leisure Class_51     1 0.0000372
 9 The Red Record_1                       1 0.0000351
10 The Theory of the Leisure Class_33     1 0.799    
# ℹ 438 more rows
sections_gamma <- sections_gamma |> 
  separate(document, c("title", "section"), sep = "_", convert = TRUE)

sections_gamma |> 
  mutate(title = reorder(title, gamma * topic)) |> 
  ggplot(aes(factor(topic), gamma, fill = factor(topic))) +
  geom_boxplot(alpha = 0.5, show.legend = FALSE) +
  scale_fill_manual(values = observable[1:4]) +
  facet_wrap(~ title) +
  labs(x = "topic", y = expression(gamma)) + 
  theme_minimal(base_size = 18)

Section Classifications

section_classifications <- sections_gamma |> 
  group_by(title, section) |> 
  slice_max(gamma) |> 
  ungroup()

section_classifications
# A tibble: 112 × 4
   title                   section topic gamma
   <chr>                     <int> <int> <dbl>
 1 The Communist Manifesto       1     4 1.00 
 2 The Communist Manifesto       2     2 1.00 
 3 The Communist Manifesto       3     1 1.00 
 4 The Communist Manifesto       4     4 0.771
 5 The Communist Manifesto       5     2 0.967
 6 The Communist Manifesto       6     2 1.00 
 7 The Red Record                1     3 1.00 
 8 The Red Record                2     3 1.00 
 9 The Red Record                3     3 0.727
10 The Red Record                4     1 0.646
# ℹ 102 more rows

Book-Topic Consensus

book_topics <- section_classifications |> 
  count(title, topic) |> 
  group_by(title) |> 
  slice_max(n, n = 1) |>  
  ungroup() |> 
  transmute(consensus = title, topic)

book_topics
# A tibble: 4 × 2
  consensus                       topic
  <chr>                           <int>
1 The Communist Manifesto             2
2 The Red Record                      3
3 The Souls of Black Folk             3
4 The Theory of the Leisure Class     2

Incorrect Predictions by Section

section_classifications |> 
  inner_join(book_topics, by = "topic") |> 
  filter(title != consensus) |> 
  count(title, consensus)
# A tibble: 4 × 3
  title                           consensus                           n
  <chr>                           <chr>                           <int>
1 The Communist Manifesto         The Theory of the Leisure Class     3
2 The Red Record                  The Souls of Black Folk            13
3 The Souls of Black Folk         The Red Record                     35
4 The Theory of the Leisure Class The Communist Manifesto            27

Incorrect Predictions by Word

assignments <- broom::augment(sections_lda, 
                              data = sections_dtm)
assignments
16
Takes our model and appends information to each observation in the original data
# A tibble: 56,260 × 4
   document                           term  count .topic
   <chr>                              <chr> <dbl>  <dbl>
 1 The Theory of the Leisure Class_43 class    41      2
 2 The Theory of the Leisure Class_45 class    37      2
 3 The Souls of Black Folk_26         class     1      2
 4 The Theory of the Leisure Class_26 class     4      1
 5 The Communist Manifesto_3          class    16      1
 6 The Theory of the Leisure Class_1  class    32      2
 7 The Theory of the Leisure Class_51 class    32      2
 8 The Theory of the Leisure Class_33 class    31      1
 9 The Theory of the Leisure Class_28 class    30      2
10 The Theory of the Leisure Class_30 class     5      1
# ℹ 56,250 more rows

Incorrect Predictions by Word

assignments <- assignments |>
  separate(document, c("title", "section"),
           sep = "_", convert = TRUE) |>
  inner_join(book_topics,
             by = c(".topic" = "topic"))

assignments
17
Now we can combine these data with book_topics to see which words were incorrectly classified
# A tibble: 80,430 × 6
   title                           section term  count .topic consensus         
   <chr>                             <int> <chr> <dbl>  <dbl> <chr>             
 1 The Theory of the Leisure Class      43 class    41      2 The Communist Man…
 2 The Theory of the Leisure Class      43 class    41      2 The Theory of the…
 3 The Theory of the Leisure Class      45 class    37      2 The Communist Man…
 4 The Theory of the Leisure Class      45 class    37      2 The Theory of the…
 5 The Souls of Black Folk              26 class     1      2 The Communist Man…
 6 The Souls of Black Folk              26 class     1      2 The Theory of the…
 7 The Theory of the Leisure Class       1 class    32      2 The Communist Man…
 8 The Theory of the Leisure Class       1 class    32      2 The Theory of the…
 9 The Theory of the Leisure Class      51 class    32      2 The Communist Man…
10 The Theory of the Leisure Class      51 class    32      2 The Theory of the…
# ℹ 80,420 more rows

Confusion Matrix

library(scales)

assignments |> 
  count(title, consensus, wt = count) |> 
  mutate(across(c(title, consensus), ~str_wrap(., 20))) |> 
  group_by(title) |> 
  mutate(percent = n / sum(n)) |> 
  ggplot(aes(consensus, title, fill = percent)) +
  geom_tile() +
  scale_fill_gradient2(high = observable_light[4], label = percent_format()) +
  theme_minimal(base_size = 18) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1),
        panel.grid = element_blank()) +
  labs(x = "Book words were assigned to",
       y = "Book words came from",
       fill = "% of assignments")

Two-Topic Model?

Further Learning & Resources

Methods and Implementation

UW Computing Resources

Thanks!