Simply put, we penalize words that are common across all seasons, and reward ones that are not. We take the word counts per season and multiply it by the scaled inverse fraction of seasons that contain the word. It stands for term frequency - inverse document frequency.
But where do the seasons differ from each other? And can we summarise each season using a handful of topics? To answer the first question, text mining’s most notorious statistic tf-idf comes to the rescue. Thus far we have looked at all words across seasons. All other words that are more than one degree away, are colored blue as out of range. There is also a handful of words that precede both characters like “Geez”, “Boy” or “God”. Looking at the red nodes, we recognize many of the things Rick throws at Morty: “Relax Morty!…It’s science Morty!…Run Morty!”. These are all the 1st degree connections (words) that have an edge pointing towards the main characters, but aren’t shared among the them. They are at the center of the network and so have the highest degree centrality scores.īesides visualising the importance of words in our network, we can similarly differentiate between words that precede either Rick or Morty. Rick and Morty are the most important words. Looking at the largest connected network, we arrive at the same conclusion as with term frequencies.
Representing the text as a graph, we can calculate things such as degree centrality, and plot the results. This igraph object contains a directed network, where the vertices are the words and an edge exists between each that appear after one another more than twice. Unnest_tokens(bigram, Text, token = "ngrams", n = 2) %>% Besides calculating summary statistics on bi-grams, we can now construct a network of words according to co-occurrence using igraph, the go-to package for network analysis in R. We can similarly get the number of times each two words appear, called bi-grams. This is the structure preferred by the tidytext package, as it is by the rest of tidyverse. The $Text column contains the subtitle text, surrounded by additional variables for line id, timestamp, season and episode number. # $ serie : chr "rick and morty" "rick and morty" "rick and morty" "rick and morty". Rick, what's going on?" "I got a surprise for you, Morty." "It's the middle of the night. Str(df) # Read: 3 seasons, 31 episodes # 'ame': 16821 obs. library(subtools)Ī <- (dir = "/series/rick and morty/") We convert the resulting MultiSubtitles object to a ame with a second command subDataFrame().
#Rick and morty season 1 episode 1 1234 series
With subtools, an entire series can be read with one command from the containing folder, (). It is very easy to find English subtitles for pretty much anything on the Internet. With season 3 of Rick and Morty coming to an end last week, the stars have finally aligned to roll up my sleeves and have some fun with text mining. srt files (the usual format for subtitles) straight into R. So I was pretty stoked to find Francois Keck’s subtools package on GitHub, that allows for reading. One man’s weekend project, another man’s treasureĪfter reading the book Tidy Text Mining online, I have been wanting to try out some of the concepts outlined in the book, and the functions of the accompanying package, on an interesting dataset.