Words in Shakespeare: exploring the data

Mar 31, 2019 in PROJECT • R
r shakespeare data-wrangling data-exploration
10 min read

I am a big fan of Shakespeare, so I can confidently tell you that he used a lot of words across his works. How many? A whole bunch. But if that answer isn’t very satisfying to you, then strap in, because I decided to use R to explore words used across the 37 plays generally agreed to have been written by him. Ultimately, I made a shiny app that means anybody can find out more about words that they’re interested in; my next blog posts will focus more on that. In this post, I’ll concentrate on sorting through the data itself.

Sourcing Shakespeare

To get the full texts of Shakespeare’s plays, I used the package gutenbergr, which makes it easy to find works in the public domain in the Project Gutenberg collection. As well as gutenbergr, I loaded the package data.table, as I prefer this data structure to data.frames.

I needed the gutenberg_id of the texts I wanted so I would be able to download them. As a first step, I identified all Project Gutenberg works that had Shakespeare as an author by filtering the gutenbergr dataset gutenberg_works. By checking the filtered output, I could then identify the works I actually wanted to include and filter the dataset again to only include these. Not all Shakespeare works were relevant to me - I didn’t want to include his poetry, for example, and there were a number of duplicates - so it is definitely worthwhile checking your filtered results if you are repeating this with a different author.

Using the gutenbergr function gutenberg_download(), I was then able to create a data.table including the text from the works I wanted to include by referencing their gutenberg_id. I made sure to use the meta_field argument to include the title of each work in the download, so each line of text was linked to a play.

# Load the packages needed

library(gutenbergr)
library(data.table)

# Obtain the metadata for Shakespeare plays

shake_metadata <- as.data.table(gutenberg_works(author == "Shakespeare, William"))
shake_metadata <- shake_metadata[gutenberg_id >= 1500 & gutenberg_id <= 1541 
                                 & gutenberg_id != 1505 & gutenberg_id != 1525]

# Download plays

shake_plays <- as.data.table(gutenberg_download(shake_metadata$gutenberg_id, 
                                                meta_fields = "title"))
## Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest
## Using mirror http://aleph.gutenberg.org

Great! Now I had a dataset that had a row for every line from each of Shakespeare’s plays.

Making the data work for me

However, I wanted to explore words in Shakespeare, not lines, so I needed to break things down. For this, I needed tidytext, a really helpful package for manipulating text data. The function unnest_tokens() reshapes the dataset so it is one token per row. I had two types of token I was interested in: sentences and words. I decided to deal with the sentences first because this would be a more helpful approach for data cleaning; once the words are isolated it is quite hard to tell if they were originally in a character’s line or just part of the stage directions.

I wanted sentences because I was going to include a feature in my app that delivered a random quotation including the word the user was interested in. The original dataset had a row for each line, but a line wasn’t quite right because many lines are not full sentences, so the user would end up with a half-finished thought. I also considered using the full line the word was in, but that also wasn’t ideal because some characters do go on a bit, and it would be hard for the casual reader to quickly spot where their word features. A sentence seemed like a good middle ground.

So, the unnest_tokens() function came in handy. I set the to_lower argument to FALSE because I wanted to preserve the original case of the quotations.

After viewing the output, I did a little data cleaning to remove sentences that were just stage directions, scene information or which character was speaking. You can see that I’ve used regular expressions to indicate that the sentence shouldn’t start with a square bracket, a word in capitals or a digit, which all seemed to indicate non-spoken text. This isn’t perfect as in the text there are clearly other stage directions that have just been written as normal sentences, but this is quite challenging to identify using these methods. For now, this will be acceptable.

library(tidytext)

shake_sentence <- unnest_tokens(shake_plays, sentence, text, 
                                token = "sentences", to_lower = FALSE)

shake_sentence <- shake_sentence[!grepl("^\\[", sentence) 
                                 & !grepl("^[A-Z]{2,}", sentence) 
                                 & !grepl("^\\d", sentence)]

I could then unnest the words from the sentences. The only difference here is that I didn’t use the to_lower argument, therefore allowing the function to pass through a TRUE condition automatically. This was because I didn’t want to arbitrarily differentiate between words that had their first letter capitalised because they were at the start of a sentence, and words that were all lower case in the middle of a sentence.

shake_words <- unnest_tokens(shake_sentence, word, sentence)

I also wanted a dataset that held information on how many words were in each play. Using data.table, this is as simple as using .N, which returns the number of rows in the data.table when used in the j element of the data.table query. I will do a post about data.table syntax at some point, but if you are unfamiliar with the structure or use of data.tables, I recommend this introduction.

As I wanted this information for each play, I included the by = title argument.

all_words <- shake_words[, .(total_words = .N), by = title]

But I wasn’t finished with counting words. I wanted to know how many words were in each type of play. That meant creating vectors of each group. As I didn’t have an existing dataset with this information, I had to do this manually, which was a bit tedious but worth doing because I would need this inforamation a number of times. I also didn’t need to create a vector of problem plays, as they could be defined as simply not any of the others (which ironically is pretty much how they’re defined in literary studies too…).

comedies <- c("The Comedy of Errors", "The Taming of the Shrew", 
              "The Two Gentlemen of Verona", "Love's Labour's Lost", 
              "A Midsummer Night's Dream", "The Merchant of Venice", 
              "The Merry Wives of Windsor", "Much Ado about Nothing", 
              "As You Like It", "Twelfth Night; Or, What You Will", 
              "The Winter's Tale", "The Tempest")
histories <- c("King Henry VI, First Part", 
               "History of King Henry the Sixth, Second Part", 
               "The History of King Henry the Sixth, Third Part", 
               "The Tragedy of King Richard III", "King John", 
               "The Tragedy of King Richard the Second", 
               "King Henry IV, the First Part", "King Henry IV, Second Part", 
               "The Life of King Henry V", "The Life of Henry the Eighth")
tragedies <- c("The Tragedy of Titus Andronicus", "Romeo and Juliet", 
               "Julius Caesar", "Hamlet, Prince of Denmark",
               "Othello, the Moor of Venice", "The Tragedy of King Lear", 
               "Macbeth", "Antony and Cleopatra", "The Tragedy of Coriolanus", 
               "The Life of Timon of Athens", "Cymbeline")

all_words[, type := ifelse(title %in% comedies, "Comedies", 
                           ifelse(title %in% histories, "Histories", 
                                  ifelse(title %in% tragedies, "Tragedies", 
                                         "Problem plays")
                                  )
                           )]

all_type_words <- all_words[, sum(total_words), by = type]

Exploring the data

Now my datasets were all ready, I could begin asking questions!

For example, the number of rows in shake_words answers the question at the start about how many words there are in Shakespeare’s plays (give or take a few from rogue stage directions).

shake_words[, .N]
## [1] 808954

Or I can combine the length() and unique() functions to find out how many unique words are in the dataset.

length(unique(shake_words[, word]))
## [1] 25435

It is very straightforwad to adapt to find out how many times a particular word of your choice is used. Because my imagination knows no bounds, I’m using the word “word” for this example.

shake_words[word == "word", .N]
## [1] 493

That’s quite a lot! But which play uses “word” most often? There are a few steps to this:

Filter the dataset of all words, shake_words, to just the rows that have the word “word” in the word column (I’m already regretting my choice of, er, word). This is in the first part of the data.table query.
Using .N again, count the number of rows, this time specifying that they should be counted while grouped by title, so I know how many times the word has appeared in each play.
Then as a new query, select the the title with the highest total using the max() function.

word_in_plays <- shake_words[word == "word", .(focus_word_freq = .N), 
                             by = title]
word_in_plays[focus_word_freq == max(focus_word_freq), title]
## [1] "Love's Labour's Lost"

And how many times does “word” make an appearance? I only need to slightly change the code above to focus on the column with the count in (focus_word_freq), rather than the title.

word_in_plays[focus_word_freq == max(focus_word_freq), focus_word_freq]
## [1] 33

Well, 33 is certainly a lot, but as Love’s Labour’s Lost is not a widely performed play, I’m struggling to think of an example. Luckily, I can extract a line without too much hassle.

I take the dataset of sentences I created earlier (shake_sentence), and filter it on two conditions. Firstly, using grepl() and a regex, I specify that I am looking for “word” with a word boundary either side (to avoid getting “sword”, for example). Secondly, the title must be “Love’s Labour’s Lost”.

I’ve selected the 17th example because it’s a fun one, but you could choose any you liked.

shake_sentence[grepl(paste0("\\bword\\b"), sentence, ignore.case = T) &
                 title == "Love's Labour's Lost", 
               sentence][17]
## [1] "I marvel thy master hath not eaten thee for a word, for thou are not so long by the head as honorificabilitudinitatibus; thou art easier swallowed than a flap-dragon."

I don’t think I’ll need to do any data checks to be pretty sure this is the only time “honorificabilitudinitatibus” appears…

One final thing to check is what type of play uses the word “word” most. After all, 33/501 still leaves a lot of uses in other works, so it might not be comedies.

However, I have to be careful here because the number of words differs drastically across the play types, so it makes most sense to look at how often the word appears as a proportion of all words, rather than simply frequency.

So, I start by adding a new column to word_in_plays, which is the dataset that shows the number of times the word has appeared in each play. This defines the play type of each play, based on the vectors I created earlier. Once I’ve done this, I can group the plays by type and create a column for the total of each type using the sum() function.

word_in_type <- copy(word_in_plays)[, 
                                      type := ifelse(
                                        title %in% comedies,
                                        "Comedies", 
                                        ifelse(title %in% histories, 
                                               "Histories", 
                                               ifelse(title %in% tragedies, 
                                                      "Tragedies", 
                                                      "Problem plays")
                                               )
                                        )]
word_in_type[, total := sum(focus_word_freq), by = type]

Merging this new dataset with the dataset all_type_words that I created earlier, which contains the total number of words in each play type, allows me to calculate the proportion and then filter the ordered dataset to the type with the highest density of “word” usage.

word_in_type <- merge(word_in_type, all_type_words, by = "type")
word_in_type[, proportion := total/V1]
word_in_type[proportion == max(proportion), type][1]
## [1] "Comedies"

In this case it is comedies after all!

Well, this has been interesting, but there’s still lots more to do. My next posts will be on styling a shiny app that uses this information.

In the meantime, if you want spoilers, you can view the shiny app here and the full code for the app here.