First, I needed to install the wordcloud and the text mining (tm) packages. The RColorBrewer package is required (if you don’t have it already).
I need a bunch of words. I've always liked the introduction speech that V makes in V for Vendetta [1]. I set the words as a character.
I would like to restrict the word cloud to v-words only. There are a number of steps required to present only v words.
- The string is split up into separate words as a data frame. Remember, a data frame is "a table, or two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case".
- Get the row numbers of the data frame containing words that start with v. This is achieved by use of metacharacters ^[Vv] which will search for the start of each word with either an upper or lower case v.
- Use the row numbers to subset the words from the data frame. That is, the non v words are excluded.
- Some of these v words end with punctuation marks like full stops and commas. These are removed using the gsub() function and metacharacters [[:punct:]].
- The remaining text of v words free of punctuation are fed into the wordcloud function with parameters defining the size of the text and colours.
Coolness - I have my v words as a word cloud. However this isn't a particularly exciting word cloud since each word only appears once, thus the text are the same size and colours. Let's try the following quote (spoilers!) from Walter White in Breaking Bad [2].
Looking at the quote we can see that the words "I" and "you" occur a number of times. Let's run some code and produce the wordcloud.
Great, but I cannot see the "I" and "you". Such "stop words" are not included by default when using wordcloud. Also notice that "Skyler" and "NASDAQ" are lower case. Further, the apostrophe has been removed from "you're".
What if I wished to include all words from the quote, adjust the capitals and keep the apostrophes? Starting again with the quote (bb – highlight and run it again), the following code is run [3].
After running the first nine lines, the quote looks like the below. 
The code takes the character string, splits it into separate words, counts the frequency of each word, sorts the matrix, creates a data frame that is fed into the workcloud function to produce the word cloud.
Done! All words are present. One could remove the stopwords from the quote by using the following. 
Then the code above (in the screengrab before the final Breaking Bad word cloud) can be run to generate a new wordcloud free of the common English words. Happy word clouding!
References and notes
1. Click here for a very excellent V for Vendetta kinetic typography of the speech.
2. If you have yet to watch Breaking Bad, DO NOT watch this spoiler.
3. The keen observers will have noticed bb <- stripWhitespace(bb) as the first line. This command takes the character quote and removed any spaces (such as those created by using tab). This line is required, otherwise the frequency table will tally these white spaces and distort the wordcloud.
 
No comments:
Post a Comment