Literature + Computation = Amazing Results

1 word, 2 words, 150 words! So many beautiful words! Ah, ah, ah, ah, ah, ah!

As a graduate student in English, I guess it comes to no surprise that I tend to have an inherent aversion to math and anything related to quantitative studies. Even as I child, I was unable to understand why the Count from Sesame Street felt such orgasmic joy as he immersed himself into the realm of numbers. I always viewed language as a safe haven from the influence of the quantitative, but then came Algebra and decided to mix letters and numbers. Let’s just say that I was less than pleased with the combination.

During my undergrad studies in English linguistics, I came to appreciate quantitative approaches towards texts and language, but I never thought that I would deal with this combination as a scholar of English. However, thanks to a graduate introductory course that I am taking on Digital Humanities (or Humanities Computing) at the University of Notre Dame, I have recently come across some incredibly useful ideas (and online software) that really augment the possibilities of quantitative research within the field of literature. These ideas and tools facilitate what Franco Moretti calls “Distance Reading,” which basically denotes the analysis of hundreds, if not thousands or millions, of literary texts in order to get a better sense of meaningful changes and developments throughout literary history (see my review of his book Graphs, Maps, Trees).

I have recently been dabbling with the interpretation and creation of programming code (using Python and HTML) thanks in part to the Programming Historian 2, a step-by-step tutorial on how to create basic computer programs that can decode and search for basic patterns in digital texts. Nonetheless, the possibilities of the pre-existing tools available on the web are indeed much more powerful and seductive than the basic programs I’ve developed so far. I will focus my attention on two of the many tools I have surveyed as of now: Voyant Tools and the Google Labs N-Gram Viewer.


According to their webpage, Voyant Tools is a web-based environment used primarily for the analysis of digital texts. This “environment” allows you to conduct different types of quantitative analyses (word counting, word frequency, etc.) with any text in practically any digital format (html, .doc, etc.). All you have to do is paste the text that you want to analyze or simply provide a web link to the actual text. The application then “reveals” facts, quantitative data, and statistics that are interpreted from the textual input. Not only can Voyant Tools demonstrate the frequency and distribution of particular words across the text, but it is also able to depict graphs and lists that graphically illustrate the prominence of any word in comparison to another.

In order to test Voyant Tools, I simply pasted the URL of the html version of Oscar Wilde’s The Picture of Dorian Gray (made available via Project Guttenberg) onto the main text bar, and I clicked on the button labeled “reveal.” My browser then opened the following set of tools:

Now, it is important to note that I applied limitations on the incorporation of stop words into my data in order to limit the types of words that Voyant Tools used and interpreted (which simply means that I requested VT to eliminate “meaningless” grammatical words such as “the,” “a,” “are,” among others, from the interpretations of the data). The application processed the textual data, organized it, and depicted it in an array of useful formats.

The “Cirrus” section illustrates the most common words of the text in a visual cloud, and the size of the word is directly correlated to its frequency within the corpus. The “Words in the Entire Corpus” section lists all of the words that appear in the source text and lists how many times they appear. The “Corpus Reader” section depicts the textual input and highlights the appearance of a word selected within the frequency list. The “Word Trend” section graphically illustrates the frequency of a selected word from the beginning to the end of the text. Note how I selected the word “man” within the Wilde’s novel, and how the Word Trend section illustrates how the word increases in frequency as the novel progresses.

Overall, I think the research possibilities of this program are indeed noteworthy, for it may allow us to offer concrete evidence for some of the claims we make as literary scholars. For instance, if we were to argue that Victor Frankenstein in Mary Shelley’s Frankenstein increasingly begins to view the creature as a human being, perhaps we could compare the frequency and distribution of words such as “monster,” “creature,” “devil,” and “wretch” with other terms such as “human,” “being,” and “man” in order to determine when and how the creature is labeled by his creator. Of course, there might be issues with these tools, especially when determining the context of these terms, and whether or not concepts are referenced to using different names. However, as Moretti once posited in his aforementioned work, graphs and lists provide data, not interpretation.


The premise of the N-Gram Viewer is far simpler than that of Voyant Tools: using the collection of digital books found within the Google Books archive, N-Gram viewer allows you to trace the presence of particular words or terms within thousands (and even millions) of books across a specific span of time. All you need to do is type in the word(s) that you are interested in tracing, establish the years that you want to survey, and the literary scope you want to study (English texts, Spanish texts, American texts, etc.), and the app will trace a nifty graph of the presence of this term in books that fall within the parameters you established.

The coolest part is the fact that N-Gram Viewer is able to illustrate the prominence of more than one term within the same chart, allowing you to trace, for instance, the use of different synonyms or of complimentary concepts (e.g. “adult, child, teenager” – “gay, lesbian, bisexual, heterosexual” – “novel, poem, short story” – “Asian, Latino, Caucasian, African American”, etc). Below, you can see the search I conducted for the terms “gay,” “homosexual,” “bisexual,” “queer,” and “lesbian” using a corpus of English fiction between the years of 1850 and 2000. The results were as follows:

It is quite amazing to see these results illustrated in such a clear and concise format. Note how the use of the term “gay” begins to decline after the 1850’s, probably due to its increasing association with homosexuality rather than an emotional state of joy. It is after the 1970’s (which coincides with the establishment of a gay rights movement after the 1969 Stonewall Riots in New York) that the use of the term gay begins to increase dramatically in English fiction, leading to a peak of the term in the late 90’s (which happens to be the peak of the nationalization of gay media in television and popular culture). It is interesting to note that the presence of the term “lesbian” roughly begins to manifest and increase during the same time that the use of the term gay begins its ascent.

Of course, as with Voyant Tools, the N-Gram Viewer has issues, particularly when it comes to the shifting meaning of particular words, the prominence of a certain term to denote a particular concept, and the sampling of the books themselves (which according to Culturomics, only represents around 12% of all the books ever published). But regardless of these issues, I particularly enjoy the possibilities that these tools present within the realm of distance reading, and I’m looking forward to seeing the new tools that these applications will inspire.


One thought on “Literature + Computation = Amazing Results

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s