Land Ho! Navigating the Geospatial Imagination of 19th Century British and American Maritime Fiction

Since the end of October, seven of my colleagues (Douglas E. Duhaime, Ana M. Jimenez-Moreno, Melissa McCoul, Daniel Murphy, Santiago Quintero, Bryan Santin, Suen Wong) and I have been working on a massive project in which we are trying to “anchor” the geographical imagination of 19th Century British and American sea fiction (based on a similar project developed by Dr. Matt Wilkens). Although this project is part of the requirements for a Digital Humanities course that we are currently taking, we are beginning to see a lot of potential in our work, and we are seriously considering some future possibilities with this project.

We are just concluding with the data collection and organization phase, and we are quite ready to study and interpret our results. Although to some extent we are aiming to see what our data yields, we are contemplating on interpreting our data through the lens of Orientalism… for now. In terms of our data, our aim was to create a database of locations mentioned in a large corpus of British and American sea fiction and to chart said locations within geographical maps. In this post, I will share our “methodology” and the technical details of our project. Once we begin interpreting the data, I might just share some of our findings with you as well!

The Corpus

The Literary corpus of our project is based on John Kohnen’s Nautical Fiction List, which contains an annotated bibliography of sea fiction (drama, fiction, and poetry) that currently includes 806 authors and 2,190 titles. Our corpus is also based on another nautical fiction list compiled by the library of the California State University – Maritime (CSUM) campus. All bibliographical entries cited in these sources were distributed evenly among us. Within this distribution, we identified all of the 19th century texts and classified them as either British or American texts. Once this concentrated list was compiled, we searched for digital full-text versions of these works, which were most obtained via Project Gutenberg and The Internet Archive.

Priority was given to texts found within Gutenberg due to their superior textual quality. Most texts found in the Internet Archive are simply physical copies of manuscripts and facsimiles translated into digital formats using Optical Character Recognition (OCR) software, whereas texts available in Gutenberg are transcribed and revised multiple times by human agents—thus, the margin of error for Project Gutenberg texts is significantly lower. All of the information not available in the original version of the text—such as Project Gutenberg’s legal and copyright disclaimers—was stripped from the document, and each fiction work as saved into separate .TXT files classified by author, title, year of publication, and nationality (British or American).

In total, we were able to create a digital corpus of approximately seventy-four (74) 19th Century British maritime fiction texts, and approximately thirty-five (35) 19th Century American maritime texts. This amounts to a total of one-hundred and nine (109) full-text versions of 19th Century maritime fiction. There are obvious issues that need to be addressed with this corpus. First and foremost, we are unaware of what percentage our data represents in terms of all of the maritime fiction published in the 19th century. Nonetheless, our goals are not to discuss every maritime text, but rather, to take into consideration a larger corpus of this genre of fiction in order to make claims and interpretations that go in accordance with the goals of distance reading. Additionally, it is clear that our British corpus is more than twice the size of the American corpus. This, however, this is completely understandable when taking into account that the publishing industry was way more advanced and developed in the British context, and perhaps due to the prominence of shipping and sea travel in the British empire.

Named Entity Extraction and Database Creation

Locations within the texts were identified using Stanford CoreNLP, a software set of language analysis tools that processes digital English texts. In essence, each word within the text is tagged with meta-linguistic information according to markers established by the user. The software creates an .XML output file that contains all of the tokenized words in the source text tagged with their features, including but not limited to part of speech, dates, locations, times, names, among others. A sample token produced by the Stanford CoreNLP with Part of Speech (POS – the word’s syntactic category) and Named Entity Recognizer (NER – Labels for the name of things) tags would look somewhat similar to this (words written in red are explanations of the token):

<token id=”1″> # ID number assigned to this particular token.

<word>Beaconsfield</word> # Token extracted from the source text.


<POS>NNP</POS> # Part of speech. In this case, the word represents a noun phrase.

<NER>LOCATION</NER> # NER Classification, in this case, the word is tagged as a location.


Using an original code written in the Python programming language, we devised a method to extract and list all tokens with a LOCATION NER marker for every respective text within our corpus. The CoreNLP output for each of these texts was briefly revised and corrected by hand. Tokens that were recognizably not locations (e.g. “esq,” “French,” “John,” “Sandwich King”) or locations that simply cannot be mapped geographically (e.g. “moon,” “Jupiter,” “Neptune”) were eliminated from the data. In total, our current data consists of 37,542 location mentions across all of our texts.

The cleaned-up version of the data was then organized into a spreadsheet. Every instance of a location token was accompanied by the following information: input file, shortened file name, title of the text, author, publication date, and nationality. This spreadsheet was then converted into a .CSV file and imported into an online database server using MySQL Workbench (for Windows), which allows us to perform advanced functions not available through Microsoft Excel, such as keeping a tally of the count of each location mentioned in our corpus. The following query was used in MySQL workbench to generate the counts of each location:

SELECT location, count(*) from Maritime_Fiction WHERE nationality = BR or US GROUP by location ORDER by count(*) DESC

This query, performed for both British and American texts, generated two lists, which mentioned the top 1,000 locations mentioned in the corpus along with their total counts (for both British and American texts). Here are some tables listing the most common locations found within our corpus:

Table 1. Top 15 Locations Mentioned in AMERICAN 19th Century Maritime Fiction:

England 454
America 357
London 282
New York 276
Mardi 223
Israel 191
Samoa 182
Atlantic 160
Pacific 151
Paris 142
France 139
Europe 126
Tahiti 116
Boston 115
Rio 110
Cape Horn 105
Taji 100
Nantucket 96
Wallingford 95

Table 2. Top 15 Locations Mentioned in BRITISH 19th Century Maritime Fiction:

England 1523
London 604
Portsmouth 367
France 360
India 332
Malta 266
Jamaica 251
Europe 234
Africa 234
Spain 232
West Indies 207
Gibraltar 193
Greenwich 179
America 177
Atlantic 169
Plymouth 168
Mediterranean 167
China 159
Ireland 154

Needless to say, there are very interesting results yielded in our data, and I am very anxious to see what findings we’ll discover and what interpretations will be made!

Geospacial Information and Mapping

The complete British and American tables were uploaded to Google Fusion Tables, experimental data management and online visualization software that allows one to process and create maps and charts of large sets of data. Luckily, Google Fusion Tables counts with the integration of Google’s Geocoding API services, which modernizes archaic locations into their contemporary places, standardizes alternate spellings of a location’s name, and translates the written location into a particular coordinate consisting of a latitude and longitude. These coordinates are then used to create stunning visualizations of all the locations present within a data set. Every location mentioned in the corpus is marked by a colored dot. When one hovers the computer’s cursor over one of these dots, the dot’s meta-information (such as location name and count) is displayed. Here are some snapshots of what these visualizations look like from afar:

Generated British Map

Locations mentioned within our British corpus. Places marked in blue are the top locations mentioned in our collection of texts.

Locations mentioned within our American corpus.

Locations mentioned within our American corpus.

Well, that’s all I’m sharing with you for now. Our data seems to have a lot a potential, and theoretically, there are dozens of interesting claims that we can make, and there are definitely other venues that we will explore in terms of creating visualizations for out data.

Wish us luck!

Mapping the Imaginative Landscape of Texas After the Mexican-American War

Part of the consequences of the Mexican-American War was the appropriation of over 500,000 square miles of Mexican territory by the United States in 1848. Places such as Texas, California, New Mexico, and Arizona, which were originally considered part of Mexico, were now considered part of the United States–which posed an immense problem for Mexicans living in these areas because they were now considered a social and cultural minority within the spectrum of a predominately white and Protestant America. Texas’s close proximity with the Mexican borderline made it a prominent nexus for the early stages of the Chicano movement (literature written by Mexican-Americans in the United States); after all, many battles during the Mexican-American war took place in Texan cities such as Brownsville, which were heavily populated by people of Mexican origins.

Since I personally have an interest in young adult Chicano/a literature, I became intrigued with the prominence of Texan locations (particularly those areas close to the Mexican/U.S. borders) in American literature published shortly after 1848–that is, when various parts of Mexico were annexed into the United States of America. Thus, this post depicts my efforts to map and trace the prominence of these locations in American literature, although not necessarily centered on texts written by Chicano/a authors. My data will therefore be centered on the presence of Texan locations in prose literatures written by American authors during the incorporation of Mexican territories into the U.S. (which marks the beginnings of the Chicano/a movement, generally speaking).

In order to achieve this, I generated maps using ArcGIS, an online application and content management system that allows users to create interactive maps capable of displaying quantitative data provided by databases. The data I used to generate the maps below was provided by a literary database developed by Dr. Matt Wilkens (which is based upon texts such as Lyle Wright’s American Fiction, 1851-1875), which includes a hefty percentage of the long prose titles published by American adults between 1851 and 1875, along with the geospatial information depicted in these titles using Google’s Geocoding API.

The database also included texts and geospatial information from other countries and dates, so I used MySQL Workbench 5.2 CE in order to create a table that only displayed the count of texts within the U.S., specifically within the region of Texas between the dates of 1851 and 1872. In order to avoid ambiguities within the data, I only included texts that mentioned specific cities or locations within Texas, meaning that all texts that simply mentioned the state of Texas were eliminated from the data set. The result of this search query listed over 31 Texas cities mentioned over a series of 972 texts. This data was used to generate the following map in ArcGIS:

(You can access the interactive version of this map by clicking here)

There are few major surprises within this generated map, seeing as the major cities in Texas seem to be the most prominent locations mentioned in American texts published between 1851 and 1872. The San Antonio and the Rio Grande regions are by far the most popular locations mentioned within the text sampling of this period, and the Rio Grande region, conveniently located near Brownsville, is by far the most prominent “borderline” region, with a total of 111 texts mentioning this location.

Although it is extremely interesting to see how prominent the Rio Grande region was in American texts written after the incorporation of Texas to the U.S. 1848, keep in mind that this data includes texts written during and after the American Civil War. Cities found within the U.S./Mexican border, particularly Brownsville, were notorious for being smuggling points of goods during the Civil War, which might have influenced the prominence of borderline locations within American literature. Seeing as my interests lie within the prominence of Texan locations in American literature influenced by the aftermath of the Mexican-American war, some adjustments had to me made. Thus, I accessed the aforementioned database once again, this time making sure to create a count of text locations mentioned between 1851 and 1860, right before the Civil War began. The results were as follows:

(You can access the interactive version of this map by clicking here)

A couple of interesting things occurred when eliminating the years marking the beginnings of the American Civil War. Notice how Austin and Houston decrease dramatically in terms of how many times they are mentioned within the literary corpus. The Rio Grande region, however, still caries the second-place medal in terms of location mentions of Texas after the Mexican-American war. Now we can rightfully assume that the prominence of this location is directly intertwined with the effects of the Mexican American War. Rio Grande, after all, was populated in 1846 as a transfer point for goods and soldiers during the invasion of Mexico during the Mexican-American war. San Antonio is perhaps the obvious and most salient contender for location mentions within American literature published after the Mexican-American war due to the Battle at the Alamo, the event that “inspired” many Texan citizens to join the army during the Texas Revolutions while in turn dramatically increasing U.S. hostility towards the Mexican population.

In due course, it’s quite interesting to see how American literature, especially in terms of location, is closely tied to historical events and shifts. But even more so, it’s quite amazing to be able to visualize and develop a more concrete notion of the presence of locations across hundreds of texts that shape the imaginative landscape of American literature during particular periods of time. Although this data is indeed tantalizing, note that the database used to create these maps contained authors deemed American, and I am personally not sure of how many Chicano authors were included within this data set.

My gut feeling, based on the prominence of places such as Rio Grande and San Antonio, is that few, if no Chicano authors are included–seeing as places such as Brownsville, Laredo, and other border regions contain few or no location counts. It would be extremely interesting to create or obtain a database that exclusively lists locations mentioned in texts written by authors of Mexican or Chicano/a descent, in order to generate maps that can be compared and contrasted to the ones shown above. Then, it will be possible to have an even more encompassing view of the imaginative landscape of Texas from an American perspective and a Mexican/Chicano perspective. After all, literature does not follow the strict boundaries that are imposed in terms of location and space.

Disclaimer: The discussion above was an attempt to experiment with the possibilities of the ArcGIS content management system, and share these possibilities with those interested in digital humanities, American literature, or history. None of the data interpretations above are definite nor conclusive.

Decoding the American Scholar: Towards a Distant Computational Reading of Emerson’s Prose

The following entry discusses some ideas that I plan to explore in a research paper that I will write for a course titled “Knowledge, Belief, and Science in Melville’s America,” which is being offered by Dr. Laura Dassow Walls at the University of Notre Dame during the fall semester of 2012.

During my last semester of school work, I became fascinated with the concept of hybridity. Something that became extremely apparent during my readings was the fact that the humanities and sciences are not as opposing as we may initially deem. Also, I became aware of the tantalizing possibilities of approaching humanistic studies in a scientific/quantitative fashion (and the extent of these possibilities is increasing tenfold with a course I am taking in Digital Humanities/Humanities Computing). This research project will be my first attempt to approach a collection of literary texts from a scientific and quantitative perspective using the tools that I’ve encountered in the area of humanities computing. My hope is that this approach will help me to understand the ever-elusive Ralph Waldo Emerson  and the overall patterns and systems that are implemented in his prose.

As readers of my website are well aware by now, Emerson has been an extremely difficult scholar to understand (at least in my opinion). I tend to develop a strange sense of fascination and utter confusion when I read his prose. I also find it tedious to delve into close readings of his essays mainly because he seems to posit ideas that are at times contradictory and difficult to conciliate (check out my past posts that discuss Emerson in order to understand this point). Of course, this is arguably because Emerson wrote in an extremely subjective point of view, but even more so, it is due to the fact that he was trying his best to grapple with notions that are both abstract and elusive: god, nature, humanity, science, religion, and methods. It can also be argued that Emerson had difficulties in terms of separating the objectivity of his idea(l)s from the subjectivity of his personal experiences. This notion is evidenced in essays such as “Experience,” in which he argues that grief is pointless and futile in the vast scope of the universe, yet it is blatantly obvious that the death of his child created an existential chasm within his life (check out his collection of letters that he sent after the death of his child if you don’t believe me).

How do we even begin to understand such a complex and obviously tormented individual? In order to hypothesize answers to these questions, I am going to suggest a rather Thoreauvian move: rather than trying to integrate myself with the text, and rather than trying to figure out Emerson through close readings, I am going to suggest that we should take a step back and try to piece together the mystery of Emerson through a distant reading.

What is distant reading? Franco Moretti greatly pushed forward this practice when he posited that the issue of close reading is that scholars only able to study a very select amount of texts, while virtually ignoring the influence of other texts within a collection or canon. Thus, textual readings are ignored, and instead, the scholar focuses on determining systems, patterns, themes, and tropes that exist within a collection of texts in order to understand a system in its entirety. Now, Moretti is quite aware that when conducting a distant reading, there are definitely particularities and ideas that are lost. This is an extremely pressing issue, especially when dealing with authors such as Emerson, whose prose and poetry were injected with countless political, religious, and social ideologies that are ostensibly lost when approaching the text from a distance. However, Moretti argues that this is perhaps the only way to make the unmanageable and invisible forces behind literature visible:

Distant reading: where distance, let me repeat it, is a condition of knowledge: it allows you to focus on units that are much smaller or much larger than the text: devices, themes, tropes—or genres and systems. And if, between the very small and the very large, the text itself disappears, well, it is one of those cases when one can justifiably say, Less is more. If we want to understand the system in its entirety, we must accept losing something. We always pay a price for theoretical knowledge: reality is infinitely rich; concepts are abstract, are poor. But it’s precisely this ‘poverty’ that makes it possible to handle them, and therefore to know. This is why less is actually more. (Conjectures…)

How will this notion of distant reading take place within my research? Simple. I created a database of Emerson’s major prose works in digitalized format (using an archive of Emerson’s texts in HTML format), including a selection of his early addresses and lectures, his first series of essays, and his second series of essays. This database of works, adapted from the prose readings available in the Norton Critical edition of Emerson’s prose and poetry, was organized in chronological order and saved within the same archive.

I then used a series of online textual analysis applications known as “Voyant Tools” (which I discuss in length in this post), which use a series of algorithms that will allow me to approach Emerson’s works from a distant quantitative fashion: the program indicates the frequency and distribution of all of the words used within the inputted database, and it is even able to graphically illustrate the trend of each word within the entire scope of texts that I uploaded. Since the database contains the texts in chronological order, this will allow me to observe patterns of word usage from Emerson’s earlier works to his later ones.

I have already tested the program using a tentative collection of Emerson’s most famous prose works, and the results have indeed been interesting. I programmed Voyant Tools to remove stopwords from the database, meaning that all grammatical and non-content words were removed from the data that was provided. The application then produced a frequency list of the words available in the entire corpus. The most frequent words found within all of the words inputted into the database were as follows (keep in mind that this list was generated using Emerson’s early addresses and lectures, his first and second series of essays, and his essay on Nature):









































I think it is unsurprising to see that ‘man’ and ‘nature’ are the most common words found within Emerson’s prose, but something that did provoke a vast sense of curiosity was the abstract and conceptual nature of the words on this list. Not only does this provide evidence that Emerson was indeed an abstract writer, but it also highlights an important issue: most, if not all of these words, have various shades of meaning can alter immensely according to the context the word is being used in, and are extremely linked to subjective ideological views of the word. Also, note that most of the words in this list are concepts that tend to be associated with positive feelings and optimistic attitudes (god, truth, love, mind, great, good, new, life, world, nature, men, etc.). I think this says an awful lot about the rhetorical nature of Emerson’s prose, and how it is expected that the overabundance of these positive terms will serve as effective emotional rapport for an audience.

What was even more fascinating was the trend graphs that I was able to generate, which indicate the usage of words across Emerson’s texts in a chronological fashion. Here are a slideshow of the graphs that I generated:

This slideshow requires JavaScript.

I think that the graphs tend to demonstrate some very insightful trends. For instance, Emerson’s use of the word ‘soul’ is particularly frequent during his earlier addresses and lectures (with the usually appearing on an average of over 50 times), whereas the use of the term begins to drop noticeably after the publication of his “Over-Soul” essay. Usage of the term ‘god’ starts off particularly strong in his earlier prose works, it drops continuously as he continues to publish essays, and suddenly, towards the publication of his essay on “Nature,” the use of the term sky-rockets. What promoted this sudden interest in god? What led to this dramatic spike in the data?

I thought the graph that illustrated the trend of the words ‘new’ and ‘old’ was very intriguing, for not only is the term ‘new’ being used much more frequently than the term ‘old,’ but both concepts tend to follow the same rises and falls throughout Emerson’s work, indicating that the concepts are frequently contrasted and are perhaps presented in a binary fashion. Notice how these words are consistently used throughout the entirety of the prose works inputted in the collection of Emerson’s prose. I never realized how consistent “newness” and “oldness” were in Emerson’s prose!

The graph that compares the use of ‘man’ versus ‘men’ is also intriguing to me, for not only do both terms tend to demonstrate the same degree of fluctuation throughout Emerson’s works, but there is a noticeable divergence between the lines when they approximate Emerson’s latter works: whereas the plural ‘men’ is being used around 40 times when approaching his essay on nature, the singular ‘man’ is used nearly 150 times (it surpasses the use of ‘men’ by a margin of nearly 300%). Perhaps this is in some way reflective of his increasing belief in the self-reliance of human beings, and his increasing concern with the perils of subjectivity.

I think there is something worthwhile to be studied here. The graphs have definitely opened up questions, but now the issue is to come up with some concrete answers and interpretations. I wonder how these graphs will change when I input more of Emerson’s prose work into the database. I am also concerned with whether or not I’ll be able to develop a full-fledged research project based on this quantitative data. My guess is that I will ultimately resort to close readings in order to better understand the trends and word frequencies produced by the program, but that in and of itself is an issue: I simply do not have the time to conduct close readings of every single one of the essays available in the database (especially considering that I am currently teaching, taking graduate courses, and working on annotations for a book series).

Do you have any thoughts or suggestions for this project? Does it seem somewhat feasible and worthwhile? Any and all feedback will be greatly appreciated!

Literature + Computation = Amazing Results

1 word, 2 words, 150 words! So many beautiful words! Ah, ah, ah, ah, ah, ah!

As a graduate student in English, I guess it comes to no surprise that I tend to have an inherent aversion to math and anything related to quantitative studies. Even as I child, I was unable to understand why the Count from Sesame Street felt such orgasmic joy as he immersed himself into the realm of numbers. I always viewed language as a safe haven from the influence of the quantitative, but then came Algebra and decided to mix letters and numbers. Let’s just say that I was less than pleased with the combination.

During my undergrad studies in English linguistics, I came to appreciate quantitative approaches towards texts and language, but I never thought that I would deal with this combination as a scholar of English. However, thanks to a graduate introductory course that I am taking on Digital Humanities (or Humanities Computing) at the University of Notre Dame, I have recently come across some incredibly useful ideas (and online software) that really augment the possibilities of quantitative research within the field of literature. These ideas and tools facilitate what Franco Moretti calls “Distance Reading,” which basically denotes the analysis of hundreds, if not thousands or millions, of literary texts in order to get a better sense of meaningful changes and developments throughout literary history (see my review of his book Graphs, Maps, Trees).

I have recently been dabbling with the interpretation and creation of programming code (using Python and HTML) thanks in part to the Programming Historian 2, a step-by-step tutorial on how to create basic computer programs that can decode and search for basic patterns in digital texts. Nonetheless, the possibilities of the pre-existing tools available on the web are indeed much more powerful and seductive than the basic programs I’ve developed so far. I will focus my attention on two of the many tools I have surveyed as of now: Voyant Tools and the Google Labs N-Gram Viewer.


According to their webpage, Voyant Tools is a web-based environment used primarily for the analysis of digital texts. This “environment” allows you to conduct different types of quantitative analyses (word counting, word frequency, etc.) with any text in practically any digital format (html, .doc, etc.). All you have to do is paste the text that you want to analyze or simply provide a web link to the actual text. The application then “reveals” facts, quantitative data, and statistics that are interpreted from the textual input. Not only can Voyant Tools demonstrate the frequency and distribution of particular words across the text, but it is also able to depict graphs and lists that graphically illustrate the prominence of any word in comparison to another.

In order to test Voyant Tools, I simply pasted the URL of the html version of Oscar Wilde’s The Picture of Dorian Gray (made available via Project Guttenberg) onto the main text bar, and I clicked on the button labeled “reveal.” My browser then opened the following set of tools:

Now, it is important to note that I applied limitations on the incorporation of stop words into my data in order to limit the types of words that Voyant Tools used and interpreted (which simply means that I requested VT to eliminate “meaningless” grammatical words such as “the,” “a,” “are,” among others, from the interpretations of the data). The application processed the textual data, organized it, and depicted it in an array of useful formats.

The “Cirrus” section illustrates the most common words of the text in a visual cloud, and the size of the word is directly correlated to its frequency within the corpus. The “Words in the Entire Corpus” section lists all of the words that appear in the source text and lists how many times they appear. The “Corpus Reader” section depicts the textual input and highlights the appearance of a word selected within the frequency list. The “Word Trend” section graphically illustrates the frequency of a selected word from the beginning to the end of the text. Note how I selected the word “man” within the Wilde’s novel, and how the Word Trend section illustrates how the word increases in frequency as the novel progresses.

Overall, I think the research possibilities of this program are indeed noteworthy, for it may allow us to offer concrete evidence for some of the claims we make as literary scholars. For instance, if we were to argue that Victor Frankenstein in Mary Shelley’s Frankenstein increasingly begins to view the creature as a human being, perhaps we could compare the frequency and distribution of words such as “monster,” “creature,” “devil,” and “wretch” with other terms such as “human,” “being,” and “man” in order to determine when and how the creature is labeled by his creator. Of course, there might be issues with these tools, especially when determining the context of these terms, and whether or not concepts are referenced to using different names. However, as Moretti once posited in his aforementioned work, graphs and lists provide data, not interpretation.


The premise of the N-Gram Viewer is far simpler than that of Voyant Tools: using the collection of digital books found within the Google Books archive, N-Gram viewer allows you to trace the presence of particular words or terms within thousands (and even millions) of books across a specific span of time. All you need to do is type in the word(s) that you are interested in tracing, establish the years that you want to survey, and the literary scope you want to study (English texts, Spanish texts, American texts, etc.), and the app will trace a nifty graph of the presence of this term in books that fall within the parameters you established.

The coolest part is the fact that N-Gram Viewer is able to illustrate the prominence of more than one term within the same chart, allowing you to trace, for instance, the use of different synonyms or of complimentary concepts (e.g. “adult, child, teenager” – “gay, lesbian, bisexual, heterosexual” – “novel, poem, short story” – “Asian, Latino, Caucasian, African American”, etc). Below, you can see the search I conducted for the terms “gay,” “homosexual,” “bisexual,” “queer,” and “lesbian” using a corpus of English fiction between the years of 1850 and 2000. The results were as follows:

It is quite amazing to see these results illustrated in such a clear and concise format. Note how the use of the term “gay” begins to decline after the 1850’s, probably due to its increasing association with homosexuality rather than an emotional state of joy. It is after the 1970’s (which coincides with the establishment of a gay rights movement after the 1969 Stonewall Riots in New York) that the use of the term gay begins to increase dramatically in English fiction, leading to a peak of the term in the late 90’s (which happens to be the peak of the nationalization of gay media in television and popular culture). It is interesting to note that the presence of the term “lesbian” roughly begins to manifest and increase during the same time that the use of the term gay begins its ascent.

Of course, as with Voyant Tools, the N-Gram Viewer has issues, particularly when it comes to the shifting meaning of particular words, the prominence of a certain term to denote a particular concept, and the sampling of the books themselves (which according to Culturomics, only represents around 12% of all the books ever published). But regardless of these issues, I particularly enjoy the possibilities that these tools present within the realm of distance reading, and I’m looking forward to seeing the new tools that these applications will inspire.

Towards a Materialist Conception of Form

Traditionally, close reading is approached as the key to unlocking and understanding the nuances of history, meaning, and ideology in literary texts. Nevertheless, by conducting close readings of a select range of texts, one only engages with a minuscule and insignificant percentage of the literary corpus. Franco Moretti’s approach towards this issue in Graphs, Maps, Trees: Abstract Models for a Literary History argues for the implementation of quantitative and “scientific” methodologies in literary study as a solution to this dilemma. This implementation leads to the widening of the literary scope in order to focus on general historical developments rather than specific literary events, which in turn offers a fresh and bold conceptual approach towards the study of the novel. At first, his attempt at elevating the status of the quantitative, the spatial, and the genealogical as the heart of literary history seems questionable and even impractical within an academic tradition built upon subjectivity and creativity. Nonetheless, by pushing the scholar of literature to focus on notions such as genre and trends rather than on close readings of selected texts, Moretti manages to highlight the creative and intellectual possibilities of approaching literary history from a renewed perspective, which exemplifies the distinctive insights that literary graphs, maps, and trees can offer to the discipline of comparative literature.

Moretti begins his discussion by arguing that close reading only allows the reader to engage with a small percentage of the total amount of texts that have been published. This is problematic because it leads to the prioritization of canonical texts, leaving aside peripheral texts that were not favored or circulated by a wide audience. He points out that close reading strives to present certain texts as representative of literary history; however, this representative approach essentially ignores the entire spectrum of texts necessary to create a complete and representative understanding. Basing himself on previous research developed by other scholars, Moretti focuses his attention on illustrating general notions such as the rate of publication of novels within certain locations and time spans, and on the prominence of particular literary genres (epistolary, Gothic, etc,) across time in order to comprehend trends, rises, and falls that demand interpretations and explanations that shed light to the problems and gaps in literary history.

In order illustrate these general notions and trends, Moretti divides his book into three different sections that discuss the tools/ approaches that can be used to study novels from a general and encompassing perspective, which he dubs a “materialist conception of form” (92). In the first section on graphs, he discusses the use of quantitative diagrams to illustrate shifts, trends, and cycles that have appeared in the genre of the novel throughout time. In his second section on maps, he illustrates the use of spatial diagrams to give shape to the coexistence of the real and the imaginary in novels, and to point out the relationship that exists between social conflict (or social forces) and form. Lastly, he discusses the use of evolutionary or genealogical trees to analyze novels morphologically, thus tightening the bonds that exist between history and form. In due course, the use and implementation of positivist tools and methodologies in the study of literature stresses gaps and ambiguities that demand to be explained qualitatively in order to get a true and encompassing sense of literary history. Graphs reveal trends and cycles within novelistic genres, which require a historical or content-based explanation; maps are based on the reduction of the text into elements and their reconstruction into diagrams that might “possess ‘emerging’ qualities, which were not visible at the lower level” (53); trees illustrate literary genres from an evolutionary perspective, underlining the devices and contexts necessary for the typological survival and extinction of particular novels.

Moretti’s book pushes the reader to ask a central question: what would be the effects if the literary scholar’s attention would shift from the exceptional to the general? Indeed, Moretti raises an intriguing question at this point, and it leads the reader to wonder what new knowledge or insights can be obtained by approaching novels from a bird-eye view rather than the microscopic view that has traditionally been favored in this area of study. Moretti, however, is not promoting a look at both the general history of literary texts and the thematic and ideological idiosyncrasies within particular novels (i.e. close reading). But, with his usage of the term “shift” of perspective, he is suggesting a complete change from the exact towards the general. Moretti grounds the need for this shift on valid claims; after all, it is virtually impossible to get a complete historical sense of a literary era or period simply by analyzing a minor percentage of texts written during those times, and it is even less feasible to get a true sense of the richness and variety of said period, especially when many texts are subjectively deemed to be representative of a particular period or genre. Ergo, Moretti suggests that a more “rational history” can be achieved if one were to put aside the traditional view of focusing on one text, and focus instead on the notions of world literature and comparative morphology: “take a form, follow it from space to space, and study the reasons for its transformations” (90). By doing so, one can arguably create a whole representation; a more collective and comprehensive approach, which can be achieved by employing the use of quantitative methods in literary studies, and by focusing on the forces that shape literary history (devices and genres) rather than on texts themselves: “Texts are certainly the real objects of literature… but they are not the right objects of knowledge for literary history” (76), for they are unable to represent a genre in its entirety.

Wherein lies the value of approaching literary texts in this fashion, and do scholars of literary history/studies truly want to leave the notion of close readings as an afterthought? True, the implementation of quantitative, cartographic, and genealogical methods can lead to a greater understanding of the trends and influences that shaped literary genres, and they push the reader to look beyond traditional categorizations and to see the variety and richness that exists within these classifications. However, the question is whether or not a truly “rational history” is something that literary scholars are aiming for in the first place, especially when such a rational approach undermines the aesthetic, cultural, sentimental, and political particularities that certain texts possess, and that lead to this intense desire to conserve, highlight, and perpetuate the validity of said texts in a culture that is increasingly distancing itself from interest in the literary. Also, I personally would argue that certain novels are favored over others simply because they embody certain qualities, such as aesthetic appeal and social/political significance, which make them both salient and influential; not every novel published is a work of art.

In his discussion of shifts within the genre of the novel, Moretti underscores the usage of quantitative methods as a way of being more inclusive and encompassing in the discussion of literary texts, especially as it pertains to the study of novel as a genre: “all great theories of the novel have precisely reduced the novel to one basic form only […]; and if the reduction has given them their elegance and power, it has also erased nine tenths of literary history. Too much” (30). Although it can be argued that Moretti is not discrediting the value of current approaches towards the novel entirely, he is affirming that these approaches, although insightful, limit the scope of the literary texts that can be scrutinized, thus preventing a rational and comprehensive literary history from taking shape. To some degree, it is ironic that he pushes for a greater degree of literary inclusivity by somewhat excluding the academic practices that have defined the venue of literary study since its conception. What is even more peculiar is that Moretti resorts to the use of quantitative and genealogical methods in order to make the study of literature more open, exact, and methodical. Nonetheless, the data is ultimately approached with the same subjectivities and sense of murkiness that literary texts are already being approached with: “Quantitative research provides a type of data which is ideally independent of interpretations, […] and that is of course also its limit: it provides data, not interpretation” (9). However, Moretti justifies this limit by affirming that the quantitative data reveals problems, whereas form offers the solutions, and that it is precisely this revelation that leads to the questions that must be answered in literary studies (27).  Are these apparent problems the ones that truly concern all scholars of literature? Furthermore, it must be questioned whether or not the literary can be understood with a positivist attitude.

A major issue with Moretti’s views and approaches is that he strives to attribute a sense of validity to the field of literature within a society that is increasingly favoring quantitative and positivist methods: a society that is pushing individuals to favor grounded and so-called pragmatic areas of study rather than abstract, ideological, and idealistic areas such as the humanities. However, should generalist and quantitative methods be the heart of literary studies and history, or should they be approximated as a heuristic and complementary aid that works in conjunction to close reading to reach this indefinable and unattainable core? To some extent, graphs, maps and trees can function as pillars within the structure of literary studies, especially within the domain of literary history. Yet, it is questionable if they can (or will ever) serve as the foundation which supports the structure in the first place. Regardless, Moretti’s well-crafted book definitely adds a new layer of possibility to the study of literature and literary history, and as he himself points out, it does an outstanding job of “opening conceptual possibilities” rather than providing concrete and exact justifications (92).

You can purchase a copy of the book here:

Protocol: How Control Exists After Decentralization

How Control Exists After Decentralization

Alexander R. Galloway’s Protocol: How Control Exists After Decentralization is by far one of the most exciting and challenging readings that I have encountered in a long time. In essence, it is a book on issues within the realm of computer science targeted towards individuals who have little or no experience within the field. By focusing his attention on the “institutional ecology” of modern computing, Galloway strives to offer a compelling and insightful look at the aspects of form, structure, and materiality within contemporary technology via the discussion of protocols, which in essence are logical rules (or arguably, formats or templates) of control that govern the exchange of data or information across a network. In due course, Galloway exposes how protocols and the advent of decentralized or distributional networks has shifted how notions such as power and control manifest in society (alluding to Foulcaldian and Deleuzian frameworks), and how their existence creates a paradoxical tension between institutionalization/fixation and freedom/deterritorialization.

Although many of the concepts, ideas, and terminology discussed in this book may seem daunting and baffling to people who don’t have much experience with computer programming, Galloway illustrates complicated concepts using various heuristic aids and metaphors (his depiction of interstate highways and airports to explain how decentralized networks function was particularly illuminating). I also found his application of Marxist, Deleuzian, and Foucauldian theories to be compelling, and it really surprised that this application helped me to better understand concepts that have been fuzzy and inaccessible to me in the past. His discussion of Foucault’s notion of biopower was particularly accessible when applied to protocols, and how they have helped transition control from a centralized presence ruled by physical and violent tendencies into a dispersed and abstract manifestation ruled by information, statistics, and quantitative data.

Given the rapidly changing nature of computer science and technology in general, it should come to no surprise that some of Galloway’s arguments and illustrative examples might seem dated and incorrect (after all, the book was published eight years ago). For instance, at one point Galloway posits that the corporate battles over video formats are moot with the presence of DVD, a format adopted according to a consensus among leaders in the film industry. Nonetheless, shortly after the publication of this book we witnessed the battle between the BluRay video format (led primarily by Sony and their integration of this technology into the Playstation 3) and the now defunct HD DVD format. Not only does this demonstrate the circularity existent within the adoption of modern technology, but it also goes to challenge some of Galloway’s assumptions of how technical standards are determined in contemporary society. Nevertheless, Galloway’s text is definitely illuminating in terms of depicting the idiosyncrasies of protocols and their formal material and social qualities, which in turn will pave the way towards better criticism of technologies, ideas, and networks that are governed by protocological standards.

What immediately came to my mind was whether or not there are non-computational phenomena or manifestations that follow protocological guidelines. For instance, in my past studies in linguistics, language was usually approached as a centralized phenomenon regulated by core apparatuses (universal grammar, broca’s area, etc.). But, how do we explain language production in the case of children who undergo hemispherectomies, and still possess the ability to speak and decode language even when entire parts of the brain are removed? Is it possible that language acquisition, similar to the internet, is also based on decentralized protocological networks? Are there areas within literature and the arts that are also guided by structures and formats similar to protocols? The possibilities are indeed tantalizing.

Check out his book by accessing the following link: