Since the end of October, seven of my colleagues (Douglas E. Duhaime, Ana M. Jimenez-Moreno, Melissa McCoul, Daniel Murphy, Santiago Quintero, Bryan Santin, Suen Wong) and I have been working on a massive project in which we are trying to “anchor” the geographical imagination of 19th Century British and American sea fiction (based on a similar project developed by Dr. Matt Wilkens). Although this project is part of the requirements for a Digital Humanities course that we are currently taking, we are beginning to see a lot of potential in our work, and we are seriously considering some future possibilities with this project.
We are just concluding with the data collection and organization phase, and we are quite ready to study and interpret our results. Although to some extent we are aiming to see what our data yields, we are contemplating on interpreting our data through the lens of Orientalism… for now. In terms of our data, our aim was to create a database of locations mentioned in a large corpus of British and American sea fiction and to chart said locations within geographical maps. In this post, I will share our “methodology” and the technical details of our project. Once we begin interpreting the data, I might just share some of our findings with you as well!
The Literary corpus of our project is based on John Kohnen’s Nautical Fiction List, which contains an annotated bibliography of sea fiction (drama, fiction, and poetry) that currently includes 806 authors and 2,190 titles. Our corpus is also based on another nautical fiction list compiled by the library of the California State University – Maritime (CSUM) campus. All bibliographical entries cited in these sources were distributed evenly among us. Within this distribution, we identified all of the 19th century texts and classified them as either British or American texts. Once this concentrated list was compiled, we searched for digital full-text versions of these works, which were most obtained via Project Gutenberg and The Internet Archive.
Priority was given to texts found within Gutenberg due to their superior textual quality. Most texts found in the Internet Archive are simply physical copies of manuscripts and facsimiles translated into digital formats using Optical Character Recognition (OCR) software, whereas texts available in Gutenberg are transcribed and revised multiple times by human agents—thus, the margin of error for Project Gutenberg texts is significantly lower. All of the information not available in the original version of the text—such as Project Gutenberg’s legal and copyright disclaimers—was stripped from the document, and each fiction work as saved into separate .TXT files classified by author, title, year of publication, and nationality (British or American).
In total, we were able to create a digital corpus of approximately seventy-four (74) 19th Century British maritime fiction texts, and approximately thirty-five (35) 19th Century American maritime texts. This amounts to a total of one-hundred and nine (109) full-text versions of 19th Century maritime fiction. There are obvious issues that need to be addressed with this corpus. First and foremost, we are unaware of what percentage our data represents in terms of all of the maritime fiction published in the 19th century. Nonetheless, our goals are not to discuss every maritime text, but rather, to take into consideration a larger corpus of this genre of fiction in order to make claims and interpretations that go in accordance with the goals of distance reading. Additionally, it is clear that our British corpus is more than twice the size of the American corpus. This, however, this is completely understandable when taking into account that the publishing industry was way more advanced and developed in the British context, and perhaps due to the prominence of shipping and sea travel in the British empire.
Named Entity Extraction and Database Creation
Locations within the texts were identified using Stanford CoreNLP, a software set of language analysis tools that processes digital English texts. In essence, each word within the text is tagged with meta-linguistic information according to markers established by the user. The software creates an .XML output file that contains all of the tokenized words in the source text tagged with their features, including but not limited to part of speech, dates, locations, times, names, among others. A sample token produced by the Stanford CoreNLP with Part of Speech (POS – the word’s syntactic category) and Named Entity Recognizer (NER – Labels for the name of things) tags would look somewhat similar to this (words written in red are explanations of the token):
<token id=”1″> # ID number assigned to this particular token.
<word>Beaconsfield</word> # Token extracted from the source text.
<POS>NNP</POS> # Part of speech. In this case, the word represents a noun phrase.
<NER>LOCATION</NER> # NER Classification, in this case, the word is tagged as a location.
Using an original code written in the Python programming language, we devised a method to extract and list all tokens with a LOCATION NER marker for every respective text within our corpus. The CoreNLP output for each of these texts was briefly revised and corrected by hand. Tokens that were recognizably not locations (e.g. “esq,” “French,” “John,” “Sandwich King”) or locations that simply cannot be mapped geographically (e.g. “moon,” “Jupiter,” “Neptune”) were eliminated from the data. In total, our current data consists of 37,542 location mentions across all of our texts.
The cleaned-up version of the data was then organized into a spreadsheet. Every instance of a location token was accompanied by the following information: input file, shortened file name, title of the text, author, publication date, and nationality. This spreadsheet was then converted into a .CSV file and imported into an online database server using MySQL Workbench (for Windows), which allows us to perform advanced functions not available through Microsoft Excel, such as keeping a tally of the count of each location mentioned in our corpus. The following query was used in MySQL workbench to generate the counts of each location:
SELECT location, count(*) from Maritime_Fiction WHERE nationality = BR or US GROUP by location ORDER by count(*) DESC
This query, performed for both British and American texts, generated two lists, which mentioned the top 1,000 locations mentioned in the corpus along with their total counts (for both British and American texts). Here are some tables listing the most common locations found within our corpus:
Table 1. Top 15 Locations Mentioned in AMERICAN 19th Century Maritime Fiction:
Table 2. Top 15 Locations Mentioned in BRITISH 19th Century Maritime Fiction:
Needless to say, there are very interesting results yielded in our data, and I am very anxious to see what findings we’ll discover and what interpretations will be made!
Geospacial Information and Mapping
The complete British and American tables were uploaded to Google Fusion Tables, experimental data management and online visualization software that allows one to process and create maps and charts of large sets of data. Luckily, Google Fusion Tables counts with the integration of Google’s Geocoding API services, which modernizes archaic locations into their contemporary places, standardizes alternate spellings of a location’s name, and translates the written location into a particular coordinate consisting of a latitude and longitude. These coordinates are then used to create stunning visualizations of all the locations present within a data set. Every location mentioned in the corpus is marked by a colored dot. When one hovers the computer’s cursor over one of these dots, the dot’s meta-information (such as location name and count) is displayed. Here are some snapshots of what these visualizations look like from afar:
Well, that’s all I’m sharing with you for now. Our data seems to have a lot a potential, and theoretically, there are dozens of interesting claims that we can make, and there are definitely other venues that we will explore in terms of creating visualizations for out data.
Wish us luck!