The terms index and topic originated in the same classical disciplines: rhetoric and logic. Index refers to the longer phrases index locorum or index locorum communium, meaning index of places or index of common places (Walter Jackson Ong 123). Topic comes from the Greek work topoi (also meaning places), which referred to lines of argument and themes that orators drew upon in the process of composing their speeches. The index and the topic both belonged to cultural practices of organizing and storing ideas that could be recalled by memory. The development of written and print cultures through the early modern period in Europe transformed these oral rhetorical concepts into our modern notions of indexicality and topicality that refer to any system of categorical organization.
The shift from the rhetorical to the indexical also entailed a transition from local oral forms to universal print ones. Our turn to the eighteenth-century subject index—a genre that emerged when the former usage still held and the latter usage was developing—indicates an attempt to draw upon the terms and concepts of relationality found in the earlier rhetorical discourse as a way to approach the results of digital methods. This move offers a way of grappling with the oft-cited difficulty of the situating and interpreting ‘signals’ identified by quantitative text analysis.
Our comparative approach has an important precedent in the emergence of media studies in the mid-twentieth century, which responded, in part, to the proliferation of electronic technologies. Scholars of orality and literacy such as Albert Lord, Marshall McLuhan, Elizabeth Eisenstein, and Walter Ong compared new forms of oral transmission that were invented in the late-nineteenth and twentieth centuries to historical forms of orality, including oratory, ballads, and epic poetry. Looking back on this boom of electronic orality, Ong observes, “Contrasts between electronic media and print sensitized us to the earlier contrast between writing and orality”(Walter Jackson Ong 2–3). The culture of visualization that has developed on the web and in the academy as an outgrowth of digital media presents a comparable opportunity to scholars. Visualization has emerged as a new kind of research genre. We have attempted to interpret the consequences of visualization for humanistic interpretation by examining genres and concepts that emerged in what scholars of print consider to be the original age of visualization—the early modern period. (See, for instance, Ong on “visualist analogies” in Ramus, Method, and the Decay of Dialogue or John Bender and Michael Marrinan’s argument in The Culture of Diagram.)
In most cases, quantitative methods like topic modeling are used to identify a pattern within a set of textual data, and that pattern is interpreted within an already established set of historical, cultural, or generic expectations. D. Sculley and Bradley Pasanek have usefully explored the challenge this approach presents because of its tendency to produce hermeneutic circularity (Sculley and Pasanek 410). Our approach does not solve the problem of circularity; we accept that it is a given obstacle that faces any interpretive act. Instead we confront circularity by relating the computational model to another historical referent (the subject index), which turns our interpretation from the model itself to the relationship between the model and the index, at specific sites of correlation and contradiction.
We first imagined our tool, the Networked Corpus, as an algorithmic method for marking passages with topical or discursive similarities. Inspired by the practice of cross-referencing passages in printed books, the tool generates hyperlinks between passages that share topics according to a topic model. From this original conception, the tool echoed the practice of Renaissance commonplacing, which involved collecting literary exemplars under thematic or logical headings. Our tool differed, however, in not giving fixed names to the topical units—instead of enabling navigation from heading to passage, the Networked Corpus encourages users to navigate from one passage to another, without assuming any prior knowledge about what the topics that connect might be.
One of our reasons for choosing this design is that the topics of topic modeling are difficult to name in a way that accurately reflects what they represent. Although the topics can sometimes appear to correspond to headings such as one might find in an index, they almost always contain words that are difficult to account for in this simplistic sort of interpretation. Thinking about what sort of navigation paradigms might work best with topic modeling led us to more general questions about the ways in which abstract models of texts can guide reading. It also occurred to us that some of the same issues that we were dealing with could also be applicable to the sort of print indexing that became popular in the late eighteenth century, which could be said to involve its own abstract model of textual content.
We decided to put these two different models into dialogue by comparing a print index from the eighteenth century to a topic model trained on the same text. We chose the index from the 1784 edition of Adam Smith’s Wealth of Nations because it is an exceptionally detailed index, and because Smith’s text exemplifies one of the theoretical issues that we’re dealing with—the relation between abstract models and concrete particulars. The text also frequently switches between different conceptual frameworks that have distinctive vocabularies, making it unusually well-suited to topic modeling, which generally does not work very well when trained on a single book. We downloaded a copy of the text and index from Project Gutenberg, and used a Python script to split the file up and parse the index into a data structure that could be easily manipulated.
Our first goal after parsing the index was to come up with a way of determining how similar it was to a topic model. To do this, we needed to find possible matches between index headings and topics. There is a conceptual difficulty here, because the index and the topic model do not have quite the same structure. While the topic model assumes that pages can draw on topics to varying degrees represented by numbers between zero and one, the index headings either refer to a page or do not. There are also a large number of index entries for very specific subjects that only refer to a few pages, so we would only expect a fraction of the index headings to correlate with any topics.
We decided that the best way of dealing with this was to use a rank correlation formula—specifically, Spearman’s rho. This method correlates topics with index headings entirely in terms of a rank ordering; in the case of the topic model, the pages are ranked by the coefficient for the topic, and in the case of the index, all of the pages indexed under the heading are ranked above all those that are not. Using this definition, a perfect correlation would mean that the pages indexed under that heading always have a higher topic coefficient than the ones that are not. Each of the cases where this is not true decreases the correlation coefficient.
The first thing that we used this calculation for was to determine how many topics we should include in the topic model in order to get it to match up with the index as well as possible according to the coefficient that we have selected. For each possible number of topics from 5 to 60, we generated 40 topic models on The Wealth of Nations using the topic modeling program MALLET (and automating things using Python). For each of these models, we then determined the number of index headings that correlate relatively well (rho >= 0.25) with some topic in the model. We plotted the results using R:
Since the number of index headings matched does not increase much after the number of topics exceeds 40, we concluded that a topic model with 40 topics would be the best to compare to the index, and generated a model with this number of topics to use as our comparator.
Although we did find correlations that were strong enough to establish matches between the index and the topic model, the highest correlation coefficients are still fairly low (generally around 0.35), suggesting that the topic model does not do a very good job of predicting where a page will appear in the index. However, many of the matches do make conceptual sense, despite the large number of pages where the correlations break down. For example, the topic with the top words “wages labour common workmen employments year employment” correlates best with the index heading “Labour”. Our hunch was that looking at the particular passages where the index and the topic model fail to match up could be revealing about the different assumptions that underlie the two models. To enable this sort of reading, we created a special version of the Networked Corpus that shows the index and the topic model side-by-side, and enables the user to view a list of the passages where a particular heading and topic do and do not coincide. You can view this tool online at http://www.networkedcorpus.com/smith/topic-index.html.
Interpreting the two models by means of this visualization has enabled us to gain a new perspective on a technology that is so familiar as to appear transparent—the index—and has also helped us to better understand some of the conceptual issues that can arise in the interpretation of topic models. Although the code that we developed for comparing topic models and indexes is of relatively limited applicability, we believe that our theoretical approach could be applied elsewhere. Many of the computational methods we use today have precedents in the pre-computer era; and many artifacts from the past can be understood as embodiments of abstract models that could be interpreted through comparison with computational analogues. Our purpose in writing software to facilitate these comparisons is not the construction of new tools that can be used repeatedly, but the interrogation of the tools that we are already using, be they old or new.
All of the code we used in this project is available online at http://github.com/jeffbinder/networkedcorpus and http://github.com/jeffbinder/adamsmith. Our project Web site is http://www.networkedcorpus.com.
McCallum, Andrew Kachites. “MALLET: A Machine Learning for Language Toolkit.” http://mallet.cs.umass.edu. 2002. Online.
Bender, John, and Michael Marrinan. The Culture of Diagram. Stanford: Stanford University Press, 2010. Print.
Ong, Walter J. Ramus, Method, and the Decay of Dialogue: Fro the Art of Discourse to the Art of Reason. Chicago (; London: The University of Chicago Press, 2004. Print.
Ong, Walter Jackson. Orality and Literacy. London; New York: Routledge, 2002. Print.
Sculley, D., and B. M Pasanek. “Meaning and Mining: The Impact of Implicit Assumptions in Data Mining for the Humanities.” Literary and Linguistic Computing 23.4 (2008): 409–424. Print.