Macro Analysis (2.0)

Authorship
  1. 1. Matthew Jockers

    Stanford University

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

At the 2005 meeting in Victoria, I presented a paper with
the revised title, “A Macro-Economic Model for Literary
Research.” That paper marked an early phase in a project aimed
at leveraging large, digital, text corpora for what I now call
“Macro-Analysis.” The work presented in Victoria was largely
a proof-of-concept using an electronic bibliography of 775
works of Irish-American literature. For that paper I performed
a dumbed down sort of macro-analytic text analysis using
metadata fields and the titles of works in the bibliography.
Today the software has improved significantly, and in place of
the title analysis of 2005, this paper offers results derived from
a macro-analysis of the full text of 1125 British and American
novels from the 19th century. In the presentation, I provide a
general overview of the tool(s) and an explanation of the
methodology employed in the analysis.
The tools and techniques I have develop utilize both supervised
and unsupervised text-mining techniques. The supervised
techniques allow for a focused analysis in which a researcher
probes the corpus for items meeting a specific research criteria.
A very simple example might involve tracing the behavior or
“frequency” of some “signal” (linguistic pattern, literary theme,
or author style) over the course of the corpus. I should note
here that while this process sounds similar in some ways to the
supervised machine learning approach being used by the NORA
project, it is specifically not like NORA in that I am not
employing machine learning or utilizing previous identified,
“marked,” training data. Instead the “signal” is developed ad
hoc by the researcher/user.
An easy way to understand the project is to visualize the tool
being used: A user is given an interface that allows for the usual
sort of corpus searching. The user performs a corpus wide
search for some term (or other feature such as a word cluster
or syntactical pattern). The result page reports all of the texts
in the corpus in which the search term(s) is found, and then the
user is given a “toolbox” of macro-analytic tools with which
to process and analyze the result set. These tools are varied and
perform diverse sorts of analysis.
A topic-modeling tool, for example, provides the ability to
harvest the salient themes from the text in the result set. The
user is thus able to say, for example, that works in the corpus
that contain word “x” show a predominance of the "n" topic.
In my own work with ethnic American literature, I have found
this tool valuable in assessing and quantifying the dominant
themes that occur in works where ethnic markers (words
denoting race or ethnicity) occur. This technique is derived
from the work of David Newman and his research team at the
University of California—Irvine.1
Another tool offers a type of literary time series analysis. Figure
1 shows a graph produced by my “timeline” tool.
Figure 1
With the timeline tool, the results of any query of the corpus
can be mapped over time. Say, for example, that I am interested
in how some textual feature evolves over time. I perform the
necessary query to isolate the occurrences of that feature and
then choose the timeline tool. The resulting graph provides a
visual time-series analysis of the frequency of the pattern. The
graph shown here was produced after a simple search for
occurrences of the word “romance” in the titles of 7300 novels
from the 18th and 19th century. The raw counts are displayed
in red beneath each year. In addition to providing this timeline
of raw hits (figure 1), a second graph (figure 2) is also produced
that shows a normalized result in the form of “hits-per-100”
texts.
Figure 2
In this case, the normalized graph is particularly revealing
because it shows that the frequency of “romance” as a key word
in titles is not especially noteworthy. Aside from a very brief
period (1800-1810) where the word appears in 2-3% of all titles,
its use is steady at 1% or 1 occurrence per 100 titles in a given
year.
The macro-analytic tools developed in this research exist as
both command line applications and as a (beta) extension to
the open-source eXtensible Text Framework (XTF) application
developed by Martin Haye and the California Digital Library.2
The successful implementation of the tools into XTF has been
achieved with the assistance of Stanford Undergraduate digital
humanities major Jenny Loomis who will spend the final five
minutes of this presentation giving a live demonstration of the
tool as implemented into the XTF application. 1. Newman’s work is profiled on line at the following sites: <htt
p://arstechnica.com/news.ars/post/20060
802-7408.html>
<http://www.ics.uci.edu/community/news/
press/view_press?id=51>
2. See <http://www.cdlib.org/inside/project
s/xtf/>

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ADHO - 2007

Hosted at University of Illinois, Urbana-Champaign

Urbana-Champaign, Illinois, United States

June 2, 2007 - June 8, 2007

106 works by 213 authors indexed

Series: ADHO (2)

Organizers: ADHO

Tags
  • Keywords: None
  • Language: English
  • Topics: None