Hidden Roads and Twisted Paths: Intertextual Discovery using Clusters, Classifi cations, and Similarities

paper
Authorship
  1. 1. Charles Cooney

    University of Chicago

  2. 2. Russell Horton

    University of Chicago

  3. 3. Mark Olsen

    University of Chicago

  4. 4. Robert Voyer

    University of Chicago

  5. 5. Glenn Roe

    University of Chicago

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

While information retrieval (IR) and text analysis in the
humanities may share many common algorithms and
technologies, they diverge markedly in their primary objects
of analysis, use of results, and objectives. IR is designed to
fi nd documents containing textual information bearing
on a specifi ed subject, frequently by constructing abstract
representations of documents or “distancing” the reader from
texts. Humanistic text analysis, on the other hand, is aimed
primarily at enhancing understanding of textual information as
a complement of “close reading” of relatively short passages.
Interpreting a poem, a novel or a philosophical treatise in the
context of large digital corpora is, we would argue, a constant
interplay between the estranging technologies of machine
learning and the direct reading of passages in a work. A
signifi cant element of interpreting or understanding a passage
in a primary text is based on linking its specifi c elements
to parts of other works, often within a particular temporal
framework. To paraphrase Harold Bloom, ‘understanding’
in humanistic textual scholarship, ‘is the art of knowing the
hidden roads that go from text to text’. [1] Indeed, situating
a passage of a text in a wider context of an author’s oeuvre,
time period, or even larger intellectual tradition, is one of the
hallmarks of textual hermeneutics.
While fi nding the “hidden roads and twisted paths” between
texts has always been subject to the limitations of human
reading and recollection, machine learning and text mining
offer the tantalizing prospect of making that search easier.
Computers can sift through ever-growing collections of
primary documents to help readers fi nd meaningful patterns,
guiding research and mitigating the frailties of human memory.
We believe that a combination of supervised and unsupervised
machine learning approaches can be integrated to propose
various kinds of passages of potential interest based on the
passage a reader is examining at a given moment, overcoming
some limitations of traditional IR tools. One-dimensional
measures of similarity, such as the single numerical score
generated by a vector space model, fail to account for the
diverse ways texts interact. Traditional ‘ranked relevancy
retrieval’ models assume a single measure of relevance that
can be expressed as an ordered list of documents, whereas
real textual objects are composed of smaller divisions that
each are relevant to other text objects in various complex
ways. Our goal is to utilize the many machine learning tools
available to create more sophisticated models of intertextual
relation than the monolithic notion of “similarity.”
We plan to adapt various ideas from information retrieval and
machine learning for our use. Measures of document similarity
form the basis of modern information retrieval systems, which
use a variety of techniques to compare input search strings to
document instances. In 1995, Singhal and Salton[2] proposed
that vector space similarity measures may also help to identify
related documents without human intervention such as humanembedded
hypertext links. Work by James Allan[3] suggests
the possibility of categorizing automatically generated links
using innate characteristics of the text objects, for example by
asserting the asymmetric relation “summary and expansion”
based on the relative sizes of objects judged to be similar.
Because we will operate not with a single similarity metric but
with the results of multiple classifi ers and clusterers, we can
expand on this technique, categorizing intertextual links by
feeding all of our data into a fi nal voting mechanism such as a
decision tree.
Our experiments with intertextual discovery began with using
vector space calculations to try to identify that most direct of
intertextual relationships, plagiarism or borrowing. Using the
interactive vector space function in PhiloMine[4], we compared
the 77,000 articles of the 18th century Encyclopédie of Diderot
and d’Alembert to the 77,000 entries in a reference work
contemporary to it, the Dictionnaire universel françois et latin,
published by the Jesuits in the small town of Trévoux. The Jesuit
defenders of the Dictionnaire de Trévoux, as it was popularly
known, loudly accused the Encyclopédists of extensive
plagiarism, a charge Diderot vigorously refuted, but which
has never been systematically investigated by scholars. Our
procedure is to compare all articles beginning with a specifi c
letter in the Encyclopédie to all Trévoux articles beginning with
the same letter. For each article in the Encyclopédie, the system
displays each Trévoux article that scores above a user-designated
similarity threshold. Human readers manually inspect possible
matches, noting those that were probably plagiarized. Having
completed 18 of 26 letters, we have found more than 2,000
articles (over 5% of those consulted) in the Encyclopédie were
“borrowed” from the Dictionnaire de Trévoux, a surprisingly
large proportion given the well-known antagonism between
the Jesuits and Encyclopédists.[5] The Encyclopédie experiment has shown us strengths and
weaknesses of the vector space model on one specifi c kind
of textual relationship, borrowing, and has spurred us to
devise an expanded approach to fi nd additional types of
intertextual links. Vector space proves to be very effective
at fi nding textual similarities indicative of borrowing, even in
cases where signifi cant differences occur between passages.
However, vector space matching is not as effective at fi nding
articles borrowed from the Trévoux that became parts of
larger Encyclopédie articles, suggesting that we might profi t
from shrinking the size of our comparison objects, with
paragraph-level objects being one obvious choice. In manually
sifting through proposed borrowings, we also noticed articles
that weren’t linked by direct borrowing, but in other ways
such as shared topic, tangential topic, expansion on an idea,
differing take on the same theme, etc. We believe that some
of these qualities may be captured by other machine learning
techniques. Experiments with document clustering using
packages such as CLUTO have shown promise in identifying
text objects of similar topic, and we have had success using
naive Bayesian classifi ers to label texts by topic, authorial style
and time period. Different feature sets also offer different
insights, with part-of-speech tagging reducing features to a
bare, structural minimum and N-gram features providing a
more semantic perspective. Using clustering and classifi ers
operating on a variety of featuresets should improve the
quality of proposed intertextual links as well as a providing
a way to assign different types of relationships, rather than
simply labeling two text objects as broadly similar.
To test our hypothesis, we will conduct experiments linking
Encyclopédie articles to running text in other contemporaneous
French literature and reference materials using the various
techniques we have described, with an emphasis on
intelligently synthesizing the results of various machine learning
techniques to validate and characterize proposed linkages. We
will create vector representations of surface form, lemma,
and ngram feature sets for the Encyclopédie and the object
texts as a pre-processing step before subjecting the data
to clustering and categorization of several varieties. Models
trained on the Encyclopédie will be used to classify and cluster
running text, so that for each segment of text we will have a
number of classifi cations and scores that show how related
it is to various Encyclopédie articles and classes of articles. A
decision tree will be trained to take into account all of the
classifi cations and relatedness measures we have available,
along with innate characteristics of each text object such as
length, and determine whether a link should exist between
two give text objects, and if so what kind of link. We believe
a decision tree model is a good choice because such models
excel at generating transparent classifi cation procedures from
low dimensionality data.
The toolbox that we have inherited or appropriated from
information retrieval needs to be extended to address
humanistic issues of intertextuality that are irreducible
to single numerical scores or ranked lists of documents.
Humanists know that texts, and parts of texts, participate in
complex relationships of various kinds, far more nuanced than
the reductionist concept of “similarity” that IR has generally
adopted. Fortunately, we have a wide variety of machine
learning tools at our disposal which can quantify different
kinds of relatedness. By taking a broad view of all these
measures, while looking narrowly at smaller segments of texts
such as paragraphs, we endeavor to design a system that can
propose specifi c kinds of lower-level intertextual relationships
that more accurately refl ect the richness and complexity of
humanities texts. This kind of tool is necessary to aid the
scholar in bridging the gap between the distant view required
to manipulate our massive modern text repositories, and the
traditional close, contextual reading that forms the backbone
of humanistic textual study.
Notes
1. “Criticism is the art of knowing the hidden roads that go from
poem to poem”, Harold Bloom, “Interchapter: A Manifesto for
Antithetical Criticism” in, The Anxiety of Infl uence; A Theory of Poetry,
(Oxford University Press, New York, 1973)
2. Singhal, A. and Salton, G. “Automatic Text Browsing Using Vector
Space Model” in Proceedings of the Dual-Use Technologies and Applications
Conference, May 1995, 318-324.
3. Allan, James. “Automatic Hypertext Linking” in Proc. 7th ACM
Conference on Hypertext, Washington DC, 1996, 42-52.
4. PhiloMine is the text mining extensions to PhiloLogic which the
ARTFL Project released in Spring 2007. Documentation, source code,
and many examples are available at http://philologic.uchicago.edu/
philomine/ A word on current work (bigrams, better normalization,
etc).
5. An article describing this work is in preparation for submission to
Text Technology.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ADHO - 2008

Hosted at University of Oulu

Oulu, Finland

June 25, 2008 - June 29, 2008

135 works by 231 authors indexed

Conference website: http://www.ekl.oulu.fi/dh2008/

Series: ADHO (3)

Organizers: ADHO

Tags
  • Keywords: None
  • Language: English
  • Topics: None