Delta, Delta Prime, and Modern American Poetry: Authorship Attribution Theory and Method

paper
Authorship
  1. 1. David Hoover

    New York University

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

In the three years since John F. Burrows presented Delta,
his new measure of authorial difference, in his Busa Award
lecture (2001), there has been a flurry of activity in the
authorship attribution community and beyond. Delta measures
the difference between test texts and a set of texts by possible
authors in an elegantly simple way: the frequencies of the most
frequent words in the test text and in each of the primary texts
are compared with their mean frequencies in the primary set.
The difference between the test text and the mean is then
compared with the difference between the texts by each author
in the primary set and the mean. Then the absolute values of
the differences between the z-scores for all the words are
summed and the mean is calculated, producing Delta, "the mean
of the absolute differences between the z-scores for a set of
word-variables in a given text-group and the z-scores for the
same set of word-variables in a target text" (Burrows 2002a,
271). The primary author whose texts show the smallest Delta,
the smallest mean difference, from the test text has the best
claim to being the author of the test text.
Burrows has published two articles demonstrating the
effectiveness of Delta on Restoration poetry, even for small
texts (2002a, 2003), and has applied the technique to the
interplay between translation and authorship in "The Englishing
of Juvenal: Computational Stylistics and Translated Texts"
(2002b). David L. Hoover has just published two studies
involving Delta (2004a, 2004b) that automate the process of
calculating and evaluating the results of Delta in an Excel
spreadsheet with macros. Hoover's first article demonstrates
Delta's effectiveness on early 20th century novels, and shows
that increasing the number of frequent words to be analyzed
far beyond the 150 most frequent that Burrows uses—to the
700 or 800 most frequent—substantially improves the results,
as does the removal of personal pronouns and words that are
frequent in the entire corpus only because they are extremely
frequent in a single text. It also shows that large drops in Delta
from the first to the second likeliest author are strongly
associated with correct attributions. The second article shows
that it is possible to improve the accuracy of attribution by
Delta by selecting subsets of the word frequency list for analysis
and by changing the formula of Delta itself, and also extends
the testing of the measures to contemporary literary criticism,
where they continue to perform very well. These new methods
recapture information about whether a word is more or less
frequent than the mean, about how different the test text is from
the mean, about the size of the absolute difference between the
test text and each primary text, and about the direction of the
difference between the test text and the primary text.
In spite of the fact that Burrows's Delta is simple and intuitively
reasonable, it, like previous statistical authorship attribution
techniques, and like Hoover's alterations, lacks any compelling
theoretical justification. Nonetheless, it and some of the
variations upon it are manifestly and surprisingly effective,
even in difficult open authorship attribution situations in which
the claimants cannot be limited to a small number by traditional
means. Other ongoing studies that are not ready for public
discussion are underway by several researchers, involving a
'real life' attribution problem on 19th century prose, another on
a Middle English saint's life, and an application of the technique
and its variants to biology.
In this paper I investigate the effectiveness of Delta and
Hoover's various Delta Prime candidates on a corpus of
1,430,000 words of Modern American Poetry by poets born
between 1902 and 1943. This investigation returns to poetry
but brings the techniques forward to the 20th century. Although
it is well known that changes in language and style across long
spans of time are very considerable, and that many authorship
attribution techniques are sensitive to these differences,
preliminary results show that Delta and the various Delta Primes
are even more accurate on the corpus investigated here than on
the restoration poetry that Burrows investigated. They are so
accurate, in fact, that the differences between the original Delta
and the alternatives are relatively small (it is difficult to improve
much on 100% accuracy). These results may be related to a
greater individuality in poetic styles in modern poetry, with
some poets using rhyme and meter and others working in much
looser forms, and to the presence of dialect. Whatever the cause,
however, they further demonstrate the robustness of the
techniques, which have now been tested on two corpora of
poetry written nearly 300 years apart, on novels from 1900,
and contemporary literary criticism. Further tests on
contemporary prose and on texts tagged for part of speech are
ongoing, not so much in an attempt to further confirm the
effectiveness and reliability of Delta and Delta Prime, which
now seem very solidy validated, but rather in the hope of more
fully understanding why these relatively simple techniques
work so well, and in continuing to improve their already
impressive power. Burrows, J.F. "Questions of Authorship: Attribution and
Beyond." Presented at the Association for Computers and the
Humanities and Association for Literary and Linguistic
Computing, Joint International Conference, New York, June
14, 2001. 2001.
Burrows, J.F. "'Delta': a measure of stylistic difference and a
guide to likely authorship." Literary and Linguistic Computing
17 (2002a): 267-287.
Burrows, J.F. "The Englishing of Juvenal: computational
stylistics and translated texts." Style 36 (2002b): 677-99.
Burrows, J.F. "Questions of Authorship: Attribution and
Beyond." Computers and the Humanities 37.1 (2003): 5-32.
Hoover, David L. "Testing Burrows's Delta." Literary and
Linguistic Computing 19.4 (2004a): 453-475.
Hoover, David L. "Delta Prime?" Literary and Linguistic
Computing 19.4 (2004b): 477-495.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2005

Hosted at University of Victoria

Victoria, British Columbia, Canada

June 15, 2005 - June 18, 2005

139 works by 236 authors indexed

Affiliations need to be double checked.

Conference website: http://web.archive.org/web/20071215042001/http://web.uvic.ca/hrd/achallc2005/

Series: ACH/ICCH (25), ALLC/EADH (32), ACH/ALLC (17)

Organizers: ACH, ALLC

Tags
  • Keywords: None
  • Language: English
  • Topics: None