Collocations, Authorship Attribution, and Authorial Style

David L. Hoover

Authorship

1. David L. Hoover

New York University

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Collocations, Authorship Attribution, and Authorial
Style

David
Hoover

New York University
david.hoover@nyu.edu

2003

University of Georgia

Athens, Georgia

ACH/ALLC 2003

editor

Eric
Rochester

William
A.
Kretzschmar, Jr.

encoder

Sara
A.
Schmidt

Authorship attribution typically seeks a small number of textual characteristics
that distinguish the texts of authors effectively from each other (see Morton,
1978 for a classic discussion). With small groups of texts, these features can
be found by examining frequency lists manually, but statistical tests such as
the t-test can also be used (see Binongo and Smith, 1999). For the purposes of
authorship attribution, a few items occurring at consistent and consistently
different frequencies in all of the known texts by all of the claimants may be
sufficient for confident attribution.
Most multivariate authorship work focuses on frequent words, following the lead
of Burrows (1987, 1988, 1989, 1992a, 1992b, 1994). Much persuasive recent work
continues this tradition (Craig, 1999a, 1999b, 1999c, 1999d; Forsyth et al.,
1999; Holmes et al., 2001a, 2001b; McKenna and Antonia, 2001; Tweedie et al.,
1998). In two recent studies, however, I have shown that cluster analyses based
on frequent words often fail to attribute known texts to their authors, and that
analyses based on word sequences are sometimes more effective (Hoover, 2001,
2002). Continuing along these lines, I will test the accuracy of analyses based
on collocations, while simultaneously examining the effects of using much larger
numbers of items than are typically used. Large numbers of words, sequences, and
collocations provide more information for potential stylistic analyses, assure
that the results take into account a large proportion of the texts under
consideration, and, as we will see, usually produce more accurate results. The
results of my investigation also show that analyses based on collocations are
often more accurate than those based on frequent words or sequences.
For this investigation I will define collocations simply as any two words that
appear repeatedly within a certain span of words. Preliminary tests show that,
perhaps contrary to intuition, meaningful collocations like house...yard, or
car...highway, are not very effective for authorship attribution. They do not
occur very frequently, and their occurrence depends too much on the content of
the text. Many multivariate analyses have been based on function words alone, in
the belief that such frequent and relatively insignificant words are most likely
to reflect unconscious and regular authorial habits. This suggests the use of
collocations of function words, but preliminary tests show that these are also
not very effective. The most effective collocations are simply those that occur
at the highest frequencies, with the exception of collocations of personal
pronouns, which, like collocations of meaningful words, seem too much
conditioned by content (especially the characters) of the texts. I omit personal
pronouns and any items for which a single text provides more than 80% of the
occurrences (typically proper names).
To test the effectiveness of collocations in authorship, it seems best to begin
with a corpus of texts by known authors, so that various spans, numbers of
collocations, and statistical methods can be tested for effectiveness before
trying the method on real authorship questions. I begin with a corpus of 10,000
words of pure narrative from fourteen third-person novels by six authors from
about 1900, and, as a baseline, test the effectiveness of frequent words and
sequences. For the restriction to narrative and to third- person, see Burrows
(1987, 1992) and Hoover, (2001). The best results cluster the texts of five of
the six authors. Although analysts usually select a small number of items (e.g.,
the 50 most frequent function words), much larger numbers of frequent words are
often more effective. I test the 50, 100, 200, 300, 400, 500, 600, 700, and 800
most frequent items except where fewer items than 800 occur frequently enough to
be included. (For this corpus, the best results for frequent words are based on
the 300-800 most frequent.) When collocations are tested, various spans and
linkages give various results, but several analyses correctly cluster the texts
of all six authors, as Fig. 1 shows. A representative completely correct cluster
analysis is shown in Fig. 2.

Fig. 1

Fig. 2

It seems useful to test the methods on another genre, as I did in previous work
(Hoover, 2001), so my next corpus consists of the first 4,000 words of
twenty-one contemporary literary critical articles by ten authors. Here,
analyses based on frequent words and sequences each correctly cluster all of the
texts once. Analyses based on collocations with spans of two, five, and ten
words also succeed.
Analyses based on collocations seem to be quite effective in attributing texts to
their authors in cases of known authorship, and can now be tested in an
authorship simulation to see how well they work under conditions that more
closely resemble true attribution problems. The simulation includes the fourteen
narratives by six authors discussed above, adds four novels by two new authors,
and then two “anonymous” novels, each known to be by one of the eight authors.
Frequent sequences succeed for only six of the authors. Frequent words still
fail to cluster Kipling’s texts correctly, but they do successfully cluster the
four texts of the two new authors. They also consistently cluster one of the
anonymous texts with Cather’s texts and the other with London’s. Analyses based
on collocations with a span of four words are extremely effective and
consistent: the 400, 500, 600, 700, and 800 most frequent correctly cluster all
of the known texts, even when the graphs are strictly interpreted, as Fig. 3
shows. Like analyses based on frequent words, these also consistently cluster
the anonymous texts with those of Cather and London. These identifications are
correct. What makes these results even more impressive is the fact that four of
the six added texts, including the two anonymous ones, are first-person rather
than third-person narratives.

Fig. 3

The results of my study confirm what many researchers have found: analyses based
on the frequencies of frequent words are quite effective in attributing texts to
their authors. Analyses based on frequent sequences of words are also often
effective, and are more effective under certain conditions, as I have showed
elsewhere (Hoover, 2002). Frequent collocations, however, are often more
effective than either words or sequences, producing the only completely correct
attributions in some cases and producing more consistently correct attributions
in others. The frequencies of frequent collocations clearly reflect important
aspects of authorial style. Analyses based on them constitute a promising method
of authorship attribution and may also prove useful in stylistic studies.

REFERENCES

J.
N.
G.
Binongo

M.
W.
A.
Smith

The application of principal Component analysis to
stylometry

Literary and Linguistic Computing

14
4
445–65
1999

J.
F.
Burrows

Computation into Criticism

Oxford
Clarendon Press
1987

J.
F.
Burrows

A.
J.
Hassall

Anna Boleyn and the authenticity of Fielding's feminine
narratives

Eighteenth Century Studies

427–453
1988

J.
F.
Burrows

‘A Vision’ as a revision

Eighteenth Century Studies

551–65
1989

J.
F.
Burrows

Computers and the study of literature

C.
S.
Butler

Computers and Written Texts

Oxford
Blackwell
1992a
167–204

J.
F.
Burrows

Not unless you ask nicely: the interpretive nexus
between analysis and information

Literary and Linguistic Computing

7
2
91-109
1992b

J.
F.
Burrows

D.
H.
Craig

Lyrical drama and the ‘turbid mountebanks’: styles of
dialogue in Romantic and Renaissance tragedy

Computers and the Humanities

63-86
1994

D.
H.
Craig

Authorial attribution and computational stylistics: if
you can tell authors apart, have you learned anything about
them?

Literary and Linguistic Computing

14
1
103-113
1999a

H.
Craig

Contrast and change in the idiolects of Ben Jonson
characters

Computers and the Humanities

221-40
1999b

H.
Craig

Jonsonian chronology and the styles of a tale of a
tub

Martin
Butler

Re-Presenting Ben Jonson: Text, History,
Performance

Houndmills, England
MacMillan, St. Martin's
1999c
210-32

H.
Craig

The weight of numbers: common words and Jonson's
dramatic style

Ben Jonson Journal: Literary Contexts in the Age of
Elizabeth, James and Charles

243-59
1999d

R.
S.
Forsyth

D.
I.
Holmes

Emily
K.
Tse

Cicero, Sigonio, and Burrows: investigating the
authenticity of the Consolatio

Literary and Linguistic Computing

14
3
375-400
1999

D.
I.
Holmes

L.
J.
Gordon

C.
Wilson

A Widow and her Soldier: Stylometry and the American
Civil War

Literary and Linguistic Computing

16
4
403-420
2001a

D.
I.
Holmes

M.
Robertson

R.
Paez

Stephen Crane and the New-York Tribune: a case study in
traditional and non-traditional authorship attribution

Computers and the Humanities

35
3
315-331
2001b

D.
L.
Hoover

Statistical stylistics and authorship attribution: an
empirical investigation

Literary and Linguistic Computing

16
4
421-44
2001

D.
L.
Hoover

New Directions in Statistical Stylistics and Authorship
Attribution

Association for Literary and Linguistic Computing and
Association for Computers and the Humanities, Joint International
Conference, Tübingen, Germany, July 24–28

2002

C.
W.
F.
McKenna

A.
Antonia

The Statistical Analysis of Style: Reflections on Form,
Meaning, and Ideology in the ‘Nausicaa’ Episode of Ulysses

Literary and Linguistic Computing

16
4
353–373
2001

A.
Q.
Morton

Literary Detection: How to Prove Authorship and Fraud
in Literature and Documents

New York
Scribner
1978

F.
J.
Tweedie

D.
I.
Holmes

T.
N.
Corns

The provenance of De Doctrina Christiana, attributed to
John Milton: a statistical investigation

Literary and Linguistic Computing

13
2
77-87
1998

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2003

"Web X: A Decade of the World Wide Web"

Hosted at University of Georgia

Athens, Georgia, United States

May 29, 2003 - June 2, 2003

83 works by 132 authors indexed

Affiliations need to be double-checked.

Conference website: http://web.archive.org/web/20071113184133/http://www.english.uga.edu/webx/

Series: ACH/ICCH (23), ALLC/EADH (30), ACH/ALLC (15)

Organizers: ACH, ALLC

Collocations, Authorship Attribution, and Authorial Style

1. David L. Hoover

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2003

"Web X: A Decade of the World Wide Web"