Teasing Out Authorship and Style with T-tests and Zeta

paper
Authorship
  1. 1. David L. Hoover

    New York University

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Most computational stylistics methods were
developed for authorship attribution, but many
have also been applied to the study of style.
Investigating Wilkie Collin's
Blind Love
(1890),
left unfinished at his death and completed
by Walter Besant from a long synopsis and
notes provided by Collins, requires both
authorship attribution and stylistics. External
evidence indicates that Besant took over after
chapter 48 (Collins 2003), which provides
an opportunity to test whether Besant was
successful in matching Collins's style and to
investigate the styles of Collins and Besant. This
divided novel also facilitates the comparison
of two computational methods: the T-test and
Burrows's Zeta.
The t-test is a well-studied method for
determining the probability of a difference
between two groups arising by chance (a classic
use in authorship and stylistics is Burrows
1992.) Here I use t-tests to identify words used
very differently by Collins and Besant. After
showing that those word frequencies accurately
identify the change of authorship, I examine
the words themselves for stylistically interesting
characteristics.
I created a combined word frequency list for
four novels by Besant and three by Collins, then
deleted words occurring only once or twice,
personal pronouns (too closely related to the
number and gender of characters), all words
with more than 90% of their occurrences in
one text (almost exclusively proper names),
and words limited to one author (required
for t-testing). I divided the novels into 167
4,000-word sections, and performed t-tests for
the remaining 6,600 words (using a Minitab
macro). I cleaned up the results and sorted them
on the p value in Excel (with another macro),
and retained only the 1719 words with p < .05,
about 1,000 for Collins and 700 for Besant
(see
https://files.nyu.edu/dh3/public/Clus
terAnalysis-PCA-T-testingInMinitab.html
for
detailed instructions and the macros).
I tested these words on six new texts for each
author, a novel and five stories for Besant
and six novels for Collins. Beginning with the
500 most distinctive words for each author,
I deleted a few words that were absent from
these texts and used the remaining 993 words
to perform a cluster analysis (Fig. 1). (To keep
the graph readable, I divided the novels into
10,000- word sections, retaining only half the
sections.) Obviously, these marker words are
quite characteristic of the authors.
Fig. 1.
Besant versus Collins: Cluster Analysis
When sections of
Blind Love
are tested along
with the texts above, the authorship change
after chapter forty-eight is starkly apparent (Fig.
2). This graph is based on the sums of the
frequencies of the 500 most distinctive words
for each author in each section. (The texts are
divided into 1,000-word sections; only a few
sections of the novels are shown; the frequencies

2
of Collins's marker words are multiplied by -1
for clarity.) Although Besant was working from
extensive notes, his style is distinctly different.
Had we not known which was Besant's first
chapter, these t-tested marker words would have
easily located it.
Fig. 2.
Besant, Collins, Blind Love: T-tested Marker Words
Because the styles of Collins and Besant are
so distinct, these marker words should also
characterize them. Consider the twenty most
distinctive words for each author:
Besant:
upon
,
all
,
but
,
then
,
and
,
not
,
or
,
very
,
so
,
because
,
great
,
thing
,
things
,
much
,
every
,
there
,
man
,
everything
,
is
,
well
Collins:
answered
,
to
,
had
,
Mrs
,
on
,
asked
,
in
,
Miss
,
mind
,
suggested
,
person
,
resumed
,
excuse
,
left
,
at
,
reminded
,
creature
,
inquired
,
reply
,
when
Obviously, more of Besant's words are
high frequency function words, and many
Collins words are related to speech
presentation (
answered
,
asked
,
inquired
,
resumed
,
suggested
,
reply
, and
reminded
). The
presence of
added
,
begged
,
declared
,
exclaimed
,
explained
,
expressed
,
muttered
,
rejoined
, and
said
as likely speech markers among the
other Collins marker words, but only
gasped
,
groaned
,
murmured
,
replied
, and
stammered
for Besant, suggests they have different ways of
presenting speech.
Sorting all of each author's marker words
alphabetically immediately reveals word
families that each author favors, as
thing
,
things
, and
everything
among the twenty
most distinctive Besant words already suggests
(
anything
and
nothing
are also Besant
markers). His
every
and
everything
are joined
by
everybody
and
everywhere
;
anything
by
any
and
anywhere
;
nothing
and
not
by
never
,
no
,
nobody
,
none
, and
nor
; and
much
by
more
,
moreover
,
most
, and
mostly
among
his markers. Collins's
answered
is joined by
answer
,
answering
, and
unanswerable
; and
five of his twenty words are joined by two others:
ask
,
asked
,
asks
;
inquired
,
inquiries
,
inquiry
;
leave
,
leaving
,
left
;
person
,
personally
,
persons
;
suggest
,
suggested
,
suggestion
.
About 600 of the 1,700 distinctive words form
groups favored by one author, but only about
175 form split groups, many of which fall
into intriguing patterns. Collins uses more
contractions, so
didn't
,
doesn't
, and
don't
are
Collins words, but
did
and
does
are Besant
words, and similarly for
must
,
need
,
should
,
and
would
and their negative contractions.
The singular and possessive forms of
brother
,
friend
,
sister
, and
son
are Collins's words and
the plural forms are Besant's; the singular
vs. plural pattern continues almost without
exception in split noun groups. Verbs in
-ing
are Collins words and 3rd singular present
forms Besant's. Finally, all nineteen cardinal
number marker words are Besant's, including
the numbers
one
to
ten
(note that Besant's
preferred plural nouns often follow numbers).
This extraordinary patterning may not seem
particularly surprising, but, so far as I know, it
has never been noticed before, and cries out for
investigation.
Two problems with t-testing are its privileging
of relatively uninteresting high-frequency words
and its inability to cope with words absent from
one author. John Burrows's Zeta addresses both
of these problems (Burrows 2006). (The specific
form used here was developed by Hugh Craig
(Craig and Kinney, 2009); for an automated
spreadsheet and instructions for performing

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ADHO - 2010
"Cultural expression, old and new"

Hosted at King's College London

London, England, United Kingdom

July 7, 2010 - July 10, 2010

142 works by 295 authors indexed

XML available from https://github.com/elliewix/DHAnalysis (still needs to be added)

Conference website: http://dh2010.cch.kcl.ac.uk/

Series: ADHO (5)

Organizers: ADHO

Tags
  • Keywords: None
  • Language: English
  • Topics: None