On Determining a Valid Text for Non-Traditional Authorship Attribution Studies: Editing, Unediting, and De-Editing

paper
Authorship
  1. 1. Joseph Rudman

    Department of English - Carnegie Mellon University

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.


On Determining a Valid Text for Non-Traditional
Authorship Attribution Studies: Editing, Unediting, and De-Editing

Joseph
Rudman

Carnegie Mellon
jr20@andrew.cmu.edu

2003

University of Georgia

Athens, Georgia

ACH/ALLC 2003

editor

Eric
Rochester

William
A.
Kretzschmar, Jr.

encoder

Sara
A.
Schmidt

INTRODUCTION:

The work’s material history since its
inception, the vast and largely uncharted alterations imposed by
the history and by the mediation of generation upon generation
of printers, editors, publishers—this is a relativism we are
prone to ignore, but ignore at our peril.

(Marcus 1996)

The literary texts often are not homogenous
since they may comprise dialogues, narrative parts, etc. An
integrated approach, therefore, would require the development of
text sampling tools for selecting the parts of the text that
best illustrate an author’s style.

(Stamatatos et al. 2001)

Most non-traditional authorship attribution studies place too much emphasis
on statistics, stylistics, and the computer and not enough focus is given to
the integrity and validity of the primary data— the text itself.
It is intuitively obvious and easily shown empirically that if you are
conducting a study of the patterns of an author’s stylistic usage (e.g.
Daniel Defoe), the study will be systematically denigrated by each
interpolation of non-Defoe text and even by each interpolation of Defoe text
of a different genre or significantly different time period.
The crux of this paper is about one important element in the empirical
methodology of a valid non-traditional authorship attribution study—the
preparation of the text for stylistic and statistical analysis: unediting,
de-editing, and editing.
The general emphasis of this presentation is on prose analysis with some
peripheral treatment of drama and poetry.

I. BACKGROUND AND DEFINITIONS
A. Why a valid text is necessary should not even be asked. No
valid experiment can be done if the input data is flawed—garbage
in, garbage out!Too many practitioners simply grab a text
from any available source—without any thought to its pedigree.
(e.g. Khmelev and Tweedie’s “Using Markov Chains for the
Identification of Writers.”)Are undertakings such as
Project Gutenberg or the Oxford Text Archive with their easily
available machine readable texts a boon or a bane to
non-traditional authorship atudies? This question is explored in
some detail.
B. Selecting a starting textThe validity of using texts
from the oral tradition and the scribal tradition is
discussed.Before any manipulation and analysis of a text is
carried out, a valid starting text must be acquired that
fulfills many necessary requirements. This selection is
primarily bibliographically driven. If a practitioner is not
savvy in the bibliographical arts, a collaborator who is should
be recruited.Examples of bad starting texts causing
problems are given (e.g. Peng and Hengartner’s “Quantitative
Analysis of Literary Styles.”)If you cannot obtain a valid
text, do not do the study.
C. Unediting—getting back to the
state of “not yet edited”De-editing—removing selected text
Editing—changing (preparing) a text for
statistical analysis

II. EXPLICATIONThe statement, “each age, each author, each study
demands a different mixture of the following particulars,” is discussed.
A. UneditingAs a rule, the closest text to the holograph
should be found and used.
1. Editorial interpolation
a. Filled in lacunae
b. Marginal notation
c. ‘Changes’ in the text
d. Critical editions

2. Printer interpolation
For the Printer is a beast, and
understands nothing I can say to him of correcting
the press.
Dryden (Ward p. 97)
a. Catchwords (the first word of the next leaf
or gathering)
b. Signatures (combinations of letters and
numerals used something like catchwords)
c. Removing obvious typesetting mistakes (a
slippery slope)
i. ‘f’ for the long ‘s’
ii. Double words (e.g. ‘the the’ ‘was
was’

B. De-editing
1. Quotes
a. Factual, unattributed
b. Factual, attributed
c. Self quotes from earlier writings

2. Plagiarism
a. Direct copy
b. Paraphrasing
c. Imitation

3. Collaboration
a. Sectional
b. Phrasal
c. Word level
d. Ghostwriting

4. Genre
a. Poetry, prose, drama, letters, etc.
b. Mixture (e.g. verse drama)

5. Graphs and Numbers
a. Tables
b. Lists
c. Arabic and Roman numerals

6. Guide words
a. Titles—chapter headings—the end word
‘Finis’
b. Marginal annotation

7. Foreign Languages
a. Sentence level and greater
b. Phrase or word level

8. Translations
a. Verbatim
b. Concepts

9. Examples of items de-edited (or not de-edited)
incorrectly by practitioners
a. Biblical quotes
b. Titles in direct apposition
c. Numbers that are spelled out
d. Words with an initial capital

C. Editing
1. Encoding the text
a. Why (e.g. homographic forms)
b. TEI

2. Regularizing
a. Spelling
b. Contracted forms (simple, compound)
c. Hyphenation
d. Masked words (e.g. ‘D_ _ _ e’ for ‘Defoe’)

3. Lemmatizing
a. Pro
b. Con

D. Special Problems in Drama and Poetry
1. Stage directions
2. The ‘age’ dependency of transmission and technique.

III. SOME EXAMPLESStudies that are compromised by mistakes of
commission and/or omission in editing, unediting, or de-editing.
A. Historia Augusta
1. Twelve individual studies

B. Shakespeare
1. Eliott and Valenza
2. Foster
3. Horton

C. Defoe
1. Hargevik
2. Rothman

IV. CONCLUSION
1. Some items that are de-edited are valid style markers in
their own right (e.g. latin phrases, different genre) and should
be treated as such in a parallel study.
2. No matter which text is selected, the practitioner must
disclose which text was used and everything that was done to
it.
3. The same care must be taken with every text in the
study—the anonymous text, the suspected author’s text, and all
of the control texts.
4. If valid texts cannot be located and correctly edited,
unedited, and de-edited, do not do the study
5. A valid text does not guarantee a valid study. However, a
non-valid text guarantees a non-valid study.

REFERENCES

Richard
D.
Altick

John
J.
Fenstermaker

The Art of Literary Research
(Fourth Edition)

New York
W.W. Norton & Company
1993

John
Burrows

Questions of Authorship: Attribution and Beyond. A
Lecture Delivered on the Occasion of the Roberto Busa Award

ACH-ALLC01 Conference. New York University, New York,
June 14, 2001

2001

Ward
E.
Y.
Eliott

Robert
J.
Valenza

So Many Hardballs, So Few Over the Plate: Conclusions
From Our ‘Debate’ With Donald Foster

Computers and the Humanities

36

450-460
2002

Don
Foster

Author Unknown: On the Trail of Anonymous

New York
Henry Holt and Company
2000

Bertrand
A.
Goldgar

Imitation and Plagiarism: The Lauder Affair and Its
Critical Aftermath

Studies in Literary Imagination

34
1
1-16
2001

D.
C.
Geetham

Textual Scholarship: An Introduction

New York
Garland
1992

Gregory
Grefenstette

Pasi
Tapanainen

What is a Word, What is a Sentence? Problems of
Tokenization

Proceedings of the 3rd International Conference on
Computational Lexicography

Budapest
Research Institute for Linguistics, Hungarian Academy of
Sciences
1994

Steig
Hargevik

The Disputed Assignment of “Memoirs of an English
Officer to Daniel Defoe”
(Part I and Part II)

Stockholm
Almqvist and Wiksell
1974

David
I.
Holmes
, et al
A Widow and Her Soldier: Stylometry and the American
Civil War

Literary and Linguistic Computing

16
4
403-420
2001

Thomas
B.
Horton

The Effectiveness of the Stylometry of Function Words
in Discriminating between Shakespeare and Fletcher

Thesis

University of Edinburg
1987

Dmitri
V.
Khmelev

Fiona
J.
Tweedie

Using Markov Chains for Identification of
Writers.

Literary and Linguistic Computing

16
3
299–307
2001

Alexander
Lindey

Plagiarism and Originality

New York
Harper and Brothers
1952

Leah
S.
Marcus

Afterword: Confessions of a Reformed Uneditor

Andrew
Murphy

The Renaissance Text: Theory, Editing,
Textuality

Manchester
Manchester University Press
2000
211–216

Leah
S.
Marcus

Unediting the Renaissance: Shakespeare, Marlow,
Milton

London
Routledge
1996

Maximillian
E.
Novak

The Defoe Canon: Attribution and De-attribution

Huntington Library Quarterly

59
1
83–104
1997

Roger
D.
Peng

Nicolas
W.
Hengartner

Quantitative Analysis of Literary Styles

The American Statistician

56
3
175-185
2002

Project Gutenberg

URL:

Pat
Rogers

The Text of Great Britain: Theme and Design in Defoe's
‘Tour’

Cranbury, NJ

1998

Irving
N.
Rothman

Defoe De-Attributions Scrutinized Under Hargevik
Criteria: Applying Stylometrics to the Canon

Papers of the Bibliographic Society of America

94
3
375–398
2000

Joseph
Rudman

The State of Authorship Attribution Studies: Some
Problems and Solutions

Computers and the Humanities

31

351-365
1998

Joseph
Rudman

Non-Traditional Authorship Attribution Studies in the
Historia Augusta: Some Caveats

Literary and Linguistic Computing

13
3
151-157
1998

Eliot
Slater

The Problem of “The Reign of King Edward III:” A
Statistical Approach

Cambridge
Cambridge University Press
1988

E.
Stamatatos
, et al
Computer-Based Authorship Attribution Without Lexical
Measures

Computers and the Humanities

35

193–214
2001

Text Encoding Initiative

James
Thorp

Watching the Ps & Qs: Editorial Treatment of
Accidentals

Lawrence, Kansas
University of Kansas Printing Service
1971

Charles
E.
Ward

The Letters of John Dryden: With Letters Addressed to
Him

Durham, NC
Duke University Press
1942

David
S.
Williams

Stylometric Authorship Studies in Flavius Josephus and
Related Literature

Lewistown, New York
The Edwin Mellen Press
1992

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2003
"Web X: A Decade of the World Wide Web"

Hosted at University of Georgia

Athens, Georgia, United States

May 29, 2003 - June 2, 2003

83 works by 132 authors indexed

Affiliations need to be double-checked.

Conference website: http://web.archive.org/web/20071113184133/http://www.english.uga.edu/webx/

Series: ACH/ICCH (23), ALLC/EADH (30), ACH/ALLC (15)

Organizers: ACH, ALLC

Tags
  • Keywords: None
  • Language: English
  • Topics: None