Introduction to Stylomatic Analysis using R

Maciej Eder; Jan Rybicki

Authorship

1. Maciej Eder

Pedagogical University of Krakow
2. Jan Rybicki

Jagiellonian University

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

1. Brief Description
Stylometry, or the study of measurable features of
(literary) style, such as sentence length, vocabulary
richness and various frequencies (of words, word
lengths, word forms, etc.), has been around at
least since the middle of the 19th century, and has
found numerous practical applications in authorship
attribution research. These applications are usually
based on the belief that there exist such conscious
or unconscious elements of personal style that can
help detect the true author of an anonymous text;
that there exist stylistic fingerprints that can betray
the plagiarist; that the oldest authorship disputes (St.
Paul’s epistles or Shakespeare’s plays) can be settled
with more or less sophisticated statistical methods.
While specific issues remain largely unresolved (or,
if closed once, they are sooner or later reopened), a
variety of statistical approaches has been developed
that allow, often with spectacular precision, to
identify texts written by several authors based on
a single example of each author’s writing. But even
more interesting research questions arise beyond
bare authorship attribution: patterns of stylometric
similarity and difference also provide new insights
into relationships between different books by the
same author; between books by different authors;
between authors differing in terms of chronology
or gender; between translations of the same author
or group of authors; helping, in turn, to find new
ways of looking at works that seem to have been
studied from all possible perspectives. Nowadays, in
the era of ever-growing computing power and of evermore literary texts available in electronic form, we
are able to perform stylometric experiments that our
predecessors could only dream of.
This half-day workshop is a hands-on introduction
to stylometric analysis in the programming language
R, using an emerging tool, a collection of Maciej
Eder’s and Jan Rybicki’s scripts, which perform
multivariate analyses of the frequencies of the
most frequent words, the most frequent word
n-grams, and the most frequent letter n-grams.
One of the scripts produces Cluster Analysis,
Multidimensional Scaling, Principal Component
Analysis and Bootstrap Consensus Tree graphs based
on Burrows’s Delta and other distance measures;
it applies additional (and optional) procedures,
such as Hoover’s ‘culling’ and pronoun deletion.
As by-products, it can be used to generate various
frequency lists; a stand-alone word-frequencymaker is also available. Another script provides
insight into state-of-the-art supervised techniques of
classification, such as Support Vector Machines, kNearest Neighbor classification, or, more classically,
Delta as developed by Burrows. Our scripts have
already been used by other scholars to study
Wittgenstein’s dictated writings or, believe it or not,
DNA sequences!
The workshop will be an opportunity to see this in
practice in a variety of text collections, investigated
for authorial attribution, translatorial attribution,
genre, gender, chronology. Text collections in a
variety of languages will be provided; workshop
attendees are welcome to bring even more texts (in
either plain text format or tei-xml). No previous
knowledge of R is necessary: our script is very userfriendly (and very fast)!
2. Tutorial Outline
During a brief introduction, (1) R will be installed on
the users’ laptops from the Internet (if it has not been
already installed); (2) participants will receive CDs/
pendrives with the script(s), a short quickstart guide
and several text collections prepared for analysis;
(3) some theory behind this particular stylometric
approach will be discussed, and the possible uses
of the tools presented will be summarized. After
that and (4) a short instruction, participants will
move on to (5) hands-on analysis to produce as
many different results as possible to better assess the
various aspects of stylometric study; (6) additional
texts might be downloaded from the Internet or
added by the participants themselves. The results,
both numeric and visualizations, will be analyzed.
For those more advanced in R (or S, or Matlab),
details of the script (R methods, functions, and
packages) will be discussed.
3. Special Requirements
Participants should come with their own laptops. We
have versions of scripts for Windows, MacOS and
Linux. The workshop also requires a projector and
Internet connection in the workshop room.
Digital Humanities 2012
17
References
Baayen, H. (2008). Analyzing Linguistic Data:
A Practical Introduction to Statistics using R.
Cambridge: Cambridge UP.
Burrows, J. (1987). Computation into Criticism: A
Study of Jane Austen’s Novels and an Experiment in
Method. Oxford: Clarendon Press.
Burrows, J. F. (2002). ‘Delta’: a measure of stylistic
difference and a guide to likely authorship. Literary
and Linguistic Computing 17(3): 267-287.
Craig, H. (1999). Authorial attribution and
computational stylistics: if you tell authors apart,
have you learned anything about them? Literary and
Linguistic Computing 14(1): 103-113.
Craig, H., and A. F. Kinney, eds. (2009).
Shakespeare, Computers, and the Mystery of
Authorship. Cambridge: Cambridge UP.
Eder, M. (2010). Does size matter? Authorship
attribution, small samples, big problem. Digital
Humanities 2010: Conference Abstracts. King’s
College London, pp. 132-135.
Eder, M. (2011). Style-markers in authorship
attribution: a cross-language study of the authorial
fingerprint. Studies in Polish Linguistics 6: 101-116.
Eder, M., and J. Rybicki (2011). Stylometry with
R. Digital Humanities 2011: Cconference Abstracts.
Stanford University, Stanford, pp. 308-311.
Eder, M., and J. Rybicki (2012). Do birds of
a feather really flock together, or how to choose
test samples for authorship attribution. Literary and
Linguistic Computing 27 (in press).
Hoover, D. L. (2004). Testing Burrows’s Delta.
Literary and Linguistic Computing 19(4): 453-475.
Jockers, M. L., and D. M. Witten (2010). A
comparative study of machine learning methods
for authorship attribution. Literary and Linguistic
Computing 25(2): 215-223.
Koppel, M., J. Schler, and S. Argamon (2009).
Computational methods in authorship attribution.
Journal of the American Society for Information
Science and Technology, 60(1): 9-26.
Rybicki, J. (2012). The great mystery of the (almost)
invisible translator: stylometry in translation. In
M. Oakley and M. Ji (eds.), Quantitative Methods
in Corpus-Based Translation Studies. Amsterdam:
John Benjamins.
Oakes, M., and A. Pichler (2012). Computational
Stylometry of Wittgenstein’s Diktät für Schlick.
Bergen Language and Linguistic (Bells) Series, (in
press).
Rybicki, J., and M. Eder (2011). Deeper Delta
across genres and languages: do we really need
the most frequent words?. Literary and Linguistic
Computing 26(3): 315-321.

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ADHO - 2012

"Digital Diversity: Cultures, languages and methods"

Hosted at Universität Hamburg (University of Hamburg)

Hamburg, Germany

July 16, 2012 - July 22, 2012

196 works by 477 authors indexed

Conference website: http://www.dh2012.uni-hamburg.de/

Series: ADHO (7)

Organizers: ADHO

Introduction to Stylomatic Analysis using R

1. Maciej Eder

2. Jan Rybicki

ADHO - 2012

"Digital Diversity: Cultures, languages and methods"