A framework for multilayered boundary detection: initial results from the Clementine Vulgate

  1. 1. Thomas Lippincott

    Columbia University

We present a framework for general boundary detection notation
to create “hypothesis-files” that capture a single
testable theory for a classifier to attempt to learn. This
notation is capable of fi ne-grained divisions, down to the
level of individual words. It can also incorporate multiple
primary sources and languages.
The simplest feature set we consider are lemma frequencies.
These perform superbly,
as they train against topical
lemmas. To prevent arbitrary over-training, such as
using proper names as features, we only retain words
recognized by a general-purpose Latin dictionary. Still,
overtraining to the narrative remains a concern, and for
that reason we do not focus on this feature set. A reasonable
approach may be to only use non-topical function
words, which were found to perform well in a study by
The second feature set is part-of-speech frequencies.
Latin words all begin as noun, verb or adjective, and
through inflection take on diverse parts of speech. Since
it is somewhat arbitrary how we distinguish parts of
speech and inflected forms here, we have begun with
the extremes of the fundamental types and the fullyinflected
The third feature set is inflectional frequencies. Collatinus
[10] is capable of lemmatising latin words and
preserving the inflectional information that is stripped
off. We fi nd 634 different inflection types throughout
the Vulgate. Unlike in most languages, an isolated word
in Latin can show an unambigious syntactic role via
inflection. Therefore, this feature set includes syntactic
labels, which proved useful for Hirst et al[6] when extracted
as bigrams. We have not attempted this yet, as the
inflectional analysis needs to be improved fi rst. This is
an important issue that we will discuss in-depth.
For the proof-of-concept, our targets for machine learning
are relatively undisputed
divisions of the text. For
example, we consider language (immediately prior to
the Vulgate), literary style, and original author. Dividing
the text according
to these features, the boundaries usually
fall between books. There are exceptions to this, for
example a passage in Esther known to have been written
separately, but for the purposes of our initial experiments
we divided the text into sets of books. The results
confirm the framework’s ability to detect stylistic boundaries,
and careful examination of its “misclassifications”
sometimes reveal subtle textual affinities.
Friedman[3] has presented a fi ne-grained theory of the
composition of the Torah, and we intend to encode this
in our hypothesis. Space and time permitting,
we will
present the results of several variations of the theory as
applied to the Hebrew Masoretic text of the Torah.
in texts with complex compositional histories.
The framework is designed for end-to-end testing of
hypotheses via linguistic feature extraction and machine
learning. We describe initial results on the Vulgate Bible
utilizing the inflectional richness of the Latin language
and several well-known facets of its composition. These
results indicate that the framework is an effective testbed
for theories in source criticism, and we propose further
work that would extend its functionality to more texts
and facets.
Texts with a history as rich as the Bible present a unique
opportunity to study the interaction of compositional
features. Scholarship ranges from consensus
on fundamental
points, to competing theories in source criticism
and translation. Moreover, passages in the Bible have
been grouped by style (poetic, historical, legal), function
(apocalyptic, prophetic), traditional author (Moses,
Joshua) historical time period (Torah, Lamentations) and
so forth. It is less clear what, if any, practical linguistic
differences these groupings represent, and how they
have interacted over time. We consider several widelyaccepted
scholarly beliefs in choosing the targets for our
preliminary machine-learning experiments.
We perform proof-of-concept experiments on the Clementine
Vulgate, the official
canon of the Catholic Church
from 1592 to 1979, because of the relative uniformity
and well-documented history of the text. The Vulgate
is composed entirely in Latin, a highly-inflectional liturgical
language of the Catholic Church and medieval
scholarship. Its regular, rich morphology makes it very
amenable to computational linguistics, although as a liturgical
language it receives little attention in practical
contexts. Every non-function word is distinguished by a
suffix which indicates grammatical qualities like gender,
number, tense, voice, etc. as well as syntactic role. Most
words belong to one of a small number of classes for
which these endings are completely deterministic: for
example, nouns belong to one of fi ve declensions, while
verbs belong to one of four conjugations. Strict agreement
between parts of speech makes word-order almost
irrelevant, semantically. The text itself was composed
circa 400 A.D. by Jerome, from Greek, Hebrew, Latin
and Aramaic sources, and is accompanied by his commentary
on his translation methodology.
Non-traditional literary studies
Before presenting the framework, we address some
common pitfalls that arise when applying computational
methods to an ancient text, and how we attempt to avoid
them. Rudman[13] gives an overview of inherent problems
in such studies:
of these,we are particularly concerned
here with addressing the following: knowledge
of the disciplines that make up the fi eld and incomplete
selection of style markers.
To avoid the errors of the interloper, we keep languageand
domain-specific choices distinct from our general
framework. Our principles are simply a) the text of the
Bible can in principle be divided along many historical
dimensions, b) linguistic features may remain that indicates
these divisions, c) machine learning, based on
these features, will be more successful at learning valid
than invalid divisions. These, we feel, are unbiased general
assumptions that lay the groundwork for collaboration
with domain experts.
The pitfalls in feature selection (“style markers”) include
limited feature sets and unfounded generalisations about
feature relevance (i.e. “style as a monolithic concept”).
We are very conscious of this, and in fact an initial motivation
for the study was to investigate the heterogeneous
usage of “style” in a text that demonstrates so many. We
throw a wide net in feature extraction, and present our
reasoning for subsequent modifications to this set.
Finally, Rudman[13] argues that non-traditional (i.e.
computational) studies should only follow extensive traditional
studies. This criterion is certainly met here: in
fact, our results so far are entirely based and evaluated
upon hypotheses developed over the past two centuries
of Biblical criticism, and concludes with an in-depth application
to a dominant theory in the fi eld.
We will present our framework in detail: the major
points are that it is written in the Python programming
language, uses TEI-derived document encodings, and
uses the WEKA toolkit for machine learning. Primary
concerns are generality and modularity: specifically, the
feature extraction methods are simple APIs that can eas

