Automating the Search for Cross-language Text Reuse

paper, specified "short paper"
Authorship
  1. 1. James Gawley

    University at Buffalo, State University of New York (SUNY)

  2. 2. Christopher Forstall

    University at Buffalo, State University of New York (SUNY)

  3. 3. Konnor Clark

    University at Buffalo, State University of New York (SUNY)

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Tesserae is a open-source, online tool for detecting allusions in Classical literature on an automated basis. Originally limited to Latin poetry, the corpus of texts available to Tesserae has recently expanded to include Greek poetry and drama. Word-level n-grams form the foundation of the existing detection algorithm: a standard search returns all instances wherein two words a phrase in a later text shares two words with a phrase in an earlier text. This method has been previously demonstrated to reliably capture intertextual parallels already noted by philologists and to identify significant, previously unrecorded intertexts.
The ability to detect allusions across the language barrier would represent an evolutionary expansion in Tesserae’s functionality as well as a significant contribution to Classical philology. Roman poets openly acknowledged their indebtedness to Greek literature (Horace famously remarked, “Greece, being conquered, tamed its wild conqueror, and brought the Arts to rustic Latium") and scholarly studies of Latin poetry have long commented on allusions to earlier Greek sources. To apply the existing system where Latin text alludes to Greek, Tesserae requires a translation dictionary linking Greek lemmata to associated Latin lemmata. This paper details two methods for building such a dictionary on an automated basis and compares their relative merits as measured by their ability to capture parallels between book one of Vergil's Aeneid and the Iliad of Homer, as noted by G.N. Knauer in his commentary.
The first method represents an original application of Bayes' theorem to a word-by- word alignment of the Greek New Testament with Jerome's Latin Vulgate.

For a given Greek word Gi, the set of Greek Bible verses in which it appears is identified. The words contained in the Latin translation of these verses become the set of possible translation candidates L. For each Li, the set of possible Greek words G is gathered from the set of Greek verses corresponding to the Latin verses in which Li appears. P(Gi|Li) is represented by the number of words in set G which may share a lemma with Gi, divided by the total number of words in that set. The probability of Gi is represented by a similar calculation, where the set of all words within the Greek text is substituted for G. The value of P(Li) is analogous. The success of this relatively simple alignment algorithm as compared with more classical IBM Models or Hidden Markov Models may be explained by the grammatical similarity of these two inflected languages and importance placed by the translator in remaining precisely faithful to the syntax of the original text.
The second method employs English as a pivot language, in a method inspired by work done previously by Jeffrey Rydberg-Cox at Perseus on Latin-Greek synonymy. Using the XML-encoded digital editions of Lewis and Short’s Latin-English Lexicon and Liddell and Scott’s Greek-English Lexicon, two dictionaries widely considered authoritative for Classical languages and available through the Perseus Digital Library, each Latin or Greek headword is characterized by a feature set composed of the English words appearing in its definition. The Python-based Gensim topic modelling tools are then used to transform the English word counts to TF-IDF weights and calculate similarities between the dictionary entries. The similarity scores between entries are then interpreted as similarities in meaning between the respective headwords.
Each of the two methods described above produces pairwise similarities between all Greek and Latin words considered, with those pairings rated by a probability measure between 0 and 1. Because each Greek word may have more than one possible Latin translation, each method accepts the top two translation candidates as valid.
The text of Homer’s Iliad is then indexed according to a feature set made up of Latin translation candidates. Each Greek token is lemmatized, and the token is then indexed according to all possible Latin translation candidates. Because lemmatization is unsupervised, ambiguous forms may have multiple possible Greek lemmata. Each possible Greek lemma will have two translation candidates if the respective translation method is successful, or zero if no translations are found. The text of Vergil’s Aeneid is indexed simply according to the possible Latin lemmata of each token. A given token in Vergil matches a token in Homer where one or more possible lemmata for the Latin word match against the set of translation candidates for the Greek word. A pair of phrases, one in Greek and the other in Latin, which share two or more words that match in this way, is returned as a possible allusion.
The two methods are evaluated by their ability to detect a subset of Aeneid-Iliad parallels collated from the commentary of G.N. Knauer. Each method retrieves a distinct, though partially overlapping, subset of the parallels noted by Knauer. Comparison of the respective performance of both methods suggests that, while each method can be shown to identify significant Latin-Greek allusions, the Bayesian alignment method provides better recall of the benchmark set than the 'pivot' method at the expense of precision. We ultimately aim to combine the output of both approaches into a single feature set.
References

Tesserae, tesserae.caset.buffalo.edu (Accessed on November 1, 2013).
Coffee, N. et. al (2012).: "Intertextuality in the Digital Age." Transactions of the American Philological Association, Volume 142, Number 2, Autumn 2012 pp. 383-422
Epistles, 2.1.156–7
G.N. Knauer (1964): "Die Aeneis und Homer: Studien zur poetischen Technik Vergils mit Listen der Homerzitate in der Aeneis." Gottingen: Vandenhoeck & Ruprecht.
Personal communication with author; tool archived at perseus.mpiwg-berlin.mpg.de/PR/syn.ann.html
www.perseus.tufts.edu
radimrehurek.com/gensim

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ADHO - 2014
"Digital Cultural Empowerment"

Hosted at École Polytechnique Fédérale de Lausanne (EPFL), Université de Lausanne

Lausanne, Switzerland

July 7, 2014 - July 12, 2014

377 works by 898 authors indexed

XML available from https://github.com/elliewix/DHAnalysis (needs to replace plaintext)

Conference website: https://web.archive.org/web/20161227182033/https://dh2014.org/program/

Attendance: 750 delegates according to Nyhan 2016

Series: ADHO (9)

Organizers: ADHO