An Unsupervised Lemmatization Model for Classical Languages

poster / demo / art installation
Authorship
  1. 1. James O'Brien Gawley

    University at Buffalo, State University of New York (SUNY)

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.


The lemmatization task, wherein a computer model must group together inflected forms derived from the same stem, or 'lemma,' is a fundamental problem in language modeling. Software that handles more complex humanistic tasks like authorship attribution or intertextuality detection relies on lemmatization as an early step.

In classical languages, the current standard depends on training sophisticated model with supervised data sets (Burns, 2018). These data sets include correct lemmatization tagged by expert readers, a labor intensive task. Modern languages can avoid generating supervised training data by taking advantage of much larger data sets. Moon and Erk (2008), for example, used an aligned corpus of English and German to infer lemmatization schemes without recourse to hand-tagged training data. Classical languages do not feature very large aligned corpora, and may not have access to a database of expert annotation for training new models.

This paper presents a technique for inferring a lemmatization model without training data, and tests the performance of this technique in classical Latin. Performance is on par with both supervised models of Latin, and models of modern languages derived from large, aligned data sets. In ambiguous cases, where a token might derive from more than one lemma, the model identifies the correct choice in roughly 66% of trials, or roughly twice as often as random chance.

The technique presented here delivers this performance without including any input but raw text, and can be applied to languages for which such training data is limited or non-existent. The data set used to train the model was provided by the open-source Tesserae project (http://tesserae.caset.buffalo.edu). The test data was supplied by the Classical Language Toolkit (CLTK: http://cltk.org).

The lemmatization model begins by determining the relative frequency of all lemmata found in a given corpus. The underlying assumption is that selecting the more common lemma is the most computationally efficient way of disambiguating between possibilities. To illustrate with an example: in the first line of Vergil’s
Aeneid
, the word ‘Arma’ might stem from the verb ‘armō’, a verb meaning ‘to arm’ or  'arma,' a noun meaning 'weapons' (the latter possiblity is correct). One intuitive way to resolve ambiguity is to select the lemma that appears more frequently in the corpus. But how do we determine the frequency of each lemma, without training data, when the tokens in a given corpus are ambiguous?

To resolve this problem, the present study assumes that all possible lemmata are present in each ambiguous case.
Error! Reference source not found.
below illustrates this process with three tokens from Vergil's
Aeneid
.

Figure 1

As illustrated, the word form 'arma' might come from the verb 'armō' or the verb 'arma' (the latter is correct), but the form 'armis', found later in the poem, might come from the noun 'arma' or the noun 'armus' (again, the latter is correct). Different forms of the noun 'arma' overlap with different lemma, but all of them share 'arma' as a potential stem. In other words, each lemmatization is correct in the same way, but incorrect in a different way. Over several million of tokens in the classical Latin corpus, the wrong answers begin to cancel each other out and the frequency counts in the model gradually begin to reflect the true rate of appearance of each lemma.

Once the unsupervised frequency model has been trained, the lemmatizer simply selects the most frequently seen stem in ambiguous cases. Given an inflected form which might come from either of several lemmata, the lemmatizer selects the lemma that is most frequent in the corpus. The example of 'arma' is shown in
Error! Reference source not found.
: because the noun 'arma' was seen more often in the corpus, it is selected as the correct lemmatization here.

Figure 2

Tested repeatedly against hand-lemmatized Latin text from the CLTK training model, this unsupervised lemmatizer selected the correct stem for roughly 89% of all tokens. This performance is comparable to the more sophisticated models currently in use for Latin lemmatization. It also exceeds the performance of random selection, which identifies the correct stem in only 79% of all tokens. Roughly 73% of Latin tokens are unambiguous.

Languages with greater ambiguity, such as classical Hebrew, may not derive performance at this level. However under-served languages such as Coptic Egyptian might use this model to build reliable lemmatizers without consuming resources in the annotation of training data.

Bibliography

Burns, P.
(2018). Backoff Lemmatization for Ancient Greek with the Classical Language Toolkit.
2018 Digital Classicist London Seminars Series
. London, England. July 27

Moon, T. and Erk, K.
(2008). Minimally supervised lemmatization scheme induction through bilingual parallel corpora.
First International Conference on Global Interoperability for Language Resources
. Hong Kong, China, Jan 9-11

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2019
"Complexities"

Hosted at Utrecht University

Utrecht, Netherlands

July 9, 2019 - July 12, 2019

436 works by 1162 authors indexed

Series: ADHO (14)

Organizers: ADHO