LiLa: Linking Latin. Building a Knowledge Base of Linguistic Resources for Latin

Marco Passarotti; Flavio M. Cecchini; Greta Franzini; Eleonora Litta; Francesco Mambrini; Paolo Ruffolo; Rachele Sprugnoli

Authorship

1. Marco Passarotti

Università Cattolica del Sacro Cuore
2. Flavio M. Cecchini

Università Cattolica del Sacro Cuore
3. Greta Franzini

Università Cattolica del Sacro Cuore
4. Eleonora Litta

Università Cattolica del Sacro Cuore
5. Francesco Mambrini

Università Cattolica del Sacro Cuore
6. Paolo Ruffolo

Università Cattolica del Sacro Cuore
7. Rachele Sprugnoli

Università Cattolica del Sacro Cuore

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Introduction
The spread of information technology has led to a substantial growth in the quantity, diversity and complexity of linguistic data available on the web. Today, although a large variety of linguistic resources (such as corpora, lexica and ontologies) exist for a number of languages, these are often not interoperable. Linking linguistic resources to one another would maximise their contribution to, and use in, linguistic analysis at multiple levels, be those lexical, morphological, syntactic, semantic or pragmatic. In an ideal virtuous cycle, linguistic resources benefit from the Natural Language Processing (NLP) tools used to build them, and NLP tools benefit from the linguistic resources that provide the empirical evidence upon which they are built.

The project
The objective of the
LiLa: Linking Latin project (2018-2023) is to connect and, ultimately, exploit the wealth of linguistic resources and NLP tools for Latin developed thus far, in order to bridge the gap between raw language data, NLP and knowledge description.

LiLa, which has received funding from the European Research Council (ERC) European Union’s Horizon 2020 Research and Innovation programme, is building a Linked Data Knowledge Base of linguistic resources and NLP tools for Latin. The Knowledge Base consists of different kinds of objects connected through edges labelled with a restricted set of values taken from a vocabulary of knowledge description. LiLa collects and connects both existing and newly-generated (meta)data. The former are mostly linguistic resources (corpora, lexica, ontologies, dictionaries, thesauri) and NLP tools (tokenisers, lemmatisers, PoS-taggers, morphological analysers and dependency parsers) for Latin. These are currently available from different providers under different licences. As for the latter, LiLa assesses a set of selected linguistic resources by expanding their lexical and/or textual coverage. In particular, it (a) enhances the Latin texts made available by existing digital libraries and resources with PoS-tagging and lemmatisation, (b) harmonises the annotation of the three Universal Dependencies treebanks for Latin

http://universaldependencies.org/

, (c) improves the lexical coverage of the
Latin WordNet

http://www.cyllenius.net/labium/index.php?option=com_content&task=view&id=21&Itemid=49

and the valency lexicon
Latin-Vallex

https://itreebank.marginalia.it/view/lvl.php

, and (d) expands the textual coverage of the
Index Thomisticus Treebank

https://itreebank.marginalia.it/view/ittb.php

. Furthermore, LiLa builds a set of newly-trained models for PoS-tagging and lemmatisation, and works on developing and testing the best performing NLP pipeline for such a task.

Conceptual model of LiLa.

As can be observed from the simplified conceptual model illustrated in Figure 1, the LiLa Knowledge Base is highly lexically-based.
Lemmas are the key node type in the Knowledge Base. Lemmas occur in
Lexical Resources (as lexical entries), but may have one or more (inflected)
Forms. For instance, the lemma
puella, ‘girl’ has forms like
puellam,
puellis and
puellas. Forms, too, can occur in lexical resources; for instance, in a lexicon containing all the word forms of a language (e.g.,
Thesaurus Formarum Totius Latinitatis, Tombeur, 1998). Both Lemmas and Forms can have one or more graphical variants (
condicio vs.
conditio). The occurrences of Forms in real texts are
Tokens, which are provided by
Textual Resources. Texts in Textual Resources can be different editions/versions of the same Work (e.g., the various editions of the
Orator by Cicero, possibly provided by different Textual Resources). Finally,
NLP tools process either
Forms (e.g., a morphological analyser) or Tokens (e.g., a PoS-tagger).

LiLa develops across the following five Work Packages (WPs):

WP1:
Selecting and Improving Linguistic Resources for Latin. This WP assesses resources eligible to enter the Knowledge Base, i.e., linguistic data-sets.

WP2:
Building the Knowledge Base. This WP represents the core of the project, as it aims to model the linguistic resources for Latin collected and selected in WP1 to build the Knowledge Base. Furthermore, this WP aims to make NLP tools for Latin interoperable and to connect them with linguistic resources in order to exploit the empirical evidence these provide for different NLP purposes.

WP3:
Querying the Knowledge Base. This WP intends to build a user-friendly interface to allow users to write and run SPARQL queries on interconnected linguistic resources.

WP4:
Testing and Evaluating the Knowledge Base. This WP tests the Knowledge Base by conducting research on its (meta)data.

WP5:
Disseminating the Results. This WP is devoted to the dissemination of the results of the project through publications, conference presentations, tutorials and workshops.

This poster contribution presents the detailed structure of LiLa, describes how it meets the so-called
FAIR Guiding Principles for scientific data management and stewardship, which state that scholarly data must be
Findable,
Accessible,
Interoperable and
Reusable (Wilkinson, 2016), and elaborates on the progress made in WP1.

Acknowledgements
The LiLa project has received funding from the European Research Council (ERC) European Union’s Horizon 2020 Research and Innovation programme under grant agreement No. 769994.

Bibliography

Tombeur, P. (1998).
Thesaurus formarum totius Latinitatis: a Plauto usque ad saeculum XXum. Brepols, Turnhout, Belgium.

Wilkinson, M. D. et al. (2016). The FAIR Guiding Principles for scientific data management and stewardship.
Scientific Data,
3. http://dx.doi.org/10.1038/sdata.2016.18

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2019

"Complexities"

Hosted at Utrecht University

Utrecht, Netherlands

July 9, 2019 - July 12, 2019

436 works by 1162 authors indexed

Conference website: http://staticweb.hum.uu.nl/dh2019/dh2019.adho.org/index.html

References: http://staticweb.hum.uu.nl/dh2019/dh2019.adho.org/programme/book-of-abstracts/index.html

Series: ADHO (14)

Organizers: ADHO