Named Entity Processing for Digital Humanities

Maud Ehrmann; Matteo Romanello; Simon Clematide

Authorship

1. Maud Ehrmann

DHLab - École Polytechnique Fédérale de Lausanne (EPFL)
2. Matteo Romanello

DHLab - École Polytechnique Fédérale de Lausanne (EPFL)
3. Simon Clematide

Universität Zürich (University of Zurich)

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Context and Motivation
Recognition and identification of real-world entities is at the core of virtually any text mining application. As a matter of fact, referential units such as names of persons, locations and organizations underlie the semantics of texts and guide their interpretation. Around since the seminal Message Understanding Conference (MUC) evaluation cycle in the 1990s (Grishman, 1996), named entity-related tasks have undergone major evolutions until now, from entity recognition and classification to entity disambiguation and linking (Nadeau et al 2007; Rao et al, 2013; Nouvel et al, 2015). Besides the general domain of well-written newswire data, named entity (NE) processing is also applied on specific domains, particularly bio-medical (Kim et al, 2013), and on more noisy inputs such as speech transcriptions (Galibert et al, 2014) and tweets (Ritter et al, 2011). More recently, NE processing has also been called upon to contribute to the domain of digital humanities, where massive digitization of historical documents is producing huge amounts of texts.
In the last few years, many cultural institutions have indeed engaged in large-scale digitization projects. Millions of images are being acquired and, when it comes to text, their content is transcribed, either manually via dedicated interfaces, or automatically via Optical Character Recognition (OCR). Beyond this great achievement in terms of document preservation and accessibility, the next crucial step is to provide an extensive and sophisticated access to the
content of these textual digital resources. In this regard, information extraction techniques, and particularly NE extraction and linking, can certainly be regarded as among the first steps.

De facto, NE processing tools are increasingly being used in the context of historical documents (cf. Ehrmann et al, 2016b for a state of the art). Research activities in this domain target texts of different nature (e.g., publications by cultural institutions, state-related documents, genealogical data, historical newspapers) and different tasks (NE recognition and classification, entity linking, or both). Experiments involve different time periods (from 16th to 20th c.), focus on different domains, and use different typologies. This great diversity demonstrates how many and varied needs are, but makes performance comparison difficult. As per language technologies in general (Sporleder, 2010), it appears that the application of NE processing on historical texts poses new challenges. First, inputs can be extremely noisy, with errors which do not resemble tweet misspellings or speech transcription hesitations, for which adapted approaches have already been devised (Ritter, 2011; Parada 2011). Second, the language under study is mostly of earlier stage(s), which renders usual external and internal evidences ineffective (e.g., the usage of different naming conventions). Further, beside historical VIPs, texts from the past contains rare entities which have undergone significant changes (esp. locations) or do no longer exist, and for which adequate linguistic resources and knowledge bases are missing (Ehrmann et al, 2016a). Finally, archives and texts from the past are not as anglophone as in today’s information society, making multilingual resources and processing capacities even more essential. Overall, and as already demonstrated by Vilain
et al. (2007), the transfer of NE tools from one domain to another is not straightforward and performances of NE tools, initially developed for homogeneous texts of the immediate past, are affected when applied on historical material.

Technically speaking, two strategies are usually followed: given an already existing system (available in-house, via web services, or publicly released), application ‘as it stands’, or after adaptation/tuning, generally by training on new material. Besides, the recent development and availability of deep learning architectures for NE recognition opens up new promises (Akbik et al, 2018), but these new approaches still need to be validated on cultural heritage texts.
In this context, we are pleased to offer a
tutorial on named entity processing on historical data which we hope will be beneficial for the DH community.

Objective
The objective of the tutorial is to provide the participants with essential knowledge with respect to a) NE processing in general and in DH, and b) how to apply NE recognition approaches. To this end, the session will be organized in two parts (theory and hands-on), as detailed in the synopsis below. Throughout the sessions, the audience will learn about the origins of named entity processing, the resources needed for their processing, the evaluation protocols, and the tools and algorithms used for their recognition, classification and disambiguation. Participants will also learn how to run an existing NER system and, more interestingly, how to build or adapt a system, by training it on historical materials.

Material
In the hands-on session we will make use of two datasets consisting of historical texts: 1. Quaero Dataset (French historical newspapers of the end of the XIX c., Galibert et al, 2012) and 2.
impresso dataset (Swiss and Luxembourgish historical newspapers in French and German, Ehrmann et al, 2019). Additionally, we will provide a list of alternative datasets, both historical and contemporary, that participants can decide to work with, in full respect of copyrights. Finally, participants are welcome to bring to the workshop their own datasets in order to apply the code and tools we will present to them.

Technical set-up
Hands-on material will be shared on GitHub and will include:

Jupyter notebooks with explanations and code examples; if relevant, we will set up a multi user environment (Jupyter Hub) in order to reduce system setup time during the tutorial;
a bibliography on the topic;
a list of of available open source academic and industrial tools;
slides of the tutorial.

Tutorial website:
https://impresso.github.io/named-entity-tutorial-dh2019/

GitHub repository:
https://github.com/impresso/named-entity-tutorial-dh2019

Impresso project

is supported by the Swiss National Science Foundation under grant CR- SII5_173719.

Bibliography

Akbik, A., Blythe, D. and Vollgraf, R. (2018).
Contextual string embeddings for sequence labeling. Proceedings of the 27th International Conference on Computational Linguistics (COLING ’18), Santa Fe, New Mexico, USA, August 2018.

Ehrmann, M., Nouvel, D. and Rosset, S. (2016a).
Named Entities Resources - Overview and Outlook. In N. Calzolari Conference Chair, K. Choukri, T. Declerck, M. Grobelnik, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, and S. Piperidis, editors, Proceedings of the 10th International Conference on Language Resources and Evaluation, Portoro, Slovenia, May 2016.

Ehrmann, M., Colavizza, G., Rochat, Y. and F. Kaplan. (2016b).
Diachronic evaluation of NER systems on old newspapers. Proceedings of the 13th Conference on Natural Language Processing (KONVENS 2016) (No. EPFL-CONF-221391, pp. 97-107). Bochumer Linguistische Arbeitsberichte.

Ehrmann, M., Romanello, M., Kaplan, F., Düring, M., Bunout, E., Guido, D., Schroeder, P., van Beek, T., Fickers, A., Clematide, S., Ströbel, P. and Volk, M. (2019). Impresso - Media Monitoring of the Past (short paper) in Ridge, M. Colavizza, G., Brake, L., Ehrmann, M., Moreux, J-P., Prescott, A. The Past, Present and Future of Digital Scholarship with Newspaper Collections (panel). Digital Humanities Conference, Utrecht, Netherlands, July 2019.

Galibert, O. , Leixa, J., Adda, G., Choukri, K. and Gravier, G. (2014). The ETAPE Speech Processing Evaluation. Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’09), Reykjavik, Iceland, May 2014.

Galibert, O., Rosset, S., Grouin, C., Zweigenbaum, P. and Quintard, L. (2012).
Extended Named Entity Annotation on OCRed Documents: From Corpus Constitution to Evaluation Campaign. Proceedings of the Eighth Conference on International Language Resources and Evaluation (LREC’12), Istanbul, Turkey, May 2012.

Grishman, R. and Sundheim, B. (1996).
Message Understanding Conference - 6: A brief history. Proceedings of the 16th Conference on Computational Linguistics (COLING ’96), Stroudsburg, PA, USA. Association for Computational Linguistics, 1: 466–471.

Kim, J.D., Ohta, T., Tateisi, Y. and Tsujii, J. (2003). Genia corpus, a semantically annotated corpus for bio-text mining. Bioinformatics, 19(1):180–182.

Nadeau, D. and Sekine, S. (2007). A survey of named entity recognition and classification. Lingvisticae Investigationes, 30(1):3–26.

Nouvel, D., Ehrmann, M. and Rosset, S. (2015). Les Entités Nommées pour le Traitement Automatique des Langues, ISTE edition.

Parada, C., Dredze, M., and Jelinek, F. (2011). OOV sensitive named-entity recognition in speech. In INTERSPEECH, 2011, pp. 2085–2088.

Rao, R., McNamee, P. and Dredze, M. (2013). Multisource, Multilingual Information Extraction and Summarization, chapter Entity Linking: Finding Extracted Entities in a Knowledge Base. Pages 93–115. Springer Berlin Heidelberg, Berlin, Heidelberg.

Ritter, A., Clark, S., Etzioni, M. and Etzioni, O. (2011). Named Entity Recognition in Tweets: An Experimental Study. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’11), Edinburgh, UK, July 2011.

Sporleder, C. (2010). Natural language processing for cultural heritage domains. Language and Linguistics Compass, 4(9):750–768.

Vilain, M., Su, J. and Lubar, S. (2007). Entity Extraction is a Boring Solved Problem: Or is It? In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers, NAACL-Short ’07, pages 181–184. Association for Computational Linguistics.

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2019

"Complexities"

Hosted at Utrecht University

Utrecht, Netherlands

July 9, 2019 - July 12, 2019

436 works by 1162 authors indexed

Conference website: http://staticweb.hum.uu.nl/dh2019/dh2019.adho.org/index.html

References: http://staticweb.hum.uu.nl/dh2019/dh2019.adho.org/programme/book-of-abstracts/index.html

Series: ADHO (14)

Organizers: ADHO

Named Entity Processing for Digital Humanities

1. Maud Ehrmann

2. Matteo Romanello

3. Simon Clematide

ADHO - 2019

"Complexities"