CLARIN: Resources, Tools, and Services for Digital Humanities Research

paper, specified "long paper"
Authorship
  1. 1. Erhard Hinrichs

    Universität Tübingen (University of Tubingen / Tuebingen)

  2. 2. Steven Krauwer

    Utrecht University

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

1. Introduction

CLARIN is the short name for the Common Language Resources and Technology Infrastructure. It aims at providing easy and sustainable access for scholars in the Humanities and Social Sciences (HSS) to digital language data and advanced tools to discover, explore, exploit, annotate, analyse or combine them, independent of where they are located. CLARIN is one of the research infrastructures that were selected for the European Research Infrastructures Roadmap by ESFRI, the European Strategy Forum on Research Infrastructures. The CLARIN Governance and Coordination body at the European level is CLARIN ERIC. An ERIC is a new type of international legal entity, established by the European Commission in 2009. Its members are governments or intergovernmental organisations.
CLARIN is in the process of building a networked federation of European data repositories, service centres and centres of expertise, with single sign-on access for all members of the academic community in all participating countries. Tools and data from different centres will be interoperable, so that data collections can be combined and tools from different sources can be chained to perform complex operations to support researchers in their work. The CLARIN infrastructure is still under construction, but a number of participating centres are already offering access services to data, tools and expertise. The purpose of the present paper is to give an overview of language resources, tools, and services that CLARIN presently offers.
2. Reference Data Sets

The federation of CLARIN centers offers a high number of reference data sets that are well-known and widely used in the scientific community. The CLARIN Center in Vienna offers the Austrian Academy Corpus, a very large collection of German texts and German literature covering the period of 1848 to 1989. The German reference corpus DeReKo, the largest linguistically motivated collection of contemporary German texts with more than 4.0 billion word tokens, is hosted by the CLARIN Center in Mannheim. The CLARIN Center in Berlin provides access to the German Text Archive, a digital collection of German-language printed works from around 1650 to 1900 as full text and as digital facsimile. The CLARIN Center in Sofia offers the Bulgarian Reference Corpus. CLARIN Center in Warsaw hosts the National Corpus of Polish, a reference corpus with more than fifteen hundred million words.
CLARIN centers offer extensive collections of spoken language. The CLARIN Center in Amsterdam is home to thousands of hours of audio material for Dutch, including more than 1000 hours of dialect recordings. The CLARIN Center in Munich specializes in digital corpora for contemporary German. The CLARIN Center in Sofia offers the Bulgarian Political and Journalistic Speech corpus.
CLARIN language resources are not restricted to the languages spoken in CLARIN member countries. The CLARIN Center in Nijmegen offers easy access to the DOBES Archive, which documents endangered languages around the world.
Another key language resource are high-quality lexica. The CLARIN Center in Tartu provides on-line access to a variety of lexica for Estonian http://www.keeleveeb.ee/. The CLARIN center in Berlin is home to the Digitale Wörterbuch der deutschen Sprache (DWDS). The DWDS lexicon uses extensive digital corpus collections to document the actual usage of German words and offers on-line access to all materials at www.dwds.de/. Apart from traditional lexica, CLARIN also offers access to lexical resources that model word meanings in terms of a network of lexical and conceptual relations. The CLARIN center federation currently hosts such word nets for Czech, Danish, Dutch, Estonian, Finnish, German, and Norwegian.
In addition to reference data sets, CLARIN provides access to an extensive set of metadata records. The Virtual Language Observatory (www.clarin.eu/vlo) currently contains more than 500.000 metadata records to language resources and tool. Facetted search and a visual map provide easy-to-use interfaces for HSS scholars to locate language resources and tools that match the needs in a particular research project.
2.2. Creation of New Resources

For new digital data sets, special care must be taken that such data creation efforts adhere to best practises or standards for text encoding whenever possible and follow a data management plan. HSS scholars often lack the necessary experience or access to data repositories to meet these expectations. The CLARIN-D User Guide [1] provides practical information on the use of standards for language resources and on following good practises in data creation.
3. Data Mining and Data Analysis

3.1. Query Tools and Federated Content Search

Since data sets available in electronic form are typically very large, CLARIN centers support HSS scholars by providing powerful and easy-to-use query tools for many of the resources described above. Access is greatly facilitated if such query tools are realized as web applications and thus available in any web browser. Two good examples of this kind are the web application for querying the German Text Archive and the MIMORE (http://www.meertens.knaw.nl/mimore/search/tool) tool, which enables researchers to investigate morphosyntactic variation in the Dutch dialects by searching three related databases with a common on-line search engine. The search results can be visualized on geographic maps and exported for statistical analysis.
In addition to query interfaces for individual resources, CLARIN offers a Federated Content Search (FCS) functionality that enables HSS scholars to construct a virtual corpus collection hosted by different CLARIN centers and to query this virtual corpus via a common search interface. Currently, nine CLARIN centers in Germany and in the Netherlands make more than 20 resources available to the linguistic researches via the common interface of the CLARIN-D Federated Content Search ( weblicht.sfs.uni-tuebingen.de/Aggregator), and this number is growing. The CLARIN Center at the University of Oslo also provides FCS functionality via the GLOSSA corpus query tool.[5]
3.2. Workflows for Data Annotations

Language data that are annotated with linguistic information can be searched with high accuracy for specific data patterns. The CLARIN Centers in Oslo, Prague, Tübingen and at the Dutch Language Union offer linguistically annotated corpora, so-called treebanks, for Czech, Dutch, German, and Norwegian with accompanying query tools.
If a collection of language resources does not contain sufficient linguistic information, for example if the word forms in a corpus have not been lemmatized, it is impossible to obtain meaningful word frequency distributions. Likewise, if an HSS scholar wants to search for all person names in a very large newspaper corpus in order obtain an overview of who is currently in the news, then the person names in such a corpus needs to be marked up. CLARIN offers support for HSS scholars who need to add annotations of this kind. The web application WebLicht [2], hosted by the CLARIN Center in Tübingen, is a tool-suite for automatic annotation of text corpora. Linguistic tools such as tokenizers, part of speech taggers can be combined into custom processing chains. The resulting annotations can then be visualized in an appropriate way, such as in a table or tree format. Recently the WebLicht tool suite has been extended to spoken language. This can be achieved with the integration of the WebMaus tool provided by the CLARIN center in Munich. WebMaus takes as input an audio file and its transcription and automatically aligns the speech signal with its transcriptions. The WebLicht tool can then further annotate the transcriptions so that via the automatic alignment, a user can find the relevant portions of the speech signal for particular data patterns.
4. Data Visualization

Visualization tools that render the data analysis results in an easy-to-grasp fashion are particularly important if the data sets involved a very large. While CLARIN cannot provide a comprehensive suite of eHumanities visualization tools, it can already support HSS scholars with a number of helpful applications. [3] The CLARIN Center at the University of Copenhagen has developed a visualization tool for parallel inspection of word nets. CinaViz[6] is web application provided that offers geo-visualizations for tracking city names with particular linguistic features.
5. Data Sharing and Data Archiving

CLARIN also provides support for the sharing, publishing and archiving of the data sets. SimpleStore and OwnCloud solutions are available for collaborative work on the same data set. Many CLARIN data repositories offer archiving services for external resources and for finished data sets. For quality assurance, all CLARIN Centers are assessed by the CLARIN Assessment Committeee, according to strictly defined technical requirements (see: http://hdl.handle.net/1839/00-DOCS.CLARIN.EU-78) and have to obtain the Data Seal of Approval[7] for their services.
6. Conclusion

Interoperability of language resources and tools in the federation of CLARIN Centers is ensured by adherence to TEI and ISO standards for text encoding, by the use of persistent identifiers as long-lasting references to digital language data as well as by the observance of common protocols: Shibboleth for user authentication and authorization, SRU/CQL for Federated Contents Search, and OAI-PMH for metadata harvesting.
Here we could describe only a subset of all CLARIN resources and tools. For comprehensive and up-to-date information we refer interested readers to the CLARIN homepage: www.clarin.eu
References

Herold, A. and L. Lemnitzer, eds. (2012). CLARIN-D User Guide. Available at: de.clarin.eu/en/language-resources/userguide.html.
Hinrichs, E., M. Hinrichs & T. Zastrow (2010). WebLicht: Web-Based LRT Services for German. In: Proceedings of the Systems Demonstrations at the 48th Annual Meeting of the Association for Computational Linguistics (ACL-2010). Uppsala, Schweden. pp. 25-29.
Zastrow, T., E. Hinrichs, M. Hinrichs, and K. Beck (2013). Scientific Visualization for the Digital Humanities as CLARIN-D Web Applications. Proceedings of Digital Humanities 2013, University of Nebraska.
The ESFRI Roadmap contains five research infrastructures in the area of Social Sciences (CESSDA, European Social Survey, and SHARE) and Humanities (CLARIN and DARIAH).
en.wikipedia.org/wiki/German_Reference_Corpus.
nkjp.pl/index.php?page=0&lang=1
www.mpi.nl/dobes
github.com/textlab
weblicht.sfs.uni-tuebingen.de/weblichtwiki/index.php/CiNaViz_-_Visualization_of_European_City_Names
datasealofapproval.org/en

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ADHO - 2014
"Digital Cultural Empowerment"

Hosted at École Polytechnique Fédérale de Lausanne (EPFL), Université de Lausanne

Lausanne, Switzerland

July 7, 2014 - July 12, 2014

377 works by 898 authors indexed

XML available from https://github.com/elliewix/DHAnalysis (needs to replace plaintext)

Conference website: https://web.archive.org/web/20161227182033/https://dh2014.org/program/

Attendance: 750 delegates according to Nyhan 2016

Series: ADHO (9)

Organizers: ADHO