DHmine: an Open Source Cloud-based Framework for DH Research

poster / demo / art installation
Authorship
  1. 1. Tamás Mészáros

    Budapest University of Technology and Economics

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.


A couple of years ago our institutions started a DH project to process and study a large 18th century text corpora (roughly 1,5 million words). During this project we faced many challenges like text digitalization and encoding, creating a large author's dictionary, processing critical annotations, storing available texts and data, providing search, retrieval and display functions, and finally performing various data analysis experiments. We applied and also developed many kinds of software to cope with these challenges.
The result of this multi-year system development is the DHmine toolkit (
https://github.com/
mtwebit
/
dhmine), a collection of software tools for DH projects. During its development we characterized the following requirements: it should support team cooperation, it should be easy to deploy and maintain in a shared environment, and it should be flexible and scalable for various problem complexities and data set sizes. In order to meet these requirements we decided to explore the capabilities of cloud computing that provides convenient, on-demand, network-based access to shared resources and tools. Cloud services can be deployed and reconfigured rapidly to meet the changing needs of their users, and they can be maintained with much less effort than traditional desktop applications.

The DHmine toolkit is a set of cloud-based tools that provides many kinds of services for DH researchers. Its data storage solutions include a cloud-based file store based on the popular Nextcloud software, a noSQL database engine (MongoDB) that can hold unstructured data, a MySQL component to store relational data, and finally, and RDF tripestore based on RDF4J to store relational and factual knowledge. We extended these storage services with autonomous software tools (called agents) that perform various tasks on demand, for example OCR, TEI encoding, document conversions and analysis, entity recognition and others (Mészáros, 2016). There are two statistical tools included in the system: RStudio that is a programmable environment for document and data analysis, and a stylometry tool called Shtylo (
) that provides an easy-to-use Web interface to the popular stylo R package (Eder, 2016). All these components (and a few additional parts like LDAP user authentication, a text indexing tool etc.) are integrated into a unified, cloud-based execution environment.

There are two main interfaces to access the services of the DHmine toolkit. On the one hand there is a Web-based user interface that provides access to the individual tools (Nextcloud, R Studio, Shtylo etc.) and it also includes a programmable environment to develop data import, retrieval and visualization tools that meet the needs of the individual research projects. On the other hand there is a programmers interface (API) for external applications to access system functions like corpus creation and maintenance (document upload, search and retrieval), accessing data in the relational database and in the knowledge-store. This makes easy to integrate the services of the cloud-based system into various desktop applications.
In order to simplify the management of the DHmine toolkit a Docker-based virtual machine environment was developed. This deploys the components into their individual containers and it automatically reconfigures the external Web interface according to the actually running services. Adding or removing a service is as simple as running a command to enable or disable a container. The DHmine toolkit can be installed by cloning its git repository and running a simple text-based tool to select the required services.
The DHmine system was extensively used to create the Digital Mikes-Dictionary (Kiss and Mészáros, 2016) and Corpus, and critical annotations and related knowledge entries retrieved from DBpedia (Mészáros and Kiss, 2018). Since the tools became available other research projects were also started developing their own solutions based on the toolkit. These include processing a large bilingual literary correspondence from the twentieth century, and analyzing of family networks from the 17th and 18th centuries.
During our presentation we can illustrate how these tools work and it would be a great opportunity to help other researchers to take the first steps to install and tailor this open source software for their purposes.
Acknowledgement
The research has been supported by the European Union, co-financed by the European Social Fund (EFOP-3.6.2-16-2017-00013, Thematic Fundamental Research Collaborations Grounding Innovation in Informatics and Infocommunications).

Bibliography

Mészáros, T. (2016).
Agent-supported knowledge acquisition for digital humanities research. In Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics, Budapest, Hungary, 9–12 October 2016.

Eder, M., Rybicki, J. and Kestemont, M. (2016)
Stylometry with R: A Package for Computational Text Analysis. The R Journal Vol. 8/1, 2016

Kiss, M. and Mészáros, T. (2016)
Creating an extended author's dictionary to support digital literary research. In DH Benelux 2016, Luxembourg, 9-10 June, 2016.

Mészáros, T. and Kiss, M. (2018)
Knowledge Acquisition from Critical Annotations. Information, 2018, 9(7), 179, pp. 1-10.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.