MServices and The Riddle of Literary Quality

paper, specified "long paper"
Authorship
  1. 1. Gertjan Filarski

    Huygens Institute for the History of the Netherlands (Huygens ING) - Royal Netherlands Academy of Arts and Sciences (KNAW)

  2. 2. Hayco de Jong

    Huygens Institute for the History of the Netherlands (Huygens ING) - Royal Netherlands Academy of Arts and Sciences (KNAW)

  3. 3. Karina van Dalen-Oskam

    Huygens Institute for the History of the Netherlands (Huygens ING) - Royal Netherlands Academy of Arts and Sciences (KNAW)

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Introduction

The Riddle of Literary Quality is a project funded by the Computational Humanities Program of the Royal Netherlands Academy of Arts and Sciences (KNAW). It runs at Huygens ING in partnership with the Institute for Logic, Language and Computation of the University of Amsterdam, and the Fryske Akademy in Leeuwarden. The aim of the project is to develop a method and the necessary software to analyze low-level and high-level formal features in a corpus of modern Dutch long fiction, to find out whether formal features in the texts play a role in the reception and evaluation of the text by the readers. Can we get more insight into the responses of readers to, for instance, texts with on average longer versus shorter sentences, or using a larger vocabulary, or on average showing a more complex syntactical structure (cf. Jautze et al.)? Is there a difference between those texts that readers consider to be highly literary and those that are experienced as more lowbrow? Can we distinguish texts found good or bad by readers based on formal features in these texts? And how do the opinions of readers correlate with the kind of reader they are?
The project thus aims to correlate formal features with readers’ opinions and readers’ roles. The analysis of the formal features is done through a chain of μServices that we will deal with in the second part of this paper. The first part is addressed to the analysis of readers’ opinions and readers’ roles.
Survey

To gather information about readers and their responses we set up a large online survey in which we asked respondents some personal information (age, gender, postal code, level of education) and sixteen questions to find out what kind of reader they predominantly are: autonomous or ‘distanced’: reading for aesthical pleasure; or heteronomous ‘identifying’: reading for fun, to discover other cultures or places, or to identify with the main characters. We based our distinction and our questions on work done by Von Heydebrand & Winko on sociological aspects of (literary) reading. Next to that, we presented a list of 400 recent novels, Dutch originals or translations into Dutch, and asked them to mark the ones they read. A selection of these novels was presented to them, with the question to evaluate these works on two scales: from ‘not so very literary’ to ‘highly literary’ and from ‘bad’ to ‘good’. The survey ran for six months, and received almost 14000 respondents. Analysis of the results has just started.
The results of the survey will be correlated with the results of the measurements of formal features. We would like to describe the technical set-up we have devised to enable the scholars to analyze the texts in the corpus in a way that is trustworthy and sustainable, using μServices written in Java that can also be used by others to repeat and to verify our analyses.
μServices

Research infrastructure for the Riddle of Literary Quality is designed with three goals in mind: research results must be reproducible; analytical tools must be reusable; the entire workflow must be maintainable and reliable. We aim to provide a toolset that allows for a verifiable system that will focus the discussion on the selected methodology - the procedures and algorithms. This also means that we will make the code behind each μService open source.
To accomplish this goal the digital humanities engineering group at Huygens ING based the research infrastructure both on the results of COST Action IS0704: An Interoperable Supranational Infrastructure for Digital Editions (Interedition) – of which the institute was grant-holder – and the work of Joris van Zundert – chair of the Action. Van Zundert (Huygens ING) specified as an objective of the COST Action the development of lightweight and distributed interoperability solutions. These solutions were implemented through webservices. The CollateX algorithm of Ronald Haentjens Dekker (Huygens ING) and Gregor Middell (University of Würzburg) was among the first and most successful of a series of compact analytical demonstrators called μServices.
The Riddle of Literary Quality does not aim to build a workflow management system. Such a top-down standardization methodology is left to large infrastructural programs like CLARIN, DARIAH or the Dutch Nederlab project. Instead we continue Interedition’s grassroots approach and leave (computational) researchers and PhD students free to experiment with high-level and low-level analytical algorithms in languages that range from Python to Java. These algorithms may or may not grow out to be part of the Riddle’s μService infrastructure and those that are deemed useful are eventually hosted at the institute’s servers.
The current services fall in three distinct categories: data import and preparation; analysis and visualization and export. In the first group we offer e.g. a series of tools that convert documents to specified standards (such as ePub/PDF to TEI) and set the data in the correct character encoding (such as a conversion from Windows-1252, ISO8859 to UTF-8). To prepare the data for further analysis we have converted parsers like the Dutch Ucto: Unicode Tokenizer (Radboud University of Nijmegen/University of Tilburg) to a μService. The output data of services in this group is a standardized json format that can be read by the analytical services in the second category. Experiments in The Riddle currently focus on this analysis group. μServices in the third category perform output operations. Some create visualizations while others export the data to external environments for further analysis. For stylometric research e.g. we created a μService to export data from The Riddle to R and integrate it with the Stylo() package created at the Universities of Krakow (Macej Eder/Jan Rybicki) and Antwerp (Mike Kestemont).
The entire suite of μServices will remain available for persistent access and may be used in alternate workflows or by external third-party software. Thus the suite does not only allow reproduction of the results of The Riddle but will also support entirely new and original research.
Sample Workflow

As an example of a µServices-driven workflow we present one possible use of gathering statistical data from a corpus of ePubs. First, each ePub is sent to a service that prepares it for analysis by converting the book into a structured TEI document. Character-encoding issues are resolved by a second µService, resulting in a normalized, platform-independent UTF-8 version of the TEI-document. Subsequently, a third service offers extraction operations on the structural level of the file. This service is used to extract all relevant paragraphs. These paragraphs are split into sentences and words by one of a family of (TEI agnostic) tokenizers, such as the Ucto-µService. Statistical analysis of these tokens is possible by sending the resulting list of tokens to the exporter µService, which transforms the extracted tokens into a format suitable for use in R.
Conclusion

To make sure that we are able to answer the main questions of The Riddle of Literary Quality – whether there are any correlations between readers' opinions about certain novels, readers' predominant reading role, and the values for a list of formal low-level and high-level features of the novels – we have chosen to develop a set of µServices that deal with single aspects of the needed analysis. By making these µServices available to other scholars we enable them to repeat and verify our research results. We provide users with tools that can be used to answer different questions than we have in The Riddle, thereby making the tools also useful in a wider sense for new original research. We hope our approach invites others to contribute µServices for further textual humanities research.
References

CollateX. collatex.net
Interedition. www.interedition.eu
>Riddle of Literary Quality.literaryquality.huygens.knaw.nl
Ucto. ilk.uvt.nl/ucto
Eder, M., Kestemont, M. & Rybicki, J., (2013). Stylometry with R: a suite of tools.Digital Humanities 2013: Conference Abstracts. Lincoln: University of Nebraska-Lincoln, pp. 487-89. dh2013.unl.edu/abstracts/ab-136.html
Heydebrand, R. von and Winko, S., (1996), Einfuehrung in die Wertung von Literatur. Systematik – Geschichte – Legitimation. Paderborn etc.: Ferdinand Schoeningh, 1996
Jautze, K., Koolen, C., Cranenburgh, A. van and Jong, H. de, (2013). From high heels to weed attics: a syntactic investigation of chick lit and literature.Proceedings of the Workshop on Computational Linguistics for Literature 2013. aclweb.org/anthology/W/W13/W13-1410.pdf
Zundert, J. van, Middell, G., Hulle, D. Van, Haentjens Dekker, R., et al., (2011). Interedition: Principles, Practice and Products of an Open Collaborative Development Model for Digital Scholarly Editions. Digital Humanities 2011: Conference Abstracts. Stanford: Stanford University. dh2011abstracts.stanford.edu/xtf/view?docId=tei/ab-227.xml

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ADHO - 2014
"Digital Cultural Empowerment"

Hosted at École Polytechnique Fédérale de Lausanne (EPFL), Université de Lausanne

Lausanne, Switzerland

July 7, 2014 - July 12, 2014

377 works by 898 authors indexed

XML available from https://github.com/elliewix/DHAnalysis (needs to replace plaintext)

Conference website: https://web.archive.org/web/20161227182033/https://dh2014.org/program/

Attendance: 750 delegates according to Nyhan 2016

Series: ADHO (9)

Organizers: ADHO