Future Development of a System for Annotation and Linkage of Sources in Arts and Humanities

paper, specified "short paper"
Authorship
  1. 1. Ivan Subotic

    Universität Basel (University of Basel)

  2. 2. André Kilchenmann

    Universität Basel (University of Basel)

  3. 3. Tobias Schweizer

    Universität Basel (University of Basel)

  4. 4. Lukas Rosenthaler

    Universität Basel (University of Basel)

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

1 Introduction
Since the late 90s, a large number of digitization projects relevant to research in the humanities have been carried out. The quality of the digital objects produced by these digitization campaigns most often meets the demands of a "digital facsimile". Since - with the exception of text files such as PDF, plain text, etc. where a full text search may be appropriate - digital objects are hardly searchable directly, associated metadata are needed to enable navigation within a collection of digital objects. It should be expected that this simplified accessibility and availability of digitized sources has fundamentally changed research in the humanities by allowing more efficient and broader research methods. However, iseems that this is not yet the case. The reason is that there are very few digital tools available to support the qualitative and comparative methods required in source based research in the humanities.

In the following, we will look at three use-cases which exemplify our vision for a research environment in the digital Humanities.

Digital Humanist. Susan, a digital humanist, is working with digitized manuscripts. For her research, she needs to transcribe, annotate, and link her annotations, regions of interest, and transcriptions with each other. By employing SALSAH as her work environment, Susan can work on the digitized manuscripts in a fully digital workflow.

Long-term Accessibility of Digital Research Data. Jim is at the stage of finishing up a five-year project, and needs to deposit his research results and the data accumulated during his research. The results and the digital data need t be still accessible in the long-term, even after the funding has long since ended. Jim can export his digital research data to an institution deploying SALSAH which will take care of their long-term accessibility.

Linked Open DataWorkbench. Karen's research is based on materials whic are provided in different repositories around the internet. She wants to be able to combine, annotate and create links between those digital objects. Also sh would like to share her results and allow other researchers to use them. By usingSALSAH, Karen can connect to external resources shared over custom APIs or SPARQL endpoints, and work with the data as if it were stored locally. Usin the SALSAH API and the provided SPARQL endpoint, other researchers can build upon her work.

SALSAH (System for Annotation and Linkage of Sources in Arts and Hu manities) version 2.0 is currently under development at the Digital Humanites Lab (DHLab) of the University of Basel, and represents a browser based VRE that will respond to requirements described in the three scenarios above.

The main contribution of this paper lies in the description of novel approache taken in the design of SALSAH 2.0, leading to new features and possibilities.

The remainder of this paper is organized as follows. In Section 2, we introduce newly developed features and Section 3 concludes.

2 SALSAH
SALSAH integrates digital (re)sources, metadata, research data, and relevan working tools. Using SALSAH, researchers are able to: (1) simultaneously visualize multiple digital objects (e.g., facsimiles, images, texts, transcripts, sound and video), (2) annotate digital objects and share these annotations with others (3) establish relations (links) between digital objects and annotate these relations, (4) access and integrate external data sources (e.g., digital libraries) so that the VRE tools may be applied to these sources without the need for local duplicates, and (5) transcribe manuscripts, speech and video.

2.1 Software Architecture
The software is based on a multi-tier architecture in which application logic is distributed between (1) a client application ("front end") which users interact with, (2) a more or less centralized server ("back end"), and (3) local and/or external data providers which provide the sources that users can work on. The SALSAH software architecture is depicted in Figure 1.

While SALSAH has the capability to function as a repository for digital sources, this is not its primary goal. There are many repositories of professionally digitized sources, and it makes no sense to duplicate their content in yet another repository. Following a logical separation of annotation tools and digital representations, SALSAH provides the basis for referencing sources without having to store them itself. Furthermore, SALSAH can provide annotation, linkag information, and metadata to an external data provider via the SALSAH API (as long as the external request has access rights), as well as over a read-only SPARQL endpoint that provides LOD (Linked Open Data).We expect SALSAH in the long term to evolve into a true distributed P2P system.

Fig. 1: Software Architecture of SALSAH

2.2 Data-Model
The data model is based on the Resource Description Framework (RDF), the Resource Description Framework Schema (RDFS), and the OWL 2 Web Ontology Language all proposed by the World Wide Web Consortium (W3C) for implementing the Semantic Web. This metadata model makes it possible to describe digital objects in a very flexible way, and to create links and relation between any objects (which are called \subjects" in RDF terminology). It is based on statements in the form of subject-predicate-object expressions about these digital subjects. Any number of such expressions can be used to describe subjects and their relations.

A given set of predicates is called a vocabulary, and can be used to implement standard metadata schemes such as Dublin Core. Within SALSAH, different vocabularies may be used at the same time to describe a given subject. Since the value of an RDF expression may itself be a subject, RDF allows for a network-like representation of knowledge about a subject and its relations to other subjects. This metadata model is subject-centric, in the sense that for each digital subject, an individual set of predicates may be assigned, in contrast to the relational data model, which is much more restrictive in its ability to assign data field to subjects. Hence, the data model used in SALSAH is especially well suited to the humanities, in which a flexible, qualitative coverage of metadata is essential. Figure 2 (a) depicts an excerpt from the SALSAH ontology, showing how a projects own metadata schema can be incorporated into SALSAH, and (b) a small part of the graph depicting an incunabula of Sebastian Brant with the title "Das Narrenschi".

Fig. 2: An excerpt from (a) the SALSAH Ontology and (b) the incunabula of Sebastian Brant.

The data store consists of a native triple-store solution such as Jena, which serves the data over a SPARQL endpoint.

2.3 Versioning
SALSAH is a dynamic system in which data can be changed by users having the necessary access rights at any time. In order to use SALSAH as a citable repository, methods will be implemented to \freeze" a subset of the data and thus provide versioning. In order to solve this non-trivial problem, SALSAH will use the concept of temporal RDF, in which each element in the RDF graph of a certain granularity will be enriched with temporal information regarding its validity. For example, if the title of a book is changed, the old version is not overwritten, but is instead marked as valid up to the time when the change occurred, while the new title is marked as valid from then on. This allows users to retrieve the state of the RDF graph at any point in time.

Versioning will lead to the concept of a new form of electronic publication. While e-papers and e-journals basically mimic the behavior of their paper equivalents an annotated network of citable sources and links represents a novel form of publication. The reader will be able to navigate through the network and extract his or her own perspectives on the knowledge represented by the interconnected digital objects. This may be the first attempt, within academic publishing in the humanities, to go beyond the phenomenon in which "new media first mimic older media", as noted by Marshall McLuhan.

2.4 Digital Long-Term Preservation
DISTARNET (DISTributed ARchival NETwork) is a distributed, autonomous long-term digital preservation system. Essentially, DISTARNET exploits dedicated processes to ensure the integrity and consistency of data with a given replication degree. At the data level, DISTARNET supports complex data objects and the management of collections, annotations, and arbitrary links between digital objects. At process level, dynamic replication management, consistency checking, and automated recovery of archived digital objects is provided, using autonomic behavior governed by preservation policies without any centralized component

DISTARNET will be implemented as a layer underneath the SALSAH local repository, and provide long-term preservation of the digital objects and associated metadata.

3 Conclusion
While the change from the analog to the digital domain makes sources available on the desktops of scholars and researchers, a real paradigm shift in source-based research requires new tools. Virtual Research Environments such as SALSAH may provide the necessary tools to gain a novel, computeraided knowledge rep- resentation that is well-suited to the needs of humanities research. These tools will undoubtedly change the way research is done in the humanities. They will help researchers organize and retrieve knowledge more efficiently, and may disclose hidden relationships between sources, among other things, but they will not replace the researchers' ingenuity and intuition. SALSAH is in use by several research projects within the University of Basel, and has sparked interest on an international scale.

References
Rosenthaler, Lukas Virtual Research Environments (2012). A New Approach for Dealin with Digitized Sources in Research in Arts and Humanities in: Claire Clivaz u.a. (editors): Reading Tomorrow. From Ancient Manuscripts to the Digital Era, Lausanne2012, S. 661-670, Ebook on http://www.ppur.info/lire-demain.html

Rosenthaler, Lukas (2012), Schweizer, Tobias SALSAH - eine webbasierte Forschungsplattform für die Geisteswissenschaften, in: Bulletin der Schweizerischen Akademie der Geistes- und Sozialwissenschaften, Bern

Rosenthaler, Lukas (2011) Entwicklung einer Web 2.0-Applikation zur Präsentation un Erforschung der Basler Frühdrucke, in: Karin Krause und Barbara Schellewald (editors), Bild und Text im Mittelalter, Böhlau Verlag Köln

Schweizer, Tobias, Rosenthaler (2011), Lukas SALSAH - eine virtuelle Forschungsumgebun für die Geisteswissenschaften, in: Dr. Andreas Bienert, Dr. Frank WeckendDr. James Hemsley, Prof. Vito Cappellini (editors), EVA 2011 Konferenzband pp. 147-153 GFai Berlin

F. Manola and E. Miller, "RDF Primer," tech. rep.

D. Brickley and R. von Guha, "RDF Vocabulary Description Language 1.0: RDF Schema," tech. rep.

W3C OWL Working Group, "OWL 2 Web Ontology Language," tech. rep.

Dublin Core Metadata Initiative. http://dublincore.org.

JENA. http://jena.apache.org.

C. Ogbuji, "SPARQL 1.1 Graph Store HTTP Protocol," W3C working draft, W3C, May 2011. http://www.w3.org/TR/2011/WD-sparql11-http-rdf-update-20110512/.

C. Gutierrez, C. Hurtado, and A. Vaisman (2005), "Temporal RDF," in European Semanti Web Conference The Semantic Web Research and Applications, vol. 3532/2005, pp. 93-107.

J. Tappolet and A. Bernstein (2009), "Applied temporal RDF : efficient temporal querying of RDF data with SPARQL," The Semantic Web: Research and Applications, no. June.

McLuhan, E. and Zingrone, F. (1995) (eds) Essential McLuhan. New York: BasicBooks

I. Subotic. A Distributed Archival Network for Process-Oriented Autonomic

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ADHO - 2014
"Digital Cultural Empowerment"

Hosted at École Polytechnique Fédérale de Lausanne (EPFL), Université de Lausanne

Lausanne, Switzerland

July 7, 2014 - July 12, 2014

377 works by 898 authors indexed

XML available from https://github.com/elliewix/DHAnalysis (needs to replace plaintext)

Conference website: https://web.archive.org/web/20161227182033/https://dh2014.org/program/

Attendance: 750 delegates according to Nyhan 2016

Series: ADHO (9)

Organizers: ADHO