The Long Road Home: conversion and transformation of the Text Creation Partnership corpus

poster / demo / art installation
Authorship
  1. 1. James Cummings

    Oxford University

  2. 2. Sebastian Rahtz

    Oxford University

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

This poster addresses some of the practical problems of working with the underlying digital files prepared by the Early English Books Online-Text Creation Partnership (EEBO-TCP) using generalized tools developed for used with files following the recommendations of the Text Encoding Initiative (TEI). [1] This conversion work, at the University of Oxford, was initially driven by a desire to experiment with the creation of truly usable ePub editions of the Eighteenth Century Collections Online - Text Creation Partnership (ECCO-TCP) corpus released into the public domain in 2010. It was undertaken at the University of Oxford's IT Services independently of the EEBO-TCP team at the Bodleian Library or in Michigan.

The Text Creation Partnership (TCP) digitization programme is a large and complex operation, with very detailed guidance and standards (http://www.textcreationpartnership.org/docs/) worked out over the last decade. When the project started the decision was made to use SGML markup as the archival storage format, and a variation on the TEI Guidelines, version P3, for the encoding vocabulary. Unfortunately, SGML-aware software is increasingly hard to come by, and the advantages of doing this kind of work in XML these day are self-evident. The TEI has also developed significantly since TEI P3, with the current TEI P5 releases making many changes and improvements (some of them owing to proposals arising from work rationalizing the TCP markup). Interchange or comparison with other TEI texts, or use of TEI-aware software, suggests that we should have a way to transform the TCP texts into valid TEI P5 XML. This does not mean that once texts are in TEI that they are inherently interoperable, at least not without some effort, but this should be a vastly simpler task with a converted version of the EEBO-TCP corpus because they have all been created by a single project following a single set of encoding guidelines. These TCP guidelines have been developed over the course of the project and while not always perfect have gradually been increasing in standards of consistency.

The transformation of these EEBO-TCP texts to a basic and conventional web site alongside the facsimile page images is generally a straightforward task. However, if we want to take advantage of some of the tools now commonly used to process digital files, particularly those based on the current TEI P5 recommendations, this is much more problematic. At very least this involves transforming the SGML markup to XML, and then to the latest edition of the TEI (TEI P5). This poster will document these stages in conversion with examples of some of the problems encountered in this sort of conversion. This has sometimes necessitated changes to the TEI Guidelines themselves in order to be able to consistently encode textual phenomena that has been identified by the TCP project which cannot adequately be described using the current TEI recommendations. In other cases, decisions have needed to be made in the appropriate way to map some of the encoding variants adopted by EEBO-TCP back onto the existing and TEI P5 markup guidelines.

As well as the process, this poster presents some of the software that we have developed for converting ECCO-TCP and EEBO-TCP files. The exercise of transformation gives an interesting opportunity to examine the nature of the encoding of TCP texts, analyze the range of textual phenomena which are recorded in the corpus, and predict which structures which will be amenable to discovery by future scholars. The approximately 40000-text corpus of TCP also provides a good testbed for the more generalized TEI tools that we have developed. For this poster we describe some of the tools that we've used for the TCP conversions and the results of analysis of the converted TCP texts. As a case study we examine and demonstrate the generation of ebook editions (ePub format) of the ECCO-TCP and EEBO-TCP texts from the converted TEI. The results of such conversions will be discussed with regard for their usefulness for contemporary readers and any failures in representing the intellectual content of the original text.

Notes
1. Significant thanks are owed to Paul Schaffner for his very patient and understanding help in explaining decisions made by the TCP project. We are also grateful to Martin Mueller, Stephen Ramsay and Brian Pytlik Zillig of the Monk and Abbott projects, who wrestled with some of the same dilemmas before and in parallel with us, for discussions of minutiae of the markup. See http://www.tei-c.org/ for more information about the TEI.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ADHO - 2013
"Freedom to Explore"

Hosted at University of Nebraska–Lincoln

Lincoln, Nebraska, United States

July 16, 2013 - July 19, 2013

243 works by 575 authors indexed

XML available from https://github.com/elliewix/DHAnalysis (still needs to be added)

Conference website: http://dh2013.unl.edu/

Series: ADHO (8)

Organizers: ADHO