White Rabbit: A Vertical Solution for Standoff Markup Encoding and Web - Delivery

Carl Stahmer

Authorship

1. Carl Stahmer

Maryland Institute for Technology and Humanities (MITH) - University of Maryland, College Park

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Below is an extract of a conventional, TEI encoded stanza from an Early Modern ballad by an anonymous author:*
For what we with our flayes coulde get
To kepe our house and seriauntes
That dyd the freers from us fet
And with our soules played the marchantes
And thus they with theyr false warantes
Of our sweate have easelye lyved
That for fatnesse theyre belyes pantes
So greatlye have they us deceaved
This form of XML markup, which imbeds the markup within the artifact itself, will be familiar to most readers. In their 1997 essay, “Hyperlink Semantics for Standoff Markup of Read-Only Documents”, Henry S. Thompson and David McKelvie describe a system of “standoff markup” in which markup data resides in a physically
separate file from that which it describes. Using
a standoff markup system of the type described by Thompson and McKelvie, all of the TEI markup from the above citation would be stored in a separate file from the text itself, relying on a pointer system to describe its relation to the root text:*
The above example uses the order of words in the root text to define inclusionary collections of words that are described by the XML elements in the standoff markup file. It provides the same markup description of the text as the more familiar, traditional example given above, but without intruding into the integrity of the root text file.
As noted by Thompson and McKelvie, one major advantage
of a standoff markup system is that it presents the
condition of possibility for utilizing multiple, overlapping
hierarchies to describe the same root artifact—something that cannot be achieved using a conventional, single-file markup approach. At the Maryland Institute for Technology and the Humanities (MITH, http://www.mith.umd.edu),
we are currently collaborating with the Library of
Congress (LOC, http://www.loc.gov/) to develop a
platform for the encoding and delivery of digital
resources in the LOC’s American Memory collection (http://memory.loc.gov/ammem/) utilizing an interactive, standoff markup system designed specifically to allow multiple, overlapping markup hierarchies, including markup provided by web-users visiting the collection via the American Memory website.
This platform, named White Rabbit, is a vertical standoff markup solution that provides an easy to use Graphical
User Interface (GUI) for editorially controlled base-
document preparation, a collection of web-service
applications that allow users to browse, search, and
retrieve recourses using a standard, HTML web-browser or to retrieve raw XML source for each resource, and, most importantly, provides a web-based interface for users to add their own markup layers to texts in the collection.
On the technical side, White Rabbit functions by tokenizing
raw, ascii textual data at the level of base, lexically
significant units (most often words, but frequently other diacritical and textual elements) and storing an ordered list of tokenized elements in a SQL database. It then
allows users to construct XML using a simple point and click interface and to validate this XML against a DTD. XML “layers,” including those created by resource
consumers visiting the LOCs website, are stored in a
collection of related database tables.
As new markup layers are added to each artifact, resource consumers gain the ability to choose which markup “layers” to apply to the text on delivery. Once a markup layer is chosen, users can then perform advanced XML searching, parsing, and manipulation of artifacts using any web browser. For example, using the example of the Early Modern ballad extracted above, the user could
search within a single or collection of Ballads for all
instances of the occurrence of a particular word or phrase within a refrain only. Using convential markup systems, this type of functionality is available only using specialized XML browsers.
White Rabbit also provides a collection of “hierarchy analysis” tools that allow users to analyze the ways in which multiple markup layers for a given text relate to each other. Using these views, a user can identify filtered
or un-filtered collections of layers and search for statistically
significant patterns of convergence and divergence between multiple markup layers in these collections. Over time, this aspect of White Rabbit’s functionality provides an increasingly valuable bank of data regarding the ways in which a growing collection of users understand both formal and thematic elements of artifacts contained in the collection.
With White Rabbit you can have your cake and eat it too, applying multiple markup strategies to the same text for retrieval and display when determined by particular
scholarly contexts, providing robust analysis of the
patterns in textual structure that emerge through
multiple, overlapping markup layers, and delivering
finely tuned XML parsing and searching of texts to any user with a standard web-browser.
White Rabbit is comprised of a collection of cross-
platform, Java-based client and server applications and applets that communicate with a SQL database. It is an open-source platform specifically designed to be easily exportable to a variety of platforms and uses and will be available for public download in both compiled and source distributions in late 2006.
The proposed paper will present a brief introduction to the concept of standoff markup as described above, an interactive demonstration of the advantages of standoff markup, and, finally, an interactive demonstration of the White Rabbit platform. Detailed information about how to download and implement White Rabbit and/or
participate in open-source project development will also be provided.
* please note that submission of XML examples
above in plain-text was breaking the automated the submission interface. As such, entity references were used to replace certain elements. This will
dramatially decrease the readability of example if abstract is read in plain-text.

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ACH/ALLC / ACH/ICCH / ADHO / ALLC/EADH - 2006

Hosted at Université Paris-Sorbonne, Paris IV (Paris-Sorbonne University)

Paris, France

July 5, 2006 - July 9, 2006

151 works by 245 authors indexed

The effort to establish ADHO began in Tuebingen, at the ALLC/ACH conference in 2002: a Steering Committee was appointed at the ALLC/ACH meeting in 2004, in Gothenburg, Sweden. At the 2005 meeting in Victoria, the executive committees of the ACH and ALLC approved the governance and conference protocols and nominated their first representatives to the ‘official’ ADHO Steering Committee and various ADHO standing committees. The 2006 conference was the first Digital Humanities conference.

Conference website: http://www.allc-ach2006.colloques.paris-sorbonne.fr/

Series: ACH/ICCH (26), ACH/ALLC (18), ALLC/EADH (33), ADHO (1)

Organizers: ACH, ADHO, ALLC

White Rabbit: A Vertical Solution for Standoff Markup Encoding and Web - Delivery

1. Carl Stahmer

ACH/ALLC / ACH/ICCH / ADHO / ALLC/EADH - 2006