Distributed Multivalent Encoding

Authorship
  1. 1. Paul Caton

    Brown University

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Distributed multivalent encoding (DME) describes a web-based
text encoding practice and the result of that practice. It assumes
a digital text resource with a browser interface that allow users
to associate N encodings with N texts in the resource. Users
need have no other connection with the resource than their
using it and constraints should be minimal; hence the encoding
is distributed among multiple creators and multiple conceptual
approaches. To realize their encoding(s) users may ignore
existing encodings; they may apply one tagset to one text, or
to multiple texts, or they may apply multiple tagsets to one text,
or multiple texts; any element they create can reference any
other element created by themselves or by someone else. Thus
the resource’s encoding becomes multivalent in part and in
whole.
A cluster of activities defines the space of distributed
multivalent encoding, including annotation, keywording
(including “folksonomizing”), editing, and humanities criticism.
For any one component of DME anticipatory previous work
exists.1 Any newness of DME lies only in the surprising fact
that the components have never been fully assembled, though
they have all been present for some years. Now, however, we
can see DME emerging, albeit in hesitant, not-fully-developed
forms.2
A DME resource has a base text (or texts), a web interface for
creating encoding, and a mechanism for storing encoding; these
three things enable all subsequent activities: editing, deleting,
retrieving, searching, etc. Considering each of the three parts
in a little more detail we will see few real technical obstacles
in DME’s path.
Distributed multivalent encoding starts with a reference base
digital text. A reference base text should be clearly marked as
such by the resource managers. Only they may change it (or
give someone else permission to change it), and any change
must be well publicized. A reference base text must have a
well-defined beginning and end, and a well-defined internal
base reference structure. While it is certainly possible to regard
either the simple byte sequence or the character sequence as
the base reference structure and then link encodings to byte or
character offsets, this is not a robust solution. Practicality
suggests the internal base reference structure should itself be
an encoding; convenience and practicality further suggest a presentation-based encoding, on the grounds that presentation
in a medium reflects a community’s sense of that medium’s
main communicative organization and delivery units (Caton,
2001). A TEI Lite encoding (<http://www.tei-c.org/
Guidelines2/index.xml.ID=lite>), for example, or
one conforming to DLF Level 4 (<http://www.diglib.
org/standards/tei.htm#level4>), would be suitable.
We should stress, though, that the reference structure makes
no claim to being representationally definitive. Within the
structure, distinctions between tags and #PCDATA are purely
structural and do not define either “text” (as general
phenomenon) or “the text”. A supplementary encoding – ie. an
encoding created by a user – associates with the text as
represented within the reference structure, not “the text” as
some reified cultural object. If we follow the reasoning of
Renear et al on the relationship between FRBR entities and
XML documents, we would probably consider the base
reference text a manifestation of an expression, especially
because we incline towards a presentation-based encoding
(which, as Caton argues, is what OHCO-style encoding
is)(Renear et al, 2004, Caton 2001). The FRBR vocabulary,
however, hardly resolves the problematic semantics of the
common phrases “the text” and “a text”. Hence our insistence
on treating the internal reference structure as definitive only
within the DME resource as a system. Indeed the very point of
a DME resource is to allow users to create encodings that they
can treat as definitive for their purposes.
The web interface presents the base text to the user and allows
the user to associate encoding(s) with parts of the text. The
encoding must allow for multiple overlapping hierarchies, and
so a format such as CLIX/TEI HORSE should be used (DeRose,
2004, Bauman, 2005). While the interface programming might
be complex in terms of having to manage numerous details,
the required functionality is straightforward. The actual
mechanics a resource employs are not important, except as
regards the degree to which they make the process awkward.
They will vary according to programmers’ preferences and with
changes in technology. The proof-of-concept Limner interface,
for example, uses HORSE and relies on explicit element IDs
and user-selected strings to mark where start and end tags go.
DeRose notes that ‘[o]ne often hears that … IDs are somehow
"safe" pointers into documents (DeRose, 2004). However, this
is not true; they are at most “safer” than many other methods.’
His point is well taken. However, there must be some reference
system, and unless we resign ourselves to inline markup and
unwieldy file sizes, it seems preferable to keep the
supplementary markup separate from the base text and use a
system that is (to view the cup as half full rather than half
empty) at least safer than many other methods. A combination
of full XPath plus element ID plus a string of sufficient length
to have a strong chance of being unique should allow a DME
resource to consistently associate a stored out-of-line element
with its proper position with respect to the internal base
reference system. The Limner implementation is rather crude
and already dated; the DOM scripting features of current
browsers together with wider availability of XPath handling
functions in programming languages offer many opportunities
to improve upon it.
The actual details of storage are also of limited importance.
Relational databases can hold the information (as with Limner)
but it seems likely that future DME resources will use native
XML databases.
Without downplaying the amount of work involved, we can
confidently say that DME is perfectly possible with current
technology and that DME resources will be built in the near
future. The real unknowns (and potential problems) lie on the
social side. Who gets to encode? Will all supplementary
encodings be equal, or will some be “more equal” than others?
Will encodings be moderated? Will differently encoded and
competing base reference texts proliferate until the very notion
of a base reference becomes utterly compromised? Will a class
system of resources emerge, driven by a scholarly fear of
non-scholars’ contributions? The relative success of Wikipedia
in the face of all the things that could have stopped it should
make us optimistic. Probably DME will initially develop in
constrained forms, with resources authorizing users and
retaining ultimate editorial control over supplementary
encodings. In time we hope to see distributed multivalent
encoding become a widespread, democratic practice
1. For example, the resource based around Pico Della Mirandola’s
Conclusiones CM publicae disputandae (PICO) features an
annotation system which is also employed by the Virtual
Humanities Lab (VHL), a resource whose development plan
includes implementing a form of DME.
2. See LIMNER, for example: a site intended as a proof-of-concept
DME resource, still in an early stage.
Bibliography
Bauman, Syd. "TEI HORSEing Around." Proceedings of
Extreme Markup Languages 2005, Montréal, Québec, August
2005. 2005. <http://www.mulberrytech.com/Extr
eme/Proceedings/html/2005/Bauman01/EML200
5Bauman01.html>
Caton, Paul. "Markup's Current Imbalance." Markup
Languages: Theory and Practice 3.1 (2001).
DeRose, Steven. "Markup Overlap: A Review and a Horse."
Proceedings of Extreme Markup Languages 2004, Montréal,
Québec, August 2004. 2004. <http://www.mulberrytech.com/Extreme/Proceedings/xml/2004/DeRos
e01/EML2004DeRose01.html>
Digital Library Federation. TEI Text Encoding in Libraries:
Guidelines for Best Encoding Practices. Version 2.1. 2006. <h
ttp://www.diglib.org/standards/tei.htm#le
vel4>
LIMNER . . <http://golf.services.brown.edu/p
rojects/Limner/>
PICO . <http://www.stg.brown.edu/projects/p
ico/index.php>
Renear, Allen H., Pat Lawton, Christopher Phillippe, and David
Dubin. "An XML Document Corresponds to which FRBR
Group 1 Entity?” ." Proceedings of Extreme Markup Languages
2004, Montréal, Québec, August 2004. 2004. <http://www
.mulberrytech.com/Extreme/Proceedings/htm
l/2003/Lawton01/EML2003Lawton01.html>
TEI Lite . <http://www.tei-c.org/Guidelines2/
index.xml.ID=lite>
Virtual Humanities Lab (VHL) . . <http://golf.servic
es.brown.edu/projects/VHL/>

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ADHO - 2007

Hosted at University of Illinois, Urbana-Champaign

Urbana-Champaign, Illinois, United States

June 2, 2007 - June 8, 2007

106 works by 213 authors indexed

Series: ADHO (2)

Organizers: ADHO

Tags
  • Keywords: None
  • Language: English
  • Topics: None