Introducting Phelix: an open XML database system

paper
Authorship
  1. 1. Lou Burnard

    Oxford University, TGE-Adonis

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

At least one of the factors underlying the current success of XML is
its claim to subvert the traditional opposition between text and
database. XML systems, the claim goes, can combine the flexibility of
unstructured text retrieval with the discipline of a closely
structured database. This tension is pervasive in the humanities, but
one application area where it seems of particular relevance is that of
detailed descriptive bibliography. Traditional bibliographic systems
are good at enforcing consistent cataloguing practices, but the
form-filling ethos they exemplify is often at variance with the more
discursive practices which characterize the description of
manuscripts, incunabula, and cultural artefacts in general. Whatever
standards may be promulgated in these areas, individual scholars and
curators will continue to require the ability to wax lyrical on
unpredictable aspects of the objects in their care. All too often
unsympathetic systems designers have either forced scholarship into
the procrustean bed of a rigid record structure or simply abandoned
any attempt to reflect the logical structure of a well written
description in its encoding.

In Phelix we have tried to get the best of both worlds, by supporting
any XML-defined structure in a traditional relational database. Our
testbed, and the major motivation behind the development of the
system, has been the requirements of the EU-funded Master project, but
Phelix is not limited to this application, and our presentation of it
will therefore focus on the general principles underlying our
architecture rather than the details of its implementation.

1. The MASTER project

MASTER (Manuscript Access through Standards for Electronic records) is
a 3 year project funded by the European Libraries Programme, the goal
of which is the definition of a standard for the representation of the
structure of manuscript catalogue records, together with pilot
implemntations of systems using that standard. The project is led by
the CTA at DMU, and its partners include the Royal Library of the
Netherlands, the Arnamaeghnaen Institute in Copenhagen, the IRHT in
Paris, and the Czech National Library in Prague. Our presentation will
not spend much time rehearsing the project's goals, deliverables, and
methods, as these have been presented at previous ALLC-ACH and DRH
conferences; for details consult the website at
http://www.cta.dmu.ac.uk/projects/master or the overview article cited
below (Robinson et al, 1999). The key points we wish to emphasize are:

- the distributed nature of the project (several different
institutions with very different traditions and conventions)

- a consequent emphasis on multiple solutions converging on flexible
standards in an open architecture

- the need to deliver demonstrable results within a short time scale

2. The MASTER DTD

We will only briefly review the key features of the document type
definition (DTD) developed for the Master project. This DTD, cast as a
set of extensions of the TEI scheme, supports records which vary
greatly in their complexity and level of detail. It can be used both
as the target for output from legacy systems and as a template for the
origination of new data: we will discuss the experiences of project
participants with respect to both modes of operation, and briefly
characterize the different software systems developed to support their
creation and validation of data for the Master database.

The Master DTD is usually thought of as a means of defining the
structure of a manuscript description, either as a free standing
document (or as a free standing collection of such documents), It can
however be used to define such a description within the context of a
full TEI document, such as a complete electronic version of a
manuscript, combining digital images, a transcription, and a
description of it. This is achieved by redefining the TEI <sourceDesc>
element to include a new metadata element <msDescription>, which is
also added to the existing TEI "chunk" Class (see Sperberg-McQueen and
Burnard 1994)

In practice, even an apparently free standing collection of
<msDescription>s will typically need to be embedded in a larger TEI
conformant framework if the application is to support such important
features as authority checking for language codes, validation of
bibliographic records, or references to persons and places. Not the
least advantage of having developed the Master scheme in a TEI context
is the availability of predefined schemes for these and other
requirements.

3. Software to support the system

From the start of the project it was clear that different partners
would use different tools. At the Arnamaegnean Institute in Copenhagen,
one team lead by Matthew Driscoll customized a cheap ASCII editor to
generate XML and validate records. At the IRHT in Paris, the team led
by Muriel Gougerot and Elizabeth Lalou produced a customised version
of Microsoft Access which allowed forms-based entry of Master
conformant records. In Oxford and elsewhere experiments were made
using specialist software like Xmetal. And in the Netherlands and
Prague, software was customised to export records from pre-existing
relational databases in XML format. As of summer 2000, the Master
project had several hundred records each ranging in size from a few
hundred bytes to nearly a megabyte of XML data.

Our original project plan called for implementation of two independent
systems to manage all this data, one at De Montfort, and the other at
Oxford, as a means of demonstrating the system independence of the
project standards. This presentation will focus on the work we carried
out at Oxford rather than the more ambitious electronic publishing
system still being developed at De Montfort.

4. The Phelix Architecture

Our goal was to create an open system, which could be given away in
source form under the terms of the GNU licence, which would make full
use of the information encoded in the XML markup of our documents,
which would not be tied to any particular DTD or application area,
which would be relatively easy to use, and which could support a large
scale collaborative project like Master. The system would not directly
support creation or editing of XML data (the project had already found
no shortage of tools for that purpose), but instead would focus on the
storage, search, retrieval, and display of records held in a central
database, accessible over the web.

The key problem to be solved was how we could combine the proven
advantages of database technologies with the equally well-proven
advantages of XML as a data format. We think our solution is
relatively unusual in that, though we do not claim any particular
originality in its design, we know of no other comparably thorough
implementation.

An XML document is, as every schoolboy and girl now knows, a
serialised tree, in which the terminal nodes are document fragments,
and in which nodes can be decorated at any level with additional
attribute value pairs. Our approach is to represent each node as a row
in a relational database management system (RDBMS). Columns in each
row contain pointers (or keys) to the root element, to siblings of the
current element, and to its children, or (if it is a terminal) simply
contain data. We have very few tables in our database, since we do not
attempt to represent the semantics of the tree (i.e. what the
different node types mean or how they relate to each other) -- that
would simply duplicate information already present in the XML
rendition of the data. We rely heavily on the efficiency with which
modern database systems support key-based access to individual rows
and columns in the table -- but that is what such systems are designed
to optimize.

The user loads data into the table by submitting a single XML
document, or a collection of them via a web-based forms
interface. Each document is then parsed against the project DTD and
(if valid) decomposed from its tree representation into a set of rows
which are then stored in the RDBMS. Queries against the database,
expressed in XML terms, are internally translated into SQL queries and
the results returned are internally translated to XML document
fragments, which are returned to the interface. We use XSLT to control
all XML transformations: this means, for example, that the user can
specify not only which XML elements from a document are to be returned
and in what order but also how they should be rendered by the HTML
browser or other interface client.

The current implementation uses a freely available RDBMS called mySQL
(but could equally well use say PostGres or Oracle); XSLT
transformations are done using the Sablotron processor (but could
equally well use say XT or Saxon); the scripting language holding the
system together is PHP (but could equally well be ASP or Java).

The user interface we have developed uses an internet shopping
metaphor: it presents the virtual library as a customisable
supermarket, in which the user can reconfigure the arrangement of
goods on the shelf (for example by sorting manuscripts by their
centiury of production or country of origin), making an initial
selection of items of particular interest for inclusion in a "basket",
against which further research can be carried out. Facilities
permitting, we will be happy to give a live demonstration of the
system, either during the session or after it.

3. Conclusions

We have developed a general purpose TEI-compatible XML database system
using freely available public domain tools. The system is configured
to support a complex XML DTD and offers sophisticated searching and
retrieval facilities. It is currently being tested on a
collaboratively designed database of several hundred manuscript
descriptions. We expect the system to have many further applications,
and hope to discuss some of them with attendees at the conference.

Bibliography

Robinson, P., Burnard, L., Lalou, E. "Vers un standard europeen de
description des manuscrits: le project Master" In Andre, Jacques, and
Marie-Anne Chabin (eds): Les documents anciens. "Document Numerique"
Vol 3, nos 1-2, June 1999. Paris: Hermes Science Publications, 1999

Sperberg-McQueen, C.M., Burnard, L. "Guidelines for Electronic Text
Encoding and Interchange" ACH-ALLC-ACL Text Encoding Initiative,
Chicago and Oxford, 1994.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2001

Hosted at New York University

New York, NY, United States

July 13, 2001 - July 16, 2001

94 works by 167 authors indexed

Series: ACH/ICCH (21), ALLC/EADH (28), ACH/ALLC (13)

Organizers: ACH, ALLC

Tags