Documents and data: modelling materials for Humanities research in XML and relational database

paper
Authorship
  1. 1. John Bradley

    Centre for Computing in the Humanities - King's College London

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

The Humanities Computing (HC) community has had a long and fruitful association with SGML and XML through the work of TEI. Indeed, the TEI has greatly deepened our understanding of the significance of markup of digital materials. With certain qualifications, the OHCO model that XML and TEI embodies (DeRose et al, 1990, and some useful follow-up discussion in DeRose 1997) has largely been shown to meet the needs document-oriented markup tasks. The HC's relationship with the relational database, on the other hand, does not appear to have been so positive. Part of the reason for this arises from the nature of many HC projects, which are focused on preparation of digital editions of source texts and are much better served by SGML/XML and the TEI. However, the dissatisfaction with the relational technology seems to go beyond this. We find it described as a "matrix straightjacket" in (Townsend 1999) -- although more generously described and understood in Greenstein's A Historian’s Guide to Computing (Greenstein 1994). The Orlando project reports that they have deliberately rejected the relational model as a way of structuring their materials, asserting that they do not represent "a 'database'; [because] the tagged prose must say subtle, complex things" (Orlando 1998). At almost every ACH/ALLC meeting this writer hears the Relational model disparaged as inappropriate for some material or other.

Our experience at the Centre for Computing in the Humanities at King's has been different from what one might imply from this seemingly widely held negative view of the relational model. It is based on the experience produced from long term (multi-year) and intimate associations with a broad range of humanities oriented projects (more than 30, and growing) in a large number of different disciplines, and dealing with source materials including images as well as text. For a somewhat different perspective on similar issues, see also Alvarado 1999 – framed there in the context of multimedia and metadata and digital libraries rather than the more specifically textual-oriented view we are taking in this paper.

Of course, the relational model is not appropriate for representing the structure of textual materials, and the OHCO model, with the possibility of mixed content within elements, provides a perhaps not perfect but often good-enough model. In a number of our projects, however, project materials that primarily look like documents fit for textual markup often turn out to harbour materials that are not well handled in that way alone. Take, for example, a brief excerpt from the Relics and Selves project, which puts online a number of articles analysing how 19th century museums contributed to the development of national identities in several South American countries. TEI-like markup for a small snippet of this text might look like this (here rather simplified for the purposes of the present argument):

[. . .] <rs type="scientist" reg="Lacerda, João Baptista de",
key="sci023">Lacerda</rs>´s stroll through the collections is a
striking example of this inversion of the `museum effect´ so
masterfully plotted by <rs type="letrado" reg="Borges, Jorge Luis"
key="let001">Borges</rs> [. . .]
(Andermann 2001)

By tagging references to persons in the text (using the TEI's "rs" – "referencing string" tag) the access system can locate references to any individual referenced in the collection. The "key" attribute (defined in TEI P3 as "provid[ing] an alternative identifier for the object being named, such as a database record key") is needed because we are not so much interested in identifying names in the text as names, but as references to persons. If there were two Jorge Luis Borges's referenced across the entire project article collection then the rs elements for one might have the key attributes as shown here (let001), and a different key (say, sci192) for the other. The biographical information about the person (say, the person's birth and death dates, and/or some brief biographical prose sketch) is not appropriate to store here in the reference within the article, but belongs in another separate structure altogether – something structured more like a glossary, encyclopaedia entry, or an index. In fact, a similar argument could be made for the person types recorded in the rs tag. A glossary entry to define for the reader what was meant by a letrado (or scientist, for that matter) would be in a structure separate from the article text itself, and the type code here perhaps should be viewed as a reference to that entry. Furthermore, it could be argued that this classification belongs with the person data, rather than here with the name reference. Graphically, we could see these relationships as:

Note the different character of the information in the "person" or "role" structure from the narrative-like article. If Relics and Selves was published as a book the Person data would likely appear at the back of the book as an index, and would be separated from the article text. This book index would, in turn, have an actually semantically arbitrary ordering based on the spelling of the person name so that references could be readily looked up. Unlike the article text which is meant to be read in "document order", the person index is meant to be consulted, and the order of consultation will almost certainly vary for different readers. The "document order" would have to be fixed on paper, but would, in fact, be determined solely to allow the user to jump into the middle it to locate an individual of interest.

Furthermore, the index is, in fact, an editorial object, created by the editors from the original material and more useful when it is, as far as possible, made consistent. While preparing a person index the editor tries to include data for all agreed components entered for persons (even if, perhaps, the data was sometimes provisional). Thus, for each person a standard form for the name is given, even if, perhaps, that exact form does not appear in the body of the text at all. Possibly the person's birth and death dates, etc would be provided as accurately as possibly, as consistently as possible (accuracy and consistency sometimes conflicting, of course, but both essential aims for usability!) and this data would be provided for as many entries as was possible. The virtues of consistency and completeness, and the structured feel of the materials ("there shall be provision for the recording of birth and death dates for all persons in the collection") is characteristic of a table in a relational database. Figure I is, in fact, the beginnings of an "entity-relationship" diagram which is often used to display the structure of a relational database itself. By structuring this data in a database-like fashion one can then ask the computer to make use of it by not only allowing access through an alphabetical list of names, but also perhaps by ordering persons by birth decades, or as person type/role as an alternative way to access the data.

In fact, the significance of the "relational" part of the name "relational database" is not captured by the "matrix straightjacket" description applied by Townsend et al. By "relational" we draw attention to the ability of the model to capture and then exploit connections between different objects that it describes. Figure II, for example, shows a small subset of the tables defined for our Prosopography of the Byzantine World project, and shows some of the kinds of data stored there for about persons, "factoids", (primary) sources, and (geographic) locations. PBW records statements made by different primary sources as "factoids" (see Bradley, Short 2003 for considerably more detail about factoids and the implication of the factoid approach on this and other similar projects), and each factoid is explicitly linked to the spot in the source from which it was derived. The lines drawn between boxes show the connections between the data types that the system knows about. For example (because of the intermediate table between Factoid and Person) it is possible to associate one or more persons with each recorded Factoid. The diagram shows that the database records which primary source contains each factoid, and that it is possible to associate more than one location with each factoid as well. These relationships allows the system to equally readily select or order factoids by the persons they are associated with, by the sources in which they appear, or the locations in which they occurred.

Now, XML based models of representing this data can manage a similar flexibility of expression, but the primary way of representing associations between elements in XML (containment – the nesting of elements within each other) is not sufficient. One might group factoids by Source, for example, and order them in source order as well – having, say, a set of <SOURCE> elements, and inside each source elements the <FACTOID> elements that belong to the containing source. However, having used containment to express the relationship of factoids to source, obviously one cannot simultaneously use containment to express relationships between these same factoids and persons or factoids and location. For material like this, in fact, the tree-model that element containment implies, is, by itself, insufficient to represent this data. There is no "document node" in this data and no natural ordering of the material that corresponds to a document order.

SGML and XML allow one to use IDs and IDREFs to assert links between different element instances, and can be pressed into service to represent the associations we show naturally in the relational database model. One could have, for example, a list of XML elements containing Person information, a separate list for factoid information, a Source list, and a Location list. IDs and IDREFs would express the links between the particular persons, factoids, locations and sources. One would uniquely identify each factoid, location, person, source object by assigning each occurrence an ID in the corresponding <FACTOID>, <PERSON>, <LOCATION> element. Then, the link between a person and factoid could be provided by providing a list of IDREFs in each person element that identifies the set of factoids associated with him/her. Similar techniques would be used to link the other data together.

Using ID/IDREF and containment in this way is, indeed, a fully adequate representation of both the basic data and the relationships between them in XML. Similar approaches linking highly structured material with narrative text is indeed demonstrated in one of the Feature Structure examples shown in TEI P3 and P4. The downside to making use of ID/IDREFs to indicate linkage is that, although XML supports the explicit use of linking in this way, the use of ID/IDREFs is definitely a "second class" association technique compared to element containment. Furthermore, it is much less central to the design process of an XML DTD than it is in the relational model. This is shown in a number of ways. The DTD language allows IDs and IDREFs to be specified but provides no way to further limit the kind of links they might be asked to represent. XML editing tools, although they may be able to enforce the uniqueness requirement for XML ID attributes are very poor in making use of this linking information in other ways. The XPATH query language does support element selection using ID/IDREF links, but does it in ways that are significantly more awkward than selection based on containment alone. Because element containment is the first-class way to represent relationships between elements there has been recent work to explore of the limits of the tree-oriented structure for non tree-like data (Shanthi and Venkatesan 2003). Others have worked to extend tree-oriented XPATH to handle the more general situation of navigating graph-like data (see Cassidy 2003).

There is evidence in the design of XML Query that the importance of efficiently supporting links between different hierarchies is being recognised -- not surprising since XML Query's design team contained several people with a relational database background. Very recent work in the development of XML-databases – perhaps inspired by XML Query – shows that there too developers are beginning to understand that it is important to ensure that links between different element hierarchies – expressed using IDREFs or in other ways – need to be recognised as "first-class" information in the system in the same way that element containment already is. Hopefully, in time XML databases will be designed in such a way that processing involving links between elements will work as efficiently with potentially large amounts of data as that expressed by containment.

The XML development world, then, seems to be on the way to recognising that the modelling of material in XML needs more than the OHCO model and containment. We believe, however, that people working on text-based projects would benefit from designing XML-based projects with more awareness of this issue as well, even if, in the end, a relational database is not needed. We has seen examples of situations where, by the designing data representation by DTD alone, the project principals missed important relationships between materials that would have been revealed if some 'entity-oriented' design of the kind undertaken in the building of databases had also been done. In our presentation we will expand on the issues mentioned above, and discuss some design strategies we have developed to handle the design of mixed document/data projects.

References

1. Alvarado, Rafael C. "Of Media, Data, Documents: An argument for the importance of Relational Technology to the project of Humanities Computing" in ACH-ALLC Conference Proceedings 1999: ACH/ALLC 1999.
2. Andermann, Jens. (2001) "The Museu Nacional at Rio de Janeiro" in Relics and Selves: Icongraphies of the National in Argentina, Brazil and Chile, 1880-1890. Online at http://www.bbk.ac.uk/ibamuseum.
3. Bradley, John and Short Harold, Texts into databases: The Evolving Field of New-style Prosopography given at ACH/ALLC conference Athens Georgia 2003. Online at http://pigeon.cch.kcl.ac.uk/docs/papers/georgia1/.
4. Cassidy, S. (2003), "Generalizing XPath for directed graphs." in Proceedings Extreme Markup Languages 2003. Online at: http://www.idealliance.org/papers/extreme03/html/2003/Cassidy01/EML2003Cassidy01.html.
5. DeRose, Steven J., Durand, David G., Mylonas, Elli, Renear, Allen H, "What is Text, Really?" in Journal of Computing in Higher Education, Winter 1990, Vol. 1 (2), 3-26.
6. DeRose, Steven J., "Further Context for 'What is Text, Really?'" in ACM SIGDOC Asterisk Journal of Computer Documentation, August 1997, Vol 21:3, pp 40-31.
7. Greenstein, D. I. (1994). A Historian’s Guide to Computing, Oxford: Oxford University Press. pp. 268.
8. Orlando Project (1998) "The Orlando Project: The Orlando Project and the Question of Delivery in XML" at Markup Technologies '98, Chicago. November 1998. Online at: http://www.ualberta.ca/ORLANDO/presentations/ACH_1999/index.htm.
9. Shanthi, K. and Venkatesan S.K. (2003) "Gliding down from graphs to trees: an attempt to bottle geometry and chemical content" in Proceedings Extreme Markup Languages 2003. Online at: http://www.idealliance.org/papers/extreme03/html/2003/Venkatesan01/EML2003Venkatesan01.html.
10. Sperberg-McQueen, C.M.and Bernard, L. (1994). Guidelines for Electronic Text Encoding and Interchange (TEI P3), ALLC, ACH, ACL 1994.
11. Townsend, Sean et al. (1999) AHDS Guides to Good Practice: Digitising History. Oxford: Oxbow Books. Online at http://hds.essex.ac.uk/g2gp/digitising_history/.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2004

Hosted at Göteborg University (Gothenburg)

Gothenborg, Sweden

June 11, 2004 - June 16, 2004

105 works by 152 authors indexed

Series: ACH/ICCH (24), ALLC/EADH (31), ACH/ALLC (16)

Organizers: ACH, ALLC

Tags
  • Keywords: None
  • Language: English
  • Topics: None