Practical extraction of meaning from markup using XSLT

paper
Authorship
  1. 1. C. Michael Sperberg-McQueen

    Black Mesa Technologies LLC, Massachusetts Institute of Technology, World Wide Web Consortium (W3.org)

  2. 2. Claus Huitfeldt

    Department of Philosophy - University of Bergen

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

This paper takes up ideas presented by Simons (1997 and 1999), Welty and Ide (1999), Ramalho et al. (1999), and Sperberg-McQueen et al. (2000) and describes software which provides a practical instantiation of their ideas. By applying their ideas in practice we will be much better able to test the predictions made by these authors, and in some cases to deepen and extend their results.

Our immediate area of concern is the problem of providing a clear, explicit account of the meaning and interpretation of markup. Scores of projects in humanities computing and elsewhere assume implicitly that markup is meaningful, and use its meaning to govern the processing of the data, but it proves remarkably difficult to find, in the literature, any straightforward account of how one can go about interpreting markup in such a way as to draw all and only the correct inferences from it.

Simons (1997) exemplifies one approach to this problem. He describes a system for conceptual modeling, based on an underlying object-oriented knowledge-base system (CELLAR), and proposes that use of this and similar systems will make possible a better and easier exploitation of the information in marked up texts. For the kind of knowledge base Simons describes to work well, it should be able to export, and equally to import, documents in suitably marked up form. The former is easy enough to accomplish, but the latter is more challenging. Simons (1999) describes a specific technique for translating from marked up documents into the knowledge-base system described earlier, using architectural forms as defined in ISO 10744 (HyTime). The notion of meaning makes no explicit appearance in Simons's work, but the conceptual model at the heart of his discussion plays a similar role; one might regard the translation from a marked up document into the objects of the knowledge base as a way of making explicit the meaning of the document.

As a technique for capturing the meaning of the elements and attributes in a document type definition (DTD), however, architectural forms have the theoretical disadvantage that they cannot perform arbitrary transformations on the input document. (A precise description of the tree transformations expressible by architectural forms has long been a desideratum, but has never been given as far as we are aware.) On the practical side, architectural forms suffer from the fact that only one of the many SGML and XML parsers available has ever implemented them.

Welty and Ide (1999) make a somewhat more explicit link between the meaning of markup and the representation of documents in a knowledge-base system. They outline, in general terms, a program for applying knowledge-management systems to marked up text -- or, viewed the other way round, a program of representing the information in documents in knowledge-management systems. They show some examples of how knowledge stored in a knowledge-management system (in their case, CLASSIC) can be used to make simple inferences, and suggest that such inferences can improve search and retrieval operations against the documents. Their examples of inference as a way of enriching information suffer, however, from implausible assumptions. (They assume, for example, that no authors are corporate authors, so that any name given as that of an author may be inferred to be the name of a person [p. 60]. They also assume that all senders and recipients of government documents are government officials, which would suggest that private citizens are never the originators, or the addressees, of such documents.) They are vague, also, in their description of how they propose to transfer the information captured by markup into the knowledge-management system, or how a practical query system is to be built, and their examples of queries which would in their view be made possible, or much easier, by the use of this system are sketchy.

Ramalho et al. (1999) describe the application of a knowledge-representation system (CAMILA) to questions of quality control. They describe a method of expressing constraints on document elements in a general form (for example, the constraint that the birth date of an individual should fall before the death date of that individual) and associating those constraints with the declaration of appropriate elements in a DTD. They also describe translating information from document instances into objects in the knowledge system, so that the constraints can be checked. But they provide no details of these processes, and they do not appear to contemplate a full formal representation of the meaning of the document and its markup in the knowledge-representation system.

Sperberg-McQueen et al. (2000) describe at some length a 'straw man' proposal for defining the proper interpretation of markup in a given markup language. In particular, they identify the meaning of markup in a document as the set of inferences authorized by it, and propose that it ought to be possible to generate those inferences from the document mechanically, following certain simple rules. Having set the strawman up, they then proceed to dismantle it, noting a number of problems in the rules they have proposed for generating the inferences. They sketch in general terms how a better account of meaning and interpretation in markup could be constructed, but leave the actual construction as an exercise for the reader.

All of these papers have in common that they suggest that one way, and perhaps the best way, to make explicit the meaning of the markup in a document is to translate it into some formalism designed for knowledge representation, whether a conventional knowledge-representation system, a new knowledge-management system designed in part for textual material, or a traditional logical formalism. They also have in common a remarkable reticence about the difficulties involved in working out such a translation in full for a DTD with more than a few elements (even Simons, by far the most forthcoming with details, limits himself to thirteen elements out of the hundreds in the TEI DTD). A skeptical reader might almost suspect that none of these papers reflects experience with working out a full system for translating from a DTD of realistic size into any target notation. In the long run, proposals such as those described above will have theoretical and practical significance only if their details can be worked out successfully, not only for small examples but for production DTDs with scores or hundreds of element types.

This paper describes a concrete realization of the 'framework' model proposed in Sperberg-McQueen et al. (2000), and a full specification, in the form described there, of two widely used DTD, on the basis of which a better evaluation of that proposal should be possible. We describe notations or software implementing each part of the framework:

a notation (specifically, an SGML/XML DTD) for expressing in a generic form (sentence skeletons, for both natural-language English sentences and sentences in a formal notation) the meaning of constructs in a markup language; we use the W3C XPointer notation for the 'deictic expressions' needed to express notions like "the contents of this element" or "the value of the type attribute on the nearest ancestor of type div"
uses of that notation to define the meaning of elements and attributes in the TEI Lite and HTML 4.01 DTDs, in the form of sentence skeletons
software (an XSLT transformation sheet) to generate the simple representation of XML documents in Prolog defined by Sperberg-McQueen et al.
software (another XSLT transformation sheet) to apply the sentence skeletons to documents of the appropriate type and generate sentences, in Prolog or some other logical notation, expressing the inferences licensed by the markup
software (a third XSLT transformation sheet) to read the sentence skeletons for a given DTD and generate the XSLT transformation sheet necessary to generate sentences for documents of that type
We use XSLT to generate the sets of basic sentences describing the meaning of the markup in a document, because XSLT provides a compact, declarative, non-proprietary, and widely understood notation for document transformations. XSLT might indeed be used as the notation for describing the meaning of constructs in a given DTD: the second of the two transformation sheets mentioned above might be regarded as an explicit and declarative expression of the meaning of the DTD for which it is written. We nevertheless define a specialized DTD for specifying skeleton sentences, on the grounds that XSLT is more powerful than is necessary for this task. A weaker notation helps ensure that the specification of the skeleton sentences for a DTD is more compact, easier to understand and interpret, and computationally more tractable, than an unrestricted XSLT transformation sheet.

We use Prolog as a target language for the logical statements about document structure and markup meaning, because it is a widely used and well understood formalism; time permitting, we expect also to perform experiments translating documents into other systems, including RDF (the W3C Resource Definition Format) and XTM (the XML Topic Map syntax).

The software described in this paper will be made available to conference participants as open-source software under the Gnu Public License.

References

Ramalho, José Carlos, Jorge Gustavo Rocha, José João Almeida, and Pedro Henriques. 1999. "SGML documents: Where does quality go?" Markup Languages: Theory & Practice 1.1 (1999): 75-90.

Simons, Gary F. 1997. "Conceptual Modeling versus Visual Modeling: A Technological Key to Building Consensus." CHum 30.4 (1997): 303-319.

Simons, Gary F. 1999. "Using Architectural Forms to Map TEI Data into an Object-Oriented Database." CHum 33.1-2 (1999): 85-101. Originally delivered in 1997 at the TEI 10 conference in Providence, R.I.

Sperberg-McQueen, C. M., Claus Huitfeldt, and Allen Renear. 2000. "Meaning and Interpretation of Markup." Paper delivered at ALLC/ACH 2000, Glasgow.

Welty, Christopher, and Nancy Ide. 1997. "Using the Right Tools: Enhancing Retrieval from Marked-up Documents." CHum 33 (1999): 59-84. Originally delivered in 1997 at the TEI 10 conference in Providence, R.I.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2001

Hosted at New York University

New York, NY, United States

July 13, 2001 - July 16, 2001

94 works by 167 authors indexed

Series: ACH/ICCH (21), ALLC/EADH (28), ACH/ALLC (13)

Organizers: ACH, ALLC

Tags