XML Schema 1.0: A Language for Document Grammars

paper
Authorship
  1. 1. C. Michael Sperberg-McQueen

    Black Mesa Technologies LLC, Massachusetts Institute of Technology, World Wide Web Consortium (W3.org)

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

The Standard Generalized Markup Language (SGML) and its offspring the Extensible Markup Language (XML) appear to be fairly well established as methods of representing texts in electronic form.1 One of the characteristic features of SGML and XML which marks them as an advance over earlier systems of textual representation is their notion of document grammars: formal specifications of rules for distinguishing valid documents from other data streams. Document grammars prove useful in routine quality assurance (finding and cleaning up dirty data), in documentation of agreements between data providers and data consumers or of the contents of data flows, and as a means of specifying the contents of messages in client/server protocols.2 In some respects, however, the notation defined by SGML and XML for document type definitions (DTDs) has proven to have some shortcomings. • The use of a specialized notation rather than SGML or XML itself means that standard tools like XSLT and XPath processors cannot be used straightforwardly to work with DTD files. • The availability of data typing for attributes, but not for#PCDATA content of elements, introduces an unnecessary complication and lack of parallelism into the comparison of elements and attributes. • From a programming-language or database point of view, the selection of data types available for attributes may charitably be described as eccentric: it has strings (more or less) and various abstrusely specialized forms of tokens, but lacks integers, floating-point numbers, dates, and other standard types. • Although a number of published DTDs (e.g. that of the TEI) rely explicitly on notions of class and inheritance similar to those used in object-oriented systems, DTD notation lacks explicit support for inheritance. • Even if DTD notation did support inheritance, there is no standard way for applications to ask SGML/XML systems for information about the DTD used to validate a document. • DTD notation does not do at all a good job of supporting XML namespaces, which are increasingly important as a means of supporting compound documents and the mixture of different XML vocabularies in the same document. For these and other reasons, there has in recent years been a good deal of interest in new languages for specifying document grammars [Bourret et al. 1999, Bray et al. 1998, Frankston/Thompson 1998, Layman et al. 1998, OASIS 2001]. XML Schema 1.0 is a non-proprietary schema language developed by the World Wide Web Consortium; work began in 1998, the specification became a W3C Recommendation in May 2001 [W3C 2001], and further development continues today. This paper will offer a brief introduction to XML Schema 1.0 and describe its salient features. Unlike DTDs but like most recent schema languages, XML Schema 1.0 uses an XML vocabulary, rather than an ad hoc specialized non-XML notation to represent document grammars. This makes XML Schema documents more verbose than equivalent schemas in DTD notation but also makes them much more easily processable. XML Schema provides explicit support for XML namespaces and for combining XML vocabularies from different namespaces into a single composite schema. Given the increasing use of namespaces to minimize name conflicts between vocabularies, the inability of DTDs to handle this task adequately has become a more and more distressing deficit. DTDs intermingle several functions: in addition to defining constraints on the logical structure of marked up documents, they also include entity declarations, which affect the initial scanning of the XML data stream. XML Schema, by contrast, assumes that a standard XML processor has already processed the XML document before schema-validation is started: the input to an XML Schema validator is not an XML 103
document in the strict sense, but an XML information set, which may be produced by parsing an XML document or by other means, such as the construction of a data structure in memory through function calls to an API. The output of an XML Schema validator is the same information set, augmented with information about the validity of each element and attribute in the document and about the validation episode itself. Defining schema validation as a mapping from an input information set to an output information set has advantages for the conceptual clarity of the specification, but it has also proven unpopular with some users, because it means that DTD notation must still be used to declare human-readable names for special characters, and there is no prescribed XML form for the additional information about validity and datatyping produced by the XML Schema validator. XML Schema provides a basic set of predefined “simple” datatypes, which can be associated with attribute values or with elements whose content is a simple character string without sub-elements. In addition to the legacy types inherited from XML, XML Schema provides types which correspond to those most commonly found in programming languages and database management systems: exact decimal numbers and integers, floating-point and double-precision numbers, dates and times (in the standard notation defined by ISO 8601), and some other more specialized datatypes. Schema authors can define new simple types by restricting existing ones in certain well defined ways. They cannot, however, create new primitive types; this has advantages for interoperability and disadvantages for the expressive power of the language: the TEI date element, for example, can use the XML Schema date type to describe the value of its value attribute, which is required to use the ISO standard date format, but not to describe its content, which also denotes a date but which does not use the standardized notation. In addition to simple types, schemas can also definecomplex types, for elements which can contain sub-elements; complex types correspond to the content models and attribute declarations of DTDs. From object-oriented systems, however, XML Schema has adopted the concept of class inheritance: it is possible to derive new complex types from existing complex types, just as it is possible to derive new object classes from existing classes in an object-oriented programming language. Experience with DTDs shows that two quite separate kinds of inheritance may be needed for document grammars: one in which the derived type inherits some properties of its content model and attributes from the ancestor types, and another in which what is inherited is the ability of an element to occur in particular locations. (The TEI models these two different kinds of inheritance by distinguishing attribute-classes and model-classes.) Perhaps the most important innovation in XML Schema 1.0 is that schema-based validation provides much more information than the simple yes/no is-this-valid? information provided by DTD-based validation. Information about the simple or complex type assigned to an attribute or element is provided by an XML Schema processor as part of the standard post-schema-validation information set (PSVI). The validity of each element and attribute is checked and recorded separately; this entails a distinction between the concept of full validity, which is recursive and requires that all descendants also be fully valid, and of local validity, which is not recursive. Since schema validation need not start at the root element of the document, and since a schema can direct that the contents of particular elements are not to be validated, or that the elements encountered in particular contexts need not be declared, XML Schema 1.0 can be said to have introduced a coherent concept of partial validation; whether it can be exploited to handle problems of structural variation in historical documents [Birnbaum 1997, Birnbaum/Mundie 1999] remains to be explored. The paper will conclude with a brief account of current work on XML Schema within the World Wide Web Consortium.
NOTES 1. This is not to ignore the recent work done by Patrick Durusau and Matthew Brooke O’Donnell on Just-In-Time Trees [Durusau/O’Donnell 2002a, 2002b], by Wendell Piez and Jeni Tennison on LMNL (Layered Markup and Annotation Language) [Piez/Tennison 2002], by Andreas Witt on the representation of concurrent markup structures in logical form [Witt 2002], or by Claus Huitfeldt and C. M. Sperberg-McQueen on TexMecs (Trivially Extended MECS (Multi-Element Code System)) [Sperberg-McQueen/Huitfeldt 2001]. All of these projects retain their interest, but at the moment most appear to be experimental systems rather than fully developed alternatives to SGML and XML. 2. This last usage is now prominent in work on the Simple Object Access Protocol and other Web-services work, but the ideas predate the current interest in Web services [Catteau 1999].
REFERENCES Birnbaum, David J. In defense of invalid SGML. Paper given at ACH/ALLC 1997. http://clover.slavic.pitt.edu/~djb/achallc97.html Birnbaum, David J., and David A. Mundie. “The problem of anomalous data: A transformational approach”. in Markup Languages: Theory & Practice 1.4 (1999): 1–19.
104
Bourret, Ronald, et al., ed., “Document Definition Markup Language (DDML) Specification”, Version 1.0, Submission to the World Wide Web Consortium, 19-Jan-1999. http://www.w3.org/TR/NOTE-ddml Bray, Tim, Charles Frankston, and Ashok Malhotra, ed., Document Content Description for XML, Submission to the World Wide Web Consortium 31-July-1998. http://www.w3.org/TR/1998/NOTE-dcd-19980731. Catteau, Tom. “An SGML system for the budget of the European Union”. in Markup Languages: Theory & Practice 1.3 (1999): 41–59. Cowan, John, and Richard Tobin, ed. 2001. “XML Information Set”. W3C Recommendation 24 October 2001. [Cambridge, Sophia-Antipolis, Tokyo]: World Wide Web Consortium. http://www.w3.org/TR/xml-infoset/ Durusau, Patrick, and Matthew Brooke O’Donnell. “Visualizing overlapping hierarchies in textual markup”. Paper given at ALLC/ACH 2002, Tübingen, July 2002. http://www.uni-tuebingen.de/cgi-bin/abs/abs?propid=100 Durusau, Patrick, and Matthew Brooke O’Donnell. “Coming down from the trees: Next step in the evolution of markup?” Paper given at Extreme Markup Languages 2002, Montréal, August 2002. http://www.idealliance.org/papers/extreme02/author-pkg/2002/Durusau01/EML2002Durusau01.zip Davidson, Andrew, et al., “Schema for Object-oriented XML 2.0”, W3C Note, 30 July 1999. http://www.w3.org/TR/NOTE-SOX/ Frankston, Charles, and Henry S. Thompson, ed., “XML-Data reduced”, Draft Version 0.21, 3 July 1998. http://www.ltg.ed.ac.uk/~ht/XMLData-Reduced.htm Huitfeldt, Claus, and C. M. Sperberg-McQueen. “TexMECS: An experimental markup meta-language for complex documents”. [Working paper of the MLCD project at the University of Bergen]. Bergen: [n.p.], 2001. http://www.hit.uib.no/claus/mlcd/papers/texmecs.html ISO (International Organization for Standardization). ISO 8601. Representations of dates and times. 1988–06–15. Available at: http://www.iso.ch/markete/8601.pdf Layman, Andrew, et al., “XML-Data”, W3C Note [Acknowledged submission], 05 Jan 1998. http://www.w3.org/TR/1998/NOTE-XML-data-0105. OASIS (Organization for the Advancement of Structured Information Standards). “RELAX NG Specification”. Committee Specification 3 December 2001. http://www.oasis-open.org/committees/relax-ng/spec-20011203.html Piez, Wendell, and Jeni Tennison. “The Layered Markup and Annotation Language (LMNL)”. Paper given at Extreme Markup Languages 2002, Montréal, August 2002. Project home page at http://www.lmnl.org/ Text Encoding Initiative. Guidelines for electronic text encoding and interchange (TEI P4), ed. C. M. Sperberg-McQueen and Lou Burnard. XML-compatible edition prepared by Syd Bauman, Lou Burnard, Steven DeRose, and Sebastian Rahtz. Oxford, Providence, Charlottesville, Bergen: TEI Consortium, 2002. Witt, Andreas, “Meaning and interpretation of concurrent markup”. Paper given at ALLC/ACH 2002, Tübingen, July 2002. http://coli.lili.uni-bielefeld.de/Texttechnologie/Forschergruppe/prolog/allc2002-witt.html W3C (World Wide Web Consortium). “XML Schema Part 0: Primer”, ed. David Fallside. “XML Schema Part 1: Structures”, ed. Henry S. Thompson, David Beech, Murray Maloney, and Noah Mendelsohn. XML Schema Part 2: Datatypes, ed. Biron, Paul V. and Ashok Malhotra. W3C Recommendation, 2 May 2001. [Cambridge, Sophia-Antipolis, Tokyo: W3C] http://www.w3.org/TR/xmlschema-0/, http://www.w3.org/TR/xmlschema-1/, http://www.w3.org/TR/xmlschema-2/

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2003
"Web X: A Decade of the World Wide Web"

Hosted at University of Georgia

Athens, Georgia, United States

May 29, 2003 - June 2, 2003

83 works by 132 authors indexed

Affiliations need to be double-checked.

Conference website: http://web.archive.org/web/20071113184133/http://www.english.uga.edu/webx/

Series: ACH/ICCH (23), ALLC/EADH (30), ACH/ALLC (15)

Organizers: ACH, ALLC

Tags
  • Keywords: None
  • Language: English
  • Topics: None