Automatic DTD simplification by examples

paper
Authorship
  1. 1. Alejandro Bia

    Libraries - University of Alicante

  2. 2. Rafael C Carrasco

    Lenguajes y Sistamas Inform´ticos - University of Alicante

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

This paper describes a method for the automatic generation of simplified DTDs from a source DTD and a set of sample marked up files. The purpose is to create the minimum DTD that the sample set of files comply. In this way, new files can be created and parsed using this simplified DTD but still being compliant to the original, more general DTD. The simplified DTD can be used to make the task of markup easier, specially for non-experienced XML writers.

The resulting tool was used at the Miguel de Cervantes digital library (http://cervantesvirtual.com/) to obtain simplified versions of the TEI.DTD (Sperberg-McQueen and Burnard, 1994). This work is part of a larger project in the field of text markup and derived applications (Bia and Pedreño, 2000).

Motivation

"Having standardized-XML-vocabularies for common things allows developers to reuse existing DTDs, saving the cost of developing custom DTDs. Custom DTDs isolate their users and applications from others that might otherwise be able to share commonly formatted documents and data. Shared DTDs are the foundation of XML data interchange and reuse" (Hunter, 2000).

Saving the cost of developing our own DTD, and text interchangeability are some of the reasons why the teixlite.dtd (XML version of the SGML teilite.dtd of the TEI encoding scheme) has been chosen at the Cervantes digital library, but the TEIxlite is still too complex for markup beginners. Our markup team is composed mostly of humanists with some computer skills but who appreciate their computer work be simplified as much as possible.

On the other hand our XML documents do not use, and do not need all the markup options provided by the teixlite.dtd. So a simpler DTD was needed to simplify markup tasks and to avoid possible use of unwanted markup options. But we still wanted our files to be TEI compliant and benefit from the advantages of sharing a common DTD with other international digitization projects. In brief, we needed a simpler DTD, a TEI compliant DTD, that is a valid subset of the teilite.dtd.

We started by defining the kinds of modifications we will allow ourselves to make to the TEIlite DTD, in order to make it simpler to use but at the same time keeping our documents TEI-compatible (except for minor exceptions). In this sense we allowed the following changes:

To add normalized values to some attributes in order to force the use of fixed values instead of free data entry.
To add new attributes only in a few necessary cases (this is the only exception that may keep our files from being TEI compliant, but we thought that these added attributes can be easily eliminated anytime we wanted to comply the TEI standard).
To make restrictions in element inclusion rules (we wanted to eliminate the possibility of including certain elements at certain levels of the markup).
To make some optional elements/attributes mandatory to force following our specific markup norms.
To eliminate optional elements we will not use to simplify the markup task and to avoid possible errors (basically we wanted to eliminate the features we decided not to use)
It is clear that doing the simplifications by hand is tedious and error prone. Constructing a set of sample documents representative of all the types of documents we need to markup together with a program that simplifies the DTD automatically will alleviate this task.

Previous works

Document types are defined by extended context-free grammars where the right hand side of productions are unambiguous regular expressions (Bruggemann, 1998). Previous work has addressed the task of identifying a DTD from examples. A common difficulty in this approach is the need to find a correct degree of generalization. Some practical tools as FRED (Shafer, 1995) let the users customize their preferred degree of generalization. Ahonen builds a (k,h)-testable model (Ahonen, 1995; Ahonen, 1997; Ahonen, Mannila, and Nikunen, 1997).

Youg-Lai and Tompa (Young-Lai and Tompa, 2000) rely on a stochastic approach to control overgeneralization, based in turn on the algorithm by Carrasco and Oncina (Carrasco, 1998). Presumably, the stochastic approach needs large collections of hand-tagged documents.

Pizza-Chef (Burnard, 1997) is a tool to generate TEI-compliant DTDs suited to a particular task. In this case, predefined tasks and TEI DTDs are only allowed.

Objectives

However, a general DTD defining a global frame that a whole set of files must fulfill allows for a natural way to avoid overgeneralization. In this sense, any particularized, narrow scope DTD should not accept any document that is not accepted by the general, wide scope DTD.

Therefore, the objective of our approach is to automatically select only those DTD features that are used by a set of valid documents (validated against the more general DTD) and eliminate the rest of them, obtaining a narrow scope DTD which defines a subset of the original markup scheme. This "pruned" DTD can be used to build new documents of the same markup subclass, which in turn would still comply the original general DTD. Needless to say that working with a simpler DTD is easier.

General description

For the implementation of the DTDprune toolkit we needed both an XML and a DTD parser. We assumed that both the XML sample files and the source DTD would be well-formed and valid, so there would be no need to build validating parsers. Instead, we developed two simple parsers, based on the XML BNF Grammar described in (Harold, 1999). A diagram of the process is shown below in the figure.

Architecture of the DTD simplifier Architecture of the DTD simplifier
As the diagram shows, the general DTD is processed to extract the structure of the markup model with which we build a Glushkov automata (Caron and Ziadi, 2000). The XML sample files are preprocessed to extract the elements used and their nesting patterns. Based on the Glushkov automata that represent the regular expressions that define the possible element contents according to the general DTD, we keep track of the elements used in the sample files and mark the visited states of the automata. Finally, a simplification process takes place. This process eliminates unused elements and simplifies the right parts of element definitions, i.e. the regular expressions that define further nestings. The simplified DTD structure is used to generate the new simplified DTD.

Conclusions

Using this automated method, the simplified DTD can be updated immediately in the event that new features are added to (or eliminated from) the sample set of XML files (modifications to files of the sample-set must be done using the general DTD for validation). This process can be repeated to incrementally produce a final narrow-scope DTD. In this way, we use a complex DTD as a general markup-design frame to build a simpler working-DTD that suits a specific project's markup needs.

Another use of this technique is to build a one-document DTD, i.e. the minimum DTD derived from the general DTD that a given XML document would comply.

Another benefit of this technique is that we can produce statistics that may help markup designers improve their markup schemes. Information about the frequency of use of certain elements within others, helps us to detect unusual structures that could reflect mark-up mistakes, misuse of the DTD, or DTD features that may allow unwanted generalization. This statistical data on the use of markup may help us take decisions about adding new markup constraints, or on the contrary expand the simplified DTD.

References

C.M. Sperberg-McQueen and Lou Burnard, editors.
Guidelines for Electronic Text Encoding and Interchange (Text Encoding Initiative P3), Revised Reprint, Oxford, May 1999.
TEI P3 Text Encoding Initiative, Chicago - Oxford, May 1994.

Alejandro Bia and Andrés Pedreño.
The Miguel de Cervantes Digital Library: The Hispanic Voice on the WEB.
LLC (Literary and Linguistic Computing) journal, Oxford University Press, (to be published soon) 2000.
Presented at ALLC/ACH 2000, The Joint International Conference of the Association for Literary and Linguistic Computing and the Association for Computers and the humanities, 21/25 July 2000, University of Glasgow.

Anne Brüggemann-Klein and Derick Wood.
One-unambiguous regular languages.
Information and Computation, 142(2):182--206, 1 May 1998.

Keith E. Shafer.
Creating dtds via the gb-engine and fred.
Technical report, OCLC Online Computer Library Center, Inc., 6565 Frantz Road, Dublin, Ohio 43017-3395, 1995.

Helena Ahonen.
Automatic generation of SGML content models.
Electronic Publishing Origination, Dissemination, and Design, 8(2/3):195--206, June\September 1995.

H.Ahonen, H.Mannila, and E.Nikunen.
Generating grammars for SGML tagged texts lacking DTD.
Mathematical and Computer Modelling, 26(1):1--13, 1997.

H.Ahonen.
Disambiguation of SGML content models.
Lecture Notes in Computer Science, 1293:27, 1997.

Matthew Young-Lai and Frank W. M. Tompa.
Stochastic grammatical inference of text database structure.
Machine Learning, 40(2):1, 2000.

Rafael C. Carrasco and Jose Oncina.
Learning deterministic regular grammars from stochastic samples in polynomial time.
RAIRO (Theoretical Informatics and Applications), 33(1):1--20, 1999.

Lou Burnard.
The Pizza Chef: a TEI Tag Set Selector.
http://www.hcu.ox.ac.uk/TEI/pizza.html, September 1997.
(Original version 13 September 1997, updated July 1998; Version 2 released 8 Oct 1999).

David Hunter, Curt Cagle, Dave Gibbons, Nikola Ozu, Jon Pinnock, and Paul Spencer.
Beginning XML.
Programmer to Programmer. Wrox Press, 1102 Warwick Road, Acocks Green, Birmingham, B27 6BH, UK, 1st edition, 2000.

Pascal Caron and Djelloul Ziadi.
Characterization of Glushkov automata.
TCS: Theoretical Computer Science, 233:75--90, 2000.

Robert D. Cameron.
REX: XML Shallow Parsing with Regular Expressions.
Markup Languages: Theory & Practice, 1(3):61--68, Summer 1999.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2001

Hosted at New York University

New York, NY, United States

July 13, 2001 - July 16, 2001

94 works by 167 authors indexed

Series: ACH/ICCH (21), ALLC/EADH (28), ACH/ALLC (13)

Organizers: ACH, ALLC

Tags