Automatic XML Mark-Up

paper
Authorship
  1. 1. Shazia Akhtar

    Dept. of Computer Science - University of College Dublin

  2. 2. Ronan G. Reilly

    Dept. of Computer Science - University of College Dublin

  3. 3. John Dunnion

    Dept. of Computer Science - University of College Dublin

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Abstract

In this paper we present a novel two-stage automatic XML markup system. The system uses Kohonen's self-organizing map (SOM) learning algorithm to arrange marked-up documents on a two-dimensional map such that nearby locations contain similar documents. It then employs an inductive learning algorithm (C5) to automatically extract and apply mark-up rules from the nearest SOM neighbours of an unmarked document. The system is designed to be adaptive, so that once the document is marked-up, it learns from its errors in order to improve accuracy. The resulting documents can be categorized on the self-organizing map, further improving the map's resolution.

1 INTRODUCTION

With the increased role of computers and computer technology in the humanities, the area of textual mark-up has become of central importance. SGML has been the electronic text mark-up standard for some years, though XML is now a significant competitor for this role. XML provides many of the benefits of SGML but is simpler in structure and easier to implement. Before standards were developed and adopted, electronic text was encoded using idiosyncratic systems, which impeded the free exchange of texts among scholars. The Text Encoding Initiative (TEI; Plotkin & Sperberg-McQueen, 1999) addressed the problem of the proliferation of different text encoding schemes. TEI provided guidelines for the preparation and interchange of machine-readable text. TEI lite, a subset of TEI, was designed for educational purposes. It is comparatively simple and is able to handle reasonably a wide variety of texts. With the rise to prominence of XML, teixlite emerged as the corresponding standard for XML. Some text handling software (e.g., Panorama, XMetal) is available that provides support for marking up the documents in SGML and XML. But these require a considerable amount of manual intervention. There is, as yet, no tool available that can automatically mark-up the documents in XML.

As a first step in addressing this need, we present a novel system that automatically marks-up the documents in XML by using a combination of self-organizing map (SOM; Kohonen 1997a, 1997b) and inductive learning (C5; Quinlan, 1993, 1979). This system was developed as part of the INTENTS (Intelligent Navigation Tools for Hypertext documents) project at the Department of Computer Science, University College Dublin (UCD). The goal of INTENTS is the design of a suite of intelligent navigational tools that can be used in the construction, management, and navigation of large-scale hypertext documents. INTENTS is a project in CoSEI (Computer Science and English Initiative), a humanities computing programme in UCD. Another project in this programme involves constructing an computer-based chronology of the Irish Modernist poet, critic and art historian Thomas McGreevy, and we are using material from the McGreevy project as an initial testbed for our experiments.

2 SYSTEM ARCHITECTURE

Our system combines the techniques of the self-organizing map algorithm and adaptive automatic mark-up in XML. SOM is a neural network based unsupervised learning algorithm, which maps higher dimensional statistical data on to a lower dimensional grid or map such that similar documents appear close to each other on the map. The first phase in the process is the formation of the document map using the SOM algorithm and the second phase deals with the automatic mark-up.

Figure 1: Hybrid architecture

In the first phase, by applying the WEBSOM algorithm (Honkela 1996) on a collection of marked documents, a document map is formed. WEBSOM is a two-level architecture and an application of SOM to document classification and retrieval.

Figure 2: Constructing a self-organising map

Once a map is formed, an incoming document will automatically be mapped into the cluster of documents most similar to it. The proposed system then captures the mark-up information from the neighbouring documents using C5/See5 learning rules. The incoming document will then be automatically marked-up according to the rule set extracted by C5/See5 learning rules. If the document is not related at all to the existing documents on the map it will be discarded. The system has a learning behaviour in that, it learns adaptively from feedback and makes change to the mark-up of documents to make it more sensible and useful. The detailed architecture of the system can be seen in the Figure 3:

Figure 3: A detailed architecture of the system

The second phase of the system is implemented as an independent automatic mark-up system. Eventually this system will be combined with the self-organizing map. The automatic mark-up system is described below in detail.

2.1 MARK-UP RULE EXTRACTION AND APPLICATION

The rule-based system comprises two main modules. The first module is an offline phase of the system, which deals with the extraction of rules from a database of letters marked-up in XML using an inductive learning approach. The second module is an online process and deals with the application of rules to any unmarked letters of a specific format to get the desired XML mark-up. We have initially considered only letters to demonstrate our ideas, as letters are one of the simpler forms of text in the corpus.

Figure 4: Automatic markup system

The system classifies elements of marked letters by the rules that are automatically generated by the inductive learning algorithm (each piece of text enclosed between a pair of start and end tags is called an element of a document marked-up in XML). All occurrences of each element are encoded using a fixed-width feature vector. The inductive learning algorithm processes these encoded vectors to develop classifiers for the elements of the XML documents. For this purpose we have selected the C5/See5 (C5/See5; Quinlan, 2000) rules learning algorithm. The advantages of this learning algorithm are that it is very fast, is not sensitive to missing features, can deal with large number of features and is incremental. For our application See5.0 has generated a set of seven rules. These rules deal with different elements of XML marked-up letters.

The second phase of the system, which is an online process, involves the application of extracted rules to unmarked letters to produce XML documents. Documents can either be well-formed or valid, according to the user's preference. Valid XML documents follow the rules of the teixlite DTD. Once a letter selected by the user is marked up, it can be parsed to check the accuracy of the mark-up is. If the text is not supported by the system then an error message is displayed.

Figure 5: Unmarked letter and well-formed XML markup of the same letter (produced by the system)

Figure 6: Valid XML markup

3 PERFORMANCE AND ANALYSIS

We applied the C5/See5 algorithm on 200 training and 50 test cases and achieved 98% accuracy (each case in our system represents one element from the collection of marked letters). We have also tested it on the elements of about 20 letters and achieved an accuracy rate of 94%. Accuracy rate is calculated by considering the correctly marked-up elements from total number of elements of the tested letters. Good results from our experiments demonstrate that our approach is practical and our system is a first step toward the novel technology of automatically marking-up text documents in XML.

4 CONCLUSION

We have described a system that uses the novel technologies of self-organization and adaptive automatic mark-up. Our proposed system uses XML for the management and retrieval of hypertext documents according to the structural information preserved in them. The system marks up the documents automatically by capturing the information in neighbouring documents on the self-organizing map. The information is extracted in the form of rules by using the C5/See5 learning rules. The system also learns from feedback and makes changes in the mark-up to improve results. The functionality of our system makes it novel and a useful tool for electronic information exchange.

5 ACKNOWLEDGEMENTS

This project is funded by the Advanced Software Technologies Initiative (ASTI), which is funded by Enterprise Ireland’s Software Program in Advanced Technology (PAT) and the National Software Directorate.

REFERENCES

Honkela, T., Kaski, S., Lagus, K. & Kohonen, T. (1996). Newsgroup Exploration with WEBSOM method and browsing. Technical Report A32, Helsinki University of Technology, Laboratory of Computer and Information Science, Espoo: Finland.
Kohonen, T. (1997a). Exploration of very large databases by self-organizing maps. In Proceedings of ICNN'97, International Conference on Neural Networks. PL1-PL6. IEEE Service Center: Piscataway, NJ.
Kohonen, T. (1997b). Self-Organizing Maps. Springer Series in Information Science.
Plotkin, Wendy & Sperberg-McQueen, C.M. (1999).Text Encoding Initiative. http://www.uic.edu/orgs/tei/
Quinlan, J. R. (2000). Data Mining Tools See5 and C5.0. http://www.rulequest.com/see5-info.html
Quinlan, J. R. (1993). C4.5: Programs For Machine Learning. Morgan Kauffman.
Quinlan, J. R. (1979). Discovering rules by induction from large collection of examples. In D. Michie (ed.), Expert Systems in the Micro Electronic Age. Edinburgh, UK: Edinburgh University Press.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2001

Hosted at New York University

New York, NY, United States

July 13, 2001 - July 16, 2001

94 works by 167 authors indexed

Series: ACH/ICCH (21), ALLC/EADH (28), ACH/ALLC (13)

Organizers: ACH, ALLC

Tags