Developing Markup Metaschemas to Support Interoperation among Resources

paper
Authorship
  1. 1. Gary F. Simons

    Graduate Institute of Applied Linguistics

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

INTRODUCTION
The work presented in this paper grows out of the EMELD project, a project which seeks to develop
“Electronic Metastructures for Endangered Language Data”1. One of the major aims stated in the project
proposal is the ”formulation and promulgation of best practice in linguistic markup of texts and lexicon”2.
The project is attempting to do this by forging a community consensus through a series of workshops3.
The first workshop easily reached consensus that the best format for the interchange and archiving of
endangered language data is XML-based markup. It just as easily reached consensus that no single system of
XML markup could be imposed on all language resources. At the same time, there is consensus that linguists
need to be able to perform queries across data sets, even if they do not use the same markup. This paper
describes the solution that is being developed to support this kind of interoperability across resources that use
different markup systems. Before describing the details of the solution, some definitions and requirements are
elaborated.
DEFINITIONS
A markup language, like a natural language, has a lexicon, syntax, and semantics. The following terms are
used throughout this paper to refer to the descriptive artifacts that document these three aspects of markup:
• markup vocabulary: Enumerates the lexical inventory of markup: i.e., the set of elements
and attributes that are used in marking up a resource. (In practice, the vocabulary is
enumerated within the markup schema rather than in a separate document.)
• markup schema: Specifies the syntax of markup: i.e., a formal grammar defining constraints
on where elements and attributes must or may occur with respect to embedding and relative
order and on what their values may be. (This is typically realized in an XML DTD or an
XML Schema, though other mechanisms are emerging.)
• markup metaschema: Specifies the semantics of markup: i.e., a formal mapping from
elements and attributes to the linguistic concepts they represent. (This area of markup is not
as well developed as the syntactic area, but is beginning to be developed under the impetus of
the so-called Semantic Web4.)
REQUIREMENTS
Given that the markup up of language data will be in XML, what is the nature of the markup vocabulary? The
following is the basic requirement on the markup vocabulary, along with consequent features of the
141
implemented solution:
• Linguists need to be able to do more than just read texts and lexicons in display format; they
also need to be able to manipulate the content by selectively accessing individual items of
information.
1. The archival form of electronically encoded resources should not follow a strategy of
presentational markup; that is, the markup vocabulary should not be one that simply
identifies what the information will look like when displayed.
2. The archival form of electronically encoded language resources should follow a
strategy of descriptive markup; that is, the markup vocabulary should identify what
the individual pieces of information are from a linguistic point of view.
3. The markup vocabulary for a particular text or lexical resource should identify all of
the elements of information (not just some of them) that go into the analysis of the
text or the description of each lexical item.
4. Users still need a presentational display of the resource; this should be accomplished
by applying a stylesheet to the descriptively marked up resource.
HTML markup, when applied to language resources, is an example of presentational markup. It does
not offer linguists the ability to do automated processing of a linguistic nature, for instance, to perform a
query like ”What are the part-of-speech categories used in this lexicon?” For this purpose a descriptive
markup vocabulary that specifically identifies the linguistic significance of each piece of information is
needed. But simply having a markup vocabulary is not enough; for each language resource there is also a
grammar that defines how the individual markup elements combine to form a valid text or lexicon.
• The linguist creating a text or lexicon needs for the markup of the resource to be consistent
with his or her plan for its content and structure.
1. A single markup schema that sanctions all common practices in structuring the
content of language resources will be too permissive to constrain any single resource
to the specific plan of its creator5.
2. There is enough convergence of practice that it will be possible to develop one or
more specific markup schemas that can be recommended for widespread use while
being adequately constraining.
3. There will always be plans for content and structure that are unique enough to require
that a unique markup schema be devised for the resource.
These consequences of requirement 2 mean that there will be multiple markup schemas, even in the
context of best practice. In order to achieve interoperability of resources when there are multiple markup
schemes it is necessary to introduce a meta-level in the approach to markup:
• Linguists need to be able to query and otherwise manipulate multiple texts or lexicons in a
single operation, even though they may individually have different markup vocabularies and
schemas.
1. As a foundation for interoperability, there must be a shared ontology for the kinds of
information that are marked up in language resources.
2. As the bridge to interoperability, each resource must have a metaschema that
formally documents how the elements and attributes of its markup schema map onto
the concepts of the common ontology.
3. The metaschema must be separate from the language resource (rather than being an
integral part of it) so that multiple resources can share the same metaschema.
4. It must be possible for a third party to create a metaschema for a resource that lacks
one without changing the resource itself. (This implies that the linkage from
metaschema to schema to resource is specified in a stand-off manner through
metadata.)
IMPLEMENTING METASCHEMAS
The ontology which serves as the foundation for interoperability (3a above) is under development as one of
the EMELD project deliverables6. The complete paper will present the details of how metaschemas are being
implemented. In brief, a metaschema is an XML document that formally expresses the mapping of the
elements and attributes in a markup schema to the concepts in the linguistic ontology. The exact meaning of a
particular instance of an element or attribute is often dependent on its context in the entire markup structure.
In order to specify relevant contexts, the metaschema uses an XPath expression7.
The metaschema also maps the elements of markup that define structure onto the generic structures of
an abstract data model8. As a result, a metaschema specifies an equivalent abstract document for any
document instance that conforms to its corresponding schema. The elements of the abstract document
correspond to generic structures; the specific concept of the ontology that identifies the linguistic meaning of
142
a particular element is encoded as the value of an attribute of the abstract element.
Interoperability is achieved by means of an XSLT script that compiles the metaschema into an
equivalent XSLT script that implements the translation of any document instance into its equivalent abstract
document. The complete paper will demonstrate how this transformation process yields documents with
comparable markup from documents that originally had different markup schemas.
NOTES
1. The project involves five host institutions and is sponsored by a five-year grant from the National
Science Foundation. See: http://saussure.linguistlist.org/cfdocs/emeld/index.cfm
2. Section 3.1 of http://linguist.emich.edu/%7Eworkshop/E-MELD.html
3. 2001 workshop, “The Need for Standards”:
http://saussure.linguistlist.org/cfdocs/emeld/documents/2001docs2.cfm and 2002 workshop, “Digitizing
Lexical Information”: http://saussure.linguistlist.org/cfdocs/emeld/workshop/2002/papers02.html
4. The Semantic Web is an activity of the W3C: http://www.w3.org/2001/sw/
5. The TEI DTD for dictionaries is an example of a markup schema that is so general as to be too
permissive for any one project. For instance, see “Using architectural processing to derive small,
problem-specific XML applications from large, widely-used SGML applications,” Gary F. Simons, SIL
Electronic Working Papers 1998-006, http://www.sil.org/silewp/1998/006/.
6. Scott Farrar and D. Terence Langendoen, 2002, “GOLD: A General Ontology for Linguistic
Description,” EMELD working paper,
http://saussure.linguistlist.org/cfdocs/emeld/documents/gold_draft4.doc. See related papers at
http://emeld.douglass.arizona.edu:8080/group.html
7. XML Path Language (XPath) Version 1.0, W3C Recommendation, 16 November 1999,
http://www.w3.org/TR/xpath
8. Nancy Ide and Laurent Romary, 2001, “Standards for Language Resources,” Proceedings of the
IRCS Workshop on Linguistic Databases,
http://www.ldc.upenn.edu/annotation/database/papers/Ide_Romary/29.3.pdf.
Peer

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2003
"Web X: A Decade of the World Wide Web"

Hosted at University of Georgia

Athens, Georgia, United States

May 29, 2003 - June 2, 2003

83 works by 132 authors indexed

Affiliations need to be double-checked.

Conference website: http://web.archive.org/web/20071113184133/http://www.english.uga.edu/webx/

Series: ACH/ICCH (23), ALLC/EADH (30), ACH/ALLC (15)

Organizers: ACH, ALLC

Tags
  • Keywords: None
  • Language: English
  • Topics: None