Into the Crucible: Testing the Merits of Hierarchical Models, Embedded Markup, and Monolithic SGML DTDs in Light of Conceptual Models of Text

paper
Authorship
  1. 1. Robin Cover

    Summer Institute of Linguistics (SIL International)

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Introduction
From almost any vantage point, the expanding universe of electronic documents testifies to widespread use of embedded markup as the dominant means of representing document structure. Current markup schemes are strongly influenced by SGML, and thus employ descriptive (structural, logical, analytical) markup rather than presentational markup. Even in cases where presentational markup is evident, markup languages are typically defined by grammars (SGML DTDs) that govern the application of markup according to a hierarchical containment model. The World Wide Web, for example, is rapidly becoming the communication channel of choice on the Internet, and the various dialects of HTML grammar (e.g., HTML2, HTML3, Mozilla) are specified in SGML DTDs. Within government and industry sectors, the use of SGML for structuring documents and databases is on a steady rise. In the academic arena, many universities and college libraries have multiple projects based upon SGML, the most common point of reference being an SGML application documented in the TEI Guidelines.

It is not surprising that this technology shift characterized by pervasive use of embedded markup should give rise to renewed criticism of embedded markup in general, and of SGML in particular. On the one hand, no alternative document structuring technology appears capable of seriously challenging SGML any time soon. On the other hand, the absence of a popular competitive technology does not prove the adequacy of SGML. SGML's limitations, inherent drawbacks, and design inadequacies are in fact being exposed ever more clearly as SGML markup schemes are applied in diverse document environments and as software developers attempt to implement text processing systems based upon fully-developed markup specifications. The inherent limitations of embedded markup are heightened in that SGML itself provides no formal mechanism to specify and validate document semantics, and until recently, companion standards (e.g., HyTime and DSSSL) that do provide the necessary semantics have been missing. Thus, there is a large gap between what SGML can offer presently and what it is capable of offering once we have corresponding standards for processing semantics (e.g., query languages, tree transformations, hypertext navigation), together with enhanced document processing software supporting SGML. Meantime, criticism is focused upon the limitations of SGML as a technology that supports merely the formal representation of structure.

Just how serious are the limitations and weaknesses of current markup systems? One area of criticism relates to the SGML standard itself (ISO 8879:1986) and some of its inherent design flaws as a formal language - as brought into judgment by computer scientists. These issues are not insignificant, but will not be addressed here. Another kind of criticism has come from within the academic community itself, alleging that SGML by its very nature is both too restrictive and too prescriptive. This philosophical and methodological objection to SGML markup may be framed approximately as follows: SGML, inherently requiring a hierarchical (containment) model for the representation of text structure, defines markup languages in terms of a complicated and rigid formal grammar, thus enforcing or imposing a kind of interpretation upon texts that is not welcome by linguists and literary scholars, and which does not serve the best interests of humanities research. In the worse case, SGML markup is not only inadequate for text encoding within the humanities: it embodies assumptions alien to the nature of the text being modeled, and thus may prove harmful in certain respects to the academic enterprise.

Such criticisms of SGML markup have not been formulated within the academic arena in a vacuum, as merely theoretical reflection, but by working scholars attempting to model real texts (e.g., Ian Lancashire, Claus Huitfeldt). As such, these misgivings represent significant critique, and merit discussion in a public forum. In some cases the criticisms may represent justifiable reactions to injudicious and excessive claims about the value or usefulness of SGML markup (modification of a complex modular DTD is quite difficult; special software is required to parse, search, and browse heavily encoded SGML text). In other cases these criticisms may legitimately point toward technical weaknesses in particular SGML applications (a monolithic DTD adequate for all humanities texts and directly usable by generic software may be overly-ambitious). In other cases the criticisms may reflect unsatisfied scholarly needs and expectations for supporting software - for example, software that might mask the imperfect match between the representational notation and the scholars' "pure" conceptual model of a text. Practical experience in SIL's CELLAR Project has revealed the inevitability of mismatches between several prominent conceptual models which will come into juxtaposition as cross-domain collaborative work proceeds: anticipating these mismatches can relieve predictable areas of tension based upon unrealistic expectations. Finally, scholarly criticisms in some cases may represent misunderstandings of the claims made for SGML, or of SGML itself.

The following presentation attempts to tease apart some of the subtle but critical issues bound up in these philosophical and methodological objections to SGML, particularly with respect to SGML's descriptive markup ethos as documented in the TEI Guidelines.

Articulations of the criticisms and responses
Criticism #1. SGML encoding requires that texts be represented formally in terms of a hierarchy (or multiple hierarchies). This hierarchical representation does violence to the texts being modeled, since the authors of these early texts did not conceive of them in hierarchical terms. Furthermore, the hierarchical representational notation tempts the modern encoder to see hierarchy in the text where there is none.

Response to be developed: Methodologically, there is no necessary requirement that the assumptions underlying a markup metalanguage (as a modeling language) be shared by the domain experts and document domain within which the specific markup language is applied. In particular, it is not necessarily true that a hierarchical database model does violence to a real world object not quintessentially "hierarchical" in structure by storing information about that object in a hierarchical data structure. Our conceptual model of a text (ontologically) does not have to be the same as our conceptual model of the markup language we use to represent (information about) the text. Multiple conceptual models need to be accommodated in a full implementation of structured text modeling and processing, but we should require only that these models be orthogonal, or commensurable at a very high level, and not formally identical. On the other hand, it may be more difficult to build software for direct processing of a hierarchically-structured (encoded) text if we believe the non-hierarchical relationships and structures are more important than the hierarchical ones.

Criticism #2. Many of the advertised advantages of "descriptive markup" (alternately called "logical" or "structural" markup), however applicable to the creation of texts in a modern production setting, are irrelevant and misleading in contexts where embedded markup is being applied analytically to pre-existing texts. The application of logical or analytic markup rather than mere "presentational markup" invites an unwarranted amount of subjectivity and interpretation - interpretation and analysis that should be left for a separate phase of scholarly inquiry.

Response to be developed: It is probably true that application of analytic (logical, structural) markup to pre-existing text is more difficult - involving more subjectivity and interpretation - than presentational markup. Simply put: it is frequently easier to see that a typographic or other visual effect is present than to correctly ascertain the meaning of the visually apparent phenomenon. It is also true that some of the processing benefits of "descriptive" markup are minimized, or entirely negated, if a markup analyst thinks presentational markup is appropriate and the SGML application being used is prejudiced against the encoding of the text in presentational (purely visual) terms.

Criticism #3. The notion of a prescriptive grammar (an SGML DTD) is potentially appropriate in an authorship context where the grammar's presence can be acknowledged and its rules enforced by software. Imposing markup to a text ex post facto in accordance with the strict rules of grammar is specious or at least suspect in the context of a modern scholarly attempt to understand (inductively) the nature of early and ancient texts.

Response to be developed: If an SGML application and its corresponding implementations are responsive to intellectual work suggesting the need for grammar modification, then the detrimental effects of an imposed grammar can be minimized. In fact, using a grammar model as a formal heuristic can provide one means for testing hypotheses about texts. In practice, however, we concede that the presence of a prestigious grammar might encourage the identification of hierarchy where there is none, and identification of textual features that are not actually in a text. The path of least resistance, when one is forced to choose from among a closed set of markup options, may be to select the best available alternative rather than to describe in honest terms the characteristics and features actually present in the text. These dangers are acknowledged as real, but represent factors in a familiar tradeoff.

Criticism #4. Creating a monolithic, all-purpose grammar (SGML DTD) for all texts in the humanities is not only theoretically suspect: it has proven unachievable in practice. After several years of collaborative work, the TEI P3 DTD, for example, is recognizably inadequate for some purposes by not offering enough markup tags, and by allowing markup tags to be used in places where they should not be allowed. Furthermore, if scholars are allowed (and required) routinely to modify the TEI DTD in order to meet their needs, it remains questionable what value the "TEI DTD" will have as a standard. General purpose software developed to process TEI-SGML text will be largely unusable because of different DTD modifications and varying approaches used as work-around encoding solutions.

Response to be developed: The fact that the TEI DTD is judged sub-optimal or inadequate for every possible scholarly goal, or that (through no fault of its own) it does not handle multiple concurrent hierarchies elegantly, should not detract from its value as a monumental and historic achievement. It is true that some of the value of a typical SGML DTD for validation purposes is compromised by the TEI P3 DTD, owing to its extreme generality. There are no simple workarounds for this problem, given the current SGML specification. An obvious solution for individual encoding projects is to design more restrictive DTDs (proper subsets) useful for validation in local processing, and to conceive of the TEI DTD as defining an interchange model. It remains to be seen whether the current TEI P3 DTD is overly ambitious, attempting in a single specification to directly unify too many disparate conceptual models and methodological bases of operation. Likewise, it is not clear that the development of general-purpose software capable of handling all TEI markup constructs (with unbounded DTD extensions) is an achievable or even desirable goal. Given the vast differences in processing requirements for various content domains and methods of scholarly inquiry, it is perhaps more realistic to think of developing software that addresses requirements in smaller scope based upon TEI DTD subsets.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ACH/ALLC / ACH/ICCH / ALLC/EADH - 1996

Hosted at University of Bergen

Bergen, Norway

June 25, 1996 - June 29, 1996

147 works by 190 authors indexed

Scott Weingart has print abstract book that needs to be scanned; certain abstracts also available on dh-abstracts github page. (https://github.com/ADHO/dh-abstracts/tree/master/data)

Conference website: https://web.archive.org/web/19990224202037/www.hd.uib.no/allc-ach96.html

Series: ACH/ICCH (16), ALLC/EADH (23), ACH/ALLC (8)

Organizers: ACH, ALLC

Tags
  • Keywords: None
  • Language: English
  • Topics: None