Typographic Regularization in the WWP Textbase

paper
Authorship
  1. 1. Jacqueline H. Russom

    Scholarly Technology Group - Brown University

  2. 2. Sydney D (Syd) Bauman

    Brown University

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Introduction

In English texts printed before 1630, the letters v, u, j, and i did not have the values that they have today. The word that we spell 'jury' was written 'iury'; the word that we spell 'ivory' was written 'iuory', etc. Another common typographical convention, characteristic of somewhat later texts, was to represent 'W' as 'VV' or 'Vv'. These archaic conventions make early texts difficult to read and compromises the matching of forms in information retrieval tasks. The Women Writers Project uses SGML tagging to encode a regularized spelling for such typographical variants, thereby allowing the option to display and search on either the original form or the regularized form.

In the first 300 or so texts encoded by the WWP from 1989 to 1999, nearly half contained some manifestation of typographical difference from modern English requiring regularization. Encoding this information by hand was time-consuming and inefficient, but about 90 texts were manually tagged with regularized forms by encoders. These provided a substantial body of useful data for understanding the nature, extent, and frequency of distribution of what we have termed 'vuji' and 'VV' phenomena.

The Scholarly Technology Group undertook to develop a system to automatically identify words subject to this typographic convention and tag them with the regularized form. This system has two major components: an SGML-aware wordlist-based program, and a set of pattern matching rules derived from linguistic principles for English consonants and vowels. Both components have been designed to work with WWP markup conventions for such things as word division across a line break, errors or abbreviations within a word, and structural elements to be excluded from regularization.

The wordlist-based program matches whole words with a dictionary list of known forms requiring regularization and replaces the word in the text with a form containing appropriate markup. The pattern-matching component uses a set of regular expressions to identify probable candidates for regularization that have not been found on the wordlist. Each such match found is presented to the encoder who can accept or reject regularization, as appropriate.

Encoding Early Printed Texts

Example of original text

Anne A∫kewes an∫were vnto
Iohan La∫∫els letter.
Oh frynde mo∫t derelye belo
ued in God. I maruele not a lyt
tle, vvhat ∫huld moue yow, to iud
ge in me ∫o ∫ledre a faythe, as to
feare deathe, vvhych is the ende
of all my∫erye.
Askew, Anne. The lattre examinacyon of Anne Askewe, 1547 Marpurg, 1547. Women Writers Online. Women Writers Project, Brown University. Unpublished.
WWP encoding practice documents structural features of the text, such as paragraphs (p element), chapters, stanzas, etc. (div element with a type attribute); typographical features such as line breaks (lb element), page breaks (pb), and catchwords & signatures (mw with type attribute); renditional characteristics such as italicization and superscription; and links to textual annotations. Intra-word entity references are used for characters that are not on the modern computer keyboard, including soft hyphens (shy), long s (s), accented letters (e.g., eacute), and macrons (e.g., emacr) -- macrons are further documented with an abbr element whose expan attribute indicates the specific nasal consonant elided in the original. In addition to the expansion of macrons, intra-word elements often occur for a variety of reasons, including editorial insertion of omitted characters or soft hyphens (in the corr attribute of sic), editorial correction of other obvious errors (also in corr of sic), expansion of some abbreviations (in the expan attribute of abbr), and, of course, the tagging of 'vuji' and 'vv' characters with their modern form (in the reg attribute of orig).

WWP encoded text:

<speaker rend="align(center)slant(upright)">Anne A&s;kewes an&s;were <orig reg="u">v</orig>nto
<lb><orig reg="J">I</orig>ohan La&s;&s;els letter.</speaker>
<p>Oh frynde mo&s;t derelye belo<sic corr="&shy"></sic>
<lb><orig reg="v">u</orig>ed in God. I mar<orig reg="v">u</orig>ele not a lyt<sic corr="&shy"></sic>
<lb>tle, what &s;huld mo<orig reg="v">u</orig>e yow, to <orig reg="j">i</orig>ud<sic corr="&shy"></sic>
<lb>ge in me &s;o &s;l<abbr expan="en">&emacr;</abbr>dre a faythe, as to<anchor id="ter306" corresp="lat306">
<lb>feare deathe, whych is the ende
<lb>of all my&s;erye.

Critical editions of early texts usually retain old spelling and typographical features of the copy text. Editions for general reading usually normalize typographical conventions, even when they retain the "old-spelling texture" [1]. The SGML version makes it possible to display either the original (as above) or a regularized version (as below).

Regularized display:

Anne Askewes answere unto
Johan Lassels letter.
Oh frynde most derelye belo-
ved in God. I marvele not a lyt-
tle, what shuld move yow, to jud-
ge in me so slendre a faythe, as to
feare deathe, whych is the ende
of all myserye.
In addition to providing considerable improvement in readability, the WWP's encoding practice makes it possible to retrieve early attestations of forms that would otherwise not qualify as 'matches' to a search query. For example, a search for beloved within two words of God would fail to match against the appropriate line in the unregularized Askew text.[2]

Automated Markup Insertion

This paper describes

creating an initial wordlist
The procedure we used to create an initial list of words which would require 'vuji' markup based on the set of 90 or so texts that had already been manually tagged; the format of the wordlist itself; and constraints that we found to be necessary.
applying the wordlist program
characterizing the distribution of the phenomena
Careful analysis of the 90 or so manually tagged files along with some research into physical bibliography allowed us to determine the linguistic rules underlying the typographical convention of our texts.
creating the pattern matching program
Note: the complete paper discusses these algorithms in summary and in detail; during the presentation they will probably only be summarized.
the resulting process
Encoders can enter the original typography during initial capture, and use the 'vuji' programs in a final pass through the text to insert appropriate orig reg markup.
continual build-up of the wordlist
Part of the interactive pattern-matching process (the second component) is to add those words which required 'vuji' markup to the wordlist for the first component.
Conclusion

The project described in this paper tests the extent to which TEI encoding of 'vuji' phenomena can be automated, and characterizes the human intervention required to evaluate and process this feature of early modern English texts.

Notes

[1]
F. Bowers. "Readability and regularization in old-spelling texts of Shakespeare." The Huntington Library Quarterly, 50 (1987), 199-227.
[2]
The problem of retrieval of variant and archaic spellings remains; it is our expectation that the research involved in automating 'vuji' tagging will support the solution of this more difficult challenge.
Copyright © 2000 by Syd Bauman, Jacque Russom, and Brown University

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2001

Hosted at New York University

New York, NY, United States

July 13, 2001 - July 16, 2001

94 works by 167 authors indexed

Series: ACH/ICCH (21), ALLC/EADH (28), ACH/ALLC (13)

Organizers: ACH, ALLC

Tags
  • Keywords: None
  • Language: English
  • Topics: None