Building spell-checking facilities for ancient Spanish

paper
Authorship
  1. 1. Alejandro Bia

    Libraries - University of Alicante

  2. 2. Manuel Sanchez-Quero

    Libraries - University of Alicante

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

The huge development of information technology has motivated the appearance of this new type of libraries, called digital libraries (Arms, 2000). The Miguel de Cervantes Digital Library (http://cervantesvirtual.com) is one of the most ambitious projects of its kind ever to have been undertaken in the Spanish-speaking world with more that 4000 digital books at present. This enormous amount of digitised works are mostly Hispanic classics from the 12th up to the 20th century. The development of these digital books require a lot of care from the point of view of correction and editing, but can be processed in a massive uniform way afterwards to produce the different publications formats and services offered to the readers.

Concerning human resources involved in the project, the biggest group by far corresponds to correction and markup people (Bia and Pedreño, 2000), who are in charge of the hardest-to-automate part of the production process, which involves reading and correcting digitisation errors, structurally marking up the texts, and taking important editing decisions that involve both rendering and functionality of the hypertext documents to be published. These humanists are highly skilled people with at least a bachelor degree in philology, or other humanistic disciplines. We want them to devote their time to higher intellectual tasks like taking editing or markup decisions, or preparing the texts for interesting Internet services (like text analysis or concordance queries), than to spend their energies in the tedious mechanical task of correction, the main bottleneck in our production workflow, and by far the most time-consuming task.

In the case of contemporary works, spell-checkers turned out to be a useful aid to the correction process, but for literary works written in ancient Spanish, commercially available modern spell-checkers may produce more mistakes than they can prevent. The reason for this is that spell-checker-dictionaries include only modern uses of the language, and when they are applied to old texts, the result is that they take correct ancient uses of words for mistakes and try to correct them. Unable to use spell-checking as an aid, correctors have to do a side by side comparison of the original and the digitised texts to detect the errors.

Being aware of the usefulness of spell-checkers on the correction of modern works, and lacking this facility for ancient texts, we decided to build dictionaries for ancient Spanish. These decision led to new problems and new questions. As there is no such thing as ancient Spanish, but instead a dynamically evolving language that changes through the centuries, how many old-Spanish dictionaries should we build? Should we set arbitrary chronological limits?

Taking advantage of the 4000 books already digitised and corrected at the Miguel de Cervantes Digital Library, as a corpus covering several centuries of Spanish writings, we’ve built a time-aware system of dictionaries that takes into account the temporal dynamics of language, to help solve the problem of ancient Spanish spell-checking.

In this paper we present the problems we have found, the decisions we have made and the conclusions and results we arrived at. We have also been able to extract statistical information on the evolution of the Spanish language through time. The final section of the paper deals with the technical details of this project and the innovative application of digital methods like the use of TEI ans XML markup.

References

William Arms
Digital Libraries
MIT Press, 2000, Cambridge, Massachusetts, ISBN 0-262-01880-8.

C.M. Sperberg-McQueen and Lou Burnard, editors.
Guidelines for Electronic Text Encoding and Interchange (Text Encoding Initiative P3), Revised Reprint, Oxford, May 1999.
TEI P3 Text Encoding Initiative, Chicago - Oxford, May 1994.

Alejandro Bia and Andrés Pedreño.
The Miguel de Cervantes Digital Library: The Hispanic Voice on the WEB.
LLC (Literary and Linguistic Computing) journal, Oxford University Press, (to be published soon) 2000.
Presented at ALLC/ACH 2000, The Joint International Conference of the Association for Literary and Linguistic Computing and the Association for Computers and the humanities, 21/25 July 2000, University of Glasgow.

David Hunter, Curt Cagle, Dave Gibbons, Nikola Ozu, Jon Pinnock, and Paul Spencer.
Beginning XML.
Programmer to Programmer. Wrox Press, 1102 Warwick Road, Acocks Green, Birmingham, B27 6BH, UK, 1st edition, 2000.

Alejandro Bia
Automating the Workflow of the Miguel de Cervantes Digital Library
Poster at the ACM-DL'2000 Digital Libraries conference, June 2000, Menger Hotel, San Antonio, Texas, USA.

Olumide Owolabi
Efficient pattern searching over large dictionaries
Information Processing Letters, v.47, n.1, 17-21, August 1993.

Real Academia Española
Diccionario de la lengua española
Espasa Calpe, 1992, Madrid.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2001

Hosted at New York University

New York, NY, United States

July 13, 2001 - July 16, 2001

94 works by 167 authors indexed

Series: ACH/ICCH (21), ALLC/EADH (28), ACH/ALLC (13)

Organizers: ACH, ALLC