Dealing with difficult characters: XML encoding and the Private Use Area of Unicode

Odd Einar Haugen

Authorship

1. Odd Einar Haugen

University of Bergen

Original URL

http://web.archive.org/web/20040903094300/http://www.hum.gu.se/allcach2004/AP/html/prop68.html

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

This paper will discuss the encoding of Medieval characters in XML documents, based on the recent publication of a character recommendation by the Medieval Unicode Font Initiative (MUFI), http://www.hit.uib.no/mufi. Although an impressive number of characters have been included in the Unicode standard, many characters or variant letter forms are still missing, and it is an open question whether these will be accepted by the standard. Furthermore, precomposed characters -- i.e. characters with various diacritics -- will no longer be accepted by Unicode. These characters can be encoded and displayed with smart font technology, but support for this technology is not wide-spread and there are cross-platform problems.

Unicode has set aside a Private Use Area (PUA) for the encoding of characters by individual projects or user groups. The PUA will not be used for any official characters, but it is as well supported as any other part of the standard. Font applications like FontLab allow users to define fonts with characters in the PUA, and these fonts will by and large be displayed correctly in applications with Unicode support. Documents containing characters in the PUA can be interchanged across applications and platforms, but they will only display the correct PUA characters if the same font is installed or there is a local standard for the PUA in a specific user group. In spite of the obvious compatibility problems, the PUA of Unicode can prove to be a good solution in the short or even medium-term perspective.

There are in fact three separate PUAs in the Unicode standard. The first of these is the one in the Basic Multilingual Plane covering 6,400 code points, ranging from E000 to F8FF. Recently two supplementary planes, nos. 15 and 16, have been set aside, each containing 65,534 code points. The supplementary planes are not well supported yet, so the PUA of the Basic Multilingual Plane has received most attention. Commercial companies have been using some code points in this area, especially towards the end, but otherwise it is being used by various interest groups, many within the academic community. One example is the Titus project, which has allocated several thousand characters for linguistic usage to this area. Other font projects are following, such as the Junicode font for Old and Medieval English, and Alphabetum, a multi-purpose font with many characters for the classical languages.

Version 1.0 of the MUFI recommendation has focused on Medieval Nordic characters, but it is expected that the coming versions will cover additional national or regional characters. It should be noted, though, that many characters and letter forms were used across large areas of Europe, so it is often misleading to locate characters to specific areas. For Medieval Nordic primary sources, the recommendation lists approx. 800 characters, of which only approx. 400 are in the Unicode standard. The remaining characters have been divided into 20-odd subranges in the PUA, in three main categories: base characters, precomposed characters and variant letter forms. At the moment, several fonts are being developed or extended to include the PUA of the MUFI recommendation, and they will in due course be made available on the MUFI site. Fortunately, TrueType fonts are now supported by all major platforms (Linux, Mac, Windows) so the whole set of MUFI characters can be encoded in a single font.

Although a Unicode font can be used as easily as any ordinary 8-bit font in a word processor, many academic projects would like to encode texts in a more interchangeable format than the one offered by word processors. Of the SGML derivatives, Extensible Markup Language (XML) has proven to be a robust and versatile encoding language, in spite of the well-known limitations of this standard, notably the encoding of concurring or discontinuous structures.

Basically, an XML document is divided into a header with general information about the text and the nature of the encoding, and the body, with the text proper. The header refers to a Document Type Definition (DTD) with further specifications of the encoding. An XML document is by default encoded according to the Unicode standard; this is specified in the first line of an XML document, which typically reads <?xml version="1.0" encoding="UTF-8"?>. However, the use of characters from the full Unicode standard may cause problems when files are interchanged or processed by older software without Unicode support. For this reason, it may be advisable to encode all characters outside Basic Latin (a--z / A--Z) with entities.

The DTD will be needed for the definition of the entities, and by linking each entity to a corresponding code point in the Unicode standard -- including the PUA -- this can be done in a simple and efficient manner. With an appropriate font, texts encoded in this manner will be shown correctly.

This paper will discuss how this can be effected for the encoding of Medieval Nordic primary sources, so that the usage of the PUA is sufficiently well documented. Of particular interest here is the question of decomposition. Many Latin characters with diacritics, such as 'á', 'è' and 'ô' have been encoded in precomposed form in Unicode (retaining old ISO standards), but new combinations must be encoded as a sequence of one or more characters; that applies for example to a commonly used character in Old Norse poetry, the 'o' with ogonek and acute accent. A robust solution for the display of decomposed characters is thus strongly needed, and this solution should by analogy also be tried and tested for the already encoded precomposed characters. As a consequence, simplified keyboard layouts can be designed for medium-size character inventories such as the one needed for Medieval Nordic characters.

The PUA is at best a medium-term solution. The long-term solution is obviously to propose characters for the Unicode standard, with the aim of reducing the need for PUA encoding. The paper will conclude by discussing procedures for decommissioning characters in the PUA that have been accepted by the official Unicode standard.

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2004

Hosted at Göteborg University (Gothenburg)

Gothenborg, Sweden

June 11, 2004 - June 16, 2004

105 works by 152 authors indexed

Conference website: http://web.archive.org/web/20040815075341/http://www.hum.gu.se/allcach2004/

Series: ACH/ICCH (24), ALLC/EADH (31), ACH/ALLC (16)

Organizers: ACH, ALLC

Dealing with difficult characters: XML encoding and the Private Use Area of Unicode

1. Odd Einar Haugen

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2004