Unicode 5.0 and 5.1 and Digital Humanities Projects

Deborah Winthrop Anderson

Authorship

1. Deborah Winthrop Anderson

University of California Berkeley

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

In theory, digital humanities projects should rely on standards
for text and character encoding. For character encoding,
the standard recommended by TEI P5 (TEI Consortium, eds.
Guidelines for Electronic Text Encoding and Interchange
[last modifi ed: 03 Feb 2008], http://www.tei-c.org/P5/) is
the Unicode Standard (http://www.unicode.org/standard/
standard.html). The choices made by digital projects in
character encoding can be critical, as they impact text analysis
and language processing, as well as the creation, storage, and
retrieval of such textual digital resources. This talk will discuss
new characters and important features of Unicode 5.0 and
5.1 that could impact digital humanities projects, discuss the
process of proposing characters into Unicode, and provide the
theoretical underpinnings for acceptance of new characters by
the standards committees. It will also give specifi c case studies
from recent Unicode proposals in which certain characters
were not accepted, relaying the discussion in the standards
committees on why they were not approved. This latter
topic is important, because decisions made by the standards
committees ultimately will affect text encoding.
For those characters not in Unicode, the P5 version of the
TEI Guidelines deftly describes what digital projects should do
in Chapter 5 (TEI Consortium, eds. “Representation of Nonstandard
Characters and Glyphs,” Guidelines for Electronic
Text Encoding and Interchange [last modifi ed: 03 Feb
2008], http://www.tei-c.org/release/doc/tei-p5-doc/en/html/
WD.html [accessed: 24 March 2008]), but one needs to be
aware of the new characters that are in the standards approval
process. The presentation will briefl y discuss where to go to
look for the new characters on public websites, which are “in
the pipeline.”
The release of Unicode 5.0 in July 2007 has meant that an
additional 1,369 new characters have been added to the
standard, and Unicode 5.1, due to be released in April 2008,
will add 1,624 more (http://www.unicode.org/versions/
Unicode5.1.0/) In order to create projects that take advantage
of what Unicode and Unicode-compliant software offers, one
must be kept abreast of developments in this standard and
make appropriate changes to fonts and documents as needed.
For projects involving medieval and historic texts, for example,
the release of 5.1 will include a signifi cant number of European
medieval letters, as well as new Greek and Latin epigraphic
letters, editorial brackets and half-brackets, Coptic combining
marks, Roman weights and measures and coin symbols, Old
Cyrillic letters and Old Slavonic combining letters. The Menota
project (http://www.menota.org/guidelines-2/convertors/
convert_2-0-b.page), EMELD’s “School of Best Practice”
(http://linguistlist.org/emeld/school/classroom/conversion/
index.html), and SIL’s tools (http://scripts.sil.org/Conversion)
all provide samples of conversion methods for upgrading
digital projects to include new Unicode characters.
Since Unicode is the accepted standard for character
encoding, any critical assessment of Unicode made to the
body in charge of Unicode, the Unicode Technical Committee,
is generally limited to comments on whether a given character
is missing in Unicode or--if proposed or currently included
in Unicode--critiques of a character’s glyph and name, as
well as its line-breaking properties and sorting position.
In Chapter 5 of the TEI P5 Guidelines, mention is made of
character properties, but it does not discuss line-breaking
or sorting, which are now two components of Unicode
proposals and are discussed in annexes and standards on the
Unicode Consortium website (Unicode Standard Annex #14
“Line Breaking Properties,” Unicode Technical Standard #10,
“Unicode Collation Algorithm,” both accessible from www.
unicode.org). Users should pay close attention to these two
features, for an incorrect assignment can account for peculiar
layout and sorting features in software. Comments on missing
characters, incorrect glyphs or names, and properties should
all be directed to the Unicode online contact page (http://
www.unicode.org/reporting.html). It is recommended that an
addition to Chapter 5 of P5 be made regarding word-breaking
and collation when defi ning new characters.
The Unicode Standard will, with Unicode 5.1, have over
100,000 characters encoded, and proposals are underway
for several unencoded historic and modern minority scripts,
many through the Script Encoding Initiative at UC Berkeley
(http://www.linguistics.berkeley.edu/sei/alpha-script-list.html).
Reviewing the glyphs, names, and character properties for
this large number of characters is diffi cult. Assistance from
the academic world is sought for (a) authoring and review of
current proposals of unencoded character and scripts, and (b)
proofi ng the beta versions of Unicode. With the participation
of digital humanists, this character encoding standard can be
made a reliable and useful standard for such projects.

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ADHO - 2008

Hosted at University of Oulu

Oulu, Finland

June 25, 2008 - June 29, 2008

135 works by 231 authors indexed

Conference website: http://www.ekl.oulu.fi/dh2008/

Series: ADHO (3)

Organizers: ADHO

Unicode 5.0 and 5.1 and Digital Humanities Projects

1. Deborah Winthrop Anderson

ADHO - 2008