Digital Image and Text Archives for Japanese Classical Literature

Shoichiro Hara; Hisashi Yasunaga

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

1. Introduction

The National Institute of Japanese Literature (NIJL) is one of the inter-university research institutes of Japan founded in 1972. The purpose of its establishment is to survey the most part of printed and handwritten Japanese classical materials from the Edo period (1603-1863) and before, and to collect their original and/or microfilm reproductions in order to preserve these and also to provide public access. Over more than two decades of activity, the NIJL has acquired its place as the center of archival activity. At present, the NIJL provides only three catalogue databases. However, another catalog data, fulltext databases, and an image archive are under preparation, and they will be public within this year as a part of digital library system. In the following, chapter 2 describes the present NIJL information system and the background of the digital library project, chapter 3 describes the outline of the digital library system. Finally, new study of the "Digital Study System" for humanities is described in chapter 4.

2. Present NIJL Information System

The NIJL's information system is comprised from computers, networks, and printing devices. Using this system, NIJL provides following three catalogues as an online database service and as printed materials.

1) Catalogue of Holding Microfilms of Manuscripts and Printed Books on Japanese Classical Literature,

2) Catalogue of Holding Manuscripts and Printed Books on Japanese Classical Literature, and

3) Bibliography of Research Papers on Japanese Classical Literature.

A feature of the NIJL's information system is that all data processing from data compiling, data correction, database service, and to publishing is executed on a main frame computer system. However, during more ten years, NIJL's system has had many problems awaiting solutions from the view of software and hardware. To solve these problems, we started the digital library project for Japanese classical literature. This project downsizes the main frame computer system and reconstructs it as the so-called distributed computer system over several years. The key words of the digital library project are "standardization of data," "data independent from systems" and "multimedia oriented."

3. Digital Library System

The digital library system is constructed from catalogue databases, fulltext databases, and image archives.

3.1 Catalogue Databases

The NIJL's databases were designed more than 10 years ago based on devices at that time. As the latest computer system cannot support these devices, we are taking this opportunity to begin reconstructing whole database systems. Reviewing the old systems, we apply the new system policy of making data independent from hardware and software; specifically, we introduced SGML to describe the data. At present, we are under reconstruction of above three catalogue databases. Another catalog databases,1) Union Catalogue of Japanese Classical Materials, and 2) Catalogue of Historical Materials are also under preparation. All these catalogue databases will be public within this year as a part of digital library system. One of the main dissatisfactions expressed by catalogue database users has been that "the catalogue databases are undoubtedly useful to find the existence of materials, but accessing the materials themselves is difficult for distant users (especially for foreign users)." Fulltext databases and image archives are our solution to respond to this complaint.

3.2 Fulltext Database

Since 1987, NIJL started the project of organizing fulltext data. At present, following four text data are compiled.

1) Anthology of Japanese Classical Literature(Nihon-Koten-Bungaku-Taikei: 100 volumes, about 560 works),

2) Anthology of Story Telling(Hanashibon-Taikei: 20 volumes, about 320 works, about 20,000 stories),

3) Anthology of Story in KANA(Kana-Zoshi-Shusei: 12 volumes, about 70 works, about 1,000 stories), and

4) Anthology of Poem in Shoho Version(Shoho-Hanpon-Kashu: 21 volumes).

Among these, "Anthology of Japanese Classical Literature" and "Anthology of Poem in Shoho Version" can be accessed on the World Wide Web. At the time we began constructing fulltext databases, SGML was not popular in Japan, and unfortunately, there were no SGML applications that could process Japanese language. For these reasons, we created our own text markup rules that resembled SGML in its basic idea. We call the rules "KOKIN Rules" (KOKubungaku (means Japanese literature) Information: "KOKIN" is also a title of a famous Japanese classical poem anthology). As KOKIN rules were designed for ease of understanding and for use by researchers of Japanese classical literature, all the fulltext data in NIJL were compiled based on these rules. However, as KOKIN rules are independent from another standard, they had a few tools to parse and check KOKIN text. Recently SGML is considered as an encoding schema for transmission of text data among the systems. From these background, we decided that we should convert our KOKIN-marked text to SGML-marked text from the point of effective data circulation.

3.3 Image Archives

The NIJL has collected microfilm reproductions of classical material. At present, NIJL holds about 15,000,000 frames of image in the form of microfilm. 90% of them are the reproduction of materials out of the NIJL, and reminders are the reproduction of holding materials. The image archive is derived from the microfilms of the holding materials as a way of getting around the copyright problems and for speedy construction. The image is sampled with 1 bit gray scale and 600 DPI resolution, compressed by G4 method, and stored in a TIFF format. The image database is linked with the database of "Catalogue of Holding Manuscripts and Printed Books on Japanese Classical Literature." Users of the image archive first consult the catalogue database to search for their objective materials, then they will access its image by following the link between two databases (this link is based on the call number of the materials in both databases). We digitized about four hundred thousand (400,000) frames of microfilms in 1996 and about 150,000 frames in 1997.

4. Digital Study System

We have constructed various kinds of databases. However, during the past few years' development, we recognized that these databases alone cannot always contribute to the research activities of humanities scientists. A database is only a bank of raw material data, while on the other hand, valuable results are produced under individual research environments. Thus we feel that in order to support researchers' own methods and skills, there is a need to develop effective tools. The Digital Study System is a user side tool that is intended for humanity scientists to organize multimedia data by the researchers' own methods and skills. The Digital Study System contains "Image Annotation Program," "Version Control Mechanism," and "Text Analyzer." The Image Annotation Program is the center of the Digital Study System, that allows researchers to attach annotations (by text) to a certain position or area on an image. If a researcher attaches some keywords or codes to images, he/she can access desired images by searching the specific string among the annotations attached to images. In the same way, a researcher can collect images on the specific subject. Furthermore, linking images from different materials is possible; for example, a researcher can compare various versions of the specific sentence in the authentic text and its variants, if he/she attaches the same keywords or codes to several materials. The "Version Control Mechanism" constructs a version tree showing the history of data development. By reviewing the history, users themselves can assess the quality of the data. The "Text Analyzer" is the collection of programs for lexical analysis, vocabulary statistics and so on. At present, we are examining some programs to see whether they will be helpful in constructing the tool. These programs will have modular architecture to allow researchers doing complex text analysis by easily assembling tools.

5. Conclusion

NIJL is undertaking reconstruction of databases using SGML to cope with the multimedia age. This reconstruction project is on the right track. We have begun a software development program to support the individual research environment. Some of these databases will be public within this year as a part of digital library system.

References

1. Shoichiro Hara, Hisashi Yasunaga: A digital Library System for Japanese Classical Literature, ACH-ALLC'97 Conference Abstracts, pp. 80-82, 1997.

2.http://www.nijl.ac.jp

3. Shoichiro Hara, Hisashi Yasunaga: Markup and Conversion of Japanese Classical Text Using SGML in the National Institute of Japanese Literature, D-Lib Magazine, July/August 1997, http://www.dlib.org/dlib/july97/japan/07hara.html

4. National Institute of Japanese Literature 1997, 1997.

Full text license: This text is republished here with permission from the original rights holder.

Digital Image and Text Archives for Japanese Classical Literature

1. Shoichiro Hara

2. Hisashi Yasunaga

ACH/ALLC / ACH/ICCH / ALLC/EADH - 1998

"Virtual Communities"