An Approach to Information Access and Knowledge Discovery from Historical Documents

poster / demo / art installation
Authorship
  1. 1. Fuminori Kimura

    Ritsumeikan University

  2. 2. Akira Maeda

    Ritsumeikan University

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

1. Introduction Recently, libraries, governments and major internet
providers, are forming consortiums to preserve historical
documents stored in libraries. (e.g. Google Book
Search, Open Content Alliance, World Digital Library,
Hathi Trust, etc.). It means that more and more old text
contents will be accessible on the internet in the near future.
Obviously, huge amount of knowledge in old documents
is as important as recent born-digital documents
typically available on the web, because old documents
are the collection of wisdom from B.C. Thus, it might
be useful to be able to access such old documents. Moreover,
it is very useful to discover hidden knowledge and
wisdom written in these old documents.
In order to realize this purpose, it is necessary to retrieve
important information from old documents. However,
it is not always easy to retrieve old documents, mainly
due to the substantial change in language and culture
over time. Therefore, we need a method to access old
documents written in ancient language using a query
in modern language. We call this method “Cross-Age
Information Retrieval”. Moreover, we should consider
the cultural difference over time, even for the same language.
For this, we need a method of “Cross-Cultural
Information Retrieval”.
Most of the research on information retrieval and information
access focus on documents written in modern
language, but we believe that knowledge and wisdom
written in old documents provide rich and valuable information
which are not available in modern language
documents, especially in web contents.
We propose a “Cross-Age Information Retrieval” method
in order to tackle these problems. It aims to discover
hidden knowledge and wisdom written in old documents.
2. Related Work
Much research on Cross-Language Information Retrieval
has been conducted in the last 10 years, with the background
of the rapid growth of the web around the world
since the middle of 1990’s. Various approaches, including
query translation, document translation, and the use
of intermediate language has been studied, and for certain
language pairs (e.g. between European languages),
adequate retrieval effectiveness has been achieved.
On the contrary, there is very little research on information
retrieval method for historical documents, and most
of which are based on simple keyword matching. Recently,
some approaches have been proposed to access
historical documents, and it could be regarded as a kind
of Cross-Age Information Retrieval (Gerlach et al. 2007;
Khaltarkhuu et al. 2006). Our goal is to establish a more
effective and sophisticated retrieval method that considers
not only language difference over time, but also cultural
difference between languages and ages.
3. The Proposed Method
We adopt dictionary-based query translation approach,
since it is proven to be the most effective method for
Cross-Language Information Retrieval. In order for dictionary-
based methods to be effective, we need to use
precise and comprehensive dictionaries for both modern
language and ancient language. From these two dictionaries,
we try to discover relationships between entries in
those dictionaries, and to “translate” the query terms in
modern language into equivalent terms in ancient language.
For this translation process, we propose the following
method (Fig. 1):
Fig. 1 Overview of the proposed method for Cross-Age
Information Retrieval.
1. For each entry in the modern dictionary, we look
for an equivalent entry in archaic word dictionary
by calculating the similarities between the definition
of the modern word and all the definitions of the
archaic words. For this process, we can use standard
text similarity measure based on vector space model and tf-idf term weighting scheme.
2. Then, we take the most similar definition in archaic
word dictionary, and that entry (headword) is regarded
as an equivalent of the modern word.
3. If more than one equivalent entry exists, we disambiguate
the translation candidates using the term
association measure such as mutual information, to
find the most equivalent archaic word for the modern
language word.
4. Document Collections
Currently, it is not very easy to obtain historical documents
in text format. However, some digital libraries
(e.g. Google Book Search, Open Content Alliance, etc.)
are ready to provide their collection of historical documents
in text format for research purposes. Moreover,
there are numerous existing old documents available
online. In Japan, there is a volunteer-based effort called
“Aozora Bunko” to digitize and to make accessible over
7,000 copyright-expired classic literatures online. Also,
many universities and institutions have already been
providing collections of old documents in text format.
We can use these huge collections of old documents for
our proposed method.
For now, we are focusing on a Japanese historical document
called “Hyohanki”, which was written in late Heian
era (12th century) in Japan. It is a valuable resource
for the research of Japanese culture of that time period.
An example of its original copy is shown in Fig. 2. Although
some part of it has been deteriorated and missing,
all of the existing pages are digitized into text format.
The existing pages consist of 2,488 diary entries.
Fig. 2 Example of the original copy of a historical
Japanese document “Hyohanki”.
5. Language Resources
As described in Section 3, we need dictionaries in order
to translate modern language query into archaic term(s).
In the case of “Hyohanki”, we can use some existing
electronic dictionaries available in CD-ROM. For Japanese
modern language, we use “Kojien”, one of the most
famous and comprehensive Japanese language dictionaries.
For ancient language, we use “Kokugo-Daijiten”,
which covers not only modern words but also archaic
words.
6. Preliminary Experiment
We conducted a preliminary experiment to test the precision
of “Cross-Age retrieval” by our proposed method.
In this experiment, we used diary entries of “Hyohanki”
as the ancient Japanese document collection, and prepared
3 modern Japanese queries, “戦争 (war)”, “法要
(Buddhist service)”, and “裸足 (bare foot)”. Since each
query has an equivalent archaic term in different wording,
no relevant documents can be retrieved if we use
these modern term queries. Note that, we consider one
diary entry as one document.
Table 1 shows the original modern Japanese query, ancient
Japanese term(s) translated by the proposed method,
and the precision of retrieval using the translated
term(s). For the queries “法要 (Buddhist service)” and
“裸足 (bare foot)”, the proposed method worked quite
well and chieved almost 100% precision (the ratio of
relevant documents in retrieved documents). However,
the query “戦争 (war)” resulted in very poor precision
(27%). The reason for it is that the proposed method returned
two translation candidates (i.e. “戦” and “軍”) for
this query. If we take only “戦” as the translated query,
we could achieve 100% precision, but if we take only “
軍”, we could obtain only 3.6% precision. It is because
the archaic term “軍” has not only a meaning “war”, but
also other meanings like “general (officer)” and “army”.
The query “死亡 (death)” also resulted in very poor precision
(15%) and the reason for it is that the translation
“没” has several meanings, “death”, “deprivation” and
“sunset”. These results suggest that we could improve
the precision if we incorporate a suitable disambiguation
method for the translated archaic terms. For that purpose,
we could apply existing disambiguation methods
used in Cross-Language Information Retrieval, such as
mutual information, etc. 7. Conclusion
In this paper, we proposed a novel information retrieval
technique called “Cross-Age Information Retrieval”,
which can be used to access old documents written in
ancient language using a query in modern language. We
conducted a preliminary experiment to test the precision
of cross-age retrieval by our proposed method. The experimental
results showed that our proposed method is
potentially useful for cross-age retrieval. Although our
proposed technique is still in an early stage, we believe
that we can achieve adequate retrieval effectiveness by
incorporating techniques used for Cross-Language Information
Retrieval.
Our goal is not only to realize cross-age retrieval, but
also to extend this technique to more advanced text mining
applications in order to discover hidden knowledge
and wisdom from large amount of premodern documents
which are now available in digital form.
Our future work includes resolving ambiguity of translated
archaic terms, large-scale experiments in other
languages such as English, consideration of cultural difference
over time, and thus extending our technique to
realize cross-age, cross-cultural, and cross-language information
access.
References
Gerlach, A. E. and Fuhr, N. (2007). Retrieval in text collections
with historic spelling using linguistic and spelling
variants. In Proceedings of the 7th ACM/IEEE Joint
Conference on Digital Libraries (JCDL 2007), pp. 333-
341, 2007.
Khaltarkhuu, G. and Maeda, A. (2006). Retrieval Technique
with the Modern Mongolian Query on Traditional
Mongolian Text. In Proceedings of the 9th International
Conference on Asian Digital Libraries (ICADL2006),
pp. 478-481, 2006.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ADHO - 2009

Hosted at University of Maryland, College Park

College Park, Maryland, United States

June 20, 2009 - June 25, 2009

176 works by 303 authors indexed

Series: ADHO (4)

Organizers: ADHO

Tags
  • Keywords: None
  • Language: English
  • Topics: None