Automatic Techniques for Generating and Correcting Cultural Heritage Collection Metadata

Antal van den Bosch; Caroline Sporleder; Marieke van Erp; Stephen Hunt

Authorship

1. Antal van den Bosch

Tilburg University
2. Caroline Sporleder

Tilburg University
3. Marieke van Erp

Tilburg University
4. Stephen Hunt

Tilburg University

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

In cultural heritage domains the central element of interest explanatory texts for educational purposes or exhibitions, or
scholarly publications. A third type of textual metadata is the
thesaurus or ontology, which aims to capture a standardized
list or set of terms that occur in the domain, and possibly some
basic relations between them such as synonymy, hyponymy,
and hyperonymy, or more domain-specific relations. Fourth,
auxiliary resources peripheral to the domain may exist that
contain information of overlapping interest, such as lists of
geographical names.
There is a need, clearly expressed by many cultural heritage
institutions, for digitizing the metadata of cultural heritage
collections, in order to make them searchable and accessible at
unprecedented scales. After the basic metadata is digitized,
allowing simple keyword-based search, an important next step
is the enrichmentphase, the result of which enables more
complex and advanced types of search that better resemble the
typical questions of scholarly researchers. The key step in the
enrichment phase is to link all resources into a single network.
By establishing all meaningful links, a particular object
effectively becomes a node in this network, meaningfully
connected to all other entities that have a direct relation to that
object, such as the artist who created it, or the location in which
it was found.
By traversing the network, a search engine can explore complex
relations among sets of objects, and unions and intersections
of such sets, allowing, for instance, biologists to perform
large-scale searches for patterns and trends, such as the
occurrence of animals in a certain family, in particular stretches
of time (e.g. the last century) and space (e.g. the Amazon rain
forest). To achieve this goal, we develop language technology
of two types: (1) preprocessing technology for the automatical
correction and normalization of our textual resources, and (2)
metadata enrichment technology for identifying relevant
concepts in all textual metadata, and linking all resources into
a connected, searchable network.
Preprocessing: Automatic metadata
cleanup
The first step in enriching textual metadata is to make sure
that the textual metadata is as correct and consistent as
possible. In Sporleder et al. (2006a, 2006b) we describe two
complementary methods. The first method, horizontal
correction, aims at correcting inconsistent values in a database
record based on the other values in that record, and all other
records in the same database, while the second method, vertical
correction, focuses on values which were entered in the wrong
column. Both methods are language-independent, rely solely
on the database itself, and are fully automatic. Their novelty
lies in the fact that they can deal with textual material in
database cells, where most traditional error correction methods
only work for numeric (Hawkins, 1980) or atomic categorial
data (Knorr and Ng, 1998).
In a range of experiments on a natural history specimen database
we observed that the two methods can effectively zoom in on
the errors in the database. Instead of having to manually inspect
all 16 thousand records in our database, each characterized by
47 attributes (columns), the methods only flag in the order of
several hundreds of errors. The errors flagged tend to capture
well above 90% of all actual errors, and typically raise about
as many false alarms as proper hits. The method is implemented
into a prototype system that draws the attention of the expert
user, with simple visual means, to alleged errors in the database.
The user is able to approve and effectuate the correction, or to
enter another correction, or to dismiss the error message.
Other cleanup methods currently under investigation are the
identification and normalization of different language versions
of the same location names (Van Erp, 2006), the disambiguation
of proper names referring to different entities (Bagga and
Baldwin, 1998) and the normalization of time expressions.
Automatic metadata generation
After resources have been cleaned and normalized, similar
methods can be used to generate more metadata, in order
to improve access to the core data. Again, we follow a strategy
which is fully automatic and completely self-sufficient, not
relying on any source other than the data itself, because in the
typical cultural heritage domain there is little useful background
resource besides the data itself.
As a first example, Sporleder et al. (2006c) describe a
bootstrapping method to generate lists of proper names of
persons, locations, and taxonomic animal names from a
specimen database. In general, an object database offers its own
structured annotation by means of its column titles, such as
"location" (of the place where an animal specimen was found).
Gathering the contents of cells in a particular column results
in a list of named entities of that particular type, that may be
found elsewhere in the same database, e.g. in comment fields,
or in other textual metadata resources such as field books or
scholarly articles. These texts can then be linked to the entries
in the database that contain the same location name, and the
occurrences of the name in the other texts can be automatically
annotated as being "location" names.
In ongoing work we explore the usage of clustering methods
to discover hidden events that are not explicitly coded or
mentioned, but that do explain certain groupings of objects. An
example in the natural history domain is the concept of
expedition, which is not coded in the database, but which is the
cause for certain animals to be found within a concentrated period of time in a relatively restricted area by the same person
or group of persons. Adding such entities and linking them to
the objects that are associated to them, makes the object
database searchable in more than the original dimensions.
Bibliography
Bagga, A., and B. Baldwin. "Entity-based Cross-document
Coreferencing Using the Vector Space Model." Proceedings
of the Thirty-Sixth Annual Meeting of the Association for
Computational Linguistics and Seventeenth International
Conference on Computational Linguistics. 1998. 79-85.
Bontcheva, K.., D. Maynard, H. Cunningham, and . "Using
Human Language Technology for Automatic Annotation and
Indexing of Digital Library Content." Proceedings of the 6th
European Conference on Research and Advanced Technology
for Digital Libraries. 2002. 613-625.
Chapman, A.D. " Principles and Methods of Data Cleaning -
Primary Species and Species Occurrence Data." Report for the
Global Biodiversity Information Facility. Copenhagen, 2005.
Hawkins, D.M. Identification of Outliers. London: Chapmal
and Hall, 1980.
Knorr, E. M., and R. T. Ng. "Algorithms for Mining
Distance-based Outliers in Large Datasets." Proceedings of the
24th International Conference on Very Large Data Bases
(VLDB'98). 1998.
Sporleder, Caroline, Marieke van Erp, T. Porcelijn, and Antal
van den Bosch. "Spotting the 'Odd-One-Out': Data-Driven Error
Detection and Correction in Textual Databases." Proceedings
of the EACL 2006 Workshop on Adaptive Text Extraction and
Mining (ATEM-06). Trento, Italy, 2006a.
Sporleder, Caroline, Marieke van Erp, T. Porcelijn, and Antal
van den Bosch. "Correcting 'Wrong-Column" Errors in Text
Databases." Proceedings of the Annual Machine Learning
Conference of Belgium and The Netherlands (Benelearn-06).
Ghent, Belgium, 2006b.
Sporleder, Caroline, Marieke van Erp, T. Porcelijn, Antal van
den Bosch, and P. Arntzen. "Identifying Named Entities in Text
Databases from the Natural History Domain." Proceedings of
the Fifth International Conference on Language Resources and
Evaluation (LREC-06). Genoa, Italy, 2006c.
van Erp, M. "Bootstrapping Multilingual Geographical
Gazetteers from Corpora." Proceedings of the 11th ESSLLI
Student Session. Malaga, Spain, 2006.
in terms of information and knowledge, but also in terms
of objective and subjective value, is the object. Text is an
important medium in retrieving objects, or information about
these objects. Even in cases of non-textual cultural heritage
collections, text plays a crucial role, since text is the pervasive
medium used for metadata, next to numeric identification codes
and measurements (Chapman, 2005). At the same time,
language is a noisy and ambiguous medium, and usage of
textual metadata ultimately implies robustness to noise and
proper disambiguation. If the goal is to automate the process
of searching and retrieving information about objects from
digitized metadata, language technology can aid in
disambiguating, translating, and correcting textual metadata
(Bontcheva et al., 2002). In this paper we describe ongoing
work in the MITCH (Mining Information in Texts from the
Cultural Heritage) project in which we employ diverse language
technologies to enrich textual metadata of a large object
collection in the natural history domain.
Textual metadata can take many forms. Many document
collections are made searchable through the use of index
systems (pre-digital as well as digital) that use keywords, titles,
and proper names (e.g. of authors). Alternatively, objects may
be described more verbosely in annotations, captions,

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ADHO - 2007

Hosted at University of Illinois, Urbana-Champaign

Urbana-Champaign, Illinois, United States

June 2, 2007 - June 8, 2007

106 works by 213 authors indexed

Conference website: http://www.digitalhumanities.org/dh2007/

References: http://web.archive.org/web/20070810143343/http://digitalhumanities.org/dh2007/DH2007.detail.html http://web.archive.org/web/20080703194728/http://www.digitalhumanities.org/dh2007/abstracts/titles.xq

Series: ADHO (2)

Organizers: ADHO

Automatic Techniques for Generating and Correcting Cultural Heritage Collection Metadata

1. Antal van den Bosch

2. Caroline Sporleder

3. Marieke van Erp

4. Stephen Hunt

ADHO - 2007