Co-Reference: A New Method to Solve Old Problems

paper
Authorship
  1. 1. Øyvind Eide

    University of Oslo

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Introduction In this abstract, I will describe co-referencing as a
method for information integration. This method is
based on building blocks that have been available in analogue
as well as digital information systems for a long
time. Co-reference is about connecting such building
blocks in new ways in order to enable additional tools
necessary for creating a functional Semantic Web for
cultural heritage.
The co-reference concept will be explained and related
to other tools for information integration. Some prototype
implementations currently being developed for use
in storing co-reference facts is being described, along
with theoretical work towards the so-called network of
identity. As a conclusion, I describe how we organise our
work in this field as well as our plans for the future.
Definition
An example of a reference is the string “The table by the
window,” referring to a physical object. If another sting
“The beautiful table” is referring to the same physical
object, the two strings are said to co-refer.
A co-reference system will add relations directly between
different references in texts. To do this, an identity
relation has to be used. A possible identity relation for
people could be:
A=B iff A and B are both human beings that have
lived or are living, and the body of A is at any time
located at the same place in the physical room as the
body of person B.
Once such an identity relation is established, the co-reference
definition is simple:
If string x refer to A and string y refer to B, and
A=B, then x and y co-refer.
Recording Co-References
Computer software is developed to detect and disambiguate
names and other referring strings in texts. They
are quick and reliable, but at the cost of a high level of
mistakes; not-detected items as well as false positives.
Even when used on very structured data with additional
information available to help the algorithm, such as a
library catalogue example described in (Bennett 2006),
the success rate in automatic data integration is far from
acceptable: 70% hits with an error rate at almost 1%.
Human beings, on the other hand, are more reliable. But
they are also expensive.
There are two different ways of solving this quality-cost
problem. One could use the computer to create candidate
co-references and let human beings check the results,
possibly as a federated job by amateurs. Or one could
try to find ways to store results of work that people are
already doing as formalised co-reference information.
As many people working in the culture heritage sector
detect co-references as part of their job, we do not need
to hire new people. What we need to do is to create computational
and organisational systems so that they are
able to store the facts they detect in a form usable for a
co-reference system.
Information about the events in which co-references
are asserted should also be stored. We will then be in
a position to differentiate between statements made by
computer programs and statements made by persons, as
well as between different persons. This enables a higher
degree of reproducibility than what is common in many
areas of research made on the basis of cultural heritage.
It also provides an addressee to approach if someone later
suspects that a co-reference assertion may be wrong.
Similar approaches
Work related to co-reference detection is common in research.
One example of tools used in such work today
is thesauri. But instead of connecting the sources to one
another, as one would do in the co-reference approach,
each of the sources are linked to a record in the thesaurus.
From such links to thesauri, co-references may be
detected, but details about the event in which the source
for the co-reference information was recorded is commonly
not available.
Prosopographies are closely related to co-reference
work, but has not traditionally published full sets of coreference
information about the names mentioned in the
sources used. The factoid approach described by Bradley
and Short, on the other hand, will provide information
in a formalised way from propopographical work that is
very close to co-reference data sets as they are described
in this abstract: “A factoid is not a statement of fact about
a person [...] Instead, each one records an assertion by
a source at a particular spot about a person.” (Bradley 2005, p. 8)
The concept of co-reference is used in corpus linguistics
as well, in the sense that co-reference annotation is applied
to text corpora (Day 2008). This is closely connected
to anaphora resolution. One may say that anaphora
resolution is a part of co-reference detection as an anaphor
is just another referring string.
Level 1: Inside the Organisation
Co-reference management should be part of the overall
information strategy in an institution. The curator,
researcher or exhibition professional should be able to
add information to their local system about external resources
co-referring with internal resources.
Tracking down co-references have always been one of
the practical tasks performed by researchers, conservators,
librarians, and others processing information about
real world objects described in texts. The results have
sometimes been published, e.g. in indices or in footnotes.
But most often it has been saved in the scholar’s
head, maybe also in his notes. We should do better. We
should enable the creation of formalised data that may
look like this: The syntax of this example is not important. The information
could e.g. be expressed in RDF. The point is that
it is expressed in a machine readable way.
We are currently implementing a prototype co-reference
system in which specific pieces of information in our
databases can be connected to external resources. The
external resources can be a URL, but can also be a reference
to a specific place in a printed book. This system
will also store information about who is responsible for
asserting the co-reference, when it was done, and optional
comments (Eide 2008). Further development is necessary
to evaluate if this method will be usable in practical
work.
Another implementation is the tagging tool created at
FORTH in Greece as a diploma thesis by Kostas Pyloudis
and Pasxalis Georgopoulos. The tagging tool is
a web based application in which HTML web pages and
photographs on the internet can be co-referenced and annotated
by the user. The operation includes selecting a
part of the document, so that part of an image, e.g. the
head of one person, can be co-referred to a string in a
HTML document, e.g. a name (Melesanakis 2008). This
is an interesting attempt to enable anyone to take part
in building up collections of co-reference information.
Methods used by the Perseus Digital Library in their
information integration work are also similar to the coreference
approach (Babeu 2007).
Level 2: Network of identity
After a while, many co-references have been stored in
many institutions, some of them referring to the same
individuals. All such co-reference collections should be
connected and made available in a cross-institutional
system.
The systems at level 1 described above will be sustainable,
because the organisations gain from using them.
The federated system will be the sum of all the twoway
links from the level 1 systems. A model for such a
system is described in (Meghini 2008). The article also
describes a possible implementation. The system will
be able to answer questions about the entities being coreferred.
If it is connected to information systems storing
information about e.g. events, it will be a very interesting
tool for exploring large data sets based on cultural
heritage information.
Conclusions
To integrate the wider cultural heritage field, it is necessary
to connect persons, places, and objects described
in and owned by museums to references in e.g. digital
versions of printed books. It is not enough to connect
classes of information, particular items will also have
to be connected. Thus, co-reference tools will find their
place among the other tools used in the development of
the Semantic Web.
The CIDOC co-reference working group was established
in 2007. We are currently working on the research
and implementation described above. We are also developing
prototype protocols for exchanging co-reference
information between systems. We hope to set up a small
integrated test system in 2009 in order to show how such
systems may be implemented.
Literature
Babeu, A., Bamman, D., Crane, G., Kummer, R. and
Weaver, G. (2007). “Named Entity Identification and
Cyberinfrastructure.” In: Lecture Notes in Computer Science,
volume 4675.
Bennett, R., Hengel-Dittrich C., O’Neill, E.T. and Tillett B.B. (2006). “USA VIAF (Virtual International Authority
File): Linking Die Deutsche Bibliothek and Library
of Congress Name Authority Files.” 72nd IFLA General
Conference. URL: http://www.ifla.org/IV/ifla72/
papers/123-Bennett-en.pdf (checked 2008-11-14).
Bradley, J. and Short, H. (2005). “Texts into Databases:
The Evolving Field of New-style Prosopography.” P.
3-24 in: Literary and Linguistic Computing.
Day, D., Hitzeman, J., Wick, M., Crouch, K and Poesio,
M (2008). “A Corpus for Cross-Document Co-reference.”
P. 2996-2999 in: Proceedings of the Sixth International
Language Resources and Evaluation (LREC’08).
Eide, Ø. (2008). “The Unit for Digital Documentation
(EDD) system for storing coref information. A short
overview of a system under development.” Paper presented
at the meeting of the CIDOC Co-Reference
Working Group, Athens. URL: http://cidoc.mediahost.
org/co-reference-meetings-2008(en)(E1).xml (checked
2008-11-14)
Meghini, C., Doerr, M. and Spyratos N. (2008). “Managing
co-reference knowledge for data integration.” P.
229-248 in: Proceedings of EJC2008, the 18th European-
Japanese Conference on Information Modelling and
Knowledge Bases.
Melesanakis, V. (2008) “Tagging-Tool.” Paper presented
at the meeting of the CIDOC Co-Reference Working
Group, Athens. URL: http://cidoc.mediahost.org/coreference-
meetings-2008(en)(E1).xml (checked 2008-
11-14).

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ADHO - 2009

Hosted at University of Maryland, College Park

College Park, Maryland, United States

June 20, 2009 - June 25, 2009

176 works by 303 authors indexed

Series: ADHO (4)

Organizers: ADHO

Tags
  • Keywords: None
  • Language: English
  • Topics: None