The LANCHART Search Engine - Making important progress in data and data archiving reuse

paper
Authorship
  1. 1. Michael Barner-Rasmussen

    University of Copenhagen

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

In this paper, we describe the basic design of a multi
purpose, multi disciplinary database-driven research
tool for spoken language research developed and implemented
by the LANCHART Centre (University of Copenhagen).
The LANCHART Search Engine provides
an open source, highly robust, fast and versatile framework
for the entering of, working with/analyzing, visualizing,
and not least reusing spoken language research
resources.
Secondly, we describe the IS-structures and programs
supporting the search engine, most notably the creation
of import-export filters for a number of widely used transcription
and analysis tools (presently Praat, Transcriber,
CLAN).
Last, to illustrate the concrete usability and versatility of
the resultant e-science tool, we demonstrate by example
some of the many uses this e-science tool has already
been put to use by LANCHART as well as by visiting
scholars.
Background
The LANCHART project (Gregersen 2009, lanchart.
dk) initially created the search engine for contextualized
searching in/mining of a very large spoken language corpora
presently consisting of ~380 sociolinguistic interviews
spanning more than 30 years of same-respondent
recordings. During the course of its development it has
proven itself an obvious candidate to an e-science tool
for other corpora of spoken language under the Danish
CLARIN project (http://english.dkclarin.ku.dk/). As it
stands the corpus encompasses ~7 M transcribed words
and ~39 M linguistic annotations.
The LANCHART search engine is a working research
tool that is supporting more than 15 different thus computer
assisted research efforts ranging from phonetic,
through grammatical to discourse context analyses.
Architecturally this versatility, i.e. reuse of both research
data and the underlying enabling technology, has been achieved by stratifying the basic data model into elements:
• topology or ‘data architecture’
• (analytical) content
• semantic structure(s)
This separation has long been advocated by prominent
new media scientists such as Bolter (2001), Weinberger
(2002), Levy (1998) and has proved rarely opportune in
the case of preparing scientific work on spoken language
corpora for an up to date e-science.
Data Architecture, outputs and the reusable
engine
The data architecture is dictated solely by the structural
elements of spoken language, i.e. it happens in contiguous
time, there may be many simultaneous speakers and
many things (of analytical interest) may happen at once.
These structural characteristics are afforded by designing
the data store around two basic entities:
1. ‘Slice’: a period of time defined by its starting and
stopping time relative to the recordings (recording
may be audiovisual or audio only).
2. ‘Tier’: a set of slices spanning the length of the
sample. Each slice may be annotated, e.g. with the word uttered
or a comment, but only one such annotation may be entered
for any one slice. Simultaneously occurring annotations
are represented by placing a slice with identical
boundaries in a different tier.
Tiers are identified by a unique tier name (Orthography
(AMF) in Fig. 1.) allowing for linking to speaker metadata,
annotation encoding convention, lists of related
tiers etc.
This purposefully devoid-of-meaning data architecture
creates an environment where most needs of researchers
of spoken language are met. They may annotate whatever,
wherever and importantly whenever (i.e. as finely
grained) as they wish.
The search engine on its part has two output formats: a
KWIC concordance of the finds and a .csv file in which
the finds are supplemented with columns containing time
values, the content of aligned intervals in all other tiers
and the speakers’ background information. This is then
used for statistical analyses using programs such as SAS.
Data Semantics and reuse of data
As hinted to above, the semantics, the meaning of the
annotations in any given tier, is entirely decided by the
annotator. The unique name of each tier allows for any
amount of metadata to be linked with the tier, e.g. guides
and explanations of the contents as well as coding conventions
used.
Reuse becomes the simple task of choosing between already
created tiers: which are of analytical interest for
the present purpose? Then create a ‘package’ with only
those tiers present, export this package to one of several
supported formats and get started on one’s own analytical
contribution.
IS structures and computer programs
supporting the search engine
A search engine is not much of a research tool if it isn’t
easy to input data for searches, does not scale or if it is
not well documented, easily customized and written in a
easy to maintain programming language.
With this in mind from the very outset, the LANCHART
Search Engine is developed in open source tools (LAMP,
Java/jsp) with initial support for the CLAN, Praat &
Transcriber spoken language representational formats.
It is also central to note, that the conversion and import
functions were implemented by translating to and from a
‘super-format’ instead of directly between formats. This
allows the researchers freedom to choose whatever tool
suits their situation best and for easy, if not always lossless,
translation between all formats used both currently
and in the future – writing new translators is, if not easy,
then thanks to the architecture surmountable. Ongoing computer assisted analysis /
e-science using the LANCHART Search
Engine
Here, we mention but a few of the LANCHART Centre’s
corpus/e-science research projects (for more see http://
lanchart.hum.ku.dk/reports):
In most detail, we describe the work of Jeffrey Parrot
who has formulated a project investigating socially salient
and theoretically challenging variation in the pronominal
case forms of Danish within the broader context
of research on the morphosyntax of case in Germanic,
to test whether variation in Danish pronominal case is
stable or represents a change in progress. By using the
analytical tiers already present in the corpus as his research
base he can then add his own tiers considering
factors including attested pronouns’ case form, person,
number, conjunct ordering, and structural position as
well as including speakers’ age, sex, geographic location,
profession, and relative literacy. Using the search
engine to locate candidate concordance sections, Jeffrey
may jump start this effort.
Other researchers at the centre have produced works on
the correlation of phonetic and grammatical variables, on
bilingualism, and perhaps most notably, phonetic variation
in real time over recordings spanning more than 30
years (see Gregersen 2009).
The digital infrastructure makes it possible not only to
study the influence of geographical origin, gender and
social class on the use of the linguistic variables in question
and the possible interaction of functional and sociolinguistic
factors but also to analyse possible correlation
between the different linguistic variables and between
linguistic variables and discourse context allowing us
to formulate and test hypotheses about the origin and
spreading of patterns of linguistic changes in late 20th
century Danish.
References
Blanke, Tobias et. al. (2008). e-Science in the Arts and
Humanities – A methodological perspective in Opas-
Hänninen, Lisa Lena (Ed.) et. al.(2008): Digital Humanities
2008. Book of Abstracts.
Bolter, Jay David (2001). Writing Space: Computers,
Hypertext, and the Remediation of Print, Second Edition.
Mahwah, NJ: Lawrence Erlbaum.
Juel Jensen, Torben (2009): “Generic variation? Developments
in use of generic pronouns in late 20th century
spoken Danish” In Acta Linguistica Hafniensia Vol. 41.
Levy, Pierre (1998). Becoming Virtual: Reality in the
Digital Age. Da Capo Press.
Normann Jørgensen, Jens (2005): “Gender differences in
the development of language choice and patterns in the
Køge Project” in Code-Switching in the Køge Project.
Special issue of the International Journal of Bilingualism
7:4, 2004 (publ. 2005)
Weinberger, David (2002). Small Pieces Loosely Joined:
A Unified Theory of the Web. NY: Perseus Publishing
Gregersen, Frans. Ed.(2009): Acta Linguistica Hafniensia
Vol. 41. Language Change In Real Time. Evidence
from the Danish Laboratory. International Journal Of
Linguistics (The Linguistic Circle of Copenhagen).

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ADHO - 2009

Hosted at University of Maryland, College Park

College Park, Maryland, United States

June 20, 2009 - June 25, 2009

176 works by 303 authors indexed

Series: ADHO (4)

Organizers: ADHO

Tags
  • Keywords: None
  • Language: English
  • Topics: None