The Scottish Corpus of Texts and Speech: problems of corpus design

paper
Authorship
  1. 1. Fiona Douglas

    University of Glasgow

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.


The Scottish Corpus of Texts and Speech: problems of
corpus design

Fiona
Douglas

University of Glasgow
F.Douglas@englang.arts.gla.ac.uk

2002

University of Tübingen

Tübingen

ALLC/ACH 2002

editor

Harald
Fuchs

encoder

Sara
A.
Schmidt

In recent years the use of large corpora has revolutionised the way we study
language. There are now numerous well established corpus projects which have set
the standard for future corpus based research. As more and more corpora are
developed and technology continues to offer greater and greater scope, the
emphasis has shifted from corpus size to establishing norms of good practice.
There is also an increasingly critical appreciation of the crucial role played
by corpus design. Corpus design can, however, present peculiar problems for
particular types of source material, and the development of the Scottish Corpus
of Texts and Speech illustrates the problems which may be encountered when
dealing with a complicated linguistic situation such as exists in Scotland.
The Scottish Corpus of Texts and Speech is the first large-scale corpus project
specifically dedicated to the languages of Scotland, and therefore it faces many
unanswered questions, such as those outlined below, which will have a direct
impact on the corpus design. The project is a joint venture by the Department of
English Language and STELLA project at the University of Glasgow, and the
Language Technology Group at the University of Edinburgh, and is funded by the
Engineering and Physical Sciences Research Council. The project seeks to address
the current gap in knowledge about the languages of Scotland by building a
publicly available electronic corpus of written and spoken texts mounted on the
Internet. The linguistic situation in Scotland is complex, with Scottish
English, Scots, Gaelic and numerous non-indigenous community languages all
playing a role. However, surprisingly little reliable information is available
on a variety of issues such as the survival of Scots, the distinguishing
characteristics of Scottish English, the use of non-indigenous languages, or the
way they have developed in Scotland.
The first phase of the corpus is focusing on the collection of Scots and Scottish
English texts. However, the language varieties Scots and Scottish English are
themselves difficult to describe, and between these two extremes lie
multifarious other language varieties which defy rigid categorisation.
Established practice norms to ensure corpus representativeness cannot be easily
applied, as these Scottish language varieties have disparate and shifting
functional roles. Scottish English is generally accepted in a wider variety of
formal contexts than Scots, but Scots has stronger local and community ties
which may also exert a pressure. Social class and education also influence when
and where each language variety may be used. Indeed the labels 'Scots' and
'Scottish English' are themselves problematic, as written and spoken varieties
of Scots and Scottish English are not as closely linked as might be assumed.
There are numerous different local varieties, and so there is a strong regional
dimension to be considered. Native Scots themselves often disagree about what is
and is not 'Scots', before they even reach considerations of where its use is
and is not considered to be appropriate. The perceived status of Scots thus has
important implications for the text types and modes in which it is used. To date
there has been no large scale study to identify where each of these language
varieties is deemed acceptable usage by native Scots. Indeed, the native Scots
themselves have ambivalent and wide-ranging opinions on these language
varieties, and there are unspoken but nevertheless tangible rules which impact
on where and how and when they are used. Present-day Scots also has no agreed
standard spelling system, which presents problems when developing search tools
for the corpus.
A balanced corpus which seeks to reflect the true linguistic situation in
Scotland must be sensitive to these problems and anomalies. It must reflect the
variety and breadth of possible linguistic options without skewing the data
along preconceived notions of what is and is not Scots or Scottish English. It
must also gather its texts from a discourse community which has very ambivalent
views about the range of language varieties it encompasses.
This paper considers the problems presented for corpus design in view of the
complex linguistic situation that exists in Scotland. It considers questions
such as how to decide what should be included, how to choose, and in what
proportions relative to the corpus as a whole and to the range of possible
language varieties. It examines the problematic issue of how to construct a well
balanced and representative corpus in what is largely uncharted linguistic
territory. The paper will also consider points of comparison with other
corpora.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2002
"New Directions in Humanities Computing"

Hosted at Universität Tübingen (University of Tubingen / Tuebingen)

Tübingen, Germany

July 23, 2002 - July 28, 2008

72 works by 136 authors indexed

Affiliations need to be double-checked.

Conference website: http://web.archive.org/web/20041117094331/http://www.uni-tuebingen.de/allcach2002/

Series: ALLC/EADH (29), ACH/ICCH (22), ACH/ALLC (14)

Organizers: ACH, ALLC

Tags
  • Keywords: None
  • Language: English
  • Topics: None