Corpora, Statistics and Common English Vocabulary: An Application to the ICAME Database

paper
Authorship
  1. 1. Ahmad S. Peyawary

    University of Manitoba

  2. 2. Paul Fortier

    University of Manitoba

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

1. Background: Thorndike and West's Work.
Thorndike in collaboration with others produced a number of books which claim to help educators and textbook writers choose the appropriate vocabulary for their students. They are [6], [7], and [8].
Thorndike intended the lists for elementary level native speakersof English. However, the introduction to [8]: x-xii), suggests the suitability of using this list in grades 1-12, as well as for adults who are learning English as a second language.
Michael West, a disciple of Thorndike, used frequency analysis oftexts in a corpus of 5,000,000 running words (2,500,000 for some ofthe items) of written English containing a variety of materials to develop a list of 2000 items of core English published in 1953, as [9]. It was the first comprehensive word list developed specifically to address the needs ofESL/EFL students. The list was intended to contain the words necessary to enable an ESL/EFL learner to read with comprehension any published non-technical English text. Although it is an old list, its2000 entries still stand as the most important words for EFL learners and are constantly used by designers and developers of ESL programsand materials [5]:22.
[5]:14 states that 2000 high frequency lemmas cover 87%of the English language. Nation's figures are based on West's work. Inaccuracies, excessive estimation, unclear definition of lemmas, andof the corpora themselves leave the value of this work open toquestion (see [3], and [2]).
2. Data
The ICAME database comes in a CD-ROM format (cf. [4]). Among other corpora, it contains the Brown (American), LOB(British), and Kolhapur (Indian) corpora of one million running wordseach. These corpora are based on the division of the published English language into 15 genres each assigned a proportional number of text samples that reflect their importance in the language. The materials used for the Brown corpus were written by American writers and published in America in 1961; the LOB corpus by British authors and published in Britain in 1961; and the Kolhapur corpus by Indian authors and published in India in 1978. The date for the Indian corpus was chosen later than its counterparts to ensure inclusion ofthe specific characteristics of Indian English.
A systematic approach requires real data collected in a structured principled way as described by Hassard and Biber. These criteria are met by the three corpora contained in the ICAME database. Like all language data, the ICAME frequencies are skewed. Outliers were kept in the database for analysis.
3. Analysis
3.1 Conversion of keywords to lemmas
Lemmatization was an extensive process, but not a theoretically interesting one. The [1] was chosen as the authority for the determination of the lemmas. Its coding for lemmas and parts of speech was followed-rigorously even when there were inconsistencies or possible errors. This dictionary was chosen because it is one of the most inclusive among one volume dictionaries, and uses American spelling conventions.
The Brown and Kolhapur corpora could be lemmatized on the basisof existing data on the CD-ROM, but the LOB corpus needed furtherwork. The elaborate information concerning syntactic function ofindividual words added to the LOB data, although invaluable forcertain types of studies, unfortunately proved to be a hindrance to lemmatization for the purposes of this study. The tags added to thedata were stripped off before lemmatization began.
3.2 Statistical Analysis
The Spearman's rho test was used for statistical analysis. Datawere analyzed in blocks of 50 whenever possible. In any case, the minimum size of a set of data to be analyzed was kept at 30, to permitusing the Bonferroni correction.
The hypotheses tested were as follows:
1. In a list sorted into descending order of frequency. thevocabulary before divergence in the frequency distribution patterns can be considered to be the core vocabulary, common to the three dialects.
2. Where the data are not correlated to a statistically significant degree, the dialects have diverged.
These hypotheses were tested in two manners: 1) The data were arranged in descending order of frequency and subjected to the statistical test. 2) The data were first classified syntactically and then put indescending order before being tested.
4. Results
4.1. The Undifferentiated List:
The lemmas are arranged in descending order of frequency. Although the whole set was correlated, using the test with the sliding windows of fifty items showed only 131 lemmas as common to the three dialects of English. Such results are perverse because they contradict the data and what is known of language.
Therefore, the list was divided into syntactic categories, on the basis of the parts of speech provided by the dictionary. The categories which contained fewer than 30 items were combined, when possible, to obtain groups with more than 30 items. The categories were then sorted into descending order of frequency and subjected to the statistical test.
4.2. The Syntactic List:
This list has three components.
Excluded words: There were 96 lemmas with a coverage of 12.034% of the language in this group. They were either culture specific orcontained such a very small number of words that they could notreliably be tested. They were considered to be part of the common vocabulary in any case.
Function words: There were a total of 183 lemmas, in this category all of which passed the test. Taken together they represent 30.86% of the language.
Content words: They constitute the greatest proportion of the data,in terms of lemmas. A large number of them fail the test. Yet, thereare many that pass the test and are part of the common vocabulary of English. The words in these categories represent 31.24% of the language.
These results are summarized in table 1.
Table 1. Results of the Spearman's rho analysis: Summary

Parts of Speech Excluded Words
n.l.
c.%
Abbreviation-proper nouns
23
0.403
Proper nouns
30
0.241
Geographical terms-proper nouns
23
0.293
Articles
3
9.670
Modal auxiliaries 6
1.305

Interjection
1
0.013
Letters of alphabet 6
0.067

Numeric 'th'
1
0.012
Prefixes
3
0.030
Category total
96
12.034


Parts of Speech
Function Words
n.l.
c.%
n.p.
c.%
n.f.
c.%
Adverbs
52
1.458
52
1.458%
0
0
Adverb/else
48
3.936
48
3.936%
0
0
Conjunctios
13
4.345
13
4.345%
0
0
Prepositions
33
13.647
33
13.647%
0
0
Pronoun/else
37
7.470
37
7.470%
0
0
Category total
183
30.856
183
30.86%
0
0


Parts of Speech
Content Words
n.l.
c.%
n.p.
c.%
n.f.
c.%
Adjectives
386
10.314
281
9.655
105
0.659
Nouns
1012
16.182
288
8.679
724
7.502
Verbs
437
14.810
205
12.906
232
1.904
Category total
1835
41.306
774
31.24
1061
10.065







Cumulative
2114
84.196
957
62.096
1061
10.065


n.l. = number of lemmas
c.% = coverage
n.p. = number of lemmas passing the test
n.f. = number of lemmas failing the test
5. Conclusions
The 957 lemmas shown to be statistically related, together with the 96words from the excluded categories make up the common Englishvocabulary, covering 74.14% of the texts sampled in the three corpora. Since this common vocabulary is 1053 lemmas rather than 2000, this should lead to revision of ESL/EFL programs and textbooks in a manner that will enhance the motivation of students. This study further demonstrates the usefulness of computer based corpora. Our results also suggest that effective analysis of text data must take into consideration the syntactic as well as the semantic dimensions oflanguage, even when carrying out a lexicometric study.
6. References
1. The American Heritage Dictionary of the English Language. 3rd edition. 1992. Anne H. Soukhanov. et al. (eds.). Boston: Houghton Mifflin.
2. Biber, Douglas. 1994. Representativeness in Corpus Design. In Antonio Zampolli, Nicoletta Calzolari, &
Martha Palmer (eds.). Current Issues in Computational Linguistics: In Honor of Don Walker. Pisa: Giardini. pp. 377-407.
3. Hassard, Thomas H. 1991. Understanding Biostatistics. St. Louis,Missouri: Bosby Year Book.
4. Hofland, Knut and Stig Johansson. 1991. ICAME Collection of English Language Corpora (In CD-ROM format). Bergen: Norwegian ComputingCentre for the Humanities.
5. Nation, I.S.P. 1990. Teaching and Learning Vocabulary. New York: Newbury House.
6. Thorndike, Edward L. 1927 (2nd edition). The Teacher's Word Book. New York: Teachers College, Columbia University.
7. Thorndike, Edward L. 1932. The Teacher's Word Book of 20,000 Words. New York: Teachers College, Columbia University.
8. Thorndike, Edward L. and Irving Lorge. 1944. The Teacher's Word Bookof 30,000 Words. New York: Teachers College, Columbia University.
9. West, Michael. 1953. A General Service List of English Words. London: Longmans.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ACH/ALLC / ACH/ICCH / ALLC/EADH - 1998
"Virtual Communities"

Hosted at Debreceni Egyetem (University of Debrecen) (Lajos Kossuth University)

Debrecen, Hungary

July 5, 1998 - July 10, 1998

109 works by 129 authors indexed

Series: ACH/ALLC (10), ACH/ICCH (18), ALLC/EADH (25)

Organizers: ACH, ALLC

Tags