Development and Assessment of Common Lexical Specifications for Six Central and Eastern European Languages

Tomaz Erjavec; Nancy Ide; Dan Tufis

Authorship

1. Tomaz Erjavec

Laboratory for Language and Speech Technologies
2. Nancy Ide

Vassar College, Department of Computer Science - Vassar College
3. Dan Tufis

RACAI-Romanian Academy Center for Artificial Intelligence

Parent session

LING (a), Nancy Ide

Original URL

http://web.archive.org/web/19980716092420/http://lingua.arts.klte.hu/allcach98/abst/abs10.htm

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

1. INTRODUCTION
Multext-East was a project under the European Union Copernicus program whose goal is to develop language resources for six Central and Eastern European (CEE) languages (Bulgarian, Czech, Estonian, Hungarian, Romanian, Slovene) and to adapt existing tools and standards to them [3]. The project has built on and extended the Multext project [9], which developed a comprehensive set of corpus-annotation tools, including tools for text segmentation, stochastic part of speech tagging, and alignment of parallel texts. Multext-East developed linguistic resources and created a multi-lingual, partially parallel corpus in the six CEE languages, a portion of which is annotated for part of speech and aligned.
Because the overall goal of Multext-East was to develop reusable resources, it was essential to establish standardized methods and specifications for these resources. To this end, a harmonized set of specifications for lexicon encoding were developed for the six Multext-East languages [11], based on the specifications developed in the EAGLES project [1] and their extension by the Multext project to six western European languages (English, French, Dutch, Italian, German, Spanish)[12]. Accommodating the different language families represented among the Multext-East languages (Romance, Finno-Ugrec, and Slavic) demanded substantial assessment and modification of the pre-existing specifications, due to the need to accommodate features which appear rarely in western European languages (e.g., heavy inflection, agglutination). To validate the specifications, the Multext-East project built lexicons for each of its six languages based on them and used the information contained in them for the automatic tagging of a parallel corpus of Orwell's "Nineteen Eighty-Four".
The availability of a harmonized set of lexical specifications provides a common base for comparison of various statistical properties of lexemes in these languages, which has heretofore been impossible. This paper provides an overview of the harmonized language specifications for Multext-East's six CEE languages and considers their comparative use and distribution in the project lexicons.
2. MORPHO-SYNTACTIC DESCRIPTIONS (MSDs)
The Multext-East lexical specifications describe the grammar of the morpho-syntactic descriptions (MSDs) used in the lexicons of the project. The development of harmonized lexical specifications for the six Multext-East languages began with proposals developed in the EAGLES project [1] and the modifications proposed for six western European languages in the Multext project [12]. These proposals were evaluated from the point of view of coverage for the six CEE languages. The nucleus of common features isolated within Multext for western European languages was assumed as the common ground for extension to the CEE languages. Specifications for information peculiar to the CEE languages were added as required taking care, however, that similar phenomena in the various (e.g. Slavic) project languages were encoded in a similar manner. This led to the formulation of a common proposal for lexicon specifications of the CEE languages, detailed in [11].
For each part of speech that is distinguished in the MSDs, the specifications give a table detailing the features used for that part of speech, the names and one-character codes for the values these features can take, and the applicability of the attribute/values to the six languages. The tables distinguish two types of attributes:
the minimal core features, i.e., those shared by most of the languages. These are common to all the Multext and Multext-East languages. This facilitates comparability of the information encoded in the lexical lists for the six Multext-East and six western European languages treated in Multext.
language-specific features, which apply only to (one or more) Multext-East languages.
The cross-language tables provide a concise summary of language differences and similarities. For example, Table 1 gives the number of attributes each of the six languages distinguishes for the various parts of speech. A hyphen in the table means that the particular part of speech is not valid for the language in question, while a zero denotes that the language distinguishes no features for that part of speech.

Romanian
Bulgarian
Czech
Slovene
Estonian
Hungarian
Noun
6
5
5
5
3
7
Verb
7
8
10
8
8
5
Adjective
7
3
7
5
3
8
Pronoun
8
8
12
10
4
7
Adverb
3
1
2
2
0
4
Adposition
4
1
3
3
1
1
Conjunction
5
2
3
2
1
3
Numeral
7
5
7
5
4
7
Interjection
0
1
0
0
0
1
Residual
0
0
0
0
0
0
Abbreviation
5
0
0
0
3
0
Particle
2
2
0
0
-
-
Determiner
8
-
-
-
-
-
Article
5
-
-
-
-
1

Table 1. Number of attributes distinguished for each part of speech, by language
The grammar of the morpho-syntactic descriptions is realized in the lexical MSDs. These are provided as strings, using a linear, term-like encoding. In this notation, the position in a string of characters corresponds to an attribute, and specific characters in each position indicate the value for the corresponding attribute. That is, the positions in a string of characters are numbered 0, 1, 2, etc., and are used in the following way:
the character at position 0 encodes part-of-speech;
each character at position 1, 2, n, encodes the value of
one attribute (person, gender, number, etc.), using the
one-character code from the tables.
if an attribute does not apply, the corresponding position in the
string contains the special marker "-" (hyphen). By convention,
trailing hyphens are not included in the lexical MSDs.
For example, the specification
Vmm-2s
stands for
Verb main imperative (no Tense) second singular

Such specifications provide a simple and relatively compact encoding, and are in intention similar to feature-structure encodings used in unification-based grammar formalisms. So, for example, the above example can be glossed as the following attribute-value matrix:

Verb
Type:
main
VForm:
imperative
Tense:
-
Person:
second
Number:
singular

The EAGLES recommendations provide another special attribute value, the dot ("."), for cases where an attribute can take any value in its domain. The "any" value is especially relevant in situations where wordforms are under-specified for certain attributes, but which can be recovered from the immediate context by grammatical rules such as agreement. The "any" value was not necessary for the project languages except for Romanian, for which the "any" value was included to avoid redundancy in the Romanian wordform lexicon ([13]). However, rather than adding the special dot notation for this special case, the Romanian encoding loaded the semantics of the "-" value with the additional meaning of "any value from the domain of the corresponding attribute".
3. THE LEXICONS
Once the harmonized set of morpho-syntactic specifications for the six Multext-East languages was developed, lexicons incorporating these specifications were created for each language. Because the lexicons were used to automatically tag texts in the MULTEXT-East corpus, they provide full coverage of all corpus texts. Token lists for the texts were automatically generated and then fed through morphological analyzers in order to produce the lemma list (and associated morpho-syntactic information). Then the lemmas were fed back to the morphological generators (except for the agglutinative languages) in order to produce the complete inflected list, i.e., the full paradigms of the lemmas, which constitute the final lexicons of the project. The creation process and lexicon contents for each language are described in ([7]).
While the inclusion of full paradigms in the lexicons is still manageable for the Romance and Slavic languages, it is not feasible for the agglutinative languages of the project, namely Estonian and Hungarian. First, automatic generation for agglutinative languages produces a prohibitively large number of unacceptable wordforms. More importantly, even if it were possible to generate correct paradigms for these languages automatically, the number of possible wordforms of a lemma for these languages is so large (estimated at 20 million for Hungarian) as to preclude the possibility of including them all in a wordform lexicon. This problem was bypassed within the project because time and budget constraints did not allow the implementation of a generative solution [2]. As a result, only the wordforms (with their relevant MSD interpretations) that actually occur in the corpus of the project are included for these two languages.
Entries in the lexicons are of the following format:

word-form
lemma
MSD

For example (Estonian):

aega
aeg
Nc-s1

Note that the same word-form may be associated with different MSDs (or lemmas) and therefore may appear in the first column of two or more entries. For example, the word-form in the entry cited above appears in the first column of the following entries as well:

aega
=
St
aega
aeg
Nc-s7

When the word-form is its own lemma, the "=" notation is placed in the lemma field. In the example above, for the entry "aega" where the MSD is "Adposition postposition" (St), the lemma is the word-form itself; however, for "aega" as "Noun common singular additive" (Nc-s7), the lemma is "aeg".
Table 2 summarizes the major characteristics of the six CEE lexicons, and includes data for a lexicon of English encoded using the same MSD formalism for comparative purposes. The languages are grouped by language family (Romance, Slavic, Finno-Ugrec, plus the Germanic English). The first field provides the number of lexical entries per language, and the second gives the number of distinct word-forms in the lexicons. The third field gives the number of distinct lemmas in the lexicon; thus each inflecting word will contribute one to this field. The "=" field provides the number of entries which are themselves lemmas (i.e., have "=" in the lemma field of their entry). Thus, the arithmetic difference between the "Lemma" and the "=" fields gives (except for Estonian and Hungarian) the number of non-inflecting words in the lexicons. The "MSDs" field gives the total number of distinct MSDs used in the lexicon. "POS Ambig" and "MSD Ambig" provide the number of ambiguity classes (i.e., the number of different groupings associated with any one word-form) for part of speech categories and MSDs, respectively. Table 3 provides the same statistics for the main part of speech categories (noun, verb, adjective, adverb) in the lexicons. Note that a wordform ambiguous by part of speech is counted in more than one category.

Language
Entries
Wordforms
Lemmas
=
MSDs
POS Ambig
MSD Ambig
Bulgarian
333721
284211
17972
19064
185
42
680
Czech
141127
44191
14125
15568
924
35
1041
English
66473
43564
12622
25816
134
47
328
Estonian
130409
89337
36703
23384
563
63
1517
Hungarians
64511
50819
17033
18756
611
62
1533
Romanian
435086
344826
31365
33872
628
83
1383
Slovene
566427
198925
15123
15475
2040
51
1316

Table 2. Summary data for the six CEE language lexicons, plus English

Language
POS
Entries
Wordform
=
Lemmas
MSDs
MSD Am
Bul
N
V
A
R
30.59
43.09
17.38
3.87
38.20
36.49
22.77
5.17
51.80
6.42
10.41
14.20
52.87
6.61
10.63
14.48
7.69
26.63
2.66
0.59
52.71
56.46
34.38
12.50
Cze
N
V
A
R
32.21
10.03
54.74
0.87
40.96
26.31
30.05
2.73
43.99
18.18
24.99
5.86
44.68
18.68
25.68
6.02
7.31
12.30
13.77
0.21
55.88
23.11
21.17
6.73
Eng
N
V
A
R
32.57
46.08
15.40
5.44
47.94
34.28
22.22
8.04
42.73
13.43
30.50
12.57
48.46
15.29
33.73
14.33
14.29
22.56
3.01
6.77
55.60
49.38
45.64
24.90
Est
N
V
A
R
61.16
15.69
19.39
0.02
71.02
16.94
23.47
3.22
55.29
3.62
26.31
12.23
58.54
3.84
27.87
12.96
9.41
26.29
13.32
0.18
69.17
32.41
35.18
18.48
Hun
N
V
A
R
48.16
25.09
21.60
2.55
55.77
26.50
25.90
3.15
38.76
8.10
42.48
7.52
42.41
7.86
46.37
8.17
38.07
9.48
24.84
0.65
70.16
20.71
44.10
19.04
Rom
N
V
A
R
29.37
41.45
28.26
0.32
32.49
41.02
31.96
0.38
51.48
11.92
31.06
3.50
54.70
12.68
33.02
3.72
8.84
14.57
10.15
1.64
49.14
46.49
36.48
9.55
Slo
N
V
A
R
22.38
20.01
54.81
1.32
30.15
39.53
32.02
3.71
44.18
22.61
27.49
2.69
44.54
22.90
27.84
2.71
4.71
6.15
13.41
0.14
48.68
23.97
26.53
9.17

Table 3. Lexicon data by part of speech
A primary motivation for gathering quantitative information on lexical data from the six CEE languages is the need to develop automatic tagging mechanisms for these languages. The first decision that needs to be made here is, of course, choosing the appropriate tagset for each language. While several tagsets exist for the English language, as well as some harmonized tagsets for Western European languages, these tagsets are of limited use for Multext-East due to the considerable differences between the Multext-East languages (except for Romanian) and Western European languages.
There is very little experience in probabilistic tagging of Central and Eastern European languages, and the only results known ([6]) show poor results on
Czech, primarily due to the need for a very large tagset for highly inflected languages (approximately 1500 for Czech, and potentially millions for Estonian), and free word order. The corpus size necessary to train a probabilistic tagger seems to be on the order of tens of millions of words, which is well beyond the scope of the project. Therefore, the Multext-East project includes a phase where the lexical MSDs for each language will be mapped to a significantly smaller "corpus tagset". The corpus tagsets are being chosen so that probabilistic taggers can actually do disambiguation with them. It is well known that collocational stochastic tagging methods (digram, trigram, n-gram) cannot discriminate the fine-grained distinctions made in the MSDs. Therefore the corpus tagsets must comprise broader categories which collapse or eliminate MSD values or (in some cases) features which a stochastic tagger cannot disambiguate.
Ambiguity classes or "genotypes" ([16]) provide useful information for designing tagsets appropriate for probabilistic disambiguation ([10],[14], [15]). We are using the data in Tables 2 and 3 to guide the development of POS corpus tag sets for the six project languages, using a process of step-wise refinement, by successively collapsing categories as results dictate.
A comparison of the "Wordform" and "Lemmas" columns shows that for all languages except for Czech and Slovene, verbs exhibit a high lemma/wordform ratio. Therefore, in these languages, verbal wordforms are strongly marked and easily recognizable. This observation was confirmed by several tagging experiments ([15]). For Czech and Slovene, on the other hand, verb identification is the same as for other parts of speech. Table 3 also suggests that recognition of adjectives is easier in the Slavic languages than the others, due to distinct graphemic marking. Nouns are somewhat easier to differentiate from the other parts of speech for the two agglutinative languages (Hungarian and Estonian). In general, the largest number of MSDs is defined for verbs; However, Czech and Slovene allocate higher proportion of tags to adjectives, while for Hungarian, almost two-thirds of the total number of tags is for nouns and adjectives. The last column in Table 3 shows that nouns are included in half of the ambiguity classes, with Hungarian at the extreme (70.16% of the total number of MSD ambiguity classes include at least a nominal MSD).

Language
%Wordforms
MSD non-amb
%Lemmas
MSD non-amb
Bulgarian
75.26
93.41
Czech
40.10
94.80
English
75.01
76.43
Estonian
72.65
88.87
Hungarian
78.02
81.73
Romanian
85.18
88.24
Slovene
34.89
97.61

Table 4: Non-ambiguous wordforms and lemmas

Czech and Slovene include several words with an exceptionally large number of MSDs (48 and 49 for Czech, and 54, 55, 56 and 57 for Slovene) and their lexicons therefore contain fewer unambiguous wordforms. However, in terms of lemmas, the lexicons for Czech, Slovene, and Bulgarian exhibit the lowest ambiguity, indicating that intra-category (inflectional) ambiguity is greatest for these languages. The Romanian lexicon exhibits the lowest wordform ambiguity (fewer than 15% of the words have more than one MSD). The non-ambiguity values in Table 4 result directly from the strategies used to handle
syncretism.[1] For statistical tagging, encoding syncretism and using "any value" attributes (removable from a tagset encoding) likely leads to greater tagging accuracy, and certainly increases the efficiency of training and tagging ([15]).
It is important to note that these figures say relatively little about ambiguity rates in running text; rather, they provide an index of ambiguity according to the encoding schema as well as an index of the degree of homography and syncretism that has been considered by the lexicon designers. We have gathered statistics over running text for English, Slovene, and Romanian, summarized in Table 5.

Language
%POS/Unamb
%MSD/Unabm
English
72.2
61.4
Slovene
69.8
26.9
Romanian
70.2
66.4

Table 5: Corpus Ambiguity

The ambiguity percentages here are very different from the same percentages computed for the lexicons (Table 4); this is expected since in the lexicons, a lexical item (word-form MSD) appears only once, while in running text the number of occurrences of a given token may be quite large. This indicates some MSD-ambiguous items appear quite frequently in the corpus, while a substantial number of MSD-unambiguous items in the lexicon do not appear there or are not very frequent. Similarly, all three languages show higher POS ambiguity (or equivalently lower POS non-ambiguity) in the corpus than in the lexicon. We are currently investigating the ramifications of this information for the development of POS corpus tags and tagging algorithms.
4. CONCLUSION
The paper provides an overview of the morpho-syntactic descriptions,lexicons, and lexical items in the corpus of the MULTEXT-East project, comprising six Central and Easter European languages from three language families together with English as the hub.
A primary contribution of this work is, of course, the provision of widely available lexical and corpus resources for the languages of the project. The complete documentation of the MULTEXT-East project together with HTML corpus samplers is available on the WWW at http://nl.ijs.si/ME/. The entire corpus is available on CD-ROM through the TELRI concerted action ([3],[5]), together with four new translations of "Nineteen Eighty-Four" in Latvian, Lithuanian, Serbian, and Russian. These translations are encoded in the same way as the Multext-East corpus, using the CES specifications, and the Latvian, Lithuanian, and Serbian translations are sentence-segmented and aligned with the English. The CD-ROM is available for research purposes only, on a per-cost basis.
NOTES
1 Paradigmatic systems are well known for exhibiting the phenomenon of syncretism, the historical tendency of languages to reduce their use of inflection. In languages like English and German, for example, large subsets of former inflectional paradigms have become fused; a good example is the English present tense verb paradigm (walk -3sg, walks +3sg), which distinguishes only two forms, although potentially having six. In Slavic languages these fused or syncretic forms exhibit a richer patterning; which forms are syncretic differs across paradigms and can be also dependent on various inherent properties of the inflecting words.
2 One solution to this problem would be to add a generic tool for the decomposition of agglutinated words into the tool chain. Such a tool would also be applicable to the processing of other languages: agglutinative languages absolutely require such a tool, German needs it for compound words, and Arab requires it for separating proclitics and enclitics.
ACKNOWLEDGMENTS
The work described here was partially supported by EU grant COP 106(Copernicus program). The authors would like to acknowledge the contribution of the following people to the development of the lexical specifications: Monica Monachini, R.Pavlov, L.Dimitrova, L.Sinapova and K.Simov, V.Petkevic, H.J.Kaalep, L.Tihanyi, A.M.Barbu, and P.Holozan. We would also like to acknowledge Greg Priest-Dorman for his work on the preparation of the corpus and generation of some of the statistics.
REFERENCES
1. Bel, N., Calzolari, N. and Monachini, M., eds. (1995). Common Specifications and Notation for Lexicon Encoding. Deliverable 1.6.1. Multext Project LRE 62-050.
<http://www.lpl.univ-aix.fr/projects/multext/LEX/LEX1.html>
2. Erjavec, T. & Ide, N. (1998). The MULTEXT-East Corpus. Frst International Language Resources and Evaluation Conference, Granada, Spain.
3. Erjavec, T., Ide, N., Petkevic, V., Veronis, J. (1996). Multext-East: Multilingual Text, Tools and Corpora for Central and Eastern European Languages. Corpora Proceedings of the First TELRI European Seminar, pp. 87-98. Project description, documentation and resources available at <http://nl.ijs.si/ME/>.
4. Erjavec, T., Lawson, A., & Romary, L. (1998). East meets West: Producing Multilingual Resources in a European Context. First International Language Resources and Evaluation Conference, Granada, Spain.
5. Erjavec, T., Monachini, M. (Eds.) (1997). Specifications and Notation for Lexicon Encoding of Eastern Languages. Deliverable 1.1F. MULTEXT-East Project COP-106.
6. Hajic, J., Hladka, B. (1998). Czech Language Processing / POS Tagging. First International Language Resources and Evaluation Conference, Granada, Spain.
7. Ide, N., ed. (1996). Multext-East Language-Specific Resourc.s. Deliverable D1.2. Multext-East Project COP 106. <http://www.lpl.univ-aix.fr/projects/multext-east/MTE2.html>
8. Ide, N. (1998). The Corpus Encoding Standard: SGML Guidelines for Encoding Linguistic Corpora. First International Language Resources and Evaluation Conference, Granada, Spain. <http://www.cs.vassar.edu/CES/CES1.html>
9. Ide, N., Veronis, J. (1994). MULTEXT (Multilingual Tools and Corpora). Proceedings of the 14th International Conference on Computational Linguistics, COLING'94, Kyoto, Japan, pp. 90-96.
10. Mason, O., & Tufis, D. (1997). Probabilistic Tagging in a Multi-lingual Environment: Making an English Tagger Understand Romanian. Proceedings of the Third International TELRI Seminar pp. 165-168., Montecatini.
11. Monachini, M., Erjavec, T. (Eds.) (1996). Common Specifications and Notation for Lexicon Encoding of Eastern Languages. Deliverable 1.1. Multext-East Project, COP-106. <http://nl.ijs.si/ME/Lexica/MorphSyn/>
12. Monachini M., Calzolari, N. (Eds.) (1996). Synopsis and Comparison of Morpho-syntactic Phenomena Encoded in Lexicons and in Corpora: A Common Proposal and Applications to European Languages. EAGLES document EAG-CLWG-MORPHSYN/R, Pisa. <http://www.ilc.pi.cnr.it/EAGLES96/morphsyn/morphsyn.html>
13. Tufis, D., Barbu, A., M., Patrascu, V., Rotariu, G., & Popescu C. (1997). Corpora and Corpus-Based Morpho-Lexical Processing. In D. Tufis, P. Andersen (Eds.): Recent Advances in Romanian Language Technology, pp. 35-56., Bucharest: Editura Academiei.
14. Tufis, D. (1998). Tiered Tagging. International Journal on Information Science and Technology, 1(2).
15. Tufis, D. & Mason, O. (1998). Tagging Romanian texts: A Case Study for QTAG, A Language Independent Probabilistic Tagger. First International Language Resources and Evaluation Conference, Granada, Spain (this volume).
16. Tzoukermann, E., & Radev, D. (1997). Tagging French Without Lexical Probabilities: Combining Linguistic Knowledge and Statistical Learning. cmp-lg/9/10002.

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ACH/ALLC / ACH/ICCH / ALLC/EADH - 1998

"Virtual Communities"

Hosted at Debreceni Egyetem (University of Debrecen) (Lajos Kossuth University)

Debrecen, Hungary

July 5, 1998 - July 10, 1998

109 works by 129 authors indexed

Conference website: https://web.archive.org/web/19991022041140/http://lingua.arts.klte.hu/allcach98/

References: http://web.archive.org/web/19990225164509/http://lingua.arts.klte.hu/allcach98/abst/jegyzek.htm

Attendance: ~60 (https://web.archive.org/web/19990128030244/http://lingua.arts.klte.hu/allcach98/listpar3.htm)

Series: ACH/ALLC (10), ACH/ICCH (18), ALLC/EADH (25)

Organizers: ACH, ALLC

Development and Assessment of Common Lexical Specifications for Six Central and Eastern European Languages

1. Tomaz Erjavec

2. Nancy Ide

3. Dan Tufis

ACH/ALLC / ACH/ICCH / ALLC/EADH - 1998

"Virtual Communities"