Corpus-Supported Modelling of Syntactic Information on Nouns in the Danish PAROLE Lexicon

paper
Authorship
  1. 1. Anna Braasch

    Center for Sprogteknologi

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Introduction

The PAROLE1 project (finished by the end of April 1998) had as its aim the construction of large, multi-functional, national language resources i.e. electronic corpora and lexica for twelve EU-languages. The most comprehensive lexicon of Danish general language up to now has been produced within this framework; it comprises approx. 20,000 morphological units (lemmas) provided with syntactic information.

This paper sketches out some methods used in the Danish PAROLE lexicon work with the focus on the modelling of syntactic information. Firstly, a brief outline of the project will be given, secondly some relevant features of the PAROLE descriptive model will be presented. Thirdly, we discuss possible strategies for selected subcategorisation issues. Finally, we consider the profitable use of corpus evidence in identifying and describing subcategorisation frames.

1 The PAROLE lexica - key notions

PAROLE lexica are primarily designed for various NLP application areas. Therefore, all information has to be formulated in a theory neutral and application independent way; this means that the lexica must not be dedicated to a particular linguistic theory or implementation. All PAROLE lexica share common core specifications and a common information structure. The practical solution adopted for the lexicon work is to follow a common approach:

the generic descriptive PAROLE model (which is based on the GENELEX3 [5] model) is used by all languages
the use of basic morphosyntactic features and values is in agreement with the relevant general and language specific recommendations of EAGLES2 [12]
the in [4] dividual language groups use a common descriptive language and basically harmonized encoding strategies. This, however, allows for some freedom e.g. in choosing more general or more detailed levels of linguistic information and in structuring this information.
Additionally, two working principles should be followed in lexicon work:

labour and time saving through re-use of existing computational lexicon material
ensuring high quality and reliability by means of corpus-based or corpus-supported encoding (if feasible).
The common properties of the lexica will ensure their cross-compatibility which is a prerequisite for establishing multilingual links between them.

2 Information structure in the PAROLE lexica

In the PAROLE model the lemma is described at three lingusitic levels: morphology, syntax and semantics. Each level has its own entity type, namely the simple morphological unit, MuS, syntactic unit, SynU, and semantic unit, SemU, respectively. They are linked to each other in the following way.

A MuS comprises the morphological DESCRIPTION of the lemma. A SynU contains the DESCRIPTION of the syntactic behaviour(s) of the lemma. A MuS - having at least one syntactic instantiation - requires a link to at least one (but possibly more) SynU(s). Correspondingly, one or more SemUs can be linked to a SynU, furthermore one SemU can be linked to more than one SynU e.g. in the case of near-synonyms.

The semantic level of DESCRIPTION is not worked with in the PAROLE project, it will be elaborated in the SIMPLE4 project which will be launched on 1st May 1998.

At each linguistic level the model allows for a variable depth of DESCRIPTION from basic to highly detailed. Basic information types are mandatory for the PAROLE lexicon these being, at the morphological level, category (and subcategory), all inflectional properties and spelling variants.

At the syntactic level subcategorisation features make up the basic information. The syntactic behaviour of a unit is characterised by the number, syntactic function and syntactic category of complements it subcategorises for. Other features are linear order constraints and control (where appropriate).

3 Syntax in the PAROLE model

In the Danish implementation the syntactic level describes in general purely syntactically observable features. This means that semantic ambiguities of a syntactic unit are not captured if they are not reflected by different surface realisations. Although the following presentation of the syntactic level is somewhat simplified it may indicate the highly structured information modelling. Further details can be found in the report on encoding Danish verbs Navarretta [13].

The basic descriptive element is the DESCRIPTION which comprises an entry word SELF (and possibly some restrictions on its use), a sample CONSTRUCTION, which is a small piece of context up to a sentence in length containing the SELF. The CONSTRUCTION is described by the list of complements occurring in the context POSITIONs. Each of them is described by information about syntactic category, obligatoriness/optionality and syntactic function within the given immediate context. Each SynU is linked to at least one DESCRIPTION and each DESCRIPTION is built to cover a syntactic frame of at least one SynU. All SynUs showing the same behavior share the same DESCRIPTION. In addition, a complex descriptive object - the FRAMESET - can handle a list of DESCRIPTIONs of a syntactic unit with all their nested elements.

3.1 A few aspects of the Danish implementation

In the following the main focus is on the treatment of deverbal nouns. We discuss possible DESCRIPTION strategies at the syntactic level including some practical choices we have adopted in the Danish lexicon.

The subcategorisation properties of nouns are not clarified to the same degree as for verbs although certain types of nouns exhibit a syntactic behaviour similar to verbs. This is the case especially with nouns that are derived from verbs i.e. deverbal nouns. In the literature different views range from 'Nouns are avalent - and nominalisations too' [11] to 'Nouns can and do take obligatory arguments' [6]. Between these extremities there are a number of assertions regarding the general optionality of arguments and regular NP complementation cf. Herslund [8] and Kirchmeier-Andersen [10].

In our work we adopted a pragmatic approach in saying that simple prototypical (concrete) nouns are avalent

only a restricted number of deverbal nouns have obligatory complements, thus the default value expresses optionality
deverbal nouns often inherit verbal complement structures, thus nominal constructions may be examined in parallel with the corresponding contructions of their base verbs.
Before the syntactic encoding two fundamental tasks were to decide on:

the main criterium and basic strategy for 'splitting units'
the choice of appropriate coding strategies (linguistic and practical approach) wrt. the treatment of optionality of complements and syntactic alternations.
3.1.1 'Splitting units'

The fundamental decision made is 'to split units as late as possible', i.e. if feasible, not before the semantic level in order to keep the number of SynUs linked to a MuS as low as possible which prevents some overgeneration in NLP applications. However, when two subcategorisation frames of a morphological unit are incompatible with each other because they reflect different argument structures, we create separate SynUs and link them to the same MuS. This is the case for a few full homonyms, e.g. (the example is simplified)

MuS: jagt (sloop, cutter; hunt, chase)

SynU1:

DESCRIPTION1: avalent (Code=Dn0)

CONSTRUCTION1:

Jagten er et lille sejlskib med en mast.

(The sloop is a small sailing ship with one mast.)

SynU2:

DESCRIPTION1: monovalent (Code=Dn1-PP-på)

the direct object is realised as a prepositional phrase introduced by

'på'

CONSTRUCTION1:

Jagten på harer går ind.

(The open season for hares begins. Lit.: The hunt for hares begins.)

DESCRIPTION2: divalent (Code=Dn2G-PP-efter)

subject is realised as a genitive phrase, object as a prepositional

phrase

introduced by 'efter'

CONSTRUCTION2:

Politiets jagt efter forbrydere er farlig.

(Lit.: The police's hunt after criminals is dangerous.)

3.1.2 Coding strategies

A great number of deverbal nouns has several different subcategorisation frames which have to be expressed appropriately in DESCRIPTIONs. This is due to various syntactic realisations of a complement (syntactic alternations) in a given POSITION, similarly to deverbal constructions. Alternations also occur combined with facultative realisation; the latter being pertinent to noun complementation.

On the other hand, one and the same surface realisation (syntactic category) can be assigned different syntactic functions depending on the underlying logical structure, e.g. a genitive phrase (G) or a prepositional phrase (PP-af) in nominal constructions can function as the subject or object of the head noun, e.g. fremstilling (account, statement, presentation, etc.; production, manufacture, etc.). Even if both complements are omitted the sentence still remains well-formed, as in the following example

(Moderens) fremstilling (af begivenhederne) er ganske præcis.

(Lit.: The mother's presentation of the events is quite accurate.)

Moderens fremstilling er ganske præcis.

(Lit.: The mother's presentation is quite accurate.)

Begivenhedernes fremstilling er ganske præcis.

(Lit.: The events' presentation is quite accurate.)

Fremstillingen af moderen / af begivenhederne er ganske præcis.

(Lit.: The presentation of the mother / of the events is quite accurate.)

Fremstillingen er ganske præcis.

(Lit.: The presentation is quite accurate. Others: The account, statement,

exposition.../ production, manufacturing...)

These examples show the patterns of deverbal nouns having both a predicative noun and a function noun reading. In such cases the function reading, being always avalent, coincides with the predicative reading with the optional complements omitted. For details the reader is referred to [l ] and [10].

Needless to say that a polysemantic base verb, such as fremstille (produce, make, etc.; give an account of, state, picture, represent, etc.) gives rise to several further combinations, including idiomatic expressions like fremstille ngn. i retten (bring sby before the court, lit.: present sby in the court).

DESCRIPTIONs of a SynU are based on its various types of immediate syntactic contexts, i.e. CONSTRUCTIONs. The structural design of the PAROLE model allows for a certain degree of freedom in choosing encoding strategies, viz.

underspecifying by giving one single, simple DESCRIPTION in case of optional realisation of the complement(s) with marking of (all) the POSITION(s) as optional: appropriate solution e.g. for encoding predicative and function noun readings by means of one DESCRIPTION
specifying explicitly by giving individual DESCRIPTIONs for each combination of optionality, syntactic function and syntactic realisation of each POSITION: used in the case of syntactic alternation based on some differences in argument structure or in the case of conditioned obligatoriness
relating the individually given DESCRIPTIONs to each other: used in the case of simple syntactic alternations, e.g. reciprocal alternation
related DESCRIPTIONs can be organised compactly by linking multiple explicit DESCRIPTIONs within one FRAMESET: particularly useful when many SynUs share the same set of DESCRIPTIONs.
4 Corpus-supported work

The number of nouns to be encoded in the lexicon is 12,000 whereof approx. 7,000 are avalent (zerovalent) thus they share one single DESCRIPTION (Code=Dn0). The remainder shows a great diversity of subcatergorisation features also because of morphological and semantic dissimilarities in nouns. This gives rise to the elaboration of a great number of DESCRIPTIONS (currently more than sixty but this figure may change during the ongoing revision). Corpus investigations provided the necessary basis for this work; we have frequently consulted the corpus of the Dictionary of Contemporary Danish (DDO) that contains 40 M tokens. It is the largest general language electronic text collection in Denmark. In addition, dictionaries, reference books and re-use of lexical data supported the work.

Ideally, the encoding of each entry should be based on corpus analysis as it is also argued for in Biber [2] and Hanks [7]. However, due to time constraints we were forced to restrict corpus searches to the most difficult or complicated cases. As a starting point we had hypotheses about the syntactic behaviour we were looking for and we also consulted dictionaries for relevant information. However, noun complementation seems to be a somewhat neglected field at least in traditional dictionaries.

Below we list the main points investigated and the information types, provided by corpus evidence, that need to be accounted for in the PAROLE model:

whether it is reasonable to assign to the MuS of the noun more than one syntactic unit SynU (on the basis of the considerations below)
whether there are any lexically-based restrictions observed in the use of SELF
whether the derived noun preserves the complementation structure(s) of the base verb (relations between noun and verb can be recorded by using the TRANSFUSYN facility)
which are the different complementation structures realised in the corpus (i.e. to find relevant CONSTRUCTIONs as a basis for DESCRIPTIONs to be assigned
whether the noun has any obligatory complement (i.e. the value 'yes' of OPTIONALITY has then to be changed to 'no'; also conditioned obligatoriness)
what are the properties of each comlpement in the CONSTRUCTION i.e. the syntactic category, function and when appropriate the preposition introducing the PP (in order to record the features and values required for each POSITION)
what kinds of syntactic alternations are frequent (to decide on whether it is relevant and efficient to organise a group of DESCRIPTIONs into a FRAMESET).
For reasons of effectiveness it is advantageous to establish a FRAMESET whenever it is possible to reflect generalisations over a greater number of syntactic units. Corpus-based, linguistic and statistical observations wrt. the frequency of combinations and the related figures of the encoded material induce us to do a future re-arrangement of DESCRIPTIONs in appropriate FRAMESETs.

The search results were informative also on other points e.g. regarding frequency and distribution of complementation patterns, occurrences of unexpected complementation patterns, conditioned distribution of particular complements, etc. A very useful discovery was the 'non-occurrence' of some constructions that were expected (by linguists!) on the basis of the verbal complementation patterns. This prevented us from encoding theoretically possible but practically never-used constructions.

Finally, we noted observations relevant to the transition between syntax and semantics in a comment field for further treatment as in many cases the corpus evidence showed that a particular syntactic behaviour - often common to whole groups of nouns - have systematic connections to lexical semantic properties.

Closing remarks

The main concern of this paper has been to present the principles of structuring lexical information in the PAROLE model with focus on the syntactic level. We selected a few fundamental tasks related to the encoding of the complementation structure of deverbal nouns in Danish and we demonstrated the role of corpus evidence for finding practically suitable and linguistically appropriate solutions for these tasks. Although we carefully prepared the work and were concerned with consistency in development and encoding, some decisions will probably need a revision. We had to come up with some pragmatic solutions appropriate to the framework of the project wrt. its overall goals - and also its time constraints.

However, the lexicon produced is because of its size, structure and contents a valuable starting point for a subsequent national project which - recently initiated by CST [4]- also will greatly benefit from the linguistic and computational experience acquired in the PAROLE project.

Notes

1 LE-2-4017 PAROLE. EU-funded project.

The languages of the lexicon project were: Catalan, Dansish, Dutch, English, Spanish, Finnish, French, German, Greek, Italian, Portugese, Spanish and Swedish. The other members of the Danish lexicon group are Costanza Navarretta and Nicolai Hartvig Sørensen.

2 EAGLES recommendations in (Monachini et al. 96)

3 GENELEX project: development and implementation of a generic lexicon model cf.(GENELEX 93)

4 LE4-8346 SIMPLE. EU-funded project on Semantic Information for Multifunctional Plurilingual Lexica.

References

1. Allan, R., P. Holmes and T. Lundskær-Nielsen. Danish; A Comprehensive Grammar. Routledge, London and New York 1995.

2. Biber, D. Investigating language use through corpus-based analyses of association patterns. In: International Journal of Corpus Linguistics. Vol.1.No.2. 1996.

3. Bindi, R., Calzolari, N., Monachini, M., Pirrelli, V. and Zampolli, A. (1994). Corpora and Computational Lexica: Integration of Different Methodologies of Lexical Knowledge Acquisition. In: Literary and Linguistic Computing, Vol.9, no. 1, 1994.

4. Braasch, A., Buhr Christensen, A., Olsen, S. and S. Pedersen, B. A Large Scale Lexicon for Danish in the Information Society. To appear in: Proceedings from the 1st International Conference on Language Resources and Evaluation. Granada, 1998.

5. GENELEX Consortium. Report on the Syntactic Layer. GSI-Erli 1993.

6. Grimshaw, J. Argument Structure. Cambridge. MIT Press 1990.

7. Hanks, P. Contextual Dependency and Lexical Sets. In: International Journal of Corpus Linguistics. Vol.1.No.1. 1996.

8. Herslund, M. Typological Remarks on Complex Noun Phrases in Danish. In:The Valency of Nouns. Odense Working Papers in Language and Communication, No.15.

9. Ide, N. and Veronis, J. Knowledge Extraction from Machine-Readable Dictionaries: An Evaluation. In: P. Steffens (Ed.) Machine Translation and the Lexicon. Lecture Notes in Artificial Intelligence 898. Heidelberg 1995.

10. Kirchmeier-Andersen, S. Verbal and Nominal Valency. In:The Valency of Nouns. Odense Working Papers in Language and Communication, No.15.

11. Lachlan Mackenzie, J. Nouns are avalent - and nominalisations too. In:The Valency of Nouns. Odense Working Papers in Language and Communication, No.15.

12. Monachini, M., Calzolari, N. Synopsis and Comparison of Morphosyntactic Phenomena Encoded in Lexicons and Corpora. A Common Proposal and Applications to European Languages. EAGLES, May 1996.

13. Navarretta, C. Encoding Danish Verbs in the PAROLE Model. In: Proceedings from RANLP ´97. Tzigov Chark, Bulgaria.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ACH/ALLC / ACH/ICCH / ALLC/EADH - 1998
"Virtual Communities"

Hosted at Debreceni Egyetem (University of Debrecen) (Lajos Kossuth University)

Debrecen, Hungary

July 5, 1998 - July 10, 1998

109 works by 129 authors indexed

Series: ACH/ALLC (10), ACH/ICCH (18), ALLC/EADH (25)

Organizers: ACH, ALLC