Korean Analysis and Transfer in Multilingual Machine Translation System

paper
Authorship
  1. 1. Sung-Kwon Choi

    Systems Engineering Research Institute

  2. 2. Tae-Wan Kim

    Systems Engineering Research Institute

  3. 3. Soo-Hyun Lee

    Systems Engineering Research Institute

  4. 4. Dong-In Park

    Systems Engineering Research Institute

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Korean Analysis and Transfer in Multilingual Machine Translation System
Sung-Kwon Choi
Systems Engineering Research Institute
skchoi@seri.re.kr
Tae-Wan Kim
Systems Engineering Research Institute
twkim@seri.re.kr
Soo-Hyun Lee
Systems Engineering Research Institute
shlee@seri.re.kr
Dong-In Park
Systems Engineering Research Institute
dipark@seri.re.kr
Keywords: machine translation, multilingualism, common grammatical knowledge

Abstract
Multilingual machine translation means translation between more than two languages. The existing multilingual machine translation systems can be classified into the transfer-based and interlingual-based multilingual machine translation. In the former the analysis and generation rules were written each other differently, so that the commonness of the languages was ignored and the whole memory space led to increase. The latter had the difficulty in implementing the linguistic universal model available to many languages. In order to get over the shortcomings of these existing multilingual machine translation systems, this paper describes the multilingual MT systems through the common rules which can accept the commonness of languages and many languages can share.
1 Introduction
The analysis and generation rules in the existing transfer-based multilingual machine translation systems (SYSTRAN, EUROTRA, METAL, LOGOS, GETA etc.) are independent and different according to target languages.[Hutchins 1992] It says that the existing multilingual machine translation systems don't acknowledge the commonness of languages. For this reason the existing multilingual machine translation systems have the form like the bundle of bilingual MT systems and this leads to a result increasing the size of system. There are the transfer-based multilingual machine translation systems that use interlingual method for reducing the transfer processes (CETA, SALAT, DLT, KANT etc.), however they have difficult problems to complete the linguistic universal model[Lewis 1992]. From this point of view this paper describes the new multilingual machine translation method by the common rules and constraint rules to overcome the problem the existing multilingual machine translation systems have. The common rules mean the rules that are in common with more than two languages. It is the merits of common rules that can reduce the memory space, augment the consistency of grammatical information and standardize the information structure of lexicon because the common rules are loaded into memory only once. They also have another merit for MT. That is, new grammar modules can be created easily through the combination of 'common' rules when we add a new language to the existing system and translate it into the existing languages. The constraint rules mean the rules controlling the linguistic characteristics of individual languages. This paper consists of three parts: In the chapter 2 the construction of the whole system is introduced. The chapter 3 describes the modules consisting of common rules, that is, the common grammatical rules, the common lexicon information structure, the common structural transfer rules, and the common information transfer rules. In the chapter 4 we explain the analysis and transfer of Korean through the parameterized common rules and the constraint rules.
2 System construction
The Figure 1 shows the system construction of multilingual machine translation by the common rules and constraint rules:

Figure1: Construction of multilingual machine translation system

The middle field of Figure 1 means the common module. 'rn' is a file of common rules consisting of the common module. These files of the common rules are called by the grammar modules of the individual languages and constitute the grammar rules of an individual language together with the constraint rules for the language. For example, Korean, Japanese, English and German in Figure 1 have in common a rule file r3, but Korean and Japanese share more rule files r2 and r4 because they are more similar in the language typology than English and German
3 Common rules
In this chapter I will show the construction of common rules. Common rules for analysis consist of the common grammar rules and the common lexicon information structure and those for transfer consist of the common structural transfer rules and the common information transfer rules.
3.1 Common grammatical rules
To handle many languages in multilingual machine translation system, common grammatical rules should explain linguistic phenomena of as many countries as possible. For explanation of linguistic phenomena of configurational language (e.g. English) as well as nonconfigurational languages (e.g. Korean, Japanese, German) whose word order is relatively free, we have made new grammar rules where X-bar syntactic theory[Jackendoff 1977] and HPSG [Pollard 1994] were mixed. The new grammar was made in binary structure except the coordination structure which was made in triple structure.

head-final-structure head-first-structure head-middle-structure
1 PRED => ARG PRED PRED => PRED ARG COORD => ARG1 COORD ARG2
2 MODED => MOD MODED MODED => MODED MOD

3 FUNCT => ARG FUNCT FUNCT => FUNCT ARG

Table 1. Common grammar rules

The common grammar rules of the table 1 are described in Appendix 1 according to the notation of the CAT2 machine translation system.

3.2 Common lexicon information structure
We need to make the lexicon information structure in order to input, manage and correct consistently the lexicon information of the multilingual machine translation system. It is desirable to build not monotonic, but multiple structure so that the information structure of lexicon may represent the possible linguistic information and be moved collectively. From this point of view I have selected the feature structure as the multilingual lexicon information structure and made the attributes be the same in many languages. Appendix 2 shows an example of multilingual lexicon information structure.
3.3 Common structural transfer rules
There is also the part in the transfer process the many languages can share. It is the compositional transfer that copies the node of the source language to that of the target language if the analysis structure of the former and the generation structure of the latter are the same. We make use of the method deleting the functional words and then transforming the syntactic nodes to the 'predicate-argument-modifier' nodes in our multilingual machine translation system in order to transfer compositionally the different structures between the languages. We have recorded the noncompositional structural rules unusable to the common structural transfer rules in the transfer lexicon because they depend on the lexemes. The transfer rules have the priority order: the noncompositional structural transfer rules are applied first to the transfer process, second, the common structural transfer rules and last, the lexical transfer rules in the lexicon. The following rule shows the common structural transfer rule:
(1) common_structural_transfer_rule = {}.[+node] <=> {}.[+node].
The rule (1) says that all compositional transfer trees, that is, '+node' are transferred unvaryingly from the source language to the target language.
3.4 Common information transfer rules
Simplifying the transfer process in the multilingual machine translation is also able to result from the separation of the structure from the information. In the existing transfer-based machine translation systems the structural transfer has included the information transfer. It has brought out the duplication of the information and the increase of the memory space. But the isolation between the structure and the information results in excluding the shortcomings of the existing machine translation systems. In this sense, the common information transfer rules have the function to transfer the common information available to many languages, that is, they are the rules that copy the semantic informations from the source language to target language. The semantic informations are produced by the mapping from form to its meaning in the analysis. The following rules show the common information transfer rules: (We use the notation of the CAT2 system.)
(2) Common information transfer rules
Lexical_semantic_transfer =
{head:{ehead:{sem:SEM}}}.[*] <=> {head:{ehead:{sem:SEM}}}.[*].
Transfer_of_semantic_roles =
{role:ROLE}.[*] <=> {role:ROLE}.[*].
The lexical semantic transfer says that the lexical semantic information of the source language is copied to that of the target language on the same node level and the reverse too ('<=>' means the bidirection). The transfer of semantic roles shows the copy of the information of the semantic role between the source language and the target language.
4 Korean analysis and transfer by constraint rules
The grammar of individual languages consists of the universal rule and its parameter [Chomsky 1981]. The language typology can be classified by the parameter [Greenberg 1963]. There is an example of machine translation[Dorr 1993] that has used the univeral principle and its parameter. According to the Greenberg's parameterized word order we can consider the Korean standard word order as follows:
(3) Standard Word Order of Korean
SOV
Number-Noun
Demonstrator-Noun
Adjective-Noun
Possessive Pronoun-Noun
Relative clause-Noun
This standard word order gives an individual language a clue for its parameter. In the next section we will see the paramterized common grammatical rules for Korean.
4.1 Korean analysis by parameterized common rules
According to the Korean standard word order the head word must always follow its argument or modifier. From this point of view we can select the head-final common rules for Korean under the multilingual common grammatical rules in the Figure 1. The head-final rules in Figure 1 and Head Feature Principle percolating the information of lexical head into that of its phrase are as follows: (the coordination structure of Korean can be considered as part of the 'Argument-Functional word structure'. I hold the coordination structure of Korean as the triple structure for the efficient analysis.)

head-final-structure head-middle-structure
1 PRED => ARG PRED COORD => ARG1 COORD ARG2
2 MODED => MOD MODED

3 FUNCT => ARG FUNCT

Table 2. Parameterized common grammar rules for Korean

(4) Head_Feature_Principle =
{head:HEAD}.[{},{head:HEAD}].
A Korean sentence that is analysed by the parameterized common grammar rules and the HFP results in what follows:
(5) cengpwunun saylowun kyeyhoykanul malyenhayessta.
government+SUBJ new plan+OBJ make+PAST+DECL
The government made a new plan.

In (5) the fine line shows the application of 'FUNCT => ARG FUNCT', the dotted line that of 'MODED => MOD MODED' and the thick line that of 'PRED => ARG PRED'.

4.2 Korean analysis by grammatical constraint rules
With analysing Korean in the machine translation, we must consider specially the following [Oh 1994]:
(6) Korean Characteristics
Phonological peculiarity
sonyen-i, sonye-ka
boy-SUBJ, girl-SUBJ
boy, girl
Double objects
kunun seoulul yehayngul hayessta.
He-SUBJ Seoul-OBJ trip-OBJ make-PAST-DECL
He made a trip to Seoul
Honorifics
kyoswunimkkeyse osipnita.
professor-SUBJ(HON) come-HON-DECL
The professor comes.
These peculiarities of Korean can be explained by the constraint rules. The table 3 shows the relation between common rules and their constraint rules.
Korean characteristics Common rules Constraint rules
Phonological peculiarities FUNCT => ARG FUNCT Phonological rule
Double objects PRED => ARG PRED Argument exchange
Honorifics HFP Context information
Table 3. Common rules and constraint rules

- Phonological rule
All morphemes contain their last phoneme that is subcategorized and predicted by a functional word.

example) sonyen{phon:con} i{phon:voc,frame:{arg1:{phon:con}}}
boy{phon:con} SUBJ{phon:voc,frame:{arg1:{phon:con}}}
- Argument exchange
The subcategorization structures of functional verb 'hata (= do/make)' and those of predicate noun are exchanged for each other in the lexicon:
lex hata
frame arg1 ARG1
arg2 ARG2
arg3 cat noun
frame arg1 ARG1
arg2 ARG2
Table 4. Lexicon of 'hata (do/make)'

example) kunun(arg1) seoulul(arg2) yehayngul(arg3(arg1,arg2)) ha(arg1,arg2,arg3)yessta.
He-SUBJ Seoul-OBJ Trip-OBJ make-PAST-DECL
He made a trip to Seoul.
- Context information
The context information of sentence subject agrees with that of verb phrase.

example) kyoswunimkkeyse(context:honor) osi(context:honor)pnita.
professor-SUBJ(HON) come-HON-DECL.
The professor comes.
4.3 Transfer constraint rules
The syntactic tree of Korean results in the semantic tree through tree transformation. The semantic tree has the 'predicate-argument-modifier' arrangement. HFP also is applied to nodes of the semantic tree. We are transducing the Korean syntactic tree (5) to the following semantic tree through the transformation rules.
(7) cengpwunun saylowun kyeyhoykanul malyenhayessta.
government-SUBJ new plan-OBJ make-PAST-DECL.
The government made a new plan.

The semantic tree becomes the input of transfer. All semantic trees that can be transferred compositionally are transferred to target language by the 'common structural transfer rules' and 'common information tranfer rules'. There is, however, the compositional transfer that is not able to apply to the common information transfer rules. The idiomatic expressions with functional verbs 'hata(do/make)' or 'toyta(be done/be made)' belong to this example. We delete 'hata' during transformation from syntactic tree to semantic tree and copy the information of 'hata' to the feature 'functional verb' of predicate noun, so that the predicate of a sentence becomes the predicate noun during transformation from syntactic tree to semantic tree and copy the information of 'hata' to the feature 'functional verb' of predicate noun, so that the predicate of a sentence becomes the predicate noun. But there is no multilingual rule that can control the relation between the predicate noun of source language and the predicate noun of target language or between predicate noun of source language and verb or adjective of target language. For this reason we need the rule constraining the common transfer rule. Now we have the transfer constraint rules for the common information transfer rules.

(8) Constraint rule of predicate noun
idiomatic expression vs idiomatic expression
Let copy the information of Korean functional verb to that of functional verb of target language, if the lexeme of target language has the functional verb that is equalent to the Korean idiomatic expression with 'hata'.
ex.) sanpolul hata => take a walk, einen Sparziergang machen, sanpowo suru
ilul hata => sikotowo suru
idiomatic expression vs verb or adjective
Let copy the information of Korean functional verb to that of the lexeme of target language, if the lexeme of target language has no functional verb that is equivalent to the Korean idiomatic expression with 'hata'.
ex.) ilul hata => work, arbeiten
5 Conclusion
In this paper I have proposed a new philosophy of multilingual machine translation that accepts the commonness of languages to reduce the memory space of the multilingual machine translation system and to simplify the transfer process. This philosophy is explained by the common rules for many languages and the constraint rules for the individual languages. For example, the analysis of Korean is explained by the parameterized common rules and the constraint rules and the transfer from Korean to other target languages is explained by the common structure transfer rules, the common information transfer rules, and the transfer constraint rules. The following table shows the size of the common and constraint rules used for the analysis and transfer of Korean in the translation from 300 Korean sentences to English or German.
Syntactic Analysis Semantic Analysis Transfer
Common Constraint Common Constraint Common Constraint
9 55 39 8 43 3
- Further work

Although the multilingual machine translation by the common rules and the constraint rules is performed reasonably well, reducing the analysis rules and simplifying the transfer process, there are yet many problems to be solved:

Truncation of the number of the parse trees
Conflict between the old and the new lexical information
Recognizing the idiomatic expressions and collocations
Disambiguation of polysemy
In order to solve the problems we are testing the following methods:

Usage of the probabilistic method
Information processing by the multiple inheritance
Implementation of the compound unit recognizer
Usage of the domain
References
[Oh 1994] Oh, Kil-Lok, Key-Sun Choi, Sey-Young Park (1994). Korean Language Engineering (in Korean), Tae-Young-Sa.
[Chomsky 1981] Chomsky, N. (1981). Lectures on Government and Binding. The Pisa Lectures. Studies in Generative Grammar 9. Foris Publication, Dordrecht Holland & Cinnaminson U.S.A.

[Greenberg 1963] Greenberg, J.H. (1963). Some universals of grammar with particular reference to the order of meaningful elements. In: Joseph H.Greenberg (ed.,) Universals of Language. The M.I.T. Press, Cambridge, Massachusetts, 2nd edition.

[Dorr 1993] Dorr,B.J.(1993). Machine Translation: A View from the Lexicon. MIT Press, Cambridge, Massachusetts. London, England.

[Hutchins 1992] Hutchins, W.J. & H.L.Somers (1992). An Introduction to Machine Translation. Academic Press.

[Jackendoff 1977] Jackendoff, R.S.(1977). X-bar Syntax: A Study of Phrase Structure. Cambridge: MIT Press.

[Lewis 1992] Lewis, D. (1992). Computers and Translation. In: Christopher Butler (ed.) Computers and Written Texts. Blackwell, pp75-114.

[Pollard 1994] Pollard, C. and I.Sag (1994). Head-Driven Phrase Structure Grammar. Studies in Comtemporary Linguistics. The University of Chicago Press, Chicago & London.

[Sharp 1994] Sharp, R.(1994). CAT2 Reference Manual Version 3.6. IAI Working Papers N.27. Saarbruecken, Germany.

Endnote
This paper summarizes the experiment of the multilingual machine translation system CAT2 [Sharp 1994]. The CAT2 system is now working on a UNIX-workstation. Its programming language is PROLOG and it uses the 'constraint bottom-up chart' parser. We are now translating Korean into English as well as German and are testing the translation from Korean into French, Chinese, Russian, and Japanese as the target languages.
Appendix 1. Multilingual common grammar rules written in CAT2 notation
the capital letter: variable as the feature structure
';' : or
'>>' : if-then-
'role' : feature for semantic role
'frame' : feature for subcategorization
head-final-structure
pred_arg_pred={head:H}.[A, {head:H,frame:({arg1:A};{arg2:A};{arg3:A};{arg4:A})}].
moded_mod_moded={head:H}.[{role:mod,head:{restr:R}}, {head:H}>>R].
funct_arg_funct={head:H}.[A, {head:H,frame:{arg1:A}}].
head-first-structure
pred_pred_arg={head:H}.[{head:H,frame:({arg1:A};{arg2:A};{arg3:A};{arg4:A})}, A].
moded_moded_mod={head:H}.[{head:H}>>R, {role:mod,head:{restr:R}}].
funct_funct_arg={head:H}.[{head:H,frame:{arg1:A}}, A].
head-middle-structure
coord_arg1_coord_arg2={head:H}.[A1, {head:H,frame:{arg1:A1,arg2:A2}}, A2].
Appendix 2. Multilingual Lexicon Information Structure
`
Attribute Value Explanation
string STRING string
lex LEX lexeme
first yes/no first position of morpheme
last yes/no last position of morpheme
pos left/right/middle word order
role agent/theme/goal/ etc. semantic role
head cat noun/verb/adverb etc. category of head
scase subj/obj/obj2 surface case
restr RESTR modifee's information
ehead cat noun/verb/adverb category of ehead
num sing/pl number
pform ey/eyse... postposition/preposition
tense pres/past/fut tense
aspect perfect/prog/... aspect
modal abil/permit/... modality
type main/sub/rel/.. sentence type
sem stense simul/ante/pros semantic tense
saspect perf/dur/term.. semantic aspect
smodal abil/permit/.. semantic modality
anim hum/ani/plant/.. lexical semantics
frame FRAME subcategorization

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ACH/ALLC / ACH/ICCH / ALLC/EADH - 1997

Hosted at Queen's University

Kingston, Ontario, Canada

June 3, 1997 - June 7, 1997

76 works by 119 authors indexed

Series: ACH/ALLC (9), ACH/ICCH (17), ALLC/EADH (24)

Organizers: ACH, ALLC

Tags
  • Keywords: None
  • Language: English
  • Topics: None