Information extraction with INTEX

paper
Authorship
  1. 1. Max Silberztein

    J.T. Watson IBM Research Center

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

INTEX is a development environment that allows users to rapidly construct, test and maintain descriptions of specific patterns that occur in texts written in natural language. See an overview of the system in [Silberztein 1999]. Each description is represented by a local grammar, usually entered via the INTEX graph editor.

Local grammars can be used to represent:

-- character-based patterns, for the recognition of phone numbers (e.g. "sequence of 3 digits, followed by a space or an hyphen, followed by 4 digits"), email or Internet addresses, hours or dates expressed numerically, reference or serial numbers, sentence endings, etc.

-- orthographical patterns, for the recognition of spelling variants (e.g. "centre" or "center"), company names and their variants ("International Business Machines Corp. ", "Big Blue"), etc.

-- morphological patterns, for the recognition of families of derived words (e.g. "France, French, Frenchmen, frenchify") and inflected forms (conjugation of verbs, inflection of nouns);

-- families of lexical entries, for the recognition and indexing of related terms and concepts (e.g. "credit card, debit card, MasterCard, visa card...");

-- morphosyntactic patterns, for the recognition of frozen or semi-frozen expressions, such as complements of dates and times (e.g. "on Monday the 15th at 3PM", "two days ago in the early afternoon"), of locations, addresses, etc.

-- other morphosyntactic patterns for the recognition and co-indexing of transformed syntactic constructions (e.g. "N0's trip to N1 = N0 went, traveled to N1"). These techniques involve the use of transducers and can therefore be applied to text encoding. A. Tutin, for example, has used INTEX for XLM encoding and for semi-automatic tagging of anaphora.

One important characteristic of INTEX is that each local grammar can be easily re-used in other local grammars. Developers typically construct simple, elementary graphs that are equivalent to finite-state transducers (FSTs), and re-use these elementary graphs to construct more complex graphs.

This process is similar to the method by which engineers build "black boxes" with Computer Aided Design systems to design for instance simple logical operators (AND, XOR) that are subsequently reused in elementary arithmetic operations (ADD), reused in large numbers in more complex arithmetic operations (ADD64), in ALUs, processors, etc. INTEX provides tools to help design, test, debug, refine and maintain large numbers of local grammars in libraries.

Another characteristic of INTEX is that all the objects processed (grammars, dictionaries and texts) are internally represented by FSTs. Therefore, all the functionalities provided by the system are expressed as a limited number of operations on FSTs. For instance, applying a grammar to a text is performed by computing the union of the grammar FSTs, and then the intersection of the resulting FST and the text FST. This architecture allows for very efficient algorithms (e.g. when applying a deterministic FST to indexed texts) and gives INTEX the power of a Turing machine (thanks to the ability to cascade FSTs).

I will describe the implementation of a large-coverage description of French determiners, based on the description available in Goosse & Grevisse (1986),Gross (1986) and Salkoff (1999). The grammar is organized by means of a hundred local grammars represented by Finite State Automata.

References:

Goosse, André; Grevisse Maurice. 1986. Le Bon Usage. Duculot : Paris-Gembloux.

Gross, Maurice. 1986. Grammaire transformationnelle du français : syntaxe du nom. Cantilène : Malakoff.

Salkoff, Morris. 1999. A French English grammar. John Benjamins Ed. Amsterdam, Philadelphia.

Silberztein, Max. 1999. " Text Indexation with INTEX ". In Computer and the Humanities vol. 33. Kluwer Academic Publishers: Amsterdam.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2001

Hosted at New York University

New York, NY, United States

July 13, 2001 - July 16, 2001

94 works by 167 authors indexed

Series: ACH/ICCH (21), ALLC/EADH (28), ACH/ALLC (13)

Organizers: ACH, ALLC

Tags