Parsing a Web site with linguistic resources : GlossaNet

paper
Authorship
  1. 1. Cedrick Fairon

    New York University, Dept of Linguistics - University of Paris

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

GlossaNet is an automated system that monitors Web sites. On dates and at intervals selected by the user, GlossaNet downloads the Web site, converts it to an electronic corpus and uses the INTEX programs (M. Silberztein 1993) and the linguistic resources of the LADL (electronic dictionaries and libraries of local grammars) to parse it (B. Courtois and M. Silberztein 1990). We present the on-line version of GlossaNet. This version is accessible on the Internet and offers an automatic service for making concordances (http://glossa.ladl.jussieu.fr). It is mainly designed for use by linguists, but it is also used by some for information retrieval purposes since the corpora available in GlossaNet are the daily updated on-line editions of 25 newspapers in French, English, Italian, Portuguese and Spanish.

Dynamic corpora

We borrow the term dynamic corpus from A. Renouf (1992, 1994) to characterize the way corpora are treated in GlossaNet. In linguistic studies, the term corpus is generally used to refer to a static and finite collection of texts gathered on the basis of criteria chosen according to the planned applications. Once the corpus has been set up, it does not change. But, as A. Renouf showed through the AVIATOR project (Birmingham University), it is possible to imagine another approach to corpora designing, where the corpus is viewed as a flow of electronical textual data. The technical difference between AVIATOR and GlossaNet is the full automation of GlossaNet: a module of GlossaNet called CorpusWeb downloads and converts a Web site into a corpus that feeds the flow of electronic data (see Figure "GlossaNet Process"). In our system, Web sites are treated and parsed as corpora…in fact, as dynamic corpora, since their content changes over time (C. Fairon 1999). D. Walker (1999) has also used a Web crawler to create corpora.

GlossaNet Process

Here is the process that GlossaNet relaunches automatically each time the Web site is updated:

CorpusWeb (a Web grabber) downloads the updated Web site;
A filter converts the retrieved HTML documents into one simple text file (the new "Corpus");
If necessary, the system archives the new corpus (for later use);
The new corpus is processed using the INTEX methodology (the corpus is divided into sentences; large coverage dictionaries and grammars are applied onto the text);
Using the INTEX programs, the user's request is applied on the corpus.
On-line service

Users must register on the GlossaNet server to access the system. Once this one-time registration is completed, they have to choose a working language so that GlossaNet displays the list of available corpora for this language (i.e. Chicago Tribune, Los Angeles Times, Philadelphia Inquirer, New York Post, The Guardian, The Times, The Herald Tribune, etc. are available for English). The user chooses a corpus and composes his/her request under the form of a regular expression or a graph (Finite State Automaton). Here are examples of valid regular expressions that can be applied to an English corpus:

Example of regular expressions

Matched patterns

((<be><V:G> to)+<will>)<V:W>

<be>in(<DET>+<E>)<N>

the (FBI+Federal Bureau of Investigation+Bureau)

<be>a good <N+hum>

am going to rent, will check, etc.

was in a hurry, are in a sweat, etc.

the FBI, the Federal Bureau of Investigation, the Bureau

is a good man, was a good teacher

In theory, graphs are equivalent to regular expressions, but practically, they offer a more convenient interface to represent complex structures. For instance, the following graph is equivalent to the first regular expression in the frame presented above:

Each path of the graph defines a "valid" pattern that will be found if the graph is applied on a text.

Results are sent by e-mail to the user under the form of a concordance. If the user has opted for an HTML concordance, the pattern matched by the user’s request and presented in concordance is a hyperlink that enables the user to access the original Web page where the occurrence has been found. The occurrence is automatically highlighted in the original Web page.

Applications

The on-line version is mainly used by linguists for locating examples of lexical/syntactic structures but also by people who have to survey the press for professional reasons. This second category of users does not look for lexical or syntactic structures, but uses keywords instead.

For each language, GlossaNet includes several newspapers from various parts of the world, so GlossaNet can also be used for comparative studies (for example, in French, there are corpus from France, Belgium, Quebec and Switzerland).

Lately, the system has been used at the LADL (Laboratoire d’Automatique Documentaire et Linguistique, Université Paris 7) to update the DELA electronic dictionaries of English. Maintaining and extending these dictionaries is a considerable task, and an automated system that simplifies it is very useful. GlossaNet was used to automatically retrieve unknown common words in newspapers. Methodology and results are discussed in C. Fairon and B. Courtois (2000).

Because GlossaNet on-line requires no installation or special configuration on the user’s machine, it can be easily used for teaching.

Conclusion

GlossaNet combines several pre-existing technologies (a Web grabber, a corpora parser and linguistic resources) in order to parse Web sites as corpora.

The on-line system offers linguists a simple way of finding attestations of lexical and syntactic patterns in press corpora. It is no longer necessary to manipulate corpora and software to find new attestations: once the request is recorded, the system repeats the task automatically and sends a new concordance by e-mail every day or week.

During the first period of test, GlossaNet on-line was used by more than 450 persons and was sending more than 600 concordances on a daily basis.

References

Fairon, Cédrick ; Blandine Courtois. 2000. " Corpus dynamique et GlossaNet : Extension de la couverture lexicale des dictionnaires électroniques du LADL à l'aide de GlossaNet " in Actes du Colloque JADT 2000 : 5e Journée Internationales d'Analyse Statistique des Données Textuelles, Lausanne.

Fairon, Cédrick. 1999. "Parsing a Web site as a Corpus". In C. Fairon (ed.). 1998-1999. Analyse lexicale et syntaxique: Le système INTEX, Lingvisticae Investigationes Tome XXII (Volume spécial), Amsterdam/Philadelphia: John Benjamins Publishing Co., 450 p.

Renouf, Antoinette. 1992. " A Word in Time : first findings from the investigation of dynamic text ", ICAME Conference, Nijmegen.

Renouf, Antoinette. 1994. " Corpora and Historical Dictionaries ", in I. Lancashire et T. Russon Wooldridge (eds.), Early Dictionary Databases, Center for Computing in the Humanitie. University of Toronto, pp. 219-235.

Silberztein, Max. 1999. " Transducteurs pour le traitement automatique des textes ". In B. Lamiroy (ed.), Le Lexique-grammaire. Travaux de Linguistique 37, pp. 127-142. Bruxelles: Duculot.

Walker, Derek. 1999. "Taking Snapshots of the Web with a TEI Camera". In Computers and the Humanities 33(1/2), pp. 185-192.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2001

Hosted at New York University

New York, NY, United States

July 13, 2001 - July 16, 2001

94 works by 167 authors indexed

Series: ACH/ICCH (21), ALLC/EADH (28), ACH/ALLC (13)

Organizers: ACH, ALLC

Tags