Intuitive Interfaces for the Retrieval of Linguistic Data

paper
Authorship
  1. 1. Eric Rochester

    University of Georgia

  2. 2. William A. Kretzschmar

    University of Georgia

Parent session
Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

For many years, the Linguistic Atlas of the Middle and South Atlantic States (LAMSAS) has tried to make the data gathered for the Atlas available to anyone. At first this had to be done with printed books, but this was costly in terms of both time and money, which limited access to the volumes. Computerization of the complex phonetic LAMSAS data was an early priority of the present editor [3], [4]. An early program for computer access to the data was written in the high-level language of FoxBase+ to take advantage of the graphical resources of Macintosh computers, but distribution of this program was limited [2], [5]. The Internet, however, has allowed LAMSAS data to be easily distributed to anyone with a network connection. The first on-line version of LAMSAS attempted to develop an interactive Gopher interface. Later the public has been able to use a Web browser to access all the databases that have been keyboarded so far (cf. [7], [8]), using scripts and programs that recapitulate and extend many features of the original Macintosh program (http://hyde.park.uga.edu)
Our web site gets up to 10,000 hits per month, many from non-educational domains. Typical users range from grade school students to college students to curious web surfers. To help make the information useful to both specialists and non-specialists alike, intuitive, graphical interfaces have been used both to get queries from users and to present the data. For example, users can generate maps showing the communities where someone gave a particular reply to one of the survey questions [6]. Also, to look at information about a particular informant, users first click on the state, from the overall survey map, in which the informant lives, and are then presented with a more detailed map of the state with the locations of the informants shown by labeled dots. By clicking on a label, users can see information about the informant. While this is intuitive enough, until the summer of 1997 users could only get information about one informant at a time. At that time, a web-based form was established to enable searches of the informant database, which solved the immediate problem. However, this approach is limited in that only one value in each field can be searched for. For instance, to get a list of the informants in New York and New Jersey would require two separate searches.
There are two obvious solutions. One is to extend the form. While this would be the easiest, it leaves open the question of how far to extend it. Although the database itself naturally limits the queries, using forms further restricts the possible searches. Moreover, the more the form is extended, the more unwieldy and the less intuitive it becomes. In the pursuit of flexibility, ease of use would be sacrificed.
The other option, which maintains ease of use while increasing flexibility, is to allow users to query the database using a natural language, in this case English. This is the option we have chosen. Because of the flexibility of natural language, it does not impose more constraints on the search than the database itself does. It also is no more unwieldy than English, and it allows us the option of later adding features to the search engine without having to change the overall nature of the interface.
In fact, such extensions have already been added. While originally only the informant database was included, now users can query either the informant database alone or a combination of the informant database with any number of the lexical databases, e.g., either "list everyone in Georgia" or "list everyone in Georgia who said andiron" (in response to the survey cue about the iron holders of the wood in a fireplace). Except for being aware that the search capabilities have been extended, the users do not need to re-familiarize themselves with a new form or relearn a complicated query language. As well as specifying what databases to search, users can also specify what fields to display. By default, all the fields queried in the search are included, but by using an "include" phrase users can specify more fields to include in the display. For instance, if users want to search on state, but also see the sex of the informants, they could query, "List everyone in South Carolina who said andiron, including the field sex." To give the output intuitiveness commensurate with that of the natural language input, users can request how they want the data shown. When the user specifies the "list" option, the interface returns a list of the informants; when a "map" is requested, the interface returns a map showing the locations of the informants meeting the query specifications. With this function, users can either ask the interface to "List everyone in Georgia who said andiron" or "Map everyone in Georgia who said andiron" or "List and map everyone in Georgia who said andiron."
The process to implement this natural language query is fairly simple. We set up a semantic grammar to extract keywords from the query. This grammar is in the form of phrase structure rules that categorize the words and phrases according to what they add to the query, not according to their linguistic categories. In a simplified form, the grammar is as follows: all words that directly query the database are "values"; other words are designated "noise." A string of values that query the same field in the database constitutes a "field," and a field preceded by a negative or other qualifier constitutes a "field phrase." A string of field phrases constitutes a "query." A query preceded by an optional verb (either "list" or "map") constitutes a "command," which is then executed. Also, these nodes in the phrase structure tree have features, which specify the command that will execute the query represented by its constituents. For example, in "Map all men not in Georgia or North Carolina," "map" is the verb, and "all men not in Georgia or North Carolina" is the query. The field phrase here is "not in Georgia or North Carolina," where "not" is a qualifier and "in Georgia or North Carolina" is a field. In this field, "Georgia" and "North Carolina" are values, while "in" and "or" are noise. Originally, we implemented a prototype of this system using three tools: Prolog, the standard Definite Clause Grammar extension to Prolog to handle the phrase structure rules, and Michael Covington's Graph Unification Logic Programming (GULP) extension to handle the features [1]. After working out the problems in the algorithm in Prolog, however, the system was ported to Perl, which would allow it to work with the web server and produce output more easily.
This type of query has turned out to be easily extensible and modifiable. It is a good way to provide a user interface to a database search engine that is both intuitive for users to query and easily maintainable and extensible for those managing the database. While not suitable to every type of database in the humanities (for instance, it would not work well with searching corpora), this application of linguistics to computer, applied back again to the humanities, has proven itself both efficient and useful and has continued the Atlas's tradition of using intuitive interfaces to provide intuitive information.
REFERENCES:
1. Covington, M. (1989) GULP 2.0: an extension of Prolog for unification-based grammar. Research Report AI-1989-01, Artificial Intelligence Programs, The University of Georgia.
2. Kirk, J., and W. Kretzschmar, Jr. (1992) Interactive Linguistic Mapping of Dialect Features. Literary and Linguistic Computing 7, pp. 168-175.
3. Kretzschmar, W. (1988) Computers and the American Linguistic Atlas. In Methods in Dialectology: Proceedings of the Sixth International Conference on Methods in Dialectology, edited by A. Thomas (Clevedon: Multilingual Matters), pp. 200-224.
4. Kretzschmar, W. (1989) Phonetic Display and Output. In Computer Methods in Dialectology, edited by W. Kretzschmar, E. Schneider, and E. Johnson (special issue of Journal of English Linguistics, vol. 22.1), pp. 47-53.
5. Kretzschmar, W. (1992) Interactive Computer Mapping for the Linguistic Atlas of the Middle and South Atlantic States (LAMSAS). In Old English and New: Essays in Language and Linguistics in Honor of Frederic G. Cassidy, edited by N. Doane, J. Hall, and R. Ringler (New York: Garland), pp. 400-414.
6. Kretzschmar, W. (1997) Computer-Assisted Study of American English Lexical Data. In From Ĉlfric to the New York Times: Studies in English Corpus Linguistics, edited by Udo Fries, Viviane Müller, and Peter Schneider (Amsterdam, Atlanta: Rodopi),pp. 239-247.
7. Kretzschmar, W., and R. Konopka. (1996) Management of Linguistic Databases. Journal of English Linguistics 24,pp. 61-70.
8. Kretzschmar, W. et al. (1993) Handbook of the Linguistic Atlas of the Middle and South Atlantic States. University of Chicago Press.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ACH/ALLC / ACH/ICCH / ALLC/EADH - 1998
"Virtual Communities"

Hosted at Debreceni Egyetem (University of Debrecen) (Lajos Kossuth University)

Debrecen, Hungary

July 5, 1998 - July 10, 1998

109 works by 129 authors indexed

Series: ACH/ALLC (10), ACH/ICCH (18), ALLC/EADH (25)

Organizers: ACH, ALLC

Tags