The Application of Corpus Linguistic Methods in Medical Information Retrieval

poster / demo / art installation
Authorship
  1. 1. Gabriella Szakál

    School of Public Health - Debreceni Egyetem (University of Debrecen) (Lajos Kossuth University)

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

RATIONALE:

Nowadays one of the most widespread bibliographical databases in medicine is MEDLINE, the product of the US National Library of Medicine (NLM). In accordance with the general requirement that all information retrieval systems must face, NLM has designed and implemented a controlled medical vocabulary called MeSH (Medical Subject Headings) [1] which presents itself as an exciting topic for interdisciplinary analysis combining the approaches of linguistics, computing and library and information science.

OBJECTIVES:

The author's purpose is to give an overview of the problematic issues concerning (medical) information retrieval; to find evidence for the NLM's indexers' choice of certain subject headings with the help of corpus linguistics; and furthermore, to try to utilize the findings in interdisciplinary fields like automated indexing, medical librarianship and information retrieval. Her starting hypothesis is that the terms occurring most frequently in texts are usually selected as headword concepts [2].

METHODS:

The first step is to collect a small corpus of medical articles and to select their already assigned MeSH headings using the MEDLINE database. Then the text of each article is to be used as an input of a concordancer to produce word lists with frequences in the text. The first conclusions are then drawn on the basis of the linguistic/semantic comparison of the two groups of terms.

DISCUSSION:

In the author's opinion it would be very important for a small library (or individuals possessing a large corpus of medical texts not indexed according to their subject matters) to be able to assign keywords for these texts automatically in order to improve the retrieval of the included information. It is especially true for publications not included in the NLM's database since it is well-known that MEDLINE does not aim at covering every journal of the biomedical field but has strict journal inclusion criteria [3]. This results in the facts that on the one hand the articles and other publications that are excluded from MEDLINE for some reasons (they were published in journals considered 'minor' or they are manuscripts but of scientific value) cannot be retrieved with the help of MEDLINE, and on the other hand, these articles had not been given headwords from a controlled thesaurus like MeSH or UMLS. As far as the latter argument is concerned it is widely accepted that medical librarians or information specialists must aim at systematic and consistent indexing. The presentation will go into technical issues like a) the electronic format of texts, b) word frequency lists (excluding 'STOP'-words) and c) standardization of word formats on the basis of MeSH controlled vocabulary.

The order of the above issues follows their realization order as well. Point a) refers to the problem of how to capture medical texts in a format that can be processed electronically. Having been stored in file(s) the elements of the corpus are then processed by a concordancer resulting in word frequency lists (point b). The prerequisite of the headword assigning procedure is that the author will define the group of words that are to be considered 'STOP'-words, ie. that will be automatically excluded from the list. The final step in the indexing procedure is to try to map the concepts deriving from the above operations into a controlled thesaurus like MeSH (point c). The output of the mapping will then result in the actual keywords that can be then assigned to the medical article and which will enable the users to search the database with the help of a standardized terminology.

The author will also describe a small-scale survey carried out among (either native or non-native English speaking) foreign medical students studying at the University Medical School of Debrecen on their terminology selecting habits while using MEDLINE to supplement the above statements. The survey had two main objectives. On the one hand, it wished to gain information about the students' habits in turning to MEDLINE during their university years, and on the other hand, the author aimed at finding a starting point for the analysis of the terms included in MeSH. The reason behind the author's interest was her experience that students very rarely start the search with the appropriate term. This fact often does not cause them much trouble because of the service MEDLINE provides, namely that either it gives an 'assist' function to find the valid MeSH term or the query term is automatically converted into the appropriate term on the basis of which the search is then executed, although knowing this term may result in better precision. This survey was intended to be a preliminary study which might be followed by a study involving a larger sample and a larger number of concepts to be analysed. For the present purposes five concepts have been selected on the basis of the author's experience so far, for which 2-4 alternatives are given. The students had to choose which term they would type in to find references on the concept behind (in this version lexical variation is ignored). Having analysed the findings of the survey the next step is to carry out the detailed study of the selected terms with the help of a larger computerized medical corpus. The presentation will look at the survey's necessity and aims, the selection of concepts included, conclusions drawn in more details, furthermore, it will present implications for future study.

CONCLUSIONS:

The developments in medicine and the subsequent growth of medical literature present a great challenge for the medical librarian in her service providing facilities. Therefore, the use of corpus linguistics and computing, which the author has tried to show a possible method of in the present paper, may contribute to this effort.

REFERENCES

1. Lindberg DAB, Humphreys BL, McCray AT. The Unified Medical Language System. Methods of Information in Medicine 32:281-91, 1993.

2. Aronson AR, Rindflesch TC, Browne AC. Exploiting a large thesaurus for information retrieval. (manuscript)

3. Fact Sheet : Journal Selection for Index Medicus (r) / MEDLINE (r) http://www.nlm.nih.gov/pubs/factsheets/jsel.html (20 Oct 1997)

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ACH/ALLC / ACH/ICCH / ALLC/EADH - 1998
"Virtual Communities"

Hosted at Debreceni Egyetem (University of Debrecen) (Lajos Kossuth University)

Debrecen, Hungary

July 5, 1998 - July 10, 1998

109 works by 129 authors indexed

Series: ACH/ALLC (10), ACH/ICCH (18), ALLC/EADH (25)

Organizers: ACH, ALLC

Tags