Applying Content Analysis to Humanities Computing Research Literature

Thomas B. Horton; Neal S. Coulter; Emanuel Grant

Authorship

1. Thomas B. Horton

Dept. of Computer Sci. & Engin. - Florida Atlantic University
2. Neal S. Coulter

Dept. of Computer Sci. & Engin. - Florida Atlantic University
3. Emanuel Grant

Dept. of Computer Sci. & Engin. - Florida Atlantic University

Parent session

LIT (b), Allen H. Renear

Original URL

http://lingua.arts.klte.hu/allcach98/abst/abs20.htm

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Our paper will present the results of a content analysis study applied to humanities computing research literature. This work is part of a larger research effort taking place at our university that has employed these methods on data from other fields, notably computer science and software engineering. The method used in our work is co-word analysis [1], [2], [6], [8], [10], [11], [14]. Such an analysis attempts to reveal patterns and trends in technical discourse by measuring the strength of association of terms or key-words found in publications or other relevant texts for a given field of study. A main tenant of this approach is that the patterns revealed by this method are maps of the conceptual space of a field, answering questions about a field that are of interest. For example, the method has been used to map concepts found in software engineering literature, allowing us to distinguish it from the closely related field of computer science. Also, given data from different time periods, a series of maps may be produced to trace the conceptual changes in the field over a long period of time.
Our goal in applying this technique to humanities computing literature is two-fold. First, we are naturally seeking to learn more about the method's effectiveness and limitation by broadening our work to examine more discipline areas. Second, we believe this work can say something useful about the discipline of humanities computing itself. Clearly scholars in this field might be able to agree upon broad conceptual areas of research that make up humanities computing, but it is intriguing to see if these areas can be identified empirically from the research literature itself. Results of an thorough study of the literature might help give scholars better self-understanding of what makes their own discipline of humanities computing unique (a matter that is sometimes discussed in forums like the Humanist Listserv discussion group). Practical results are also possible. For example, results from a study of computer science literature will be utilized in reviewing and revising the ACM's Computing Classification System (CCS), a taxonomy developed by the major computing professional society that is used to classify publications in all areas of computing. Results are currently complete for an initial small study of humanities computing research documents: the last two years of the ALLC/ACH conference. These results are encouraging in that they identify what the authors recognize as important topics that were addressed at these conferences. The results also recognized (and isolated) associations that one would reasonably expect to find in paper proposals submitted to a conference. This last result is not surprising, and in fact shows that the method is able to recognize verbal characteristics of the genre in addition to the texts' subject matter based on word associations. We will give some details of these results at the end of this proposal. Our plan is to expand this study before the conference by looking at papers published in the journal Computers and the Humanities. Like several others previous co-word analysis studies, this will be based on the key-words assigned to these articles, not on the full-text of the articles. Clearly more information would be present in the full-text versions of the articles, but examining key-words only is practical for such a large number of articles yet stizll allows us to look at changes over time, etc.
The Co-word Analysis Method Used
Co-word analysis deals directly with sets of terms shared by documents. (A similar approach to this problem is co-citation analysis, which deals instead of with shared citations.) Co-word analysis enables the structuring of data at various levels of analysis: (1) as networks of links and nodes; (2) as distributions of interacting networks; and (3) as transformation of networks over time periods.
Co-word analysis reduces a large space of related descriptors (i.e. words or phrases in our study) to multiple related smaller spaces that are easier to comprehend but are also indicative of actual partitions of interrelated concepts in the literature under consideration. This analysis requires an association measure and an algorithm for searching through a descriptor space. The analysis is designed to identify areas of strong focus that interrelate.
Metrics for co-word analysis have been studied extensively [1], [2], [8], [10], [14]. Two descriptors, i and j, co-occur if they are used together in the classification of a single document. Take a corpus consisting of N documents. Each document is indexed by a set of unique descrip tors that can occur in multiple documents. Let ck be the number of occurrences of descriptor k; i.e, the number of times k is used for indexing documents in the corpus. Let cij be the number of co-occurrences of descriptors i and j (the number of documents indexed by both descriptors). Different measures of association have been proposed. The basic metric used for this study is Strength Sij.
The Strength S of association between descriptors i and j is given by the expression:

This metric has some useful properties and also provides an intuitive measure of the strength of association between terms indicating only that there is some semantic relationship or other.
Two descriptors that appear many times in isolation but only a few times together will yield a lower value than two descriptors that appear relatively less often alone but have a higher ratio of co-occurrences. Descriptors with relatively high values form the networks' links. A network consists of nodes (descriptors) connected by links. Each node must be linked to at least one other node in a network.
The co-word algorithm uses two passes through the data to produce pair-wise connections of descriptors in networks (see the attached graphic file, which is described below). Pass-1 builds networks that can identify areas of strong focus; Pass-2 can identify descriptors that associate in more than one network and thereby indicate pervasive issues. This pattern of networks yields a mosaic of the data being analyzed.
The first pass (Pass-1) generates the primary associations among descriptors; these descriptors are called internal nodes and the corresponding links are called internal links. A second pass (Pass-2) generates links between Pass-1 nodes across networks, thereby forming associations among completed networks. Pass-2 nodes and links are called external ones.

.

The attached graphic image file illustrates an example of these results. This figure displays the network connections as a map. Pass-1 links and nodes are represented by thick lines connecting thick boxes, respectively. Pass-2 nodes are in thin boxes, while Pass-2 links are shown as thin lines connecting Pass-1 and Pass-2 nodes.
Without some minimum constraints, descriptors appearing infrequently but almost always together could dominate networks; hence a minimum co-occurrence value is required to gen erate a link. At the same time, some maps can become cluttered due to an excessive number of legitimate links (but of generally decreasing values); hence, restrictions on numbers of nodes and links are sometimes required to help discover major partitions of concepts. However, many networks are limited only by the number of qualifying nodes.
Previous Work
The use of co-word analysis to develop and/or refine the taxonomy of a field is thus well-established. Several strong European research teams continue to apply these techniques, sometimes in combination with other techniques [7], [12], [13], [15], [16]. In North America, (aside from the examples already mentioned) co-word analysis has been integrated in knowledge level support systems for scholarly communities [9]. Prototypes have been built that combine term association maps with knowledge acquisition and knowledge representation techniques to build detailed formal knowledge structures to provide a framework for knowledge expression, interchange, and collaborative development.
Results of the Initial Study
The maps produced by our initial analysis do reveal some easily-recognized themes. The map submitted with this proposal includes a number of Pass-1 nodes related to the creation of editions, including electronic editions, of a work. The word "manuscript" has strong associations with "book", "reading", "transcription", "edition" and "variant". This was the sixth map found by the analysis.
Other maps that appears to represent conceptual themes include the second map, which clearly relates to papers at the conferences related to the TEI and related markup issues. Pass-1 nodes include "markup language", "markup", "SGML", "TEI", "encoding", etc. Interestingly this map show a strong degree association among its own nodes, but a low level of secondary links to other networks. This suggests that the vocabulary of papers on the TEI and related subject are very specifically focused on issues of mark-up without sharing key-word associations with other themes represented at these conferences.
The third map produced shows links between Pass-1 nodes "web", "program", and "university". Other Pass-1 nodes in this map are "development", "department", "student", and "teaching". This map appears to reflect that creating and use of Web resources in academic settings. Other maps (which will of course be described at the conference) seem to focus on concepts such as writers and authorship, and on corpus-based studies.
The seventh map produced by the co-word analysis may interest a humanities computing scholar searching for something to help identify or define their own role in the world of research. This map links the Pass-1 node "humanity" with "scholar", "issue", "technology", and "computer". Other Pass-1 nodes found in the map include "individual", "collection", "library", "model", and "issue".
This last map is perhaps not surprising since those writing paper proposals for a humanities computing conference employ the language and associate terms that reflect their own view of who they are. Likewise these writers use the vocabulary employed by anyone writing a conference paper proposal, and this is also uncovered by the co-word analysis results. The first map produced by the analysis does not suggest any easily recognizable theme of the conference papers or our field, but includes words like "problem", "example", "project", "text", "way" and "number". Sev- eral of these terms seem to reflect how one naturally describes one's research and its results. The fourth map has a similar nature; it includes Pass-1 nodes like "information", "language", "time", "work", "use", "type", "form", "structure", "word", and "tool". These two maps are connected with a great many of the other maps, and seem to indicate a common vocabulary used by anyone writing a paper proposal for these conferences.
Conclusion
Our initial study indicates that this method of co-word analysis sheds some interesting light on concepts found empirically in the analysis of conference abstracts for two recent ALLC/ACH conferences. These results will be presented in more detail at the conference, including more detailed information about the method itself. Our larger study, based on key-words assigned to published papers in the journal Computer and the Humanities, will serve as a more challenging test for the methods introduced here in this proposal.
References
1. Callon, M., Law, J., & Rip, A. (1986). Mapping of the dynamics of science and technology. Lon- don: McMillian.
2. Callon, M., Courtial, J-P. & Laville, F. (1991).Co-word analysis as a tool for describing the net- work of interactions betweenbasic and technological research: The case of polymer chemistry. Scientometrics p. 22, p. 1, pp. 153-203.
3. Coulter, N., Monarch, I., & Konda, S.. Software Engineering as Seen Through Its Research Liter- ature: A Study in Co-Word Analysis. To appear in the Journal of the American Society for Information Science.
4. Coulter, N., Monarch, I., Konda, S., & Carr, M. (1996a). An evolutionary perspective of software engineering research through co-word analysis. CMU/SEI-96-TR-019, Software Engi- neering Institute, Carnegie Mellon University, Pittsburgh.
5. Coulter, N. (1998). Changes to the Computing Classification System. Computing Reviews p.39, p. 1, (forthcoming)
6. Courtial, J-P. (1994). A co-word analysis of scientometrics. Scientometrics p. 31, p. 3,pp. 251-260.
7. Courtial, J-P., Cahlik, T., & Callon, M. (1994). A model for social interaction between cognition and action through a key-word simulation of knowledge growth. Scientometrics 31, p. 2, pp. 173-192.
8. Courtial, J-P., & Law, J. (1989). A co-word study of artificial intelligence. Social Studies in Sci- ence p. 19, pp. 301-311, London: Sage.
9. Gaines, B., & Shaw, M. (1994). Knowledge acquisition and representation techniques in scholarly communication. Proceedings SIGDOC'94: ACM 12th Annual International Conference on Systems Documentation, 251-260. New York: ACM.
10. Law, J., & Whittaker, J. (1992). Mapping acidification research: A test of the co-word method, Sci- entometrics p.23, p. 3, pp. 417-461.
11. Turner, W., Chartron, G., Laville, F., & Michelet, B. (1988). Packaging information for peer re- view: new co-word analysis techniques. In Van Raan, A.,(Ed), Handbook of quantitative studies of science and technology. (pp. 291-323), Amsterdam: North Holland.
12. Turner, W. & Rojouan, F. (1991). Evaluating input/output relationships in a regional research net- work using co-word analysis. Scientometrics, p. 22, p. 1, pp. 139-154.
13. Turner, W., Lelu, A., & Georgel, A. (1994). Geode: Optimizing data flow representation tech- niques in a network information system. Scientometrics, p. 30, p. 1, p. 269-281.
14. Whittaker, J. (1989). Creativity and conformity in science: Titles, keywords, and co-word analysis. Social Science in Science, p. 19, pp. 473-496.
15. Zitt, M. (1991). A simple method for dynamic scientometrics using lexical analysis. Scientomet- rics, p. 22, p. 1, pp. 229-252.
16. Zitt, M. & Bassecoulard, E. (1994). Development of a method for detection and trend analysis of research fronts built by lexical and cocitation analysis. Scientometrics 30, 1, pp. 333-351.

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ACH/ALLC / ACH/ICCH / ALLC/EADH - 1998

"Virtual Communities"

Hosted at Debreceni Egyetem (University of Debrecen) (Lajos Kossuth University)

Debrecen, Hungary

July 5, 1998 - July 10, 1998

109 works by 129 authors indexed

Conference website: https://web.archive.org/web/19991022041140/http://lingua.arts.klte.hu/allcach98/

References: http://web.archive.org/web/19990225164509/http://lingua.arts.klte.hu/allcach98/abst/jegyzek.htm

Attendance: ~60 (https://web.archive.org/web/19990128030244/http://lingua.arts.klte.hu/allcach98/listpar3.htm)

Series: ACH/ALLC (10), ACH/ICCH (18), ALLC/EADH (25)

Organizers: ACH, ALLC

Applying Content Analysis to Humanities Computing Research Literature

1. Thomas B. Horton

2. Neal S. Coulter

3. Emanuel Grant

ACH/ALLC / ACH/ICCH / ALLC/EADH - 1998

"Virtual Communities"