A Survey On LDA Topic Modeling In Digital Humanities

paper, specified "short paper"
Authorship
  1. 1. Keli Du

    Julius-Maximilians Universität Würzburg (Julius Maximilian University of Wurzburg)

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.


Introduction

Latent Dirichlet Allocation (LDA) topic modeling (Blei, 2012) is a statistical method that discovers hidden themes and topics from a text corpus, and it has been widely applied in digital humanities over the past several years. My survey on applications of topic modeling have found 53 studies from the books of abstracts of the annual international conference of the Alliance of Digital Humanities Organizations between 2011 and 2018

Collection of abstracts from the last decade was initially planned. Unfortunately, due to a broken link, the abstracts of DH2009 could not be obtained, and studies related to topic modeling could not be found in the abstracts of DH2010.

. Topic modeling-related approaches are increasing (Figure 1) and most of them are related to historical studies (18 approaches) and literary studies (17 approaches). It has also been used for religious studies, digital archaeology, etc.

Figure 1. Distribution of studies over time.
The Problem

The standard use of LDA topic modeling is to browse corpora through topics and data visualizations, while it is more complex than just training and visualizing topics in practice. Results of topic modeling can be influenced by several factors such as the LDA hyperparameters, topic number, chunk-length of documents, number of iterations of model-updating as well as hyperparameter optimization (Schöch, 2017). As far as I know, a common understanding on handling these factors seems yet to be established. Using text chunking as an example, Jockers (2013) divided novels into chunks in order to capture transient themes which only appear at certain points of a novel. In contrast, Nichols et al. (2018) writes: “Following common practice using LDA on texts, we did not chunk or split the texts in our corpus for analysis.” Stop words removal is another example, where more coherent topics are able to be obtained in general by removing stop words with no contents. Most approaches remove stop words before training the model, however discussions regarding the effectiveness of removing stop words after the modeling process have surfaced (Schofield et al. 2017).

In order to provide a comprehensive overview of how the majority of humanities scholars understands and uses topic modeling, a survey on above mentioned 53 approaches in detail has been done. In this paper I therefore propose to look at these approaches in the following aspects:

Preprocessing: text processing procedures before topic modeling. What kinds of preprocessing is used and why?
Modeling: what are the parameters, which control a topic modeling process. How can they influence the results and how are they been used in different approaches?

Postprocessing: What method has been used for the interpretation of topics? How were the quality of a topic model and the topics evaluated?

Preprocessing:
The common preprocessing procedures include lemmatization, part-of-speech (POS) tagging and document chunking. By transforming words to their base form, the topic model can become more concentrated on the semantic structure. Through POS tagging, words with less contents could be identified and removed from corpus, in order to get more coherent topics. Chunking allows us to capture topics which only appear at certain points. My survey pays particular attention to the reasons of applying (or not) a preprocessing procedure in practice. For example, lemmatization is often applied when the corpora are in highly inflected languages like German or French. Document chunking is very diverse: the chunk-size could be several hundred or several thousand words, or a page of a book, or to split a book into ten equal segments. But almost no approach explained the reason of their chunking choices.

Modeling:
The LDA hyperparameters, the number of topics, the number of iterations of model-updating, the hyperparameter optimization control the modeling process. As a matter of fact, no approaches have been reported on setting the hyperparameters, while two approaches reported their number of iterations (Schöch, 2015; Maryl & Eder, 2017) and two approaches applied the hyperparameter optimization (Goldstone, 2014; Falk, 2016). Other approaches are more focused on the visualization and analysis on topic models. In 23 approaches, the choice of topic number has been reported, but only 8 of them explained how the numbers were determined.

Postprocessing:
After the modeling process, it is important to evaluate the topic model and the topics. Only 5 approaches reported their evaluation methods. Although the topic quality can be evaluated by measuring topic coherence (Röder et al., 2015) and topics can also be automatically labeled (Magatti et al., 2009). Only one approach used topic coherence for the evaluation of the trained topics (Rhody, 2014). It is more common to label topics manually and to highlight the correlation between interesting topics and metadata. More than half of all approaches applied data visualization for exploration purposes.

Conclusion
This survey intends to provide an overview regarding the common use of topic modeling in digital humanities. It presents the situation that DH-community not always report how topic modeling was applied: Within 53 approaches, around 74% didn't report how their corpora were prepared; more than 70% didn't report which tool was used to train their topic models; almost 57% didn't report how many topics were trained and about 90.5% didn't report how their topic model were evaluated. Without reporting the technical details, the scientific reproducibility and the stability of their research could be questionable.

In addition, the lack of interest (or knowledge) on the complexity of topic modeling itself may indicate non-optimal application of this method. The standard parametrization of the topic modeling tool could be used, but because texts in humanities (for example literary texts) are more complex, it is not always clear how much previous understanding of topic modeling from computer science can be beneficial. One good example of the variation is the subgenre classification using topic modeling in Schöch, 2017. The accuracy difference between the worst and the best models is 17%. To obtain a better understanding of topic modeling, series of systematic investigations into the impact of factors on topic modeling will be my next steps to proceed.

Bibliography

Armaselu, F.
(2018). The Magnifying Glass and the Kaleidoscope. Analysing Scale in Digital History and Historiography. In:
DH
.

Bailey, S., & Rochester, E.
(2016). Testing the Doctrine of Election: A Computational Approach to Karl Barth's Church Dogmatics. In:
DH
.

Bartz, T., Beißwenger, M., Pölitz, C., Radtke, N., & Storrer, A.
(2014, July). Neue Möglichkeiten der Arbeit mit strukturierten Sprachressourcen in den Digital Humanities mithilfe von Data-Mining. In:
DH
.

Blei, D. M.
(2012). Probabilistic topic models. Communications of the ACM, 55(4), 77-84.

Blevins, C.
(2011). Topic modeling historical sources: Analyzing the diary of Martha Ballard. In:
DH
.

Broadwell, P., & Tangherlini, T. R.
(2015). ElfYelp: Geolocated topic models for pattern discovery in a large folklore corpus. In:
DH
.

Burton, M.
(2013). Data Driven Documentation of Digital Humanities Discourse. In:
DH
.

Chao, A. S., Li, Q., & Liu, Z.
(2018). Integrating Latent Dirichlet Allocation and Poisson Graphical Model: A Deep Dive into the Writings of Chen Duxiu, Co-Founder of the Chinese Communist Party. In:
DH
.

Clivaz, C., Clark, E. S., Dilley, P., Faull, K. M., McBride-Lindsey, R., & Phillips, P.
(2017). Digital Religion-Digital Theology. In:
DH
.

Croxall, B.
(2017). Digital humanities from scratch: A pedagogy-driven investigation of an in-copyright corpus. In:
DH
.

Du, K.
(2017). Authorship of Dream of the Red Chamber: A Topic Modeling Approach. In:
DH
.

Eder, M., Winkowski, J., Wozniak, M., Górski, R. L., & Grzybowski, B
. (2018). Text Mining Methods to Solve Organic Chemistry Problems, or Topic Modeling Applied to Chemical Molecules. In:
DH
.

Eisenstein, J., Sun, I., & Klein, L. F.
(2014). Exploratory thematic analysis for historical newspaper archives. In:
DH
.

Falk, M. G.
(2016). Faraway, So Close!: Reading Adeline Mowbray Closely Using Topic Modelling. In:
DH
.

Fankhauser, P., Knappen, J., & Teich, E.
(2016). Topical diversification over time in the royal society corpus. In:
DH
.

Fischer, S., Knappen, J., & Teich, E.
(2018). Using Topic Modelling to Explore Authors’ Research Fields in a Corpus of Historical Scientific English. In:
DH
.

Fitzgerald, J. D., & Cordell, R. (
2018). Stranger Genres: Computationally Classifying Reprinted Nineteenth Century Newspaper Texts. In:
DH
.

Fyfe, P.
(2013). Counting Words with Henry James: Towards a Quantitative Hermeneutics. In:
DH
.

Garcia, A. S.
(2018). Interrogating the Roots of American Settler Colonialism: Experiments in Network Analysis and Text Mining. In:
DH
.

Goldstone, A.
(2014). Let DH Be Sociological! [Short Paper]. In:
DH
.

Hammond, A., & Brooke, J.
(2018). SciFiQ," Twinkle, Twinkle": A Computational Approach to Creating" the Perfect Science Fiction Story". In:
DH
.

Hettinger, L., Jannidis, F., Reger, I., & Hotho, A.
(2016). Significance Testing for the Classification of Literary Subgenres. In:
DH
.

Jähnichen, P., Oesterling, P., Liebmann, T., Heyer, G., Kuras, C., & Scheuermann, G.
(2015). Exploratory search through interactive visualization of topic models. In:
DH
.

Jautze, K., van Cranenburgh, A., & Koolen, C.
(2016). Topic modeling literary quality. In:
DH
.

Jennings, C., & Binder, J. M.
(2013). Eighteenth-and Twenty-First-Century Genres of Topical Knowledge. In:
DH
.

Jockers, M. L
. (2011). Detecting and characterizing national style in the 19th century novel. In:
DH
.

Jockers, M. L.
(2012). Computing and visualizing the 19th-century literary genome. In:
DH
.

Jockers, M. L.
(2013). Macroanalysis: Digital methods and literary history. University of Illinois Press.

Kaufman, M.
(2015). Everything on paper will be used against me: Quantifying Kissinger. In:
DH
.

Liu, A., Champagne, A., Douglass, J., Kleinman, S., Russell, J., & Thomas, L
. (2017). Open, Shareable, Reproducible Workflows for the Digital Humanities: The Case of the 4Humanities. org" WhatEvery1Says" Project. In:
DH
.

Magatti, D., Calegari, S., Ciucci, D., & Stella, F.
(2009, November). Automatic labeling of topics. Intelligent Systems Design and Applications, 2009. ISDA'09. Ninth International Conference on (pp. 1227-1232). IEEE.

Maryl, M., & Eder, M.
(2017). Topic Patterns in an Academic Literary Journal: The Case Of “Teksty Drugie”. In:
DH
.

McCallum, A. K.
(2002). MALLET: A Machine Learning for Language Toolkit.

Meneses, L., & Mallen, E.
(2017). Semantic Domains in Picasso's Poetry. In:
DH
.

Meneses, L., Martin, J., Furuta, R., & Siemens, R
. (2018). Part Deux: Exploring the Signs of Abandonment of Online Digital Humanities Projects. In:
DH
.

Miller, B., Berger, C., Bhattcharyya, S., Caselli, T., Kelman, D., Olive, J., & Rajiva, J
. (2016). Contextualizing Receptions of World Literature by Mining Multilingual Wikipedias. In:
DH
.

Mimno, D., Broadwell, P. M., & Tangherlini, T. R.
(2014). The Telltale Hat: LDA and Classification Problems in a Large Folklore Corpus. In:
DH
.

Montague, J., Simpson, J., Rockwell, G., Ruecker, S., & Brown, S.
(2015). Exploring large datasets with topic model visualizations. In:
DH
.

Nelson, R. K., Mimno, D., & Brown, T.
(2012). Topic Modeling the Past. In:
DH
.

Nichols, R., Slingerland, E., Nielbo, K., Bergeton, U., Logan, C., & Kleinman, S.
(2018). Modeling the Contested Relationship between Analects, Mencius, and Xunzi: Preliminary Evidence from a Machine-Learning Approach. The Journal of Asian Studies, 77(1), 19-57. doi:10.1017/S0021911817000973

Norén, F., & Mähler, R.
(2017). Text Mining the History of Information Politics Through Thousands of Swedish Governmental Official Reports. In:
DH
.

Olesen, C. G., & Kisjes, I.
(2018). OCR'ing and classifying Jean Desmet's business archive: methodological implications and new directions for media historical research. In:
DH
.

Organisciak, P., & Franklin, S.
(2017). Modeling Creativity: Tracking Long-term Lexical Change. In:
DH
.

Organisciak, P.; Auvil, L.; Downie, J. S.
(2015). Remembering books: A within-book topic mapping technique. In:
DH
.

O'Sullivan, J., Shade, M., & Rowles, B.
(2016). Player-Driven Content: Analysing Textual Communications in Online Roleplay. In:
DH
.

Quamen, H., & Hjartarson, P.
(2014). Big Data and the Literary Archive: Topic Modeling the Watson-McLuhan Correspondence. In:
DH
.

Rhody, L.
(2014). The Story of Stopwords: Topic Modeling an Ekphrastic Tradition. In:
DH
.

Riddell, A. B.
(2011). Toward a Demography of Literary Forms: Building on Moretti's Graphs. In:
DH
.

Roe, G., Gladstone, C., & Morrissey, R.
(2014). Discourses and disciplines in the enlightenment: Topic modeling the french encyclopédie. In:
DH
.

Röder, M., Both, A., & Hinneburg, A.
(2015, February). Exploring the space of topic coherence measures. Proceedings of the eighth ACM international conference on Web search and data mining (pp. 399-408). ACM.

Savonick, D., & Tagliaferri, L.
(2018). The Purpose of Education: A Large-Scale Text Analysis of University Mission Statements. In:
DH
.

Schofield, A., Magnusson, M., & Mimno, D.
(2017). Pulling out the stops: Rethinking stopword removal for topic models. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers (Vol. 2, pp. 432-436).

Schöch, C.
(2015). Topic Modeling French Crime Fiction. In:
DH
.

Schöch, C.
(2017) Topic Modeling Genre: An Exploration of French Classical and Enlightenment Drama.” Digital Humanities Quarterly 11, no. 2. §1-53. http://www.digitalhumanities.org/dhq/vol/11/2/000291/000291.html.

Schreibman, S., Kerr, S., & McGarry, S.
(2017). The Third Way: Discovery Beyond Search and Browse in Letters of 1916. In:
DH
.

Shaw, R.
(2012). Contours of the Past: Computationally Exploring Civil Rights Histories. In:
DH
.

Shawn, G
(2013). Topic Modeling Time and Space: Archaeological Datasets as Discourses. In:
DH
.

Smeets, J., Scholtes, J. C., Rasterhoff, C., & Schavemaker, M.
(2016). SMTP: Stedelijk Museum Text Mining Project. In:
DH
.

Wevers, M., Smits, T., & Impett, L.
(2018). Modeling the Genealogy of Imagetexts: Studying Images and Texts in Conjunction using Computational Methods. In:
DH
.

Wolff, M.
(2013). Surveying a corpus with alignment visualization and topic modeling. In:
DH
.

Yamada, T., Inoue, S.
(2015). Detection of People Relationship Using Topic Model from Diaries in Medieval Period of Japan. In:
DH
.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2019
"Complexities"

Hosted at Utrecht University

Utrecht, Netherlands

July 9, 2019 - July 12, 2019

436 works by 1162 authors indexed

Series: ADHO (14)

Organizers: ADHO