Exploring Audiovisual Corpora in the Humanities: Methods, Infrastructure, and Software

Lauren Tilton; Taylor Arnold Arnold; Giles Bergel; Jasmijn Van Gorp; Julia Noordegraaf; Liliana Melgar; Mark Williams; John Bell; Roeland Ordelman; Thomas Poell

Authorship

1. Lauren Tilton

University of Richmond
2. Taylor Arnold Arnold

University of Richmond
3. Giles Bergel

Oxford University - University of the West of England
4. Jasmijn Van Gorp

Utrecht University
5. Julia Noordegraaf

University of Amsterdam
6. Liliana Melgar

University of Amsterdam
7. Mark Williams

Dartmouth College
8. John Bell

Dartmouth College
9. Roeland Ordelman

Netherlands Institute for Sound and Vision
10. Thomas Poell

University of Amsterdam

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Image analysis is increasingly central to the field. DH scholars are a critical part of interdisciplinary teams developing infrastructure for analyzing audio-visual (AV) culture at large scale. The first paper titled "Computer Vision Software for AV Search and Discovery" describes novel approaches to increasing discovery of AV content in cultural heritage institutions by focusing on four search modalities (instance; category; identity via facial recognition; and text); how computer vision can complement and augment traditional metadata-based content management; and, use a new multimodal approach to search in which correspondences between the audio and visual content of videos are learnt from unlabelled data. “Analyzing Moving Images at Scale with the Distant Viewing Toolkit (DVT)” then turns to how digital humanists can use DVT and why we should be involved in creating image analysis infrastructure. The paper will begin with an overview of DVT, a software library that summarizes media objects through the automated detection of stylistic and content-driven metadata. The paper will then turn to how and why humanities scholars need to be a part of questioning and then retraining algorithms that address our scholarly concerns and commitments.
Training computer vision algorithms on historic images requires new training sets and metadata. The third paper, “ The Media Ecology Project’s Semantic Annotation Tool (SAT): Collaborative Synergies to Train Computer Vision Analysis” will discuss SAT, which enhances the precision of existing annotation methodologies by adding geometric targets within the frame, and also provides the infrastructure to instantiate an iterative process of computer vision algorithm refinement. The paper will detail an innovative multi-partner project to collate, annotate, and mobilize for algorithmic research significant curated metadata related to WWI footage at the U.S. National Archives.
"From 'user' to 'co-developer': Strategies for a User-centered Approach to Building Media Analysis Infrastructure" will then discuss re-centering users in the development and implementation of infrastructure. The research communities expected to use these platforms must be collaborators from the beginning so that their needs are incorporated, else broad and generalized infrastructure will serve few needs well. They ground their argument in a case study, which discusses a year-long pilot program to engage scholars in the development of Media Suite; a media studies toolkit for audiovisual research that is a part of the Dutch national research infrastructure CLARIAH.
In line with ADHO’s commitment to amplifying diverse voices, the panel is intentionally designed to represent different institutional, national, and funding structures. There are two U.S. based teams: one is from an elite, private Carnegie Research 2 university in the Northeast and the other is from a small, private liberal arts college in the South. The other two papers reflect work from public research-intensive universities in Europe. Funding includes government grants, commercial partnerships, and private philanthropy. Participants represent a range of fields including American Studies, Media Studies, and Statistics. The panel also includes three women presenters, a demographic regularly underrepresented in technical research, particularly computer vision. The panel is therefore designed to bring different perspectives on how to approach audiovisual DH infrastructure.
Computer Vision Software for AV Search and Discovery
This paper will describe current work in making audiovisual content searchable by using computer vision to identify features in images. The talk will, first, describe ongoing collaborations with cultural heritage sector partners that implement the current state of the art in computer vision software and, second, outline new research that improves the characterisation of AV content. The focus will be on VGG’s collaborations with BBC and the British Film Institute. The intention behind these partnerships is twofold: first, to assist maintainers of large audiovisual datasets to benefit from proven computer vision techniques; and second, to provide VGG with opportunities to refine and advance the state of the art in the field through working on tasks such as facial recognition across large datasets.
As the BBC resource is publicly accessible it will provide the main case-study for the first part of the paper. The BBC resource implements four search modalities (instance; category; identity via facial recognition; and text). It contains over ten thousand hours of broadcast news and factual programming, recorded over five years from several British TV channels, amounting to over 5 million keyframes. The paper will demonstrate each search mode; describe the methods upon which each is based; and note the performance of retrieval in relation to the state of the art. The paper will give an account of how the resource is indexed and maintained; how the collaboration is structured; and how the methods might be integrated into either public-facing discovery tools or in-house AV management systems.
Turning to the partnership with the British Film Institute, the paper will outline how computer vision can complement and augment traditional metadata-based content management.The BFI dataset combines a digital video archive of films and television programmes described through the use of custom metadata. While visual recognition and metadata-based search each have their own strengths, the paper will outline their strength in combination – metadata providing training data and ground truth, while visual recognition provides a means of identifying inconsistencies or errors in metadata. The paper will discuss the practicalities of integrating the two modes in a collection-management environment as well as the importance of a development ecosystem based on open source software, documented standards, and a support community. It will propose that the digital humanities community could take a leading position in building such an ecosystem for computer vision. Last, the paper will outline a new multimodal approach to search, in which correspondences between the audio and visual content of videos are learnt from unlabelled data, with the result of localising sound-production and improving visual categorisation. The paper will explore some of the potential uses of identifying these correspondences in digital humanities research projects.
Analyzing Moving Images at Scale with the Distant Viewing Toolkit (DVT)
The paper will address how digital humanists can use the Distant Viewing Toolkit (DVT) and why we should be involved in creating image analysis tools. The paper will begin with an overview of DVT—a software library that addresses the challenges of working with moving images by summarizing media objects through the automated detection of stylistic and content-driven metadata. It algorithmically approximates the way humans process moving images by identifying and tracking objects, people, sound, and dialogue. The DVT software library allows users to input raw media files in a variety of formats. The input files are then analyzed to detect the following features: (1) the dominant colors and lighting over each shot; (2) time codes for shot and scene breaks; (3) bounding boxes for faces and other common objects; (4) consistent identifiers and descriptors for scenes, faces, and objects over time; (5) time codes and descriptions of diegetic and non-diegetic sound; and (6) a transcript of the spoken dialogue. These features serve as building blocks for the analysis of moving images in the same way words are the foundation for text analysis. From these extracted elements, higher-level features such as camera movement, framing, blocking, and narrative style can be derived and analyzed. After an overview of the features, we will then discuss the output formats and how non-technical users explore and visualize the extracted information. This discussion will center on a specific application to the research question of how to organize and search a large collection of local television news programs.
While explaining the usage of the toolkit in addressing humanities questions, we also address how and why humanists should be involved in creating image analysis methods. We will discuss how and why humanities scholars need to be a part of retraining algorithms in order to build computer vision methods that address our concerns and intellectual commitments. Specially, we will talk about how we are applying transfer learning to tweak the open-source computer vision algorithms to better function on moving images from across the 20th century. The toolkit utilizes and adjusts the architecture of three open source programming libraries: dlib (King 2009), ffmpeg (Tomar 2006), and TensorFlow (Abadi et al. 2016). Within these frameworks, novel computer vision and sound processing algorithms extract the required features. Specifically, the project draws from VGGFace2 for face detection (Cao et al., 2018); YOLOv3 for object detection (Redmon and Farhadi, 2018); AudioSet/VGGish for converting sound to text (Gemmeke, et al., 2017; Hershey, et al., 2017). Our work in building DVT consists of modifying and stitching together these six models for our specific humanities-centric needs. Digital humanists are well positioned to bring our critical lens to computer vision in order to facilitate archival discovery and visual culture and media studies scholarship.
Thanks to the support of a 2018 United States National Endowment for the Humanities Office of Digital Humanities Advancement Grant, a fully working development version of DVT has been made available on GitHub. The Version 1.0 release is scheduled for Fall 2019.
The Media Ecology Project’s Semantic Annotation Tool (SAT): Collaborative Synergies to Train Computer Vision Analysis
This paper will introduce new work in the development of The Semantic Annotation Tool (SAT), an NEH-funded component of The Media Ecology Project (MEP) that affords the creation of precise time-base annotations of moving images.
MEP is working to realize a virtuous cycle of archival access and preservation, enabled by technological advance and new practical applications of digital tools and platforms. In a fundamental sense this is a sustainability project regarding moving image history as public memory. By fostering innovations in both granular close-textual analysis (traditional Arts and Humanities) and computational distant reading (computer vision and machine reading), MEP is a collaborative incubator to develop 21st-century research methods.
We will report on our innovative multi-partner collaboration to collate, annotate, and mobilize for algorithmic research significant curated metadata related to vast collections of World War I footage at The National Archives and Records Administration (NARA) in the U.S. The key collection to be focused on for this presentation is footage produced by the U.S. Signal Corps, which represents the first major motion picture endeavors by the U.S. military (1914-1936).
Within MEP's emerging research environment, Mediathread has allowed the documentation of written descriptions and tagging through full-frame time-based annotations. Deploying the Semantic Annotation Tool (SAT) enhances the precision of our already granular annotation methodology by adding geometric targets within the frame and real-time playback of annotations with sub-second resolution.
SAT also provides the backend infrastructure to model and instantiate an ongoing, iterative process of computer vision algorithm refinement that features 1) manual time-based annotations that help to train deep learning algorithms; 2) the production of many additional time-based annotations via these algorithms; 3) the evaluation and refinement of the new annotations by means of further manual annotation work within the SAT environment; and, 4) the application of the resultantly refined annotated dataset as the new training set. This paper will detail a first-draft test case of such an SAT workflow.
The collection of Signal Corps WWI footage has been selectively curated into a test dataset of 300 films. Within this dataset we have developed via manual annotation a training set of annotations from 50 films for use in training the Distant Viewing Toolkit, which is being developed by a team at another U.S. university. The training metadata is derived by translating into SAT the metadata from two unique collections of existing metadata: 1) the shot log of archival footage utilized in a recent 3-part PBS American Experience program about the history of WWI (the shot log generously provided to MEP by program archival producer Lizzy McGlynn), and 2) selected notecards of precise shot-specific metadata produced as part of the documentation by The Signal Corps itself (materials found within the NARA Signal Corps collections).
The SAT will be demonstrated to afford a unique and unprecedented collaboration between digital humanities scholars and scientists, professional (documentary) filmmakers, and cultural memory archives in the advancement of DH audio-visual analysis toward enhanced video search and multimedia information retrieval.
From “user” to “co-developer”: Strategies for a User-centered Approach to Building a Media Analysis Infrastructure
Digital humanists experiment with information technologies and develop tools, either using the scholars’ own skills or in close collaboration with data and system experts. Toolmaking has taken place within the digital humanities for over seventy years (Bradley, 2018). Thus, the design of systems and the understanding of their underlying processes are inherently at the core of scholarly investigation (van Zundert, 2017). According to Patrik Svenson’s (2010), this engagement assumes different forms including using information technology as a tool, as an object of study, as an “exploratory laboratory”, as an expressive medium, and as an “activist venue”. Thus, being a digital scholar implies not only being a “user” of a “tool” or an evaluator of its performance, but also being at the center of a tool’s design from the beginning. This paper describes one way in which scholars can directly participate in the construction of digital humanities infrastructure by means of a case study. We will focus on the Media Suite, a media studies toolkit for audiovisual research, part of the Dutch national research infrastructure CLARIAH .
In certain contexts, there has been a move to create overarching infrastructure for the sciences, humanities, and social sciences at the national and international level. The aim is to coordinate efforts in data sharing, interoperability, access, and tool provision according to standards. Thus, “personal”, bottom up engagement with tool building has given way to a more supervised, top-down way of tool and data “provision.” At the same time, these generalized infrastructures are expected to be usable in specific, collaborative research settings. The question for these infrastructures of “if we build it, will they come”? (van Zundert, 2012) becomes the fundamental paradox of how to attract scholars with specific research questions to use tools and data that is generic and aimed to serve broader groups. Consequently, to avoid the risk of broad digital humanities infrastructures creating a separation between scholars (“users”) and infrastructure developers (“designers”), they need to devise ways to more directly involve the communities that they aim to serve.
In order to do this, CLARIAH has set up a pilot scholar program in which scholars were invited for one year to offer feedback and help developing the infrastructure. This paper describes the strategies we used for “co-developing” the CLARIAH Media Suite with the six pilot research projects that were involved, including: design sessions, a living “demonstration scenario” document, a chat room connected to the development issues in the Github repository, a summer school, and the participation in a series of “demonstration” meetings organized by the overall CLARIAH pilot project program. In addition, this paper presents the results of the evaluation of the six research pilot projects. Since there is a lack of common evaluation frameworks (Duşa et al., 2014) in digital humanities projects, we evaluated the pilot projects using a questionnaire and interviews based on these topics categories: research outcomes, Media Suite improvements for research support, the scholars’ new skills, and dissemination outcomes (increasing their network and improving collaboration).

Bibliography
Arandjelović, R. and Chatfield, K., Parkhi, C., Coto, E., Zisserman, A., & Vedaldi, A. (2018). “Visual Search of BBC News”. VGG website.
Arandjelovic, R., & Zisserman, A. (2017). Look, listen and learn. In 2017 IEEE International Conference on Computer Vision (ICCV) (pp. 609-617). IEEE.
Cao, Q., Shen, L., Xie, W., Parkhi, O. M., & Zisserman, A. (2018). Vggface2: A dataset for recognising faces across pose and age. In Automatic Face & Gesture Recognition (FG 2018), 2018 13th IEEE International Conference on (pp. 67-74). IEEE.
Coto, E. & Zissermann, A. (2018). “VGG Image Classification (VIC) Engine", 2017.
Gupta, A. Vedaldi, A. & Zisserman, A. (2016). "Synthetic Data for Text Localisation in Natural Images", IEEE Conference on Computer Vision and Pattern Recognition".
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., (2016). Tensorflow: a system for large-scale machine learning. In OSDI (Vol. 16, pp. 265-283).
Cao, Q., Shen, L., Xie, W., Parkhi, O. M., & Zisserman, A. (2018, May). Vggface2: A dataset for recognising faces across pose and age. In Automatic Face & Gesture Recognition (FG 2018), 2018 13th IEEE International Conference on (pp. 67-74). IEEE.
Gemmeke, J. F., et al., (2017). Audio set: An ontology and human-labeled dataset for audio events. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on (pp. 776-780). IEEE.
Hershey, S., et al., (2017). CNN architectures for large-scale audio classification. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on (pp. 131-135). IEEE.
King, D. E. (2009). Dlib-ml: A machine learning toolkit. Journal of Machine Learning Research, 10, 1755-1758.
Redmon, J., & Farhadi, A. (2018). YOLO-V3: An incremental improvement. arXiv preprint arXiv:1804.02767.
Tomar, S. (2006). Converting video formats with FFmpeg. Linux Journal, 2006(146), 10.
Arnold, T., Leonard, P., & Tilton, L. (2017). Knowledge creation through recommender systems. Digital Scholarship in the Humanities, 32, ii151-ii157.
Awad, G., Fiscus, J., Michel, M., Joy, D., Kraaij, W., Smeaton, A. F., Ordelman, R. (2016). TRECVID 2016. Evaluating Video Search, Video Event Detection, Localization and Hyperlinking.
Hamilton, K., Karahalios, K., Sandvig, C., & Eslami, M. (2014, April). A path to understanding the effects of algorithm awareness. In CHI'14 Extended Abstracts on Human Factors in Computing Systems (pp. 631-642). ACM.
Bradley, J. (2018). Digital tools in the humanities: Some fundamental provocations? Digital Scholarship in the Humanities.
Duşa, A., Nelle, D., Stock, G., & Wagner, G. G. (Eds.). (2014). Facing the Future : European Research Infrastructures for the Humanities and Social Sciences. Berlin: Scivero Verlag.
Svensson, P. (2010). The Landscape of Digital Humanities. Digital Humanities Quarterly, 4(1).
van Zundert, J. (2012). If You Build It, Will We Come? Large Scale Digital Infrastructures as a Dead End for Digital Humanities. Historical Social Research / Historische Sozialforschung, 37(3 (141)), 165–186.
van Zundert, J., & Haentjens Dekker, R. (2017). Code, scholarship, and criticism: When is code scholarship and when is it not? Digital Scholarship in the Humanities, 32, i121–i133.

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2019

"Complexities"

Hosted at Utrecht University

Utrecht, Netherlands

July 9, 2019 - July 12, 2019

436 works by 1162 authors indexed

Conference website: http://staticweb.hum.uu.nl/dh2019/dh2019.adho.org/index.html

References: http://staticweb.hum.uu.nl/dh2019/dh2019.adho.org/programme/book-of-abstracts/index.html

Series: ADHO (14)

Organizers: ADHO