Multimodal AI support of source criticism in the humanities – work in progress

paper, specified "short paper"
Authorship
  1. 1. Sander Muenster

    FSU Jena, Germany

  2. 2. Jonas Bruschke

    Julius-Maximilians Universität Würzburg (Julius Maximilian University of Wurzburg)

  3. 3. Stephan Hoppe

    LMU Muenchen, Germany

  4. 4. Ferdinand Maiwald

    FSU Jena, Germany

  5. 5. Florian Niebling

    Julius-Maximilians Universität Würzburg (Julius Maximilian University of Wurzburg)

  6. 6. Aaron Pattee

    LMU Muenchen, Germany

  7. 7. Ronja Utescher

    University of Bielefeld

  8. 8. Sina Zarriess

    University of Bielefeld

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.


Introduction

The use of images, texts and objects is an essential foundation of history studies. This project funded by the
German Ministry of Education and Research (BMBF), seeks to establish an AI-based approach towards modelling image sources and their multimodal contexts as a new technique for researchers in architectural history studies. Related questions are: How do architectural historians discover and evaluate sources? How can AI best be of service to this end?

State of the Art

The point of departure for this project is the use of sources and source criticism in history studies. This is usually led by a constructive problem-oriented approach, featuring a critical analysis of the topics and methodologies in question
Reich20069787(Reich, 2006)978797875Kersten ReichKonstruktivistische Ansätze in den Sozial- und KulturwissenschaftenKonstruktivistische Didaktik: Lehr-und Studienbuch mit Methodenpool356-3762006Beltz?>(Reich, 2006) and is highly experience and tacit knowledge based
Polanyi19662030(Polanyi, 1966)203020306Polanyi, MichaelThe tacit dimension18th edition (2009)1966ChicagoUniversity of Chicago Press?>(Polanyi, 1966).

Language & Vision: Deep Learning (DL) prove themselves ideal for
transfer learning at the intersection of image and language processing. For example, semantic representations such as
word or
sentence embeddings, which the computer learns from texts, are enriched by multimodal data such as image descriptions paired with actual visual representations
Hessel201912513(Hessel et al., 2019)125131251310Jack HesselLillian LeeDavid MimnoUnsupervised Discovery of Multimodal Links in Multi-image, Multi-sentence DocumentsProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)2019?>(Hessel et al., 2019). However, for the extraction of multimodal information from scientific texts, it is still necessary to refine the referential connections between text and image components
Utescher202112514(Utescher and Zarrieß, 2021)125141251410Ronja UtescherSina ZarrießWhat Did This Castle Look like before? Exploring Referential Relations in Naturally Occurring Multimodal Texts
Proceedings of the Third Workshop on Beyond Vision and LANguage: inTEgrating Real-world kNowledge (LANTERN)2021https://aclanthology.org/2021.lantern-1.5?>(Utescher and Zarrieß, 2021).

Segmentation and Object Recognition: Photogrammetric processes deliver spatial relations between the photographs and the 3D geometries. The datasets that are developed in this method allow for the automatic segmentation
Martinovic20159811(Martinovic et al., 2015, Hackel et al., 2016)9811981110Martinovic, A.Knopp, J.Riemenschneider, H.Van Gool, L.3d all the way: Semantic segmentation of urban scenes from start to end in 3dIEEE Computer Vision & Pattern Recognition4456–44652015Hackel201698129812981217Hackel, T.Wegner, J. D.Schindler, K.Fast semantic segmentation of 3D point clouds with strongly varying densityISPRS AnnalsISPRS Annals177–184332016?>(Martinovic et al., 2015, Hackel et al., 2016) of simple structures
Vosselman20049808(Vosselman et al., 2004)9808980817Vosselman, G.Gorte, B. G.Sithole, G.Rabbani, T.Recognising structure in laser scanner point cloudsISPRS ArchivesISPRS Archives33-384682004?>(Vosselman et al., 2004), as well as a complex objects such as buildings
Li20169806(Li et al., 2016, Agarwal et al., 2011)9806980617Li, M.Nan, L.Smith, N.Wonka, P.Reconstructing building mass models from UAV imagesComputers & GraphicsComputers & Graphics84-93542016Agarwal201175267526752617Agarwal, S.Furukawa, Y.Snavely, N.Simon, I.Curless, B.Seitz, S. M.Szeliski, R.Building rome in a dayCommunications of the ACMCommunications of the ACM10554102011?>(Li et al., 2016, Agarwal et al., 2011).

Machine Learning (ML) is playing an ever increasing role in the segmentation of images and object recognition

(Minaee et al., 2021, Jiao et al., 2019).

Research Outline
The following will provide a brief overview of the first steps in the research.

Identifying Research Scenarios
A series of generic scenarios were identified with the assistance of expert consultation and workshops during the preliminary investigations
Kröber202112197(Kröber, 2021, Dewitz et al., 2019)12197121975Cindy KröberGerman Art History Students’ use of Digital Repositories: an Insight Papers Proceedings, Diversity, Divergence, Dialogue2021ChamSpringer LNCSDewitz201910567105671056717Dewitz, L.Kröber, C.Messemer, H.Maiwald, F.Münster, S.Bruschke, J.Niebling, F.Historical Photos and Visualizations: Potential for ResearchISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information SciencesISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences405–412XLII-2/W1540520192194-903410.5194/isprs-archives-XLII-2-W15-405-2019?>(Kröber, 2021, Dewitz et al., 2019), and consequently ordered by relevance and priority. Of the 20 described scenarios, the cross-media identification of object descriptions (“Which images, texts, and 3D data describe the same object?”), and the analysis of such descriptions (“How can the dating of historical image and text depictions be supported by multimodal validation using media whose dating has already been established?”), were chosen as the focal points of the research.

Cross-Media Classification

Figure 1: Identified architectural elements using the
Kronentor
of the
Zwinger
in Dresden in the photograph (left), in text (middle), and in the 3D model (right).

A key requirement to this end, is to identify and name such cross-media elements (Fig. 1). The framework for the description of architectural elements in our project is provided by the Getty Art and Architectural Thesaurus (AAT). In our project the subgroup
architectural elements

http://vocab.getty.edu/aat/300000885

, 15.07.2021.

is being used. The identified elements from texts (single words or word groups), images (polygonal image details), and 3D models (individual subgroup objects) are assigned to the concept from the AAT. Different processes are necessary depending upon the source type, e.g. semantic segmentation,
Named Entity Recognition (NER), and discourse parsing, in addition to what concerns the identification of the concepts and semantic accumulation.

Multimodal Data Accumulation
In a further step, various approaches are used for the accumulation and validation of multimodal data. In this way, within the 3D realm for example, 2D images can be used in relation to the 3D model in order to transfer them to the structure at hand provided by the 3D model
Niebling20189950(Niebling et al., 2018)9950995010F. NieblingF. MaiwaldS. MünsterJ. BruschkeF. HenzeAccessing Urban History by Historical Photographs2018 3rd Digital Heritage International Congress (DigitalHERITAGE) held jointly with 2018 24th International Conference on Virtual Systems & Multimedia (VSMM 2018)1-82018San Francisco10.1109/DigitalHeritage.2018.8809998?>(Niebling et al., 2018).

Automated classification
A current step is to investigate approaches towards automating the identification and annotation of objects. For this purpose, AI-based models will be used that are specialized on the respective modalities (3D models, images, and texts). Based on the pipeline described in
Wu202112535(Wu et al., 2021)12535125356Wu, XiaoshiAverbuch-Elor, HadarSun, JinSnavely, NoahTowers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision2021?>(Wu et al., 2021) we currently test to enhance quality by better text identification as well as modular object retrieval for the identification of architectural structures in images
Münsterin print10075(Münster et al., in print)10075100755Sander MünsterChristopher LehmannTaras LazarivFerdinand MaiwaldSusanne KarstenFlorian NieblingSander MünsterToward an Automated Processing Pipeline for a Browser-based, City-scale Mobile 4D VR Application Based on Historical ImagesProceedings of the 2nd UHDL Workshopin printChamSpringer CCIS?>(Münster et al., in print), and the transfer of this segmentation to 3D models.

Next steps

Based upon the developed demonstrator, next steps will be to cross-validate and multimodal enrich content and test those results with historians in step 1 the research scenario. It is within this area to examine the discrepancy between the requirement of large data amounts for AI models and the complexity of historical expertise can be investigated and evaluate, how existing AI models can be employed within the field of architectural history research and criticism.

Bibliography

AGARWAL, S., FURUKAWA, Y., SNAVELY, N., SIMON, I., CURLESS, B., SEITZ, S. M. & SZELISKI, R. 2011. Building rome in a day.
Communications of the ACM, 54
, 105.

DEWITZ, L., KRÖBER, C., MESSEMER, H., MAIWALD, F., MÜNSTER, S., BRUSCHKE, J. & NIEBLING, F. 2019. Historical Photos and Visualizations: Potential for Research.
ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, XLII-2/W15
, 405–412.

HACKEL, T., WEGNER, J. D. & SCHINDLER, K. 2016. Fast semantic segmentation of 3D point clouds with strongly varying density.
ISPRS Annals, 3
, 177–184.

HESSEL, J., LEE, L. & MIMNO, D. Unsupervised Discovery of Multimodal Links in Multi-image, Multi-sentence Documents. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019.
JIAO, L., ZHANG, F., LIU, F., YANG, S., LI, L., FENG, Z. & QU, R. 2019. A Survey of Deep Learning-Based Object Detection.
IEEE Access, 7
, 128837-128868.

KRÖBER, C. 2021. German Art History Students’ use of Digital Repositories: an Insight
Papers Proceedings, Diversity, Divergence, Dialogue. Cham: Springer LNCS.

LI, M., NAN, L., SMITH, N. & WONKA, P. 2016. Reconstructing building mass models from UAV images.
Computers & Graphics, 54
, 84-93.

MARTINOVIC, A., KNOPP, J., RIEMENSCHNEIDER, H. & VAN GOOL, L. 3d all the way: Semantic segmentation of urban scenes from start to end in 3d. IEEE Computer Vision & Pattern Recognition, 2015. 4456–4465.
MINAEE, S., BOYKOV, Y. Y., PORIKLI, F., PLAZA, A. J., KEHTARNAVAZ, N. & TERZOPOULOS, D. 2021. Image Segmentation Using Deep Learning: A Survey.
IEEE Transactions on Pattern Analysis and Machine Intelligence
, 1-1.

MÜNSTER, S., LEHMANN, C., LAZARIV, T., MAIWALD, F. & KARSTEN, S. in print. Toward an Automated Processing Pipeline for a Browser-based, City-scale Mobile 4D VR Application Based on Historical Images.
In: NIEBLING, F. & MÜNSTER, S. (eds.)
Proceedings of the 2nd UHDL Workshop. Cham: Springer CCIS.

NIEBLING, F., MAIWALD, F., MÜNSTER, S., BRUSCHKE, J. & HENZE, F. Accessing Urban History by Historical Photographs. 2018 3rd Digital Heritage International Congress (DigitalHERITAGE) held jointly with 2018 24th International Conference on Virtual Systems & Multimedia (VSMM 2018), 2018 San Francisco. 1-8.
POLANYI, M. 1966.
The tacit dimension, Chicago, University of Chicago Press.

REICH, K. 2006. Konstruktivistische Ansätze in den Sozial- und Kulturwissenschaften.
Konstruktivistische Didaktik: Lehr-und Studienbuch mit Methodenpool. Beltz.

UTESCHER, R. & ZARRIEß, S. What Did This Castle Look like before? Exploring Referential Relations in Naturally Occurring Multimodal Texts
Proceedings of the Third Workshop on Beyond Vision and LANguage: inTEgrating Real-world kNowledge (LANTERN), 2021.
VOSSELMAN, G., GORTE, B. G., SITHOLE, G. & RABBANI, T. 2004. Recognising structure in laser scanner point clouds.
ISPRS Archives, 46
, 33-38.

WU, X., AVERBUCH-ELOR, H., SUN, J. & SNAVELY, N. 2021.
Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2022
"Responding to Asian Diversity"

Tokyo, Japan

July 25, 2022 - July 29, 2022

361 works by 945 authors indexed

Held in Tokyo and remote (hybrid) on account of COVID-19

Conference website: https://dh2022.adho.org/

Contributors: Scott B. Weingart, James Cummings

Series: ADHO (16)

Organizers: ADHO