Training Algorithms to Read Complex Collections: Handwriting Classification for Improved HTR Models

paper, specified "short paper"
Authorship
  1. 1. Carrie Pirmann

    Bucknell University

  2. 2. Bhagawat Acharya

    Bucknell University

  3. 3. Brian King

    Bucknell University

  4. 4. Katherine Mary Faull

    Bucknell University

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

This paper will present a new handwriting grouping algorithm that has been developed to decrease the Character Error Rate (CER) for a collection of manuscript documents written in various hands and in multiple languages. The Moravian Lives project (moravianlives.org), an international, collaborative DH project housed at Bucknell University, takes as its starting point a vast collection of archival materials held by the Moravian Church in the United Kingdom, Germany, and the United States. The materials include tens of thousands of handwritten ego-documents, written in a variety of handwriting styles. As the documents are held in archives in a variety of international locations, one of the goals of the Moravian Lives Project is to digitize and transcribe these memoirs, to make them accessible to a broader audience.  Initially, transcriptions for the memoirs were crowdsourced. However, crowdsourcing is replete with problems including varying accuracy of transcriptions, length of time needed to produce transcriptions, and a dearth of individuals who can read the handwriting styles of the documents, particularly those written in old German script. To facilitate the transcription process, in early 2019 the Moravian Lives team began using Transkribus (transkribus.eu). The platform allows for creation of custom handwritten text recognition (HTR) models, which are based on previously transcribed memoirs and used to machine transcribe new documents (Muehlberger et al., 2019). With adequate training data (i.e., several hundred pages or 50,000+ words), models with a CER of five percent or less can be developed, which is sufficient for expediting archival work. Extant projects which have so far achieved this success rate may be based on multiple hands, drawing on significant data from each hand. For example, the University of Greifswald has trained successful models with a 5% CER on a corpus of 250,000 words written in three different hands. Similarly, the Bentham Project trained a highly accurate English-language model on 50,000 words written in a small number of hands (Muehlberger et al., 2019). The numerous and varying handwriting styles found in the Moravian memoirs present multi-facted challenges to creating highly accurate models. We do not know how many scribes there were, or in most cases their identities, and we are continually coming across new handwriting styles. Memoir documents are between two to 50 pages in length; most documents we are working with are 10 pages or fewer, meaning there is not a lot of data per document. While we have had some success creating models via human identification of similarities in handwriting, we believe that automated scribe identification and/or automated grouping of handwriting by similarities in style could result in much more accurate models. To address this problem, an undergraduate computer science major and professor of computer science joined the Moravian Lives team and are experimenting with deep learning to author a grouping model, designed to group or sort memoirs by handwriting styles. These groupings should enable the creation of more accurate models in Transkribus, as well as more accurate transcription outputs.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2020
"carrefours / intersections"

Hosted at Carleton University, Université d'Ottawa (University of Ottawa)

Ottawa, Ontario, Canada

July 20, 2020 - July 25, 2020

475 works by 1078 authors indexed

Conference cancelled due to coronavirus. Online conference held at https://hcommons.org/groups/dh2020/. Data for this conference were initially prepared and cleaned by May Ning.

Conference website: https://dh2020.adho.org/

References: https://dh2020.adho.org/abstracts/

Series: ADHO (15)

Organizers: ADHO