Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards

paper, specified "long paper"
Authorship
  1. 1. Elizabeth Grumbach

    Initiative for Digital Humanities, Media, and Culture - Texas A&M University

  2. 2. Matthew J Christy

    Initiative for Digital Humanities, Media, and Culture - Texas A&M University

  3. 3. Laura Mandell

    Initiative for Digital Humanities, Media, and Culture - Texas A&M University

  4. 4. Clemens Neudecker

    Koninklijke Bibliotheek (KB National Library of the Netherlands)

  5. 5. Loretta Auvil

    Illinois Informatics Institute - University of Illinois, Urbana-Champaign

  6. 6. Todd Samuelson

    University Libraries - Texas A&M University

  7. 7. Apostolos Antonacopoulos

    Pattern Recognition and Image Analysis (PRImA) research Lab - University of Salford

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Introduction
In 2011, the Comite de Sages presented “The New Renaissance” to the European commision, stating that “digiti[z]ation is more than a technical option, it is a moral obligation” to the public. The report stresses that the initiative’s goal is to ensure that we “experience a digital Renaissance instead of entering into a digital dark age.” If the lack of adequate, searchable early-modern digital resources can be correctly referred to as a “digital dark age,” then we are undoubtedly seeing the emergence of a “digital renaissance.1” Projects like IMPACT (Improving Access to Text2), eMOP (Early Modern OCR Project), TCP (Text Creation Partnership), and others have emerged in recent years to take up the call to arms issued by the Comite de Sages. However, we need more than the digitization of cultural materials; we need responsible digitization alongside a community engaged in the fight for digital visibility of those materials. And, most importantly, large DH projects need effective and responsible management and collaboration standards. The aim is to adapt and adjust to the changing climate, ultimately steering the project safely into the harbor.

Overview
Many OCR (Optical Character Recognition) and cultural preservation projects are underway that need to be able to adapt their project plans to OCR technology and crowd-sourcing breakthroughs as they occur. In Fall 2012, the Initiative for Digital Humanities, Media, and Culture at Texas A&M University received a $734,000 grant from the Mellon foundation for the Early Modern OCR Project (eMOP)3. eMOP’s objective is to make machine readable, or improve the readability for, 45 million pages of text from two major proprietary databases: Eighteenth Century Collections Online (ECCO) and Early English Books Online (EEBO). Generally, eMOP intends to improve the visibility of early modern texts by making their contents fully searchable. The current paradigm of searching special collections for early modern materials by either metadata alone or “dirty” OCR is inefficient for scholarly research (Mandell, 20134). We intend to publish an open source OCR workflow at grant end in Taverna. This workflow will contain access to an early modern font database, customization guidelines for the Tesseract OCR engine, post-processing and diagnostic algorithms, and crowdsourcing and “scholar-sourcing” (as Brian Geiger has dubbed) correction tools. But the overarching goal of eMOP, a project that blends book history5, digital humanities, textual analysis, and machine learning, is ultimately to foster a community of scholars and institutions interested in the digital preservation of, and access to, these texts. To this end, eMOP has assembled an international team of collaborators from multiple disciplines.

eMOP, however, has faced problems in the implementation of our goals and processes. During Year One, the eMOP team and collaborators quickly realized that the grant document excellently outlined milestones and goals, but it did not provide the level of granularity needed to complete each. We have also realized that progress is continually changing in this field, and if big DH projects do not adjust accordingly, they will end up reinventing the wheel. Active outreach and collaboration with institutions outside the initial grant collaborators proved important. In addition, eMOP is working with proprietary page images and metadata in order to release an open source tool, which has produced its own challenges. In order to succeed in producing a corpus of machine-readable texts and a workflow for future OCR projects, continual outreach and collaboration is needed, yet not always possible due to the restrictions of grant deadlines, funding, and other institutional roadblocks.

Getting Started
This panel considers how big DH projects, with big datasets, big networks of collaborators, and big goals, can adjust and adapt to change. It has long been noted that digital humanities projects lend themselves well to agile6 development models7, specifically the “the philosophy of ‘releasing early and often‘” (Scheinfeldt, 20108). However, these models often break down in the face of multi-institutional and international collaboration, software development, assembling large amounts of data, and what James Smithies and enterprise IT call “transition management,” or planning for “Change” (20119). A digital humanities project, large or small, also “seems to both depend upon collaboration and aim to support it” (Spiro, 200910). Each big DH project must find a practical balance in development management and collaboration methods.

This panel will bring together the eMOP management team at the IDHMC and collaborators from various disciplines and institutions to discuss the reasons why big DH projects need to plan for adaptation, ways in which projects can achieve this flexibility, and how to swiftly change directions.

If eMOP’s goal reflects of the goal of digital humanities at large, i.e. to foster collaboration among various disciplines and cultivate inter-institutional and international relationships that make possible new kinds of humanities research, then this panel provides a microcosm of that endeavor.

Panel Organization
This panel will consist of a brief 5 minute overview of the goals of, methodologies for, and collaborators in the Early Modern OCR Project, and then each speaker will introduce a major directional change or challenge that eMOP has faced, including the resulting solution in 7 minutes or less. Introductions to challenges may include comparisons to other large dh projects (e.g. IMPACT). Discussion of the resulting solution may include a short software/tool demo. The panel organizers will then pose questions to the roundtable to begin an open conversation, leaving the remaining time for discussion amongst panelists and the audience. Discussion will likely focus on how to change directions, rethink decisions, and reconfigure plans when collaborating with multiple institutions and individuals while facing grant deadlines and milestones.

Questions that Panel Organizers may pose:
Discuss future models of big DH project management, especially how essential multi-institution and international collaboration can be.
What can big DH projects learn from the agile vs. traditional software development models?
Discuss best practices in project management, and how they might be modified in order to take in recent technological innovations or respond to challenges.
We know that “failure” (Unsworth, 199711) is important: how can small failures be channeled into big success?
What kinds of cultural practices need to be taken into account when U.S. projects adopt European models, and vice versa?
How can transatlantic collaboration best be orchestrated so that projects benefits from collaborators’ advancements, both technological and social?
Participants
All panelists are committed and eagerly anticipating the discussion of eMOP, large DH projects, and successful and responsible collaboration and development management.

Apostolos Antonacopoulos is the Director of the Pattern Recognition and Image Analysis (PRImA) research lab at the University of Salford, UK. Dr. Antonacopoulos has been working on issues of pattern recognition, image and document analysis, and historical document digital restoration for many years. In addition to eMOP, he has contributed to the IMPACT and Europeana Newspaper projects, and will discuss how the adoption and customization of software for large cultural preservation projects should be responsive to changing project needs.
Loretta Auvil works at the Illinois Informatics Institute (I3) at the University of Illinois at Urbana Champaign. She has worked with a diverse set of application drivers to integrate machine learning and information visualization techniques to solve the needs of research partners. Prior to working for I3, she spent many years at NCSA on machine learning and information visualization projects and several years creating tools for visualizing performance data of parallel computer programs at Rome Laboratory and Oak Ridge National Laboratory. She will be discussing big DH projects from the perspective of these experiences and her work with eMOP.
Liz Grumbach is Project Manager for the Advanced Research Consortium (ARC) and IDHMC “alt-ac” Research Staff. She is Co-Project Manager for eMOP (Year Two), and will briefly introduce the project (goals and methodologies). She will also end the panel by comparing the current workflow for the eMOP OCRing process with the proposed OCR workflow contained in the grant, summing up the overall changes that each collaborator’s contribution shaped.
Laura Mandell is Professor of English and Director of the IDHMC at Texas A&M University. In addition to being the Lead PI for eMOP, Dr. Mandell previously received a Mellon grant (2010) to investigate how effective the open-source OCR engine Gamera could be trained to read early modern fonts. She will introduce the data management challenges eMOP has faced, demonstrating software and tool solutions created by eMOP graduate students and staff.
Clemens Neudecker serves as Technical Coordinator in the Research section of the Innovation & Development Department of the KB National Library of the Netherlands. He has been working in numerous large-scale national and international digitization / digital humanities projects since the early 2000’s, with a particular focus on OCR (www.impact-project.eu) and scalable workflows (www.scape-project.eu), and will be discussing how this previous knowledge aided the eMOP team.
Todd Samuelson is Assistant Professor at Texas A&M University and the Curator of Rare Books & Manuscripts at Cushing Memorial Library & Archives. Dr. Samuelson is the book history consultant for eMOP. He will discuss font history research roadblocks and demonstrate font creation and identification tools created by eMOP collaborators to solve these issues.
Panel Organizers:
Matthew Christy, Lead Software Applications Developer for the IDHMC and Co-Project Manager for eMOP (Year Two)

Liz Grumbach, IDHMC “alt-ac” Research Staff and Co-Project Manager for eMOP (Year Two)

References
1. European Commission: The Comité des Sages. The New Renaissance: Report of the comité des sages on bringing Europe’s cultural heritage online. By Elizabeth Niggemann, et al. 10 Jan 2011.

2. IMPACT. Annual Report: Project Periodic Report. Netherlands: IMPACT, 2011. Improving Access to Text. 9 Dec 2011. www.impact-project.eu/uploads/media/IMPACT_Annual_report_2011_Publishable_summary_01.pdf. 29 Oct 2013.

3. Mandell, Laura.Mellon Foundation Grant Proposal: "OCR'ing Early Modern Texts." Grant Proposal. 30 Jun 2012.

4. Mandell, Laura. (2013) Digitizing the Archive: The Necessity of an 'Early Modern' Period. Journal for Early Modern Cultural Studies 13.2: 83-92.

5. Heil, Jacob and Todd Samuelson. (2013) Book History in the Early Modern OCR Project, or, Bringing Balance to the Force. Journal for Early Modern Cultural Studies 13.4 (2013): 90-103. Web. 30 Oct 2013.

6. Beck, Kent, et al.Manifesto for Agile Software Development. Agile Alliance. 30 Oct 2013.

7. Martin, Robert Cecil (2003). Agile Software Development: Principles, Patterns, and Practices. Saddle River, NJ: Prentice Hall.

8. Scheinfeldt, Tom.Stuff Digital Humanists Like: Defining Digital Humanities by its Values. Found History. 2 Dec 2010. www.foundhistory.org/2010/12/02/stuff-digital-humanists-like . 30 Oct 2013.

9. Smithies, James (2011). A View from IT. Digital Humanities Quarterly 5.3. Web. 30 Oct 2013.

10. Spiro, Lisa. Examples of Collaborative Digital Humanities Projects. Digital Scholarship in the Humanities. 1 Jun 2009. digitalscholarship.wordpress.com/2009/06/01/examples-of-collaborative-digital-humanities-project.. 30 Oct 2013.

11. Unsworth, John.Documenting the Reinvention of Text: The Importance of Failure. The Journal of Electronic Publishing 3.2 (1997). Web. 30 Oct 2013.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ADHO - 2014
"Digital Cultural Empowerment"

Hosted at École Polytechnique Fédérale de Lausanne (EPFL), Université de Lausanne

Lausanne, Switzerland

July 7, 2014 - July 12, 2014

377 works by 898 authors indexed

XML available from https://github.com/elliewix/DHAnalysis (needs to replace plaintext)

Conference website: https://web.archive.org/web/20161227182033/https://dh2014.org/program/

Attendance: 750 delegates according to Nyhan 2016

Series: ADHO (9)

Organizers: ADHO