A Neural OCR Engine for North Saami

  1. 1. Andre Kåsen

    National Library, Norway

  2. 2. Håvard Østli

    National Library, Norway

  3. 3. Andrea M. Huus

    National Library, Norway

  4. 4. Lars Johnsen

    National Library, Norway

The DH-LAB at the National Library of Norway can announce that we have an open-source optical character recognition (OCR) engine for North Saami in construction. North Saami is an under-resourced indigenous minority language recognized by the Norwegian State. The OCR engine is induced with the system Tesseract by the means of cross-lingual model transfer. When evaluating the model on a held-out portion of the ground truth, it reaches a bag-of-words F1 measure of 0.98 %. The OCR engine in question will be the first freely available OCR engine for North Saami.

