OCR improvements through machine learning methods and the impact on the long term preservation of digitized content

Overview

The National Library of Luxembourg (Bibliothèque nationale du Luxembourg) has been digitizing its national heritage collections since the early 2000’s. After a few years of image-only digitization projects, the library switched to a METS/ALTO output with multiple manifestations, gaining with the years a great expertise in creating digitized content enriched with both Optical Character Recognition (OCR) and Optical Layout Recognition (OLR). In 2020 the eLmA (eLuxemburgensia meets AI) project was born: correcting the full-text (ALTO files) of more than 6,000,000 articles on the eluxemburgensia.lu site. These articles have a varying quality for their OCR text, due to one or more reasons: the language of the text in which the text is written (German and French, to a lesser extent in Luxembourgish and English), the typography used (Gothic or Latin characters) or the quality of the digitization. This presentation will have a more in-depth look at the eLmA project, as well as its impact on the digital preservation of METS/ALTO content.

Speakers

Roxana Maurer, Bibliothèque nationale du Luxembourg, Coordinator Digital Preservation
Ralph Marschall, Bibliothèque nationale du Luxembourg, Project Manager Digitisation

Registration

Registration will open to OPF members shortly.

OPF members receive priority webinar registration and exclusive access to our webinar archives. To find out when registration opens to the community sign up to our mailing list.

Share