OCR improvements through machine learning methods and the impact on the long term preservation of digitized content


The National Library of Luxembourg (Bibliothèque nationale du Luxembourg) has been digitizing its national heritage collections since the early 2000’s. After a few years of image-only digitization projects, the library switched to a METS/ALTO output with multiple manifestations, gaining with the years a great expertise in creating digitized content enriched with both Optical Character Recognition (OCR) and Optical Layout Recognition (OLR). In 2020 the eLmA (eLuxemburgensia meets AI) project was born: correcting the full-text (ALTO files) of more than 6,000,000 articles on the eluxemburgensia.lu site. These articles have a varying quality for their OCR text, due to one or more reasons: the language of the text in which the text is written (German and French, to a lesser extent in Luxembourgish and English), the typography used (Gothic or Latin characters) or the quality of the digitization. This presentation will have a more in-depth look at the eLmA project, as well as its impact on the digital preservation of METS/ALTO content.


Roxana Maurer, Bibliothèque nationale du Luxembourg, Coordinator Digital Preservation
Ralph Marschall, Bibliothèque nationale du Luxembourg, Project Manager Digitisation


Registration is now closed.

OPF members receive exclusive access to our webinar archive and priority webinar registration. Any remaining places are made available to the community – to find out when places are released, sign up to our mailing list or follow us on Twitter.