How does lossy JP2 image compression influence OCR?

PDF Eh? – Another Hackathon Tale

Many institutions have been doing large scale digitisation projects during the last decade, and the question how to store the digital master images in a cost effective way made the JPEG2000 image format more popular in the library, museums, and archives community.

Especially the lossy JP2 encoding of page image masters turned out to provide a good balance between reducing the file size and preserving the visible properties of a master image. Lossy JP2 encoding of digital images means that it might not be possible to restore the original file at the bit level, even if there are no distinguishable differences to the human eye. More importantly in this context, the absence of visual changes does not imply at all that there would be no influence on the computational processing of the images.

Generally, the question arises what consequences the lossy JP2 encoding have for the processes that are build around the digital master files. One of the processes that are directly related to digital master images representing text, like book or newspaper pages, is the optical character recognition (OCR), and the subordinated question therefore is how the lossy JP2 encoding influences text recognition.

Sure, there are recommendations on which profile to use for a certain collection type, so we could simply rely on a typical profile that is recommended by institutions with many years of experience in digitisation projects. Still, I would say, additional evidence can avoid surprises and help to better understand what the impact on the very own collection items actually is.

I will answer this question in a practical way, developing an experiment that allows flexibility in modifying the main variables that have an influence in this regard:

  • TIFF images data sample
  • JP2 codec (Kakadu, OpenJPEG)
  • JP2 compression parameter alternatives and parameter value ranges
  • OCR engine

Assuming that the plan is to migrate a TIFF image collection to JPEG2000, the input is a sample of TIFF image files of a certain bit depth (e.g. 8 bit, grayscale, images of book pages with standard book layout). For one experiment, the main variables is a list of TIFF images and the list of increasing of decreasing compression parameter values. The experiment then performes the encoding of the TIFF images to JP2 with each each compression parameter value, decodes the images back to TIFF and subsequently applies OCR. The difference of the OCR result is then evaluated against the OCR result of the original TIFF image. The overall result of the experiments can be compared to recommendations for JPEG2000 profiles and provide reliable evidence and verification of their validity for the own collection.

Concrete results of these experiments will be presented shortly in June at the Archiving 2012 conference in Copenhagen. The experiments will be published in a way to allow reproducing the results using other image samples, codecs, compression parameters or parameter values, and OCR engines.

Leave a Reply

Join the conversation