Preserving Cultural Heritage
National libraries have the responsible task of building a bridge between preserving the rich cultural heritage of our society and providing public accessibility to it. Digitization is a mean to solve this complex and contradictory task. Digital copies of books provide access to their content while preserving the original artifacts in case of lost or destruction. While this seems to solve one problem it raises another one: long term preservation of digital objects.
What is Matchbox?
The Matchbox Toolset is an open source toolset that provides decision-making support for various quality assurance tasks of digital libraries. It can be used to assess quality properties of image collections, compare different versions or find duplicates within collections. Matchbox is based on state-of-the-art image processing technologies and does not rely on Optical Character Recognition (OCR) which makes it more flexible than previous approaches.
How does it work?
The solution provided is based on interest point detection – a technique that has emerged into various fields of visual computing. Based on contrast properties of an image perceptual outstanding points are detected and statistically described. The intrinsic properties – scale, illumination and rotation invariance – make this approach a perfect choice for analyzing inhomogeneous document collections. From these interest points a visual vocabulary and document fingerprints are computed. An approach derived from classical document retrieval, now applied to image retrieval. Using machine learning techniques to identify interest points common to all images of a collection a visual vocabulary is calculated. By counting these words a histogram based fingerprint can be created for each image. This highly condensed representation of an image can be used for fast indexing and search operations. Based on efficient machine learning algorithms these fingerprints are used to identify matching images within one or between different book collections. Once matching pairs have been identified, a geometrical transformation is calculated from their corresponding interest points to scale, rotate and align the images accurately. After this registration procedure they can be reliably compared and a similarity estimation can be calculated.
The method described has been implemented as a set of small tools. Instead of a monolithic program solving only a specific task this approach provides flexibility for various current problems – detecting duplicated pages within a book, estimating quality differences between two different digital versions of a book, assembly of a collection from different partial versions – as well as many yet unknown problems.
The provided video demonstrates and visualizes the fundamental principles and technologies of the Matchbox Toolset.
Matchbox Screencast from SCAPE project on Vimeo.