Last week I attended the PREFORMA Experience Workshop in Berlin. The Open Preservation Foundation and PDF Association are leading veraPDF. The morning focused on use cases for conformance checkers from memory institutions and the afternoon explored the PREFORMA challenge with an overview of the testing phase which starts in January 2017. This was followed by presentations from the three suppliers that are developing the conformance checkers:
PREFORMA is a Pre-Commercial Procurement (PCP) project, co-funded by the EU. The three suppliers to the project were selected through a competitive tender process. The software must meet the strict requirements in the format specifications as well as requirements from memory institutions. Antonella Fresa, technical co-ordinator, noted that the work carried out by the three suppliers is in line with their expectations. PREFORMA are keen to talk to users for further feedback on the results.
Borje Justrell, project co-ordinator, introduced the project partners and their approach. The conformance checkers are all open source to help establish a community around the software and ensure it is available after the project ends. It is available under two open source licenses: GPLv3 or later and MPLv2 or later. The PREFORMA vision defines three preservation layers: bit preservation, logical preservation and semantic preservation. The project is focusing on the logical layer.
The keynote was given by Hannes Kulovits from the Austrian State Archives. One of the main challenges they face is that the different federal government departments do not all follow the same procedures when supplying their records. There is no consistent policy for records management and there are no file format restrictions – they have to be able to handle everything.
Hannes explained that the significant and technical properties of a file are key to maintaining their authenticity. The properties the archives focus on are:
Format and Sub-format
Assessing the risks for different files formats such as the quality, availability and price of the specification, the number of open source tools available for identification and validation, and the licence cost.
Representation Instance Properties
Whether the object is valid and well-formed, what size is it and is it searchable? Is the metadata valid, and does it conform to the format standard?
How many pages does it have? Does it contain footnotes, or a table of contents? If it’s an image, what size it is? How many bits per sample are there?
The archives also look at the record properties and check when the last change was made, what the subject area is and the date it was archived.
Hannes gave an example of a use case when they received a digitally signed PDF that rendered different on different computers. It transpired that the font was not embedded. This is a problem for digital preservation as it means the structure of the document can change.
The archives mostly use JHOVE and DROID to validate their files, and are planning to evaluate veraPDF and FITS. They are keen to try identification and validation tools as possible on their repository.
Hannes summarised with a few points that are central to the success of digital preservation:
- know WHAT you have
- be sure that digital files are what they purport to be
- know WHAT action to take and HOW
Up next we heard use cases from memory institutions, and digitisation and archiving service providers. Conformance checking files is considered an important step during ingest workflows and ingest reporting.
For digitised material it is easier to agree on a single, or limited number of formats to ingest, however as born digital content is created by many producers and technology advances quickly, it is difficult to enforce a single or limited number of formats to accept.
Testing of the conformance checkers has been carried out by external organisations. They presented some of their initial results and gave feedback on how the software could be improved. PREFORMA memory institutions are also testing the tools and adapting their workflows to integrate the conformance checkers.
Digital preservation? – just press save
Benjamin Yousefi, legal and technical adviser at Riksarkivet did not have a background in digital preservation before he joined the project. When he began to consider the issues, one of the first questions was: how do we determine that a file conforms to the ISO 19005 standard? There are several validators available that try to answer this question, but there are discrepancies in the results so he could not advise which his organisation should use.
ISO standards are interpreted differently so it is difficult to decide which implementation is correct. The PREFORMA Challenge was to establish an objective point of reference. The three suppliers have approached this differently: EasyInnova could not change the specification for TIFF. They have created TI/A as a source for interpreting the specification. MediaInfo have been involved in the creation of the specification for Matroska, acting as a legislator for the format. veraPDF is working with the PDF industry. Through the PDF Association’s Technical Working Group they are resolving ambiguities in the specification.
The PREFORMA Challenge
Bert Lemmens looked back at the role of memory institutions. They have been preserving paper and ink for decades, and digital preservation is not a new problem. PREFORMA describes digital preservation as:
Taking precautions enabling long-term access to digital data
This implies both policy decisions, implementing a sustainability strategy, and practical solutions, deploying tools to preserve and manage of digital data
He then discussed current strategies for digital preservation:
There is often a good reason for this; organisations don’t understand the problem or know what action to take. Their strategy is that by doing nothing, they are not doing anything wrong.
Apply what you know from preserving analog material. Put on a shelf, keep at safe, sealed and confidential.
Preserve the software manuals and information with machines. In some cases this is very useful.
All three are very passive strategies. More active strategies include:
Replace the underlying technology to preserve the content. This is very hard – organisations need to decide which formats they should migrate to and keep up to date. Which formats are newer, better, more sustainable? How do you choose the format? There is a huge list to choose from, each with their own risks, especially if you don’t understand the format.
Follow what the large organisations are doing
Create your own format or own solutions. This is more common in larger institutions and it creates a whole new range of risks. This approach depends largely on specific people within an organisation. When they leave, the knowledge goes too.
Look for better alternatives, convince others to do the same.
Memory institutions can lack knowledge about how file formats technically work, and often do not have control over the way they are produced, or the tools to manage the different types.
The PREFORMA Challenge Brief is to enable memory institutions to gain full control over the technical properties of digital content intended for long term preservation.
The main issues lies with the file format specification which describes how the format has been put together. It is a document written in natural language and therefore open to interpretation. There are other issues with the specification. They can be:
- Inaccessible (closed)
- Planned for obsolescence (client lock in)
Conformance checking is defined as
The process of checking if the technical properties of a digital file are conform with the specification of the corresponding file format
Memory institutions need files with consistent properties to ensure authenticity of the content, simplify the management of collections and enable large scale migration and emulation.
Three formats were selected for the conformance checker development: one text, one image and one moving image. The specifications were chosen because they are:
- Complete, and you can unambiguously point to one version
- Open – or subject to a nominal charge, but irrevocably royalty free
- Use reference implementations – have test files that show in practice what is and is not valid
Marcus Gerber, Riksarkivet, explained how testing of the conformance checkers is carried out. The suppliers, PREFORMA partners and external members of the digital preservation community have tested the software during development.
The software is hosted on GitHub so it is open: the project and community can track progress and log issues and feature requests. Each supplier has made stable, monthly software releases and has received written feedback from PREFORMA after each formal release. PREFORMA has used a combination of organic and synthetic test files contributed by the partners, suppliers and external sources. The prototyping phase ends on 31 December. A six month scientific test phase will run from January – June 2017. There is still an open call for external partners to contribute to testing to help improve the software.
Each of the suppliers gave a presentation about the latest developments in the software and an overview of plans for the future. For more information about each of the conformance checkers visit the PREFORMA Open Source Portal or take a look at the websites:
Carl Wilson, Open Preservation Foundation
Slides from the workshop are published at: https://web.archive.org/web/20181220234043/http://experienceworkshop.preforma-project.eu/programme/