To our friend and colleague,
Nicolas Yñesta, 1975-2020
A love story by Bertrand Caron,
with Alix Bruys,Yannick Grandcolas, Thomas Ledoux, Anne Paounov, Chloé Perrot, Luc Verrier, National Library of France (BnF)
[Note : Ce billet existe en français sous le titre Une déclaration d’amour aux formats]
The New Formats Expertise of the National Library of France
Knowledge of data formats directly affects our ability to make the most of the information they convey and to enable our future users to do the same. Without this knowledge, we are reduced to preserving only an uninterpretable bitstream and we are at risk of rendering it incorrectly1. For an organization that preserves digital information, formalizing a formats policy based on this knowledge is necessary and is an essential component of an overall digital preservation policy.
Since 2018, BnF has reactivated its activity of monitoring and studying data and metadata formats for the preservation of digital information, in order to publish a revised, justified and accepted formats policy. The year 2020 was marked by a significant acceleration in the pace of work, which will make it possible, early next year, to publish a policy document, firstly intended to the institution itself, but also to its partners, network and potential donors.
This article presents the activity, the working methods adopted, and the results expected in the near future, and unveils one of the structuring elements of this policy: the criteria adopted at BnF to assess a format in order to ensure the sustainability of information.
The genesis
Since the end of the 2000s, one of the prefiguration groups of the BnF preservation system has been working on the issue of data and metadata formats. To its credit, we can cite in particular the study leading to the choice of the JPEG 2000 still image format for digitisation, and compression rates adapted to each type of medium2 , the analysis of metadata packaging formats which led to the preference of METS over the XFDU standard, and the comparison between audioMD, videoMD and MPEG-7 formats to express the metadata produced by the characterisation of audio-visual files, to the benefit of the latter.
Once the preservation system was put into production (2010) and for almost eight years, the group maintained an episodic activity, solicited on specific subjects. In 2015, an internal report on the governance of digital preservation reaffirms the need for a body to monitor and study data and metadata formats. It took almost three years to set up this group until, in 2018, redefined and revitalised, it took off!
The ‘Formats’ working group
The new working group “Données et métadonnées pour la préservation” (quickly abbreviated to “Groupe Formats”) had its charter redefined in 2016-2017. The task consisted in particular in identifying the organisational units that had to assist the group, either because they already had recognised experts on one of the subjects, or because they had stewardship on specialised collections and were therefore bound to develop their expertise on a specific type of content.
As of today, the group gathers 25 agents, working in specialised departments (Departments of Prints and Photography, Performing Arts, Audiovisual, Maps and Plans, Music) and in support departments (Departments of Information Systems, Conservation, Metadata, Cooperation3, Images and Digital Services4, Institutional Archives).
The group is also intended to be a space for knowledge transmission from experts to librarians responsible for new digital collections. It aims to involve the latter in decisions that may seem technical, but which directly affect the information preserved and the uses that can be made of it in the future5.
The method
In order to allow experts to work simultaneously on different parts of the policy document and to encourage micro-contribution, we have chosen not to work exclusively in the BnF production base6, but to develop the part that had to be collaboratively established using the software development platform GitHub (https://fr.github.com/), and more specifically its wiki7. A large number of the members were thus able to discover and familiarise themselves with this now indispensable tool. The contents are hosted in a space dedicated to BnF hackathons, highlighting their experimental and constantly revised status. The aim was to show the process of increasing competence and developing knowledge and know-how on data formats within BnF. Each of the contents is therefore marked by a state of progress in the form of badges: only those marked as “validated” are considered to be communicable, although access to the others is not restricted.
Surprisingly, the lockdown, declared in France from March 17th to May 11th, has enabled the group to make significant progress. Isolated in their own homes, freed from many of the obligations related to public service or to digitisation contract management, and from all other daily tasks, the group members were able to devote more working time to formats studies.
Generally more computer-literate than the average BnF staff member, the group members continued to work collectively, and the organization of weekly videoconferences on the subject also favoured the group cohesion8 . Finally, the choice of GitHub paid off, as it remained accessible from the personal equipment of most of the agents, unlike the BnF production bases.
The regular meetings became an opportunity to start training in the particular way that digital preservation specialists look at formats. We have thus been able to promote the tools developed by the international community (JHOVE, in particular) and by BnF (in particular the Preservation Planning module developed five years earlier and still under-used). Finally, the group is the ideal place to learn and use the common vocabulary of digital preservation9.
In the course of 2020, the parallel advances in the processing of a voluminous collection of born-digital archives donated by the filmmaker Amos Gitaï to BnF in 201710 have provided the group members with an unhoped-for experimental ground. Never had BnF received a born-digital donation on such a scale: 19 TB of data in approximately 200 different file formats, many of them proprietary, and successive video montages in FCP format (libraries of Final Cut Pro 7 or X projects). The necessary preservation operations, such as migrations carried out on proprietary format files, showed us that empirical practices, implemented in an emergency, do not make a policy, although they can help, once supported by further studies, to establish it.
As a result, a better shared and accepted policy
The result of these months of collective work is therefore a publication announced for the first quarter of 2021.
The reason why we are taking longer than expected to produce such a policy is related to the group collective acknowledgment that a formats policy cannot simply list accepted formats. Rather, we choose to address the issue by type of content and consider for each one what significant properties and functionalities we wish to preserve. In other words, the formats issues led us to ask ourselves the question of ‘preservation intent’.
We have also abandoned the idea of making recommendations in such a general context as a formats policy theoretically applicable to all the channels of an institution such as BnF. Like other institutions before us, we rather describe preferences and justify them by objective arguments or by uses and choices specific to the BnF context.
The reflection has begun to address the thorny question of preservation strategies, particularly, but not exclusively, in the face of exotic or simply unexpected formats that have arrived through the new projects that collect born-digital data. We then noted the benefit of systematically asking the question of the dilemma between adapting content to the environment or changing the environment to take content into account11. This formulation seemed more relevant in our opinion than the traditional alternative “migration” versus “emulation”.
It should also be noted that the policy document will not contain exclusively unpublished content: other institutions (the National Archives and Records Administration of the United States12, the Library of Congress13, the British Library14 , etc.) have published an equivalent. Nevertheless, they have two major advantages.
- They are in French and are aimed at a public that is aware of digital issues but is not specialized in digital preservation; they have therefore been designed with an educational and concise rather than exhaustive approach.
- They are obviously adapted to BnF’s uses, needs and resources and have been decided, discussed and validated jointly. While the assessment of file formats is based on a certain number of objective criteria, the decision to adopt and process them in one way or another is based on a weighting of these criteria specific to the institution, which will have to determine the acceptable trade-offs between compactness and robustness, between simplicity and efficiency, etc.
Expected readership
The document is aimed at three different audiences:
- French-speaking libraries holding digital data that would like to develop a formats policy or compare their own to BnF’s,
- potential donors wishing to provide their creations in a form that can be controlled by BnF,
- and, more generally, any producer of data interested in their persistence, whether or not they are likely to entrust them to BnF.
Content
The content of the policy document will consist of three main parts.
- The policy principles
- Glossary of the main concepts,
- Rationale for a formats policy,
- Criteria for choice,
- Methods of file analysis,
- In the event that data are in a format deemed unsuitable for the institution’s policy, criteria to be considered in order to determine the migration strategy to be adopted (target format, migration method, whether or not to retain the original data).
- The structured list of formats identified by BnF
- By type of content and by use, general considerations and technical characterisation metadata produced by BnF to judge the relevance, quality, possible uses and history of the file;
- For each type of content and use, the list of formats itself, by level of preference (preferred formats, accepted, under study and recognised by the digital preservation community), each possibly accompanied by the preservation strategy adopted by BnF.
- One record for each preferred and accepted format, described with particular attention to all the parameters affecting each preservation criterion, the tools identified and preferred by BnF to produce, edit, render, characterize, validate and migrate it and the use or presence in the institution’s collections.
Among the important sections, one of them is critical and can now be unveiled: the criteria adopted by BnF to assess a format in order to keep on the long term its content information.
Conclusion
The group ‘Data and Metadata Formats for Preservation’ is therefore a privileged space for addressing digital preservation issues and promoting them as issues concerning the profession in the first place, and not as simple technical questions. Its re-establishment as a permanent working group, although a long time in the making, has now been achieved and is helping to ensure that agents and their hierarchy consider the fact that digital preservation is a recurring activity.
Once the initial deliverables have been published at the beginning of next year, the next task will be to promote and publicise the group’s activity, expertise and contributions within BnF itself, and to gradually implement its conclusions.
1 As a testimony of this issue, you may refer to the blog post published by BnF on the previous International Day for Digital Preservation: “The JPEG blues: properly rendering 32-bits JPEG”, on the OPF website, https://openpreservation.org/blogs/jpeg-got-the-blues/.
2 The results of the study were synthesised in a paper at the Archiving conference in 2017 on “JPEG2000 as a preservation format for digitization: lessons learned from a library”.
3 Gallica (https://gallica.bnf.fr), the shared digital library that offers its services to several hundreds of French partner institutions, is managed by the Department of Cooperation.
4 The Images and Digital Services Department manages https://images.bnf.fr/, a database of digitised and indexed images.
5 For example, one of the last topics studied by the group was on PSD files (Photoshop), and enabled collection managers to become aware of the impact of the format on the ability to preserve all the layers of a model, removed when exporting the image as it merges layers.
6 This database currently uses the IBM Lotus Notes solution as its technical building block.
7 The space currently in use is https://github.com/hackathonBnF/FichesFormat/wiki.
8 This organisation was so popular that it has been maintained, beyond the partial return on site of BnF agents, until today.
9 It is interesting to note that the adoption of the OAIS terminology has been questioned. Rather than teaching group members a complex data model, and assuming the familiarity of readers with the standard, we preferred reusing a consistent subset of these concepts and providing an appropriate definition in the policy document.
10 See the fonds record (http://comitehistoire.bnf.fr/dictionnaire-fonds/amos-gita%C3%AF) and the finding aid (https://archivesetmanuscrits.bnf.fr/ark:/12148/cc1063058).
11 As an example, see the issue posed by 32-bit CMYK JPEG images mentioned in note 1 above. The alternative that we faced was the following: given that our processing chain expects 24-bit RGB images, should we upgrade it to take into account content in another colour model, or, given the small amount of such content likely to be included in the BnF collections, should we migrate them to the RGB colour model?
12 See in particular the U.S. National Archives and Records Administration Digital Preservation Framework on GitHub (https://github.com/usnationalarchives/digital-preservation).
13 See the Library of Congress’ Recommended Formats Statement (https://www.loc.gov/preservation/resources/rfs/) and their rich descriptions of several hundred formats (Sustainability of Digital Formats: Planning for Library of Congress Collections, https://www.loc.gov/preservation/digital/formats/).
14 See in particular the comprehensive studies hosted on the Digital Preservation Coalition’s website at https://wiki.dpconline.org/index.php?title=File_Formats_Assessments.