Government departments are connecting their information systems to the e-Depot of the National Archives of the Netherlands (NANETH). The digital archival materials (information objects) coming from these systems (closed cases or other process-bound information) are subsequently preserved in the e-Depot. NANETH’s Service Organization supports the more complex connection projects. These projects are always preceded by a so-called impact assessment. In the impact assessments, experts from various NANETH departments and representatives of the responsible authority (provider) investigate what organisational, content and technical measures are required for the connection. The impact assessment results are input for the project plan for establishing the actual connection and ingest.
The organisational track includes project planning, contractual agreements, communication, and relation management. Certain technical measures are required for establishing the connection. As archives are (by law) required to be in good condition, properly arranged and accessible, we also investigate the content of the information objects.
In this blog, we provide a bird’s eye view of the topics we are currently discussing in the preservation meetings that are part of impact assessments, and how preservation tools support those conversations. The purpose of the conversations is to estimate the impact the information objects will have on NANETH’s preservation efforts. It is also possible that certain issues need to be addressed before a provider can connect to the e-Depot. We look back on the first results. Please remember that this blog only provides a bird’s eye overview of one aspect of impact assessments. The Dutch version of this blog is can be found here: https://informatie2020.pleio.nl/blog/view/49801192/impactanalyses-hoe-preservationtools-het-gesprek-ondersteunen.
In the content track of impact assessments, we examine a representative selection of the information objects from a preservation perspective. The assessment begins with a conversation around the following topics or questions:
- To what extent does the dataset contain information types that will have an (extra) impact on preservation? Databases, software and GIS information are usually a bigger challenge than text files.
- To what extent do the information objects conform to NANETH’s preferred file formats (in Dutch)? The more the file formats differ from our preferred formats, the harder it is to guarantee their preservation and sustainable access, and to find preservation knowledge/expert(i)s(e).
- To what extent does the set contain files with a lot of or unusual (interactive) behaviour? Think of formulas or macros in Microsoft Office Excel, stored procedures in a database or hyperlinks or active links in and between documents.
- To what extent does the set contain encrypted files? Encryption has to be removed before transfer, or the decryption key must be provided.
- To what extent does the set contain digital signatures? Digital signatures will be maintained only if they are necessary for legal reasons. Generally, however, the metadata of the e-Depot provides sufficient guarantees to prove the authenticity and integrity of information.
- To what extent does migration / conversion occur in the source information system, and how is the quality of migrations / conversions measured? See article 25 of the Archival Regulation (Archiefregeling, in Dutch).
To support the conversation, we use preservation tools to analyse the representative dataset. Among other things, the tools can show if the file extensions and extension information from the source system’s metadata match the file formats found by the tools, if file formats are well-formed and valid, if encryption / password protection has been applied, and if the tools can help fix certain (date) metadata. The tools are not used in the subject of migration in the source system.
The tools we use are the File Information Tool Set (FITS), developed and managed by Harvard University, and Clever, Crafty, Content Profiling of Objects (c3po). C3po was developed by the Technical University of Vienna in the SCAPE project and the Benchmark DP project.
FITS “identifies, validates and extracts technical metadata for a wide range of file formats. It acts as a wrapper, invoking and managing the output from several at other open source tools.” (http://projects.iq.harvard.edu/fits/introduction)
C3po “is a software tool, which uses metadata extracted from files of a digital collection as input to generate a profile of the content set.” (http://peshkira.github.io/c3po/ [SCAPE] or https://github.com/datascience/c3po [Benchmark-DP])
The FITS and c3po websites explain how the software must be installed and operated. FITS is a product, c3po is a prototype. C3po provides a useful web interface with export capabilities, but even though c3po has a web interface, both tools are primarily command line tools. Please note that specific (ICT) knowledge is required for installing the tools and their software dependencies, and working with command line tools. At NANETH we use the latest version of FITS and, due to certain software dependencies, the SCAPE version of c3po.
First, we use FITS and c3po on the command line to create a c3po profile for the representative dataset. Then we select that set in the c3po web interface. C3po presents graphs and other information based on the information from the profile: how many files of which file format and file format version are in the dataset? Are the files well-formed and valid? When were the files created and last changed? This video explains more c3po features: https://youtu.be/6KibTpdxQBs.
C3po provides a good first impression of the dataset. However, viewing all the information in detail requires a lot of mouse clicks. It is therefore useful that the information of c3po can (in whole or in part) be exported as comma-separated text file. In Microsoft Excel or LibreOffice Calc we can then perform our own analysis of the data and/or create additional graphs. One example is that we use an Excel formula to show the file extension of files in a column next to the name of the file formats found by the tools. As a result, we can see at a glance if the files (according to the tools) are what they say they are (according to their extension).
How do the tools support the conversation?
In this section, we briefly discuss the preservation conversation topics and what we can conclude from the (FITS and) c3po information.
Information types and behaviour
C3po does not tell us what information types there are in the dataset, and if some of the files contain behaviour. However, information about file formats provides some insight. Are the information objects mostly text files (PDF, DOC, ODF, etc.), or databases, software or GIS files? Datasets mainly consisting of text files and images have a low(er) preservation impact. Databases and software are more complex, may contain (interactive) behaviour, and are more difficult to preserve.
Open file formats, or common and well-supported formats, will result in a lower preservation impact than proprietary and/or infrequently used file formats. The numbers of file format versions that are identified by the tools help prioritize preservation actions and/or the need to gain (new) preservation knowledge. As a result of impact assessments, we also know if we should advise a provider to use open file formats more often (and implement the Dutch ‘comply or explain’ policy w.r.t. open standards).
Well-formed and valid
When the files are analysed, it is a good thing to check if the file formats are well-formed and valid. Well-formedness is about the form (syntax), validity more about the content (semantics). XML files, for example, are well-formed if they meet a number of criteria, such as the correct opening, closing, and nesting of tags. XML files are valid if the content complies with a document type definition (DTD) of schema (XSD). DTDs and XSDs can be used to define which tags are allowed in an XML file and what information the tags may contain (text, number, date, etc.).
Although not for all file formats, the tools can provide information about encryption or password protection. Encryption and password protection complicate access to the information in the files, and should be removed before transfer to NANETH. The e-Depot’s access rights system should manage any access restrictions, not the encryption of or password protection on individual files.
The tools provide information about the creation and modification dates of files. By comparing this information with date information from the source system’s metadata, mismatches can be identified. Are the dates found by the tools equal (or at least roughly similar) to the dates from the metadata? To what extent were files created well before other files, or changed well after all other files? If mismatches are detected and confirmed as problems by human experts, the metadata gathered by the tools can be used to resolve those problems.
Different tools identify the creation and modification dates of files in different ways. Therefore, the tools sometimes do not seem to agree (or export, e.g., the same date in different date formats). For this reason alone, human content experts must check the tool information, and make a well-informed decision about what information is ‘the truth’. Another reason why human experts need to check the information, is that a creation date can be based on the creation date of a document template or a reused file. In those cases, the file might look much older than it would have been if a new file would have been created.
What do we learn from impact assessments? A lot about the future. In addition to asking for a representative dataset, we also ask for an overview of the total number of file formats in the source system. This helps assess the representativeness of the dataset and provides additional information about what NANETH may have to deal with in the future. The good news is that the bulk of the information is already reasonably consistent with NANETH’s preferred formats. The bad news is that there is also a ‘long tail’ of relatively small numbers of other formats.
The impact assessments show that files are not always what they say they are. In one of the impact assessments, a number of files with a .pdf extension was recognized by the tools as a Microsoft Word file. Manual inspection confirmed this, and the provider was able to investigate the cause and solve the problem before the actual connection project.
What the impact assessments also show is that human beings cannot be replaced by a computer. We already saw that manual inspection was required for checking file extension mismatches. The well-formedness and validity checks also require human effort. JHOVE is one of the tools used by FITS to check the format and validity of files. JHOVE is also integrated into NANETH’s e-Depot, and used, e.g., in ingest workflows. Problems that tools like JHOVE report should be investigated by human experts. Sometimes these problems need to be solved. Sometime they do not appear to result in a significant preservation impact. Sometimes the problems are the result of a bug in the software. This analysis is labour intensive, but NANETH does not stand alone. Through the Information 2020 (Informatie 2020, in Dutch) knowledge platforms, we are in contact with national preservation experts outside NANETH. We are also members of, and work with the Open Preservation Foundation (OPF, www.openpreservation.org). The OPF maintains a web page with JHOVE issues and error messages, organise JHOVE hack days, and have a Document Interest Group (DIG). One of the priorities of the DIG is to “Build a knowledge base of errors in daily digital preservation activities”. Another OPF group is the Archive Interest Group (AIG), in which we investigate common archive (preservation) priorities with the Danish National Archives and the National Archives of Estonia, including quality control.
The impact assessments have not yet resulted in the identification of problems with encryption or password protection of documents. We did however discover that there is a grey area. Some PDF files have specific password-protected permissions. In some cases, you can open and print the files, but you cannot change them or add comments. In particular, the restriction that content copying can be prohibited caught our eye in one of the early impact assessments. We are investigating the extent to which this restriction could hinder future preservation actions.
The above case of incorrect file extensions is an example of how impact assessments can result in issues that need to be resolved prior to the actual connection project. Another example is about dates. In one of the impact assessments the content experts saw that the source system metadata showed the same creation date for all files. We learned that this was the date on which the files were transferred from an old document management system to a new one. Using the date information the tools extracted from the files, the content experts were able to repair the date metadata.
The FITS and c3po tools provide useful information about the more technical aspects of digital information objects. This information is a welcome addition to impact assessment conversations on the subject of preservation, and on impact assessments in general. In impact assessments, the computer cannot replace humans, but automation helps make the impact assessments more efficient.
The topics covered by the preservation conversation and preservation tool analysis are, for now, focused on (a) the more technical aspects of (b) individual files. We will undoubtedly gain more knowledge and be able to cover more preservation topics* and improve our preservation impact estimates, but the impact assessments have already resulted in better prepared connection projects. In addition, we know a lot more about how many and which types of digital information objects will be coming our way, and are more prepared for the future.
- one topic to include is a more detailed conversation about significant properties. This topic is related to one of the priorities of the AIG.