What is preservation watch?
The reason why we should worry about preservation of digital content and why some preservation action needs to be done is closely related to the idea that content is at risk. The risk relates to the potential of losing something of value, weighted against the potential of gaining something of value. In digital preservation, the risk relates to losing long-term and continuous access (or usability) of content by the intended users and it is weighted against the cost (or profit) of maintaining such access. The long-term and continuous aspects of this access mean that there should be a continuous and long-term process that knows when content is misaligned with the requirements of the intended users, and this process is preservation watch.
In practice preservation watch becomes even more complex as long-term and continuous are many times conflicting requirements. To tackle this, an institution would normally define a "preservation format" which tries to fulfill the long-term access requirement, and create "access" or "dissemination" copies, which are optimized for user community.
Monitoring if content is aligned with the long-term and continuous access requirements, i.e. if selected preservation and access format are still adequate, is a big endeavor that quickly becomes infeasible with large-scale content. Institutions are normally able to tackle the usual suspects, like images and text documents, but are unable to process the long tail of file formats that almost all institutions have.
Scout – a preservation watch system
http://openplanets.github.io/scout/
Scout is a preservation watch system being developed within the SCAPE project. It provides an ontological knowledge base to centralize all necessary information to detect preservation risks and opportunities. It uses plugins to allow easy integration of new sources of information, as file format registries, tools for characterization, migration and quality assurance, policies, human knowledge and others. The knowledge base can be easily browsed and triggers can be installed to automatically notify users of new risks and opportunities. Examples of such notification could be: content fails to conform to defined policies, a format became obsolete or new tools able to render your content are available.
For example, you can continuously monitor your content file formats and other characteristics, e.g. compression scheme. Scout can monitor your content profile throughout time and allow you to compare it with other institutions, see how content evolves and cross-reference that information with your policies, file format registries (like PRONOM), and any other information that can be provided to Scout.
This will give you an invaluable insight into your content and how it relates with the outside world.
What information does Scout currently have?
Content
Scout is able to monitor the content profile, which is a summary of the content characterization. Scout is fetching information about file format distribution, file size, and file characteristics like compression scheme. Scout does this using C3PO and FITS, you can run FITS on every file of your content to get the characterization output, and run C3PO to generate the content profile XML that can be monitored by Scout.
Here is an example of the data gathered form a web archive collection:
Internet Memory Foundation web archive collection (harvests of a confidential domain from 2009 to 2012)
Content size (on each harvest):
Format distribution (table with latest status):
… and a long tail of other formats.
Format distribution (diagram with history on each harvest):
Compression scheme (on each harvest):
Policies
Scout allows upload of preservation control-policies in an RDF model created in the SCAPE project. Check the Preservation Policy Levels in SCAPE paper for more information about the Preservation Policy model. These control policies define requirements on the content that can be automatically checked for conformance with monitored content. For example, you can upload to Scout a policy that defines that compression scheme must be lossless, and monitor your content to be warned whenever a lossy format is added to your content.
To add policies you have to log into Scout and add upload your policy RDF model in the Scout dashboard.
Note that the current version of Scout does not support multiple users (so you must download your own version of Scout to do this). Note also that not all policies can be checked for conformance, as they might depend on non-existing information, but you can add more information to Scout anytime (via source adaptors). Finally, please note that you might need to create new trigger that cross-references a control policy with the content profile, but default triggers and common vocabularies are currently being developed to make this cross-reference easier.
Registries
Scout currently monitors PRONOM registry via the SPARQL endpoint. It currently has 843 file formats:
Web
An experiment with Automatic Preservation Watch using Information Extraction on the Web was presented at the last iPRES conference (2013). In this experiment, the journals that are provided by a publisher are automatically extracted from the Web by doing focused crawlings on the Web (using journal and publisher names), and relations are calculated from natural language statements using information extraction tools.
In the experiment, 500,000 web pages that, with about 18 million sentences were crawled, and this resulted 2,000 journal titles and 500 journal-publisher relations. Comparing the results with eDepot and the Keepers registry gave the following results.
Comparing the results with eDepot we found that 86% of the gathered journal titles were not on the eDepot and should be added, 10% were already registered and 4% were false-positives. Manually comparing a sample of the results with the Keepers registry we also estimate that aroud 50% for all found journal-publisher relationships were already added, and 35% needed to be added. We also estimate that there exist more false-positives in the journal-publisher results because detecting journal(title)-publisher(name) relations is more complex and error prone that just detecting the journal titles.
This experiment demonstrates that information extraction technologies can be a good complement to registries and even serve as a substitute information source when no registries exist on some subject. Nevertheless, some work is needed to reduce the related error, several suggestions on how to do this are available on the paper.
How can I use Scout?
Plans exist to create a central instance for Scout, which could serve as a central hub for digital preservation information. For now, there is no such central instance, but you can check out the demonstration instance at http://scout.scape.keep.pt (please be aware that this is a development/demonstration site and may go down at any moment).
You can also download and install your own instance of Scout, gather information and monitor your content. To know how, check the development site: http://openplanets.github.io/scout/
Finally, you can send us your content profile and be an early adopter. To know more, please contact me at lfaria[AT]keep.pt