Summary of Outputs and Roadmap (Feb 2012)

Summary of Outputs and Roadmap (Feb 2012)


Since joining the project in July 2011 I have focussed on aligning a number of different groups and outputs to be consistant and maintainable into the future. In this way I feel my role is not only to support OPF but to use it as a platform to support the on going digital preservation targets of others outside of the immediate OPF and SCAPE project comminuties. 

Due to this fact some of the technical progress can be slow, time is more spent on waiting and evaluating the problems that everyone has and encouraging corroboration on similar solutions. 


One of my first aims within OPF was to address sustainable support for tools. Like all good digital preservation systems, propper preservation starts with user education, in this case of the developers producing those tools. By encouraging the digital preservation communitee to actually adopt software pacakging standards via linux and windows distribution (not via some hooky version of Java please!), we open up the potential communitee of people who can support the package in the future. 

Sustainable software starts with people being able to install and evaluate the existing software, a software product dies if no one can use it. At this point people will simply code an alternative that may or may not conform to the original specification and intended use.

The starting point for packaging has been Debian/Ubuntu and RedHat/Fedora linux packages. and have been set up to provide the package mirrors for each platform. This allows features such as auto-upgrade to be enable such that users can not only install each software package in one click, but also keep up-to-date with the latest releases. 

Doing this has also resulted in changes to the ways people think about releasing software and how upgrades are carried out. From version numbering to changelogs, packaging software in industry/community standard ways can only be a good thing to provide the best route to sustainability. 

On that note I have also put together GitHub2ChangeLog (gh2ch) which takes a GitHub commit history and gives you back a fully structured changelog compatible with the stringent debian packaging requirements. 

This is available currently (it will be moved) at and as a debian package on the OPF deb repository. 

Package Building Infrastructure

In order to support the building of packages, a task which can often require a heavily customised machine (to build a package which can then go on any machine), OPF has begun making virtual machine images in Amazon which can be “fired up” for very short, very cheap amounts of time by anyone in order to build and submit the latest version of their software to the central repository. These machines can also act as a good test environement and many machines can be started from the same image at the same time, reducing the IT overhead in maintaining physical machines. 

At the time of writing the jyplyzer build AMIs (32 and 64-bit) are available from the EU-WEST (Ireland) location at

The OPF Website Software Section

Another route to software usage and sustainability is to better market software by giving it a user friendly home. While GitHub and other revision control websites are great for sharing and collaborating on code, they are not the best place for actual users to find out about the software and download a version for their computer. Unless you understand GitHub these sites can be complex and misleading. 

To help solve this problem work has begun on a software section of the OPF website (see the software tab above) to give each software output a home page. Data to populate this page is bought directly in from GitHub and simply displayed in a better, more google friendly style. So we not only get better software pages, but they are auto-updated meaning no-one has to change the way they work with GitHub, they just have to fill in more to populate the software page with more than just a sentance. 


In order to evaluate data (e.g. files), data is first required with an open licience which can be made available. 

As a result I have already gathered together the old Planets corpora and the New Zealand GovDocs dataset and made these available at In addition to just mirroring the files I have also re-zip’d the govdocs corpora by extension so it is possible to just downlaod a single file type. There is still many terrabytes left on the data server which can be filled with more files in the future, thus this is an ongoing task also. 

Looking at file identification tools, people wanted to know the differences between the DROID signature files, e.g. what is added each time. This first requires a back catalogue of the signature files, something it turns out even The National Archives (UK) didn’t have. 2 days and a number of tweets later, crowd sourcing came to the rescue and I made a comparison service available at


Linked Data…data…data

In addition to simply sourcing files, there is a high demand for data relating to software, experiments and analysis carried out in the area of digital preservation. There are also a large amount of projects producing data which is distributed and hard to access. By creating a linked-data endpoint for OPF (, these datasets can be easily published, accessed and re-used by many.

The data endpoint is backed entirely by existing software and when complete will provide a style endpoint for data relating to digital preservation.

As well as providing an endpoint the Linked Data Simple Storage Specification (lds3) provides a clear HTTP CRUD based specification for managing data within the system. This specification (available at by the end of Q1 2012) outlines the requirement on data storage nodes to automatically version and apply provenance information to any incoming data. will be the first endpoint supporting this specification and will see some exciting developments over the coming year to support full graph querying on linked-data which allows:

  • You can find out who published each individual bit of data. 
  • You can resolve conflicts in the data
  • You can eliminate un-trusted graphs in queries 
  • You can roll-back data
  • You can query the data which existed in the past

The Future

The next 6 months will see the population of these systems, bringing online the framework for the future of digital preservation. After that the aims are all about usage of these systems and performing experiments using the linked-data to prove the value of the systems and the people who populate them. It is in this period that the services provided by OPF will start to impact on the lives of real digital preservation practicioners. 

















  1. Bill Roberts
    February 22, 2012 @ 4:07 pm CET

    Maurice pointed me to

  2. davetaz
    February 28, 2012 @ 11:38 pm CET

    Note that this spec, while online is not final before the end of March 2012. I have put it online so Google can index it but I am not declaring it supported until I have implemented and tested the sandbox. Of course comments are welcome but I would advise patience regarding reference implementations and exemplars until these are officailly announced via the OPF website. 

  3. Bill Roberts
    February 21, 2012 @ 10:34 am CET

    Hi Dave – I’m looking forward to seeing when you get it going- sounds great.  I haven’t previously come across the Linked Data Simple Storage Specification – is that available online somewhere? (Your post was the only mention of it I could find on google).

    Is it similar to the SPARQL graph store HTTP protocol ?





Leave a Reply

Join the conversation