Summary of Outputs and Roadmap (Feb 2012)

Since joining the project in July 2011 I have focussed on aligning a number of different groups and outputs to be consistant and maintainable into the future. In this way I feel my role is not only to support OPF but to use it as a platform to support the on going digital preservation targets of others outside of the immediate OPF and SCAPE project comminuties.

Due to this fact some of the technical progress can be slow, time is more spent on waiting and evaluating the problems that everyone has and encouraging corroboration on similar solutions.

Packaging

One of my first aims within OPF was to address sustainable support for tools. Like all good digital preservation systems, propper preservation starts with user education, in this case of the developers producing those tools. By encouraging the digital preservation communitee to actually adopt software pacakging standards via linux and windows distribution (not via some hooky version of Java please!), we open up the potential communitee of people who can support the package in the future.

Sustainable software starts with people being able to install and evaluate the existing software, a software product dies if no one can use it. At this point people will simply code an alternative that may or may not conform to the original specification and intended use.

The starting point for packaging has been Debian/Ubuntu and RedHat/Fedora linux packages.

http://deb.openplanetsfoundation.org and http://rpm.openplanetsfoundation.org have been set up to provide the package mirrors for each platform. This allows features such as auto-upgrade to be enable such that users can not only install each software package in one click, but also keep up-to-date with the latest releases.

Doing this has also resulted in changes to the ways people think about releasing software and how upgrades are carried out. From version numbering to changelogs, packaging software in industry/community standard ways can only be a good thing to provide the best route to sustainability.

On that note I have also put together GitHub2ChangeLog (gh2ch) which takes a GitHub commit history and gives you back a fully structured changelog compatible with the stringent debian packaging requirements.

This is available currently (it will be moved) at http://p2-registry.ecs.soton.ac.uk/opf/gh2ch and as a debian package on the OPF deb repository.

Package Building Infrastructure

In order to support the building of packages, a task which can often require a heavily customised machine (to build a package which can then go on any machine), OPF has begun making virtual machine images in Amazon which can be “fired up” for very short, very cheap amounts of time by anyone in order to build and submit the latest version of their software to the central repository. These machines can also act as a good test environement and many machines can be started from the same image at the same time, reducing the IT overhead in maintaining physical machines.

At the time of writing the jyplyzer build AMIs (32 and 64-bit) are available from the EU-WEST (Ireland) location at http://aws.amazon.com

The OPF Website Software Section

Another route to software usage and sustainability is to better market software by giving it a user friendly home. While GitHub and other revision control websites are great for sharing and collaborating on code, they are not the best place for actual users to find out about the software and download a version for their computer. Unless you understand GitHub these sites can be complex and misleading.

To help solve this problem work has begun on a software section of the OPF website (see the software tab above) to give each software output a home page. Data to populate this page is bought directly in from GitHub and simply displayed in a better, more google friendly style. So we not only get better software pages, but they are auto-updated meaning no-one has to change the way they work with GitHub, they just have to fill in more to populate the software page with more than just a sentance.

Data…data…data!

In order to evaluate data (e.g. files), data is first required with an open licience which can be made available.

As a result I have already gathered together the old Planets corpora and the New Zealand GovDocs dataset and made these available at http://soton.corpora.openplanetsfoundation.org. In addition to just mirroring the files I have also re-zip’d the govdocs corpora by extension so it is possible to just downlaod a single file type. There is still many terrabytes left on the data server which can be filled with more files in the future, thus this is an ongoing task also.

Looking at file identification tools, people wanted to know the differences between the DROID signature files, e.g. what is added each time. This first requires a back catalogue of the signature files, something it turns out even The National Archives (UK) didn’t have. 2 days and a number of tweets later, crowd sourcing came to the rescue and I made a comparison service available at http://users.ecs.soton.ac.uk/dt2/droid/

Linked Data…data…data

In addition to simply sourcing files, there is a high demand for data relating to software, experiments and analysis carried out in the area of digital preservation. There are also a large amount of projects producing data which is distributed and hard to access. By creating a linked-data endpoint for OPF (http://data.openplanetsfoundation.org), these datasets can be easily published, accessed and re-used by many.

The data endpoint is backed entirely by existing software and when complete will provide a http://reference.data.gov.uk style endpoint for data relating to digital preservation.

As well as providing an endpoint the Linked Data Simple Storage Specification (lds3) provides a clear HTTP CRUD based specification for managing data within the system. This specification (available at http://www.lds3.org by the end of Q1 2012) outlines the requirement on data storage nodes to automatically version and apply provenance information to any incoming data.

http://data.openplanetsfoundation.org will be the first endpoint supporting this specification and will see some exciting developments over the coming year to support full graph querying on linked-data which allows:

You can find out who published each individual bit of data.
You can resolve conflicts in the data
You can eliminate un-trusted graphs in queries
You can roll-back data
You can query the data which existed in the past

The Future

The next 6 months will see the population of these systems, bringing online the framework for the future of digital preservation. After that the aims are all about usage of these systems and performing experiments using the linked-data to prove the value of the systems and the people who populate them. It is in this period that the services provided by OPF will start to impact on the lives of real digital preservation practicioners.

3 Comments

Bill Roberts
February 22, 2012 @ 4:07 pm CET

Maurice pointed me to http://www.lds3.org/Specification
davetaz
February 28, 2012 @ 11:38 pm CET

Note that this spec, while online is not final before the end of March 2012. I have put it online so Google can index it but I am not declaring it supported until I have implemented and tested the sandbox. Of course comments are welcome but I would advise patience regarding reference implementations and exemplars until these are officailly announced via the OPF website.
Bill Roberts
February 21, 2012 @ 10:34 am CET

Hi Dave – I’m looking forward to seeing data.openplanetsfoundation.org when you get it going- sounds great. I haven’t previously come across the Linked Data Simple Storage Specification – is that available online somewhere? (Your post was the only mention of it I could find on google).

Is it similar to the SPARQL graph store HTTP protocol ?

Cheers

Bill

You must be logged in to post a comment.

Packaging

Package Building Infrastructure

The OPF Website Software Section

Data…data…data!

Linked Data…data…data

The Future

3 Comments

Leave a Reply

You might also like…

Apache Tika File Mime Type Identification and the Importance of Metadata

What is the checksum of a directory? Using DROID reports and the concepts behind Merkle Trees to generate Directory, and Collection Checksums

On Building a Debian Package of a Ruby Program

Join the conversation

Member-only content

or

or

or

or

Download

or