FITS Blitz

FITS is a classic case of a great digital preservation tool that was developed with an initial injection of resource, and subsequently the creator (Harvard University) has then struggled to maintain it. But let me be very clear, Harvard deserves no blame for this situation. They've created a tool that many in our community have found particularly useful but have been left to maintain it largely on their own.

Wouldn't it be great if different individuals and organisations in our community could all chip in to maintain and enhance the tool? Wrap new tools, upgrade outdated versions of existing tools, and so on? Well many have started to do this, including some injections of effort from my own project, SPRUCE. What a lovely situation to be in, seeing the community come together to drive this tool forward…

Unfortunately we were perhaps a little naive about the effort and mechanics needed to make this happen as a genuine open source development. FITS is a complex beast, wrapping a good number of tools that extract a multitude of information about your files which is then normalised by FITS. What happens when you tweak one bit of code? Does the rest of the codebase still work as it should? Obviously you need to have confidence in a tool if it plays a critical role in your preservation infrastructure.

From the point of view of the SPRUCE Project, we'd like to see all the latest tweaks and enhancements to FITS brought together so that the practitioners we're supporting get a more effective tool. But we also equally want future improvements to find their way into the codebase in a managed and dependable way, so that upgrading to a new FITS version doesn't involve lots of testing for every organisation using it.

So in partnership with Harvard and the Open Planets Foundation (with support from Creative Pragmatics), SPRUCE is supporting a two week project to get the technical infrastructure in place to make FITS genuinely maintainable by the community. "FITS Blitz" will merge the existing code branches and establish a comprehensive testing setup so that further code developments only find their way in when there is confidence that other bits of functionality haven't been damaged by the changes.

FITS Blitz commences next Monday. Please get in touch with myself, or Carl Wilson from the Open Planets Foundation, if you'd like to find out more.

12 Comments

lfaria
November 7, 2013 @ 12:50 pm CET

I do agree I would not want to create yet another codebase. This is only an exploratory implementation that aims to evaluate its results against other existing implementations, such as the Apache ODF Validator. What is done in our implementation, and suggested in the 2009 guideline, is to check the XML files against relax-ng schemas, which is quite up to date.

But, or current problem is where does the truth lie? At first we thought that the Libre Office would absolutely give correct and valid ODFs, that they should be seen as the ground truth for valid files. But Apache ODF Validator seems not to agree with LibreOffice in what constitutes a valid ODF file. Without knowing in who to trust, we created our own validation tool and will try to ascertain by ourselfs who is right. Todays results actually seem to point to Libre Office as the cullprit, and we might choose Apache ODF Validator after all, but we still need more tests to be sure.

We also tried to compare the results with the output of online validation tools such as OpenDocument Fellowship and RHCloud ODF Validator, but the question of where does the truth lie still remains.

This is why we think that testing FITS against a test corpora with well defined ground truth is so important. But creating a good set of test corpora and ensuring the quality of the ground truth is not easy, as we can see from this ODF validation example. That is why I would like to bring some attention to this tool and call upon the community to help this project go further.
lfaria
November 7, 2013 @ 10:40 am CET

We experimented with Apache ODF validator and with Office-o-tron and they did not function correctly. Apache ODF Validator gives many false-negatives and false-positives and office-o-tron is terribly slow. But as the schemas are publicly available and the method to validate is well defined by OASIS, we just quickly developed a new implementation that will be available soon at keeps-validator-odf.

You must be logged in to post a comment.

12 Comments

Leave a Reply

You might also like…

A Year of FITS

ChatGPT discusses Digital Preservation

Web Archive FITS Characterisation using ToMaR

Join the conversation

Member-only content

or

or

or

or

Download

or