FITS Blitz

Challenges of Dumping/Imaging old IDE Disks
FITS is a classic case of a great digital preservation tool that was developed with an initial injection of resource, and subsequently the creator (Harvard University) has then struggled to maintain it. But let me be very clear, Harvard deserves no blame for this situation. They've created a tool that many in our community have found particularly useful but have been left to maintain it largely on their own.
 
Wouldn't it be great if different individuals and organisations in our community could all chip in to maintain and enhance the tool? Wrap new tools, upgrade outdated versions of existing tools, and so on? Well many have started to do this, including some injections of effort from my own project, SPRUCE. What a lovely situation to be in, seeing the community come together to drive this tool forward…
 

Unfortunately we were perhaps a little naive about the effort and mechanics needed to make this happen as a genuine open source development. FITS is a complex beast, wrapping a good number of tools that extract a multitude of information about your files which is then normalised by FITS. What happens when you tweak one bit of code? Does the rest of the codebase still work as it should? Obviously you need to have confidence in a tool if it plays a critical role in your preservation infrastructure.
 
From the point of view of the SPRUCE Project, we'd like to see all the latest tweaks and enhancements to FITS brought together so that the practitioners we're supporting get a more effective tool. But we also equally want future improvements to find their way into the codebase in a managed and dependable way, so that upgrading to a new FITS version doesn't involve lots of testing for every organisation using it.
 
So in partnership with Harvard and the Open Planets Foundation (with support from Creative Pragmatics), SPRUCE is supporting a two week project to get the technical infrastructure in place to make FITS genuinely maintainable by the community. "FITS Blitz" will merge the existing code branches and establish a comprehensive testing setup so that further code developments only find their way in when there is confidence that other bits of functionality haven't been damaged by the changes.
 
FITS Blitz commences next Monday. Please get in touch with myself, or Carl Wilson from the Open Planets Foundation, if you'd like to find out more.

25
reads

12 Comments

  1. andy jackson
    November 7, 2013 @ 9:36 am CET

    Lovely to hear about all this work going ahead, and it's really good to publicise it like this.

    Just wanted to check you know that there's already an Apache ODF Validator you could exploit.

  2. johan
    November 18, 2013 @ 5:42 pm CET

    This is just for info: incidentally I created an ODT entry in the OPF File Format Risk registry today, if there are any validation issues please feel free to report them e.g. as a child page here:

    http://wiki.opf-labs.org/display/TR/OpenDocument+Text

     

  3. Jay Gattuso
    November 12, 2013 @ 1:54 am CET

    Interesting discussion. I just wanted to add that I think we need both things. Some use cases demand a reference spec for a format, other use cases ask for an exemplar implementation that can be poked and prodded.

    Why not both? The Spec is a reference. The implementation(s) is a tangible example of a spec (you could also associate deviations from the spec here, e.g. a file proprietary implementation of a format type that shares 95% of its structure with an "official' spec but includes some 5% proprietary novelty)(. Both are truths in their own right and should be used as such. The total knowledge of the format is then formed from both the specs and the implementations that a DP/format SME has decided to include….

  4. lfaria
    November 8, 2013 @ 11:18 am CET

    Our process to create existing corpora in the fits-testing project is by using current well know implementations, such as LibreOffice and Microsoft Word, create a new document with some content, and then save or export in as many formats as tool options allow. But to consider these files as valid or not may be a question of terminology, or point of view.

    On one point of view, a file being valid means it follows the file format specification.  Whereas an archive accepts or not files that do not conform to specification is a matter of policy, and I agree that many times they would have to accept whatever implementations provide. But, nevertheless, the information that a file follows or not the formal specification should be there.

    On another point of view, the whole idea of digital preservation is continuous access from the community, and if the community uses the implementations (such as Microsoft Word and LibreOffice), than compatibility with these implementations is the most important objective, even if they deviate from formal standards.

    Now, there might be that there is no such thing as a valid file (as there is no truth), but there are files that "follow formal specification" and files that "are compatible with implementation X". For now, we are considering valid as the first one, but if no specification is available we might have to resort to the second definition. In the end, it might mean we just need to define clearer terms and let policy decide what to accept or not.

  5. andy jackson
    November 8, 2013 @ 9:56 am CET

    I guess I'm just a little surprised that making a new tool would tell you anything you could not learn by picking through the Apache one, which appears to use the same validation methodology. However, everyone has their own ways of approaching things, and now that I understand that this work reflects your process of understanding this format and it's validation (rather than necessarily being a new tool intended to usurp existing implementations) it all makes much more sense.

    I do agree that creating test corpora that explore these issues is really important and useful work. However, I'm a little skeptical about the implication that the formal specification represents 'Ground Truth'. We have to deal with whatever the implementations create, and so the test corpus must include examples from the common tools, even if they break the formal specification. That specification may provide a useful baseline against which the variation between implementations might be compared and understood, but that does not make it 'the truth'.

Leave a Reply

Join the conversation