Getting FITS into shape

Getting FITS into shape

The Harvard Library developed FITS, the File Information Tool Set, as part of the ingest processing of its Digital Repository Service (DRS). This was mostly Spencer McEwen's work. It's a "Swiss army knife," running a number of different tools to identify formats and provide metadata information about files. It was put up on Google Code as open source, and a number of other institutions have started using it.

Harvard hasn't had the time to update it to a more broadly useful project, but thanks to a SPRUCE Award, I've been spending April making various updates and fixes to it, with the results currently available on Github. That repository is a temporary way station for it; these changes will be merged into an institutionally maintained repository, though just where hasn't been determined yet.

The first task I undertook was adding Apache Tika as a new tool. The work on this started at the OPF Hackathon in Leeds. The advantage of Tika is that not only does it already cover a lot of formats, but it's actively maintained, so we can expect support for more formats in future releases. FITS is a Java application, and Tika is a reasonably well-documented Java library, so getting it to work wasn't very hard. The main complication was that Tika's output vocabulary is sprawling and undocumented, so there's no good way to tell what properties it might report in previously untested cases. This makes it more difficult to translate Tika terms into standard FITS output.

Several of the tools FITS uses were out of date. JHOVE hadn't been brought up to its latest version because attempts to do so produced less metadata than version 1.5 did. This turned out to be because JHOVE had updated to the current MIX 2.0 schema, and FITS was still trying to interpret it as MIX 0.2. Once the problem was found, the fix was obvious.

DROID was a more difficult case. FITS was using DROID 3, and DROID 6 was vastly changed, to the point that FITS got numerous compilation errors after dropping in DROID 6. DROID has no public API documentation, making things difficult. Matt Palmer, who has worked on DROID development, provided vital help in figuring out how to call the current version.

Some issues in efficiency turned up. DROID uses an XML signature file to identify files. It's big, and parsing it took over 13 seconds on my computer. If FITS is run on a large directory, the time cost is spread out over a lot of files, but this is a problem if it's run on one file or a small directory. Hopefully there will be optimizations, perhaps a persistent serialized cache, in future versions of DROID.

The National Library of New Zealand's metadata tool was more problematic. An attempt to bring it up from version 3.4GA to 3.5GA ran into problems similar to the ones with DROID, with classes having been changed. Apparently this tool isn't being actively maintained, and I wasn't able to get the information needed to do the update. It's staying at 3.4GA in FITS.

Another task was improving the metadata vocabulary for video. FITS output isn't much more than a flat set of properties, so it wasn't possible to adopt any other schema full-blown, but ideas were used from a number of sources, including MediaInfo, Archivematica, and PBCore. Exiftool is currently the best of the tools for reporting video properties, so the output was shaped by what it can produce. Hopefully other tools, such as Tika, will produce more information on video files in future versions.

Documentation is an important part of any open source project, but one that often gets low priority. I did some work on the Javadoc and added documentation in the wiki pages of the Github repository. In particular, there are instructions on how to add a new tool to FITS.

Hopefully this work will make FITS a more useful tool, both for Harvard and for its other users.

2 Comments

  1. paul
    May 1, 2013 @ 10:26 am CEST

     

    Thanks very much for making these very useful changes happen Gary. Excellent work!
     
    FITS provides a great mechanism for meeting our practitioners' needs and these fixes and enhancements bring it right back up to date. The second project funded with a small SPRUCE Award is focused on C3PO, a visualiation tool that enables the analysis of FITS output. This work will be complete in a few weeks time and will provide further enhancement to the FITS-C3PO toolset.

  2. Jay Gattuso
    May 16, 2013 @ 9:27 pm CEST

    Hey Gary,

    An interesting read, thanks. 

    I'd be very interested in finding out any more info on the issues you encountered with the NLNZ MET. Whilst its arguably true we're not actively maintaining it, there is some current work on the table that should result in some newer formats being added to the tool.

    I'm specifically interested to know if there is anything you learnt from your efforts with the tool that would be worth addressing as a priority?  (whilst acknowledging the code base is ~7 years old, so it's very much due some work….) 

    Jay

Leave a Reply

Join the conversation