Getting FITS into shape

The Harvard Library developed FITS, the File Information Tool Set, as part of the ingest processing of its Digital Repository Service (DRS). This was mostly Spencer McEwen's work. It's a "Swiss army knife," running a number of different tools to identify formats and provide metadata information about files. It was put up on Google Code as open source, and a number of other institutions have started using it.

Harvard hasn't had the time to update it to a more broadly useful project, but thanks to a SPRUCE Award, I've been spending April making various updates and fixes to it, with the results currently available on Github. That repository is a temporary way station for it; these changes will be merged into an institutionally maintained repository, though just where hasn't been determined yet.

The first task I undertook was adding Apache Tika as a new tool. The work on this started at the OPF Hackathon in Leeds. The advantage of Tika is that not only does it already cover a lot of formats, but it's actively maintained, so we can expect support for more formats in future releases. FITS is a Java application, and Tika is a reasonably well-documented Java library, so getting it to work wasn't very hard. The main complication was that Tika's output vocabulary is sprawling and undocumented, so there's no good way to tell what properties it might report in previously untested cases. This makes it more difficult to translate Tika terms into standard FITS output.

Several of the tools FITS uses were out of date. JHOVE hadn't been brought up to its latest version because attempts to do so produced less metadata than version 1.5 did. This turned out to be because JHOVE had updated to the current MIX 2.0 schema, and FITS was still trying to interpret it as MIX 0.2. Once the problem was found, the fix was obvious.

DROID was a more difficult case. FITS was using DROID 3, and DROID 6 was vastly changed, to the point that FITS got numerous compilation errors after dropping in DROID 6. DROID has no public API documentation, making things difficult. Matt Palmer, who has worked on DROID development, provided vital help in figuring out how to call the current version.

Some issues in efficiency turned up. DROID uses an XML signature file to identify files. It's big, and parsing it took over 13 seconds on my computer. If FITS is run on a large directory, the time cost is spread out over a lot of files, but this is a problem if it's run on one file or a small directory. Hopefully there will be optimizations, perhaps a persistent serialized cache, in future versions of DROID.

The National Library of New Zealand's metadata tool was more problematic. An attempt to bring it up from version 3.4GA to 3.5GA ran into problems similar to the ones with DROID, with classes having been changed. Apparently this tool isn't being actively maintained, and I wasn't able to get the information needed to do the update. It's staying at 3.4GA in FITS.

Another task was improving the metadata vocabulary for video. FITS output isn't much more than a flat set of properties, so it wasn't possible to adopt any other schema full-blown, but ideas were used from a number of sources, including MediaInfo, Archivematica, and PBCore. Exiftool is currently the best of the tools for reporting video properties, so the output was shaped by what it can produce. Hopefully other tools, such as Tika, will produce more information on video files in future versions.

Documentation is an important part of any open source project, but one that often gets low priority. I did some work on the Javadoc and added documentation in the wiki pages of the Github repository. In particular, there are instructions on how to add a new tool to FITS.

Hopefully this work will make FITS a more useful tool, both for Harvard and for its other users.

By garymcgath, posted in garymcgath's Blog

30th Apr 2013  4:42 PM  13048 Reads  2 Comments


There are no comments on this post.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.