I am currently working on a tool that should profile collections of digital objects. The motivation behind this, is that in order to conduct efficient preservation planning, one has to analyze the content that is to be preserved and to detect the significant properties of the collection or set of objects. As soon as this baseline is provided, planning can be done and different preservation actions can be evaluated on representative sample objects before the chosen action is really executed. Afterwards this process can be repeated on demand. In my opinion, only with this first step of analysis and detection the cycle of this preservation workflow can be completed: analyze -> detect -> plan -> act -> repeat.
To achieve this goal, obviously I need meta data of real content and lots of it. I guess this is the place to thank Bjarne Andersen, Asger Askov Blekinge and their team in the State and University Library, Denmark for the meta data they have shared with me. Thank you guys, you are awesome!
I am using the File Information Tool Set or FITS (http://code.google.com/p/fits/) developed by the Harvard University Library for numerous reasons but mostly because of the data it provides and its standardized output. However, I have heard many arguments against FITS and its performance, so I will try to summarize them and tell you what I think of FITS.
FITS identifies, validates and extracts technical meta data for various file formats. Sounds great, doesn’t it?
Actually it does none of those things! It just wraps a bunch of other tools and normalizes their output in a structured xml file that follows a well defined schema.
Let me rephrase that:
“Don’t reinvent the wheel!”
That is the main point, why I like it and why I fits:
Why try to invent yet another identification or characterization tool that claims to be so much better than all the others, when there are already so many of them? Why not just fork an existing one and make it better?
Btw, there is this amazing report that evaluates some of them (FITS is also in there, which I personally find a bit odd as its purpose is the combination of the tools and not characterization itself, but anyway). Definitely a must read if you are interested in the topic and haven’t done so yet.
Some characterization tools produce very good results on a variety of files, other produce good results only on specific files, some perform better than others under specific circumstances. There is however one simple fact. In the end of the day, as a user I am interested in a lot of (relevant! and valid!) meta data that can be extracted easily (with one tool in one environment please). And here is why…
There are certainly use cases where only identification of a set of objects would be enough, but from my point of view it will never be enough. From preservation planning point of view it will most certainly never be enough! Knowing that a collection of objects consists of x formats and having a histogram over them is not enough. The format is certainly a very important property – a significant one – but it is one of many. We cannot look at it as if it is the only decision factor that influences what we are going to do on the set of objects. Consider the following example:
You have three pdf files and you know only their formats and their versions. Which two are similar? Which one is the outlier?
Pretty easy, right? Well, now consider the same 3 files. Note however, that there is some more meta data provided by FITS.
Which two files are more similar now? Which one is the outlier? I hope you see how the format only is not enough. It is just another property that has to be considered.
I am sure by now many of you are thinking, apparently he hasn’t used FITS if he is really proposing it for large scale usage. Well, let me assure you I did use it extensively and I believe this is the way to go. May be this version and the standard configuration are not ready, may be even this is not the implementation of the needed tool, but I believe this is the correct idea that we are looking for.
Obviously, FITS has many downsides – but let’s cut the project some slack. First of all, it is in version 0.6. Second, many of the downsides are fairly simple to fix. There are arguments such as, JHOVE performs bad on html files – well, just turn it off for html. It provides mostly irrelevant information like unclosed <br> tags anyway.
Another would be – Droid is not the best identification tool out there anymore. Well, just exchange it with Apache Tika then.
Don’t reinvent the wheel, right?
On the other hand, I haven’t heard a single argument about data normalization. It seems to me that the digital preservation community is still swamped with so many other problems that we tend to oversee some fundamental points. We need a normalized model of how this meta data is going to be represented. Every single tool out there provides different measures (with different data units) and a different format for the same concepts. And since creating a new way of describing something always reminds me of this comic: http://xkcd.com/927/ – may be we should try to stick to an existing one.
FITS is the only tool (that I know, so correct me if I am wrong), that tries to normalize the data to some extent.
So why not, exchange some wrapped tools that are bad performers with a couple of better performing tools and why not write some xslts so that more relevant meta data is added to the output?
Don’t reinvent the wheel, right?
As I pointed out, I am interested in valid meta data. Once again, FITS is the only tool that tries to provide some insight whether or not a measurement is valid by pointing out conflicts from different wrapped tools. Yes, it is fairly rudimentary but it is a start and it is more than anything else I know. Why not build upon that?
And since I don’t want to come across as if FITS is the greatest tool ever, here are some points that I believe should be fixed in the current implementation:
– Better error recovery on external tool crash, so that the whole framework does not hang!
– Better configuration (e.g. not only excludes based on extension, but also may be includes).
– Better output format schema.
– Better and more flexible tool chaining.
– More known properties (more xslt transformations).
– … I could continue, but I don’t want to make this a rant.
I am pretty sure, if there was such a tool that does all that and potentially more and if it is configured correctly, it will show very promising results and will be the fundament of many other tools that build on top of this information. I am certain that it will be of great input to Preservation Planning and I am certain it will help us understand better what content we have.
That is why I like to fits! I’ll be happy if you drop me comment of why you do or do not!