Format Registry Challenge, Part Two

PDF Eh? – Another Hackathon Tale

Before I started with format editing, I realized that it would be very simple to implement cool URLs for file formats. The only difficulty was that Tapestry associates pages with Java classes, and “x-fmt” is not an allowed Java class name. This meant implementing URL re-directs, which is a special topic in Tapestry. Tapestry does not believe in heavily modifying the web.xml, almost all configuration takes part in so-called filter classes. Anyway, once I sorted out how to do re-writes, I could immediately produce nice output for requests like:

http://host/repository/x-fmt/123

My next challenge was incorporating Fido regular expression signatures in the application. This required an extension of the Pronom model, and hence a modification of my generated Pronom XSD and the associated PRONOMReport class. I decided to do this by hand, in order to avoid loosing my previous annotations and extensions of PRONOMReport. Then I wanted to update my Pronom XMLs (my database) with the 208 existing Fido regular expression signatures, which Adam has conveniently provided on GitHub in this file: formats.py. A few dozen lines of code allowed me to update all of the XML files as well as test my new data model (and the associated marshalling/unmarshalling).

I then noticed that a few critical attributes in the data model should definitely not be free strings, in particular FileFormatIdentifier.IdentifierType (MIME, PUID, or Apple Uniform Type dentifier), RelatedFormat.RelationshipType (Equivalent to, Is subtype of, etc.), and ExternalSignature.SignatureType (although there is only one value for this in the data set at this time, “File extension”). The easiest way for Tapestry to handle enumerated strings is to create java enum objects, but this has the drawback that enums cannot contain spaces. So I replaced every instance of these Strings in the XML sources with underlined versions (that is, “Is_subtype_of” instead of “Is subtype of”) and added the enum objects and methods to my PRONOM report class. Using enums, Tapestry automatically populates drop-down elements for any Select input type in the editing interface.

The rest of the work, which cost me more than one day, was all about setting up editing interfaces for a very complex data type – one that contains recursive lists of other objects. The most complicated relationship is that between a Format and its associated Signature, which looks like this:

FileFormat 1:n InternalSignature 1:n ByteSequence

While I was able to successfully display this nested relationship, editing it all on one screen was too difficult for me; in the end, I pushed the ByteSequence editor on to its own page. Although the editor development was time-consuming, Tapestry provided an important component (the AjaxFormLoop) that allowed for nice editing of those 1:n relationships, where it is possible to add, edit, or delete any relationship within one Ajax component.

So the status after five days is: I have a prototype Format Registry application based on XML-persistence of files that should in principle be compatible with Pronom exports (although I probably shouldn’t have messed with the enumerators as I did – it would have been better to create customized bindings, as described here – but this was clearly beyond me in the time available). The user can edit File Formats (in particular by adding Fido Regular Expression signatures), or create new Formats, as well as search the existing data in a variety of ways.

Of course, there are some very important missing features: authentication and security (who is allowed to edit existing formats? who is allowed to add new formats?), validation (Tapestry has hooks for validation built in, it would take another day to include this functionaliy in the editor), and perhaps most importantly, the generation of unique identifiers for new formats. I will post some of my thoughts regarding this last point in my next blog entry.

0
reads

Leave a Reply

Join the conversation