I attended the Practical Tools for Digital Preservation – A Hackathon event as a developer. As well as being enjoyable, I found this event useful and interesting.
In my role as Developer for Arts and Humanities computing projects, I am in the process of implementing an OAIS compliant digital archive for the University of St Andrews. This involves developing both work flows and technical solutions to technical problems. The event covered both of these aspects of digital archiving.
It was reassuring to see that the other attendees had come up against similar problems to me, and that they had found similar solutions. Some of the issues that collection owners presented were ones that I had already come across (e.g. file format identification and conversion), and others were ones that can see myself coming across in the future (e.g file corruption and how to evaluate batch conversions).
Collection and issues
The event paired developers with collection owners. I was put together with Cal Lee, who had a collection of realistic disk images. These are images of hard disks of Windows PCs which have been used in a realistic way, yet contain no personal or incriminating information. He explained to me that students had used these machines for several weeks, pretending to work for a fictitious company and carrying out various tasks. This would leave a trail of documents, application settings and user information on the disks. Because they contain no genuine information, these disk images can be made public and used by anyone interested in in digital forensics.
The images were in AFF format. From what I understand, this format preserves not only the files in the file system, but also attempts to identify files still on the hard disk but deleted from the file system. I was given access to these, and also to XML files which listed all the files and directories in each disk image.
It was explained to me that, while there are many tools for creating forensically valuable disk images, the tools for presenting the data contained in the images were not so plentiful. Cal was hoping that we could develop an interface for browsing the disk images, and that this interface would be aware of permissions. Ideally, it might only give access to certain areas of the file system, depending on who was browsing the disk image.
Solution
I was able to develop some of what Cal, the collection owner, was looking for, in the time available. What got implemented was a web interface allowing a user to choose a disk image, and then to navigate the directories in the file system and to see their contents. Some basic application of permissions was also done.
The XML files, being very large (long flat lists), were read, using a SAX parser. The path of each directory and file was given, and this was used to reconstruct the file system’s tree structure using a stack. The tree structure was put into a SQLite database using PDO (for database independence). The modified preorder tree traversal method was used to label the nodes in the tree (for quick retrieval later on).
Having ingested the XML, some basic permissions were applied. Initially, all directories and files are to be owned by the admin
user. The disk images were of Windows’ C drives, so users’ files were to be found under Documents and Settings
. There are various system users there too, but these were excluded. Everything below, and including, the genuine user’s home directory was to be owned by the user. This was achieved using PHP to select and update rows in the database.
These two steps (ingest and application of basic permissions) were done on the command line of the server. For the web interface, PHP queried the database, changed the two dimensional arrays obtained into XML, and transformed this XML into HTML using XSLT.
Limitations
Tools exist for extracting files from the disk images. I installed some, but wasn’t able to get them to work in time. It was hoped that these tools could have been used to extract and return files to the client via the web interface.
The XML files were in Digital Forensics XML format. Cal’s colleague, Kam Woods explained that this was an evolving standard, and that not all the fields were documented yet. One field looked like it could be used to determine permissions for directories and files. But we couldn’t work out what the values contained in the field meant. If we do find out what they mean, then I could refine the application of permissions.
The code I wrote for transforming the flat list of files into a tree structure required that directories be recognised as directories. In some disk images, deleted directories were in the XML list, and this caused my program to break. A file with the name .
also broke my program.
The web interface only displays ownership, rather than requiring authentication and enforcing permissions.
Being able to access the files for full text indexing, metadata generation and permissions based on content would all be quite desirable.
Conclusion
Initially I wondered why I had been paired with Cal, given that his collection and issues were quite unrelated to my experience, as I had described it when introducing myself to the other attendees. But Paul Wheatley, who put the teams together, told me that he thought it would be good to give me a new challenge.
I’m glad he did now, as it has given me an introduction to a complex area that we are very likely to face ourselves, as what people expect of the digital archive’s capabilities expands.
Developing a rough prototype has probably not gone very far to meet Cal’s requirements. But it has, I think, helped to refine the questions that need answering.
I am grateful to the Open Planets Foundation and the Digital Preservation Coalition for organising this very interesting event.