Identifying ebooks (File ID Hackathon)

Several of us at The British Library took part in the CURATEcamp file id hackathon on Friday.

We decided that one issue we could make a useful impact on was identification of various ebook formats. eBooks are an important content type for the British Library, especially with the expected implementation of non-print legal deposit legislation next year. For a long list of formats look here: http://wiki.mobileread.com/wiki/E-book_formats

Having gathered a test corpus of various ebook formats, we set about looking at how the files were recognised by file, Tika and Droid. Initial testing showed that Droid did not recognise formats other than AZW/Mobi, PDF & TXT, whilst Tika misidentified many of the formats, particularly as chemical/x-pdb. File was much better at recognising more of the formats, but it did not identify all files. Whilst file had the best coverage of all the tools it did not associate a useful mime type. As, perhaps, a sensible default position over misidentification, it identified the mime type of many files as application/octet-stream.

Table showing results before our work on file signatures was carried out:

Format + mime type	Droid result	Tika result	File result
AZW/Mobi Application/vnd.amazon.ebook Application/x-mobipocket-ebook	PocketMobi*	Recognised as either application/vnd.amazon.ebook or text/html	Mobipocket E-book. Mime type: Application/octet-stream
EPUB Application/epub+zip	Application/zip	Application/epub+zip	Application/epub+zip
PDB (5 formats)	One recognised as PocketMobi*	Chemical/x-pdb	Recognised 3/5 formats. All mime types: Application/octet-stream
FB2 Application/x-fictionbook+xml	Text/xml	Application/x-fictionbook+xml	Application/xml
PRC	PocketMobi*	Text/html	Recognised but mime type: Application/octet-stream
LRF	–	Application/octet-stream	Recognised but mime type: Application/octet-stream
LIT Application/x-ms-reader	–	Application/octet-stream	Application/x-ms-reader
PKG application/x-newton-compatible-pkg	–	Application/octet-stream	Recognised but mime type: Application/octet-stream
RB Application/x-rocketbook	–	Text/x-ruby	Application/octet-stream
TCR	–	Application/octet-stream	Application/octet-stream

*indicates no mimetype specified by tool. Results in italics indicate incorrect identification.

We then spent the day developing signatures that Tika could use to identify the ebooks in our test corpus by using fidget. By my count we added thirteen signatures to a custom-mimetypes.xml which enabled Tika to correctly identify all of our test files by the end of the day.

Members of the team who were not creating signatures were busy researching existing mime types for formats, identifying where the identification was failing, identifying discrepancies in the output from the three tools and investigating the effect of rights management/encryption on epubs.

We still have some work to do, particularly around determining how correct and robust our new signatures are. We need more test files, with ground truth, to check the new signatures. One concern is that we may be recognising the same file format, created by multiple sources, in multiple ways. For example – are PDB files all different or compatible? i.e. are we recognising the producer of the file and not the format? An analogy could be PDFs created by different tools; they are one output format, and not different formats depending on producer.

We also need to consider more closely the implications of creating mime types, which we were forced to do for some formats. What impact will this have on different tools’ abilities to, in theory, provide the same information about the same files, if the new mime types are not shared?

In a more general sense how does false-positive identification of files affect any assessment of the contents of a repository? What are the digital preservation risks of a false or missed identification? What should be done when identification tools don’t agree?

1 Comment

johan
November 19, 2012 @ 5:10 pm CET

Just a quick comment on File and missing mime types: basically there can be 2 reasons for this. Sometimes simply no registered mime type exists for a format (surprisingly common actually). You can check this here:

http://www.iana.org/assignments/media-types/index.html

For some formats only informal mime types exist (i.e. ones that are not registered at IANA), and these can be hard to track down. Epub is a good example.

Also, not all entries in File‘s ‘magic’ include a mime type definition. It’s pretty easy to add one yourself, see also this blog post I wrote some time ago:

https://openpreservation.org/blogs/2012-08-09-magic-editing-and-creation-primer

You must be logged in to post a comment.

1 Comment

Leave a Reply

You might also like…

Bringing together the Emulation and Format ID hackathons

Monitoring Disappearing File Formats 5: Applications for disappearing file formats

New characterisation developments from the SPRUCE hackathon

Join the conversation

Member-only content

or

or

or

or

Download

or