Identifying ebooks (File ID Hackathon)

Identifying ebooks (File ID Hackathon)

Several of us at The British Library took part in the CURATEcamp file id hackathon on Friday.

We decided that one issue we could make a useful impact on was identification of various ebook formats. eBooks are an important content type for the British Library, especially with the expected implementation of non-print legal deposit legislation next year. For a long list of formats look here: http://wiki.mobileread.com/wiki/E-book_formats

Having gathered a test corpus of various ebook formats, we set about looking at how the files were recognised by file, Tika and Droid.  Initial testing showed that Droid did not recognise formats other than AZW/Mobi, PDF & TXT, whilst Tika misidentified many of the formats, particularly as chemical/x-pdb.  File was much better at recognising more of the formats, but it did not identify all files.  Whilst file had the best coverage of all the tools it did not associate a useful mime type.  As, perhaps, a sensible default position over misidentification, it identified the mime type of many files as application/octet-stream.

Table showing results before our work on file signatures was carried out:

Format + mime type

Droid result

Tika result

File result

AZW/Mobi

Application/vnd.amazon.ebook

Application/x-mobipocket-ebook

PocketMobi*

Recognised as either application/vnd.amazon.ebook or text/html

Mobipocket E-book.  Mime type: Application/octet-stream

EPUB

Application/epub+zip

Application/zip

Application/epub+zip

Application/epub+zip

PDB (5 formats)

One recognised as PocketMobi*

Chemical/x-pdb

Recognised 3/5 formats.  All mime types: Application/octet-stream

FB2

Application/x-fictionbook+xml

Text/xml

Application/x-fictionbook+xml

Application/xml

PRC

PocketMobi*

Text/html

Recognised but mime type: Application/octet-stream

LRF

Application/octet-stream

Recognised but mime type: Application/octet-stream

LIT

Application/x-ms-reader

Application/octet-stream

Application/x-ms-reader

PKG

application/x-newton-compatible-pkg

Application/octet-stream

Recognised but mime type: Application/octet-stream

RB

Application/x-rocketbook

Text/x-ruby

Application/octet-stream

TCR

Application/octet-stream

Application/octet-stream

*indicates no mimetype specified by tool.  Results in italics indicate incorrect identification.

We then spent the day developing signatures that Tika could use to identify the ebooks in our test corpus by using fidget.  By my count we added thirteen signatures to a custom-mimetypes.xml which enabled Tika to correctly identify all of our test files by the end of the day.

Members of the team who were not creating signatures were busy researching existing mime types for formats, identifying where the identification was failing, identifying discrepancies in the output from the three tools and investigating the effect of rights management/encryption on epubs.

We still have some work to do, particularly around determining how correct and robust our new signatures are. We need more test files, with ground truth, to check the new signatures.  One concern is that we may be recognising the same file format, created by multiple sources, in multiple ways.  For example – are PDB files all different or compatible?  i.e. are we recognising the producer of the file and not the format?  An analogy could be PDFs created by different tools; they are one output format, and not different formats depending on producer.

We also need to consider more closely the implications of creating mime types, which we were forced to do for some formats.  What impact will this have on different tools’ abilities to, in theory, provide the same information about the same files, if the new mime types are not shared?

In a more general sense how does false-positive identification of files affect any assessment of the contents of a repository?  What are the digital preservation risks of a false or missed identification?  What should be done when identification tools don’t agree?

1 Comment

  1. johan
    November 19, 2012 @ 5:10 pm CET

    Just a quick comment on File and missing mime types: basically there can be 2 reasons for this. Sometimes simply no registered mime type exists for a format (surprisingly common actually). You can check this here:

    http://www.iana.org/assignments/media-types/index.html

    For some formats only informal mime types exist (i.e. ones that are not registered at IANA), and these can be hard to track down. Epub is a good example.

    Also, not all entries in File‘s ‘magic’ include a mime type definition. It’s pretty easy to add one yourself, see also this blog post I wrote some time ago:

    https://openpreservation.org/blogs/2012-08-09-magic-editing-and-creation-primer

Leave a Reply

Join the conversation