EPUB for archival preservation: an update

Last year (2012) the KB released a report on the suitability of the EPUB format for archival preservation. A substantial number of EPUB-related developments have happened since then, and as a result some of the report's findings and conclusions have become outdated. This applies in particular to the observations on EPUB 3, and the support of EPUB by characterisation tools. This blog post provides an update to those findings. It addresses the following topics in particular:

  • Use of EPUB in scholarly publishing
  • Adoption and use of EPUB 3
  • EPUB 3 reader support
  • Support of EPUB by characterisation tools

In the following sections I will briefly summarise the main developments in each of these areas, after which I will wrap up things in a concluding section.

Use of EPUB in scholarly publishing

Although scholarly publishing is still dominated by PDF, the use of EPUB in this sector is on the rise. This blog post by Todd Carpenter gives the following examples:

At the time of writing, the above publishers are all using EPUB 2.

Adoption and use of EPUB 3

Over the last year a number of organisations that are representing the publishing industry have expressed their support of EPUB 3. The Book Industry Study Group (BISG) is a trade association for companies in the publishing industry. Last year (August 2012) BISG released a policy statement in which it endorsed "EPUB 3 as the accepted and preferred standard for representing, packaging, and encoding structured and semantically enhanced Web content — including XHTML, CSS, SVG, images, and other resources — for distribution in a single-file format". Early this year (March 2013) the International Publishers Association (IPA) issued a press release that also endorsed EPUB 3 as a "preferred standard format for representing HTML and other web content for distribution as single-file publications". IPA represents over 60 national publishing organisations from more than 50 countries. Finally, the European Booksellers Federation recently released a report on the interoperability of eBook Formats. Its authors did a comparison of the features and functionality provided by EPUB 3, Amazon's KF8 (Kindle) and Apple's e-book formats. They concluded that EPUB 3 "clearly covers the superset of the expressive abilities of all the formats", and that there is "no technical or functional reason not to use and establish EPUB 3 as an/the interoperable (open) ebook format standard". This all suggests that EPUB 3 is widely supported by the publishing industry.

Having said that, the actual use of EPUB 3 is still limited at this stage, even though some publishers have already started using the format. Earlier this year technical publisher O’Reilly started releasing all their new eBook bundles in EPUB 3 format. The announcement mentions that their backlist will be updated as well. Interestingly, they decided to create "hybrid" EPUBs that are backward-compatible with EPUB 2. In November 2012 publisher Hachette also announced the launch of their EPUB 3 program.

EPUB 3 reader support

At this time reader support for EPUB 3 is still limited, but there have been a number of significant developments since the second half of 2012:

Support of EPUB by characterisation tools

The 2012 report concluded that EPUB was not optimally supported by characterisation tools. This situation has improved quite a lot since that time.


EPUB is now included in PRONOM, and has a corresponding DROID signature. This means that Fido should now be able to identify the format as well. On a side note, PRONOM doesn't differentiate between EPUB 2 and 3, and it appears that the current record (which is only an outline record anyway) either combines both versions, or only refers to EPUB 2. PRONOM should probably be more specific on this.

Validation and feature extraction

The 2012 report included tests of 2 EPUB validator tools: epubcheck and flightcrew. While testing epubcheck in 2012, I was't entirely happy with the rather unstructured output that the tool produced. Also, I couldn't find any tool that was capable of extracting technical meta-information about an EPUB, like the presence of encryption or other digital rights management technology (feature extraction). Happily, starting with version 3.0 epubcheck is capable of extracting this kind of information. Moreover, it added an option to report its output in structured XML format that follows the JHOVE schema. I haven't done any elaborate testing, but a quick run on some of these EPUB 3 samples showed that epubcheck was able to identify font obfuscation, in which case a property hasEncryption (value true) is reported. I wasn't able to find any EPUB files with DRM, so I cannot confirm if epubcheck detects this as well.


As for flightcrew, no new versions of that tool have been released since August 2011, and it looks like it is not under any active development.

Discussion and conclusions

Since the release of the KB report on the suitability of EPUB for archival preservation the EPUB landscape has changed rather a lot. First, a number of academic publishers have started to offer scholarly content in this format. Although EPUB 3 is still in its early stages, various organisations representing the publishing industry have explicitly expressed their support of EPUB 3. A number of software applications now exist that are able to read the format, and work on a high-performance open source EPUB 3 Software Development Kit is backed by major players in the digital publishing industry (including e-reader manufacturers such as Kobo and Sony). EPUB support by characterisation tools has improved as well, mostly thanks to a number of recent enhancements of epubcheck. So, overall, EPUB's credentials as a preservation format appear to have improved quite a bit over the last year. In the case of EPUB 3 it's still too early to say anything about actual adoption, but the conditions for adoption to happen look pretty favourable. This is something I will get back to in my next update, perhaps in another year from now.

Useful links

