Convert me if you can – Preservation Planning with malicious PDFs

PDF Eh? – Another Hackathon Tale

Do we fear that not understanding the PDF error messages today might put the readability of our archived PDFs at jeopardy tomorrow?

Preservation Planning is still no wide-spread topic for Digital Archivists – most of us still struggle with ingest, storage and access. Nevertheless, migration and validation is good practice in pre-ingest and ingest, but what do we do with the plethora of error messages?

Our Digital Archive went life in April 2015, so ready for primary school soon. We started tests with Preservation Planning in our Test system (which we are working with since 2010, so it will be a bitchy teenager quite soon) in 2012. Still, we are reluctant to go live with our current preservation plan.

In these days of home office and quarantine, I had an email conversation with a colleague who is still quite new in our field, and again I was asked about PDF error messages. She is not the only one asking these questions, indeed, the same questions pop into our minds regularly:

  • What does this PDF error mean?
  • How do I fix it?
  • Does it have an impact on the long-term-availability and if so, which one?

My guess is, that we all want to prevent that in some years from now, our users will complain that our PDFs do not open any more or will look funny – missing parts, compromised tables and graphics – and we will be to blame because we did not understand the error messages today.

Today, our Digital Archive has 213,913 PDF files in permanent storage. It is already four years ago that we have bought the pdfaPilot to convert possibly all these PDFs into PDF/A-2b and integrated the pdfaPilot as a Plugin into the Preservation Planning module of Rosetta, the Digital Archival System we are using now in our tenth year.

So, ready to go, are we?

We conducted a test, converting a couple of 10,000 PDFs back in 2016, before we finally bought the pdfaPilot, after reaching a success rate from more than 92%. We learned that our worst problem was non-embedded fonts and learned how to embed the fonts as an afterthought, also using the features of the pdfaPilot, which further improved our success rate up to 95%.

Due to some inconvenient features of the preservation module, we did not go live then, but had a few things fixed and improved with the Preservation Module of Rosetta. Today, there is still one bug left, but we recently decided this should not be a showstopper and conducted a test in the Rosetta Test system to see how the rate is, and more importantly, which error messages are output when files are not convertible.

We left out all PDFs that were already migrated to PDF/A-2b in past tests, all password-protected PDF files (as pdfaPilot would not convert them) and all PDF files with the JHOVE error message: “JHOVE compression method unknown (Error message contains Keywords Compression method is invalid or unknown to JHOVE)”. This error only seems to occur with PDF 1.6 and 1.7 (PUID fmt/19 and fmt/20) and for the PDF which throws this JHOVE error, the information cannot be extracted if password protection is in place.

Do we fear that not understanding the PDF error messages today might put the readability of our archived PDFs at jeopardy tomorrow?

Preservation Planning is still no wide-spread topic for Digital Archivists – most of us still struggle with ingest, storage and access. Nevertheless, migration and validation is good practice in pre-ingest and ingest, but what do we do with the plethora of error messages?

Our Digital Archive went life in April 2015, so ready for primary school soon. We started tests with Preservation Planning in our Test system (which we are working with since 2010, so it will be a bitchy teenager quite soon) in 2012. Still, we are reluctant to go live with our current preservation plan.

In these days of home office and quarantine, I had an email conversation with a colleague who is still quite new in our field, and again I was asked about PDF error messages. She is not the only one asking these questions, indeed, the same questions pop into our minds regularly:

  • What does this PDF error mean?
  • How do I fix it?
  • Does it have an impact on the long-term-availability and if so, which one?

My guess is, that we all want to prevent that in some years from now, our users will complain that our PDFs do not open any more or will look funny – missing parts, compromised tables and graphics – and we will be to blame because we did not understand the error messages today.

Today, our Digital Archive has 213,913 PDF files in permanent storage. It is already four years ago that we have bought the pdfaPilot to convert possibly all these PDFs into PDF/A-2b and integrated the pdfaPilot as a Plugin into the Preservation Planning module of Rosetta, the Digital Archival System we are using now in our tenth year.

So, ready to go, are we?

We conducted a test, converting a couple of 10,000 PDFs back in 2016, before we finally bought the pdfaPilot, after reaching a success rate from more than 92%. We learned that our worst problem was non-embedded fonts and learned how to embed the fonts as an afterthought, also using the features of the pdfaPilot, which further improved our success rate up to 95%.

Due to some inconvenient features of the preservation module, we did not go live then, but had a few things fixed and improved with the Preservation Module of Rosetta. Today, there is still one bug left, but we recently decided this should not be a showstopper and conducted a test in the Rosetta Test system to see how the rate is, and more importantly, which error messages are output when files are not convertible.

We left out all PDFs that were already migrated to PDF/A-2b in past tests, all password-protected PDF files (as pdfaPilot would not convert them) and all PDF files with the JHOVE error message: “JHOVE compression method unknown (Error message contains Keywords Compression method is invalid or unknown to JHOVE)”. This error only seems to occur with PDF 1.6 and 1.7 (PUID fmt/19 and fmt/20) and for the PDF which throws this JHOVE error, the information cannot be extracted if password protection is in place.

This left 16,127 for us to convert, distributed like this:

formatPUIDcorpus to migrateable to migratesuccess rate in %
PDF 1.1fmt/1594651854.8
PDF 1.2fmt/164469352979
PDF 1.3fmt/173632318288
PDF 1.4fmt/182917222176
PDF 1.5fmt/191410126189
PDF 1.6 fmt/20 107141048998
PDF 1.7 fmt/276665888
PDF/Afmt/952687260897
all268412386689
all without PDF 1.1 2589523348 90

So, why is the success rate so low? Even when not involving the underperformer PDF 1.1, the rate is lower than with our initial tests in 2016. The pdfaPilot has improved since then, being able to handle a good handful or PDF errors more than it was four years ago. Besides, we have changed nothing in our settings and configuration.

Only being able to guess, I would say that the corpus of the latest test consists of more malicious PDFs than the PDF corpus we used for the test in 2016. The real-world PDFs we get nowadays are even more heterogeneous and malicious as they used to be in 2016. Back then, more than 95% of our errors were about non-embedded fonts. Nowadays, we still cannot fix them all, as some fonts cannot be embedded due to copyright reasons and some others we cannot embed, because we do not have the fonts in our font folders yet, e.g. because they are not free of charge. We constantly analyse the error files to see which fonts still are missing and which me can possibly add to our font folders to improve our success rate.

To further improve the rate, we have started to analyse the non-font-related PDF errors, first only listing the TOP-FOUR.

PDF errors

Having bought a professional tool to convert to PDF/A, we have the advantage of using the customer support, which we have done to hopefully get rid of the four most common PDF errors which prevent PDF/A-conversion.

Annotation visible flag set

The PDF/A standard does not allow invisible annotations. Either they have to be visible or be deleted altogether (Source). The support team has helped us to update our configuration so that visible annotations now are deleted. We do not expect to see this error message again any time soon.

Font name/ Name object is not a valid UTF-8 string

This example still is recent, so it is not a hundred percent solved yet, but the support team sounds optimistic that these PDFs will be convertible to PDF/A. A possible workaround would be to sort the fonts in font-subsets, this would prevent this error to be thrown, which can be achieved adjusting the pdfaPilot configuration.

Implementation limit: Max. number of nested graphic states

Although this problem is not totally unsolvable when handling one PDF after another, e. g. manually with Adobe Preflight, there is no easy way to alter our configuration so that these files can automatically convert to PDF/A. A solid workaround would be to render the page contents /grid the page contents again, although this would be a major intervention.

ICC profile is not valid

This is our best-of, right after all font-related problems. Our configuration could be updated, so that minor issues with the ICC profiles can now be automatically fixed. However, some PDF have an ICC profile included that codes the whole color space in a wrong and invalid way. Usually it is possible to open and read the PDF, but a conversion is impossible.

After the update of our configuration we have not yet conducted a mass test to see how many of our PDF have a “too bad to be converted”-ICC profile.

Next steps

Since three of the TOP-FOUR problems are at least partially solved, we have analysed our error files further and found the next TOP-THREE errors:

  • A required ‘Subtype’ entry is missing
  • Syntax problem: PDF contains data after end of file marker
  • Implementation limit: Name object with a length greater than 127 byte

Thankfully, at least two of these errors are self-explanatory, and after having seen junk after the End-of-file-marker pretty often in PDF and other files, I am sure this problem will be solvable in most cases. We will see if a too-long name object can easily be shortened. I have no idea about the required ‘Subtype’ entry and will thankfully embrace ideas and hints about it.

So, this is how it will go. We will be always on the hunt for PDF error messages to improve our success rate. My guess is that some PDFs are just too malicious to convert and will stay so, not matter how much we will learn about the PDF standard and its possible errors and regardless of how good the tool is. I just hope this rate will drop in the years to come and that we will have a reliable stack of PDF/A-2b files in our archive which will not keep me awake at night, wondering if, in a few years there will be only junk left – not only after the EOL-tag, but before, as well.

415
reads

< Previous Next >

Leave a Reply

Join the conversation