Preserving PDF – identify, validate, repair
22 participants from 8 countries – the UK, Germany, Denmark, the Netherlands, Switzerland, France, Sweden and the Czech Republic, not to forget umpteenthousand defect or somehow interesting PDF files brought to the event.
Not only is this my first Blog entry on the OPDF website, it is also about my first Hackathon. I guess it was Michelle's idea in the first place to organise a Hackathon with the Open Planets Foundation on the PDF topic and to have the event in our library in Hamburg. I am located in Kiel, but as we are renewing our parquet floor in Kiel at the moment, the room situation in Hamburg is much better (Furthermore, it's Hamburg which has the big airport).
The preparation for the event was pretty intense for me. Not only the organisation in Hamburg (food, rooms, water, coffee, dinner event) had to be done, much more intense was the preparating in terms of the Hacking itself.
I am a library- and information scientiest, not a programmer. Sometimes I would rather be a programmer considering my daily best-of-problems, but you should dress for the body you have, not for the body you'd like to have.
Having learned the little I know about writing code within the last 8 months and most of it just since this july, I am still brand-new to it. As there always is a so-called "summer break" (which means that everybody else is in a holiday and I actually have time to work on difficult stuff) I had some very intense Skype calls with Carl from the OPF, who enabled me to put all my work-in-progress PDF-tools to Github. I learned about Maven and Travis and was not quite recovered when the Hackathon actually started this monday and we all had to install some Virtual Ubuntu machine to be able to try out some best-of-tools like DROID, Tika and Fido and run it over our own PDF files.
We had Olaf Drümmer from the PDF Association as our Keynote Speaker for both days. On the first day, he gave us insights about PDF and PDF/A, and when I say insights, I really mean that. Talking about the building blocks of a PDF, the basic object types and encoding possibilities. This was much better than trying to understand the PDF 1.7 specification of 756 pages just by myself alone in the office with sentences like "a single object of type null, denoted by the keyword null, and having a type and value that are unequal to those of any other object".
We learned about the many different kinds of page content, the page being the most important structure unit of a PDF file and about the fact that a PDF page could have every size you can think of, but Acrobat 7.0 officially only supports a page dimension up to 381 km. The second day, we learned about PDF(/A)-Validation and what would theoretically be needed to have the perfect validator. Talking about the PDF and PDF/A specifications and all the specification quoted and referenced by these, I am under the impression that it would last some months to read them all – and so much is clear, somebody would have to read and understand them all. The complexity of the PDF file, the flexibility of the viewers and the plethora of users and user's needs will always take care of a heterogenious PDF reality with all the strangeness and brokenness possible. As far as I remember it is his guess that about 10 years of manpower would be needed to build a perfect validator, if it could be done at all. Being strucked by this perfectly comprehensible suggestions, it is probably not surprising that some of the participants had more questions at the end of the two days than they had at the beginning.
As PDF viewers tend to conceal problems and tend to display problematic PDF files in a decent way, they are usually no big help in terms of PDF validation or ensuring long-term-availability, quite the contrary.
Some errors can have a big impact on the longterm availability of PDF files, expecially content that is only referred to and not embedded within the file and might just be lost over time. On the other hand, the "invalid page tree node" which e. g. JHOVE likes to put its finger on, is not an error, but just a hint that the page tree is not balanced and the page cannot be found in the most efficient way. Even if all the pages would just be saved as an array and you would have to iterate through the whole array to go to a certain page, this would only slow down the loading, but does not prevent anybody from accessing the page he wants to read, especially if the affected PDF document only has a couple of dozen pages.
During the afternoon of the first day, we collected specific problems everybody has and formed working groups, each engaging in a different problem. One working group (around Olaf) started to seize JHOVE error messages and trying to figure out which ones really bear a risk and what do they mean in the first place, anyway? Some of the error messages definitely describe real existent errors and a rule or specification is hurt, but will practically never cause any problems displaying the file. Is this really an error then? Or just burocracy? Should a good validator even display this as an error – which formally would be the right thing to do – or not disturb the user unnessecarily?
Another group wanted to create a small java tool with an csv output that looks into a PDF file and puts out the information which Software has created the PDF file and which validation errors does it containt, starting with PDFBox, as this was easy to implement in Java. We came so far to get the tool working, but as we brought expecially broken PDF files to the event, it is not yet able to cope with all of them, we still have to make it error-proof.
By the way, it is really nice to be surrounded by people who obviously live in the same nerdy world than I do. When I told them I could not wait to see our new tool's output and was anxious to analyse the findings, the answer was just "And neither can I". Usually, I just get frowning fronts and "I do not get why you are interested in something so boring"-faces.
A third working group went to another room and tested the already existing tools with brought PDF samples in the Virtual Ubuntu Environment.
There were more ideas, some of them seemed to difficult or to impossible to be able to create a solution in such a small time, but some of us are determined to have some follow-up-event soon.
For example, Olaf stated that sometimes the text extraction in a PDF file does not work and the participant who sat next to me suggested to me, we could start to check the output against dicitonaries to see if the output still make sense. "But there are so many languages" I told him, thinking about my libary's content. "Well, start with one" he answered, following the idea that a big problem often can be split in several small ones.
Another participant would like to know more about the quality and compression of the JPEGs embedded within his PDF files, but some other doubted this information could still be retrieved.
When the event was over tuesday around 5 pm, we were all tired, but happy, with clear ideas or new interesting problems in our heads.
And just because I was already asked this today because I might look slightly tired still. We did sleep during the night. We did not hack it all through or slept on mattrasses in our library. Some of us had quite a few pitcher full of beer during the evening, but I am quite sure everybody made it to his or her Hotel room.
Twitter Hashtag #OPDFPDF