Destination Null: one of the many causes of PDF-hul 122

Destination Null: one of the many causes of PDF-hul 122

This is a continuation of OPF blog posts I’ve been writing about various JHOVE (and other file format validators’) error messages. The error message du jour is PDF-hul 122. The title already hints at the fact that there are many different causes for PDF-hul 122 errors. Or, to put it in the words of the JHOVE github Wiki entry for this error: “This needs review, it’s a horrible cludge that eats and PDFExceptions thrown while processing destination objects and always sets the invalid flag. Seems dubious behaviour. It, for example, reports the error “Invalid indirect destination – referenced object ‘ ‘ cannot be found”. This error comes from PDF-HUL-149.
Therefore a word of warning in the beginning: if you come across this error message, take a closer look. The behavior described in this blog might be similiar to yours, but doesn’t have to be! It could be much more severe.

The file to reproduce this blog is publically available at the time of writing.
I’ll be describing the process using a methodology I’ve introduced at ipres2023. If you’re interested to read the paper behind it, you can do so here. An abridged version of the workflow presented in this blog is also available on COPTR.

Validation error

The analysis starts with the “Well-Formed, but not valid” message for PDF-HUL-122 error “Invalid Destination”. Table 1 contains the version of JHOVE that was used for validation.

Cross-check with other tools

Let’s cross-check the error with other tools. For PDF, some of my goto tools are pdfcpu, qpdf and PDF Checker. The pdfcpu relaxed mode, qpdf and PDF Checker report no errors. Pdfcpu’s strict mode reports a different error, which is Font related and thus seems to be unrelated. We will ignore the font error within the context of this blog.

Tools and Version / ModeJHOVE v1.28.0 PDF-Hul 1.12.4pdfcpu v0.6.0 dev / relaxed modepdfcpu v0.6.0 dev / strict modeqpdf v9.1.1PDF Checker 2.1.0
CommandGUIpdfcpu validate -mode relaxed inputfilepdfcpu validate -mode strict inputfileqpdf –check –verbose inputfilepdfchecker -i inputfile -j pathtoprofile\everything.json
ResultErrorMessage: edu.harvardh.hul.ois.jhove.module.pdf.PdfInvalidException: Invalid Destination
ID: PDF-HUL-122
Offset: 540245
Status: Well-Formed, but not valid
OKdict=type1FontDict required
entry=FirstChar missing
No syntax or stream encoding errors foundNo errors
Table 1: Result of validation with JHOVE as well as other validators

Locate error in file and in specification

Locating the error in the file and in the specification goes hand in hand and might take a few rounds until the exact issue can be pinpointed. In this case, I started by locating the error in the file via the offset given by JHOVE (540245).
The offset lands between two objects. In that case it seems safe to assume that the problem is related to the object before the offset. Think of it as “I’ve looked at this object and something is wrong with it, so I’m throwing an error”. The object before the offset contains the following:

80 0 obj
<</D(Rechte_von_Eltern_in_der_Kita_2018_V7_bf.indd:.45593:62)/S/GoTo>>
endobj

GoTo indicates a pointer action which requires a destination D, namely (Rechte_von_Eltern_in_der_Kita_2018_V7_bf.indd:.45593:62) (see ISO 32000-2:2017 Section 12.6.4.2 Go-To actions). Comparing what we have to the required entries found in the blog, things look ok – we have both required entries, namely an action /GoTo and a destination /D. Object 80 in itself is therefore valid. We need to take a closer look at the destination itself.
The destination contained here is a an indirect one in form of another object, it’s a Named Destination (see ISO 32000-2:2017 Section 12.3.2.4 Named destinations). Whenever named destinations are used, you also need a Name dictionary which links the name to an actual object the name is pointing to.

There are different ways to find the Names directory. I took the shortcut and used Didier Steven’s pdf-parser and just searched for /Names. I could have also used the same tool to search for Rechte_von_Eltern_in_der_Kita_2018_V7_bf.indd:.45593:62 and would have found out that way that the refernce is actually used twice in the pdf (once in object 80 and once in the Name tree).
The Name tree found in the dictionary (obj 145) looks as follows:

  <<
    /Names [(Rechte_von_Eltern_in_der_Kita_2018_V7_bf.indd:.45487:36)[13 0 R/Fit]
    (Rechte_von_Eltern_in_der_Kita_2018_V7_bf.indd:.45491:37)[17 0 R
    /Fit ](Rechte_von_Eltern_in_der_Kita_2018_V7_bf.indd:.45495:38)[25 0 R
    /Fit ](Rechte_von_Eltern_in_der_Kita_2018_V7_bf.indd:.45499:39)[25 0 R
    /Fit ](Rechte_von_Eltern_in_der_Kita_2018_V7_bf.indd:.45507:41)[25 0 R
    /Fit ](Rechte_von_Eltern_in_der_Kita_2018_V7_bf.indd:.45511:42)[27 0 R
    /Fit ](Rechte_von_Eltern_in_der_Kita_2018_V7_bf.indd:.45519:44)[27 0 R
    /Fit ](Rechte_von_Eltern_in_der_Kita_2018_V7_bf.indd:.45536:48)[31 0 R
    /Fit ](Rechte_von_Eltern_in_der_Kita_2018_V7_bf.indd:.45556:53)[38 0 R
    /Fit ](Rechte_von_Eltern_in_der_Kita_2018_V7_bf.indd:.45560:54)[38 0 R
    /Fit ](Rechte_von_Eltern_in_der_Kita_2018_V7_bf.indd:.45564:55)[40 0 R
    /Fit ](Rechte_von_Eltern_in_der_Kita_2018_V7_bf.indd:.45593:62)[null
    /Fit ](Rechte_von_Eltern_in_der_Kita_2018_V7_bf.indd:.45609:47)[31 0 R
    /Fit ](Rechte_von_Eltern_in_der_Kita_2018_V7_bf.indd:.45611:45)[31 0 R
    /Fit ]]
  >

Each destination is expected to have a name – e.g., Rechte_von_Eltern_in_der_Kita_2018_V7_bf.indd:.45487:36) – a location to go to – e.g., 13 0 R which is in this case is the object for page 6 of the document – and a destination syntax telling the program how to display the destination – e.g., /Fit which means that the target should be displayed “with its contents magnified just enough to fit the entire page within the window both horizontally and vertically”). (see ISO 32000-2:2017 Table 149)
Looking at the name tree above we can easily spot that the name we wanted to take a closer look at (Rechte_von_Eltern_in_der_Kita_2018_V7_bf.indd:.45593:62) does not contain a target object. Instead it just says null. This violates the specification, as described above.

We’ve now found our problematic place in the file ((Rechte_von_Eltern_in_der_Kita_2018_V7_bf.indd:.45593:62)[null
/Fit ]
) and also in the spec (see ISO 32000-2:2017 Sections 12.3.2 / 12.3.2.4).

Impact on rendering / file behavior

The next step is the question whether the error is fixable and, if so, fix it. However, there is actually a step in between – the decision whether the file should be fixed. In the case of this file, the decision can go both ways. Like most things in life, this decision should be an informed one. So let’s look at it in detail.

First we need to understand the impact that the error has on the rendering of file – so on it’s appearance but also on functional behavior. For some errors the problem is either directly visible in the viewing application or an error message of the viewing application makes us aware of it. In this case it’s not quite as easy, so we need to pinpoint the problematic location in the file. The fact that it’s a GoTo action combined with an internal link helps in this case – we’ll need to look for something like an internal link like those typically found in table of contents. If we look at the other destinations in the name tree we can see that they all link to page objects. There are different ways to verify this – we can look at the objects in a (Hex)editor or via terminal tools or use tools that display the visual structure of the pdf pages, such as Preflight or itext RUPS, which I’ve talked a bit about in a different post before.

The behavior of the Tables of Contents is that the section links are actionable – if you click on them, it will navigate you to the respective section in the PDF file. With a bit of digging I could reconstruct which destination belongs to which table of content entry. The “problematic” destination pointing to Null is the last one. We can confirm this by clicking as many times as we want to on “Schlussfolgerung” and nothing ever happening.

In other words, the validation error has a behavioral impact on the file. A link that should be actionable is not actionable.

Figure 1: Table of Contents: view as is (left) and with destinations linked to in GoTo action (right)

Fix it!

The easiest way to fix it would be if we could just replace NULL with the correct page object, which is 811 0 R. The problem is that 811 0 R takes up more bytes than NULL and the offsets of the PDF wouldn’t be correct anymore, so that’s not an option.

The full version of Adobe Acrobat (as well as other professional PDF editors, I’m sure) offer an option to manipulate the link. We can delete the existing destination and add our own GoTo action to page 14. This fixes the problem and revalidating the file in JHOVE now returns the file as being “Well-Formed and valid”. Congratulations!


Figure 2: Edit Link Menu in Adobe Acrobat Pro 2017. The NULL link needs to be deleted first and then a correct one needs to be added.

To fix or not to fix – that is the question

Digital preservation wouldn’t be digital preservation if there weren’t a deeper question that we could wax philosphical about for hours on end. This file poses such a question. Above we showed that the error can be fixed, however, fixing it changes the structure of the file intrinsically a little bit. For example, the original file had fewer objects and the root now has a different object id. While the file is not impacted in a visual or behavioral way beside the intended fix, an authenticity purist might take offense in that change. Also – the file was passed to us as is is. What if the error was intended!? Even though that might seem unlikely, we should always carefully consider what errors we do want to fix. Ideally, we have policies in place that guide us on what to fix and what not – and on how to document our fixes.

Personally, I would rank this specific instance of the error as “Low”, meaning a fix is not necessary. My reasoning is that there is only a navigation loss and no information loss within the document – the table of content is still intact as all page numbers are written behind each entry and those page numbers align with the numbering of the document.
However, this is only true for this specific instance of the error – if e.g., a destination to a second PDF was lost, the impact might be much more severe. And thus we’ve come full circle: the error described here is just one of the many PDF-hul-122 errors! I would always recommend examining those errors more closedly to check if the impact is indeed low or severe.

Workflow Summary

Figure 3: Summary of Workflow in Validation-Error-Treatment Workflow

174
reads

1 Comment

  1. samalloing
    January 23, 2024 @ 7:12 am CET

    Hi Micky

    Thanks for taking the time for writing this. Very useful!
    Like you said, there are different PDF-HUL-122 errors. I create a pull request because this error actually hides some other errors: https://github.com/openpreserve/jhove/pull/882. When this pull request is accepted and released more of the other errors will show instead of PDF-HUL-122

    Sam

Leave a Reply

Join the conversation