TIFF format validation: easy-peasy?

TIFF format validation: easy-peasy?

 

The research question

I have never doubted the JHOVE TIFF module. The JHOVE TIFF module is always right. Everybody says so. That’s why nobody uses the myriad alternatives to it, although it’s so easy to write a TIFF validator, I could almost do it myself.

But while my colleague Michelle and I are drafting a paper for the IDCC this february, it dawned on me: “Everybody” has never written about the infallibility of JHOVE in a paper or Blogpost so far, or has run a thorough test that I know of. Besides, the “myriad alternatives” often seem difficult to use for me on my windows machine with my limited experience with command-line-tools and batch-scripting.

Last fall, I have compared the validation tools JHOVE and Bad Peggy and how they both deal with JPEG validation (see OPF Blogpost). My goal was to analyse if the JHOVE JPEG module is reliable, as we are basing our preservation decisions on it. In theory, my goal is the same with this examination, except being focussed on TIFF, but my initial intention was admittedly biased: I wanted to prove that the JHOVE TIFF module indeed is infallible and that TIFF validation is, as I have always known, easy-peasy. As the analysis went on I had to admit that the reality is much more complicated.

The statement of a validation tool usually is relied on without a second thought, although most validation tools are not free from false negatives and false positives. As the JHOVE validation tool is widespread in the digital preservation community and integrated in out-of-the-box digital preservation software like Rosetta and Preservica, the reliance of JHOVE is especially interesting for the formats we possibly all have in our archives, like TIFF images.

My research question: Is the JHOVE TIFF module really that good in comparison with other tools?

And, as a side-effect: Is TIFF validation really easy-peasy?

TIFF validation tools

First, there seem to be plethora of tools to test TIFF-validity, analyse TIFF-tags and even repair common errors. Some are listed in COPTR (search for “TIFF” and “validation”, though some tools like ExifTool do some validation and are not marked as validation tools). Furthermore, the libtiff library offers many programs that can be integrated in other tools (see e. g. this collection of TIFF-tools). Most tools are not out-of-the-box tools with a nice GUI like JHOVE, which can be used by you and me on a windows machine.

I selected the following tools for my test:

Validation Tool version How to use remark
1 JHOVE 1.14.6 GUI and java library
2 ImageMagick 7.0.3 Command-line, batch-script help for the batch-script via twitter from David Underdown and the ImageMagick people
3 ExifTool 10.37 Command-line, batch-script help for the batch-script from Mario from the German nestor format identification group
4 DPF Manager 3.1 GUI
5 checkit_tiff 0.2.0 runs on linux only yet Andreas, checkit_tiff developer from the SLUB Dresden has run the test suite for me
6 LibTIFF 4.0.7 runs on linux only Heinz from the German nestor format identification group has run the test suite for me

In summary, I was able to analyse the test suite with six tools. I had some help with a suitable batch-script for ImageMagick and ExifTool, but at least I could easily run the tests on my windows machine (unlike checkit_tiff, for example, which requires Linux). For checkit_tiff and LibTIFF two colleagues helped me out and sent me the findings for me to analyse.

In the following paragraphs, I introduce the tools I have used for this analysis in more detail:

JHOVE

Validation: JHOVE is my to-go-validator for TIFF files. The findings are intelligible and the expectations for the file to follow the TIFF specification seem reasonable. There are almost 70 known TIFF error and info messages in the JHOVE module, most of them carry their meaning within the message like “TileLength not defined” and even a passer-by with a minimum of fantasy can imagine why information about the tile length might come in handy for an image. In forms of transparency, it is described on the JHOVE website which requirements a TIFF file must met to be well-formed or well-formed and valid.

Handling: As the GUI output never suited me, I have long ago begun to use JHOVE as a java library and have my own html output, which is very user-friendly. Aside from the GUI output not being handy when dealing with many files, JHOVE is very easy to install and use, as one could just throw (drag & drop) files and folders at it and it will validate them all.

ImageMagick

Validation: Just to be fair, ImageMagick is not primarily about file validation, but instead about displaying, migrating and working on images. I have marked every file as invalid that ImageMagick had at least one error message about, even if the error seems to be a minor one, like the encounter of an unknown TIFF field or incorrect contents of tags. As far as I know there is no list of all possible error messages of ImageMagick. The corpus used for this blogpost, the Google Imagetestsuite, however, consists of more than 40 different error messages, listed here. Analysing the ImageMagick output for this Blogpost, “valid” means “error-free”.

Handling: If I would have known how bad the output is before I started this test, I would have skipped this tool altogether. ImageMagick is a command-line-tool and as far as I know the batch-processing only has a text output and this output is really a mess. It’s difficult to tell which error information belongs to which file, as some files are not even listed by name. I had to test those all one-by-one, which was time-consuming and boring. But I am pigheaded and had already started to tell everybody I am going to test this tool. Even after I had written some java to help me with the messy output, some stuff could not be automated, as I could not find a regular pattern for everything. I am very sure that ImageMagick is very useful for batch-processing when converting images etc., but obviously nobody has really thought about validating 166 images at once or validation in the first place.

ExifTool

Validation: ExifTool is not really meant for validation, either. It’s for metadata extraction. The information about image errors is just a by-product if the tool runs into any problems while trying to extract metadata. So it’s not really fair to treat ExifTool like a validation tool, as it would never complain about an absolute unreadable TIFF which cannot be opened by any viewer, as long as all the metadata can get extracted. That might be the reason why ExifTool has the highest percentage of presumably valid TIFF files within this test. So, “valid” for ExifTool means, that there were no warnings or errors in the metadata output.

Handling: It’s a command-line-tool with quite good possibilities to batch whole folders and output human-readable csv (though the csv can have many, many columns, as images can have a myriad of metadata).

DPF Manager

Validation: The DPF Manager is for TIFF validation only and was built only for that. Some validation profiles are included (e. g. for “baseline TIFF”, “extended TIFF”, TIFF/EP and the TI/A Draft). Besides, you can specify your own validation profile to check against. For this analysis, only the baseline and extended TIFF profiles were used. For the DPF manager, “valid” means that a TIFF file does not hurt the specification in any way (depending on the profile used).

Handling: The DPF Manager is very easy to install (though you need Admin-rights to do so, it’s not portable like JHOVE is) and extremely easy to use: You can just drop and drag a file or folder on the GUI or, alternatively, select a file or folder. The tool is very fast – 80 TIFFs need less than a minute (size of the TIFFs varied between 9 kb and 4 MB) – and the HTML output is very nice and there is also METS and XML (although personally I would like an additional csv option). Furthermore, the TIFF files are sorted by the number of errors. The worst ones come first and when you scroll down, the TIFF get less and less invalid, ending with the valid ones in the end. So bad news first! There are also thumbnails of the TIFFs, if the image is renderable. I think in terms of usability this tool has easily won the contest.

As a bonus, each error is referenced to the page and section in the TIFF guide, including the exact quotation that the error refers to.

Having seen this, I am tempted to think of the DPF manager as the reference tool for whether a TIFF really is valid or not – the question is whether it outperforms my current go-to tool, JHOVE? It certainly aims at being the go-to-validator for TIFF-files.

checkit_tiff

Validation: checkit_tiff is too picky for this examination, as it validates against baseline TIFF whereas the TIFFs in the Google Imagetestsuite are not baseline. I still want to present the findings of the tool and include it in this Blogpost, as the reader surely has a different TIFF corpus and might very well want to validate baseline TIFF.

Handling: The tool runs on Linux. There is not yet a windows version of the tool and it’s a command line tool. A windows version is about to be released soon, though.

LibTIFF

Validation: There are 695 different error messages from evaluation of the Google Imagetestsuite sample (listed here), mostly very similar ones dealing with unknown TIFF tags. Reading the error messages, LibTIFF seems to check the tags only, and omits some general file-structure validation, such as the check for end-of-file tags. It does check the TIFF header for the magic number, though (see “No TIFF magic number“). As far as I know, there is only a text output (“.log”), which in general is readable, but for bulk-analysis not much better than ImageMagick.

Handling: The tool runs on Linux.

Test corpus

The test was run on the Google Imagetestsuite for TIFF, which has the advantage of being openly available and consists of some really bad TIFF files. By that I mean that half of the files are not even renderable or looks bogus, as if parts were missing or the image is just grey, white or black, and one does not quite know if that’s intentional.

The files are named after their MD5-checksums (see: About).Unfortunately, there is no “ground truth” indicating whether each TIFF file is valid or not. I have added information about whether the image is renderable in either Windows Photo Preview, Paint or ImageMagick in the Findings spreadsheet, and used this as a basis for comparing tool validation output against. I know that this is not a water-tight solution as an image can be invalid and still open in a current viewer. Some of the Google Images also look damaged in the viewer: either as if something is missing or just black or white – and I cannot figure out if this is on purpose (= this just is a picture of some black stuff) or the image is broken.

Reading the explanation of the TIFF Google Imagetestsuite, all images without the prefix “m-” are original and were not modified. All images with the prefix “m-“, however, were modified in some way. Although the intention was not necessarily to add errors to the image, the percentage of valid images for these files is much smaller (see table below).

For the chapter “differences in terms of error” below I have used the 3 files of the fixit/checkit_tiff test corpus in addition, as these files were better suited for the examples.

Examination of 166 TIFFs from the Google Imagetestsuite

For ImageMagick and ExifTool I have marked images as invalid if they threw errors.

If an image could not be analysed by a tool at all (e. g. because the tool only analysis files with a correct TIFF header and some files don’t provide one), they were marked as “could not be analysed”.

I have tested the images wicht windows photos, paint, ImageMagick and the thumbnail preview on DPF Manager. If an image could not be rendered by either of them, I marked it as “invalid” in the column “renderable in a viewer”.

JHOVE ImageMagick ExifTool DPF Mananger (Baseline) DPF Manager (Extended TIFF) checkit_tiff LibTiff Renderable in a viewer
all 166 files
valid / error free 29 18 56 4 15 0 21 83
invalid / errors reported 129 148 109 151 140 131 145 83
could not be analysed 8 1 11 11 35
% valid 17,5% 11% 34% 2,4% 9% 0% 13% 50%
47 original files (not modified)
valid 27 18 44 3 14 0 21 47
invalid 20 29 3 44 33 47 26 0
could not be analysed
% valid 57% 38% 94% 6% 30% 0% 44% 100%
119 modified files
valid 2 0 12 0 1 0 0 36
invalid 109 119 106 108 107 84 119 83
could not be analysed 8 1 11 11 35
% valid 1,6% 0% 10% 0% 1% 0% 0% 30%
83 non-renderable files
valid 2 0 7 0 0 0 0 0
invalid 75 83 75 72 72 80 83 83
could not be analysed 5 1 11 11 3
% valid 2,4% 0% 8,4% 0% 0% 0% 0% 0%

A few observations on the above figures:

checkit_tiff does not analyse the TIFF files if the magic number is missing.

Compared against the ability for a viewer to render the files, all tools seem to generate false positive (=false alarm) results – more files are marked as invalid than renderability would seem to imply.

The number of invalid “modified files”, and the number of invalid “non-renderable files” is very high. There were some files which could not be opened with Paint but with ImageMagick, though (marked with “ImageMagick can open” in the Spreadsheet).

I am reluctant to state that JHOVE has two false negatives here, but all the other tools have marked these two files as invalid except ExifTool, which has marked one of the files as error-free (see Spreadsheet). Furthermore, no viewer can render the files. I tried to analyse the files with my own TIFF java tools, but they could not process the files at all. So much as I hate it, I have to admit: These are false negatives. What else should I call it? I certainly would not want these two files going unnoticed in my archive. Furthermore, the other five tools all have detected that something is wrong with these files.

Premature End-of-File

For 5 files of the test corpus JHOVE throws the “Premature End-of-File“-Error, which usually hints at a fatal error with the file. Often parts of the file are missing, a typical issue is that the file was not completely downloaded/uploaded and the last chunk of the file is not there. JHOVE usually realises missing chunks at the end of a file.

Four of the five files (spreadsheet) do look very suspicious. Two cannot be opened, two are black, one looks as if parts of the text were missing. At least most tools agree that something is wrong with the files. Only the DPF manager considers one of the 5 files to be valid. Looking at the error messages of the DPF manager, there is no hint of a premature file ending.

ImageMagick throws the error “unexpected end-of-file”, but for five other files of the corpus (listed here). JHOVE reports other errors for these files, but at least again all validation tools agree on the invalidity of the files. They certainly look damaged in the viewer and one of them cannot even be opened. The DPF manager could not even analyse these five files, they all were omitted in the analysis.

No TIFF magic number

If the file only purports to be a TIFF file, e. g. by the file extension, but the magic number cannot be found, all tools agree on the error (spreadsheet). None of the three files reporting the error could be opened and JHOVE, ImageMagick, DPF Manager and checkit_tiff (by not handling the file) agree that the magic number is missing and that it therefore cannot be a TIFF file or rather the TIFF signature is incorrect.

Commonalities in terms of errors

Sometimes, the tools agree on an error and even use very similar words to describe the error. One example is shown in the table below.

ImageMagick Error for file 0c84d07e1b22b76f24cccc70d8788e4a JHOVE TIFF Module Error for file 0c84d07e1b22b76f24cccc70d8788e4a
Unknown field with tag 37680 (0x9330) encountered Unknown TIFF IFD tag: 37680
Unknown field with tag 37677 (0x932d) encountered. Unknown TIFF IFD tag: 37677
Unknown field with tag 37678 (0x932e) encountered. Unknown TIFF IFD tag: 37678

Obviously, both tools check for unknown TIFF tags and reports it if they encounter some. ImageMagick also gives the Hex value of the field. It does not matter which of these two tools one uses, it will always report unknown tags. At least both tools have done so with the Google Imagetestsuite.

Differences in terms of errors

For theese two examples I have used files from the fixit/checkit_tiff test corpus.

JHOVE: “Invalid DateTime separator”

The JHOVE Module reports correctly if the DateTime is not formatted as it supposed to be and marks the file as “well-formed, but not valid”. It has done so with the “invalid_date.tiff” from the fixit / checkit_tiff testfiles. ImageMagick, however, completely neglects to realise that there is something wrong with the DateTime Tag in this file and the error goes unnoticed. (ImageMagick does report an error, which seems to be unconnected to the DateTime, however, as it is about the “Photoshop”-tag.) The DPF manager also reports “Incorrect format for DateTime” and quoted the TIFF specification, so this is a false negative for ImageMagick.

JHOVE: “Value offset not word-aligned”

The JHOVE module throws this error for the “minimal_valid”-Tiff in the checkit_tiff Examples and marks the TIFF as “not well-formed”. ImageMagick does not report any errors for this file. The DPF Manager, however, finds three errors in the file, two related to “bad word alignment in offset” (which sounds pretty much like the JHOVE error) and one inconsistency about the tag planar configuration, which does not sound that fatal (“PlanarConfiguration is irrelevant if SamplesPerPixel is 1, and need not be included.“).

Fun fact

Of the 166 files, only for four files do all of the tools (except checkit_tiff, which considers them all to be invalid) agree on validity (spreadsheet). If one would decide on a file validity policy which only allows files in an archive for which no tools has any complaints, it would be a very empty archive indeed. It might not even be possible to satisfy them all with real-world images from different producers.

Summary and conclusion

Although the tools agree on the “real bad” TIFF files, TIFF validation does not seem to be at all that easy-peasy. It has been much easier – at least with the corpus analysed – to determine what is a false positive and what is a false negative with the JPEGs in my last OPF Blogpost. The JHOVE TIFF module still seems to be a decent choice and I have not found any real gap like I did with the JHOVE JPEG module the other day, although the two false negatives leave me nervous.

Findings of the DPF manager seem to be trustworthy to me, as the TIFF specification can be referenced for each error found. Please note that some errors lead the DPF manager not to detect TIFF files as such, e. g. if the file ends prematurely or unexpectedly (see this spreadsheet).

Nevertheless, most of the tools – if not all – seem to be too paranoid. Assuming all non-modified TIFF are valid (which are all renderable in a viewer), only ExifTool considers 94% of them to be valid (or, “error-free”, as in the case of ExifTool). The second-best, JHOVE, still considers almost half of them to be invalid in some way. The DPF manager considers only 30% of them to be valid (Extended TIFF) and even is able to prove every bit of it.

Back to my research questions:

Is the JHOVE TIFF module really that good in comparison with other tools?

Well. It’s pretty user-friendly, the error messages are intelligible (but most TIFF errors are, with every tool tested), the output can be dealt with, but it’s not as user-friendly as the DPF manager, which also has a nicer output. And, the DPF manager has the reference to the specification all the time, which really feels good when talking to my boss about the quality of our TIFF files. Look, the TIFF bible says it’s ok / not ok. Who would argue?

Nevertheless, it was the only (real validation) tool with false negatives with perfectly invalid and un-renderable files, which would be worth a second look in one of my next posts.

And, as a side-effect: Is TIFF validation really easy-peasy?

It does not seem so, as the validators agree on very little indeed.

So, how to act?

I might just stick to JHOVE in our productive digital preservation environment, but I will at least add the DPF manager in our Pre-Ingest workflows, especially in our digitisation centre, to be sure we stick to the TIFF specification at least with TIFF files we generate ourselves. When receiving files from outsiders, I will be more tolerant, as I always am, but might add a preservation planning workflow to repair the TIFFs, if possible. But that will be the topic of another post at another time.

6851
reads

4 Comments

  1. Yvonne Tunnat
    February 13, 2017 @ 10:12 am CET

    Hi,

    in the meantime, ExifTool has a new version (http://www.sno.phy.queensu.ca/~phil/exiftool/, (10.42) and it is a little bit easier to see the Warnings at first glance if you use:

    exiftool -api validate -a -u -G1 FILE

    (Example for a batch is here: https://github.com/nestorFormatGroup/FormatIdentificationBenchmarking/blob/master/exifTool.bat)

    Best, Yvonne

  2. victormunoz
    January 19, 2017 @ 8:41 am CET

    Hello,

    We want to thank you very much for this analysis. We are happy you included the DPF Manager, and found out it is quite good!
    Since DPF Manager is a new tool, we assume some bugs are there, and your analysis will help us to find some of them (e.g. premature end-of-file issues).
    In the next version this will be solved! We are delivering releases nearly every month and it is planned to include new functionalities in the next releases. Follow https://twitter.com/DPFManager or http://dpfmanager.org/blog.html to keep informed!

    When we started the development of the DPF Manager, we also thought that JHOVE was the “role model”.
    However once we started digging into the TIFF specifications, we found out that there were several rules that were ignored, or partially implemented, in JHOVE.
    So, we tried to build a new tool from scratch that implemented absolutelly all the rules word-by-word in the TIFF specification.

    Take into consideration that the goal of the DPF Manager is not just to tell if a TIFF file is renderable or not by actual TIFF readers, but if it follows strictly with the specifications.
    This is an important point, because most TIFF readers simply ignore invalid information they find inside a TIFF, and sometimes the ignored information does not affect to the renderability. But this does not mean that the TIFF file is correct (and even less in terms of preservation).

    Regarding the differentiation between “Baseline” and “Extended Baseline”, it appears in the final version of the TIFF 6.0 specification.
    The difference between them is that the “Extended Baseline” includes more kinds of images (CMYK, YCbCr, tiled images, etc). An image of these kinds will only be valid when checking against the extension. So, for most of the users, the extended baseline should be the standard to use for validating.

    Thank you very much again for this fantastic work!

  3. art1pirat
    January 17, 2017 @ 4:27 pm CET

    Some hints from me:

    the tool checkit_tiff was developed with a different philosophy in mind. Instead to validate any TIFF file against any TIFF specification, we wanted to validate a TIFF file against our predefined ruleset of a baseline TIFF with extensions. This is important, because we do not need to parse the result output to check if the TIFF file fits our needs, we need only to define a rule set and to check the exit-code. Checkit_tiff was developed as a CLI because it must be integrated in our automatic preservation workflow. This is also the reason why it is coded in plain C.
    checkit_tiff is more descriptive now, please check the GIT repository at https://github.com/SLUB-digitalpreservation/checkit_tiff 🙂
    The problem in validation is not to create a validator, technically. It is the problem to:
    a) understand specifications
    b) fileformat in detail
    c) the differences between files in the wild
    d) develop minimal testcase examples (e.g.: https://github.com/SLUB-digitalpreservation/checkit_tiff/tree/master/tiffs_should_fail/)

    Also, you could not say with certainity “it is valid”. It depends from your knowledge and expectations.
    E.g. is a TIFF valid, if the ICC is wrong? Is a TIFF valid if tags are contradictory? What if some points of specifications are fuzzy?

    In my opinion, for an average user the DPFManager is a good choice. For an automatic workflow I would suggest checkit_tiff.
    The checkit_tiff is free available and should be very portable (and can be even crosscompiled). There exists no windows version, because we use the tool under linux, only. Anyway, it is open source, send me a patch! (At the moment, a Windows version needs some rewrites in memory allocation. That is all) 😉
    Thanks to Yvonne and Michele for their work, great job!

  4. boardhead
    January 17, 2017 @ 12:48 pm CET

    Thanks for this article. I usually recommend JHOVE for TIFF validation without knowing how it compares to other tools. You are right that Exiftool isn’t designed as a validation tool, but it does much more strict testing when writing a file, so it would be informative to see how many of these files gave errors/warnings when writing. Also, the ExifTool -htmldump feature is very useful for detailed inspection of the TIFF structure, and may reveal other types of problems (see http://owl.phy.queensu.ca/~phil/exiftool/htmldump.html for an example output).

Leave a Reply

Join the conversation