Error detection of JPEG files with JHOVE and Bad Peggy – so who’s the real Sherlock Holmes here?

PDF Eh? – Another Hackathon Tale

In this Blogpost I want to examine the findings of two validation tools, JHOVE (Version 1.14.6) and Bad Peggy (version 2.0), which scans image files for damages, using the Java Image IO library. Goal is to compare the findings and enable the reader to know what to expect from these validation tools for the daily digital curation work.

This test was done with the publicly available Imagetestsuite from Google. Furthermore, I included pictures from events like Christmas parties and outdoor events in my library from the last 6 years, pictures contributed by friends and colleagues, some of my own pictures, and even some memes from public fun Facebook pages like “useless facts”.

In one case, I even opened a JPEG in an editor to remove some bytes to test if the tools would realise that something was wrong, hoping I would get error messages that were still missing from the list (which, by the way, worked out).

All in all 3070 JPEG files were tested. In the following, I will focus on the files that either the JHOVE JPEG module or Bad Peggy – or both – found objectionable in one way or the other.

In general, the JHOVE JPEG module knows 13 different error messages, whereas Bad Peggy can distinguish at least 30 (source code of KOST-Val, which uses Bad Peggy to validate JPEG files).

Table: Possible JHOVE errors on JPEG files

In this table there are listed the JHOVE errors only. All errors found by Bad Peggy in the sample are in the table below the conclusion.

Message Examples in the sample? No. of occurences in Sample Bad Peggy equivalent
1 DTT segment without previous DTI no /
2 Unexpected end of file Yes (Example) 26 corrupt data: premature end of data segment. AND
corrupt data: Truncated File – Missing EOI marker.
3 I/O exception processing Exif metadata: no /
4 Invalid JPEG header Yes (Example) 4 The file is not a JPEG (header).
5 JFIF APP0 marker not at beginning of file Yes (see “Expected marker byte 255, got”) 17 Bad Peggy does not recognise this file as invalid
6 Marker not valid in context Yes (Example) 3 invalid file structure: two SOI markers
7 Expected marker byte 255, got Yes (Example) 78 Bad Peggy does not recognise this file as invalid
8 SPIFF marker not at beginning of file no /
9 File does not begin with SPIFF, Exif or JFIF segment Yes (see “Expected marker byte 255, got”) 105 Bad Peggy does not recognise this file as invalid
10 Error creating temporary file. Check your configuration: no /
11 Unrecognized tiling data no /
12 Value offset not word-aligned: xxx Yes (Example) 6 Bad Peggy considers these files to be invalid as well, but throws different error messages
13 No TIFF magic number: 4906 (Is officially a TIFF error message, but was thrown for this presumably JPEG file) Yes (Example) 1 corrupt data: bad Huffman code. (only one file in the sample)

So for five of the error messages no examples could be found, so I won’t look at these errors in this blog post. Furthermore, for the error “No TIFF magic number: 4906” and “Value offset not word-aligned: xxx“, the examples in the sample were so scarce that I cannot possibly explain them properly yet. If anybody out there has examples for these errors, I will happily extend this post and include the findings.

Let’s take a closer look at the different errors:

General information about the JPEG structure

Between the SOI (Start of Image) and the EOI (End of Image), there are other segments allowed, roughly said the structure of a JPEG should be as following:

  • SOI-segment (SOI: start of image): “FF D8”
  • APP0-segment (JFIF-Tag): “FF E0”
  • other segments
  • SOS-segment (SOS: start of scan): “FF DA”
  • data: compressed data
  • EOI-segment (EOI: end of image): “FF D9”

If there are e. g. embedded thumbnails, it is also possible that a single file can have more than one SOI- and EOI-segments. In these cases, the JPEG has to have a SOI segment between two EOI thumbnail markers. (Bad Peggy checks this, whereas JHOVE does not.)

Unexpected end of file

Interestingly, JHOVE does not always detect if parts of the files are missing. Only for files where Bad Peggy throws two errors: “corrupt data: premature end of data segment” AND “corrupt data: Truncated File – Missing EOI marker“, will the JHOVE JPEG module detect that something is indeed wrong with the file. For several files Bad Peggy only throws “corrupt data: premature end of data segment” and JHOVE considers them to be valid.

Furthermore, the Bad Peggy error “corrupt data: xxxx extraneous bytes before marker 0xd9.” goes unnoticed by JHOVE. Images in this sample with these errors, though, do not look healthy to me at all.

Jhove detects that there is something wrong with these images (“Unexpected end of file“):

JHOVE does not detect that these images are damages, though Bad Peggy does (“corrupt data: premature end of data segment“):


There is clearly something missing, as you can see with these three examples, which are considered to be valid by JHOVE:

File Name Bad Peggy Error Impact
image195 corrupt data: 83426 extraneous bytes before marker 0xd9. color problems, picture seems to have two parts that do not belong together
image185 corrupt data: premature end of data segment. color problems, picture seems to have three parts
image183 corrupt data: 19846 extraneous bytes before marker 0xd9. color problems, picture seems to have two parts

As a digital archivist, I would want to know about these errors while ingesting data. I fully agree with Bad Peggy – this data is indeed corrupt. I consider this as a false negative finding of the JHOVE JPEG module: The JPEG has serious problems, but JHOVE does not detect them. In this case, Bad Peggy is the better Sherlock Holmes.

The last Bytes of a JPEG should look like this and always end with and EOI (end of image), which is “FF D9”:

In this example, the JPEG just ends without the necesary EOI:

If I add the “FF D9” manually and save the file, JHOVE does not detect that something is wrong with it any more. It considers the file to be valid & well-formed.

Bad Peggy, however, still considers this file to be invalid:

File Name Bad Peggy message 1 Bad Peggy message 2
image178.JPG corrupt data: premature end of data segment corrupt data: Truncated file – Missing EOI marker
image176_added_FFD9.JPG corrupt data: premature end of data segment this error message is gone

Invalid JPEG header

Easy and straightforward: Both tools check the JPEG header and throw an error if there is no correct JPEG header. This is extremely useful, as tools usually cannot open files with a missing JPEG header. In most cases the file is unreadable for good – or it’s not even a JPEG in the first place.

JFIF APP0 marker not at beginning of file

After the SOI (“FF D8”), an APP0-segment should follow, which always starts with “FF E0” (see: “General information about the JPEG structure”). If this error is thrown, the APP0-segment does not follow directly after the SOI-segment. A JPEG file which throws this error viewed in a Hex editor shows that the  SOI marker “FF D8” is not followed by “FF E0”. Instead, in this example the SOI-marker is directly followed by a copyright marker (“FF EE”).

JHOVE flags this file as missing the APP0 marker; Bad Peggy, however, completely ignores this error and obviously does not test it. The JPEG standard clearly states that JPEG files have to be structured like this, but so far, none of the JPEG files of the sample have caused any problems in commonly used viewers. This cannot be marked as a false positive for the JHOVE JPEG module, but currently does not seem to bear any practical risks for the affected data. As usual, the viewers are more flexible than the file format standard, at least contemporary viewers (we cannot guess for future viewer, however, which makes the decision for long-term-availability harder).

Marker not valid in context

The examples in the corpus are all visibly corrupted, as parts of the images seem to be missing:

The JHOVE error sounds pretty general. For the three examples in the corpus, Bad Peggy has found different equivalents.

File Name JHOVE Bad Peggy Explanation
marker_1.jpg marker not valid in context corrupt data: 16 extraneous bytes before marker 0xe0 Cannot explain the error, as there is no 0xe0 marker to be found
Marker_2.jpg marker not valid in context corrupt data: premature end of data segment See “Unexpected end of file”
Marker_3.jpg marker not valid in context corrupt data: premature end of data segment
invalid file structure: two SOI markers
Searching for the SOI-segment in an affected file has shown a second SOI-segment later in the file

Expected marker byte 255, got xxx

This error occurs several times within the sample and gives a plethora of marker bytes which have been used instead of 255. So far, none of the affected JPEGs have shown any problems and Bad Peggy ignores the error altogether.

File does not begin with SPIFF, Exif or JFIF segment

A JPEG file usually uses the graphic format JPEG Interchange Format (JFIF), but can also use Exif or SPIFF – but obviously has to start with one of these three segments and no other. Bad Peggy marks these files as invalid as well, but the error message is quite tight-lipped “ype.” – which was translated by the KOST (Switzerland) as something like “This JPEG contains characteristics that are not supported” – which does not really enlighten me.

JHOVE usually detects this error combined with “JFIF APP0 marker not at beginning of file” or “Expected marker byte 255, got XXX“. An example for a file with this error “standalone” is this:


Similarly to the example for the JHOVE error “JFIF APP0 marker not at beginning of file“, the tag directly following the SOI (“FF D8”)-marker is not a JFIF, Exif- or SPIFF-marker, but a copyright tag (“FF EE”).

Bad Peggy: Invalid file structure: Missing SOI between two EOI thumbnail markers.

Bad Peggy also detects an error that is completely ignored by the JHOVE JPEG module, which Bad Peggy has found for more than 100 files within the sample. This error is almost self-explanatory, knowing what an SOI (start of image) and EOI (end of image) is. So far, none of the JPEGs look bogus in any way or had any problems to be displayed.

Conclusion

After a closer look at the affected JPEG data I would not want these JPEG files being unnoticed in my archive:

Two of the images cannot even be opened and displayed any more and the rest has missing parts, mixed up parts and colour problems. For practical reasons, I would want my tool to detect the errors automatically and not necessarily more than those. These are the only JPEGs that obviously have problems, others show errors in JHOVE or Bad Peggy or both, but contemporary JPEG viewer tools have no problems displaying the JPEGs. Of course it is impossible to say if future tools will be able to display these JPEGs properly.

Considering this, Bad Peggy has clearly won: It detects all of the visually corrupt images.

The JHOVE JPEG module misses 7 out of 18 – which is the Bad Peggy error “corrupt data: premature end of data segment” without the additional error “corrupt data: Truncated File – Missing EOI marker” and “xxxx extraneous bytes before marker 0xd9.” Maybe JHOVE would be just fine if these two extra tests would be included. If there is seriously other stuff missing – well, maybe we’d need a bigger sample to examine to be able to answer this question.

These seven JPEGs with visible problems are missed by JHOVE:

All errors found by Bad Peggy

All in all 1007 files in the 3070 sample had problems, if Bad Peggy is to believed. As one file can contain more than one error, the findings are as follows:

occurance error flavour Error Message
15 recognition and BadPeggy The file is not a JPEG (header).
83 ype.
846 nvalid file structure Missing SOI between two EOI thumbnail markers
2 invalid file structure two SOI markers.
2 invalid file structure two SOF markers.
2 invalid file structure Huffman table 0x00 was not defined
2 invalid file structure SOS before SOF.
1 invalid file structure Empty JPEG image (DNL not supported).
2 invalid file structure missing SOS marker.
21 corrupt data premature end of data segment
40 corrupt data 16 extraneous bytes before marker 0xe0
23 corrupt data Truncated File – Missing EOI marker
1 corrupt data bad Huffman code.
1 corrupt data found marker 0xf7 instead of RST0.
6 other problems Bogus Huffman table definition
6 other problems Bogus marker length
7 other problems Warning: unknown JFIF revision number 148.195.
7 other problems Image Format Error
3 other problems Invalid progressive parameters Ss=227 Se=63 Ah=1 Al=0
3 other problems Bogus DQT index 10.
2 other problems Quantization table 0x00 was not defined.
2 other problems Unsupported JPEG process: SOF type 0xc3.
2 other problems JFIF not permitted in stream metadata.
1 other problems Unsupported JPEG data precision 9
1 other problems Sampling factors too large for interleaved scan.
1 other problems Bogus sampling factors.
1 other problems Too many color components: 17, max 10.
1 other problems Sorry, there are legal restrictions on arithmetic coding.

4087
reads

1 Comment

  1. Yvonne Tunnat
    January 16, 2017 @ 10:09 am CET

    Hi,
    I have run the JPEG Google Imagetestsuite with JHOVE, Bad Peggy, ImageMagick and ExifTool and have listed all the findings in this spreadsheet:
    https://docs.google.com/spreadsheets/d/1v0JbJZFs_4Oy405li_xkmxAoWhAqQs-nB9yeNAg3tnI/edit#gid=0
    Have fun!
    Yvonne

Leave a Reply

Join the conversation