Properly Rendering 32-bits JPEG
A colorful story by Alix Bruys, Bertrand Caron, Yannick Grandcolas and Thomas Ledoux from the National Library of France (BnF)
[Note : Ce billet existe en version française sous le titre Le blues du JPEG.]
The Context: Acquired Born-Digital Photographs of Parisian Theaters
On October the 20th of 2017, the Department of Performing Arts of the National Library of France (BnF) acquired a set of 622 born-digital photographs representing architectural elements of the major Parisian theaters (palais Garnier, Comédie-Française, etc). Produced by photographers Sabine Hartl and Olaf-Daniel Meyer in 2012, these photographs were to illustrate the book Théâtres parisiens. Un patrimoine du XIXe siècle published in 2013 by editor Citadelles et Mazenod. Though all 622 photographs did not appear in the book, the collection manager decided to acquire the whole set. A contract assigned to BnF the rights to publish them online in its digital library Gallica. The photographs were delivered in the only available format: JPEG.
|Thumbnails of the original photographs of the Opéra Garnier Theater, Sabine Hartl & Daniel-Olaf Meyer, 2012-2013.|
In the spring of 2018, these 622 photographs were ingested in our Digital Documents Acquisition and Donation workflow, like any born-digital photographs acquired by or donated to BnF. The collection was divided into 30 sub-lots – each theater had two sub-lots, one for published photographs, one for unpublished ones. All photographs successfully passed QA procedures and were ingested in the preservation repository. When the process manager made the final visual control, after the photographs became visible in Gallica, she noticed a blatant color shift on 5 of them. Where one would expect red and gold tones, they were blue and green! In contrast, the other 617 photographs were rendered as expected.
|Thumbnails of the same photographs, as available at https://gallica.bnf.fr/ark:/12148/bc6p05jkgg7|
The Investigation: IT to the Rescue!
As soon as we were alerted of this situation (red alert 😉), we went directly into the data storage, in order to retrieve the dissemination copies. It appears that the file was a regular JPEG 24 bits but with this distinctive blue aspect.
Then we went to query the archival system to see the technical characteristics of the preserved copy. The manifest (expressed in METS) reflects the following technical metadata:
The surprising part was the number of components
mix:bitsPerSampleValue=8,8,8,8 which indicates a 32-bits JPEG!!! But the system interpreted it as a regular JPEG.
In order to go deeper, we retrieved the archival package (AIP) from SPAR, our preservation repository: there, we found the unaltered master file that was delivered initially. Opening it with XnView raised an “Alert box” that informed us of the conversion/transformation to a 24-bits RGB colorspace.
Indeed, the original file was a “Well-formed JPEG” but using 4 components (CMYK) to encode the information. This was a surprise because the only JPEGs we encountered before showed either 3 components (RGB color space) or 1 component (grayscale).
The Investigation: Backtracking on the Appraisal Step
We then interrogated the accessioning office about the preliminary controls. Unlike our initial hypothesis, the normal procedure was fully and correctly applied…
The pre-conditioning tool Frontin (a BnF in-house application bundling JHOVE among other programs) carries out a quick analysis and returns a result boiling down to a traffic light: green means that the file is accepted without restriction, red means that the file format is not accepted, yellow means the file analysis raised some warnings. In the case of the 5 CMYK files, the yellow light showed up because the files were JPEG but did not conform to the 24-bits JPEG profile we expected. When agents in charge of pre-conditioning operations are returned a yellow light, they are required to do a visual control to ensure that the content is readable and complete before submitting the package to QA procedures. This visual control was made by opening the files with XnView; the message showed above was displayed but was ignored because the image was rendered normally.
Analysis: What Exactly Went Wrong?
The question raised is whether this file should be considered valid, and does it respect any standard? In fact, “common” JPEGs (the ones coming out from cameras) are 3-components images (24 bits) in sRGB colorspace (the standard Red, Green, Blue). This colorspace is used for displaying images on screen where the different components are added to create the target color.
But here we have a 4-components image (32 bits) in CMYK colorspace. These letters stand for : Cyan, Magenta, Yellow and Key (Black). This colorspace is used for printing since in this case the components are mixed in a subtractive way.
In order to render correctly an image in CMYK you need to convert it properly to the RGB colorspace. If you ignore the K channel, then you might end up with an inverted-color image (similar to a negative).
How comes we didn’t detect it? JPEG images are standardized in the ISO/IEC 10918 standard “Digital compression and coding of continuous-tone still images”, but 32-bits JPEGs weren’t included in the standard until Part 6 (ISO/IEC 10918-6:2013 or ITU T.872), titled “Application to printing systems”. This part 6 was first published in June 2012 when the part 1 was released 20 years before in September 1992! If we look closely to the coverage of the JHOVE JPEG module (running
jhove -m JPEG-hul -h), it states that the coverage is limited to Parts 1 to 5 and to the EXIF format from version 2.0 to 2.2.
In particular, the current version 1.22 of JHOVE is not taking into account the new
APP14 marker segment (one that begins by the string ‘Adobe’) which gives the proper information on color encoding (see for example this issue and a proposed pull-request ). In fact, JHOVE detects this image as “Exif 2.1 (JEIDA-49-1998)” but this is incorrect since the EXIF standard clearly states for the ‘
SamplesPerPixel’ tag: “The number of components per pixel. Since this standard applies to RGB and YCbCr images, the value set for this tag is 3.” This image couldn’t be EXIF then and should be validated as something like a ‘Adobe JPEG’.
But detecting is not enough, we also need to convert the image properly so that our dissemination tools can handle it as a more usual 24-bits RGB image.
If you look for solutions on this issue, you will discover that the problem has many solutions all implying that you correctly identify your input format to use the appropriate parameters:
- for ImageMagick, you need to spell (more details here):
convert -negate -colorspace RGB input_CMYK_Image.jpg output_RGB_Image.jpg.
- in Java, the standard ImageIO library doesn’t handle it (until recently) so you need extra code or library to handle it [follow the StackExchange discussion]
In our workflow, the problem is made even more complex because every image is first converted to JPEG 2000 in order to comply with the IIIF protocol (for zooming, rotating, transforming, and the like). The initial image is therefore transformed twice: first to JPEG 2000, then back to JPEG but in RGB for Web displaying. The knowledge of the 4th component is lost during the first conversion which leads to the unwanted effect!!!
This misadventure is full of lessons:
- know your tools well, as well as their limits,
- never underestimate simple warnings from them,
- don’t overlook the complexity of apparently simple and well-known file formats: you need to monitor them as they evolve and develop,
- be ready to monitor your tools as well, and request or make enhancements accordingly or even change them,
- be prepared to learn more and share the new knowledge you got about the formats,
- dissemination is the ultimate judge to know if you handled correctly your information.
HAPPY WORLDWIDE DIGITAL PRESERVATION DAY
SHARE YOUR OWN STORY