What do we mean by “embedded” files in PDF?

What do we mean by “embedded” files in PDF?

The most important new feature of the recently released PDF/A-3 standard is that, unlike PDF/A-2 and PDF/A-1, it allows you to embed any file you like. Whether this is a good thing or not is the subject of some heated on-line discussions. But what do we actually mean by embedded files? As it turns out, the answer to this question isn’t as straightforward as you might think. One of the reasons for this is that in colloquial use we often talk about “embedded files” to describe the inclusion of any “non-text” element in a PDF (e.g. an image, a video or a file attachment). On the other hand, the word “embedded files” in the PDF standards (including PDF/A) refers to something much more specific, which is closely tied to PDF‘s internal structure.

Embedded files and embedded file streams

When the PDF standard mentions “embedded files”, what it really refers to is a specific data structure. PDF has a File Specification Dictionary object, which in its simplest form is a table that contains a reference to some external file. PDF 1.3 extended this, making it possible to embed the contents of referenced files directly within the body of the PDF using Embedded File Streams. They are described in detail in Section 7.11.4 of the PDF Specification (ISO 32000). A File Specification Dictionary that refers to an embedded file can be identified by the presence of an EF entry.

Here’s an example (source: ISO 32000). First, here’s a file specification dictionary:

31 0 obj
<</Type /Filespec /F (mysvg.svg) /EF <</F 32 0 R>> >>
endobj

Note the EF entry, which references another PDF object. This is the actual embedded file stream. Here it is:

32 0 obj
<</Type /EmbeddedFile /Subtype /image#2Fsvg+xml /Length 72>>
stream
…SVG Data…
endstream
endobj

Note that the part between the stream and endstream keywords holds the actual file data, here an SVG image, but this could really be anything!

So, in short, when the PDF standard mentions “embedded files”, this really means Embedded File Streams.

So what about “embedded” images?

Here’s the first source of confusion: if a PDF contains images, we often colloquially call these “embedded”. However, internally they are not represented as Embedded File Streams, but as so-called Image XObjects. (In fact the PDF standard also includes yet another structure called inline images, but let’s forget about those just to avoid making things even more complicated.)

Here’s an example of an Image XObject (again taken from ISO 32000):

10 0 obj
<< /Type /XObject /Subtype /Image /Width 100 /Height 200 /ColorSpace /DeviceGray /BitsPerComponent 8 /Length 2167 /Filter /DCTDecode >>
stream
…Image data…
endstream
endobj

Similar to embedded filestreams, the part between the stream and endstream keywords holds the actual image data. The difference is that only a limited set of pre-defined formats are allowed. These are defined by the Filter entry (see Section 7.4 in ISO 32000) . In the example above, the value of Filter is DCTDecode, which means we are dealing with JPEG encoded image data.

Embedded file streams and file attachments

Going back to embedded file streams, you may now start wondering what they are used for. According to Section 7.11.4.1 of ISO 32000, they are primarily intended as a mechanism to ensure that external references in a PDF (i.e. references to other files) remain valid. It also states:

The embedded files are included purely for convenience and need not be directly processed by any conforming reader.

This suggests that the usage of embedded file streams is simply restricted to file attachments (through a File Attachment Annotation or an EmbeddedFiles entry in the document’s name dictionary).

Here’s a sample file (created in Adobe Acrobat 9) that illustrates this:

http://www.opf-labs.org/format-corpus/pdfCabinetOfHorrors/fileAttachment.pdf

Looking at the underlying code we can see the File Specification Dictionary:

37 0 obj
<</Desc()/EF<</F 38 0 R>>/F(KSBASE.WQ2)/Type/Filespec/UF(KSBASE.WQ2)>>
endobj

Note the /EF entry, which means the referenced file is embedded (the actual file data are in a separate stream object).

Further digging also reveals an EmbeddedFiles entry:

33 0 obj
<</EmbeddedFiles 34 0 R/JavaScript 35 0 R>>
endobj

However, careful inspection of ISO 32000 reveals that embedded file streams can also be used for multimedia! We’ll have a look at that in the next section…

Embedded file streams and multimedia

Section 13.2.1 (Multimedia) of the PDF Specification (ISO 32000) describes how multimedia content is represented in PDF (emphases added by me):

  • Rendition actions (…) shall be used to begin the playing of multimedia content.

  • A rendition action associates a screen annotation (…) with a rendition (…)

  • Renditions are of two varieties: media renditions (…) that define the characteristics of the media to be played, and selector renditions (…) that enables choosing which of a set of media renditions should be played.
  • Media renditions contain entries that specify what should be played (…), how it should be played (…), and where it should be played (…)

The actual data for a media object are defined by Media Clip Objects, and more specifically by the media clip data dictionary. Its description (Section 13.2.4.2) contains a note, saying that this dictionary “may reference a URL to a streaming video presentation or a movie embedded in the PDF file“. The description of the media clip data dictionary (Table 274) also states that the actual media data are “either a full file specification or a form XObject”.

In plain English, this means that multimedia content in PDF (e.g. movies that are meant to be rendered by the viewer) may be represented internally as an embedded file stream.

The following sample file illustrates this:

http://www.opf-labs.org/format-corpus/pdfCabinetOfHorrors/embedded_video_quicktime.pdf

This PDF 1.7 file was created in Acrobat 9, and if you open it you will see a short Quicktime movie that plays upon clicking on it.

Digging through the underlying PDF code reveals a Screen Annotation, a Rendition Action and a Media clip data dictionary. The latter looks like this:

41 0 obj
<</CT(video/quicktime)/D 42 0 R/N(Media clip from animation.mov)/P<</TF(TEMPACCESS)>>/S/MCD>>
endobj

It contains a reference to another object (42 0), which turns out to be a File Specification Dictionary:

42 0 obj
<</EF<</F 43 0 R>>/F(<embedded file>)/Type/Filespec/UF(<embedded file>)>>
endobj

What’s particularly interesting here is the /EF entry, which means we’re dealing with an embedded file stream here. (The actual movie data are in a stream object (43 0) that is referenced by the file specification dictionary.)

So, the analysis of this sample file confirms that embedded filestreams are actually used by Adobe Acrobat for multimedia content.

What does PDF/A say on embedded file streams?

In PDF/A-1, embedded file streams are not allowed at all:

A file specification dictionary (…) shall not contain the EF key. A file’s name dictionary shall not contain the EmbeddedFiles key

In PDF/A-2, embedded file streams are allowed, but only if the embedded file itself is PDF/A (1 or 2) as well:

A file specification dictionary, as defined in ISO 32000-1:2008, 7.11.3, may contain the EF key, provided that the embedded file is compliant with either ISO 19005-1 or this part of ISO 19005.

Finally, in PDF/A-3 this last limitation was dropped, which means that any file may be embedded (source: this unofficial newsletter item, as at this moment I don’t have access to the full specification of PDF/A-3).

Does this mean PDF/A-3 supports multimedia?

No, not at all! Even though nothing stops you from embedding multimedia content (e.g. a Quicktime movie), you wouldn’t be able to use it as a renderable object inside a PDF/A-3 document. The reason is that the annotations and actions that are needed for this (e.g. Screen annotations and Rendition actions, to name but a few) are not allowed in PDF/A-3. So effectively you are only able to use embedded file streams as attachments.

Adobe adding to the confusion

A few weeks ago the embedding issue came up again in a blog post by Gary McGath. One of the comments there is from Adobe’s Leonord Rosenthol (who is also the Project Leader for PDF/A). After correctly pointing out some mistakes in both the original blog post and in an earlier a comment by me, he nevertheless added to the confusion by stating that objects that are are rendered by the viewer (movies, etc.) all use Annotations, and that embedded files (which he apparently uses a a synonym to attachments) are handled in a completely different manner. This doesn’t appear to be completely accurate: at least one class of renderable objects (screen annotations/rendition actions) may be using embedded filestreams. Also, embedded files that are used as attachments may be associated with a File Attachment Annotation, which means that “under the hood” both cases are actually more similar than first meets the eye (which is confirmed by the analysis of the 2 sample files in the preceding sections). Contributing to this confusion is also the fact that Section 7.11.4 of ISO 32000 erroneously states that embedded file streams are only used for non-renderable objects like file attachments, which is contradicted by their allowed use for multimedia content.

Does any of this matter, really?

Some might argue that the above discussion is nothing but semantic nitpicking. However, details like these do matter if we want to do a proper assessment of preservation risks in PDF documents. As an example, in this previous blog post I demonstrated how a PDF/A validator tool can be used to profile PDFs for “risky” features. Such tools typically give you a list of features. It is then largely up to the user to further interpret this information.

Now suppose we have a pre-ingest workflow that is meant to accept PDFs with multimedia content, while at the same time rejecting file attachments. By only using the presence of an embedded file stream (reported by both Apache‘s and Acrobat‘s Preflight tools) as a rejection criterion, we could end up unjustly rejecting files with multimedia content as well. To avoid this, we also need to take into account what the embedded file stream is used for, and for this we need to look at what annotation types are used, and the presence of any EmbeddedFiles entry in the document’s name dictionary. However, if we don’t know precisely which features we are looking for, we may well arrive at the wrong conclusions!

This is made all the worse by the fact that preservation issues are often formulated in vague and non-specific ways. An example is this issue on the OPF Wiki on the detection of “embedded objects”. The issue’s description suggests that images and tables are the main concern (both of which aren’t strictly speaking embedded objects). The corresponding solution page subsequently complicates things further by also throwing file attachments in the mix. In order to solve issues like these, it is helpful to know that images are (mostly) represented as Image XObjects in PDF. The solution should then be a method for detecting Image XObjects. However, without some background knowledge of PDF‘s internal data structure, solving issues like these becomes a daunting, if not impossible task.

Final note

In this blog post I have tried to shed some light on a number of common misconceptions about embedded content in PDF. I might have inadvertently created some new ones in the process, so feel free to contribute any corrections or additions using the comment fields below.

The PDF specification is vast and complex, and I have only addressed a limited number of its features here. For instance, one might argue that a discussion of embedding-related features should also include fonts, metadata, ICC profiles, and so on. The coverage of multimedia features here is also incomplete, as I didn’t include Movie Annotations or Sound Annotations (which preceded the Screen Annotations, which are now more commonly used). These things were all left out here because of time and space constraints. This also means that further surprises may well be lurking ahead!


Johan van der Knijff
KB / National Library of the Netherlands

14 Comments

  1. ecochrane
    January 21, 2013 @ 11:59 pm CET

    Thanks again Paul, I apreciate the time taken to respond to these. As the folks from the National Library of Australia said recently 

    "Like most things to do with managing digital collections, effective ways of making preservation decisions are evolving. Let's be quite clear about what we mean by that statement: we (the digital preservation community) have no settled, agreed procedures for the full range of digital preservation challenges, nor even tentative plans we confidently expect to ensure adequate preservation for the next 100, 200 or 500 years. We don't know with certainty what we have to do now, let alone what will have to be done in 20, 50 or 100 years in the future. So we are necessarily speculating and making up proposals for action."

    As such, these kinds of discussions are important for establishing where to go in the future and what tools to use to get there. 

    So to address your argument:

    Producers don't provide rendering stack info and other useful information in most cases

    My point was that if they don't now, we should get them to in the future. And that it's not going to be that hard for them. In most cases for e.g. document files, all that is expected by the creator is that the user has a copy of MS Word that is able to open it, and is from the same era. Most people/orgs will be able to tell you that. If that means that the file has weird performance issues then that is a replication of what would have happened at the time. Not a bug/issue. 

    Even if they did, it would be patchy at best. Creator/producers don't all understand digital preservation, or care.

    They don't need to understand digital preservation to tell us how they expect users to interact with their files. In most cases the lack of understanding will make our jobs easier as their expectations will be low (see example above). 

    Where available the quality of this information would therefore vary tremendously

    I doubt it, I imagine it will mostly be of similar quality, mostly very good, however it will mostly be very basic, e.g. "requires acrobat reader", or "requires photo viewer", or "requires MS Powerpoint". 

    Some risks relate to external dependencies which an emulation solution does not address anyway

    Yes they (the risks) do and yes emulation doesn't address that.  I agreed to this earlier, and I'll agree again. It will be a matter of educating users to inform them about these issues in the future. And the best way to do this might be to use a characterisation type tool to identify when it is happening and to notify users every time they try to deposit such files so that they learn to tell the preserving institution about them in the future.  

    Therefore characterisation of everything is an inevitable need

     I'll just say no to that, both because I hope I have shown it to be untrue, and as I am unable to cite great evidence (which we have both been limited by in this thread)

    There is a problem with all of these arguments and that is the lack of data to support either position. In particular though there is a lack of data to support arguments that are pro-emulation as few people seem willing to try implementing it on a large scale. Hopefully that will change in the future and these discussions will be more evidence based and less tautological. 

     

     

     

     

  2. paul
    January 21, 2013 @ 12:47 pm CET

    Euan and Johan,

    I’d like to pick up on a few things Euan said:

    “This is why I believe the donors or creators of the files should be the ones to specify these details about them, as they are the ones who know what is important about the files.”

    While I have no doubt that some users have the expertise to understand the issues and come to the right conclusions, assuming that everyone will be able to do this, let alone understand concepts like embedded or non-embedded fonts is a little naive.

    “So whether or not it is difficult for donors to provide this information, they should do it. I do concede that getting this information from users is not always going to happen. It doesn’t mean we shouldn’t work towards ensuring that it happens more often than not though.”

    So if we agree that donors/producers won’t *always* be able to provide this information, and we don’t know when the information is accurate/reliable, surely we then have to characterise everything anyway? Let’s say it does happen. What form will this information be in? Will it be standardised? Will all those creators produce this info in an easily machine digestable form. Seems unlikely. Therefore, you have to question the value of it.

    You still seem to be sure that if us preservation people just tried a little harder, producers of the content we preserve would provide us with all this useful rendering stack information, dependencies, etc. This does not happen in the majority of cases (and as I highlighted in my last post, and I’m familiar with a lot of organisations), even where effort has been put in to build up a relationship with the depositor.

    So in summary:

    • Producers don’t provide rendering stack info and other useful information in most cases
    • Even if they did, it would be patchy at best. Creator/producers don’t all understand digital preservation, or care.
    • Where available the quality of this information would therefore vary tremendously
    • Some risks relate to external dependencies which an emulation solution does not address anyway
    • Therefore characterisation of everything is an inevitable need
    Paul

     

     

     

     

     

  3. johan
    January 14, 2013 @ 12:39 pm CET

    Even if donors were able to provide all this information (I’m still very sceptical about this, but let’s assume you’re right), at best that would only help us for PDFs that we will be receiving at some future time (as I’m not aware of any of this happening in operational settings already). But:

    • We’ve been receiving PDFs for many years already;
    • we have millions of them, and
    • quite frankly we don’t know very much about them!

    Some may have dependencies we’re not even aware of. Your proposed solution doesn’t cover these materials at all, yet I hope you’ll agree with me that we want to keep them accessible over time. Characterisation would help us to identify problematic files, and take appropriate action if needed (e.g. by getting back to the original donors/publishers if needed).

    So, even if your solution would work, it only covers a part of the problem!

  4. ecochrane
    January 14, 2013 @ 12:38 am CET

    I agree that characterisation is useful. But I believe that it should be the donors/creators that specify which dependencies exist for files and whether they are encrypted etc.  This is particularly important for “dependencies” as (for example) while specific fonts may be specified in files it may not mean that those fonts are required for the object to be presented properly. In fact the font used to create a file may have regularly been not-available to the users that were the target audience of the content the file was presenting. So when an archival institution characterises a file as requiring a specific font when being rendered/interacted with, it may be wrong. This is why I believe the donors or creators of the files should be the ones to specify these details about them, as they are the ones who know what is important about the files.  So whether or not it is difficult for donors to provide this information, they should do it. I do concede that getting this information from users is not always going to happen. It doesn’t mean we shouldn’t work towards ensuring that it happens more often than not though.

    To make this a little more palatable its useful to point out that the requirements a donor might have to meet for such metadata may not be as onerous as you seem to think. You quoted Andy saying:

    “later years have seen an explosion in the number of distinct creator applications, with over 2100 different implementations of around 600 distinct software packages” .

    I added the bold formatting to the quote to make a point. In my previous comment I specified “interaction” stack quite deliberately. The interaction stack can often be far less complex than the creation stack and can be much easier for the donor to specify. For example in the case of pdf-based objects the donor might only require that the interaction stack include a version of Adobe Acrobat Reader that was current at the time the file was created (and not the full Adobe Acobat Pro), and otherwise not care which particular version of reader is used or which Operating System (OS) it is used with.  The reason they might not care is because when the files were created the creators usually didn’t care about those fine details, and expected the files to be interacted with using many different interaction stacks.   

    What this can lead to is a requirement or desire for archival institutions to provide more than one emulated interaction environment for future users to use to interact with the files and access the content captured using them. They may need, or want to do that when multiple environments existed at the same time that were likely used to interact with the objects and which may have presented slightly (or dramatically) different content to the users. Examples include Microsoft Word (.docx) documents accessed via Word 2007 and Word 2010 running on Windows XP, Windows Vista, Windows 7 or Windows 8, or the same files accessed via Word 2003 with the compatibility pack installed and running on any of the same operating systems, or the same files accessed via OSX and OSX versions of Word, or the same files accessed via LibreOffice, OpenOffice, Abiword, etc running on some version of Linux/Android etc. But I would stress that this is likely not going to be a necessity. In most cases it should only be necessary for the institutions to provide a representative interaction environment that represents a typical user’s experience of interacting with the objects as that is all that would have been expected to be available  by the creator when the files were created.

    An important point to re-emphasise here is that by giving the end-user the option of which environment to use, and by maintaining the original files without alteration, the institution is interpreting the content as little as possible. In contrast, with a migration-based approach the institution is required to regularly decide what content is important in order to assess the viability of migration tools that lose some content (as all do by definition).

    One last point to make before I finish up, the donors/creators need to specify details of the interaction environments for the objects in order to implement a migration based strategy anyway. If they don’t, then the archival institutions will not know which environment to use to test migration tools against.  For example, if a docx file is converted to a .odt file, how will the archival institution know which rendering of the docx file should be compared against the final rendering of the .odt file to check whether the content has been preserved? Guessing what environment it should be by using a characterisation tool might be better than nothing. But it should not be the goal at which we aim.

    So in summary it seems the main point we disagree on is whether or not donors/transferring agencies/publishers etc. will be able to provide the information (metadata) necessary to support an emulation strategy. I think that for the most part, regardless of how difficult it is to collect such information, it is critical that it is collected as otherwise the archival institution will have to make decisions about the importance of content in the objects,  that are inappropriate for it to make. I also wonder if part of your wariness around this is a result of having an expectation that the amount of information that donors will need to provide will be large.  I believe that in most cases (80% or more of the time) the donors will not have to provide much information at all, as the expectations around interaction environments for objects are often not very complex. 

     

    A final addtional point re: software archives. There are economies of scale to be realised around these, especially via the use of cloud-emulation services such as those proposed by Dirk von Suchodoletz. There may only need to be a few such archives from which software can be lent/leased out to users on demand (and licensing issues centrally managed). These can be done legally, and yes they will be a challenge to build but so is everything in this field it seems. We just need people to get on an do it. 

  5. ecochrane
    January 10, 2013 @ 11:25 pm CET

    Thanks for the reply Paul.  I’ve outlined some counter-points below.  

    1) I agree this is a challenge and for some types of institutions this will be more difficult than others. But it is definitely possible (and often simple). Publishers, for instance, would likely have a very good idea of the software required to interact with their products. Government and Corporate archives can set standards to a degree, and requiring documentation of the interaction-stack is not a particularly onerous standard to require. 

    2) The legality of donors providing software will depend on the jurisdiction and the license terms. In some jurisdictions I believe even OEM licensed can be sold/reused/donated (I am not a lawyer but I believe this is the case in Germany for instance). This is also not an essential step. The archival institutions can acquire the software themselves through legal means.  While there are not currently many legal means of doing so, this is a solvable problem provided there is a will to do so. 

    3) I think you countered this point yourself. As Dirk, Bram and others are showing, opening and interacting with a pdf-based object via an emulator doesn’t have to be any more difficult than opening and interacting with one via Adobe Acrobat reader, i.e. click to open. This “issue”, if not already solved, will be soon. 

    4) I agree this is a challenge and may necessitate the use of characterisation tools. However if donors can provide rendering-stack details then those should include all dependencies and therefore characterisation would not be required (your point about the challenge of getting this information still stands though). Or alternatively they might provide snapshots of desktops with all necessary software & dependencies included on them, to be used for interacting with the objects via emulators in the future. 

    5) My points in answer to your 4) apply equally to encryption. However what I would also suggest here is that a) this is an edge case much more so than embedded fonts and therefore not a great criticism of the overall solution, and b) the transferring owner will likely be quite aware of the objects it has that require encryption keys, as they will have had to have a solution for managing those keys on an ongoing basis, and will therefore be quite easily able to provide those keys to the archival institution. 

    So in summary, the approach I suggest is not necessarily straight forward at the moment but it does offer a solution to the problems of preserving PDFs. I concede that characterisation can be useful, however under the approach I outlined it is not necessary if you are willing to do a little more up-front (and certainly possible) work with donors etc. The issues that are being noted in trying to build characterisation and validation tools for every format and format variant highlight the complexity of an approach that relies on them. When this complexity and its associated cost are combined with the current (and likely future) impossibility of validating migration on a large scale, for a cost that isn’t outrageous, then you are left with serious questions about the feasibility of an approach that requires all of this. More generally the objections you raise are all things that could be solved if there is a will (and therefore funding) to do so.

    I may be sounding like a broken record here but the reasons I keep advocating emulation as a solution are that it works, can be proven to work (unlike large scale cost-feasible migration for anything but the simplest of objects), is mostly a just-in-time solution rather than a just-in-case solution, is likely to be cheaper than alternatives in the long-run as a result, and, if properly supported, can provide a much richer and more engaging digital history experience.

    While there is increasing support for emulation solutions, I often find myself wondering why there aren’t more practioners that support this approach in contrast to the number that support migration. I am beginning to be concerned about the possibility that, after 10-15+ years of investment in digital preservation solutions that institutions are now stuck with, there is a large amount of vested interested in migration-based approaches. If this is at all the case then I implore anyone reading to take some time to reconsider their approach and to think about the possible long-term benefits an emulation-based strategy may have to your institution and to the future of your digital assets and digital cultural heritage. 

     But perhaps I’m missing the point? I still don’t really understand what a “preservation risk” is meant to be. A risk to what? A risk to our ability to continue to interact with the objects in the future? 

Leave a Reply

Join the conversation