There’s Ample to Sample: Content Sampling at the British Library

There’s Ample to Sample: Content Sampling at the British Library

Many factors contribute to the long-term preservation of and access to digital collections. And typically, the endpoint for this material is a repository—or other type of preservation system.

But what happens to content after it is stored? How do digital preservationists ensure that content is correct and valid when ingested as well as remains unchanged over time? There are many approaches to mitigating the possibility of corruption pre- and post-ingest, and I propose regular content sampling to be one of these methods.

This blog post outlines automated and manual sampling and assessment workflows for preserved content that have been developed by the British Library’s Digital Preservation Team. Please share any other methods and suggestions in the comments section.

The British Library endeavours to ensure the integrity of its collections starting at the points of acquisition or creation and ingest with mechanisms built into these workflows. These steps typically include file format identification, characterisation, and validation as well as checksum creation and validation. And while there are automated tools to carry out these steps, it is also not possible to programmatically check for every issue that could occur.

Issues that could possibly hinder the long-term preservation of digital content are constantly being discovered. It is even more difficult to detect unknown problems: this is why this we wanted to conduct both automated and manual checking, the latter of which involves opening the files and visually inspecting them. It should also be kept in mind that anything discovered is not always an issue per se so much as an observation, which might not need to be immediately resolved—or resolved at all. For example, an observation may be noted during manual checking, but further research deems it not to be an issue that hinders long-term preservation of the object. Either way, checking a sample of preserved content provides a method for identifying whether known issues have occurred as well as increases the opportunity to actively find new ones, enabling digital preservationists to carry out further investigation and intervene when and where necessary.

Automated and manual checking of preserved digital content at the British Library

As part of our repository’s validation stack, objects are checked for completeness and correctness using tools like JHOVE and EpubCheck, amongst others (depending on the file format) upon ingest. As part of pre-ingest activity for digitised objects, we create checksums that are then validated and stored within their AIPs’ METS files; checksums provided with born-digital objects from publishers are also validated and stored in their AIPs’ METS files. Additionally, checksums for all preserved objects are subsequently preserved in the Library’s metadata store, and once objects are preserved in the repository, an automated tool periodically validates these values and flags objects whose checksums do not match (if any).

Although we have taken these steps, we also know that automated tools are not always able to 100% guarantee integrity; there are preservation issues, as well as broader concerns that could impact the understanding of intellectual content (e.g. blurry text and/images), for which no automated solutions exist to detect such issues—or at least not yet.

With this in mind, the Library’s Digital Preservation Team created its Preservation of Ingested Collections: Assessments, Sampling, and Action plans (PICASA) project with the following objectives:

  • Determine the scope and nature of issues (if any) associated with specific, defined sub-collections
  • Identify potential aspects of the born-digital or digitised files that would hinder the automated QA process (if any)

Since the Library’s digital collections are broad in both scope and scale, we decided to take a more pragmatic approach and manually inspect content of specific formats and files types from our ingest streams, further parameterising where necessary. The content sample itself is determined by what we already know about our collections and through previous PICASA assessments.

Creating a statistically representative sample of preserved content

To begin, we needed a way to create a statistically representative sample of content. We used the following formula to calculate the sample size required to estimate the proportion of files with issues within a 3% margin of error (d) with 95% confidence (the z-score for this is 1.96, but we approximate to 2):  n = ((2/d)^2) * (p) * (1-p)

For our purposes, as we do not know what the expected proportion of files with issues (p) is, we set this at 50% (normalised to 0.5) in order to maximise the sample size – at worst, this may mean a larger than necessary sample.

Plugging these (normalised) figures into the formula gives:

n = ((2/0.03)^2) * (0.5) * (0.5)

n = 1111 (approx.)

At least 1,111 files are needed in our sample to achieve a 3% margin of error at the 95% confidence level.

The following table indicates sample sizes at different confidence levels using the formula against a population of 100,000 (with a 3% margin of error, and 50% expected error proportion)

Confidence Level Sample Size

1               75%         368

2               80%         456

3               85%         576

4               90%         752

5               95%        1067

6               98%        1503

7               99%        1843

8             99.5%        2189

9             99.9%        3008

10           99.99%        4205

11          99.999%        5420

It should be noted that in the original sample size formula we used 2 as a value to approximate the 95% confidence level Z-score. The Z-score indicates the amount of standard deviations an element is from the mean for a given confidence level; more accurately it is 1.96 (for normal distributions). The above table is generated using an R script which uses the true Z-score values (hence a slight difference for the 95% sample size).

For collections over 100,000 (approx.) these sample sizes remain unchanged. However, as the total collection size drops below 100,000 (approx.) the sample size will start representing an increasingly significant proportion of the collection. In this case, if the sample size represents more than 5% of our total collection population, a finite population correction factor should be applied, using:

n­­a = n / (1 + (n-1)/N)

where n­­a is the adjusted sample size, n is the required sample size (calculated above), and N is the collection size.

Having selected a sample size, you then need to select that many items from your collection. This can easily be achieved by randomly selecting items.

Automated and manual checking approach

For our first assessment, we chose to look at PDF access copies from throughout the now-completed 19th Century Books ingest stream, whose content is comprised of digitised books and ingested between 2008 and 2012.

Our calculated sample size of PDFs was 1,111. After randomly selecting this amount of PDFs, we re-created the PDFs’ checksums and compared them to those stored in their respective METS files, as well as analysed the PDFs with the version of JHOVE used when the content was ingested as well as the current version. There are individual automated tools for these steps, and as part of the automated checking process, a script was created to connect these steps as well as retrieve the sample content from our repository.

Recording manual checking results and creating an observations vocabulary

For the manual checking component of this workflow, a spreadsheet containing the unique identifiers and location to access the PDFs was divided up amongst Digital Preservation Team members. Each person opened their assigned PDFs in Adobe Reader and visually checked every page, noting any observations in the spreadsheet. We also created a controlled vocabulary to define observations that should be recorded.

The controlled vocabulary is a living list that is added to and revised as assessments are completed. It is helpful to include screen captures of the observations and have the list accessible to those conducting manual checking so it can be referred to throughout an assessment; within the spreadsheet, it is helpful to have the terms available in a drop-down menu within each cell in the observations column. Any additional observations that arise during an assessment should be discussed amongst team members and added to the controlled vocabulary if deemed an observation worth recording.

The terms used, moreover, do not necessarily have to denote the underlying causes of the observations; these might not yet be known. They can simply visually describe what appears onscreen: for example, common observations with digitised content can include blurriness, cropping, and bleed through.

Another benefit of using a controlled vocabulary and recording results in a way that can be queried (like a spreadsheet) is that you can compare sampled content across collections and conduct further investigation to see whether commonalities exist.

Additionally, it could also be helpful to assign ‘impact levels’ for any observations noted. When it is time to address any observations, this would allow you to triage and prioritise the most significant ones.

Examples of impact levels:

  • Low impact (level 1): A change has occurred, but it has not impacted the meaning of the content. The user can still understand the intellectual content.
  • Medium impact (level 2): A change has occurred, but it has not changed the meaning of the content. However, the change could cause unnecessary difficulty and/or confusion for a user.
  • High impact (level 3): A change has occurred that completely misrepresents the intellectual content.

A spreadsheet for manual checking could therefore include the following column headings:

  • Unique identifier: UUID for the digital object being checked
  • URL/location of object: Where the person manually checking the object can find it.
  • Object ok? (Y/N): At a high level, does the object open and render? If yes, move on to the next one; if no, record the observation type.
  • Type of change: Record the type of change that’s occurred (e.g. blurriness, cropping, etc.)
  • Impact level: 1, 2, or 3 (see above)
  • Notes: any additional information that is not captured in the other columns

Summarising findings

A last step to an assessment involves writing a final report and disseminating it to colleagues, particularly those who are key stakeholders and can help with any follow up actions.  We have found it helpful to include the following information:

  • Justification for the assessment: The reasons behind assessing a particular collection.
  • Information on the collection being sampled: How the content was acquired, or who is responsible for digitisation work; when ingest began and whether it is ongoing; and what types of files comprise an AIP.
  • Scope: What is the scope of the assessment? Are only access copies being looked at?
  • Sampling and assessment approach: Within the scope of the assessment, what are the manual and automated steps being taken to assess the sample content? What tools are being used? How is manual checking being undertaken and by whom?
  • Results: Record the results of the automated and manual checking. It is helpful to break these into two separate sub-sections.
  • Summary of assessment
  • Recommendations for follow up actions

Concluding thoughts

Sampling and manual inspection has provided an opportunity to be more thorough in determining the preservation needs of the content the Library creates and acquires. It has also provided the chance to see whether any observations are, in fact, legitimate preservation concerns and determine how best to address these with key stakeholders.

As mentioned, there are issues that could hinder the preservation of content that have not yet been discovered. Sampling increases the likelihood of discovering these in a pragmatic way, enables understanding the root cause of any issues, as well as provides an opportunity to explore whether automated solutions could be developed to solve these.

In short, checking a sample of preserved content is about spotting possible issues early to mitigate the possibility of significant problems at a later date.

1 Comment

  1. jaygattuso
    February 21, 2017 @ 6:35 pm CET

    Hi Caylin. Thanks for posting this. Its always interesting to see approaches to this problem of validating large collections.
    From your post, it looks like you’ve undertaken this method for atleast the collection of 1,111 pdf files. Could you share your findings? I’d be very interested to see the numbers around your impact levels and any other insights the process gave you.

Leave a Reply

Join the conversation