Quality assured ARC to WARC migration

This blog post continues a series of posts about the weeb archiving topic „ARC to WARC migration“, namely it is a follow-up on the posts „ARC to WARC migration: How to deal with de-duplicated records?“, and „Some reflections on scalable ARC to WARC migration“.

Especially the last one of these posts ,which described how SCAPE tools can be used for multi-Terabyte web archive data migration, is the basis for this post from a subject point of view. One consequence of evaluating alternative approaches for processing web archive records using the Apache Hadoop framework was to abandon the native Hadoop job implementation (the arc2warc-migration-hdp module was deprecated and removed from the master branch) because of having some disadvantages without bringing significant benefits in terms of performance and scale-out cabability compared to the command line application arc2warc-migration-cli used together with the SCAPE tool ToMaR for parallel processing. While this previous post did not elaborate on quality assurance, it will be the main focus of this post.

The workflow diagram in figure 1 illustrates the main components and processes that were used to create a quality assured ARC to WARC migration workflow.

Figure 1: Workflow diagram of the ARC to WARC migration workflow

The basis of the main components used in this workflow is the Java Web Archive Toolkit (JWAT) for reading web archive ARC container files. Based on this toolkit the „hawarp“ tool set was developed in the SCAPE project which bundles several components for preparing and processing web archive data, especially data that is stored in ARC or WARC container files. Special attention was given to making sure that data can be processed using the Hadoop framework, an essential part of the SCAPE platform for distributed data processing using computer clusters.

The input of the workflow is an ARC container file, a format originally proposed by the Internet Archive to persistently store web archive records. The ARC to WARC migration tool is a JAVA command line application which takes an ARC file as input and produces a WARC file. The tool basically performs a procedural mapping of metadata between the ARC and WARC format (see constructors of the eu.scape_project.hawarp.webarchive.ArchiveRecord class). One point that is worth highlighting is that the records of ARC container files from the Austrian National Library's web archive were not structured homogeneously. When creating ARC records (using the Netarchive Suite/Heritrix web crawler in this case), the usual procedure was to strip off the HTTP response metadata from the server's response and store these data as part of the header of the ARC record. Possibly due to malformed URLs this was not applied to all records, so that the HTTP response metadata were still part of the payload content as it was actually defined later for the WARC standard. The ARC to WARC migration tool handles these special cases accordingly. Generally, and as table 1 shows, HTTP response metadata is transferred from ARC header to WARC payload header and therefore becomes part of the payload content.

ARC Header → HTTP Response Metadata	WARC Header
ARC Payload	→ HTTP Response Metadata WARC Payload

Table 1: HTTP response metadata is transferred from ARC header to WARC payload header

The CDX-Index Creation module, which is also part of the „hawarp“ tool set, is used to create a file in the CDX file format to store selected attributes of web archive records aggregated in ARC or WARC container files – one line per record – in a plain text file. The main purpose of the CDX-index is to provide a lookup table for the wayback software. The index file contains the necessary information (URL, date, offset: record position in container file, container identifier, etc) to retrieve data required for rendering an archived web page and depending ressources from the container files.

Apart from serving the purpose of rendering web ressources using the wayback software, the CDX index file can also be used to do a basic verification if the container format migration process was successful or not, namely by comparing the CDX fields of the ARC CDX file and the WARC CDX file. The basic assumption here is that apart from the offset and container identifier fields all the other fields must have the same values for corresponding records. Especially the payload digest allows verifying if the digest computed for the binary data (payload content) are the same for all records in the two container formats respectively.

An additional step of the workflow in order to verify the quality of the migration is to compare the rendering results of selected ressources when being retrieved from the original ARC and the migrated WARC container files. To this end, the CDX files are deployed to the wayback application in a first step. In a second step the PhantomJS framwork is used to take snapshots from rendering the same ressource retrieved once from the ARC container and once from the WARC container file.

Finally, the snapshot images are compared using Exiftool (basic image properties) and ImageMagick (measure AE: absolute error) in order to determine if the rendering result is equal for both instances. Randomized manual verification of individual cases may then conclude the quality control process.

There is an executable Taverna workflow available on myExperiment. The Taverna workflow is configured by adapting the values of the constant values (light-blue boxes) which define the paths to configuration files, deployment files, and scripts in the processing environment. However, as in this workflow Taverna is just used as an orchestration tool to build a sequence of bash script invocations, it is also possible to just use the individual scripts of this workflow and replace the Taverna variables (embraced by two per cent symbols) accordingly.

The following prerequisites must be fulfilled to be able to execute the Taverna workflow and/or the bash scripts it contains:

Linux operating system
Java version >= 1.7
Download arc2warc-migration-cli and cdx-creator executable jars. Alternatively build from sources using hawarp (git clone https://github.com/openplanets/hawarp.git) and create executable jars with dependencies from modules arc2warc-migration-cli and cdx-creator using „mvn assembly:assembly“ in the corresponding sub-modules (Requires additionally Maven2 and Git).
OpenSource wayback deployed to Apache Tomcat servlet container. See this setup and configuration guide for installing the wayback software.
Wayback configured with CDX collection. Configuration is done in configuration files available in the WEB-INF folder of the deployed web application. See this example wayback.xml and CDXcollection.xml configuration files to see how to setup the CDX collection.
Perl script CSVDIFF installed
PhantomJS installed and PhantomJS script for taking snapshots from URLs available.
Exiftool and ImageMagick installed
Constant values in the Taverna workflow configured to match the system's environment paths.

The following screencast demonstrates the workflow using a simple "Hello World!" crawl as example:

Leave a Reply

You might also like…

Some reflections on scalable ARC to WARC migration

Crawling offline web content: the NL-menu case

A Weekend With Nanite

Join the conversation

Member-only content

or

or

or

or

Download

or