The browser-shots tool is developed by Internet Memory in the context of SCAPE project, as part of the preservation and watch (PW) sub-project. The goal of this tool is to perform automatic visual comparisons, in order to detect rendering issues in the archived Web pages.
From the tools developed in the scope of the project (in the preservation components sub-project), we selected the MarcAlizer tool, developed by UPMC, that performs the visual comparison between two web pages. In a second phase, the renderability analysis will also include the structural comparison of the pages, which is implemented by the new Pagelyser tool.
Since the core analysis for the renderability is thus performed by an external tool, the overall performance of the browser-shot tool will be tight to this external dependency. We will keep integrating the latest releases issued from the MarcAlizer development, as well as the updates on the tool issued from a more specific training.
The detection of the rendering issues is done in the following three steps:
1° Web pages screenshots automatically taken using Selenium framework, for different browser versions.
2° Visual comparison between pairs of screenshots using MarcAlizer tool (recently replaced by PageAlizer tool, to include also the structural comparison).
3° Automatically detect the rendering issues in the Web pages, based on the comparison results.
Initial Implementation
The browser-shots tool is developed as a wrapper application, to orchestrate the main building blocks (Selenium instances and MarcAlizer comparators) and to perform large scale experiments on archived Web content.
The browser versions currently experienced and tested are: Firefox (for all the available releases), Chrome (only for the last version), Opera (for the official 11th and 12th versions) and Internet Explorer (still to be fixed).
The initial, sequential implementation of the tool is represented by several Python scripts, running on a Debian Squeeze (64 bits) platform. This version of the tool was released on GitHub and we received some valuable feedback from the sub-project partners:
https://github.com/crawler-IM/browser-shots-tool
For the preliminary rounds of tests, we deployed the browser-shots tool on three nodes of IM's cluster and we performed automated comparisons for around 440 pairs of URLs. The processing time registered in average was about 16 seconds per pair of Web pages. These results showed that the existing solution is suitable for small-scale analysis only. Most of the time in the process is actually represented by IO operations and disk access to the binary files for the snapshots. Taking the screenshots proven to be very time consuming and therefore if this solution is to be deployed on a large scale, the solution needed to be further optimized and parallelized.
These results showed also that a serious bottleneck for the performance of the tool is represented by the passage of intermediary parameters in between the modules. More precisely, the materialization of the screenshots in binary files on the disk is a very time consuming operation, especially when considering large scale experiments on a large number of Web pages.
We therefore have to move to a different implementation of the tool, which will use an optimized version of MarcAlizer. The Web pages screenshots taken with Selenium will be directly passed over to MarcAlizer comparator using streams and the new implementation of the browser-shots tool will be represented by a MapReduce job, running on a Hadoop cluster. Based on this framework, the current rounds of tests could be extended up to much higher number of pairs of URLs.
In the second round the browser shot comparison tool is implemented as a MapReduce job to parallelize the processing of the input. The input in this later case is a list of urls that together with a list of browser versions, that are used to render the screen shot – note the difference in comparison to the former version where the input where pairs of URLs that were rendered using one common browser version and these were compared.
Optimizations
In order to achieve acceptable running times of the tool newer version of the Marcalizer comparison tool was integrated into this tool. The major improvement brings the possibility of feeding to tool with in-memory objects instead of pointers to files on disk. This improvement and the elimination of the unnecessary IO operations lead into following average times got for the particular steps in the shot comparison:
1) browser shot acquirement – 2s
2) marcalizer comparison 2s
Note that the time to take the render the screenshot using a browser mainly depends on the size of the rendered page, for instance capturing a wsj.com page takes about 15s on the IM machine where the resulting png image has several MBs.
MapReduce
As you can see, the operations on the operations on the screenshots are very expensive (remember that the list of the tested browsers can be very long and for each we need to spend one browser screen shot operation). Therefore we need to parallelize the tool to several machines working on the input list of urls. To facilitate this, we have employed Hadoop MapReduce which is part of the SCAPEs platform.
The result of the comparisons is then materialized in a set of XML files where each file represents one pair of browser shots comparions. In order to alleviate the problem of having big numbers of small files, these files are automatically bundled together into one ZIP file. A C3P0 adapter has been implemented by TU Wien so the result can be processed and passed further to Scout.
Tests
In the moment, we have ran preliminary tests on the currently supported browser versions – Firefox and Opera. The list of urls to test is about 13 000 entries long. We are using the IM central instance for these tests, currently having two worker nodes (thus we can cut the processing time to half in parallel execution).