At the National Archives of the Netherlands, we’re writing an information sheet about WARC validation. WARCs are Web ARChive files for archiving harvested web content. Harvesting is the process of collecting and storing web content, such as web sites. We’re studying WARC validation, because we’re expecting to receive many (government) web sites. The information sheet accompanies the ‘richtlijn archivering overheidswebsites’ (guidelines for archiving government web sites). The sheet provides information about validating WARC files and our experiences while testing WARC validation tools.
We selected four WARC tools for our study: JHOVE, JWAT, Warcat and Warcio. More tools for processing WARCs are available, but in our opinion these four tools are the most commonly used, mature and actively maintained tools that can check of validate WARC files. Contact with colleagues who archive web sites confirmed this.
We tested the tools on a corpus of WARC files to learn what the tools actually do, and how the tools compare to each other. One thing struck us as odd: JWAT has been integrated in JHOVE, but the output of the tools is different. While testing the same WARC, JWAT reported 100 “WARC-Target-URI” errors and JHOVE 84 “Incorrect payload digest” errors.
As members of the Open Preservation Foundation, we made use of the possibility to book a tech clinic with Carl Wilson. After presenting our findings to and discussing them with Carl, we concluded that version differences between our version of JWAT (JWAT-warc v1.11 as part of JWAT-Tools 0.6.6) and the JWAT version integrated in JHOVE (JHOVE 1.22 includes JWAT-warc 1.0.3) were the cause of the differences. Our tech clinic also resulted in one or two additional Github issues for JHOVE. The tech clinic furthered our understanding of the WARC validation ecosystem.
In our research into WARC validation, we noticed that some tools are validation tools that check conformance to WARC standard ISO 28500 and others ‘only’ check block and/or payload digests. Most tools support version 1.0 of the WARC standard (of 2009). Few support version 1.1 (of 2017).
While our information sheet is not yet finished, a preliminary conclusion is that there is no one WARC validation tool ‘to rule them all’. All tools have strengths and weaknesses, so using a combination of tools will probably be the best strategy for now. An additional reason for this is that we think the average WARC validation tool maturity level is not quite high enough. Some tools have one maintainer and/or have not been updated in the last few years.
What we also noticed when we contacted colleagues who archive web sites, is that most don’t use WARC validation tools at all. Are we way ahead of our time? Or did we miss something, perhaps your work? Please let us know.
Mentioned in this blog:
- Richtlijn archivering overheidswebsites (in Dutch): https://www.nationaalarchief.nl/archiveren/kennisbank/Richtlijn-Archiveren-Overheidswebsites
- WARC standard: https://www.iso.org/standard/68004.html and https://iipc.github.io/warc-specifications/
- JHOVE: https://openpreservation.org/products/jhove and https://github.com/openpreserve/jhove
- JWAT: https://sbforge.org/display/JWAT/JWAT and https://github.com/netarchivesuite/jwat
- Warcat: https://pypi.org/project/Warcat/ and https://github.com/chfoo/warcat
- Warcio: https://pypi.org/project/warcio/ and https://github.com/webrecorder/warcio
More WARC tools: