Blogs: Web Archiving

Blog posts filtered by the Web Archiving subject tag.

Browse blogs by subject

We recently posted an article on the UK Web Archive blog that may be of interest here, User-Driven Digital Preservation, where we summarise our work with the SCAPE Project on a little prototype application that explores how we might integrate user feedback and preservation actions into our usual discovery and access processes. The idea is […]

By Andy Jackson, posted in Andy Jackson's Blog

28th Aug 2014  8:53 PM  11897 Reads  No comments

I would like to draw your attention to the new QA tool for finger detection on scans: https://github.com/openplanets/finger-detection-tool. This tool was developed by AIT in scope of the SCAPE project.   Checking to identify fingers on scan manually is a very time-consuming and error-prone process. You need a tool to help you: Fingerdet. Fingerdet is […]

By Roman Graf, posted in Roman Graf's Blog

10th Jul 2014  11:49 AM  9800 Reads  No comments

This blog post continues a series of posts about the weeb archiving topic „ARC to WARC migration“, namely it is a follow-up on the posts „ARC to WARC migration: How to deal with de-duplicated records?“, and „Some reflections on scalable ARC to WARC migration“. Especially the last one of these posts ,which described how SCAPE […]

By shsdev, posted in shsdev's Blog

10th Jul 2014  10:44 AM  10760 Reads  No comments

Well over a year ago I wrote the ”A Year of FITS”(http://www.openpreservation.org/blogs/2013-01-09-year-fits) blog post describing how we, during the course of 15 months, characterised 400 million of harvested web documents using the File Information Tool Kit (FITS) from Harvard University. I presented the technique and the technical metadata and basically concluded that FITS didn’t fit […]

By Per Møldrup-Dalum, posted in Per Møldrup-Dalum's Blog

28th May 2014  9:30 PM  13564 Reads  1 Comment

In my last blog post about ARC to WARC migration I did a performance comparison of two alternative approaches for migrating very large sets of ARC container files to the WARC format using Apache Hadoop, and I said that resolving contextual dependencies in order to create self-contained WARC files was the next point to investigate […]

By shsdev, posted in shsdev's Blog

24th Mar 2014  4:13 PM  13637 Reads  No comments

This post covers two main topics that are related; characterising web content with Nanite, and my methods for successfully integrating the Tika parsers with Nanite. Introducing Nanite Nanite is a Java project lead by Andy Jackson from the UK Web Archive, formed of two main subprojects: Nanite-Core: an API for Droid    Nanite-Hadoop: a MapReduce […]

By willp-bl, posted in willp-bl's Blog

21st Mar 2014  1:58 PM  13364 Reads  No comments

The SCAPE project is developing solutions to enable the processing of very large data sets with a focus on long-term preservation. One of the application areas is web archiving where long-term preservation is of direct relevance for different task areas, like harvesting, storage, and access. Web archives usually consist of large data collections of multi-terabyte […]

By shsdev, posted in shsdev's Blog

7th Mar 2014  1:56 PM  12092 Reads  No comments

The Web is constantly evolving over time. Web content like texts, images, etc. are updated frequently. One of the major problems encountered by archiving systems is to understand what happened between two different versions of the web page.   We want to underline that the aim is not to compare two web pages like this […]

By Zeynep, posted in Zeynep's Blog

7th Feb 2014  1:15 PM  11011 Reads  No comments

In December last year I attended a Hadoop Hackathon in Vienna. A hackathon that has been written about before by other participants: Sven Schlarb's Impressions of the ‘Hadoop-driven digital preservation Hackathon’ in Vienna and Clemens and René's The Elephant Returns to the Library…with a Pig!. Like these other participants I really came home from this […]

By Per Møldrup-Dalum, posted in Per Møldrup-Dalum's Blog

23rd Jan 2014  9:01 AM  10932 Reads  No comments

From the very beginning of the SCAPE project on, it was a requirement that the SCAPE Execution Platform be able to leverage functionality of existing command line applications. The solution for this is ToMaR, a Hadoop-based application, which, amongst other things, allows for the execution of command line applications in a distributed way using a […]

By shsdev, posted in shsdev's Blog

16th Dec 2013  3:13 PM  14892 Reads  No comments