Blogs: Web Archiving

Blog posts filtered by the Web Archiving subject tag.

Browse blogs by subject

Related to my work exploring hyperlinks in documentary heritage – something I feel we’ll be taking care of for a long time – I created a hyperlink extract tool called tikalinkextract. Put simply – the tool will take your collection of files, extract the intellectual content using Apache Tika, and then analyse that content for […]

By ross-spencer, posted in ross-spencer's Blog

21st Oct 2017  1:12 PM  646 Reads  No comments

Much of the inspiration from this blog came from this source here. According to UNESCO, the authenticity of a record can be jeopardized by: Threats to integrity. Changes to the content of the object itself also potentially damage authenticity. Most such changes stem from threats to the object at a data level. A hyperlink is data. […]

By ross-spencer, posted in ross-spencer's Blog

19th May 2017  3:41 AM  1523 Reads  No comments

We recently posted an article on the UK Web Archive blog that may be of interest here, User-Driven Digital Preservation, where we summarise our work with the SCAPE Project on a little prototype application that explores how we might integrate user feedback and preservation actions into our usual discovery and access processes. The idea is […]

By Andy Jackson, posted in Andy Jackson's Blog

28th Aug 2014  8:53 PM  12493 Reads  No comments

I would like to draw your attention to the new QA tool for finger detection on scans: https://github.com/openplanets/finger-detection-tool. This tool was developed by AIT in scope of the SCAPE project.   Checking to identify fingers on scan manually is a very time-consuming and error-prone process. You need a tool to help you: Fingerdet. Fingerdet is […]

By Roman Graf, posted in Roman Graf's Blog

10th Jul 2014  11:49 AM  10148 Reads  No comments

This blog post continues a series of posts about the weeb archiving topic „ARC to WARC migration“, namely it is a follow-up on the posts „ARC to WARC migration: How to deal with de-duplicated records?“, and „Some reflections on scalable ARC to WARC migration“. Especially the last one of these posts ,which described how SCAPE […]

By shsdev, posted in shsdev's Blog

10th Jul 2014  10:44 AM  11298 Reads  No comments

Well over a year ago I wrote the ”A Year of FITS”(http://www.openpreservation.org/blogs/2013-01-09-year-fits) blog post describing how we, during the course of 15 months, characterised 400 million of harvested web documents using the File Information Tool Kit (FITS) from Harvard University. I presented the technique and the technical metadata and basically concluded that FITS didn’t fit […]

By Per Møldrup-Dalum, posted in Per Møldrup-Dalum's Blog

28th May 2014  9:30 PM  14149 Reads  1 Comment

In my last blog post about ARC to WARC migration I did a performance comparison of two alternative approaches for migrating very large sets of ARC container files to the WARC format using Apache Hadoop, and I said that resolving contextual dependencies in order to create self-contained WARC files was the next point to investigate […]

By shsdev, posted in shsdev's Blog

24th Mar 2014  4:13 PM  14415 Reads  No comments

This post covers two main topics that are related; characterising web content with Nanite, and my methods for successfully integrating the Tika parsers with Nanite. Introducing Nanite Nanite is a Java project lead by Andy Jackson from the UK Web Archive, formed of two main subprojects: Nanite-Core: an API for Droid    Nanite-Hadoop: a MapReduce […]

By willp-bl, posted in willp-bl's Blog

21st Mar 2014  1:58 PM  14143 Reads  No comments

The SCAPE project is developing solutions to enable the processing of very large data sets with a focus on long-term preservation. One of the application areas is web archiving where long-term preservation is of direct relevance for different task areas, like harvesting, storage, and access. Web archives usually consist of large data collections of multi-terabyte […]

By shsdev, posted in shsdev's Blog

7th Mar 2014  1:56 PM  12627 Reads  No comments

The Web is constantly evolving over time. Web content like texts, images, etc. are updated frequently. One of the major problems encountered by archiving systems is to understand what happened between two different versions of the web page.   We want to underline that the aim is not to compare two web pages like this […]

By Zeynep, posted in Zeynep's Blog

7th Feb 2014  1:15 PM  11473 Reads  No comments