Blogs: Web Archiving

Blog posts filtered by the Web Archiving subject tag.

Browse blogs by subject

In a previous blog post I showed how we resurrected NL-menu, the first Dutch web index. It explains how we recovered the site’s data from an old CD-ROM, and how we subsequently created a local copy of the site by serving the CD-ROM’s contents on the Apache web server. This follow-up post covers the final […]

By johan, posted in johan's Blog

11th Jul 2018  3:47 PM  362 Reads  No comments

NL-menu was the first Dutch web index. The site was originally founded by a consortium of SURFnet, Dutch universities and the KB. From the mid-nineties onwards it was maintained solely by the KB. NL-menu was discontinued in 2004, after which the site was taken offline. In 2006 the domain name was sold to a private […]

By johan, posted in johan's Blog

24th Apr 2018  5:01 PM  845 Reads  No comments

Related to my work exploring hyperlinks in documentary heritage – something I feel we’ll be taking care of for a long time – I created a hyperlink extract tool called tikalinkextract. Put simply – the tool will take your collection of files, extract the intellectual content using Apache Tika, and then analyse that content for […]

By ross-spencer, posted in ross-spencer's Blog

21st Oct 2017  1:12 PM  1587 Reads  No comments

Much of the inspiration from this blog came from this source here. According to UNESCO, the authenticity of a record can be jeopardized by: Threats to integrity. Changes to the content of the object itself also potentially damage authenticity. Most such changes stem from threats to the object at a data level. A hyperlink is data. […]

By ross-spencer, posted in ross-spencer's Blog

19th May 2017  3:41 AM  2125 Reads  No comments

We recently posted an article on the UK Web Archive blog that may be of interest here, User-Driven Digital Preservation, where we summarise our work with the SCAPE Project on a little prototype application that explores how we might integrate user feedback and preservation actions into our usual discovery and access processes. The idea is […]

By Andy Jackson, posted in Andy Jackson's Blog

28th Aug 2014  8:53 PM  13105 Reads  No comments

I would like to draw your attention to the new QA tool for finger detection on scans: https://github.com/openplanets/finger-detection-tool. This tool was developed by AIT in scope of the SCAPE project.   Checking to identify fingers on scan manually is a very time-consuming and error-prone process. You need a tool to help you: Fingerdet. Fingerdet is […]

By Roman Graf, posted in Roman Graf's Blog

10th Jul 2014  11:49 AM  10555 Reads  No comments

This blog post continues a series of posts about the weeb archiving topic „ARC to WARC migration“, namely it is a follow-up on the posts „ARC to WARC migration: How to deal with de-duplicated records?“, and „Some reflections on scalable ARC to WARC migration“. Especially the last one of these posts ,which described how SCAPE […]

By shsdev, posted in shsdev's Blog

10th Jul 2014  10:44 AM  11901 Reads  No comments

Well over a year ago I wrote the ”A Year of FITS”(http://www.openpreservation.org/blogs/2013-01-09-year-fits) blog post describing how we, during the course of 15 months, characterised 400 million of harvested web documents using the File Information Tool Kit (FITS) from Harvard University. I presented the technique and the technical metadata and basically concluded that FITS didn’t fit […]

By Per Møldrup-Dalum, posted in Per Møldrup-Dalum's Blog

28th May 2014  9:30 PM  14798 Reads  1 Comment

In my last blog post about ARC to WARC migration I did a performance comparison of two alternative approaches for migrating very large sets of ARC container files to the WARC format using Apache Hadoop, and I said that resolving contextual dependencies in order to create self-contained WARC files was the next point to investigate […]

By shsdev, posted in shsdev's Blog

24th Mar 2014  4:13 PM  15022 Reads  No comments

This post covers two main topics that are related; characterising web content with Nanite, and my methods for successfully integrating the Tika parsers with Nanite. Introducing Nanite Nanite is a Java project lead by Andy Jackson from the UK Web Archive, formed of two main subprojects: Nanite-Core: an API for Droid    Nanite-Hadoop: a MapReduce […]

By willp-bl, posted in willp-bl's Blog

21st Mar 2014  1:58 PM  14843 Reads  No comments