shsdev's Blog

This blog post continues a series of posts about the weeb archiving topic „ARC to WARC migration“, namely it is a follow-up on the posts „ARC to WARC migration: How to deal with de-duplicated records?“, and „Some reflections on scalable ARC to WARC migration“. Especially the last one of these posts ,which described how SCAPE […]

By shsdev, posted in shsdev's Blog

10th Jul 2014  10:44 AM  11089 Reads  No comments

Authors: Martin Schaller, Sven Schlarb, and Kristin Dill In the SCAPE Project, the memory institutions are working on practical application scenarios for the tools and solutions developed within the project. One of these application scenarios is the migration of a large image collection from one format to another. There are many reasons why such a […]

By shsdev, posted in shsdev's Blog

24th Jun 2014  9:12 AM  11211 Reads  No comments

In my last blog post about ARC to WARC migration I did a performance comparison of two alternative approaches for migrating very large sets of ARC container files to the WARC format using Apache Hadoop, and I said that resolving contextual dependencies in order to create self-contained WARC files was the next point to investigate […]

By shsdev, posted in shsdev's Blog

24th Mar 2014  4:13 PM  14120 Reads  No comments

The SCAPE project is developing solutions to enable the processing of very large data sets with a focus on long-term preservation. One of the application areas is web archiving where long-term preservation is of direct relevance for different task areas, like harvesting, storage, and access. Web archives usually consist of large data collections of multi-terabyte […]

By shsdev, posted in shsdev's Blog

7th Mar 2014  1:56 PM  12394 Reads  No comments

From the very beginning of the SCAPE project on, it was a requirement that the SCAPE Execution Platform be able to leverage functionality of existing command line applications. The solution for this is ToMaR, a Hadoop-based application, which, amongst other things, allows for the execution of command line applications in a distributed way using a […]

By shsdev, posted in shsdev's Blog

16th Dec 2013  3:13 PM  15217 Reads  No comments

More than 20 developers visited the ‘Hadoop-driven digital preservation Hackathon’ in Vienna which took place in the baroque room called "Oratorium" of the Austrian National Library from 2nd to 4th of December 2013. It was really exciting to hear people vividly talking about Hadoop, Pig, Hive, HBase followed by silent phases of concentrated coding accompanied […]

By shsdev, posted in shsdev's Blog

6th Dec 2013  4:30 PM  10930 Reads  No comments

The DROID software tool is developed by The National Archives (UK) to perform automated batch identification of file formats by assigning Pronom Unique Identifiers (PUIDs) and MIME types to files. The tool uses so called signature files as a basis of information stemming from the PRONOM technical registry. I am here presenting some considerations for […]

By shsdev, posted in shsdev's Blog

24th May 2013  11:44 AM  15040 Reads  3 Comments

This blog post is an answer to willp-bl's post "Mixing Hadoop and Taverna" and is building on some of the ideas that I presented in my blog post "Big data processing: chaining Hadoop jobs using Taverna". First of all, it is very interesting to see willp-bl's variants of implementing a large scale file format migration […]

By shsdev, posted in shsdev's Blog

4th Mar 2013  12:10 PM  11349 Reads  No comments

Processing very large data sets is a core challenge of the SCAPE project. Using the SCAPE platform and a variety of services and tools, the SCAPE Testbeds are developing solutions for real world institutional scenarios dealing with big data. The SCAPE platform is based on Apache Hadoop, an implementation for MapReduce and a programming model […]

By shsdev, posted in shsdev's Blog

7th Aug 2012  10:07 AM  15430 Reads  No comments

Many institutions have been doing large scale digitisation projects during the last decade, and the question how to store the digital master images in a cost effective way made the JPEG2000 image format more popular in the library, museums, and archives community. Especially the lossy JP2 encoding of page image masters turned out to provide […]

By shsdev, posted in shsdev's Blog

13th Feb 2012  11:29 AM  17538 Reads  No comments