willp-bl's Blog

This post covers two main topics that are related; characterising web content with Nanite, and my methods for successfully integrating the Tika parsers with Nanite. Introducing Nanite Nanite is a Java project lead by Andy Jackson from the UK Web Archive, formed of two main subprojects: Nanite-Core: an API for Droid    Nanite-Hadoop: a MapReduce […]

By willp-bl, posted in willp-bl's Blog

21st Mar 2014  1:58 PM  14219 Reads  No comments

As previously blogged about by Carl we now have virtually all SCAPE and OPF projects in Continuous Integration; building and unit testing in both Travis CI and Jenkins.  Travis compiles the projects and executes unit tests whenever a new commit is pushed to Github, or when a pull request is submitted to the project.  Jenkins […]

By willp-bl, posted in willp-bl's Blog

1st Nov 2013  10:19 AM  10859 Reads  No comments

Introduction For our evaluations within SCAPE it would be useful to have the ability to quantitatively measure the abilities of the Hadoop clusters available to us, to allow results from each cluster to be compared. Fortunately as part of the standard Hadoop distribution there are some examples included that can be run as tests.  Intel […]

By willp-bl, posted in willp-bl's Blog

30th Sep 2013  2:36 PM  12217 Reads  No comments

An important part of image file format migration is quality assurance.  Various tools can be used such as ImageMagick or Matchbox, but they only provide one metric or are for different use-cases.  I wanted to investigate implementation of image comparison algorithms so began investigating. I created a prototype tool/library for image quality analysis, called Dissimilar.  […]

By willp-bl, posted in willp-bl's Blog

17th Jul 2013  12:50 PM  24222 Reads  4 Comments

We have been evaluating the use of the latest Fedora Commons, version 3.6.2, as a test repository.  Having followed the straightforward installation process we were left with a repository with one preconfigured user – fedoraAdmin.  There are two APIs – API-A for access and API-M for management.  For our test instance API-A was configured on […]

By willp-bl, posted in willp-bl's Blog

20th May 2013  12:54 PM  11726 Reads  No comments

Part of my work on the SCAPE testbeds involves producing a workflow for the large scale migration of TIFF to JP2 files, with validation.  The tests I have run all involve the lossy compression of files. Two tools that could be used for the validation of image payload, and therefore success of a migration, are […]

By willp-bl, posted in willp-bl's Blog

5th Mar 2013  10:04 AM  11907 Reads  No comments

As part of our work on test-beds for the SCAPE project we have been investigating the various ways in which a large scale file format migration workflow could be implemented.  The underlying technologies chosen for the platform are Hadoop and Taverna.  One of the aims of the SCAPE project is to allow the automatic generation […]

By willp-bl, posted in willp-bl's Blog

14th Feb 2013  1:48 PM  14965 Reads  No comments

Several of us at The British Library took part in the CURATEcamp file id hackathon on Friday. We decided that one issue we could make a useful impact on was identification of various ebook formats. eBooks are an important content type for the British Library, especially with the expected implementation of non-print legal deposit legislation […]

By willp-bl, posted in willp-bl's Blog

19th Nov 2012  3:53 PM  14479 Reads  1 Comment