Blogs: Characterisation

Blog posts filtered by the Characterisation subject tag.

Browse blogs by subject

In my previous blog post I addressed the detection of broken audio files in an automated workflow for ripping audio CDs. For (data) CD-ROMs and DVDs that are imaged to an ISO image, a similar problem exists: how can we be reasonably sure that the created image is complete? In this blog post I will […]

By johan, posted in johan's Blog

13th Jan 2017  3:30 PM  3590 Reads  5 Comments

At the KB we have a large collection of offline optical media. Most of these are CD-ROMs, but we also have a sizeable proportion of audio CDs. We’re currently in the process of designing a workflow for stabilising the contents of these materials using disk imaging. For audio CDs this involves ‘ripping’ the tracks to […]

By johan, posted in johan's Blog

4th Jan 2017  2:38 PM  977 Reads  3 Comments

On 11th October we held our first JHOVE online hack day. Our aim was to catalogue error messages produced by JHOVE to get a better understanding of their meaning and potential preservation impact. Background: organising an online hack day We have been considering running online hackathons because attending face-to-face events has become more difficult as […]

By Becky, posted in Becky's Blog

19th Oct 2016  10:06 AM  983 Reads  No comments

For anyone dealing with a relatively small number of records, compared to say an internet or data archive, a reasonable process for ingest of material into your digital preservation system might be: 1. Process files with a file format identification tool 2. Per 1. process files with a file format validation tool 3. Per 1. […]

By ross-spencer, posted in ross-spencer's Blog

13th Mar 2016  5:27 AM  1701 Reads  No comments

Hi, this is my first blog post in which I want to introduce the project I am currently working on: Flint. history Flint (File/Format Lint) has developed out of DRMLint, a lightweight piece of Java software that makes use of different third party tools (Preflight, iText, Calibre, Jhove) to detect DRM in PDF-files and EPUBs. […]

By alecs, posted in alecs's Blog

2nd Jul 2014  12:53 PM  10915 Reads  No comments

I have been working on some code to ensure the accurate and consistent output of any file format analysis based on the DROID CSV export, example here. One way of looking at it is an executive summary of a DROID analysis, except I don't think executives, as such, will be its primary user-base.  The reason for pushing […]

By ross-spencer, posted in ross-spencer's Blog

3rd Jun 2014  7:20 AM  11140 Reads  1 Comment

Well over a year ago I wrote the ”A Year of FITS”(http://www.openpreservation.org/blogs/2013-01-09-year-fits) blog post describing how we, during the course of 15 months, characterised 400 million of harvested web documents using the File Information Tool Kit (FITS) from Harvard University. I presented the technique and the technical metadata and basically concluded that FITS didn’t fit […]

By Per Møldrup-Dalum, posted in Per Møldrup-Dalum's Blog

28th May 2014  9:30 PM  13628 Reads  1 Comment

This post covers two main topics that are related; characterising web content with Nanite, and my methods for successfully integrating the Tika parsers with Nanite. Introducing Nanite Nanite is a Java project lead by Andy Jackson from the UK Web Archive, formed of two main subprojects: Nanite-Core: an API for Droid    Nanite-Hadoop: a MapReduce […]

By willp-bl, posted in willp-bl's Blog

21st Mar 2014  1:58 PM  13489 Reads  No comments

Fifteen days was the estimate I gave for completing an analysis on roughly 450,000 files we were holding at Archives New Zealand. Approximately three seconds per file for each round of analysis: 3 x 450,000 = 1,350,000 seconds 1,350,000 seconds = 15.625 days My bash script included calls to three Java applications, Apache Tika, 1.3 […]

By ross-spencer, posted in ross-spencer's Blog

24th Feb 2014  2:17 AM  17344 Reads  5 Comments

A while back I wrote a blog post, MIA: Metadata. I highlighted how difficult it was to capture certain metadata without a managed system – without an Electronic Document and Records Management System (EDRMS). I also questioned if we were doing enough with EDRMS by way of collecting data. Following that blog we sought out […]

By ross-spencer, posted in ross-spencer's Blog

4th Feb 2014  5:21 AM  11754 Reads  1 Comment