Blogs: Characterisation

Blog posts filtered by the Characterisation subject tag.

Browse blogs by subject

All subjects Access Analysis Android apache tika ApacheTika AQuA ARC ARC to WARC archives archiving audiovisual Benchmark benchmarking best practice best practices Bit rot bitcurator board game British Library Characterisation Community compression Corpora CSV-Validator curation Database Database Archiving Database Preservation Delivery Digital Forensics digital preservation digitisation Disk Images DROID E-ARK E-ARK Project EaaS Education Emulation epub Experimentation extensible Fido File Formats FLAC Flashback floppy disk floppy disks floppy drive Format Identification Format Registry GitHub Hackathon Hardware obsolescence help httpreserve Identification IDPD17 IMPACT Internet Standards iPRES. community survey isolyzer jhove job JP2 JPEG2000 jpylyzer LZW magnetic media Matchbox MediaConch Members Metadata metadate Migration Monitoring Normalisation OCR open Open Planets Foundation Open Preservation Foundation Open source OPF diary Optimization Packaging PDF PDF/A Planets policy PREFORMA PREMIS preservation Preservation Actions preservation planning Preservation Risks Preservation Strategies Preservia Process Projects PRONOM Provenance pywb recordkeeping records Representation Information Research data research infrastructure Resources RFC Rogues Gallery Rosetta Roy SCAPE Server Siegfried Signature Development Software Software benchmarking SPARQL specification specifications spreadsheets SPRUCE standards technical technical registry testing TIFF Tika Tools training validation veraPDF Virtual Machines w3c WARC Watch WAV WAVE Web Archiving Web Publications wget Wikidata Workflow Workflows Zip

Fifteen days was the estimate I gave for completing an analysis on roughly 450,000 files we were holding at Archives New Zealand. Approximately three seconds per file for each round of analysis: 3 x 450,000 = 1,350,000 seconds 1,350,000 seconds = 15.625 days My bash script included calls to three Java applications, Apache Tika, 1.3 […]

By ross-spencer, posted in ross-spencer's Blog

24th Feb 2014  2:17 AM  19113 Reads  5 Comments

A while back I wrote a blog post, MIA: Metadata. I highlighted how difficult it was to capture certain metadata without a managed system – without an Electronic Document and Records Management System (EDRMS). I also questioned if we were doing enough with EDRMS by way of collecting data. Following that blog we sought out […]

By ross-spencer, posted in ross-spencer's Blog

4th Feb 2014  5:21 AM  14208 Reads  1 Comment

One of my first blogs here covered an evaluation of a number of format identification tools. One of the more surprising results of that work was that out of the five tools that were tested, no less than four of them (FITS, DROID, Fido and JHOVE2) failed to even run when executed with their associated […]

By johan, posted in johan's Blog

31st Jan 2014  12:58 PM  1682212 Reads  6 Comments

This blog follows up on three earlier posts about detecting preservation risks in PDF files. In part 1 I explored to what extent the Preflight component of the Apache PDFBox library can be used to detect specific preservation risks in PDF documents. This was followed up by some work during the SPRUCE Hackathon in Leeds, […]

By johan, posted in johan's Blog

27th Jan 2014  3:08 PM  23882 Reads  7 Comments

From the very beginning of the SCAPE project on, it was a requirement that the SCAPE Execution Platform be able to leverage functionality of existing command line applications. The solution for this is ToMaR, a Hadoop-based application, which, amongst other things, allows for the execution of command line applications in a distributed way using a […]

By shsdev, posted in shsdev's Blog

16th Dec 2013  3:13 PM  17288 Reads  No comments

More than 20 developers visited the ‘Hadoop-driven digital preservation Hackathon’ in Vienna which took place in the baroque room called "Oratorium" of the Austrian National Library from 2nd to 4th of December 2013. It was really exciting to hear people vividly talking about Hadoop, Pig, Hive, HBase followed by silent phases of concentrated coding accompanied […]

By shsdev, posted in shsdev's Blog

6th Dec 2013  4:30 PM  11910 Reads  No comments

During and around this year's iPRES a couple of discussions sprung up around the topic of proper software archiving and it was part of the DP challenges workshop discussions. With services emerging around emulation as e.g. developed in the bwFLA project (see e.g. the blog post on EaaS demo or Digital Art curation) proper measures […]

By Dirk von Suchodoletz, posted in Dirk von Suchodoletz's Blog

10th Oct 2013  1:51 PM  15599 Reads  2 Comments

Last Friday I ran a workshop at the BL trying to identify what I guess we might call significant properties of ebooks. This is to inform requirements for ebook characteristation tools developed as part of SCAPE and also help inform BL staff involved in ebook ingest projects. To this end I wasn't just interested in the theoretically interesting features that […]

By pixelatedpete, posted in pixelatedpete's Blog

4th Sep 2013  9:13 AM  14110 Reads  1 Comment

Last winter I started a first attempt at identifying preservation risks in PDF files using the Apache Preflight PDF/A validator. This work was later followed up by others in two SPRUCE hackathons in Leeds (see this blog post by Peter Cliff) and London (described here). Much of this later work tacitly assumes that Apache Preflight […]

By johan, posted in johan's Blog

25th Jul 2013  12:57 PM  24819 Reads  12 Comments

Now that the subproject lead in PW is being transferred from me to Kresimir, it seems a good time to reflect a little on what we have achieved in PW since February 2011 and what is left to do! What did we set out to do? To accomplish effective digital preservation, environments with a preservation […]

By cbecker, posted in cbecker's Blog

23rd Jul 2013  9:20 AM  13467 Reads  No comments