Blogs: Characterisation

Blog posts filtered by the Characterisation subject tag.

Browse blogs by subject

All subjects Access Analysis Android apache tika ApacheTika AQuA ARC ARC to WARC archives archiving audiovisual Benchmark benchmarking best practice best practices Bit rot bitcurator board game British Library Characterisation Community compression Corpora curation Database Database Archiving Database Preservation Delivery Digital Forensics digital preservation digitisation Disk Images DROID E-ARK Project EaaS Education Emulation epub Experimentation extensible Fido File Formats FLAC Flashback floppy disk floppy disks floppy drive Format Identification Format Registry GitHub Hackathon Hardware obsolescence help httpreserve Identification IDPD17 IMPACT Internet Standards isolyzer jhove job JP2 JPEG2000 jpylyzer LZW magnetic media Matchbox MediaConch Members Metadata metadate Migration Monitoring Normalisation OCR open Open Planets Foundation Open Preservation Foundation Open source OPF diary Optimization Packaging PDF PDF/A Planets policy PREFORMA PREMIS preservation Preservation Actions preservation planning Preservation Risks Preservation Strategies Preservia Process Projects PRONOM Provenance pywb recordkeeping records Representation Information Research data research infrastructure Resources RFC Rogues Gallery Rosetta Roy SCAPE Siegfried Signature Development Software Software benchmarking SPARQL specification spreadsheets SPRUCE standards technical technical registry testing TIFF Tika Tools training validation veraPDF w3c WARC Watch WAV WAVE Web Archiving Web Publications wget Wikidata Workflow Workflows Zip

Course Overview If you have a digital preservation strategy that involves digital files, you’ll know how important it is to understand the file formats in which your data is encoded. To do this comprehensively involves at least three main operations: identifying the format, characterising the format, and validating the format. To put it another way, […]

By Becky, posted in Becky's Blog

18th Jul 2017  12:00 AM  0 Reads  No comments

Earlier this year I blogged about Isolyzer, a tool designed to help the detection of broken ISO images. Today I released a shiny new beta version that adds a significant amount of new functionality. Below is an overview of the main changes, followed by some warnings and caveats. Support of more file systems Where previous […]

By johan, posted in johan's Blog

12th Jul 2017  3:06 PM  1585 Reads  No comments

In my previous blog post I addressed the detection of broken audio files in an automated workflow for ripping audio CDs. For (data) CD-ROMs and DVDs that are imaged to an ISO image, a similar problem exists: how can we be reasonably sure that the created image is complete? In this blog post I will […]

By johan, posted in johan's Blog

13th Jan 2017  3:30 PM  7764 Reads  5 Comments

At the KB we have a large collection of offline optical media. Most of these are CD-ROMs, but we also have a sizeable proportion of audio CDs. We’re currently in the process of designing a workflow for stabilising the contents of these materials using disk imaging. For audio CDs this involves ‘ripping’ the tracks to […]

By johan, posted in johan's Blog

4th Jan 2017  2:38 PM  4871 Reads  3 Comments

On 11th October we held our first JHOVE online hack day. Our aim was to catalogue error messages produced by JHOVE to get a better understanding of their meaning and potential preservation impact. Background: organising an online hack day We have been considering running online hackathons because attending face-to-face events has become more difficult as […]

By Becky, posted in Becky's Blog

19th Oct 2016  10:06 AM  2648 Reads  No comments

For anyone dealing with a relatively small number of records, compared to say an internet or data archive, a reasonable process for ingest of material into your digital preservation system might be: 1. Process files with a file format identification tool 2. Per 1. process files with a file format validation tool 3. Per 1. […]

By ross-spencer, posted in ross-spencer's Blog

13th Mar 2016  5:27 AM  3490 Reads  No comments

Hi, this is my first blog post in which I want to introduce the project I am currently working on: Flint. history Flint (File/Format Lint) has developed out of DRMLint, a lightweight piece of Java software that makes use of different third party tools (Preflight, iText, Calibre, Jhove) to detect DRM in PDF-files and EPUBs. […]

By alecs, posted in alecs's Blog

2nd Jul 2014  12:53 PM  11945 Reads  No comments

I have been working on some code to ensure the accurate and consistent output of any file format analysis based on the DROID CSV export, example here. One way of looking at it is an executive summary of a DROID analysis, except I don't think executives, as such, will be its primary user-base.  The reason for pushing […]

By ross-spencer, posted in ross-spencer's Blog

3rd Jun 2014  7:20 AM  12365 Reads  1 Comment

Well over a year ago I wrote the ”A Year of FITS”(http://www.openpreservation.org/blogs/2013-01-09-year-fits) blog post describing how we, during the course of 15 months, characterised 400 million of harvested web documents using the File Information Tool Kit (FITS) from Harvard University. I presented the technique and the technical metadata and basically concluded that FITS didn’t fit […]

By Per Møldrup-Dalum, posted in Per Møldrup-Dalum's Blog

28th May 2014  9:30 PM  15054 Reads  1 Comment

This post covers two main topics that are related; characterising web content with Nanite, and my methods for successfully integrating the Tika parsers with Nanite. Introducing Nanite Nanite is a Java project lead by Andy Jackson from the UK Web Archive, formed of two main subprojects: Nanite-Core: an API for Droid    Nanite-Hadoop: a MapReduce […]

By willp-bl, posted in willp-bl's Blog

21st Mar 2014  1:58 PM  15220 Reads  No comments