Blogs: Corpora

Blog posts filtered by the Corpora subject tag.

Browse blogs by subject

All subjects Access Analysis Android apache tika ApacheTika AQuA ARC ARC to WARC archives archiving audiovisual Benchmark benchmarking best practice best practices Bit rot bitcurator board game British Library Characterisation Community compression Corpora CSV-Validator curation Database Database Archiving Database Preservation Delivery Digital Forensics digital preservation digitisation Disk Images DROID E-ARK Project EaaS Education Emulation epub Experimentation extensible Fido File Formats FLAC Flashback floppy disk floppy disks floppy drive Format Identification Format Registry GitHub Hackathon Hardware obsolescence help httpreserve Identification IDPD17 IMPACT Internet Standards iPRES. community survey isolyzer jhove job JP2 JPEG2000 jpylyzer LZW magnetic media Matchbox MediaConch Members Metadata metadate Migration Monitoring Normalisation OCR open Open Planets Foundation Open Preservation Foundation Open source OPF diary Optimization Packaging PDF PDF/A Planets policy PREFORMA PREMIS preservation Preservation Actions preservation planning Preservation Risks Preservation Strategies Preservia Process Projects PRONOM Provenance pywb recordkeeping records Representation Information Research data research infrastructure Resources RFC Rogues Gallery Rosetta Roy SCAPE Server Siegfried Signature Development Software Software benchmarking SPARQL specification spreadsheets SPRUCE standards technical technical registry testing TIFF Tika Tools training validation veraPDF Virtual Machines w3c WARC Watch WAV WAVE Web Archiving Web Publications wget Wikidata Workflow Workflows Zip

Last year (2012) the KB released a report on the suitability of the EPUB format for archival preservation. A substantial number of EPUB-related developments have happened since then, and as a result some of the report's findings and conclusions have become outdated. This applies in particular to the observations on EPUB 3, and the support […]

By johan, posted in johan's Blog

23rd May 2013  2:23 PM  17625 Reads  No comments

The typical digital artefact or complex object does not function (render, execute, …) without a certain software environment. Emulation-as-a-Service (EaaS) provides original environments running in platform emulators. Depending on the (complex) object to be handled, several software components are required to reproduce an original environment. Often, these components are proprietary and require a software license. […]

By Dirk von Suchodoletz, posted in Dirk von Suchodoletz's Blog

1st Apr 2013  2:23 PM  13467 Reads  3 Comments

The most important new feature of the recently released PDF/A-3 standard is that, unlike PDF/A-2 and PDF/A-1, it allows you to embed any file you like. Whether this is a good thing or not is the subject of some heated on-line discussions. But what do we actually mean by embedded files? As it turns out, […]

By johan, posted in johan's Blog

9th Jan 2013  1:42 PM  131323 Reads  16 Comments

The PDF format contains various features that may make it difficult to access content that is stored in this format in the long term. Examples include (but are not limited to): Encryption features, which may either restrict some functionality (copying, printing) or make files inaccessible altogether. Multimedia features (embedded multimedia objects may be subject to […]

By johan, posted in johan's Blog

19th Dec 2012  3:15 PM  16558 Reads  1 Comment

As many of you may know, Cal Lee, Andi Rauber and myself recently attempted to facilitate a broad discussion on emerging research challenges within the DP community at a workshop at IPRES 2012. We solicited – and received (thanks again to all contributors!) – wide-ranged contributions from Europe, North America, and New Zealand. The invitation […]

By cbecker, posted in cbecker's Blog

13th Nov 2012  8:08 AM  13377 Reads  No comments

The National Library of Australia has just completed a small project to investigate and test a number of software tools of interest to digital preservation activities. The result of this project was an internal report describing the tests and the results, and giving some recommendations about the potential for using these tools in a planned replacement of […]

By matthewh, posted in matthewh's Blog

12th Aug 2012  11:30 PM  17290 Reads  No comments

As part of the evaluation framework i'm developing for OPF and Scape I've been working on gathering a corpora of files to run experiments against.  Although Govdocs1 would seem like a good place to start there are a few problems: 1) It's too big, 1 Million Files is just showing off. 2) It's full of […]

By davetaz, posted in davetaz's Blog

26th Jul 2012  11:31 AM  23388 Reads  9 Comments

The Scape Characterisation Tool Testing Suite This information have also been published in the Scape Deliverable D9.1.  We have created a testing framework based on the Govdocs1 digital Corpora (, and are using the characterisation results from Forensic Innovations, Inc. ((, as ground truths. The framework we used for this evaluation can be found on  […]

By blekinge, posted in blekinge's Blog

23rd Feb 2012  9:09 AM  27449 Reads  4 Comments

 Many office suites and other applications allow the embedding of information in them via a link to another file. The use of linked spreadsheets is common amonst data intensive agencies and large documents are often managed through linking multiple office documents to form a single final product.  Currently we have only anecdotal evidence as to […]

By Euan Cochrane, posted in Euan Cochrane's Blog

21st Nov 2011  4:23 AM  20677 Reads  1 Comment

As I already briefly mentioned in a previous blog post, one of the objectives of the SCAPE project is to develop an architecture that will enable large scale characterisation of digital file objects. As a first step, we are evaluating existing characterisation tools. The overall aim of this work is twofold. First, we want to […]

By johan, posted in johan's Blog

21st Sep 2011  1:40 PM  19687 Reads  No comments