Blogs: Identification

Blog posts filtered by the Identification subject tag.

Browse blogs by subject

All subjects Access Analysis Android apache tika ApacheTika AQuA ARC ARC to WARC archives archiving audiovisual Benchmark benchmarking best practice best practices Bit rot bitcurator board game British Library Characterisation Community compression Corpora curation Database Database Archiving Database Preservation Delivery Digital Forensics digital preservation digitisation Disk Images DROID E-ARK Project EaaS Education Emulation epub Experimentation extensible Fido File Formats FLAC Flashback floppy disk floppy disks floppy drive Format Identification Format Registry GitHub Hackathon Hardware obsolescence help httpreserve Identification IDPD17 IMPACT Internet Standards iPRES. community survey isolyzer jhove job JP2 JPEG2000 jpylyzer LZW magnetic media Matchbox MediaConch Members Metadata metadate Migration Monitoring Normalisation OCR open Open Planets Foundation Open Preservation Foundation Open source OPF diary Optimization Packaging PDF PDF/A Planets policy PREFORMA PREMIS preservation Preservation Actions preservation planning Preservation Risks Preservation Strategies Preservia Process Projects PRONOM Provenance pywb recordkeeping records Representation Information Research data research infrastructure Resources RFC Rogues Gallery Rosetta Roy SCAPE Server Siegfried Signature Development Software Software benchmarking SPARQL specification spreadsheets SPRUCE standards technical technical registry testing TIFF Tika Tools training validation veraPDF Virtual Machines w3c WARC Watch WAV WAVE Web Archiving Web Publications wget Wikidata Workflow Workflows Zip

I have been working on some code to ensure the accurate and consistent output of any file format analysis based on the DROID CSV export, example here. One way of looking at it is an executive summary of a DROID analysis, except I don't think executives, as such, will be its primary user-base.  The reason for pushing […]

By ross-spencer, posted in ross-spencer's Blog

3rd Jun 2014  7:20 AM  12618 Reads  1 Comment

Well over a year ago I wrote the ”A Year of FITS”(http://www.openpreservation.org/blogs/2013-01-09-year-fits) blog post describing how we, during the course of 15 months, characterised 400 million of harvested web documents using the File Information Tool Kit (FITS) from Harvard University. I presented the technique and the technical metadata and basically concluded that FITS didn’t fit […]

By Per Møldrup-Dalum, posted in Per Møldrup-Dalum's Blog

28th May 2014  9:30 PM  15300 Reads  1 Comment

This post covers two main topics that are related; characterising web content with Nanite, and my methods for successfully integrating the Tika parsers with Nanite. Introducing Nanite Nanite is a Java project lead by Andy Jackson from the UK Web Archive, formed of two main subprojects: Nanite-Core: an API for Droid    Nanite-Hadoop: a MapReduce […]

By willp-bl, posted in willp-bl's Blog

21st Mar 2014  1:58 PM  15555 Reads  No comments

The SCAPE project is developing solutions to enable the processing of very large data sets with a focus on long-term preservation. One of the application areas is web archiving where long-term preservation is of direct relevance for different task areas, like harvesting, storage, and access. Web archives usually consist of large data collections of multi-terabyte […]

By shsdev, posted in shsdev's Blog

7th Mar 2014  1:56 PM  13702 Reads  No comments

Fifteen days was the estimate I gave for completing an analysis on roughly 450,000 files we were holding at Archives New Zealand. Approximately three seconds per file for each round of analysis: 3 x 450,000 = 1,350,000 seconds 1,350,000 seconds = 15.625 days My bash script included calls to three Java applications, Apache Tika, 1.3 […]

By ross-spencer, posted in ross-spencer's Blog

24th Feb 2014  2:17 AM  18753 Reads  5 Comments

A while back I wrote a blog post, MIA: Metadata. I highlighted how difficult it was to capture certain metadata without a managed system – without an Electronic Document and Records Management System (EDRMS). I also questioned if we were doing enough with EDRMS by way of collecting data. Following that blog we sought out […]

By ross-spencer, posted in ross-spencer's Blog

4th Feb 2014  5:21 AM  13424 Reads  1 Comment

One of my first blogs here covered an evaluation of a number of format identification tools. One of the more surprising results of that work was that out of the five tools that were tested, no less than four of them (FITS, DROID, Fido and JHOVE2) failed to even run when executed with their associated […]

By johan, posted in johan's Blog

31st Jan 2014  12:58 PM  1681669 Reads  6 Comments

In December last year I attended a Hadoop Hackathon in Vienna. A hackathon that has been written about before by other participants: Sven Schlarb's Impressions of the ‘Hadoop-driven digital preservation Hackathon’ in Vienna and Clemens and René's The Elephant Returns to the Library…with a Pig!. Like these other participants I really came home from this […]

By Per Møldrup-Dalum, posted in Per Møldrup-Dalum's Blog

23rd Jan 2014  9:01 AM  12212 Reads  No comments

More than 20 developers visited the ‘Hadoop-driven digital preservation Hackathon’ in Vienna which took place in the baroque room called "Oratorium" of the Austrian National Library from 2nd to 4th of December 2013. It was really exciting to hear people vividly talking about Hadoop, Pig, Hive, HBase followed by silent phases of concentrated coding accompanied […]

By shsdev, posted in shsdev's Blog

6th Dec 2013  4:30 PM  11600 Reads  No comments

Here's a little newsbulletin about FIDO, the open source file format identification tool of OPF. It seems that the use of FIDO is growing the last few months. I am getting responses by e-mail and through the Github issuetracker from all over the world, ranging from requests for help, giving suggestions for improvement and even […]

By TechMaurice, posted in TechMaurice's Blog

18th Sep 2013  10:27 AM  19191 Reads  No comments