Blogs: Web Archiving

Blog posts filtered by the Web Archiving subject tag.

Browse blogs by subject

All subjects Access Analysis Android apache tika ApacheTika AQuA ARC ARC to WARC archives archiving audiovisual Benchmark benchmarking best practice best practices Bit rot bitcurator board game British Library Characterisation Community compression Corpora CSV-Validator curation Database Database Archiving Database Preservation Delivery Digital Forensics digital preservation digitisation Disk Images DROID E-ARK Project EaaS Education Emulation epub Experimentation extensible Fido File Formats FLAC Flashback floppy disk floppy disks floppy drive Format Identification Format Registry GitHub Hackathon Hardware obsolescence help httpreserve Identification IDPD17 IMPACT Internet Standards iPRES. community survey isolyzer jhove job JP2 JPEG2000 jpylyzer LZW magnetic media Matchbox MediaConch Members Metadata metadate Migration Monitoring Normalisation OCR open Open Planets Foundation Open Preservation Foundation Open source OPF diary Optimization Packaging PDF PDF/A Planets policy PREFORMA PREMIS preservation Preservation Actions preservation planning Preservation Risks Preservation Strategies Preservia Process Projects PRONOM Provenance pywb recordkeeping records Representation Information Research data research infrastructure Resources RFC Rogues Gallery Rosetta Roy SCAPE Server Siegfried Signature Development Software Software benchmarking SPARQL specification spreadsheets SPRUCE standards technical technical registry testing TIFF Tika Tools training validation veraPDF Virtual Machines w3c WARC Watch WAV WAVE Web Archiving Web Publications wget Wikidata Workflow Workflows Zip

This post covers two main topics that are related; characterising web content with Nanite, and my methods for successfully integrating the Tika parsers with Nanite. Introducing Nanite Nanite is a Java project lead by Andy Jackson from the UK Web Archive, formed of two main subprojects: Nanite-Core: an API for Droid    Nanite-Hadoop: a MapReduce […]

By willp-bl, posted in willp-bl's Blog

21st Mar 2014  1:58 PM  15783 Reads  No comments

The SCAPE project is developing solutions to enable the processing of very large data sets with a focus on long-term preservation. One of the application areas is web archiving where long-term preservation is of direct relevance for different task areas, like harvesting, storage, and access. Web archives usually consist of large data collections of multi-terabyte […]

By shsdev, posted in shsdev's Blog

7th Mar 2014  1:56 PM  13876 Reads  No comments

The Web is constantly evolving over time. Web content like texts, images, etc. are updated frequently. One of the major problems encountered by archiving systems is to understand what happened between two different versions of the web page.   We want to underline that the aim is not to compare two web pages like this […]

By Zeynep, posted in Zeynep's Blog

7th Feb 2014  1:15 PM  12723 Reads  No comments

In December last year I attended a Hadoop Hackathon in Vienna. A hackathon that has been written about before by other participants: Sven Schlarb's Impressions of the ‘Hadoop-driven digital preservation Hackathon’ in Vienna and Clemens and René's The Elephant Returns to the Library…with a Pig!. Like these other participants I really came home from this […]

By Per Møldrup-Dalum, posted in Per Møldrup-Dalum's Blog

23rd Jan 2014  9:01 AM  12343 Reads  No comments

From the very beginning of the SCAPE project on, it was a requirement that the SCAPE Execution Platform be able to leverage functionality of existing command line applications. The solution for this is ToMaR, a Hadoop-based application, which, amongst other things, allows for the execution of command line applications in a distributed way using a […]

By shsdev, posted in shsdev's Blog

16th Dec 2013  3:13 PM  16918 Reads  No comments

More than 20 developers visited the ‘Hadoop-driven digital preservation Hackathon’ in Vienna which took place in the baroque room called "Oratorium" of the Austrian National Library from 2nd to 4th of December 2013. It was really exciting to hear people vividly talking about Hadoop, Pig, Hive, HBase followed by silent phases of concentrated coding accompanied […]

By shsdev, posted in shsdev's Blog

6th Dec 2013  4:30 PM  11733 Reads  No comments

The browser-shots tool is developed by Internet Memory in the context of SCAPE project, as part of the preservation and watch (PW) sub-project. The goal of this tool is to perform automatic visual comparisons, in order to detect rendering issues in the archived Web pages. From the tools developed in the scope of the project […]

By stanislav.barton, posted in stanislav.barton's Blog

26th Jul 2013  2:31 PM  15356 Reads  No comments

Following the community response to our workshop last year, we want to invite you again to contribute your future preservation challenge! Digital Preservation has emerged as a key challenge for information systems in almost any domain from eCommerce and eGovernment to finance, health, and personal life. The field is increasingly recognized and has taken major […]

By cbecker, posted in cbecker's Blog

17th Jun 2013  5:24 PM  14830 Reads  2 Comments

Digital Preservation is making certain progress in terms of tool development, progressive establishment of standards and increasing activity in user communities, but there is a wide gap of approaches to systematically assess, compare and improve how organizations go about achieving their preservation goals. Some standards exist suggesting certain functional building blocks and others prescribing criteria […]

By cbecker, posted in cbecker's Blog

14th Jun 2013  5:46 PM  12879 Reads  No comments

This blog post is an answer to willp-bl's post "Mixing Hadoop and Taverna" and is building on some of the ideas that I presented in my blog post "Big data processing: chaining Hadoop jobs using Taverna". First of all, it is very interesting to see willp-bl's variants of implementing a large scale file format migration […]

By shsdev, posted in shsdev's Blog

4th Mar 2013  12:10 PM  12389 Reads  No comments