What happens when the Internet and digital preservation coinicide

PDF Eh? – Another Hackathon Tale

The National Library of New Zealand is looking for an experienced Digital Preservation Web Engineer to lead its web archiving programme.

This might feel like job recruitment post, but we’re seeing at as rare chance to talk about how we think about our ongoing digital preservation program, and what happens when we identify a gap in our capability that can’t be ignored.

You will be responsible for the strategic and operational management of the Library’s web archiving programme. The primary purpose of the Digital Preservation Web Engineer is to define, implement and support the efficient acquisition, preservation and discovery/delivery of web based digital content subject to the Library’s legislated mandate.

We’ve been a National Library for just over 50 years. Things have changed quite a bit in those 50 years, including of course the arrival of digital materials, closely followed by the internet as a way of communicating digital materials. We recognise that there is digital, and there is “online”, and sometimes digital is online. To be able to confidently collect online digital content, maintaining a sense of content, context and structure is a capability gap that we’ve been working around for a while. We’ve been operating in the space for a while; our first whole of domain is nearly 10 years old. Going back further, and with gratitude to the Internet Archive, we also have a copy of the Geocities archive that we know contains New Zealand relevant content.

This includes harvesting via the Library’s selective and whole-of-domain web archiving programmes and the design, deployment and implementation of mechanisms for storing, indexing and providing access (including data mining) to the web collections.

We’ve been making .nz whole-of-domain harvests since 2008, and we now have six significantly sized whole of domain harvests to think about. These are multi-terabyte resources, comprised of millions of digital artifacts. They are cumbersome the point of being virtually impenetrable and we want to get these collections visible to our readers through useful, responsive, and ultimately thoughtful mechanisms.

We also have other web collections that we need to stitch into our historical memory of the web. Our selective harvests have been running for over 10 years, starting with HTTrack based content, through various iterations of Web Curator Tool created harvests.

This range of inputs leads us towards many questions as we try and address the meaningful delivery of this content to a readership that’s still establishing its own needs.

  • Do we use OpenWayBack, or WayBack as our delivery mechanism, or is it cleaner to derive our own arc/warc interrogation methods?
  • Do we blend all our collections into one sensible “lump” and leverage existing processes like Memento or Solr to feed into a smart search interface that makes sense of this giant blob of things…
  • What does a researcher want to see from our web harvested collections? Is it just access to a webpage or two? Text extracts? Binary item extracts? Sets of things? A diff of a resource over time?
  • What does a library catalogue record for these things look like? We currently have a high level descriptor that ring fences each WOD instance as a mega record, but what does more granular description look like, especially if we’re expecting to offer long term and reliable URIs for individual websites that form part of these WOD lumps? What can be automated, and what does “quality” measure of automated record creation feel like when we consider the need to generate millions of new records. How can we prevent these records swamping our existing catalogue methods/systems?
  • What does meaningful preservation of these items look like? Is WARC the only way forward, or should we be considering much more granular processing of sub-WARC items, effectively more than a billion individual files.
  • What does “social-media” mean in this space? Do we need special consideration for web based information that originates from contemporary social media platforms? If no, why not – where does that social context reside with a national heritage collection, if yes, why? What does that look like?
  • What does geographic territoriality mean for collecting heritage materials, especially when we overlay de facto legal instruments used by the Library to build its collections? What might changes to copyright legislation or social expectations of permanence/impermanence mean for contemporary collection building? How do we sensibly traverse the boundaries between technical comprehension and intellectual insight into the various units of information that we intend to collect in perpetuity?
  • What is a viable web object anyway? Is it sensible or desirable to harvest data APIs for structured data like JSON or XML? Or should we be confining our thinking to browser rendered human readable content?

You will be working within a team of 8 digital preservation specialists in the Preservation Research & Consultancy team and be responsible for:

 

  • developing a world class approach to the gathering, preservation, dissemination and interrogation of web based collections.

  • drafting a business case for indexing, search and delivery for the Library approx 80Tb of web archives.

  • contributing to the development of the strategic direction of the National Library with particular emphasis on the Library’s digital (web) collection building, preservation and access roles.

  • working closely with national and international peers, related teams within the department and with external vendors.

We believe we have a unique opportunity to consider national web collections at a scale that’s manageable but still comprehensive. This is an exciting time to working in this space, working towards a common goal that’s national, international, novel, contemporary, technical and intellectual. We recognise there are partial solutions to many of the questions we’ve illuminated, and we further acknowledge that we’re not alone in this challenge, work with international partners and peers helps us to find current norms, deficiencies and of course, answers.

In our Library, these collections have a home, but they’re in need of carving their own place in the vastness of our collections. One mechanism we know we can use to establish a sense of place within our collections is to demonstrate value, and the very vastness and technical complexity of the web collections leaves us with a growing problem which we expect to address partially through this appointment.

The challenge of collecting and maintaining web harvested items sits neatly inside our existing digital preservation paradigm, but we’re really interested to understand how the scale at which we can collect the Internet affect existing practice and thinking. We expect the two processes, digital collecting and digital preservation to dovetail into a well-considered unified workflow. Without having this new role as a persistent component of our digital preservation team, informing and tempering discussions, we know that we’re in danger of forming generalised positions on practice and capability that might not ring true when undertaken at the scale we know we need to operate at.

Working at the Department of Internal Affairs, you’ll have the opportunity to make a real difference in the lives of New Zealanders.

The National Library of New Zealand sits within the Information and Knowledge Services branch, the Department’s epicentre of information management. Our branch is all about collecting, storing and preserving important things that are precious to New Zealand.

We don’t believe this point can be understated. We are slowly start to understand the cultural and research impact of web content, and this new post is a direct response to the challenge that sits behind national level collection building and the rapid uptake of Internet based content and information.

The content we collect has an extremely important role in our National memory, and as the National Digital Heritage Archive we have an obligation to ensure that we are able to operate with the care and expertise that this content demands. The web collections sit within a vastly broader context of heritage artefacts that intertwine our collective memory, and through their existence we are helping New Zealanders understand their sense of place and history as well as informing research and creative outputs alike. The Web collections are no different, and we are extremely excited to have an opportunity to bolster our treasured collections with the appointment of an expert practitioner that brings a significant advancement to our organisational and national understanding of what it means to treasure and nurture web collections.

You will have substantial practical experience in an IT environment with specific emphasis on web framework technologies and protocols, and preferably also demonstrated experience in web archiving. You will be a self-driven specialist, ideally with a tertiary qualification in information technology, computer science or information science or its equivalent.

All said and done, this terse description of the successful candidate really encapsulates the one of generalised problem facing digital preservation today. We are an emergent discipline, finding our way through new challenges, and without specifically crafted routes into the work we expect to undertake. We are only just starting to see the edges of what’s possible, and unless we repeatedly open the door to complimentarily professions we are going to struggle to address the contemporary challenge of collecting fast moving content, regardless of the ongoing care required when today’s harvests become tomorrow’s Preservation Masters with all the attendant questions of technical sustainability.

29
reads

Leave a Reply

Join the conversation