In Automation we believe

PDF Eh? – Another Hackathon Tale

History

Two months ago, I published the Blogpost „Carved in Stone? Let’s edit it!” I described our four best use cases about changes to AIPs after ingest and how we handled them so far.

The worst one, because it was not automated in the slightest, was “Data Producer replaces file”, which occurs about 300 times a year. The manual workflow contains five steps, which have been described in the Blogpost.

After publishing the Blogpost, a communication via Twitter arose. A developer of Ex Libris suggested that automation should be technically possible. We talked afterwards and searched for a solution, using the Web Services available for our Longterm Archive based on Rosetta and began to re-think our Workflow.

Our solution

So far, we used an Excel List which my colleagues responsible for the representation platform filled with Data Sets for which the PDF has to be replaced.

To use the Web Service in batch, we have to prepare a CSV with a special structure.

The information needed is:

Field NameDescription
Handleserves as identifier
old file name
new file name
stateIs filled in by the program: “tested” means test was successful, “done” indicates when the file has been replaces
notes DP
SRU Key
Update Error
IE PIDRosetta identifier, which is filled in by the program

This is a very handy solution, as these pieces of information can be provided by colleagues from the representation platform and the workflow can be run by any Digital Preservation team member.

The workflow is almost self-explanatory. “Run Test” means that the PDF is not yet replaced, but only tested if every item can be found and can be replaced. The CSV is altered adequately. As we have several repositories archived, the one that consists of the items to be altered must be selected via the Drop-Down-menu. The ”update files”-workflow is the interesting one.

What does this program do, exactly? The program finds the item in the representation platform via the handle using OAI, and downloads the updated file (new file name). These two parameters are necessary to retrieve the new file from the representation platform.

Afterwards, the program finds the AIP in Rosetta via SRU using the handle as the identifier in Rosetta. The program receives the Rosetta internal identifier, the IE PID. Using the old file name the program can retrieve the file PID and the representation ID through SRU again. Now all three necessary identifiers are available:

  • File PID
  • Representation ID
  • IE PID

With all this information the Web Service is able to update the old file with the new file, which has been already downloaded in the first step.

All that’s to be done now is to implement the new workflow in our daily life, running this workflow once a month for all the items that have been updated in the meantime.

This is another big and important step to automation. So, it’s usually really true:
“Boring stuff can be automated”. Like I was promised back in 2011 when I applied for this job.

Leave a Reply

Join the conversation