Changing AIPs after Ingest
This Blogpost is about our four best-of AIP Update use cases and how we handled them in the last five years. Besides, some thought on how to improve our workflows in the future.
Boring stuff can be automated. Not always true
When I started my job back in October 2011, I was told: “No worries: Boring stuff can be automated.”
So why am I sitting here, almost a decade later, manually hacking changes in the AIPs in our Dark Archive?
Making changes to AIPs possible is necessary, though. Otherwise colleagues who are responsible for hosting the content would never have given me permission to ingest anything into the archive.
While many change-workflows are at least partially automatic, some are not.
In summary, we have four use cases:
- Data Producer withdraws access rights
- Data Producer hands in an additional file
- The not-so-persistent identifier changes
- Data Producer replaces file
There is no automatic communication between the representation platforms which host the content for our users and our dark archive. When we went live in 2015, our colleagues used e-mails to inform the Digital Preservation Team about changes in the AIPs.
It soon became clear that changes happen frequently. Thus, we started to use an Excel List. We started that exactly two years ago, so it’s a good time to review our workflows now.
Since August 2018, the four uses cases have occured as follows:
Data Producer withdraws access rights | 397 |
Data Producer hands in an additional file | 15 |
The not-so-persistent identifier changes | 59 |
Data Producer replaces file | 280 |
Data Producer withdraws access rights
Our best-of-case. The data producer decides that the publication of his work on our Open Access server is no longer desirable for him. Our colleagues delete the item on the platform and inform us via the Excel list.
As we are hesitant to delete AIPs, once ingested, we change the access rights to “access right withdrawn”. We have a dark archive and there are no plans to open the archive for public use. Just in case the units find their way back to some representation platform, we needed to mark these units in a clear way, so they won’t ever be republished accidentally.
For years, we had a manual workflow for this which contains five steps. Generally, this is the workflow:
- Search and open AIP
- Lock AIP to prevent simultaneous changes
- Open the metadata menu
- Assign Access rights, change the current AR into the new (AR withdrawn)
- Commit changes
While this is not too bad and not too error-prone, it is also extremely boring and cumbersome. Especially if we look at the numbers and see that we have to do it around 200 times a year. We ingest around 15,000 new units from the Open Access server per year, so that’s only 1.3%. But as the amount of AIPs tends to rise, even more cases are to be expected in the not-so-far-away future.
Therefore, we developed something fancier. We established a task chain, which changes the Access rights for a determined set of AIPs. The only downside is that we need the PID of each AIP. The PID, however, is an internal identifier of the AIP within our Archival system, which is based on the Rosetta Software of Ex Libris. The colleagues who report changes to us give us their identifier, which is usually the Handle. They have no way of knowing the PID within Rosetta. Of course our Digital Preservation Team is able to search within Rosetta, but this would again be cumbersome and manual and if we already are at it, it would be faster to use the manual workflow described above in the first place.
Fortunately, a member of our team has written a PERL script to get all the PIDs from a txt-file of all the handles which have to be changed. Afterwards we have a txt-file with all the PIDs. Rosetta can work with that, we just feed the txt-file with the PIDs to the task chain and the fun starts. The good thing is that this workflow is scalable; it does not matter how many PIDs the txt-file contains, this workflow is a matter of seconds anyhow.
Data Producer hands in an additional file
Luckily, this does not happen so often. Sometimes a postprint is added. The workflow is very manual and contains three steps:
- lock AIP
- add representation (label: postprint (e. g.))
- commit changes
As the cases are so scarce, motivation to automate this workflow is small. As we will ingest the contents of our digitization centre pretty soon, however, a member of our team has developed an easy way to add files to an AIP, using the Rosetta webservices. This still is done for one AIP at the time, so there is not much time gained. The gain is more that using this new application is much less error-prone, especially if AIPs usually consist of several hundreds of files. This is especially true if a file is not merely added, but replaced (for more details see “Data Producer replaces file”).
The not-so-persistent identifier changes
Interestingly, this is a workflow which was automated pretty fast, soon after gInterestingly, this is a workflow which was automated pretty fast, soon after going live. Not because it was so desperately necessary, but because it was relatively easy to do. Given an identifier dc field and a dc field to be replaced, we can enter a list with IDs and new values for the replace field and the value will be replaced. So far, we have only used this for the PPN. We call the application “Metadata updater” and use it frequently. (Link to GitHub)
Data Producer replaces file
There are webservices to communicate with the dark archive, so in theory, this workflow could be automated. But there’s a downside. Of course, our representation platform is ignorant towards the SIP ID of the AIP in our dark archive. There is no communication back to the platform. Only our submission application gets the SIP ID back, in addition, it holds all the information and direct URL link to the representation platform.
Just recently, our DP developer has created an application which can update the file. If the filename is identical, one only needs the PID of the AIP, the internal identifier of the AIP in the archive. If the file has a new name, the ID of the representation and the file are also necessary. Plus, of course, one still needs to download the new file from the representation platform and upload it to the application.
Running tests and stopping time has shown:
This new application saves a lot of time and nerves if it’s an AIP with many representations and files, like e. g. the material of our digitization centre. This is the reason why we developed this in the first place, as the ingest for the material of our digitization centre is about to start this year.
For our Open Access repository, with AIPs that usually consist only of one file, nothing is gained, as we still have to search for all the IDs and the new file itself first. It makes no difference if we upload the new file directly to the archive or via the new application.
I would guess a direct communication between the systems would be possible, at least via the submission application, but we have not yet implemented anything like this.
So what would be the ideal future?
The most frequently used workflow is replacing files. It would be great to have a better way to automate this. There would have to be some communication between our archive and the representation platform. A button to be clicked which envokes an AIP update.
In a perfect world, there would be some AIP update application between our archive and the representation platform, which handles all the changes independently and reliable. Something like our submission application for the ingest. We are a long way from that. The only consolation is that we do not seem to be alone with that, according to a twitter poll I did recently.