Digital Preservation Cost Modelling: Where did it all go wrong?

I recently spoke at a workshop on digital preservation costing organised by the lovely people at Knowledge Exchange and Nordbib. After briefly covering some of the work I was previously involved in as part of the LIFE Projects, I talked about why I think that estimating the costs of digital preservation is such a difficult challenge. I thought I’d reiterate some of those thoughts here, as well as my conclusions on how we should move things forward in this space.

I began by noting that there are some quite involved challenges in this field. Some of them inherent in the problem area, and some we have made for ourselves. Of the latter, I think we’ve done a poor job of articulating why we actually want to cost digital preservation. With a lack of clarity about the aims, it can become a confusing picture.

These are a few reasons why an organisation might want to estimate DP costs:

Build a new repository from scratch – plan and budget for it
Add new collection to our repository – can we afford it?
Evaluate/refine/compare lifecycles between organisations
Out source your digital preservation or do it in house?
How much to charge for providing a preservation service to clients?

Clearly a different challenge might require a different solution, but matching these two things up is not straightforward at the moment. More broadly speaking, I think we’ve done a poor job of extracting requirements from our users, and a poor job of communicating those requirements. There are lots of generic statements out there about wanting to know more about costs, but few that really dig down to the reasons why.

Digital Preservation is of course also a moving target. It’s changing as new challenges and technologies come along. It’s also changing, because we haven’t really got to a point where we can agree on how to do it. We’ve got a good idea on approaches we can take. We don’t have a lot of evidence on how well each approach or tool works in each circumstance. And we don’t have much in the way of agreed best practice. Estimating the cost of such an uncertain set of tasks is obviously going to be difficult.

Getting the right level of detail in a costing model is one of the hardest inherent challenges in this field. A simple model that allows us to make high level statements about where the biggest costs are likely to be tends to be of little practical use. Name me a cost generalisation (eg. probably the most well known: Ingest = costly) and I usually won’t have trouble coming up with examples that contradict it. This is because lifecycles are incredibly complex things and they depend on a whole host of variables. Attempt to model all these variables and you quickly find yourself at the other end of the cost modelling scale. Your model itself becomes bloated and complex. This makes it difficult to understand, difficult to develop and even harder to maintain. And then your users struggle to make sense of it as well.

We worked hard to tackle this particular challenge in LIFE3, and based on the evaluation project that followed, we almost completely failed. How we deal with this problem I’m afraid I really don’t know, and its clearly not one that only we encountered. Sabine Schrimpf, who also spoke at the workshop, noted that her organisation looked at using costing models already out there. They decided they were too complex and wanted something simpler, so they developed a new model. By the time they got to the end of the project, they still needed to do more, but had a model as complex as the others they’d originally reviewed.

Getting access to real cost data is essential for developing estimative models that illicit anything close to usable and accurate results. But organisations are typically reluctant to share. This is a familiar refrain from me, and from my colleagues in this field. However, I think its important to note that we’ve actually been making this problem worse for ourselves. A number of different projects have captured cost data and shared it (albeit in small amounts). Unfortunately, we’ve all used different models, and different ways of describing the cost data. Re-using and interchanging the data with colleagues therefore becomes a lot of work.

My next step was to revisit the activities going on in this community that are examining costs. There are plenty to choose from. This is a list I compiled over a number of months, which features 13 different initiatives. I’m sure there are some that I missed, and I frequently go back and add more. I asked the 50 or so participants at the workshop to raise their hand if they were familiar with all of the initiatives on the list (note that there were key representatives from many of the initiatives actually in the room). Only Neil Beagrie half raised his hand, with the cheeky grin on his face suggesting that my threat to test him after the workshop would have stumped him before working through the whole list. Clearly this is at least twice as many initiatives as this small community can really support. So why do we reinvent the wheel rather than building on previous work? Communication and awareness is definitely part of the problem if no one (experts included) knows about all of this work. The lack of articulated requirements makes it very easy for newcomers to decide that they have their own unique requirements and do some new (duplicative) development rather than building on existing efforts.

I made a final observation on the popularity of costing workshops. 3 in the space of 3 weeks:

Screening the Future: Managing the cost of archiving, master class, 22nd May, California
Nordbib: The Costs and Benefits of Keeping Knowledge, 11th June, Copenhagen
JCDL2012: Models for Digital Cost Analysis, 14th June, Washington DC

I asked the attendees at the Nordbib workshop if anyone was attending either of the other events. This time, even Neil didn’t raise his hand. Now I certainly don’t want to criticise the organisers of these events for putting together what were clearly very valuable sessions (for example see this great write up on the Screening the future masterclass from Inge Angevaare). But, this does seem to be another indicator of a lack of awareness of related activities.

Clearly we’ve got to communicate and collaborate more effectively in this space. I think that the best way to begin doing this would be to develop a shared or standardised lifecycle model. With this foundation, all of the costing data we subsequently capture (or translate) and all of our developments in estimating or analysing lifecycle costs instantly become far more compatible. New developments would be complimentary rather than competitive. This would be a huge win.

But is it realistic? Looking at a number of the models (eg. LIFE, KRDS, CMDP and the DANS work) there isn’t really a great deal of difference between them. While there may need to be some parts of the lifecycle we need to agree to differ on, perhaps where there are specific domain requirements, I think we could still reach agreement on the majority of the lifecycle. This would be good enough to yield significant benefits.

How would we get buy-in to such a model, and how would we create it in the first place? Get the key players from each of the big costing initiatives together and hack out the detail in a one or two day workshop. I chatted with fellow presenters at the end of the Nordbib event, and this kind of collaboration seemed to be a popular idea.

The Knowledge Exchange / Norbib workshop ended with some excellent discussion sessions and further collaboration is on the action list for next steps, so I have great hope for the future!

Digital Preservation Cost Modelling: Where did it all go wrong?

1 Comment

Leave a Reply

You might also like…

ChatGPT discusses Digital Preservation

Using Kanban at the SCAPE Developer Workshop

Is ist possible to predict the cost of longterm digital preservation?

Join the conversation

Member-only content

or

or

or

or

Download

or