I spent last week at the Open Planets Foundation Database Archiving event at the Statens Arkvier in Copenhagen. There was a good mix of people, both technical and practicioners, with some informative presentations, and it was interesting to see the different approaches to the problem.
To my mind there appeared to be two distinct classes of use case:
Preserve the database and the environment allowing authentic access to the information and any accompanying applications.
In this case the requirement is to preserve an entire system, i.e.
- Table structures, including datatypes, and relationships.
- Business logic contained in the database e.g. stored procedures, triggers, constraints, etc.
- Ideally the user accounts and permissions should be retained, to ensure authentic access.
This case appears to cry out for an emulation solution, but there are potential problems:
- Preserving a system / set of systems, complete with user permissions and accompanying applications is not trivial even for comparatively simple systems.
- The original environment may not be that sophisticated meaning extracting new value form old data may be difficult
- Preserving old systems potentially carries the overhead of retaining the required technical skills to use it, e.g. DB Admin skills, these can be expensive.
The second case is really a subset of the first:
Extract / migrate the raw data and table structure from the original database.
- The table metadata and data can be serialised into an accessible XML format, e.g. SIARD.
- Where possible stored procedures, constraints, etc. are also serialised as XML.
- The serialised data can be loaded into a different RDBMS, or and alternative DB, to enable reuse.
While this approach is pragmatic the main drawback lies in what isn’t extracted. There will be information that can’t be migrated to a a new database, I can’t think of a single Stored Procedure dialect that doesn’t break ANSI SQL in multiple ways.
Any attempt to preserve a complex system using emulation will require a lot of bespoke work per system preserved. It’s difficult to see how much of the work involved can be generalised
The serialisation approach is more widely used. There were several demonstrations of serialisation tools given at the event. The problem is that none of the systems demonstrated used a common serialisation format. The SIARD format was used by at least two of the tools demonstrated, but in some cases the implementers had adapted the format to suit their own needs. This negates the advantage of using a common format, i.e. interoperability.
If I could pick a common format, I’d choose SIARD, it’s been in use for a few years and the format is to be submitted as a candidate ISO standard. I’ll have more to say about SIARD in a follow up post next week.