A while back I wrote a blog post,
MIA: Metadata. I highlighted how difficult it was to capture certain metadata without a managed system – without an Electronic Document and Records Management System (EDRMS). I also questioned if we were doing enough with EDRMS by way of collecting data. Following that blog we sought out the help of a student from the local region’s university to begin looking at EDRMS systems, to understand what metadata they collected, and how to collect additional ‘technical’ metadata using the tools often found in the digital preservation toolkit.
Sarah McKenzie is a student at Victoria University. She has been working at Archives New Zealand on a 400 Hour,
Summer Scholarship Programme that takes place during the university’s summer-break. Our department submitted three research proposals to the
School of Engineering and Computer Science and out of them Sarah selected the EDRMS focussed project. She began work in December and her scholarship is set to be completed mid-February.
To add further detail, the title and focus of the project is as follows:
Mechanism to connect the tools in the digital preservation toolset to content management and database systems for metadata extraction and generation
Electronic and document records management systems (EDRMS) are the only legitimate mechanism for storing electronic documents with sufficient organisational context to develop archival descriptions but are not necessarily suited at the point of the creation of a record to store important technical information. Sitting atop database management technology we are keen to understand mechanisms of generating this technical metadata before ingest into a digital archive.
We are keen to understand the challenge of developing this metadata from an EDRMS and DBMS perspective where it is appreciated that mechanisms of access may vary from system to another. In the DBMS context, technical metadata and contextual organisational metadata may be entirely non-existent.
With interfaces to popular characterization tools biased towards that of the file system it is imperative that we create mechanisms to use tools central to the preservation workflow in alternative ways. This project will ask students to develop an interface to EDRMS and DBMS systems that can support characterization using multiple digital preservation tools.
Metadata we’re seeking to gather includes format identification, characterisation reports along with other such data as SHA-1 checksums. Tools typical to the digital preservation workflow include DROID, JHOVE, FITS and TIKA.
The blog continues with Sarah writing for the OPF on behalf of Archives New Zealand. She provides some insight into her work thus far, and insight into her own methods of research and discovery within a challenging government environment.
EDRMS Systems
An EDRMS is a system for controlling, and tracking the creation of documents from the point they are made through publication and possibly even destruction. They function as a form of
version control for text documents, providing a way to accomplish a varying range of tasks in the management of documents. Some examples of tasks an EDRMS can perform are:
• Tracking creation date
• Changes and publication status
• Keeping a record of who has accessed the documents.
EDRMS stores are the individual databases of documents that are maintained for management. They are usually in a proprietary format, and interfacing directly with them means having access to the appropriate
Application Layer Interface (API) and
Software Development Kit (SDK). In some cases these are merged together requiring only one package. The actual structure of the store varies from system to system. Some use the directory structure that is part of the computer's file system and then have an interface from there. Others utilise a database for storing the documents.
Most EDRMS are running client/server architecture.
Currently Archives New Zealand has dealt with three different EDRMS stores:
‘Notes’ has a publically available API and the latest version is built in Java, allowing for ease of use with metadata extraction tools, used in the digital preservation community – The majority I have found to be written in Java. There are many EDRMS systems, and it's simply not possible to code a tool enabling our preservation toolkit to interact with all of them without a comprehensive review of all New Zealand government agencies and their IT suites.
A survey has been partially completed by Archives New Zealand. The large number of systems suggested a more focused approach in my research project, i.e. a particular instance of EDRMS, over multiple systems.
Gathering Information on Systems in Use
Within New Zealand, The Office of the Government Chief Information Officer (OGCIO) had already conducted a survey of electronic document management systems currently used by government agencies. This survey did not cover all government agencies, but with 113 agencies replying it was considered a large enough sample to understand the most widely used systems across government. Out of the 113, some agencies did not provide any information, leaving only 69 cases where a form of EDRMS was explicitly named. These results were then turned into an alphabetical table listing:
• EDRMS names
• The company that created them
• Any notes on the entry
• A list of agencies using them
In addition to the information provided by the OGCIO survey, some investigative work was done in looking through the records of the Archives' own document management system to find any reference to other EDRMS in use across government. Other active EDRMS systems were uncovered.
For the purposes of this research it was assumed that if an agency has ever used a given EDRMS, it is still relevant to the work of Archives New Zealand, and considered ‘in-use’ until it is verified that there are no more document stores from that particular system which remain not archived, migrated to a new format, or destroyed.
Obstacles were encountered in the process of converting the information into a reference table useful for this project. Some agencies provided the names of companies that built their EDRMS. This is understandable to some extent, since there has been a vanity in the software industry where companies name their flagship product after the company (or vice versa). However, in some cases it was difficult to discern what was meant because the company that made the original software had been bought out and their product was still being sold by the new owner under the same name – or the name had been turned into a brand for an arm of the new parent company which deals with all their EDRMS software (e.g. Autonomy Corporation has now become HP Autonomy, Hewlett-Packard's EDRMS branch).
In addition, sometimes there were multiple software packages for document management with the same name. While it was possible to deduce what some of these names meant, it was not possible to find all of them. In these cases the name provided by the agency was listed with a note explaining it was not possible to conclude what they meant, and some suggestions for further inquiry. Vendor acquisitions were listed to provide a path through to newer software packages that possibly have compatibility with the old software, and also provide a way to quickly track down current owners of an older piece of software.
The varying needs of different agencies means there is no one-size-fits-all EDRMS system (e.g. a system designed for legal purposes may offer specialised features one for general document handling wouldn't have). But since there has been no overarching standard for EDRMS for various purposes – it was assumed that agencies would make their own choices based on their business needs – there turned out to be a large number of systems in use, some of them obscure or old. The oldest system that could be reasonably verified as having been used was a 1990's version of a program originally created in the late 1980s, called Paradox. This was in progress of currently being upgraded and the data migrated to a system called Radar when the document mentioning it was written, but there was no clear note of this being completed.
At the time of writing it had been established that there were approximately 44 EDRMS ‘in-use’.
With 44 systems in use it was considered unfeasible to investigate the possibility of automating metadata extraction from all of them at this time. It was decided to set some boundaries for starting points. One boundary was, which EDRMS is the most used? The most common according to the information gathered looked to be Microsoft SharePoint, which we could gather may have 24 agencies using it, and Objective Corporation's Objective was associated with at least 12 agencies.
A second way to view this was to ask, ‘which systems have been recommended for use going forward?’ Archives New Zealand’s parent department
The Department of Internal Affairs (DIA) has created a three-supplier
panel for providing enterprise content management solutions to government agencies. Those suppliers are:
With two weeks remaining in the scholarship, and work already completed to connect a number of digital preservation tools together in a middle abstraction layer to provide a broad range of metadata for our digital archivists, it was decided that testing of the tool, that is connecting it to an EDRMS and extracting technical metadata, would be best done on a working, in-use EDRMS, from the proposed DIA supplier panel, that would continue to add value to Archives New Zealand’s work moving into the future.
Getting Things Out of an EDRMS
The following tools were considered to be a good set to start examining extraction of metadata from files:
Linking the tools together has been done via a java application that uses each tool's command line API to run them in turn. The files are identified first by Droid, and then each tool is run over the file to produce a collection of all available metadata in Comma Separated Values format. This showed that some tools extract the information in different ways (date formatting is not consistent) and some tools can read data, others cannot; for example, due to a
character encoding issue, a particular PDFs Title, Author, and Creator fields were not readable in JHOVE where they were read correctly in Tika, and NLMET – JHOVE still extracts information those tools do not.
When a tool sends its output to standard out it's a simple matter of working with the text output as it's fed back to the calling function from the process. In some cases a tool produces an output file which had to be read back in. In the case of the NLMET, a handler for the XML format had to be built. Since the XML schema had separate fields for date and time of creation and modification, the opportunity was taken to collate those into two single date-time fields so they would better fit into a schema.
The goal with the collated outputs is to have domain experts check over them to verify which tools produce the information they want, and once that is done a schema for which piece of data to get from which tool can be introduced to the program so it can create and populate the Archives metadata schema for the files it analyses.
The ideal goal for this tool is to connect it to an EDRMS system via an API layer, enabling the extraction of metadata from the files within a store without having to export the files. For that purpose the next stage in this research is to set up a test example of one of DIA’s proposed EDRMS solutions and try to access it with the tool unifier. It is hoped that this will provide an approach that can be applied to other document management systems moving forward.
johan
February 4, 2014 @ 10:44 am CET
This caught my attention:
The tools you're using here are all wrapped FITS already. So I'm curious if there was any particular reason for writing yet another wrapper instead of using FITS?