In the last months, I have been researching the problem of large-scale content profiling for preservation analysis. I do this for a number of reasons. For one, I support the opinion that formats are just another property. Undoubtedly, a very important one, but knowing which formats you have is not sufficient for good preservation planning and actions. I believe a good content profile sets the foundation of a preservation plan and helps reduce bias during the experiments phase. And lastly, it is a great source for preservation monitoring, but more on that later.
In this blog post I present a content profiling prototype tool called Clever, Crafty Content Profiling of Objects – c3po. 😉
What is c3po?
c3po is a tool that deals with meta data of digital objects and helps you to get an idea of what you have to deal with. It consists of two parts; a CLI (Command Line Interface) application and a Web Application. The CLI app reads in and processes FITS meta data files and stores them in a document store. The Web Application offers visualisation, filtering, export of the data and much more.
tl;dr The prototype of c3po has a CLI interface for near data processing and a Web App for visualisations and analysis of meta data
Why do I need it and why the hell have you chosen FITS?
Well, simply because you may have no idea what content you have 🙂 Ok, that sounded rather harsh, but unfortunately, it is often true. You probably have an idea in terms of number of objects, mime-types, formats and versions and even size. But do you have that picture across multiple collections? Or what if I ask you: "How many PDF 1.4 documents with a page count larger than 100 and password protection do you have in your collections?" or "Where do your invalid PDF/A documents come from and which applications created them?" If the answer is "I don't know" or "Let me ask our system and repository admins, I will tell you in a week", then you need c3po.
The problem is that these kinds of differences between the meta data measurements of digital objects (variance between page count, embedded images, count of tables, validity and well-formedness, embedded color profiles etc., etc.) are often the cause of failed preservation actions. That is why I think it is a good idea to obtain a better overview of the content and to split it into (smaller) homogeneous sets. And homogeneity is not necessarily based on one property as the format 😉
I already have posted my thoughts on FITS and why do I use it here but to summarise: FITS provides not only identification data, but also very important deep characterisation data. Due to the tools embedded, it has quite good coverage of content types. But undoubtedly, the best reason is the normalisation of the data. It is the only tool (I know of) that gives me normalised output. On top of that it provides me with hints whether a measurement is correct or not. Since there are numerous characterisation (and identification) tools out there, and all of them have different output schemas, I had to choose one to start with. FITS seemed as an appropriate option due to the reasons above, but also because of its ability to be extended and the format coverage it has.
Nonetheless, c3po allows the usage of different meta data formats. However, it will require a little effort to implement new adaptor parsers for every new schema. It is a prototype, remember? 🙂
tl;dr It is very hard to obtain an overview of large digital collections and some kind of profile is needed. c3po uses FITS meta data to provide such an overview and generate an aggregated profile.
Nothing is impossible!
Now, I don't claim that c3po can handle endless amounts of data, but it is a start and I believe it is a proof of concept. Let me explain how it works:
In the current alpha version (0.2) there are two tools. The CLI application which processes the raw data and stores it in a document store can be executed near the data and the data store in order to optimise performance and reduce network overhead. C3PO uses MongoDB, which is pretty neat. It is a NoSQL solution that combines the advantages of a fast key-value store and common relational databases. The best part is, it supports automatic node balancing and native map-reduce within the document store, which is a great tool for aggregation and analysis of the data.
With this prototype implementation and a single node for the document store it seems to be feasible to scale up to a million documents, which in my opinion is a good start. I am aware of the fact, that there are content holding organisations that hold vast amounts of information. Nonetheless, I believe that with a powerful enough infrastructure, it is possible to obtain a profile and analyse the pecularities of a collection in a reasonable time frame even for these larger collections. And even if not, I am sure it will be helpful to others that haven't reached beyond the million objects threshold.
tl;dr The alpha prototype version 0.2 of c3po seems to be able to scale up easily to a million documents on a single node. With better infrastructure it appears to be feasible to go beyond without compromising scalability much.
Yeah, but I want to interact with the data!
The Web App is meant just for that. It offers you a few helpful features. First of all it gives you overview of your collections and allows you to drill down. You can browse the raw meta data of all the objects (or the filtered ones) and you can select representative sample objects based on some algorithms.
If you would like to do some other (more complex analysis), you can easily export all the data (or some filter) to a .csv file and then continue with a spreadsheet processor.
On top of this, you can export a special profile (in xml format) and use it in the new Plato 4 version (coming soon). This will automatically fill in the basis of your plan and even more. Through the REST API, you can also connect c3po with Scout – the SCAPE preservation monitor. It will periodically talk to c3po and monitor your profile for some special changes. All this works already and will be released really soon. Pretty neat ha?
tl;dr Analyse and filter the content with the Web App. Export to csv and process with other tools. Integrate the profile with Plato and Scout.
Result
If you find this somewhat interesting, then please check out the following screencast that demonstrates c3po and leave me a comment.
Limitations
Well, as any software, there are limitations. Here are some of the most important ones.
– Obviously, c3po's data quality is as good as the quality of the provided meta data. We need better characterisation. What is even more important, we need more performant characterisation. c3po does not include the time for characterising the content, which can be very long.
– Currently, only basic visualisations of the data are generated. What is more, only based on a single property. Clearly, it will be helpful if there is the possibility to choose between different types of data and combinations of two or more properties.
– Only FITS is supported (for now). Other formats can be included by implementing a small adaptor that understands them. In this case other great identification and characterisation tools, such as the jpylyzer can be used.
Next Steps
As next steps I plan to do some more tests over larger collections in order to find out bugs, problems, and bottlenecks, and to figure out how the web app can be enhanced. I will try to cope with the aforementioned limitations and will concentrate on stability and integration with other components.
One more thing:
In the beginning of December, the first SCAPE Training Event will be hosted in Guimaraes, Portugal. I will present c3po there, so if you are interested and want to talk to me, consider joining.
Links:
http://ifs.tuwien.ac.at/imp/c3po
https://github.com/peshkira/c3po
Wow, you actually read whole thing. You rock!