Web-Scale Data Mining for Digital Preservation

Recent years have seen an ever-increasing interest in developing Data Mining methods that allow us to find structured information of interest in very large collections of data ("Big Data"). In this complex and emerging field, the digital preservation community may play an interesting role:

1. Information needs. One the one hand, the digital preservation community is actively developing tools in order to identify preservation risks, events and opportunities. As I highlight further on, this points to diverse and complex information needs that Big Data Analytics methods may help address.

2. Large scale data and processing. On the other hand – and this is even more significant – the digital preservation community has both a unique access to very large data sets and the necessary infrastructure and experience to perform data-parallel processing on this data.

Taken together, this points to the potential of actively leveraging the data we preserve in order to make more informed digital preservation decisions.

An Example from SCAPE

In the SCAPE project, we are investigating scenarios in which we address information needs from digital preservation using large-scale data mining. We presented one such scenario at last year's iPres conference (slides here, paper here). In this scenario, we mined the Web for a simple piece of information:

Which publisher is responsible for which content?

Such information is currently aggregated in repositories such as the Keepers Registry – check out the following screenshot from Keepers which shows how they archive a journal from the area of "Big Data". This includes the journal title, its ISSN number, its publisher and the archiving agency:

Our goal was to automatically find more journals and their publishers in order to make such repositories more complete than they currently are.

Example Continued: Information Extraction on the Web

We implemented an Information Extraction (IE) system and executed it on a collection of crawled Web pages from the area of preservation. We were especially interested in sentences like the following:

"In 1991, two years before the merger with Reed, Elsevier acquired Pergamon Press in the UK."

and

"The American Journal of Preventive Medicine is the official journal of the American College of Preventive Medicine and the Association for Prevention Teaching and Research."

We performed deep syntactic analysis on such sentences and applied so-called lexico-syntactic patterns (such as "X acquired Y" or "Y is journal of X") to extract structured information from matching sentences. As a result, we extracted thousands of journal-publisher pairs, examples of which are given in the following table:

Journal	Publisher
A Journal of Human Environment	Royal Swedish Academy of Sciences
AAPS Journal	American Association of Pharmaceutical Scientists
Acta Radiologica	Scandinavian Society of Radiology
…	…

A manual evaluation reveiled that 50% of all journal-publisher pairs found with this method were not in the Keepers Registry, but were correct and should be added. This shows how IE can be used to address information needs from the digital preservation community.

Try it Out: Build Your Own Extractor!

For demonstration purposes, our Information Extraction system is now online as a workbench HERE. It executes IE on-the-fly on a very large corpus of over 160 million sentences crawled from the Web.

1. Try the examples. At the top left corner, there are some examples that you may chose from to get introduced to the system. Next to the journal-publisher use case, we have created an extractor that identifies which tool supports which file format as an example.

2. Try creating your own. By selecting lexico-syntactic patterns and entity type restrictions, you create your own extractors. You can export the result tables using the export link at the bottom right. You can also create a permalink to share the extractor that you have created by clicking on the icon at the bottom left.

Try it out! By clicking on the question mark in the top middle of the page you get more detailed usage instructions for the workbench.

Outlook

We will demonstrate the system at the upcoming SCAPE Developer Workshop in Den Haag. Until then, some GUI details may change to make its use more intuitive. I look forward to many interesting discussions 🙂

An Example from SCAPE

Example Continued: Information Extraction on the Web

Try it Out: Build Your Own Extractor!

Outlook

Leave a Reply

You might also like…

Will the real lazy pig please scale up: quality assured large scale image migration

What happens when the Internet and digital preservation coinicide

ToMaR – How to let your preservation tools scale

Join the conversation

Member-only content

or

or

or

or

Download

or