Tabular Data Normalisation Tool
In 2002 the UK government introduced regulation that required all UK local authorities to provide the British Library with a copy of the electoral register every year. However, the legislation did not require this data to be provided in any particular format and, as a result the data is sent to the British Library in a variety of digital formats.
This has presented some challenges to the British Library concerning how this information is stored for the long term.
Most of the data is sent to the British Library in the form of delimited text files; however there are also Excel spread sheets, Microsoft Word documents and a significant number of PDF files.
In addition some of the files are zipped and others are password protected. There are also some with incorrect file types, for examples some files have suffixes that indicate an excel spread sheet but are in fact text files.
To date the work at the British Library has concentrating on the data supplied as delimited text files, although some work has been done on PDFs.
Of the delimited text files the nature of the data varies :-
· Some have headers, although the header names are not consistent, while others have no headers.
· Some are comma delimited, while others are tab delimited.
· Some have data values enclosed in quotes, others don’t, while some have some have a mixture of both, quoted and unquoted.
· Most files include more data than the BL is required to keep and some of it is duplicated.
The aim was to create a tool that can take the supplied electoral register data as input and produce a single data file containing just the items of information the BL is required to hold in a normalised format.
Although this specific problem relates to UK Electoral Register data it represents a more general problem of processing disparate text files which need to be rationalised. The software has therefore been designed to be as generic as possible in order that it can be reused with other datasets of this type.
The solution consists of 4 stages :
The identification stage involves identifying the type of the incoming file using Apache Tika. At present only text files are supported.
This is the most complex stage as it is responsible for mapping columns from the incoming files to corresponding columns in the normalised output file. In order to do this it is necessary to determine the nature of the data in the incoming file. For example :-
Is it tabular (delimited)
Does it have a header line ?
If there is a header line we need to match header names in the incoming file to the required column headers in the output file, this is done using pattern matching.
If there is no header we need to identify the contents of columns using the data in those columns. This is done using a mixture of pattern matching e.g. postcodes, and comparing the contents of columns with expected values, e.g. common surnames.
The information used to carry out this characterisation is held in a normalisation properties file that is specific to the context. It effectively contains a description of the normalised output file.
Examples of the kind of information held in the properties file are :-
· the column headers of the output file,
· whether each item (column) of data is mandatory or optional,
· the kind of delimiter used,
· whether output data is enclosed in quotes,
· regular expressions describing the format of the expected data e.g. postcode
· validation information in order to carry out quality assurance, this might take the form of pattern matching or lists of values.
The aim of the characterisation stage is to produce a mapping that describes how the input data maps to the output data, i.e. which columns in the input file need to be extracted and copied to the output file.
The next step is migration which involves copying data from the input file to the output file. The mapping, produced by the characterisation stage, makes this a relatively trivial task, although it also incorporates QA using validation information stored in the properties file.
The first three stages, identification, characterisation and migration, are carried out by the Hadoop map process. The more usual Hadoop scenario is that of one large file that is split between a number of Hadoop nodes. In this case we have a large number of relatively small files and the input to the Hadoop map method is a file containing a list of all the incoming files to be processed. These files are then distributed between the available nodes in the Hadoop cluster. The output from the map process is a single normalised output file per incoming file. The Hadoop reduce process then merges these files together to produce a single normalised output file.
The tool has worked well in producing normalised output from incoming electoral register data. In order to check how generic it is the tool was also tested against another, completely different, set of government data files. All that should be needed to generate a normalised dataset from the new input files is a new version of the normalisation properties file. When the tool was tested against the new dataset it successfully produced the required normalised data file.
By marwoodls, posted in marwoodls's Blog