Developing an Audio QA workflow using Hadoop: Part I

Challenges of Dumping/Imaging old IDE Disks

It seems I have been working on this for a long time. It also seems the direction has moved a number of times. And I may still end up with a number of versions…

Now, the new goal is an Audio Migration+QA Taverna workflow using a number of Hadoop jobs. The workflow will migrate a large number of mp3 files using ffmpeg and it will perform a content comparison of the original and migrated files using xcorrSound waveform-compare and it will complete the audio part of check point CP082 "Implementation of QA workflows 03" (PC.WP.3) (M36).

Thus version 1 (simple audio QA) will be : Taverna workflow including 3 hadoop jobs: ffmpeg migration, mpg321 conversion, waveform-compare on file-lists/directories. Version 2 would include ffprobe property extraction and comparison. I have changed my mind about the input / output fitting the tools / Taverna / Hadoop best a number of times. For now, the input to the Taverna workflow is a file containing a list of paths to mp3 files to migrate + an output path (+ number of files pr. task). This is also the input to the ffmpeg migration hadoop job and the mpg321 conversion job. The output from these will be the path to the wav file (the output directory will also contain logs). These lists of paths to ffmpeg migrated wavs and mpg321 converted wavs will then be combined in Taverna to a list of pairs of paths to wav files, which will be used as input to the xcorrSound waveform-compare hadoop job.

The good thing about Taverna is the nice illustrations 😉 And the Taverna part is hopefully fairly straight forward, so I'll leave it as the last part. First I want to ge the three Hadoop jobs running. My trouble seems to be reading and writing files… Some of the trouble may be caused by testing on a local one-machine-cluster, which probably has some quirks! Right now I am wrapping the FFmpeg Migration as a Hadoop job. The tool can read local nfs files. The tool seems also to be able to write local files – it however does so with a "Permission denied" message in the log file?!? The Hadoop job is able to output the job result to an hdfs directory specified as part of input. The Hadoop mapper is however not able to output to the same directory, but it can output the logs to a different hdfs directory. Thus I now have the output distributed to three different locations to make things work… And I do not have these settings in a nice configuration!!!

So this raises a number of questions:

  • Why am I not using They have probably encountered and solved many of the same issues. The answer for now is that this will be version 3, as it would be nice to compare.
  • Where do I want to read data from and write data to? If I have my data in a repository, it is probably not on hdfs. Do I really want to copy data to hdfs for processing and results back from hdfs. It seems that the command line tools I am using do not understand hdfs, which makes the simple answer no. So I want an nfs-mounted input and output data storage on my cluster. I can then read from this mount and output to this mount. I think I will probably put the event logs here as well, instead of on hdfs (here the input to the ffmpeg migration is the original mp3 file; the output is the migrated wav file and the "preservation event log" from the tool).

This is work in progress… Soon to come: Part II with Taverna diagram 🙂



Leave a Reply

Join the conversation