Before Easter we planned to do a correctness benchmark for Audio Migration QA, specifically targeting the new tool xcorrSound waveform-compare, see https://openpreservation.org/blogs/2012-07-09-xcorrsound-waveform-compare-new-audio-quality-assurance-tool. The migration tool used was FFmpeg (version 0.10). The tools used in the QA were FFprobe, JHove2 (version 2.0.0) and xcorrSound waveform-compare (version 0.1.0). The tool used for the workflow was Taverna, and the workflow is available from myExperiment.
The challenge in a correctness baseline test set for audio migration quality assurance is that audio migration errors are rare. We thus wanted to create a "simulated annotated data set", where each entry consist of a "test file" with a possible migration error, a "control file" without any error that we can use for comparison, an "annotation" telling us about the test file, and a "similar" attribute = true or false.
In connection with large scale experiments in November 2012 (http://wiki.opf-labs.org/display/SP/EVAL-LSDR6-1), we did succeed in finding a migration error using waveform-compare. It turned out that this was caused by a bug in the old version of FFmpeg (0.6-36_git20101002), which we had used. This bug had been fixed in the 0.10 version. The error was a short bit of noise. We of course made a test file testing for this type of error.
We also experienced that different conversions tools added different (not audible) short pieces of silence to the end of the files. The waveform-compare tool reported these as 'not similar', but we decided that these files are similar, and the tool was updated accordingly. We also created test files with short bursts of silence in different places to test this.
We then had 5 different test files based on one original, a snippet of "Danmarks Erhvervsradio" (Danish Business Radio) from a show called "Børs & Valuta med Amagerbanken" from 8th of January 2002 from 8:45 till 9 am. This file is available from https://github.com/statsbiblioteket/xcorrsound-test-files/raw/master/DER259955_ffmpeg.wav.
We thought our test data set was meager and decided that we needed a more diverse annotated dataset for the correctness benchmarking. This was accomplished by issuing an internal challenge at the Digital Preservation Technologies Department at SB. The challenge was, given a correct wav file, to introduce a simulated migration error, that our workflow would miss. The given original file was a snippet of a Danish Radio P3 broadcast from October 31st 1995 approximately 8:13 till 8:15 am with the end of a song, then a short bit of talking, and then the beginning of the new song. You can download the file from here https://github.com/statsbiblioteket/xcorrsound-test-files/raw/master/original.mp3 and listen to it. The reward was a chocolate Easter egg to anyone who succeeded.
This resulted in 23 new and very different test files. The full simulated annotated dataset used for the correctness benchmarking thus consists of 28 test files and 2 comparison files along with annotations and is available from Github https://github.com/statsbiblioteket/xcorrsound-test-files.
Experiments and Results
The experiences from the correctness benchmarking showed that the classification of files into similar and not similar is certainly debatable. We decided to reward everyone with a small Easter egg for participating, and we then chose a few remarkable contributions and awarded them bigger Easter eggs 🙂
The first big Easter egg went to the challenge-pmd.wav test file, which has a hidden jpg image in the least significant bits in the wave file. The difference in challenge-pmd.wav and challenge.wav is not audible to the human ear, and is only discovered by the waveform-compare tool if the match threshold is set to at least 0.9999994 (the default is 0.98). We think these files are similar! This means that our tool does not always 'catch' hidden images. For the fun story of how to hide an image in a sound file, see http://theseus.dk/per/articles/hiding-stuff-in-stuff/.
The second big Easter egg went to challenge-TE-2.wav test file, which was made setting Audacity Compressor 10:1 and amplify -5, which is similar to radio broadcast quality loss. The difference between challenge-TE-2.wav and challenge.wav is audible, but only discoverable with threshold>=0.99. The question is whether to accept these as similar. They certainly are similar, but this test file represents a loss of quality, and if we accept this loss of quality in a migration once, what happens if this file is migrated 50 times? The annotation is similar=true, and this is also our test result with default threshold=0.98, but perhaps the annotation should be false and the default threshold should be 0.99?
And then there is challenge-KFC-3.wav, where one channel is shifted a little less than 0.1 second, the file then cut to original length and both file and stream header updated to correct length. The difference here is certainly audible, and the test file sounds awful. The waveform-compare tool however only compares one channel (default channel 0) and outputs success with offset 0. The correctness benchmark result is thus similar=true, which is wrong. If waveform-compare is set to compare channel 1, it again outputs success, but this time with offset 3959 samples (82 millisecond, as the sample rate is 48kHz). This suggests that the tool should be run on both (all) channels, and the offsets compared. This may be introduced in future versions of the workflows. Unfortunately this entry was late, so no Easter egg was awarded 🙁
A Few Notes
Some settings are also relevant for the checks of 'trivial' properties. We for instance have a slack for the duration. This was introduced as we earlier have experienced that tools insert a short bit of silence to the end of the file in a migration. The default slack used in the benchmark is 10 milliseconds, but this may still be too little slack depending on the migration tool.
The full test results are attached 🙂