Following a few interesting conversations recently, I got interested in the idea of 'bit flip' – an occasion where a single binary bit changes state from a 0 to a 1 or from a 1 to a 0 inside a file.
I wrote a very inefficient script that sequentially flipped every bit in jpeg file, saved the new bitstream as a jpeg, attempted to render it in the [im] python library, and if successful, to calculate an RMSe error value for the new file.
I've not really had much time to take this further at the moment, but its an academic notion I'd be interested in exploring some more.
I'm not sure if a bit flip is a theoretical or 'real' threat on modern storage devices – in the millions of digital objects that have passed through my hands in the past 10 years, I've never knowingly handled a random bit flip errored file. I'd be interested in any thoughts / experiences / observations on the topic.
Please see the attached file for some pretty pictures.
Feel free to get in touch if you want any more data – images, RMSe data or scripts.
pixelatedpete
February 18, 2013 @ 9:57 am CET
You know, I've been thinking (dangerous I know) and wondered if my thought was a good idea, a bad idea, been done before, etc. and this seemed a good place to find out! 🙂
Flipping a bit produces a broken image. If you flip the bits on lots of images you get lots of broken images and broken images – particularly JPEGs – seem to exhibit very similar artifacts – at least the broken images seem familar somehow.
We could use this technique then to create a large body of broken images quite quickly.
Now, my question is, will we see any similarity in that breakage?
I'm not expecting a direct correllation between the bit and the damage (though it'd be neat if flipping bit 17 always resulted in a cyan swathe across the image for example) but rather that images that are broken may all produce similar artifacts/shapes?
If (big if probably) we can extract features from each of the broken images (Matchbox?) we may then be able to cluster around these features and start to answer that question – is there any similarity in the breakages?
Why?
If we can spot similarity, we can use that cluster data as another measure of whether or not an image is broken in the absence of any "ground truth" – ie. we've not migrated the image and are checking against the original, we're just handling an image in isolation – say from a CD-ROM we're ingesting?
Could also do something similar with images identified as broken on the Atlas, but I'm not sure the corpus is big enough yet…
Having thought it all through, I think I'll go get on with it! 🙂
andy jackson
February 15, 2013 @ 1:44 pm CET
Just asked a collegue, and they said over the last six years of operation of the main store, which has a current total of 50 million files containing about half a petabyte of data (replicated totals), the BL has seen spontanous bitstream damage once (i.e. only one file has ever been repaired for this reason). There have been other errors, but they have been down to systematic sources like faulty hardware or workflow problems, rather than true spontanous 'bit rot'.
So yes, it happens, but it is certainly rare.
Jay Gattuso
February 14, 2013 @ 7:27 pm CET
Indeed. When I get round to it, I'm going to loop them all into a movie. Even at full frame rate its going to be a very long and dull movie!
Jay Gattuso
February 14, 2013 @ 7:13 pm CET
Yupe, totally agree – I was following a couple of strands when I did this, one is the comments in the reply to Paul, and the other was to see what the resulting images look like!
I've really only seen file construction errors (where a filestream is created incorrectly) or truncation errors (where files haven't been written fully post tx or write).
Jay Gattuso
February 14, 2013 @ 7:16 pm CET
Good spot,
Fixed now, thanks.