Percipio; computer generated file signatures

Percipio; computer generated file signatures

Percipio is a small tool I have developed. You can find the tool here https://github.com/blekinge/percipio

I will make a proper release soon, especially if anybody shows any interest. It has been heavily inspired by the now not-developed closed sourcec tool TrID http://mark0.net/soft-trid-e.html

File signatures have traditionally been created by hand. This comes from the traditional belief that humans are much better than computers in reading binary data. This belief is obviously wrong, or at least not complete. Whereas humans can understand binary data on a higher level, any simple matching task is much easier done by computers.

The primary task of Percipio is to generate file format recognizition signatures. Given a set of data files, Percipio will scan for header and footer bytes in common between the files. These common bytes are the signature of the file. The signature is written in an xml format, that attempts to  be compatible with the one used by TRiD. I hope someone will write an xslt to transform these to Fido signatures, which should be very easy.

Percipio cannot presently generate regular expression signatures. So, if the same structure exist in all files, but at different offsets, Percipio will not find it. I am presently thinking about algorithms for matching identical blocks this way, but presently percipio does not have this capability.

Percipio also contain code to match files against signatures. Here again I use an Idea from TrID. Rather than going for a unique identification, I score the file against all the known signatures, and present a scoreboard. With computer generated signatures, there will be false positives, and I hope the scoreboard will provide a way for users to work with this.

The score is generated as (number of bytes matched – number of bytes not matched) * the number of files used to generate the signature

I multiply the score with the number of files used to generate the signature, as this is a an easy number that should somewhat correspond to the quality of the signature. A signature generated from many files will be less likely to have “local” artifacts, than one generated from few files. Of course, if the files are all related, this number will not work, but I do not know any other way to specify the quality of a signature in a computer generated way.

Tests have shown that as few as three distinct files are sufficient to generate signatures of a useful quality. If possible, get files created by different programs or computers, to prevent license and author identifiers embedded in the file headers to be identified as file signatures.

10
reads

Leave a Reply

Join the conversation