What is the checksum of a directory? Using DROID reports and the concepts behind Merkle Trees to generate Directory, and Collection Checksums

PDF Eh? – Another Hackathon Tale

What is the checksum of a directory? Using DROID reports and the concepts behind Merkle Trees to generate Directory, and Collection Checksums

An introduction to sumfolder1

by Ross Spencer



A directory on disk doesn’t have a checksum, but what if it did? This is a question I ask, and try to answer in my new script, sumfolder1.

Unlike a file, a directory doesn’t have a payload that can be summed together to create a checksum which we often also refer to as a digest or a hash. 

Yet, the function of a directory is hugely important, and more so in the field of digital preservation which supports disciplines within the GLAM sector, such as libraries and archives. Directories provide context and structure for collections of digital objects. They can help describe how objects were used and interacted with.

An example collection

A simple directory structure may look as follows:

   C:\DOCUMENTS\COLLECTION
   ├───policies
   │       fiscal-policy-2021.doc
   │       fiscal-policy-2022.doc
   │
   ├───speeches
   │       opening-of-parliament-2021.doc
   │
   └───strategic-planning
           budget-2022.xlsx
           foreigh-policy.doc
           schools.doc

Each of the files have the following checksums:

File nameChecksum
fiscal-policy-2022.doc144688d499ba783230b5583012cc76f1
fiscal-policy-2021.docbdd43f731248afe2a469042a3c98ee55
opening-of-parliament-2021.doca0e5ea45f7e4c2fd2669ce92565063f8
budget-2022.xlsx37692aad8bd0fc4b83c2fa005ac81c95
foreigh-policy.doc600f19f72ed65fbc689734d3c62265c6
schools.doc9111b5b9989fecab44e07aff47aa73c3

If we look at the file names, we may be able to guess where in the directory listing the files belong, we may not. 

Fortunately, when we process a collection normally we tend to receive two things:

  • A file and directory path listing.
  • A list of checksums. 

Given this information, we can align the two sets and see that we have a “complete” collection. We also know the original structure. 

But what if I have deleted a row of information from each listing? Removing a file, its checksum, and the full path to it in the directory listing? 

I wouldn’t be able to tell if the information was ever there. 

Checksums are important when we’re working in archives. They tell us whether something has been modified (as we try to combat entropy). They tell us about duplicates. They give super-powers to file-format identification reports as I discuss in my Code4Lib paper Fractal in detail: What information is in a file format identification report?.

But checksums alone do not provide an archivist or librarian with structural information. They usually require an intellectual control layer to become useful in an archival context; that usually looks like a database with the listings above stored within – directories converted to fonds and sub-fonds, or series, and checksums stored alongside file-listings attached to those. Periodically the database will be compared against a newly generated list of checksums looking for data corruption and our job is largely done. 

If information is deleted, or even added, we don’t have good computational ways of proving this. We rely on our knowledge of a collection, and techniques such as diplomatics, to authenticate a collection. 

If directories had checksums, we may be able to do more with the information in front of us. 

A worked example

Take this simple Python 3 script using the checksums and structure described above: 

"""Simple demo listing generating a collection hash.
"""

import hashlib


def sum_digest(listing, digest):
    """Return a hexdigest from a list."""
    for item in listing:
        digest.update(item.encode())
    return digest.hexdigest()


def hash_simple_collection():
    """Static function for hashing our demonstration collection."""
    digest_planning = hashlib.md5()
    planning = [
        "37692aad8bd0fc4b83c2fa005ac81c95",
        "600f19f72ed65fbc689734d3c62265c6",
        "9111b5b9989fecab44e07aff47aa73c3",
    ]
    hash_planning = sum_digest(planning, digest_planning)

    digest_speeches = hashlib.md5()
    speeches = [
        "a0e5ea45f7e4c2fd2669ce92565063f8",
    ]
    hash_speeches = sum_digest(speeches, digest_speeches)

    digest_policies = hashlib.md5()
    policies = [
        "144688d499ba783230b5583012cc76f1",
        "bdd43f731248afe2a469042a3c98ee55",
    ]
    hash_policies = sum_digest(policies, digest_policies)

    collection_digest = hashlib.md5()
    collection = [
        hash_planning,
        hash_speeches,
        hash_policies,
    ]
    collection_hash = sum_digest(collection, collection_digest)
    print(f"planning hash is:   {hash_planning}")
    print(f"speeches hash is:   {hash_speeches}")
    print(f"policies hash is:   {hash_policies}")
    print(f"collection hash is: {collection_hash}")


def main():
    """Primary entry point for this script."""
    hash_simple_collection()


if __name__ == "__main__":
    main()

This  outputs the following information:

planning hash is:   23bede883649889f67f1428507ac69c2
speeches hash is:   3a9ccc1adcda8f2062cb2ea48b62bfa6
policies hash is:   c6ca13ea99414abb036dad0b06bba431
collection hash is: b09251f91b3bcc56b33a049e43c1d8c3

Which can be visualized as follows:

📁 collection b09251f91b3bcc56b33a049e43c1d8c3
   📁 policies c6ca13ea99414abb036dad0b06bba431
      📄 fiscal-policy-2022.doc 144688d499ba783230b5583012cc76f1
      📄 fiscal-policy-2021.doc bdd43f731248afe2a469042a3c98ee55
   📁 speeches 3a9ccc1adcda8f2062cb2ea48b62bfa6
      📄 opening-of-parliament-2021.doc a0e5ea45f7e4c2fd2669ce92565063f8
   📁 strategic-planning 23bede883649889f67f1428507ac69c2
      📄 budget-2022.xlsx 37692aad8bd0fc4b83c2fa005ac81c95
      📄 foreigh-policy.doc 600f19f72ed65fbc689734d3c62265c6
      📄 schools.doc 9111b5b9989fecab44e07aff47aa73c3

Borrowing from techniques used in Merkle Trees, e.g. in Git source control, and in the block-chain, e.g. Cardano, we can use the sum of the file hashes (the parts of the tree with data) to generate hashes for the directories that contain them. And given a list of directories with newly minted hashes, we can create an overall collection checksum that can begin to tell us more about what we see in front of us. 

Identifying changes in a collection

For example, if we remove a file, “fiscal-policy-2021.doc” the collection checksum is completely changed:

planning hash is:   23bede883649889f67f1428507ac69c2
speeches hash is:   3a9ccc1adcda8f2062cb2ea48b62bfa6
policies hash is:   1fb121f760ccf12bc711b62ca104c043
collection hash is: 256c9656439399e1c3a2b9c4c0fa1caa

Interestingly, the “strategic planning” and “speeches” hashes remain the same because their underlying data hasn’t changed, but the “policies” hash has (it contained the changed file), and thus the collection hash changes because some of the data in the folders underneath has changed. 

If we had a record of the original collection hash alone, we would know when downloading this new collection that it is inauthentic. We may not know yet what is different, but like with normal checksums, it triggers an investigation as to what changed. 

If we have a record of the entire original collection in front of us, listing file hashes, and directory hashes, we would see immediately, something has changed in the policies directory, and nowhere else. 

Querying a collection

Given the original listing, with no changes, we can ask about the existence of a file within the listing, e.g. take:

  • opening-of-parliament-2021.doc
    • a0e5ea45f7e4c2fd2669ce92565063f8

We would simply create a hash for the directory that the file sits in, “speeches”, and then take the hashes of the two other directories sitting just below the collection directory: planning and policies, and we would arrive at a collection hash:

Planning: 23bede883649889f67f1428507ac69c2 +
Speeches: md5(a0e5ea45f7e4c2fd2669ce92565063f8) =
  3a9ccc1adcda8f2062cb2ea48b62bfa6 +
Policies: 1fb121f760ccf12bc711b62ca104c043

= b09251f91b3bcc56b33a049e43c1d8c3 <- our collection hash

If these values all add up, we can say “file (x) was part of collection (y)”.

Proving a negative

If the file we were asking about had even the slightest difference in checksum: “b0e5ea45f7e4c2fd2669ce92565063f8” instead of “a0e5ea45f7e4c2fd2669ce92565063f8”, then the collection hash would be completely different: “7f7287697cf021c3850272977333ee69”. We would be able to say, “file with hash(x) was not part of collection (y)”.

📁 collection 7f7287697cf021c3850272977333ee69
   📁 policies c6ca13ea99414abb036dad0b06bba431
      📄 fiscal-policy-2022.doc 144688d499ba783230b5583012cc76f1
      📄 fiscal-policy-2021.doc bdd43f731248afe2a469042a3c98ee55
   📁 speeches 2e8a92cd1f60636f6a3fb502d658e5b8
      📄 opening-of-parliament-2021.doc b0e5ea45f7e4c2fd2669ce92565063f8
   📁 strategic-planning 23bede883649889f67f1428507ac69c2
      📄 budget-2022.xlsx 37692aad8bd0fc4b83c2fa005ac81c95
      📄 foreigh-policy.doc 600f19f72ed65fbc689734d3c62265c6
      📄 schools.doc 9111b5b9989fecab44e07aff47aa73c3

We can ask the same question of any other file in the collection, including if there are duplicates, and we can ask about folders and their existence too, as well as whether folders or files are empty. 

Introducing sumfolder1

I have expanded on this proof of concept for the tool sumfolder1.

sumfolder1 performs the same function as we see here using a DROID file format identification report (both DROID and the compatible Siegfried variant). 

If we ask the question again, is the file with hash “a0e5ea45f7e4c2fd2669ce92565063f8” part of archival collection: “b09251f91b3bcc56b33a049e43c1d8c3” we get the following result from sumfolder1:

"query": {
  "search_hash": "a0e5ea45f7e4c2fd2669ce92565063f8",
  "found": true,
  "type": "File",
  "results": [
    {
      "name": "opening-of-parliament-2021.doc",
      "parent_folder": "speeches (3a9ccc)",
      "parent_folder_name": "speeches",
      "containing_dirs": [
        "3a9ccc1adcda8f2062cb2ea48b62bfa6",
        "b09251f91b3bcc56b33a049e43c1d8c3"
      ]
    }
  ]
}

I have released the proof of concept on PyPi and so it is something that everyone reading this blog can try today. More information about how sumfolder1 works is available in the README. I’ll add to the README over time, and aside from this blog, I suspect writing this work up in a more formal paper would be beneficial to those that are interested.

Next steps

What I’d love to ask those reading this blog today is to take a look at this work, give it a try, and help me to improve it if they see, as I do, potential value in it. One person’s script is a long way away from becoming something others can use. The code is written very defensively, and it is very much a proof-of-concept with lots of ways to evolve. 

One question folks will have is why have I chosen MD5? Really, it’s just a little bit shorter for demonstration purposes, and it’s a little bit less information for the reader to comprehend. The code adopts something approaching a strategy pattern that enables SHA-256 and BLAKE2 algorithms to be trivially swapped in. 

For those looking at trying this immediately, there’s a demo mode, and the option to download a reference spreadsheet generated using a reference set included with the tool using the following commands:

Demo:

  • python sumfolder1 –demo

Reference set: 

  • python sumfolder1 –ref

Normally a DROID CSV will be provided to the tool:

  • python sumfolder1 –csv <csv_name>

Summarizing sumfolder1

sumfolder1 potentially gives us a new way of talking about folders within an archival collection. We’ve so far been able to approximate talking about collections as a whole through techniques like bags, zips, and archival management systems/collections that provide intellectual control of objects such as these. We haven’t yet found a way to create a mathematical and cryptographical representation of a digital collection as a whole.

sumfolder1 provides us with a mathematical approach to proving a collection is whole and in-tact, and prove this consistently and predictably.

While I am aware storage mechanisms are changing, e.g. object storage is increasingly popular. I am aware that trust mechanisms are changing too with block-chain improving and providing new techniques to investigate each new year. I do feel there may still be some value in looking at these things the old-fashioned way.

I hope that this blog and the topics raised by sumfolder1 are useful but it will take a lot more prodding and probing to understand if it is useful to the community. 

Thinking in collections

The one takeaway I want to highlight, is that, as some researchers focus on the file as the first-class object in digital preservation workflows, archivists, have for a long time looked at collections and context, and, for example, when we look at cyber-security risks in isolation, we have to ask ourselves:

  • What did we receive, and what does the collection look like?

If we can identify something, risky or not, as part of a collection of objects that we took custody of, we have one set of maintenance or disposal questions to think about. If we can identify something as new, modified, or not belonging, then we have another set. 

There’s a lot to be said for archival thinking, and thinking about context, and about a collection as a whole and improving trust surrounding that, not just the individual items.

5 Comments

  1. Ross Spencer
    January 23, 2023 @ 11:02 am CET

    Thanks Nathan! Totally agree about the impact on sustainability. I should have touched upon that briefly here, and it’s something I should address that in a long-form write-up. It would be great if this approach could provide alternatives to file-level checksums, perhaps something that can be rolled into sampling and spot-checking? I’m not sure yet. I was wondering about object storage so I appreciate you bringing that up and your other comments too, especially about OCFL. Complexity wise, I really hope to bring that down a notch. The primary loop of the tool is a bit over-complicated, but I think with the testing there and some immediate changes that have been identified it should get a bit tighter.

  2. Nathan Tallman
    January 19, 2023 @ 6:55 pm CET

    Great work, Ross. Although sumfolder1 has a lot more complexity, this reminds me a lot of AWS etags, which are essentially an md5 hash or hash of hashes in the case of multiple files. It’s an effective way to monitor at a higher level, though we should consider the sustainability impact of adding additional fixity processes. Maybe this is an alternative to the hegemony of file-level checksums or complementary to adding cryptographic signatures to files/packages.

    I once wrote a script to migrate content off CD-ROMS and used a folder-level checksum approach that involved calculating the hash of each file, adding it to a new tmp file, then hashing that file. But I really like how you’ve designed sumfolder1. It’s giving of OCFL vibes too.

  3. Ross Spencer
    January 19, 2023 @ 11:28 am CET

    Hi Matt, thanks for reading and the positive response!!

    The question of folder names came up in a discussion on the repository. I have responded there: https://github.com/ross-spencer/sumfolder1/discussions/5#discussioncomment-4706195

    This particular form of the work is meant (at least as I conceived it) to be content-based. That’s not to say the other use-case (making folder naming part of the algorithm) isn’t something that can’t be considered. Perhaps it is another fork? I am more than happy to entertain the idea as the code becomes more stable.

    NB. thanks too for the tip on info on length extension attacks, I will make sure to add that!

  4. mattpalmer1086
    January 19, 2023 @ 11:03 am CET

    One more issue. Hash functions like MD5 rely on applying their state iteratively. You can *add* something to an existing hash by taking the hash and just running the digest with it and the new content.

    So this does not protect against someone *adding* things to an archive. This is known as a length extension attack in cryptography:

    https://en.wikipedia.org/wiki/Length_extension_attack

    The defence is to use a slightly more convoluted function that is immune to length extension. The HMAC algorithm can do this. This also works:

    H’ = H(H(0b || m)), where you first take the hash of the message with some zeros prepended to it, and then take the hash of that hash.

  5. mattpalmer1086
    January 19, 2023 @ 10:16 am CET

    Great idea! One thing that isn’t done is to bind the actual folder to the contents. So it would be possible to switch the hash contents of one folder with another.

    I don’t know if the name, or the path of the folder is what should identify it in an archive. Maybe it has some other persistent digital identifier. In any case, however the folder is permanently identified, that should also be bound in to the digest computation.

Leave a Reply

Join the conversation