A string generated by a hash algorithm/hash function that can allow us to determine changes to a stream of data, i.e. by comparing result of a hash algorithm after data transfer to one we generated before data transfer.
In digital preservation we tend to use the term checksum interchangeably with the word hash – the fixed length string generated by something called a cryptographic hash function (MD5, SHA1, SHA256, etc.).
Checksum may also refer to the process of comparing two checksum values – checking the sum – for changes in the data stream.
A checksum will usually be made up of hexadecimal characters 0-9 and A-F, e.g.
Purpose of a Checksum
A checksum algorithm calculates a fixed length string based on the data in a file alone.
A file with the letters USA has MD5 checksum:
will always have the check sum
If a single bit changes, it will be unrecognisably otherwise.
A file with the letters USB (USA to USB, a change of two-bits) has checksum:
Checksums are great for spotting data integrity errors – the key to digital preservation.
Bit level preservation is simply about checking the checksums – constantly.
Data vs. Filename
Checksums are calculated on the data inside a file. If a filename changes, the checksum of the value is still the same because the data inside hasn’t been changed. If a file is copied, and given another filename the checksum of the two files will be identical.
Checksums only operate on the data inside the file.
A mapping of data of arbitrary length to a fixed length string, the output of a hash function can be called a hash value, hash code, digest, or simply a hash. A checksum in digital preservation is a hash of the data inside a file.
A fixed length string. The output of a hash function.
Cryptographic Hash Function
A cryptographic hash function is a one way function such that the original data cannot be determined from the hash value itself – it is infeasible to invert the function. Cryptographic hashes are considered quick. The cryptographic hash functions employed in digital preservation have wide application as well and so are considerably well tested and there are many tools that can support their use in our workflows.
One Way Function
A transformation of data such that the result cannot be transformed back into the original.
What do checksums look like?
Fixed length strings. Hexadecimal characters 0-9, A-F.
- Message Digest 5.
- 32 character string.
- Theoretically, 21 quintillion files needed for a collision.
- Secure Hash Algorithm 1.
- 40 character string.
- Theoretically 1 septillion files needed for a collision.
- Secure Hash Algorithm 256.
- 64 character string.
- Theoretically 400 undecillion files needed for a collision.
Other Cryptographic Hashes...
- Spectral Hash
The MD5 checksum of a zero byte file. Other checksums capable of generating a hash from a zero-byte file include:
- A collision happens when two different data streams result in the same checksum value.
- This is a big concern when a checksum is used for security purposes (e.g. in password applications).
- A collision is computationally difficult to engineer but not impossible.
- Collisions could of course be incidental.
- An engineered collision for SHA1 recently took knowledge of the algorithm, plus 9,223,372,036,854,775,808 SHA-1 computations, 6,500 years of CPU (Central Processing Unit) time, and 110 years of GPU (Graphics Processing Unit) time, to create.
- Collisions are not a huge concern in digital preservation because multiple checksums may often be created for a single file to avoid such a situation.
- Archivists also have the concept of fixity.
- Collisions are a bigger concern when workflows require on just a single checksum to align large amounts of data.
A useful tool available for Linux and Windows for generating checksums recursively for a directory or directories of files. SHA1DEEP has compatriot tools MD5DEEP and SHA256DEEP.
Supported in Rosetta
Checksums supported in the Rosetta digital preservation system are, CRC32, MD5, and SHA1
Supported by DROID
Checksums supported in DROID are MD5, SHA1, and SHA256
AV Preserve Fixity
AV Preserve Fixity is a software agent for scheduling the scanning and checking of checksums for a given directory or directories of files. If a comparison fails, that is a file that is expected to match doesn’t, then an email is sent prompting users about the error enabling them to initiate procedures to return original data from backups. The tool is maintained by AV Preserve.
Because a data input will always output the same checksum value, checksums are great for de-duplication, that is removal of duplicate files with the same information.
In an archival context this may be more complicated where a duplicate record has multiple contexts.
In some storage systems, checksums can be used to store no more than one copy of an object that can then be referenced from multiple contexts.
Authenticity and Integrity
Checksums can prove data hasn’t changed which can help us to prove a record's authenticity and integrity from the point of transfer.
In UNESCO memory of the world terms, integrity is the quality of being ‘uncorrupted and free of unauthorized and undocumented changes’ (UNESCO 2003).
UNESCO definition of integrity
The state of being whole, uncorrupted and free of unauthorised and undocumented changes. (UNESCO, 2003)
Checksums are unique to a data stream and thus can become unique, fixed-length, identifiers for those files. We can keep track of our files through various automated workflows through the use of checksums.
Just a large number!
Checksums are just really big numbers. Computers are really good at working with numbers that is why they are good for automated processes and comparisons. If we convert hexadecimal:
to a decimal number in Google we get 2.8194977e+38
A number system of 16 characters, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F. Hexadecimal can represent all numbers. Its primary application is the representation of binary numbers in the form of two digit bytes. Hexadecimal makes binary easier to read, for example, the number 255, in binary is, 0b11111111, and in hexadecimal is 0xFF. A hexadecimal number is often prefixed with the number zero and letter ‘x’ to signal the following characters are hexadecimal.
Checksums vs. Fixity
If a checksum should fail for any reason, archivists also have the concept of fixity. The concept of ‘remaining fixed in state'. We can observe file date ranges, e.g. modified and creation date. We can also look at the content and clues in the content for features that help us to prove a digital file is what it purports to be. There is only one Domesday Book – we have many ways of proving this is what it is without a checksum value per se.
Deterministic but Unpredictable
Cryptographic hashes are deterministic meaning for a given piece of data, the same output will always be generated. That is, the same checksum value.
Output is, however, unpredictable between inputs meaning that similar (not the same) output results in a radically different looking checksum value so the original data cannot be predicted.
A feature of a cryptographic hash function that makes it difficult to reverse engineer. The range of outputs for any given input is uniformly distributed meaning every possible output has an equal chance of occurring – you won’t see chunks of similar checksums output for similar (not the same) chunks of data.
Infeasible to Invert
Means it is computationally difficult and time consuming to reverse engineer the output of a cryptographic hash function. The one mechanism to do it would be to try all possible combinations of input, yet, original data size is not known, and there are no clues to the original data type or content.
Having understood checksums, one might also be interested in fuzzy hashes. These are used in an alternative way to the checksums discussed here.
Fuzzy hashes are used to determine the similarity of content – e.g. to determine when only small changes have been made to a data stream.
This property of fuzzy hashes can be exploited to perform content sentencing, or to point users to similar content if there is a record available.