Checksums Flashcards Preview

Digital Preservation > Checksums > Flashcards

Flashcards in Checksums Deck (29)
Loading flashcards...
What is a...


A string generated by a hash algorithm/hash function that can allow us to determine changes to a stream of data, i.e. by comparing result of a hash algorithm after data transfer to one we generated before data transfer.

In digital preservation we tend to use the term checksum interchangeably with the word hash – the fixed length string generated by something called a cryptographic hash function (MD5, SHA1, SHA256, etc.).

Checksum may also refer to the process of comparing two checksum values – checking the sum – for changes in the data stream.

A checksum will usually be made up of hexadecimal characters 0-9 and A-F, e.g.


What is the...

Purpose of a Checksum

A checksum algorithm calculates a fixed length string based on the data in a file alone.

A file with the letters USA has MD5 checksum: 


will always have the check sum


If a single bit changes, it will be unrecognisably otherwise.

A file with the letters USB (USA to USB, a change of two-bits) has checksum: 


Checksums are great for spotting data integrity errors – the key to digital preservation.

Bit level preservation is simply about checking the checksums – constantly.

What is the meaning of...

Data vs. Filename

Checksums are calculated on the data inside a file. If a filename changes, the checksum of the value is still the same because the data inside hasn’t been changed. If a file is copied, and given another filename the checksum of the two files will be identical.

Checksums only operate on the data inside the file.

What is a...

Hash Function

A mapping of data of arbitrary length to a fixed length string, the output of a hash function can be called a hash value, hash code, digest, or simply a hash. A checksum in digital preservation is a hash of the data inside a file.

What is a...


A fixed length string. The output of a hash function.

What is a...

Cryptographic Hash Function

A cryptographic hash function is a one way function such that the original data cannot be determined from the hash value itself – it is infeasible to invert the function. Cryptographic hashes are considered quick. The cryptographic hash functions employed in digital preservation have wide application as well and so are considerably well tested and there are many tools that can support their use in our workflows.

What is a...

One Way Function

A transformation of data such that the result cannot be transformed back into the original.


What do checksums look like?

Fixed length strings. Hexadecimal characters 0-9, A-F.

What is...


  • Message Digest 5.
  • 32 character string.
  • Theoretically, 21 quintillion files needed for a collision.


What is...


  • Secure Hash Algorithm 1.
  • 40 character string.
  • Theoretically 1 septillion files needed for a collision.

What is...


  • Secure Hash Algorithm 256.
  • 64 character string.
  • Theoretically 400 undecillion files needed for a collision.



Other Cryptographic Hashes...

  • BLAKE-256
  • BLAKE-512
  • MD5
  • RadioGatún
  • SHA-1
  • SHA-256
  • Spectral Hash
  • Streebog
  • Tiger
  • Whirlpool

What is...




The MD5 checksum of a zero byte file. Other checksums capable of generating a hash from a zero-byte file include:

  • MD5:



  • SHA1:



  • SHA256:




What are...


  • A collision happens when two different data streams result in the same checksum value.
  • This is a big concern when a checksum is used for security purposes (e.g. in password applications).
  • A collision is computationally difficult to engineer but not impossible.
  • Collisions could of course be incidental.
  • An engineered collision for SHA1 recently took knowledge of the algorithm, plus 9,223,372,036,854,775,808 SHA-1 computations, 6,500 years of CPU (Central Processing Unit) time, and 110 years of GPU (Graphics Processing Unit) time, to create.
  • Collisions are not a huge concern in digital preservation because multiple checksums may often be created for a single file to avoid such a situation.
  • Archivists also have the concept of fixity.
  • Collisions are a bigger concern when workflows require on just a single checksum to align large amounts of data.

What is...


A useful tool available for Linux and Windows for generating checksums recursively for a directory or directories of files. SHA1DEEP has compatriot tools MD5DEEP and SHA256DEEP.

What checksums are...

Supported in Rosetta

Checksums supported in the Rosetta digital preservation system are, CRC32, MD5, and SHA1

What checksums are...

Supported by DROID

Checksums supported in DROID are MD5, SHA1, and SHA256

What is...

AV Preserve Fixity

AV Preserve Fixity is a software agent for scheduling the scanning and checking of checksums for a given directory or directories of files. If a comparison fails, that is a file that is expected to match doesn’t, then an email is sent prompting users about the error enabling them to initiate procedures to return original data from backups. The tool is maintained by AV Preserve.

What is...


Because a data input will always output the same checksum value, checksums are great for de-duplication, that is removal of duplicate files with the same information.

In an archival context this may be more complicated where a duplicate record has multiple contexts.

In some storage systems, checksums can be used to store no more than one copy of an object that can then be referenced from multiple contexts.

What is...

Authenticity and Integrity

Checksums can prove data hasn’t changed which can help us to prove a record's authenticity and integrity from the point of transfer.

In UNESCO memory of the world terms, integrity is the quality of being ‘uncorrupted and free of unauthorized and undocumented changes’ (UNESCO 2003).

What is the...

UNESCO definition of integrity

The state of being whole, uncorrupted and free of unauthorised and undocumented changes. (UNESCO, 2003)

What is...


Checksums are unique to a data stream and thus can become unique, fixed-length, identifiers for those files. We can keep track of our files through various automated workflows through the use of checksums.


Just a large number!

Checksums are just really big numbers. Computers are really good at working with numbers that is why they are good for automated processes and comparisons. If we convert hexadecimal:


to a decimal number in Google we get 2.8194977e+38

What is...


A number system of 16 characters, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F. Hexadecimal can represent all numbers. Its primary application is the representation of binary numbers in the form of two digit bytes. Hexadecimal makes binary easier to read, for example, the number 255, in binary is, 0b11111111, and in hexadecimal is 0xFF. A hexadecimal number is often prefixed with the number zero and letter ‘x’ to signal the following characters are hexadecimal.

What is the meaning of...

Checksums vs. Fixity

If a checksum should fail for any reason, archivists also have the concept of fixity. The concept of ‘remaining fixed in state'. We can observe file date ranges, e.g. modified and creation date. We can also look at the content and clues in the content for features that help us to prove a digital file is what it purports to be. There is only one Domesday Book – we have many ways of proving this is what it is without a checksum value per se.

What is...

Deterministic but Unpredictable

Cryptographic hashes are deterministic meaning for a given piece of data, the same output will always be generated. That is, the same checksum value.

Output is, however, unpredictable between inputs meaning that similar (not the same) output results in a radically different looking checksum value so the original data cannot be predicted.

What is...

Uniform Distribution

A feature of a cryptographic hash function that makes it difficult to reverse engineer. The range of outputs for any given input is uniformly distributed meaning every possible output has an equal chance of occurring – you won’t see chunks of similar checksums output for similar (not the same) chunks of data.

What is...

Infeasible to Invert

Means it is computationally difficult and time consuming to reverse engineer the output of a cryptographic hash function. The one mechanism to do it would be to try all possible combinations of input, yet, original data size is not known, and there are no clues to the original data type or content.

What are...

Fuzzy Hashes

Having understood checksums, one might also be interested in fuzzy hashes. These are used in an alternative way to the checksums discussed here.

Fuzzy hashes are used to determine the similarity of content – e.g. to determine when only small changes have been made to a data stream.

This property of fuzzy hashes can be exploited to perform content sentencing, or to point users to similar content if there is a record available.