- PRONOM is a digital preservation technical registry. It is maintained by The National Archives, UK, though it’s core architecture is little changed since 2004.
- PRONOM’s use in the community is to be a centralised service for file format signatures.
- For every file format that can be identified through the tools that use PRONOM’s signatures, a unique identifier is also given to the user.
PRONOM Unique Identifier (PUID) which are assigned to all formats in the PRONOM registry. There are two primary types, fmt and x-fmt. The latter is the result of a historical error when x-fmt identifiers were made available to the public. A subsequent decision to maintain x-fmt was made in favour of continuity as a standard. There is no longer a semantic difference between identifier types – that is the x- is no longer experimental, it is equivalent to the other type.
PRONOM web services
PRONOM delivers signature files to tools via web services. DROID for example will first use a web-service to check for new signatures. If they exist it will then communicate with a second web service to download those signatures in the form of a 'signature file'. A second type of signature file, Container Signatures, are downloaded via more traditional web based techniques utilizing a web-page’s Last-modified date, to seek new data.
PRONOM can be accessed via XML making it possible to download and remix. The links look like:
DROID Signature File
- A DROID signature file is an XML file that contains a snapshot of PRONOM in its current state.
- Split into two, or three sections (for container signatures), the signature file’s two main components are a list of file formats and metadata, e.g. format MIMEType, and then a mapping to a list of signatures.
- A container signature file contains a third section of ‘trigger PUIDs’ that is, PUIDs that trigger container identification when a match is found.
A PRONOM release happens when a publishing job is run by The National Archives, UK. Importantly, the draft information in the database is published onto the web, and a signature file is created via database stored procedure and uploaded to a location where it can be accessed via web service.
PRONOM Release Notes
The PRONOM release notes are released in XML form and are available from the PRONOM index page on the web. Each release it summaries in terms of:
- New Records: New records for file formats that now have PUIDs
- Updated records: Format records in PRONOM that have had their information updated in some way, including signature changes
- New Signatures: File formats that now have signatures associated with them and can be identified via PRONOM
The email address to send format requests to at The National Archives, UK.
DROID-list Google Group
An open community that is a good first place to start for discussing new file format signatures for PRONOM. Being open, folks are invited to contribute to other’s identification issues. Signatures can be shared and the workload in fixing them shared too. PRONOM development is aided when there is as much information as possible about a file format and its potential signature. This work would all have to be done by their developers otherwise.
DROID was the first client tool to make use of PRONOM signatures. The tool can be pointed at a directory, or directories of files to recurse. The files are then matched against the signatures in the signature file. DROID will return a PUID only for those that do match. For all files DROID outputs other metadata including last-modified date, and checksum if selected.
Fido was the second client tool to make use of a subset of the PRONOM signatures. Fido was created in Python and utilized traditional regular expressions to match file formats with signatures. This meant converting the PRONOM signatures into a format that could be understood by a standard regular expression matching engine. Fido is used in Archivematica and is still maintained as part of the Open Preservation Foundations stewardship.
Siegfried is a more current implementation of a DROID-like tool and utilizes all of the signature information available to DROID. Siegfried uses a different matching algorithm. It will return equivalent metadata. Siegfried has a number of strengths.
- It is the first to use more sources of file format signatures, including a type of signature from BSD FreeDesktop, and a set of signatures from the Library of Congress
- Siegfried is primarily command line based making it easier to integrate with workflows
- Siegfried is also open source like DROID
Brunnhilde is a reporting companion tool for Siegfried created by Tim Walsh. Brunnhilde is part of BitCurator implementations and also integrates reports from sources such as ClamAV virus checker.
Roy is a utility created alongside Siegfried that allows users to customize signature files and Siegfried’s capability with those signature files. For example, it is possible to create custom offsets to match against, or to limit the number of formats with signatures in the signature file, e.g. only image formats for digitization workflows.
Offsets are important to the functionality of a signature, that is, where in a file will certain byte patterns (signature patterns) are expected to be found. DROID and Siegfried both offer customisations which limit the size of an offset. These customisations can be used to speed up format identification e.g. by scanning less data a scan can finish quicker, but this has its trade-offs.
A false positive occurs when a format is matched incorrectly, or imprecisely in DROID or Siegfried. This can happen when the amount of scanning done by the tool is limited and the format has some similarities to another, e.g. PDF/A files require more bytes to be scanned than regular PDF. A false positive can be hard to spot because a match, is after all, a match. False positives can impact workflow routing and future preservation planning.
Beginning of File (BOF). A file will often have a magic signature in its very first few bytes and so we’ll often be looking at beginning of file sequences.
End of File (EOF). A good signature will also be anchored to another piece of data in the file, this will often be the very end, e.g. PDF provides an end of file sequence that can be used. Programs, while not very efficient at reading every byte in a file, can easily look at the head and tail of an object within a certain threshold of bytes.
Variable sequence (VAR). Some signatures have a moveable sequence specified. These sequences can be anywhere in the file and often require the tool to scan every byte which is slow. A key optimization of file format signatures is trying to remove variable sequences to replace them with fixed byte sequences (BOF or EOF) with larger ranges in which to find them.
The DROID signature file contains more semantics than the signatures alone. To avoid two PUIDs being returned for a single file as much as possible, prioritization of signatures has to take place, that is, if a signature matches with higher priority over another, then that is returned in favour of the other. When looking at the records of signatures in PRONOM prioritizations are listed on the front page of the record, not the signature page. All this information is included in the signature file, so that is how DROID finds it all in one place.
Standard signatures are signatures which look at the byte stream as read by the program, that is, without uncompressing it, or manipulating it in any other way first. What you see is what you get.
Container signatures require the tool to first uncompress the file. Container signatures only exist for OLE2 type files (Microsoft family, plus a few others), and ZIP type files (Microsoft family, Open Office, plus a few others). First a trigger is discovered, and that trigger maps to a set of rules for identification in the container signature which may include the specification of files or folders that must exist, and optionally magic number byte sequences inside specific files.
Inspect Container File Contents
Different from container ‘identification’ if DROID or Siegfried encounter a file that is legitimately a container or ‘archive’ file format, such as ZIP or TAR (Tape Archive File), then setting 'Inspect Container File Contents' can make the tool look inside the file and return PUIDs for the container's contents as well.
Scan Web Archives
WARC (Web Archive) files are complex and can contain any number of any other file format. DROID and Siegfried can scan the contents of a WARC file returning PUIDs for every matching file inside.
A companion website for Siegfried that has drag and drop functionality for identifying individual files. Itforarchivists is pretty cool and retro and a great resource for introducing folks to the concept of format identification.
Signature Development Utility
A website (http://www.nationalarchives.gov.uk/pronom/sigdev/index.htm) that enables folks outside of TNA to create individual signature files for testing. A signature developed and tested in anticipation of a submission to PRONOM may make the turnaround to it being published as part of PRONOM much quicker.
DROID can calculate MD5, SHA1, SHA256
Siegfried can calculate CRC32, MD5, SHA1, SHA256
DROID results can be exported in a comma separated values table (CSV). It is also possible to save the results of a scan in an Apache Derby Database.
Siegfried results can be exported in YAML (YAML Ain't Markup Language), and CSV (Comma Separated Values table)