Double the magic

Exploring space savings by removing whitespace in METS files

The latest version of siegfried, 1.5.0, has just been released. The big change is support for a second identifier type, freedesktop.org’s Shared MIME-Info specification.

MIME-Info signatures

MIME-Info is a common format for file format signatures published by the freedesktop.org project. Apache Tika is one application that has adopted the standard, using MIME-Info for the bulk of its file type detection and publishing its own MIME-Info signature file, tika-mimetypes.xml. The freedesktop.org project publishes its own MIME-Info signature files too.

In order to use MIME-Info signatures with siegfried, you need to use the roy tool to build a custom signature file. For example, the command to replace your default PRONOM signature file with a Tika MIME-Info signature file is: roy build -mi tika. You can use the freedesktop.org signatures instead with roy build -mi freedesktop.

This tutorial is a comprehensive guide to setting up and using roy:

Use cases

First off, it’s important to note that siegfried’s default signature file remains a PRONOM-only one, so if you’re in a PRONOM shop and depend on PUIDs for your workflows (like we do at my workplace), and aren’t much interested in alternate identifications, don’t worry!

On the other hand, MIME-Info signatures have advantages in terms of speed, identification of text and XML formats, and in their use MIME types as unique identifiers. If your organisation uses MIME types rather than PUIDs, and sees a lot of text-based formats, it might be worth considering switching. For example, where PRONOM defaults to a plain text idnetification, Tika’s tika-mimetypes.xml signatures correctly identify siegfried’s golang source code:

tika

Best of both worlds

Another option is to use multiple identifiers. The following example demonstrates how both Tika MIME-Info and PRONOM can be used together in siegfried:

pronom-tika

The roy add -mi tika command will add a second MIME-Info identifier to your existing PRONOM default signature file.

Using multiple identifiers won’t significantly slow down identification and will allow you the benefit of a second (or third, fourth etc.) opinion. This can be useful in tracking down unknowns, providing more precise identification of text-based formats, and in validating the MIME types suggested by PRONOM.

Andy Jackson’s diagrams of file format registry coverage demonstrate how surprisingly small is the overlap between them. Adding an additional identifier to your siegfried signature file really can double your magic!

Leave a Reply

Join the conversation