To fits or not to fits

To fits or not to fits

I am currently working on a tool that should profile collections of digital objects. The motivation behind this, is that in order to conduct efficient preservation planning, one has to analyze the content that is to be preserved and to detect the significant properties of the collection or set of objects. As soon as this baseline is provided, planning can be done and different preservation actions can be evaluated on representative sample objects before the chosen action is really executed. Afterwards this process can be repeated on demand. In my opinion, only with this first step of analysis and detection the cycle of this preservation workflow can be completed: analyze -> detect -> plan -> act -> repeat.

To achieve this goal, obviously I need meta data of real content and lots of it. I guess this is the place to thank Bjarne Andersen, Asger Askov Blekinge and their team in the State and University Library, Denmark for the meta data they have shared with me. Thank you guys, you are awesome!

I am using the File Information Tool Set or FITS (http://code.google.com/p/fits/) developed by the Harvard University Library for numerous reasons but mostly because of the data it provides and its standardized output. However, I have heard many arguments against FITS and its performance, so I will try to summarize them and tell you what I think of FITS.

FITS identifies, validates and extracts technical meta data for various file formats. Sounds great, doesn’t it?
Actually it does none of those things! It just wraps a bunch of other tools and normalizes their output in a structured xml file that follows a well defined schema. 

Let me rephrase that:

“Don’t reinvent the wheel!”

That is the main point, why I like it and why I fits:
Why try to invent yet another identification or characterization tool that claims to be so much better than all the others, when there are already so many of them? Why not just fork an existing one and make it better?

Btw, there is this amazing report that evaluates some of them (FITS is also in there, which I personally find a bit odd as its purpose is the combination of the tools and not characterization itself, but anyway). Definitely a must read if you are interested in the topic and haven’t done so yet.
Some characterization tools produce very good results on a variety of files, other produce good results only on specific files, some perform better than others under specific circumstances. There is however one simple fact. In the end of the day, as a user I am interested in a lot of (relevant! and valid!) meta data that can be extracted easily (with one tool in one environment please). And here is why…

There are certainly use cases where only identification of a set of objects would be enough, but from my point of view it will never be enough. From preservation planning point of view it will most certainly never be enough! Knowing that a collection of objects consists of x formats and having a histogram over them is not enough. The format is certainly a very important property – a significant one – but it is one of many. We cannot look at it as if it is the only decision factor that influences what we are going to do on the set of objects. Consider the following example:

You have three pdf files and you know only their formats and their versions. Which two are similar? Which one is the outlier?
Pretty easy, right? Well, now consider the same 3 files. Note however, that there is some more meta data provided by FITS.

Which two files are more similar now? Which one is the outlier? I hope you see how the format only is not enough. It is just another property that has to be considered. 

I am sure by now many of you are thinking, apparently he hasn’t used FITS if he is really proposing it for large scale usage. Well, let me assure you I did use it extensively and I believe this is the way to go. May be this version and the standard configuration are not ready, may be even this is not the implementation of the needed tool, but I believe this is the correct idea that we are looking for.

Obviously, FITS has many downsides – but let’s cut the project some slack. First of all, it is in version 0.6. Second, many of the downsides are fairly simple to fix. There are arguments such as, JHOVE performs bad on html files – well, just turn it off for html. It provides mostly irrelevant information like unclosed <br> tags anyway.
Another would be – Droid is not the best identification tool out there anymore. Well, just exchange it with Apache Tika then.

Don’t reinvent the wheel, right?

On the other hand, I haven’t heard a single argument about data normalization. It seems to me that the digital preservation community is still swamped with so many other problems that we tend to oversee some fundamental points. We need a normalized model of how this meta data is going to be represented. Every single tool out there provides different measures (with different data units) and a different format for the same concepts. And since creating a new way of describing something always reminds me of this comic: http://xkcd.com/927/ – may be we should try to stick to an existing one.
FITS is the only tool (that I know, so correct me if I am wrong), that tries to normalize the data to some extent.
So why not, exchange some wrapped tools that are bad performers with a couple of better performing tools and why not write some xslts so that more relevant meta data is added to the output?

Don’t reinvent the wheel, right?

As I pointed out, I am interested in valid meta data. Once again, FITS is the only tool that tries to provide some insight whether or not a measurement is valid by pointing out conflicts from different wrapped tools. Yes, it is fairly rudimentary but it is a start and it is more than anything else I know. Why not build upon that?

And since I don’t want to come across as if FITS is the greatest tool ever, here are some points that I believe should be fixed in the current implementation:
– Better error recovery on external tool crash, so that the whole framework does not hang!
– Better configuration (e.g. not only excludes based on extension, but also may be includes).
– Better output format schema.
– Better and more flexible tool chaining.
– More known properties (more xslt transformations).
– … I could continue, but I don’t want to make this a rant.

I am pretty sure, if there was such a tool that does all that and potentially more and if it is configured correctly, it will show very promising results and will be the fundament of many other tools that build on top of this information. I am certain that it will be of great input to Preservation Planning and I am certain it will help us understand better what content we have.

That is why I like to fits! I’ll be happy if you drop me comment of why you do or do not!

4 Comments

  1. andy jackson
    August 2, 2012 @ 10:34 pm CEST

    In my opinion, dependency management is the one wheel that FITS re-invents. It’s all rather manual, and I got stuck when trying to work out how to upgrade the built-in DROID. Personally, I would rather it re-used an existing dependency management system (e.g. Debian packages, Maven, etc.). Unfortunately, Windows does not really have any kind of sane package management, so if that’s the primary platform things get difficult. It would at least be nice if the pure Java parts of these tools used Maven and made their JARs available. DROID and JHOVE2 already use Maven to some extent (although upgrading the DROID that JHOVE2 depends on is also very hard!). JHOVE was easy to mavenise. NZME was a bit tricker because JFlac appears to be dead, but I can pass on what I’ve got if anyone wants to take it on. Unfortunately, only Apache Tika does this right and publishes its releases on Maven Central so that everyone else can get them (which is really easy, see these instructions for details).

    It’s worth pointing out that both JHOVE2, NZME and Tika also wrap existing code and normalise the output. For NZME and Tika, this means pure-Java libraries, but for JHOVE2 this extends to native binaries. This can lead to some confusion. I saw a patch for FITS that wrapped JHOVE2, which itself wraps other tools. In particular, it wraps DROID 4, which means that FITS would have been running DROID twice, using two different versions, with no clear idea of which signature file was being used where. I’m not against having a layer that merges the results of tools, but perhaps we should aim to only have one of them. Or we could consider separating the normalisation code from the bit that combines the results, so that the former can be re-used and the latter can be kept light.

    EDIT: I just want to make clear that I don’t consider the dependency management thing to be a showstopper. I just wanted to make it clear that this aspect of FITS acted as a barrier to my attempts to improve it, and I worry that over the long term this means the developers of the tool will spend a lot of time managing dependencies by hand. If the developers’ don’t mind that, then fair enough.

  2. andy jackson
    August 3, 2012 @ 10:08 am CEST

    I agree that combining the results from multiple tools is generally useful and indeed quite necessary, and FITS provides a nice model for doing this. I particularly like the way it can be used to expose conflicts between the tools.

    The lack of progress on normalising metadata is frustrating, but there are a number of subtleties here that may help explain why progress is so slow. It is very difficult to normalise metadata without making false assertions due to the way in which apparently similar concepts are in fact distinct (both in form and meaning, across different formats.

    To take a simple but important example, consider DPI (dots per inch). At the most basic level, some formats encode this as a simple integer, others as a ratio of integers, and still others as a floating point value, and in a range of units (not just inches). In general, one cannot transform between these without loss, and so comparing normalised values will require some kind of tolerance parameter, which in turn will depend on the context. Worse still, it’s not always clear which resolution we are talking about. JP2 has a ‘Capture Resolution’ and a ‘Display Resolution’, so in general the former maps to DPI. Except if you used older versions of Kakadu, in which case the latter may be more reliable. And which one does JHOVE express in it’s normalised output? I had to dig into the source code to find that one out – when tools normalise data they also tend to erase their provenance.

    There are similar issues with ‘Valid’ and ‘Well-formed’. These are only tightly defined for a small number of formats, e.g. XML, and it is not always clear what they mean elsewhere. Consequently, the use of valid and well-formed as global distinctions has been dropped from JHOVE2. Other tools disagree, but in the absense of a formal specification there is no right answer here.

    Lastly, and most awkwardly of all, there is the PDF feature you describe above as ‘encryption’. While the absence of file-level encryption can be determined reliably, the presence of encryption has at least two distinct meanings for PDF. If the file has an ‘owner password’ then it is encrypted but the file-level password is known to all implementations (it’s an empty string), and its presence implies an agreement to restrict certain operations even when decrypted. This may represent some minor preservation risk, but that is as nothing compared to the ‘user password’, in which case the file is properly encrypted and cannot be opened without the password. It may be possible to determine the password by brute-force, but this depends on the algorithm and key-length involved. For all these reasons, a normalised property along lines of ‘openable’ might be preferably to ‘encrypted’, i.e. in terms of the processes that can or cannot be performed using this data, rather than re-expressing the data itself.

    This complexity only makes projects like FITS more important. When they work well, they can become the community’s expression of its understanding of these subtleties and how to resolve them. That common understanding is precisely what we need to build and preserve.

  3. paul
    August 2, 2012 @ 2:45 pm CEST

    Spencer: What’s the position with the tool wrapping in FITS, with regard to updating with new versions of the tools? I wondered if you’re aware of the experiment Dave Tarrant did at one of our hackathons last year where he linked to existing Debian packages to do one click installs of a wrapper for FILE and Exiftool, there by ensuring that a new install always pulls the latest versions of the tools. This demonstrated the approach, albeit without the clever bit in FITS that combines the reslts. Platform issues aside, wondered if this could be useful for you?

    http://wiki.opf-labs.org/display/REQ/Open+Planets+Foundation+-+File+Scanner

  4. spencermcewen
    July 27, 2012 @ 1:47 pm CEST

    Thanks for the feedback!  Some of the recommended enhancements are already on my list, like the configuration option to include by extension rather than exclude, and support for Apache Tika.  If you could open issues on the Google Code project site and provide some more information about the others that would help getting them considered for a future release.  

    Regarding performance, we need to keep in mind that for most files FITS is invoking 6 or 7 tools behind the scenes.  Version 0.6 spins off each tool in their own thread.  If I remember correctly I saw a 20-30% overall speed increase after implementing this.  However, FITS will always be bound by the completion of longest running tool.  Depending on the file format and the tool this can vary.  0.6 also added a directory processing option that eliminates the overhead of initializing FITS for every file when running in batch mode.

    FITS is shipped with the configuration that works best for our needs.  We will be running it against all files ingested into our next generation digital repository.  It would be interesting to know if people are tweaking the configuration at all.  For example, by simply rearranging the <tool> elements in xml/fits.xml you can give preference to one tool over another in the consolidated output.  If Jhove and DROID disagree about a file format, but Jhove is listed first in the configuration then it will also be listed first in the output file. We are also using the -x option to convert the consolidated output into standard metadata formats like MIX, TextMD, DocumentMD, or AES57 depending on the input file format. The FITS configuration, mapping files and format tree are intended to be tweakable, but I agree it could be better documented.

Leave a Reply

Join the conversation