Measuring Bigfoot

Measuring Bigfoot

My previous blog Assessing file format risks: searching for Bigfoot? resulted in some interesting feedback from a number of people. There was a particularly elaborate response from Ross Spencer, and I originally wanted to reply to that directly using the comment fields. However, my reply turned out to be a bit more lengthy than I meant to, so I decided to turn it into a separate blog entry.

Numbers first?

Ross overall point is that we need the numbers first; he makes a plea for collecting more format-related data, and adding numbers to these. Although these data do not directly translate into risks, Ross argues that it might be possible to use them to address format risks at a later stage. This may look like a sensible approach at first glance, but on closer inspection there’s a pretty fundamental problem, which I’ll try to explain below. To avoid any confusion here, I will be speaking of "format risk" here in the sense used by Graf & Gordea, which follows from the idea of "institutional obsolescence" (which is probably worth a blog post by itself, but I won’t go into this here).

The risk model

Graf & Gordea define institutional obsolescence in terms of "the additional effort required to render a file beyond the capability of a regular PC setup in particular institution". Let’s call this effort E. Now the aim is to arrive at an index that has some predictive power of E. Let’s call this index RE. For the sake of the argument it doesn’t matter how RE is defined precisely, but it’s reasonable to assume it will be proportional to E (i.e. as the effort to render a file increases, so does the risk):

REE

The next step is to find a way to estimate RE (the dependent variable) as a function of a set of potential predictor variables:

RE = f(S, P, C, … )

where S = software count, P = popularity, C = complexity, and so on. To establish the predictor function we have two possibilities:

  1. use a statistical approach (e.g. multiple regression or something more sophisticated);
  2. use a conceptual model that is based on prior knowledge of how the predictor variables affect RE.

The first case (statistical approach) is only feasible if we have actual data on E. For the second case we also need observations on E, if only to be able to say anything about the model’s ability to predict RE (verification).

No observed data on E!

Either way, the problem here is that there’s an almost complete lack of any data on E. Although we may have a handful of isolated ‘war stories’, these don’t even come close to the amount of data that would be needed to support any risk model, no matter whether it is purely statistical or based on an underlying conceptual model1. So how are we going to model a quantity for which we do not have any observed data in the first place? Or am I overlooking something here?

Looking at Ross’s suggestions for collecting more data, all of the examples he provides fall into the potential (!) predictor variables category. For instance, prompted by my observation on compression in PDF, Ross suggests to start analysing large collections of PDFs to establish patterns on the occurrence of various types of compression (and other features), and attach numbers to them. Ross acknowledges that such numbers by themselves don’t tell you if PDF is "riskier" than another format, but he argues that:

once we’ve got them [the numbers], subject matter experts and maybe some of those mathematical types with far greater statistics capability than my own might be able to work with us to do something just a little bit clever with them.

Aside from the fact that it’s debatable whether, in practical terms, the use of compression is really a risk (is there any evidence to back up this claim?), there’s a more fundamental issue here. Bearing in mind that, ultimately, the thing we’re really interested in here is E, how could collecting more data on potential predictor variables of E ever help here in the near absence of any actual data on E? No amount of clever maths or statistics can compensate for that! Meanwhile, ongoing work on the prediction of E mainly seems to be focused on the collection, aggregation and analysis of potential predictor variables (which is also illustrated by Ross’s suggestions), even though the purpose of these efforts remains largely unclear.

Within this context I was quite intrigued by the grant proposal mentioned by Andrea Goethals which, from the description, looks like an actual (and quite possibly the first) attempt at the systematic collection of data on E (although like Andy Jackson said here I’m also wondering whether this may be too ambitious).

Obsolescence-related risks versus format instance risks

On a final note, Ross makes the following remark about the role of tools:

[W]ith tools such as Jpylyzer we have such powerful ways of measuring formats – and more and more should appear over time.

This is true to some extent, but a tool like jpylyzer only provides information on format instances (i.e. features of individual files); it doesn’t say anything about preservation risks of the JP2 format in general. The same applies to tools that are are able to detect features in individual PDF files that are risky from a long-term preservation point of view. Such risks affect file instances of current formats, and this is an area that is covered by the OPF File Format Risk Registry that is being developed within SCAPE (it only covers a limited number of formats). They are largely unrelated to (institutional) format obsolescence, which is the domain that is being addressed by FFMA. This distinction is important, because both types of risks need to be tackled in fundamentally different ways, using different tools, methods and data. Also, by not being clear about which risks are being addressed, we may end up not using our data in the best possible way. For example, Ross’s suggestion on compression in PDF entails (if I’m understanding him correctly) the analysis of large volumes of PDFs in order to gather statistics on the use of different compression types. Since such statistics say little about individual file instances, a more practically useful approach might be to profile individual files instances for ‘risky’ features.


  1. On a side note even conceptual models often need to be fine-tuned against observed data, which can make them pretty similar to statistically-derived models. 

4 Comments

  1. johan
    November 27, 2013 @ 1:02 pm CET

    Hi Ross,

    Thanks for taking the time to further clarify your original comment. I think I now have a much better idea of what you are getting at. To be clear, I wasn't making a case for doing nothing in my blog post, it's just that the limited amounts of time and effort that are available are spent on stuff that's actually useful. If I understand you correctly, this is also why you are making a call for collecting and publishing figures/statistics on collections, since they can help deciding where to invest resources. This makes complete sense, and I largely agree with you on this.

  2. ross-spencer
    November 8, 2013 @ 12:19 am CET

    Johan,

    To help put my original comment back into context, the title should simply have been "Numbers first…" the emphasis on understanding, later…

    I also need to distinguish my comment from one on obsolescence. What it is, and perhaps isn't clear enough, is a comment of risk analysis (and management) which is the important angle that I think we can address with the collection of numbers. Obsolescence is only one risk we have to analyse and potentially deal with.

    To clarify further:

    "…ongoing work on the prediction of E mainly seems to be focused on the collection, aggregation and analysis of potential predictor variables (which is also illustrated by Ross's suggestions), even though the purpose of these efforts remains largely unclear."

    Again, "understanding later…"

    The purpose of these efforts is not to predict E, alone. The purpose is the qualification of statements and observations pleading to authority and lacking precision :

    "For example, PDF is supported by a plethora of software tools, yet it is well known that few of them support every feature of the format (possibly even none, with the exception of Adobe's implementation).  "

    "quite a few (open-source) software tools support the JP2 format, but for this many of them (including ImageMagick and GraphicsMagick) rely on JasPer, a JPEG 2000 library that is notorious for its poor performance and stability. "

    "PDF is not a compressed format (in reality text in PDF nearly always uses Flate compression, whereas a whole array of compression methods may be used for images)"

    The purpose is looking at real institutional risk, and helping institutions understand where to place investment, for example, where they will see encryption; password protection, or maybe where they'll see certain configurations of JP2 profile where they’ve only the capability to deal with another very specific one…

    The purpose is about allowing individuals and institutions make decisions based on appetite.

    Jay found the spirit of the comment in creating format type statements. Expanding further, collecting numbers is getting easier and easier. I can analyse all the formats going into my collection and collect all the instance stats coming from tools like Jpylyzer and use them institutionally and make them available globally. Over time, I can build a picture of the JP2 images coming into my collection and see what that might mean in terms of tools I need to support the further analysis, rendering and migration of those files moving forward.

    A real risk to my institution might be accessioning password protected PDF files. This causes problems for access and running preservation actions. If I have statements about PDF (backed up by numbers) across different domains (Internet, memory institutions, all government departments, specific government departments), then I might be able to talk to those above me and suggest where I need to concentrate investment of resource to handle potential accessions. Particularly useful if I have no figures and find 99% of all PDF encountered across all domains are never password protected, or further down the line, following a metric-centric-culture, find that 50% of all PDF from a department I am accessioning from across multiple accessions are, in fact, protected…

    It's probabilistic, but, as mentioned in my original comment, risk analysis is about the reduction of uncertainty, not the elimination of it. We still have to take numbers and understand how we want to treat risks. There might be appetite to not pro-actively deal with password protected files (research, tool-creation, purchasing), there might not be. In the former case, a risk once realised becomes an issue anyway, and that needs to be looked at. What we're doing by being pragmatic and thinking about what risks there are and what we might realistically have to deal with is concentrating our investments sensibly.

    We can apply this analysis across much of what we do.

    At this stage in my reply, I'm not sure what else needs addressing. Collecting numbers and understanding how we can use them is important. Do I think they can help me address the situation above? Yes. Do I think they can help me predict obsolescence? or allow me to say one format presents more of a preservation risk than another – again (referring to my original comment) – I don't know. I don't think I can necessarily answer the format obsolescence question but I think that maybe, stats can give us an idea about preservation risks – the caveat for that is aknowledging Andy Jackson's comment and agreeing that we need stories first.

    If I have a story about a developer creating a renderer for compressed JPEG, and another by the same developer creating a renderer for uncompressed TIF then we might gain an insight into the difficulties and requirements of coding each and therefore an insight into what we might need to do should we find ourselves in future without applications for either (read: both are obsolete) and we need to develop new capability to read either format. Should war stories suggest a compressed format be harder to develop for, it might be a potential preservation risk, and we might want to mitigate against that by using uncompressed formats.

    But I'm indulging instinct by jumping down the road to the potential conclusion of war stories. Let's just get the war stories first.

    Well… let's get war stories, and, my view, in parallel with as much effort as is suitable, collect the other bits and pieces we need as well.

    What I really hope Johan, given the conviction demonstrated in your post, is that you're not making an argument for doing nothing, that is, not collecting statistics when we have the opportunity to because we don't completely understand the end-game of them yet. I'll emphasise one more time, understanding later… that understanding will likely cover a broad range of risks and areas, but in the short term will at least help us qualify, and quantify broad, non-scientific statements that have the potential to contribute to common misunderstandings for the sake of attaching it to a strong, repeatable, verifiable figure, reducing its otherwise inherent uncertainty and ambiguity.

    Ross

  3. johan
    October 17, 2013 @ 11:11 am CEST

    …totally agree with this. What you're describing here sounds pretty similar to (well, if I'm understanding you correctly)the work we started here:

    http://wiki.opf-labs.org/display/TR/OPF+File+Format+Risk+Registry

    I'll be the first to admit that this is a very small scale effort at the moment, and it remains to be seen if this will gather any further momentum. One initiative which, in my view, has the potential to be hugely useful here is ArchiveTeam's File Format Wiki, because its free-form nature makes it really straightforward to add information like this (either directly or by linking), and it's really easy to use and contribute to as well. See for instance this entry on XLSX:

    http://fileformats.archiveteam.org/wiki/XLSX

  4. Jay Gattuso
    October 15, 2013 @ 7:46 pm CEST

    a tool like jpylyzer only provides information on format instances (i.e. features of individual files); it doesn't say anything about preservation risks of the JP2 format in general. The same applies to tools that are are able to detect features in individual PDF files that are risky from a long-term preservation point of view.

    This comment jumped out for me.

    I agree that these tools give us data on the instance of the file object, however, if we can bring that instance data back to a single place, and with it our various risk assements and experiances with instance sets, we can start to build a bigger picture view of the format "type", and start to make some informed and detailed "format type" level statements by merging "format instance" level statements.

    As ever, for me it points collaboration, and thus the develoment of tools and spaces that are easy to use, and do most of the heavy lifting.

Leave a Reply

Join the conversation