From 1 Million to 21,000: Reducing Govdocs Significantly

From 1 Million to 21,000: Reducing Govdocs Significantly

As part of the evaluation framework i'm developing for OPF and Scape I've been working on gathering a corpora of files to run experiments against. 

Although Govdocs1 would seem like a good place to start there are a few problems:

1) It's too big, 1 Million Files is just showing off.

2) It's full of repeats! There are over 700,000 PDF files.

3) Running experiments on 1 Million files that is full of repeats generates too much data (yes there is such a thing)

So I went on a mission to reduce the corpora in size that I explain here. 

In order to reduce the corpora in size I am relying on the ground truth data, which is the results of funning the File Identification Toolset (FI-Tools) over the corpora. Now the ground truth data may not be correct but I am relying on it being consistantly wrong such that the size of the corpora can still be easily reduced. We shall hope to find out later if it is wrong. 

Stage 1 – Irradicate all Free Variables (Mr Bond)

The ground truth data also pulls out many of the charecteristics of each file. Since we are only interesting in the identification data, lots of data can be removed.

Properties to remove:

  • Last Saved
  • Last Printed
  • Title
  • Number of Pages
  • SHA-1
  • Image Size
  • File Name (for now)
  • File Size (for now)
  • other charecterics…

Properties to keep

  • Extension
  • Description
  • Version (& related information)
  • Valid File Extensions
  • Accuracy of Identification
  • Content
  • Creating Program (or library)
  • Description Index (serial code)
  • Extension Valid (Y/N)
At this point we will still have a million records, however lots and lots of the remaining data *should* be repeated. 

Stage 2 – Sort-id

This is an easy stage, run:

#sort -u data.txt > limit.txt

This gives us 4653 unique identifications made up of 87 different extensions. Of the 4653 identifications: 

PDF 3337
TEXT 267
XLS 194
HTML 169
DOC 147
PPT 58
PS 52
LOG 52

Only 20 extensions have more than 20 different identification types, probably down to the lacking number of files in the govdocs selection. However it is still shocking to see that PDFs can be created in 3337 different ways. Considering other formats have never changed (text) we have 20 or so versions fo PDF (including PDFa) and loads of creation libraries. By trying to solve the problem have we actually made it worse?

 

At this point we could just stop and select 4653 files, one of each type of identification. 

Stage 3 – Select some Files

The final stage is to actually select some files of each of the 4653 types of identification. 

It was decided to select 10 of each type of identification where possible

If it wasn't possible to select 10 of each type then however many were available were selected. 

Where more than 10 were available the following selection policy applies:

  • Select the largest in filesize
  • Select the smallest in filesize
  • Select 8 random others.

Stage 4 – Publish

So with all this done we have ~21,000 available at http://corpora.opf-labs.org/govdocs_selected.tar.gz

Further to this i'll also push up the code that does all this.  

 

9 Comments

  1. Khoanc
    October 29, 2013 @ 11:02 pm CET

    Hi David Tarrant,

    The link for govdocs_selected.tar.gz (http://soton.corpora.openplanetsfoundation.org/govdocs_selected.tar.gz) is not available now. Could you please give me the link that I can download the file govdocs_selected.tar.gz. 

     

    Thanks

  2. Khoanc
    October 28, 2013 @ 1:09 pm CET

    Hi Dave,

    The link for govdocs_selected.tar.gz (http://soton.corpora.openplanetsfoundation.org/govdocs_selected.tar.gz) is not available now. Could you please give me the link that I can download the file govdocs_selected.tar.gz. 

     

    Thanks

     

     

  3. Dirk von Suchodoletz
    August 7, 2012 @ 9:48 pm CEST

    Great to see some action in the field of test data, thanks for the start here! There are a couple of comments on the test corpus and the number of files already, so I wont add to that. What was most important in my opinion is the point four in the list which is unfortunately the shortest. To have a good and well described test corpus would be very beneficial. But govdocs1 can just be a start. Lots of filetypes, especially older ones, are missing (there are no files of the type we found e.g. on the CTOS floppies) and the selection of filetypes in that corpus is biased. Unfortunately it is not as easy to add to the corpus as the data we analysed in the recovery project was restricted.

  4. davetaz
    July 31, 2012 @ 2:07 pm CEST

    Logically, the Date Created/Saved and Printed information should not cause loss of information relating to creating programs and versions. In some versions writen by some programms this data will exists, whereas in other cases it won’t. The date information is only relevant where a service pack might “fix/break” the writing of some formats and this is a hard case to generalise when the date information “is more likely” to be incorrect than other data.

    So I’d say, yes although all data is generally relevant. There is currently a stronger argument for discounting it. As I experiment with the various identification tools there may be some cases where other parameters should be considered. 

Leave a Reply

Join the conversation