Some reflections on scalable ARC to WARC migration

Some reflections on scalable ARC to WARC migration

The SCAPE project is developing solutions to enable the processing of very large data sets with a focus on long-term preservation. One of the application areas is web archiving where long-term preservation is of direct relevance for different task areas, like harvesting, storage, and access.

Web archives usually consist of large data collections of multi-terabyte size, the largest archive being the Internet Archive which according to its own statements stores about 364 billion pages that occupy around 10 petabytes of storage. And the International Internet Preservation Consortium (IIPC) with its over 40 members worldwide shows how diverse the institutions are, each with a different focus regarding the selection of content or type of material they are interested in.

It is up to this international community and to the individual member institutions to ensure that archived web content content can be accessed and displayed correctly in the future. And the challenge in doing this lies in the nature of the content which is like the internet itself: diverse in the use of formats for publishing text and multi-media content, using a rich variety of standards and programming languages, providing interactive user experience with data-base driven web sites and strongly interlinked functionality, involving social media content from numerous sources, etc. This is to say that apart from the sheer size, it is the heterogeneity and complexity of the data that poses the significant challenge for collection curators, developers, and long-term preservation specialists.

One of the topics which the International Internet Preservation Consortium (IIPC) is dealing with is the question how web archive content should actually be stored for the long term. Originally, content was stored in the ARC format as proposed by the Internet Archive. The format was designed to hold multiple web resources aggregated in a single – optionally compressed – container file. But this format was not supposed to be an ideal format to store content for the long term, for example, it was lacking features that allow adding contextual information in a standardised way. For this reason, the new WARC format as an ISO Standard was created to provide additional features, especially the ability to hold harvested content as well as any meta-data related to it in a self-contained manner.

A particularity of web archiving is the fact that while content is continuously changing on one side, it remains static on the other. In order to preserve the changes, web pages are harvested with a certain frequency of crawl jobs. Storing the same content at each visit would create content redundantly and not make efficient use of storage.

For this reason, the Netarchive Suite, originally developed by the The Royal Library and The State and University Library, and used in the meantime by other libraries as well, provides a mechanism called “deduplication” which detects that content was already retrieved and therefore references the existing payload content. The information where the referenced content is actually stored is available in the crawl log files which means that if the crawl log file is missing, there is actually no knowledge of any referenced content. In order to display a single web page with various images, for example, the wayback machine needs to know where to find content that may be scattered over various ARC container files. An index file, e.g. an index in the CDX file format, contains the required information, and to build this index, it is necessary to involve ARC files and crawl log files in the index building process.

From a long-term-preservation perspective, this is a problematic dependency. The ARC container files are not self-describing, they depend on operative data (log files generated by the crawl software) in a non-standardised manner. Web archive operators and developers know where to get the information, and the dependency might be well documented. But it involves the risk of loosing information that is essential for displaying and accessing the content.

This is one of the reasons why the option to migrate from ARC to the new WARC format is being considered by many institutions. But, as often happens, what looks like a simple format transformation at first glance rapidly turns into a project with complex requirements that are not easy to fulfil.

In the SCAPE project, there are several aspects that in our opinion deserve closer attention:

  1. The migration from ARC to WARC is typically dealing with large data sets, therefore a solution must provide an efficient, reliable and scalable transformation process. There must be the ability to scale-out which means that it should be possible to increase processing power by using a computing cluster of appropriate size to enable organisations to complete the migration in a given time frame.

  2. Reading and writing the large data sets comes with a cost. Sometimes, data must be even shifted to a (remote) cluster first. It should therefore be possible to easily hook in other processes that are used to extract additional meta-data from the content.

  3. The migration from one format to another conveys the risk of information loss. Measures of quality assurance like calculating the payload hash and compare content between corresponding ARC and WARC instances or doing rendering tests in the Wayback machine of subsets of migrated content are possible approaches in this regard.

  4. Resolving dependencies of the ARC container files to any external information entities is a necessary requirement. A solution should therefore not only look into a one-to-one mapping between ARC and WARC, but it should involve contextual information in the migration process.

The first concrete step regarding this activity was to find out the right approach to the first of the above mentioned aspects.

In the SCAPE project, the Hadoop framework is an essential element of the so called SCAPE platform. Hadoop is the core which holds the responsibility of efficiently distributing processing tasks to the available workers in a computing cluster.

Taking advantage of software development outcomes from the SCAPE project, there were different options to implement a solution. The first option was using a module of the SCAPE platform called ToMaR, a Map/Reduce java application that allows to easily distribute command line application processing on a computing cluster (in the following: ARC2WARC-TOMAR). And the second option was using a Map/Reduce application with customised reader for the ARC format and customised writer for the WARC format so that the Hadoop framework is able to handle these web archive file formats directly (in the following: ARC2WARC-HDP).

An experiment was set up to test the performance of two different approaches and the main question was whether the native Map/Reduce job implementation had a significant performance advantage compared to using ToMaR with an underlying command line tool execution.

The reason why this advantage should be “significant” is that the ARC2WARC-HDP option has an important limitation: In order to achieve the transformation based on a native Map/Reduce implementation it is required to use a Hadoop representation of a web archive record. This is the intermediate representation that is between reading the records from the ARC files and writing the records to WARC files. As it uses a byte array field to store web archive record payload content, there is a theoretical limit of around 2 GB due to the Integer length of the byte array which would be a value near Integer.MAXVALUE. In reality, the limitation of payload content size might be much lower depending on hardware setup and configuration of the cluster.

This limitation would come along with the need for an alternative solution for records with large payload content. And, such a separation between "small" and "large" records would possibly increase the complexity of the application, especially when it is required to involve contextual information across different container files in the migration process.

The implementations used to do the migration are proof-of-concept tools which means that they are not intended to be used to run a production migration at this stage. This means that there are the following limitations:

  1. Related to ARC2WARC-HDP, as already mentioned, there is a file size limit regarding the in-memory representation of a web archive record, the largest ARC file in the data sets used in these experiments is around 300MB, therefore record-payload content can be easily stored as byte array fields.

  2. Exceptions are catched and logged, but there is no gathering of processing errors or any other analytic results. As the focus lies here on performance evaluation, any details regarding record processing are not taken into consideration.

  3. The current implementations neither do quality assurance nor do they involve contextual information which have been mentioned as important aspects of the ARC to WARC migration above.

The basis of the implementations is the Java Web Archive Toolkit (JWAT) for reading web archive ARC container files and to iterate over the records.

As an example for a process that is used while we are reading the data, the implementations include Apache Tika to identify the payload content as an optional feature. All Hadoop job executions are therefore tested with and without payload content identification enabled.

As already mentioned, the ARC2WARC-HDP application was implemented as a Map/Reduce application which is started from the command line as follows:

hadoop jar arc2warc-migration-hdp-1.0-jar-with-dependencies.jar \
-i hdfs:///user/input/directory -o hdfs:///user/output/directory

And the ARC2WARC-TOMAR workflow is using a command line Java-Implementation and executed using ToMaR. One bash script was used to prepare the input needed by ToMaR and another bash script to execute the ToMaR Hadoop job, a combined representation of the workflow is available as a Taverna workflow.

A so called “tool specification” is needed to start an action in a ToMaR Hadoop which specified inputs and outputs and the java command to be executed:

<?xml version="1.0" encoding="utf-8" ?>
<tool xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://scape-project.eu/tool tool-1.0_draft.xsd"
    xmlns="http://scape-project.eu/tool"
    xmlns:xlink="http://www.w3.org/1999/xlink" schemaVersion="1.0" name="bash">
  <operations>
    <operation name="migrate">
      <description>ARC to WARC migration using arc2warc-migration-cli</description>
      <command>
        java -jar /usr/local/java/arc2warc-migration-cli-1.0-jar-with-dependencies.jar -i ${input} -o ${output}
      </command>
      <inputs>
        <input name="input" required="true">
          <description>Reference to input file</description>
        </input>
      </inputs>
      <outputs>
        <output name="output" required="true">
          <description>Reference to output file</description>
        </output>
      </outputs>
    </operation>
  </operations>
</tool>

All commands allow use of a “-p” flag to enable Apache Tika identification of payload content.

The cluster used in the experiment has one controller machine (Master) and 5 worker machines (Slaves). The master node has two quadcore CPUs (8 physical/16 HyperThreading cores) with a clock rate of 2.40GHz and 24 Gigabyte RAM. The slave nodes have one quadcore CPUs (4 physical/8 HyperThreading cores) with a clock rate of 2.53GHz and 16 Gigabyte RAM. Regarding the Hadoop configuration, five processor cores of each machine have been assigned to Map Tasks, two cores to Reduce tasks, and one core is reserved for the operating system. This is a total of 25 processing cores for Map tasks and 10 cores for Reduce tasks.

The experiment was executed using data sets of different size, one with 1000 ARC files with 91,58 Gigabyte and one with 4924 ARC files with a total size of 445,47 Gigabyte, and finally a data set with 9856 ARC files with a total size of 890,21 Gigabyte.

A summary of the results is shown in the table below.

    Obj./hour Throughput Avg.time/item
    (num) (GB/min) (s)
  Baseline 834 1,2727 4,32
Map/Reduce 1000 ARC files 4592 7,0089 0,78
  4924 ARC files 4645 7,0042 0,77
ToMaR 1000 ARC files 4250 6,4875 0,85
  4924 ARC files 4320 6,5143 0,83
  9856 ARC files 8300 12,4941 0,43
  Baseline 545 0,8321 6,60
Map/Reduce w. Tika 1000 ARC files 2761 4,2139 1,30
  4924 ARC files 2813 4,2419 1,28
ToMaR w. Tika 1000 ARC files 3318 5,0645 1,09
  4924 ARC files 2813 4,2419 1,28
  9856 ARC files 7007 10,5474 0,51

 

The Baseline value was determined by executing a standalone Java-application that shifts content and meta-data from one container to the other using JWAT. It was executed on one worker node of the cluster and serves as a point of reference for the distributed processing.

Some observations regarding these data are that, compared to the local java application processing the cluster processing shows a significant increase of performance for all Hadoop jobs – which should not be a surprise, this is the purpose of distributed processing. Then, throughput increases significantly when using a larger data set which becomes apparent when comaring the throughput between the 4924 and 9856 ARC files data sets. Regarding the two different approaches ARC2WARC-HDP and ARC2WARC-TOMAR there is only a slight difference which – given the above mentioned caveats of the Map/Reduce implementation – highlights ToMaR as promising candidate for production usage. Finally, the figures show that using Apache Tika the processing time is increased by up to 50%.

To give an outlook to further work and following arguments outlined here, resolving contextual dependencies in order to create self-contained WARC files is the next point to look into.

As a final remark, the proof-of-concept implementations presented here are far from a workflow that can be used in production. There is an ongoing discussion in the web archiving community whether it makes any sense to tackle such a project in memory institutions at all. Ensuring backwards-compatibility of the wayback machine and safely preserving contextual information is a viable alternative to this.

Many thanks to colleagues sitting near to me and in the SCAPE project who gave me useful hints and support.

Leave a Reply

Join the conversation