Breaking WAVEs (and some FLACs)

Challenges of Dumping/Imaging old IDE Disks

At the KB we have a large collection of offline optical media. Most of these are CD-ROMs, but we also have a sizeable proportion of audio CDs. We’re currently in the process of designing a workflow for stabilising the contents of these materials using disk imaging. For audio CDs this involves ‘ripping’ the tracks to audio files. Since the workflow will be automated to a high degree, basic quality checks on the created audio files are needed. In particular, we want to be sure that the created audio files are complete, as it is possible that some hardware failure during the ripping process could result in truncated or otherwise incomplete files.

To get a better idea of what software tool(s) are best suitable for this task, I created a small dataset of audio files which I deliberately damaged. I subsequently ran each of these files through a set of candidate tools, and then looked which tools were able to detect the faulty files. The first half of this blog post focuses on the WAVE format; the second half covers the FLAC format (at the moment we haven’t decided on which format to use yet).

WAVE dataset

For the WAVE dataset I started out with a small, intact WAVE file. Using a Hex editor I then made the following derivatives of this file:

Candidate tools, WAVE

The candidate tools I used to analyse the WAVE files are:

  • jhove includes a WAVE validation module, which makes it an obvious choice. The tested version is 1.14.6, 2016-05-12.
  • shntool is a "multi-purpose WAVE data processing and reporting utility". It was first released in 2000. The tested version is 3.0.7.
  • ffmpeg is a popular conversion tool for audio and video formats. The tested version is 3.2.2.
  • mediainfo is a widely-used feature extraction tool for audiovisual files. The tested version is v0.7.81.

Note that of the above tools, only Jhove and Shntool are designed to detect problems in WAVE files. Both Ffmpeg and Mediainfo were primarily designed for other purposes (format conversion and technical metadata extraction), and they were not designed to detect defective files! I included these tools here mainly because they are widely used, and I was curious whether they would throw up anything interesting in case of defective files1. I ran the tools with the following command-line arguments (replacing "foo.wav" with the actual file name):

Jhove

jhove -m WAVE-hul foo.wav

Shntool

shntool info foo.wav

Ffmpeg

ffmpeg -v error -i foo.wav -f null -

Mediainfo

mediainfo foo.wav

I automated this using a simple shell script that runs each tool on all files, and then writes the output to a set of text files.

Results, WAVE

The full output results of each tool can be found here.

Jhove

The ‘Status’ field in Jhove’s output summarises the validation outcome. Here are the results for each file:

File Result
frogs-01.wav Status: Well-Formed and valid
frogs-01-last-byte-missing.wav Status: Well-Formed and valid
frogs-01-last-2032-bytes-missing.wav Status: Well-Formed and valid
frogs-01-byte-missing-at-offset-811537.wav Status: Well-Formed and valid

So, Jhove was unable to detect any of the damaged files at all!

Shntool

Shntool checks a WAVE on six criteria, which are listed in its output under ‘Possible problems’:

Possible problems:
  File contains ID3v2 tag:    no
  Data chunk block-aligned:   yes
  Inconsistent header:        no
  File probably truncated:    no
  Junk appended to file:      no
  Odd data size has pad byte: n/a

The thing to watch here is the ‘File probably truncated’ item:

File Result
frogs-01.wav File probably truncated: no
frogs-01-last-byte-missing.wav File probably truncated: yes (missing 1 byte)
frogs-01-last-2032-bytes-missing.wav File probably truncated: yes (missing 2032 bytes
frogs-01-byte-missing-at-offset-811537.wav File probably truncated: yes (missing 1 byte)

So, Shntool was able to detect all damaged files.

Ffmpeg

For our Ffmpeg call we monitor any errors that are sent to the standard error stream. The results:

File result
frogs-01.wav
frogs-01-last-byte-missing.wav [pcm_s16le @ 0x3545380] Invalid PCM packet, data has size 3 but at least a size of 4 was expected
Error while decoding stream #0:0: Invalid data found when processing input
frogs-01-last-2032-bytes-missing.wav
frogs-01-byte-missing-at-offset-811537.wav [pcm_s16le @ 0x2768380] Invalid PCM packet, data has size 3 but at least a size of 4 was expected
Error while decoding stream #0:0: Invalid data found when processing input

Interestingly, Ffmpeg reports an error for both files that have 1 byte missing, but it doesn’t for the file that has 2023 bytes missing. This suggests that Ffmpeg is not suitable for detecting broken WAVE files.

Mediainfo

Mediainfo didn’t report errors or warnings for any of these files. This is not surprising, but it does confirm that Mediainfo cannot be used for detecting broken WAVE files.

FLAC dataset

Analogous to the WAVE dataset, I started out with a small, intact FLAC file, which I then butchered into the following derivative files:

Candidate tools, FLAC

The set of candidate tools is identical to the one used for the WAVE analysis, with two exceptions:

  • flac is the reference implementation of the FLAC format. The tested version is 1.3.0.
  • Since Jhove does not include a FLAC module, it was not used.

Flac

The Flac tool is able to encode audio to FLAC, and decode and analyze FLAC files. For this tests I ran it with the * -t* (or –test) option:

flac -t foo.flac

This decodes a FLAC without writing the decoded data to a file. Any errors during the decoding process are reported to the standard error stream.

Results, FLAC

The full output results of each tool can be found here.

Shntool

Even though Shntool supports FLAC, it was not able to detect the missing data in any of the files:

File Result
frogs-01.flac File probably truncated: no
frogs-01-last-byte-missing.flac File probably truncated: no
frogs-01-last-1000-bytes-missing.flac File probably truncated: no
frogs-01-byte-missing-at-offset-651202.flac File probably truncated: no

So, Shntool does not provide any meaningful information on whether a FLAC is damaged.

Ffmpeg

Here are the results for Ffmpeg:

File Result
frogs-01.flac
frogs-01-last-byte-missing.flac [flac @ 0x294b860] overread: 1
Error while decoding stream #0:0: Invalid data found when processing input
frogs-01-last-1000-bytes-missing.flac [flac @ 0x3c5d860] overread: 1
Error while decoding stream #0:0: Invalid data found when processing input
frogs-01-byte-missing-at-offset-651202.flac [flac @ 0x279faa0] overread: 1
Error while decoding stream #0:0: Invalid data found when processing input

So, Ffmpeg was able to identify all damaged FLACs.

Mediainfo

Similar to the WAVE results, Mediainfo again didn’t report errors or warnings for any of these files.

Flac

Finally the results for the Flac tool:

File Result
frogs-01.flac
frogs-01-last-byte-missing.flac ERROR while decoding data
state = FLAC__STREAM_DECODER_END_OF_STREAM| |frogs-01-last-1000-bytes-missing.flac|ERROR while decoding data
state = FLAC__STREAM_DECODER_END_OF_STREAM| |frogs-01-byte-missing-at-offset-651202.flac|ERROR while decoding data
state = FLAC__STREAM_DECODER_READ_FRAME|

So, the Flac tool was able to identify all defective files2.

Conclusion

Out of the candidate tools considered here, only Shntool was able to identify all damaged WAVE files in this experiment. As a result, this (ancient!) tool still appears to be the best choice for detecting damaged WAVE files. Surpringly, Jhove was unable to detect any of the damaged files at all, and is probably best avoided for this particular purpose. For FLAC, both the Flac tool (FLAC reference implementation) and Ffmpeg were able to detect all damaged files, and both appear to be suitable tools.

Dataset and scripts

All example files, scripts and raw tool output are available here:

https://github.com/KBNLresearch/detectDamagedAudio

Post scriptum: update on MediaInfo and MediaConch

In response to this post the developers of MediaInfo added support for detecting truncated WAVE files. This should cover all of the damaged WAVE files presented here. Moreover, their Twitter account announced that detection of FLAC flaws is planned for the MediaConch tool, but that they are looking for sponsors for this.


  1. Also, this thread on superuser.com recommends Ffmpeg for checking the integrity of video files.

  2. On a side note, I noticed that the error stream of the Flac tool sometimes contained a sequence of 21 non-printable ‘0x08’ (backspace) characters. This is probably a bug.

4 Comments

  1. David Russo
    November 21, 2019 @ 2:10 pm CET

    As an update on JHOVE’s performance: JHOVE 1.20 correctly reports truncated files as not well-formed, along with the number of bytes missing and the ID of the truncated chunk when available.

  2. Yvonne Tunnat
    January 16, 2017 @ 10:15 am CET

    Hi Johan,
    this is really interesting. I do similar stuff with JPEG & TIFF and reading your post, I have great idea how to test JHOVE and PDF files.
    I am really yealous of your cool script, I really have to extend my script-skills. 🙂
    Best, Yvonne

  3. johan
    January 4, 2017 @ 3:52 pm CET

    Hi Carl, yes of course you can use those WAVs. I just added a license statement to the repo’s readme (CC-BY).

  4. Carl Wilson
    January 4, 2017 @ 3:29 pm CET

    Hi Johan, this is interesting work. I’m commenting as I’m currently adding some test WAV files to JHOVE. I presume I’m free to borrow the WAV data here? I have other sources but some synthetic, broken files would be useful. FYI this doesn’t mean that the WAV module will be fixed to address these testing issues right away. Using other FOSS tools to test JHOVE seems to be the best way of ensuring that JHOVE is doing its job properly. This blog post is an example of similar work using Bad Peggy to test JHOVE’s JPEG validator: https://openpreservation.org/blog/2016/11/29/jpegvalidation/

Leave a Reply

Join the conversation