Today, I perform a small experiment to investigate whether a carefully designed loss function can help a very low-capacity neural network “spend” that capacity only on perceptually relevant features. If we can design audio codecs like Ogg Vorbis that allocate bits according to perceptual relevance, then we should be able to design a loss function that penalizes perceptually relevant errors, and doesn’t bother much with those that fall near or below the threshold of human awareness.

Given enough capacity, data, and time, a model trained using mean squared error will eventually produce pleasing samples, but can we produce those same samples with less of all three?

I used zounds and pytorch to build a small experiment where I dip the first half of my pinkie toe into these waters.

The Experiment

In this experiment, I’ll train a “generator” network to transform a fixed noise vector (128-dimensional, drawn from a gaussian distribution with zero mean, unit variance, and a diagonal covariance matrix) into a single fixed audio sample of dimension 8192 (~.75 seconds at 11025hz sampling rate). I’ll perform this same experiment with five different audio samples, and two different loss functions:

The Perceptually-Inspired Loss

The PerceptualLoss is very rudimentary, but hopefully captures some characteristics of early stages in the human auditory processing pipeline, namely:

  • an FIR filter bank, whose filters’ center frequencies lie along the Bark scale
  • half-wave rectification (AKA ReLU)
  • a logarithmic, or decibel-like amplitude scaling

The network architecture, as well as the weight initialization scheme (but not the exact initialized weights) is held constant as we subjectively evaluate the performance of our two loss functions on five different audio files containing:

For each (audio sample, loss) pair, the generator network is given 1000 iterations to learn to transform the fixed noise vector into the given audio sample, and “checkpoint” audio samples are recorded from the network every 250 iterations.

Put another way, when over-fitting our low capacity network to a dataset of size one, what does the loss emphasize?

Inspiration and Previous Work

This approach was inspired by the Deep Image Prior paper. While that paper sought to understand biases inherent in neural network architectures by keeping the loss fixed, and varying network structure, this experiment holds the architecture fixed, and tries to understand the contribution of different losses.

Both Autoencoding beyond pixels using a learned similarity metric and Generating Images with Perceptual Similarity Metrics based on Deep Networks explore losses that go beyond simple per-pixel (or per-sample) metrics.

The Code

The code for this experiment can be found on GitHub.

The Results

In each section below, you can hear generations from the network every 250 iterations.

Richard Nixon Speaking

Original

Mean Squared Error

250 Iterations

500 Iterations

750 Iterations

1000 Iterations

Perceptual Loss

250 Iterations

500 Iterations

750 Iterations

1000 Iterations

Conclusions

For this sample, MSE at 250 iterations is noisy and unintelligible, while the perceptual loss is already starting to be intelligible at the same point. By the end of the experiment, the samples generated by the network trained with MSE are fairly intelligible, but completely missing the sibilant at the end of the word “practice”. The network trained with perceptual loss captures this plainly and simply.

Bach Piano

Original

Mean Squared Error

250 Iterations

500 Iterations

750 Iterations

1000 Iterations

Perceptual Loss

250 Iterations

500 Iterations

750 Iterations

1000 Iterations

Conclusions

Even at the final iteration, generations from the network trained with MSE are “blurry”, and noisy. Generations from the network trained with perceptual loss are also a bit noisy, but overall, the final result is significantly clearer.

Top Gun Soundtrack

Original

Mean Squared Error

250 Iterations

500 Iterations

750 Iterations

1000 Iterations

Perceptual Loss

250 Iterations

500 Iterations

750 Iterations

1000 Iterations

Conclusions

This sample has a pretty broad frequency range, and a lot’s going on: bass guitar, synth, female vocals, and a snare drum. At 250 iterations, generations from the network trained with MSE are a noisy, low frequency mess, while generations from the network trained with perceptual loss are beginning to be intelligible.

The network trained with MSE does OK by the final iteration, but the network trained with perceptual loss does a much better job of generating the crisp high end, especially notable in the crack of the snare drum.

Kevin Gates

Original

Mean Squared Error

250 Iterations

500 Iterations

750 Iterations

1000 Iterations

Perceptual Loss

250 Iterations

500 Iterations

750 Iterations

1000 Iterations

Conclusions

While this sample covers a frequencies similar to those from the “Top Gun” soundtrack, both networks seem to struggle a bit more. Unsurprisingly at this point, the network trained with perceptual loss has managed to capture the sharp attack of the snare drum, as well as make the words “I get…” at the end of the sample almost intelligibly. Generations from both networks are noisy, but the MSE generations are much more so.

Drumkit

Original

Mean Squared Error

250 Iterations

500 Iterations

750 Iterations

1000 Iterations

Perceptual Loss

250 Iterations

500 Iterations

750 Iterations

1000 Iterations

Conclusions

The difference between the two losses is not as apparent here, although there’s certainly more definition in the mid range of the tom from the network trained with perceptual loss. Interestingly, the perceptual loss network seems to plateau (or close to it) by iteration 250, while the MSE network starts pretty terribly and makes its way slowly to the same point.

Final Thoughts

While the results aren’t terribly dramatic, it does seem that the perceptually-inspired loss function has helped to use network capacity on more salient (from a human perspective) features of each sound. In general, this loss promoted:

  • crisper mid and high frequencies
  • less noise
  • quicker convergence on something intelligible

Obviously, my “perceptual model” is dead-simple, and certainly dead-wrong in many cases. Could we make even better use of model capacity by adding things that perceptual audio codecs use to save bits, like tonal and temporal masking? Our current model will penalize the generator for incorrectly producing a frequency that may be inaudible to the listener!

Addendum

The most realistic audio generation models in the recent past, namely, WaveNet and SampleRNN have some interesting (and surprising) properties:

  • they model audio at the PCM sample level, and not at the FFT frame level, which the vast majority of previous work has preferred
  • they are auto-regressive and recurrent models, respectively, ultimately meaning that each successive sample is conditioned on all previous samples. Sample generation is thus serial, and not parallelizeable (later work, especially on WaveNet makes this not entirely true)
  • the problem is modelled in both cases as a classification problem, and not a regression problem. Samples are modeled as discrete classes, and the relationships between those classes must be learned, from scratch, by the model

The question of whether it’s reasonable to expect to generate audio samples in parallel, as if they were all independent of one another is an interesting one, but isn’t really the topic of this post.

What’s most interesting in this context is the re-framing of the problem as a classification problem. This apparently has a couple benefits:

  • While using MSE to do regression assumes a gaussian probability distribution, this approach allows us to learn any arbitratily complex distribution
  • With MSE and naive per-raw-sample regression, it’s possible to do pretty well by modelling low frequencies (since that’s where most of the energy lies), and making a lot of noise. By framing the problem as classification, it becomes a lot harder for the model to get “partial credit” and coast along happily

This approach is very clever, but doesn’t totally sit right with me, because audio samples are continuous, so why can’t we model them as such? Is the problem a missing perceptual model that maps groups of “raw” audio samples into a more perceptually appropriate space? Would regression in that space work better?