Today, I perform a small experiment to investigate whether a carefully designed loss function can help a very low-capacity neural network “spend” that capacity only on perceptually relevant features. If we can design audio codecs like Ogg Vorbis that allocate bits according to perceptual relevance, then we should be able to design a loss function that penalizes perceptually relevant errors, and doesn’t bother much with those that fall near or below the threshold of human awareness.

Given enough capacity, data, and time, a model trained using mean squared error will eventually produce pleasing samples, but can we produce those same samples with less of all three?

I used zounds and pytorch to build a small experiment where I dip the first half of my pinkie toe into these waters.

The Experiment

In this experiment, I’ll train a “generator” network to transform a fixed noise vector (128-dimensional, drawn from a gaussian distribution with zero mean, unit variance, and a diagonal covariance matrix) into a single fixed audio sample of dimension 8192 (~.75 seconds at 11025hz sampling rate). I’ll perform this same experiment with five different audio samples, and two different loss functions:

The Perceptually-Inspired Loss

The PerceptualLoss is very rudimentary, but hopefully captures some characteristics of early stages in the human auditory processing pipeline, namely:

an FIR filter bank, whose filters’ center frequencies lie along the Bark scale
half-wave rectification (AKA ReLU)
a logarithmic, or decibel-like amplitude scaling

The network architecture, as well as the weight initialization scheme (but not the exact initialized weights) is held constant as we subjectively evaluate the performance of our two loss functions on five different audio files containing:

For each (audio sample, loss) pair, the generator network is given 1000 iterations to learn to transform the fixed noise vector into the given audio sample, and “checkpoint” audio samples are recorded from the network every 250 iterations.

Put another way, when over-fitting our low capacity network to a dataset of size one, what does the loss emphasize?

Inspiration and Previous Work

This approach was inspired by the Deep Image Prior paper. While that paper sought to understand biases inherent in neural network architectures by keeping the loss fixed, and varying network structure, this experiment holds the architecture fixed, and tries to understand the contribution of different losses.

Both Autoencoding beyond pixels using a learned similarity metric and Generating Images with Perceptual Similarity Metrics based on Deep Networks explore losses that go beyond simple per-pixel (or per-sample) metrics.

The Code

The code for this experiment can be found on GitHub.

The Results

In each section below, you can hear generations from the network every 250 iterations.

Richard Nixon Speaking

Original

Mean Squared Error

250 Iterations

500 Iterations

750 Iterations

1000 Iterations

Perceptual Loss

250 Iterations

500 Iterations

750 Iterations

1000 Iterations

Conclusions

For this sample, MSE at 250 iterations is noisy and unintelligible, while the perceptual loss is already starting to be intelligible at the same point. By the end of the experiment, the samples generated by the network trained with MSE are fairly intelligible, but completely missing the sibilant at the end of the word “practice”. The network trained with perceptual loss captures this plainly and simply.