Sparse Interpretible Audio Model

Architecture
Sound Samples

Model Architecture

This small model attempts to decompose audio featuring acoustic instruments into the following components:

Some maximum number of small (16-dimensional) event vectors, representing individual audio events
Times at which each event occurs

While event data are encoded as real-valued vectors and not discrete values, the representation learned still lend themselves to a sparse, interpretible, and hopefully easy-to-manipulate encoding. This first draft was trained using the amazing MusicNet dataset.

Each sound sample below includes the following elements:

The original recording
The model's reconstruction
New audio using the original timings, but random event vectors
New audio using the original event vectors, but with random timings

Future Directions

There are several areas that could provide further gains in compression and interpretibility:

Imposing more severe sparsity constraints on the number of events produced. You may notice that there are often many redundant events that could be merged into one.
Performing vector quantization on the event vectors such that there is a discrete set of possible events

Cite this Work

                    
@misc{vinyard2023audio,
    author = {Vinyard, John},
    title = {Sparse Interpetable Audio},
    url = {https://JohnVinyard.github.io/machine-learning/2023/11/15/sparse-physical-model.html},
    year = 2024
}

Sound Samples

Original

Recon

With Random Event Vectors

(based on mean and variance of event vectors for this sample)

With Random Timings

Timeline

Original

Recon

With Random Event Vectors

(based on mean and variance of event vectors for this sample)

With Random Timings

Timeline

Original

Recon

With Random Event Vectors

(based on mean and variance of event vectors for this sample)

With Random Timings

Timeline

Original

Recon

With Random Event Vectors

(based on mean and variance of event vectors for this sample)