MALT – Manipulating Audio with Latent Timbre

In my PhD I look at how we might rethink AI systems from the perspectives of empowering individual or marginalised forms of musical expression. In particular I look at sample-based music making, and how AI is impacting the relationships between creators, communities, sounds and technologies.

Can AI Support Sampling?

Can AI support sampling? Kind of… AI is super useful for organising and finding samples in a library¹, or for re-synthesising a target sound from scratch². We’re even at a point where generative models are small and quick enough to run on local machines in real-time, allowing us to play back large libraries of samples like instruments.

All of these technologies rely on what is called a ‘latent representation’. This is the middle bit that an AI model learns between input and output, the innards of the system. There are several approaches for structuring latent representations, such as prioritising user interpretability³, prioritising impact on the sound⁴, or prioritising information efficiency for communication with other AI systems⁵.

Each of these has their own affordances, but they have a fundamental issue, they are fixed. What if I think about my sound differently to you, or to the folks at Meta? None of these models are trained on my sounds, for manipulating samples how I like to, so they aren’t as useful for my practice.

“Just Train Your Own model…? OK!”

We could retrain our own models from scratch… if we had all the data and computing required for high quality music/audio generation, and if we were comfortable with the environmental overhead. But it would be much better if we had ways of reusing what is already there. Big AI models are bad because they’re big and flattening, but they’re also performant because of this ability to generalise. If we did our own, they’d be small, bad, or take ages and lots of hacking to work…. Monica Dinculescu from Google's Magenta says ‘it'd be like trying to learn all of music theory from a single song’⁶

So it would be cool to personalise existing models on our own samples, and with control attributes of our choice. MALT is a system for quickly training such a small, personalised control model. Based on prior work⁷, it works by learning a small subspace from an existing VAE, then regularising in the loss until that subspace 'makes sense'.

I implemented this in a max patch so that everything from training to music making can happen on one machine… your data never leaves your computer.

(Check back! I'll update this page with a more detailed outline soon)

XO is probably the simplest example of this↩
eg Synplant ↩
Generative timbre spaces by Esling et al.↩
RAVE by Caillon et al.↩
Music2Latent by Pasini et al.↩
https://magenta.withgoogle.com/midi-me ↩
Latent Constraints by Engel et al. and MusicVAE by Pati and Lerch↩