Boomcraft: crafting music using AI trained on license-free content.

Marving Young; Pantelis Monogioudis

Abstract

We present the elements of Boomcraft - a system that uses AI trained on license-free (creative-commons) music to help you craft music from multimodal prompts.

Introduction

This white paper attempts to outline Boomcraft - an system that can helps craft music using AI. Rather than artifically generating music using models whose training data are copyright-owned by others, the approach here is that of Augmented Intelligence that augments the musician’s own skills preserving privacy and rights to the music that is created. The system can be used by music producers, DJs, and other music professionals to create new music components using multimodal prompts - this includes humming a melody, text and existing music license-free or licensed content.

Unique proposition

License-free content for training data

To avoid the legal issues associoated with Udio, Suno and other AI music generation applications, our models are trained on license-free content and avoid the “black box” composition generation of other services. The produced music can be licensable - we ensure that the composition does not infringe on the master recording of the composition. Note though that we cannot guarantee that that composition does not infringe on earlier pre-existing underlying compositions.

More specifically we have 20TB of individual insrtrument loops that are license-free that can be combined to create new compositions. The instrument tracks can be indexed by:

Instrument type. All instruments in the track will have the same key, tempo and time signature.
Key: e.g. C major, B flat minor etc. This is common to all instruments that play notes.
Tempo / Speed: e.g 60 to 200 beats per minute. For both instruments that play notes and drums.
Time signature: 4/4 (most common), 3/4, 6/8 etc. For both instruments that play notes and drums.

There are two versions of the pipeline:

Proof of Concept (PoC)

The input prompt is a specification of the instrument, key, tempo and time signature. The input prompt may be text or speech. The output is a loop that is generated based on the input prompt. The prompt may also result from an interaction with an AI agent.

The musician can then select the loop that is most suitable for the composition.

The 20TB library is annotated with the above features and the model is trained to generate new loops based on the input prompt. Annotations are done either manually or using AI models that can detect the key, tempo and time signature of the loop. The library may already contain annotations based on folder structure and therefore some of the features may be automatically extracted.

The partially annotated dataset maybe of the order of 100s of tracks - few tens of GBs and needs to be fully annotated. For AI to be able to assist on the annotation, the dataset will be split into two parts the first being the training dataset and the second the validation dataset. The training dataset will be used to train the AI model to predict the key, tempo and time signature of the loop. The validation dataset will be used to evaluate the model. The model will be trained to minimize the error between the predicted and the actual key, tempo and time signature of the loop.

The input prompt will be embedded in the same space as the loops and the model will return the top-k closest new loops to the prompt. This is the generative AI facility that we provide to customers.

Advanced Model (AM)

Our training pipeline takes as input these instruments and instrument loops and creates an internal representation that will allow the model to

Generate new loops configurable by the musician to a prompt. A prompt may be as simple as an explicit configuration of the instrument, key, tempo and time signature or a more abstract prompt such as rough track / demo / user input or a text (“piano with key B flat minor, 120 beats per minute and 4/4 time signature”).
Genre can be a selectable option that can further condition the generation of the loops.
Input: partial track, AI will determine key, speed, time signature and generate new loops that enhance / complete the partial track based on the features of the partial track.

Although the purpose of this white paper is not to detail our AI modeling approach, in its simplest form we can satisfy the above requirements as follows:

Uses the database developed by Disco Theory to create embeddings of the individual loops. See the technical approach section as to how this is done at a high level.
The rough cut / user input acts as the query / prompt to the model and is itself embedded in the same space.
The AI model returns the top-k closest new loops to the prompt and the musician is then able to select the loops that are most suitable for the composition.

Multimodal prompts

The system is designed to be a Software as a Service (SaaS) application that can be used by music producers, DJs, and other music professionals to create music from multimodal prompts that include:

Humming a melody
Text
Other existing music provided by the user.

The prompts generate music using a variety of generative AI models and the application offers the ability for the composer to iterate over the generated music using a variety of sound transformations to create a final composition.

Open source baseline models

The so-called baseline AI models are released under the MIT license and can be used by anyone to create new music components.

The release of the model source code allows the finetuning of the model to your own private collection such that the finetuned model can generate music that closely resembles your own music style. Baseline models are updated regularly to include the latest research in AI music generation as well as new licese-free content. Since training models is very expensive, the baseline models are trained and released prediodically every few months.

Business Model

The model is based on a SaaS subscription model and a web-based UI that allows the user to compose music based on the offered baseline models. A basic subscription would allow the user to select the baseline model (similarly to how chatGPT today allows such selection). The costs associated in this basic tier are absorbed by the SaaS provider and the system supports thousands of concurrent parallel requests towards our a cloud-hosted infrastructure that runs the service where inference happens. The service that hosts such models tend to require higher-end CPU and memory compute and depending on the complexity of the model and latency requirements would typically require GPU acceleration as well.

Add-on services such as finetuning (see corresponding dedicated section) are upsold and the associated fee covers storage of the private data, storage of multiple versions of a finetuned model, and the ephemeral computing costs associated with the finetuning task that typically lasts from few hours to few days, noting however that such job requires expensive AI acceleration hardware with costs starting from 5$ per hour.

In the annex we list a number of add-on services that various vendors are offering.

Technical approach

There are a couple of main categories of AI models that have already been developed and deployed in the music industry.

Conditional generative models - these models are trained on a large corpus of music and can generate music based on a prompt. The prompt can be a melody, text, or existing music. Such models are typically very large and therefore expensive to produce.
AI assistive models that are embedded into existing tools such as Digital Audio Workstations (DAW) - in this white paper we will use the term AI DAW to refer to them. In Annex we provide a list of AI DAWs that are currently available.

The distinction between the two categories is not well defined since AI DAWs may in fact use light conditional models or they may call AI what is purely advanced audio signal processing algorithms such as user programmable filters.

Embedding generation

Model Name	Description	Organization
CLAP	A model for learning audio concepts from natural language supervision (Elizalde et al. 2022).	Microsoft
CLAP (LAION)	Contrastive Language-Audio Pretraining.	LAION
Encodec	A state-of-the-art deep learning-based audio codec (Défossez et al. 2022).	Facebook/Meta Research
MERT	An acoustic music understanding model with large-scale self-supervised training (Li et al. 2023).	m-a-p
DAC	High-fidelity audio compression with Improved RVQGAN (Kumar et al. 2023).	Descript
CDPAM	A contrastive learning-based Deep Perceptual Audio Metric (Manocha et al. 2021).	Pranay Manocha et al.

We use the Frechet Inception Distance (FID) amongst other metrics to evaluate the quality of the embeddings.

Vector Database and Similarity Search

We store the produced embeddings into Qdrant (Qdrant 2024). Qdrant is a high-performance, vector-based search database designed for similarity search and nearest neighbor search in large collections of vector embeddings. It is specifically built to enable machine learning and AI applications that require real-time vector search.

Model Review

The Audiocraft Model (Meta)

In this section we provide a high-level description of the recently open sourced model from Meta (Copet et al. 2023). The model is called audiocraft, that encompasses several elements of arguably state-of-the-art (SOTA) generative modeling approach. The model’s block diagram is shown below.

Figure 1: link

You can also hear the demos the Meta team prepared. Note that the trained weights of the model are not available for download.

The model offer the following features:

Multimodal conditioning - the model can generate music from multiple modalities such as text but also melodies. The models are called conditional since what is generated is conditioned on user input (text and melodies) and earlier model output (the generated music).
Ability to generate stereo sound at 32 KHz which is considered good quality sound. This is below the 48 KHz production quality sound though.
The model is trained on a large corpus of licensed music.

A follow up model provides instructive (Zhang et al. 2024) capabilities to the model.

References

Copet, Jade, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, and Alexandre Défossez. 2023. “Simple and Controllable Music Generation.” arXiv [Cs.SD], June.

Défossez, Alexandre, Jade Copet, Gabriel Synnaeve, and Yossi Adi. 2022. “High Fidelity Neural Audio Compression.” arXiv [Eess.AS], October.

Elizalde, Benjamin, Soham Deshmukh, Mahmoud Al Ismail, and Huaming Wang. 2022. “CLAP: Learning Audio Concepts from Natural Language Supervision.” arXiv [Cs.SD], June.

Kumar, Rithesh, Prem Seetharaman, Alejandro Luebs, Ishaan Kumar, and Kundan Kumar. 2023. “High-Fidelity Audio Compression with Improved RVQGAN.” arXiv [Cs.SD], June.

Li, Yizhi, Ruibin Yuan, Ge Zhang, Yinghao Ma, Xingran Chen, Hanzhi Yin, Chenghao Xiao, et al. 2023. “MERT: Acoustic Music undERstanding Model with Large-Scale Self-Supervised Training.” arXiv [Cs.SD], May.

Manocha, Pranay, Zeyu Jin, Richard Zhang, and Adam Finkelstein. 2021. “CDPAM: Contrastive Learning for Perceptual Audio Similarity.” arXiv [Eess.AS], February.

Qdrant. 2024. “Qdrant: High-Performance Vector Search Database.” https://qdrant.tech.

YouTube. 2024. “Soundverse: AI Music Generation.” https://youtu.be/2cnMoO7eEoU?si=q1qcXUvnxdDS42GN.

———. 2024. “Music AI.” https://www.youtube.com/watch?v=p9exoLDs9qE&t=1s.

Zhang, Yixiao, Yukara Ikemiya, Woosung Choi, Naoki Murata, Marco A Martínez-Ramírez, Liwei Lin, Gus Xia, Wei-Hsiang Liao, Yuki Mitsufuji, and Simon Dixon. 2024. “Instruct-MusicGen: Unlocking Text-to-Music Editing for Music Language Models via Instruction Tuning.” arXiv [Cs.SD], May.

Annexes

AI DAW Features

AI DAWs offer a variety of features that can help the composer create music.

Music AI

They highlight the following features:

Stem separation
Lyrics Transcription
Chord Recognition
Beat Detection
Mastering
Vocal Synthesis

Their pricing is structured according to simultaneous compute and total storage as well as per minute usage of each module. Notice that the business / enterprise plan quotes “Private AI model creation and support” which does not clarify if this is a finetuning feature based on a baseline model they offer or the customer may bring a self-hosted model or they can develop a private to the customer model. A (YouTube 2024) video shows how to use the service.

Soundverse AI

They highlight the following features:

Stem separation
Vocal isolation
Vocal remover

Their pricing is token based. A (YouTube 2024.) video shows how to use the service.

Apple’s Logic Pro

It offers AI assistance in the form of of Session Players - they do stem separation as well.

Citation

BibTeX citation:

@article{young2024,
  author = {Young, Marving and Monogioudis, Pantelis},
  title = {Boomcraft: Crafting Music Using {AI} Trained on License-Free
    Content.},
  date = {2024-12-29},
  langid = {en},
  abstract = {We present the elements of \_Boomcraft\_ - a system that
    uses AI trained on license-free (creative-commons) music to help you
    craft music from multimodal prompts.}
}

For attribution, please cite this work as:

Young, Marving, and Pantelis Monogioudis. 2024. “Boomcraft: Crafting Music Using AI Trained on License-Free Content.” December.