The Ingenuity of Music Language Models

Oct 8, 2025

Did you know you can wish for a song that doesn’t exist, and it will materialize out of thin air? Google researchers have developed an AI, called MusicLM, that takes text prompts as input and yields minutes-long musical pieces catered to the specifications mentioned in the prompts.

Did you know you can wish for a song that doesn’t exist, and it will materialize out of thin air? Google researchers have developed an AI, called MusicLM, that takes text prompts as input and yields minutes-long musical pieces catered to the specifications mentioned in the prompts. This is quite similar to how DALL-E generates images from text prompts.

MusicLM is like other Large Language Models (LLMs), and uses Deep Learning and Natural Language Processing. It analyzes and finds hidden representations in its training dataset of music-text pairs shared by human experts, to generate music. That’s not all – not only can it write a music score, it can recommend new chords on existing music or create a brand new instrumental sound.

If you’re like me, and wondering whether you need to convert your music to a musical notation in order to generate music from MusicLM, this is not the case. Simply feed a raw input audio into the AI model, that is converted into a series of discrete tokens for analysis, and then used to produce new audio sequences. MusicLM is built on top of AudioLM, another Google project, which uses two tokenizers to extract information from the musical sequence –

SoundStream tokenizer for producing acoustic tokens
w2v-BERT tokenizer for producing semantic tokens

Once the audio signal is tokenized, MuLan performs a joint embedding on music and text using a technique called constrastive loss. The joint embedding output is passed onto the next 3 stages:

Semantic modelling for mapping audio token to semantic (text-based) token, and learning long-term structural coherence
Coarse acoustic modelling that generates acoustic tokens conditioned based on semantic tokens
Fine acoustic modelling that processes the coarse tokens to be more finer and realistic. These fine tokens are passed into a SoundStream decoder to recreate a waveform.

No alt text provided for this image — MusicLM architecture (Source: Google)

MusicLM is quite flexible in its behaviour. It can act on any paragraph-long descriptions that refer to a vibe, genre or specific instruments to be included, or even short phrases like “melodic metal”. There is a story mode where the model morphs between prompts like this

electronic song played in a videogame (0:00-0:15)
meditation song played next to a river (0:15-0:30)
fire (0:30-0:45)
fireworks (0:45-0:60)

resulted in a generated audio that you can listen here. Or check out the dulcet tones of Blob Opera built using MusicLM here.

Google has been cautious with the model, and has not released it to the general public, as similar generative AI technologies like Stable Diffusion and Midjourney have been charged with “misappropriation of creative content” for violating copyright law by scraping artists’ work without their consent. The AI programming model, CoPilot, developed jointly by Microsoft, GitHub and OpenAI is also being sued in a similar case. As the rise of generative AI accelerates, being cautious of the pitfalls such as data privacy issues is always the right thing to do.

Share this post

Facebook Twitter LinkedIn WhatsApp

The Ingenuity of Music Language Models

Related Posts

All About Self-Driving Cars

Meditron: a Responsible Medical GenAI Breakthrough

The ChatGPT of Finance Arrives

Smart Instruction-Following LLMs

AI's Secret Superstars: Introducing Annotators

The Ingenuity of Music Language Models

Share this post

Related Posts

All About Self-Driving Cars

Meditron: a Responsible Medical GenAI Breakthrough

The ChatGPT of Finance Arrives

Smart Instruction-Following LLMs

AI's Secret Superstars: Introducing Annotators

Follow Me On