← Back to Blog

The Ingenuity of Music Language Models

Oct 8, 2025

Did you know you can wish for a song that doesn’t exist, and it will materialize out of thin air? Google researchers have developed an AI, called MusicLM, that takes text prompts as input and yields minutes-long musical pieces catered to the specifications mentioned in the prompts.

Did you know you can wish for a song that doesn’t exist, and it will materialize out of thin air? Google researchers have developed an AI, called MusicLM, that takes text prompts as input and yields minutes-long musical pieces catered to the specifications mentioned in the prompts. This is quite similar to how DALL-E generates images from text prompts.

MusicLM is like other Large Language Models (LLMs), and uses Deep Learning and Natural Language Processing. It analyzes and finds hidden representations in its training dataset of music-text pairs shared by human experts, to generate music. That’s not all – not only can it write a music score, it can recommend new chords on existing music or create a brand new instrumental sound.

If you’re like me, and wondering whether you need to convert your music to a musical notation in order to generate music from MusicLM, this is not the case. Simply feed a raw input audio into the AI model, that is converted into a series of discrete tokens for analysis, and then used to produce new audio sequences. MusicLM is built on top of AudioLM, another Google project, which uses two tokenizers to extract information from the musical sequence –

  1. SoundStream tokenizer for producing acoustic tokens
  2. w2v-BERT tokenizer for producing semantic tokens

 

Once the audio signal is tokenized, MuLan performs a joint embedding on music and text using a technique called constrastive loss. The joint embedding output is passed onto the next 3 stages:

  1. Semantic modelling for mapping audio token to semantic (text-based) token, and learning long-term structural coherence
  2. Coarse acoustic modelling that generates acoustic tokens conditioned based on semantic tokens
  3. Fine acoustic modelling that processes the coarse tokens to be more finer and realistic. These fine tokens are passed into a SoundStream decoder to recreate a waveform.

No alt text provided for this image
MusicLM architecture (Source: Google)

MusicLM is quite flexible in its behaviour. It can act on any paragraph-long descriptions that refer to a vibe, genre or specific instruments to be included, or even short phrases like “melodic metal”. There is a story mode where the model morphs between prompts like this

  1. electronic song played in a videogame (0:00-0:15)
  2. meditation song played next to a river (0:15-0:30)
  3. fire (0:30-0:45)
  4. fireworks (0:45-0:60)

 

resulted in a generated audio that you can listen here. Or check out the dulcet tones of Blob Opera built using MusicLM here.

Google has been cautious with the model, and has not released it to the general public, as similar generative AI technologies like Stable Diffusion and Midjourney have been charged with “misappropriation of creative content” for violating copyright law by scraping artists’ work without their consent. The AI programming model, CoPilot, developed jointly by Microsoft, GitHub and OpenAI is also being sued in a similar case. As the rise of generative AI accelerates, being cautious of the pitfalls such as data privacy issues is always the right thing to do.

Share this post

Facebook Twitter LinkedIn WhatsApp

Related Posts