Google have been making huge progress with their latest text-to-music AI. It creates songs that last as long as 5 minutes. All you have to do is describe what you want, and the AI will do the rest.
Is this a good idea, or just putting more unnecessary pressure on independent artists? Google have released a paper with their work and findings. This has revealed MusicLM and how it will work. They claim that it will “outperform previous systems both in audio quality and adherence to the text description”.
The examples however, are only 30 seconds long. It will be interesting to see a full 5-minute piece and how that compares. Their caption choices were the following:
- “The main soundtrack of an arcade game. It is fast-paced and upbeat, with a catchy electric guitar riff. The music is repetitive and easy to remember, but with unexpected sounds, like cymbal crashes or drum rolls”.
- “A fusion of reggaeton and electronic dance music, with a spacey, otherworldly sound. Induces the experience of being lost in space, and the music would be designed to evoke a sense of wonder and awe, while being danceable”.
- “A rising synth is playing an arpeggio with a lot of reverb. It is backed by pads, sub bass line and soft drums. This song is full of synth sounds creating a soothing and adventurous atmosphere. It may be playing at a festival during two songs for a buildup”.
Previously, AIs have been used to create music, but nothing quite like this. There aren’t currently any other platforms that can create music based on simple text. It seems Google will be the first of its kind. Google’s researchers are honest within their report. They explain various challenges that they’ve either faced or are facing throughout this process.
It seems there is a lack of audio paired with text data. Text-to-image platforms don’t have this issue. They have gained a lot through large datasets. Meaning, they’re able to advance far quicker than MusicLM will be. OpenAI company DALL-E has caused a huge surge in popularity and public interest.
Another issue they’ve come across is the structure of the music. They’ve said that it’s structured “along a temporal dimension” and it is harder to understand the intent behind the music track when all they have is a text caption. MusicLM have said that it is a “hierarchical sequence-to-sequence model for music generation”.
Apparently the AI will use a machine to learn and generate sequences for various levels of songs. It will be processing the structure, melody and sounds within. A huge dataset of unlabelled music has been used in this process. Various musicians have come together to help too. With more than 5,500 examples of music captions, they now have a huge dataset.
Whistling, or humming should be able to help with the melody of each song. It will be rendered to suit the style described in the text. These sounds can be used as audio input to help the track along its way. MusicLM isn’t yet accessible to the public, it’s currently still in the works. It will be interesting to see if it’s successful.