MusicGen is a state-of-the-art controllable text-to-music model that seamlessly translates textual prompts into captivating musical compositions. It is a single-stage auto-regressive Transformer architecture and is trained using a 32kHz EnCodec tokenizer. MusicGen is trained on 20k hours of licensed music, an internal dataset of 10K high-quality music tracks, and on the ShutterStock and Pond5 music data. There are 4 pre-trained models available and they are as follows: large model, melody model, chord model, and rhythm model. MusicGen is optimized for efficiency and speed of the music generation process and can generate 10 seconds of audio in 35 seconds using a GPU.
