Nvidia Debuts Most Advanced AI Audio Solution
A team of GenAI researchers created a new sound tool that allows users to control the audio output simply using text.
A team of GenAI researchers created a new sound tool that allows users to control the audio output simply using text. Called Fugatto (short for Foundational Generative Audio Transformer Opus 1), it generates or transforms any mix of music, voices, and sounds described with prompts using any combination of text and audio files. It can create a music snippet based on a text prompt, remove or add instruments from an existing song, change the accent or emotion in a voice, and even let people produce sounds never heard before.
Music producers could use Fugatto to quickly prototype or edit an idea for a song, trying out different styles, voices, and instruments. They could also add effects and enhance the overall audio quality of an existing track. An ad agency could apply Fugatto to quickly target an existing campaign for multiple regions or situations, applying different accents and emotions to voiceovers. Language learning tools could be personalized to use any voice a speaker chooses. Video game developers could use the model to modify prerecorded assets in their titles to fit the changing action as users play the game. Or, they could create new assets on the fly from text instructions and optional audio inputs.
During inference, the model uses a technique called ComposableART to combine instructions that were only seen separately during training. The model’s ability to interpolate between instructions gives users fine-grained control over text instructions, in this case, the heaviness of the accent or the degree of sorrow. The model also generates sounds that change over time. It can, for instance, create the sounds of a rainstorm moving through an area with crescendos of thunder that slowly fade into the distance. It also gives users fine-grained control over how the soundscape evolves. Plus, unlike most models, Fugatto allows users to create soundscapes it’s never seen before, such as a thunderstorm easing into a dawn with the sound of birds singing.
Fugatto is a foundational generative transformer model that builds on the team’s prior work in areas such as speech modeling, audio vocoding, and audio understanding. The full version uses 2.5 billion parameters and was trained on a bank of NVIDIA DGX systems packing 32 H100 Tensor Core GPUs. One of the hardest parts of the effort was generating a blended dataset that contained millions of audio samples used for training. The team employed a multifaceted strategy to generate data and instructions that considerably expanded the range of tasks the model could perform while achieving more accurate performance and enabling new tasks without requiring additional data. They also scrutinized existing datasets to reveal new relationships among the data. The overall work spanned more than a year.