Nvidia unveils AI audio generator ‘Fugatto’ that can produce ‘sounds never heard before’

November 26, 2024

Nvidia, known for its sophisticated computer chips, is doubling down on its artificial intelligence offerings, developing a new generative AI audio model that can “produce sounds never heard before.”

Continue to article...

The new AI model is called Fugatto, which stands for Foundational Generative Audio Transformer Opus 1. Nvidia says it is able to generate, transform, and manipulate sound using text and audio inputs, creating sounds like a trumpet barking or a saxophone meowing. The model can also generate “high-quality singing voices” from text prompts.

The key capabilities of Fugatto include creating music snippets from text prompts, modifying existing songs by adding or removing instruments, changing voice characteristics like accent and emotion, and generating entirely novel sounds. Nvidia described its new technology as “a Swiss Army knife for sound.”

Nvidia demonstrated the capabilities of Fugatto in a video, showcasing how users can generate sounds through prompts like: “Create a sound where a train passes by and becomes a lush string orchestra.” Fugatto also allows users to isolate voices from songs, among other features, the video shows.

“This thing is wild,” said Ido Zmishlany, a multi-platinum producer, songwriter, and cofounder of One Take Audio, a member of the NVIDIA Inception program for cutting-edge startups.

“We wanted to create a model that understands and generates sound like humans do.”
Rafael Valle, Nvidia

“Sound is my inspiration. It’s what moves me to create music. The idea that I can create entirely new sounds on the fly in the studio is incredible.”

Fugatto uses ComposableART, a technique that allows users to combine instructions not originally seen together during training. This means users can request complex audio transformations, such as text spoken with a sad feeling in a French accent, Nvidia explained.

The model also introduces temporal interpolation, allowing the creation of evolving soundscapes. For example, users can generate a rainstorm that gradually transitions, with thunder crescendos fading into the distance.

Nvidia noted that Fugatto is a transformer model with 2.5 billion parameters, trained on NVIDIA DGX systems using 32 NVIDIA H100 Tensor Core GPUs, the same GPUs that power Vultr, which claims to be the world’s largest privately-held cloud computing platform.

Meanwhile, Nvidia said its research team — from India, Brazil, China, Jordan and South Korea — spent over a year developing a dataset containing millions of audio samples to develop Fugatto.

Fugatto can be applied in multiple industries including music production, advertising, language learning and video game development, Nvidia said.

Rafael Valle, a manager of applied audio research at NVIDIA and project contributor, described Fugatto as “our first step toward a future where unsupervised multitask learning in audio synthesis and transformation emerges from data and model scale.”

“We wanted to create a model that understands and generates sound like humans do,” said Valle.

Nvidia is the latest tech company to launch an AI audio tool, joining other companies like Stability AI, OpenAI, and Google DeepMind. However, Nvidia has yet to announce a timeline for public release or commercial availability of Fugatto.

Jensen Huang, founder and CEO of NVIDIA, said last week when the company published its Q3 earnings results: “The age of AI is in full steam, propelling a global shift to NVIDIA computing.”

“AI is transforming every industry, company and country. Enterprises are adopting agentic AI to revolutionize workflows. Industrial robotics investments are surging with breakthroughs in physical AI. And countries have awakened to the importance of developing their national AI and infrastructure.”

Music Business Worldwide

Related Posts