Google unveils AI that can automatically sync soundtracks and dialogue to videos

Mitchell Luo via Unsplash

Google’s artificial intelligence lab DeepMind has taken AI-created video content one step further – and conventional movie and TV production (not to mention sync licensing) one step closer to obsolescence.

In a blog post published on Monday (June 17), DeepMind says it’s developing “video-to-audio” (V2A) technology that pairs AI-created music, sound effects and even dialogue to AI-generated video.

“Video generation models are advancing at an incredible pace, but many current systems can only generate silent output,” DeepMind wrote.

“One of the next major steps toward bringing generated movies to life is creating soundtracks for these silent videos.”

DeepMind says its technology stands out from other projects to add sound to AI-generated video because “it can understand raw pixels,” and, though users can give it text prompts, they aren’t actually necessary – the AI tech can figure out for itself what sort of sounds are appropriate for a given video.

The tech can also automatically synchronize sound with image, DeepMind says (goodbye, sound editors, you won’t be needed).

The DeepMind blog features a number of text-prompted video clips featuring sounds added to video, including a cinematic score (prompt: “cinematic, thriller, horror film, music, tension, ambience, footsteps on concrete”), an underwater scene (prompt: “jellyfish pulsating under water, marine life, ocean”), and a person playing guitar (see below):



“Initial results are showing this technology will become a promising approach for bringing generated movies to life,” the DeepMind blog stated.

The lab says the technology was trained on audio, video, and transcripts of spoken dialogue, augmented by “AI-generated annotations with detailed descriptions of sound.”

Notably, the lab didn’t say whether the audio, video and text transcripts were copyrighted, and whether the material was licensed for use in AI training, noting only that DeepMind is “committed to developing and deploying AI technologies responsibly.”

“Initial results are showing this technology will become a promising approach for bringing generated movies to life.”

Google DeepMind

Google’s approach to AI training and copyright has been hard to parse. Though the company’s YouTube division has teamed up with major record companies to create AI music tools with the blessing of artists, Google also told the US Copyright Office last year that the use of copyrighted materials in training AI should be considered fair use.

For the time being, the V2A technology appears not to be ready for prime time – that is, it isn’t being released to the public.

“There are a number of other limitations we’re trying to address and further research is underway,” DeepMind said.

One area that the lab says needs improvement is spoken dialogue generation. The current iteration of the V2A tech “often result[s] in uncanny lip-syncing, as the video model doesn’t generate mouth movements that match the transcript,” DeepMind said.

Also, the quality of the audio drops when the video input includes “artifacts or distortions” that the V2A tech hasn’t been trained on, DeepMind said.

Nonetheless, it’s clear that audio-to-video technology like this is the missing link to the creation of instant, complete audiovisual content using AI.

Amidst the ongoing AI boom, numerous developers are working on sound-generating technology. As one example, earlier this month, Stability AI released Stable Audio Open, a free, open-source model which allows users to create high-quality audio samples.

Though it isn’t meant for creating full-length musical tracks, it can create snippets up to 47 seconds long, including sound effects, drum beats, instrument riffs, ambience, and other production elements commonly used in music and sound design.

The past few months have also seen the release of AI video creation tools that are capable of making uncannily realistic video, including OpenAI’s Sora, which went viral this spring with its convincing images of people, animals and scenery.

Pretty soon, other AI video generators appeared, all vying for the title of “Sora killer,” and all being hailed by some as the best yet: There is Luma Labs’ Dream Machine, Runway’s Gen-3 Alpha, and – most recently – Chinese video platform Kuaishou’s Kling.


With true-to-life AI video generation now in the hands of users, the issue of deepfakes is becoming increasingly urgent – perhaps explaining part of the reason why Google’s DeepMind is hesitant to release its latest tech, which (when perfected) will be able to put realistic sound effects and vocals to AI-created videos.

DeepMind noted in its blog that it has integrated its SynthID tool into its V2A creations. SynthID is a technology that adds digital watermarks to AI-created content, making it identifiable as the product of AI tools.

DeepMind also gave a nod to the audiovisual creators at risk of being put out of work by these new AI tools.

“To make sure our V2A technology can have a positive impact on the creative community, we’re gathering diverse perspectives and insights from leading creators and filmmakers, and using this valuable feedback to inform our ongoing research and development,” the blog stated.Music Business Worldwide

Related Posts