DeepMind Unveils AI That Creates Soundtracks and Dialogue for Videos

DeepMind, Google’s AI research lab, is developing groundbreaking technology designed to generate soundtracks for videos. In a recent blog post, DeepMind introduces V2A, which stands for "video-to-audio," and positions it as a crucial component of the AI-generated media landscape. While various organizations, including DeepMind, have created video-generating AI models, these systems often fall short in producing synchronized sound effects to accompany the visuals.

“Video generation models are evolving rapidly, but many still produce silent output,” notes DeepMind. “V2A technology [could] offer a promising method to bring generated films to life.”

The innovative V2A technology leverages soundtrack descriptions—such as “jellyfish pulsating under water, marine life, ocean”—along with associated video content to produce music, sound effects, and dialogue that harmonizes with both the characters and overall tone of the video. This advanced system incorporates DeepMind's SynthID technology, designed to combat deepfake issues. The AI model behind V2A is a diffusion model that trains on a variety of sounds, dialogue transcripts, and video clips, as emphasized by DeepMind.

“By learning from video, audio, and additional annotations, our technology can associate specific audio events with corresponding visual scenes, adapting to the details provided in the annotations or transcripts,” DeepMind explains.

However, questions remain regarding the potential copyright status of the training data used by DeepMind and whether the data creators were notified about this initiative. We have reached out to DeepMind for clarification and will update this article upon receiving a response.

Although AI-driven sound-generating tools are not new, with startups like Stability AI launching similar systems recently and ElevenLabs introducing one in May, the extent of V2A's capabilities stands out. Microsoft also has projects that generate videos with synchronized talking and singing from still images, while platforms like Pika and GenreX utilize trained models to suggest appropriate music and effects for video scenes.

DeepMind asserts that V2A is distinct because it can automatically interpret video pixels and synchronize sounds, even without explicit descriptions. That said, the V2A technology is not without its limitations; DeepMind admits that audio quality can suffer when the underlying model encounters videos with artifacts or distortions. Overall, the generated audio has been described as “a collection of stereotypical sounds,” a sentiment echoed by my colleague Natasha Lomas.

To avoid potential misuse and to ensure a positive impact on the creative community, DeepMind has decided not to release V2A to the public for the foreseeable future.

“To ensure our V2A technology benefits the creative community, we are gathering insights from prominent creators and filmmakers, using this feedback to guide our research and development,” DeepMind emphasizes. “Before considering public access, V2A technology will undergo thorough safety assessments and testing.”

While DeepMind highlights the utility of V2A for archivists and those working with historical footage, the rise of generative AI in this domain poses challenges for the film and TV industry. Strong labor protections will be essential to prevent the erosion of jobs—and, potentially, entire professions—due to generative media tools.

Most people like

Find AI tools in YBX