Meta’s Movie Gen Model Delivers Realistic Videos with Sound: Unlimited Moo Deng Awaits!

No one yet fully understands the practical applications of generative video models, but that hasn't stopped industry giants like Runway, OpenAI, and Meta from investing millions in their development. Meta's newest creation, Movie Gen, seamlessly transforms text prompts into fairly realistic videos complete with sound—though, thankfully, not voice just yet. Importantly, they have opted against a public release.

Movie Gen is essentially a collection of foundation models, with the most prominent being its text-to-video capability. Meta claims it surpasses competitors like Runway’s Gen3, LumaLabs’ latest offering, and Kling1.5. However, these comparisons often serve more to establish participation in the competitive landscape than to definitively crown Movie Gen as the victor. Detailed technical specifications are available in the research paper published by Meta, which outlines all its components.

Audio for the videos is generated to align with the visuals. For example, you might hear engine sounds matching car movements or the rumble of a waterfall in the background, complemented by a crack of thunder when appropriate. The system can even add music when it suits the scene.

Meta trained Movie Gen using "a combination of licensed and publicly available datasets," which they referred to as "proprietary/commercially sensitive," providing no further details. This likely includes a vast array of Instagram and Facebook videos along with various publicly accessible content vulnerable to scraping.

The ultimate goal for Meta isn't just to claim fleeting recognition as the “state of the art” but to establish a comprehensive method that allows for the creation of high-quality videos from simple, natural language prompts. For example, a user could input something like, “imagine me as a baker crafting a shiny hippo cake during a thunderstorm.”

A common challenge with video generators is their inflexibility in editing. If you request a video of someone walking across the street, any subsequent changes, such as altering their direction, can completely change the entire shot. Meta addresses this by introducing a straightforward text-based editing feature, allowing users to specify adjustments like, “change the background to a busy intersection” or “change her clothes to a red dress,” with the system aiming to implement just those modifications.

Camera movements are also recognized, meaning commands like “tracking shot” or “pan left” will be incorporated into the generated video. Although this still lacks the finesse of real camera control, it represents a notable improvement.

The model does have some unconventional limitations. It generates video at a width of 768 pixels, reminiscent of the now-outdated 1024×768 format but also compatible with various HD standards. Movie Gen upscales this to 1080p, which explains the claim of producing that resolution. While this claim isn't entirely accurate, the effectiveness of upscaling merits some leniency.

Interestingly, the model generates video for up to 16 seconds at 16 frames per second—a frame rate that has historically been unappealing. Alternatively, you can produce 10 seconds of video at a more standard 24 FPS, which is worth emphasizing.

As for the absence of voice generation, there are likely two factors at play. First, creating synchronized speech is significantly more challenging than generating audio. Matching speech to lip movements and corresponding facial expressions adds substantial complexity, making it a prudent choice to postpone this capability. An example like generating “a clown delivering the Gettysburg Address while riding a tiny bike in circles” could quickly turn into viral chaos.

The second factor appears to be a political consideration: launching a deepfake generator just ahead of a major election could create serious reputation issues. By limiting capabilities, Meta seeks to deter potential misuse, ensuring that any attempts by malicious actors would require significant effort. While it's feasible to combine this generative model with speech synthesis and lip-syncing technology, generating a candidate making outrageous claims is certainly not advisable.

“Movie Gen is currently a purely experimental AI research concept, and maintaining safety is our foremost priority, as it has been with all our generative AI technologies,” expressed a Meta representative in response to inquiries.

Unlike other models such as the Llama large language models, Movie Gen won't be publicly accessible. Users can replicate its methodologies by consulting the research paper, but the actual code will remain unpublished, aside from the “underlying evaluation prompt dataset,” which records the prompts used during the generation of test videos.

Most people like

Find AI tools in YBX