Google's new AI model for video generation Lumiere A is used A new diffusion model is called Space-Time-U-Net, or STUNet, which determines where objects in the video are (space) and how they move and change at the same time (time). Ars Technica This method allows Lumiere to create the video in a single process rather than stitching together smaller still frames, notes this method.
Lumiere starts by creating a basic frame from the vector. It then uses the STUNet framework to start approximating where objects will move within that frame to create more frames that flow into each other, creating the appearance of smooth motion. Lumiere also creates 80 frames compared to 25 frames from Stable Video Diffusion.
Admittedly, I'm more of a text reporter than a video person, but Google's news release, along with a preprint scientific paper, shows that AI video creation and editing tools have gone from the uncanny valley to near-real in just a few years. It also establishes Google's technology in a space already occupied by competitors like Runway, Stable Video Diffusion, or Meta's Emu. Runway, one of the first mass-produced text-to-video platforms, launched Runway Gen-2 in March last year and began offering more realistic videos. Runway videos also have difficulty capturing action.
Google was kind enough to put the clips and prompts on the Lumiere site, allowing me to put the same prompts across Runway for comparison. Here are the results:
Yes, some of the clips presented have an industrial touch, especially if you look closely at the texture of the skin or if the scene is more atmospheric. but Look at that turtle! She moves like a turtle in water! It looks like a real turtle! I sent the Lumiere intro video to a friend who is a professional video editor. While she noted that “you can clearly tell this isn't quite real,” she thought it was impressive that if I didn't tell her it was AI, she would think it was CGI. (She also said, “That would take my job, wouldn't it?”)
Other models stitch together video clips from keyframes generated where the action actually occurred (think drawings in a paper book), while STUNet allows Lumiere to focus on the action itself based on where the generated content should be at the time specific from the video.
Google hasn't been a big player in the text-to-video category, but it has slowly released more advanced AI models and moved toward a multimedia focus. His Gemini Grand Language Model will eventually bring image generation to the Bard. Lumiere isn't available for testing yet, but it shows Google's ability to develop an AI video platform that's comparable to — and arguably slightly better — than generally available AI video generators like Runway and Pika. And just for the record, this is where Google was with AI video a couple of years ago.
In addition to creating text-to-video conversion, Lumiere will also allow for creating image-to-video conversion, stylized creation, allowing users to create videos in a specific style, cinematic graphics that animate only part of the video, and drawing to mask an area of the video to change the color or pattern.
However, Google Lumiere's study noted that “there is a risk of abuse to create fake or malicious content using our technology, and we believe it is essential to develop and implement tools to detect biases and instances of malicious use to ensure a safe and fair experience.” is used.” The authors of the paper did not explain how this could be achieved.
“Web specialist. Lifelong zombie maven. Coffee ninja. Hipster-friendly analyst.”