Google researchers unveil ‘VLOGGER’, an AI that can bring still photos to life

Google researchers have developed a new artificial intelligence system that can generate lifelike videos of people speaking, gesturing and moving — from just a single still photo. The technology, called VLOGGER, relies on advanced machine learning models to synthesize startlingly realistic footage, opening up a range of potential applications while also raising concerns around deepfakes and misinformation.

Described in a research paper titled “VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis,” the AI model can take a photo of a person and an audio clip as input, and then output a video that matches the audio, showing the person speaking the words and making corresponding facial expressions, head movements and hand gestures. The videos are not perfect, with some artifacts, but represent a significant leap in the ability to animate still images.

The researchers, led by Enric Corona at Google Research, leveraged a type of machine learning model called diffusion models to achieve the novel result. Diffusion models have recently shown remarkable performance at generating highly realistic images from text descriptions. By extending them into the video domain and training on a vast new dataset, the team was able to create an AI system that can bring photos to life in a highly convincing way.

“In contrast to previous work, our method does not require training for each person, does not rely on face detection and cropping, generates the complete image (not just the face or the lips), and considers a broad spectrum of scenarios (e.g. visible torso or diverse subject identities) that are critical to correctly synthesize humans who communicate,” the authors wrote.

A key enabler was the curation of a huge new dataset called MENTOR containing over 800,000 diverse identities and 2,200 hours of video — an order of magnitude larger than what was previously available. This allowed VLOGGER to learn to generate videos of people with varied ethnicities, ages, clothing, poses and surroundings without bias.

The technology opens up a range of compelling use cases. The paper demonstrates VLOGGER’s ability to automatically dub videos into other languages by simply swapping out the audio track, to seamlessly edit and fill in missing frames in a video, and to create full videos of a person from a single photo.

One could imagine actors being able to license detailed 3D models of themselves that could be used to generate new performances. The technology could also be used to create photorealistic avatars for virtual reality and gaming. And it might enable the creation of AI-powered virtual assistants and chatbots that are more engaging and expressive.

Google sees VLOGGER as a step toward “embodied conversational agents” that can engage with humans naturally through speech, gestures and eye contact. “VLOGGER can be used as a stand-alone solution for presentations, education, narration, low-bandwidth online communication, and as an interface for text-only human-computer interaction,” the authors wrote.

However, the technology also has the potential for misuse, for example in creating deepfakes — synthetic media in which a person in a video is replaced with someone else’s likeness. As these AI-generated videos become more realistic and easier to create, it could exacerbate the challenges around misinformation and digital fakery.

While impressive, VLOGGER still has limitations. The generated videos are relatively short and have a static background. The individuals don’t move around a 3D environment. And their mannerisms and speech patterns, while realistic, are not yet indistinguishable from those of real humans.

Nonetheless, VLOGGER represents a significant step forward. “We evaluate VLOGGER on three different benchmarks and show that the proposed model surpasses other state-of-the-art methods in image quality, identity preservation and temporal consistency,” the authors reported.

With further advances, this type of AI-generated media is likely to become ubiquitous. We may soon live in a world where it is hard to tell whether the person speaking to us in a video is real or generated by a computer program.

VLOGGER provides an early glimpse of that future. It is a powerful demonstration of the rapid progress being made in artificial intelligence and a sign of the increasing challenges we will face in distinguishing between what is real and what is fake.

Join leaders in Boston on March 27 for an exclusive night of networking, insights, and conversation. Request an invite here.

VLOGGER generates photorealistic videos of talking and gesturing avatars from a single image. (Credit: enriccorona.github.io)

A breakthrough in synthesizing talking heads

VB Event

The AI Impact Tour – Atlanta

Continuing our tour, we’re headed to Atlanta for the AI Impact Tour stop on April 10th. This exclusive, invite-only event, in partnership with Microsoft, will feature discussions on how generative AI is transforming the security workforce. Space is limited, so request an invite today.

Request an invite

Potential applications and societal implications

A new frontier in AI research

VB Daily

Stay in the know! Get the latest news in your inbox daily

By subscribing, you agree to VentureBeat’s Terms of Service.

Thanks for subscribing. Check out more VB newsletters here.

An error occured.

Author: Michael Nuñez
Source: Venturebeat
Reviewed By: Editorial Team

Google researchers unveil ‘VLOGGER’, an AI that can bring still photos to life

A breakthrough in synthesizing talking heads

VB Event

Potential applications and societal implications

A new frontier in AI research

Latest News

Popular News

Battlefield 6 Devs Say New Conquest Tweaks Will Improve the Pace of Matches, But Fans Have Other Ideas

Minecraft Probably Wouldn't Add Creepers Today as They'd Be Too 'Controversial' — But Don't Worry, They're Not Going Anywhere Now They're So Iconic

IGN Fan Fest 2025: Fall Edition - Everything Announced

Categories

Google researchers unveil ‘VLOGGER’, an AI that can bring still photos to life

A breakthrough in synthesizing talking heads

VB Event

Potential applications and societal implications

A new frontier in AI research

Related posts

Battlefield 6 Devs Say New Conquest Tweaks Will Improve the Pace of Matches, But Fans Have Other Ideas

Minecraft Probably Wouldn't Add Creepers Today as They'd Be Too 'Controversial' — But Don't Worry, They're Not Going Anywhere Now They're So Iconic

IGN Fan Fest 2025: Fall Edition - Everything Announced

Robert Kiyosaki Says Crash Hits This Year—Loads up on Bitcoin and Ethereum Fast

Sign up for our Newsletter and stay informed!

Latest News

Popular News

Battlefield 6 Devs Say New Conquest Tweaks Will Improve the Pace of Matches, But Fans Have Other Ideas

Minecraft Probably Wouldn't Add Creepers Today as They'd Be Too 'Controversial' — But Don't Worry, They're Not Going Anywhere Now They're So Iconic

IGN Fan Fest 2025: Fall Edition - Everything Announced

Categories

Sign up for our Newsletter and
stay informed!