Just yesterday, I asked if Google would ever get an AI product release right on the first try. Consider that asked and answered — at least, going by the looks of its latest research.
This week, Google showed off VideoPoet, a new large language model (LLM) designed for a variety of video generation tasks from a team of 31 researchers at Google Research.
The fact that the Google Research team built an LLM for these tasks is notable in-and-of-itself. As they write in their pre-review research paper: “Most existing models employ diffusion-based methods that are often considered the current top performers in video generation. These video models typically start with a pretrained image model, such as Stable Diffusion, that produces high-fidelity images for individual frames, and then fine-tune the model to improve temporal consistency across video frames.”
By contrast, instead of using a diffusion model based on the popular (and controversial) Stable Diffusion open source image/video generating AI, the Google Research team decided to use an LLM, a different type of AI model based on the transformer architecture, typically used for text and code generation, such as in ChatGPT, Claude 2, or Llama 2. But instead of training it to produce text and code, the Google Research team trained it to generate videos.
They did this by heavily “pre-training” the VideoPoet LLM on 270 million videos and more than 1 billion text-and-image pairs from “the public internet and other sources,” and specifically, turning that data into text embeddings, visual tokens, and audio tokens, on which the AI model was “conditioned.”
The results are pretty jaw-dropping, even in comparison to some of the state-of-the-art consumer-facing video generation models such as Runway and Pika, the former a Google investment.
More than this, the Google Research team notes that their LLM video generator approach may actually allow for longer, higher quality clips, eliminating some of the constraints and issues with current diffusion-based video generating AIs, where movement of subjects in the video tends to break down or turn glitchy after just a few frames.
“One of the current bottlenecks in video generation is in the ability to produce coherent large motions,” two of the team members, Dan Kondratyuk and David Ross, wrote in a Google Research blog post announcing the work. “In many cases, even the current leading models either generate small motion or, when producing larger motions, exhibit noticeable artifacts.”
But VideoPoet can generate larger and more consistent motion across longer videos of 16 frames, based on the examples posted by the researchers online. It also allows for a wider range of capabilities right from the jump, including simulating different camera motions, different visual and aesthetic styles, even generating new audio to match a given video clip. It also handles a range of inputs including text, images, and videos to serve as prompts.
Integrating all these video generation capabilities within a single LLM, VideoPoet eliminates the need for multiple, specialized components, offering a seamless, all-in-one solution for video creation.
In fact, viewers surveyed by the Google Research team preferred it. The researchers showed video clips generated by VideoPoet to an unspecified number of “human raters,” as well as clips generated by video generation diffusion models Source-1, VideoCrafter, and Phenaki, showing two clips at a time side-by-side. The human evaluators largely rated the VideoPoet clips as superior in their eyes.
As summarized in the Google Research blog post: “On average people selected 24–35% of examples from VideoPoet as following prompts better than a competing model vs. 8–11% for competing models. Raters also preferred 41–54% of examples from VideoPoet for more interesting motion than 11–21% for other models.” You can see the results displayed in a bar chart format below as well.
Google Research has tailored VideoPoet to produce videos in portrait orientation by default, or “vertical video” catering to the mobile video marketplace popularized by Snap and TikTok.
Looking ahead, Google Research envisions expanding VideoPoet’s capabilities to support “any-to-any” generation tasks, such as text-to-audio and audio-to-video, further pushing the boundaries of what’s possible in video and audio generation.
There’s only one problem I see with VideoPoet right now: it’s not currently available for public usage. We’ve reached out to Google for more information on when it might become available and will update when we hear back. But until then, we’ll have to wait eagerly for its arrival to see how it really compares to other tools on the market.
Are you ready to bring more awareness to your brand? Consider becoming a sponsor for The AI Impact Tour. Learn more about the opportunities here.
Just yesterday, I asked if Google would ever get an AI product release right on the first try. Consider that asked and answered — at least, going by the looks of its latest research.
This week, Google showed off VideoPoet, a new large language model (LLM) designed for a variety of video generation tasks from a team of 31 researchers at Google Research.
The fact that the Google Research team built an LLM for these tasks is notable in-and-of-itself. As they write in their pre-review research paper: “Most existing models employ diffusion-based methods that are often considered the current top performers in video generation. These video models typically start with a pretrained image model, such as Stable Diffusion, that produces high-fidelity images for individual frames, and then fine-tune the model to improve temporal consistency across video frames.”
By contrast, instead of using a diffusion model based on the popular (and controversial) Stable Diffusion open source image/video generating AI, the Google Research team decided to use an LLM, a different type of AI model based on the transformer architecture, typically used for text and code generation, such as in ChatGPT, Claude 2, or Llama 2. But instead of training it to produce text and code, the Google Research team trained it to generate videos.
VB Event
The AI Impact Tour
Connect with the enterprise AI community at VentureBeat’s AI Impact Tour coming to a city near you!
Pre-training was key
They did this by heavily “pre-training” the VideoPoet LLM on 270 million videos and more than 1 billion text-and-image pairs from “the public internet and other sources,” and specifically, turning that data into text embeddings, visual tokens, and audio tokens, on which the AI model was “conditioned.”
The results are pretty jaw-dropping, even in comparison to some of the state-of-the-art consumer-facing video generation models such as Runway and Pika, the former a Google investment.
Longer, higher quality clips with more consistent motion
More than this, the Google Research team notes that their LLM video generator approach may actually allow for longer, higher quality clips, eliminating some of the constraints and issues with current diffusion-based video generating AIs, where movement of subjects in the video tends to break down or turn glitchy after just a few frames.
“One of the current bottlenecks in video generation is in the ability to produce coherent large motions,” two of the team members, Dan Kondratyuk and David Ross, wrote in a Google Research blog post announcing the work. “In many cases, even the current leading models either generate small motion or, when producing larger motions, exhibit noticeable artifacts.”
But VideoPoet can generate larger and more consistent motion across longer videos of 16 frames, based on the examples posted by the researchers online. It also allows for a wider range of capabilities right from the jump, including simulating different camera motions, different visual and aesthetic styles, even generating new audio to match a given video clip. It also handles a range of inputs including text, images, and videos to serve as prompts.
Integrating all these video generation capabilities within a single LLM, VideoPoet eliminates the need for multiple, specialized components, offering a seamless, all-in-one solution for video creation.
In fact, viewers surveyed by the Google Research team preferred it. The researchers showed video clips generated by VideoPoet to an unspecified number of “human raters,” as well as clips generated by video generation diffusion models Source-1, VideoCrafter, and Phenaki, showing two clips at a time side-by-side. The human evaluators largely rated the VideoPoet clips as superior in their eyes.
As summarized in the Google Research blog post: “On average people selected 24–35% of examples from VideoPoet as following prompts better than a competing model vs. 8–11% for competing models. Raters also preferred 41–54% of examples from VideoPoet for more interesting motion than 11–21% for other models.” You can see the results displayed in a bar chart format below as well.
Built for vertical video
Google Research has tailored VideoPoet to produce videos in portrait orientation by default, or “vertical video” catering to the mobile video marketplace popularized by Snap and TikTok.
Looking ahead, Google Research envisions expanding VideoPoet’s capabilities to support “any-to-any” generation tasks, such as text-to-audio and audio-to-video, further pushing the boundaries of what’s possible in video and audio generation.
There’s only one problem I see with VideoPoet right now: it’s not currently available for public usage. We’ve reached out to Google for more information on when it might become available and will update when we hear back. But until then, we’ll have to wait eagerly for its arrival to see how it really compares to other tools on the market.
VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.
Author: Carl Franzen
Source: Venturebeat
Reviewed By: Editorial Team