AI & RoboticsNews

Stability AI debuts Stable Video Diffusion models in research preview

Stability AI

As OpenAI celebrates the return of Sam Altman, its rivals are moving to up the ante in the AI race. Just after Anthropic’s release of Claude 2.1 and Adobe’s reported acquisition of Rephrase.ai, Stability AI has announced the release of Stable Video Diffusion to mark its entry into the much-sought video generation space.

Available for research purposes only, Stable Video Diffusion (SVD) includes two state-of-the-art AI models – SVD and SVD-XT – that produce short clips from images. The company says they both produce high-quality outputs, matching or even surpassing the performance of other AI video generators out there.

Stability AI has open-sourced the image-to-video models as part of its research preview and plans to tap user feedback to further refine them, ultimately paving the way for their commercial application.

According to a blog post from the company, SVD and SVD-XT are latent diffusion models that take in a still image as a conditioning frame and generate 576 X 1024 video from it. Both models produce content at speeds between three to 30 frames per second, but the output is rather short: lasting just up to four seconds only. The SVD model has been trained to produce 14 frames from stills, while the latter goes up to 25, Stability AI noted.

To create Stable Video Diffusion, the company took a large, systematically curated video dataset, comprising roughly 600 million samples, and trained a base model with it. Then, this model was fine-tuned on a smaller, high-quality dataset (containing up to a million clips) to tackle downstream tasks such as text-to-video and image-to-video, predicting a sequence of frames from a single conditioning image.

Stability AI said the data for training and fine-tuning the model came from publicly available research datasets, although the exact source remains unclear.

More importantly, in a whitepaper detailing SVD, the authors write that this model can also serve as a base to fine-tune a diffusion model capable of multi-view synthesis. This would enable it to generate multiple consistent views of an object using just a single still image.

All of this could eventually culminate into a wide range of applications across sectors such as advertising, education and entertainment, the company added in its blog post.

In an external evaluation by human voters, SVD outputs were found to be of high quality, largely surpassing leading closed text-to-video models from Runway and Pika Labs. However, the company notes that this is just the beginning of its work and the models are far from perfect at this stage. On many occasions, they miss out on delivering photorealism, generate videos without motion or with very slow camera pans and fail to generate faces and people as users may expect.

Eventually, the company plans to use this research preview to refine both models, rule out their present gaps and introduce new features, like support for text prompts or text rendering in videos, for commercial applications. It emphasized that the current release is mainly aimed at inviting open investigation of the models, which could flag more issues (like biases) and help with safe deployment later.

“We are planning a variety of models that build on and extend this base, similar to the ecosystem that has built around stable diffusion,” the company wrote. It has also started calling users to sign up for an upcoming web experience that would allow users to generate videos from text.

That said, it remains unclear when exactly the experience will be available.

To get started with the new open-source Stable Video Diffusion models, users can find the code on the company’s GitHub repository and the weights required to run the model locally on its Hugging Face page. The company notes that usage will be allowed only after acceptance of its terms, which detail both allowed and excluded applications.

As of now, along with researching and probing the models, permitted use cases include generating artworks for design and other artistic processes and applications in educational or creative tools.

Generating factual or “true representations of people or events” remains out of scope, Stability AI said.

Are you ready to bring more awareness to your brand? Consider becoming a sponsor for The AI Impact Tour. Learn more about the opportunities here.


As OpenAI celebrates the return of Sam Altman, its rivals are moving to up the ante in the AI race. Just after Anthropic’s release of Claude 2.1 and Adobe’s reported acquisition of Rephrase.ai, Stability AI has announced the release of Stable Video Diffusion to mark its entry into the much-sought video generation space.

Available for research purposes only, Stable Video Diffusion (SVD) includes two state-of-the-art AI models – SVD and SVD-XT – that produce short clips from images. The company says they both produce high-quality outputs, matching or even surpassing the performance of other AI video generators out there.

Stability AI has open-sourced the image-to-video models as part of its research preview and plans to tap user feedback to further refine them, ultimately paving the way for their commercial application.

Understanding Stable Video Diffusion

According to a blog post from the company, SVD and SVD-XT are latent diffusion models that take in a still image as a conditioning frame and generate 576 X 1024 video from it. Both models produce content at speeds between three to 30 frames per second, but the output is rather short: lasting just up to four seconds only. The SVD model has been trained to produce 14 frames from stills, while the latter goes up to 25, Stability AI noted.

VB Event

The AI Impact Tour

Connect with the enterprise AI community at VentureBeat’s AI Impact Tour coming to a city near you!

Learn More

To create Stable Video Diffusion, the company took a large, systematically curated video dataset, comprising roughly 600 million samples, and trained a base model with it. Then, this model was fine-tuned on a smaller, high-quality dataset (containing up to a million clips) to tackle downstream tasks such as text-to-video and image-to-video, predicting a sequence of frames from a single conditioning image.

Stability AI said the data for training and fine-tuning the model came from publicly available research datasets, although the exact source remains unclear.

More importantly, in a whitepaper detailing SVD, the authors write that this model can also serve as a base to fine-tune a diffusion model capable of multi-view synthesis. This would enable it to generate multiple consistent views of an object using just a single still image.

All of this could eventually culminate into a wide range of applications across sectors such as advertising, education and entertainment, the company added in its blog post.

High-quality output but limitations remain

In an external evaluation by human voters, SVD outputs were found to be of high quality, largely surpassing leading closed text-to-video models from Runway and Pika Labs. However, the company notes that this is just the beginning of its work and the models are far from perfect at this stage. On many occasions, they miss out on delivering photorealism, generate videos without motion or with very slow camera pans and fail to generate faces and people as users may expect.

Eventually, the company plans to use this research preview to refine both models, rule out their present gaps and introduce new features, like support for text prompts or text rendering in videos, for commercial applications. It emphasized that the current release is mainly aimed at inviting open investigation of the models, which could flag more issues (like biases) and help with safe deployment later.

“We are planning a variety of models that build on and extend this base, similar to the ecosystem that has built around stable diffusion,” the company wrote. It has also started calling users to sign up for an upcoming web experience that would allow users to generate videos from text.

That said, it remains unclear when exactly the experience will be available.

A glimpse of Stable Video Diffusion’s text-to-video experience

How to use the models?

To get started with the new open-source Stable Video Diffusion models, users can find the code on the company’s GitHub repository and the weights required to run the model locally on its Hugging Face page. The company notes that usage will be allowed only after acceptance of its terms, which detail both allowed and excluded applications.

As of now, along with researching and probing the models, permitted use cases include generating artworks for design and other artistic processes and applications in educational or creative tools.

Generating factual or “true representations of people or events” remains out of scope, Stability AI said.

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.


Author: Shubham Sharma
Source: Venturebeat
Reviewed By: Editorial Team

Related posts
AI & RoboticsNews

H2O.ai improves AI agent accuracy with predictive models

AI & RoboticsNews

Microsoft’s AI agents: 4 insights that could reshape the enterprise landscape

AI & RoboticsNews

Nvidia accelerates Google quantum AI design with quantum physics simulation

DefenseNews

Marine Corps F-35C notches first overseas combat strike

Sign up for our Newsletter and
stay informed!