AI & RoboticsNews

IBM’s AI generates new footage from video stills

A paper coauthored by researchers at IBM describes an AI system — Navsynth — that generates videos seen during training, as well as unseen videos. While this in and of itself isn’t novel — it’s an acute area of interest for Alphabet’s DeepMind and others — the researchers say the approach produces superior quality videos compared with existing methods. If the claim holds water, their system could be used to synthesize videos on which other AI systems train, supplementing real-world data sets that are incomplete or marred by corrupted samples.

As the researchers explain, the bulk of work in the video synthesis domain leverages GANs, or two-part neural networks consisting of generators that produce samples and discriminators that attempt to distinguish between the generated samples and real-world samples. They’re highly capable but suffer from a phenomenon called mode collapse, where the generator generates a limited diversity of samples (or even the same sample) regardless of the input.

By contrast, IBM’s system consists of a variable representing video content features, a frame-specific transient variable (more on that later), a generator, and a recurrent machine learning model. It breaks videos down into a static constituent that captures the constant portion of the video common for all frames and a transient constituent that represents the temporal dynamics (i.e., periodic regularity driven by time-based events) between all the frames in the video. Effectively, the system jointly learns the static and transient constituents, which it uses to generate videos at inference time.

IBM AI video synthesis

Above: Videos trained by IBM’s Navsynth system.

To capture equally from the static portion of the video, the researchers’ system randomly chooses a frame and compares its corresponding generated frame during training. This ensures that the generated frame remains close to the ground truth frame.

In experiments, the research team trained, validated, and tested the system on three publicly available data sets: Chair-CAD, which consists of 1,393 3D models of chairs (out of which 820 were chosen with the first 16 frames); Weizmann Human Action, which provides 10 different actions performed by nine people, amounting to 90 videos; and the Golf scene data set, which contains 20,268 golf videos (out of which 500 videos were chosen).

VB TRansform 2020: The AI event for business leaders. San Francisco July 15 - 16

Compared with the videos generated by several baseline models, the researchers say their system produced “visually more appealing” videos that “maintained consistency” with sharper frames. Moreover, it reportedly demonstrated a knack for frame interpolation, or a form of video processing in which the intermediate frames are generated between the existing one in an attempt to make animation more fluid.


Author: Kyle Wiggers.
Source: Venturebeat

Related posts
AI & RoboticsNews

Midjourney launches AI image editor: how to use it

AI & RoboticsNews

Meta just beat Google and Apple in the race to put powerful AI on phones

AI & RoboticsNews

DeepMind’s Talker-Reasoner framework brings System 2 thinking to AI agents

Cleantech & EV'sNews

Ford F-150 Lightning and Mustang Mach-E drivers just gained Google Maps EV routing

Sign up for our Newsletter and
stay informed!