Video Creation by Demonstration. Given a demonstration video, our proposed \(\delta\)-Diffusion generates a video that naturally continues from a context images and carries out the same action concepts. The emoji is added for preserving anonymity in visualization, not during training or inference.
We explore a novel video creation experience, namely Video Creation by Demonstration. Given a demonstration video and a context image from a different scene, we generate a physically plausible video that continues naturally from the context image and carries out the action concepts from the demonstration. To enable this capability, we present \(\delta\)-Diffusion, a self-supervised training approach that learns from unary unlabeled videos by conditional future frame prediction. Unlike most existing video generation controls that are based on explicit signals, we adopts the form of implicit latent control for maximal flexibility and expressiveness required by general videos. By leveraging a video foundation model with an appearance bottleneck design on top, we extract action latents from demonstration videos for conditioning the generation process with minimal appearance leakage. Empirically, \(\delta\)-Diffusion outperforms related baselines in terms of both human preference and large-scale machine evaluations, and demonstrates potentials towards interactive world simulation.
(a) Overview of \(\delta\)-Diffusion. The context frame \(I\) is provided to the generation model \(\mathcal{G}\) along with the action latents \(\delta_V\) extracted from the demonstration video \(V\). (b) Extracting action latents. A spatial-temporal vision encoder is applied to extract temporally-aggregated spatiotemopral representations \(\mathbf{z}\) from an input video \(V\), with \(t\) denoting the temporal dimension. In parallel, a spatial vision encoder extracts per-frame representations from \(V\), which is aligned to \(\mathbf{z}\) by feature predictor \(\mathcal{P}\) as \(\mathbf{h}\). The appearance bottleneck then computes the action latents \(\delta_V\) by subtracting the aligned spatial representations \(\mathbf{h}\) from the spatiotemporal representations.
Given one context image, \(\delta\)-Diffusion generates videos with a variety of actions by conditioning on different demonstration videos. The context images are shown and paused initially in the generated videos.
Auto-regressive generation controlled via a sequence of three different demonstration videos of varying lengths. The demonstration videos are placed side-by-side and played sequentially when they are conditioned on for visualization purpose.
Please refer to this page for more qualitative samples, ablation study on the bottleneck design, qualitative comparisons against prior works, and failure cases.