Video Creation by Demonstration. Given a demonstration video and an initial frame, our proposed \(\delta\)-Diffusion generates a video that naturally continues from the initial frame and carries out the same action as shown in demonstration. The emoji is added for preserving anonymity in visualization, not during training or inference.
We present Video Creation by Demonstration: given a demonstration video and an initial frame from any scene, we generate a realistic video that continues naturally from the initial frame and carries out the action concepts from the demonstration. This is important because unlike captions, camera poses, or point tracks, a demonstration video can provide detailed description of the target action without needing extensive manual annotations.
The main challenge for training these models is the difficulty in curating supervised training data based on paired actions across different contexts. To mitigate this, we propose \(\delta\)-Diffusion, a self-supervised method that learns from unlabeled videos. Our key insight is that by placing a separately learned bottleneck on the features of a video foundation model, we can extract demonstration actions through these features and minimize degenerate solutions. We found \(\delta\)-Diffusion to outperform baselines in both human preference and large-scale machine evaluations.
(a) Overview of \(\delta\)-Diffusion. The initial frame \(I\) is provided to the generation model \(\mathcal{G}\) along with the action latents \(\delta_V\) extracted from the demonstration video \(V\). The output \(\hat{V}\) continues naturally from \(I\) and carries out the actions in \(V\). During training, the target \(\hat{V}\) is the same as the demonstration \(V\). (b) Extracting action latents. First, a video encoder extracts the spatiotemporal representations \(\mathbf{z}^{ST}\) from demonstration \(V\), with \(t\) denoting the temporal dimension. In parallel, an image encoder extracts per-frame spatial representations \(\mathbf{z}^{S}\) from \(V\), which are aligned to \(\mathbf{z}^{ST}\) by feature predictor \(\mathcal{P}\). The appearance bottleneck then computes the action latents \(\delta_V\) by subtracting the aligned spatial representations \(\mathcal{P}(\mathbf{z}^S_t)\) from the spatiotemporal representations \(\mathbf{z}^{ST}_t\) for each frame \(V_t\).
Given one initial frame, \(\delta\)-Diffusion generates videos with a variety of actions by conditioning on different demonstration videos. The initial frames are shown and paused in the beginning of the generated videos.
Compositional generation controlled via a sequence of three different demonstration videos of varying lengths. The demonstration videos are placed side-by-side and played sequentially when they are conditioned on for visualization purpose.
Please refer to this page for more qualitative samples, ablation study on the bottleneck design, qualitative comparisons against prior works, and failure cases.