Qualitative results for bottleneck ablation. Applying no ("None") or temporal normalization ("Temp. Norm.") bottleneck suffers from appearance leakage, while generation based on our appearance bottleneck preserves the context in the initial frame.
Qualitative comparisons of \(\delta\)-Diffusion against MotionDirector and WALT.
For MotionDirector and WALT, ground truth captions are additionally provided during inference:
Row 1: "pushing a cloth clip from right to left".
Row 2: "moving phone up".
For MotionDirector and WALT, ground truth captions are additionally provided during inference:
Row 1: "put oregano back".
Row 2: "wash knife".
For MotionDirector and WALT, ground truth captions are additionally provided during inference:
Row 1: "knock blue plastic bottle over".
Row 2: "knock water bottle over".
We show failure cases where the demonstration and initial frame are mis-matched (row 1), semantics of the action concepts are not fully carried out (row 2), and permanence is not held for objects with fast appearance changes (row 3).