Qualitative results for bottleneck ablation. Applying no ("None") or temporal normalization ("Temp. Norm.") bottleneck suffers from appearance leakage, while generation based on our appearance bottleneck preserves the input context.
Qualitative comparisons of \(\delta\)-Diffusion against MotionDirector and WALT.
For MotionDirector and WALT, ground truth captions are additionally provided during inference:
Row 1: "pushing a cloth clip from right to left".
Row 2: "moving phone up".
For MotionDirector and WALT, ground truth captions are additionally provided during inference:
Row 1: "put oregano back".
Row 2: "wash knife".
For MotionDirector and WALT, ground truth captions are additionally provided during inference:
Row 1: "knock blue plastic bottle over".
Row 2: "knock water bottle over".
We show failure cases where the demonstration and context image are mis-matched (row 1), semantics of the action concepts are not fully carried out (row 2), and permanence is not held for objects with fast appearance changes (row 3).