TokenFlow: Consistent Diffusion Features for Consistent Video Editing

Supplementary Material

Our Results
Our Results combined with post-process deflickering
Results of our method with a different image-editing technique
Comparisons to Baselines
Additional Qualitative Comparisons
Ablations
Diffusion Features PCA Visualizations

Our Results

We present sample results of our method.

Input video	"Sahara desert, a cheeta looking out the car window"	"The dolomites, a bear cub looking out of a car window"	"Machu pichu, a wolf looking out of a car window"

Input video	"Van-Gogh style portrait of a man spinning a basketball"	"a silver shiny robot spinning a shiny silver ball on his finger"	"the milky way, a star wars clone trooper spinning a planet"

Input video	"a bronze eagle sculpture"	"an origami of an eagle, colorful japanses washi patterns"	"an origami of an eagle, pink paper art"

Input video	"a pixar animation of a woman running"	"a marble scuplture of a woman running"	"Maui in Moana Movie"

Input video	"a shiny silver robotic wolf, futuristic"	"a colorful polygonal illustration of a wolf"	"a photo of a fluffy wolf doll"

Input video	"a pink car in a snowy landscape, sunset lighting"	"an ice sculpture of a car"	"a sand sculpture of a car on the beach"

Input video	"a greek marble sculpture"	"Van-Gogh style portrait"

Input video	"colorful crochet kittens"	"shiny silver robotic cats eating"

Our Results combined with post-process deflickering

We present sample results of our method combined with post process de-flickering.

-->

Input video	"Van-Gogh style portrait of a man spinning a basketball"	"a silver shiny robot spinning a shiny silver ball on his finger"	"the milky way, a star wars clone trooper spinning a planet"

Input video	"a shiny silver robotic wolf, futuristic"	"a colorful polygonal illustration of a wolf"	"a photo of a fluffy wolf doll"

Input video	"a pixar animation of a woman running"

Our Results combined with other image editing techniques

We present sample results of our method on top of SDEdit ([7]).

"a shiny silver robot"	Per-frame SDEdit [6]	TokenFlow + SDEdit [6]

"an ice sculpture"

We present sample results of our method on top of ControlNet image synthesis ([9]).

"a colorful oil painting of a wolf"	Per-frame ControlNet [6]	TokenFlow + ControlNet [6]

"an anime of a man in a field"

Comparisons to Baselines

Existing methods of text-guided video editing suffer from temporal inconsistency.

Text-to-video-zero ([1]) .
Tune-a-video ([2]).
Gen1 ([3]).
Plug-and-Play per frame ([4]).
Fate-Zero ([8])
Rerender-a-Video ([10])

Our msthod manages to preserve the structure of the guidance image while fulfilling the target text.

"rainbow textured dog"	Ours	Text-to-video ([1])	TAV ([2])

Gen1 ([3])	PnP per frame ([4])	Fate-Zero ([8])	Re-render a Video ([10])

"an origami of a stork"	Ours	Text-to-video ([1])	TAV ([2])

Gen1 ([3])	PnP per frame ([4])	Fate-Zero ([8])	Re-render a Video ([10])

"a metal sculpture"	Ours	Text-to-video ([1])	TAV ([2])

Gen1 ([3])	PnP per frame ([4])	Fate-Zero ([8])	Re-render a Video ([10])

"a fluffy wolf doll"	Ours	Text-to-video ([1])	TAV ([2])

Gen1 ([3])	PnP per frame ([4])	Fate-Zero ([8])	Re-render a Video ([10])

Additional Qualitative Comparisons

We present additional qualitative comparisons of our method with Text2LIVE ([5]) and Ebsynth ([6]). Text2live lacks a strong generative prior, thus has a poor visual quality. Ebsynth performs well on video frames close to the edited keyframe, but either fails to propagate the edit to the rest of the video or introduces artifacts.

"a car in s snowy scene" Ours Text2LIVE ([5]) Ebsynth ([6])

Ablations

We ablate tokenflow propagation and keyframe randomization .

"a colorful polygonal illustration"	Ours	Ours, constant keyframes	Extended attention, random keyframes

"a rainbow textured dog"

PCA visualisations

We present the feature PCA visualisation of the original video featuers, of the features of a video edited by ([5]), and of the features of frames edited by our method. Different rows show features from different layers of the Unet decoder.

original video	Ours	Per frame editing

[1] Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439, 2023.

[2] Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. arXiv preprint arXiv:2212.11565, 2022

[3] Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models. arXiv preprint arXiv:2302.03011, 2023

[4] Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. Plug-and-play diffusion features for text- driven image-to-image translation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

[5] Omer Bar-Tal, Dolev Ofri-Amar, Rafail Fridman, Yoni Kasten, and Tali Dekel. Text2live: Text-driven layered image and video editing. In European Conference on Computer Vision. Springer, 2022.

[6] Ondˇrej Jamriˇska, ˇS ́arka Sochorov ́a, Ondˇrej Texler, Michal Luk ́aˇc, Jakub Fiˇser, Jingwan Lu, Eli Shechtman, and Daniel S ́ykora. Stylizing video by example. ACM Transactions on Graphics, 2019.

[7] henlin Meng, Yutong He, Yang Song, Jiaming Song, Jia- jun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image synthesis and editing with stochastic differential equa- tions. In International Conference on Learning Representa- tions, 2022.

[8] Chenyang Qi, Xiaodong Cun, Yong Zhang, Chenyang Lei, Xintao Wang, Ying Shan, and Qifeng Chen. Fatezero: Fusing attentions for zero-shot text-based video editing. arXiv:2303.09535, 2023 .

[9] Lvmin Zhang and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models, 2023.

[10] Shuai Yang, Yifan Zhou, Ziwei Liu, and Chen Change Loy. Rerender a video: Zero-shot text-guided video-to- video translation, 2023.