We present sample results of our method.
We present sample results of our method combined with post process de-flickering.
We present sample results of our method on top of SDEdit ([7]).
"a shiny silver robot" | Per-frame SDEdit [6] | TokenFlow + SDEdit [6] |
---|---|---|
"an ice sculpture" | ||
We present sample results of our method on top of ControlNet image synthesis ([9]).
"a colorful oil painting of a wolf" | Per-frame ControlNet [6] | TokenFlow + ControlNet [6] |
---|---|---|
"an anime of a man in a field" | ||
Existing methods of text-guided video editing suffer from temporal inconsistency.
"rainbow textured dog" | Ours | Text-to-video ([1]) | TAV ([2]) |
---|---|---|---|
Gen1 ([3]) | PnP per frame ([4]) | Fate-Zero ([8]) | Re-render a Video ([10]) |
"an origami of a stork" | Ours | Text-to-video ([1]) | TAV ([2]) |
---|---|---|---|
Gen1 ([3]) | PnP per frame ([4]) | Fate-Zero ([8]) | Re-render a Video ([10]) |
"a metal sculpture" | Ours | Text-to-video ([1]) | TAV ([2]) |
---|---|---|---|
Gen1 ([3]) | PnP per frame ([4]) | Fate-Zero ([8]) | Re-render a Video ([10]) |
"a fluffy wolf doll" | Ours | Text-to-video ([1]) | TAV ([2]) |
---|---|---|---|
Gen1 ([3]) | PnP per frame ([4]) | Fate-Zero ([8]) | Re-render a Video ([10]) |
We present additional qualitative comparisons of our method with Text2LIVE ([5]) and Ebsynth ([6]).
Text2live lacks a strong generative prior, thus has a poor visual quality. Ebsynth performs well on video frames close to the edited keyframe, but either fails to propagate the edit to the rest of the video or introduces artifacts.
"a car in s snowy scene" | Ours | Text2LIVE ([5]) | Ebsynth ([6]) |
---|---|---|---|
We ablate tokenflow propagation and keyframe randomization .
"a colorful polygonal illustration" | Ours | Ours, constant keyframes | Extended attention, random keyframes |
---|---|---|---|
"a rainbow textured dog" | |||
We present the feature PCA visualisation of the original video featuers, of the features of a video edited by ([5]), and of the features of frames edited by our method. Different rows show features from different layers of the Unet decoder.
original video | Ours | Per frame editing |
---|---|---|
[1] Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439, 2023.
[2] Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. arXiv preprint arXiv:2212.11565, 2022
[3] Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models. arXiv preprint arXiv:2302.03011, 2023
[4] Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. Plug-and-play diffusion features for text- driven image-to-image translation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023
[5] Omer Bar-Tal, Dolev Ofri-Amar, Rafail Fridman, Yoni Kasten, and Tali Dekel. Text2live: Text-driven layered image and video editing. In European Conference on Computer Vision. Springer, 2022.
[6] Ondˇrej Jamriˇska, ˇS ́arka Sochorov ́a, Ondˇrej Texler, Michal Luk ́aˇc, Jakub Fiˇser, Jingwan Lu, Eli Shechtman, and Daniel S ́ykora. Stylizing video by example. ACM Transactions on Graphics, 2019.
[7] henlin Meng, Yutong He, Yang Song, Jiaming Song, Jia- jun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image synthesis and editing with stochastic differential equa- tions. In International Conference on Learning Representa- tions, 2022.
[8] Chenyang Qi, Xiaodong Cun, Yong Zhang, Chenyang Lei, Xintao Wang, Ying Shan, and Qifeng Chen. Fatezero: Fusing attentions for zero-shot text-based video editing. arXiv:2303.09535, 2023 .
[9] Lvmin Zhang and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models, 2023.
[10] Shuai Yang, Yifan Zhou, Ziwei Liu, and Chen Change Loy. Rerender a video: Zero-shot text-guided video-to- video translation, 2023.