The world of AI research has recently witnessed the introduction of a groundbreaking video generation model, aptly named FancyVideo, developed by the 360AI team in collaboration with Sun Yat-sen University. This innovative model, built on the UNet architecture, promises to revolutionize the field by enabling the creation of high-quality videos with diverse resolutions, aspect ratios, styles, and motion ranges on consumer-grade graphics processing units (GPUs), such as the NVIDIA GeForce RTX 3090.
The model’s capabilities extend beyond basic generation, as it can also perform video extension and video retracing tasks, opening up new possibilities for content creators and researchers alike. The 360AI team, known for its dedication to visual generation research and open-source community contributions, is led by Ao Ma, a master’s graduate from the Chinese Academy of Sciences’ Institute of Computing Technology. Ma has an extensive background in academic research and algorithm implementation at both Microsoft Research Asia’s Visual Computing Group and Alibaba’s Tongyi Lab.
The team’s latest endeavor, FancyVideo, addresses a common issue found in existing text-to-video (T2V) models, which often employ spatial cross-attention mechanisms. These mechanisms can limit the model’s ability to understand temporal logic and generate videos with consistent motion. To overcome this challenge, the researchers introduced the Cross-frame Textual Guidance Module (CTGM), a novel component that enhances the text control mechanism and promotes dynamic and consistent video generation.
The CTGM consists of three key submodules: the Temporal Information Injector (TII), which incorporates frame-specific information from latent features into the text conditions; the Temporal Affinity Refiner (TAR), which refines the correlation matrix between cross-frame text conditions and latent features along the temporal dimension; and the Temporal Feature Booster (TFB), which strengthens the temporal consistency of the latent features.
During training, the FancyVideo model follows a pipeline that begins with a text-to-image (T2I) operation to generate the first frame, followed by an image-to-video (I2V) process. This approach maintains the advantages of T2I models for higher video quality while minimizing training costs. To control motion, the model leverages motion information extracted by RAFT (Recurrent All-Pairs Field Transforms) and time embeddings, which are injected into the network during training.
Quantitative and qualitative evaluations of the model showcase its exceptional performance. On the EvalCrafter Benchmark, FancyVideo outperforms other T2V models in video quality, text consistency, motion handling, and temporal coherence. Furthermore, the model achieved state-of-the-art results in the zero-shot evaluations on the UCF-101 and MSR-VTT benchmarks, scoring high on both the Inception Score (IS) for video richness and the CLIPSIM metric for text consistency.
FancyVideo’s open-source release marks a significant milestone in the democratization of advanced AI technologies, making cutting-edge video generation accessible to a wider audience. The model’s code and project page are available at https://github.com/360CVGroup/FancyVideo and https://fancyvideo.github.io/, respectively, inviting researchers and developers to explore, experiment, and contribute to this evolving field.
As AI continues to reshape various industries, the introduction of FancyVideo underscores the potential for collaborative research and open-source initiatives to push the boundaries of what’s possible. With its innovative approach to video generation, the 360AI team and their partners at Sun Yat-sen University have set a new standard for the future of AI-driven content creation.
【source】https://www.jiqizhixin.com/articles/2024-08-26-11
Views: 0