Alibaba and East China Normal University Launch AI Video Length Optimization Technology

Alibaba and East China Normal University Collaborate on AI Video Length Extension Technology: ExVideo

Shanghai, China – Alibaba and East China Normal University (ECNU) have jointly developed a novel AI video length extension and optimization technology called ExVideo. This innovative technology, unveiled recently, allows for the generation of longer andmore frame-rich videos by extending the temporal scale of existing video synthesis models.

ExVideo is built upon the foundation of the Stable Video Diffusion model. Theresearch team trained an extension model capable of generating coherent videos up to 128 frames long while preserving the original model’s generative capabilities. The key to ExVideo’s success lies in its optimized temporal modules, including 3Dconvolutions, temporal attention, and positional embedding. These enhancements enable the model to handle longer time spans, significantly increasing the number of video frames without compromising the original model’s generative prowess. Notably, ExVideo achieves this with minimal training costs, making it particularly suitable for scenarios with limited computational resources.

Key Features of ExVideo:

Temporal Scale Extension: ExVideo’s core functionality lies in its ability to extend the temporal scale of video synthesis models, enabling the processing and generation of longer video sequences than originally designed. This extension allows forthe creation of videos with more frames, facilitating the telling of more comprehensive stories or showcasing dynamic scenes over extended durations.
Post-Tuning Strategy: ExVideo employs a post-tuning strategy, a crucial aspect of its technology. By retraining specific parts of models like Stable Video Diffusion, ExVideo enables these models togenerate longer videos, reaching 128 frames or more. This not only enhances video length but also maintains the models’ generalization ability across diverse inputs, resulting in diverse and adaptable video outputs.
Parameter Efficiency: Unlike traditional training methods, ExVideo’s post-tuning approach avoids training a completely new modelfrom scratch. Instead, it optimizes existing models, significantly reducing the required parameter count and computational resources. This makes model extension more efficient and practical.
Preservation of Generative Capabilities: While extending video length, ExVideo prioritizes maintaining video quality. The generated videos not only exhibit temporal elongation but also meethigh standards in terms of visual coherence, clarity, and overall quality.
Compatibility and Generality: ExVideo’s design prioritizes compatibility with various video synthesis models, allowing for its broad application across diverse video generation tasks. Whether dealing with 3D convolutions, temporal attention, or positional embedding, ExVideo provides corresponding extension strategies to adapt to different model architectures.

Availability and Resources:

ExVideo is readily accessible through various platforms:

Official Project Homepage: https://ecnu-cilab.github.io/ExVideoProjectPage/
GitHub Code Repository: https://github.com/modelscope/DiffSynth-Studio
Hugging Face Model Download: https://huggingface.co/ECNU-CILab/ExVideo-SVD-128f-v1
ModelScope Model Download: https://www.modelscope.cn/models/ECNU-CILab/ExVideo-SVD-128f-v1/summary
arXiv Technical Paper: https://arxiv.org/abs/2406.14130

Technical Principles:

ExVideo’s success hinges on two key principles:

*Parameter Post-Tuning: ExVideo employs a parameter post-tuning approach to refine existing video synthesis models. This involves retraining specific model components rather than the entire model, enhancing efficiency.

Temporal Module Extension: ExVideo introduces extension strategies for temporal modules within video synthesis models. These strategies optimize 3D convolutionallayers, temporal attention mechanisms, and positional embedding layers to accommodate longer video sequences.
3D Convolutional Layers: 3D convolutional layers capture features along the temporal dimension in video synthesis. ExVideo retains the original model’s 3D convolutional layers as they adapt to different time scales without requiring additional fine-tuning.
Temporal Attention Mechanisms: To improve the model’s ability to handle long sequences, ExVideo fine-tunes the temporal attention module. This helps the model better understand the temporal coherence of video content.
Positional Embedding: Traditional video synthesis models might utilize static or trainable positionalembeddings. ExVideo incorporates dynamic positional embeddings that adjust based on the video’s length, enhancing the model’s ability to represent temporal information effectively.

Impact and Future Directions:

ExVideo’s development holds significant implications for the field of AI video generation. Its ability to extend video length while maintaining quality opensup new possibilities for creating more engaging and informative video content. This technology can be leveraged in various applications, including:

Film and Television Production: ExVideo can facilitate the creation of longer and more detailed scenes, enhancing storytelling and visual effects.
Educational Content: ExVideo can be used to generate longerand more comprehensive educational videos, making learning more engaging and effective.
Marketing and Advertising: ExVideo can create longer and more captivating video advertisements, enhancing brand messaging and audience engagement.
Virtual Reality and Augmented Reality: ExVideo can generate longer and more immersive VR and AR experiences, enhancing user engagement andimmersion.

As research continues, the team behind ExVideo aims to further enhance the technology’s capabilities, including:

Improved Temporal Coherence: The team is working on improving the temporal coherence of generated videos, ensuring smooth transitions and realistic motion.
Enhanced Generative Capabilities: The team is exploringways to further enhance the generative capabilities of ExVideo, allowing for the creation of even more diverse and realistic videos.
Wider Model Compatibility: The team is striving to make ExVideo compatible with a wider range of video synthesis models, expanding its applicability across different domains.

ExVideo represents a significant advancement in AIvideo generation technology. Its ability to extend video length while preserving quality and efficiency opens up exciting possibilities for creating more engaging and informative video content across various applications. As the technology continues to evolve, we can expect to see even more innovative and impactful applications of ExVideo in the future.

【source】https://ai-bot.cn/exvideo-model/