Shanghai Jiao Tong University Open-Sources AniTalker: A Framework for Realistic TalkingHead Video Generation
Shanghai, China – Researchers from Shanghai Jiao Tong University’s X-LANCE Lab and AISpeech have released AniTalker, an open-source framework for generating lifelike talking head videos. This innovative toolcan transform a single static portrait into a dynamic video, complete with synchronized lip movements, natural facial expressions, and even subtle head movements, all driven by an inputaudio track.
AniTalker leverages self-supervised learning strategies to capture the intricate dynamics of human faces, including nuanced expressions and head movements. It employs a combination of diffusion models and variance adapters to generate diverse and controllable facial animations,achieving results comparable to those of industry giants like Alibaba’s EMO and Tencent’s AniPortrait.
Key Features of AniTalker:
- Static Portrait Animation: AniTalker can animate any single portrait image, makingthe depicted person appear to speak and express emotions.
- Audio Synchronization: The framework synchronizes input audio with the character’s lip movements and speech rhythm, creating a natural dialogue effect.
- Facial Dynamic Capture: Beyond lip sync, AniTalker simulates a range of complex facial expressions and subtle muscle movements.
- Diverse Animation Generation: Utilizing diffusion models, AniTalker produces diverse facial animations with random variations, adding naturalness and unpredictability to the generated content.
- Real-time Facial Animation Control: Users can guide the animation generation in real-time through control signals, including head posture, facial expressions, and eye movements.
- Speech-driven Animation Generation: The framework supports generating animations directly from speech signals, eliminating the need for additional video input.
- Long Video Continuous Generation: AniTalker can generate continuous animations for extended durations, suitable for lengthy dialogues or speeches.
Technical Details of AniTalker:
AniTalker’s core functionality relies on a multi-step process:
- Motion Representation Learning: The framework employs self-supervised learning to train a universal motion encoder that captures facial dynamics. This involves selecting source and target images from videos and learning motion information through target image reconstruction.
- Identity and Motion Decoupling: To ensure the motion representation remains free of identity-specific information, AniTalker utilizes metric learning and mutual information minimization techniques. Metric learning helps the model distinguish identity information between individuals, while mutual information minimization ensures the motion encoder focuses on capturing motion rather than identity features.
- Hierarchical Aggregation Layer (HAL): AniTalker introduces the HAL to enhance the motion encoder’s understanding of motion variations across different scales. HAL integrates information from various stages of the image encoder through average pooling and weighted sum layers.
- Motion Generation: Once the motion encoder is trained, AniTalker can generate motion representations based on user-controlled driving signals. This includes video-driven and speech-driven pipelines.
- Video-driven Pipeline: This pipeline uses a video sequence of the driving speaker to generate animation for the source image, accurately replicating the driving posture and facial expressions.
- Speech-driven Pipeline: Unlike the video-driven approach, the speech-driven method generates video based on speech signals or other control signals, synchronizing with the input audio.
- Diffusion Models and Variance Adapters: In the speech-driven method, AniTalker employs diffusion models to generate motionlatent sequences and utilizes variance adapters to introduce attribute manipulation, resulting in diverse and controllable facial animations.
- Rendering Module: Finally, an image renderer utilizes the generated motion latent sequences to render the final animated video frame by frame.
- Training and Optimization: AniTalker’s training process involves multipleloss functions, including reconstruction loss, perceptual loss, adversarial loss, mutual information loss, and identity metric learning loss, to optimize model performance.
- Control Attribute Features: AniTalker allows users to control head posture and camera parameters, such as head position and face size, to generate animations with specific attributes.
Applications of AniTalker:
AniTalker holds vast potential across various applications:
- Virtual Assistants and Customer Service: AniTalker can create realistic avatars for virtual assistants and customer service interactions, enhancing user engagement and personalization.
- Education and Training: The framework can generate animated characters foreducational videos and simulations, making learning more engaging and interactive.
- Entertainment and Gaming: AniTalker can be used to create realistic characters for video games, movies, and other forms of entertainment.
- Social Media and Communication: The framework can facilitate more expressive and engaging communication on social media platforms.
Availability and Future Directions:
AniTalker is open-sourced on GitHub, allowing developers and researchers to access and contribute to the project. The researchers are actively working on improving the framework’s performance and adding new features, such as the ability to generate more diverse and realistic facial expressions and movements.
The releaseof AniTalker marks a significant advancement in the field of talking head video generation. Its open-source nature fosters collaboration and innovation, paving the way for more realistic and expressive virtual characters in various applications. As the technology continues to evolve, we can expect to see even more captivating and lifelike talking head videos generated byAniTalker and similar frameworks.
【source】https://ai-bot.cn/anitalker/
Views: 1