A novel approach from AutoNavi (Gaode Map) researchers allows for seamless transfer of visual model weights to downstream tasks, significantly accelerating the convergence of diffusion models like SiT by nearly 47x.
[Beijing, March 17, 2025] – The intersection of diffusion models and representation learning is rapidly becoming a fertile ground for innovation in artificial intelligence. Recent studies have highlighted the synergistic relationship between these two domains: intermediate representations from diffusion models can be leveraged for downstream vision tasks, while visual model representations can enhance the convergence speed and generation quality of diffusion models. However, a significant hurdle remains: transferring pre-trained weights from visual models to diffusion models has been challenging due to input mismatches and the utilization of Variational Autoencoder (VAE) latent spaces.
Now, researchers at AutoNavi (Gaode Map), a leading Chinese digital mapping, navigation and location-based services provider, have introduced a groundbreaking solution: Unified Self-Supervised Pretraining (USP). This innovative method addresses the aforementioned challenges by performing Masked Latent Modeling within the latent space of a VAE. The resulting pre-trained weights, particularly from ViT (Vision Transformer) encoders, can be seamlessly transferred to various downstream tasks, including image classification, semantic segmentation, and, crucially, image generation based on diffusion models.
The research paper, titled USP: Unified Self-Supervised Pretraining for Image Generation and Understanding, is available on arXiv: https://arxiv.org/pdf/2503.06132. The code for USP is also publicly accessible on GitHub: https://github.com/cxxgtxy/USP.
The core of USP lies in its ability to create a unified representation space that is compatible with both vision understanding and image generation tasks. By pretraining within the VAE’s latent space, the model learns to encode and decode images in a way that preserves crucial visual information while also being amenable to the probabilistic nature of diffusion models.
The benefits of USP are two-fold:
- Competitive Performance on Understanding Tasks: USP achieves competitive results on standard image understanding benchmarks, demonstrating its effectiveness as a general-purpose visual representation learning method.
- Significant Acceleration of Diffusion Models: The most striking result is the substantial acceleration of diffusion models like DiT (Diffusion Transformer) and SiT (Shifted Window Transformer) when using USP-pretrained weights. The reported convergence speed increase of nearly 47x for SiT is a game-changer, potentially drastically reducing the training time and computational resources required for high-quality image generation.
This breakthrough has significant implications for the future of AI-powered image generation. By simplifying the transfer learning process and accelerating training, USP opens the door to more efficient and accessible development of advanced generative models. The ability to leverage pre-trained visual models for diffusion tasks promises to unlock new levels of realism, control, and creativity in image synthesis. Further research will likely focus on exploring the limits of USP’s applicability to different diffusion architectures and investigating its potential for other generative tasks beyond image generation.
References:
- C.X.X., et al. (2025). USP: Unified Self-Supervised Pretraining for Image Generation and Understanding. arXiv:2503.06132.
Views: 0