Customize Consent Preferences

We use cookies to help you navigate efficiently and perform certain functions. You will find detailed information about all cookies under each consent category below.

The cookies that are categorized as "Necessary" are stored on your browser as they are essential for enabling the basic functionalities of the site. ... 

Always Active

Necessary cookies are required to enable the basic features of this site, such as providing secure log-in or adjusting your consent preferences. These cookies do not store any personally identifiable data.

No cookies to display.

Functional cookies help perform certain functionalities like sharing the content of the website on social media platforms, collecting feedback, and other third-party features.

No cookies to display.

Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics such as the number of visitors, bounce rate, traffic source, etc.

No cookies to display.

Performance cookies are used to understand and analyze the key performance indexes of the website which helps in delivering a better user experience for the visitors.

No cookies to display.

Advertisement cookies are used to provide visitors with customized advertisements based on the pages you visited previously and to analyze the effectiveness of the ad campaigns.

No cookies to display.

0

A novel approach from AutoNavi (Gaode Map) researchers allows for seamless transfer of visual model weights to downstream tasks, significantly accelerating the convergence of diffusion models like SiT by nearly 47x.

[Beijing, March 17, 2025] – The intersection of diffusion models and representation learning is rapidly becoming a fertile ground for innovation in artificial intelligence. Recent studies have highlighted the synergistic relationship between these two domains: intermediate representations from diffusion models can be leveraged for downstream vision tasks, while visual model representations can enhance the convergence speed and generation quality of diffusion models. However, a significant hurdle remains: transferring pre-trained weights from visual models to diffusion models has been challenging due to input mismatches and the utilization of Variational Autoencoder (VAE) latent spaces.

Now, researchers at AutoNavi (Gaode Map), a leading Chinese digital mapping, navigation and location-based services provider, have introduced a groundbreaking solution: Unified Self-Supervised Pretraining (USP). This innovative method addresses the aforementioned challenges by performing Masked Latent Modeling within the latent space of a VAE. The resulting pre-trained weights, particularly from ViT (Vision Transformer) encoders, can be seamlessly transferred to various downstream tasks, including image classification, semantic segmentation, and, crucially, image generation based on diffusion models.

The research paper, titled USP: Unified Self-Supervised Pretraining for Image Generation and Understanding, is available on arXiv: https://arxiv.org/pdf/2503.06132. The code for USP is also publicly accessible on GitHub: https://github.com/cxxgtxy/USP.

The core of USP lies in its ability to create a unified representation space that is compatible with both vision understanding and image generation tasks. By pretraining within the VAE’s latent space, the model learns to encode and decode images in a way that preserves crucial visual information while also being amenable to the probabilistic nature of diffusion models.

The benefits of USP are two-fold:

  • Competitive Performance on Understanding Tasks: USP achieves competitive results on standard image understanding benchmarks, demonstrating its effectiveness as a general-purpose visual representation learning method.
  • Significant Acceleration of Diffusion Models: The most striking result is the substantial acceleration of diffusion models like DiT (Diffusion Transformer) and SiT (Shifted Window Transformer) when using USP-pretrained weights. The reported convergence speed increase of nearly 47x for SiT is a game-changer, potentially drastically reducing the training time and computational resources required for high-quality image generation.

This breakthrough has significant implications for the future of AI-powered image generation. By simplifying the transfer learning process and accelerating training, USP opens the door to more efficient and accessible development of advanced generative models. The ability to leverage pre-trained visual models for diffusion tasks promises to unlock new levels of realism, control, and creativity in image synthesis. Further research will likely focus on exploring the limits of USP’s applicability to different diffusion architectures and investigating its potential for other generative tasks beyond image generation.

References:

  • C.X.X., et al. (2025). USP: Unified Self-Supervised Pretraining for Image Generation and Understanding. arXiv:2503.06132.


>>> Read more <<<

Views: 0

0

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注