China’s Zhihu AI Open-Sources Powerful Text-to-Image Model

作者智能小编

11 月 18, 2024 #cogview3, #每日AI快讯

川普在美国宾州巴特勒的一次演讲中遇刺_20240714

Zhihu AI Open-Sources High-Performance Text-to-Image Model:CogView3-Plus-3B

A leap forward in open-source AI image generation.

Zhihu AI, a leading artificial intelligence research institute, has announced the open-sourcing of its advanced text-to-imagegeneration model, CogView3-Plus-3B, under the permissive Apache 2.0 license. This release marks a significant contribution to the open-source community, offering researchers and developers access to a model that rivals commercially available, state-of-the-art solutions in both quality and efficiency.

CogView3-Plus-3B builds upon its predecessor, CogView3, which already demonstrated superior performance compared to other open-source models. CogView3, a cascaded diffusion model, generates images in three stages: initially creating a 512×512 low-resolution image, thenupscaling to 1024×1024, and finally to a high-resolution 2048×2048 image. In blind human evaluations, CogView3 outperformed the leading open-source text-to-image diffusion model, SDXL, by a remarkable77.0%, while requiring approximately one-tenth of the inference time. (See this paper for detailed methodology.)

CogView3-Plus-3B significantly enhances CogView3 by incorporating the innovative Diffusionwith Transformer (DiT) framework. This integration, coupled with Zero-SNR diffusion noise scheduling and a novel text-image joint attention mechanism, leads to substantial improvements in image quality and flexibility. Unlike the commonly used MMDiT architecture, CogView3-Plus-3B achieves this performance boost while maintainingefficiency in both training and inference, utilizing a latent dimension of 16 in its Variational Autoencoder (VAE). Furthermore, mixed-resolution training allows CogView3-Plus-3B to generate images at resolutions ranging from 512 to 2048 pixels, offering unparalleled versatility.

Benchmark tests indicate that CogView3-Plus-3B achieves performance on par with the leading commercial text-to-image models. This is a crucial development, as it democratizes access to cutting-edge AI image generation technology. The open-source nature of CogView3-Plus-3B fosters collaboration and innovation within the AI community, accelerating the development and refinement of text-to-image models.

The release of CogView3-Plus-3B underscores Zhihu AI’s commitment to open research and its dedication to advancing the field of artificial intelligence. The model’s superior performance,combined with its ease of access and permissive license, promises to significantly impact various applications, from creative content generation to scientific visualization. The future implications of this open-source contribution are vast and exciting, paving the way for further breakthroughs in the rapidly evolving landscape of AI-powered image synthesis.

References:

Zhihu AI. (Date of release). *CogView3-Plus-3B Model Release. [Link to Zhihu AI’s official announcement (if available)]
* Paper link: https://arxiv.org/abs/2403.05121

(Note: I have created this article based on the provided information. A link to Zhihu AI’s official announcement would enhance the article’s credibility and should be added if available.)

>>> Read more <<<