书生大模型升级，视觉对齐再突破

上海AI实验室发布新一代书生·视觉大模型，视觉核心任务开源领先

上海人工智能实验室（上海AI实验室）联合清华大学、香港中文大学、商汤科技等机构开源新一代书生·视觉大模型（InternVL）。

新一代“书生·视觉基础”模型的视觉编码器参数量达60亿(InternVL-6B)，首次提出了对比-生成融合的渐进式对齐技术，实现了在互联网级别数据上视觉大模型与语言大模型的精细对齐。

该模型在视觉核心任务上取得了领先的开源表现，在ImageNet图像分类任务上达到87.8%的top-1精度，在COCO目标检测任务上达到56.5%的box AP，在ADE20K语义分割任务上达到57.5%的mIoU。

此外，书生·视觉大模型还支持图像生成、图像编辑、视频理解等多种下游任务，展现了强大的视觉理解和生成能力。

上海AI实验室表示，书生·视觉大模型的开源将促进视觉大模型的研究和应用，推动人工智能领域的发展。

对比-生成融合的渐进式对齐技术

新一代书生·视觉大模型首次提出了对比-生成融合的渐进式对齐技术，实现了在互联网级别数据上视觉大模型与语言大模型的精细对齐。

该技术将对比学习和生成式对齐相结合，通过渐进式的对齐过程，逐步缩小视觉大模型和语言大模型之间的语义鸿沟。

视觉核心任务领先开源表现

在视觉核心任务上，书生·视觉大模型取得了领先的开源表现。

在ImageNet图像分类任务上，该模型达到87.8%的top-1精度，超越了此前开源的视觉大模型。

在COCO目标检测任务上，该模型达到56.5%的box AP，在ADE20K语义分割任务上达到57.5%的mIoU，均位列开源视觉大模型的前列。

下游任务应用

书生·视觉大模型还支持图像生成、图像编辑、视频理解等多种下游任务，展现了强大的视觉理解和生成能力。

例如，在图像生成任务中，该模型可以根据文本描述生成逼真的图像；在图像编辑任务中，该模型可以实现图像风格转换、图像超分辨率等功能；在视频理解任务中，该模型可以进行视频分类、视频动作识别等。

开源促进发展

上海AI实验室表示，书生·视觉大模型的开源将促进视觉大模型的研究和应用，推动人工智能领域的发展。

该模型的开源将使研究人员和开发者能够更方便地获取和使用视觉大模型，从而加速人工智能技术的创新和应用。

英语如下：

**Headline:** Bookworm Vision Model Upgrades, Achieving New Breakthroughs in Visual Alignment

**Keywords:** Vision Large Model, Open Source, Fine-Grained Alignment

**News Content:**

Shanghai AI Lab Releases New Generation of Bookworm VisionLarge Model, Open-Sourcing Core Visual Tasks

Shanghai AI Lab, in collaboration with Tsinghua University, The Chinese University of Hong Kong, SenseTime, and other institutions, has open-sourced a new generation of Bookworm Vision Large Model (InternVL).

The new generation “Bookworm Vision Foundation” modelhas a visual encoder with 6 billion parameters (InternVL-6B). It introduces a novel contrastive-generative fusion progressive alignment technique for the first time, achieving fine-grained alignment between vision large models and language large models on Internet-scale data.

The model has achieved leading open-source performance on core visual tasks, reaching 87.8% top-1 accuracy on the ImageNet image classification task, 56.5% box AP on the COCO object detection task, and 57.5% mIoU on the ADE20K semantic segmentation task.

In addition, theBookworm Vision Large Model supports various downstream tasks such as image generation, image editing, and video understanding, demonstrating strong visual understanding and generation capabilities.

Shanghai AI Lab stated that open-sourcing the Bookworm Vision Large Model will promote the research and application of vision large models, driving the development of the artificial intelligence field.

**Contrastive-Generative Fusion Progressive Alignment Technique**

The new generation of Bookworm Vision Large Model introduces a novel contrastive-generative fusion progressive alignment technique for the first time, achieving fine-grained alignment between vision large models and language large models on Internet-scale data.

This technique combines contrastive learning and generative alignment, gradually narrowing the semantic gap between vision large models and language large models through a progressive alignment process.

**Leading Open-Source Performance on Core Visual Tasks**

The Bookworm Vision Large Model has achieved leading open-source performance on core visual tasks.

On the ImageNet image classification task, the model reaches 87.8% top-1 accuracy, surpassing previously open-sourced vision large models.

On the COCO object detection task, the model achieves 56.5% box AP, and on the ADE20K semantic segmentation task, it reaches 57.5%mIoU, both ranking among the top open-source vision large models.

**Downstream Task Applications**

The Bookworm Vision Large Model also supports various downstream tasks such as image generation, image editing, and video understanding, demonstrating strong visual understanding and generation capabilities.

For example, in image generation tasks, the model can generate realistic images based on text descriptions; in image editing tasks, the model can perform functions such as image style transfer and image super-resolution; and in video understanding tasks, the model can perform video classification and video action recognition.

**Open Source Promotes Development**

Open-sourcing the model will allow researchers and developers to more easily access and use vision large models, thereby accelerating the innovation and application of artificial intelligence technologies.

【来源】https://mp.weixin.qq.com/s/bdfAJRqOF9tUk8Vy9KC_XQ