腾讯混元多模态大模型：通用人工智能曙光？

##国内首个自研MoE多模态大模型，腾讯混元多模态理解揭秘

**机器之心报道**

近年来，以GPT 为代表的大型语言模型在数字认知空间中展现出强大的理解和推理能力，预示着通用人工智能的曙光。然而，要推动通用人工智能向探索物理世界迈进，第一步便是解决视觉理解问题，即多模态理解大模型。

多模态理解让人工智能能够像人类一样，通过多种感官获取和处理信息，从而更全面地理解和互动世界。这一领域的突破将使人工智能在机器人、自动驾驶等方面取得更大的进展，真正实现从数字世界到物理世界的跨越。

然而，相较于大型语言模型，多模态理解模型的发展显得较为缓慢，尤其是在中文领域。此外，业界对于多模态模型的架构和训练方法的选型还没有完全形成共识。

近期，腾讯混元推出了基于 MoE 架构的多模态理解大模型，在架构、训练方法和数据处理方面进行了创新和深度优化，显著提升了其性能，并能支持任意长宽比及最高 7K 分辨率图片的理解。

**腾讯混元多模态模型的优势：**

* **采用 MoE 架构：** 腾讯混元语言大模型率先采用混合专家模型 (MoE) 架构，模型总体性能相比上一代提升 50%，部分中文能力已追平 GPT-4o。MoE 能够更好地兼容更多模态和任务，确保不同模态和任务之间是互相促进而非竞争的关系。
* **简单可规模化：** 腾讯混元多模态模型的设计遵循简单、合理、可规模化的原则，支持原生任意分辨率，采用简单的 MLP 适配器，使得模型和数据更容易扩展和规模化。
* **注重通用性、实用性和可靠性：** 与大部分多模态模型主要在开源基准测试中进行调优不同，腾讯混元多模态模型更加注重模型的通用性、实用性和可靠性，具备丰富的多模态场景理解能力。

**在近期发布的中文多模态大模型 SuperCLUE-V 基准评测中（2024 年 8 月），腾讯混元斩获国内排名第一，超越了多个主流闭源模型。**

**SuperCLUE-V评测更侧重于中文能力评测，关注用户在实际应用场景中的多模态理解能力。** 腾讯混元多模态模型在该评测中取得的优异成绩，表明其在中文多模态理解领域取得了显著进展，为推动通用人工智能向物理世界迈进奠定了坚实基础。

**腾讯混元多模态理解模型的发布，标志着中国在多模态大模型领域取得了重要突破，也为未来人工智能的发展指明了方向。** 相信随着技术的不断发展，多模态理解模型将越来越成熟，并将在更多领域发挥重要作用，为人类社会带来更多福祉。

英语如下：

##Tencent’s HunYuan Multimodal Model: A Dawn for General ArtificialIntelligence?

**Keywords:** HunYuan, Multimodal, AIxiv

**News Content:** ## Tencent’s HunYuan Multimodal Understanding: Unveiling the First Domestically Developed MoE Multimodal Large Model

**Machine Intelligence Report**

In recent years, large language models like GPT have demonstrated powerful understanding and reasoning capabilities in the digital cognitive space, hinting at the dawn of general artificial intelligence.However, to propel general AI towards exploring the physical world, the first step is to address the challenge of visual understanding, which translates to multimodal understanding large models.

Multimodal understanding allows AI to acquire and process information through multiple senses, justlike humans, enabling a more comprehensive understanding and interaction with the world. Breakthroughs in this field will enable AI to make greater strides in areas like robotics and autonomous driving, truly bridging the gap between the digital and physical worlds.

However,compared to large language models, the development of multimodal understanding models has been relatively slow, especially in the Chinese language domain. Moreover, there is no consensus yet on the architecture and training methods for multimodal models.

Recently, Tencent HunYuan has launched a multimodal understanding large model based on the MoE architecture. This model hasundergone innovation and deep optimization in terms of architecture, training methods, and data processing, significantly enhancing its performance and enabling it to understand images of arbitrary aspect ratios and up to 7K resolution.

**Advantages of Tencent HunYuan’s Multimodal Model:**

* **Adopting the MoE Architecture:** Tencent HunYuan’s language model is the first to adopt the Mixture-of-Experts (MoE) architecture. This architecture has resulted in a 50% improvement in overall model performance compared to the previous generation, and some Chinese language capabilities have caught up with GPT-4o. MoE can better accommodate more modalitiesand tasks, ensuring that different modalities and tasks are mutually reinforcing rather than competitive.
* **Simplicity and Scalability:** The design of Tencent HunYuan’s multimodal model follows the principles of simplicity, rationality, and scalability. It supports native arbitrary resolution, utilizes simple MLP adapters, and makes the model and data easier toexpand and scale.
* **Emphasis on Generality, Practicality, and Reliability:** Unlike most multimodal models that are primarily tuned on open-source benchmark tests, Tencent HunYuan’s multimodal model places greater emphasis on the model’s generality, practicality, and reliability, possessing rich multimodal scene understanding capabilities.

**In the recent SuperCLUE-V benchmark evaluation for Chinese multimodal large models (August 2024), Tencent HunYuan achieved first place in China, surpassing several mainstream closed-source models.**

**The SuperCLUE-V evaluation focuses more on Chinese language capabilities and assesses multimodal understanding capabilities in real-worldapplication scenarios.** Tencent HunYuan’s multimodal model’s outstanding performance in this evaluation demonstrates its significant progress in Chinese multimodal understanding, laying a solid foundation for propelling general AI towards the physical world.

**The release of Tencent HunYuan’s multimodal understanding model marks a significant breakthrough for China in the field of multimodallarge models and points the way for the future development of AI.** As technology continues to advance, multimodal understanding models are expected to become increasingly mature and play a crucial role in more domains, bringing greater benefits to human society.

【来源】https://www.jiqizhixin.com/articles/2024-08-22-6