随着人工智能技术的不断发展,多模态模型的研究已经成为当前的热点之一。混合专家的概念在人工智能领域逐渐流行起来,他们能够感知不同模态并据此采取行动。近日,Meta公司提出了一种新的模态感知型专家混合模型,旨在更好地处理和整合不同模态的信息。
Meta的Chameleon团队在《Chameleon: Mixed-modal early-fusion foundation models》的论文中,提出了一种新的单一Transformer架构,该架构可以根据下一个token的预测目标,对由离散图像和文本token组成的混合模态序列进行建模。这种架构能够在不同模态之间进行无缝推理和生成,并且在约10万亿混合模态token上完成预训练后,表现出适应广泛的视觉和语言能力。
Chameleon在生成混合模态长回答任务的表现尤其出色,甚至超过了Gemini 1.0 Pro和GPT-4V等商用模型。然而,由于在模型训练的早期阶段就混合了各种模态,想要拓展其能力,需要投入大量计算资源。
为了解决这一问题,Meta FAIR的团队对路由式稀疏架构进行了研究,并提出了MoMa:模态感知型专家混合架构。该架构通过整合针对具体模态的模块,有效提升了模型的性能。这一概念被称为模态感知型稀疏性,简称MaS,它能让模型更好地捕获每个模态的特征,同时还能通过部分参数共享和注意力机制维持强大的跨模态整合性能。
这一研究进一步推动了混合模态专家方法的发展,特别是在视觉-语言编码器和掩码式语言建模方面的应用。未来的研究可能会将这一技术应用于更多的领域,以期实现更加智能化和高效的跨模态信息处理。
英语如下:
Title: “Meta Breakthrough: Modality-Perceiving Expert Hybrid Architecture Leads the New Wave of Multimodal AI”
Keywords: Modality Fusion, Meta Innovation, Expert Hybrid Architecture
News Content:
As artificial intelligence technology continues to advance, the study of multimodal models has become a hot topic. The concept of mixed-expert models, which can perceive different modalities and act accordingly, has gained popularity in the field of AI. Recently, Meta has introduced a new modality-perceiving expert hybrid model aimed at better processing and integrating information from different modalities.
In the paper “Chameleon: Mixed-modal early-fusion foundation models,” Meta’s Chameleon team proposed a new single Transformer architecture that models mixed-modal sequences consisting of discrete images and text tokens based on the next token’s prediction target. This architecture can perform seamless reasoning and generation across different modalities and exhibits broad visual and linguistic capabilities after pre-training on approximately 10 trillion multimodal tokens.
Chameleon excels in generating long multimodal answers, outperforming commercial models like Gemini 1.0 Pro and GPT-4V. However, due to the early mixing of various modalities in the model training, expanding its capabilities requires significant computational resources.
To address this issue, the Meta FAIR team investigated routing-based sparse architectures and proposed MoMa: the Modality-Perceiving Expert Hybrid Architecture. This architecture enhances model performance by integrating modules tailored to specific modalities. This concept, known as modality-perceiving sparsity, or MaS, allows the model to better capture the characteristics of each modality while maintaining strong cross-modal integration performance through partial parameter sharing and attention mechanisms.
This research further advances the development of mixed-modal expert methods, particularly in the application of visual-language encoders and masked language modeling. Future research may apply this technology to more fields, aiming to achieve more intelligent and efficient multimodal information processing.
【来源】https://www.jiqizhixin.com/articles/2024-08-11-2
Views: 2