shanghaishanghai

近日,苹果公司与洛桑联邦理工学院(EPFL)的研究团队合作,共同研发出全新的多模态视觉模型4M-21。这一研究成功突破了传统多模态和多任务基础模型的限制,实现了任意到任意模态的单一模型应用。

据悉,当前的多模态和多任务基础模型,如4M和UnifiedIO,虽然已展现出卓越的性能,但其在接受不同输入和执行不同任务的能力上,仍受到训练模态和任务数量的限制。而4M-21模型则在数十种高度多样化的模态上进行训练,通过对大规模多模态数据集和文本语料库的协同训练,实现了更广泛的适用性。

研究团队在训练过程中的一个关键步骤是,对各种模态执行离散tokenization,无论数据是图像神经网络特征图、向量、实例分割、人体姿态等结构化数据,还是可表征为文本的数据,4M-21模型都能轻松应对。

该研究还表明,4M-21模型不仅能完成现有模型至少3倍多的任务/模态,而且不会损失性能。此外,该研究还实现了更细粒度和更可控的多模态生成能力,为多模态应用领域开辟了新的可能性。

该研究的成功建立在多模态掩码预训练方案的基础上,这一方案使得模型能够更好地适应各种模态数据,提高了模型的泛化能力。

总的来说,4M-21模型的诞生,标志着多模态视觉技术的新里程碑,有望为未来的多模态应用带来更加广泛和深入的发展。

英语如下:

News Title: Apple and EPFL Jointly Create a Single Model for Any Modalities: Realizing Ultra-Efficient Training for Diverse Task Modalities

Keywords: Multimodal Model, Apple New Update, Fine-grained Training

News Content:

Title: Apple and EPFL Jointly Develop the New Multimodal Vision Model 4M-21, Ushering in a New Era of Single Models for Any-to-Any Modalities

Recently, Apple has collaborated with the research team from the École Polytechnique Fédérale de Lausanne (EPFL) to jointly develop a new multimodal vision model, 4M-21. This research has successfully broken through the limitations of traditional multimodal and multitask base models, realizing the application of a single model for any-to-any modality.

It is reported that current multimodal and multitask base models, such as 4M and UnifiedIO, have demonstrated excellent performance, but they are still limited in their ability to accept different inputs and perform different tasks. The 4M-21 model, however, is trained on dozens of highly diverse modalities and achieves broader applicability through collaborative training on large-scale multimodal datasets and text corpora.

A key step in the training process by the research team is the execution of discrete tokenization for various modalities. No matter whether the data is structured data such as image neural network feature maps, vectors, instance segmentation, human poses, or data that can be characterized as text, the 4M-21 model can easily cope with it.

The research also indicates that the 4M-21 model can complete at least three times more tasks/modalities than existing models without any performance loss. In addition, the study has achieved finer-grained and more controllable multimodal generative capabilities, opening up new possibilities for the application of multimodalities.

The success of this research is based on the multimodal mask pretraining scheme, which enables the model to better adapt to various modality data and improves its generalization ability.

Overall, the emergence of the 4M-21 model marks a new milestone in multimodal vision technology and is expected to bring broader and deeper development for future multimodal applications.

【来源】https://www.jiqizhixin.com/articles/2024-06-25-2

Views: 3

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注