智源研究院发布CMMU v0.1：GPT-4V多模态模型答题准确

智源研究院最近发布了一项名为CMMU的中文多模态多题型理解及推理评测基准。根据智源发布的消息，他们提供了一个CMMU v0.1版本，其中包含了3603道题目，这些题目都是从中国教育体系规范指导下的全国小学、初中、高中考试题中抽取并制作而成的。这些题目涵盖了单选题、多选题和填空题等多种题型，并且为了避免模型“随机猜对答案”，智源研究院采用了多重评测手段。

CMMU整体难度较高，而OpenAI推出的GPT-4V多模态模型在答题准确率方面仅约为30%左右。通过对错误类型进行分析，发现GPT-4V在图像理解和推理能力方面仍有待提高。这一发现引起了人们的关注，因为GPT-4V作为一款由OpenAI开发的先进模型，被认为在自然语言处理方面取得了显著的成果。

智源研究院的CMMU基准的发布，引发了人们对于多模态模型的进一步探索和研究的兴趣。随着人工智能技术的不断发展，多模态模型已经成为了新的研究热点。通过将图像、文字、语音等多种模态的信息进行融合，多模态模型可以更好地理解和推理复杂的问题，从而提高答题准确率。

然而，CMMU基准的发布也揭示了多模态模型在图像理解和推理能力方面的不足。这一问题的存在，对于进一步改进和优化多模态模型提出了挑战。为了提高模型的准确率，研究人员需要加强对于图像理解和推理能力的研究，并探索更有效的模型架构和训练方法。

智源研究院的CMMU基准的发布，为多模态模型的发展提供了一个重要的评测平台。通过在多种题型和难度上进行评测，可以更全面地了解模型的性能和局限性，为后续的研究和发展提供参考。希望在未来的研究中，能够进一步提升多模态模型在图像理解和推理能力方面的表现，为人工智能技术的应用带来更大的突破。

英语如下：

News Title: Zhiyuan Research Institute Releases CMMU v0.1: GPT-4V Multimodal Model Achieves Only 30% Accuracy in Answering Questions, Image Understanding Capability Needs Improvement

Keywords: Zhiyuan releases CMMU, GPT-4V, multimodal model

News Content: Zhiyuan Research Institute recently released a Chinese multimodal and multitask understanding and reasoning evaluation benchmark called CMMU. According to the announcement by Zhiyuan, they provided a version 0.1 of CMMU, which includes 3603 questions extracted and created from standardized national primary, middle, and high school exam questions in China’s education system. These questions cover various types, including multiple-choice, multiple-select, and fill-in-the-blank questions. To avoid the model guessing answers randomly, Zhiyuan Research Institute adopted multiple evaluation methods.

CMMU has an overall high difficulty level, and the GPT-4V multimodal model developed by OpenAI achieves only around 30% accuracy in answering questions. Through analyzing the types of errors, it was found that GPT-4V still needs improvement in image understanding and reasoning capabilities. This finding has drawn attention because GPT-4V, as an advanced model developed by OpenAI, is considered to have made significant achievements in natural language processing.

The release of CMMU benchmark by Zhiyuan Research Institute has sparked further exploration and research on multimodal models. With the continuous development of artificial intelligence technology, multimodal models have become a new research hotspot. By integrating information from various modalities such as images, text, and speech, multimodal models can better understand and reason complex problems, thereby improving the accuracy of answering questions.

However, the release of CMMU benchmark also reveals the shortcomings of multimodal models in image understanding and reasoning capabilities. This problem poses challenges for further improvement and optimization of multimodal models. To enhance the accuracy of the models, researchers need to strengthen the research on image understanding and reasoning capabilities and explore more effective model architectures and training methods.

The release of CMMU benchmark by Zhiyuan Research Institute provides an important evaluation platform for the development of multimodal models. By conducting evaluations on various question types and difficulties, a more comprehensive understanding of the model’s performance and limitations can be obtained, providing references for subsequent research and development. It is hoped that in future studies, the performance of multimodal models in image understanding and reasoning capabilities can be further improved, bringing greater breakthroughs to the application of artificial intelligence technology.

【来源】https://mp.weixin.qq.com/s/wegZvv4hwLef0BpdIh32-A