智源发布CMMU评测基准，挑战GPT-4V：高难度多模态语文测试

【智源研究院发布权威评测基准，推动多模态模型进步】近日，国内知名人工智能研究机构——智源研究院正式发布了全新的多模态模型中文评测基准——CMMU（Chinese Multimodal Multi-type Understanding and Reasoning）。该基准从中国教育体系严格遵循的小学、初中、高中的标准化考试题库中精心筛选并编制了3,603道题目，涵盖了单选题、多选题以及填空题等多种题型，旨在全面评估模型在理解文本、图片以及其他模态信息，并进行跨领域推理的能力。

CMMU v0.1版本的设计目标在于挑战现有多模态技术的边界，其整体难度被设定得相当高，即便是国际领先的人工智能公司OpenAI所推出的GPT-4V多模态模型，在参与评测后，其答题准确率仅约为30%。通过对错误类型的深入分析，发现LMM（Language and Modal Fusion Model）在图像理解和推理方面存在一定的局限性，暴露了现有模型在这类任务上的短板。

这一评测基准的推出，无疑为多模态模型的研发者们提供了一个更具挑战性和公正性的评估标准，推动了人工智能领域尤其是多模态技术的发展与进步。未来，随着CMMU不断迭代和完善，有望引领相关技术在实际应用中实现更高效、精准的理解与推理，赋能更多领域的智能化创新。

英语如下：

Headline: “Zhiyuan Releases CMMU Evaluation Benchmark, Challenging GPT-4V: High-Difficulty Multimodal Chinese Test, Model Answer Accuracy Stands at Only 30%”

Keywords: Zhiyuan CMMU, GPT-4V Evaluation, Multimodal Challenge

News Content: 【Zhiyuan Research Institute Launches New Comprehensive Multimodal Model Chinese Evaluation Benchmark—CMMU (Chinese Multimodal Multi-type Understanding and Reasoning)】 Recently, the renowned domestic artificial intelligence research institution, Zhiyuan Research Institute, officially announced the release of its cutting-edge multimodal model evaluation benchmark for the Chinese language—CMMU. Carefully selected and compiled from a standardized test bank drawing upon China’s education system, which adheres to exams for primary, middle, and high schools, CMMU comprises 3,603 questions covering various types such as multiple-choice, multiple-select, and fill-in-the-blank, aiming to comprehensively assess a model’s ability in understanding text, images, and other modalities, as well as conducting cross-disciplinary reasoning.

The design objective of the CMMU v0.1 version is to push the boundaries of current multimodal technologies by setting a significantly high overall difficulty level. Even the internationally leading AI company OpenAI’s GPT-4V multimodal model, after participating in the evaluation, achieved an answer accuracy rate of only around 30%. Through in-depth analysis of error categories, it was discovered that the LMM (Language and Modal Fusion Model) exhibits certain limitations in image understanding and reasoning, thereby exposing existing models’ weaknesses in such tasks.

The launch of this evaluation benchmark undoubtedly provides developers of multimodal models with a more challenging and impartial assessment standard, driving advancements and progress in the field of artificial intelligence, particularly multimodal technology. Looking ahead, as CMMU continues to be iteratively improved and refined, it is expected to lead related technologies towards achieving more efficient and precise understanding and reasoning in practical applications, empowering智能化 innovation across numerous domains.

【来源】https://mp.weixin.qq.com/s/wegZvv4hwLef0BpdIh32-A