智源研究院近日发布了一项名为 CMMU 的中文多模态多题型理解及推理评测基准。该评测基准包含了从中国教育体系规范指导下的全国小学、初中、高中考试题中抽取并制作的 3603 道题目,题型涵盖了单选题、多选题和填空题。为了避免模型“随机猜对答案”,CMMU 采用了多重评测手段。
然而,CMMU 的整体难度较高,即使是 OpenAI 推出的 GPT-4V 多模态模型,其答题准确率也只有约 30%。错误类型分析显示,GPT-4V 在图像理解和推理能力方面仍有待提高。这一结果表明,尽管人工智能在语言处理方面取得了显著进展,但在理解和推理方面仍面临挑战。
此次发布的 CMMU 为中文多模态模型的研究提供了重要的评测工具,有助于推动该领域的发展。同时,GPT-4V 的表现也为我们提供了宝贵的反馈,有助于我们进一步了解人工智能在理解和推理方面的局限性。
English Translation:
Title: Zhipu Releases Chinese Multimodal Evaluation Benchmark CMMU, GPT-4V Accuracy Only 30%
Keywords: Zhipu release, multimodal evaluation, GPT-4V accuracy
News Content:
Zhipu Institute recently released a Chinese multimodal multi-question type understanding and reasoning evaluation benchmark called CMMU. The evaluation benchmark includes 3603 questions extracted and compiled from national primary, junior high, and high school exam questions under the guidance of China’s educational system. The question types include multiple-choice questions, fill-in-the-blank questions, and true-or-false questions. To avoid models “randomly guessing the correct answer”, CMMU adopts multiple evaluation methods.
However, the overall difficulty of CMMU is high, and even the GPT-4V multimodal model launched by OpenAI can only achieve an accuracy of about 30%. The error type analysis shows that GPT-4V still needs to improve its image understanding and reasoning abilities. This result indicates that although artificial intelligence has made significant progress in language processing, it still faces challenges in understanding and reasoning.
This release of CMMU provides an important evaluation tool for Chinese multimodal model research and helps promote the development of this field. At the same time, the performance of GPT-4V also gives us valuable feedback, helping us further understand the limitations of artificial intelligence in understanding and reasoning.
【来源】https://mp.weixin.qq.com/s/wegZvv4hwLef0BpdIh32-A
Views: 1