豆包大模型团队推出Detail Image Caption评估基准，革新VLM评测方式

近日，由中国科学院、北京大学以及字节跳动豆包大模型团队共同研发的全新Detail Image Caption评估基准正式发布，旨在提升视觉语言模型（VLM）的性能评测可靠性，填补了当前评估手段在基础理解能力评测上的空白，尤其是针对细节图像描述（Detail Image Caption）性能的可靠评测。

当前，VLM的性能评测主要依赖于问答形式，但这并未充分考量模型对图像内容的全面理解能力。这一新基准的推出，旨在通过引入细节图像描述任务，对VLM的基础理解能力进行更为全面、细致的评估，从而提供更准确、可靠的评测结果。

DetailCaps-4870数据集作为此次发布的成果之一，涵盖了丰富多样的图像和描述文本，为评估模型提供了全面的测试环境。同时，团队提出了一种新的评估指标CAPTURE，该指标在开源评估指标中展现出最高的专家评价一致性，并且通过低成本操作，实现了与GPT-Eval相媲美的效果。这一成果不仅为VLM的性能评估提供了新的工具和方法，也为学术界和工业界提供了重要的参考。

AIxiv专栏，作为机器之心发布学术和技术内容的平台，一直以来致力于促进学术交流与传播。该专栏在过去几年中接收报道了超过2000篇内容，覆盖了全球各大高校与企业的顶级实验室。对于有志于分享优秀工作的研究者和从业者，AIxiv专栏欢迎投稿或联系报道。投稿邮箱为liyazhou@jiqizhixin.com和zhaoyunfeng@jiqizhixin.com。

此次发布的Detail Image Caption评估基准、数据集以及评估指标，不仅标志着VLM性能评估领域的进步，也为推动相关技术的发展和应用提供了有力的支持。随着这一系列成果的发布和应用，我们有理由期待视觉语言模型在未来能够更好地理解、描述和处理图像信息，进一步推动人工智能技术在各领域的深入应用。

英语如下：

News Title: “Doubao Mega Model Team Launches Detail Image Caption Evaluation Benchmark, Revolutionizing VLM Evaluation Methods”

Keywords: Doubao Mega Model, Detail Caps, CAPTURE Metric

News Content: Recently, a new Detail Image Caption evaluation benchmark developed by a joint effort of the Chinese Academy of Sciences, Peking University, and the Doubao Mega Model team from ByteDance has been officially released. This benchmark aims to enhance the reliability of performance evaluations for Visual Language Models (VLMs), filling a gap in the current evaluation methods, particularly for the reliable assessment of detail image captioning capabilities.

Traditionally, VLM performance evaluations rely mainly on question-answering formats, which do not fully consider the model’s comprehensive understanding of image content. The introduction of this new benchmark seeks to comprehensively and precisely evaluate VLM’s foundational understanding abilities by incorporating the detail image captioning task, thus providing more accurate and reliable evaluation results.

DetailCaps-4870 dataset, one of the outcomes of this release, features a diverse array of images and descriptive texts, offering a comprehensive testing environment for model evaluation. Simultaneously, the team has introduced a new evaluation metric, CAPTURE, which demonstrates the highest expert agreement consistency among open-source evaluation metrics. Through cost-effective operations, this metric achieves performance on par with GPT-Eval. This achievement not only provides new tools and methods for VLM performance evaluation but also serves as a significant reference for academia and industry.

AIxiv, a platform for publishing academic and technical content by Machine Intelligence, has been dedicated to promoting academic exchange and dissemination. Over the past few years, it has covered over 2000 pieces of content from top laboratories at major universities and companies worldwide. For researchers and practitioners eager to share their outstanding work, AIxiv welcomes submissions or inquiries for reporting. The submission emails are liyazhou@jiqizhixin.com and zhaoyunfeng@jiqizhixin.com.

The launch of the Detail Image Caption evaluation benchmark, dataset, and evaluation metric not only signifies progress in the field of VLM performance evaluation but also provides substantial support for advancing and applying related technologies. With the release and application of these results, there is reason to anticipate that VLMs will in the future be able to better understand, describe, and process image information, further driving the application of artificial intelligence technologies across various fields.

【来源】https://www.jiqizhixin.com/articles/2024-07-15-3