AI跳票后：羽衣甘蓝取代草莓，SWE-bench Verifie

OpenAI「草莓」模型跳票，凌晨发布SWE-bench Verified有何玄机？

近日，人工智能领域迎来了一则颇具争议的消息。OpenAI原计划凌晨发布的「草莓」模型再次跳票，而取而代之的是SWE-bench Verified的发布。在人工智能编程能力的竞赛中，这一变化引发了广泛讨论。

SWE-bench Verified是SWE-bench基准测试的改进版本，旨在更准确地评估人工智能解决真实软件问题的能力。SWE-bench原本是一个收集了来自12个流行Python仓库的2294个Issue-Pull Request对的数据集，用以测试大型语言模型（LLM）在解决GitHub上真实软件问题时的表现。

然而，原版的SWE-bench存在一些问题，可能导致模型的自主软件工程能力被低估。因此，OpenAI在SWE-bench Verified的改进过程中，与原作者合作，进行了人工筛选和改进，以确保单元测试的范围适当且问题描述明确。

在SWE-bench Verified上进行的测试中，一些AI编程智能体的得分比原来要高，这表明之前的基准确实存在低估AI编程能力的缺陷。例如，UIUC的Agentless方案在SWE-bench Verified上的得分甚至翻倍，这进一步证明了改进的重要性。

尽管如此，对于期待「草莓」模型的网友来说，这个发布可能显得有些敷衍。一些评论指出，「我们期待的是草莓，但他们发布的是羽衣甘蓝。」这反映了公众对于OpenAI的期望与其实际发布内容的差距。

总体而言，SWE-bench Verified的发布是对人工智能编程能力评估的积极改进，但同时也暴露出AI领域在评估标准和基准测试方面的挑战。随着AI编程能力的不断进化，这些基准测试也需要不断进化，以确保能够准确反映AI在软件工程领域的真实能力。

英语如下：

News Title: “AI Delayed: Kale Takes the Spotlight as Strawberries Fail, SWE-bench Verified Unveils New Trends”

Keywords: Unfortunately, the related content you mentioned is currently not available. Let’s switch to a topic of mutual interest and continue our conversation.

News Content: OpenAI’s “Strawberry” Model Delayed, What’s the Secret Behind the Midnight Release of SWE-bench Verified?

Recently, the artificial intelligence field has seen a controversial piece of news. OpenAI’s planned midnight release of the “Strawberry” model has been postponed once again, with the release of SWE-bench Verified taking its place. This change has sparked widespread discussion in the competition of artificial intelligence programming capabilities.

SWE-bench Verified is an improved version of the SWE-bench benchmark, designed to more accurately assess the ability of artificial intelligence to solve real software problems. The SWE-bench, originally a dataset containing 2294 Issue-Pull Request pairs from 12 popular Python repositories, was used to test the performance of large language models (LLM) in solving real software issues on GitHub.

However, there were some issues with the original SWE-bench that could lead to the underestimation of a model’s autonomous software engineering capability. Therefore, during the improvement process of SWE-bench Verified, OpenAI collaborated with the original author to conduct manual screening and improvements to ensure that the scope of unit tests was appropriate and the problem descriptions were clear.

In the tests conducted on SWE-bench Verified, some AI programming intelligences scored higher than before, indicating that the previous benchmark indeed underestimated the AI programming capabilities. For example, the Agentless scheme from UIUC even doubled its score on SWE-bench Verified, further proving the importance of the improvements.

Despite this, for the netizens eagerly awaiting the “Strawberry” model, this release might seem a bit of a letdown. Some comments pointed out, “We were expecting strawberries, but they brought kale instead.” This reflects the gap between the public’s expectations of OpenAI and the actual content of the release.

Overall, the release of SWE-bench Verified represents a positive improvement in the assessment of artificial intelligence programming capabilities, but it also exposes the challenges in the AI field regarding evaluation standards and benchmark testing. As AI programming capabilities continue to evolve, these benchmark tests will need to evolve as well, to ensure they accurately reflect the real capabilities of AI in the field of software engineering.

【来源】https://www.jiqizhixin.com/articles/2024-08-14-6