Customize Consent Preferences

We use cookies to help you navigate efficiently and perform certain functions. You will find detailed information about all cookies under each consent category below.

The cookies that are categorized as "Necessary" are stored on your browser as they are essential for enabling the basic functionalities of the site. ... 

Always Active

Necessary cookies are required to enable the basic features of this site, such as providing secure log-in or adjusting your consent preferences. These cookies do not store any personally identifiable data.

No cookies to display.

Functional cookies help perform certain functionalities like sharing the content of the website on social media platforms, collecting feedback, and other third-party features.

No cookies to display.

Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics such as the number of visitors, bounce rate, traffic source, etc.

No cookies to display.

Performance cookies are used to understand and analyze the key performance indexes of the website which helps in delivering a better user experience for the visitors.

No cookies to display.

Advertisement cookies are used to provide visitors with customized advertisements based on the pages you visited previously and to analyze the effectiveness of the ad campaigns.

No cookies to display.

NEWS 新闻NEWS 新闻
0

San Francisco, CA – Meta’s latest AI model series, Llama 4, is facing a wave of skepticism after users reported disappointing performance in practical coding tasks, despite its impressive rankings in the Large Model Systems Organization (LMSYS) Arena benchmark. The release of Llama 4, encompassing three models – Llama 4 Scout, Llama 4 Maverick, and Llama 4 Behemoth – was initially met with excitement due to their strong performance in the arena, a popular platform for evaluating large language models (LLMs).

According to Meta’s official announcement, Llama 4 Maverick secured the second-highest overall ranking, becoming only the fourth model to surpass the 1400-point threshold. Furthermore, it claimed the top spot among open-source models, surpassing DeepSeek, and excelled in challenging tasks such as complex prompts, programming, mathematics, and creative writing.

However, the initial euphoria has quickly faded as users began to test Llama 4’s coding capabilities in real-world scenarios. Online forums and social media platforms are now filled with reports of Llama 4’s underwhelming performance in coding-related tasks.

One user, @deedydas, highlighted the poor performance of Llama 4 Scout (109B) and Maverick (402B) on the Kscores benchmark, which focuses specifically on code generation and code completion. According to @deedydas, these models lagged behind competitors such as GPT-4o, Gemini Flash, Grok 3, DeepSeek V3, and even smaller models like Sonnet 3.5/7.

The user illustrated this point with an example involving a ball bouncing within a rotating hexagon, where Llama 4’s performance was notably subpar. This observation was echoed by numerous other users in the comments section, who reported similar experiences with both Llama 4 Scout and Maverick.

The discrepancies between Llama 4’s benchmark scores and its real-world performance have raised questions about the model’s training methodology and evaluation metrics. Some speculate that Llama 4 may have been over-optimized for the specific tasks and datasets used in the LMSYS Arena, leading to inflated scores that do not accurately reflect its general coding abilities. This phenomenon, sometimes referred to as benchmark cheating, occurs when a model is specifically trained to excel on a particular benchmark, potentially at the expense of broader generalization.

The situation highlights the challenges of evaluating LLMs and the importance of considering a variety of benchmarks and real-world use cases. While benchmarks like the LMSYS Arena provide a valuable means of comparing models, they should not be the sole determinant of a model’s overall quality and utility.

The concerns surrounding Llama 4’s coding performance serve as a reminder that LLMs are still under development and that further research is needed to improve their reliability and generalizability. As AI models become increasingly integrated into various aspects of our lives, it is crucial to ensure that they are evaluated rigorously and that their limitations are well understood. Meta has yet to respond to these criticisms, but the ongoing debate underscores the importance of transparency and accountability in the development and deployment of AI technologies.

References:

  • @deedydas’s post on [Platform where the post was made].
  • Official announcement of Meta Llama 4.
  • LMSYS Arena benchmark results.
  • Kscores benchmark results.


>>> Read more <<<

Views: 0

0

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注