Customize Consent Preferences

We use cookies to help you navigate efficiently and perform certain functions. You will find detailed information about all cookies under each consent category below.

The cookies that are categorized as "Necessary" are stored on your browser as they are essential for enabling the basic functionalities of the site. ... 

Always Active

Necessary cookies are required to enable the basic features of this site, such as providing secure log-in or adjusting your consent preferences. These cookies do not store any personally identifiable data.

No cookies to display.

Functional cookies help perform certain functionalities like sharing the content of the website on social media platforms, collecting feedback, and other third-party features.

No cookies to display.

Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics such as the number of visitors, bounce rate, traffic source, etc.

No cookies to display.

Performance cookies are used to understand and analyze the key performance indexes of the website which helps in delivering a better user experience for the visitors.

No cookies to display.

Advertisement cookies are used to provide visitors with customized advertisements based on the pages you visited previously and to analyze the effectiveness of the ad campaigns.

No cookies to display.

0

Beijing, [Date] – DeepSeek’s GRPO (Generalized Policy Optimization) has shown significant promise in boosting the efficiency of reinforcement learning (RL) for large language models (LLMs). However, the research community has noted that the published details lack the necessary depth to replicate the system effectively at scale for industrial applications.

Now, a collaborative effort between Tsinghua University’s Artificial Intelligence Research (AIR) institute and ByteDance’s SIA Lab has yielded a significant breakthrough: DAPO, or Decoupled Clip and Dynamic sAmpling Policy Optimization. This open-source system represents a state-of-the-art (SOTA) solution for large-scale LLM reinforcement learning. Furthermore, the team plans to open-source the models trained using this algorithm soon.

The project page can be found at: https://dapo-sia.github.io/
The research paper is available at: https://dapo-sia.github.io/static/pdf/dapo_paper.pdf
The code repository is located at: https://github.com/volcengine/verl/tree/gm-tyx/puffin/main/recipe/dapo
The dataset used is hosted on Hugging Face: https://huggingface.co/datasets/BytedTsinghua-SIA/DAPO-Math-17k

Using DAPO, the team has successfully trained a Qwen2.5-32B model to achieve impressive results on the AIME 2 benchmark.

The Significance of DAPO

The development of efficient and scalable RL algorithms is crucial for advancing the capabilities of LLMs. Reinforcement learning enables LLMs to learn from experience, optimizing their performance in complex tasks and adapting to dynamic environments. DeepSeek’s GRPO represented a step forward, but its limitations in replicability hindered wider adoption and further research.

DAPO addresses these limitations by providing a more transparent and readily implementable framework for large-scale LLM reinforcement learning. The decoupling of clip and dynamic sampling allows for more stable and efficient training, leading to improved performance and faster convergence. The open-source nature of DAPO encourages collaboration and accelerates innovation in the field.

Looking Ahead

The release of DAPO marks a significant milestone in the development of RL for LLMs. The open-source code, dataset, and soon-to-be-released models will empower researchers and practitioners to explore the potential of DAPO and build upon its foundations. This collaborative approach promises to drive further advancements in LLM capabilities and unlock new applications across various domains.

The team’s success in training the Qwen2.5-32B model on the AIME 2 benchmark demonstrates the effectiveness of DAPO. As the field continues to evolve, DAPO is poised to become a key tool for researchers and developers seeking to push the boundaries of LLM performance through reinforcement learning.

References:


>>> Read more <<<

Views: 0

0

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注