ByteDance & Tsinghua AIR Open Source DAPO Rivaling DeepSeek’s GRPO in LLM RL

Beijing, [Date] – DeepSeek’s GRPO (Generalized Policy Optimization) has shown significant promise in boosting the efficiency of reinforcement learning (RL) for large language models (LLMs). However, the research community has noted that the published details lack the necessary depth to replicate the system effectively at scale for industrial applications.

Now, a collaborative effort between Tsinghua University’s Artificial Intelligence Research (AIR) institute and ByteDance’s SIA Lab has yielded a significant breakthrough: DAPO, or Decoupled Clip and Dynamic sAmpling Policy Optimization. This open-source system represents a state-of-the-art (SOTA) solution for large-scale LLM reinforcement learning. Furthermore, the team plans to open-source the models trained using this algorithm soon.

The project page can be found at: https://dapo-sia.github.io/
The research paper is available at: https://dapo-sia.github.io/static/pdf/dapo_paper.pdf
The code repository is located at: https://github.com/volcengine/verl/tree/gm-tyx/puffin/main/recipe/dapo
The dataset used is hosted on Hugging Face: https://huggingface.co/datasets/BytedTsinghua-SIA/DAPO-Math-17k

Using DAPO, the team has successfully trained a Qwen2.5-32B model to achieve impressive results on the AIME 2 benchmark.

The Significance of DAPO

The development of efficient and scalable RL algorithms is crucial for advancing the capabilities of LLMs. Reinforcement learning enables LLMs to learn from experience, optimizing their performance in complex tasks and adapting to dynamic environments. DeepSeek’s GRPO represented a step forward, but its limitations in replicability hindered wider adoption and further research.

DAPO addresses these limitations by providing a more transparent and readily implementable framework for large-scale LLM reinforcement learning. The decoupling of clip and dynamic sampling allows for more stable and efficient training, leading to improved performance and faster convergence. The open-source nature of DAPO encourages collaboration and accelerates innovation in the field.

Looking Ahead

The release of DAPO marks a significant milestone in the development of RL for LLMs. The open-source code, dataset, and soon-to-be-released models will empower researchers and practitioners to explore the potential of DAPO and build upon its foundations. This collaborative approach promises to drive further advancements in LLM capabilities and unlock new applications across various domains.

The team’s success in training the Qwen2.5-32B model on the AIME 2 benchmark demonstrates the effectiveness of DAPO. As the field continues to evolve, DAPO is poised to become a key tool for researchers and developers seeking to push the boundaries of LLM performance through reinforcement learning.

References:

DAPO Project Page: https://dapo-sia.github.io/
DAPO Research Paper: https://dapo-sia.github.io/static/pdf/dapo_paper.pdf
DAPO Code Repository: https://github.com/volcengine/verl/tree/gm-tyx/puffin/main/recipe/dapo
DAPO Dataset: https://huggingface.co/datasets/BytedTsinghua-SIA/DAPO-Math-17k

>>> Read more <<<

一	二	三	四	五	六	日
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

ByteDance & Tsinghua AIR Open Source DAPO Rivaling DeepSeek’s GRPO in LLM RL

作者智能小编

相关文章

陈春花：智能寻捷径，智慧照亮生命

智谱AI CEO：大模型领域存在“反共识”

语音交互：AI应用新王牌，巨头B2C掉队？

发表回复取消回复

为您推荐