Beijing, [Date] – DeepSeek’s GRPO (Generalized Policy Optimization) has shown significant promise in boosting the efficiency of reinforcement learning (RL) for large language models (LLMs). However, the research community has noted that the published details lack the necessary depth to replicate the system effectively at scale for industrial applications.
Now, a collaborative effort between Tsinghua University’s Artificial Intelligence Research (AIR) institute and ByteDance’s SIA Lab has yielded a significant breakthrough: DAPO, or Decoupled Clip and Dynamic sAmpling Policy Optimization. This open-source system represents a state-of-the-art (SOTA) solution for large-scale LLM reinforcement learning. Furthermore, the team plans to open-source the models trained using this algorithm soon.
The project page can be found at: https://dapo-sia.github.io/
The research paper is available at: https://dapo-sia.github.io/static/pdf/dapo_paper.pdf
The code repository is located at: https://github.com/volcengine/verl/tree/gm-tyx/puffin/main/recipe/dapo
The dataset used is hosted on Hugging Face: https://huggingface.co/datasets/BytedTsinghua-SIA/DAPO-Math-17k
Using DAPO, the team has successfully trained a Qwen2.5-32B model to achieve impressive results on the AIME 2 benchmark.
The Significance of DAPO
The development of efficient and scalable RL algorithms is crucial for advancing the capabilities of LLMs. Reinforcement learning enables LLMs to learn from experience, optimizing their performance in complex tasks and adapting to dynamic environments. DeepSeek’s GRPO represented a step forward, but its limitations in replicability hindered wider adoption and further research.
DAPO addresses these limitations by providing a more transparent and readily implementable framework for large-scale LLM reinforcement learning. The decoupling of clip and dynamic sampling allows for more stable and efficient training, leading to improved performance and faster convergence. The open-source nature of DAPO encourages collaboration and accelerates innovation in the field.
Looking Ahead
The release of DAPO marks a significant milestone in the development of RL for LLMs. The open-source code, dataset, and soon-to-be-released models will empower researchers and practitioners to explore the potential of DAPO and build upon its foundations. This collaborative approach promises to drive further advancements in LLM capabilities and unlock new applications across various domains.
The team’s success in training the Qwen2.5-32B model on the AIME 2 benchmark demonstrates the effectiveness of DAPO. As the field continues to evolve, DAPO is poised to become a key tool for researchers and developers seeking to push the boundaries of LLM performance through reinforcement learning.
References:
- DAPO Project Page: https://dapo-sia.github.io/
- DAPO Research Paper: https://dapo-sia.github.io/static/pdf/dapo_paper.pdf
- DAPO Code Repository: https://github.com/volcengine/verl/tree/gm-tyx/puffin/main/recipe/dapo
- DAPO Dataset: https://huggingface.co/datasets/BytedTsinghua-SIA/DAPO-Math-17k
Views: 0