什么是组相对策略优化 (GRPO)?
@deepseek_ai Coder v2 是最好的开放代码 LLM,在编码任务中可与 @openai GPT-4 相媲美。作为技术报告的一部分,GRPO 被称为 RLHF 方法,但它是什么?
GRPO 是在今年早些时候的 DeepSeekMath 论文中提出的,是一种旨在以更少的内存消耗提高数学推理能力的方法。
执行
- 1 使用当前策略为每个输入问题生成多个输出
- 2 使用奖励模型对这些输出进行评分
- 3 计算平均奖励,并以此作为计算优势的基准
- 4 更新策略以最大化 GRPO 目标,其中包括优势和 KL 项
洞察
GRPO 不需要价值函数模型,从而减少了内存和复杂性
GPRO 将 KL 项直接添加到损失中,而不是添加到奖励中
GPRO 提升了 GSM8K 和 MATH 约 5%
使用迭代方法训练新的奖励模型
RL 数据由来自 SFT 数据集的 144k CoT 提示组成
奖励模型使用“Math-Shepherd”过程进行训练
RL 是“增强 TopK 的正确响应,而不是增强基本能力”。
What is Group Relative Policy Optimization (GRPO)? @deepseek_ai
Coder v2 is the best open Code LLM, rivaling @openai
GPT-4 in coding tasks. As part of the technical report, GRPO is mentioned as the RLHF method, but what is it? 🤔
GRPO was introduced in the DeepSeekMath Paper earlier this year and is method in designed to improve improve mathematical reasoning capabilities with less memory consumption.
Implementation
1️⃣ Generate multiple outputs for each input question using the current Policy
2️⃣ Score these outputs using a reward model
3️⃣ Average the rewards and use it as a baseline to compute the advantages
4️⃣ Update the Policy to maximize the GRPO objective, which includes the advantages and a KL term
Insights
💡 GRPO doesn’t need value function model, reducing memory and complexity
🔗 GPRO adds the KL term directly to the loss rather than in the reward
📈 GPRO improved GSM8K and MATH ~5%
🔁 Used Iterative Approach to train new Reward Models
📊 RL data consisted of 144k CoT prompts from SFT dataset
🧠 Reward Model was trained using “Math-Shepherd” process
RL is “boosting the correct response from TopK rather than the enhancement of fundamental capabilities.”
Views: 20