ML笔记：什么是组相对策略优化 (GRPO)？

作者既智

8 月 11, 2024 #AI, #RLHF, #大模型, #微调

GRPO

什么是组相对策略优化 (GRPO)？

@deepseek_ai Coder v2 是最好的开放代码 LLM，在编码任务中可与 @openai GPT-4 相媲美。作为技术报告的一部分，GRPO 被称为 RLHF 方法，但它是什么？

GRPO 是在今年早些时候的 DeepSeekMath 论文中提出的，是一种旨在以更少的内存消耗提高数学推理能力的方法。

执行

1 使用当前策略为每个输入问题生成多个输出

2 使用奖励模型对这些输出进行评分

3 计算平均奖励，并以此作为计算优势的基准

4 更新策略以最大化 GRPO 目标，其中包括优势和 KL 项

洞察

GRPO 不需要价值函数模型，从而减少了内存和复杂性

GPRO 将 KL 项直接添加到损失中，而不是添加到奖励中

GPRO 提升了 GSM8K 和 MATH 约 5%

使用迭代方法训练新的奖励模型

RL 数据由来自 SFT 数据集的 144k CoT 提示组成

奖励模型使用“Math-Shepherd”过程进行训练

RL 是“增强 TopK 的正确响应，而不是增强基本能力”。

What is Group Relative Policy Optimization (GRPO)? @deepseek_ai
Coder v2 is the best open Code LLM, rivaling @openai
GPT-4 in coding tasks. As part of the technical report, GRPO is mentioned as the RLHF method, but what is it? 🤔

GRPO was introduced in the DeepSeekMath Paper earlier this year and is method in designed to improve improve mathematical reasoning capabilities with less memory consumption.

Implementation
1️⃣ Generate multiple outputs for each input question using the current Policy
2️⃣ Score these outputs using a reward model
3️⃣ Average the rewards and use it as a baseline to compute the advantages
4️⃣ Update the Policy to maximize the GRPO objective, which includes the advantages and a KL term

Insights
💡 GRPO doesn’t need value function model, reducing memory and complexity
🔗 GPRO adds the KL term directly to the loss rather than in the reward
📈 GPRO improved GSM8K and MATH ~5%
🔁 Used Iterative Approach to train new Reward Models
📊 RL data consisted of 144k CoT prompts from SFT dataset
🧠 Reward Model was trained using “Math-Shepherd” process

RL is “boosting the correct response from TopK rather than the enhancement of fundamental capabilities.”

What is Group Relative Policy Optimization (GRPO)? @deepseek_ai Coder v2 is the best open Code LLM, rivaling @openai GPT-4 in coding tasks. As part of the technical report, GRPO is mentioned as the RLHF method, but what is it? 🤔

GRPO was introduced in the DeepSeekMath Paper… pic.twitter.com/ldkM8vuOZN
— Philipp Schmid (@_philschmid) June 22, 2024

AI生成 IT 机器学习

一	二	三	四	五	六	日
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

ML笔记：什么是组相对策略优化 (GRPO)？

作者既智

执行

洞察

相关文章

ML笔记：利用 DeepSeek 的 GPRO 算法优化 LLM 在金融文本和数据预测中的性能

ML笔记：actual counterfactual prediction 反事实预测为什么重要

ML笔记：机器学习中的协变量是什么？

发表回复取消回复

为您推荐

赫拉利：秩序渴求，AI控人的首要原因

Secure Spring AI MCP Server with OAuth2 Best Practices

Spring AI MCP服务器安全升级：OAuth2保驾护航

告别文档灌输！RAG入门指南

作者既智

执行

洞察

相关文章

发表回复 取消回复

为您推荐

发表回复取消回复