GRPOGRPO

在大语言模型(LLM)中,监督微调(SFT)和对齐(PPO、DPO)是两种不同的技术手段,它们在模型优化和任务适应上有不同的作用和应用场景。以下是对这两种技术的详细解释:

监督微调(SFT)

基本原理

  • 监督微调(Supervised Fine-Tuning, SFT)是一种经典的模型微调方法。它首先在源数据集上预训练一个神经网络模型(源模型),然后创建一个新的模型(目标模型),该模型复制了源模型除输出层外的所有设计及其参数。在微调过程中,为目标模型添加一个新的输出层,并随机初始化其参数,随后在目标数据集上训练整个模型,但只微调输出层及部分预训练层。
  • 在预训练阶段之后,SFT通过有监督的学习方法优化模型,使其更好地遵守特定指令,从而提高模型的响应质量和相关性。这涉及模型与人类标注的对话数据进行交互,学习如何生成更符合人类意图的响应,减少输出的不准确性和无关信息。

应用场景

  • SFT广泛应用于自然语言处理(NLP)领域,如文本分类、情感分析等任务。通过微调,模型能够更好地适应目标数据集,提高任务完成质量。

特点

  • SFT利用预训练模型的参数和结构,避免从头开始训练模型,从而加速模型的训练过程,并且能够提高模型在目标任务上的表现。

对齐(PPO、DPO)

基本原理

  • 对齐技术主要通过强化学习从人类反馈(RLHF)来实现。现有的RLHF方法可以大致分为基于奖励的方法和无奖励的方法。PPO(Proximal Policy Optimization)和DPO(Direct Preference Optimization)是两种常见的对齐算法。
  • 对齐技术如PPO和DPO通过强化学习(RL)进一步优化模型,确保LLM与人类的价值观保持一致,特别是在面对潜在有害或误导性内容时。PPO和DPO在SFT的基础上进行,通过人类反馈来微调模型,减少“幻觉问题”并提高对人类价值观的遵守。PPO使用奖励模型来评估模型响应,而DPO直接从人类偏好中学习,两者都旨在简化对齐流程,降低计算开销,并实现更稳健的优化。

PPO(Proximal Policy Optimization)

  • PPO是一种强化学习算法,通过限制策略更新的幅度来稳定训练过程。它在对齐大语言模型时表现出色,特别是在细化语言模型和处理复杂任务(如代码生成)方面。

DPO(Direct Preference Optimization)

  • DPO是一种相对较新的方法,通过直接优化人类偏好来对齐模型。与PPO不同,DPO在训练数据分布之外的输出上可能存在偏好,这可能会影响其在某些任务上的表现。

应用场景

  • PPO和DPO主要用于需要模型与人类偏好高度一致的任务,如对话系统、内容生成等。PPO在处理复杂任务时表现优异,而DPO在某些特定场景下可能更具优势。

特点

  • PPO通过优势归一化、大批量大小以及对参考模型使用指数移动平均来解决训练中的挑战,最终在对话任务和代码生成任务上取得了优于DPO的结果。
  • DPO在RLHF训练前进行额外的监督微调(SFT)训练,以及使用在线采样数据,可以提升其效果。

总结

技术基本原理应用场景特点
监督微调(SFT)在预训练模型基础上微调输出层及部分预训练层文本分类、情感分析等NLP任务加速训练过程,提高目标任务表现
PPO通过限制策略更新幅度来稳定训练过程对话系统、内容生成、代码生成等复杂任务优势归一化、大批量大小、指数移动平均
DPO直接优化人类偏好需要高度一致的人类偏好任务需进行额外的SFT训练和在线采样数据

通过以上对比,可以看出SFT和对齐技术(PPO、DPO)在大语言模型中的不同作用和应用场景。SFT主要用于提高模型在特定任务上的表现,而PPO和DPO则侧重于使模型与人类偏好高度一致。根据具体任务需求选择合适的技术手段,可以更好地发挥大语言模型的潜力。

In the language model (LLM), supervised fine-tuning (SFT) and alignment (PPO, DPO) are two different technical means, which have different roles and application scenarios in model optimization and task adaptation. The following is a detailed explanation of the two technologies:

Supervised Fine-Tuning (SFT)

Basic Principle:

Supervised Fine-Tuning (SFT) is a classic model fine-tuning method. It first pre-trains a neural network model (source model) on the source dataset, and then creates a new model (target model) that copies all the designs and parameters of the source model except the output layer. During the fine-tuning process, a new output layer is added to the target model and its parameters are randomly initialized. The entire model is then trained on the target dataset, but only the output layer and some pre-trained layers are fine-tuned.

After the pre-training stage, SFT optimizes the model through supervised learning methods to make it better comply with specific instructions, thereby improving the quality and relevance of the model’s response. This involves the model interacting with human-annotated conversation data to learn how to generate responses that are more in line with human intentions and reduce the inaccuracy and irrelevant information of the output.

Application scenarios:

SFT is widely used in the field of natural language processing (NLP), such as text classification, sentiment analysis and other tasks. Through fine-tuning, the model can better adapt to the target dataset and improve the quality of task completion.

Features:

SFT uses the parameters and structure of the pre-trained model to avoid training the model from scratch, thereby accelerating the model training process and improving the performance of the model on the target task.

Alignment (PPO, DPO)

Basic principle:

Alignment technology is mainly implemented through reinforcement learning from human feedback (RLHF). Existing RLHF methods can be roughly divided into reward-based methods and non-reward methods. PPO (Proximal Policy Optimization) and DPO (Direct Preference Optimization) are two common alignment algorithms.

Alignment techniques such as PPO and DPO further optimize the model through reinforcement learning (RL) to ensure that LLM is consistent with human values, especially when facing potentially harmful or misleading content. PPO and DPO are based on SFT, fine-tuning the model through human feedback, reducing the “hallucination problem” and improving compliance with human values. PPO uses a reward model to evaluate model responses, while DPO learns directly from human preferences, both of which aim to simplify the alignment process, reduce computational overhead, and achieve more robust optimization.

PPO (Proximal Policy Optimization):

PPO is a reinforcement learning algorithm that stabilizes the training process by limiting the amplitude of policy updates. It performs well in aligning large language models, especially in refining language models and handling complex tasks such as code generation.

DPO (Direct Preference Optimization):

DPO is a relatively new method that aligns models by directly optimizing human preferences. Unlike PPO, DPO may have preferences for outputs outside the distribution of training data, which may affect its performance on certain tasks.

Application scenarios:

PPO and DPO are mainly used for tasks that require models to be highly consistent with human preferences, such as dialogue systems, content generation, etc. PPO performs well in handling complex tasks, while DPO may have more advantages in certain specific scenarios.

Features:

PPO solves the challenges in training by using advantage normalization, large batch size, and exponential moving average for the reference model, and finally achieves better results than DPO on dialogue tasks and code generation tasks.

DPO can improve its performance by performing additional supervised fine-tuning (SFT) training before RLHF training and using online sampling data.

Summary

Technical basic principles Application scenarios Features Supervised fine-tuning (SFT) Fine-tunes the output layer and part of the pre-trained layer based on the pre-trained model Text classification, sentiment analysis and other NLP tasks Accelerate the training process and improve the performance of the target task PPO stabilizes the training process by limiting the strategy update amplitude Complex tasks such as dialogue systems, content generation, and code generation Advantage normalization, large batch size, exponential moving average DPO directly optimizes human preferences Tasks that require highly consistent human preferences require additional SFT training and online sampling data

Through the above comparison, we can see the different roles and application scenarios of SFT and alignment techniques (PPO, DPO) in large language models. SFT is mainly used to improve the performance of the model on specific tasks, while PPO and DPO focus on making the model highly consistent with human preferences. Choosing appropriate technical means according to specific task requirements can better unleash the potential of large language models.


[1] https://cloud.baidu.com/article/3330006
[2] https://briefgpt.xyz/a/2404.10719
[3] https://www.youtube.com/watch?v=-YZ-YZ05VXU
[4] https://cloud.baidu.com/article/3083684
[5] https://www.bestblogs.dev/article/0c68b2
[6] https://www.cerebras.net/blog/fine-tuning-language-models-using-direct-preference-optimization/
[7] https://www.wehelpwin.com/m_article/4306
[8] https://www.xinfinite.net/t/topic/4819
[9] https://medium.com/@sulbha.jindal/policy-optimization-with-rlhf-ppo-dpo-orpo-d65d075d99f3
[10] https://www.53ai.com/news/qianyanjishu/1725.html
[11] https://blog.csdn.net/2401_83878212/article/details/140694758
[12] https://plainenglish.io/community/direct-preference-optimization-dpo-a-simplified-approach-to-fine-tuning-large-language-models
[13] https://jishuba.cn/article/%E7%90%86%E8%A7%A3%E5%92%8C%E4%BD%BF%E7%94%A8%E8%AF%AD%E8%A8%80%E6%A8%A1%E5%9E%8B%E7%9A%84%E7%9B%91%E7%9D%A3%E5%BE%AE%E8%B0%83sft/
[14] https://www.53ai.com/news/qianyanjishu/1897.html
[15] https://arxiv.org/html/2404.10719v1
[16] https://www.nowcoder.com/feed/main/detail/150ddebdc42f4d57b52ec67451fbb185
[17] https://x.com/hongming731/status/1820686119023042635
[18] https://cameronrwolfe.substack.com/p/understanding-and-using-supervised
[19] https://cloud.tencent.com/developer/article/2302701
[20] https://finance.sina.com.cn/tech/roll/2024-08-05/doc-inchqhum7948595.shtml
[21] https://aman.ai/primers/ai/llm-alignment/
[22] https://www.53ai.com/news/qianyanjishu/626.html
[23] https://toloka.ai/blog/direct-preference-optimization/
[24] https://blog.csdn.net/zwqjoy/article/details/136813722
[25] https://arxiv.org/html/2405.11870v2

Views: 12

发表回复

您的电子邮箱地址不会被公开。 必填项已用 * 标注