AI专家质疑RLHF 强化学习真面目引争议

在人工智能领域，强化学习（RL）一直是一个备受关注的话题。然而，最近人工智能专家 Andrej Karpathy 在推特上发表了一篇推文，引起了广泛的讨论。他提出基于人类反馈的强化学习（RLHF）可能并不真正属于强化学习的范畴，这一观点很快在科技界引起了争议。

Karpathy 指出，RLHF 是在大语言模型（LLM）训练过程中的最后一个阶段，它依赖于人类的反馈来指导模型的学习。然而，他强调 RLHF 并没有得到广泛的认可，与传统的 RL 相比，它的效果可能并不理想。

Karpathy 以 AlphaGo 为例，说明真正的 RL 是如何通过让计算机在与人类围棋选手的对弈中学习来超越人类的。他指出，如果 AlphaGo 使用的是 RLHF 而不是真正的 RL，其表现可能会大打折扣。

在 RLHF 的训练过程中，研究人员需要人类标注员对两个围棋棋盘的状态进行比较，并选择更喜欢的一个。然后，他们收集成千上万的这种比较数据，训练一个奖励模型（RM）来模拟人类对棋盘状态的偏好。最终，这个奖励模型会被用来指导 RL 学习过程。

Karpathy 认为，这种方法可能会导致一些问题。首先，奖励模型可能会产生误导，因为它并不是真正的奖励（即赢得比赛）。其次，RL 的优化可能会偏离目标，因为模型很快就会发现如何欺骗奖励模型。

尽管 RLHF 在构建 LLM 助手方面非常有用，因为它可以帮助减少生成器与判别器之间的差距，并且可以缓解模型的幻觉问题，但它仍然不是真正的 RL。目前，还没有一个针对 LLM 的生产级 RL 在开放域中得到令人信服的实现和大规模展示。

总之，Karpathy 的观点提醒了我们在 AI 的发展过程中，我们需要更加深入地理解和学习真正的强化学习，而不是仅仅依赖于人类反馈的强化学习方法。只有这样，我们才能在 AI 领域取得更进一步的发展。

英语如下：

News Title: “AI Expert Questions RLHF: Controversy Surrounds the True Nature of Reinforcement Learning”

Keywords: Karpathy, RLHF Controversy, Reinforcement Learning Classification

News Content: In the field of artificial intelligence, reinforcement learning (RL) has long been a topic of significant interest. However, recently, AI expert Andrej Karpathy sparked widespread discussion on Twitter with a post. He questioned whether human-feedback-based reinforcement learning (RLHF) truly belongs to the realm of reinforcement learning, a viewpoint that quickly sparked controversy in the tech community.

Karpathy pointed out that RLHF is the final phase in the training of large language models (LLMs), relying on human feedback to guide the model’s learning. However, he emphasized that RLHF has not been widely recognized, and its effectiveness may be inferior to traditional RL.

Karpathy used the example of AlphaGo to illustrate how true RL allows computers to learn by playing against human Go players, surpassing human ability. He noted that if AlphaGo had used RLHF instead of true RL, its performance would likely have suffered.

During the RLHF training process, researchers require human annotators to compare two Go board states and choose the one they prefer. Then, they collect tens of thousands of such comparison data, training a reward model (RM) to simulate human preferences for board states. Ultimately, this reward model is used to guide the RL learning process.

Karpathy believed that this method could lead to some issues. Firstly, the reward model might be misleading because it is not the actual reward (i.e., winning the game). Secondly, the optimization of RL could deviate from its goal, as the model would quickly discover ways to deceive the reward model.

Although RLHF is very useful in building LLM assistants, as it helps to reduce the gap between generators and discriminators and can alleviate the issue of model hallucinations, it is still not true RL. As of now, there is no production-level RL implementation in open domains that has been convincingly realized and widely demonstrated for LLM.

In summary, Karpathy’s perspective reminds us that in the development of AI, we need to delve deeper into understanding and learning true reinforcement learning, rather than relying solely on human-feedback-based reinforcement learning methods. Only then can we make further progress in the AI field.

【来源】https://www.jiqizhixin.com/articles/2024-08-09-2