大型语言模型(LLM)的智能水平不断提升,但随之而来的是安全性的挑战。近日,香港中文大学(深圳)数据科学学院的贺品嘉教授团队和腾讯AI Lab的涂兆鹏博士,在腾讯AI Lab实习生袁尤良的协助下,提出了一个名为Decoupled Refusal Training (DeRTa)的新方法,旨在增强大型语言模型的安全性。

DeRTa通过两种新颖的设计解决了安全微调数据中的拒绝位置偏差问题。一是通过在安全回复前随机添加有害前缀,训练模型在任何位置都能拒绝不安全的内容,二是通过强化过渡优化(RTO),使模型在有害序列的任意位置都能预测到安全的过渡。

研究者在两个知名模型家族LLaMA3和Mistral上进行了实验,结果显示DeRTa显著提升了模型的安全性,同时不会影响模型的有用性。通过对DeRTa的工作原理进行更细致的分析,研究团队发现该方法能够有效改变拒绝性单词的位置分布,并验证了两种策略的作用大小。

这一研究成果表明,尽管经过大量安全对齐的模型仍然容易被越狱,但通过DeRTa这样的创新方法,可以有效提升大型语言模型的安全性。这项工作对于确保AI技术负责任和可持续的发展具有重要意义。

英语如下:

News Title: “Decoupled Training: A New Strategy for Safeguarding AI Models”

Keywords: Jailbreak Defense, Safety Tuning, Awakening

News Content: As the intelligence level of large language models (LLMs) continues to improve, so do the challenges to their security. Recently, a team led by Professor Biao He from the Data Science Institute of the Chinese University of Hong Kong (Shenzhen) and Dr. Zhaopeng Tu from Tencent AI Lab, with the assistance of intern Yuyong Yuan, proposed a new method called Decoupled Refusal Training (DeRTa) aimed at enhancing the security of large language models.

DeRTa addresses the problem of position bias in safety-tuning data through two novel designs. One involves randomly adding harmful prefixes before safe responses to train the model to refuse unsafe content at any position. The other employs Reinforced Transition Optimization (RTO) to enable the model to predict safe transitions at any position within harmful sequences.

Experiments conducted on two well-known model families, LLaMA3 and Mistral, showed that DeRTa significantly improves the safety of the models without affecting their usefulness. A more detailed analysis of the workings of DeRTa revealed that the method effectively changes the distribution of rejection-related words and validated the effectiveness of both strategies.

This research outcome indicates that while models that have undergone extensive safety alignment can still be prone to jailbreaks, innovative methods like DeRTa can effectively enhance the security of large language models. This work is of significant importance for ensuring the responsible and sustainable development of AI technology.

【来源】https://www.jiqizhixin.com/articles/2024-07-30

Views: 2

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注