Customize Consent Preferences

We use cookies to help you navigate efficiently and perform certain functions. You will find detailed information about all cookies under each consent category below.

The cookies that are categorized as "Necessary" are stored on your browser as they are essential for enabling the basic functionalities of the site. ... 

Always Active

Necessary cookies are required to enable the basic features of this site, such as providing secure log-in or adjusting your consent preferences. These cookies do not store any personally identifiable data.

No cookies to display.

Functional cookies help perform certain functionalities like sharing the content of the website on social media platforms, collecting feedback, and other third-party features.

No cookies to display.

Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics such as the number of visitors, bounce rate, traffic source, etc.

No cookies to display.

Performance cookies are used to understand and analyze the key performance indexes of the website which helps in delivering a better user experience for the visitors.

No cookies to display.

Advertisement cookies are used to provide visitors with customized advertisements based on the pages you visited previously and to analyze the effectiveness of the ad campaigns.

No cookies to display.

0

大型语言模型(LLM)的智能水平不断提升,但随之而来的是安全性的挑战。近日,香港中文大学(深圳)数据科学学院的贺品嘉教授团队和腾讯AI Lab的涂兆鹏博士,在腾讯AI Lab实习生袁尤良的协助下,提出了一个名为Decoupled Refusal Training (DeRTa)的新方法,旨在增强大型语言模型的安全性。

DeRTa通过两种新颖的设计解决了安全微调数据中的拒绝位置偏差问题。一是通过在安全回复前随机添加有害前缀,训练模型在任何位置都能拒绝不安全的内容,二是通过强化过渡优化(RTO),使模型在有害序列的任意位置都能预测到安全的过渡。

研究者在两个知名模型家族LLaMA3和Mistral上进行了实验,结果显示DeRTa显著提升了模型的安全性,同时不会影响模型的有用性。通过对DeRTa的工作原理进行更细致的分析,研究团队发现该方法能够有效改变拒绝性单词的位置分布,并验证了两种策略的作用大小。

这一研究成果表明,尽管经过大量安全对齐的模型仍然容易被越狱,但通过DeRTa这样的创新方法,可以有效提升大型语言模型的安全性。这项工作对于确保AI技术负责任和可持续的发展具有重要意义。

英语如下:

News Title: “Decoupled Training: A New Strategy for Safeguarding AI Models”

Keywords: Jailbreak Defense, Safety Tuning, Awakening

News Content: As the intelligence level of large language models (LLMs) continues to improve, so do the challenges to their security. Recently, a team led by Professor Biao He from the Data Science Institute of the Chinese University of Hong Kong (Shenzhen) and Dr. Zhaopeng Tu from Tencent AI Lab, with the assistance of intern Yuyong Yuan, proposed a new method called Decoupled Refusal Training (DeRTa) aimed at enhancing the security of large language models.

DeRTa addresses the problem of position bias in safety-tuning data through two novel designs. One involves randomly adding harmful prefixes before safe responses to train the model to refuse unsafe content at any position. The other employs Reinforced Transition Optimization (RTO) to enable the model to predict safe transitions at any position within harmful sequences.

Experiments conducted on two well-known model families, LLaMA3 and Mistral, showed that DeRTa significantly improves the safety of the models without affecting their usefulness. A more detailed analysis of the workings of DeRTa revealed that the method effectively changes the distribution of rejection-related words and validated the effectiveness of both strategies.

This research outcome indicates that while models that have undergone extensive safety alignment can still be prone to jailbreaks, innovative methods like DeRTa can effectively enhance the security of large language models. This work is of significant importance for ensuring the responsible and sustainable development of AI technology.

【来源】https://www.jiqizhixin.com/articles/2024-07-30

Views: 2

0

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注