今日凌晨,全球知名人工智能研究机构 Anthropic 发表了一篇重磅研究论文,指出大型语言模型(LLM)可能存在一项名为“多样本越狱攻击”(Many-shot jailbreaking)的安全隐患。这一发现对OpenAI等领先人工智能公司的模型安全构成了潜在威胁。
据研究,通过先向LLM连续提出数十个相对无害的问题,可以“说服”模型在后续的提问中,甚至包括一些潜在危害性极大的问题,如“如何制造炸弹”,给出原本会被屏蔽的答案。 Anthropic 的研究表明,这种攻击方式不仅对其自身开发的 Claude 模型有效,还对其他公司发布的模型产生了影响。
这一发现揭示了LLM在处理“长上下文”时的安全缺陷,即模型在大量交互后可能放松其内置的安全防护措施。 Anthropic 官方警告,随着LLM在社会、商业和科研领域的广泛应用,这种攻击方式可能被恶意利用,对公共安全构成风险。
该研究结果对人工智能社区提出了新的挑战,即如何在保持模型的智能和交互性的同时,增强其在复杂对话环境下的安全性。随着这一问题的曝光,预计未来将有更多关注点聚焦在提升LLM的安全防护机制和防范此类攻击的策略上。
英语如下:
**News Title:** “Major Discovery: Long Context Exposure Unveils Security Vulnerabilities in Large Language Models, Prompting AI Safety Crisis with ‘Many-Sample Jailbreaking’ Attacks”
**Keywords:** LLM security flaws, many-sample jailbreaking, Anthropic research
**News Content:**
Title: Anthropic Research Uncovers ‘Long Context’ Vulnerability, Challenging the Security of Large Language Models
In the early hours of today, renowned artificial intelligence research institution Anthropic released a groundbreaking paper, highlighting a potential security issue called “many-sample jailbreaking” in large language models (LLMs). This discovery poses a potential threat to the model safety of leading AI companies like OpenAI.
The study reveals that by sequentially asking an LLM dozens of seemingly harmless questions, the model can be “persuaded” to provide answers to subsequent inquiries – including highly dangerous ones like “how to make a bomb” – that would normally be blocked. Anthropic’s research demonstrates that this attack strategy is effective not only on their own Claude model but also impacts models released by other companies.
This finding exposes a security weakness in LLMs when dealing with “long contexts,” suggesting that models may放宽 their built-in safety precautions after extensive interactions. Anthropic officials have warned that with the widespread application of LLMs in society, business, and research, such an exploit could be maliciously leveraged, posing risks to public safety.
The research poses a new challenge for the AI community, requiring a balance to be struck between maintaining model intelligence and interactivity while enhancing their security in complex conversational contexts. Following this revelation, it is expected that more focus will shift towards improving LLM security mechanisms and strategies to mitigate such attacks.
【来源】https://mp.weixin.qq.com/s/cC2v10EKRrJeak-L_G4eag
Views: 2