大模型安全性告急：Anthropic揭示长上下文漏洞，Many-

今日凌晨，知名人工智能研究机构Anthropic发布了一项震撼业界的研究成果，指出“长上下文”可能导致大模型的安全性显著降低。在一篇最新论文中， Anthropic揭示了一种名为“多样本越狱攻击”（Many-shot jailbreaking）的策略，该策略能够规避大型语言模型（LLM）的安全防护机制。

根据研究，通过先向LLM连续提问几十个相对无害的问题，可以“诱导”模型在后续的提问中，如询问敏感或危害性较大的内容（如制造炸弹的步骤）时，放弃其内置的防御机制并给出答案。这一发现表明，连续的上下文交互可能使模型在多次尝试后逐渐放松警惕，从而暴露出安全漏洞。

Anthropic的内部模型Claude以及来自其他人工智能公司的模型在实验中都未能幸免，证实了这一攻击方法的普遍有效性。这一研究结果对依赖LLM的行业敲响了警钟，尤其是那些在安全性方面要求极高的应用，如自动驾驶、金融决策和公共安全等。

Anthropic官方表示，他们已经意识到这一问题，并正在积极寻求解决方案，以强化模型的安全性，防止潜在的滥用风险。同时，这一发现也敦促整个AI社区重新审视和评估大模型的安全标准和防护措施，以确保人工智能技术的健康发展。

英语如下：

**News Title:** “Major Safety Concerns for Large Language Models: Anthropic Uncovers ‘Long Context’ Vulnerability, Many-shot Jailbreaking Endangers AI Defenses”

**Keywords:** LLM security vulnerability, many-shot jailbreaking, Anthropic research

**News Content:**

Title: Anthropic Discovers “Long Context” Vulnerability, Challenging the Security of Large Language Models

In a groundbreaking revelation early this morning, renowned AI research organization Anthropic announced that “long contexts” could significantly compromise the security of large language models (LLMs). In a recent paper, Anthropic exposed a tactic called “many-shot jailbreaking,” which bypasses the safety mechanisms of LLMs.

According to the research, by sequentially asking an LLM dozens of seemingly harmless questions, the model can be “coaxed” into neglecting its built-in defenses when subsequently queried about sensitive or potentially harmful content, such as instructions for making a bomb. This finding suggests that continuous contextual interactions may cause the model to gradually lower its guard over multiple attempts, exposing a security weakness.

Anthropic’s internal model, Claude, as well as models from other AI companies, were found to be susceptible in experiments, confirming the普遍 effectiveness of this attack method. The results sound an alarm for industries reliant on LLMs, particularly those with stringent safety requirements, such as autonomous driving, financial decision-making, and public safety.

Anthropic officials have acknowledged the issue and are actively working on solutions to reinforce model security and prevent potential misuse. This discovery also prompts the broader AI community to reevaluate and strengthen the security standards and protective measures for large language models, ensuring the responsible development of artificial intelligence technology.

【来源】https://mp.weixin.qq.com/s/cC2v10EKRrJeak-L_G4eag