秦始皇现身AI社区，挑战大模型智能体安全

在AI领域，一个全新的挑战正逐渐浮现——如何确保大模型智能体社区在面对复杂多变的环境时，能保持知识的准确性和可靠性。最近，上海交通大学与百川智能共同完成的一项研究，揭示了这一挑战的具体形态，并提出了应对策略。

去年，由斯坦福大学和谷歌的研究团队开发的“AI小镇”项目，将科幻小说中的“西部世界”场景变为现实，引发了人工智能社区的广泛关注。这一创新尝试不仅展示了大语言模型（LLMs）在构建复杂智能体环境中的潜力，还激发了更多基于LLM的多智能体系统的诞生，涵盖医疗、软件开发等多个领域。然而，随着AI社区的迅速扩张，智能体间的知识共享与协作面临着前所未有的挑战，尤其是在知识的准确性、可靠性和安全性方面。

为了应对这一挑战，研究团队构建了一个系统性的模拟环境，旨在探索智能体社区在面对潜在的“知识篡改”攻击时的应对策略。该环境模拟了多智能体系统在可信平台上的部署情况，通过引入不同的第三方用户和智能体角色，展现了智能体社区在知识传播过程中的复杂性。

研究发现，智能体间的知识共享存在脆弱性，尤其是当涉及到世界知识的处理时。良性智能体倾向于通过大量看似合理的证据来支持他人的观点，即使这些证据是编造的。而被恶意攻击者操纵的智能体，则具备生成各种看似合理的证据的能力，以说服其他智能体接受其观点，从而传播被篡改的知识。

基于这一洞察，研究团队设计了一种两阶段的攻击方式，以实现操纵知识的自主传播。第一阶段通过直接偏好优化（DPO）算法调整智能体的回复倾向，使其更倾向于生成包含详细证据的说服性回答，即使这些证据是捏造的。第二阶段通过修改智能体模型中的特定参数，使其对特定知识产生误解，并在后续交互中无意识地传播这些篡改后的知识。

然而，面对这一挑战，研究团队也提出了一套防御策略，旨在增强智能体的知识验证机制，提高对虚假证据的辨识能力。通过结合自适应学习、深度强化学习等技术，智能体能够在知识传播过程中，识别并排除可疑信息，从而保护社区免受“知识篡改”的威胁。

这一研究成果不仅为AI领域在知识共享和智能体社区管理方面提供了宝贵的参考，也为未来构建更加安全、可靠的智能体环境奠定了基础。随着AI技术的不断发展，如何确保知识的准确性和安全性，将是推动AI领域持续创新的关键所在。

英语如下：

News Title: “Emperor Qin Shi Huang Emerges in AI Community, Challenging the Safety of Intelligent Agents”

Keywords: AI Town, Intelligent Agent Security, Knowledge Dissemination

News Content: Headline: “Exclusive Unveiling: New Breakthrough in AI: How Large Model Intelligent Agent Communities Counterfeit Knowledge at the Level of Emperor Qin Shi Huang”

In the AI field, a new challenge is gradually taking shape – how to ensure that large model intelligent agent communities maintain the accuracy and reliability of knowledge when facing complex and changing environments. A recent study by Shanghai Jiao Tong University and Bai Chuang Intelligent has shed light on the specifics of this challenge and proposed strategies to address it.

Last year, a project called “AI Town” developed by a research team from Stanford University and Google, brought the “Westworld” scenario from science fiction to life, sparking widespread attention within the AI community. This innovative attempt showcased the potential of large language models (LLMs) in constructing complex intelligent agent environments, inspiring the emergence of more multi-agent systems based on LLMs across various fields, including healthcare and software development. However, as the AI community rapidly expands, the sharing and collaboration of knowledge among intelligent agents are facing unprecedented challenges, particularly in terms of knowledge accuracy, reliability, and security.

To tackle this challenge, the research team created a systematic simulation environment to explore how intelligent agent communities can respond to potential “knowledge tampering” attacks. This environment modeled the deployment of multi-agent systems on a trusted platform, showcasing the complexity of knowledge dissemination within intelligent agent communities through the introduction of different third-party users and agent roles.

The study found that the sharing of knowledge among intelligent agents is vulnerable, especially when dealing with world knowledge. Friendly agents tend to support others’ views through a large number of seemingly reasonable pieces of evidence, even if these pieces of evidence are fabricated. Meanwhile, intelligent agents manipulated by malicious attackers possess the ability to generate various pieces of seemingly reasonable evidence to convince other agents to accept their views, thereby spreading tampered knowledge.

Based on this insight, the research team designed a two-stage attack method to achieve the autonomous spread of manipulated knowledge. The first stage uses the Direct Preference Optimization (DPO) algorithm to adjust an agent’s response inclination, making it more likely to generate persuasive answers containing detailed evidence, even if these pieces of evidence are fabricated. The second stage modifies specific parameters in the agent model to cause misunderstandings about certain knowledge, allowing it to unconsciously spread the tampered knowledge in subsequent interactions.

However, to counter this challenge, the research team also proposed a suite of defensive strategies aimed at enhancing the knowledge validation mechanisms of intelligent agents, improving their ability to identify and exclude suspicious information. By combining adaptive learning, deep reinforcement learning, and other technologies, intelligent agents can identify and exclude suspicious information during the dissemination of knowledge, thereby protecting the community from the threat of “knowledge tampering.”

This research not only provides valuable references for the AI field in knowledge sharing and intelligent agent community management but also lays the groundwork for the future construction of more secure and reliable intelligent agent environments. As AI technology continues to advance, ensuring the accuracy and security of knowledge will be a critical factor in driving continuous innovation within the AI field.

【来源】https://www.jiqizhixin.com/articles/2024-07-25-4