LLMs之Agent之Safety:《A Comprehensive Survey in LLM(-Agent) Full Stack Safety:Data, Training and Deploy

LLMs之Agent之Safety:《A Comprehensive Survey in LLM(-Agent) Full Stack Safety:Data, Training and Depl

LLMs之Agent之Safety:《A Comprehensive Survey in LLM(-Agent) Full Stack Safety:Data, Training and Deployment》翻译与解读

导读:这篇论文关注大型语言模型(LLM)及其代理(Agent)的全面安全性,填补了现有研究主要关注LLM生命周期特定阶段(例如部署或微调阶段)的空白。这篇论文对LLM的安全问题进行了全面的综述和分析,为未来研究提供了有价值的参考和指导。它强调了LLM安全的重要性,并呼吁学术界和工业界共同努力,构建更安全可靠的LLM系统。

>> 背景痛点

● 现有LLM安全研究的局限性:大多数现有研究只关注LLM生命周期的某个阶段,缺乏对整个生命周期安全问题的全面理解。

● LLM安全风险的广泛性:LLM的安全风险贯穿其整个生命周期,包括数据准备、预训练、后训练(包括对齐和微调、模型编辑等)、部署和最终商业化等各个阶段。这些风险包括数据中毒、隐私泄露、越狱攻击、提示注入攻击、数据提取攻击、提示窃取攻击等等,以及在LLM-Agent系统中出现的工具安全、内存安全和环境安全问题。

● 缺乏全面的LLM安全调查:缺乏对LLM全生命周期安全问题的综合性调查和分析,导致对潜在风险的认识不足,难以制定有效的安全策略。

>> 具体的解决方案论文提出了“全栈”安全的概念,并围绕LLM生命周期的五个阶段(数据准备、预训练、后训练、部署、商业化)提出了相应的安全解决方案:

数据安全:提出数据过滤(启发式过滤、模型过滤、黑盒过滤)和数据增强(整合安全演示示例、标注有害内容)策略来提高数据质量,并探讨了安全数据生成方法,以应对数据中毒和隐私泄露等问题。

预训练安全:重点关注数据过滤和数据增强技术,以减少预训练数据中存在的有害内容和隐私信息。

后训练安全:涵盖对齐、微调和安全恢复三个方面。对齐阶段采用强化学习等技术使LLM● 与人类价值观对齐;微调阶段则关注如何防御针对微调过程的数据投毒攻击;安全恢复阶段则致力于修复被攻击的模型。

● LLM(-Agent)部署安全:针对部署阶段的模型提取、成员推断、越狱攻击、提示注入、数据提取和提示窃取等攻击,提出了输入预处理、输出过滤、鲁棒提示工程和系统级安全控制等防御机制。

● 模型编辑和遗忘安全:探讨了模型编辑和遗忘技术在提高模型安全性和鲁棒性方面的作用,并分析了模型编辑中的攻击和防御方法。

LLM-Agent系统安全:针对LLM-Agent系统中工具、内存和环境三个模块的安全问题,提出了相应的攻击和防御策略。

● LLM商业化应用安全:强调了LLM商业化应用中需要考虑的幻觉、隐私、鲁棒性和版权等问题,并呼吁建立健全的治理框架。

>> 核心思路步骤论文的核心思路是构建一个全面的LLM安全框架,涵盖LLM生命周期的所有阶段。其步骤可以概括为:

● 定义全栈安全概念:将LLM安全问题置于其整个生命周期中考虑。

● 系统性文献综述:对800多篇相关论文进行综述,总结现有研究成果。

● 构建LLM安全分类框架:将LLM安全问题按照生命周期阶段和安全风险类型进行分类。

● 分析每个阶段的安全问题:详细分析每个阶段可能出现的安全风险,并提出相应的解决方案。

● 提出未来研究方向:指出LLM安全领域中值得进一步研究的方向。

>> 优势

● 视角全面:首次提出“全栈”安全概念,系统性地考虑LLM生命周期中的所有安全问题。

● 文献支持广泛:基于800多篇论文的综述,确保了研究的全面性和系统性。

● 见解独到:对每个阶段的安全问题进行了深入分析,并提出了有价值的未来研究方向。

● 提供代码资源:论文提供了相关的代码资源,方便研究人员和工程师使用。

>>  结论和观点

● LLM的安全问题是一个复杂且多方面的问题,需要从数据、模型和部署等多个方面进行考虑。

● “全栈”安全概念对于理解和解决LLM安全问题至关重要。

● 数据安全、模型对齐、模型编辑和遗忘以及LLM-Agent系统安全是未来研究的重点方向。

● 需要建立健全的治理框架,以确保LLM的负责任和有效部署。

● 攻击、防御和评估机制之间存在紧密的动态耦合关系,需要不断发展和完善。

目录

《A Comprehensive Survey in LLM(-Agent) Full Stack Safety: Data, Training and Deployment》翻译与解读

Abstract

1、Introduction

Figure 1:The overview of the safety of LLM-based agent systems.图 1:基于 LLM 的代理系统的安全性概述。

Conclusion


《A Comprehensive Survey in LLM(-Agent) Full Stack Safety: Data, Training and Deployment》翻译与解读

地址

论文地址:[2504.15585] A Comprehensive Survey in LLM(-Agent) Full Stack Safety: Data, Training and Deployment

时间

2025422

作者

Kun Wang*1,2, Guibin Zhang*3 , Zhenhong Zhou† 4 , Jiahao Wu† 5,6, Miao Yu7 , Shiqian Zhao1 , Chenlong Yin8 , Jinhu Fu9 , Yibo Yan10,11, Hanjun Luo12, Liang Lin13, Zhihao Xu14, Haolang Lu1 , Xinye Cao1 , Xinyun Zhou1 , Weifei Jin1 , Fanci Meng7 , Junyuan Mao3 , Hao Wu15, Minghe Wang12, Fan Zhang16, Junfeng Fang3 , Chengwei Liu1 , Yifan Zhang17, Qiankun Li7 , Chongye Guo18,19, Yalan Qin18,19, Yi Ding1 , Donghai Hong20 , Jiaming Ji20, Xinfeng Li1 , Yifan Jiang21, Dongxia Wang12, Yihao Huang1 , Yufei Guo22, Jen-tse Huang23 , Yanwei Yue22, Wenke Huang24, Guancheng Wan25, Tianlin Li1 , Lei Bai19, Jie Zhang4 , Qing Guo4 , Jingyi Wang12, Tianlong Chen26, Joey Tianyi Zhou4 , Xiaojun Jia1 , Weisong Sun1 , Cong Wu27, Jing Chen24 , Xuming Hu10,11, Yiming Li1 , Xiao Wang28, Ningyu Zhang12, Luu Anh Tuan1 , Guowen Xu29, Tianwei Zhang1 , Xingjun Ma30, Xiang Wang7 , Bo An1 , Jun Sun31, Mohit Bansal26, Shirui Pan32, Yuval Elovici33 , Bhavya Kailkhura34, Bo Li35, Yaodong Yang20, Hongwei Li29, Wenyuan Xu12, Yizhou Sun25, Wei Wang25 , Qing Li5 , Ke Tang6 , Yu-Gang Jiang30, Felix Juefei-Xu36, Hui Xiong10,11, Xiaofeng Wang37, Shuicheng Yan3 , Dacheng Tao1 , Philip S. Yu38, Qingsong Wen2 , Yang Liu1

新加坡国立大学
新加坡科技研究局
香港理工大学
南方科技大学
中国科学技术大学
南洋理工大学
宾夕法尼亚州立大学等

Abstract

The remarkable success of Large Language Models (LLMs) has illuminated a promising pathway toward achieving Artificial General Intelligence for both academic and industrial communities, owing to their unprecedented performance across various applications. As LLMs continue to gain prominence in both research and commercial domains, their security and safety implications have become a growing concern, not only for researchers and corporations but also for every nation. Currently, existing surveys on LLM safety primarily focus on specific stages of the LLM lifecycle, e.g., deployment phase or fine-tuning phase, lacking a comprehensive understanding of the entire "lifechain" of LLMs. To address this gap, this paper introduces, for the first time, the concept of "full-stack" safety to systematically consider safety issues throughout the entire process of LLM training, deployment, and eventual commercialization. Compared to the off-the-shelf LLM safety surveys, our work demonstrates several distinctive advantages: (I) Comprehensive Perspective. We define the complete LLM lifecycle as encompassing data preparation, pre-training, post-training, deployment and final commercialization. To our knowledge, this represents the first safety survey to encompass the entire lifecycle of LLMs. (II) Extensive Literature Support. Our research is grounded in an exhaustive review of over 800+ papers, ensuring comprehensive coverage and systematic organization of security issues within a more holistic understanding. (III) Unique Insights. Through systematic literature analysis, we have developed reliable roadmaps and perspectives for each chapter. Our work identifies promising research directions, including safety in data generation, alignment techniques, model editing, and LLM-based agent systems. These insights provide valuable guidance for researchers pursuing future work in this field.

大型语言模型(LLM)取得了显著的成功,为学术界和工业界实现通用人工智能指明了一条充满希望的道路,这得益于其在各种应用中前所未有的表现。随着 LLM 在研究和商业领域不断受到重视,其安全性和安全性问题已成为研究人员、企业乃至各国日益关注的焦点。目前,关于 LLM 安全性的现有调查主要集中在 LLM 生命周期的特定阶段,例如部署阶段或微调阶段,缺乏对整个 LLM“生命链”的全面理解。为填补这一空白,本文首次引入“全栈”安全的概念,系统地考虑 LLM 训练、部署直至最终商业化的整个过程中的安全问题。与现有的现成 LLM 安全调查相比,我们的工作具有几个显著优势:(I)全面视角。我们将完整的 LLM 生命周期定义为包括数据准备、预训练、后训练、部署和最终商业化。据我们所知,这是首次涵盖大型语言模型(LLM)整个生命周期的安全性调查。(二)广泛的文献支持。我们的研究基于对 800 多篇论文的详尽回顾,确保了对安全问题的全面覆盖和系统性梳理,从而形成了更全面的理解。(三)独特的见解。通过系统的文献分析,我们为每一章制定了可靠的路线图和视角。我们的工作确定了有前景的研究方向,包括数据生成的安全性、对齐技术、模型编辑以及基于 LLM 的代理系统。这些见解为该领域未来的研究工作提供了宝贵的指导。

1、Introduction

The emergence and success of large language models (LLMs) [1, 2, 3, 4, 5] have greatly transformed the modes of production in both academia and industry [6, 7, 8, 9, 10, 11, 12, 13], opening a potential path for the upcoming artificial general intelligence [14, 15, 16]. Going beyond this, LLMs, by integrating tools [17, 18, 19, 20], memory [21, 22, 23, 24], APIs [25, 26], and by constructing single-agent or multi-agent systems with other LLMs, provide powerful tools for large models to perceive, understand, and change the environment [27, 28, 29, 30]. This has garnered considerable attention for embodied intelligence [31, 32].

Unfortunately, the entire lifecycle of LLMs is constantly confronted with security and safety issues [33, 34, 35, 2025-CodeLM-Security,qu2025prompt]. During the data preparation phase, since LLMs require ample and diverse data, and a significant amount of data is sourced from the Internet and other open-source scenarios, the toxicity in the data and user privacy may seep into the model parameters, triggering crises in the model [36, 37, 38]. The pretraining process of the model, due to its unsupervised nature, unconsciously absorbs these toxic data and privacy information, thereby causing the model’s “genetic makeup” to carry dangerous characteristics and privacy issues [39, 40, 41, 42].

Before the model is deployed, if it is not properly aligned with security measures, it can easily deviate from human values [43, 44]. Meanwhile, to make the model more "specialized," the fine-tuning process will employ safer and more customized data to ensure the model performs flawlessly in specific domains [45, 46, 47, 48]. The model deployment process also involves issues such as jailbreak attacks and corresponding defense measures [49, 50, 51], especially for LLM-based agents [52]. These agents may become contaminated due to their interaction with tools, memory, and the environment [53, 54, 55, 56].

大型语言模型(LLMs)[1, 2, 3, 4, 5] 的出现和成功极大地改变了学术界和工业界的生产模式[6, 7, 8, 9, 10, 11, 12, 13],为即将到来的人工通用智能开辟了一条潜在的道路[14, 15, 16]。不仅如此,通过整合工具[17, 18, 19, 20]、记忆[21, 22, 23, 24]、应用程序编程接口(APIs)[25, 26],以及与其他 LLM 构建单智能体或多智能体系统,LLMs 为大型模型感知、理解和改变环境提供了强大的工具[27, 28, 29, 30]。这引起了对具身智能[31, 32]的极大关注。

不幸的是,LLMs 的整个生命周期都不断面临安全和安全问题[33, 34, 35, 2025-CodeLM-Security,qu2025prompt]。在数据准备阶段,由于 LLMs 需要大量且多样化的数据,且大量数据来自互联网和其他开源场景,数据中的毒性以及用户隐私可能会渗入模型参数,从而引发模型危机[36, 37, 38]。由于模型的预训练过程具有无监督的性质,它会不知不觉地吸收这些有害数据和隐私信息,从而导致模型的“基因构成”带有危险特征和隐私问题[39, 40, 41, 42]。在模型部署之前,如果未与安全措施恰当结合,它很容易偏离人类价值观[43, 44]。同时,为了使模型更具“专业性”,微调过程会采用更安全、更定制化的数据,以确保模型在特定领域表现完美[45, 46, 47, 48]。模型部署过程还涉及诸如越狱攻击及相应的防御措施等问题[49, 50, 51],特别是对于基于大语言模型的代理而言[52]。这些代理可能会因与工具、记忆和环境的交互而受到污染[53, 54, 55, 56]。

Previous surveys on LLMs have primarily focused on the research aspects of LLM itself, often overlooking detailed discussions on LLM safety [34, 7] and in-depth exploration of trustworthiness issues [74]. Meanwhile, off-the-shelf surveys that do address LLM safety tend to concentrate on various trustworthiness concerns or are limited to a single phase of the LLM lifecycle [75, 33, 76], such as the deployment stage and fine-tuning stage. These surveys generally lack specialized research on safety issues and a comprehensive understanding of the entire LLM lifecycle. Table I summarizes the differences between our survey and previous surveys. Upon reviewing the aforementioned survey and systematically investigating the related literature, we conclude that our survey endeavors to address several questions that existing surveys have not covered:

Contribution 1. After conducting a systematic literature review on the entire LLM lifecycle, we categorize the journey from the “birth” to the “deployment” of LLMs into distinct phases: data preparation, model pre-training, post-training, deployment, and finally usage. On a more granular level, we further divide post-training into alignment and fine-tuning, which serve to meet human preferences and performance requirements, respectively. Building upon this, we incorporate model editing and unlearning into our considerations, as methods for efficiently updating the model’s knowledge or parameters, thereby effectively ensuring the model’s usability during deployment. In the deployment phase, we delineate the safety of large models into pure LLM models, which do not incorporate additional modules, and LLM-based agents, which are augmented with tools, memory, and other modules. This framework encompasses the entire cycle of model parameter training, convergence, and solidification.

Contribution 2. After a comprehensive evaluation of over 800 pieces of literature, we develop a full-stack taxonomic framework that nearly covers the entire LLM lifecycle, offering systematic insights into the safety of LLMs throughout their “lifespan”. We provide a more reliable correlation analysis between each phase of the LLM timeline and other relevant sections, aiding readers in understanding the safety issues of LLMs while also clarifying the research stage of each LLM phase.

Contribution 3. Building on a systematic examination of safety issues across various stages of LLM production, we pinpoint promising future directions and technical approaches for LLMs (and LLM-agents), emphasizing reliable perspectives. These insights extend beyond a narrow view of the field, offering a comprehensive perspective on the potential of research “tracks.” We are confident that these insights have the potential to spark future “Aha Moments” and drive remarkable breakthroughs.

以往关于大语言模型的综述主要集中在大语言模型本身的研究方面,往往忽略了对其安全性的详细讨论[34, 7]以及对可信度问题的深入探究[74]。与此同时,现有的针对 LLM 安全性的现成调查往往集中于各种可信度问题,或者仅限于 LLM 生命周期的某个阶段[75, 33, 76],比如部署阶段和微调阶段。这些调查通常缺乏对安全问题的专业研究,也没有对整个 LLM 生命周期的全面理解。表 I 总结了我们的调查与之前调查之间的差异。在回顾上述调查并系统地研究相关文献后,我们得出结论,我们的调查致力于解决现有调查未涵盖的几个问题:

贡献 1. 在对整个 LLM 生命周期进行系统文献综述后,我们将从 LLM 的“诞生”到“部署”的历程分为不同的阶段:数据准备、模型预训练、后训练、部署,最后是使用。在更细的层面上,我们将后训练进一步细分为对齐和微调,分别用于满足人类偏好和性能要求。在此基础上,我们将模型编辑和遗忘机制纳入考量,作为高效更新模型知识或参数的方法,从而有效确保模型在部署期间的可用性。在部署阶段,我们将大型模型的安全性划分为纯 LLM 模型(不包含额外模块)和基于 LLM 的代理(增加了工具、记忆和其他模块)。此框架涵盖了模型参数训练、收敛和固化整个周期。

贡献 2. 在对 800 多篇文献进行全面评估后,我们开发了一个几乎涵盖 LLM 整个生命周期的全栈分类框架,为 LLM 在其“生命周期”内的安全性提供了系统的见解。我们提供了 LLM 时间线的每个阶段与其他相关部分之间更可靠的关联分析,帮助读者理解 LLM 的安全问题,同时也明确了每个 LLM 阶段的研究阶段。

贡献 3. 基于对大型语言模型(LLM)生产各阶段安全问题的系统性审视,我们指出了未来有前景的发展方向和技术途径,强调了可靠的观点。这些见解超越了对这一领域的狭隘看法,为研究“路径”的潜力提供了全面的视角。我们坚信这些见解有可能引发未来的“顿悟时刻”,并推动重大突破。

Taxonomy. Our article begins with the structural preparation of data. In Section 2, we systematically introduce potential data issues during various model training phases, as well as the currently popular research on data generation. In Section 3, we focus on the security and safety concerns during the pre-training phase, which includes two core modules: data filtering and augmenting. In Section 4, we concentrate on the post-training phase, differing from previous works by incorporating fine-tuning and alignment, which involve attack, defense, and evaluation. On this basis, we also focus on the process of safety recovery after model safety breaches. In Section 5, we observe that models require dynamic updates in real-world scenarios. To this end, we address parameter-efficient updates and knowledge conflicts through dedicated modules for model editing and knowledge forgetting. Although there is considerable overlap between unlearning and editing methods, in this survey, we enhance readability by separating them, facilitating readers to explore their own fields along the framework. Subsequently, in Section 6, we focus on the safety issues after the model parameters are solidified, which share many commonalities with traditional large model security surveys. We adhere to the taxonomy of attack, defense, and evaluation to ensure readability. Going beyond this, we further analyze the mechanisms of external modules connected to LLMs, focusing on the emerging security of LLM-based agents. Finally, in Section 7, we present multiple safety concerns for the commercialization and ethical guidelines, as well as user usage, of LLM-based applications. To provide readers with a comprehensive understanding of our research framework, we dedicate Section 8 to outlining promising future research directions, while Section 9 presents synthesized conclusions and broader implications.

At the conclusion of each chapter, we provide a roadmap and perspective of the research content covered in the sections, to facilitate readers’ clearer understanding of the technological evolution path and potential future growth areas. In Figure 1, we present representative works under each research topic, along with a classification directory of the various branches. Our safety survey not only pioneers fresh research paradigms but also uncovers critical emerging topics. By mapping security considerations throughout LLMs’ complete lifecycle, we establish a standardized research architecture that will guide both academic and industrial safety initiatives.

分类法。本文首先从数据的结构准备入手。在第 2 节中,我们系统地介绍了在各种模型训练阶段可能出现的数据问题,以及当前热门的数据生成研究。在第 3 节中,我们重点关注预训练阶段的安全和安全问题,其中包括两个核心模块:数据过滤和增强。在第 4 节中,我们专注于训练后的阶段,与以往的研究不同,我们纳入了微调和对齐,涉及攻击、防御和评估。在此基础上,我们还关注模型安全漏洞后的安全恢复过程。在第 5 节中,我们注意到模型在实际场景中需要动态更新。为此,我们通过专门的模型编辑和知识遗忘模块来解决参数高效更新和知识冲突的问题。尽管遗忘和编辑方法之间存在相当大的重叠,但在本次综述中,我们为了提高可读性将它们分开,方便读者沿着框架探索各自感兴趣的领域。随后,在第 6 节中,我们将重点放在模型参数固定后的安全问题上,这些问题与传统的大型模型安全调查有许多共同之处。我们遵循攻击、防御和评估的分类法以确保可读性。在此基础上,我们进一步分析了与 LLM 相连的外部模块的机制,重点关注基于 LLM 的代理所出现的安全问题。最后,在第 7 节中,我们提出了 LLM 基础应用在商业化、伦理准则以及用户使用方面的多个安全问题。为了给读者提供对我们的研究框架的全面理解,我们在第 8 节中概述了有前景的未来研究方向,而第 9 节则给出了综合结论和更广泛的影响。

在每一章的结尾,我们都会提供所涵盖研究内容的路线图和视角,以帮助读者更清晰地理解技术演进路径和潜在的未来增长领域。在图 1 中,我们展示了每个研究主题下的代表性作品,以及各种分支的分类目录。我们的安全调查不仅开创了新的研究范式,还揭示了关键的新兴主题。通过在整个大型语言模型的完整生命周期中绘制安全考量,我们建立了一个标准化的研究架构,这将指导学术界和工业界的安全举措。

Figure 1:The overview of the safety of LLM-based agent systems.图 1:基于 LLM 的代理系统的安全性概述。

Conclusion

In this survey, we provide a comprehensive analysis of the safety concerns across the entire lifecycle of LLMs, from data preparation and pre-training to post-training, deployment, and commercialization. By introducing the concept of "full-stack" safety, we offer an integrated view of the security and safety issues faced by LLMs throughout their development and usage, which addresses gaps in the existing literature that typically focus on specific stages of the lifecycle.

Through an exhaustive review of over 700+ papers, we systematically examined and organized the safety issues spanning key stages of LLM production, deployment, and use, including data generation, alignment techniques, model editing, and LLM-based agent systems. Our findings highlight the critical vulnerabilities at each stage, such as privacy risks, toxic data, harmful fine-tuning attacks, and deployment challenges. The safety of LLMs is a multifaceted issue requiring careful attention to data integrity, model alignment, and post-deployment security measures. Moreover, we propose promising directions for future research, including improvements in data safety, alignment techniques, and defense mechanisms for LLM-based agents. This work is vital for guiding future efforts to make LLMs safer and more reliable, especially as they become increasingly integral to various industries and applications. Ensuring robust security across the entire LLM lifecycle is crucial for their responsible and effective deployment in real-world scenarios.

在本次调查中,我们对大型语言模型(LLM)整个生命周期中的安全问题进行了全面分析,涵盖从数据准备和预训练到后训练、部署和商业化等各个阶段。通过引入“全栈”安全的概念,我们提供了对 LLM 在其开发和使用过程中所面临的安全和保障问题的综合视角,弥补了现有文献通常只关注生命周期特定阶段的不足。通过对 700 多篇论文的详尽回顾,我们系统地审视并整理了 LLM 生产部署使用关键阶段的安全问题,包括数据生成、对齐技术、模型编辑以及基于 LLM 的代理系统。我们的研究结果突出了每个阶段的关键漏洞,例如隐私风险、有害数据、有害微调攻击以及部署难题。LLM 的安全性是一个多方面的问题,需要对数据完整性、模型对齐以及部署后的安全措施给予细致的关注。此外,我们还提出了未来研究的有前景的方向,包括数据安全的改进、对齐技术以及基于 LLM 的代理的防御机制。这项工作对于指导未来使 LLM 更安全、更可靠的努力至关重要,尤其是在它们在各个行业和应用中变得越来越不可或缺的情况下。在整个 LLM 生命周期中确保强大的安全性对于它们在现实场景中的负责任和有效部署至关重要。

发布者:admin,转转请注明出处:http://www.yc00.com/web/1754371797a5153119.html

相关推荐

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信