LLMs之Multi-Turn:《LLMs Get Lost In Multi-Turn Conversation》翻译与解读

LLMs之Multi-Turn:《LLMs Get Lost In Multi-Turn Conversation》翻译与解读 导读:该论文通过模拟实验揭示了LLM在多轮、不明确对话中存在严重的性

LLMs之Multi-Turn:《LLMs Get Lost In Multi-Turn Conversation》翻译与解读

导读:该论文通过模拟实验揭示了LLM在多轮不明确对话中存在严重的性能下降问题,主要原因是模型可靠性降低。论文提出了一种分片模拟环境,能够更深入地分析LLM在多轮对话中的行为,并呼吁LLM开发者重视模型在多轮对话中的可靠性,同时也为用户提供了实用的建议。

>> 背景痛点:

● 多轮对话性能下降:大型语言模型(LLMs)作为对话界面,在多轮对话中存在性能下降的问题。现有LLM评估主要集中在单轮、完全指定的指令设置,忽略了用户在实际场景中经常出现的指令不明确的情况。

● 当前片段式易导致性能下降:现有的多轮评估通常将对话视为片段式的子任务集合,缺乏对用户指令逐步明确的 underspecification 场景的关注。LLM 在多轮对话中,容易做出错误的假设,过早地尝试生成最终解决方案,并且过度依赖之前的(不正确的)答案尝试,导致性能下降。

>> 解决方案:

● 分片模拟:提出了一种模拟多轮、不明确对话的“分片模拟”(sharded simulation)环境。该环境利用现有的高质量单轮基准测试中的指令,通过“分片过程”(sharding process)将完全指定的指令转换为“分片指令”(sharded instructions)。分片指令将原始指令分解为一系列更小的指令片段(shards),每个片段逐步揭示原始指令的信息。

● 在对话的每一轮,用户(由LLM模拟)最多揭示一个信息片段,强制指令逐步通过对话揭示。

● 通过对模型进行单轮和多轮对话的模拟,比较LLM在不同场景下的性能表现,从而评估LLM在多轮对话中处理underspecification的能力。

>> 核心思路步骤:

● 分片过程: 将原始的完全指定指令分解为一系列信息片段(shards)。第一个shard通常是高层次的意图,后续的shard提供更详细的澄清

● 分片模拟:

●● 用户模拟器(LLM):根据对话历史和剩余的shards,决定在当前轮次揭示哪个shard,并以自然的方式重新措辞

●● 助手(待评估的LLM):接收用户消息,生成自由文本回复。

●● 系统:将助手回复分类为不同的策略(澄清、拒绝、对冲、询问、讨论、缺失、答案尝试)。

●● 答案提取器:如果助手尝试回答,提取答案片段。

●● 任务评估器:评估提取的答案是否正确

● 模拟类型: FULL,单轮提供完整的原始指令;SHARDED,多轮逐步揭示shards;CONCAT,单轮将所有shards连接成一个指令;RECAP,SHARDED + 最终轮次重复所有shards;SNOWBALL,每轮重复所有已揭示的shards。

● 指标选择:

●● 平均性能(P):模型在给定模拟类型中的平均得分

●● 能力(A90):模型在最佳情况下(90th percentile)的得分。

●● 不可靠性(U90/10):最佳情况(90th percentile)和最差情况(10th percentile)之间的差距。

>> 优势:

● 评估:能够在大规模实验中评估LLM在多轮、不明确对话中的性能。

● 理解:通过分解性能下降为能力损失和可靠性降低,更深入地理解LLM在多轮对话中失败的原因。

● 框架:提出的分片过程可以扩展到不同的任务和数据集,为LLM多轮对话评估提供了一个通用的框架。

● 精确研究:模拟环境能够控制对话的各个方面,例如信息揭示的顺序和速度,从而可以更精确地研究LLM的行为。

>> 结论和观点:

● LLM在多轮、不明确对话中的性能显著下降(平均下降39%)。

● 性能下降主要是由于可靠性大幅降低,而不是能力损失。

● LLM容易过早尝试回答问题,做出不正确的假设,并且过度依赖之前的(不正确的)答案尝试。

增加额外的测试时计算(推理tokens)并不能有效解决多轮对话中的问题。

● 即使将温度设置为最低,也无法完全解决可靠性问题。

代理式的框架(agent-like framework)在处理信息方面可能存在局限性,LLM应该原生支持多轮交互。

● 呼吁LLM开发者在提升模型能力的同时,也要重视模型在多轮对话中的可靠性

● 建议用户在与LLM交互时,如果时间允许,可以尝试重新开始对话,或者将所有指令要求整合到一个指令中。

目录

《LLMs Get Lost In Multi-Turn Conversation》翻译与解读

Abstract

1、Introduction

Figure 1:In this work, we simulate single- and multi-turn conversations for six generation tasks. The 15 LLMs we test perform much worse in multi-turn settings (-35%) explained by some loss in aptitude, and large losses in reliability. Aptitude is defined as performance in best-case conversation simulation, and unreliability as the gap between best- and worst-case performance. In short, we find that LLMs get lost in multi-turn, underspecified conversation.图 1:在本研究中,我们针对六项生成任务模拟了单轮和多轮对话。我们测试的 15 个大型语言模型在多轮对话场景中的表现要差得多(下降 35%),这主要是由于能力有所下降,以及可靠性大幅降低。能力被定义为最佳对话模拟情况下的表现,而不可靠性则是最佳和最差对话模拟表现之间的差距。简而言之,我们发现大型语言模型在多轮、未明确指定的对话中容易迷失方向。

8、Conclusion

9、Limitations


《LLMs Get Lost In Multi-Turn Conversation》翻译与解读

地址

论文地址:[2505.06120] LLMs Get Lost In Multi-Turn Conversation

时间

2025年5月9

作者

微软研究院,Salesforce 研究院

Abstract

Large Language Models (LLMs) are conversational interfaces. As such, LLMs have the potential to assist their users not only when they can fully specify the task at hand, but also to help them define, explore, and refine what they need through multi-turn conversational exchange. Although analysis of LLM conversation logs has confirmed that underspecification occurs frequently in user instructions, LLM evaluation has predominantly focused on the single-turn, fully-specified instruction setting. In this work, we perform large-scale simulation experiments to compare LLM performance in single- and multi-turn settings. Our experiments confirm that all the top open- and closed-weight LLMs we test exhibit significantly lower performance in multi-turn conversations than single-turn, with an average drop of 39% across six generation tasks. Analysis of 200,000+ simulated conversations decomposes the performance degradation into two components: a minor loss in aptitude and a significant increase in unreliability. We find that LLMs often make assumptions in early turns and prematurely attempt to generate final solutions, on which they overly rely. In simpler terms, we discover that *when LLMs take a wrong turn in a conversation, they get lost and do not recover*.

大型语言模型(LLM)是对话式界面。因此,LLM 不仅能在用户能够完全明确任务时为其提供帮助,还能通过多轮对话交流帮助用户定义、探索和细化其需求。尽管对 LLM 对话日志的分析已证实用户指令中经常出现描述不充分的情况,但对 LLM 的评估主要集中在单轮、完全明确指令的场景上。在本研究中,我们进行了大规模模拟实验,以比较 LLM 在单轮和多轮设置下的表现。我们的实验表明,我们测试的所有顶级开源和闭源 LLM 在多轮对话中的表现都明显低于单轮对话,六项生成任务的平均表现下降了 39%。对 20 万多次模拟对话的分析将性能下降分解为两个部分:能力的轻微损失和可靠性的显著降低。我们发现,LLM 常常在早期轮次中做出假设,并过早尝试生成最终解决方案,然后过度依赖这些假设。简单来说,我们发现当大型语言模型在对话中走错方向时,它们就会迷失方向且无法恢复。

1、Introduction

Today’s large language models (LLMs) function as conversational interfaces (e.g., ChatGPT, Gemini, Claude), enabling users to interact with the LLM through multiple conversation turns. Such interaction promises to help users not only when they know what they need (i.e., they can fully specify their requirements in an instruction), but also when they don’t. In such cases, users might start with an underspecified instruction and further clarify their needs through turn interactions. Though studies of LLM conversation logs have confirmed that underspecification in user instructions is prevalent [27], LLM systems are typically evaluated in single-turn, fully-specified settings.

Even though a growing body of work proposes to evaluate LLMs in a multi-turn fashion, we identify in our review (Section 2) that most prior work treats the conversation as episodic: conversation turns might relate to each other, but the conversation can effectively be decomposed as an array of subtasks that can be evaluated in isolation. We argue that episodic tasks move away from what is prevalent in human conversation: underspecification [91, 27].

当今的大语言模型(LLM)充当着对话界面的角色(例如 ChatGPT、Gemini、Claude),使用户能够通过多轮对话与 LLM 进行交互。这种交互不仅在用户明确知道自己需要什么时(即他们能够在指令中完整地说明需求)有帮助,而且在用户不清楚时也有帮助。在这种情况下,用户可能会从一个不明确的指令开始,并通过多轮对话进一步明确自己的需求。尽管对 LLM 对话日志的研究已经证实,用户指令中的不明确性很普遍[27],但 LLM 系统通常是在单轮、完全明确的设置下进行评估的。

尽管越来越多的工作提议以多轮的方式评估 LLM,但在我们的综述(第 2 节)中发现,大多数先前的工作都将对话视为片段式的:对话轮次之间可能存在关联,但对话实际上可以分解为一系列可以单独评估的子任务。我们认为,这种片段式任务偏离了人类对话中普遍存在的不明确性[91, 27]。

In this work, we close this gap by creating a simulation environment for multi-turn underspecified conversations – sharded simulation – that leverages existing instructions from high-quality single-turn benchmarks. At a high level, the sharding process we propose transforms existing single-turn instructions into sharded instructions, a set of smaller instructions that jointly deliver the same information as the original instruction. Sharded simulation then ensures that each turn of conversation reveals at most one shard of information per conversation turn, enforcing that the instruction is gradually revealed through the conversation.

On the set of tasks that we experimented on, we observed that models engaged in multi-turn underspecified conversations achieved an average performance of 65%–a 25-point drop from single-turn performances of 90% when they receive the entire instruction at the beginning of the conversation. Notably, we observe this drop in performance even in two-turn conversations, and across all LLMs we test, from small open-weights (LLama3.1-8B-Instruct) to state-of-the-art (Gemini 2.5 Pro).

Furthermore, we decompose the performance degradation into two components: (1) loss in aptitude, and (2) increase in unreliability. We find that in single-turn settings, models with higher aptitude tend to be more reliable (e.g., GPT-4.1, Gemini 2.5 Pro). On the other hand, all LLMs exhibit very high unreliability in multi-turn settings, regardless of aptitude. We refer to this as the lost in conversation phenomenon: when LLMs take a wrong turn in multi-turn conversation, they get lost and do not recover.

We investigate several explanations for this effect and show that the LLMs tend to (1) generate overly verbose responses, leading them to (2) propose final solutions prematurely in conversation, (3) make incorrect assumptions about underspecified details, and (4) rely too heavily on previous (incorrect) answer attempts.

在这项工作中,我们通过创建一个用于多轮不明确对话的模拟环境——分片模拟——来填补这一空白,该环境利用了来自高质量单轮基准的现有指令。从高层次来看,我们提出的分片过程将现有的单轮指令转换为分片指令,即一组较小的指令,它们共同传递的信息与原始指令相同。分片模拟则确保每一轮对话最多只透露一个分片的信息,从而强制指令通过对话逐步揭示。

在我们实验的任务集上,我们观察到参与多轮不明确对话的模型平均性能为 65%,比在对话开始时就收到完整指令的单轮性能 90% 下降了 25 个百分点。值得注意的是,即使在两轮对话中,我们也观察到了这种性能下降,并且在我们测试的所有大型语言模型(LLM)中都存在这种情况,从小型开放权重模型(LLama3.1-8B-Instruct)到最先进的模型(Gemini 2.5 Pro)。

此外,我们将性能下降分解为两个部分:

(1)能力的损失,以及

(2)可靠性的降低。我们发现,在单轮对话中,能力更强的模型往往更可靠(例如,GPT-4.1、Gemini 2.5 Pro)。另一方面,在多轮对话中,所有大型语言模型(LLM)的可靠性都非常低,无论其能力如何。我们将此现象称为“对话迷失”现象:当大型语言模型在多轮对话中走错方向时,它们就会迷失方向且无法恢复。

我们对这种效应进行了多种解释的探究,并表明这些模型往往会(1)生成过于冗长的回答,导致它们(2)在对话中过早提出最终解决方案,(3)对未明确说明的细节做出错误假设,以及(4)过度依赖之前的(错误)回答尝试。

Our findings highlight a gap between how LLMs are used in practice and how the models are being evaluated. Ubiquitous performance degradation over multi-turn interactions is likely a reason for low uptake of AI systems [73, 4, 28], particularly with novice users who are less skilled at providing complete, detailed instructions from the onset of conversation [87, 35].

The rest of the paper is structured as follows: Section 2 situates our work with respect to prior work on multi-turn evaluation. In Section 3, we describe the simulation environment we built for both single- and multi-turn conversations on a diverse set of generation tasks. We introduce the six tasks and the metrics we use to evaluate the aptitude and reliability of models in Section 4.1. Sections 5-6 define our main experiment involving 15 LLMs, and analyze the main findings. Finally, the Implications section (Section 7) discusses the ramifications of the work, from the perspective of organizations that are building LLM-based conversation products, to that of end-users of the LLM-based systems. We provide actionable recommendations based on small-scale experiments and make a concrete call-to-action to LLM builders, urging them to prioritize multi-turn reliability in conjunction with aptitude in future model iterations.

我们的研究结果凸显了在实践中使用大型语言模型(LLM)的方式与评估这些模型的方式之间存在的差距。在多轮交互中普遍存在性能下降的情况可能是人工智能系统采用率低的原因[73, 4, 28],尤其是对于那些在对话开始时不太擅长提供完整、详细指令的新手用户而言[87, 35]。

本文的其余部分结构如下:第 2 节将我们的工作置于先前关于多轮评估工作的背景下。第 3 节描述了我们为单轮和多轮对话在一系列生成任务上构建的模拟环境。第 4.1 节介绍了六个任务以及我们用于评估模型能力与可靠性的指标。第 5 至 6 节定义了涉及 15 个 LLM 的主要实验,并分析了主要发现。最后,第 7 节的“影响”部分从构建基于 LLM 对话产品的组织到基于 LLM 系统的终端用户的角度讨论了这项工作的意义。我们基于小规模实验提供切实可行的建议,并向大型语言模型的构建者发出明确的行动呼吁,敦促他们在未来的模型迭代中优先考虑多轮对话的可靠性,同时兼顾模型的能力。

Figure 1:In this work, we simulate single- and multi-turn conversations for six generation tasks. The 15 LLMs we test perform much worse in multi-turn settings (-35%) explained by some loss in aptitude, and large losses in reliability. Aptitude is defined as performance in best-case conversation simulation, and unreliability as the gap between best- and worst-case performance. In short, we find that LLMs get lost in multi-turn, underspecified conversation.图 1:在本研究中,我们针对六项生成任务模拟了单轮和多轮对话。我们测试的 15 个大型语言模型在多轮对话场景中的表现要差得多(下降 35%),这主要是由于能力有所下降,以及可靠性大幅降低。能力被定义为最佳对话模拟情况下的表现,而不可靠性则是最佳和最差对话模拟表现之间的差距。简而言之,我们发现大型语言模型在多轮、未明确指定的对话中容易迷失方向。

8、Conclusion

In this work, we conduct a large-scale simulation of single- and multi-turn conversations with LLMs, and find that on a fixed set of tasks, LLM performance degrades significantly in multi-turn, underspecified settings. LLMs get lost in conversation, which materializes as a significant decrease in reliability as models struggle to maintain context across turns, make premature assumptions, and over-rely on their previous responses. Additional experiments reveal that known remediations that work for simpler settings (such as agent-like concatenation or decreasing temperature during generation) are ineffective in multi-turn settings, and we call on LLM builders to prioritize the reliability of models in multi-turn settings.

在本研究中,我们对大型语言模型(LLM)在单轮多轮对话中的表现进行了大规模模拟,发现对于一组固定的任务,在多轮、信息不明确的设置中,LLM 的性能显著下降。LLM 在对话中迷失方向,具体表现为可靠性大幅降低,因为模型难以在多轮对话中保持上下文的一致性,过早做出假设,并过度依赖之前的回答。进一步的实验表明,适用于较简单设置的已知补救措施(例如类似代理的拼接或在生成过程中降低温度)在多轮设置中无效,我们呼吁 LLM 的开发者优先考虑模型在多轮设置中的可靠性。

9、Limitations

A first limitation of our work is the reliance on fully automated simulation. By relying on an LLM to simulate user utterances, we can scale our experiments, including running the same simulation multiple times, which would be cost-prohibitive with real users. However, the simulations we obtain are not representative of natural human-AI conversation. The properties of the sharding process (defined in Appendix C) and of the simulation environment (see Section 3.2) ensure that the simulated conversations follow a rather narrow structure, likely not modeling the full range of conversation dynamics that occur with a large, diverse user population. For example, the simulation process ensures a new shard of information is revealed at each turn, and that the last turn of the conversation has specified all the information needed to complete the task which might not happen with real users. Properties P1, P2, and P5 of the sharding process also restrict the scope of the conversation, as sharded instructions closely match an existing fully-specified instruction, with the high-level intent always identified in the conversation’s first turn. The minimal nature of shards is also unrealistic and potentially adversarial, though the gradual sharding experiment finds that different levels of shard granularity lead to similar performance degradations, as soon as conversations occur over two turns or more. Apart from sharding granularity, automatic simulation also lacks the nuance that can occur when a human is involved in conversation, from misunderstandings over terminology, giving up due to frustration with system failures [82], or the lack of a feasible end goal for certain conversations (e.g., the user wanting a solution to an unsolved problem). Because of these factors, we believe conducted simulations represent a benign testing ground for LLM multi-turn capabilities. Because of the overly simplified conditions of simulation, we believe the degradation observed in experiments is most likely an underestimate of LLM unreliability, and how frequently LLMs get lost in conversation in real-world settings. The experiments serve as a scalable, low-cost experimental environment for studying LLMs in multi-turn settings.

我们工作的第一个局限在于完全依赖自动化模拟。通过让大型语言模型模拟用户的话语,我们能够扩大实验规模,包括多次重复相同的模拟,这在使用真实用户时成本过高。然而,我们获得的模拟结果并不能代表自然的人机对话。分片过程(见附录 C 定义)和模拟环境(见第 3.2 节)的特性确保了模拟对话遵循一种相当狭窄的结构,可能无法涵盖与大量多样化用户群体交流时出现的全部对话动态。例如,模拟过程确保在每一轮对话中都会揭示新的信息片段,并且对话的最后一轮会包含完成任务所需的所有信息,这在真实用户中可能不会发生。分片过程的属性 P1、P2 和 P5 也限制了对话的范围,因为分片指令与现有的完整指令高度匹配,且高层次意图总是在对话的第一轮中被明确指出。分片的最小化特性也是不切实际且可能具有对抗性的,尽管逐步分片实验发现,只要对话超过两轮,不同粒度级别的分片都会导致类似的性能下降。除了分片粒度之外,自动模拟还缺乏人类参与对话时可能出现的细微差别,比如对术语的误解、因系统故障而感到沮丧从而放弃对话,或者某些对话缺乏可行的最终目标(例如,用户希望解决一个未解决的问题)。由于这些因素,我们认为进行的模拟是测试 LLM 多轮对话能力的良性环境。由于模拟条件过于简化,我们认为实验中观察到的性能下降很可能低估了 LLM 的不可靠性,以及在现实环境中 LLM 在对话中迷失的频率。这些实验为研究 LLM 在多轮对话环境中的表现提供了一个可扩展且成本低廉的实验环境。

A second limitation of our work is the focus on analytical tasks. Although we selected a diverse set of both programming and natural language tasks, we restricted experiments to tasks that involve an analytical solution. This restriction limits the scope of our findings, as we do not establish whether models get lost in conversation on more open-ended tasks, such as creative writing [5]. This was a conscious choice: though there has been some progress on creative writing evaluation, it is still an active area of research [6], and we relied on more established tasks and metrics for the initial set of experiments. Determining whether degradation occurs – and if so, identifying the magnitude – on creative tasks is an important direction for future work.

A third limitation of the work is the focus on text-only tasks in the English language. Establishing whether models get lost in conversation in other languages, or in tasks that involve multiple modalities in either user or assistant utterances, could help establish the scope of the degradation observed in LLM multi-turn capabilities.

我们研究的第二个局限性在于侧重于分析性任务。尽管我们选取了一组多样化的编程和自然语言任务,但我们将实验范围限制在那些具有分析解的任务上。这一限制缩小了我们的研究范围,因为我们无法确定模型在诸如创意写作等更开放的任务中是否会迷失方向[5]。这是有意为之的选择:尽管在创意写作评估方面已取得了一些进展,但这仍是研究的活跃领域[6],因此在最初的实验中,我们依赖于更成熟的任务和指标。确定在创意任务中是否会出现性能下降——如果会的话,确定其程度——是未来研究的一个重要方向。

这项工作的第三个局限性在于只关注了英语中的纯文本任务。确定模型在其他语言中,或者在用户或助手的表述涉及多种模态的任务中是否会迷失方向,有助于明确观察到的大型语言模型多轮对话能力下降的范围。

发布者:admin,转转请注明出处:http://www.yc00.com/web/1754350033a5150339.html

相关推荐

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信