Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models
TL;DR Highlight
A comprehensive survey covering multi-turn conversation benchmarks, improvement methods, and real-world applications for LLMs — all in one place.
Who Should Read
AI developers and ML engineers building chatbots, tutoring systems, or medical consultation AI with multi-turn dialogue. Especially teams running production services where context retention across turns is critical.
Core Mechanics
- LLM performance degrades significantly in multi-turn settings — MT-Eval analysis shows accuracy drops meaningfully vs. single-turn, with errors compounding over turns
- Standard RLHF/SFT has limited effectiveness for multi-turn — MT-Bench-101 (21 LLM evaluation) confirmed that typical alignment techniques don't guarantee multi-turn consistency
- Multi-turn jailbreak attacks (like Crescendo) can bypass safety guardrails of major models including GPT-3.5, GPT-4, and Med-PaLM2
- Task-specific multi-turn benchmarks are emerging: MT-Bench (general), MathChat-Bench (math), InterCode (coding)
Evidence
- MathChat-Agent framework: ~6% accuracy improvement on competitive math problems via multi-turn LLM collaboration (vs. single-turn CoT)
- MINT benchmark: adding tool use or feedback turns improves problem-solving success rate by 1-17%
- GPT-4-based multi-turn jailbreak (Crescendo): successfully bypassed safety filters on GPT-3.5, GPT-4, Med-PaLM2, and other major models
How to Apply
- When building a multi-turn evaluation pipeline: choose task-specific benchmarks (MT-Bench for general, MathChat-Bench for math, InterCode for coding), and when using LLM-as-a-judge, assign a model from a different training lineage as the judge to prevent Preference Leakage.
- For RAG chatbots needing long conversation context: integrate memory management techniques like MemGPT to handle context beyond the model's window.
Code Example
# Summary-based memory injection in multi-turn conversations (COMEDY style)
import openai
def compress_history(history: list[dict], model="gpt-4o-mini") -> str:
"""Convert conversation history into a compressed memo"""
history_text = "\n".join([f"{m['role']}: {m['content']}" for m in history])
response = openai.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": "Summarize the key facts, user preferences, and unfinished requests from the following conversation in 3~5 lines."},
{"role": "user", "content": history_text}
]
)
return response.choices[0].message.content
def multi_turn_chat(user_input: str, history: list[dict], compress_every: int = 5):
"""Compress history every N turns to maintain context"""
history.append({"role": "user", "content": user_input})
# Generate compressed memo when turn count exceeds threshold
if len(history) > compress_every * 2:
memo = compress_history(history[:-2]) # Compress excluding the last 2 messages
compressed_history = [
{"role": "system", "content": f"[Previous Conversation Summary]\n{memo}"},
*history[-2:] # Keep only the most recent conversation
]
messages = compressed_history
else:
messages = [{"role": "system", "content": "You are a helpful assistant."}] + history
response = openai.chat.completions.create(
model="gpt-4o",
messages=messages
)
assistant_msg = response.choices[0].message.content
history.append({"role": "assistant", "content": assistant_msg})
return assistant_msg, history
# Usage example
history = []
for user_msg in ["Hi, I'm a Python beginner", "What is list comprehension?", "Give me more examples", "What about dictionaries?", "Can you give me a practical example?", "Compare it with the list method you mentioned earlier"]:
reply, history = multi_turn_chat(user_msg, history)
print(f"User: {user_msg}")
print(f"AI: {reply}\n")Terminology
Related Resources
Original Abstract (Expand)
Recent advances in large language models (LLMs) have substantially improved single-turn task performance, yet real-world applications increasingly demand sophisticated multi-turn interactions. This survey provides a comprehensive review of recent progress in evaluating and enhancing multi-turn LLM interactions. Centered on a task-oriented taxonomy-spanning instruction following in domains such as mathematics and coding, and conversational engagement in role-playing, healthcare, education, and adversarial jailbreak settings-we systematically examine the challenges of maintaining context, coherence, fairness, and responsiveness across prolonged dialogues. We organize existing benchmarks and datasets into coherent categories reflecting the evolving landscape of multi-turn dialogue evaluation, and review a broad spectrum of enhancement methodologies, including model-centric strategies (in-context learning, supervised fine-tuning, reinforcement learning, and architectural innovations), external integration approaches (memory augmentation, retrieval-based methods, and knowledge graphs), and agent-based techniques for collaborative interaction. Finally, we identify open challenges and promising directions for future research to further improve the robustness and effectiveness of multi-turn LLM interactions.