Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models

Apr 7, 2025•Yubo Li, Xiaobin Shen, X. Yao +4•View PDF

TL;DR Highlight

A comprehensive survey covering multi-turn conversation benchmarks, improvement methods, and real-world applications for LLMs — all in one place.

Who Should Read

AI developers and ML engineers building chatbots, tutoring systems, or medical consultation AI with multi-turn dialogue. Especially teams running production services where context retention across turns is critical.

Core Mechanics

LLM performance degrades significantly in multi-turn settings — MT-Eval analysis shows accuracy drops meaningfully vs. single-turn, with errors compounding over turns
Standard RLHF/SFT has limited effectiveness for multi-turn — MT-Bench-101 (21 LLM evaluation) confirmed that typical alignment techniques don't guarantee multi-turn consistency
Multi-turn jailbreak attacks (like Crescendo) can bypass safety guardrails of major models including GPT-3.5, GPT-4, and Med-PaLM2
Task-specific multi-turn benchmarks are emerging: MT-Bench (general), MathChat-Bench (math), InterCode (coding)

Evidence

MathChat-Agent framework: ~6% accuracy improvement on competitive math problems via multi-turn LLM collaboration (vs. single-turn CoT)
MINT benchmark: adding tool use or feedback turns improves problem-solving success rate by 1-17%
GPT-4-based multi-turn jailbreak (Crescendo): successfully bypassed safety filters on GPT-3.5, GPT-4, Med-PaLM2, and other major models

How to Apply

When building a multi-turn evaluation pipeline: choose task-specific benchmarks (MT-Bench for general, MathChat-Bench for math, InterCode for coding), and when using LLM-as-a-judge, assign a model from a different training lineage as the judge to prevent Preference Leakage.
For RAG chatbots needing long conversation context: integrate memory management techniques like MemGPT to handle context beyond the model's window.

Code Example

snippet

# Summary-based memory injection in multi-turn conversations (COMEDY style)
import openai

def compress_history(history: list[dict], model="gpt-4o-mini") -> str:
    """Convert conversation history into a compressed memo"""
    history_text = "\n".join([f"{m['role']}: {m['content']}" for m in history])
    response = openai.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": "Summarize the key facts, user preferences, and unfinished requests from the following conversation in 3~5 lines."},
            {"role": "user", "content": history_text}
        ]
    )
    return response.choices[0].message.content

def multi_turn_chat(user_input: str, history: list[dict], compress_every: int = 5):
    """Compress history every N turns to maintain context"""
    history.append({"role": "user", "content": user_input})
    
    # Generate compressed memo when turn count exceeds threshold
    if len(history) > compress_every * 2:
        memo = compress_history(history[:-2])  # Compress excluding the last 2 messages
        compressed_history = [
            {"role": "system", "content": f"[Previous Conversation Summary]\n{memo}"},
            *history[-2:]  # Keep only the most recent conversation
        ]
        messages = compressed_history
    else:
        messages = [{"role": "system", "content": "You are a helpful assistant."}] + history
    
    response = openai.chat.completions.create(
        model="gpt-4o",
        messages=messages
    )
    assistant_msg = response.choices[0].message.content
    history.append({"role": "assistant", "content": assistant_msg})
    return assistant_msg, history

# Usage example
history = []
for user_msg in ["Hi, I'm a Python beginner", "What is list comprehension?", "Give me more examples", "What about dictionaries?", "Can you give me a practical example?", "Compare it with the list method you mentioned earlier"]:
    reply, history = multi_turn_chat(user_msg, history)
    print(f"User: {user_msg}")
    print(f"AI: {reply}\n")

Terminology

multi-turn interactionA conversation where human and AI exchange multiple messages back and forth, with context carrying over. Like a chat thread rather than one-off Q&A.

RLHFTraining AI by having humans rate its responses — the model learns from this feedback. Like a student improving their writing based on teacher comments.

DPOTraining by showing the model pairs of good and bad responses so it learns to prefer the good ones.

Related Resources

Awesome-Multi-Turn-LLMs GitHub Repository

Original Abstract (Expand)

Recent advances in large language models (LLMs) have substantially improved single-turn task performance, yet real-world applications increasingly demand sophisticated multi-turn interactions. This survey provides a comprehensive review of recent progress in evaluating and enhancing multi-turn LLM interactions. Centered on a task-oriented taxonomy-spanning instruction following in domains such as mathematics and coding, and conversational engagement in role-playing, healthcare, education, and adversarial jailbreak settings-we systematically examine the challenges of maintaining context, coherence, fairness, and responsiveness across prolonged dialogues. We organize existing benchmarks and datasets into coherent categories reflecting the evolving landscape of multi-turn dialogue evaluation, and review a broad spectrum of enhancement methodologies, including model-centric strategies (in-context learning, supervised fine-tuning, reinforcement learning, and architectural innovations), external integration approaches (memory augmentation, retrieval-based methods, and knowledge graphs), and agent-based techniques for collaborative interaction. Finally, we identify open challenges and promising directions for future research to further improve the robustness and effectiveness of multi-turn LLM interactions.