Enhancing diagnostic capability with multi-agents conversational large language models
TL;DR Highlight
Multiple AI doctor agents debating like an MDT meeting diagnose rare diseases more accurately than GPT-4 alone.
Who Should Read
Healthcare AI researchers and clinical informatics teams exploring multi-agent systems for complex medical decision support.
Core Mechanics
- Multi-agent framework where specialized AI agents (generalist, specialist, devil's advocate) debate differential diagnoses
- Structured debate protocol: each agent proposes, critiques, and defends diagnoses across multiple rounds
- Outperforms single GPT-4 on rare disease diagnosis benchmarks, especially for complex multi-system conditions
- Devil's advocate agent reduces premature diagnostic convergence (anchoring bias)
- Final diagnosis is determined by structured consensus, not simple majority vote
Evidence
- Evaluated on rare disease QA benchmarks (MIMIC, NEJM case records)
- Top-1 and Top-3 diagnostic accuracy compared against GPT-4 single-shot and chain-of-thought
- Multi-agent debate improved Top-1 accuracy by 8–12% on hard cases
How to Apply
- For high-stakes decisions with multiple plausible options, use a multi-agent debate structure rather than a single LLM query.
- Include an explicit 'devil's advocate' agent role to challenge the leading hypothesis and surface overlooked alternatives.
- Structure the debate with a fixed number of rounds and a consensus protocol to avoid endless loops.
Code Example
# Simple MAC pattern example using OpenAI API
import openai
client = openai.OpenAI()
def doctor_agent(case: str, specialty: str) -> str:
response = client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": f"You are an experienced {specialty} physician. Analyze the case and suggest a diagnosis with reasoning."},
{"role": "user", "content": case}
]
)
return response.choices[0].message.content
def supervisor_agent(case: str, doctor_opinions: list[str]) -> str:
opinions_text = "\n\n".join([f"Doctor {i+1}: {op}" for i, op in enumerate(doctor_opinions)])
response = client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": "You are a senior supervising physician. Review all doctor opinions and provide a final integrated diagnosis."},
{"role": "user", "content": f"Case:\n{case}\n\nDoctor Opinions:\n{opinions_text}\n\nProvide the final diagnosis."},
]
)
return response.choices[0].message.content
# Usage example
case = "28-year-old patient with progressive muscle weakness, fatigue, and elevated CK levels..."
specialties = ["neurologist", "rheumatologist", "internal medicine specialist", "geneticist"]
# Run 4 Doctor agents
opinions = [doctor_agent(case, s) for s in specialties]
# Supervisor provides the final diagnosis
final_diagnosis = supervisor_agent(case, opinions)
print(final_diagnosis)Terminology
Related Papers
Show HN: OpenKnowledge – open source AI-first alternative to Obsidian/Notion
Git 기반 동기화와 Claude/Codex/Cursor 연동을 내장한 로컬 우선 마크다운 에디터로, AI 에이전트의 두 번째 뇌(LLM Wiki)로 활용할 수 있는 오픈소스 도구다.
The Unfireable Safety Kernel: Execution-Time AI Alignment for AI Agents and Other Escapable AI Systems
AI 에이전트가 자신의 안전장치를 우회할 수 없도록, 에이전트 프로세스 바깥에 수학적으로 증명된 강제 통제 게이트를 배치하는 아키텍처
RubyLLM: A Ruby framework for all major AI providers
OpenAI, Claude, Gemini 등 주요 AI 프로바이더를 단일 인터페이스로 통합한 Ruby 프레임워크로, Rails 통합과 에이전트 기능까지 지원해 Ruby 개발자가 AI 기능을 빠르게 붙일 수 있다.
Qwen-AgentWorld: Language World Models for General Agents
Alibaba Qwen 팀이 AI 에이전트가 행동 결과를 미리 시뮬레이션할 수 있는 'Language World Model'을 공개했다. 에이전트 훈련과 실행 경로 검증에 새로운 패러다임을 제시하는 연구다.
SHERLOC: Structured Diagnostic Localization for Code Repair Agents
버그 위치만 알려주는 게 아니라 '왜, 어떻게 고쳐야 하는지'까지 진단 리포트를 생성해서 코드 수정 에이전트의 성능을 높이는 training-free 프레임워크
Show HN: peerd – AI agent harness that runs entirely in your browser
백엔드 서버 없이 Chrome/Firefox 확장 프로그램으로만 동작하는 AI 에이전트 실행 환경으로, 브라우저 탭을 직접 조작하고 WASM Linux VM까지 구동할 수 있어 프라이버시와 보안을 동시에 챙길 수 있다.
Original Abstract (Expand)
Large Language Models (LLMs) show promise in healthcare tasks but face challenges in complex medical scenarios. We developed a Multi-Agent Conversation (MAC) framework for disease diagnosis, inspired by clinical Multi-Disciplinary Team discussions. Using 302 rare disease cases, we evaluated GPT-3.5, GPT-4, and MAC on medical knowledge and clinical reasoning. MAC outperformed single models in both primary and follow-up consultations, achieving higher accuracy in diagnoses and suggested tests. Optimal performance was achieved with four doctor agents and a supervisor agent, using GPT-4 as the base model. MAC demonstrated high consistency across repeated runs. Further comparative analysis showed MAC also outperformed other methods including Chain of Thoughts (CoT), Self-Refine, and Self-Consistency with higher performance and more output tokens. This framework significantly enhanced LLMs’ diagnostic capabilities, effectively bridging theoretical knowledge and practical clinical application. Our findings highlight the potential of multi-agent LLMs in healthcare and suggest further research into their clinical implementation.