Enhancing diagnostic capability with multi-agents conversational large language models
TL;DR Highlight
Multiple AI doctor agents debating like an MDT meeting diagnose rare diseases more accurately than GPT-4 alone.
Who Should Read
Healthcare AI researchers and clinical informatics teams exploring multi-agent systems for complex medical decision support.
Core Mechanics
- Multi-agent framework where specialized AI agents (generalist, specialist, devil's advocate) debate differential diagnoses
- Structured debate protocol: each agent proposes, critiques, and defends diagnoses across multiple rounds
- Outperforms single GPT-4 on rare disease diagnosis benchmarks, especially for complex multi-system conditions
- Devil's advocate agent reduces premature diagnostic convergence (anchoring bias)
- Final diagnosis is determined by structured consensus, not simple majority vote
Evidence
- Evaluated on rare disease QA benchmarks (MIMIC, NEJM case records)
- Top-1 and Top-3 diagnostic accuracy compared against GPT-4 single-shot and chain-of-thought
- Multi-agent debate improved Top-1 accuracy by 8–12% on hard cases
How to Apply
- For high-stakes decisions with multiple plausible options, use a multi-agent debate structure rather than a single LLM query.
- Include an explicit 'devil's advocate' agent role to challenge the leading hypothesis and surface overlooked alternatives.
- Structure the debate with a fixed number of rounds and a consensus protocol to avoid endless loops.
Code Example
# Simple MAC pattern example using OpenAI API
import openai
client = openai.OpenAI()
def doctor_agent(case: str, specialty: str) -> str:
response = client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": f"You are an experienced {specialty} physician. Analyze the case and suggest a diagnosis with reasoning."},
{"role": "user", "content": case}
]
)
return response.choices[0].message.content
def supervisor_agent(case: str, doctor_opinions: list[str]) -> str:
opinions_text = "\n\n".join([f"Doctor {i+1}: {op}" for i, op in enumerate(doctor_opinions)])
response = client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": "You are a senior supervising physician. Review all doctor opinions and provide a final integrated diagnosis."},
{"role": "user", "content": f"Case:\n{case}\n\nDoctor Opinions:\n{opinions_text}\n\nProvide the final diagnosis."},
]
)
return response.choices[0].message.content
# Usage example
case = "28-year-old patient with progressive muscle weakness, fatigue, and elevated CK levels..."
specialties = ["neurologist", "rheumatologist", "internal medicine specialist", "geneticist"]
# Run 4 Doctor agents
opinions = [doctor_agent(case, s) for s in specialties]
# Supervisor provides the final diagnosis
final_diagnosis = supervisor_agent(case, opinions)
print(final_diagnosis)Terminology
Related Papers
Show HN: adamsreview – better multi-agent PR reviews for Claude Code
Claude Code에서 최대 7개의 병렬 서브 에이전트가 각각 다른 관점으로 PR을 리뷰하고, 자동 수정까지 해주는 오픈소스 플러그인이다. 기존 /review나 CodeRabbit보다 실제 버그를 더 많이 잡는다고 주장하지만 커뮤니티에서는 복잡도와 실효성에 대한 회의론도 나왔다.
How Fast Does Claude, Acting as a User Space IP Stack, Respond to Pings?
Claude Code에게 IP 패킷을 직접 파싱하고 ICMP echo reply를 구성하도록 시켜서 실제로 ping에 응답하게 만든 실험으로, 'Markdown이 곧 코드이고 LLM이 프로세서'라는 아이디어를 네트워크 스택 수준까지 밀어붙인 재미있는 사례다.
Show HN: Git for AI Agents
AI 코딩 에이전트(Claude Code 등)가 수행한 모든 툴 호출을 자동으로 추적하고, 어떤 프롬프트가 어느 코드 줄을 작성했는지 blame까지 가능한 버전 관리 도구다.
Principles for agent-native CLIs
AI 에이전트가 CLI 도구를 더 잘 사용할 수 있도록 설계하는 원칙들을 정리한 글로, 에이전트가 CLI를 도구로 활용하는 빈도가 높아지면서 이 설계 방식이 실용적으로 중요해지고 있다.
Agent-harness-kit scaffolding for multi-agent workflows (MCP, provider-agnostic)
여러 AI 에이전트가 서로 역할을 나눠 협업할 수 있도록 조율하는 scaffolding 도구로, Vite처럼 설정 없이 빠르게 멀티 에이전트 파이프라인을 구성할 수 있다.
Show HN: Tilde.run – Agent sandbox with a transactional, versioned filesystem
AI 에이전트가 실제 프로덕션 데이터를 건드려도 롤백할 수 있는 격리된 샌드박스 환경을 제공하는 도구로, GitHub/S3/Google Drive를 하나의 버전 관리 파일시스템으로 묶어준다.
Original Abstract (Expand)
Large Language Models (LLMs) show promise in healthcare tasks but face challenges in complex medical scenarios. We developed a Multi-Agent Conversation (MAC) framework for disease diagnosis, inspired by clinical Multi-Disciplinary Team discussions. Using 302 rare disease cases, we evaluated GPT-3.5, GPT-4, and MAC on medical knowledge and clinical reasoning. MAC outperformed single models in both primary and follow-up consultations, achieving higher accuracy in diagnoses and suggested tests. Optimal performance was achieved with four doctor agents and a supervisor agent, using GPT-4 as the base model. MAC demonstrated high consistency across repeated runs. Further comparative analysis showed MAC also outperformed other methods including Chain of Thoughts (CoT), Self-Refine, and Self-Consistency with higher performance and more output tokens. This framework significantly enhanced LLMs’ diagnostic capabilities, effectively bridging theoretical knowledge and practical clinical application. Our findings highlight the potential of multi-agent LLMs in healthcare and suggest further research into their clinical implementation.