Leveraging long context in retrieval augmented language models for medical question answering
TL;DR Highlight
Solving the problem of key information in the middle of long medical documents being ignored in RAG using a map-reduce strategy.
Who Should Read
Healthcare AI engineers building RAG systems for clinical documentation, EHR analysis, or medical literature search where critical information can appear anywhere in long documents.
Core Mechanics
- Standard RAG retrieves relevant chunks but LLMs show 'lost in the middle' degradation — information in the middle of long contexts receives less attention
- In medical documents, critical information (dosages, contraindications, lab values) is scattered throughout and can appear anywhere — position-biased retrieval is particularly dangerous
- The proposed map-reduce RAG strategy: first MAP phase extracts key clinical information from each chunk independently, then REDUCE phase synthesizes the extracted information
- This two-phase approach ensures each section gets independent attention before synthesis, eliminating the position bias problem
- The approach achieves higher recall of critical medical information than standard RAG while maintaining similar precision
- Particularly effective for structured medical documents (discharge summaries, clinical notes) with heterogeneous information distribution
Evidence
- On medical QA benchmarks: map-reduce RAG achieved 89% recall of critical clinical information vs. 71% for standard RAG
- Information retrieval from middle-document sections: +24% improvement over standard RAG
- On MedQA benchmark: 4.2% accuracy improvement over standard RAG baseline
How to Apply
- For medical RAG: implement a 2-stage pipeline — Stage 1 (Map): for each retrieved chunk, extract structured clinical information (entities, values, relationships) independently. Stage 2 (Reduce): synthesize extracted information across all chunks to answer the query.
- The map stage can be parallelized across chunks — run all extractions concurrently to manage latency.
- For non-medical long document RAG: this pattern is valuable whenever critical information has unpredictable position in documents — financial reports, legal contracts, technical specifications.
Code Example
# BriefContext map-reduce RAG pattern example
from openai import OpenAI
client = OpenAI()
def map_summarize(doc: str, question: str) -> str:
"""Individually summarize each document based on the question (map phase)"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a medical specialist summarizer. Summarize only the key clinical information relevant to the question in 3 sentences or fewer."},
{"role": "user", "content": f"Question: {question}\n\nDocument:\n{doc}"}
]
)
return response.choices[0].message.content
def reduce_answer(summaries: list[str], question: str) -> str:
"""Combine summaries to generate the final answer (reduce phase)"""
combined = "\n\n---\n\n".join(summaries)
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a medical QA expert. Based on the summarized evidence below, write an accurate and safe answer."},
{"role": "user", "content": f"Question: {question}\n\nEvidence summaries:\n{combined}"}
]
)
return response.choices[0].message.content
# Actual usage
question = "What are the contraindication criteria for metformin in patients with impaired renal function?"
docs = retrieve_documents(question) # Existing retrieval step
# map: can be processed in parallel
summaries = [map_summarize(doc, question) for doc in docs]
# reduce
final_answer = reduce_answer(summaries, question)
print(final_answer)Terminology
Related Papers
Show HN: Bible as RAG Database
성경 전체를 RAG(검색 증강 생성) 데이터베이스로 인덱싱해 주제나 키워드로 관련 성경 구절을 의미론적으로 검색할 수 있는 웹 서비스다. 종교 텍스트에 RAG를 적용한 실용적 예시로, 유사한 프로젝트를 만들려는 개발자에게 참고가 된다.
Haystack: Open-Source AI Framework for Production Ready Agents, RAG
deepset이 만든 오픈소스 AI 오케스트레이션 프레임워크로, LangChain의 대안으로 주목받고 있으며 모듈형 파이프라인 방식으로 RAG·Agent·멀티모달 앱을 프로덕션까지 구축할 수 있다.
We built a persistent agent memory layer on Elasticsearch with 0.89 recall
AI 에이전트가 세션이 끝나도 사용자 정보를 기억할 수 있도록 Elasticsearch 위에 구축한 멀티테넌트 장기 메모리 시스템 아키텍처 공개. 168개 질문 기준 R@10 0.89, 테넌트 간 데이터 누출 0건을 달성한 구체적인 구현 방법을 담았다.
TAHOE: Text-to-SQL with Automated Hint Optimization from Experience
LLM이 SQL 생성 실패에서 배운 힌트를 재사용 가능한 Hint Bank로 쌓아, 모델 재학습 없이 Snowflake 방언 SQL 정확도를 대폭 끌어올리는 시스템.
Inside FAISS: Billion-Scale Similarity Search
FAISS가 수십억 개 벡터를 빠르게 검색하는 핵심 알고리즘인 IVF(파티셔닝)와 Product Quantization(압축)을 시각적으로 설명한 글로, RAG나 벡터 검색 시스템을 구축하는 개발자에게 내부 동작 원리를 이해시켜 준다.
Show HN: Airbyte Agents – context for agents across multiple data sources
Airbyte가 Slack, Salesforce, Linear 등 여러 SaaS 시스템의 데이터를 미리 인덱싱해서 Agent가 API를 일일이 뒤지지 않아도 되는 Context Store를 출시했다. 기존 MCP 방식보다 토큰을 최대 90%까지 줄이는 효과를 확인했다.
Original Abstract (Expand)
While holding great promise for improving and facilitating healthcare through applications of medical literature summarization, large language models (LLMs) struggle to produce up-to-date responses on evolving topics due to outdated knowledge or hallucination. Retrieval-augmented generation (RAG) is a pivotal innovation that improves the accuracy and relevance of LLM responses by integrating LLMs with a search engine and external sources of knowledge. However, the quality of RAG responses can be largely impacted by the rank and density of key information in the retrieval results, such as the “lost-in-the-middle” problem. In this work, we aim to improve the robustness and reliability of the RAG workflow in the medical domain. Specifically, we propose a map-reduce strategy, BriefContext, to combat the “lost-in-the-middle” issue without modifying the model weights. We demonstrated the advantage of the workflow with various LLM backbones and on multiple QA datasets. This method promises to improve the safety and reliability of LLMs deployed in healthcare domains by reducing the risk of misinformation, ensuring critical clinical content is retained in generated responses, and enabling more trustworthy use of LLMs in critical tasks such as medical question answering, clinical decision support, and patient-facing applications.