Comprehensive testing of large language models for extraction of structured data in pathology
TL;DR Highlight
No need for GPT-4 — open-source LLMs match its performance on structured extraction from pathology reports.
Who Should Read
Healthcare AI engineers and clinical NLP practitioners who need to extract structured data from medical reports while keeping data on-premise.
Core Mechanics
- Benchmarks multiple open-source LLMs (Llama, Mistral, etc.) against GPT-4 on pathology report structured extraction
- Top open-source models achieve comparable F1 scores to GPT-4 on named entity recognition and structured field extraction
- Smaller 7B–13B parameter models perform surprisingly well with the right prompting, making on-premise deployment feasible
- Few-shot prompting with domain-specific examples significantly boosts open-source model performance
- Privacy and compliance benefits of running locally outweigh the marginal performance gap
Evidence
- Direct benchmark comparison on a pathology report dataset with GPT-4 as the baseline
- F1 scores within a few percentage points of GPT-4 for key extraction fields
- Tested with zero-shot, few-shot, and fine-tuned variants
How to Apply
- For structured extraction from medical documents, start with a few-shot prompting approach using 3–5 domain-specific examples before considering fine-tuning.
- Use open-source models (Llama 3, Mistral) on-premise when patient data cannot leave your infrastructure.
- Evaluate extraction quality field-by-field rather than with a single aggregate metric to catch per-field regressions.
Code Example
# Example of extracting structured data from pathology reports using Ollama + Llama3 70B
import ollama
import json
report = """
Diagnosis: Invasive ductal carcinoma, grade 2.
Tumor size: 1.8 cm. Lymph nodes: 0/5 positive.
ER: positive, PR: positive, HER2: negative.
"""
prompt = f"""Extract the following fields from the pathology report as JSON:
- diagnosis
- tumor_size_cm
- lymph_nodes_positive
- er_status
- pr_status
- her2_status
Return only valid JSON, no explanation.
Report:
{report}
Few-shot example output:
{{"diagnosis": "Invasive ductal carcinoma", "grade": 2, "tumor_size_cm": 1.8,
"lymph_nodes_positive": "0/5", "er_status": "positive",
"pr_status": "positive", "her2_status": "negative"}}
Now extract from the report above:"""
response = ollama.chat(
model="llama3:70b",
messages=[{"role": "user", "content": prompt}]
)
try:
structured = json.loads(response["message"]["content"])
print(structured)
except json.JSONDecodeError:
print("Parsing failed — add retry or output fixing logic")Terminology
Related Papers
Can LLMs model real-world systems in TLA+?
LLM이 TLA+ 명세를 작성할 때 문법은 잘 통과하지만 실제 시스템과의 동작 일치도(conformance)는 46% 수준에 그친다는 걸 체계적으로 검증한 벤치마크 연구로, AI 기반 형식 검증의 현실적 한계를 보여준다.
Natural Language Autoencoders: Turning Claude's Thoughts into Text
Anthropic이 LLM 내부의 숫자 벡터(활성화값)를 직접 읽을 수 있는 자연어로 변환하는 NLA 기법을 공개했다. AI가 실제로 무슨 생각을 하는지 해석하는 interpretability 연구의 새로운 진전이다.
ProgramBench: Can language models rebuild programs from scratch?
LLM이 FFmpeg, SQLite, PHP 인터프리터 같은 실제 소프트웨어를 문서만 보고 처음부터 재구현할 수 있는지 측정하는 새 벤치마크로, 최고 모델도 전체 태스크의 3%만 95% 이상 통과하는 수준에 그쳤다.
MOSAIC-Bench: Measuring Compositional Vulnerability Induction in Coding Agents
티켓 3장으로 쪼개면 Claude/GPT도 보안 취약점 코드를 53~86% 확률로 그냥 짜준다.
Refusal in Language Models Is Mediated by a Single Direction
Open-source chat models encode safety as a single vector direction, and removing it disables safety fine-tuning.
Show HN: A new benchmark for testing LLMs for deterministic outputs
Structured Output Benchmark assesses LLM JSON handling across seven metrics, revealing performance beyond schema compliance.
Original Abstract (Expand)
Pathology departments generate large volumes of unstructured data as free-text diagnostic reports. Converting these reports into structured formats for analytics or artificial intelligence projects requires substantial manual effort by specialized personnel. While recent studies show promise in using advanced language models for structuring pathology data, they primarily rely on proprietary models, raising cost and privacy concerns. Additionally, important aspects such as prompt engineering and model quantization for deployment on consumer-grade hardware remain unaddressed. We created a dataset of 579 annotated pathology reports in German and English versions. Six language models (proprietary: GPT-4; open-source: Llama2 13B, Llama2 70B, Llama3 8B, Llama3 70B, and Qwen2.5 7B) were evaluated for their ability to extract eleven key parameters from these reports. Additionally, we investigated model performance across different prompt engineering strategies and model quantization techniques to assess practical deployment scenarios. Here we show that open-source language models extract structured data from pathology reports with high precision, matching the accuracy of proprietary GPT-4 model. The precision varies significantly across different models and configurations. These variations depend on specific prompt engineering strategies and quantization methods used during model deployment. Open-source language models demonstrate comparable performance to proprietary solutions in structuring pathology report data. This finding has significant implications for healthcare institutions seeking cost-effective, privacy-preserving data structuring solutions. The variations in model performance across different configurations provide valuable insights for practical deployment in pathology departments. Our publicly available bilingual dataset serves as both a benchmark and a resource for future research. Pathology departments produce many diagnostic reports as free text, which is hard to analyze or use in research and computer projects. Converting this free text into more standard organized information like test results or diagnoses, makes it easier to use. This task often requires human experts and takes time. Large language models (LLMs), which are advanced computer systems designed to understand and generate human-like text, might simplify this process. Here, we tested six LLMs, including freely available models and the commercial GPT-4 model, using 579 pathology reports in English and German. Our results show that freely available models can perform as well as commercial, providing a cheaper solution while avoiding privacy concerns. The shared dataset will support future research in pathology data processing. Grothey et al. examine the performance of large language models in structuring pathology reports. Findings demonstrate similar accuracy between commercial and open-source models providing a cost-effective, privacy-conscious solution to extract structured data with high precision from bilingual datasets.