Comprehensive testing of large language models for extraction of structured data in pathology
TL;DR Highlight
No need for GPT-4 — open-source LLMs match its performance on structured extraction from pathology reports.
Who Should Read
Healthcare AI engineers and clinical NLP practitioners who need to extract structured data from medical reports while keeping data on-premise.
Core Mechanics
- Benchmarks multiple open-source LLMs (Llama, Mistral, etc.) against GPT-4 on pathology report structured extraction
- Top open-source models achieve comparable F1 scores to GPT-4 on named entity recognition and structured field extraction
- Smaller 7B–13B parameter models perform surprisingly well with the right prompting, making on-premise deployment feasible
- Few-shot prompting with domain-specific examples significantly boosts open-source model performance
- Privacy and compliance benefits of running locally outweigh the marginal performance gap
Evidence
- Direct benchmark comparison on a pathology report dataset with GPT-4 as the baseline
- F1 scores within a few percentage points of GPT-4 for key extraction fields
- Tested with zero-shot, few-shot, and fine-tuned variants
How to Apply
- For structured extraction from medical documents, start with a few-shot prompting approach using 3–5 domain-specific examples before considering fine-tuning.
- Use open-source models (Llama 3, Mistral) on-premise when patient data cannot leave your infrastructure.
- Evaluate extraction quality field-by-field rather than with a single aggregate metric to catch per-field regressions.
Code Example
# Example of extracting structured data from pathology reports using Ollama + Llama3 70B
import ollama
import json
report = """
Diagnosis: Invasive ductal carcinoma, grade 2.
Tumor size: 1.8 cm. Lymph nodes: 0/5 positive.
ER: positive, PR: positive, HER2: negative.
"""
prompt = f"""Extract the following fields from the pathology report as JSON:
- diagnosis
- tumor_size_cm
- lymph_nodes_positive
- er_status
- pr_status
- her2_status
Return only valid JSON, no explanation.
Report:
{report}
Few-shot example output:
{{"diagnosis": "Invasive ductal carcinoma", "grade": 2, "tumor_size_cm": 1.8,
"lymph_nodes_positive": "0/5", "er_status": "positive",
"pr_status": "positive", "her2_status": "negative"}}
Now extract from the report above:"""
response = ollama.chat(
model="llama3:70b",
messages=[{"role": "user", "content": prompt}]
)
try:
structured = json.loads(response["message"]["content"])
print(structured)
except json.JSONDecodeError:
print("Parsing failed — add retry or output fixing logic")Terminology
Original Abstract (Expand)
Pathology departments generate large volumes of unstructured data as free-text diagnostic reports. Converting these reports into structured formats for analytics or artificial intelligence projects requires substantial manual effort by specialized personnel. While recent studies show promise in using advanced language models for structuring pathology data, they primarily rely on proprietary models, raising cost and privacy concerns. Additionally, important aspects such as prompt engineering and model quantization for deployment on consumer-grade hardware remain unaddressed. We created a dataset of 579 annotated pathology reports in German and English versions. Six language models (proprietary: GPT-4; open-source: Llama2 13B, Llama2 70B, Llama3 8B, Llama3 70B, and Qwen2.5 7B) were evaluated for their ability to extract eleven key parameters from these reports. Additionally, we investigated model performance across different prompt engineering strategies and model quantization techniques to assess practical deployment scenarios. Here we show that open-source language models extract structured data from pathology reports with high precision, matching the accuracy of proprietary GPT-4 model. The precision varies significantly across different models and configurations. These variations depend on specific prompt engineering strategies and quantization methods used during model deployment. Open-source language models demonstrate comparable performance to proprietary solutions in structuring pathology report data. This finding has significant implications for healthcare institutions seeking cost-effective, privacy-preserving data structuring solutions. The variations in model performance across different configurations provide valuable insights for practical deployment in pathology departments. Our publicly available bilingual dataset serves as both a benchmark and a resource for future research. Pathology departments produce many diagnostic reports as free text, which is hard to analyze or use in research and computer projects. Converting this free text into more standard organized information like test results or diagnoses, makes it easier to use. This task often requires human experts and takes time. Large language models (LLMs), which are advanced computer systems designed to understand and generate human-like text, might simplify this process. Here, we tested six LLMs, including freely available models and the commercial GPT-4 model, using 579 pathology reports in English and German. Our results show that freely available models can perform as well as commercial, providing a cheaper solution while avoiding privacy concerns. The shared dataset will support future research in pathology data processing. Grothey et al. examine the performance of large language models in structuring pathology reports. Findings demonstrate similar accuracy between commercial and open-source models providing a cost-effective, privacy-conscious solution to extract structured data with high precision from bilingual datasets.