Comprehensive testing of large language models for extraction of structured data in pathology
TL;DR Highlight
No need for GPT-4 — open-source LLMs match its performance on structured extraction from pathology reports.
Who Should Read
Healthcare AI engineers and clinical NLP practitioners who need to extract structured data from medical reports while keeping data on-premise.
Core Mechanics
- Benchmarks multiple open-source LLMs (Llama, Mistral, etc.) against GPT-4 on pathology report structured extraction
- Top open-source models achieve comparable F1 scores to GPT-4 on named entity recognition and structured field extraction
- Smaller 7B–13B parameter models perform surprisingly well with the right prompting, making on-premise deployment feasible
- Few-shot prompting with domain-specific examples significantly boosts open-source model performance
- Privacy and compliance benefits of running locally outweigh the marginal performance gap
Evidence
- Direct benchmark comparison on a pathology report dataset with GPT-4 as the baseline
- F1 scores within a few percentage points of GPT-4 for key extraction fields
- Tested with zero-shot, few-shot, and fine-tuned variants
How to Apply
- For structured extraction from medical documents, start with a few-shot prompting approach using 3–5 domain-specific examples before considering fine-tuning.
- Use open-source models (Llama 3, Mistral) on-premise when patient data cannot leave your infrastructure.
- Evaluate extraction quality field-by-field rather than with a single aggregate metric to catch per-field regressions.
Code Example
# Example of extracting structured data from pathology reports using Ollama + Llama3 70B
import ollama
import json
report = """
Diagnosis: Invasive ductal carcinoma, grade 2.
Tumor size: 1.8 cm. Lymph nodes: 0/5 positive.
ER: positive, PR: positive, HER2: negative.
"""
prompt = f"""Extract the following fields from the pathology report as JSON:
- diagnosis
- tumor_size_cm
- lymph_nodes_positive
- er_status
- pr_status
- her2_status
Return only valid JSON, no explanation.
Report:
{report}
Few-shot example output:
{{"diagnosis": "Invasive ductal carcinoma", "grade": 2, "tumor_size_cm": 1.8,
"lymph_nodes_positive": "0/5", "er_status": "positive",
"pr_status": "positive", "her2_status": "negative"}}
Now extract from the report above:"""
response = ollama.chat(
model="llama3:70b",
messages=[{"role": "user", "content": prompt}]
)
try:
structured = json.loads(response["message"]["content"])
print(structured)
except json.JSONDecodeError:
print("Parsing failed — add retry or output fixing logic")Terminology
Related Papers
What happened after 2k people tried to hack my AI assistant
실제로 6,000개 이상의 이메일로 AI 에이전트에 prompt injection 공격을 시도한 공개 실험 결과로, Claude Opus 4.6이 비밀 파일 유출을 한 번도 허용하지 않았지만 실험 설계의 현실성에 대한 논란이 뜨거웠다.
When Does Combining Language Models Help? A Co-Failure Ceiling on Routing, Voting, and Mixture-of-Agents Across 67 Frontier Models
여러 LLM을 조합해도 '모든 모델이 동시에 틀리는 비율(β)'이 성능 상한선이며, 업계가 쓰는 pairwise 상관계수(ρ)는 이 상한선을 예측하지 못한다.
Beyond Function Calling: Benchmarking Tool-Using Agents under Tool-Environment Unreliability
실제 환경처럼 API가 망가지거나 결과가 이상할 때 LLM 에이전트가 얼마나 잘 버티는지 측정하는 벤치마크 ToolBench-X 공개.
Nearly Half of LG Smart TV Apps Contain Residential Proxy SDKs
6,038개의 LG·Samsung 스마트 TV 앱을 스캔했더니 2,058개에서 사용자의 IP를 몰래 팔아 트래픽을 중계하는 Residential Proxy SDK가 발견됐다. TV는 컴퓨터처럼 감시받지 않아서 프록시 호스트로 거의 이상적인 환경이다.
Prompt Injection as Role Confusion
LLM이 시스템 프롬프트, 사용자 입력, 툴 출력을 구분하지 못하는 구조적 결함이 prompt injection의 근본 원인이라는 ICML 2026 논문으로, 현재 LLM 보안 아키텍처의 한계를 명확히 분석한다.
GPT-5.5 hallucinates 3x more than MIT-licensed GLM-5.2
모델 크기가 커질수록 성능이 좋아진다는 통념에 반해, 오픈소스 753B 모델 GLM-5.2가 추정 1~2T 규모의 GPT-5.5보다 환각 비율이 3배 낮다는 벤치마크 결과가 나왔다. 단순히 파라미터 수와 벤치마크 점수만으로 모델을 선택하면 실제 업무에서 낭패를 볼 수 있다는 경고다.
Original Abstract (Expand)
Pathology departments generate large volumes of unstructured data as free-text diagnostic reports. Converting these reports into structured formats for analytics or artificial intelligence projects requires substantial manual effort by specialized personnel. While recent studies show promise in using advanced language models for structuring pathology data, they primarily rely on proprietary models, raising cost and privacy concerns. Additionally, important aspects such as prompt engineering and model quantization for deployment on consumer-grade hardware remain unaddressed. We created a dataset of 579 annotated pathology reports in German and English versions. Six language models (proprietary: GPT-4; open-source: Llama2 13B, Llama2 70B, Llama3 8B, Llama3 70B, and Qwen2.5 7B) were evaluated for their ability to extract eleven key parameters from these reports. Additionally, we investigated model performance across different prompt engineering strategies and model quantization techniques to assess practical deployment scenarios. Here we show that open-source language models extract structured data from pathology reports with high precision, matching the accuracy of proprietary GPT-4 model. The precision varies significantly across different models and configurations. These variations depend on specific prompt engineering strategies and quantization methods used during model deployment. Open-source language models demonstrate comparable performance to proprietary solutions in structuring pathology report data. This finding has significant implications for healthcare institutions seeking cost-effective, privacy-preserving data structuring solutions. The variations in model performance across different configurations provide valuable insights for practical deployment in pathology departments. Our publicly available bilingual dataset serves as both a benchmark and a resource for future research. Pathology departments produce many diagnostic reports as free text, which is hard to analyze or use in research and computer projects. Converting this free text into more standard organized information like test results or diagnoses, makes it easier to use. This task often requires human experts and takes time. Large language models (LLMs), which are advanced computer systems designed to understand and generate human-like text, might simplify this process. Here, we tested six LLMs, including freely available models and the commercial GPT-4 model, using 579 pathology reports in English and German. Our results show that freely available models can perform as well as commercial, providing a cheaper solution while avoiding privacy concerns. The shared dataset will support future research in pathology data processing. Grothey et al. examine the performance of large language models in structuring pathology reports. Findings demonstrate similar accuracy between commercial and open-source models providing a cost-effective, privacy-conscious solution to extract structured data with high precision from bilingual datasets.