DBAutoDoc: 통계 분석 + LLM 반복 정제로 미문서화 DB 스키마 자동 탐색·문서화

DBAutoDoc: Automated Discovery and Documentation of Undocumented Database Schemas via Statistical Analysis and Iterative LLM Refinement

Mar 24, 2026•Amith Nagarajan, Thomas Altman•View PDF

TL;DR Highlight

레거시 어둠의 DB를 역전파 유사 반복 정제로 자동 문서화 — 복합 지표 96.1%, 비용 0.70달러/100테이블로 수동 대비 99.5% 절감

Who Should Read

문서 없는 레거시 DB를 물려받은 백엔드 개발자, DBA, 데이터 엔지니어, 기업 데이터 마이그레이션 담당자

Core Mechanics

복합 지표 96.1% 달성(AdventureWorks 기준, Gemini/Claude 양 모델) — FK F1 94.2%, PK F1 95.0%, 설명 커버리지 99%
핵심 아이디어: 스키마를 그래프로 보고 역전파 구조 유사하게 부모-자식 테이블 간 의미를 반복 전파. 중간값 수렴: 2회 반복
FK 탐지에서 LLM이 핵심: 통계만 15/91(20% 정밀도) vs LLM만 75/84(89% 정밀도). 결정론적 게이트가 LLM 단독 대비 F1 +23포인트 기여
비공개 기업 DB 검증: OrgA(36테이블) 97% PK 커버리지, OrgB(125테이블) 93% PK — 사전학습 암기 배제
Sonnet 4.6/Opus 4.6이 Gemini 대비 7배 적은 토큰으로 동등 품질 — 비용 효율성 최고
수동 문서화: 테이블당 2~4시간, 12000~48000달러/100테이블. DBAutoDoc: ~0.70달러/100테이블 (99.5% 절감)

Evidence

4개 공개 벤치마크(AdventureWorks, Chinook, Northwind, LousyDB) + 2개 비공개 기업 DB(OrgA, OrgB) 평가
Ablation: 결정론적 게이트 제거 시 FK F1 94.2%에서 71.7%로 하락(-22.5pp), 통계만 사용 시 30% — 각 레이어 기여도 명확히 분리

How to Apply

npm install @memberjunction/db-auto-doc 로 즉시 설치. SQL Server·PostgreSQL·MySQL 지원
Ground truth injection(검증된 정보)과 seed context(도메인 힌트)를 주면 수렴 속도 향상 — 정보 없을 시 3~5회, 양쪽 있으면 2회 수렴
PII 포함 DB는 sampleSize: 0 설정으로 실제 값 전송 없이 구조 메타데이터만 사용 가능 (GDPR 등 규제 대응)

Terminology

어둠의 DB(Dark Database)PK/FK 선언 없고 컬럼 설명·ERD 전무한 완전 미문서화 프로덕션 DB

의미 전파(Context Propagation)자식 테이블 분석 결과를 부모 테이블 설명 개선에 피드백하는 반복 구조 (역전파 유사)

결정론적 게이트(Deterministic Gates)수학적으로 FK가 불가능한 케이스를 비용 없이 제거하는 하드 필터 8종

Ground Truth InjectionLLM이 절대 덮어쓰지 않는 인간 검증 정보 앵커 — 반복 전파 기준점

Original Abstract (Expand)

A tremendous number of critical database systems lack adequate documentation. Declared primary keys are absent, foreign key constraints have been dropped for performance, column names are cryptic abbreviations, and no entity-relationship diagrams exist. We present DBAutoDoc, a system that automates the discovery and documentation of undocumented relational database schemas by combining statistical data analysis with iterative large language model (LLM) refinement. DBAutoDoc's central insight is that schema understanding is fundamentally an iterative, graph-structured problem. Drawing structural inspiration from backpropagation in neural networks, DBAutoDoc propagates semantic corrections through schema dependency graphs across multiple refinement iterations until descriptions converge. This propagation is discrete and semantic rather than mathematical, but the structural analogy is precise: early iterations produce rough descriptions akin to random initialization, and successive passes sharpen the global picture as context flows through the graph. The system makes four concrete contributions detailed in the paper. On a suite of benchmark databases, DBAutoDoc achieved overall weighted scores of 96.1% across two model families (Google's Gemini and Anthropic's Claude) using a composite metric. Ablation analysis demonstrates that the deterministic pipeline contributes a 23-point F1 improvement over LLM-only FK detection, confirming that the system's contribution is substantial and independent of LLM pre-training knowledge. DBAutoDoc is released as open-source software with all evaluation configurations and prompt templates included for full reproducibility.