Show HN: A new benchmark for testing LLMs for deterministic outputs
TL;DR Highlight
Structured Output Benchmark assesses LLM JSON handling across seven metrics, revealing performance beyond schema compliance.
Who Should Read
Backend and ML engineers developing or operating pipelines that extract structured data from documents, images, and audio using LLMs. Particularly useful for developers handling production environments where the accuracy of JSON output impacts downstream systems.
Core Mechanics
- Existing benchmarks (JSONSchemaBench, StructEval, etc.) only verify if a response is parsable JSON and passes the schema, allowing perfectly formatted but incorrect JSON to receive a perfect score and failing to measure real-world production reliability.
- SOB evaluates across text (HotpotQA 5,000), image (olmOCR-bench 209), and audio (AMI Meeting Corpus 115) modalities using a unified scoring pipeline, reflecting real-world input environments like OCR, screenshots, and meeting transcripts.
- Images and audio recordings are normalized to text before evaluation, isolating pure structured output capability and excluding vision or ASR (speech recognition) performance.
- SOB reports seven metrics separately: Value Accuracy (exact value match), JSON Pass Rate (parsability), Type Safety (type match), Structure Coverage (structure inclusion), Path Recall (required key inclusion), Faithfulness (source grounding), and Perfect Response (complete record match). Value Accuracy is the most critical metric for production.
- Two gates prevent score inflation: JSON parsing failures result in zero scores for all downstream semantic metrics, and Value Accuracy only scores fields actually returned by the model, penalizing omissions.
- Schema difficulty is tagged as easy (1.0), medium (2.0), and hard (3.0) with corresponding weights applied to the final leaderboard, rewarding models that handle complex nested structures well.
- All evaluations run with temperature 0.0, max output 2048 tokens, and inference/thinking capabilities disabled to reflect pure structured output/extraction ability.
- Leaderboard highlights: 1st GPT-5.4 (Overall 0.870, Value Acc 0.798), 2nd GLM-4.7 (0.861, 0.804), 3rd Qwen3.5-35B (0.861, 0.801), 4th Gemini-2.5-Flash (0.860, 0.796), 5th Qwen3-235B (0.857, 0.786). Structural metrics (JSON Pass, Path Recall, etc.) are near ceiling across models, with differences arising in Value Accuracy and Perfect Response.
Evidence
- "Shared experiences highlight the vulnerability of simultaneously requesting 'input parsing' and 'JSON formatting' in a single LLM call. A two-step approach—performing the task first, then wrapping the result in JSON with a separate LLM call—significantly improves quality, especially in agentic state machines requiring HTML/JS/Python code snippets within JSON."
How to Apply
- If building pipelines to extract JSON from invoices, medical records, or meeting transcripts, select models based on the Value Accuracy and Perfect Response columns of the SOB leaderboard. These two metrics more directly reflect production reliability than the overall score.
- For cost-sensitive, high-volume JSON extraction tasks, consider Qwen3.5-35B as an alternative to GPT-5.4. It potentially offers comparable accuracy at a significantly lower cost.
- If encountering frequent errors when simultaneously parsing input and generating JSON with a single LLM call, experiment with a two-step approach: complete the task as free text first, then convert the result to JSON with a separate LLM call.
- To measure the structured output quality of your own LLM pipeline, adapt SOB’s seven-metric framework (JSON Pass → Structure Coverage → Path Recall → Type Safety → Value Accuracy → Faithfulness → Perfect Response) as a hierarchical framework for internal evaluation.
Terminology
Related Papers
MTG Bench: Testing how well LLMs can play Magic
카드 게임 MTG의 규칙 준수 능력으로 LLM의 복잡한 규칙 추론 능력을 측정하는 독창적인 벤치마크로, gpt-5.5가 95.4점으로 1위를 차지했다.
ALIGNBEAM : Inference-Time Alignment Transfer via Cross-Vocabulary Logit Mixing
도메인 파인튜닝으로 망가진 LLM 안전성을, 재학습 없이 추론 시점에 작은 안전 모델에서 빌려와 복구하는 방법.
The iPad was on Tailscale: a WebRTC debugging story
WebRTC 데이터 채널에서 iPad만 응답을 못 받는 희귀 버그를 추적한 결과, webrtc-rs의 하드코딩된 MTU 상수와 Tailscale의 IPv6 Fragment 패킷 드롭이 동시에 작용한 복합 버그였다는 2주간의 디버깅 실화.
Can LLMs Beat Classical Hyperparameter Optimization Algorithms?
LLM 기반 하이퍼파라미터 최적화 에이전트와 CMA-ES, TPE 같은 고전 알고리즘을 직접 비교한 연구로, LLM 단독으로는 고전 방법을 이기지 못하지만 두 방법을 합친 하이브리드 'Centaur'가 최고 성능을 낸다는 결론이 나왔다.
What the Eyes See, the LLMs Miss: Exploiting Human Perception for Adversarial Text Attacks
Bold, 하이라이트, 공백 배치 같은 타이포그래피 트릭으로 GPT-4o, Llama Guard 등 10개 콘텐츠 모더레이션 시스템을 99% 이상 우회할 수 있다.
Did Claude increase bugs in rsync?
rsync 프로젝트에 Claude AI가 도입된 이후 버그가 늘었다는 소셜 미디어 주장을 실제 데이터와 통계 분석으로 검증한 글로, 결론적으로 Claude 도입 후 릴리즈가 역사적 분포에서 유독 버그가 많다는 통계적 근거는 없었다.