Show HN: A new benchmark for testing LLMs for deterministic outputs
TL;DR Highlight
Structured Output Benchmark assesses LLM JSON handling across seven metrics, revealing performance beyond schema compliance.
Who Should Read
Backend and ML engineers developing or operating pipelines that extract structured data from documents, images, and audio using LLMs. Particularly useful for developers handling production environments where the accuracy of JSON output impacts downstream systems.
Core Mechanics
- Existing benchmarks (JSONSchemaBench, StructEval, etc.) only verify if a response is parsable JSON and passes the schema, allowing perfectly formatted but incorrect JSON to receive a perfect score and failing to measure real-world production reliability.
- SOB evaluates across text (HotpotQA 5,000), image (olmOCR-bench 209), and audio (AMI Meeting Corpus 115) modalities using a unified scoring pipeline, reflecting real-world input environments like OCR, screenshots, and meeting transcripts.
- Images and audio recordings are normalized to text before evaluation, isolating pure structured output capability and excluding vision or ASR (speech recognition) performance.
- SOB reports seven metrics separately: Value Accuracy (exact value match), JSON Pass Rate (parsability), Type Safety (type match), Structure Coverage (structure inclusion), Path Recall (required key inclusion), Faithfulness (source grounding), and Perfect Response (complete record match). Value Accuracy is the most critical metric for production.
- Two gates prevent score inflation: JSON parsing failures result in zero scores for all downstream semantic metrics, and Value Accuracy only scores fields actually returned by the model, penalizing omissions.
- Schema difficulty is tagged as easy (1.0), medium (2.0), and hard (3.0) with corresponding weights applied to the final leaderboard, rewarding models that handle complex nested structures well.
- All evaluations run with temperature 0.0, max output 2048 tokens, and inference/thinking capabilities disabled to reflect pure structured output/extraction ability.
- Leaderboard highlights: 1st GPT-5.4 (Overall 0.870, Value Acc 0.798), 2nd GLM-4.7 (0.861, 0.804), 3rd Qwen3.5-35B (0.861, 0.801), 4th Gemini-2.5-Flash (0.860, 0.796), 5th Qwen3-235B (0.857, 0.786). Structural metrics (JSON Pass, Path Recall, etc.) are near ceiling across models, with differences arising in Value Accuracy and Perfect Response.
Evidence
- "Shared experiences highlight the vulnerability of simultaneously requesting 'input parsing' and 'JSON formatting' in a single LLM call. A two-step approach—performing the task first, then wrapping the result in JSON with a separate LLM call—significantly improves quality, especially in agentic state machines requiring HTML/JS/Python code snippets within JSON."
How to Apply
- If building pipelines to extract JSON from invoices, medical records, or meeting transcripts, select models based on the Value Accuracy and Perfect Response columns of the SOB leaderboard. These two metrics more directly reflect production reliability than the overall score.
- For cost-sensitive, high-volume JSON extraction tasks, consider Qwen3.5-35B as an alternative to GPT-5.4. It potentially offers comparable accuracy at a significantly lower cost.
- If encountering frequent errors when simultaneously parsing input and generating JSON with a single LLM call, experiment with a two-step approach: complete the task as free text first, then convert the result to JSON with a separate LLM call.
- To measure the structured output quality of your own LLM pipeline, adapt SOB’s seven-metric framework (JSON Pass → Structure Coverage → Path Recall → Type Safety → Value Accuracy → Faithfulness → Perfect Response) as a hierarchical framework for internal evaluation.
Terminology
Related Papers
Claude.ai unavailable and elevated errors on the API
Anthropic’s entire service suite—Claude.ai, the API, Claude Code—became inaccessible for 1 hour and 18 minutes (17:34–18:52 UTC), sparking outrage among enterprise users over reliability concerns.
4TB of voice samples just stolen from 40k AI contractors at Mercor
Mercor data breach exposes voice recordings and ID scans of 40,000 contractors, fueling deepfake and voice fraud risks.
I cancelled Claude: Token issues, declining quality, and poor support
Anthropic’s Claude Code Pro experienced a three-week decline in speed, token allowance, and support quality, sparking a community discussion among developers.
Different Language Models Learn Similar Number Representations
LLMs, regardless of architecture—from Transformers to LSTMs—consistently learn periodic patterns with periods T=2, 5, and 10 when representing numbers, mathematically explaining a 'convergent evolution' phenomenon beyond model architecture.
Diagnosing CFG Interpretation in LLMs
LLMs frequently lose semantic meaning despite syntactically correct output when exposed to novel grammar rules.
Kernel code removals driven by LLM-created security reports