Visual-ERM: Reward Modeling for Visual Equivalence
TL;DR Highlight
An 8B multimodal Reward Model that catches fine-grained visual errors in chart/table/SVG-to-code RL training that DINO and text-based rewards miss.
Who Should Read
ML engineers building pipelines that convert chart/document images to code or markdown — especially devs struggling with reward design when fine-tuning Vision-Language Models with RL.
Core Mechanics
- Existing rewards like DINO embedding similarity or TEDS (text edit distance metric) are vulnerable to reward hacking — cases exist where DINO score is 0.99 but chart colors/axes/data are completely wrong
- Visual-ERM is an 8B multimodal generative Reward Model based on Qwen3-VL-8B-Instruct that compares the original image with the rendered prediction image and outputs 4 error categories (structure/data/text/style) with severity as JSON
- SFT-trained on 340K error annotation data (104K charts, 125K tables, 111K SVGs) generated by GPT-4o-mini — distilling knowledge from a large model into a smaller one
- When used as RL reward: +8.4 pts on chart-to-code, +2.7 pts on table-to-markdown, +4.1 pts on SVG-to-code over base Qwen3-VL-8B-Instruct (consistently outperforms DINO-based RL)
- Also applicable for Test-Time Scaling — running a 3-round self-reflection/revision loop with Visual-ERM feedback adds another +8.0 pts
- On VC-RewardBench (1,335 high-quality instances), Visual-ERM 8B significantly outperforms Qwen3-VL-235B-Instruct and approaches closed-source models like GPT-4o
Evidence
- Chart-to-Code: baseline Qwen3-VL-8B-Instruct 69.6 → 78.0 after Visual-ERM RL (+8.4 pts), DINO-based RL only reached 76.1
- Applying Visual-ERM RL on VinciCoder-8B-SFT (already a strong specialized model) adds another +10.1 pts — effective even on strong baselines
- On VC-RewardBench: Visual-ERM (8B) avg F1h/F1s/Sc = 42.1/44.7/58.4 vs Qwen3-VL-235B-Instruct 29.5/32.4/56.2 — a 30x smaller model wins
- After 3-round reflection/revision with Test-Time Scaling: Chart-to-Code avg 78.0 → 81.1 (+3.1 pts), base model 65.7 → 77.6 (+8.0 pts)
How to Apply
- For RL reward design: render model output code into an image, feed (original image, rendered image) pairs to Visual-ERM, and use the summed severity score as reward. Adding render success as a format reward improves training stability.
- For Test-Time Scaling: run a loop of model generates code draft → render → extract error feedback via Visual-ERM → append feedback to prompt and regenerate, repeated 3 times. 3 rounds is optimal for cost/benefit.
- For building Reward Model training data: inject errors into ground-truth image+code pairs using GPT-4o-mini to create (original, error version) pairs, then use a strong model to generate JSON annotations for category/severity/location/description in a distillation pipeline.
Code Example
# Visual-ERM inference prompt example (Chart-to-Code)
PROMPT_CHART2CODE_JUDGEMENT = """
You are an Experienced Specialist for Data Visualization.
You will be provided with two images:
1. Original Image: a chart rendered using ground-truth Matplotlib code.
2. Generated Image: a chart rendered using AI-generated Matplotlib code.
Compare the Generated Image against the Original Image and identify all visual discrepancies.
Assign severity: 1(minor) / 2(medium) / 3(severe)
Output ONLY a single JSON object:
{
"structure_error_count": int,
"data_error_count": int,
"text_error_count": int,
"style_error_count": int,
"errors": [
{
"category": "structure_error | data_error | text_error | style_error",
"severity": 1 | 2 | 3,
"location": "Specific location (e.g., 'Legend', 'X-axis label')",
"description": "Concise description of the error."
}
]
}
"""
# RL reward calculation example
def compute_verm_reward(errors: list, epsilon: float = 1e-6) -> float:
"""
Takes a Visual-ERM output error list and converts it to a reward in the [0,1] range
"""
total_severity = sum(e['severity'] for e in errors)
# In actual implementation, normalize by the maximum severity within the task
max_severity = 15.0 # Adjust per task
normalized = total_severity / (max_severity + epsilon)
reward_verm = max(0.0, min(1.0, 1.0 - normalized))
return reward_verm
def compute_total_reward(rendered_image, gt_image, render_success: bool) -> float:
render_success_reward = 1.0 if render_success else 0.0
# Obtain error list after calling Visual-ERM
errors = call_visual_erm(gt_image, rendered_image) # Actual model call
verm_reward = compute_verm_reward(errors)
return render_success_reward + verm_reward # Maximum 2.0Terminology
Related Resources
Original Abstract (Expand)
Vision-to-code tasks require models to reconstruct structured visual inputs, such as charts, tables, and SVGs, into executable or structured representations with high visual fidelity. While recent Large Vision Language Models (LVLMs) achieve strong results via supervised fine-tuning, reinforcement learning remains challenging due to misaligned reward signals. Existing rewards either rely on textual rules or coarse visual embedding similarity, both of which fail to capture fine-grained visual discrepancies and are vulnerable to reward hacking. We propose Visual Equivalence Reward Model (Visual-ERM), a multimodal generative reward model that provides fine-grained, interpretable, and task-agnostic feedback to evaluate vision-to-code quality directly in the rendered visual space. Integrated into RL, Visual-ERM improves Qwen3-VL-8B-Instruct by +8.4 on chart-to-code and yields consistent gains on table and SVG parsing (+2.7, +4.1 on average), and further strengthens test-time scaling via reflection and revision. We also introduce VisualCritic-RewardBench (VC-RewardBench), a benchmark for judging fine-grained image-to-image discrepancies on structured visual data, where Visual-ERM at 8B decisively outperforms Qwen3-VL-235B-Instruct and approaches leading closed-source models. Our results suggest that fine-grained visual reward supervision is both necessary and sufficient for vision-to-code RL, regardless of task specificity.