Visual-ERM: Reward Modeling for Visual Equivalence

Mar 13, 2026•Ziyu Liu, Shengyuan Ding, Xinyu Fang +7•View PDF

TL;DR Highlight

An 8B multimodal Reward Model that catches fine-grained visual errors in chart/table/SVG-to-code RL training that DINO and text-based rewards miss.

Who Should Read

ML engineers building pipelines that convert chart/document images to code or markdown — especially devs struggling with reward design when fine-tuning Vision-Language Models with RL.

Core Mechanics

Existing rewards like DINO embedding similarity or TEDS (text edit distance metric) are vulnerable to reward hacking — cases exist where DINO score is 0.99 but chart colors/axes/data are completely wrong
Visual-ERM is an 8B multimodal generative Reward Model based on Qwen3-VL-8B-Instruct that compares the original image with the rendered prediction image and outputs 4 error categories (structure/data/text/style) with severity as JSON
SFT-trained on 340K error annotation data (104K charts, 125K tables, 111K SVGs) generated by GPT-4o-mini — distilling knowledge from a large model into a smaller one
When used as RL reward: +8.4 pts on chart-to-code, +2.7 pts on table-to-markdown, +4.1 pts on SVG-to-code over base Qwen3-VL-8B-Instruct (consistently outperforms DINO-based RL)
Also applicable for Test-Time Scaling — running a 3-round self-reflection/revision loop with Visual-ERM feedback adds another +8.0 pts
On VC-RewardBench (1,335 high-quality instances), Visual-ERM 8B significantly outperforms Qwen3-VL-235B-Instruct and approaches closed-source models like GPT-4o

Evidence

Chart-to-Code: baseline Qwen3-VL-8B-Instruct 69.6 → 78.0 after Visual-ERM RL (+8.4 pts), DINO-based RL only reached 76.1
Applying Visual-ERM RL on VinciCoder-8B-SFT (already a strong specialized model) adds another +10.1 pts — effective even on strong baselines
On VC-RewardBench: Visual-ERM (8B) avg F1h/F1s/Sc = 42.1/44.7/58.4 vs Qwen3-VL-235B-Instruct 29.5/32.4/56.2 — a 30x smaller model wins
After 3-round reflection/revision with Test-Time Scaling: Chart-to-Code avg 78.0 → 81.1 (+3.1 pts), base model 65.7 → 77.6 (+8.0 pts)

How to Apply

For RL reward design: render model output code into an image, feed (original image, rendered image) pairs to Visual-ERM, and use the summed severity score as reward. Adding render success as a format reward improves training stability.
For Test-Time Scaling: run a loop of model generates code draft → render → extract error feedback via Visual-ERM → append feedback to prompt and regenerate, repeated 3 times. 3 rounds is optimal for cost/benefit.
For building Reward Model training data: inject errors into ground-truth image+code pairs using GPT-4o-mini to create (original, error version) pairs, then use a strong model to generate JSON annotations for category/severity/location/description in a distillation pipeline.

Code Example

snippet

# Visual-ERM inference prompt example (Chart-to-Code)
PROMPT_CHART2CODE_JUDGEMENT = """
You are an Experienced Specialist for Data Visualization.
You will be provided with two images:
1. Original Image: a chart rendered using ground-truth Matplotlib code.
2. Generated Image: a chart rendered using AI-generated Matplotlib code.

Compare the Generated Image against the Original Image and identify all visual discrepancies.
Assign severity: 1(minor) / 2(medium) / 3(severe)

Output ONLY a single JSON object:
{
  "structure_error_count": int,
  "data_error_count": int,
  "text_error_count": int,
  "style_error_count": int,
  "errors": [
    {
      "category": "structure_error | data_error | text_error | style_error",
      "severity": 1 | 2 | 3,
      "location": "Specific location (e.g., 'Legend', 'X-axis label')",
      "description": "Concise description of the error."
    }
  ]
}
"""

# RL reward calculation example
def compute_verm_reward(errors: list, epsilon: float = 1e-6) -> float:
    """
    Takes a Visual-ERM output error list and converts it to a reward in the [0,1] range
    """
    total_severity = sum(e['severity'] for e in errors)
    # In actual implementation, normalize by the maximum severity within the task
    max_severity = 15.0  # Adjust per task
    normalized = total_severity / (max_severity + epsilon)
    reward_verm = max(0.0, min(1.0, 1.0 - normalized))
    return reward_verm

def compute_total_reward(rendered_image, gt_image, render_success: bool) -> float:
    render_success_reward = 1.0 if render_success else 0.0
    # Obtain error list after calling Visual-ERM
    errors = call_visual_erm(gt_image, rendered_image)  # Actual model call
    verm_reward = compute_verm_reward(errors)
    return render_success_reward + verm_reward  # Maximum 2.0

Terminology

Reward Model (RM)A scorer in RL training that rates how good the model's output is. Provides automated feedback instead of human evaluation.

GRPOA reinforcement learning optimization algorithm. One of the methods used to fine-tune LLMs with RL.

DINOA self-supervised visual representation learning model. Its embedding similarity was previously used as an RL reward for image comparison tasks.

TEDSTree Edit Distance-based Similarity — a metric for evaluating how similar table structures are. Similar in concept to edit distance but applied to tree structures.

Test-Time ScalingA technique to improve model performance at inference time by spending more compute during generation (e.g., self-reflection loops, multiple sampling).

SFTSupervised Fine-Tuning. Fine-tuning by learning from labeled (input, correct output) pairs.

DistillationKnowledge distillation. A technique to transfer knowledge from a large model to a smaller model.

Related Resources

https://github.com/InternLM/Visual-ERM

Original Abstract (Expand)

Vision-to-code tasks require models to reconstruct structured visual inputs, such as charts, tables, and SVGs, into executable or structured representations with high visual fidelity. While recent Large Vision Language Models (LVLMs) achieve strong results via supervised fine-tuning, reinforcement learning remains challenging due to misaligned reward signals. Existing rewards either rely on textual rules or coarse visual embedding similarity, both of which fail to capture fine-grained visual discrepancies and are vulnerable to reward hacking. We propose Visual Equivalence Reward Model (Visual-ERM), a multimodal generative reward model that provides fine-grained, interpretable, and task-agnostic feedback to evaluate vision-to-code quality directly in the rendered visual space. Integrated into RL, Visual-ERM improves Qwen3-VL-8B-Instruct by +8.4 on chart-to-code and yields consistent gains on table and SVG parsing (+2.7, +4.1 on average), and further strengthens test-time scaling via reflection and revision. We also introduce VisualCritic-RewardBench (VC-RewardBench), a benchmark for judging fine-grained image-to-image discrepancies on structured visual data, where Visual-ERM at 8B decisively outperforms Qwen3-VL-235B-Instruct and approaches leading closed-source models. Our results suggest that fine-grained visual reward supervision is both necessary and sufficient for vision-to-code RL, regardless of task specificity.