ManiTwin: Scaling Data-Generation-Ready Digital Object Dataset to 100K

Mar 17, 2026•Kaixuan Wang, Tianxing Chen, Jiawei Liu +13•View PDF

TL;DR Highlight

A pipeline that auto-generates 100K physics-simulation-ready 3D robot manipulation datasets from a single image

Who Should Read

Robotics ML engineers building simulation data generation pipelines for robot manipulation policy training. Researchers struggling to acquire 3D assets for simulation-based learning.

Core Mechanics

Single image → auto-converted to simulation-ready 3D mesh (using CLAY model, ~45 seconds per object)
VLM auto-annotates physical properties (mass, friction coefficient, size), functional points (handles, spouts, buttons), and grasp points
GraspGen (diffusion model-based grasp generator) produces up to 4,000 grasp candidates per object → physics-validated in SAPIEN simulator
ManiTwin-100K: 512 categories, 100K objects, 5M verified grasp poses, 10M auto-generated grasp trajectories
Grasp annotations validated on Franka Panda transfer to other robot platforms (parallel grippers, multi-finger hands, etc.)
Also supports auto-generation of robot VQA data: language grounding, function planning, task planning across 5 categories

Evidence

Human evaluation on 500 samples: category classification 100%, language description 99.6%, functional point labels 92.2%, physical property estimation 92.2%, grasp point selection 84.8% accuracy
3D generation success rate 69.67%, grasp simulation validation pass rate 76.13% (average 62.14 verified grasps maintained per object)
Image→3D generation: CLIP(I-I/T) 0.7769, CLIP(N-I/T) 0.6848 — much higher alignment than text→3D (0.2324, 0.1948 respectively)
Unlike existing robotics datasets RoboTwin-OD (731) and GAPartNet (4K), provides simulation-ready + grasp + functional + language annotations at 100K scale

How to Apply

Feed e-commerce product images or text-to-image generated images into the ManiTwin pipeline to automatically get simulator-loadable 3D assets + grasp poses
Load desired category objects from the ManiTwin-100K dataset directly into simulators like SAPIEN/Isaac Gym, then mass-generate pick-and-place trajectory data using the provided 6-DoF grasp poses
For VQA data needs, use the layout generation feature to arrange multiple objects on a table and auto-generate language-action alignment QA pairs using functional point and language annotations

Terminology

6-DoF grasp poseThe robot gripper's 3D position (x,y,z) + 3D orientation (roll, pitch, yaw) combined into 6 degrees of freedom when grasping an object. Fully describes how to grab from what angle.

FPS (Farthest Point Sampling)An algorithm that selects representative points maximally spread apart from a point cloud. Used to pick candidate points that evenly cover an object's surface.

VLM (Vision-Language Model)An AI model that understands both images and text. Used here to view 3D object renderings and auto-generate annotations like 'this is a handle' or 'friction coefficient is 0.4.'

SAPIENA robot physics simulation platform. Used to test in a virtual environment whether grasp poses can actually hold objects stably.

GraspGenA diffusion model-based learning method that predicts stable grasp poses from point cloud input. Generates thousands of grasp candidates per object.

sim-to-real gapThe phenomenon where robot policies trained in simulation don't work well in the real world. The gap grows as physical properties or appearance differ from reality.

affordanceProperties indicating what actions are possible with an object. A cup's handle affords 'grasping,' a kettle's spout affords 'pouring.'

Related Resources

ManiTwin Project Page

Original Abstract (Expand)

Learning in simulation provides a useful foundation for scaling robotic manipulation capabilities. However, this paradigm often suffers from a lack of data-generation-ready digital assets, in both scale and diversity. In this work, we present ManiTwin, an automated and efficient pipeline for generating data-generation-ready digital object twins. Our pipeline transforms a single image into simulation-ready and semantically annotated 3D asset, enabling large-scale robotic manipulation data generation. Using this pipeline, we construct ManiTwin-100K, a dataset containing 100K high-quality annotated 3D assets. Each asset is equipped with physical properties, language descriptions, functional annotations, and verified manipulation proposals. Experiments demonstrate that ManiTwin provides an efficient asset synthesis and annotation workflow, and that ManiTwin-100K offers high-quality and diverse assets for manipulation data generation, random scene synthesis, and VQA data generation, establishing a strong foundation for scalable simulation data synthesis and policy learning. Our webpage is available at https://manitwin.github.io/.