ManiTwin: Scaling Data-Generation-Ready Digital Object Dataset to 100K
TL;DR Highlight
A pipeline that auto-generates 100K physics-simulation-ready 3D robot manipulation datasets from a single image
Who Should Read
Robotics ML engineers building simulation data generation pipelines for robot manipulation policy training. Researchers struggling to acquire 3D assets for simulation-based learning.
Core Mechanics
- Single image → auto-converted to simulation-ready 3D mesh (using CLAY model, ~45 seconds per object)
- VLM auto-annotates physical properties (mass, friction coefficient, size), functional points (handles, spouts, buttons), and grasp points
- GraspGen (diffusion model-based grasp generator) produces up to 4,000 grasp candidates per object → physics-validated in SAPIEN simulator
- ManiTwin-100K: 512 categories, 100K objects, 5M verified grasp poses, 10M auto-generated grasp trajectories
- Grasp annotations validated on Franka Panda transfer to other robot platforms (parallel grippers, multi-finger hands, etc.)
- Also supports auto-generation of robot VQA data: language grounding, function planning, task planning across 5 categories
Evidence
- Human evaluation on 500 samples: category classification 100%, language description 99.6%, functional point labels 92.2%, physical property estimation 92.2%, grasp point selection 84.8% accuracy
- 3D generation success rate 69.67%, grasp simulation validation pass rate 76.13% (average 62.14 verified grasps maintained per object)
- Image→3D generation: CLIP(I-I/T) 0.7769, CLIP(N-I/T) 0.6848 — much higher alignment than text→3D (0.2324, 0.1948 respectively)
- Unlike existing robotics datasets RoboTwin-OD (731) and GAPartNet (4K), provides simulation-ready + grasp + functional + language annotations at 100K scale
How to Apply
- Feed e-commerce product images or text-to-image generated images into the ManiTwin pipeline to automatically get simulator-loadable 3D assets + grasp poses
- Load desired category objects from the ManiTwin-100K dataset directly into simulators like SAPIEN/Isaac Gym, then mass-generate pick-and-place trajectory data using the provided 6-DoF grasp poses
- For VQA data needs, use the layout generation feature to arrange multiple objects on a table and auto-generate language-action alignment QA pairs using functional point and language annotations
Terminology
Related Resources
Original Abstract (Expand)
Learning in simulation provides a useful foundation for scaling robotic manipulation capabilities. However, this paradigm often suffers from a lack of data-generation-ready digital assets, in both scale and diversity. In this work, we present ManiTwin, an automated and efficient pipeline for generating data-generation-ready digital object twins. Our pipeline transforms a single image into simulation-ready and semantically annotated 3D asset, enabling large-scale robotic manipulation data generation. Using this pipeline, we construct ManiTwin-100K, a dataset containing 100K high-quality annotated 3D assets. Each asset is equipped with physical properties, language descriptions, functional annotations, and verified manipulation proposals. Experiments demonstrate that ManiTwin provides an efficient asset synthesis and annotation workflow, and that ManiTwin-100K offers high-quality and diverse assets for manipulation data generation, random scene synthesis, and VQA data generation, establishing a strong foundation for scalable simulation data synthesis and policy learning. Our webpage is available at https://manitwin.github.io/.