Sim-and-Real Co-Training: A Simple Recipe for Vision-Based Robotic Manipulation

*Equal Contributions, †Project Leads

1UT Austin 2NVIDIA Research 3UC Berkeley 4New York University

Sim-and-Real Co-Training Boosts Real-World Performance.

Simulation Data: Task-Aware Digital Cousins

Creating a perfect digital twin copy of the real-world task is challenging, requiring extensive manual tuning, system identification, and sourcing identical 3D assets. Instead, we opt to create task-aware digital cousins, i.e., tasks in simulation that share the same task semantics but may have discrepancies in visual and physical features.

Simulation Data: Prior Task-Agnostic Simulation Datasets

Creating digital cousin still requires extra effort. So we also explore using prior task-agnostic simulation datasets. We define a prior task-agnostic simulation dataset as any simulation dataset that existed before the creation of the downstream, real-world task. These datasets may have numerous discrepancies with real-world data, but present a very simple and convenient way to leverage simulation data.

A Simple Recipe for Sim-and-Real Co-Training

We establish a systematic study for co-training on real-robot data and synthetically generated data from simulation, resulting in a simple recipe to leverage synthetic simulation data for real-world manipulation:

Task and scene composition. Use task-aware digital cousins with similar task and scene compositions to real-world tasks. Multi-task prior simulation data can still help even with different compositions.

Object composition and initialization. Incorporate diverse objects and varying placements in simulation to improve generalization.

Task-aware digital cousin alignment. Ensure simulation tasks share the same definition and success criteria as real-world tasks. Similar camera viewpoints help, but perfect alignment isn't necessary.

Co-training hyperparameters. Use significantly more simulation data than real-world data and carefully tune the co-training ratio.

Sim-and-Real Co-Training Pipeline

Sim-and-Real Co-training Pipeline. Left: We start with a real-world task in mind and some prior simulation data. Middle: Given real-world tasks and prior simulation data, we can build additional digital cousin tasks that share semantic similarities with their real-world counterparts but may still have discrepancies in visual and physical features. From here we can consolidate prior simulation data, digital cousin data, and real world data. Right: We co-train on a mixture of real-world and simulation data. We sample simulation data according to a sampling ratio of α, and as we show in experiments, this sample ratio is crucial. After training the policy, we deploy the resulting policy directly in the real world.

Co-Training with Simulation Data Consistently Improves Real-World Performance.

Trained on Real-Only Data

Success Rate: 31%

Trained on Real + Task-Aware Digital Cousin

Success Rate: 71%

Trained on Real + Prior Simulation Data

Success Rate: 69%

Trained on Real + Prior Simulation Data + Task-Aware Digital Cousin

Success Rate: 76%

Sim-and-Real Co-Training Remains Beneficial in Data-Rich Settings.

We use 4000 simulation demos and increase the number of real demos from 40 to 400 on task MultiTaskPnP. The results show that our co-training recipe remains beneficial in data-rich settings.

Sim-and-Real Co-Training Recipe Works across Multiple Robot Embodiments and Workspaces.

Humanoid Tasks

Cup PnP

Milk Box PnP

Pour

Bimanual Cube PnP

Quantitatively, sim-and-real co-training boosts policy performance by 37.9% on average across 6 tasks on 2 embodiments:

Sim-and-Real Co-Training Improves Policy's Generalizability.

Co-training with simulation data can help policy generalize to novel object entities and poses unseen in the real-world dataset.

Unseen Object

Real-Only Policy (33%)

Co-Training Policy (50%)

Real-Only Policy (10%)

Co-Training Policy (80%)

Novel Object Position

Real-Only Policy (11%)

Co-Training Policy (28%)

Real-Only Policy (43%)

Co-Training Policy (100%)

Novel Object Color

Co-Training Policy (100%)

Key Elements in Co-Training: Co-Training Ratio and Camera Alignment

Our findings help practitioners effectively leverage synthetic simulation data for real-world robotics tasks.

Co-Training Ratio

Effect of the different co-training ratios (α). We experiment on CupPnP task with 20 real demos and 1000 simulation demos from task-aware digital cousins. Tuning the co-training ratio is important for good performance of co-trained policies.

Camera Alignment

Improving Visual Fidelity with Vid2Vid

Given our observation that incorporating task-aware digital cousins, such as camera viewpoint alignment, enhances performance, we pose a follow-up question: Would improving visual fidelity by making simulation rendering closer to reality enhance co-training?

We fine-tune CogVideoX (Yang et al., 2024), a video diffusion model, on CupPnP real demonstration videos in the Text2Video (T2V) modality. To adapt the model for style transfer on simulation videos—preserving object positions while enhancing visual realism—we adopt a simple strategy: instead of generating videos from pure noise, we introduce noise into reference simulation videos and use them as initialization for video diffusion. Lower noise levels retain more of the original simulation textures, producing outputs closer to the inputs, while higher noise levels result in more realistic appearances but may distort object poses.

We find that Vid2Vid-enhanced visual fidelity leads to a 5-10% average improvement in policy performance, with the most significant benefits occurring when the number of simulation or real-world trajectories is low.

We also explore an alternative strategy, where instead of using the RGB simulation video as noised input for T2V, we use the depth or segmentation video as the intermediate modality, where they are used as a conditioning for Video-to-Video (V2V) diffusion. This reduces the requirement of visual fidelity of the simulation videos, and we believe it opens up exciting opportunities for future research.

BibTeX

@article{maddukuri2025simandreal,
    title   = {Sim-and-Real Co-Training: A Simple Recipe for Vision-Based Robotic Manipulation},
    author  = {Maddukuri, Abhiram and Jiang, Zhenyu and Chen, Lawrence Yunliang and Nasiriany, Soroush and Xie, Yuqi and Fang, Yu and Huang, Wenqi and Wang, Zu and Xu, Zhenjia and Chernyadev, Nikita and Reed, Scott and Goldberg, Ken and Mandlekar, Ajay and Fan, Linxi and Zhu, Yuke},
    journal = {arXiv preprint arXiv:2503.24361},
    year    = {2025}
  }

Contact

If you have any questions, please feel free to contact Zhenyu Jiang, Ajay Mandelkar, Jim Fan, and Yuke Zhu.