Sim-and-Real Co-Training: A Simple Recipe for Vision-Based Robotic Manipulation

Creating a perfect digital twin copy of the real-world task is challenging, requiring extensive manual tuning, system identification, and sourcing identical 3D assets. Instead, we opt to create task-aware digital cousins, i.e., tasks in simulation that share the same task semantics but may have discrepancies in visual and physical features.

Panda CounterToSink PnP

Digital Cousin

Real Demonstrations

Panda CloseDoor

Digital Cousin

Real Demonstrations

Panda CounterToCab PnP

Digital Cousin

Real Demonstrations

Humanoid Cup PnP

Digital Cousin

Real Demonstrations

Humanoid Milk Box PnP

Digital Cousin

Real Demonstrations

Humanoid Pouring

Digital Cousin

Real Demonstrations

Humanoid Bimanual PnP

Digital Cousin

Real Demonstrations

Humanoid Multi-Task PnP

Digital Cousin

Real Demonstrations

1 / 6

Creating digital cousin still requires extra effort. So we also explore using prior task-agnostic simulation datasets. We define a prior task-agnostic simulation dataset as any simulation dataset that existed before the creation of the downstream, real-world task. These datasets may have numerous discrepancies with real-world data, but present a very simple and convenient way to leverage simulation data.

Prior Task-agnostic Simulation Data

Panda Setup (24 Tasks)

Prior Task-agnostic Simulation Data

Humanoid Setup (10 Tasks)

1 / 2

We establish a systematic study for co-training on real-robot data and synthetically generated data from simulation, resulting in a simple recipe to leverage synthetic simulation data for real-world manipulation:

• Task and scene composition. Use task-aware digital cousins with similar task and scene compositions to real-world tasks. Multi-task prior simulation data can still help even with different compositions.

• Object composition and initialization. Incorporate diverse objects and varying placements in simulation to improve generalization.

• Task-aware digital cousin alignment. Ensure simulation tasks share the same definition and success criteria as real-world tasks. Similar camera viewpoints help, but perfect alignment isn't necessary.

• Co-training hyperparameters. Use significantly more simulation data than real-world data and carefully tune the co-training ratio.

Sim-and-Real Co-training Pipeline. Left: We start with a real-world task in mind and some prior simulation data. Middle: Given real-world tasks and prior simulation data, we can build additional digital cousin tasks that share semantic similarities with their real-world counterparts but may still have discrepancies in visual and physical features. From here we can consolidate prior simulation data, digital cousin data, and real world data. Right: We co-train on a mixture of real-world and simulation data. We sample simulation data according to a sampling ratio of α, and as we show in experiments, this sample ratio is crucial. After training the policy, we deploy the resulting policy directly in the real world.

Trained on Real-Only Data

Success Rate: 31%

Trained on Real + Task-Aware Digital Cousin

Success Rate: 71%

Trained on Real + Prior Simulation Data

Success Rate: 69%

Trained on Real + Prior Simulation Data + Task-Aware Digital Cousin

Success Rate: 76%

We use 4000 simulation demos and increase the number of real demos from 40 to 400 on task MultiTaskPnP. The results show that our co-training recipe remains beneficial in data-rich settings.

Counter to Sink PnP

Digital Cousin Demo

Co-Training Policy (72%)

Close Door

Digital Cousin Demo

Co-Training Policy (100%)

Counter to Cabinet PnP

Digital Cousin Demo

Co-Training Policy (72%)

1 / 3

Humanoid Tasks

Cup PnP

Milk Box PnP

Pour

Bimanual Cube PnP

Quantitatively, sim-and-real co-training boosts policy performance by 37.9% on average across 6 tasks on 2 embodiments:

Co-training with simulation data can help policy generalize to novel object entities and poses unseen in the real-world dataset.

Unseen Object

Real-Only Policy (33%)

Co-Training Policy (50%)

Real-Only Policy (10%)

Co-Training Policy (80%)

Novel Object Position

Real-Only Policy (11%)

Co-Training Policy (28%)

Real-Only Policy (43%)

Co-Training Policy (100%)

Novel Object Color

Co-Training Policy (100%)

Our findings help practitioners effectively leverage synthetic simulation data for real-world robotics tasks.

Co-Training Ratio

Effect of the different co-training ratios (α). We experiment on CupPnP task with 20 real demos and 1000 simulation demos from task-aware digital cousins. Tuning the co-training ratio is important for good performance of co-trained policies.

Camera Alignment

Given our observation that incorporating task-aware digital cousins, such as camera viewpoint alignment, enhances performance, we pose a follow-up question: Would improving visual fidelity by making simulation rendering closer to reality enhance co-training?

We fine-tune CogVideoX (Yang et al., 2024), a video diffusion model, on CupPnP real demonstration videos in the Text2Video (T2V) modality. To adapt the model for style transfer on simulation videos—preserving object positions while enhancing visual realism—we adopt a simple strategy: instead of generating videos from pure noise, we introduce noise into reference simulation videos and use them as initialization for video diffusion. Lower noise levels retain more of the original simulation textures, producing outputs closer to the inputs, while higher noise levels result in more realistic appearances but may distort object poses.

We find that Vid2Vid-enhanced visual fidelity leads to a 5-10% average improvement in policy performance, with the most significant benefits occurring when the number of simulation or real-world trajectories is low.

We also explore an alternative strategy, where instead of using the RGB simulation video as noised input for T2V, we use the depth or segmentation video as the intermediate modality, where they are used as a conditioning for Video-to-Video (V2V) diffusion. This reduces the requirement of visual fidelity of the simulation videos, and we believe it opens up exciting opportunities for future research.

BibTeX

@inproceedings{maddukuri2025simandreal,
    title     = {Sim-and-Real Co-Training: A Simple Recipe for Vision-Based Robotic Manipulation},
    author    = {Maddukuri, Abhiram and Jiang, Zhenyu and Chen, Lawrence Yunliang and Nasiriany, Soroush and Xie, Yuqi and Fang, Yu and Huang, Wenqi and Wang, Zu and Xu, Zhenjia and Chernyadev, Nikita and Reed, Scott and Goldberg, Ken and Mandlekar, Ajay and Fan, Linxi and Zhu, Yuke},
    booktitle = {Proceedings of Robotics: Science and Systems (RSS)},
    address   = {Los Angeles, CA, USA},
    year      = {2025}
  }

Contact

If you have any questions, please feel free to contact Abhiram Maddukuri, Zhenyu Jiang, Ajay Mandelkar, Jim Fan, and Yuke Zhu.

Sim-and-Real Co-Training: A Simple Recipe for Vision-Based Robotic Manipulation

Sim-and-Real Co-Training Boosts Real-World Performance.

Simulation Data: Task-Aware Digital Cousins

Panda CounterToSink PnP

Panda CloseDoor

Panda CounterToCab PnP

Humanoid Cup PnP

Humanoid Milk Box PnP

Humanoid Pouring

Humanoid Bimanual PnP

Humanoid Multi-Task PnP

Simulation Data: Prior Task-Agnostic Simulation Datasets

Prior Task-agnostic Simulation Data

Prior Task-agnostic Simulation Data

A Simple Recipe for Sim-and-Real Co-Training

Sim-and-Real Co-Training Pipeline

Co-Training with Simulation Data Consistently Improves Real-World Performance.

Sim-and-Real Co-Training Remains Beneficial in Data-Rich Settings.

Sim-and-Real Co-Training Recipe Works across Multiple Robot Embodiments and Workspaces.

Counter to Sink PnP

Close Door

Counter to Cabinet PnP

Humanoid Tasks

Sim-and-Real Co-Training Improves Policy's Generalizability.

Key Elements in Co-Training: Co-Training Ratio and Camera Alignment

Improving Visual Fidelity with Vid2Vid

BibTeX

Contact