In this paper, we build a real-to-sim-to-real (Real2Sim2Real) system
for robot manipulation policy learning from casual human videos. We propose
a new framework, ROSE, that directly leverages casual videos to reconstruct
simulator-ready assets, including objects, scenes, and object trajectories, for training manipulation
policies with reinforcement learning in the simulation. Unlike
existing real-to-sim pipelines that rely on specialized equipment or time-consuming
and labor-intensive human annotation, our pipeline is equipment-agnostic and fully
automated, facilitating data collection scalability. From casual monocular videos,
ROSE enables the direct reconstruction of metric-scale scenes, objects, and object
trajectories in the same gravity-calibrated coordinate for robotic data collection
in the simulator. With ROSE, we curate a dataset of hundreds of simulator-ready
scenes from casual videos from our own capture and the Internet, and create a
benchmark for real-to-sim evaluation. Across a diverse suite of manipulation tasks,
ROSE outperforms the existing baselines, laying the groundwork for scalable
robotic data collection and achieving efficient Real2Sim2Real deployment.