We propose to learn legged robot locomotion skills by watching thousands of wild animal videos from the internet, such as those featured in nature documentaries. Indeed, such videos offer a rich and diverse collection of plausible motion examples, which could inform how robots should move.
To achieve this, we introduce Reinforcement Learning from Wild Animal Videos (RLWAV), a method to ground these motions into physical robots. We first train a video classifier on a large-scale animal video dataset to recognize actions from RGB clips of animals in their natural habitats. We then train a multi-skill policy to control a robot in a physics simulator, using the classification score of a third-person camera capturing videos of the robot’s movements as a reward for reinforcement learning. Finally, we directly transfer the learned policy to a real quadruped Solo.
Remarkably, despite the extreme gap in both domain and embodiment between animals in the wild and robots, our approach enables the policy to learn diverse skills such as walking, jumping, and keeping still, without relying on reference trajectories nor skill-specific rewards.
We consider the Animal Kingdom dataset, a large and diverse collection of labeled animal videos spanning various species, sourced from the internet such as wildlife documentaries.
We focus on a subset of 4 action classes - Keeping still, Walking, Running and Jumping - excluding behaviors irrelevant for legged robots (e.g. "Flying", "Spitting Venom").
In total, we select 8,791 videos from this dataset.
Reward Learning from Wild Animal Videos: We train a video classifier to recognize actions from the RGB videos of animals in the wild. We finetune a Uniformer video encoder and employ different techniques to improve generalization to out-of-distribution data, such as weight averaging and random convolution augmentations. Training on diverse species in their various natural habitats allows the classifier to generalize to our legged robot in simulation in a zero-shot manner.
Physical Grounding with Constrained Reinforcement Learning in a Simulator: With a third-person camera capturing videos of the robot’s movement, we use the classification score for the desired skill as a reward to train the policy with RL. We train a single multi-skill policy with PPO in the massively parallel Isaac Gym physics simulator. To ensure effective and safe sim-to-real transfer, we employ CaT, a constrained reinforcement learning algorithm, to comply the policy with constraints that are applied independently of the target skill, such as torque limits, foot air time and base orientation around the roll axis.
Keeping still
Walking
Running
Jumping
Despite the extreme gap in both domain and embodiment between animals in the wild and the quadruped robot, our approach enables the policy to learn distinct behaviors for keeping still, walking, running and jumping, without relying on reference trajectories nor skill-specific rewards.