Robots Learn Tennis from YouTube: AI Achieves Human-Level Play with Limited Data

by Javier Moreno - Sports Editor
0 comments

Robots Learn Like Humans: AI Achieves Tennis Expertise from YouTube Videos

A significant leap in robotics and artificial intelligence has been achieved with the development of a latest system capable of training humanoid robots to perform complex physical tasks – like playing tennis – using only readily available video data from sources like YouTube. Researchers at Tsinghua University and Peking University have unveiled Motus, a unified latent action world model, demonstrating the potential for robots to learn advanced skills without relying on expensive and precise 3D motion capture data.

Overcoming the Data Bottleneck in Robotics

Traditionally, training robots to perform intricate movements has been hampered by the need for high-quality, meticulously labeled datasets. Acquiring such data is costly and time-consuming. Motus circumvents this limitation by leveraging the vast amount of video data available online. The system utilizes video as a “hint” and employs a physical simulator to correct for inaccuracies inherent in real-world physics, enabling effective learning from incomplete and short human movement clips.

Impressive Results: A 90% Success Rate in Tennis

The research team demonstrated the effectiveness of Motus by training a humanoid robot to play tennis. Remarkably, the robot achieved a 90% success rate in hitting a ball traveling at 54 km/h (15 m/s) after just 5 hours of training. This rapid learning capability highlights the potential for accelerated development of physical intelligence in robots.

The Architecture Behind Motus

Motus employs a Mixture-of-Transformers (MoT) architecture, integrating three key experts: understanding, action, and video generation. This allows the system to flexibly switch between different modeling modes, including world models, vision-language-action models, inverse dynamics models, and video generation models. The system also utilizes optical flow to learn latent actions and a three-phase training pipeline with a six-layer data pyramid to extract pixel-level “delta action,” facilitating large-scale action pretraining.

Key Components and Parameters

  • VGM (Video Generation Model): Wan2.2-5B (~5.00B parameters)
  • VLM (Vision-Language Model): Qwen3-VL-2B (~2.13B parameters)
  • Action Expert: ~641.5M parameters
  • Understanding Expert: ~253.5M parameters
  • Total Parameters: ~8B

Recent Updates and Availability

As of December 16, 2025, Motus has been released with pretrained checkpoints and training code available on GitHub. Subsequent updates through December 27, 2025, have included support for RoboTwin inference, LeRobotDataset and MultiLeRobotDataset formats, and RoboTwin raw dataset conversion, along with optimized training scripts and three-view image concatenation scripts.

Implications for the Future of Robotics

The success of Motus suggests that the evolution of physical intelligence in robots may occur at a much faster pace than previously anticipated. By unlocking the potential of readily available video data, this research paves the way for more adaptable, versatile, and accessible robotic systems capable of performing a wider range of complex tasks in real-world environments.

Haoxuan Li, a Ph.D. Student at Peking University involved in the research, has been recognized for his contributions to the field, with multiple paper acceptances at leading AI conferences like NeurIPS, ICML, and AAAI in 2025 and 2026. Li’s homepage provides further details on his research activities.

Related Posts

Leave a Comment