Reinforcement learning with Unitree G1 humanoid - Dev w/ G1 P.5

2025-07-25 · Source: sentdex · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Software Development & Engineering · Depth: Advanced, extended

Summary

A new reinforcement learning policy, trained using Proximal Policy Optimization (PPO), has been developed for controlling the arm of a Unitree G1 humanoid robot. This compact model, a 2x64 network weighing only 184 kilobytes, enables the G1's arm to precisely seek and reach target positions in 3D space. The policy was trained for approximately 24 hours within an OpenAI Gym-formatted environment, utilizing 24 observation values including joint angles, velocities, hand position, and goal position, with a 7D action space for incremental joint commands. A key challenge involved ensuring sim-to-real alignment and addressing safety concerns related to the G1's default motor speeds and unexpected snapping behavior, which can lead to motor overload and robot shutdowns. The reward function penalizes distance, collisions, and joint limit excursions, while rewarding target achievement within a 2 cm tolerance.

Key takeaway

For Robotics Engineers developing reinforcement learning policies for humanoid robots like the Unitree G1, you must prioritize robust reward function design that incorporates physical constraints and safety. Carefully tune joint limits and consider dynamic episode termination to prevent motor overloads and unexpected robot behavior. Additionally, integrate safety protocols, such as using the remote for boot-up to ensure immediate E-stop access, to mitigate risks associated with high-speed movements and firmware quirks.

Key insights

PPO-trained, compact RL policies can achieve precise robotic arm control despite sim-to-real challenges and inherent robot safety risks.

Principles

Reward functions must balance task completion with physical constraints.
Sim-to-real alignment is crucial but rarely perfect.
Default robot speeds can pose significant safety risks.

Method

Train a PPO model in an OpenAI Gym environment using 24 observations (joint states, hand/goal positions) and 7D incremental joint commands, terminating on goal, collision, or joint limits.

In practice

Implement dynamic episode termination based on target delta.
Constrain joint limits (e.g., shoulder pitch, elbow) to prevent overloads.
Use remote control for boot-up to enforce E-stop readiness.

Topics

Reinforcement Learning
Unitree G1
Robotic Arm Control
Proximal Policy Optimization
Sim-to-Real Transfer
Robot Safety
OpenAI Gym

Best for: Machine Learning Engineer, Robotics Engineer, AI Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by sentdex.