Reinforcement learning with Unitree G1 humanoid - Dev w/ G1 P.5
Summary
A new reinforcement learning policy, trained using Proximal Policy Optimization (PPO), has been developed for controlling the arm of a Unitree G1 humanoid robot. This compact model, a 2x64 network weighing only 184 kilobytes, enables the G1's arm to precisely seek and reach target positions in 3D space. The policy was trained for approximately 24 hours within an OpenAI Gym-formatted environment, utilizing 24 observation values including joint angles, velocities, hand position, and goal position, with a 7D action space for incremental joint commands. A key challenge involved ensuring sim-to-real alignment and addressing safety concerns related to the G1's default motor speeds and unexpected snapping behavior, which can lead to motor overload and robot shutdowns. The reward function penalizes distance, collisions, and joint limit excursions, while rewarding target achievement within a 2 cm tolerance.
Key takeaway
For Robotics Engineers developing reinforcement learning policies for humanoid robots like the Unitree G1, you must prioritize robust reward function design that incorporates physical constraints and safety. Carefully tune joint limits and consider dynamic episode termination to prevent motor overloads and unexpected robot behavior. Additionally, integrate safety protocols, such as using the remote for boot-up to ensure immediate E-stop access, to mitigate risks associated with high-speed movements and firmware quirks.
Key insights
PPO-trained, compact RL policies can achieve precise robotic arm control despite sim-to-real challenges and inherent robot safety risks.
Principles
- Reward functions must balance task completion with physical constraints.
- Sim-to-real alignment is crucial but rarely perfect.
- Default robot speeds can pose significant safety risks.
Method
Train a PPO model in an OpenAI Gym environment using 24 observations (joint states, hand/goal positions) and 7D incremental joint commands, terminating on goal, collision, or joint limits.
In practice
- Implement dynamic episode termination based on target delta.
- Constrain joint limits (e.g., shoulder pitch, elbow) to prevent overloads.
- Use remote control for boot-up to enforce E-stop readiness.
Topics
- Reinforcement Learning
- Unitree G1
- Robotic Arm Control
- Proximal Policy Optimization
- Sim-to-Real Transfer
- Robot Safety
- OpenAI Gym
Best for: Machine Learning Engineer, Robotics Engineer, AI Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by sentdex.