Distributional Value Estimation Without Target Networks for Robust Quality-Diversity
Summary
QDHUAC is a novel Quality-Diversity Reinforcement Learning (QD-RL) algorithm designed to overcome the sample inefficiency of traditional QD methods, which often require tens of millions of environment steps. It achieves this by introducing a target-free distributional critic architecture, eliminating the computational bottleneck of target networks that typically stabilize high Update-to-Data (UTD) ratio training. QDHUAC provides dense, low-variance gradient signals, enabling stable training at high UTD ratios (e.g., UTD $\geq 10$) for Dominated Novelty Search. The algorithm employs a hybrid normalization scheme, combining Weight Normalization (WN) and Batch Normalization (BN) within a residual critic architecture, to mitigate internal covariate shift and constrain gradient magnitudes. Empirical results on high-dimensional Brax locomotion tasks, including Hopper, Walker2D, HalfCheetah, Ant, and Humanoid, demonstrate that QDHUAC achieves competitive coverage and fitness with an order of magnitude fewer samples than baseline QD-RL algorithms like PGA-ME and QD-PG.
Key takeaway
Research Scientists developing sample-efficient evolutionary reinforcement learning algorithms should investigate QDHUAC's target-free distributional critic with hybrid normalization. This approach significantly improves sample efficiency and stability in high-UTD regimes, particularly for tasks requiring diverse skill repertoires. You can achieve higher maximum fitness and faster discovery of diverse behaviors by adopting this architecture, potentially reducing computational costs by an order of magnitude compared to traditional target-network-based methods.
Key insights
Target-free distributional critics with hybrid normalization enable sample-efficient Quality-Diversity Reinforcement Learning at high UTD ratios.
Principles
- Target networks introduce latency in non-stationary QD-RL.
- Hybrid normalization stabilizes critics in high-UTD regimes.
- Distributional critics provide richer gradient signals.
Method
QDHUAC uses a target-free distributional critic with hybrid (Weight + Batch) normalization and a residual architecture. It integrates with Dominated Novelty Search, employing a hybrid mutation strategy and Prioritized Experience Replay.
In practice
- Use hybrid normalization for stable high-UTD training.
- Employ distributional critics for dense gradient signals.
- Consider target-free architectures for dynamic environments.
Topics
- Quality-Diversity
- Reinforcement Learning
- Target-Free Critic
- Distributional Value Estimation
- Hybrid Normalization
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.NE updates on arXiv.org.