Distributional Value Estimation Without Target Networks for Robust Quality-Diversity

· Source: cs.NE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Mathematics & Computational Sciences · Depth: Expert, extended

Summary

QDHUAC is a novel Quality-Diversity Reinforcement Learning (QD-RL) algorithm designed to overcome the sample inefficiency of traditional QD methods, which often require tens of millions of environment steps. It achieves this by introducing a target-free distributional critic architecture, eliminating the computational bottleneck of target networks that typically stabilize high Update-to-Data (UTD) ratio training. QDHUAC provides dense, low-variance gradient signals, enabling stable training at high UTD ratios (e.g., UTD $\geq 10$) for Dominated Novelty Search. The algorithm employs a hybrid normalization scheme, combining Weight Normalization (WN) and Batch Normalization (BN) within a residual critic architecture, to mitigate internal covariate shift and constrain gradient magnitudes. Empirical results on high-dimensional Brax locomotion tasks, including Hopper, Walker2D, HalfCheetah, Ant, and Humanoid, demonstrate that QDHUAC achieves competitive coverage and fitness with an order of magnitude fewer samples than baseline QD-RL algorithms like PGA-ME and QD-PG.

Key takeaway

Research Scientists developing sample-efficient evolutionary reinforcement learning algorithms should investigate QDHUAC's target-free distributional critic with hybrid normalization. This approach significantly improves sample efficiency and stability in high-UTD regimes, particularly for tasks requiring diverse skill repertoires. You can achieve higher maximum fitness and faster discovery of diverse behaviors by adopting this architecture, potentially reducing computational costs by an order of magnitude compared to traditional target-network-based methods.

Key insights

Target-free distributional critics with hybrid normalization enable sample-efficient Quality-Diversity Reinforcement Learning at high UTD ratios.

Principles

Method

QDHUAC uses a target-free distributional critic with hybrid (Weight + Batch) normalization and a residual architecture. It integrates with Dominated Novelty Search, employing a hybrid mutation strategy and Prioritized Experience Replay.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.NE updates on arXiv.org.