Distributional Value Estimation Without Target Networks for Robust Quality-Diversity

2026-04-23 · Source: cs.NE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Mathematics & Computational Sciences · Depth: Expert, extended

Summary

QDHUAC is a novel Quality-Diversity Reinforcement Learning (QD-RL) algorithm designed to overcome the sample inefficiency of traditional QD methods, which often require tens of millions of environment steps. It achieves this by introducing a target-free distributional critic architecture, eliminating the computational bottleneck of target networks that typically stabilize high Update-to-Data (UTD) ratio training. QDHUAC provides dense, low-variance gradient signals, enabling stable training at high UTD ratios (e.g., UTD $\geq 10$) for Dominated Novelty Search. The algorithm employs a hybrid normalization scheme, combining Weight Normalization (WN) and Batch Normalization (BN) within a residual critic architecture, to mitigate internal covariate shift and constrain gradient magnitudes. Empirical results on high-dimensional Brax locomotion tasks, including Hopper, Walker2D, HalfCheetah, Ant, and Humanoid, demonstrate that QDHUAC achieves competitive coverage and fitness with an order of magnitude fewer samples than baseline QD-RL algorithms like PGA-ME and QD-PG.

Key takeaway

Research Scientists developing sample-efficient evolutionary reinforcement learning algorithms should investigate QDHUAC's target-free distributional critic with hybrid normalization. This approach significantly improves sample efficiency and stability in high-UTD regimes, particularly for tasks requiring diverse skill repertoires. You can achieve higher maximum fitness and faster discovery of diverse behaviors by adopting this architecture, potentially reducing computational costs by an order of magnitude compared to traditional target-network-based methods.

Key insights

Target-free distributional critics with hybrid normalization enable sample-efficient Quality-Diversity Reinforcement Learning at high UTD ratios.

Principles

Target networks introduce latency in non-stationary QD-RL.
Hybrid normalization stabilizes critics in high-UTD regimes.
Distributional critics provide richer gradient signals.

Method

QDHUAC uses a target-free distributional critic with hybrid (Weight + Batch) normalization and a residual architecture. It integrates with Dominated Novelty Search, employing a hybrid mutation strategy and Prioritized Experience Replay.

In practice

Use hybrid normalization for stable high-UTD training.
Employ distributional critics for dense gradient signals.
Consider target-free architectures for dynamic environments.

Topics

Quality-Diversity
Reinforcement Learning
Target-Free Critic
Distributional Value Estimation
Hybrid Normalization

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.NE updates on arXiv.org.