ROCm with PyTorch and PyTorch Lightning seems to still suck for research [D]

· Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Intermediate, short

Summary

A user's experience with AMD's ROCm platform for machine learning research indicates significant issues, particularly with PyTorch and PyTorch Lightning. After procuring an RX 7900XTX, the user encountered widespread NaN errors during the backward pass when porting a flow matching model (SANA Architecture) codebase that ran flawlessly on NVIDIA RTX3090s. Attempts to resolve these issues by switching between bf16 and fp32 precisions or adjusting environment variables were unsuccessful. While a standard nanoGPT training script ran perfectly, the user's intuition suggests ROCm's stack is robust only for well-established codebases and fragile with slightly uncommon code. Comments from other users corroborate these findings, citing similar NaN issues, differing operator behavior compared to CUDA, and a lack of clear documentation on critical differences for ROCm users.

Key takeaway

For AI Engineers evaluating AMD GPUs for deep learning research, you should anticipate potential stability issues and NaN errors, especially with custom or less common PyTorch codebases. Be prepared for a non-"plug and play" experience, requiring significant debugging and potential workarounds like adjusting precision or compiler settings. Your existing CUDA-optimized code may not translate seamlessly, necessitating a thorough validation process for backward passes and custom operations.

Key insights

ROCm remains challenging for ML research due to instability and NaN issues with less common PyTorch codebases.

Principles

In practice

Topics

Best for: AI Engineer, NLP Engineer, Computer Vision Engineer, AI Scientist, Machine Learning Engineer, AI Hardware Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.