Performance Profiling on AMD GPUs - Part 4: Fortran OpenMP Offload Edition
Summary
This guide, "Performance Profiling on AMD GPUs - Part 4: Fortran OpenMP Offload Edition," details systematic analysis and optimization techniques for Fortran OpenMP offload applications on AMD GPUs, using ROCm 7.2 tools. It leverages the GenASiS astrophysics simulation code, compiled with amdflang from ROCm AFAR drop 22.2.0, running on an MI300A APU. The analysis revealed a relatively flat kernel profile, with top kernels consuming approximately 50% of GPU time and operating at 64-94% of peak HBM bandwidth, indicating memory-bound behavior. Key findings include that HSA_XNACK=1 can improve kernel performance by up to 59% for some kernels after page migration, reducing MEMORY_COPY instances by 84%. Workgroup size tuning, specifically setting thread_limit to 512 for computeeigenspeedskernel_l166, provided a 3% improvement for that kernel, but had minimal overall impact due to the flat profile. Multi-device profiling showed a 4-6x computation-to-communication ratio.
Key takeaway
For HPC developers optimizing Fortran OpenMP offload applications on AMD GPUs, you should adopt a systematic profiling workflow. Prioritize HSA_XNACK=1 for production runs to achieve up to 59% kernel performance gains, but use HSA_XNACK=0 for accurate short-run profiling. Instrument code with ROCTx markers and use rocprof-compute for roofline analysis to confirm memory-bound kernels. While workgroup size tuning offers incremental gains, focus on broader optimizations like bundling communication for flat profiles.
Key insights
Profiling Fortran OpenMP offload on AMD GPUs requires a systematic workflow to identify memory-bound kernels and optimize HSA_XNACK and workgroup sizes.
Principles
- GPU kernels often exhibit memory-bound behavior.
- HSA_XNACK=1 can significantly boost kernel performance.
- Flat kernel profiles limit single-kernel optimization impact.
Method
Establish baseline, identify bottlenecks (host/device/data movement), analyze hardware usage and kernel metrics, perform targeted optimizations, and iterate. Use rocprofv3 for traces, rocpd for summary/Perfetto, and rocprof-compute for roofline analysis.
In practice
- Instrument Fortran code with ROCTx markers via hipfort.
- Use HSA_XNACK=0 for short profiling runs, HSA_XNACK=1 for production.
- Tune OpenMP thread_limit for optimal workgroup size.
Topics
- AMD GPUs
- Fortran OpenMP Offload
- ROCm Profiling Tools
- GenASiS
- Performance Optimization
- Unified Memory Model
Code references
Best for: Research Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AMD ROCm Blogs.