Performance Profiling on AMD GPUs - Part 4: Fortran OpenMP Offload Edition

· Source: AMD ROCm Blogs · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Mathematics & Computational Sciences · Depth: Advanced, extended

Summary

This guide, "Performance Profiling on AMD GPUs - Part 4: Fortran OpenMP Offload Edition," details systematic analysis and optimization techniques for Fortran OpenMP offload applications on AMD GPUs, using ROCm 7.2 tools. It leverages the GenASiS astrophysics simulation code, compiled with amdflang from ROCm AFAR drop 22.2.0, running on an MI300A APU. The analysis revealed a relatively flat kernel profile, with top kernels consuming approximately 50% of GPU time and operating at 64-94% of peak HBM bandwidth, indicating memory-bound behavior. Key findings include that HSA_XNACK=1 can improve kernel performance by up to 59% for some kernels after page migration, reducing MEMORY_COPY instances by 84%. Workgroup size tuning, specifically setting thread_limit to 512 for computeeigenspeedskernel_l166, provided a 3% improvement for that kernel, but had minimal overall impact due to the flat profile. Multi-device profiling showed a 4-6x computation-to-communication ratio.

Key takeaway

For HPC developers optimizing Fortran OpenMP offload applications on AMD GPUs, you should adopt a systematic profiling workflow. Prioritize HSA_XNACK=1 for production runs to achieve up to 59% kernel performance gains, but use HSA_XNACK=0 for accurate short-run profiling. Instrument code with ROCTx markers and use rocprof-compute for roofline analysis to confirm memory-bound kernels. While workgroup size tuning offers incremental gains, focus on broader optimizations like bundling communication for flat profiles.

Key insights

Profiling Fortran OpenMP offload on AMD GPUs requires a systematic workflow to identify memory-bound kernels and optimize HSA_XNACK and workgroup sizes.

Principles

Method

Establish baseline, identify bottlenecks (host/device/data movement), analyze hardware usage and kernel metrics, perform targeted optimizations, and iterate. Use rocprofv3 for traces, rocpd for summary/Perfetto, and rocprof-compute for roofline analysis.

In practice

Topics

Code references

Best for: Research Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AMD ROCm Blogs.