From 5974fce9ad1602d5f0ce1b3de311bcbcdbaa2718 Mon Sep 17 00:00:00 2001 From: Istvan Kiss Date: Thu, 10 Apr 2025 16:41:47 +0200 Subject: [PATCH 01/36] Fix CU and WGP mode effect on warpSize There is no effect on warpSize --- docs/how-to/hip_cpp_language_extensions.rst | 4 +--- docs/reference/hardware_features.rst | 10 ++++++---- 2 files changed, 7 insertions(+), 7 deletions(-) diff --git a/docs/how-to/hip_cpp_language_extensions.rst b/docs/how-to/hip_cpp_language_extensions.rst index afedf9cc06..21ab449be5 100644 --- a/docs/how-to/hip_cpp_language_extensions.rst +++ b/docs/how-to/hip_cpp_language_extensions.rst @@ -411,9 +411,7 @@ warpSize ================================================================================ The ``warpSize`` constant contains the number of threads per warp for the given -target device. It can differ between different architectures, and on RDNA -architectures it can even differ between kernel launches, depending on whether -they run in CU or WGP mode. See the +target device. It can differ between different architectures, see the :doc:`hardware features <../reference/hardware_features>` for more information. diff --git a/docs/reference/hardware_features.rst b/docs/reference/hardware_features.rst index f5e227fc78..5bf3a74b81 100644 --- a/docs/reference/hardware_features.rst +++ b/docs/reference/hardware_features.rst @@ -240,10 +240,12 @@ page. - 106 - 104 -.. [1] RDNA architectures have a configurable wavefront size. The native - wavefront size is 32, but they can run in "CU mode", which has an effective - wavefront size of 64. This affects the number of resident wavefronts and - blocks per compute Unit. +.. [1] The RDNA architectures feature an experimental compiler option called + ``mwavefrontsize64``, which determines the wavefront size for kernel code + generation. When this option is disabled, the native wavefront size of 32 is + used, when enabled wavefront size 64 is used. This option is not supported by + the HIP runtime. + .. [2] RDNA architectures expand the concept of the traditional compute unit with the so-called work group processor, which effectively includes two compute units, within which all threads can cooperate. From 8f2fc851787748b554828d6638e5eb92577789b8 Mon Sep 17 00:00:00 2001 From: Istvan Kiss Date: Mon, 14 Apr 2025 12:57:15 +0200 Subject: [PATCH 02/36] Use single example for installation tests --- docs/install/build.rst | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/install/build.rst b/docs/install/build.rst index 64deba241b..49cf9adca6 100644 --- a/docs/install/build.rst +++ b/docs/install/build.rst @@ -238,4 +238,5 @@ Run HIP ================================================= After installation and building HIP, you can compile your application and run. -Simple examples can be found in the `ROCm-examples repository `_. +A simple SAXPY example can be found in the `ROCm-examples repository `_ +and the guide on how to build and run it is in the :doc:`SAXPY tutorial <../tutorial/saxpy>` From 6b91c48c27e8deb186187ae4865e7dac6aea9b07 Mon Sep 17 00:00:00 2001 From: Istvan Kiss Date: Tue, 15 Apr 2025 14:23:13 +0200 Subject: [PATCH 03/36] Update install and build instructions --- docs/install/build.rst | 254 ++++++++++++++++++++------------------- docs/install/install.rst | 112 ++++++++--------- 2 files changed, 187 insertions(+), 179 deletions(-) diff --git a/docs/install/build.rst b/docs/install/build.rst index 49cf9adca6..76903a81b9 100644 --- a/docs/install/build.rst +++ b/docs/install/build.rst @@ -9,27 +9,28 @@ Build HIP from source Prerequisites ================================================= -HIP code can be developed either on AMD ROCm platform using HIP-Clang compiler, or a CUDA platform with ``nvcc`` installed. -Before building and running HIP, make sure drivers and prebuilt packages are installed properly on the platform. +HIP code can be developed either on AMD ROCm platform using HIP-Clang compiler, +or a CUDA platform with ``nvcc`` installed. Before building and running HIP, +make sure drivers and prebuilt packages are installed properly on the platform. You also need to install Python 3, which includes the ``CppHeaderParser`` package. Install Python 3 using the following command: .. code-block:: shell - apt-get install python3 + apt-get install python3 Check and install ``CppHeaderParser`` package using the command: .. code-block:: shell - pip3 install CppHeaderParser + pip3 install CppHeaderParser Install ``ROCm LLVM`` package using the command: .. code-block:: shell - apt-get install rocm-llvm-dev + apt-get install rocm-llvm-dev .. _Building the HIP runtime: @@ -41,197 +42,200 @@ Set the repository branch using the variable: ``ROCM_BRANCH``. For example, for .. code-block:: shell - export ROCM_BRANCH=rocm-6.1.x + export ROCM_BRANCH=rocm-6.1.x .. tab-set:: - .. tab-item:: AMD - :sync: amd + .. tab-item:: AMD + :sync: amd - #. Get HIP source code. + #. Get HIP source code. - .. note:: - Starting in ROCM 5.6, CLR is a new repository that includes the former ROCclr, HIPAMD and - OpenCl repositories. OpenCL provides headers that ROCclr runtime depends on. + .. note:: + + Starting in ROCM 5.6, CLR is a new repository that includes the former ROCclr, HIPAMD and + OpenCl repositories. OpenCL provides headers that ROCclr runtime depends on. - .. note:: - Starting in ROCM 6.1, a new repository ``hipother`` is added to ROCm, which is branched out from HIP. - ``hipother`` provides files required to support the HIP back-end implementation on some non-AMD platforms, - like NVIDIA. + .. note:: - .. code-block:: shell + Starting in ROCM 6.1, a new repository ``hipother`` is added to ROCm, which is branched out from HIP. + ``hipother`` provides files required to support the HIP back-end implementation on some non-AMD platforms, + like NVIDIA. - git clone -b "$ROCM_BRANCH" https://github.com/ROCm/clr.git - git clone -b "$ROCM_BRANCH" https://github.com/ROCm/hip.git + .. code-block:: shell - CLR (Compute Language Runtime) repository includes ROCclr, HIPAMD and OpenCL. + git clone -b "$ROCM_BRANCH" https://github.com/ROCm/clr.git + git clone -b "$ROCM_BRANCH" https://github.com/ROCm/hip.git - ROCclr (ROCm Compute Language Runtime) is a virtual device interface which - is defined on the AMD platform. HIP runtime uses ROCclr to interact with different backends. + CLR (Compute Language Runtime) repository includes ROCclr, HIPAMD and OpenCL. - HIPAMD provides implementation specifically for HIP on the AMD platform. + ROCclr (ROCm Compute Language Runtime) is a virtual device interface which + is defined on the AMD platform. HIP runtime uses ROCclr to interact with different backends. - OpenCL provides headers that ROCclr runtime currently depends on. - hipother provides headers and implementation specifically for non-AMD HIP platforms, like NVIDIA. + HIPAMD provides implementation specifically for HIP on the AMD platform. - #. Set the environment variables. + OpenCL provides headers that ROCclr runtime currently depends on. + hipother provides headers and implementation specifically for non-AMD HIP platforms, like NVIDIA. - .. code-block:: shell + #. Set the environment variables. - export CLR_DIR="$(readlink -f clr)" - export HIP_DIR="$(readlink -f hip)" + .. code-block:: shell + export CLR_DIR="$(readlink -f clr)" + export HIP_DIR="$(readlink -f hip)" - #. Build HIP. - .. code-block:: shell + #. Build HIP. - cd "$CLR_DIR" - mkdir -p build; cd build - cmake -DHIP_COMMON_DIR=$HIP_DIR -DHIP_PLATFORM=amd -DCMAKE_PREFIX_PATH="/opt/rocm/" -DCMAKE_INSTALL_PREFIX=$PWD/install -DHIP_CATCH_TEST=0 -DCLR_BUILD_HIP=ON -DCLR_BUILD_OCL=OFF .. + .. code-block:: shell - make -j$(nproc) - sudo make install + cd "$CLR_DIR" + mkdir -p build; cd build + cmake -DHIP_COMMON_DIR=$HIP_DIR -DHIP_PLATFORM=amd -DCMAKE_PREFIX_PATH="/opt/rocm/" -DCMAKE_INSTALL_PREFIX=$PWD/install -DHIP_CATCH_TEST=0 -DCLR_BUILD_HIP=ON -DCLR_BUILD_OCL=OFF .. - .. note:: + make -j$(nproc) + sudo make install - Note, if you don't specify ``CMAKE_INSTALL_PREFIX``, the HIP runtime is installed at - ````. + .. note:: - By default, release version of HIP is built. If need debug version, you can put the option ``CMAKE_BUILD_TYPE=Debug`` in the command line. + Note, if you don't specify ``CMAKE_INSTALL_PREFIX``, the HIP runtime is installed at + ````. - Default paths and environment variables: + By default, release version of HIP is built. If need debug version, you can + put the option ``CMAKE_BUILD_TYPE=Debug`` in the command line. - * HIP is installed into ````. This can be overridden by setting the ``INSTALL_PREFIX`` as the command option. - environment variable. - * HSA is in ````. This can be overridden by setting the ``HSA_PATH`` - environment variable. - * Clang is in ``/llvm/bin``. This can be overridden by setting the - ``HIP_CLANG_PATH`` environment variable. - * The device library is in ``/lib``. This can be overridden by setting the - ``DEVICE_LIB_PATH`` environment variable. - * Optionally, you can add ``/bin`` to your ``PATH``, which can make it easier to - use the tools. - * Optionally, you can set ``HIPCC_VERBOSE=7`` to output the command line for compilation. + Default paths and environment variables: - After you run the ``make install`` command, HIP is installed to ```` by default, or ``$PWD/install/hip`` while ``INSTALL_PREFIX`` is defined. + * HIP is installed into ````. This can be overridden by setting the ``INSTALL_PREFIX`` as the command option. + + * HSA is in ````. This can be overridden by setting the ``HSA_PATH`` environment variable. + + * Clang is in ``/llvm/bin``. This can be overridden by setting the ``HIP_CLANG_PATH`` environment variable. + + * The device library is in ``/lib``. This can be overridden by setting the ``DEVICE_LIB_PATH`` environment variable. + + * Optionally, you can add ``/bin`` to your ``PATH``, which can make it easier to use the tools. + + * Optionally, you can set ``HIPCC_VERBOSE=7`` to output the command line for compilation. - #. Generate a profiling header after adding/changing a HIP API. + After you run the ``make install`` command, HIP is installed to ```` by default, or ``$PWD/install/hip`` while ``INSTALL_PREFIX`` is defined. - When you add or change a HIP API, you may need to generate a new ``hip_prof_str.h`` header. - This header is used by ROCm tools to track HIP APIs, such as ``rocprofiler`` and ``roctracer``. + #. Generate a profiling header after adding/changing a HIP API. - To generate the header after your change, use the ``hip_prof_gen.py`` tool located in - ``hipamd/src``. + When you add or change a HIP API, you may need to generate a new ``hip_prof_str.h`` header. + This header is used by ROCm tools to track HIP APIs, such as ``rocprofiler`` and ``roctracer``. - Usage: + To generate the header after your change, use the ``hip_prof_gen.py`` tool located in + ``hipamd/src``. - .. code-block:: shell + Usage: - `hip_prof_gen.py [-v] []` + .. code-block:: shell - Flags: + `hip_prof_gen.py [-v] []` - * ``-v``: Verbose messages - * ``-r``: Process source directory recursively - * ``-t``: API types matching check - * ``--priv``: Private API check - * ``-e``: On error exit mode - * ``-p``: ``HIP_INIT_API`` macro patching mode + Flags: - Example usage: + * ``-v``: Verbose messages + * ``-r``: Process source directory recursively + * ``-t``: API types matching check + * ``--priv``: Private API check + * ``-e``: On error exit mode + * ``-p``: ``HIP_INIT_API`` macro patching mode - .. code-block:: shell + Example usage: - hip_prof_gen.py -v -p -t --priv /include/hip/hip_runtime_api.h \ - /src /include/hip/amd_detail/hip_prof_str.h \ - /include/hip/amd_detail/hip_prof_str.h.new + .. code-block:: shell - .. tab-item:: NVIDIA - :sync: nvidia + hip_prof_gen.py -v -p -t --priv /include/hip/hip_runtime_api.h \ + /src /include/hip/amd_detail/hip_prof_str.h \ + /include/hip/amd_detail/hip_prof_str.h.new - #. Get the HIP source code. + .. tab-item:: NVIDIA + :sync: nvidia - .. code-block:: shell + #. Get the HIP source code. - git clone -b "$ROCM_BRANCH" https://github.com/ROCm/clr.git - git clone -b "$ROCM_BRANCH" https://github.com/ROCm/hip.git - git clone -b "$ROCM_BRANCH" https://github.com/ROCm/hipother.git + .. code-block:: shell - #. Set the environment variables. + git clone -b "$ROCM_BRANCH" https://github.com/ROCm/clr.git + git clone -b "$ROCM_BRANCH" https://github.com/ROCm/hip.git + git clone -b "$ROCM_BRANCH" https://github.com/ROCm/hipother.git - .. code-block:: shell + #. Set the environment variables. - export CLR_DIR="$(readlink -f clr)" - export HIP_DIR="$(readlink -f hip)" - export HIP_OTHER="$(readlink -f hipother)" + .. code-block:: shell - #. Build HIP. + export CLR_DIR="$(readlink -f clr)" + export HIP_DIR="$(readlink -f hip)" + export HIP_OTHER="$(readlink -f hipother)" - .. code-block:: shell + #. Build HIP. - cd "$CLR_DIR" - mkdir -p build; cd build - cmake -DHIP_COMMON_DIR=$HIP_DIR -DHIP_PLATFORM=nvidia -DCMAKE_INSTALL_PREFIX=$PWD/install -DHIP_CATCH_TEST=0 -DCLR_BUILD_HIP=ON -DCLR_BUILD_OCL=OFF -DHIPNV_DIR=$HIP_OTHER/hipnv .. - make -j$(nproc) - sudo make install + .. code-block:: shell + + cd "$CLR_DIR" + mkdir -p build; cd build + cmake -DHIP_COMMON_DIR=$HIP_DIR -DHIP_PLATFORM=nvidia -DCMAKE_INSTALL_PREFIX=$PWD/install -DHIP_CATCH_TEST=0 -DCLR_BUILD_HIP=ON -DCLR_BUILD_OCL=OFF -DHIPNV_DIR=$HIP_OTHER/hipnv .. + make -j$(nproc) + sudo make install Build HIP tests ================================================= .. tab-set:: - .. tab-item:: AMD - :sync: amd + .. tab-item:: AMD + :sync: amd - * Build HIP catch tests. + **Build HIP catch tests.** - HIP catch tests are separate from the HIP project and use Catch2. + HIP catch tests are separate from the HIP project and use Catch2. - * Get HIP tests source code. + #. Get HIP tests source code. - .. code-block:: shell + .. code-block:: shell - git clone -b "$ROCM_BRANCH" https://github.com/ROCm/hip-tests.git + git clone -b "$ROCM_BRANCH" https://github.com/ROCm/hip-tests.git - * Build HIP tests from source. + #. Build HIP tests from source. - .. code-block:: shell + .. code-block:: shell - export HIPTESTS_DIR="$(readlink -f hip-tests)" - cd "$HIPTESTS_DIR" - mkdir -p build; cd build - cmake ../catch -DHIP_PLATFORM=amd -DHIP_PATH=$CLR_DIR/build/install # or any path where HIP is installed; for example: ``/opt/rocm`` - make build_tests - ctest # run tests + export HIPTESTS_DIR="$(readlink -f hip-tests)" + cd "$HIPTESTS_DIR" + mkdir -p build; cd build + cmake ../catch -DHIP_PLATFORM=amd -DHIP_PATH=$CLR_DIR/build/install # or any path where HIP is installed; for example: ``/opt/rocm`` + make build_tests + ctest # run tests - HIP catch tests are built in ``$HIPTESTS_DIR/build``. + HIP catch tests are built in ``$HIPTESTS_DIR/build``. - To run any single catch test, use this example: + To run any single catch test, use this example: - .. code-block:: shell + .. code-block:: shell - cd $HIPTESTS_DIR/build/catch_tests/unit/texture - ./TextureTest + cd $HIPTESTS_DIR/build/catch_tests/unit/texture + ./TextureTest - * Build a HIP Catch2 standalone test. + #. Build a HIP Catch2 standalone test. (Optional) - .. code-block:: shell + .. code-block:: shell - cd "$HIPTESTS_DIR" - hipcc $HIPTESTS_DIR/catch/unit/memory/hipPointerGetAttributes.cc \ - -I ./catch/include ./catch/hipTestMain/standalone_main.cc \ - -I ./catch/external/Catch2 -o hipPointerGetAttributes - ./hipPointerGetAttributes - ... + cd "$HIPTESTS_DIR" + hipcc $HIPTESTS_DIR/catch/unit/memory/hipPointerGetAttributes.cc \ + -I ./catch/include ./catch/hipTestMain/standalone_main.cc \ + -I ./catch/external/Catch2 -o hipPointerGetAttributes + ./hipPointerGetAttributes + ... - All tests passed + All tests passed - .. tab-item:: NVIDIA - :sync: nvidia + .. tab-item:: NVIDIA + :sync: nvidia - The commands to build HIP tests on an NVIDIA platform are the same as on an AMD platform. - However, you must first set ``-DHIP_PLATFORM=nvidia``. + The commands to build HIP tests on an NVIDIA platform are the same as on an AMD platform. + However, you must first set ``-DHIP_PLATFORM=nvidia``. Run HIP diff --git a/docs/install/install.rst b/docs/install/install.rst index c5cafac663..522c935edc 100644 --- a/docs/install/install.rst +++ b/docs/install/install.rst @@ -10,10 +10,10 @@ HIP can be installed on AMD (ROCm with HIP-Clang) and NVIDIA (CUDA with NVCC) pl .. note:: - The version definition for the HIP runtime is different from CUDA. On AMD - platforms, the :cpp:func:`hipRuntimeGetVersion` function returns the HIP - runtime version. On NVIDIA platforms, this function returns the CUDA runtime - version. + The version definition for the HIP runtime is different from CUDA. On AMD + platforms, the :cpp:func:`hipRuntimeGetVersion` function returns the HIP + runtime version. On NVIDIA platforms, this function returns the CUDA runtime + version. .. _install_prerequisites: @@ -22,84 +22,88 @@ Prerequisites .. tab-set:: - .. tab-item:: AMD - :sync: amd + .. tab-item:: AMD + :sync: amd - Refer to the Prerequisites section in the ROCm install guides: + Refer to the Prerequisites section in the ROCm install guides: - * :doc:`rocm-install-on-linux:reference/system-requirements` - * :doc:`rocm-install-on-windows:reference/system-requirements` + * :doc:`rocm-install-on-linux:reference/system-requirements` + * :doc:`rocm-install-on-windows:reference/system-requirements` - .. tab-item:: NVIDIA - :sync: nvidia + .. tab-item:: NVIDIA + :sync: nvidia - With NVIDIA GPUs, HIP requires unified memory. All CUDA-enabled NVIDIA - GPUs with compute capability 5.0 or later should be supported. For more - information, see `NVIDIA's list of CUDA enabled GPUs `_. + With NVIDIA GPUs, HIP requires unified memory. All CUDA-enabled NVIDIA + GPUs with compute capability 5.0 or later should be supported. For more + information, see `NVIDIA's list of CUDA enabled GPUs `_. Installation ======================================= .. tab-set:: - .. tab-item:: AMD - :sync: amd + .. tab-item:: AMD + :sync: amd - HIP is automatically installed during the ROCm installation. If you haven't yet installed ROCm, you - can find installation instructions here: + HIP is automatically installed during the ROCm installation. If you haven't + yet installed ROCm, you can find installation instructions here: - * :doc:`rocm-install-on-linux:index` - * :doc:`rocm-install-on-windows:index` + * :doc:`rocm-install-on-linux:index` + * :doc:`rocm-install-on-windows:index` - By default, HIP is installed into ``/opt/rocm``. + By default, HIP is installed into ``/opt/rocm``. - .. note:: - There is no autodetection for the HIP installation. If you choose to install it somewhere other than the default location, you must set the ``HIP_PATH`` environment variable as explained in `Build HIP from source <./build.html>`_. + .. note:: + + There is no autodetection for the HIP installation. If you choose to + install it somewhere other than the default location, you must set the + ``HIP_PATH`` environment variable as explained in + `Build HIP from source <./build.html>`_. - .. tab-item:: NVIDIA - :sync: nvidia + .. tab-item:: NVIDIA + :sync: nvidia - #. Install the NVIDIA toolkit. + #. Install the NVIDIA toolkit. - The latest release can be found here: - `CUDA Toolkit `_. + The latest release can be found here: + `CUDA Toolkit `_. - #. Setup the radeon repo. + #. Setup the radeon repo. - .. code-block::shell + .. code-block::shell - # Replace url with appropriate link in the table below - wget https://repo.radeon.com/amdgpu-install/6.2/distro/version_name/amdgpu-install_6.2.60200-1_all.deb - sudo apt install ./amdgpu-install_6.2.60200-1_all.deb - sudo apt update + # Replace url with appropriate link in the table below + wget https://repo.radeon.com/amdgpu-install/6.2/distro/version_name/amdgpu-install_6.2.60200-1_all.deb + sudo apt install ./amdgpu-install_6.2.60200-1_all.deb + sudo apt update - .. list-table:: amdgpu-install links - :widths: 25 100 - :header-rows: 1 + .. list-table:: amdgpu-install links + :widths: 25 100 + :header-rows: 1 - * - Ubuntu version - - URL - * - 24.04 - - https://repo.radeon.com/amdgpu-install/6.2.4/ubuntu/noble/amdgpu-install_6.2.60204-1_all.deb - * - 22.04 - - https://repo.radeon.com/amdgpu-install/6.2.4/ubuntu/jammy/amdgpu-install_6.2.60204-1_all.deb + * - Ubuntu version + - URL + * - 24.04 + - https://repo.radeon.com/amdgpu-install/6.2.4/ubuntu/noble/amdgpu-install_6.2.60204-1_all.deb + * - 22.04 + - https://repo.radeon.com/amdgpu-install/6.2.4/ubuntu/jammy/amdgpu-install_6.2.60204-1_all.deb - #. Install the ``hip-runtime-nvidia`` and ``hip-dev`` packages. This installs the CUDA SDK and HIP - porting layer. + #. Install the ``hip-runtime-nvidia`` and ``hip-dev`` packages. This installs the CUDA SDK and HIP + porting layer. - .. code-block:: shell + .. code-block:: shell - apt-get install hip-runtime-nvidia hip-dev + apt-get install hip-runtime-nvidia hip-dev - The default paths are: - * CUDA SDK: ``/usr/local/cuda`` - * HIP: ``/opt/rocm`` + The default paths are: + * CUDA SDK: ``/usr/local/cuda`` + * HIP: ``/opt/rocm`` - #. Set the HIP_PLATFORM to nvidia. + #. Set the HIP_PLATFORM to nvidia. - .. code-block:: shell + .. code-block:: shell - export HIP_PLATFORM="nvidia" + export HIP_PLATFORM="nvidia" Verify your installation ========================================================== @@ -108,4 +112,4 @@ Run ``hipconfig`` in your installation path. .. code-block:: shell - /opt/rocm/bin/hipconfig --full + /opt/rocm/bin/hipconfig --full From 2f0079b121c6aec966e54dfcd46f72aeb2188af7 Mon Sep 17 00:00:00 2001 From: Istvan Kiss Date: Tue, 29 Apr 2025 19:18:45 +0200 Subject: [PATCH 04/36] Added not about HIP runtime build on windows --- docs/install/build.rst | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/docs/install/build.rst b/docs/install/build.rst index 76903a81b9..9e15251029 100644 --- a/docs/install/build.rst +++ b/docs/install/build.rst @@ -180,6 +180,11 @@ Set the repository branch using the variable: ``ROCM_BRANCH``. For example, for make -j$(nproc) sudo make install +.. note:: + + HIP runtime is not buildable on Windows as it depends on closed source + components. + Build HIP tests ================================================= From 1493d7ea8f209ac963c298e125499eb758a03ce4 Mon Sep 17 00:00:00 2001 From: Istvan Kiss Date: Wed, 7 May 2025 16:17:02 +0200 Subject: [PATCH 05/36] Update text --- docs/install/build.rst | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/docs/install/build.rst b/docs/install/build.rst index 9e15251029..b0a7baa43d 100644 --- a/docs/install/build.rst +++ b/docs/install/build.rst @@ -182,8 +182,7 @@ Set the repository branch using the variable: ``ROCM_BRANCH``. For example, for .. note:: - HIP runtime is not buildable on Windows as it depends on closed source - components. + The HIP runtime is only buildable on Linux. Build HIP tests ================================================= From bac057b36b7b7730589984d0bd9376cedf33cf96 Mon Sep 17 00:00:00 2001 From: Scott Date: Mon, 5 May 2025 15:26:12 -0700 Subject: [PATCH 06/36] Add missing newline character to include/hip/linker_types.h. --- include/hip/linker_types.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/hip/linker_types.h b/include/hip/linker_types.h index fd3d29d09a..505cdcf0f0 100755 --- a/include/hip/linker_types.h +++ b/include/hip/linker_types.h @@ -127,4 +127,4 @@ typedef enum hipJitFallback { #error("Must define exactly one of __HIP_PLATFORM_AMD__ or __HIP_PLATFORM_NVIDIA__"); #endif -#endif // HIP_INCLUDE_HIP_LINKER_TYPES_H \ No newline at end of file +#endif // HIP_INCLUDE_HIP_LINKER_TYPES_H From 64a078c50579094e4c310e923a24840ccb2f6c53 Mon Sep 17 00:00:00 2001 From: Istvan Kiss Date: Mon, 7 Apr 2025 08:25:31 +0200 Subject: [PATCH 07/36] Add lane masks bit-shift in the porting guide --- .wordlist.txt | 1 + docs/how-to/hip_porting_guide.rst | 64 +++++++++++++++++++++++++++++++ 2 files changed, 65 insertions(+) diff --git a/.wordlist.txt b/.wordlist.txt index 1bca54a941..6cbf374ae1 100644 --- a/.wordlist.txt +++ b/.wordlist.txt @@ -15,6 +15,7 @@ bfloat Bitcode bitcode bitcodes +bitmask blockDim blockIdx builtins diff --git a/docs/how-to/hip_porting_guide.rst b/docs/how-to/hip_porting_guide.rst index 6f1fe20f04..205ea6a848 100644 --- a/docs/how-to/hip_porting_guide.rst +++ b/docs/how-to/hip_porting_guide.rst @@ -611,6 +611,70 @@ code, while the host can query it during runtime via the device properties. See the :ref:`HIP language extension for warpSize ` for information on how to write portable wave-aware code. +Lane masks bit-shift +================================================================================ + +A thread in a warp is also called a lane, and a lane mask is a bitmask where +each bit corresponds to a thread in a warp. A bit is 1 if the thread is active, +0 if it's inactive. Bit-shift operations are typically used to create lane masks +and on AMD GPUs the ``warpSize`` can differ between different architectures, +that's why it's essential to use correct bitmask type, when porting code. + +Example: + +.. code-block:: cpp + + // Get the thread's position in the warp + unsigned int laneId = threadIdx.x % warpSize; + + // Use lane ID for bit-shift + val & ((1 << (threadIdx.x % warpSize) )-1 ); + + // Shift 32 bit integer with val variable + WarpReduce::sum( (val < warpSize) ? (1 << val) : 0); + +Lane masks are 32-bit integer types as this is the integer precision that C +assigns to such constants by default. GCN/CDNA architectures have a warp size of +64, :code:`threadIdx.x % warpSize` and :code:`val` in the example may obtain +values greater than 31. Consequently, shifting by such values would clear the +32-bit register to which the shift operation is applied. For AMD +architectures, a straightforward fix could look as follows: + +.. code-block:: cpp + + // Get the thread's position in the warp + unsigned int laneId = threadIdx.x % warpSize; + + // Use lane ID for bit-shift + val & ((1ull << (threadIdx.x % warpSize) )-1 ); + + // Shift 64 bit integer with val variable + WarpReduce::sum( (val < warpSize) ? (1ull << val) : 0); + +For portability reasons, it is better to introduce appropriately +typed placeholders as shown below: + +.. code-block:: cpp + + #if defined(__GFX8__) || defined(__GFX9__) + typedef uint64_t lane_mask_t; + #else + typedef uint32_t lane_mask_t; + #endif + +The use of :code:`lane_mask_t` with the previous example: + +.. code-block:: cpp + + // Get the thread's position in the warp + unsigned int laneId = threadIdx.x % warpSize; + + // Use lane ID for bit-shift + val & ((lane_mask_t{1} << (threadIdx.x % warpSize) )-1 ); + + // Shift 32 or 64 bit integer with val variable + WarpReduce::sum( (val < warpSize) ? (lane_mask_t{1} << val) : 0); + Porting from CUDA __launch_bounds__ ================================================================================ From 531c595b0bc132fdaba2f3c52f4ad60468310f6c Mon Sep 17 00:00:00 2001 From: Adel Johar Date: Tue, 20 May 2025 11:07:17 +0200 Subject: [PATCH 08/36] Docs: Add FP4/FP6 types --- docs/reference/low_fp_types.rst | 382 +++++++++++++++++++++++++++++++- 1 file changed, 379 insertions(+), 3 deletions(-) diff --git a/docs/reference/low_fp_types.rst b/docs/reference/low_fp_types.rst index 7fe450a35f..b5645eed61 100644 --- a/docs/reference/low_fp_types.rst +++ b/docs/reference/low_fp_types.rst @@ -12,6 +12,382 @@ and FP16 (Half Precision), which reduce memory and bandwidth requirements compar 32-bit or 64-bit formats. The following sections detail their specifications, variants, and provide practical guidance for implementation in HIP. +FP4 (4-bit Precision) +======================= + +FP4 (Floating Point 4-bit) numbers represent the current extreme in low-precision formats, +pushing the boundaries of memory optimization for specialized AI workloads. This ultra-compact +format is designed for scenarios where model size and computational efficiency are paramount +constraints, even at the cost of significant precision reduction. + +FP4 is particularly valuable in weight storage for large language models (LLMs) and vision +transformers, where aggressive quantization can dramatically reduce model size while +maintaining acceptable inference quality. By reducing memory footprint to a quarter of FP16, +FP4 enables deployment of larger models in memory-constrained environments or higher throughput +in existing hardware. + +The supported FP4 format is: + +- **E2M1 Format** + + - Sign: 1 bit + - Exponent: 2 bits + - Mantissa: 1 bit + +The E2M1 format offers a balance between minimal precision and a reasonable dynamic range, +optimized for weight storage in neural network applications. + +HIP Header +---------- + +The `HIP FP4 header `_ +defines the FP4 numbers. + +Supported Devices +----------------- + +Different GPU models support different FP4 formats. Here's a breakdown: + +.. list-table:: Supported devices for fp4 numbers + :header-rows: 1 + + * - Device Type + - E2M1 + * - Host + - Yes + * - CDNA1 + - No + * - CDNA2 + - No + * - CDNA3 + - Yes + * - RDNA2 + - No + * - RDNA3 + - No + +Using FP4 Numbers in HIP Programs +--------------------------------- + +To use the FP4 numbers inside HIP programs: + +.. code-block:: cpp + + #include + +FP4 numbers can be used on CPU side: + +.. code-block:: cpp + + __hip_fp4_storage_t convert_float_to_fp4( + float in, /* Input val */ + __hip_saturation_t sat /* Saturation behavior */ + ) { + return __hip_cvt_float_to_fp4(in, __HIP_E2M1, sat); + } + +The same can be done in kernels as well: + +.. code-block:: cpp + + __device__ __hip_fp4_storage_t d_convert_float_to_fp4( + float in, + __hip_saturation_t sat) { + return __hip_cvt_float_to_fp4(in, __HIP_E2M1, sat); + } + +The following code example demonstrates a simple roundtrip conversion using FP4 types: + +.. code-block:: cpp + + #include + #include + #include + #include + + #define hip_check(hip_call) \ + { \ + auto hip_res = hip_call; \ + if (hip_res != hipSuccess) { \ + std::cerr << "Failed in HIP call: " << #hip_call \ + << " at " << __FILE__ << ":" << __LINE__ \ + << " with error: " << hipGetErrorString(hip_res) << std::endl; \ + std::abort(); \ + } \ + } + + __global__ void float_to_fp4_to_float(float *in, + __hip_saturation_t sat, float *out, + size_t size) { + int i = threadIdx.x; + if (i < size) { + auto fp4 = __hip_cvt_float_to_fp4(in[i], __HIP_E2M1, sat); + out[i] = __hip_cvt_fp4_to_halfraw(fp4, __HIP_E2M1); + } + } + + int main() { + constexpr size_t size = 16; + hipDeviceProp_t prop; + hip_check(hipGetDeviceProperties(&prop, 0)); + bool is_supported = (std::string(prop.gcnArchName).find("gfx94") != std::string::npos); + if(!is_supported) { + std::cerr << "Need a gfx94x, but found: " << prop.gcnArchName << std::endl; + std::cerr << "Device conversions are not supported on this hardware." << std::endl; + return -1; + } + + constexpr __hip_saturation_t sat = __HIP_SATFINITE; + + // Create test data + std::vector in; + in.reserve(size); + for (size_t i = 0; i < size; i++) { + in.push_back(i * 0.5f); + } + + // Allocate device memory + float *d_in, *d_out; + hip_check(hipMalloc(&d_in, sizeof(float) * size)); + hip_check(hipMalloc(&d_out, sizeof(float) * size)); + hip_check(hipMemcpy(d_in, in.data(), sizeof(float) * size, hipMemcpyHostToDevice)); + + // Run conversion kernel + float_to_fp4_to_float<<<1, size>>>(d_in, sat, d_out, size); + + // Get results + std::vector result(size); + hip_check(hipMemcpy(result.data(), d_out, sizeof(float) * size, hipMemcpyDeviceToHost)); + + // Clean up + hip_check(hipFree(d_in)); + hip_check(hipFree(d_out)); + + // Display results + std::cout << "FP4 Roundtrip Results:" << std::endl; + for (size_t i = 0; i < size; i++) { + std::cout << "Original: " << in[i] << " -> FP4 roundtrip: " << result[i] << std::endl; + } + + return 0; + } + +There are C++ style classes available as well: + +.. code-block:: cpp + + __hip_fp4_e2m1 fp4_val(1.0f); + +FP4 type has its own class: + +- __hip_fp4_e2m1 + +There is support of vector of FP4 types: + +- __hip_fp4x2_e2m1: holds 2 values of FP4 e2m1 numbers +- __hip_fp4x4_e2m1: holds 4 values of FP4 e2m1 numbers + +FP6 (6-bit Precision) +======================== + +FP6 (Floating Point 6-bit) numbers represent an even more aggressive memory optimization +compared to FP8, designed specifically for ultra-efficient deep learning inference and +specialized AI applications. This extremely compact format delivers significant memory +and bandwidth savings at the cost of reduced dynamic range and precision. + +The primary advantage of FP6 is enabling higher computational throughput in +hardware-constrained environments, particularly for AI model deployment on edge devices +and applications where model size is a critical constraint. While offering less precision +than FP8, FP6 maintains sufficient accuracy for many inference tasks, especially when +used with carefully quantized models. + +There are two primary FP6 formats: + +- **E3M2 Format** + + - Sign: 1 bit + - Exponent: 3 bits + - Mantissa: 2 bits + +- **E2M3 Format** + + - Sign: 1 bit + - Exponent: 2 bits + - Mantissa: 3 bits + +The E3M2 format provides a wider numeric range with less precision, while the E2M3 format +offers higher precision within a narrower range. + +HIP Header +---------- + +The `HIP FP6 header `_ +defines the FP6 numbers. + +Supported Devices +----------------- + +Different GPU models support different FP6 formats. Here's a breakdown: + +.. list-table:: Supported devices for fp6 numbers + :header-rows: 1 + + * - Device Type + - E3M2 + - E2M3 + * - Host + - Yes + - Yes + * - CDNA1 + - No + - No + * - CDNA2 + - No + - No + * - CDNA3 + - Yes + - Yes + * - RDNA2 + - No + - No + * - RDNA3 + - No + - No + +Using FP6 Numbers in HIP Programs +--------------------------------- + +To use the FP6 numbers inside HIP programs: + +.. code-block:: cpp + + #include + +FP6 numbers can be used on CPU side: + +.. code-block:: cpp + + __hip_fp6_storage_t convert_float_to_fp6( + float in, /* Input val */ + __hip_fp6_interpretation_t interpret, /* interpretation of number E3M2/E2M3 */ + __hip_saturation_t sat /* Saturation behavior */ + ) { + return __hip_cvt_float_to_fp6(in, interpret, sat); + } + +The same can be done in kernels as well: + +.. code-block:: cpp + + __device__ __hip_fp6_storage_t d_convert_float_to_fp6( + float in, + __hip_fp6_interpretation_t interpret, + __hip_saturation_t sat) { + return __hip_cvt_float_to_fp6(in, interpret, sat); + } + +The following code example demonstrates a roundtrip conversion using FP6 types: + +.. code-block:: cpp + + #include + #include + #include + #include + + #define hip_check(hip_call) \ + { \ + auto hip_res = hip_call; \ + if (hip_res != hipSuccess) { \ + std::cerr << "Failed in HIP call: " << #hip_call \ + << " at " << __FILE__ << ":" << __LINE__ \ + << " with error: " << hipGetErrorString(hip_res) << std::endl; \ + std::abort(); \ + } \ + } + + __global__ void float_to_fp6_to_float(float *in, + __hip_fp6_interpretation_t interpret, + __hip_saturation_t sat, float *out, + size_t size) { + int i = threadIdx.x; + if (i < size) { + auto fp6 = __hip_cvt_float_to_fp6(in[i], interpret, sat); + out[i] = __hip_cvt_fp6_to_halfraw(fp6, interpret); + } + } + + int main() { + constexpr size_t size = 16; + hipDeviceProp_t prop; + hip_check(hipGetDeviceProperties(&prop, 0)); + bool is_supported = (std::string(prop.gcnArchName).find("gfx94") != std::string::npos); + if(!is_supported) { + std::cerr << "Need a gfx94x, but found: " << prop.gcnArchName << std::endl; + std::cerr << "Device conversions are not supported on this hardware." << std::endl; + return -1; + } + + // Test both formats + const __hip_saturation_t sat = __HIP_SATFINITE; + + // Create test vectors + std::vector in(size); + for (size_t i = 0; i < size; i++) { + in[i] = i * 0.5f; + } + + std::vector out_e2m3(size); + std::vector out_e3m2(size); + + // Allocate device memory + float *d_in, *d_out; + hip_check(hipMalloc(&d_in, sizeof(float) * size)); + hip_check(hipMalloc(&d_out, sizeof(float) * size)); + hip_check(hipMemcpy(d_in, in.data(), sizeof(float) * size, hipMemcpyHostToDevice)); + + // Test E2M3 format + float_to_fp6_to_float<<<1, size>>>(d_in, __HIP_E2M3, sat, d_out, size); + hip_check(hipMemcpy(out_e2m3.data(), d_out, sizeof(float) * size, hipMemcpyDeviceToHost)); + + // Test E3M2 format + float_to_fp6_to_float<<<1, size>>>(d_in, __HIP_E3M2, sat, d_out, size); + hip_check(hipMemcpy(out_e3m2.data(), d_out, sizeof(float) * size, hipMemcpyDeviceToHost)); + + // Display results + std::cout << "FP6 Roundtrip Results:" << std::endl; + for (size_t i = 0; i < size; i++) { + std::cout << "Original: " << in[i] + << " -> E2M3: " << out_e2m3[i] + << " -> E3M2: " << out_e3m2[i] << std::endl; + } + + // Clean up + hip_check(hipFree(d_in)); + hip_check(hipFree(d_out)); + + return 0; + } + +There are C++ style classes available as well: + +.. code-block:: cpp + + __hip_fp6_e2m3 fp6_val_e2m3(1.1f); + __hip_fp6_e3m2 fp6_val_e3m2(1.1f); + +Each type of FP6 number has its own class: + +- __hip_fp6_e2m3 +- __hip_fp6_e3m2 + +There is support of vector of FP6 types: + +- __hip_fp6x2_e2m3: holds 2 values of FP6 e2m3 numbers +- __hip_fp6x4_e2m3: holds 4 values of FP6 e2m3 numbers +- __hip_fp6x2_e3m2: holds 2 values of FP6 e3m2 numbers +- __hip_fp6x4_e3m2: holds 4 values of FP6 e3m2 numbers + FP8 (Quarter Precision) ======================= @@ -65,7 +441,7 @@ numbers compared to standard FP8 formats. HIP Header ---------- -The `HIP FP8 header `_ +The `HIP FP8 header `_ defines the FP8 ocp/fnuz numbers. Supported Devices @@ -317,10 +693,10 @@ supported with its two main formats, float16 and bfloat16. HIP Header ---------- -The `HIP FP16 header `_ +The `HIP FP16 header `_ defines the float16 format. -The `HIP BF16 header `_ +The `HIP BF16 header `_ defines the bfloat16 format. Supported Devices From 0b07b23f5a249802c54d92803fea1c874513e0e5 Mon Sep 17 00:00:00 2001 From: randyh62 Date: Tue, 27 May 2025 12:53:59 -0700 Subject: [PATCH 09/36] add builtin-expect text --- docs/how-to/performance_guidelines.rst | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/docs/how-to/performance_guidelines.rst b/docs/how-to/performance_guidelines.rst index 33dbbb4af4..efe532e6b6 100644 --- a/docs/how-to/performance_guidelines.rst +++ b/docs/how-to/performance_guidelines.rst @@ -262,7 +262,9 @@ For example, when the control condition depends on ``threadIdx`` or ``warpSize`` warp doesn't diverge. The compiler might optimize loops, short ifs, or switch blocks using branch predication, which prevents warp divergence. With branch predication, instructions associated with a false predicate are scheduled but -not executed, which avoids unnecessary operations. +not executed, which avoids unnecessary operations. For control conditions where +one outcome is significantly more likely than the other, use `__builtin_expect `_ +or ``[[likely]]`` to indicate the likely condition result. Avoiding divergent warps ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ From bfe3aed632c9e1ee559d2041baecaf5f62757c96 Mon Sep 17 00:00:00 2001 From: randyh62 Date: Thu, 5 Jun 2025 13:43:53 -0700 Subject: [PATCH 10/36] Update COMGR default enabled, and add note regarding use of COMGR --- docs/how-to/hip_rtc.rst | 54 +++++++++-------------------------------- 1 file changed, 12 insertions(+), 42 deletions(-) diff --git a/docs/how-to/hip_rtc.rst b/docs/how-to/hip_rtc.rst index 223e11081c..861e07fec8 100644 --- a/docs/how-to/hip_rtc.rst +++ b/docs/how-to/hip_rtc.rst @@ -17,9 +17,14 @@ alongside options to guide the compilation. * This library can be used for compilation on systems without AMD GPU drivers installed (offline compilation). However, running the compiled code still requires both the HIP runtime library and GPU drivers on the target system. - * This library depends on Code Object Manager (comgr). You can try to - statically link comgr into HIPRTC to avoid ambiguity. * Developers can bundle this library with their application. + * HIPRTC leverages AMD's Code Object Manager API (``Comgr``) internally, which + is designed to simplify linking, compiling, and inspecting code objects. For + more information, see the `llvm-project/amd/comgr/README `_. + * Comgr may cache HIPRTC compilations. To force full recompilation for each HIPRTC API invocation, set AMD_COMGR_CACHE=0. + + - When viewing the *README* in the Comgr GitHub repository you should look at a + specific branch of interest, such as ``docs/6.3.0`` or ``docs/6.4.1``, rather than the default branch. Compilation APIs =============================================================================== @@ -250,45 +255,6 @@ The full example is below: HIP_CHECK(hipFree(doutput)); } - -Kernel Compilation Cache -=============================================================================== - -HIPRTC incorporates a cache to avoid recompiling kernels between program -executions. The contents of the cache include the kernel source code (including -the contents of any ``#include`` headers), the compilation flags, and the -compiler version. After a ROCm version update, the kernels are progressively -recompiled, and the new results are cached. When the cache is disabled, each -kernel is recompiled every time it is requested. - -Use the following environment variables to manage the cache status as enabled or -disabled, the location for storing the cache contents, and the cache eviction -policy: - -* ``AMD_COMGR_CACHE`` By default this variable is unset and the - compilation cache feature is enabled. To disable the feature set the - environment variable to a value of ``0``. - -* ``AMD_COMGR_CACHE_DIR``: By default the value of this environment variable is - defined as ``$XDG_CACHE_HOME/comgr``, which defaults to - ``$USER/.cache/comgr`` on Linux, and ``%LOCALAPPDATA%\cache\comgr`` - on Windows. You can specify a different directory for the environment variable - to change the path for cache storage. If the runtime fails to access the - specified cache directory the cache is disabled. If the environment variable - is set to an empty string (``""``), the default directory is used. - -* ``AMD_COMGR_CACHE_POLICY``: If assigned a value, the string is interpreted and - applied to the cache pruning policy. The string format is consistent with - `Clang's ThinLTO cache pruning policy `_. - The default policy is defined as: - ``prune_interval=1h:prune_expiration=0h:cache_size=75%:cache_size_bytes=30g:cache_size_files=0``. - If the runtime fails to parse the defined string, or the environment variable - is set to an empty string (""), the cache is disabled. - -.. note:: - - This cache is also shared with the OpenCL runtime shipped with ROCm. - HIPRTC specific options =============================================================================== @@ -484,7 +450,7 @@ application requires the ingestion of bitcode/IR not derived from the currently installed AMD compiler, it must run with HIPRTC and comgr dynamic libraries that are compatible with the version of the bitcode/IR. -`Comgr `_ is a +`Comgr `_ is a shared library that incorporates the LLVM/Clang compiler that HIPRTC relies on. To identify the bitcode/IR version that comgr is compatible with, one can execute "clang -v" using the clang binary from the same ROCm or HIP SDK package. @@ -492,6 +458,10 @@ For instance, if compiling bitcode/IR version 14, the HIPRTC and comgr libraries released by AMD around mid 2022 would be the best choice, assuming the LLVM/Clang version included in the package is also version 14. +.. note:: + When viewing the *README* in the Comgr GitHub repository you should look at a + specific branch of interest, such as ``docs/6.3.0`` or ``docs/6.4.1``, rather than the default branch. + To ensure smooth operation and compatibility, an application may choose to ship the specific versions of HIPRTC and comgr dynamic libraries, or it may opt to clearly specify the version requirements and dependencies. This approach From d805be1eb5c958543be9287a136104f2174517d3 Mon Sep 17 00:00:00 2001 From: Istvan Kiss Date: Mon, 12 May 2025 14:41:27 +0200 Subject: [PATCH 11/36] Add best practice for warpSize handling --- docs/how-to/hip_cpp_language_extensions.rst | 136 +++++++++++- .../template_warp_size_reduction.hip | 207 ++++++++++++++++++ .../example_codes/warp_size_reduction.hip | 184 ++++++++++++++++ 3 files changed, 521 insertions(+), 6 deletions(-) create mode 100644 docs/tools/example_codes/template_warp_size_reduction.hip create mode 100644 docs/tools/example_codes/warp_size_reduction.hip diff --git a/docs/how-to/hip_cpp_language_extensions.rst b/docs/how-to/hip_cpp_language_extensions.rst index 21ab449be5..fba47b2a6a 100644 --- a/docs/how-to/hip_cpp_language_extensions.rst +++ b/docs/how-to/hip_cpp_language_extensions.rst @@ -411,9 +411,9 @@ warpSize ================================================================================ The ``warpSize`` constant contains the number of threads per warp for the given -target device. It can differ between different architectures, see the -:doc:`hardware features <../reference/hardware_features>` for more -information. +target device. On AMD hardware, this is referred to as ``wavefront size``, which +may vary depending on the architecture. For more details, see the +:doc:`hardware features <../reference/hardware_features>`. Since ``warpSize`` can differ between devices, it can not be assumed to be a compile-time constant on the host. It has to be queried using @@ -421,8 +421,8 @@ compile-time constant on the host. It has to be queried using .. code-block:: cpp - int val; - hipDeviceGetAttribute(&val, hipDeviceAttributeWarpSize, deviceId); + int warpSizeHost; + hipDeviceGetAttribute(&warpSizeHost, hipDeviceAttributeWarpSize, deviceId); .. note:: @@ -433,6 +433,130 @@ compile-time constant on the host. It has to be queried using of 32 can run on devices with a ``warpSize`` of 64, it only utilizes half of the compute resources. +The ``warpSize`` parameter will no longer be a compile-time constant in a future +release of ROCm, however it will be still early folded by the compiler, which +means it can be used for loop bounds and supports loop unrolling similarly to +compile-time warp size. + +If the compile time warp size is still required, for example to select the correct +mask type or code path at compile time, the recommended approach is to determine +the warp size of the GPU on host side and setup the kernel accordingly, as shown +in the following block reduce example. + +The ``block_reduce`` kernel has a template parameter for warp size and performs +a reduction operation in two main phases: + +- Shared memory reduction: Reduction is performed iteratively, halving the + number of active threads each step until only a warp remains + (32 or 64 threads, depending on the device). + +- Warp-level reduction: Once the shared memory reduction completes, the + remaining threads use warp-level shuffling to sum the remaining values. This + is done efficiently with the ``__shfl_down`` intrinsic, which allows threads within + the warp to exchange values without explicit synchronization. + +.. tab-set:: + + .. tab-item:: WarpSize Template Parameter + :sync: template-warpsize + + .. literalinclude:: ../tools/example_codes/template_warp_size_reduction.hip + :start-after: // [Sphinx template warp size block reduction kernel start] + :end-before: // [Sphinx template warp size block reduction kernel end] + :language: cpp + + + .. tab-item:: HIP warpSize + :sync: hip-warpsize + + .. literalinclude:: ../tools/example_codes/warp_size_reduction.hip + :start-after: // [Sphinx HIP warp size block reduction kernel start] + :end-before: // [Sphinx HIP warp size block reduction kernel end] + :language: cpp + +The host code with the main function: + +- Retrieves the warp size of the GPU (``warpSizeHost``) to determine the optimal + kernel configuration. + +- Allocates device memory (``d_data`` for input, ``d_results`` for block-wise + output) and initializes the input vector to 1. + +- Generates the mask variables for every warp and copies them to the device. + + .. tab-set:: + + .. tab-item:: Compile-time WarpSize + :sync: template-warpsize + + .. literalinclude:: ../tools/example_codes/template_warp_size_reduction.hip + :start-after: // [Sphinx template warp size mask generation start] + :end-before: // [Sphinx template warp size mask generation end] + :language: cpp + + + .. tab-item:: HIP warpSize + :sync: hip-warpsize + + .. literalinclude:: ../tools/example_codes/warp_size_reduction.hip + :start-after: // [Sphinx HIP warp size mask generation start] + :end-before: // [Sphinx HIP warp size mask generation end] + :language: cpp + +- Selects the appropriate kernel specialization based on the warp + size (either 32 or 64) and launches the kernel. + + .. tab-set:: + + .. tab-item:: Compile-time WarpSize + :sync: template-warpsize + + .. literalinclude:: ../tools/example_codes/template_warp_size_reduction.hip + :start-after: // [Sphinx template warp size select kernel start] + :end-before: // [Sphinx template warp size select kernel end] + :language: cpp + + + .. tab-item:: HIP warpSize + :sync: hip-warpsize + + .. literalinclude:: ../tools/example_codes/warp_size_reduction.hip + :start-after: // [Sphinx HIP warp size select kernel start] + :end-before: // [Sphinx HIP warp size select kernel end] + :language: cpp + +- Synchronizes the device and copies the results back to the host. + +- Checks that each block's sum is equal with the expected mask bit count, + verifying the reduction's correctness. + +- Frees the device memory to prevent memory leaks. + +.. note:: + + The ``warpSize`` runtime example code is also provided for comparison purposes + and the full example codes are located in the `tools folder `_. + + The variable ``warpSize`` can be used for loop bounds and supports + loop unrolling similarly to the template parameter ``WarpSize``. + +For users who still require a compile-time constant warp size as a macro on the +device side, it can be defined manually based on the target device architecture, +as shown in the following example. + +.. code-block:: cpp + + #if defined(__GFX8__) || defined(__GFX9__) + #define WarpSize 64 + #else + #define WarpSize 32 + #endif + +.. note:: + + ``mwavefrontsize64`` compiler option is not supported by HIP runtime, that's + why the architecture based compile time selector is an acceptable approach. + ******************************************************************************** Vector types ******************************************************************************** @@ -855,7 +979,7 @@ The different shuffle functions behave as following: of range, the thread returns its own ``var``. ``__shfl_down`` - The thread reads ``var`` from lane ``laneIdx - delta``, thereby "shuffling" + The thread reads ``var`` from lane ``laneIdx + delta``, thereby "shuffling" the values of the lanes of the warp "down". If the resulting source lane is out of range, the thread returns its own ``var``. diff --git a/docs/tools/example_codes/template_warp_size_reduction.hip b/docs/tools/example_codes/template_warp_size_reduction.hip new file mode 100644 index 0000000000..2d265080d9 --- /dev/null +++ b/docs/tools/example_codes/template_warp_size_reduction.hip @@ -0,0 +1,207 @@ +// MIT License +// +// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved. +// +// Permission is hereby granted, free of charge, to any person obtaining a copy +// of this software and associated documentation files (the "Software"), to deal +// in the Software without restriction, including without limitation the rights +// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +// copies of the Software, and to permit persons to whom the Software is +// furnished to do so, subject to the following conditions: +// +// The above copyright notice and this permission notice shall be included in all +// copies or substantial portions of the Software. +// +// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +// SOFTWARE. + +#include +#include +#include +#include +#include + +#define HIP_CHECK(expression) \ +{ \ + const hipError_t status = expression; \ + if(status != hipSuccess){ \ + std::cerr << "HIP error " \ + << status << ": " \ + << hipGetErrorString(status) \ + << " at " << __FILE__ << ":" \ + << __LINE__ << std::endl; \ + } \ +} + +// [Sphinx template warp size block reduction kernel start] +template +using lane_mask_t = typename std::conditional::type; + +template +__global__ void block_reduce(int* input, lane_mask_t* mask, int* output, size_t size) { + extern __shared__ int shared[]; + + // Read of input with bounds check + auto read_global_safe = [&](const uint32_t i, const uint32_t lane_id, const uint32_t mask_id) + { + lane_mask_t warp_mask = lane_mask_t(1) << lane_id; + return (i < size) && (mask[mask_id] & warp_mask) ? input[i] : 0; + }; + + const uint32_t tid = threadIdx.x, + lid = threadIdx.x % WarpSize, + wid = threadIdx.x / WarpSize, + bid = blockIdx.x, + gid = bid * blockDim.x + tid; + + // Read input buffer to shared + shared[tid] = read_global_safe(gid, lid, bid * (blockDim.x / WarpSize) + wid); + __syncthreads(); + + // Shared reduction + for (uint32_t i = blockDim.x / 2; i >= WarpSize; i /= 2) + { + if (tid < i) + shared[tid] = shared[tid] + shared[tid + i]; + __syncthreads(); + } + + // Use local variable in warp reduction + int result = shared[tid]; + __syncthreads(); + + // This loop would be unrolled the same with the runtime warpSize. + #pragma unroll + for (uint32_t i = WarpSize/2; i >= 1; i /= 2) { + result = result + __shfl_down(result, i); + } + + // Write result to output buffer + if (tid == 0) + output[bid] = result; +}; +// [Sphinx template warp size block reduction kernel end] + +// [Sphinx template warp size mask generation start] +template +void generate_and_copy_mask( + void *d_mask, + std::vector& vectorExpected, + int numOfBlocks, + int numberOfWarp, + int mask_size, + int mask_element_size) { + + std::random_device rd; + std::mt19937_64 eng(rd()); + + // Host side mask vector + std::vector> mask(mask_size); + // Define uniform unsigned int distribution + std::uniform_int_distribution> distr; + // Fill up the mask + for(int i=0; i < numOfBlocks; i++) { + int count = 0; + for(int j=0; j < numberOfWarp; j++) { + int mask_index = i * numberOfWarp + j; + mask[mask_index] = distr(eng); + if constexpr(WarpSize == 32) + count += __builtin_popcount(mask[mask_index]); + else + count += __builtin_popcountll(mask[mask_index]); + } + vectorExpected[i]= count; + } + + // Copy the mask array + HIP_CHECK(hipMemcpy(d_mask, mask.data(), mask_size * mask_element_size, hipMemcpyHostToDevice)); +} +// [Sphinx template warp size mask generation end] + +int main() { + + int deviceId = 0; + int warpSizeHost; + HIP_CHECK(hipDeviceGetAttribute(&warpSizeHost, hipDeviceAttributeWarpSize, deviceId)); + std::cout << "Warp size: " << warpSizeHost << std::endl; + + constexpr int numOfBlocks = 16; + constexpr int threadsPerBlock = 1024; + const int numberOfWarp = threadsPerBlock / warpSizeHost; + const int mask_element_size = warpSizeHost == 32 ? sizeof(uint32_t) : sizeof(uint64_t); + const int mask_size = numOfBlocks * numberOfWarp; + constexpr size_t arraySize = numOfBlocks * threadsPerBlock; + + int *d_data, *d_results; + void *d_mask; + int initValue = 1; + std::vector vectorInput(arraySize, initValue); + std::vector vectorOutput(numOfBlocks); + std::vector vectorExpected(numOfBlocks); + // Allocate device memory + HIP_CHECK(hipMalloc(&d_data, arraySize * sizeof(*d_data))); + HIP_CHECK(hipMalloc(&d_mask, mask_size * mask_element_size)); + HIP_CHECK(hipMalloc(&d_results, numOfBlocks * sizeof(*d_results))); + // Host to Device copy of the input array + HIP_CHECK(hipMemcpy(d_data, vectorInput.data(), arraySize * sizeof(*d_data), hipMemcpyHostToDevice)); + + // [Sphinx template warp size select kernel start] + // Fill up the mask variable, copy to device and select the right kernel. + if(warpSizeHost == 32) { + // Generate and copy mask arrays + generate_and_copy_mask<32>(d_mask, vectorExpected, numOfBlocks, numberOfWarp, mask_size, mask_element_size); + + // Start the kernel + block_reduce<32><<>>( + d_data, + static_cast(d_mask), + d_results, + arraySize); + } else if(warpSizeHost == 64) { + // Generate and copy mask arrays + generate_and_copy_mask<64>(d_mask, vectorExpected, numOfBlocks, numberOfWarp, mask_size, mask_element_size); + + // Start the kernel + block_reduce<64><<>>( + d_data, + static_cast(d_mask), + d_results, + arraySize); + } else { + std::cerr << "Unsupported warp size." << std::endl; + return 0; + } + // [Sphinx template warp size select kernel end] + + // Check the kernel launch + HIP_CHECK(hipGetLastError()); + // Check for kernel execution error + HIP_CHECK(hipDeviceSynchronize()); + // Device to Host copy of the result + HIP_CHECK(hipMemcpy(vectorOutput.data(), d_results, numOfBlocks * sizeof(*d_results), hipMemcpyDeviceToHost)); + + // Verify results + bool passed = true; + for(size_t i = 0; i < numOfBlocks; ++i) { + if(vectorOutput[i] != vectorExpected[i]) { + passed = false; + std::cerr << "Validation failed! Expected " << vectorExpected[i] << " got " << vectorOutput[i] << " at index: " << i << std::endl; + } + } + if(passed){ + std::cout << "Execution completed successfully." << std::endl; + }else{ + std::cerr << "Execution failed." << std::endl; + } + + // Cleanup + HIP_CHECK(hipFree(d_data)); + HIP_CHECK(hipFree(d_mask)); + HIP_CHECK(hipFree(d_results)); + return 0; +} \ No newline at end of file diff --git a/docs/tools/example_codes/warp_size_reduction.hip b/docs/tools/example_codes/warp_size_reduction.hip new file mode 100644 index 0000000000..0be830ff0e --- /dev/null +++ b/docs/tools/example_codes/warp_size_reduction.hip @@ -0,0 +1,184 @@ +// MIT License +// +// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved. +// +// Permission is hereby granted, free of charge, to any person obtaining a copy +// of this software and associated documentation files (the "Software"), to deal +// in the Software without restriction, including without limitation the rights +// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +// copies of the Software, and to permit persons to whom the Software is +// furnished to do so, subject to the following conditions: +// +// The above copyright notice and this permission notice shall be included in all +// copies or substantial portions of the Software. +// +// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +// SOFTWARE. + +#include +#include +#include +#include +#include + +#define HIP_CHECK(expression) \ +{ \ + const hipError_t status = expression; \ + if(status != hipSuccess){ \ + std::cerr << "HIP error " \ + << status << ": " \ + << hipGetErrorString(status) \ + << " at " << __FILE__ << ":" \ + << __LINE__ << std::endl; \ + } \ +} + +// [Sphinx HIP warp size block reduction kernel start] +__global__ void block_reduce(int* input, uint64_t* mask, int* output, size_t size){ + extern __shared__ int shared[]; + // Read of input with bounds check + auto read_global_safe = [&](const uint32_t i, const uint32_t lane_id, const uint32_t mask_id) + { + uint64_t warp_mask = 1ull << lane_id; + return (i < size) && (mask[mask_id] & warp_mask) ? input[i] : 0; + }; + const uint32_t tid = threadIdx.x, + lid = threadIdx.x % warpSize, + wid = threadIdx.x / warpSize, + bid = blockIdx.x, + gid = bid * blockDim.x + tid; + // Read input buffer to shared + shared[tid] = read_global_safe(gid, lid, bid * (blockDim.x / warpSize) + wid); + __syncthreads(); + // Shared reduction + for (uint32_t i = blockDim.x / 2; i >= warpSize; i /= 2) + { + if (tid < i) + shared[tid] = shared[tid] + shared[tid + i]; + __syncthreads(); + } + + // Use local variable in warp reduction + int result = shared[tid]; + __syncthreads(); + + // This loop would be unrolled the same with the compile-time WarpSize. + #pragma unroll + for (uint32_t i = warpSize/2; i >= 1; i /= 2) { + result = result + __shfl_down(result, i); + } + + // Write result to output buffer + if (tid == 0) + output[bid] = result; +}; +// [Sphinx HIP warp size block reduction kernel end] + +// [Sphinx HIP warp size mask generation start] +void generate_and_copy_mask( + uint64_t *d_mask, + std::vector& vectorExpected, + int warpSizeHost, + int numOfBlocks, + int numberOfWarp, + int mask_size, + int mask_element_size) { + + std::random_device rd; + std::mt19937_64 eng(rd()); + + // Host side mask vector + std::vector mask(mask_size); + // Define uniform unsigned int distribution + std::uniform_int_distribution distr; + // Fill up the mask + for(int i=0; i < numOfBlocks; i++) { + int count = 0; + for(int j=0; j < numberOfWarp; j++) { + int mask_index = i * numberOfWarp + j; + mask[mask_index] = distr(eng); + if(warpSizeHost == 32) + count += __builtin_popcount(mask[mask_index]); + else + count += __builtin_popcountll(mask[mask_index]); + } + vectorExpected[i]= count; + } + // Copy the mask array + HIP_CHECK(hipMemcpy(d_mask, mask.data(), mask_size * mask_element_size, hipMemcpyHostToDevice)); +} +// [Sphinx HIP warp size mask generation end] + +int main() { + int deviceId = 0; + int warpSizeHost; + HIP_CHECK(hipDeviceGetAttribute(&warpSizeHost, hipDeviceAttributeWarpSize, deviceId)); + std::cout << "Warp size: " << warpSizeHost << std::endl; + constexpr int numOfBlocks = 16; + constexpr int threadsPerBlock = 1024; + const int numberOfWarp = threadsPerBlock / warpSizeHost; + const int mask_element_size = sizeof(uint64_t); + const int mask_size = numOfBlocks * numberOfWarp; + constexpr size_t arraySize = numOfBlocks * threadsPerBlock; + int *d_data, *d_results; + uint64_t *d_mask; + int initValue = 1; + std::vector vectorInput(arraySize, initValue); + std::vector vectorOutput(numOfBlocks); + std::vector vectorExpected(numOfBlocks); + // Allocate device memory + HIP_CHECK(hipMalloc(&d_data, arraySize * sizeof(*d_data))); + HIP_CHECK(hipMalloc(&d_mask, mask_size * mask_element_size)); + HIP_CHECK(hipMalloc(&d_results, numOfBlocks * sizeof(*d_results))); + // Host to Device copy of the input array + HIP_CHECK(hipMemcpy(d_data, vectorInput.data(), arraySize * sizeof(*d_data), hipMemcpyHostToDevice)); + + // [Sphinx HIP warp size select kernel start] + // Generate and copy mask arrays + generate_and_copy_mask( + d_mask, + vectorExpected, + warpSizeHost, + numOfBlocks, + numberOfWarp, + mask_size, + mask_element_size); + + // Start the kernel + block_reduce<<>>( + d_data, + d_mask, + d_results, + arraySize); + // [Sphinx HIP warp size select kernel end] + + // Check the kernel launch + HIP_CHECK(hipGetLastError()); + // Check for kernel execution error + HIP_CHECK(hipDeviceSynchronize()); + // Device to Host copy of the result + HIP_CHECK(hipMemcpy(vectorOutput.data(), d_results, numOfBlocks * sizeof(*d_results), hipMemcpyDeviceToHost)); + // Verify results + bool passed = true; + for(size_t i = 0; i < numOfBlocks; ++i) { + if(vectorOutput[i] != vectorExpected[i]) { + passed = false; + std::cerr << "Validation failed! Expected " << vectorExpected[i] << " got " << vectorOutput[i] << " at index: " << i << std::endl; + } + } + if(passed){ + std::cout << "Execution completed successfully." << std::endl; + }else{ + std::cerr << "Execution failed." << std::endl; + } + // Cleanup + HIP_CHECK(hipFree(d_data)); + HIP_CHECK(hipFree(d_mask)); + HIP_CHECK(hipFree(d_results)); + return 0; +} \ No newline at end of file From 357c0116369c079ebb5ca6063cacfbc15d90d2d5 Mon Sep 17 00:00:00 2001 From: randyh62 Date: Wed, 9 Jul 2025 12:43:28 -0700 Subject: [PATCH 12/36] Update HIP API Reference link and other issues --- docs/doxygen/Doxyfile | 3 +- docs/faq.rst | 2 +- docs/how-to/hip_runtime_api/call_stack.rst | 2 +- docs/reference/hip_runtime_api_reference.rst | 49 +++++++++++++++- docs/sphinx/_toc.yml.in | 59 ++++++++++---------- docs/understand/programming_model.rst | 2 +- 6 files changed, 79 insertions(+), 38 deletions(-) diff --git a/docs/doxygen/Doxyfile b/docs/doxygen/Doxyfile index fb4eaae2de..0d6c94b4e4 100644 --- a/docs/doxygen/Doxyfile +++ b/docs/doxygen/Doxyfile @@ -834,7 +834,8 @@ INPUT = ../../include/hip \ ../../../clr/hipamd/include/hip/amd_detail/amd_hip_gl_interop.h \ ../../../clr/hipamd/include/hip/amd_detail/amd_surface_functions.h \ ../../../clr/hipamd/include/hip/amd_detail/amd_hip_cooperative_groups.h \ - ../../../ROCR-Runtime/src/inc/hsa_ext_amd.h + ../../../ROCR-Runtime/runtime/hsa-runtime/inc/hsa_ext_amd.h + # ../../../ROCR-Runtime/src/inc/hsa_ext_amd.h # This tag can be used to specify the character encoding of the source files # that doxygen parses. Internally doxygen uses the UTF-8 encoding. Doxygen uses diff --git a/docs/faq.rst b/docs/faq.rst index 15308e437c..f0d836761f 100644 --- a/docs/faq.rst +++ b/docs/faq.rst @@ -43,7 +43,7 @@ What NVIDIA CUDA features does HIP support? The :doc:`NVIDIA CUDA runtime API supported by HIP` and :doc:`NVIDIA CUDA driver API supported by HIP` pages describe which NVIDIA CUDA APIs are supported and what the equivalents are. -The :doc:`HIP API documentation ` describes each API and +The :doc:`HIP API documentation ` describes each API and its limitations, if any, compared with the equivalent CUDA API. The kernel language features are documented in the diff --git a/docs/how-to/hip_runtime_api/call_stack.rst b/docs/how-to/hip_runtime_api/call_stack.rst index 43354cd0cf..a9d03bb493 100644 --- a/docs/how-to/hip_runtime_api/call_stack.rst +++ b/docs/how-to/hip_runtime_api/call_stack.rst @@ -21,7 +21,7 @@ and AMD GPUs use different approaches. NVIDIA GPUs have the independent thread scheduling feature where each thread has its own call stack and effective program counter. On AMD GPUs threads are grouped; each warp has its own call stack and program counter. Warps are described and explained in the -:ref:`inherent_thread_hierarchy` +:ref:`inherent_thread_model` If a thread or warp exceeds its stack size, a stack overflow occurs, causing kernel failure. This can be detected using debuggers. diff --git a/docs/reference/hip_runtime_api_reference.rst b/docs/reference/hip_runtime_api_reference.rst index e77490f79e..4af699c309 100644 --- a/docs/reference/hip_runtime_api_reference.rst +++ b/docs/reference/hip_runtime_api_reference.rst @@ -8,7 +8,50 @@ HIP runtime API ******************************************************************************** -The HIP Runtime API reference: +The HIP Runtime API reference includes descriptions of HIP functions, as well as global datatypes, enums, and structs. -* :ref:`modules_reference` -* :ref:`global_defines_enums_structs_files_reference` +Modules +======= + +The API is organized into modules based on functionality. + +* :ref:`initialization_version_reference` +* :ref:`device_management_reference` +* :ref:`execution_control_reference` +* :ref:`error_handling_reference` +* :ref:`stream_management_reference` +* :ref:`stream_memory_operations_reference` +* :ref:`event_management_reference` +* :ref:`memory_management_reference` + + * :ref:`memory_management_deprecated_reference` + * :ref:`external_resource_interoperability_reference` + * :ref:`stream_ordered_memory_allocator_reference` + * :ref:`unified_memory_reference` + * :ref:`virtual_memory_reference` + * :ref:`texture_management_reference` + * :ref:`texture_management_deprecated_reference` + * :ref:`surface_object_reference` + +* :ref:`peer_to_peer_device_memory_access_reference` +* :ref:`context_management_reference` +* :ref:`module_management_reference` +* :ref:`occupancy_reference` +* :ref:`profiler_control_reference` +* :ref:`launch_api_reference` +* :ref:`runtime_compilation_reference` +* :ref:`callback_activity_apis_reference` +* :ref:`graph_management_reference` +* :ref:`opengl_interoperability_reference` +* :ref:`graphics_interoperability_reference` +* :ref:`cooperative_groups_reference` + +Global defines, enums, structs and files +======================================== + +The structs, define macros, enums and files in the HIP runtime API. + +* :ref:`global_enum_defines_reference` +* :ref:`driver_types_reference` +* :doc:`../../doxygen/html/annotated` +* :doc:`../../doxygen/html/files` \ No newline at end of file diff --git a/docs/sphinx/_toc.yml.in b/docs/sphinx/_toc.yml.in index 35ed57f0b6..72f87fb9a8 100644 --- a/docs/sphinx/_toc.yml.in +++ b/docs/sphinx/_toc.yml.in @@ -66,39 +66,36 @@ subtrees: - file: reference/hip_runtime_api_reference subtrees: - entries: - - file: reference/hip_runtime_api/modules + - file: reference/hip_runtime_api/modules/initialization_and_version + - file: reference/hip_runtime_api/modules/device_management + - file: reference/hip_runtime_api/modules/execution_control + - file: reference/hip_runtime_api/modules/error_handling + - file: reference/hip_runtime_api/modules/stream_management + - file: reference/hip_runtime_api/modules/stream_memory_operations + - file: reference/hip_runtime_api/modules/event_management + - file: reference/hip_runtime_api/modules/memory_management subtrees: - entries: - - file: reference/hip_runtime_api/modules/initialization_and_version - - file: reference/hip_runtime_api/modules/device_management - - file: reference/hip_runtime_api/modules/execution_control - - file: reference/hip_runtime_api/modules/error_handling - - file: reference/hip_runtime_api/modules/stream_management - - file: reference/hip_runtime_api/modules/stream_memory_operations - - file: reference/hip_runtime_api/modules/event_management - - file: reference/hip_runtime_api/modules/memory_management - subtrees: - - entries: - - file: reference/hip_runtime_api/modules/memory_management/memory_management_deprecated - - file: reference/hip_runtime_api/modules/memory_management/external_resource_interoperability - - file: reference/hip_runtime_api/modules/memory_management/stream_ordered_memory_allocator - - file: reference/hip_runtime_api/modules/memory_management/unified_memory_reference - - file: reference/hip_runtime_api/modules/memory_management/virtual_memory_reference - - file: reference/hip_runtime_api/modules/memory_management/texture_management - - file: reference/hip_runtime_api/modules/memory_management/texture_management_deprecated - - file: reference/hip_runtime_api/modules/memory_management/surface_object - - file: reference/hip_runtime_api/modules/peer_to_peer_device_memory_access - - file: reference/hip_runtime_api/modules/context_management - - file: reference/hip_runtime_api/modules/module_management - - file: reference/hip_runtime_api/modules/occupancy - - file: reference/hip_runtime_api/modules/profiler_control - - file: reference/hip_runtime_api/modules/launch_api - - file: reference/hip_runtime_api/modules/runtime_compilation - - file: reference/hip_runtime_api/modules/callback_activity_apis - - file: reference/hip_runtime_api/modules/graph_management - - file: reference/hip_runtime_api/modules/graphics_interoperability - - file: reference/hip_runtime_api/modules/opengl_interoperability - - file: reference/hip_runtime_api/modules/cooperative_groups_reference + - file: reference/hip_runtime_api/modules/memory_management/memory_management_deprecated + - file: reference/hip_runtime_api/modules/memory_management/external_resource_interoperability + - file: reference/hip_runtime_api/modules/memory_management/stream_ordered_memory_allocator + - file: reference/hip_runtime_api/modules/memory_management/unified_memory_reference + - file: reference/hip_runtime_api/modules/memory_management/virtual_memory_reference + - file: reference/hip_runtime_api/modules/memory_management/texture_management + - file: reference/hip_runtime_api/modules/memory_management/texture_management_deprecated + - file: reference/hip_runtime_api/modules/memory_management/surface_object + - file: reference/hip_runtime_api/modules/peer_to_peer_device_memory_access + - file: reference/hip_runtime_api/modules/context_management + - file: reference/hip_runtime_api/modules/module_management + - file: reference/hip_runtime_api/modules/occupancy + - file: reference/hip_runtime_api/modules/profiler_control + - file: reference/hip_runtime_api/modules/launch_api + - file: reference/hip_runtime_api/modules/runtime_compilation + - file: reference/hip_runtime_api/modules/callback_activity_apis + - file: reference/hip_runtime_api/modules/graph_management + - file: reference/hip_runtime_api/modules/graphics_interoperability + - file: reference/hip_runtime_api/modules/opengl_interoperability + - file: reference/hip_runtime_api/modules/cooperative_groups_reference - file: reference/hip_runtime_api/global_defines_enums_structs_files subtrees: - entries: diff --git a/docs/understand/programming_model.rst b/docs/understand/programming_model.rst index 3cac7e374a..2fc13df65b 100644 --- a/docs/understand/programming_model.rst +++ b/docs/understand/programming_model.rst @@ -243,7 +243,7 @@ multiple threads via the thread ID constants ``threadIdx.x``, ``blockIdx.x``, et .. _inherent_thread_model: Hierarchical thread model ---------------------- +------------------------- As previously discussed, all threads of a kernel are uniquely identified by a set of integral values called thread IDs. The hierarchy consists of three levels: thread, From 7898f473aab139581d2bc8f1f6a5921ea044fe22 Mon Sep 17 00:00:00 2001 From: randyh62 Date: Thu, 10 Jul 2025 12:00:29 -0700 Subject: [PATCH 13/36] Update compilers.rst to refer to ROCm Compiler reference --- docs/understand/compilers.rst | 129 +++++----------------------------- 1 file changed, 19 insertions(+), 110 deletions(-) diff --git a/docs/understand/compilers.rst b/docs/understand/compilers.rst index ccd2dbbec6..f35bb24627 100644 --- a/docs/understand/compilers.rst +++ b/docs/understand/compilers.rst @@ -8,95 +8,12 @@ HIP compilers ******************************************************************************** -ROCm provides the compiler driver ``hipcc``, that can be used on AMD ROCm and -NVIDIA CUDA platforms. +ROCm provides the compiler tools used to compile HIP applications for use on AMD GPUs. +The compilers setup the default libraries and include paths for the HIP and ROCm +libraries, and some needed environment variables. For more information, see the +:doc:`ROCm compiler reference `. -On ROCm, ``hipcc`` takes care of the following: - -- Setting the default library and include paths for HIP -- Setting some environment variables -- Invoking the appropriate compiler - ``amdclang++`` - -On NVIDIA CUDA platform, ``hipcc`` takes care of invoking compiler ``nvcc``. -``amdclang++`` is based on the ``clang++`` compiler. For more -details, see the :doc:`llvm project`. - -HIPCC -================================================================================ - -Common Compiler Options --------------------------------------------------------------------------------- - -The following table shows the most common compiler options supported by -``hipcc``. - -.. list-table:: - :header-rows: 1 - - * - - Option - - Description - * - - ``--fgpu-rdc`` - - Generate relocatable device code, which allows kernels or device functions - to call device functions in different translation units. - * - - ``-ggdb`` - - Equivalent to `-g` plus tuning for GDB. This is recommended when using - ROCm's GDB to debug GPU code. - * - - ``--gpu-max-threads-per-block=`` - - Generate code to support up to the specified number of threads per block. - * - - ``-offload-arch=`` - - Generate code for the given GPU target. - For a full list of supported compilation targets see the `processor names in AMDGPU's llvm documentation `_. - This option can appear multiple times to generate a fat binary for multiple - targets. - The actual support of the platform's runtime may differ. - * - - ``-save-temps`` - - Save the compiler generated intermediate files. - * - - ``-v`` - - Show the compilation steps. - -Linking --------------------------------------------------------------------------------- - -``hipcc`` adds the necessary libraries for HIP as well as for the accelerator -compiler (``nvcc`` or ``amdclang++``). We recommend linking with ``hipcc`` since -it automatically links the binary to the necessary HIP runtime libraries. - -Linking Code With Other Compilers -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -``nvcc`` by default uses ``g++`` to generate the host code. - -``amdclang++`` generates both device and host code. The code uses the same API -as ``gcc``, which allows code generated by different ``gcc``-compatible -compilers to be linked together. For example, code compiled using ``amdclang++`` -can link with code compiled using compilers such as ``gcc``, ``icc`` and -``clang``. Take care to ensure all compilers use the same standard C++ header -and library formats. - -libc++ and libstdc++ -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -``hipcc`` links to ``libstdc++`` by default. This provides better compatibility -between ``g++`` and HIP. - -In order to link to ``libc++``, pass ``--stdlib=libc++`` to ``hipcc``. -Generally, libc++ provides a broader set of C++ features while ``libstdc++`` is -the standard for more compilers, notably including ``g++``. - -When cross-linking C++ code, any C++ functions that use types from the C++ -standard library, such as ``std::string``, ``std::vector`` and other containers, -must use the same standard-library implementation. This includes cross-linking -between ``amdclang++`` and other compilers. - - -HIP compilation workflow +Compilation workflow ================================================================================ HIP provides a flexible compilation workflow that supports both offline @@ -115,25 +32,18 @@ performance overhead. Offline compilation -------------------------------------------------------------------------------- -The HIP code compilation is performed in two stages: host and device code -compilation stage. +Offline compilation is performed in two steps: host and device code +compilation. + +- Host-code compilation: On the host side, ``amdclang++`` or ``hipcc`` can + compile the host code in one step without other C++ compilers. -- Device-code compilation stage: The compiled device code is embedded into the +- Device-code compilation: The compiled device code is embedded into the host object file. Depending on the platform, the device code can be compiled - into assembly or binary. ``nvcc`` and ``amdclang++`` target different - architectures and use different code object formats. ``nvcc`` uses the binary - ``cubin`` or the assembly PTX files, while the ``amdclang++`` path is the - binary ``hsaco`` format. On CUDA platforms, the driver compiles the PTX files - to executable code during runtime. - -- Host-code compilation stage: On the host side, ``hipcc`` or ``amdclang++`` can - compile the host code in one step without other C++ compilers. On the other - hand, ``nvcc`` only replaces the ``<<<...>>>`` kernel launch syntax with the - appropriate CUDA runtime function call and the modified host code is passed to - the default host compiler. + into assembly or binary. For an example on how to compile HIP from the command line, see :ref:`SAXPY -tutorial` . +tutorial ` . Runtime compilation -------------------------------------------------------------------------------- @@ -142,27 +52,26 @@ HIP allows you to compile kernels at runtime using the ``hiprtc*`` API. Kernels are stored as a text string, which is passed to HIPRTC alongside options to guide the compilation. -For more details, see -:doc:`HIP runtime compiler <../how-to/hip_rtc>`. +For more information, see :doc:`HIP runtime compiler <../how-to/hip_rtc>`. Static libraries ================================================================================ -``hipcc`` supports generating two types of static libraries. +Both ``amdclang++`` and ``hipcc`` support generating two types of static libraries. - The first type of static library only exports and launches host functions within the same library and not the device functions. This library type offers - the ability to link with a non-hipcc compiler such as ``gcc``. Additionally, + the ability to link with another compiler such as ``gcc``. Additionally, this library type contains host objects with device code embedded as fat binaries. This library type is generated using the flag ``--emit-static-lib``: .. code-block:: shell - hipcc hipOptLibrary.cpp --emit-static-lib -fPIC -o libHipOptLibrary.a + amdclang++ hipOptLibrary.cpp --emit-static-lib -fPIC -o libHipOptLibrary.a gcc test.cpp -L. -lhipOptLibrary -L/path/to/hip/lib -lamdhip64 -o test.out - The second type of static library exports device functions to be linked by - other code objects by using ``hipcc`` as the linker. This library type + other code objects by using ``amdclang++`` or ``hipcc`` as the linker. This library type contains relocatable device objects and is generated using ``ar``: .. code-block:: shell @@ -171,6 +80,6 @@ Static libraries ar rcsD libHipDevice.a hipDevice.o hipcc libHipDevice.a test.cpp -fgpu-rdc -o test.out -A full example for this can be found in the ROCm-examples, see the examples for +Examples of this can be found in `rocm-examples `_ under `static host libraries `_ or `static device libraries `_. From aa8ad478cbb5a99642c40138c3b13e3b29976dd0 Mon Sep 17 00:00:00 2001 From: randyh62 <42045079+randyh62@users.noreply.github.com> Date: Fri, 11 Jul 2025 13:52:20 -0700 Subject: [PATCH 14/36] Update docs/understand/compilers.rst Co-authored-by: Jeffrey Novotny --- docs/understand/compilers.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/understand/compilers.rst b/docs/understand/compilers.rst index f35bb24627..650ca622ef 100644 --- a/docs/understand/compilers.rst +++ b/docs/understand/compilers.rst @@ -9,8 +9,8 @@ HIP compilers ******************************************************************************** ROCm provides the compiler tools used to compile HIP applications for use on AMD GPUs. -The compilers setup the default libraries and include paths for the HIP and ROCm -libraries, and some needed environment variables. For more information, see the +The compilers set up the default libraries and include paths for the HIP and ROCm +libraries and some needed environment variables. For more information, see the :doc:`ROCm compiler reference `. Compilation workflow From 127e8118e1b736b49920bd4aef8b0e95cf9b5836 Mon Sep 17 00:00:00 2001 From: Jan Stephan Date: Fri, 4 Jul 2025 21:06:15 +0200 Subject: [PATCH 15/36] Fix Doxygen build warnings --- docs/doxygen/Doxyfile | 3 ++- include/hip/hip_ext.h | 2 +- include/hip/hip_runtime_api.h | 4 ++-- 3 files changed, 5 insertions(+), 4 deletions(-) diff --git a/docs/doxygen/Doxyfile b/docs/doxygen/Doxyfile index 0d6c94b4e4..92b394fa36 100644 --- a/docs/doxygen/Doxyfile +++ b/docs/doxygen/Doxyfile @@ -2211,7 +2211,8 @@ PREDEFINED = "__HIP_PLATFORM_AMD__" \ "HIP_PUBLIC_API" \ "HIP_ENABLE_WARP_SYNC_BUILTINS" \ "__HOST_DEVICE__" \ - "__forceinline__" + "__forceinline__" \ + "__HIP_NODISCARD=[[nodiscard]]" # If the MACRO_EXPANSION and EXPAND_ONLY_PREDEF tags are set to YES then this # tag can be used to specify a list of macro names that should be expanded. The diff --git a/include/hip/hip_ext.h b/include/hip/hip_ext.h index 67a71692bb..e4c47032db 100644 --- a/include/hip/hip_ext.h +++ b/include/hip/hip_ext.h @@ -27,7 +27,7 @@ THE SOFTWARE. #include #include #endif -/** @addtogroup Execution Execution Management +/** @addtogroup Execution Execution Control * @{ */ diff --git a/include/hip/hip_runtime_api.h b/include/hip/hip_runtime_api.h index 142564cbc6..8e707814cc 100644 --- a/include/hip/hip_runtime_api.h +++ b/include/hip/hip_runtime_api.h @@ -1481,7 +1481,7 @@ typedef enum hipLaunchAttributeID { */ typedef union hipLaunchAttributeValue { char pad[64]; ///< 64 byte padding - hipAccessPolicyWindow accessPolicyWindow; ///< Value of launch attribute ::hipLaunchAttributePolicyWindow. + hipAccessPolicyWindow accessPolicyWindow; ///< Value of launch attribute ::hipLaunchAttributeAccessPolicyWindow. int cooperative; ///< Value of launch attribute ::hipLaunchAttributeCooperative. Indicates whether the kernel is cooperative. int priority; ///< Value of launch attribute :: hipLaunchAttributePriority. Execution priority of kernel } hipLaunchAttributeValue; @@ -6111,7 +6111,7 @@ hipError_t hipLinkComplete(hipLinkState_t state, void** hipBinOut, size_t* sizeO /** * @brief Creates a linker instance with options. * @param [in] numOptions Number of options - * @param [in] option Array of options + * @param [in] options Array of options * @param [in] optionValues Array of option values cast to void* * @param [out] stateOut hip link state created upon success * From 7e3c339d079bbf7725e5dae8dbb32ac852d6ebc4 Mon Sep 17 00:00:00 2001 From: Brian Kocoloski Date: Tue, 15 Jul 2025 09:08:40 -0400 Subject: [PATCH 16/36] add changes --- .../memory_management/unified_memory.rst | 103 ++++++++++-------- 1 file changed, 59 insertions(+), 44 deletions(-) diff --git a/docs/how-to/hip_runtime_api/memory_management/unified_memory.rst b/docs/how-to/hip_runtime_api/memory_management/unified_memory.rst index 55f3ceb35b..838433c313 100644 --- a/docs/how-to/hip_runtime_api/memory_management/unified_memory.rst +++ b/docs/how-to/hip_runtime_api/memory_management/unified_memory.rst @@ -41,48 +41,64 @@ Unified memory enables the access to memory located on other devices via several methods, depending on whether hardware support is available or has to be managed by the driver. -Hardware supported on-demand page migration --------------------------------------------------------------------------------- - -When a kernel on the device tries to access a memory address that is not in its -memory, a page-fault is triggered. The GPU then in turn requests the page from -the host or an other device, on which the memory is located. The page is then -unmapped from the source, sent to the device and mapped to the device's memory. -The requested memory is then available to the processes running on the device. - -In case the device's memory is at capacity, a page is unmapped from the device's -memory first and sent and mapped to host memory. This enables more memory to be -allocated and used for a GPU, than the GPU itself has physically available. - -This level of unified memory support can be very beneficial for sparse accesses -to an array, that is not often used on the device. - -Driver managed page migration --------------------------------------------------------------------------------- +Managed memory +================================================================================ -If the hardware does not support on-demand page migration, then all the pages -accessed by a kernel have to be resident on the device, so they have to be -migrated before the kernel is running. Since the driver can not know beforehand, -what parts of an array are going to be accessed, all pages of all accessed -arrays have to be migrated. This can lead to significant delays on the first run -of a kernel, on top of possibly copying more memory than is actually accessed by -the kernel. +Managed Memory is an extension of the unified memory architecture in which HIP +monitors memory access and intelligently migrates data between device and +system memories, thereby improving performance and resource efficiency. + +When a kernel on the device tries to access a managed memory address that is +not in its local device memory, a page-fault is triggered. The GPU then in +turn requests the page from the host or other device on which the memory is +located. The page is then unmapped from the source, sent to the device and +mapped to the device's memory. The requested memory is then available locally +to the processes running on the device, which improves performance as local +memory access outperforms remote memory access. + +Managed memory also expands the memory capacity available to a GPU kernel. When +migrating memory into the device on page-fault, if the device's memory is +already at capacity, a page is unmapped from the device's memory first and sent +and mapped to host memory. This enables more memory to be allocated and used +for a GPU than the GPU itself has physically available. This level of support +can be very beneficial, for example, for sparse accesses to an array that is +not often used on the device. .. _unified memory system requirements: -System requirements -================================================================================ +System requirements for managed memory +-------------------------------------------------------------------------------- -Unified memory is supported on Linux by all modern AMD GPUs from the Vega -series onward, as shown in the following table. Unified memory management can -be achieved by explicitly allocating managed memory using -:cpp:func:`hipMallocManaged` or marking variables with the ``__managed__`` -attribute. For the latest GPUs, with a Linux kernel that supports -`Heterogeneous Memory Management (HMM) +Some AMD GPUs do not support page-faults, and thus do not support on-demand +page-fault driven migration. On these architectures, if the programmer prefers +all GPU memory accesses to be local, all pages have to migrated before the +kernel is dispatched, as the driver cannot know beforehand which parts of a +dataset are going to be accessed. This can lead to significant delays on the +first run of a kernel, and, in the example of a sparsely accessed array, can +also lead to copying more memory than is actually accessed by the kernel. + +Note that on systems which do not support page-faults, managed memory APIs are +still accessible to the programmer, but managed memory operates in a degraded +fashion due to the lack of demand-driven migration. Furthermore, on these +systems it is still possible to use other unified memory allocators that do not +provide managed memory features. + +Managed memory is supported on Linux by all modern AMD GPUs from the Vega +series onward, as shown in the following table. Managed memory can be +explicitly allocated using :cpp:func:`hipMallocManaged()` or marking variables +with the ``__managed__`` attribute. For the latest GPUs, with a Linux kernel +that supports `Heterogeneous Memory Management (HMM) `_, the normal system -allocator can be used. +allocators can be used. -.. list-table:: Supported Unified Memory Allocators by GPU architecture +Note: to ensure the proper functioning of managed memory on supported GPUs, it +is __essential__ to set the environment variable ``HSA_XNACK=1`` and use a GPU +kernel mode driver that supports HMM +`_. Without this +configuration, access-driven memory migration will be disabled, and the +behavior will be similar to that of systems without HMM support. + +.. list-table:: Managed Memory Support by GPU Architecture :widths: 40, 25, 25 :header-rows: 1 :align: center @@ -98,7 +114,7 @@ allocator can be used. - ✅ :sup:`1` * - CDNA1 - ✅ - - ✅ :sup:`1` + - ❌ * - RDNA1 - ✅ - ❌ @@ -138,12 +154,11 @@ system requirements` and :ref:`checking unified memory support`. offers an easy transition for code written for CPUs to HIP code as the same system allocation API is used. -To ensure the proper functioning of system allocated unified memory on supported -GPUs, it is essential to set the environment variable ``HSA_XNACK=1`` and use -a GPU kernel mode driver that supports HMM -`_. Without this -configuration, the behavior will be similar to that of systems without HMM -support. +- **HIP allocated non-managed memory** + + :cpp:func:`hipMalloc()` and :cpp:func:`hipHostMalloc()` are dynamic memory + allocators available on all GPUs with unified memory support. Memory + allocated by these allocators is not migrated between device and host memory. The table below illustrates the expected behavior of managed and unified memory functions on ROCm and CUDA, both with and without HMM support. @@ -177,10 +192,10 @@ functions on ROCm and CUDA, both with and without HMM support. - host - page-fault migration * - :cpp:func:`hipHostRegister()` + - pinned host + - zero copy [zc]_ - undefined behavior - undefined behavior - - host - - page-fault migration * - :cpp:func:`hipHostMalloc()` - pinned host - zero copy [zc]_ From 4c4efd369a935731dd1844452d08b739b7523399 Mon Sep 17 00:00:00 2001 From: Brian Kocoloski Date: Tue, 15 Jul 2025 09:35:48 -0400 Subject: [PATCH 17/36] add README to wordlist --- .wordlist.txt | 1 + 1 file changed, 1 insertion(+) diff --git a/.wordlist.txt b/.wordlist.txt index 6cbf374ae1..69934c1b06 100644 --- a/.wordlist.txt +++ b/.wordlist.txt @@ -145,6 +145,7 @@ PyHIP queryable prefetching quad +README representable RMW rocgdb From e71f584d1f3e1f210f4df0d475afaa8d98a57447 Mon Sep 17 00:00:00 2001 From: Brian Kocoloski Date: Wed, 16 Jul 2025 09:32:43 -0400 Subject: [PATCH 18/36] Update docs/how-to/hip_runtime_api/memory_management/unified_memory.rst Co-authored-by: randyh62 <42045079+randyh62@users.noreply.github.com> --- .../how-to/hip_runtime_api/memory_management/unified_memory.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/how-to/hip_runtime_api/memory_management/unified_memory.rst b/docs/how-to/hip_runtime_api/memory_management/unified_memory.rst index 838433c313..47aae8156a 100644 --- a/docs/how-to/hip_runtime_api/memory_management/unified_memory.rst +++ b/docs/how-to/hip_runtime_api/memory_management/unified_memory.rst @@ -51,7 +51,7 @@ system memories, thereby improving performance and resource efficiency. When a kernel on the device tries to access a managed memory address that is not in its local device memory, a page-fault is triggered. The GPU then in turn requests the page from the host or other device on which the memory is -located. The page is then unmapped from the source, sent to the device and +located. The page is unmapped from the source, sent to the device and mapped to the device's memory. The requested memory is then available locally to the processes running on the device, which improves performance as local memory access outperforms remote memory access. From e913c1ca7ee6404fdaa3e35828fb7a2b8bfd2890 Mon Sep 17 00:00:00 2001 From: Brian Kocoloski Date: Wed, 16 Jul 2025 12:31:26 -0400 Subject: [PATCH 19/36] Update docs/how-to/hip_runtime_api/memory_management/unified_memory.rst Co-authored-by: randyh62 <42045079+randyh62@users.noreply.github.com> --- .../memory_management/unified_memory.rst | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/docs/how-to/hip_runtime_api/memory_management/unified_memory.rst b/docs/how-to/hip_runtime_api/memory_management/unified_memory.rst index 47aae8156a..6eefa6bf1a 100644 --- a/docs/how-to/hip_runtime_api/memory_management/unified_memory.rst +++ b/docs/how-to/hip_runtime_api/memory_management/unified_memory.rst @@ -91,12 +91,13 @@ that supports `Heterogeneous Memory Management (HMM) `_, the normal system allocators can be used. -Note: to ensure the proper functioning of managed memory on supported GPUs, it -is __essential__ to set the environment variable ``HSA_XNACK=1`` and use a GPU -kernel mode driver that supports HMM -`_. Without this -configuration, access-driven memory migration will be disabled, and the -behavior will be similar to that of systems without HMM support. +.. note:: + + To ensure the proper functioning of managed memory on supported GPUs, + it is __essential__ to set the environment variable ``HSA_XNACK=1`` and use a GPU + kernel mode driver that supports `HMM `_. + Without this configuration, access-driven memory migration will be disabled, + and the behavior will be similar to that of systems without HMM support. .. list-table:: Managed Memory Support by GPU Architecture :widths: 40, 25, 25 From 4e1cea09ae62233e3203188cca2eb35c4979dd5c Mon Sep 17 00:00:00 2001 From: Brian Kocoloski Date: Wed, 16 Jul 2025 12:56:41 -0400 Subject: [PATCH 20/36] address feedback --- docs/conf.py | 3 ++- .../memory_management/unified_memory.rst | 24 +++++++++++-------- 2 files changed, 16 insertions(+), 11 deletions(-) diff --git a/docs/conf.py b/docs/conf.py index de214e7ebf..e0ddc89518 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -57,7 +57,8 @@ "doxygen/mainpage.md", "understand/glossary.md", 'how-to/debugging_env.rst', - "data/env_variables_hip.rst" + "data/env_variables_hip.rst", + "venv" ] git_url = subprocess.check_output(['git', 'config', '--get', 'remote.origin.url']).strip().decode('ascii') diff --git a/docs/how-to/hip_runtime_api/memory_management/unified_memory.rst b/docs/how-to/hip_runtime_api/memory_management/unified_memory.rst index 6eefa6bf1a..2f0a469e83 100644 --- a/docs/how-to/hip_runtime_api/memory_management/unified_memory.rst +++ b/docs/how-to/hip_runtime_api/memory_management/unified_memory.rst @@ -80,8 +80,9 @@ also lead to copying more memory than is actually accessed by the kernel. Note that on systems which do not support page-faults, managed memory APIs are still accessible to the programmer, but managed memory operates in a degraded fashion due to the lack of demand-driven migration. Furthermore, on these -systems it is still possible to use other unified memory allocators that do not -provide managed memory features. +systems it is still possible to use unified memory allocators that do not +provide managed memory features; see :ref:`unified memory allocators` for +more details. Managed memory is supported on Linux by all modern AMD GPUs from the Vega series onward, as shown in the following table. Managed memory can be @@ -89,15 +90,15 @@ explicitly allocated using :cpp:func:`hipMallocManaged()` or marking variables with the ``__managed__`` attribute. For the latest GPUs, with a Linux kernel that supports `Heterogeneous Memory Management (HMM) `_, the normal system -allocators can be used. +allocators (e.g., ``new``, ``malloc()``) can be used. -.. note:: - - To ensure the proper functioning of managed memory on supported GPUs, - it is __essential__ to set the environment variable ``HSA_XNACK=1`` and use a GPU - kernel mode driver that supports `HMM `_. - Without this configuration, access-driven memory migration will be disabled, - and the behavior will be similar to that of systems without HMM support. +.. note:: + To ensure the proper functioning of managed memory on supported GPUs, it + is **essential** to set the environment variable ``HSA_XNACK=1`` and use a + GPU kernel mode driver that supports `HMM + `_. Without this + configuration, access-driven memory migration will be disabled, and the + behavior will be similar to that of systems without HMM support. .. list-table:: Managed Memory Support by GPU Architecture :widths: 40, 25, 25 @@ -107,6 +108,9 @@ allocators can be used. * - Architecture - :cpp:func:`hipMallocManaged()`, ``__managed__`` - ``new``, ``malloc()`` + * - CDNA4 + - ✅ + - ✅ :sup:`1` * - CDNA3 - ✅ - ✅ :sup:`1` From 859237b301ca84194ab3f26f5a02a7bd1e5a4e80 Mon Sep 17 00:00:00 2001 From: Brian Kocoloski Date: Wed, 16 Jul 2025 12:58:42 -0400 Subject: [PATCH 21/36] revert conf.py changes --- docs/conf.py | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/docs/conf.py b/docs/conf.py index e0ddc89518..a9d7e3f534 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -57,10 +57,9 @@ "doxygen/mainpage.md", "understand/glossary.md", 'how-to/debugging_env.rst', - "data/env_variables_hip.rst", - "venv" + "data/env_variables_hip.rst" ] git_url = subprocess.check_output(['git', 'config', '--get', 'remote.origin.url']).strip().decode('ascii') if git_url.find("git:") != -1: - html_theme_options = {"repository_url": "https://github.com/ROCm/hip"} + html_theme_options = {"repository_url": "https://github.com/ROCm/hip"} \ No newline at end of file From baef6966504564e3162f2ea8468660478b40cb09 Mon Sep 17 00:00:00 2001 From: randyh62 Date: Thu, 17 Jul 2025 11:12:43 -0700 Subject: [PATCH 22/36] restore TOC for modules --- docs/sphinx/_toc.yml.in | 59 ++++++++++++++++++++++------------------- 1 file changed, 31 insertions(+), 28 deletions(-) diff --git a/docs/sphinx/_toc.yml.in b/docs/sphinx/_toc.yml.in index 72f87fb9a8..35ed57f0b6 100644 --- a/docs/sphinx/_toc.yml.in +++ b/docs/sphinx/_toc.yml.in @@ -66,36 +66,39 @@ subtrees: - file: reference/hip_runtime_api_reference subtrees: - entries: - - file: reference/hip_runtime_api/modules/initialization_and_version - - file: reference/hip_runtime_api/modules/device_management - - file: reference/hip_runtime_api/modules/execution_control - - file: reference/hip_runtime_api/modules/error_handling - - file: reference/hip_runtime_api/modules/stream_management - - file: reference/hip_runtime_api/modules/stream_memory_operations - - file: reference/hip_runtime_api/modules/event_management - - file: reference/hip_runtime_api/modules/memory_management + - file: reference/hip_runtime_api/modules subtrees: - entries: - - file: reference/hip_runtime_api/modules/memory_management/memory_management_deprecated - - file: reference/hip_runtime_api/modules/memory_management/external_resource_interoperability - - file: reference/hip_runtime_api/modules/memory_management/stream_ordered_memory_allocator - - file: reference/hip_runtime_api/modules/memory_management/unified_memory_reference - - file: reference/hip_runtime_api/modules/memory_management/virtual_memory_reference - - file: reference/hip_runtime_api/modules/memory_management/texture_management - - file: reference/hip_runtime_api/modules/memory_management/texture_management_deprecated - - file: reference/hip_runtime_api/modules/memory_management/surface_object - - file: reference/hip_runtime_api/modules/peer_to_peer_device_memory_access - - file: reference/hip_runtime_api/modules/context_management - - file: reference/hip_runtime_api/modules/module_management - - file: reference/hip_runtime_api/modules/occupancy - - file: reference/hip_runtime_api/modules/profiler_control - - file: reference/hip_runtime_api/modules/launch_api - - file: reference/hip_runtime_api/modules/runtime_compilation - - file: reference/hip_runtime_api/modules/callback_activity_apis - - file: reference/hip_runtime_api/modules/graph_management - - file: reference/hip_runtime_api/modules/graphics_interoperability - - file: reference/hip_runtime_api/modules/opengl_interoperability - - file: reference/hip_runtime_api/modules/cooperative_groups_reference + - file: reference/hip_runtime_api/modules/initialization_and_version + - file: reference/hip_runtime_api/modules/device_management + - file: reference/hip_runtime_api/modules/execution_control + - file: reference/hip_runtime_api/modules/error_handling + - file: reference/hip_runtime_api/modules/stream_management + - file: reference/hip_runtime_api/modules/stream_memory_operations + - file: reference/hip_runtime_api/modules/event_management + - file: reference/hip_runtime_api/modules/memory_management + subtrees: + - entries: + - file: reference/hip_runtime_api/modules/memory_management/memory_management_deprecated + - file: reference/hip_runtime_api/modules/memory_management/external_resource_interoperability + - file: reference/hip_runtime_api/modules/memory_management/stream_ordered_memory_allocator + - file: reference/hip_runtime_api/modules/memory_management/unified_memory_reference + - file: reference/hip_runtime_api/modules/memory_management/virtual_memory_reference + - file: reference/hip_runtime_api/modules/memory_management/texture_management + - file: reference/hip_runtime_api/modules/memory_management/texture_management_deprecated + - file: reference/hip_runtime_api/modules/memory_management/surface_object + - file: reference/hip_runtime_api/modules/peer_to_peer_device_memory_access + - file: reference/hip_runtime_api/modules/context_management + - file: reference/hip_runtime_api/modules/module_management + - file: reference/hip_runtime_api/modules/occupancy + - file: reference/hip_runtime_api/modules/profiler_control + - file: reference/hip_runtime_api/modules/launch_api + - file: reference/hip_runtime_api/modules/runtime_compilation + - file: reference/hip_runtime_api/modules/callback_activity_apis + - file: reference/hip_runtime_api/modules/graph_management + - file: reference/hip_runtime_api/modules/graphics_interoperability + - file: reference/hip_runtime_api/modules/opengl_interoperability + - file: reference/hip_runtime_api/modules/cooperative_groups_reference - file: reference/hip_runtime_api/global_defines_enums_structs_files subtrees: - entries: From fddd68715d464a401c80f01a239b6ee704c3a759 Mon Sep 17 00:00:00 2001 From: Istvan Kiss Date: Fri, 9 May 2025 12:39:00 +0200 Subject: [PATCH 23/36] Add links to environment variables table --- docs/conf.py | 2 +- docs/data/env_variables_hip.rst | 278 ------------------ .../reference/env_variables/debug_hip_env.rst | 100 +++++++ .../env_variables/gpu_isolation_hip_env.rst | 27 ++ .../memory_management_hip_env.rst | 100 +++++++ .../env_variables/miscellaneous_hip_env.rst | 34 +++ .../env_variables/profiling_hip_env.rst | 23 ++ docs/how-to/debugging.rst | 38 ++- .../memory_management/coherence_control.rst | 2 + docs/reference/env_variables.rst | 30 +- 10 files changed, 315 insertions(+), 319 deletions(-) delete mode 100644 docs/data/env_variables_hip.rst create mode 100644 docs/data/reference/env_variables/debug_hip_env.rst create mode 100644 docs/data/reference/env_variables/gpu_isolation_hip_env.rst create mode 100644 docs/data/reference/env_variables/memory_management_hip_env.rst create mode 100644 docs/data/reference/env_variables/miscellaneous_hip_env.rst create mode 100644 docs/data/reference/env_variables/profiling_hip_env.rst diff --git a/docs/conf.py b/docs/conf.py index a9d7e3f534..fc3d1274f9 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -57,7 +57,7 @@ "doxygen/mainpage.md", "understand/glossary.md", 'how-to/debugging_env.rst', - "data/env_variables_hip.rst" + "data/reference/env_variables" ] git_url = subprocess.check_output(['git', 'config', '--get', 'remote.origin.url']).strip().decode('ascii') diff --git a/docs/data/env_variables_hip.rst b/docs/data/env_variables_hip.rst deleted file mode 100644 index 4192db7387..0000000000 --- a/docs/data/env_variables_hip.rst +++ /dev/null @@ -1,278 +0,0 @@ -.. meta:: - :description: HIP environment variables - :keywords: AMD, HIP, environment variables, environment - -HIP GPU isolation variables --------------------------------------------------------------------------------- - -The GPU isolation environment variables in HIP are collected in the following table. - -.. _hip-env-isolation: -.. list-table:: - :header-rows: 1 - :widths: 70,30 - - * - **Environment variable** - - **Value** - - * - | ``ROCR_VISIBLE_DEVICES`` - | A list of device indices or UUIDs that will be exposed to applications. - - Example: ``0,GPU-DEADBEEFDEADBEEF`` - - * - | ``GPU_DEVICE_ORDINAL`` - | Devices indices exposed to OpenCL and HIP applications. - - Example: ``0,2`` - - * - | ``HIP_VISIBLE_DEVICES`` or ``CUDA_VISIBLE_DEVICES`` - | Device indices exposed to HIP applications. - - Example: ``0,2`` - -HIP profiling variables --------------------------------------------------------------------------------- - -The profiling environment variables in HIP are collected in the following table. - -.. _hip-env-prof: -.. list-table:: - :header-rows: 1 - :widths: 70,30 - - * - **Environment variable** - - **Value** - - * - | ``HSA_CU_MASK`` - | Sets the mask on a lower level of queue creation in the driver, - | this mask will also be set for queues being profiled. - - Example: ``1:0-8`` - - * - | ``ROC_GLOBAL_CU_MASK`` - | Sets the mask on queues created by the HIP or the OpenCL runtimes, - | this mask will also be set for queues being profiled. - - Example: ``0xf``, enables only 4 CUs - - * - | ``HIP_FORCE_QUEUE_PROFILING`` - | Used to run the app as if it were run in rocprof. Forces command queue - | profiling on by default. - - | 0: Disable - | 1: Enable - -HIP debug variables --------------------------------------------------------------------------------- - -The debugging environment variables in HIP are collected in the following table. - -.. _hip-env-debug: -.. list-table:: - :header-rows: 1 - :widths: 35,14,51 - - * - **Environment variable** - - **Default value** - - **Value** - - * - | ``AMD_LOG_LEVEL`` - | Enables HIP log on various level. - - ``0`` - - | 0: Disable log. - | 1: Enables error logs. - | 2: Enables warning logs next to lower-level logs. - | 3: Enables information logs next to lower-level logs. - | 4: Enables debug logs next to lower-level logs. - | 5: Enables debug extra logs next to lower-level logs. - - * - | ``AMD_LOG_LEVEL_FILE`` - | Sets output file for ``AMD_LOG_LEVEL``. - - stderr output - - - - * - | ``AMD_LOG_MASK`` - | Specifies HIP log filters. Here is the ` complete list of log masks `_. - - ``0x7FFFFFFF`` - - | 0x1: Log API calls. - | 0x2: Kernel and copy commands and barriers. - | 0x4: Synchronization and waiting for commands to finish. - | 0x8: Decode and display AQL packets. - | 0x10: Queue commands and queue contents. - | 0x20: Signal creation, allocation, pool. - | 0x40: Locks and thread-safety code. - | 0x80: Kernel creations and arguments, etc. - | 0x100: Copy debug. - | 0x200: Detailed copy debug. - | 0x400: Resource allocation, performance-impacting events. - | 0x800: Initialization and shutdown. - | 0x1000: Misc debug, not yet classified. - | 0x2000: Show raw bytes of AQL packet. - | 0x4000: Show code creation debug. - | 0x8000: More detailed command info, including barrier commands. - | 0x10000: Log message location. - | 0x20000: Memory allocation. - | 0x40000: Memory pool allocation, including memory in graphs. - | 0x80000: Timestamp details. - | 0xFFFFFFFF: Log always even mask flag is zero. - - * - | ``HIP_LAUNCH_BLOCKING`` - | Used for serialization on kernel execution. - - ``0`` - - | 0: Disable. Kernel executes normally. - | 1: Enable. Serializes kernel enqueue, behaves the same as ``AMD_SERIALIZE_KERNEL``. - - * - | ``HIP_VISIBLE_DEVICES`` (or ``CUDA_VISIBLE_DEVICES``) - | Only devices whose index is present in the sequence are visible to HIP - - Unset by default. - - 0,1,2: Depending on the number of devices on the system. - - * - | ``GPU_DUMP_CODE_OBJECT`` - | Dump code object. - - ``0`` - - | 0: Disable - | 1: Enable - - * - | ``AMD_SERIALIZE_KERNEL`` - | Serialize kernel enqueue. - - ``0`` - - | 0: Disable - | 1: Wait for completion before enqueue. - | 2: Wait for completion after enqueue. - | 3: Both - - * - | ``AMD_SERIALIZE_COPY`` - | Serialize copies - - ``0`` - - | 0: Disable - | 1: Wait for completion before enqueue. - | 2: Wait for completion after enqueue. - | 3: Both - - * - | ``AMD_DIRECT_DISPATCH`` - | Enable direct kernel dispatch (Currently for Linux; under development for Windows). - - ``1`` - - | 0: Disable - | 1: Enable - - * - | ``GPU_MAX_HW_QUEUES`` - | The maximum number of hardware queues allocated per device. - - ``4`` - - The variable controls how many independent hardware queues HIP runtime can create per process, - per device. If an application allocates more HIP streams than this number, then HIP runtime reuses - the same hardware queues for the new streams in a round-robin manner. Note that this maximum - number does not apply to hardware queues that are created for CU-masked HIP streams, or - cooperative queues for HIP Cooperative Groups (single queue per device). - -HIP memory management related variables --------------------------------------------------------------------------------- - -The memory management related environment variables in HIP are collected in the -following table. - -.. _hip-env-memory: -.. list-table:: - :header-rows: 1 - :widths: 35,14,51 - - * - **Environment variable** - - **Default value** - - **Value** - - * - | ``HIP_HIDDEN_FREE_MEM`` - | Amount of memory to hide from the free memory reported by hipMemGetInfo. - - ``0`` - - | 0: Disable - | Unit: megabyte (MB) - - * - | ``HIP_HOST_COHERENT`` - | Specifies if the memory is coherent between the host and GPU in ``hipHostMalloc``. - - ``0`` - - | 0: Memory is not coherent. - | 1: Memory is coherent. - | Environment variable has effect, if the following conditions are statisfied: - | - One of the ``hipHostMallocDefault``, ``hipHostMallocPortable``, ``hipHostMallocWriteCombined`` or ``hipHostMallocNumaUser`` flag set to 1. - | - ``hipHostMallocCoherent``, ``hipHostMallocNonCoherent`` and ``hipHostMallocMapped`` flags set to 0. - - * - | ``HIP_INITIAL_DM_SIZE`` - | Set initial heap size for device malloc. - - ``8388608`` - - | Unit: Byte - | The default value corresponds to 8 MB. - - * - | ``HIP_MEM_POOL_SUPPORT`` - | Enables memory pool support in HIP. - - ``0`` - - | 0: Disable - | 1: Enable - - * - | ``HIP_MEM_POOL_USE_VM`` - | Enables memory pool support in HIP. - - | ``0``: other OS - | ``1``: Windows - - | 0: Disable - | 1: Enable - - * - | ``HIP_VMEM_MANAGE_SUPPORT`` - | Virtual Memory Management Support. - - ``1`` - - | 0: Disable - | 1: Enable - - * - | ``GPU_MAX_HEAP_SIZE`` - | Set maximum size of the GPU heap to % of board memory. - - ``100`` - - | Unit: Percentage - - * - | ``GPU_MAX_REMOTE_MEM_SIZE`` - | Maximum size that allows device memory substitution with system. - - ``2`` - - | Unit: kilobyte (KB) - - * - | ``GPU_NUM_MEM_DEPENDENCY`` - | Number of memory objects for dependency tracking. - - ``256`` - - - - * - | ``GPU_STREAMOPS_CP_WAIT`` - | Force the stream memory operation to wait on CP. - - ``0`` - - | 0: Disable - | 1: Enable - - * - | ``HSA_LOCAL_MEMORY_ENABLE`` - | Enable HSA device local memory usage. - - ``1`` - - | 0: Disable - | 1: Enable - - * - | ``PAL_ALWAYS_RESIDENT`` - | Force memory resources to become resident at allocation time. - - ``0`` - - | 0: Disable - | 1: Enable - - * - | ``PAL_PREPINNED_MEMORY_SIZE`` - | Size of prepinned memory. - - ``64`` - - | Unit: kilobyte (KB) - - * - | ``REMOTE_ALLOC`` - | Use remote memory for the global heap allocation. - - ``0`` - - | 0: Disable - | 1: Enable - -HIP miscellaneous variables --------------------------------------------------------------------------------- - -The following table lists environment variables that are useful but relate to -different features in HIP. - -.. _hip-env-other: -.. list-table:: - :header-rows: 1 - :widths: 35,14,51 - - * - **Environment variable** - - **Default value** - - **Value** - - * - | ``HIPRTC_COMPILE_OPTIONS_APPEND`` - | Sets compile options needed for ``hiprtc`` compilation. - - None - - ``--gpu-architecture=gfx906:sramecc+:xnack``, ``-fgpu-rdc`` diff --git a/docs/data/reference/env_variables/debug_hip_env.rst b/docs/data/reference/env_variables/debug_hip_env.rst new file mode 100644 index 0000000000..6e1e0f3e6a --- /dev/null +++ b/docs/data/reference/env_variables/debug_hip_env.rst @@ -0,0 +1,100 @@ +The debugging environment variables in HIP are collected in the following table. For +more information, check :doc:`hip:how-to/logging`, :doc:`hip:how-to/debugging` +and :doc:`GPU isolation `. + +.. _hip-env-debug: +.. list-table:: + :header-rows: 1 + :widths: 35,14,51 + + * - **Environment variable** + - **Default value** + - **Value** + + * - | ``AMD_LOG_LEVEL`` + | Enables HIP log on various level. + - ``0`` + - | 0: Disable log. + | 1: Enables error logs. + | 2: Enables warning logs next to lower-level logs. + | 3: Enables information logs next to lower-level logs. + | 4: Enables debug logs next to lower-level logs. + | 5: Enables debug extra logs next to lower-level logs. + + * - | ``AMD_LOG_LEVEL_FILE`` + | Sets output file for ``AMD_LOG_LEVEL``. + - stderr output + - + + * - | ``AMD_LOG_MASK`` + | Specifies HIP log filters. Here is the ` complete list of log masks `_. + - ``0x7FFFFFFF`` + - | 0x1: Log API calls. + | 0x2: Kernel and copy commands and barriers. + | 0x4: Synchronization and waiting for commands to finish. + | 0x8: Decode and display AQL packets. + | 0x10: Queue commands and queue contents. + | 0x20: Signal creation, allocation, pool. + | 0x40: Locks and thread-safety code. + | 0x80: Kernel creations and arguments, etc. + | 0x100: Copy debug. + | 0x200: Detailed copy debug. + | 0x400: Resource allocation, performance-impacting events. + | 0x800: Initialization and shutdown. + | 0x1000: Misc debug, not yet classified. + | 0x2000: Show raw bytes of AQL packet. + | 0x4000: Show code creation debug. + | 0x8000: More detailed command info, including barrier commands. + | 0x10000: Log message location. + | 0x20000: Memory allocation. + | 0x40000: Memory pool allocation, including memory in graphs. + | 0x80000: Timestamp details. + | 0xFFFFFFFF: Log always even mask flag is zero. + + * - | ``HIP_LAUNCH_BLOCKING`` + | Used for serialization on kernel execution. + - ``0`` + - | 0: Disable. Kernel executes normally. + | 1: Enable. Serializes kernel enqueue, behaves the same as ``AMD_SERIALIZE_KERNEL``. + + * - | ``HIP_VISIBLE_DEVICES`` (or ``CUDA_VISIBLE_DEVICES``) + | Only devices whose index is present in the sequence are visible to HIP + - Unset by default. + - 0,1,2: Depending on the number of devices on the system. + + * - | ``GPU_DUMP_CODE_OBJECT`` + | Dump code object. + - ``0`` + - | 0: Disable + | 1: Enable + + * - | ``AMD_SERIALIZE_KERNEL`` + | Serialize kernel enqueue. + - ``0`` + - | 0: Disable + | 1: Wait for completion before enqueue. + | 2: Wait for completion after enqueue. + | 3: Both + + * - | ``AMD_SERIALIZE_COPY`` + | Serialize copies + - ``0`` + - | 0: Disable + | 1: Wait for completion before enqueue. + | 2: Wait for completion after enqueue. + | 3: Both + + * - | ``AMD_DIRECT_DISPATCH`` + | Enable direct kernel dispatch (Currently for Linux; under development for Windows). + - ``1`` + - | 0: Disable + | 1: Enable + + * - | ``GPU_MAX_HW_QUEUES`` + | The maximum number of hardware queues allocated per device. + - ``4`` + - The variable controls how many independent hardware queues HIP runtime can create per process, + per device. If an application allocates more HIP streams than this number, then HIP runtime reuses + the same hardware queues for the new streams in a round-robin manner. Note that this maximum + number does not apply to hardware queues that are created for CU-masked HIP streams, or + cooperative queues for HIP Cooperative Groups (single queue per device). diff --git a/docs/data/reference/env_variables/gpu_isolation_hip_env.rst b/docs/data/reference/env_variables/gpu_isolation_hip_env.rst new file mode 100644 index 0000000000..56fe8c08ec --- /dev/null +++ b/docs/data/reference/env_variables/gpu_isolation_hip_env.rst @@ -0,0 +1,27 @@ +Restricting the access of applications to a subset of GPUs, also known as GPU +isolation, allows users to hide GPU resources from programs. The GPU isolation +environment variables in HIP are collected in the following table. + +.. _hip-env-isolation: +.. list-table:: + :header-rows: 1 + :widths: 50,30,20 + + * - **Environment variable** + - **Links** + - **Value** + + * - | ``ROCR_VISIBLE_DEVICES`` + | A list of device indices or UUIDs that will be exposed to applications. + - :doc:`GPU isolation `, :doc:`Setting the number of compute units ` + - Example: ``0,GPU-DEADBEEFDEADBEEF`` + + * - | ``GPU_DEVICE_ORDINAL`` + | Devices indices exposed to OpenCL and HIP applications. + - :doc:`GPU isolation ` + - Example: ``0,2`` + + * - | ``HIP_VISIBLE_DEVICES`` or ``CUDA_VISIBLE_DEVICES`` + | Device indices exposed to HIP applications. + - :doc:`GPU isolation `, :doc:`HIP debugging ` + - Example: ``0,2`` diff --git a/docs/data/reference/env_variables/memory_management_hip_env.rst b/docs/data/reference/env_variables/memory_management_hip_env.rst new file mode 100644 index 0000000000..0bb6631a3f --- /dev/null +++ b/docs/data/reference/env_variables/memory_management_hip_env.rst @@ -0,0 +1,100 @@ +The memory management related environment variables in HIP are collected in the +following table. The ``HIP_HOST_COHERENT`` variable linked at the following +pages: + +- :ref:`Coherence control ` + +- :ref:`Memory allocation flags ` + +.. _hip-env-memory: +.. list-table:: + :header-rows: 1 + :widths: 35,14,51 + + * - **Environment variable** + - **Default value** + - **Value** + + * - | ``HIP_HIDDEN_FREE_MEM`` + | Amount of memory to hide from the free memory reported by hipMemGetInfo. + - ``0`` + - | 0: Disable + | Unit: megabyte (MB) + + * - | ``HIP_HOST_COHERENT`` + | Specifies if the memory is coherent between the host and GPU in ``hipHostMalloc``. + - ``0`` + - | 0: Memory is not coherent. + | 1: Memory is coherent. + | Environment variable has effect, if the following conditions are statisfied: + | - One of the ``hipHostMallocDefault``, ``hipHostMallocPortable``, ``hipHostMallocWriteCombined`` or ``hipHostMallocNumaUser`` flag set to 1. + | - ``hipHostMallocCoherent``, ``hipHostMallocNonCoherent`` and ``hipHostMallocMapped`` flags set to 0. + + * - | ``HIP_INITIAL_DM_SIZE`` + | Set initial heap size for device malloc. + - ``8388608`` + - | Unit: Byte + | The default value corresponds to 8 MB. + + * - | ``HIP_MEM_POOL_SUPPORT`` + | Enables memory pool support in HIP. + - ``0`` + - | 0: Disable + | 1: Enable + + * - | ``HIP_MEM_POOL_USE_VM`` + | Enables memory pool support in HIP. + - | ``0``: other OS + | ``1``: Windows + - | 0: Disable + | 1: Enable + + * - | ``HIP_VMEM_MANAGE_SUPPORT`` + | Virtual Memory Management Support. + - ``1`` + - | 0: Disable + | 1: Enable + + * - | ``GPU_MAX_HEAP_SIZE`` + | Set maximum size of the GPU heap to % of board memory. + - ``100`` + - | Unit: Percentage + + * - | ``GPU_MAX_REMOTE_MEM_SIZE`` + | Maximum size that allows device memory substitution with system. + - ``2`` + - | Unit: kilobyte (KB) + + * - | ``GPU_NUM_MEM_DEPENDENCY`` + | Number of memory objects for dependency tracking. + - ``256`` + - + + * - | ``GPU_STREAMOPS_CP_WAIT`` + | Force the stream memory operation to wait on CP. + - ``0`` + - | 0: Disable + | 1: Enable + + * - | ``HSA_LOCAL_MEMORY_ENABLE`` + | Enable HSA device local memory usage. + - ``1`` + - | 0: Disable + | 1: Enable + + * - | ``PAL_ALWAYS_RESIDENT`` + | Force memory resources to become resident at allocation time. + - ``0`` + - | 0: Disable + | 1: Enable + + * - | ``PAL_PREPINNED_MEMORY_SIZE`` + | Size of prepinned memory. + - ``64`` + - | Unit: kilobyte (KB) + + * - | ``REMOTE_ALLOC`` + | Use remote memory for the global heap allocation. + - ``0`` + - | 0: Disable + | 1: Enable diff --git a/docs/data/reference/env_variables/miscellaneous_hip_env.rst b/docs/data/reference/env_variables/miscellaneous_hip_env.rst new file mode 100644 index 0000000000..facef59e0f --- /dev/null +++ b/docs/data/reference/env_variables/miscellaneous_hip_env.rst @@ -0,0 +1,34 @@ +The following table lists environment variables that are useful but relate to +different features in HIP. + +.. _hip-env-other: +.. list-table:: + :header-rows: 1 + :widths: 35,14,51 + + * - **Environment variable** + - **Default value** + - **Value** + + * - | ``HIPRTC_COMPILE_OPTIONS_APPEND`` + | Sets compile options needed for ``hiprtc`` compilation. + - Unset by default. + - ``--gpu-architecture=gfx906:sramecc+:xnack``, ``-fgpu-rdc`` + + * - | ``AMD_COMGR_SAVE_TEMPS`` + | Controls the deletion of temporary files generated during the compilation of COMGR. These files do not appear in the current working directory, but are instead left in a platform-specific temporary directory. + - Unset by default. + - | 0: Temporary files are deleted automatically. + | Non zero integer: Turn off the temporary files deletion. + + * - | ``AMD_COMGR_EMIT_VERBOSE_LOGS`` + | Sets logging of COMGR to include additional Comgr-specific informational messages. + - Unset by default. + - | 0: Verbose log disabled. + | Non zero integer: Verbose log enabled. + + * - | ``AMD_COMGR_REDIRECT_LOGS`` + | Controls redirect logs of COMGR. + - Unset by default. + - | `stdout` / `-`: Redirected to the standard output. + | `stderr`: Redirected to the error stream. \ No newline at end of file diff --git a/docs/data/reference/env_variables/profiling_hip_env.rst b/docs/data/reference/env_variables/profiling_hip_env.rst new file mode 100644 index 0000000000..0146f4a0e9 --- /dev/null +++ b/docs/data/reference/env_variables/profiling_hip_env.rst @@ -0,0 +1,23 @@ +The profiling environment variables in HIP are collected in the following table. For +more information, check :doc:`setting the number of CUs page `. + +.. _hip-env-prof: +.. list-table:: + :header-rows: 1 + :widths: 70,30 + + * - **Environment variable** + - **Value** + + * - | ``HSA_CU_MASK`` + | Sets the mask on a lower level of queue creation in the driver, this mask will also be set for queues being profiled. + - Example: ``1:0-8`` + + * - | ``ROC_GLOBAL_CU_MASK`` + | Sets the mask on queues created by the HIP or the OpenCL runtimes, this mask will also be set for queues being profiled. + - Example: ``0xf``, enables only 4 CUs + + * - | ``HIP_FORCE_QUEUE_PROFILING`` + | Used to run the app as if it were run in rocprof. Forces command queue profiling on by default. + - | 0: Disable + | 1: Enable diff --git a/docs/how-to/debugging.rst b/docs/how-to/debugging.rst index 6d5ff2ff24..fce7fac6ae 100644 --- a/docs/how-to/debugging.rst +++ b/docs/how-to/debugging.rst @@ -259,42 +259,50 @@ HSA provides environment variables that help analyze issues in drivers or hardwa * To isolate issues with hardware copy engines, you can use ``HSA_ENABLE_SDMA``. - ``HSA_ENABLE_SDMA=0`` causes host-to-device and device-to-host copies to use compute shader - blit kernels, rather than the dedicated DMA copy engines. Compute shader copies have low latency - (typically < 5 us) and can achieve approximately 80% of the bandwidth of the DMA copy engine. + ``HSA_ENABLE_SDMA=0`` causes host-to-device and device-to-host copies to use compute shader + blit kernels, rather than the dedicated DMA copy engines. Compute shader copies have low latency + (typically < 5 us) and can achieve approximately 80% of the bandwidth of the DMA copy engine. * To diagnose interrupt storm issues in the driver, you can use ``HSA_ENABLE_INTERRUPT``. - ``HSA_ENABLE_INTERRUPT=0`` causes completion signals to be detected with memory-based - polling, rather than interrupts. + ``HSA_ENABLE_INTERRUPT=0`` causes completion signals to be detected with memory-based + polling, rather than interrupts. HIP environment variable summary -------------------------------- Here are some of the more commonly used environment variables: -.. include-table:: data/env_variables_hip.rst +.. include-table:: data/reference/env_variables/debug_hip_env.rst :table: hip-env-debug General debugging tips ====================================================== * ``gdb --args`` can be used to pass the executable and arguments to ``gdb``. + * You can set environment variables (``set env``) from within GDB on Linux: - .. code-block:: bash + .. code-block:: bash + + (gdb) set env AMD_SERIALIZE_KERNEL 3 - (gdb) set env AMD_SERIALIZE_KERNEL 3 + .. note:: - .. note:: - This ``gdb`` command does not use an equal (=) sign. + This ``gdb`` command does not use an equal (=) sign. * The GDB backtrace shows a path in the runtime. This is because a fault is caught by the runtime, but it is generated by an asynchronous command running on the GPU. + * To determine the true location of a fault, you can force the kernels to run synchronously by setting the environment variables ``AMD_SERIALIZE_KERNEL=3`` and ``AMD_SERIALIZE_COPY=3``. This forces HIP runtime to wait for the kernel to finish running before returning. If the fault occurs when a kernel is running, you can see the code that launched the kernel inside the backtrace. The thread that's causing the issue is typically the one inside ``libhsa-runtime64.so``. + * VM faults inside kernels can be caused by: - * Incorrect code (e.g., a for loop that extends past array boundaries) - * Memory issues, such as invalid kernel arguments (null pointers, unregistered host pointers, bad pointers) - * Synchronization issues - * Compiler issues (incorrect code generation from the compiler) - * Runtime issues + * Incorrect code (e.g., a for loop that extends past array boundaries) + + * Memory issues, such as invalid kernel arguments (null pointers, unregistered host pointers, bad pointers) + + * Synchronization issues + + * Compiler issues (incorrect code generation from the compiler) + + * Runtime issues diff --git a/docs/how-to/hip_runtime_api/memory_management/coherence_control.rst b/docs/how-to/hip_runtime_api/memory_management/coherence_control.rst index a4d0234988..be05bb178c 100644 --- a/docs/how-to/hip_runtime_api/memory_management/coherence_control.rst +++ b/docs/how-to/hip_runtime_api/memory_management/coherence_control.rst @@ -64,6 +64,8 @@ To check the availability of fine- and coarse-grained memory pools, use Segment: GLOBAL; FLAGS: COARSE GRAINED ... +.. _hip-memory-coherence-table: + The APIs, flags and respective memory coherence control are listed in the following table: diff --git a/docs/reference/env_variables.rst b/docs/reference/env_variables.rst index fb1732311d..390cb9516c 100644 --- a/docs/reference/env_variables.rst +++ b/docs/reference/env_variables.rst @@ -12,44 +12,24 @@ on AMD platform, which are grouped by functionality. GPU isolation variables ================================================================================ -The GPU isolation environment variables in HIP are collected in the following table. -For more information, check :doc:`GPU isolation page `. - -.. include-table:: data/env_variables_hip.rst - :table: hip-env-isolation +.. include:: ../data/reference/env_variables/gpu_isolation_hip_env.rst Profiling variables ================================================================================ -The profiling environment variables in HIP are collected in the following table. For -more information, check :doc:`setting the number of CUs page `. - -.. include-table:: data/env_variables_hip.rst - :table: hip-env-prof +.. include:: ../data/reference/env_variables/profiling_hip_env.rst Debug variables ================================================================================ -The debugging environment variables in HIP are collected in the following table. For -more information, check :ref:`debugging_with_hip`. - -.. include-table:: data/env_variables_hip.rst - :table: hip-env-debug +.. include:: ../data/reference/env_variables/debug_hip_env.rst Memory management related variables ================================================================================ -The memory management related environment variables in HIP are collected in the -following table. - -.. include-table:: data/env_variables_hip.rst - :table: hip-env-memory +.. include:: ../data/reference/env_variables/memory_management_hip_env.rst Other useful variables ================================================================================ -The following table lists environment variables that are useful but relate to -different features. - -.. include-table:: data/env_variables_hip.rst - :table: hip-env-other +.. include:: ../data/reference/env_variables/miscellaneous_hip_env.rst From 72163f6a70993031dad90ed3b25df6cc2831ae72 Mon Sep 17 00:00:00 2001 From: Istvan Kiss Date: Wed, 2 Jul 2025 14:43:45 +0200 Subject: [PATCH 24/36] Minor warp size fix --- docs/how-to/hip_cpp_language_extensions.rst | 22 +++++++++---------- .../how-to/hip_runtime_api/error_handling.rst | 1 + 2 files changed, 12 insertions(+), 11 deletions(-) diff --git a/docs/how-to/hip_cpp_language_extensions.rst b/docs/how-to/hip_cpp_language_extensions.rst index fba47b2a6a..2c08d2fa59 100644 --- a/docs/how-to/hip_cpp_language_extensions.rst +++ b/docs/how-to/hip_cpp_language_extensions.rst @@ -433,15 +433,15 @@ compile-time constant on the host. It has to be queried using of 32 can run on devices with a ``warpSize`` of 64, it only utilizes half of the compute resources. -The ``warpSize`` parameter will no longer be a compile-time constant in a future -release of ROCm, however it will be still early folded by the compiler, which -means it can be used for loop bounds and supports loop unrolling similarly to -compile-time warp size. +Prior to ROCm 7.0, the warpSize parameter was a compile-time constant. Starting +with ROCm 7.0, it is early folded by the compiler, allowing it to be used in +loop bounds and enabling loop unrolling in a manner similar to a compile-time +constant warp size. -If the compile time warp size is still required, for example to select the correct -mask type or code path at compile time, the recommended approach is to determine -the warp size of the GPU on host side and setup the kernel accordingly, as shown -in the following block reduce example. +If compile time warp size is required, for example to select the correct mask +type or code path at compile time, the recommended approach is to determine the +warp size of the GPU on host side and setup the kernel accordingly, as shown in +the following block reduce example. The ``block_reduce`` kernel has a template parameter for warp size and performs a reduction operation in two main phases: @@ -457,7 +457,7 @@ a reduction operation in two main phases: .. tab-set:: - .. tab-item:: WarpSize Template Parameter + .. tab-item:: WarpSize template parameter :sync: template-warpsize .. literalinclude:: ../tools/example_codes/template_warp_size_reduction.hip @@ -486,7 +486,7 @@ The host code with the main function: .. tab-set:: - .. tab-item:: Compile-time WarpSize + .. tab-item:: WarpSize template parameter :sync: template-warpsize .. literalinclude:: ../tools/example_codes/template_warp_size_reduction.hip @@ -508,7 +508,7 @@ The host code with the main function: .. tab-set:: - .. tab-item:: Compile-time WarpSize + .. tab-item:: WarpSize template parameter :sync: template-warpsize .. literalinclude:: ../tools/example_codes/template_warp_size_reduction.hip diff --git a/docs/how-to/hip_runtime_api/error_handling.rst b/docs/how-to/hip_runtime_api/error_handling.rst index a400ff97ec..b860f639f4 100644 --- a/docs/how-to/hip_runtime_api/error_handling.rst +++ b/docs/how-to/hip_runtime_api/error_handling.rst @@ -27,6 +27,7 @@ without changing it. To get a human readable version of the errors, host thread. :cpp:func:`hipGetLastError` behavior will be matched with ``cudaGetLastError`` in ROCm release 7.0. + Best practices of HIP error handling: 1. Check errors after each API call - Avoid error propagation. From 97122df8ef860ba98d4829a9c2061eb0f2bb8f7b Mon Sep 17 00:00:00 2001 From: Adel Johar Date: Mon, 14 Jul 2025 16:05:06 +0200 Subject: [PATCH 25/36] Docs: Update low_fp_types --- .wordlist.txt | 1 + docs/reference/low_fp_types.rst | 369 ++++++++++++++++++++++---------- 2 files changed, 251 insertions(+), 119 deletions(-) diff --git a/.wordlist.txt b/.wordlist.txt index 69934c1b06..faa1107abf 100644 --- a/.wordlist.txt +++ b/.wordlist.txt @@ -12,6 +12,7 @@ Asynchrony asynchrony backtrace bfloat +BFloat Bitcode bitcode bitcodes diff --git a/docs/reference/low_fp_types.rst b/docs/reference/low_fp_types.rst index b5645eed61..65b296a733 100644 --- a/docs/reference/low_fp_types.rst +++ b/docs/reference/low_fp_types.rst @@ -43,28 +43,32 @@ HIP Header The `HIP FP4 header `_ defines the FP4 numbers. -Supported Devices ------------------ +Device Compatibility +-------------------- -Different GPU models support different FP4 formats. Here's a breakdown: +The following table shows hardware support for this precision format by GPU architecture. "Yes" +indicates native hardware acceleration is available, while "No" indicates hardware acceleration +is not available. -.. list-table:: Supported devices for fp4 numbers +.. list-table:: :header-rows: 1 * - Device Type - E2M1 - * - Host - - Yes * - CDNA1 - No * - CDNA2 - No * - CDNA3 + - No + * - CDNA4 - Yes * - RDNA2 - No * - RDNA3 - No + * - RDNA4 + - No Using FP4 Numbers in HIP Programs --------------------------------- @@ -130,9 +134,9 @@ The following code example demonstrates a simple roundtrip conversion using FP4 constexpr size_t size = 16; hipDeviceProp_t prop; hip_check(hipGetDeviceProperties(&prop, 0)); - bool is_supported = (std::string(prop.gcnArchName).find("gfx94") != std::string::npos); + bool is_supported = (std::string(prop.gcnArchName).find("gfx950") != std::string::npos); if(!is_supported) { - std::cerr << "Need a gfx94x, but found: " << prop.gcnArchName << std::endl; + std::cerr << "Need gfx950, but found: " << prop.gcnArchName << std::endl; std::cerr << "Device conversions are not supported on this hardware." << std::endl; return -1; } @@ -180,12 +184,12 @@ There are C++ style classes available as well: FP4 type has its own class: -- __hip_fp4_e2m1 +- ``__hip_fp4_e2m1`` There is support of vector of FP4 types: -- __hip_fp4x2_e2m1: holds 2 values of FP4 e2m1 numbers -- __hip_fp4x4_e2m1: holds 4 values of FP4 e2m1 numbers +- ``__hip_fp4x2_e2m1``: holds 2 values of FP4 e2m1 numbers +- ``__hip_fp4x4_e2m1``: holds 4 values of FP4 e2m1 numbers FP6 (6-bit Precision) ======================== @@ -224,20 +228,19 @@ HIP Header The `HIP FP6 header `_ defines the FP6 numbers. -Supported Devices ------------------ +Device Compatibility +-------------------- -Different GPU models support different FP6 formats. Here's a breakdown: +The following table shows hardware support for this precision format by GPU architecture. "Yes" +indicates native hardware acceleration is available, while "No" indicates hardware acceleration +is not available. -.. list-table:: Supported devices for fp6 numbers +.. list-table:: :header-rows: 1 * - Device Type - E3M2 - E2M3 - * - Host - - Yes - - Yes * - CDNA1 - No - No @@ -245,6 +248,9 @@ Different GPU models support different FP6 formats. Here's a breakdown: - No - No * - CDNA3 + - No + - No + * - CDNA4 - Yes - Yes * - RDNA2 @@ -253,6 +259,9 @@ Different GPU models support different FP6 formats. Here's a breakdown: * - RDNA3 - No - No + * - RDNA4 + - No + - No Using FP6 Numbers in HIP Programs --------------------------------- @@ -321,9 +330,9 @@ The following code example demonstrates a roundtrip conversion using FP6 types: constexpr size_t size = 16; hipDeviceProp_t prop; hip_check(hipGetDeviceProperties(&prop, 0)); - bool is_supported = (std::string(prop.gcnArchName).find("gfx94") != std::string::npos); + bool is_supported = (std::string(prop.gcnArchName).find("gfx950") != std::string::npos); if(!is_supported) { - std::cerr << "Need a gfx94x, but found: " << prop.gcnArchName << std::endl; + std::cerr << "Need gfx950, but found: " << prop.gcnArchName << std::endl; std::cerr << "Device conversions are not supported on this hardware." << std::endl; return -1; } @@ -378,15 +387,15 @@ There are C++ style classes available as well: Each type of FP6 number has its own class: -- __hip_fp6_e2m3 -- __hip_fp6_e3m2 +- ``__hip_fp6_e2m3`` +- ``__hip_fp6_e3m2`` There is support of vector of FP6 types: -- __hip_fp6x2_e2m3: holds 2 values of FP6 e2m3 numbers -- __hip_fp6x4_e2m3: holds 4 values of FP6 e2m3 numbers -- __hip_fp6x2_e3m2: holds 2 values of FP6 e3m2 numbers -- __hip_fp6x4_e3m2: holds 4 values of FP6 e3m2 numbers +- ``__hip_fp6x2_e2m3``: holds 2 values of FP6 e2m3 numbers +- ``__hip_fp6x4_e2m3``: holds 4 values of FP6 e2m3 numbers +- ``__hip_fp6x2_e3m2``: holds 2 values of FP6 e3m2 numbers +- ``__hip_fp6x4_e3m2``: holds 4 values of FP6 e3m2 numbers FP8 (Quarter Precision) ======================= @@ -444,10 +453,12 @@ HIP Header The `HIP FP8 header `_ defines the FP8 ocp/fnuz numbers. -Supported Devices ------------------ +Device Compatibility +-------------------- -Different GPU models support different FP8 formats. Here's a breakdown: +The following table shows hardware support for this precision format by GPU architecture. "Yes" +indicates native hardware acceleration is available, while "No" indicates hardware acceleration +is not available. .. list-table:: Supported devices for fp8 numbers :header-rows: 1 @@ -455,9 +466,6 @@ Different GPU models support different FP8 formats. Here's a breakdown: * - Device Type - FNUZ FP8 - OCP FP8 - * - Host - - Yes - - Yes * - CDNA1 - No - No @@ -467,12 +475,18 @@ Different GPU models support different FP8 formats. Here's a breakdown: * - CDNA3 - Yes - No + * - CDNA4 + - No + - Yes * - RDNA2 - No - No * - RDNA3 - No - No + * - RDNA4 + - No + - Yes Using FP8 Numbers in HIP Programs --------------------------------- @@ -535,7 +549,7 @@ The following code example does roundtrip FP8 conversions on both the CPU and GP __device__ float d_convert_fp8_to_float(float in, __hip_fp8_interpretation_t interpret) { - __half hf = __hip_cvt_fp8_to_halfraw(in, interpret); + float hf = __hip_cvt_fp8_to_float(in, interpret); return hf; } @@ -572,9 +586,11 @@ The following code example does roundtrip FP8 conversions on both the CPU and GP constexpr size_t size = 32; hipDeviceProp_t prop; hip_check(hipGetDeviceProperties(&prop, 0)); - bool is_supported = (std::string(prop.gcnArchName).find("gfx94") != std::string::npos); // gfx94x + bool is_supported = (std::string(prop.gcnArchName).find("gfx94") != std::string::npos) + || (std::string(prop.gcnArchName).find("gfx950") != std::string::npos) + || (std::string(prop.gcnArchName).find("gfx12") != std::string::npos); if(!is_supported) { - std::cerr << "Need a gfx94x, but found: " << prop.gcnArchName << std::endl; + std::cerr << "Need a gfx94x, gfx950 or gfx12xx, but found: " << prop.gcnArchName << std::endl; std::cerr << "No device conversions are supported, only host conversions are supported." << std::endl; return -1; } @@ -625,6 +641,8 @@ The following code example does roundtrip FP8 conversions on both the CPU and GP } } std::cout << "...CPU and GPU round trip convert matches." << std::endl; + + return 0; } There are C++ style classes available as well. @@ -636,110 +654,72 @@ There are C++ style classes available as well. Each type of FP8 number has its own class: -- __hip_fp8_e4m3 -- __hip_fp8_e5m2 -- __hip_fp8_e4m3_fnuz -- __hip_fp8_e5m2_fnuz +- ``__hip_fp8_e4m3`` +- ``__hip_fp8_e5m2`` +- ``__hip_fp8_e4m3_fnuz`` +- ``__hip_fp8_e5m2_fnuz`` There is support of vector of FP8 types. -- __hip_fp8x2_e4m3: holds 2 values of OCP FP8 e4m3 numbers -- __hip_fp8x4_e4m3: holds 4 values of OCP FP8 e4m3 numbers -- __hip_fp8x2_e5m2: holds 2 values of OCP FP8 e5m2 numbers -- __hip_fp8x4_e5m2: holds 4 values of OCP FP8 e5m2 numbers -- __hip_fp8x2_e4m3_fnuz: holds 2 values of FP8 fnuz e4m3 numbers -- __hip_fp8x4_e4m3_fnuz: holds 4 values of FP8 fnuz e4m3 numbers -- __hip_fp8x2_e5m2_fnuz: holds 2 values of FP8 fnuz e5m2 numbers -- __hip_fp8x4_e5m2_fnuz: holds 4 values of FP8 fnuz e5m2 numbers +- ``__hip_fp8x2_e4m3``: holds 2 values of OCP FP8 e4m3 numbers +- ``__hip_fp8x4_e4m3``: holds 4 values of OCP FP8 e4m3 numbers +- ``__hip_fp8x2_e5m2``: holds 2 values of OCP FP8 e5m2 numbers +- ``__hip_fp8x4_e5m2``: holds 4 values of OCP FP8 e5m2 numbers +- ``__hip_fp8x2_e4m3_fnuz``: holds 2 values of FP8 fnuz e4m3 numbers +- ``__hip_fp8x4_e4m3_fnuz``: holds 4 values of FP8 fnuz e4m3 numbers +- ``__hip_fp8x2_e5m2_fnuz``: holds 2 values of FP8 fnuz e5m2 numbers +- ``__hip_fp8x4_e5m2_fnuz``: holds 4 values of FP8 fnuz e5m2 numbers FNUZ extensions will be available on gfx94x only. -FP16 (Half Precision) -===================== +Float16 (Half Precision) +======================== -FP16 (Floating Point 16-bit) numbers offer a balance between precision and +``float16`` (Floating Point 16-bit) numbers offer a balance between precision and efficiency, making them a widely adopted standard for accelerating deep learning inference. With higher precision than FP8 but lower memory requirements than FP32, -FP16 enables faster computations while preserving model accuracy. +``float16`` enables faster computations while preserving model accuracy. Deep learning workloads often involve massive datasets and complex calculations, -making FP32 computationally expensive. FP16 helps mitigate these costs by reducing +making FP32 computationally expensive. ``float16`` helps mitigate these costs by reducing storage and bandwidth demands, allowing for increased throughput without significant loss of numerical stability. This format is particularly useful for training and inference in GPUs and TPUs optimized for half-precision arithmetic. -There are two primary FP16 formats: - -- **float16 Format** - - - Sign: 1 bit - - Exponent: 5 bits - - Mantissa: 10 bits - -- **bfloat16 Format** +Float16 Format +-------------- - - Sign: 1 bit - - Exponent: 8 bits - - Mantissa: 7 bits +The ``float16`` format uses the following bit allocation: -The float16 format offers higher precision with a narrower range, while the bfloat16 -format provides a wider range at the cost of some precision. +- **Sign**: 1 bit +- **Exponent**: 5 bits +- **Mantissa**: 10 bits -Additionally, FP16 numbers have standardized representations developed by industry -initiatives to ensure compatibility across various hardware and software implementations. -Unlike FP8, which has specific representations like OCP and FNUZ, FP16 is more uniformly -supported with its two main formats, float16 and bfloat16. +This format offers higher precision with a narrower range compared to ``bfloat16``. HIP Header ---------- The `HIP FP16 header `_ -defines the float16 format. +defines the ``float16`` format. -The `HIP BF16 header `_ -defines the bfloat16 format. +Device Compatibility +-------------------- -Supported Devices ------------------ - -Different GPU models support different FP16 formats. Here's a breakdown: - -.. list-table:: Supported devices for fp16 numbers - :header-rows: 1 - - * - Device Type - - float16 - - bfloat16 - * - Host - - Yes - - Yes - * - CDNA1 - - Yes - - Yes - * - CDNA2 - - Yes - - Yes - * - CDNA3 - - Yes - - Yes - * - RDNA2 - - Yes - - Yes - * - RDNA3 - - Yes - - Yes +This precision format is supported across all GPU architectures. The HIP types and functions +are available for use in both host and device code, with implementation handled by the +compiler and device libraries. -Using FP16 Numbers in HIP Programs ----------------------------------- +Using Float16 Numbers in HIP Programs +------------------------------------- -To use the FP16 numbers inside HIP programs. +To use ``float16`` numbers inside HIP programs: .. code-block:: cpp #include // for float16 - #include // for bfloat16 -The following code example adds two float16 values on the GPU and compares the results +The following code example adds two ``float16`` values on the GPU and compares the results against summed float values on the CPU. .. code-block:: cpp @@ -764,8 +744,8 @@ against summed float values on the CPU. int idx = threadIdx.x; if (idx < size) { // Load as half, perform addition in float, store as float - float sum = __half2float(in1[idx] + in2[idx]); - out[idx] = sum; + __half sum = in1[idx] + in2[idx]; + out[idx] = __half2float(sum); } } @@ -776,8 +756,8 @@ against summed float values on the CPU. // Initialize input vectors as floats std::vector in1(size), in2(size); for (size_t i = 0; i < size; i++) { - in1[i] = i + 1.1f; - in2[i] = i + 2.2f; + in1[i] = i + 0.5f; + in2[i] = i + 0.5f; } // Compute expected results in full precision on CPU @@ -825,22 +805,173 @@ against summed float values on the CPU. } std::cout << "Success: CPU and GPU half-precision addition match within tolerance!" << std::endl; + + return 0; } +C++ Style Classes +----------------- -There are C++ style classes available as well. +Float16 numbers can be used with C++ style classes: .. code-block:: cpp __half fp16_val(1.1f); // float16 - __hip_bfloat16 fp16_val(1.1f); // bfloat16 -Each type of FP16 number has its own class: +Vector Support +-------------- + +There is support for vectors of float16 types: + +- ``__half2``: holds 2 values of float16 numbers + +BFloat16 (Brain float 16-bit precision) +======================================= + +``bfloat16`` (Brain Floating Point 16-bit) is a truncated version of the 32-bit IEEE 754 +single-precision floating-point format. Originally developed by Google for machine +learning applications, ``bfloat16`` provides a good balance between range and precision +for neural network computations. + +``bfloat16`` is particularly well-suited for deep learning workloads because it maintains +the same exponent range as FP32, making it less prone to overflow and underflow issues +during training. This format sacrifices some precision compared to float16 but offers +better numerical stability for many AI applications. + +BFloat16 Format +--------------- + +The ``bfloat16`` format uses the following bit allocation: + +- **Sign**: 1 bit +- **Exponent**: 8 bits +- **Mantissa**: 7 bits + +This format provides a wider range at the cost of some precision compared to ``float16``. + +HIP Header +---------- + +The `HIP BF16 header `_ +defines the ``bfloat16`` format. + +Device Compatibility +-------------------- + +This precision format is supported across all GPU architectures. The HIP types and functions +are available for use in both host and device code, with implementation handled by the +compiler and device libraries. + +Using ``bfloat16`` Numbers in HIP Programs +------------------------------------------ + +To use ``bfloat16`` numbers inside HIP programs: + +.. code-block:: cpp + + #include // for bfloat16 + +The following code example demonstrates basic ``bfloat16`` operations: + +.. code-block:: cpp + + #include + #include + #include + #include + + #define hip_check(hip_call) \ + { \ + auto hip_res = hip_call; \ + if (hip_res != hipSuccess) { \ + std::cerr << "Failed in HIP call: " << #hip_call \ + << " at " << __FILE__ << ":" << __LINE__ \ + << " with error: " << hipGetErrorString(hip_res) << std::endl; \ + std::abort(); \ + } \ + } + + __global__ void add_bfloat16(__hip_bfloat16* in1, __hip_bfloat16* in2, float* out, size_t size) { + int idx = threadIdx.x; + if (idx < size) { + // Load as bfloat16, perform addition, convert to float for output + __hip_bfloat16 sum = in1[idx] + in2[idx]; + out[idx] = __bfloat162float(sum); + } + } + + int main() { + constexpr size_t size = 32; + constexpr float tolerance = 1e-1f; // Allowable numerical difference + + // Initialize input vectors as floats + std::vector in1(size), in2(size); + for (size_t i = 0; i < size; i++) { + in1[i] = i + 0.5f; + in2[i] = i + 0.5f; + } + + // Compute expected results in full precision on CPU + std::vector cpu_out(size); + for (size_t i = 0; i < size; i++) { + cpu_out[i] = in1[i] + in2[i]; // Direct float addition + } + + // Allocate device memory (store input as bfloat16, output as float) + __hip_bfloat16 *d_in1, *d_in2; + float *d_out; + hip_check(hipMalloc(&d_in1, sizeof(__hip_bfloat16) * size)); + hip_check(hipMalloc(&d_in2, sizeof(__hip_bfloat16) * size)); + hip_check(hipMalloc(&d_out, sizeof(float) * size)); + + // Convert input to bfloat16 and copy to device + std::vector<__hip_bfloat16> in1_bf16(size), in2_bf16(size); + for (size_t i = 0; i < size; i++) { + in1_bf16[i] = __float2bfloat16(in1[i]); + in2_bf16[i] = __float2bfloat16(in2[i]); + } + + hip_check(hipMemcpy(d_in1, in1_bf16.data(), sizeof(__hip_bfloat16) * size, hipMemcpyHostToDevice)); + hip_check(hipMemcpy(d_in2, in2_bf16.data(), sizeof(__hip_bfloat16) * size, hipMemcpyHostToDevice)); + + // Launch kernel + add_bfloat16<<<1, size>>>(d_in1, d_in2, d_out, size); + + // Copy result back to host + std::vector gpu_out(size, 0.0f); + hip_check(hipMemcpy(gpu_out.data(), d_out, sizeof(float) * size, hipMemcpyDeviceToHost)); + + // Free device memory + hip_check(hipFree(d_in1)); + hip_check(hipFree(d_in2)); + hip_check(hipFree(d_out)); + + // Validation with tolerance + for (size_t i = 0; i < size; i++) { + if (std::fabs(cpu_out[i] - gpu_out[i]) > tolerance) { + std::cerr << "Mismatch at index " << i << ": CPU result = " << cpu_out[i] + << ", GPU result = " << gpu_out[i] << std::endl; + std::abort(); + } + } + + std::cout << "Success: CPU and GPU bfloat16 addition match within tolerance!" << std::endl; + + return 0; + } + +C++ Style Classes +----------------- + +``bfloat16`` numbers can be used with C++ style classes: + +.. code-block:: cpp + + __hip_bfloat16 bf16_val(1.1f); // bfloat16 -- __half -- __hip_bfloat16 +Vector Support +-------------- -There is support of vector of FP16 types. +There is support for vectors of bfloat16 types: -- __half2: holds 2 values of float16 numbers -- __hip_bfloat162: holds 2 values of bfloat16 numbers +- ``__hip_bfloat162``: holds 2 values of bfloat16 numbers From 33b2f42b1b0016f6f41482a9f8957d3cb9ba67e3 Mon Sep 17 00:00:00 2001 From: Jatin Chaudhary Date: Fri, 18 Jul 2025 14:28:38 +0100 Subject: [PATCH 26/36] Update hipExt Documentation Change-Id: I6e664a7ee4b2ca5de9f5353e2220fbeba3b94f84 --- docs/reference/low_fp_types.rst | 366 ++++++++++++++++++++++++++++++++ 1 file changed, 366 insertions(+) diff --git a/docs/reference/low_fp_types.rst b/docs/reference/low_fp_types.rst index 65b296a733..e70392e5fa 100644 --- a/docs/reference/low_fp_types.rst +++ b/docs/reference/low_fp_types.rst @@ -975,3 +975,369 @@ Vector Support There is support for vectors of bfloat16 types: - ``__hip_bfloat162``: holds 2 values of bfloat16 numbers + +HIP Extensions +============== + +HIP also provides some extensions APIs for microscaling formats. These are supported on AMD GPUs. `gfx950` provides hardware acceleration for hip extensions. Infact most APIs are 1 to 1 mapping of hardware instruction. + +Scale is also an input to the APIs. Scale is defined as type `__amd_scale_t` and is of format E8M0. + +hipExt Types +============ + +hipExt microscaling APIs introduce a bunch of types which are used throughout the set of APIs. + +.. list-table:: Types + :header-rows: 1 + + * - Types + - Notes + * - `__amd_scale_t` + - Store scale type which stores a value of E8M0. + * - `__amd_fp8_storage_t` + - Store a single fp8 value. + * - `__amd_fp8x2_storage_t` + - Store 2 packed fp8 value. + * - `__amd_fp8x8_storage_t` + - Store 8 packed fp8 value. + * - `__amd_fp4x2_storage_t` + - Store 2 packed fp4 value. + * - `__amd_fp4x8_storage_t` + - Store 8 packed fp4 value. + * - `__amd_bf16_storage_t` + - Store a single bf16 value. + * - `__amd_bf16x2_storage_t` + - Store 2 packed bf16 value. + * - `__amd_bf16x8_storage_t` + - Store 8 packed bf16 value. + * - `__amd_bf16x32_storage_t` + - Store 32 packed bf16 value. + * - `__amd_fp16_storage_t` + - Store a single fp16 value. + * - `__amd_fp16x2_storage_t` + - Store 2 packed fp16 value. + * - `__amd_fp16x8_storage_t` + - Store 8 packed fp16 value. + * - `__amd_fp16x32_storage_t` + - Store 32 packed fp16 value. + * - `__amd_floatx2_storage_t` + - Store 2 packed float value. + * - `__amd_floatx8_storage_t` + - Store 8 packed float value. + * - `__amd_floatx16_storage_t` + - Store 16 packed float value. + * - `__amd_floatx32_storage_t` + - Store 32 packed float value. + * - `__amd_fp6x32_storage_t` + - Store 32 packed fp6 value. + * - `__amd_shortx2_storage_t` + - Store 2 packed short value. + +C-APIs +====== + +The naming style of C API is as follows: + +All APIs start with `__amd`. +`_`: is used as a separator. +`cvt`: means convert i.e. convert from one format to another. +`sr`: if an API name has sr in it, means it will do stochastic rounding and will expect an input as seed. +`scale`: if an API has scale in it, means it will scale the values based on the `__amd_scale_t` input. + +`create`: The following APIs will be used to create composite types from smaller values +`extract`: The following set of APIs will extract out individual values from a composite type. + +Example: +`__amd_cvt_fp8x8_to_bf16x8_scale` : this API converts 8-packed fp8 values to 8 packed bf16. This will also accept input of scale to do the conversion. + +`__amd_extract_fp8x2` : this API will extract out a 2 packed fp8 value from 8 packed fp8 value based on index. Example of 8-packed fp8: `{a:{fp8, fp8}, b:{fp8, fp8}, c:{fp8, fp8}, d:{fp8, fp8}}` based on index 0, 1, 2 or 3 the API will return `a`, `b`, `c` or `d` respectively. +`__amd_create_fp8x8` : this API will create 8 packed fp8 value from 4 inputs of 2 packed fp8 values. + +.. list-table:: C APIs + :header-rows: 1 + + * - API + - Notes + * - `float __amd_cvt_fp8_to_float(const __amd_fp8_storage_t, const __amd_fp8_interpretation_t)` + - Convert a fp8 value to float. + * - `__amd_fp8_storage_t __amd_cvt_float_to_fp8_sr(const float, const __amd_fp8_interpretation_t, const unsigned int /* sr seed */)` + - Convert a float to fp8 value with stochastic rounding, seed is passed as unsigned int argument. + * - `float __amd_cvt_fp8_to_float_scale(const __amd_fp8_storage_t, const __amd_fp8_interpretation_t, const __amd_scale_t)` + - Convert a fp8 value to float with scale. + * - `float __amd_cvt_fp8_to_float_scale(const __amd_fp8_storage_t, const __amd_fp8_interpretation_t, const __amd_scale_t)` + - Convert a fp8 value to float with scale. + * - `__amd_floatx2_storage_t __amd_cvt_fp8x2_to_floatx2(const __amd_fp8x2_storage_t, const __amd_fp8_interpretation_t)` + - Convert 2 packed fp8 value to 2 packed float. + * - `__amd_fp8x2_storage_t __amd_cvt_floatx2_to_fp8x2(const __amd_floatx2_storage_t, const __amd_fp8_interpretation_t)` + - Convert 2 packed float value to 2 packed fp8. + * - `__amd_fp4x2_storage_t __amd_cvt_floatx2_to_fp4x2_sr_scale(const __amd_floatx2_storage_t, const __amd_fp4_interpretation_t, const unsigned int /* sr seed */, const __amd_scale_t)` + - Convert 2 packed float value to 2 packed fp4 with stochastic rounding and scale. + * - `__amd_floatx2_storage_t __amd_cvt_fp4x2_to_floatx2_scale(const __amd_fp4x2_storage_t , const __amd_fp4_interpretation_t, const __amd_scale_t)` + - Convert 2 packed fp4 value to 2 packed float with scale. + * - `__amd_fp4x2_storage_t __amd_cvt_floatx2_to_fp4x2_scale(const __amd_floatx2_storage_t, const __amd_fp4_interpretation_t, const __amd_scale_t)` + - Convert 2 packed float value to 2 packed fp4 with scale. + * - `__amd_floatx2_storage_t __amd_cvt_fp8x2_to_floatx2_scale(const __amd_fp8x2_storage_t, const __amd_fp8_interpretation_t, const __amd_scale_t)` + - Convert 2 packed fp8 value to 2 packed float with scale. + * - `__amd_fp8x2_storage_t __amd_cvt_floatx2_to_fp8x2_scale(const __amd_floatx2_storage_t, const __amd_fp8_interpretation_t, const __amd_scale_t)` + - Convert 2 packed float value to 2 packed fp8 with scale. + * - `__amd_fp6x32_storage_t __amd_cvt_bf16x32_to_fp6x32_scale(const __amd_bf16x32_storage_t, const __amd_fp6_interpretation_t, const __amd_scale_t)` + - Convert 32 packed bf16 value to 32 packed fp6 with scale. + * - `__amd_fp6x32_storage_t __amd_cvt_fp16x32_to_fp6x32_scale(const __amd_fp16x32_storage_t, const __amd_fp6_interpretation_t, const __amd_scale_t)` + - Convert 32 packed fp16 value to 32 packed fp6 with scale. + * - `__amd_fp16x2_storage_t __amd_cvt_fp8x2_to_fp16x2_scale(const __amd_fp8x2_storage_t, const __amd_fp8_interpretation_t, const __amd_scale_t)` + - Convert 2 packed fp8 value to 2 packed fp16 with scale. + * - `__amd_fp16x8_storage_t __amd_cvt_fp8x8_to_fp16x8_scale(const __amd_fp8x8_storage_t, const __amd_fp8_interpretation_t, const __amd_scale_t)` + - Convert 8 packed fp8 value to 8 packed fp16 with scale. + * - `__amd_bf16x2_storage_t __amd_cvt_fp8x2_to_bf16x2_scale(const __amd_fp8x2_storage_t, const __amd_fp8_interpretation_t, const __amd_scale_t)` + - Convert 2 packed fp8 value to 2 packed bf16 with scale. + * - `__amd_bf16x2_storage_t __amd_cvt_fp8x2_to_bf16x2_scale(const __amd_fp8x2_storage_t, const __amd_fp8_interpretation_t, const __amd_scale_t)` + - Convert 2 packed fp8 value to 2 packed bf16 with scale. + * - `__amd_bf16x8_storage_t __amd_cvt_fp8x8_to_bf16x8_scale(const __amd_fp8x8_storage_t, const __amd_fp8_interpretation_t, const __amd_scale_t)` + - Convert 8 packed fp8 value to 8 packed bf16 with scale. + * - `__amd_fp16x32_storage_t __amd_cvt_fp6x32_to_fp16x32_scale(const __amd_fp6x32_storage_t, const __amd_fp6_interpretation_t, const __amd_scale_t)` + - Convert 32 packed fp6 value to 32 packed fp16 with scale. + * - `__amd_bf16x32_storage_t __amd_cvt_fp6x32_to_bf16x32_scale(const __amd_fp6x32_storage_t, const __amd_fp6_interpretation_t, const __amd_scale_t)` + - Convert 32 packed fp6 value to 32 packed bf16 with scale. + * - `__amd_floatx32_storage_t __amd_cvt_fp6x32_to_floatx32_scale(const __amd_fp6x32_storage_t, const __amd_fp6_interpretation_t, const __amd_scale_t)` + - Convert 32 packed fp6 value to 32 packed float with scale. + * - `__amd_fp16x2_storage_t __amd_cvt_fp4x2_to_fp16x2_scale(const __amd_fp4x2_storage_t, const __amd_fp4_interpretation_t, const __amd_scale_t)` + - Convert 2 packed fp4 value to 2 packed fp16 with scale. + * - `__amd_fp16x8_storage_t __amd_cvt_fp4x8_to_fp16x8_scale(const __amd_fp4x8_storage_t, const __amd_fp4_interpretation_t, const __amd_scale_t)` + - Convert 8 packed fp4 value to 8 packed fp16 with scale. + * - `__amd_bf16x2_storage_t __amd_cvt_fp4x2_to_bf16x2_scale(const __amd_fp4x2_storage_t, const __amd_fp4_interpretation_t, const __amd_scale_t)` + - Convert 2 packed fp4 value to 2 packed bf16 with scale. + * - `__amd_bf16x8_storage_t __amd_cvt_fp4x8_to_bf16x8_scale(const __amd_fp4x8_storage_t, const __amd_fp4_interpretation_t, const __amd_scale_t)` + - Convert 8 packed fp4 value to 8 packed bf16 with scale. + * - `__amd_floatx8_storage_t __amd_cvt_fp4x8_to_floatx8_scale(const __amd_fp4x8_storage_t, const __amd_fp4_interpretation_t, const __amd_scale_t)` + - Convert 8 packed fp4 value to 8 packed float with scale. + * - `__amd_fp4x8_storage_t __amd_cvt_floatx8_to_fp4x8_scale(const __amd_floatx8_storage_t, const __amd_fp4_interpretation_t, const __amd_scale_t)` + - Convert 8 packed float value to 8 packed fp4 with scale. + * - `__amd_fp8x2_storage_t __amd_cvt_fp16x2_to_fp8x2_scale(const __amd_fp16x2_storage_t, const __amd_fp8_interpretation_t, const __amd_scale_t)` + - Convert 2 packed fp16 value to 2 packed fp8 with scale. + * - `__amd_fp8x2_storage_t __amd_cvt_bf16x2_to_fp8x2_scale(const __amd_bf16x2_storage_t, const __amd_fp8_interpretation_t, const __amd_scale_t)` + - Convert 2 packed bf16 value to 2 packed fp8 with scale. + * - `__amd_fp8x8_storage_t __amd_cvt_bf16x8_to_fp8x8_scale(const __amd_bf16x8_storage_t, const __amd_fp8_interpretation_t, const __amd_scale_t)` + - Convert 8 packed bf16 value to 8 packed fp8 with scale. + * - `__amd_floatx8_storage_t __amd_cvt_fp8x8_to_floatx8_scale(const __amd_fp8x8_storage_t, const __amd_fp8_interpretation_t, const __amd_scale_t)` + - Convert 8 packed fp8 value to 8 packed float with scale. + * - `__amd_fp16_storage_t __amd_cvt_fp8_to_fp16_scale(const __amd_fp8_storage_t, const __amd_fp8_interpretation_t, const __amd_scale_t)` + - Convert a fp8 value to fp16 with scale. + * - `__amd_bf16_storage_t __amd_cvt_fp8_to_bf16_scale(const __amd_fp8_storage_t, const __amd_fp8_interpretation_t, const __amd_scale_t)` + - Convert a fp8 value to bf16 with scale. + * - `__amd_fp6x32_storage_t __amd_cvt_floatx16_floatx16_to_fp6x32_scale(const __amd_floatx16_storage_t, const __amd_floatx16_storage_t, const __amd_fp6_interpretation_t, const __amd_scale_t)` + - Convert 2 inputs of 16-packed float values to 32 packed fp6 with scale. + * - `__amd_fp6x32_storage_t __amd_cvt_floatx32_to_fp6x32_scale(const __amd_floatx32_storage_t, const __amd_fp6_interpretation_t, const __amd_scale_t)` + - Convert 32 packed float values to 32 packed fp6 with scale. + * - `__amd_fp6x32_storage_t __amd_cvt_floatx32_to_fp6x32_sr_scale(const __amd_floatx32_storage_t, const __amd_fp6_interpretation_t, const unsigned int, const __amd_scale_t)` + - Convert 32 packed float values to 32 packed fp6 with stochastic rounding and scale. + * - `__amd_fp16_storage_t __amd_cvt_float_to_fp16_sr(const float, const unsigned int)` + - Convert a float value to fp16 with stochastic rounding. + * - `__amd_fp16x2_storage_t __amd_cvt_float_float_to_fp16x2_sr(const float, const float, const unsigned int)` + - Convert two inputs of float to 2 packed fp16 with stochastic rounding. + * - `__amd_bf16_storage_t __amd_cvt_float_to_bf16_sr(const float, const unsigned int)` + - Convert a float value to bf16 with stochastic rounding. + * - `__amd_fp6x32_storage_t __amd_cvt_fp16x32_to_fp6x32_sr_scale(const __amd_fp16x32_storage_t, const __amd_fp6_interpretation_t, const unsigned int, const __amd_scale_t)` + - Convert 32 packed fp16 values to 32 packed fp6 with stochastic rounding and scale. + * - `__amd_fp6x32_storage_t __amd_cvt_bf16x32_to_fp6x32_sr_scale(const __amd_bf16x32_storage_t, const __amd_fp6_interpretation_t, const unsigned int, const __amd_scale_t)` + - Convert 32 packed bf16 values to 32 packed fp6 with stochastic rounding and scale. + * - `__amd_fp4x2_storage_t __amd_cvt_bf16x2_to_fp4x2_scale(const __amd_bf16x2_storage_t, const __amd_fp4_interpretation_t, const __amd_scale_t)` + - Convert 2 packed bf16 value to 2 packed fp4 with scale. + * - `__amd_fp4x8_storage_t __amd_cvt_bf16x8_to_fp4x8_scale(const __amd_bf16x8_storage_t, const __amd_fp4_interpretation_t, const __amd_scale_t)` + - Convert 8 packed bf16 value to 8 packed fp4 with scale. + * - `__amd_fp4x2_storage_t __amd_cvt_fp16x2_to_fp4x2_scale(const __amd_fp16x2_storage_t, const __amd_fp4_interpretation_t, const __amd_scale_t)` + - Convert 2 packed fp16 value to 2 packed fp4 with scale. + * - `__amd_fp4x8_storage_t __amd_cvt_fp16x8_to_fp4x8_scale(const __amd_fp16x8_storage_t, const __amd_fp4_interpretation_t, const __amd_scale_t)` + - Convert 8 packed fp16 value to 8 packed fp4 with scale. + * - `__amd_fp4x8_storage_t __amd_cvt_floatx8_to_fp4x8_sr_scale(const __amd_floatx8_storage_t, const __amd_fp4_interpretation_t, const unsigned int, const __amd_scale_t)` + - Convert 8 packed float values to 8 packed fp4 with stochastic rounding and scale. + * - `_amd_fp4x2_storage_t __amd_cvt_bf16x2_to_fp4x2_sr_scale(const __amd_bf16x2_storage_t, const __amd_fp4_interpretation_t, const unsigned int,const __amd_scale_t)` + - Convert 2 packed bf16 value to 2 packed fp4 with stochastic rounding and scale. + * - `__amd_fp4x8_storage_t __amd_cvt_bf16x8_to_fp4x8_sr_scale(const __amd_bf16x8_storage_t, const __amd_fp4_interpretation_t, const unsigned int, const __amd_scale_t)` + - Convert 8 packed bf16 value to 8 packed fp4 with stochastic rounding and scale. + * - `__amd_fp4x2_storage_t __amd_cvt_fp16x2_to_fp4x2_sr_scale(const __amd_fp16x2_storage_t, const __amd_fp4_interpretation_t, const unsigned int, const __amd_scale_t)` + - Convert 2 packed fp16 value to 2 packed fp4 with stochastic rounding and scale. + * - `__amd_fp4x8_storage_t __amd_cvt_fp16x8_to_fp4x8_sr_scale(const __amd_fp16x8_storage_t , const __amd_fp4_interpretation_t, const unsigned int, const __amd_scale_t)` + - Convert 8 packed fp16 values to 8 packed fp4 with stochastic rounding and scale. + * - `__amd_fp8x8_storage_t __amd_cvt_floatx8_to_fp8x8_sr_scale(const __amd_floatx8_storage_t, const __amd_fp8_interpretation_t, const unsigned int, const __amd_scale_t)` + - Convert 8 packed float values to 8 packed fp8 with stochastic rounding and scale. + * - `__amd_fp8_storage_t __amd_cvt_fp16_to_fp8_sr_scale(const __amd_fp16_storage_t, const __amd_fp8_interpretation_t, const unsigned int, const __amd_scale_t)` + - Convert a fp16 value to fp8 with stochastic rounding and scale. + * - `__amd_fp8x8_storage_t __amd_cvt_fp16x8_to_fp8x8_sr_scale(const __amd_fp16x8_storage_t, const __amd_fp8_interpretation_t, const unsigned int, const __amd_scale_t)` + - Convert 8 packed fp16 values to 8 packed fp8 with stochastic rounding and scale. + * - `__amd_fp8_storage_t __amd_cvt_bf16_to_fp8_sr_scale(const __amd_bf16_storage_t, const __amd_fp8_interpretation_t, const unsigned int, const __amd_scale_t)` + - Convert a bf16 value to fp8 with stochastic rounding and scale. + * - `__amd_fp8x8_storage_t __amd_cvt_bf16x8_to_fp8x8_sr_scale(const __amd_bf16x8_storage_t, const __amd_fp8_interpretation_t, const unsigned int, const __amd_scale_t)` + - Convert 8 packed bf16 values to 8 packed fp8 with stochastic rounding and scale. + * - `__amd_fp16_storage_t __amd_cvt_fp8_to_fp16(const __amd_fp8_storage_t, const __amd_fp8_interpretation_t)` + - Convert a fp8 value to fp16. + * - `__amd_fp16x2_storage_t __amd_cvt_fp8x2_to_fp16x2(const __amd_fp8x2_storage_t, const __amd_fp8_interpretation_t)` + - Convert 2 packed fp8 value to 2 packed fp16. + * - `__amd_fp8x2_storage_t __amd_cvt_fp16x2_to_fp8x2(const __amd_fp16x2_storage_t, const __amd_fp8_interpretation_t)` + - Convert 2 packed fp16 value to 2 packed fp8. + * - `__amd_fp8x8_storage_t __amd_cvt_fp16x8_to_fp8x8_scale(const __amd_fp16x8_storage_t, const __amd_fp8_interpretation_t, const __amd_scale_t)` + - Convert 8 packed fp16 values to 8 packed fp8 with scale. + * - `__amd_fp8x8_storage_t __amd_cvt_floatx8_to_fp8x8_scale(const __amd_floatx8_storage_t, const __amd_fp8_interpretation_t, const __amd_scale_t)` + - Convert 8 packed float values to 8 packed fp8 with scale. + * - `__amd_fp8_storage_t __amd_cvt_fp16_to_fp8_sr(const __amd_fp16_storage_t, const __amd_fp8_interpretation_t, const short)` + - Convert a fp16 value to fp8 with stochastic rounding. + * - `float2 __amd_cvt_floatx2_to_float2(const __amd_floatx2_storage_t)` + - Convert 2 packed float value to hip's float2 type. + * - `__half __amd_cvt_fp16_to_half(const __amd_fp16_storage_t)` + - Convert fp16 type to hip's __half type. + * - `__half2 __amd_cvt_fp16x2_to_half2(const __amd_fp16x2_storage_t)` + - Convert 2 packed fp16 type to hip's __half2 type. + * - `__amd_fp16_storage_t __amd_cvt_half_to_fp16(const __half)` + - Convert hip's __half type to fp16 type. + * - `__amd_fp16x2_storage_t __amd_cvt_half2_to_fp16x2(const __half2)` + - Convert hip's __half2 type to 2 packed fp16. + * - `__hip_bfloat16 __amd_cvt_bf16_to_hipbf16(const __amd_bf16_storage_t)` + - Convert bf16 type to __hip_bfloat16 type. + * - `__hip_bfloat162 __amd_cvt_bf16x2_to_hipbf162(const __amd_bf16x2_storage_t)` + - Convert 2 packed bf16 type to __hip_bfloat162 type. + * - `__amd_bf16_storage_t __amd_cvt_hipbf16_to_bf16(const __hip_bfloat16)` + - Convert __hip_bfloat16 to bf16 type. + * - `__amd_bf16x2_storage_t __amd_cvt_hipbf162_to_bf16x2(const __hip_bfloat162)` + - Convert __hip_bfloat162 to 2 packed bf16 type. + +HIP EXT C++ API +=============== + +There are C++ data structures also available. These are different from one in `` header. These APIs expose a wider capability set which are exclusive to `gfx950`. + +HIP EXT FP8 E4M3: + +.. code-block:: c + + struct __hipext_ocp_fp8_e4m3 { + // Constructor + __host__ __device__ __hipext_ocp_fp8_e4m3(const float); // Create fp8 e4m3 from float + __host__ __device__ __hipext_ocp_fp8_e4m3(const float, const unsigned int /* sr seed */); // Create fp8 e4m3 from float with stochastic rounding + __host__ __device__ __hipext_ocp_fp8_e4m3(const float, const unsigned int /* sr seed */, const __amd_scale_t /* scale */); // Create fp8 e4m3 from float with stochastic rounding and scale + __host__ __device__ __hipext_ocp_fp8_e4m3(const __amd_fp16_storage_t, const unsigned int /* sr seed */, const __amd_scale_t /* scale */); // Create fp8 e4m3 from fp16 with scale + __host__ __device__ __hipext_ocp_fp8_e4m3(const __amd_bf16_storage_t, const unsigned int /* sr seed */, const __amd_scale_t /* scale */); // Create fp8 e4m3 from bf16 with scale + + // Getters + __host__ __device__ __amd_fp16_storage_t get_scaled_fp16(const __amd_scale_t /* scale */) const; // get scaled fp16 value + __host__ __device__ __amd_bf16_storage_t get_scaled_bf16(const __amd_scale_t /* scale */) const; // get scaled bf16 value + __host__ __device__ float get_scaled_float(const __amd_scale_t /* scale */) const; // get scaled float value + + // Operators + __host__ __device__ operator float() const; // get a float value + }; + +HIP EXT FP8 E5M2: + +.. code-block:: c + + struct __hipext_ocp_fp8_e5m2 { + // Constructor + __host__ __device__ __hipext_ocp_fp8_e5m2(const float); // Create fp8 e4m3 from float + __host__ __device__ __hipext_ocp_fp8_e5m2(const float, const unsigned int /* sr seed */); // Create fp8 e4m3 from float with stochastic rounding + __host__ __device__ __hipext_ocp_fp8_e5m2(const float, const unsigned int /* sr seed */, const __amd_scale_t /* scale */); // Create fp8 e4m3 from float with stochastic rounding and scale + __host__ __device__ __hipext_ocp_fp8_e5m2(const __amd_fp16_storage_t, const unsigned int /* sr seed */, const __amd_scale_t /* scale */); // Create fp8 e4m3 from fp16 with scale + __host__ __device__ __hipext_ocp_fp8_e5m2(const __amd_bf16_storage_t, const unsigned int /* sr seed */, const __amd_scale_t /* scale */); // Create fp8 e4m3 from bf16 with scale + + // Getters + __host__ __device__ __amd_fp16_storage_t get_scaled_fp16(const __amd_scale_t /* scale */) const; // get scaled fp16 value + __host__ __device__ __amd_bf16_storage_t get_scaled_bf16(const __amd_scale_t /* scale */) const; // get scaled bf16 value + __host__ __device__ float get_scaled_float(const __amd_scale_t /* scale */) const; // get scaled float value + + // Operators + __host__ __device__ operator float() const; // get a float value + }; + +HIP EXT 2 Packed FP8 E4M3 + +.. code-block:: c + + struct __hipext_ocp_fp8x2_e4m3 { + __host__ __device__ __hipext_ocp_fp8x2_e4m3(const float, const float); // Create fp8x2 from two floats + __host__ __device__ __hipext_ocp_fp8x2_e4m3(const __amd_floatx2_storage_t); // Create fp8x2 from 2 packed floats + __host__ __device__ __hipext_ocp_fp8x2_e4m3(const __amd_floatx2_storage_t, __amd_scale_t /* scale */); // Create fp8x2 from 2 packed floats with scale + __host__ __device__ __hipext_ocp_fp8x2_e4m3(const __amd_fp16x2_storage_t, const __amd_scale_t /* scale */); // Create fp8x2 from 2 packed fp16 with scale + __host__ __device__ __hipext_ocp_fp8x2_e4m3(const __amd_bf16x2_storage_t, const __amd_scale_t /* scale */); // Create fp8x2 from 2 packed bf16 with scale + + // Getters + __host__ __device__ __amd_fp16x2_storage_t get_scaled_fp16x2(const __amd_scale_t) const; // Get scaled 2 packed fp16 + __host__ __device__ __amd_bf16x2_storage_t get_scaled_fp16x2(const __amd_scale_t) const; // Get scaled 2 packed fp16 + __host__ __device__ __amd_floatx2_storage_t get_scaled_floatx2(const __amd_scale_t scale)const; // Get scaled 2 packed float + + // Operators + __host__ __device__ operator __amd_floatx2_storage_t() const; // Get 2 packed float + }; + +HIP EXT 2 Packed FP8 E5M2 + +.. code-block:: c + + struct __hipext_ocp_fp8x2_e5m2 { + __host__ __device__ __hipext_ocp_fp8x2_e5m2(const float, const float); // Create fp8x2 from two floats + __host__ __device__ __hipext_ocp_fp8x2_e5m2(const __amd_floatx2_storage_t); // Create fp8x2 from 2 packed floats + __host__ __device__ __hipext_ocp_fp8x2_e5m2(const __amd_floatx2_storage_t, __amd_scale_t /* scale */); // Create fp8x2 from 2 packed floats with scale + __host__ __device__ __hipext_ocp_fp8x2_e5m2(const __amd_fp16x2_storage_t, const __amd_scale_t /* scale */); // Create fp8x2 from 2 packed fp16 with scale + __host__ __device__ __hipext_ocp_fp8x2_e5m2(const __amd_bf16x2_storage_t, const __amd_scale_t /* scale */); // Create fp8x2 from 2 packed bf16 with scale + + // Getters + __host__ __device__ __amd_fp16x2_storage_t get_scaled_fp16x2(const __amd_scale_t) const; // Get scaled 2 packed fp16 + __host__ __device__ __amd_bf16x2_storage_t get_scaled_fp16x2(const __amd_scale_t) const; // Get scaled 2 packed fp16 + __host__ __device__ __amd_floatx2_storage_t get_scaled_floatx2(const __amd_scale_t scale)const; // Get scaled 2 packed float + + // Operators + __host__ __device__ operator __amd_floatx2_storage_t() const; // Get 2 packed float + }; + +HIP EXT 32 packed FP6 E2M3 + +.. code-block:: c + + struct __hipext_ocp_fp6x32_e2m3 { + __host__ __device__ __hipext_ocp_fp6x32_e2m3(const __amd_floatx16_storage_t, const __amd_floatx16_storage_t, const __amd_scale_t); // Create fp6x32 from two floatx16 with scale + __host__ __device__ __hipext_ocp_fp6x32_e2m3(const __amd_floatx32_storage_t, const unsigned int /* seed */, const __amd_scale_t); // Create fp6x32 from two floatx32 with stochastic rounding and scale + __host__ __device__ __hipext_ocp_fp6x32_e2m3(const __amd_fp16x32_storage_t, const unsigned int /* seed */, const __amd_scale_t); // Create fp6x32 from two fp16x32 with stochastic rounding and scale + __host__ __device__ __hipext_ocp_fp6x32_e2m3(const __amd_fp16x32_storage_t, const __amd_scale_t); // Create fp6x32 from two fp16x32 with scale + __host__ __device__ __hipext_ocp_fp6x32_e2m3(const __amd_bf16x32_storage_t, const unsigned int /* seed */, const __amd_scale_t); // Create fp6x32 from two bf16x32 with stochastic rounding and scale + __host__ __device__ __hipext_ocp_fp6x32_e2m3(const __amd_bf16x32_storage_t, const __amd_scale_t); // Create fp6x32 from two bf16x32 with scale + + // Getters + __host__ __device__ __amd_floatx32_storage_t get_scaled_floatx32(const __amd_scale_t) const; // Get Scaled floatx32 + __host__ __device__ __amd_fp16x32_storage_t get_scaled_fp16x32(const __amd_scale_t) const; // Get Scaled fp16x32 + __host__ __device__ __amd_bf16x32_storage_t get_scaled_bf16x32(const __amd_scale_t) const; // Get Scaled bf16x32 + }; + +HIP EXT 32 packed FP6 E3M2 + +.. code-block:: c + + struct __hipext_ocp_fp6x32_e3m2 { + __host__ __device__ __hipext_ocp_fp6x32_e3m2(const __amd_floatx16_storage_t, const __amd_floatx16_storage_t, const __amd_scale_t); // Create fp6x32 from two floatx16 with scale + __host__ __device__ __hipext_ocp_fp6x32_e3m2(const __amd_floatx32_storage_t, const unsigned int /* seed */, const __amd_scale_t); // Create fp6x32 from two floatx32 with stochastic rounding and scale + __host__ __device__ __hipext_ocp_fp6x32_e3m2(const __amd_fp16x32_storage_t, const unsigned int /* seed */, const __amd_scale_t); // Create fp6x32 from two fp16x32 with stochastic rounding and scale + __host__ __device__ __hipext_ocp_fp6x32_e3m2(const __amd_fp16x32_storage_t, const __amd_scale_t); // Create fp6x32 from two fp16x32 with scale + __host__ __device__ __hipext_ocp_fp6x32_e3m2(const __amd_bf16x32_storage_t, const unsigned int /* seed */, const __amd_scale_t); // Create fp6x32 from two bf16x32 with stochastic rounding and scale + __host__ __device__ __hipext_ocp_fp6x32_e3m2(const __amd_bf16x32_storage_t, const __amd_scale_t); // Create fp6x32 from two bf16x32 with scale + + // Getters + __host__ __device__ __amd_floatx32_storage_t get_scaled_floatx32(const __amd_scale_t) const; // Get Scaled floatx32 + __host__ __device__ __amd_fp16x32_storage_t get_scaled_fp16x32(const __amd_scale_t) const; // Get Scaled fp16x32 + __host__ __device__ __amd_bf16x32_storage_t get_scaled_bf16x32(const __amd_scale_t) const; // Get Scaled bf16x32 + }; + +HIP EXT 2 packed FP4 + +.. code-block:: c + + struct __hipext_ocp_fp4x2_e2m1 { + __host__ __device__ __hipext_ocp_fp4x2_e2m1(const float, const float, const __amd_scale_t); // Create FP4x2 from two floats with scale + __host__ __device__ __hipext_ocp_fp4x2_e2m1(const __amd_floatx2_storage_t, const __amd_scale_t); // Create FP4x2 from floatx2 with scale + __host__ __device__ __hipext_ocp_fp4x2_e2m1(const __amd_bf16x2_storage_t, const __amd_scale_t); // Create FP4x2 from bf16x2 with scale + __host__ __device__ __hipext_ocp_fp4x2_e2m1(const __amd_fp16x2_storage_t, const __amd_scale_t); // Create FP4x2 from fp16x2 with scale + __host__ __device__ __hipext_ocp_fp4x2_e2m1(const __amd_floatx2_storage_t, const unsigned int, const __amd_scale_t); // Create FP4x2 from floatx2 with stochastic rounding and scale + __host__ __device__ __hipext_ocp_fp4x2_e2m1(const __amd_bf16x2_storage_t, const unsigned int, const __amd_scale_t); // Create FP4x2 from bf16x2 with stochastic rounding and scale + __host__ __device__ __hipext_ocp_fp4x2_e2m1(const __amd_fp16x2_storage_t, const unsigned int, const __amd_scale_t); // Create FP4x2 from fp16x2 with stochastic rounding and scale + + // Getters + __host__ __device__ __amd_floatx2_storage_t get_scaled_floatx2(const __amd_scale_t) const; // get scaled floatx2 + __host__ __device__ __amd_fp16x2_storage_t get_scaled_fp16x2(const __amd_scale_t) const; // Get scaled fp16x2 + __host__ __device__ __amd_bf16x2_storage_t get_scaled_bf16x2(const __amd_scale_t) const; // Get scaled bf16x2 + }; From 8badbcb6419a2106feb1c0ce8302171662b2b0fe Mon Sep 17 00:00:00 2001 From: Adel Johar Date: Mon, 21 Jul 2025 17:35:48 +0200 Subject: [PATCH 27/36] Docs: Cleanup of low_fp_type page --- .wordlist.txt | 3 + docs/reference/low_fp_types.rst | 208 ++++++++++++++++---------------- 2 files changed, 108 insertions(+), 103 deletions(-) diff --git a/.wordlist.txt b/.wordlist.txt index faa1107abf..3f595b2d63 100644 --- a/.wordlist.txt +++ b/.wordlist.txt @@ -70,6 +70,7 @@ HIP-Clang hipcc hipCtx hipexamine +hipExt hipified HIPify hipModule @@ -110,6 +111,7 @@ MALU maxregcount MiB memset +microscaling multicore multigrid multithreading @@ -165,6 +167,7 @@ sinewave sinf SOMA SPMV +sr structs SYCL syntaxes diff --git a/docs/reference/low_fp_types.rst b/docs/reference/low_fp_types.rst index e70392e5fa..d37808cd6f 100644 --- a/docs/reference/low_fp_types.rst +++ b/docs/reference/low_fp_types.rst @@ -979,9 +979,11 @@ There is support for vectors of bfloat16 types: HIP Extensions ============== -HIP also provides some extensions APIs for microscaling formats. These are supported on AMD GPUs. `gfx950` provides hardware acceleration for hip extensions. Infact most APIs are 1 to 1 mapping of hardware instruction. +HIP also provides some extensions APIs for microscaling formats. These are supported on AMD +GPUs. ``gfx950`` provides hardware acceleration for hip extensions. In fact most APIs are 1 to 1 +mapping of hardware instruction. -Scale is also an input to the APIs. Scale is defined as type `__amd_scale_t` and is of format E8M0. +Scale is also an input to the APIs. Scale is defined as type ``__amd_scale_t`` and is of format E8M0. hipExt Types ============ @@ -993,45 +995,45 @@ hipExt microscaling APIs introduce a bunch of types which are used throughout th * - Types - Notes - * - `__amd_scale_t` + * - ``__amd_scale_t`` - Store scale type which stores a value of E8M0. - * - `__amd_fp8_storage_t` + * - ``__amd_fp8_storage_t`` - Store a single fp8 value. - * - `__amd_fp8x2_storage_t` + * - ``__amd_fp8x2_storage_t`` - Store 2 packed fp8 value. - * - `__amd_fp8x8_storage_t` + * - ``__amd_fp8x8_storage_t`` - Store 8 packed fp8 value. - * - `__amd_fp4x2_storage_t` + * - ``__amd_fp4x2_storage_t`` - Store 2 packed fp4 value. - * - `__amd_fp4x8_storage_t` + * - ``__amd_fp4x8_storage_t`` - Store 8 packed fp4 value. - * - `__amd_bf16_storage_t` + * - ``__amd_bf16_storage_t`` - Store a single bf16 value. - * - `__amd_bf16x2_storage_t` + * - ``__amd_bf16x2_storage_t`` - Store 2 packed bf16 value. - * - `__amd_bf16x8_storage_t` + * - ``__amd_bf16x8_storage_t`` - Store 8 packed bf16 value. - * - `__amd_bf16x32_storage_t` + * - ``__amd_bf16x32_storage_t`` - Store 32 packed bf16 value. - * - `__amd_fp16_storage_t` + * - ``__amd_fp16_storage_t`` - Store a single fp16 value. - * - `__amd_fp16x2_storage_t` + * - ``__amd_fp16x2_storage_t`` - Store 2 packed fp16 value. - * - `__amd_fp16x8_storage_t` + * - ``__amd_fp16x8_storage_t`` - Store 8 packed fp16 value. - * - `__amd_fp16x32_storage_t` + * - ``__amd_fp16x32_storage_t`` - Store 32 packed fp16 value. - * - `__amd_floatx2_storage_t` + * - ``__amd_floatx2_storage_t`` - Store 2 packed float value. - * - `__amd_floatx8_storage_t` + * - ``__amd_floatx8_storage_t`` - Store 8 packed float value. - * - `__amd_floatx16_storage_t` + * - ``__amd_floatx16_storage_t`` - Store 16 packed float value. - * - `__amd_floatx32_storage_t` + * - ``__amd_floatx32_storage_t`` - Store 32 packed float value. - * - `__amd_fp6x32_storage_t` + * - ``__amd_fp6x32_storage_t`` - Store 32 packed fp6 value. - * - `__amd_shortx2_storage_t` + * - ``__amd_shortx2_storage_t`` - Store 2 packed short value. C-APIs @@ -1039,171 +1041,171 @@ C-APIs The naming style of C API is as follows: -All APIs start with `__amd`. -`_`: is used as a separator. -`cvt`: means convert i.e. convert from one format to another. -`sr`: if an API name has sr in it, means it will do stochastic rounding and will expect an input as seed. -`scale`: if an API has scale in it, means it will scale the values based on the `__amd_scale_t` input. +All APIs start with ``__amd``. +``_``: is used as a separator. +``cvt``: means convert i.e. convert from one format to another. +``sr``: if an API name has **sr** in it, means it will do stochastic rounding and will expect an input as seed. +``scale``: if an API has scale in it, means it will scale the values based on the ``__amd_scale_t`` input. -`create`: The following APIs will be used to create composite types from smaller values -`extract`: The following set of APIs will extract out individual values from a composite type. +``create``: The following APIs will be used to create composite types from smaller values +``extract``: The following set of APIs will extract out individual values from a composite type. Example: -`__amd_cvt_fp8x8_to_bf16x8_scale` : this API converts 8-packed fp8 values to 8 packed bf16. This will also accept input of scale to do the conversion. +``__amd_cvt_fp8x8_to_bf16x8_scale`` : this API converts 8-packed fp8 values to 8 packed bf16. This will also accept input of scale to do the conversion. -`__amd_extract_fp8x2` : this API will extract out a 2 packed fp8 value from 8 packed fp8 value based on index. Example of 8-packed fp8: `{a:{fp8, fp8}, b:{fp8, fp8}, c:{fp8, fp8}, d:{fp8, fp8}}` based on index 0, 1, 2 or 3 the API will return `a`, `b`, `c` or `d` respectively. -`__amd_create_fp8x8` : this API will create 8 packed fp8 value from 4 inputs of 2 packed fp8 values. +``__amd_extract_fp8x2`` : this API will extract out a 2 packed fp8 value from 8 packed fp8 value based on index. Example of 8-packed fp8: ``{a:{fp8, fp8}, b:{fp8, fp8}, c:{fp8, fp8}, d:{fp8, fp8}}`` based on index 0, 1, 2 or 3 the API will return ``a``, ``b``, ``c`` or ``d`` respectively. +``__amd_create_fp8x8`` : this API will create 8 packed fp8 value from 4 inputs of 2 packed fp8 values. .. list-table:: C APIs :header-rows: 1 * - API - Notes - * - `float __amd_cvt_fp8_to_float(const __amd_fp8_storage_t, const __amd_fp8_interpretation_t)` + * - ``float __amd_cvt_fp8_to_float(const __amd_fp8_storage_t, const __amd_fp8_interpretation_t)`` - Convert a fp8 value to float. - * - `__amd_fp8_storage_t __amd_cvt_float_to_fp8_sr(const float, const __amd_fp8_interpretation_t, const unsigned int /* sr seed */)` + * - ``__amd_fp8_storage_t __amd_cvt_float_to_fp8_sr(const float, const __amd_fp8_interpretation_t, const unsigned int /* sr seed */)`` - Convert a float to fp8 value with stochastic rounding, seed is passed as unsigned int argument. - * - `float __amd_cvt_fp8_to_float_scale(const __amd_fp8_storage_t, const __amd_fp8_interpretation_t, const __amd_scale_t)` + * - ``float __amd_cvt_fp8_to_float_scale(const __amd_fp8_storage_t, const __amd_fp8_interpretation_t, const __amd_scale_t)`` - Convert a fp8 value to float with scale. - * - `float __amd_cvt_fp8_to_float_scale(const __amd_fp8_storage_t, const __amd_fp8_interpretation_t, const __amd_scale_t)` + * - ``float __amd_cvt_fp8_to_float_scale(const __amd_fp8_storage_t, const __amd_fp8_interpretation_t, const __amd_scale_t)`` - Convert a fp8 value to float with scale. - * - `__amd_floatx2_storage_t __amd_cvt_fp8x2_to_floatx2(const __amd_fp8x2_storage_t, const __amd_fp8_interpretation_t)` + * - ``__amd_floatx2_storage_t __amd_cvt_fp8x2_to_floatx2(const __amd_fp8x2_storage_t, const __amd_fp8_interpretation_t)`` - Convert 2 packed fp8 value to 2 packed float. - * - `__amd_fp8x2_storage_t __amd_cvt_floatx2_to_fp8x2(const __amd_floatx2_storage_t, const __amd_fp8_interpretation_t)` + * - ``__amd_fp8x2_storage_t __amd_cvt_floatx2_to_fp8x2(const __amd_floatx2_storage_t, const __amd_fp8_interpretation_t)`` - Convert 2 packed float value to 2 packed fp8. - * - `__amd_fp4x2_storage_t __amd_cvt_floatx2_to_fp4x2_sr_scale(const __amd_floatx2_storage_t, const __amd_fp4_interpretation_t, const unsigned int /* sr seed */, const __amd_scale_t)` + * - ``__amd_fp4x2_storage_t __amd_cvt_floatx2_to_fp4x2_sr_scale(const __amd_floatx2_storage_t, const __amd_fp4_interpretation_t, const unsigned int /* sr seed */, const __amd_scale_t)`` - Convert 2 packed float value to 2 packed fp4 with stochastic rounding and scale. - * - `__amd_floatx2_storage_t __amd_cvt_fp4x2_to_floatx2_scale(const __amd_fp4x2_storage_t , const __amd_fp4_interpretation_t, const __amd_scale_t)` + * - ``__amd_floatx2_storage_t __amd_cvt_fp4x2_to_floatx2_scale(const __amd_fp4x2_storage_t , const __amd_fp4_interpretation_t, const __amd_scale_t)`` - Convert 2 packed fp4 value to 2 packed float with scale. - * - `__amd_fp4x2_storage_t __amd_cvt_floatx2_to_fp4x2_scale(const __amd_floatx2_storage_t, const __amd_fp4_interpretation_t, const __amd_scale_t)` + * - ``__amd_fp4x2_storage_t __amd_cvt_floatx2_to_fp4x2_scale(const __amd_floatx2_storage_t, const __amd_fp4_interpretation_t, const __amd_scale_t)`` - Convert 2 packed float value to 2 packed fp4 with scale. - * - `__amd_floatx2_storage_t __amd_cvt_fp8x2_to_floatx2_scale(const __amd_fp8x2_storage_t, const __amd_fp8_interpretation_t, const __amd_scale_t)` + * - ``__amd_floatx2_storage_t __amd_cvt_fp8x2_to_floatx2_scale(const __amd_fp8x2_storage_t, const __amd_fp8_interpretation_t, const __amd_scale_t)`` - Convert 2 packed fp8 value to 2 packed float with scale. - * - `__amd_fp8x2_storage_t __amd_cvt_floatx2_to_fp8x2_scale(const __amd_floatx2_storage_t, const __amd_fp8_interpretation_t, const __amd_scale_t)` + * - ``__amd_fp8x2_storage_t __amd_cvt_floatx2_to_fp8x2_scale(const __amd_floatx2_storage_t, const __amd_fp8_interpretation_t, const __amd_scale_t)`` - Convert 2 packed float value to 2 packed fp8 with scale. - * - `__amd_fp6x32_storage_t __amd_cvt_bf16x32_to_fp6x32_scale(const __amd_bf16x32_storage_t, const __amd_fp6_interpretation_t, const __amd_scale_t)` + * - ``__amd_fp6x32_storage_t __amd_cvt_bf16x32_to_fp6x32_scale(const __amd_bf16x32_storage_t, const __amd_fp6_interpretation_t, const __amd_scale_t)`` - Convert 32 packed bf16 value to 32 packed fp6 with scale. - * - `__amd_fp6x32_storage_t __amd_cvt_fp16x32_to_fp6x32_scale(const __amd_fp16x32_storage_t, const __amd_fp6_interpretation_t, const __amd_scale_t)` + * - ``__amd_fp6x32_storage_t __amd_cvt_fp16x32_to_fp6x32_scale(const __amd_fp16x32_storage_t, const __amd_fp6_interpretation_t, const __amd_scale_t)`` - Convert 32 packed fp16 value to 32 packed fp6 with scale. - * - `__amd_fp16x2_storage_t __amd_cvt_fp8x2_to_fp16x2_scale(const __amd_fp8x2_storage_t, const __amd_fp8_interpretation_t, const __amd_scale_t)` + * - ``__amd_fp16x2_storage_t __amd_cvt_fp8x2_to_fp16x2_scale(const __amd_fp8x2_storage_t, const __amd_fp8_interpretation_t, const __amd_scale_t)`` - Convert 2 packed fp8 value to 2 packed fp16 with scale. - * - `__amd_fp16x8_storage_t __amd_cvt_fp8x8_to_fp16x8_scale(const __amd_fp8x8_storage_t, const __amd_fp8_interpretation_t, const __amd_scale_t)` + * - ``__amd_fp16x8_storage_t __amd_cvt_fp8x8_to_fp16x8_scale(const __amd_fp8x8_storage_t, const __amd_fp8_interpretation_t, const __amd_scale_t)`` - Convert 8 packed fp8 value to 8 packed fp16 with scale. - * - `__amd_bf16x2_storage_t __amd_cvt_fp8x2_to_bf16x2_scale(const __amd_fp8x2_storage_t, const __amd_fp8_interpretation_t, const __amd_scale_t)` + * - ``__amd_bf16x2_storage_t __amd_cvt_fp8x2_to_bf16x2_scale(const __amd_fp8x2_storage_t, const __amd_fp8_interpretation_t, const __amd_scale_t)`` - Convert 2 packed fp8 value to 2 packed bf16 with scale. - * - `__amd_bf16x2_storage_t __amd_cvt_fp8x2_to_bf16x2_scale(const __amd_fp8x2_storage_t, const __amd_fp8_interpretation_t, const __amd_scale_t)` + * - ``__amd_bf16x2_storage_t __amd_cvt_fp8x2_to_bf16x2_scale(const __amd_fp8x2_storage_t, const __amd_fp8_interpretation_t, const __amd_scale_t)`` - Convert 2 packed fp8 value to 2 packed bf16 with scale. - * - `__amd_bf16x8_storage_t __amd_cvt_fp8x8_to_bf16x8_scale(const __amd_fp8x8_storage_t, const __amd_fp8_interpretation_t, const __amd_scale_t)` + * - ``__amd_bf16x8_storage_t __amd_cvt_fp8x8_to_bf16x8_scale(const __amd_fp8x8_storage_t, const __amd_fp8_interpretation_t, const __amd_scale_t)`` - Convert 8 packed fp8 value to 8 packed bf16 with scale. - * - `__amd_fp16x32_storage_t __amd_cvt_fp6x32_to_fp16x32_scale(const __amd_fp6x32_storage_t, const __amd_fp6_interpretation_t, const __amd_scale_t)` + * - ``__amd_fp16x32_storage_t __amd_cvt_fp6x32_to_fp16x32_scale(const __amd_fp6x32_storage_t, const __amd_fp6_interpretation_t, const __amd_scale_t)`` - Convert 32 packed fp6 value to 32 packed fp16 with scale. - * - `__amd_bf16x32_storage_t __amd_cvt_fp6x32_to_bf16x32_scale(const __amd_fp6x32_storage_t, const __amd_fp6_interpretation_t, const __amd_scale_t)` + * - ``__amd_bf16x32_storage_t __amd_cvt_fp6x32_to_bf16x32_scale(const __amd_fp6x32_storage_t, const __amd_fp6_interpretation_t, const __amd_scale_t)`` - Convert 32 packed fp6 value to 32 packed bf16 with scale. - * - `__amd_floatx32_storage_t __amd_cvt_fp6x32_to_floatx32_scale(const __amd_fp6x32_storage_t, const __amd_fp6_interpretation_t, const __amd_scale_t)` + * - ``__amd_floatx32_storage_t __amd_cvt_fp6x32_to_floatx32_scale(const __amd_fp6x32_storage_t, const __amd_fp6_interpretation_t, const __amd_scale_t)`` - Convert 32 packed fp6 value to 32 packed float with scale. - * - `__amd_fp16x2_storage_t __amd_cvt_fp4x2_to_fp16x2_scale(const __amd_fp4x2_storage_t, const __amd_fp4_interpretation_t, const __amd_scale_t)` + * - ``__amd_fp16x2_storage_t __amd_cvt_fp4x2_to_fp16x2_scale(const __amd_fp4x2_storage_t, const __amd_fp4_interpretation_t, const __amd_scale_t)`` - Convert 2 packed fp4 value to 2 packed fp16 with scale. - * - `__amd_fp16x8_storage_t __amd_cvt_fp4x8_to_fp16x8_scale(const __amd_fp4x8_storage_t, const __amd_fp4_interpretation_t, const __amd_scale_t)` + * - ``__amd_fp16x8_storage_t __amd_cvt_fp4x8_to_fp16x8_scale(const __amd_fp4x8_storage_t, const __amd_fp4_interpretation_t, const __amd_scale_t)`` - Convert 8 packed fp4 value to 8 packed fp16 with scale. - * - `__amd_bf16x2_storage_t __amd_cvt_fp4x2_to_bf16x2_scale(const __amd_fp4x2_storage_t, const __amd_fp4_interpretation_t, const __amd_scale_t)` + * - ``__amd_bf16x2_storage_t __amd_cvt_fp4x2_to_bf16x2_scale(const __amd_fp4x2_storage_t, const __amd_fp4_interpretation_t, const __amd_scale_t)`` - Convert 2 packed fp4 value to 2 packed bf16 with scale. - * - `__amd_bf16x8_storage_t __amd_cvt_fp4x8_to_bf16x8_scale(const __amd_fp4x8_storage_t, const __amd_fp4_interpretation_t, const __amd_scale_t)` + * - ``__amd_bf16x8_storage_t __amd_cvt_fp4x8_to_bf16x8_scale(const __amd_fp4x8_storage_t, const __amd_fp4_interpretation_t, const __amd_scale_t)`` - Convert 8 packed fp4 value to 8 packed bf16 with scale. - * - `__amd_floatx8_storage_t __amd_cvt_fp4x8_to_floatx8_scale(const __amd_fp4x8_storage_t, const __amd_fp4_interpretation_t, const __amd_scale_t)` + * - ``__amd_floatx8_storage_t __amd_cvt_fp4x8_to_floatx8_scale(const __amd_fp4x8_storage_t, const __amd_fp4_interpretation_t, const __amd_scale_t)`` - Convert 8 packed fp4 value to 8 packed float with scale. - * - `__amd_fp4x8_storage_t __amd_cvt_floatx8_to_fp4x8_scale(const __amd_floatx8_storage_t, const __amd_fp4_interpretation_t, const __amd_scale_t)` + * - ``__amd_fp4x8_storage_t __amd_cvt_floatx8_to_fp4x8_scale(const __amd_floatx8_storage_t, const __amd_fp4_interpretation_t, const __amd_scale_t)`` - Convert 8 packed float value to 8 packed fp4 with scale. - * - `__amd_fp8x2_storage_t __amd_cvt_fp16x2_to_fp8x2_scale(const __amd_fp16x2_storage_t, const __amd_fp8_interpretation_t, const __amd_scale_t)` + * - ``__amd_fp8x2_storage_t __amd_cvt_fp16x2_to_fp8x2_scale(const __amd_fp16x2_storage_t, const __amd_fp8_interpretation_t, const __amd_scale_t)`` - Convert 2 packed fp16 value to 2 packed fp8 with scale. - * - `__amd_fp8x2_storage_t __amd_cvt_bf16x2_to_fp8x2_scale(const __amd_bf16x2_storage_t, const __amd_fp8_interpretation_t, const __amd_scale_t)` + * - ``__amd_fp8x2_storage_t __amd_cvt_bf16x2_to_fp8x2_scale(const __amd_bf16x2_storage_t, const __amd_fp8_interpretation_t, const __amd_scale_t)`` - Convert 2 packed bf16 value to 2 packed fp8 with scale. - * - `__amd_fp8x8_storage_t __amd_cvt_bf16x8_to_fp8x8_scale(const __amd_bf16x8_storage_t, const __amd_fp8_interpretation_t, const __amd_scale_t)` + * - ``__amd_fp8x8_storage_t __amd_cvt_bf16x8_to_fp8x8_scale(const __amd_bf16x8_storage_t, const __amd_fp8_interpretation_t, const __amd_scale_t)`` - Convert 8 packed bf16 value to 8 packed fp8 with scale. - * - `__amd_floatx8_storage_t __amd_cvt_fp8x8_to_floatx8_scale(const __amd_fp8x8_storage_t, const __amd_fp8_interpretation_t, const __amd_scale_t)` + * - ``__amd_floatx8_storage_t __amd_cvt_fp8x8_to_floatx8_scale(const __amd_fp8x8_storage_t, const __amd_fp8_interpretation_t, const __amd_scale_t)`` - Convert 8 packed fp8 value to 8 packed float with scale. - * - `__amd_fp16_storage_t __amd_cvt_fp8_to_fp16_scale(const __amd_fp8_storage_t, const __amd_fp8_interpretation_t, const __amd_scale_t)` + * - ``__amd_fp16_storage_t __amd_cvt_fp8_to_fp16_scale(const __amd_fp8_storage_t, const __amd_fp8_interpretation_t, const __amd_scale_t)`` - Convert a fp8 value to fp16 with scale. - * - `__amd_bf16_storage_t __amd_cvt_fp8_to_bf16_scale(const __amd_fp8_storage_t, const __amd_fp8_interpretation_t, const __amd_scale_t)` + * - ``__amd_bf16_storage_t __amd_cvt_fp8_to_bf16_scale(const __amd_fp8_storage_t, const __amd_fp8_interpretation_t, const __amd_scale_t)`` - Convert a fp8 value to bf16 with scale. - * - `__amd_fp6x32_storage_t __amd_cvt_floatx16_floatx16_to_fp6x32_scale(const __amd_floatx16_storage_t, const __amd_floatx16_storage_t, const __amd_fp6_interpretation_t, const __amd_scale_t)` + * - ``__amd_fp6x32_storage_t __amd_cvt_floatx16_floatx16_to_fp6x32_scale(const __amd_floatx16_storage_t, const __amd_floatx16_storage_t, const __amd_fp6_interpretation_t, const __amd_scale_t)`` - Convert 2 inputs of 16-packed float values to 32 packed fp6 with scale. - * - `__amd_fp6x32_storage_t __amd_cvt_floatx32_to_fp6x32_scale(const __amd_floatx32_storage_t, const __amd_fp6_interpretation_t, const __amd_scale_t)` + * - ``__amd_fp6x32_storage_t __amd_cvt_floatx32_to_fp6x32_scale(const __amd_floatx32_storage_t, const __amd_fp6_interpretation_t, const __amd_scale_t)`` - Convert 32 packed float values to 32 packed fp6 with scale. - * - `__amd_fp6x32_storage_t __amd_cvt_floatx32_to_fp6x32_sr_scale(const __amd_floatx32_storage_t, const __amd_fp6_interpretation_t, const unsigned int, const __amd_scale_t)` + * - ``__amd_fp6x32_storage_t __amd_cvt_floatx32_to_fp6x32_sr_scale(const __amd_floatx32_storage_t, const __amd_fp6_interpretation_t, const unsigned int, const __amd_scale_t)`` - Convert 32 packed float values to 32 packed fp6 with stochastic rounding and scale. - * - `__amd_fp16_storage_t __amd_cvt_float_to_fp16_sr(const float, const unsigned int)` + * - ``__amd_fp16_storage_t __amd_cvt_float_to_fp16_sr(const float, const unsigned int)`` - Convert a float value to fp16 with stochastic rounding. - * - `__amd_fp16x2_storage_t __amd_cvt_float_float_to_fp16x2_sr(const float, const float, const unsigned int)` + * - ``__amd_fp16x2_storage_t __amd_cvt_float_float_to_fp16x2_sr(const float, const float, const unsigned int)`` - Convert two inputs of float to 2 packed fp16 with stochastic rounding. - * - `__amd_bf16_storage_t __amd_cvt_float_to_bf16_sr(const float, const unsigned int)` + * - ``__amd_bf16_storage_t __amd_cvt_float_to_bf16_sr(const float, const unsigned int)`` - Convert a float value to bf16 with stochastic rounding. - * - `__amd_fp6x32_storage_t __amd_cvt_fp16x32_to_fp6x32_sr_scale(const __amd_fp16x32_storage_t, const __amd_fp6_interpretation_t, const unsigned int, const __amd_scale_t)` + * - ``__amd_fp6x32_storage_t __amd_cvt_fp16x32_to_fp6x32_sr_scale(const __amd_fp16x32_storage_t, const __amd_fp6_interpretation_t, const unsigned int, const __amd_scale_t)`` - Convert 32 packed fp16 values to 32 packed fp6 with stochastic rounding and scale. - * - `__amd_fp6x32_storage_t __amd_cvt_bf16x32_to_fp6x32_sr_scale(const __amd_bf16x32_storage_t, const __amd_fp6_interpretation_t, const unsigned int, const __amd_scale_t)` + * - ``__amd_fp6x32_storage_t __amd_cvt_bf16x32_to_fp6x32_sr_scale(const __amd_bf16x32_storage_t, const __amd_fp6_interpretation_t, const unsigned int, const __amd_scale_t)`` - Convert 32 packed bf16 values to 32 packed fp6 with stochastic rounding and scale. - * - `__amd_fp4x2_storage_t __amd_cvt_bf16x2_to_fp4x2_scale(const __amd_bf16x2_storage_t, const __amd_fp4_interpretation_t, const __amd_scale_t)` + * - ``__amd_fp4x2_storage_t __amd_cvt_bf16x2_to_fp4x2_scale(const __amd_bf16x2_storage_t, const __amd_fp4_interpretation_t, const __amd_scale_t)`` - Convert 2 packed bf16 value to 2 packed fp4 with scale. - * - `__amd_fp4x8_storage_t __amd_cvt_bf16x8_to_fp4x8_scale(const __amd_bf16x8_storage_t, const __amd_fp4_interpretation_t, const __amd_scale_t)` + * - ``__amd_fp4x8_storage_t __amd_cvt_bf16x8_to_fp4x8_scale(const __amd_bf16x8_storage_t, const __amd_fp4_interpretation_t, const __amd_scale_t)`` - Convert 8 packed bf16 value to 8 packed fp4 with scale. - * - `__amd_fp4x2_storage_t __amd_cvt_fp16x2_to_fp4x2_scale(const __amd_fp16x2_storage_t, const __amd_fp4_interpretation_t, const __amd_scale_t)` + * - ``__amd_fp4x2_storage_t __amd_cvt_fp16x2_to_fp4x2_scale(const __amd_fp16x2_storage_t, const __amd_fp4_interpretation_t, const __amd_scale_t)`` - Convert 2 packed fp16 value to 2 packed fp4 with scale. - * - `__amd_fp4x8_storage_t __amd_cvt_fp16x8_to_fp4x8_scale(const __amd_fp16x8_storage_t, const __amd_fp4_interpretation_t, const __amd_scale_t)` + * - ``__amd_fp4x8_storage_t __amd_cvt_fp16x8_to_fp4x8_scale(const __amd_fp16x8_storage_t, const __amd_fp4_interpretation_t, const __amd_scale_t)`` - Convert 8 packed fp16 value to 8 packed fp4 with scale. - * - `__amd_fp4x8_storage_t __amd_cvt_floatx8_to_fp4x8_sr_scale(const __amd_floatx8_storage_t, const __amd_fp4_interpretation_t, const unsigned int, const __amd_scale_t)` + * - ``__amd_fp4x8_storage_t __amd_cvt_floatx8_to_fp4x8_sr_scale(const __amd_floatx8_storage_t, const __amd_fp4_interpretation_t, const unsigned int, const __amd_scale_t)`` - Convert 8 packed float values to 8 packed fp4 with stochastic rounding and scale. - * - `_amd_fp4x2_storage_t __amd_cvt_bf16x2_to_fp4x2_sr_scale(const __amd_bf16x2_storage_t, const __amd_fp4_interpretation_t, const unsigned int,const __amd_scale_t)` + * - ``__amd_fp4x2_storage_t __amd_cvt_bf16x2_to_fp4x2_sr_scale(const __amd_bf16x2_storage_t, const __amd_fp4_interpretation_t, const unsigned int,const __amd_scale_t)`` - Convert 2 packed bf16 value to 2 packed fp4 with stochastic rounding and scale. - * - `__amd_fp4x8_storage_t __amd_cvt_bf16x8_to_fp4x8_sr_scale(const __amd_bf16x8_storage_t, const __amd_fp4_interpretation_t, const unsigned int, const __amd_scale_t)` + * - ``__amd_fp4x8_storage_t __amd_cvt_bf16x8_to_fp4x8_sr_scale(const __amd_bf16x8_storage_t, const __amd_fp4_interpretation_t, const unsigned int, const __amd_scale_t)`` - Convert 8 packed bf16 value to 8 packed fp4 with stochastic rounding and scale. - * - `__amd_fp4x2_storage_t __amd_cvt_fp16x2_to_fp4x2_sr_scale(const __amd_fp16x2_storage_t, const __amd_fp4_interpretation_t, const unsigned int, const __amd_scale_t)` + * - ``__amd_fp4x2_storage_t __amd_cvt_fp16x2_to_fp4x2_sr_scale(const __amd_fp16x2_storage_t, const __amd_fp4_interpretation_t, const unsigned int, const __amd_scale_t)`` - Convert 2 packed fp16 value to 2 packed fp4 with stochastic rounding and scale. - * - `__amd_fp4x8_storage_t __amd_cvt_fp16x8_to_fp4x8_sr_scale(const __amd_fp16x8_storage_t , const __amd_fp4_interpretation_t, const unsigned int, const __amd_scale_t)` + * - ``__amd_fp4x8_storage_t __amd_cvt_fp16x8_to_fp4x8_sr_scale(const __amd_fp16x8_storage_t , const __amd_fp4_interpretation_t, const unsigned int, const __amd_scale_t)`` - Convert 8 packed fp16 values to 8 packed fp4 with stochastic rounding and scale. - * - `__amd_fp8x8_storage_t __amd_cvt_floatx8_to_fp8x8_sr_scale(const __amd_floatx8_storage_t, const __amd_fp8_interpretation_t, const unsigned int, const __amd_scale_t)` + * - ``__amd_fp8x8_storage_t __amd_cvt_floatx8_to_fp8x8_sr_scale(const __amd_floatx8_storage_t, const __amd_fp8_interpretation_t, const unsigned int, const __amd_scale_t)`` - Convert 8 packed float values to 8 packed fp8 with stochastic rounding and scale. - * - `__amd_fp8_storage_t __amd_cvt_fp16_to_fp8_sr_scale(const __amd_fp16_storage_t, const __amd_fp8_interpretation_t, const unsigned int, const __amd_scale_t)` + * - ``__amd_fp8_storage_t __amd_cvt_fp16_to_fp8_sr_scale(const __amd_fp16_storage_t, const __amd_fp8_interpretation_t, const unsigned int, const __amd_scale_t)`` - Convert a fp16 value to fp8 with stochastic rounding and scale. - * - `__amd_fp8x8_storage_t __amd_cvt_fp16x8_to_fp8x8_sr_scale(const __amd_fp16x8_storage_t, const __amd_fp8_interpretation_t, const unsigned int, const __amd_scale_t)` + * - ``__amd_fp8x8_storage_t __amd_cvt_fp16x8_to_fp8x8_sr_scale(const __amd_fp16x8_storage_t, const __amd_fp8_interpretation_t, const unsigned int, const __amd_scale_t)`` - Convert 8 packed fp16 values to 8 packed fp8 with stochastic rounding and scale. - * - `__amd_fp8_storage_t __amd_cvt_bf16_to_fp8_sr_scale(const __amd_bf16_storage_t, const __amd_fp8_interpretation_t, const unsigned int, const __amd_scale_t)` + * - ``__amd_fp8_storage_t __amd_cvt_bf16_to_fp8_sr_scale(const __amd_bf16_storage_t, const __amd_fp8_interpretation_t, const unsigned int, const __amd_scale_t)`` - Convert a bf16 value to fp8 with stochastic rounding and scale. - * - `__amd_fp8x8_storage_t __amd_cvt_bf16x8_to_fp8x8_sr_scale(const __amd_bf16x8_storage_t, const __amd_fp8_interpretation_t, const unsigned int, const __amd_scale_t)` + * - ``__amd_fp8x8_storage_t __amd_cvt_bf16x8_to_fp8x8_sr_scale(const __amd_bf16x8_storage_t, const __amd_fp8_interpretation_t, const unsigned int, const __amd_scale_t)`` - Convert 8 packed bf16 values to 8 packed fp8 with stochastic rounding and scale. - * - `__amd_fp16_storage_t __amd_cvt_fp8_to_fp16(const __amd_fp8_storage_t, const __amd_fp8_interpretation_t)` + * - ``__amd_fp16_storage_t __amd_cvt_fp8_to_fp16(const __amd_fp8_storage_t, const __amd_fp8_interpretation_t)`` - Convert a fp8 value to fp16. - * - `__amd_fp16x2_storage_t __amd_cvt_fp8x2_to_fp16x2(const __amd_fp8x2_storage_t, const __amd_fp8_interpretation_t)` + * - ``__amd_fp16x2_storage_t __amd_cvt_fp8x2_to_fp16x2(const __amd_fp8x2_storage_t, const __amd_fp8_interpretation_t)`` - Convert 2 packed fp8 value to 2 packed fp16. - * - `__amd_fp8x2_storage_t __amd_cvt_fp16x2_to_fp8x2(const __amd_fp16x2_storage_t, const __amd_fp8_interpretation_t)` + * - ``__amd_fp8x2_storage_t __amd_cvt_fp16x2_to_fp8x2(const __amd_fp16x2_storage_t, const __amd_fp8_interpretation_t)`` - Convert 2 packed fp16 value to 2 packed fp8. - * - `__amd_fp8x8_storage_t __amd_cvt_fp16x8_to_fp8x8_scale(const __amd_fp16x8_storage_t, const __amd_fp8_interpretation_t, const __amd_scale_t)` + * - ``__amd_fp8x8_storage_t __amd_cvt_fp16x8_to_fp8x8_scale(const __amd_fp16x8_storage_t, const __amd_fp8_interpretation_t, const __amd_scale_t)`` - Convert 8 packed fp16 values to 8 packed fp8 with scale. - * - `__amd_fp8x8_storage_t __amd_cvt_floatx8_to_fp8x8_scale(const __amd_floatx8_storage_t, const __amd_fp8_interpretation_t, const __amd_scale_t)` + * - ``__amd_fp8x8_storage_t __amd_cvt_floatx8_to_fp8x8_scale(const __amd_floatx8_storage_t, const __amd_fp8_interpretation_t, const __amd_scale_t)`` - Convert 8 packed float values to 8 packed fp8 with scale. - * - `__amd_fp8_storage_t __amd_cvt_fp16_to_fp8_sr(const __amd_fp16_storage_t, const __amd_fp8_interpretation_t, const short)` + * - ``__amd_fp8_storage_t __amd_cvt_fp16_to_fp8_sr(const __amd_fp16_storage_t, const __amd_fp8_interpretation_t, const short)`` - Convert a fp16 value to fp8 with stochastic rounding. - * - `float2 __amd_cvt_floatx2_to_float2(const __amd_floatx2_storage_t)` + * - ``float2 __amd_cvt_floatx2_to_float2(const __amd_floatx2_storage_t)`` - Convert 2 packed float value to hip's float2 type. - * - `__half __amd_cvt_fp16_to_half(const __amd_fp16_storage_t)` + * - ``__half __amd_cvt_fp16_to_half(const __amd_fp16_storage_t)`` - Convert fp16 type to hip's __half type. - * - `__half2 __amd_cvt_fp16x2_to_half2(const __amd_fp16x2_storage_t)` + * - ``__half2 __amd_cvt_fp16x2_to_half2(const __amd_fp16x2_storage_t)`` - Convert 2 packed fp16 type to hip's __half2 type. - * - `__amd_fp16_storage_t __amd_cvt_half_to_fp16(const __half)` + * - ``__amd_fp16_storage_t __amd_cvt_half_to_fp16(const __half)`` - Convert hip's __half type to fp16 type. - * - `__amd_fp16x2_storage_t __amd_cvt_half2_to_fp16x2(const __half2)` + * - ``__amd_fp16x2_storage_t __amd_cvt_half2_to_fp16x2(const __half2)`` - Convert hip's __half2 type to 2 packed fp16. - * - `__hip_bfloat16 __amd_cvt_bf16_to_hipbf16(const __amd_bf16_storage_t)` + * - ``__hip_bfloat16 __amd_cvt_bf16_to_hipbf16(const __amd_bf16_storage_t)`` - Convert bf16 type to __hip_bfloat16 type. - * - `__hip_bfloat162 __amd_cvt_bf16x2_to_hipbf162(const __amd_bf16x2_storage_t)` + * - ``__hip_bfloat162 __amd_cvt_bf16x2_to_hipbf162(const __amd_bf16x2_storage_t)`` - Convert 2 packed bf16 type to __hip_bfloat162 type. - * - `__amd_bf16_storage_t __amd_cvt_hipbf16_to_bf16(const __hip_bfloat16)` + * - ``__amd_bf16_storage_t __amd_cvt_hipbf16_to_bf16(const __hip_bfloat16)`` - Convert __hip_bfloat16 to bf16 type. - * - `__amd_bf16x2_storage_t __amd_cvt_hipbf162_to_bf16x2(const __hip_bfloat162)` + * - ``__amd_bf16x2_storage_t __amd_cvt_hipbf162_to_bf16x2(const __hip_bfloat162)`` - Convert __hip_bfloat162 to 2 packed bf16 type. HIP EXT C++ API =============== -There are C++ data structures also available. These are different from one in `` header. These APIs expose a wider capability set which are exclusive to `gfx950`. +There are C++ data structures also available. These are different from one in ```` header. These APIs expose a wider capability set which are exclusive to ``gfx950``. HIP EXT FP8 E4M3: From a560ce05f226305e18a63e17d3ff6c8c5d7caae5 Mon Sep 17 00:00:00 2001 From: randyh62 Date: Tue, 22 Jul 2025 15:30:42 -0700 Subject: [PATCH 28/36] Move reference/env_variables folder under docs/reference Update conf.py Update exclude_patterns in conf.py for relocated env_variables include files Update debugging.rst Correct include-table for ./reference/env_variables/debug_hip_env.rst Update memory_management_hip_env.rst replace hip:hip-memory-coherence-table with coheence-control --- docs/conf.py | 10 +++++----- docs/faq.rst | 6 +++--- docs/how-to/debugging.rst | 2 +- docs/how-to/hip_porting_guide.rst | 6 +++--- docs/how-to/hip_runtime_api/asynchronous.rst | 2 +- docs/reference/env_variables.rst | 10 +++++----- .../reference/env_variables/debug_hip_env.rst | 0 .../reference/env_variables/gpu_isolation_hip_env.rst | 0 .../env_variables/memory_management_hip_env.rst | 4 ++-- .../reference/env_variables/miscellaneous_hip_env.rst | 0 .../reference/env_variables/profiling_hip_env.rst | 0 11 files changed, 20 insertions(+), 20 deletions(-) rename docs/{data => }/reference/env_variables/debug_hip_env.rst (100%) rename docs/{data => }/reference/env_variables/gpu_isolation_hip_env.rst (100%) rename docs/{data => }/reference/env_variables/memory_management_hip_env.rst (95%) rename docs/{data => }/reference/env_variables/miscellaneous_hip_env.rst (100%) rename docs/{data => }/reference/env_variables/profiling_hip_env.rst (100%) diff --git a/docs/conf.py b/docs/conf.py index fc3d1274f9..5dc32092f9 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -54,12 +54,12 @@ numfig = False exclude_patterns = [ - "doxygen/mainpage.md", - "understand/glossary.md", - 'how-to/debugging_env.rst', - "data/reference/env_variables" + "./doxygen/mainpage.md", + "./understand/glossary.md", + './how-to/debugging_env.rst', + "./reference/env_variables" ] git_url = subprocess.check_output(['git', 'config', '--get', 'remote.origin.url']).strip().decode('ascii') if git_url.find("git:") != -1: - html_theme_options = {"repository_url": "https://github.com/ROCm/hip"} \ No newline at end of file + html_theme_options = {"repository_url": "https://github.com/ROCm/hip"} diff --git a/docs/faq.rst b/docs/faq.rst index f0d836761f..696985c61d 100644 --- a/docs/faq.rst +++ b/docs/faq.rst @@ -40,10 +40,10 @@ for the full list. What NVIDIA CUDA features does HIP support? ------------------------------------------- -The :doc:`NVIDIA CUDA runtime API supported by HIP` -and :doc:`NVIDIA CUDA driver API supported by HIP` +The :doc:`NVIDIA CUDA runtime API supported by HIP` +and :doc:`NVIDIA CUDA driver API supported by HIP` pages describe which NVIDIA CUDA APIs are supported and what the equivalents are. -The :doc:`HIP API documentation ` describes each API and +The :ref:`HIP runtime API reference` describes each API and its limitations, if any, compared with the equivalent CUDA API. The kernel language features are documented in the diff --git a/docs/how-to/debugging.rst b/docs/how-to/debugging.rst index fce7fac6ae..9222b8cfc9 100644 --- a/docs/how-to/debugging.rst +++ b/docs/how-to/debugging.rst @@ -273,7 +273,7 @@ HIP environment variable summary Here are some of the more commonly used environment variables: -.. include-table:: data/reference/env_variables/debug_hip_env.rst +.. include-table:: ./reference/env_variables/debug_hip_env.rst :table: hip-env-debug General debugging tips diff --git a/docs/how-to/hip_porting_guide.rst b/docs/how-to/hip_porting_guide.rst index 205ea6a848..410c17a76f 100644 --- a/docs/how-to/hip_porting_guide.rst +++ b/docs/how-to/hip_porting_guide.rst @@ -46,16 +46,16 @@ HIPIFY translate CUDA to HIP code. There are two flavours available, ``hipfiy-clang`` and ``hipify-perl``. -:doc:`hipify-clang ` is, as the name implies, a Clang-based +:doc:`hipify-clang ` is, as the name implies, a Clang-based tool, and actually parses the code, translates it into an Abstract Syntax Tree, from which it then generates the HIP source. For this, ``hipify-clang`` needs to be able to actually compile the code, so the CUDA code needs to be correct, and a CUDA install with all necessary headers must be provided. -:doc:`hipify-perl ` uses pattern matching, to translate the +:doc:`hipify-perl ` uses pattern matching, to translate the CUDA code to HIP. It does not require a working CUDA installation, and can also convert CUDA code, that is not syntactically correct. It is therefore easier to -set up and use, but is not as powerful as ``hipfiy-clang``. +set up and use, but is not as powerful as ``hipify-clang``. Scanning existing CUDA code to scope the porting effort -------------------------------------------------------------------------------- diff --git a/docs/how-to/hip_runtime_api/asynchronous.rst b/docs/how-to/hip_runtime_api/asynchronous.rst index 82c024969f..63aeddc2cf 100644 --- a/docs/how-to/hip_runtime_api/asynchronous.rst +++ b/docs/how-to/hip_runtime_api/asynchronous.rst @@ -88,7 +88,7 @@ developer may have to reduce the block size of the kernels. The kernel runtimes can be misleading for concurrent kernel runs, that is why during optimization it is a good practice to check the trace files, to see if one kernel is blocking another kernel, while they are running in parallel. For more information about -the application tracing, check::doc:`rocprofiler:/how-to/using-rocprof`. +application tracing, see :doc:`rocprofiler:how-to/using-rocprof`. When running kernels in parallel, the execution time can increase due to contention for shared resources. This is because multiple kernels may attempt diff --git a/docs/reference/env_variables.rst b/docs/reference/env_variables.rst index 390cb9516c..6010ce6cfa 100644 --- a/docs/reference/env_variables.rst +++ b/docs/reference/env_variables.rst @@ -12,24 +12,24 @@ on AMD platform, which are grouped by functionality. GPU isolation variables ================================================================================ -.. include:: ../data/reference/env_variables/gpu_isolation_hip_env.rst +.. include:: ./env_variables/gpu_isolation_hip_env.rst Profiling variables ================================================================================ -.. include:: ../data/reference/env_variables/profiling_hip_env.rst +.. include:: ./env_variables/profiling_hip_env.rst Debug variables ================================================================================ -.. include:: ../data/reference/env_variables/debug_hip_env.rst +.. include:: ./env_variables/debug_hip_env.rst Memory management related variables ================================================================================ -.. include:: ../data/reference/env_variables/memory_management_hip_env.rst +.. include:: ./env_variables/memory_management_hip_env.rst Other useful variables ================================================================================ -.. include:: ../data/reference/env_variables/miscellaneous_hip_env.rst +.. include:: ./env_variables/miscellaneous_hip_env.rst diff --git a/docs/data/reference/env_variables/debug_hip_env.rst b/docs/reference/env_variables/debug_hip_env.rst similarity index 100% rename from docs/data/reference/env_variables/debug_hip_env.rst rename to docs/reference/env_variables/debug_hip_env.rst diff --git a/docs/data/reference/env_variables/gpu_isolation_hip_env.rst b/docs/reference/env_variables/gpu_isolation_hip_env.rst similarity index 100% rename from docs/data/reference/env_variables/gpu_isolation_hip_env.rst rename to docs/reference/env_variables/gpu_isolation_hip_env.rst diff --git a/docs/data/reference/env_variables/memory_management_hip_env.rst b/docs/reference/env_variables/memory_management_hip_env.rst similarity index 95% rename from docs/data/reference/env_variables/memory_management_hip_env.rst rename to docs/reference/env_variables/memory_management_hip_env.rst index 0bb6631a3f..fe15850b2b 100644 --- a/docs/data/reference/env_variables/memory_management_hip_env.rst +++ b/docs/reference/env_variables/memory_management_hip_env.rst @@ -2,9 +2,9 @@ The memory management related environment variables in HIP are collected in the following table. The ``HIP_HOST_COHERENT`` variable linked at the following pages: -- :ref:`Coherence control ` +- :ref:`Coherence control ` -- :ref:`Memory allocation flags ` +- :ref:`Memory allocation flags ` .. _hip-env-memory: .. list-table:: diff --git a/docs/data/reference/env_variables/miscellaneous_hip_env.rst b/docs/reference/env_variables/miscellaneous_hip_env.rst similarity index 100% rename from docs/data/reference/env_variables/miscellaneous_hip_env.rst rename to docs/reference/env_variables/miscellaneous_hip_env.rst diff --git a/docs/data/reference/env_variables/profiling_hip_env.rst b/docs/reference/env_variables/profiling_hip_env.rst similarity index 100% rename from docs/data/reference/env_variables/profiling_hip_env.rst rename to docs/reference/env_variables/profiling_hip_env.rst From c77022854d5c75ea455c15a2fee9a8a5262cbe80 Mon Sep 17 00:00:00 2001 From: Istvan Kiss Date: Fri, 25 Jul 2025 13:51:28 +0200 Subject: [PATCH 29/36] Fix the environment variables reference --- docs/reference/env_variables/memory_management_hip_env.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/reference/env_variables/memory_management_hip_env.rst b/docs/reference/env_variables/memory_management_hip_env.rst index fe15850b2b..8b4fee6d85 100644 --- a/docs/reference/env_variables/memory_management_hip_env.rst +++ b/docs/reference/env_variables/memory_management_hip_env.rst @@ -2,9 +2,9 @@ The memory management related environment variables in HIP are collected in the following table. The ``HIP_HOST_COHERENT`` variable linked at the following pages: -- :ref:`Coherence control ` +- :ref:`Coherence control ` -- :ref:`Memory allocation flags ` +- :ref:`Memory allocation flags ` .. _hip-env-memory: .. list-table:: From 95832325a6b54e66e668e183161b6583aafe71ae Mon Sep 17 00:00:00 2001 From: randyh62 Date: Wed, 16 Jul 2025 16:24:32 -0700 Subject: [PATCH 30/36] Add CUDA compatibility changes - fix Markdown linting errors - Add words for Compatility - Reformat code samples - minor edit - convert hip-changes from markdown to RST - Add cpp links to reference content - Update for Julia's comments - add Leo's comments - Clean up hipGetLastError --- .wordlist.txt | 16 ++ docs/hip-7-changes.rst | 344 ++++++++++++++++++++++++++++++++++++++++ docs/index.md | 7 + docs/sphinx/_toc.yml.in | 1 + 4 files changed, 368 insertions(+) create mode 100644 docs/hip-7-changes.rst diff --git a/.wordlist.txt b/.wordlist.txt index 3f595b2d63..19ef638591 100644 --- a/.wordlist.txt +++ b/.wordlist.txt @@ -41,7 +41,9 @@ Dereferencing DFT dll DirectX +dst EIGEN +elementSize enqueue enqueues entrypoint @@ -69,14 +71,24 @@ hcBLAS HIP-Clang hipcc hipCtx +hipError +hipErrorContextIsDestroyed +hipErrorInvalidHandle hipexamine hipExt +hipGetLastError hipified HIPify +hipMalloc +hipMemsetParams hipModule hipModuleLaunchKernel hipother +hipOutOfMemory +hipRTC HIPRTC +hipSetDevice +hipSuccess hyperthreading icc IILE @@ -122,7 +134,9 @@ NCCL NDRange nonnegative NOP +NULLPTR Numa +nvRTC ns Nsight ocp @@ -168,6 +182,7 @@ sinf SOMA SPMV sr +STL structs SYCL syntaxes @@ -178,6 +193,7 @@ tradeoffs templated toolkits transfering +typedef typedefs ULP ULPs diff --git a/docs/hip-7-changes.rst b/docs/hip-7-changes.rst new file mode 100644 index 0000000000..2c0520e2c9 --- /dev/null +++ b/docs/hip-7-changes.rst @@ -0,0 +1,344 @@ +.. meta:: + :description: This topic discusses the changes introduced in HIP 7.0 + :keywords: AMD, ROCm, HIP, HIP changes, CUDA, C++ language extensions + +.. _compatibility-changes: + +******************************************************************************* +HIP API 7.0 changes +******************************************************************************* + +To improve code portability between AMD and NVIDIA GPU programming models specific changes were made to the HIP API in the 7.0 release to simplify cross-platform programming. These changes align HIP C++ even more closely with NVIDIA CUDA. These changes are incompatible with prior releases, and might require recompiling existing HIP applications for use in the 7.0 release, or editing and recompiling code in some cases. In the best case, the change requires no modification of existing applications. These changes were made available in a preview release based on the 6.4.1 release, and as such, hopefully, you have had advanced notice and prepared for the following changes. + +Behavior changes in HIP Runtime API +=================================== + +Update ``hipGetLastError`` +-------------------------- + +Prior to the 7.0 release of the HIP API, :cpp:func:`hipGetLastError` was not fully compliant with CUDA's behavior. The purpose of this change is to have ``hipGetLastError`` return the last actual error caught in the current thread during the application execution. Neither ``hipSuccess`` nor ``hipErrorNotReady`` is considered an error. Take the following code as an example: + +.. code:: cpp + + 1: hipError_t err = hipMalloc(...); // returns hipOutOfMemory + 2: err = hipSetDevice(0); // returns hipSuccess + 3: err = hipGetLastError(); + +The prior behavior was for ``hipGetLastError`` at line 3 to return ``hipSuccess`` from line 2. In the 7.0 release, the value of ``err`` at line 3 is ``hipOutOfMemory`` which is the error returned in Line 1, rather than simply the result returned in line 2. This matches CUDA behavior. + +You can still use the prior functionality by using the ``hipExtGetLastError`` function. Notice that the function begins with ``hipExt`` which denotes a function call that is unique to HIP, without correlation to CUDA. This function was introduced with the 6.0 release. + +Cooperative groups changes +-------------------------- + +For :cpp:func:`hipLaunchCooperativeKernelMultiDevice` function, HIP now includes additional input parameter validation checks. + +* If the input launch stream is a NULLPTR or it is ``hipStreamLegacy``, the function now returns ``hipErrorInvalidResourceHandle``. +* If the stream capturing is active, the function returns the error code ``hipErrorStreamCaptureUnsupported``. +* If the stream capture status is invalidated, the function returns the error ``hipErrorStreamCaptureInvalidated``. + +The :cpp:func:`hipLaunchCooperativeKernel` function now checks the input stream handle. If it's invalid, the returned error is changed to ``hipErrorInvalidHandle`` from ``hipErrorContextIsDestroyed``. + +Update ``hipPointerGetAttributes`` +---------------------------------- + +:cpp:func:`hipPointerGetAttributes` now matches the functionality of ``cudaPointerGetAttributes`` which changed in CUDA 11. If a NULL host or attribute pointer is passed as input parameter, ``hipPointerGetAttributes`` now returns ``hipSuccess`` instead of the error code ``hipErrorInvalidValue``. + +Any application which is expecting the API to return an error instead of success could be impacted and a code change may need to handle the error properly. + +Update ``hipFree`` +------------------ + +:cpp:func:`hipFree` previously had an implicit wait for synchronization purpose which is applicable for all memory allocations. This wait has been disabled in the HIP 7.0 runtime for allocations made with ``hipMallocAsync`` and ``hipMallocFromPoolAsync`` to match the behavior of CUDA API ``cudaFree`` + +HIP runtime compiler (hipRTC) changes +===================================== + +Runtime compilation for HIP is available through the ``hipRTC`` library as described in :ref:`hip_runtime_compiler_how-to`. The library grew organically within the main HIP runtime code. However, segregation of the ``hipRTC`` code is now needed to ensure better compatibility and easier code portability. + +Removal of ``hipRTC`` symbols from HIP Runtime Library +------------------------------------------------------ + +``hipRTC`` has been an independent library since the 6.0 release, but the ``hipRTC`` symbols were still available in the HIP runtime library. Starting with the 7.0 release ``hipRTC`` is no longer included in the HIP runtime, and any application using ``hipRTC`` APIs should link explicitly with the ``hipRTC`` library. + +This change makes the usage of ``hipRTC`` library on Linux the same as on Windows and matches the behavior of CUDA ``nvRTC``. + +``hipRTC`` compilation +---------------------- + +The device code compilation via ``hipRTC`` now uses namespace ``__hip_internal``, instead of the standard headers ``std``, to avoid namespace collision. These changes are made in the HIP header files. + +No code change is required in any application, but rebuilding is necessary. + +Removal of datatypes from ``hipRTC`` +------------------------------------ + +In ``hipRTC``, datatype definitions such as ``int64_t``, ``uint64_t``, ``int32_t``, and ``uint32_t`` could result in conflicts in some applications, as they use their own definitions for these types. ``nvRTC`` doesn't define these datatypes either. +These datatypes are removed and replaced by HIP internal datatypes prefixed with ``__hip``, for example, ``__hip_int64_t``. + +Any application relying on HIP internal datatypes during ``hipRTC`` compilation might be affected. +These changes have no impact on any application if it compiles as expected using ``nvRTC``. + +HIP header clean up +=================== + +HIP header files previously included unnecessary Standard Template Libraries (STL) headers. +With the 7.0 release, unnecessary STL headers are no longer included, and only the required STL headers +are included. + +Applications relying on HIP runtime header files might need to be updated to include STL header +files that have been removed in 7.0. + +API signature and struct changes +================================ + +API signature changes +--------------------- + +Signatures in some APIs have been modified to match corresponding CUDA APIs, as described below. + +The RTC method definition is changed in the following ``hipRTC`` APIs: + +* :cpp:func:`hiprtcCreateProgram` +* :cpp:func:`hiprtcCompileProgram` + +In these APIs, the input parameter type changes from ``const char**`` to ``const char* const*``. + +In addition, the following APIs have signature changes: + +* :cpp:func:`hipMemcpyHtoD`, the type of the second argument pointer changes from ``const void*`` to ``void*``. +* :cpp:func:`hipCtxGetApiVersion`, the type of second argument is changed from ``int*`` to ``unsigned int*``. + +These signature changes do not require code modifications but do require rebuilding the application. + +Deprecated struct ``HIP_MEMSET_NODE_PARAMS`` +-------------------------------------------- + +The deprecated structure ``HIP_MEMSET_NODE_PARAMS`` is removed. +You can use the definition :cpp:struct:`hipMemsetParams` instead, as input parameter, while using these two APIs: + +* :cpp:func:`hipDrvGraphAddMemsetNode` +* :cpp:func:`hipDrvGraphExecMemsetNodeSetParams` + +``hipMemsetParams`` struct change +--------------------------------- + +The struct :cpp:struct:`hipMemsetParams` is updated to be compatible with CUDA. +The change is from the old struct definition shown below: + +.. code:: cpp + + typedef struct hipMemsetParams { + void* dst; + unsigned int elementSize; + size_t height; + size_t pitch; + unsigned int value; + size_t width; + } hipMemsetParams; + +To the new struct definition as follows: + +.. code:: cpp + + typedef struct hipMemsetParams { + void* dst; + size_t pitch; + unsigned int value; + unsigned int elementSize; + size_t width; + size_t height; + } hipMemsetParams; + +No code change is required in any application using this structure, but rebuilding is necessary. + +HIP vector constructor change +----------------------------- + +Changes have been made to HIP vector constructors for ``hipComplex`` initialization to generate values in alignment with CUDA. The affected constructors are small vector types such as ``float2`` and ``int4`` for example. If your code previously relied on a single value to initialize all components within a vector or complex type, you might need to update your code. Otherwise, rebuilding the application is necessary but no code change is required in any application using these constructors. + +Stream capture updates +====================== + +Restrict stream capture modes +----------------------------- + +Stream capture mode has been restricted in HIP APIs through the addition of the macro ``CHECK_STREAM_CAPTURE_SUPPORTED``. + +In the HIP enumeration ``hipStreamCaptureMode``, three capture modes were previously supported: + +* Global +* ThreadLocal +* Relaxed + +As of the 7.0 release, when checking with the ``CHECK_STREAM_CAPTURE_SUPPORTED`` macro the only supported stream capture mode is ``hipStreamCaptureModeRelaxed``. The rest are not supported, and the macro will return ``hipErrorStreamCaptureUnsupported``. + +This change matches the behavior of CUDA. There is no impact on any application if stream capture works correctly on the CUDA platform. However, in the HIP runtime the API will return ``hipErrorStreamCaptureUnsupported`` on unsupported stream capture modes. + +This update involves the following APIs. They are allowed only in relaxed stream capture mode. Not all three capture modes. + +* :cpp:func:`hipMallocManaged` +* :cpp:func:`hipMemAdvise` + +Check stream capture mode +------------------------- + +The following APIs will check the stream capture mode and return error codes to match the behavior of CUDA. No impact if stream capture is working correctly on CUDA. Otherwise, the application would need to modify the graph being captured. + +* :cpp:func:`hipLaunchCooperativeKernelMultiDevice` - Returns error code while stream capture status is active. The usage is restricted during stream capture +* :cpp:func:`hipEventQuery` - Returns an error ``hipErrorStreamCaptureUnsupported`` in global capture mode +* :cpp:func:`hipStreamAddCallback` - The stream capture behavior is updated. The function now checks if any of the blocking streams are capturing. If so, it returns an error and invalidates all capturing streams. The usage of this API is restricted during stream capture to match CUDA. + +Stream capture error return +--------------------------- + +During stream capture, the following HIP APIs return the ``hipErrorStreamCaptureUnsupported`` error on the HIP runtime, but not always ``hipSuccess``, to match behavior with CUDA. + +* :cpp:func:`hipDeviceSetMemPool` +* :cpp:func:`hipMemPoolCreate` +* :cpp:func:`hipMemPoolDestroy` +* :cpp:func:`hipDeviceSetSharedMemConfig` +* :cpp:func:`hipDeviceSetCacheConfig` + +The usage of these APIs is restricted during stream capture. No impact if stream capture is working fine on CUDA. + +Error code changes +================== + +The following HIP APIs have been updated to return new or additional error codes to match the corresponding +CUDA APIs. Most existing applications just check if ``hipSuccess`` is returned and no change is needed. +However, if an application checks for a specific error code, the application code may need to be updated +to match/handle the new error code accordingly. + +Module management related APIs +------------------------------ + +Kernel launch APIs +^^^^^^^^^^^^^^^^^^ + +The following APIs have updated implementations: + +* :cpp:func:`hipModuleLaunchKernel` +* :cpp:func:`hipExtModuleLaunchKernel` +* :cpp:func:`hipExtLaunchKernel` +* :cpp:func:`hipDrvLaunchKernelEx` +* :cpp:func:`hipLaunchKernel` +* :cpp:func:`hipLaunchKernelExC` + +More conditional checks are added in the API implementation, and the return errors are added or changed in the following scenarios: + +* If the input stream handle is invalid, the returned error is changed to ``hipErrorContextIsDestroyed`` from ``hipErrorInvalidValue`` +* Adds a grid dimension check, if any input global work size dimension is zero, returns ``hipErrorInvalidValue`` +* Adds extra shared memory size check, if exceeds the size limit, returns ``hipErrorInvalidValue`` +* If the total number of threads per block exceeds the maximum work group limit during a kernel launch, the return value is changed to``hipErrorInvalidConfiguration`` from ``hipErrorInvalidValue`` + +``hipModuleLaunchCooperativeKernel`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Conditions are added in the API implementation of :cpp:func:`hipModuleLaunchCooperativeKernel`, and the returned errors are added in the following scenarios: + +* If the input stream is invalid, returns ``hipErrorContextIsDestroyed``, instead of ``hipErrorInvalidValue`` +* If any grid dimension or block dimension is zero, returns ``hipErrorInvalidValue`` +* If any grid dimension exceeds the maximum dimension limit, or work group size exceeds the maximum size, returns ``hipErrorInvalidConfiguration`` , instead of ``hipErrorInvalidValue`` +* If shared memory size in bytes exceeds the device local memory size per CU, returns ``hipErrorCooperativeLaunchTooLarge`` + +``hipModuleLoad`` +^^^^^^^^^^^^^^^^^^ + +The API updates the negative return of :cpp:func:`hipModuleLoad` to match the CUDA behavior. In cases where the file name exists but the file size is 0, the function returns ``hipErrorInvalidImage`` instead of ``hipErrorInvalidValue``. + +Texture management related APIs +------------------------------- + +The following APIs have updated the return codes to match the CUDA behavior: + +* :cpp:func:`hipTexObjectCreate`, supports zero width and height for 2D image. If either width or height are zero the function will not return ``false``. +* :cpp:func:`hipBindTexture2D`, adds extra check, if pointer for texture reference or device is NULL, returns ``hipErrorNotFound``. +* :cpp:func:`hipBindTextureToArray`, if any NULL pointer is input for texture object, resource descriptor, or texture descriptor, returns error ``hipErrorInvalidChannelDescriptor``, instead of ``hipErrorInvalidValue``. +* :cpp:func:`hipGetTextureAlignmentOffset`, adds a return code ``hipErrorInvalidTexture`` when the texture reference pointer is NULL. + +Cooperative group related APIs +------------------------------- + +``hipLaunchCooperativeKernelMultiDevice`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Validations are added to the API implementation of :cpp:func:`hipLaunchCooperativeKernelMultiDevice`, as follows: + +* If input launch stream is NULLPTR or it is ``hipStreamLegacy``, returns ``hipErrorInvalidResourceHandle``. +* If the stream capturing is active, returns the error ``hipErrorStreamCaptureUnsupported``. +* If the stream capture status is invalidated, returns the error ``hipErrorStreamCaptureInvalidated`` +* If the total number of threads per block exceeds the maximum work group limit during a kernel launch, the return value is changed to ``hipErrorInvalidConfiguration`` from ``hipErrorInvalidValue``. + +``hipLaunchCooperativeKernel`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Validation are added to the API implementation of :cpp:func:`hipLaunchCooperativeKernel`, as follows: + +* If the input stream handle is invalid, the returned error is changed to ``hipErrorInvalidHandle`` from ``hipErrorContextIsDestroyed``. +* If the total number of threads per block exceeds the maximum work group limit during a kernel launch, the return value is changed to ``hipErrorInvalidConfiguration`` from ``hipErrorInvalidValue`` . + +Invalid stream input parameter handling matches CUDA +==================================================== + +In order to match the CUDA runtime behavior more closely, HIP APIs with streams passed as input parameters no longer check the stream validity. Prior to the 7.0 release, the HIP runtime returns an error code ``hipErrorContextIsDestroyed`` if the stream is invalid. In CUDA 12 and later, the equivalent behavior is to raise a segmentation fault. With HIP 7.0, the HIP runtime matches CUDA by causing a segmentation fault. The list of APIs impacted by this change are as follows: + +* Stream management related APIs + + * :cpp:func:`hipStreamGetCaptureInfo` + * :cpp:func:`hipStreamGetPriority` + * :cpp:func:`hipStreamGetFlags` + * :cpp:func:`hipStreamDestroy` + * :cpp:func:`hipStreamAddCallback` + * :cpp:func:`hipStreamQuery` + * :cpp:func:`hipLaunchHostFunc` + +* Graph management related APIs + + * :cpp:func:`hipGraphUpload` + * :cpp:func:`hipGraphLaunch` + * :cpp:func:`hipStreamBeginCaptureToGraph` + * :cpp:func:`hipStreamBeginCapture` + * :cpp:func:`hipStreamIsCapturing` + * :cpp:func:`hipStreamGetCaptureInfo` + * :cpp:func:`hipGraphInstantiateWithParams` + +* Memory management related APIs + + * :cpp:func:`hipMemcpyPeerAsync` + * :cpp:func:`hipMemcpy2DValidateParams` + * :cpp:func:`hipMallocFromPoolAsync` + * :cpp:func:`hipFreeAsync` + * :cpp:func:`hipMallocAsync` + * :cpp:func:`hipMemcpyAsync` + * :cpp:func:`hipMemcpyToSymbolAsync` + * :cpp:func:`hipStreamAttachMemAsync` + * :cpp:func:`hipMemPrefetchAsync` + * :cpp:func:`hipDrvMemcpy3D` + * :cpp:func:`hipDrvMemcpy3DAsync` + * :cpp:func:`hipDrvMemcpy2DUnaligned` + * :cpp:func:`hipMemcpyParam2D` + * :cpp:func:`hipMemcpyParam2DAsync` + * :cpp:func:`hipMemcpy2DArrayToArray` + * :cpp:func:`hipMemcpy2D` + * :cpp:func:`hipMemcpy2DAsync` + * :cpp:func:`hipDrvMemcpy2DUnaligned` + * :cpp:func:`hipMemcpy3D` + +* Event management related APIs + + * :cpp:func:`hipEventRecord` + * :cpp:func:`hipEventRecordWithFlags` + +Developers porting CUDA code to HIP no longer need to modify their error handling code. However, +if you have come to expect the HIP runtime to return the error code ``hipErrorContextIsDestroyed``, +you might need to adjust your code. + +``warpSize`` Change +=================== + +To match the CUDA specification, ``warpSize`` is no longer a ``constexpr``. +In general, this should be a transparent change. However, if an application was using ``warpSize`` +as a compile-time constant, it will have to be updated to handle the new definition. +For more information, see `warpSize <./how-to/hip_cpp_language_extensions.html#warpsize>`_ +in :doc:`./how-to/hip_cpp_language_extensions`. diff --git a/docs/index.md b/docs/index.md index d47962d6fe..d48e8f8e9e 100644 --- a/docs/index.md +++ b/docs/index.md @@ -10,6 +10,13 @@ The Heterogeneous-computing Interface for Portability (HIP) is a C++ runtime API and kernel language that lets you create portable applications for AMD and NVIDIA GPUs from a single source code. For more information, see [What is HIP?](./what_is_hip) +```{note} +The 7.0 release of the HIP API includes backward incompatible changes to make it +align more closely with NVIDIA CUDA. These change are incompatible with prior releases, +and may require recompiling existing HIP applications for use in the 7.0 release. +For more information, see [HIP API 7.0 changes](./hip-7-changes). +``` + Installation instructions are available from: * [Installing HIP](./install/install) diff --git a/docs/sphinx/_toc.yml.in b/docs/sphinx/_toc.yml.in index 35ed57f0b6..5be5d90049 100644 --- a/docs/sphinx/_toc.yml.in +++ b/docs/sphinx/_toc.yml.in @@ -7,6 +7,7 @@ root: index subtrees: - entries: - file: what_is_hip + - file: hip-7-changes - file: faq - caption: Install From ff6ebfc537a03fbeffa2a0bf72127fa02a3f521b Mon Sep 17 00:00:00 2001 From: Istvan Kiss Date: Tue, 29 Jul 2025 21:35:41 +0200 Subject: [PATCH 31/36] Temporary fix for install broken links --- docs/install/install.rst | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/install/install.rst b/docs/install/install.rst index 522c935edc..fec897168a 100644 --- a/docs/install/install.rst +++ b/docs/install/install.rst @@ -27,8 +27,8 @@ Prerequisites Refer to the Prerequisites section in the ROCm install guides: - * :doc:`rocm-install-on-linux:reference/system-requirements` - * :doc:`rocm-install-on-windows:reference/system-requirements` + * `System requirements (Linux) `_ + * `System requirements (Windows) `_ .. tab-item:: NVIDIA :sync: nvidia @@ -48,8 +48,8 @@ Installation HIP is automatically installed during the ROCm installation. If you haven't yet installed ROCm, you can find installation instructions here: - * :doc:`rocm-install-on-linux:index` - * :doc:`rocm-install-on-windows:index` + * `ROCm installation for Linux `_ + * `HIP SDK installation for Windows `_ By default, HIP is installed into ``/opt/rocm``. From 60a1306f9aa87f4aea43bfd4c70d1ce961bdf9b0 Mon Sep 17 00:00:00 2001 From: randyh62 <42045079+randyh62@users.noreply.github.com> Date: Mon, 28 Jul 2025 18:39:32 -0700 Subject: [PATCH 32/36] Update hipgraph.rst Remove Beta warning on hipGRAPH topic --- docs/how-to/hip_runtime_api/hipgraph.rst | 6 ------ 1 file changed, 6 deletions(-) diff --git a/docs/how-to/hip_runtime_api/hipgraph.rst b/docs/how-to/hip_runtime_api/hipgraph.rst index 01c036af2b..7f36a7d373 100644 --- a/docs/how-to/hip_runtime_api/hipgraph.rst +++ b/docs/how-to/hip_runtime_api/hipgraph.rst @@ -8,12 +8,6 @@ HIP graphs ******************************************************************************** -.. note:: - The HIP graph API is currently in Beta. Some features can change and might - have outstanding issues. Not all features supported by CUDA graphs are yet - supported. For a list of all currently supported functions see the - :ref:`HIP graph API documentation`. - HIP graphs are an alternative way of executing tasks on a GPU that can provide performance benefits over launching kernels using the standard method via streams. A HIP graph is made up of nodes and edges. The nodes of a From c0e0b01ab65f64e0ec7f8c48633a30e6fcfd585f Mon Sep 17 00:00:00 2001 From: Adel Johar Date: Mon, 28 Jul 2025 16:28:41 +0200 Subject: [PATCH 33/36] Docs: Update unified memory management page --- .../unified_memory/um.drawio | 400 +++++++++--------- .../memory_management/unified_memory/um.svg | 18 +- .../memory_management/unified_memory.rst | 81 ++-- 3 files changed, 271 insertions(+), 228 deletions(-) diff --git a/docs/data/how-to/hip_runtime_api/memory_management/unified_memory/um.drawio b/docs/data/how-to/hip_runtime_api/memory_management/unified_memory/um.drawio index 1deeca61f5..c041156c5b 100644 --- a/docs/data/how-to/hip_runtime_api/memory_management/unified_memory/um.drawio +++ b/docs/data/how-to/hip_runtime_api/memory_management/unified_memory/um.drawio @@ -1,6 +1,6 @@ - + @@ -77,7 +77,7 @@ - + @@ -101,7 +101,7 @@ - + @@ -949,10 +949,10 @@ - + - + @@ -961,919 +961,931 @@ - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + + + + + + + + + + + + + diff --git a/docs/data/how-to/hip_runtime_api/memory_management/unified_memory/um.svg b/docs/data/how-to/hip_runtime_api/memory_management/unified_memory/um.svg index 83accc3b27..51e617d759 100644 --- a/docs/data/how-to/hip_runtime_api/memory_management/unified_memory/um.svg +++ b/docs/data/how-to/hip_runtime_api/memory_management/unified_memory/um.svg @@ -1,9 +1,9 @@ -Explicit Memory Management
CPU cores
CPU cores -
CPU
CPU -
GPU Memory
(HBM)
GPU Memory...
Unified Memory Management
Unified Memory
Unified Memory
CPU Memory (RAM)
CPU Memory (RAM)
GPU
GPU -
GPU cores
GPU cores -
GPU
GPU -
CPU cores
CPU cores -
CPU
CPU -
GPU cores
GPU cores -
Text is not SVG - cannot display
\ No newline at end of file +Explicit Memory Management
CPU cores
CPU cores +
CPU
CPU +
GPU Memory
(HBM)
GPU Memory...
Unified Memory Management
Unified Memory
Unified Memory
CPU Memory (RAM)
CPU Memory (RAM)
GPU
GPU +
GPU cores
GPU cores +
GPU
GPU +
CPU cores
CPU cores +
CPU
CPU +
GPU cores
GPU cores +
Text is not SVG - cannot display
\ No newline at end of file diff --git a/docs/how-to/hip_runtime_api/memory_management/unified_memory.rst b/docs/how-to/hip_runtime_api/memory_management/unified_memory.rst index 2f0a469e83..ac7bba454e 100644 --- a/docs/how-to/hip_runtime_api/memory_management/unified_memory.rst +++ b/docs/how-to/hip_runtime_api/memory_management/unified_memory.rst @@ -9,6 +9,14 @@ Unified memory management ******************************************************************************* +This document covers unified memory management in HIP, which encompasses several +approaches that provide a single address space accessible from both CPU and GPU. +**Unified memory** refers to the overall architectural concept of this shared +address space, while **managed memory** is one specific implementation that +provides automatic page migration between devices. Other unified memory allocators +like :cpp:func:`hipMalloc()` and :cpp:func:`hipHostMalloc()` provide different +access patterns within the same unified address space concept. + In conventional architectures CPUs and attached devices have their own memory space and dedicated physical memory backing it up, e.g. normal RAM for CPUs and VRAM on GPUs. This way each device can have physical memory optimized for its @@ -39,7 +47,8 @@ model is shown in the following figure. Unified memory enables the access to memory located on other devices via several methods, depending on whether hardware support is available or has to be -managed by the driver. +managed by the driver. CPUs can access memory allocated via :cpp:func:`hipMalloc()`, +providing bidirectional memory accessibility within the unified address space. Managed memory ================================================================================ @@ -81,8 +90,8 @@ Note that on systems which do not support page-faults, managed memory APIs are still accessible to the programmer, but managed memory operates in a degraded fashion due to the lack of demand-driven migration. Furthermore, on these systems it is still possible to use unified memory allocators that do not -provide managed memory features; see :ref:`unified memory allocators` for -more details. +provide managed memory features; see +:ref:`memory allocation approaches in unified memory` for more details. Managed memory is supported on Linux by all modern AMD GPUs from the Vega series onward, as shown in the following table. Managed memory can be @@ -107,7 +116,7 @@ allocators (e.g., ``new``, ``malloc()``) can be used. * - Architecture - :cpp:func:`hipMallocManaged()`, ``__managed__`` - - ``new``, ``malloc()`` + - ``new``, ``malloc()``, ``allocate()`` * - CDNA4 - ✅ - ✅ :sup:`1` @@ -134,11 +143,16 @@ allocators (e.g., ``new``, ``malloc()``) can be used. :sup:`1` Works only with ``HSA_XNACK=1`` and kernels with HMM support. First GPU access causes recoverable page-fault. -.. _unified memory allocators: +.. _memory allocation approaches in unified memory: -Unified memory allocators +Memory allocation approaches in unified memory ================================================================================ +While managed memory provides automatic migration, unified memory encompasses +several allocation methods, each with different access patterns and migration +behaviors. The following section covers all available unified memory allocation +approaches, including but not limited to managed memory APIs. + Support for the different unified memory allocators depends on the GPU architecture and on the system. For more information, see :ref:`unified memory system requirements` and :ref:`checking unified memory support`. @@ -154,10 +168,11 @@ system requirements` and :ref:`checking unified memory support`. - **System allocated unified memory** - Starting with CDNA2, the ``new`` and ``malloc()`` system allocators allow + Starting with CDNA2, the ``new``, ``malloc()``, and ``allocate()`` (Fortran) system allocators allow you to reserve unified memory. The system allocator is more versatile and offers an easy transition for code written for CPUs to HIP code as the - same system allocation API is used. + same system allocation API is used. Memory allocated by these allocators can + be registered to be accessible on device using :cpp:func:`hipHostRegister()`. - **HIP allocated non-managed memory** @@ -181,10 +196,10 @@ functions on ROCm and CUDA, both with and without HMM support. - Access outside the origin without HMM or ``HSA_XNACK=0`` - Allocation origin with HMM and ``HSA_XNACK=1`` - Access outside the origin with HMM and ``HSA_XNACK=1`` - * - ``new``, ``malloc()`` + * - ``new``, ``malloc()``, ``allocate()`` - host - not accessible on device - - host + - first touch - page-fault migration * - :cpp:func:`hipMalloc()` - device @@ -194,13 +209,13 @@ functions on ROCm and CUDA, both with and without HMM support. * - :cpp:func:`hipMallocManaged()`, ``__managed__`` - pinned host - zero copy [zc]_ - - host + - first touch - page-fault migration * - :cpp:func:`hipHostRegister()` - pinned host - zero copy [zc]_ - - undefined behavior - - undefined behavior + - pinned host + - zero copy [zc]_ * - :cpp:func:`hipHostMalloc()` - pinned host - zero copy [zc]_ @@ -256,9 +271,9 @@ functions on ROCm and CUDA, both with and without HMM support. Checking unified memory support -------------------------------------------------------------------------------- -The following device attributes can offer information about which :ref:`unified -memory allocators` are supported. The attribute value is 1 if the functionality -is supported, and 0 if it is not supported. +The following device attributes can offer information about which :ref:`memory +allocation approaches in unified memory` are supported. The attribute value is +1 if the functionality is supported, and 0 if it is not supported. .. list-table:: Device attributes for unified memory management :widths: 40, 60 @@ -384,10 +399,11 @@ explicit memory management example is presented in the last tab. .. tab-item:: new .. code-block:: cpp - :emphasize-lines: 20-23 + :emphasize-lines: 21-24 #include #include + #include #define HIP_CHECK(expression) \ { \ @@ -406,10 +422,10 @@ explicit memory management example is presented in the last tab. // This example requires HMM support and the environment variable HSA_XNACK needs to be set to 1 int main() { - // Allocate memory for a, b, and c. - int *a = new int[1]; - int *b = new int[1]; - int *c = new int[1]; + // Allocate memory with proper alignment for performance + int *a = new(std::align_val_t(128)) int[1]; + int *b = new(std::align_val_t(128)) int[1]; + int *c = new(std::align_val_t(128)) int[1]; // Setup input values. *a = 1; @@ -424,10 +440,10 @@ explicit memory management example is presented in the last tab. // Prints the result. std::cout << *a << " + " << *b << " = " << *c << std::endl; - // Cleanup allocated memory. - delete[] a; - delete[] b; - delete[] c; + // Cleanup allocated memory with matching aligned delete. + ::operator delete[](a, std::align_val_t(128)); + ::operator delete[](b, std::align_val_t(128)); + ::operator delete[](c, std::align_val_t(128)); return 0; } @@ -521,9 +537,24 @@ Performance optimizations for unified memory There are several ways, in which the developer can guide the runtime to reduce copies between devices, in order to improve performance. +With ``numactl --membind`` bindings, developers can control where physical +allocation occurs by restricting memory allocation to specific NUMA nodes. +This approach can reduce or eliminate the need for explicit data prefetching +since memory is allocated in the desired location from the start. + Data prefetching -------------------------------------------------------------------------------- +.. warning:: + Data prefetching is not always an optimization and can slow down execution, + as the API takes time to execute. If the memory is already in the right + place, prefetching will waste time. Users should profile their code to + verify whether prefetching is beneficial for their specific use case. + +When prefetching is beneficial, developers can consider setting different default +locations for different devices and using prefetch between them, which can help +eliminate IPC communication overhead when memory moves between devices. + Data prefetching is a technique used to improve the performance of your application by moving data to the desired device before it's actually needed. ``hipCpuDeviceId`` is a special constant to specify the CPU as target. From f25e829349680b0463ba597c858ed0b992dad599 Mon Sep 17 00:00:00 2001 From: randyh62 <42045079+randyh62@users.noreply.github.com> Date: Mon, 18 Aug 2025 18:12:16 -0700 Subject: [PATCH 34/36] Update hip-7-changes.rst Implement Leo feedback --- docs/hip-7-changes.rst | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/docs/hip-7-changes.rst b/docs/hip-7-changes.rst index 2c0520e2c9..3609e847dc 100644 --- a/docs/hip-7-changes.rst +++ b/docs/hip-7-changes.rst @@ -306,7 +306,6 @@ In order to match the CUDA runtime behavior more closely, HIP APIs with streams * Memory management related APIs * :cpp:func:`hipMemcpyPeerAsync` - * :cpp:func:`hipMemcpy2DValidateParams` * :cpp:func:`hipMallocFromPoolAsync` * :cpp:func:`hipFreeAsync` * :cpp:func:`hipMallocAsync` @@ -334,8 +333,8 @@ Developers porting CUDA code to HIP no longer need to modify their error handlin if you have come to expect the HIP runtime to return the error code ``hipErrorContextIsDestroyed``, you might need to adjust your code. -``warpSize`` Change -=================== +warpSize Change +=============== To match the CUDA specification, ``warpSize`` is no longer a ``constexpr``. In general, this should be a transparent change. However, if an application was using ``warpSize`` From 4414edea4a2aae76f378b28177468dca217d711e Mon Sep 17 00:00:00 2001 From: Istvan Kiss Date: Thu, 7 Aug 2025 16:18:54 +0200 Subject: [PATCH 35/36] SAXPY tutorial: roc-obj replace with llvm-objdump Apply suggestions from code review Co-authored-by: Jan Stephan --- docs/tutorial/saxpy.rst | 95 ++++++++++++++++++++--------------------- 1 file changed, 47 insertions(+), 48 deletions(-) diff --git a/docs/tutorial/saxpy.rst b/docs/tutorial/saxpy.rst index c3dc766102..4242eba5f5 100644 --- a/docs/tutorial/saxpy.rst +++ b/docs/tutorial/saxpy.rst @@ -348,89 +348,88 @@ find out what device binary flavors are embedded into the executable? artifacts on disk. Add the ROCmCC installation folder to your PATH if you want to use these utilities (the utilities expect them to be on the PATH). - You can list embedded program binaries using ``roc-obj-ls``. + You can list embedded program binaries using ``llvm-objdump`` with + ``--offloading`` option. .. code-block:: bash - roc-obj-ls ./saxpy + llvm-objdump --offloading ./saxpy It should return something like: .. code-block:: shell - 1 host-x86_64-unknown-linux file://./saxpy#offset=12288&size=0 - 1 hipv4-amdgcn-amd-amdhsa--gfx803 file://./saxpy#offset=12288&size=9760 + ./saxpy: file format elf64-x86-64 + Extracting offload bundle: ./saxpy.0.host-x86_64-unknown-linux-gnu- + Extracting offload bundle: ./saxpy.0.hipv4-amdgcn-amd-amdhsa--gfx942 The compiler embeds a version 4 code object (more on `code object versions `_) - and used the LLVM target triple `amdgcn-amd-amdhsa--gfx803` (more on `target triples + and used the LLVM target triple ``amdgcn-amd-amdhsa--gfx942`` (more on `target triples `_). You can extract that program object in a disassembled fashion for human consumption - via ``roc-obj``. + via ``llvm-objdump``. .. code-block:: bash - roc-obj -t gfx803 -d ./saxpy + llvm-objdump --disassemble saxpy.0.hipv4-amdgcn-amd-amdhsa--gfx942 > saxpy.s - This creates two files on disk and ``.s`` extension is of most interest. - Opening this file or dumping it to the console using ``cat`` - lets find the disassembled binary of the SAXPY compute kernel, something - similar to: + This creates a file on the disk called ``saxpy.s`` Opening this file or + dumping it to the console using ``cat`` lets find the disassembled binary of + the SAXPY compute kernel, something similar to: .. code-block:: + saxpy.0.hipv4-amdgcn-amd-amdhsa--gfx942: file format elf64-amdgpu + Disassembly of section .text: - <_Z12saxpy_kernelfPKfPfj>: - s_load_dword s0, s[4:5], 0x2c // 000000001000: C0020002 0000002C - s_load_dword s1, s[4:5], 0x18 // 000000001008: C0020042 00000018 - s_waitcnt lgkmcnt(0) // 000000001010: BF8C007F - s_and_b32 s0, s0, 0xffff // 000000001014: 8600FF00 0000FFFF - s_mul_i32 s6, s6, s0 // 00000000101C: 92060006 - v_add_u32_e32 v0, vcc, s6, v0 // 000000001020: 32000006 - v_cmp_gt_u32_e32 vcc, s1, v0 // 000000001024: 7D980001 - s_and_saveexec_b64 s[0:1], vcc // 000000001028: BE80206A - s_cbranch_execz 22 // 00000000102C: BF880016 <_Z12saxpy_kernelfPKfPfj+0x88> - s_load_dwordx4 s[0:3], s[4:5], 0x8 // 000000001030: C00A0002 00000008 - v_mov_b32_e32 v1, 0 // 000000001038: 7E020280 - v_lshlrev_b64 v[0:1], 2, v[0:1] // 00000000103C: D28F0000 00020082 - s_waitcnt lgkmcnt(0) // 000000001044: BF8C007F - v_mov_b32_e32 v3, s1 // 000000001048: 7E060201 - v_add_u32_e32 v2, vcc, s0, v0 // 00000000104C: 32040000 - v_addc_u32_e32 v3, vcc, v3, v1, vcc // 000000001050: 38060303 - flat_load_dword v2, v[2:3] // 000000001054: DC500000 02000002 - v_mov_b32_e32 v3, s3 // 00000000105C: 7E060203 - v_add_u32_e32 v0, vcc, s2, v0 // 000000001060: 32000002 - v_addc_u32_e32 v1, vcc, v3, v1, vcc // 000000001064: 38020303 - flat_load_dword v3, v[0:1] // 000000001068: DC500000 03000000 - s_load_dword s0, s[4:5], 0x0 // 000000001070: C0020002 00000000 - s_waitcnt vmcnt(0) lgkmcnt(0) // 000000001078: BF8C0070 - v_mac_f32_e32 v3, s0, v2 // 00000000107C: 2C060400 - flat_store_dword v[0:1], v3 // 000000001080: DC700000 00000300 - s_endpgm // 000000001088: BF810000 + 0000000000001900 <_Z12saxpy_kernelfPKfPfj>: + s_load_dword s3, s[0:1], 0x2c // 000000001900: C00200C0 0000002C + s_load_dword s4, s[0:1], 0x18 // 000000001908: C0020100 00000018 + s_waitcnt lgkmcnt(0) // 000000001910: BF8CC07F + s_and_b32 s3, s3, 0xffff // 000000001914: 8603FF03 0000FFFF + s_mul_i32 s2, s2, s3 // 00000000191C: 92020302 + v_add_u32_e32 v0, s2, v0 // 000000001920: 68000002 + v_cmp_gt_u32_e32 vcc, s4, v0 // 000000001924: 7D980004 + s_and_saveexec_b64 s[2:3], vcc // 000000001928: BE82206A + s_cbranch_execz 20 // 00000000192C: BF880014 <_Z12saxpy_kernelfPKfPfj+0x80> + s_load_dwordx4 s[4:7], s[0:1], 0x8 // 000000001930: C00A0100 00000008 + v_mov_b32_e32 v1, 0 // 000000001938: 7E020280 + v_lshlrev_b64 v[0:1], 2, v[0:1] // 00000000193C: D28F0000 00020082 + s_load_dword s0, s[0:1], 0x0 // 000000001944: C0020000 00000000 + s_waitcnt lgkmcnt(0) // 00000000194C: BF8CC07F + v_lshl_add_u64 v[2:3], s[4:5], 0, v[0:1] // 000000001950: D2080002 04010004 + v_lshl_add_u64 v[0:1], s[6:7], 0, v[0:1] // 000000001958: D2080000 04010006 + global_load_dword v4, v[2:3], off // 000000001960: DC508000 047F0002 + global_load_dword v5, v[0:1], off // 000000001968: DC508000 057F0000 + s_waitcnt vmcnt(0) // 000000001970: BF8C0F70 + v_fmac_f32_e32 v5, s0, v4 // 000000001974: 760A0800 + global_store_dword v[0:1], v5, off // 000000001978: DC708000 007F0500 + s_endpgm // 000000001980: BF810000 + s_nop 0 // 000000001984: BF800000 Alternatively, call the compiler with ``--save-temps`` to dump all device binary to disk in separate files. .. code-block:: bash - amdclang++ ./HIP-Basic/saxpy/main.hip -o saxpy -I ./Common -lamdhip64 -L /opt/rocm/lib -O2 --save-temps + amdclang++ ./HIP-Basic/saxpy/main.hip -o saxpy -I ./Common -lamdhip64 -L /opt/rocm/lib -O2 --save-temps --offload-arch=gfx942 List all the temporaries created while compiling ``main.hip`` with: .. code-block:: bash ls main-hip-amdgcn-amd-amdhsa-* - main-hip-amdgcn-amd-amdhsa-gfx803.bc - main-hip-amdgcn-amd-amdhsa-gfx803.cui - main-hip-amdgcn-amd-amdhsa-gfx803.o - main-hip-amdgcn-amd-amdhsa-gfx803.out - main-hip-amdgcn-amd-amdhsa-gfx803.out.resolution.txt - main-hip-amdgcn-amd-amdhsa-gfx803.s - + main-hip-amdgcn-amd-amdhsa-gfx942.bc + main-hip-amdgcn-amd-amdhsa-gfx942.o + main-hip-amdgcn-amd-amdhsa-gfx942.out.resolution.txt + main-hip-amdgcn-amd-amdhsa-gfx942.hipi + main-hip-amdgcn-amd-amdhsa-gfx942.out + main-hip-amdgcn-amd-amdhsa-gfx942.s Files with the ``.s`` extension hold the disassembled contents of the binary. The filename notes the graphics IPs used by the compiler. The contents of - this file are similar to what ``roc-obj`` printed to the console. + this file are similar to the `*.s` file created with ``llvm-objdump`` earlier. .. tab-item:: Linux and NVIDIA :sync: linux-nvidia @@ -491,7 +490,7 @@ find out what device binary flavors are embedded into the executable? We can see that the compiler embedded a version 4 code object (more on code `object versions `_) and - used the LLVM target triple `amdgcn-amd-amdhsa--gfx906` (more on `target triples + used the LLVM target triple ``amdgcn-amd-amdhsa--gfx906`` (more on `target triples `_). Don't be alarmed about linux showing up as a binary format, AMDGPU binaries uploaded to the GPU for execution are proper linux ELF binaries in their format. From d3be919d4841dd9541b2b69413df2233215a7d52 Mon Sep 17 00:00:00 2001 From: randyh62 <42045079+randyh62@users.noreply.github.com> Date: Tue, 19 Aug 2025 12:55:59 -0700 Subject: [PATCH 36/36] Add HIP 7.0 changes to documentation --- docs/hip-7-changes.rst | 20 ++--- docs/how-to/hip_porting_guide.rst | 2 + docs/how-to/hip_rtc.rst | 14 ++++ .../how-to/hip_runtime_api/error_handling.rst | 8 +- .../memory_management/host_memory.rst | 2 + docs/index.md | 8 +- docs/reference/complex_math_api.rst | 7 ++ docs/reference/error_codes.rst | 73 ++++++++++++++++++- 8 files changed, 109 insertions(+), 25 deletions(-) diff --git a/docs/hip-7-changes.rst b/docs/hip-7-changes.rst index 3609e847dc..cb6c6e693c 100644 --- a/docs/hip-7-changes.rst +++ b/docs/hip-7-changes.rst @@ -8,7 +8,7 @@ HIP API 7.0 changes ******************************************************************************* -To improve code portability between AMD and NVIDIA GPU programming models specific changes were made to the HIP API in the 7.0 release to simplify cross-platform programming. These changes align HIP C++ even more closely with NVIDIA CUDA. These changes are incompatible with prior releases, and might require recompiling existing HIP applications for use in the 7.0 release, or editing and recompiling code in some cases. In the best case, the change requires no modification of existing applications. These changes were made available in a preview release based on the 6.4.1 release, and as such, hopefully, you have had advanced notice and prepared for the following changes. +To improve code portability between AMD and NVIDIA GPU programming models, changes were made to the HIP API in ROCm 7.0 to simplify cross-platform programming. These changes align HIP C++ even more closely with NVIDIA CUDA. These changes are incompatible with prior releases, and might require recompiling existing HIP applications for use with ROCm 7.0, or editing and recompiling code in some cases. In the best case, the change requires no modification of existing applications. These changes were made available in a preview release based on ROCm 6.4.1 to help you prepare. Behavior changes in HIP Runtime API =================================== @@ -163,23 +163,13 @@ Stream capture updates Restrict stream capture modes ----------------------------- -Stream capture mode has been restricted in HIP APIs through the addition of the macro ``CHECK_STREAM_CAPTURE_SUPPORTED``. - -In the HIP enumeration ``hipStreamCaptureMode``, three capture modes were previously supported: - -* Global -* ThreadLocal -* Relaxed - -As of the 7.0 release, when checking with the ``CHECK_STREAM_CAPTURE_SUPPORTED`` macro the only supported stream capture mode is ``hipStreamCaptureModeRelaxed``. The rest are not supported, and the macro will return ``hipErrorStreamCaptureUnsupported``. - -This change matches the behavior of CUDA. There is no impact on any application if stream capture works correctly on the CUDA platform. However, in the HIP runtime the API will return ``hipErrorStreamCaptureUnsupported`` on unsupported stream capture modes. - -This update involves the following APIs. They are allowed only in relaxed stream capture mode. Not all three capture modes. +Stream capture mode has been restricted in the following APIs to relaxed (``hipStreamCaptureModeRelaxed``) mode: * :cpp:func:`hipMallocManaged` * :cpp:func:`hipMemAdvise` +These APIs are allowed only in relaxed stream capture mode. If the functions are used with stream capture, the HIP runtime the will return ``hipErrorStreamCaptureUnsupported`` on unsupported stream capture modes. + Check stream capture mode ------------------------- @@ -230,7 +220,7 @@ More conditional checks are added in the API implementation, and the return erro * If the input stream handle is invalid, the returned error is changed to ``hipErrorContextIsDestroyed`` from ``hipErrorInvalidValue`` * Adds a grid dimension check, if any input global work size dimension is zero, returns ``hipErrorInvalidValue`` * Adds extra shared memory size check, if exceeds the size limit, returns ``hipErrorInvalidValue`` -* If the total number of threads per block exceeds the maximum work group limit during a kernel launch, the return value is changed to``hipErrorInvalidConfiguration`` from ``hipErrorInvalidValue`` +* If the total number of threads per block exceeds the maximum work group limit during a kernel launch, the return value is changed to ``hipErrorInvalidConfiguration`` from ``hipErrorInvalidValue`` ``hipModuleLaunchCooperativeKernel`` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ diff --git a/docs/how-to/hip_porting_guide.rst b/docs/how-to/hip_porting_guide.rst index 410c17a76f..f4a1d582eb 100644 --- a/docs/how-to/hip_porting_guide.rst +++ b/docs/how-to/hip_porting_guide.rst @@ -29,6 +29,8 @@ error code spaces: General Tips -------------------------------------------------------------------------------- +* ``hipDeviceptr_t`` is a ``void*`` and treated like a raw pointer, while ``CUdevicptr`` + is an ``unsigned int`` and treated as a device memory handle. * Starting to port on an NVIDIA machine is often the easiest approach, as the code can be tested for functionality and performance even if not fully ported to HIP. diff --git a/docs/how-to/hip_rtc.rst b/docs/how-to/hip_rtc.rst index 861e07fec8..f188c22839 100644 --- a/docs/how-to/hip_rtc.rst +++ b/docs/how-to/hip_rtc.rst @@ -14,6 +14,8 @@ alongside options to guide the compilation. .. note:: + * Device code compilation via HIPRTC uses the ``__hip_internal`` namespace instead + of the ``std`` namespace to avoid namespace collision. * This library can be used for compilation on systems without AMD GPU drivers installed (offline compilation). However, running the compiled code still requires both the HIP runtime library and GPU drivers on the target system. @@ -35,6 +37,11 @@ To use HIPRTC functionality the header needs to be included: #include +.. note:: + + Prior to the 7.0 release, the HIP runtime included the hipRTC library. With the 7.0 + release, the library is separate and must be specifically included as shown above. + Kernels can be stored in a string: .. code-block:: cpp @@ -255,6 +262,13 @@ The full example is below: HIP_CHECK(hipFree(doutput)); } +.. note:: + + Some applications define datatypes such as ``int64_t``, ``uint64_t``, ``int32_t``, and ``uint32_t`` + that could lead to conflicts when integrating with ``hipRTC``. To resolve these conflicts, these + datatypes are replaced with HIP-specific internal datatypes prefixed with ``__hip``. For example, + ``int64_t`` is replaced by ``__hip_int64_t``. + HIPRTC specific options =============================================================================== diff --git a/docs/how-to/hip_runtime_api/error_handling.rst b/docs/how-to/hip_runtime_api/error_handling.rst index b860f639f4..b78df92e2d 100644 --- a/docs/how-to/hip_runtime_api/error_handling.rst +++ b/docs/how-to/hip_runtime_api/error_handling.rst @@ -21,11 +21,9 @@ without changing it. To get a human readable version of the errors, .. note:: - :cpp:func:`hipGetLastError` returns the returned error code of the last HIP - runtime API call even if it's ``hipSuccess``, while ``cudaGetLastError`` - returns the error returned by any of the preceding CUDA APIs in the same - host thread. :cpp:func:`hipGetLastError` behavior will be matched with - ``cudaGetLastError`` in ROCm release 7.0. + :cpp:func:`hipGetLastError` returns the last actual HIP API error caught in the current thread + during the application execution. Prior to ROCm 7.0, ``hipGetLastError`` might also return + ``hipSuccess`` or ``hipErrorNotReady`` from the last HIP runtime API call, which are not errors. Best practices of HIP error handling: diff --git a/docs/how-to/hip_runtime_api/memory_management/host_memory.rst b/docs/how-to/hip_runtime_api/memory_management/host_memory.rst index cee9872413..01f00ce555 100644 --- a/docs/how-to/hip_runtime_api/memory_management/host_memory.rst +++ b/docs/how-to/hip_runtime_api/memory_management/host_memory.rst @@ -108,6 +108,8 @@ C++ application. :cpp:func:`hipMalloc` and :cpp:func:`hipFree` are blocking calls. However, HIP also provides non-blocking versions :cpp:func:`hipMallocAsync` and :cpp:func:`hipFreeAsync`, which require a stream as an additional argument. + For asynchronous memory allocations made with ``hipMallocAsync`` and ``hipMallocFromPoolAsync`` + ``hipFree`` does not implicitly wait for synchronization, to match the behavior of ``cudaFree``. .. _pinned_host_memory: diff --git a/docs/index.md b/docs/index.md index d48e8f8e9e..744d525b3e 100644 --- a/docs/index.md +++ b/docs/index.md @@ -11,10 +11,10 @@ and kernel language that lets you create portable applications for AMD and NVIDIA GPUs from a single source code. For more information, see [What is HIP?](./what_is_hip) ```{note} -The 7.0 release of the HIP API includes backward incompatible changes to make it -align more closely with NVIDIA CUDA. These change are incompatible with prior releases, -and may require recompiling existing HIP applications for use in the 7.0 release. -For more information, see [HIP API 7.0 changes](./hip-7-changes). +HIP API 7.0 introduces changes to make it align more closely with NVIDIA CUDA. +These changes are incompatible with prior releases, and might require recompiling +existing HIP applications for use with the ROCm 7.0 release. For more information, +see [HIP API 7.0 changes](./hip-7-changes). ``` Installation instructions are available from: diff --git a/docs/reference/complex_math_api.rst b/docs/reference/complex_math_api.rst index 5306efec42..b65cc2e484 100644 --- a/docs/reference/complex_math_api.rst +++ b/docs/reference/complex_math_api.rst @@ -43,6 +43,13 @@ in both single and double precision formats. Complex Number Functions ======================== +.. note:: + + Changes have been made to small vector constructors for ``hipComplex`` and ``hipFloatComplex`` + initialization, such as ``float2`` and ``int4``. If your code previously relied + on a single value to initialize all components within a vector or complex type, you might need + to update your code. + A comprehensive collection of functions for creating and manipulating complex numbers, organized by functional categories for easy reference. diff --git a/docs/reference/error_codes.rst b/docs/reference/error_codes.rst index d46127c44c..047f7b5c69 100644 --- a/docs/reference/error_codes.rst +++ b/docs/reference/error_codes.rst @@ -13,6 +13,8 @@ returned by HIP API functions to indicate various runtime conditions and errors. For more details, see :ref:`Error handling functions `. +.. _basic_runtime_errors: + Basic Runtime Errors ==================== @@ -100,6 +102,8 @@ Basic Runtime Errors If this error is encountered, it generally means the API or feature is not fully supported in the current version. +.. _memory_management_errors: + Memory Management Errors ======================== @@ -139,6 +143,14 @@ Memory Management Errors - ``1052`` - Runtime memory call returned error + * - :term:`hipErrorInvalidChannelDescriptor` + - ``911`` + - Input for texture object, resource descriptor, or texture descriptor is a NULL pointer or invalid + + * - :term:`hipErrorInvalidTexture` + - ``912`` + - Texture reference pointer is NULL or invalid + .. glossary:: hipErrorOutOfMemory @@ -233,6 +245,21 @@ Memory Management Errors This error differs from ``hipErrorOutOfMemory`` in that it relates to memory operations internal to the HIP runtime rather than explicit application requests for memory allocation. + hipErrorInvalidChannelDescriptor + + This error indicates that an invalid channel descriptor is used to define the format and layout of data + in memory, particularly when working with textures or arrays. This could happen if the descriptor is + incorrectly set up or if it does not match the expected format for the operation being performed. + + hipErrorInvalidTexture + + The error code is returned when an invalid texture object is used in a function call. This typically + occurs when a texture object is not properly initialized or configured before being used in operations + that require valid texture data. If you encounter this error, it suggests that the texture object + might be missing necessary configuration details or has been corrupted. + +.. _device_context_errors: + Device and Context Errors ========================= @@ -385,6 +412,8 @@ Device and Context Errors * Custom build environments with mismatched components * Partial upgrades of the ROCm stack +.. _kernel_launch_errors: + Kernel and Launch Errors ======================== @@ -396,10 +425,18 @@ Kernel and Launch Errors - Value - Description + * - :term:`hipErrorInvalidValue`` + - ``1`` + - Invalid input value + * - :term:`hipErrorInvalidDeviceFunction` - ``98`` - Invalid device function + * - :term:`hipErrorContextIsDestroyed` + - ``709`` + - Invalid stream handle + * - :term:`hipErrorInvalidConfiguration` - ``9`` - Invalid configuration argument @@ -446,6 +483,11 @@ Kernel and Launch Errors .. glossary:: + hipErrorInvalidValue + + Error returned when a grid dimension check finds any input global work size + dimension is zero, or a shared memory size check finds the size exceeds the size limit. + hipErrorInvalidDeviceFunction Invalid device function. This error occurs when attempting to use a function that is not a valid device @@ -453,6 +495,10 @@ Kernel and Launch Errors * Code compiled for a specific GPU architecture (using ``--offload-arch``) but executed on an different/incompatible GPU + hipErrorContextIsDestroyed + + This error is returned when the input stream or input stream handle is invalid. + hipErrorInvalidConfiguration Invalid configuration argument. This error occurs when the configuration specified for a kernel launch @@ -507,7 +553,7 @@ Kernel and Launch Errors hipErrorInvalidKernelFile Invalid kernel file. This error occurs when the kernel file or module being loaded is corrupted or in - an invalid format. + an invalid format, for example the file name exists but the file size is 0. hipErrorInvalidImage @@ -556,6 +602,7 @@ Kernel and Launch Errors * Launching a cooperative kernel with grid dimensions that exceed hardware limits * Requesting more resources than available for synchronization across thread blocks + * The shared memory size in bytes exceeds the device local memory size per CU * Using cooperative groups on hardware with limited support * Not accounting for cooperative launch limitations in kernel configuration @@ -577,6 +624,8 @@ Kernel and Launch Errors normal operation. Additional debugging of the previous failed launch may be required to identify the root cause. +.. _stream_capture_errors: + Stream Capture Errors ===================== @@ -624,6 +673,10 @@ Stream Capture Errors - ``907`` - Operation not permitted on an event last recorded in a capturing stream + * - :term:`hipErrorInvalidResourceHandle` + - ``400`` + - Input launch stream is ``NULL`` or is ``hipStreamLegacy`` + .. glossary:: hipErrorStreamCaptureUnsupported @@ -754,6 +807,14 @@ Stream Capture Errors and cannot be used for host-side synchronization until the capture is complete and the graph is executed. + hipErrorInvalidResourceHandle + + This error is returned when the input launch stream is a NULL pointer, is invalid, or is ``hipStreamLegacy``. + If you encounter this error, you should check the validity of the resource handle being used in your HIP + API calls. Ensure that the handle was correctly obtained and has not been freed or invalidated before use. + +.. _profiler_errors: + Profiler Errors =============== @@ -845,6 +906,8 @@ Profiler Errors The HIP profiler must be in an active state before it can be stopped. This error is informational and indicates that the profiler is already in the desired inactive state. +.. _resource_mapping_errors: + Resource Mapping Errors ======================= @@ -992,6 +1055,8 @@ Resource Mapping Errors operation was attempted on a resource that was not mapped as a pointer. Resources must be mapped with the appropriate mapping type for the operations that will be performed on them. +.. _peer_access_errors: + Peer Access Errors ================== @@ -1058,6 +1123,8 @@ Peer Access Errors access between peer devices. Not all device combinations support peer access. Compatibility can be determined with :cpp:func:`hipDeviceCanAccessPeer()`. +.. _system_file_errors: + System and File Errors ====================== @@ -1183,6 +1250,8 @@ System and File Errors This is a catch-all error that may require looking at system logs or using additional debugging tools to identify the root cause. +.. _graphics_content_errors: + Graphics Context Errors ======================= @@ -1216,6 +1285,8 @@ Graphics Context Errors instantiated graph update. This error occurs when attempting to update an already instantiated graph with changes that are not allowed. +.. _hardware_errors: + Hardware Errors ===============