Add lane masks bit-shift in the porting guide

neon60 · neon60 · commit f5dc82cee99d · 2025-06-07T16:49:42.000+02:00
diff --git a/.wordlist.txt b/.wordlist.txt
@@ -15,6 +15,7 @@ bfloat
 Bitcode
 bitcode
 bitcodes
+bitmask
 blockDim
 blockIdx
 builtins
diff --git a/docs/how-to/hip_porting_guide.rst b/docs/how-to/hip_porting_guide.rst
@@ -611,6 +611,70 @@ code, while the host can query it during runtime via the device properties. See
 the :ref:`HIP language extension for warpSize <warp_size>` for information on
 how to write portable wave-aware code.
 
+Lane masks bit-shift
+================================================================================
+
+A thread in a warp is also called a lane, and a lane mask is a bitmask where
+each bit corresponds to a thread in a warp. A bit is 1 if the thread is active,
+0 if it's inactive. Bit-shift operations are typically used to create lane masks
+and on AMD GPUs the ``warpSize`` can differ between different architectures,
+that's why it's essential to use correct bitmask type, when porting code.
+
+Example:
+
+.. code-block:: cpp
+
+  // Get the thread's position in the warp
+  unsigned int laneId = threadIdx.x % warpSize;
+
+  // Use lane ID for bit-shift
+  val & ((1 << (threadIdx.x % warpSize) )-1 );
+
+  // Shift 32 bit integer with val variable
+  WarpReduce::sum( (val < warpSize) ? (1 << val) : 0);
+
+Lane masks are 32-bit integer types as this is the integer precision that C 
+assigns to such constants by default. GCN/CDNA architectures have a warp size of
+64, :code:`threadIdx.x % warpSize` and :code:`val` in the example may obtain 
+values greater than 31. Consequently, shifting by such values would clear the 
+32-bit register to which the shift operation is applied. For AMD
+architectures, a straightforward fix could look as follows:
+
+.. code-block:: cpp
+  
+  // Get the thread's position in the warp
+  unsigned int laneId = threadIdx.x % warpSize;
+
+  // Use lane ID for bit-shift
+  val & ((1ull << (threadIdx.x % warpSize) )-1 );
+
+  // Shift 64 bit integer with val variable
+  WarpReduce::sum( (val < warpSize) ? (1ull << val) : 0);
+
+For portability reasons, it is better to introduce appropriately
+typed placeholders as shown below:
+
+.. code-block:: cpp
+
+  #if defined(__GFX8__) || defined(__GFX9__)
+  typedef uint64_t lane_mask_t;
+  #else
+  typedef uint32_t lane_mask_t;
+  #endif
+
+The use of :code:`lane_mask_t` with the previous example:
+
+.. code-block:: cpp
+
+  // Get the thread's position in the warp
+  unsigned int laneId = threadIdx.x % warpSize;
+
+  // Use lane ID for bit-shift
+  val & ((lane_mask_t{1} << (threadIdx.x % warpSize) )-1 );
+
+  // Shift 32 or 64 bit integer with val variable
+  WarpReduce::sum( (val < warpSize) ? (lane_mask_t{1} << val) : 0);
+
 Porting from CUDA __launch_bounds__
 ================================================================================