Skip to content

Commit f5dc82c

Browse files
committed
Add lane masks bit-shift in the porting guide
1 parent 28678b6 commit f5dc82c

File tree

2 files changed

+65
-0
lines changed

2 files changed

+65
-0
lines changed

.wordlist.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ bfloat
1515
Bitcode
1616
bitcode
1717
bitcodes
18+
bitmask
1819
blockDim
1920
blockIdx
2021
builtins

docs/how-to/hip_porting_guide.rst

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -611,6 +611,70 @@ code, while the host can query it during runtime via the device properties. See
611611
the :ref:`HIP language extension for warpSize <warp_size>` for information on
612612
how to write portable wave-aware code.
613613

614+
Lane masks bit-shift
615+
================================================================================
616+
617+
A thread in a warp is also called a lane, and a lane mask is a bitmask where
618+
each bit corresponds to a thread in a warp. A bit is 1 if the thread is active,
619+
0 if it's inactive. Bit-shift operations are typically used to create lane masks
620+
and on AMD GPUs the ``warpSize`` can differ between different architectures,
621+
that's why it's essential to use correct bitmask type, when porting code.
622+
623+
Example:
624+
625+
.. code-block:: cpp
626+
627+
// Get the thread's position in the warp
628+
unsigned int laneId = threadIdx.x % warpSize;
629+
630+
// Use lane ID for bit-shift
631+
val & ((1 << (threadIdx.x % warpSize) )-1 );
632+
633+
// Shift 32 bit integer with val variable
634+
WarpReduce::sum( (val < warpSize) ? (1 << val) : 0);
635+
636+
Lane masks are 32-bit integer types as this is the integer precision that C
637+
assigns to such constants by default. GCN/CDNA architectures have a warp size of
638+
64, :code:`threadIdx.x % warpSize` and :code:`val` in the example may obtain
639+
values greater than 31. Consequently, shifting by such values would clear the
640+
32-bit register to which the shift operation is applied. For AMD
641+
architectures, a straightforward fix could look as follows:
642+
643+
.. code-block:: cpp
644+
645+
// Get the thread's position in the warp
646+
unsigned int laneId = threadIdx.x % warpSize;
647+
648+
// Use lane ID for bit-shift
649+
val & ((1ull << (threadIdx.x % warpSize) )-1 );
650+
651+
// Shift 64 bit integer with val variable
652+
WarpReduce::sum( (val < warpSize) ? (1ull << val) : 0);
653+
654+
For portability reasons, it is better to introduce appropriately
655+
typed placeholders as shown below:
656+
657+
.. code-block:: cpp
658+
659+
#if defined(__GFX8__) || defined(__GFX9__)
660+
typedef uint64_t lane_mask_t;
661+
#else
662+
typedef uint32_t lane_mask_t;
663+
#endif
664+
665+
The use of :code:`lane_mask_t` with the previous example:
666+
667+
.. code-block:: cpp
668+
669+
// Get the thread's position in the warp
670+
unsigned int laneId = threadIdx.x % warpSize;
671+
672+
// Use lane ID for bit-shift
673+
val & ((lane_mask_t{1} << (threadIdx.x % warpSize) )-1 );
674+
675+
// Shift 32 or 64 bit integer with val variable
676+
WarpReduce::sum( (val < warpSize) ? (lane_mask_t{1} << val) : 0);
677+
614678
Porting from CUDA __launch_bounds__
615679
================================================================================
616680

0 commit comments

Comments
 (0)