@@ -611,6 +611,70 @@ code, while the host can query it during runtime via the device properties. See
611
611
the :ref: `HIP language extension for warpSize <warp_size >` for information on
612
612
how to write portable wave-aware code.
613
613
614
+ Lane masks bit-shift
615
+ ================================================================================
616
+
617
+ A thread in a warp is also called a lane, and a lane mask is a bitmask where
618
+ each bit corresponds to a thread in a warp. A bit is 1 if the thread is active,
619
+ 0 if it's inactive. Bit-shift operations are typically used to create lane masks
620
+ and on AMD GPUs the ``warpSize `` can differ between different architectures,
621
+ that's why it's essential to use correct bitmask type, when porting code.
622
+
623
+ Example:
624
+
625
+ .. code-block :: cpp
626
+
627
+ // Get the thread's position in the warp
628
+ unsigned int laneId = threadIdx.x % warpSize;
629
+
630
+ // Use lane ID for bit-shift
631
+ val & ((1 << (threadIdx.x % warpSize) )-1 );
632
+
633
+ // Shift 32 bit integer with val variable
634
+ WarpReduce::sum( (val < warpSize) ? (1 << val) : 0);
635
+
636
+ Lane masks are 32-bit integer types as this is the integer precision that C
637
+ assigns to such constants by default. GCN/CDNA architectures have a warp size of
638
+ 64, :code: `threadIdx.x % warpSize ` and :code: `val ` in the example may obtain
639
+ values greater than 31. Consequently, shifting by such values would clear the
640
+ 32-bit register to which the shift operation is applied. For AMD
641
+ architectures, a straightforward fix could look as follows:
642
+
643
+ .. code-block :: cpp
644
+
645
+ // Get the thread's position in the warp
646
+ unsigned int laneId = threadIdx.x % warpSize;
647
+
648
+ // Use lane ID for bit-shift
649
+ val & ((1ull << (threadIdx.x % warpSize) )-1 );
650
+
651
+ // Shift 64 bit integer with val variable
652
+ WarpReduce::sum( (val < warpSize) ? (1ull << val) : 0);
653
+
654
+ For portability reasons, it is better to introduce appropriately
655
+ typed placeholders as shown below:
656
+
657
+ .. code-block :: cpp
658
+
659
+ #if defined(__GFX8__) || defined(__GFX9__)
660
+ typedef uint64_t lane_mask_t;
661
+ #else
662
+ typedef uint32_t lane_mask_t;
663
+ #endif
664
+
665
+ The use of :code: `lane_mask_t ` with the previous example:
666
+
667
+ .. code-block :: cpp
668
+
669
+ // Get the thread's position in the warp
670
+ unsigned int laneId = threadIdx.x % warpSize;
671
+
672
+ // Use lane ID for bit-shift
673
+ val & ((lane_mask_t{1} << (threadIdx.x % warpSize) )-1 );
674
+
675
+ // Shift 32 or 64 bit integer with val variable
676
+ WarpReduce::sum( (val < warpSize) ? (lane_mask_t{1} << val) : 0);
677
+
614
678
Porting from CUDA __launch_bounds__
615
679
================================================================================
616
680
0 commit comments