Skip to content

Commit 57f0535

Browse files
committed
Add best practice for warpSize handling
1 parent 14aa300 commit 57f0535

File tree

3 files changed

+521
-6
lines changed

3 files changed

+521
-6
lines changed

docs/how-to/hip_cpp_language_extensions.rst

Lines changed: 130 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -411,18 +411,18 @@ warpSize
411411
================================================================================
412412

413413
The ``warpSize`` constant contains the number of threads per warp for the given
414-
target device. It can differ between different architectures, see the
415-
:doc:`hardware features <../reference/hardware_features>` for more
416-
information.
414+
target device. On AMD hardware, this is referred to as ``wavefront size``, which
415+
may vary depending on the architecture. For more details, see the
416+
:doc:`hardware features <../reference/hardware_features>`.
417417

418418
Since ``warpSize`` can differ between devices, it can not be assumed to be a
419419
compile-time constant on the host. It has to be queried using
420420
:cpp:func:`hipDeviceGetAttribute` or :cpp:func:`hipDeviceGetProperties`, e.g.:
421421

422422
.. code-block:: cpp
423423
424-
int val;
425-
hipDeviceGetAttribute(&val, hipDeviceAttributeWarpSize, deviceId);
424+
int warpSizeHost;
425+
hipDeviceGetAttribute(&warpSizeHost, hipDeviceAttributeWarpSize, deviceId);
426426
427427
.. note::
428428

@@ -433,6 +433,130 @@ compile-time constant on the host. It has to be queried using
433433
of 32 can run on devices with a ``warpSize`` of 64, it only utilizes half of
434434
the compute resources.
435435

436+
The ``warpSize`` parameter will no longer be a compile-time constant in a future
437+
release of ROCm, however it will be still early folded by the compiler, which
438+
means it can be used for loop bounds and supports loop unrolling similarly to
439+
compile-time warp size.
440+
441+
If the compile time warp size is still required, for example to select the correct
442+
mask type or code path at compile time, the recommended approach is to determine
443+
the warp size of the GPU on host side and setup the kernel accordingly, as shown
444+
in the following block reduce example.
445+
446+
The ``block_reduce`` kernel has a template parameter for warp size and performs
447+
a reduction operation in two main phases:
448+
449+
- Shared memory reduction: Reduction is performed iteratively, halving the
450+
number of active threads each step until only a warp remains
451+
(32 or 64 threads, depending on the device).
452+
453+
- Warp-level reduction: Once the shared memory reduction completes, the
454+
remaining threads use warp-level shuffling to sum the remaining values. This
455+
is done efficiently with the ``__shfl_down`` intrinsic, which allows threads within
456+
the warp to exchange values without explicit synchronization.
457+
458+
.. tab-set::
459+
460+
.. tab-item:: WarpSize Template Parameter
461+
:sync: template-warpsize
462+
463+
.. literalinclude:: ../tools/example_codes/template_warp_size_reduction.hip
464+
:start-after: // [Sphinx template warp size block reduction kernel start]
465+
:end-before: // [Sphinx template warp size block reduction kernel end]
466+
:language: cpp
467+
468+
469+
.. tab-item:: HIP warpSize
470+
:sync: hip-warpsize
471+
472+
.. literalinclude:: ../tools/example_codes/warp_size_reduction.hip
473+
:start-after: // [Sphinx HIP warp size block reduction kernel start]
474+
:end-before: // [Sphinx HIP warp size block reduction kernel end]
475+
:language: cpp
476+
477+
The host code with the main function:
478+
479+
- Retrieves the warp size of the GPU (``warpSizeHost``) to determine the optimal
480+
kernel configuration.
481+
482+
- Allocates device memory (``d_data`` for input, ``d_results`` for block-wise
483+
output) and initializes the input vector to 1.
484+
485+
- Generates the mask variables for every warp and copies them to the device.
486+
487+
.. tab-set::
488+
489+
.. tab-item:: Compile-time WarpSize
490+
:sync: template-warpsize
491+
492+
.. literalinclude:: ../tools/example_codes/template_warp_size_reduction.hip
493+
:start-after: // [Sphinx template warp size mask generation start]
494+
:end-before: // [Sphinx template warp size mask generation end]
495+
:language: cpp
496+
497+
498+
.. tab-item:: HIP warpSize
499+
:sync: hip-warpsize
500+
501+
.. literalinclude:: ../tools/example_codes/warp_size_reduction.hip
502+
:start-after: // [Sphinx HIP warp size mask generation start]
503+
:end-before: // [Sphinx HIP warp size mask generation end]
504+
:language: cpp
505+
506+
- Selects the appropriate kernel specialization based on the warp
507+
size (either 32 or 64) and launches the kernel.
508+
509+
.. tab-set::
510+
511+
.. tab-item:: Compile-time WarpSize
512+
:sync: template-warpsize
513+
514+
.. literalinclude:: ../tools/example_codes/template_warp_size_reduction.hip
515+
:start-after: // [Sphinx template warp size select kernel start]
516+
:end-before: // [Sphinx template warp size select kernel end]
517+
:language: cpp
518+
519+
520+
.. tab-item:: HIP warpSize
521+
:sync: hip-warpsize
522+
523+
.. literalinclude:: ../tools/example_codes/warp_size_reduction.hip
524+
:start-after: // [Sphinx HIP warp size select kernel start]
525+
:end-before: // [Sphinx HIP warp size select kernel end]
526+
:language: cpp
527+
528+
- Synchronizes the device and copies the results back to the host.
529+
530+
- Checks that each block's sum is equal with the expected mask bit count,
531+
verifying the reduction's correctness.
532+
533+
- Frees the device memory to prevent memory leaks.
534+
535+
.. note::
536+
537+
The ``warpSize`` runtime example code is also provided for comparison purposes
538+
and the full example codes are located in the `tools folder <https://github.com/ROCm/hip/tree/docs/develop/docs/tools/example_codes>`_.
539+
540+
The variable ``warpSize`` can be used for loop bounds and supports
541+
loop unrolling similarly to the template parameter ``WarpSize``.
542+
543+
For users who still require a compile-time constant warp size as a macro on the
544+
device side, it can be defined manually based on the target device architecture,
545+
as shown in the following example.
546+
547+
.. code-block:: cpp
548+
549+
#if defined(__GFX8__) || defined(__GFX9__)
550+
#define WarpSize 64
551+
#else
552+
#define WarpSize 32
553+
#endif
554+
555+
.. note::
556+
557+
``mwavefrontsize64`` compiler option is not supported by HIP runtime, that's
558+
why the architecture based compile time selector is an acceptable approach.
559+
436560
********************************************************************************
437561
Vector types
438562
********************************************************************************
@@ -855,7 +979,7 @@ The different shuffle functions behave as following:
855979
of range, the thread returns its own ``var``.
856980

857981
``__shfl_down``
858-
The thread reads ``var`` from lane ``laneIdx - delta``, thereby "shuffling"
982+
The thread reads ``var`` from lane ``laneIdx + delta``, thereby "shuffling"
859983
the values of the lanes of the warp "down". If the resulting source lane is
860984
out of range, the thread returns its own ``var``.
861985

Lines changed: 207 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,207 @@
1+
// MIT License
2+
//
3+
// Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
4+
//
5+
// Permission is hereby granted, free of charge, to any person obtaining a copy
6+
// of this software and associated documentation files (the "Software"), to deal
7+
// in the Software without restriction, including without limitation the rights
8+
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
// copies of the Software, and to permit persons to whom the Software is
10+
// furnished to do so, subject to the following conditions:
11+
//
12+
// The above copyright notice and this permission notice shall be included in all
13+
// copies or substantial portions of the Software.
14+
//
15+
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
// SOFTWARE.
22+
23+
#include <hip/hip_runtime.h>
24+
#include <type_traits>
25+
#include <iostream>
26+
#include <vector>
27+
#include <random>
28+
29+
#define HIP_CHECK(expression) \
30+
{ \
31+
const hipError_t status = expression; \
32+
if(status != hipSuccess){ \
33+
std::cerr << "HIP error " \
34+
<< status << ": " \
35+
<< hipGetErrorString(status) \
36+
<< " at " << __FILE__ << ":" \
37+
<< __LINE__ << std::endl; \
38+
} \
39+
}
40+
41+
// [Sphinx template warp size block reduction kernel start]
42+
template<uint32_t WarpSize>
43+
using lane_mask_t = typename std::conditional<WarpSize == 32, uint32_t, uint64_t>::type;
44+
45+
template<uint32_t WarpSize>
46+
__global__ void block_reduce(int* input, lane_mask_t<WarpSize>* mask, int* output, size_t size) {
47+
extern __shared__ int shared[];
48+
49+
// Read of input with bounds check
50+
auto read_global_safe = [&](const uint32_t i, const uint32_t lane_id, const uint32_t mask_id)
51+
{
52+
lane_mask_t<WarpSize> warp_mask = lane_mask_t<WarpSize>(1) << lane_id;
53+
return (i < size) && (mask[mask_id] & warp_mask) ? input[i] : 0;
54+
};
55+
56+
const uint32_t tid = threadIdx.x,
57+
lid = threadIdx.x % WarpSize,
58+
wid = threadIdx.x / WarpSize,
59+
bid = blockIdx.x,
60+
gid = bid * blockDim.x + tid;
61+
62+
// Read input buffer to shared
63+
shared[tid] = read_global_safe(gid, lid, bid * (blockDim.x / WarpSize) + wid);
64+
__syncthreads();
65+
66+
// Shared reduction
67+
for (uint32_t i = blockDim.x / 2; i >= WarpSize; i /= 2)
68+
{
69+
if (tid < i)
70+
shared[tid] = shared[tid] + shared[tid + i];
71+
__syncthreads();
72+
}
73+
74+
// Use local variable in warp reduction
75+
int result = shared[tid];
76+
__syncthreads();
77+
78+
// This loop would be unrolled the same with the runtime warpSize.
79+
#pragma unroll
80+
for (uint32_t i = WarpSize/2; i >= 1; i /= 2) {
81+
result = result + __shfl_down(result, i);
82+
}
83+
84+
// Write result to output buffer
85+
if (tid == 0)
86+
output[bid] = result;
87+
};
88+
// [Sphinx template warp size block reduction kernel end]
89+
90+
// [Sphinx template warp size mask generation start]
91+
template<uint32_t WarpSize>
92+
void generate_and_copy_mask(
93+
void *d_mask,
94+
std::vector<int>& vectorExpected,
95+
int numOfBlocks,
96+
int numberOfWarp,
97+
int mask_size,
98+
int mask_element_size) {
99+
100+
std::random_device rd;
101+
std::mt19937_64 eng(rd());
102+
103+
// Host side mask vector
104+
std::vector<lane_mask_t<WarpSize>> mask(mask_size);
105+
// Define uniform unsigned int distribution
106+
std::uniform_int_distribution<lane_mask_t<WarpSize>> distr;
107+
// Fill up the mask
108+
for(int i=0; i < numOfBlocks; i++) {
109+
int count = 0;
110+
for(int j=0; j < numberOfWarp; j++) {
111+
int mask_index = i * numberOfWarp + j;
112+
mask[mask_index] = distr(eng);
113+
if constexpr(WarpSize == 32)
114+
count += __builtin_popcount(mask[mask_index]);
115+
else
116+
count += __builtin_popcountll(mask[mask_index]);
117+
}
118+
vectorExpected[i]= count;
119+
}
120+
121+
// Copy the mask array
122+
HIP_CHECK(hipMemcpy(d_mask, mask.data(), mask_size * mask_element_size, hipMemcpyHostToDevice));
123+
}
124+
// [Sphinx template warp size mask generation end]
125+
126+
int main() {
127+
128+
int deviceId = 0;
129+
int warpSizeHost;
130+
HIP_CHECK(hipDeviceGetAttribute(&warpSizeHost, hipDeviceAttributeWarpSize, deviceId));
131+
std::cout << "Warp size: " << warpSizeHost << std::endl;
132+
133+
constexpr int numOfBlocks = 16;
134+
constexpr int threadsPerBlock = 1024;
135+
const int numberOfWarp = threadsPerBlock / warpSizeHost;
136+
const int mask_element_size = warpSizeHost == 32 ? sizeof(uint32_t) : sizeof(uint64_t);
137+
const int mask_size = numOfBlocks * numberOfWarp;
138+
constexpr size_t arraySize = numOfBlocks * threadsPerBlock;
139+
140+
int *d_data, *d_results;
141+
void *d_mask;
142+
int initValue = 1;
143+
std::vector<int> vectorInput(arraySize, initValue);
144+
std::vector<int> vectorOutput(numOfBlocks);
145+
std::vector<int> vectorExpected(numOfBlocks);
146+
// Allocate device memory
147+
HIP_CHECK(hipMalloc(&d_data, arraySize * sizeof(*d_data)));
148+
HIP_CHECK(hipMalloc(&d_mask, mask_size * mask_element_size));
149+
HIP_CHECK(hipMalloc(&d_results, numOfBlocks * sizeof(*d_results)));
150+
// Host to Device copy of the input array
151+
HIP_CHECK(hipMemcpy(d_data, vectorInput.data(), arraySize * sizeof(*d_data), hipMemcpyHostToDevice));
152+
153+
// [Sphinx template warp size select kernel start]
154+
// Fill up the mask variable, copy to device and select the right kernel.
155+
if(warpSizeHost == 32) {
156+
// Generate and copy mask arrays
157+
generate_and_copy_mask<32>(d_mask, vectorExpected, numOfBlocks, numberOfWarp, mask_size, mask_element_size);
158+
159+
// Start the kernel
160+
block_reduce<32><<<dim3(numOfBlocks), dim3(threadsPerBlock), threadsPerBlock * sizeof(*d_data)>>>(
161+
d_data,
162+
static_cast<uint32_t*>(d_mask),
163+
d_results,
164+
arraySize);
165+
} else if(warpSizeHost == 64) {
166+
// Generate and copy mask arrays
167+
generate_and_copy_mask<64>(d_mask, vectorExpected, numOfBlocks, numberOfWarp, mask_size, mask_element_size);
168+
169+
// Start the kernel
170+
block_reduce<64><<<dim3(numOfBlocks), dim3(threadsPerBlock), threadsPerBlock * sizeof(*d_data)>>>(
171+
d_data,
172+
static_cast<uint64_t*>(d_mask),
173+
d_results,
174+
arraySize);
175+
} else {
176+
std::cerr << "Unsupported warp size." << std::endl;
177+
return 0;
178+
}
179+
// [Sphinx template warp size select kernel end]
180+
181+
// Check the kernel launch
182+
HIP_CHECK(hipGetLastError());
183+
// Check for kernel execution error
184+
HIP_CHECK(hipDeviceSynchronize());
185+
// Device to Host copy of the result
186+
HIP_CHECK(hipMemcpy(vectorOutput.data(), d_results, numOfBlocks * sizeof(*d_results), hipMemcpyDeviceToHost));
187+
188+
// Verify results
189+
bool passed = true;
190+
for(size_t i = 0; i < numOfBlocks; ++i) {
191+
if(vectorOutput[i] != vectorExpected[i]) {
192+
passed = false;
193+
std::cerr << "Validation failed! Expected " << vectorExpected[i] << " got " << vectorOutput[i] << " at index: " << i << std::endl;
194+
}
195+
}
196+
if(passed){
197+
std::cout << "Execution completed successfully." << std::endl;
198+
}else{
199+
std::cerr << "Execution failed." << std::endl;
200+
}
201+
202+
// Cleanup
203+
HIP_CHECK(hipFree(d_data));
204+
HIP_CHECK(hipFree(d_mask));
205+
HIP_CHECK(hipFree(d_results));
206+
return 0;
207+
}

0 commit comments

Comments
 (0)