|
| 1 | +--- |
| 2 | +authors: ["kokkos-team"] |
| 3 | +title: "kokkos-fft: FFT interface for Kokkos eco-system" |
| 4 | +date: 2025-06-25 |
| 5 | +tags: ["blog"] |
| 6 | +thumbnail: img/blog/2025/kokkos-fft/kokkos-fft.png |
| 7 | +--- |
| 8 | + |
| 9 | +# Introduction |
| 10 | + |
| 11 | +The fast Fourier transform (FFT) is a family of fundamental algorithms that is widely used in scientific computing and other areas [1]. [kokkos-fft](https://github.com/kokkos/kokkos-fft) is designed to help [Kokkos](https://github.com/kokkos/kokkos) users who are: |
| 12 | + |
| 13 | +* developing a Kokkos application which relies on FFT libraries. E.g., fluid simulation codes with periodic boundaries, plasma turbulence, etc. |
| 14 | + |
| 15 | +* wishing to integrate in-situ signal and image processing with FFTs. E.g., spectral analyses, low pass filtering, etc. |
| 16 | + |
| 17 | +* willing to use de facto standard FFT libraries just like [`numpy.fft`](https://numpy.org/doc/stable/reference/routines.fft.html). |
| 18 | + |
| 19 | +kokkos-fft [2] can benefit such users through the following features: |
| 20 | + |
| 21 | +* A simple interface like [`numpy.fft`](https://numpy.org/doc/stable/reference/routines.fft.html) with in-place and out-of-place transforms: |
| 22 | +Only accepts [Kokkos Views](https://kokkos.org/kokkos-core-wiki/API/core/view/view.html) to make APIs simple and safe. |
| 23 | + |
| 24 | +* 1D, 2D, 3D standard and real FFT functions (similar to [`numpy.fft`](https://numpy.org/doc/stable/reference/routines.fft.html)) over 1D to 8D Kokkos Views: |
| 25 | +Batched plans are automatically used if `View` dimension is larger than FFT dimension. |
| 26 | + |
| 27 | +* A reusable [FFT plan](https://kokkosfft.readthedocs.io/en/latest/api/plan/plan.html) which wraps the vendor libraries for each Kokkos backend: |
| 28 | +[fftw](http://www.fftw.org), [cuFFT](https://developer.nvidia.com/cufft), [hipFFT](https://github.com/ROCm/hipFFT) ([rocFFT](https://github.com/ROCm/rocFFT)), and [oneMKL](https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html) are automatically enabled based on the enabled Kokkos backend. |
| 29 | + |
| 30 | +* Support for multiple CPU and GPU backends: |
| 31 | +FFT libraries for the enabled Kokkos backend are executed on the stream/queue used in that Execution space. |
| 32 | + |
| 33 | +* Compile time and/or runtime errors for invalid usage (e.g. `View` extents mismatch). |
| 34 | + |
| 35 | +# How it Works |
| 36 | + |
| 37 | +For those who are familiar with [`numpy.fft`](https://numpy.org/doc/stable/reference/routines.fft.html), you may use kokkos-fft quite easily. In fact, all of the numpy.fft functions (`numpy.fft.<function_name>`) have an analogous counterpart in kokkos-fft (`KokkosFFT::<function_name>`), which can run on the Kokkos device. In addition, kokkos-fft supports [in-place transform](https://kokkosfft.readthedocs.io/en/latest/intro/using.html#inplace-transform) and [plan reuse](https://kokkosfft.readthedocs.io/en/latest/intro/using.html#reuse-fft-plan) capabilities. |
| 38 | + |
| 39 | +Let's start with a simple example. The following C++ listing shows the 1D real to complex transform using `rfft` in kokkos-fft. |
| 40 | + |
| 41 | +```C++ |
| 42 | +#include <Kokkos_Core.hpp> |
| 43 | +#include <Kokkos_Random.hpp> |
| 44 | +#include <KokkosFFT.hpp> |
| 45 | +int main(int argc, char* argv[]) { |
| 46 | + Kokkos::ScopeGuard guard(argc, argv); |
| 47 | + const int n = 4; |
| 48 | + Kokkos::View<double*> x("x", n); |
| 49 | + Kokkos::View<Kokkos::complex<double>*> x_hat("x_hat", n/2+1); |
| 50 | + // initialize the input array with random values |
| 51 | + Kokkos::DefaultExecutionSpace exec; |
| 52 | + Kokkos::Random_XorShift64_Pool<> random_pool(/*seed=*/12345); |
| 53 | + Kokkos::fill_random(exec, x, random_pool, /*range=*/1.0); |
| 54 | + KokkosFFT::rfft(exec, x, x_hat); |
| 55 | + // block the current thread until all work enqueued into exec is finished |
| 56 | + exec.fence(); |
| 57 | +} |
| 58 | +``` |
| 59 | +
|
| 60 | +This is equivalent to the following Python code. |
| 61 | +
|
| 62 | +```python |
| 63 | +import numpy as np |
| 64 | +x = np.random.rand(4) |
| 65 | +x_hat = np.fft.rfft(x) |
| 66 | +``` |
| 67 | + |
| 68 | +There are two additional arguments in the Kokkos version: |
| 69 | + |
| 70 | +* `exec`: [*Kokkos execution space instance*](https://kokkos.org/kokkos-core-wiki/API/core/execution_spaces.html) whose internal stream/queue is attached to the plan of backend FFT library. |
| 71 | + |
| 72 | +* `x_hat`: [*Kokkos Views*](https://kokkos.org/kokkos-core-wiki/API/core/view/view.html) where the complex-valued FFT output will be stored. By accepting this view as an argument, the function allows the user to pre-allocate memory and optimize data placement, avoiding unnecessary allocations and copies. |
| 73 | + |
| 74 | +Also, kokkos-fft only accepts [Kokkos Views](https://kokkos.org/kokkos-core-wiki/API/core/view/view.html) as input data. The accessibility of a `View` from `ExecutionSpace` is statically checked and will result in a compilation error if not accessible. See [documentations](https://kokkosfft.readthedocs.io/en/latest/intro/quick_start.html) for basic usage. |
| 75 | + |
| 76 | +# Solving 2D Hasegawa-Wakatani turbulence with the Fourier spectral method |
| 77 | + |
| 78 | +For turbulence simulations, we sometimes consider periodic boundaries by assuming that a system is homogeneous and isotropic. Under periodic boundary conditions, we can solve the system of equations using the Fourier spectral method. Here, we consider a typical 2D plasma turbulence model, called the Hasegawa-Wakatani equation [3] (see the thumbnail for the vorticity structure). Using Kokkos and kokkos-fft, we can easily implement the code, just like Python, while getting a significant acceleration. As described in the [documentation](https://github.com/kokkos/kokkos-fft/tree/main/examples/10_HasegawaWakatani/README.md), the core computational kernel of the code is the nonlinear term. In Python, it is implemented as follows, |
| 79 | + |
| 80 | +```python |
| 81 | +def _poissonBracket(self, f, g): |
| 82 | + ikx_f = 1j * self.grid.kx * f |
| 83 | + iky_f = 1j * self.grid.kyh * f |
| 84 | + ikx_g = 1j * self.grid.kx * g |
| 85 | + iky_g = 1j * self.grid.kyh * g |
| 86 | + |
| 87 | + # Inverse FFT complex [ny, nx/2+1] => real [ny, nx] |
| 88 | + dfdx = self._backwardFFT(ikx_f) |
| 89 | + dfdy = self._backwardFFT(iky_f) |
| 90 | + dgdx = self._backwardFFT(ikx_g) |
| 91 | + dgdy = self._backwardFFT(iky_g) |
| 92 | + |
| 93 | + # Convolution in real space |
| 94 | + conv = dfdx * dgdy - dfdy * dgdx |
| 95 | + |
| 96 | + # Forward FFT real [ny, nx] => [ny, nx/2+1] |
| 97 | + poisson_bracket = self._forwardFFT(conv) |
| 98 | + |
| 99 | + # Reality condition |
| 100 | + poisson_bracket = realityCondition(poisson_bracket) |
| 101 | + return poisson_bracket |
| 102 | +``` |
| 103 | + |
| 104 | +We make 4 backward FFTs on `ikx_f`, `iky_f`, `ikx_g` and `iky_g`. Then, we perform the convolution in real space followed by a forward FFT on the result to compute the Poisson bracket. The equivalent Kokkos code is as follows, |
| 105 | + |
| 106 | +```C++ |
| 107 | +template <typename FViewType, typename GViewType, typename PViewType> |
| 108 | +void poissonBracket(const FViewType& fk, const GViewType& gk, PViewType& pk) { |
| 109 | + derivative(fk, gk, m_ik_fg_all); |
| 110 | + backwardFFT(m_ik_fg_all, m_dfgdx_all); |
| 111 | + |
| 112 | + // Convolution in real space |
| 113 | + convolution(m_dfgdx_all, m_conv); |
| 114 | + |
| 115 | + // Forward FFT |
| 116 | + forwardFFT(m_conv, pk); |
| 117 | + |
| 118 | + // ky == 0 component |
| 119 | + auto sub_pk = Kokkos::subview(pk, Kokkos::ALL, 0, Kokkos::ALL); |
| 120 | + realityCondition(sub_pk, m_mask); |
| 121 | +} |
| 122 | +``` |
| 123 | +
|
| 124 | +The functions `derivative` and `convolution` are parallelized with [`Kokkos::parallel_for`](https://kokkos.org/kokkos-core-wiki/API/core/parallel-dispatch/parallel_for.html) using a [`MDRangePolicy`](https://kokkos.org/kokkos-core-wiki/API/core/policies/MDRangePolicy.html). For example, `derivative` is computed in the following manner. It should be noted that we store the derivatives `ikx_f`, `iky_f`, `ikx_g` and `iky_g` as [`subviews`](https://kokkos.org/kokkos-core-wiki/API/core/view/subview.html) of a single view `ik_fg_all`. This way, we only need to perform one batched backward FFT over derivatives rather than calling FFTs multiple times for all derivatives. |
| 125 | +
|
| 126 | +```C++ |
| 127 | +template <typename FViewType, typename GViewType, typename FGViewType> |
| 128 | +void derivative(const FViewType& fk, const GViewType& gk, |
| 129 | + FGViewType& ik_fg_all) { |
| 130 | + auto ikx_f = |
| 131 | + Kokkos::subview(ik_fg_all, pair_type(0, 2), Kokkos::ALL, Kokkos::ALL); |
| 132 | + auto iky_f = |
| 133 | + Kokkos::subview(ik_fg_all, pair_type(2, 4), Kokkos::ALL, Kokkos::ALL); |
| 134 | +
|
| 135 | + auto ikx_g = Kokkos::subview(ik_fg_all, 4, Kokkos::ALL, Kokkos::ALL); |
| 136 | + auto iky_g = Kokkos::subview(ik_fg_all, 5, Kokkos::ALL, Kokkos::ALL); |
| 137 | + auto kx = m_grid->m_kx; |
| 138 | + auto kyh = m_grid->m_kyh; |
| 139 | +
|
| 140 | + const Kokkos::complex<double> z(0.0, 1.0); |
| 141 | + constexpr int nb_vars = 2; |
| 142 | +
|
| 143 | + range2D_type range(point2D_type{{0, 0}}, point2D_type{{m_nkyh, m_nkx2}}, |
| 144 | + tile2D_type{{TILE0, TILE1}}); |
| 145 | +
|
| 146 | + Kokkos::parallel_for( |
| 147 | + range, KOKKOS_LAMBDA(int iky, int ikx) { |
| 148 | + const auto tmp_ikx = z * kx(ikx); |
| 149 | + const auto tmp_iky = z * kyh(iky); |
| 150 | + for (int in = 0; in < nb_vars; in++) { |
| 151 | + const auto tmp_fk = fk(in, iky, ikx); |
| 152 | + ikx_f(in, iky, ikx) = tmp_ikx * tmp_fk; |
| 153 | + iky_f(in, iky, ikx) = tmp_iky * tmp_fk; |
| 154 | + } |
| 155 | + const auto tmp_gk = gk(iky, ikx); |
| 156 | + ikx_g(iky, ikx) = tmp_ikx * tmp_gk; |
| 157 | + iky_g(iky, ikx) = tmp_iky * tmp_gk; |
| 158 | + }); |
| 159 | +} |
| 160 | +``` |
| 161 | + |
| 162 | +For forward and backward FFTs, we create plans during initialization which are reused by the [`KokkosFFT::execute`](https://kokkosfft.readthedocs.io/en/latest/intro/using.html#reuse-fft-plan) function. The following function implements the forward FFT followed by unpacking. |
| 163 | + |
| 164 | +```C++ |
| 165 | +template <typename InViewType, typename OutViewType> |
| 166 | +void forwardFFT(const InViewType& f, OutViewType& fk) { |
| 167 | + KokkosFFT::execute(*m_forward_plan, f, m_forward_buffer); |
| 168 | + auto forward_buffer = m_forward_buffer; |
| 169 | + auto norm_coef = m_norm_coef; |
| 170 | + int nkx2 = m_nkx2, nkx = (m_nkx2 - 1) / 2, ny = m_ny, nv = 2; |
| 171 | + range3D_type range(point3D_type{{0, 0, 0}}, |
| 172 | + point3D_type{{nv, m_nkyh, nkx + 1}}, |
| 173 | + tile3D_type{{2, TILE0, TILE1}}); |
| 174 | + Kokkos::parallel_for( |
| 175 | + "FFT_unpack", range, KOKKOS_LAMBDA(int iv, int iky, int ikx) { |
| 176 | + fk(iv, iky, ikx) = forward_buffer(iv, iky, ikx) * norm_coef; |
| 177 | + int ikx_neg = nkx2 - ikx; |
| 178 | + int iky_neg = (ny - iky), iky_nonzero = iky; |
| 179 | + if (ikx == 0) ikx_neg = 0; |
| 180 | + if (iky == 0) { |
| 181 | + iky_neg = ny - 1; |
| 182 | + iky_nonzero = 1; |
| 183 | + } |
| 184 | + fk(iv, iky_nonzero, ikx_neg) = |
| 185 | + Kokkos::conj(forward_buffer(iv, iky_neg, ikx)) * norm_coef; |
| 186 | + }); |
| 187 | +} |
| 188 | +``` |
| 189 | +
|
| 190 | +The implementation together with detailed description can be found in the [examples](https://github.com/kokkos/kokkos-fft/tree/main/examples/10_HasegawaWakatani) directory. |
| 191 | +
|
| 192 | +# Benchmark |
| 193 | +
|
| 194 | +We have performed a benchmark of this application over multiple backends. We performed a simulation for 100 steps with a resolution of `1024 x 1024` while I/Os are disabled. The following table shows the achieved performance. |
| 195 | +
|
| 196 | +| Device | Icelake (python) | Icelake (36 cores) | A100 | H100 | MI250X (1 GCD) | PVC | |
| 197 | +| --- | --- | --- | --- | --- | --- | --- | |
| 198 | +| LOC | 568 | 738 | 738 | 738 | 738 | 738 | |
| 199 | +| Compiler/version | Python 3.12.3 | IntelLLVM 2023.0.0 | nvcc 12.2 | nvcc 12.3 | rocm 5.7 | IntelLLVM 2024.0.2 | |
| 200 | +| GB/s (Theoretical peak) | 205 | 205 | 1555 | 3350 | 1600 | 3276.8 | |
| 201 | +| Elapsed time [s] | 463 | 9.28 | 0.25 | 0.14 | 0.41 | 0.30 | |
| 202 | +| Speed up | x 1 | x 49.9 | x 1852 | x 3307 | x 1129 | x 1562 | |
| 203 | +
|
| 204 | +As expected, the Python version is the simplest in terms of lines of code (LOC). With Kokkos and kokkos-fft, the same logic can be implemented without significantly increasing the source code size (roughly 1.5 times longer). However, the performance gain is enormous, allowing a single and simple code runs on multiple architectures efficiently. |
| 205 | +Note that this performance improvement largely reflects the fact that KokkosFFT is using the various optimized hardware-specific FFT libraries under the hood. |
| 206 | +
|
| 207 | +# Future developments |
| 208 | +
|
| 209 | +We are planning to add the following functionalities. Contributions to the project are highly welcomed (see [developer guide](https://kokkosfft.readthedocs.io/en/latest/developer_guide.html)). |
| 210 | +
|
| 211 | +* Multi-GPU support with MPI |
| 212 | +
|
| 213 | +* Device callable batched capability of FFTs like [`kokkos-kernels`](https://github.com/kokkos/kokkos-kernels) |
| 214 | +
|
| 215 | +* Supporting callbacks if backend library supports that |
| 216 | +
|
| 217 | +# References |
| 218 | +
|
| 219 | +[1] Daniel N Rockmore; The FFT: an algorithm the whole family can use. Computing in Science & Engineering Jan/Feb 2000; 2 (1): 60-64. https://doi.org/10.1109/5992.814659 |
| 220 | +[2] Y. Asahi, T. Padioleau, P. Zehner, J. Bigot and D Lebrun-Grandie, kokkos-fft: A shared-memory FFT for the Kokkos ecosystem, Journal of Open Source Software (JOSS), submitted. [https://github.com/openjournals/joss-reviews/issues/8391](https://github.com/openjournals/joss-reviews/issues/8391) |
| 221 | +[3] Masahiro Wakatani, Akira Hasegawa; A collisional drift wave description of plasma edge turbulence. Phys. Fluids 1 March 1984; 27 (3): 611–618. https://doi.org/10.1063/1.864660 |
0 commit comments