Skip to content

Commit 82aadd8

Browse files
yasahi-hpcYuuichi Asahidalg24crtrott
authored
Add kokkos fft blog post with joss citation (#184)
* Add kokkos fft blog post * modify the explanation of execution space argument * Apply Christian's suggestion Co-authored-by: Christian Trott <[email protected]> * Rename blog-post-10.md to 2025-06-kokkos-fft.md --------- Co-authored-by: Yuuichi Asahi <[email protected]> Co-authored-by: Damien L-G <[email protected]> Co-authored-by: Christian Trott <[email protected]>
1 parent ea80836 commit 82aadd8

File tree

2 files changed

+221
-0
lines changed

2 files changed

+221
-0
lines changed
1.04 MB
Loading

content/blog/2025-06-kokkos-fft.md

Lines changed: 221 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,221 @@
1+
---
2+
authors: ["kokkos-team"]
3+
title: "kokkos-fft: FFT interface for Kokkos eco-system"
4+
date: 2025-06-25
5+
tags: ["blog"]
6+
thumbnail: img/blog/2025/kokkos-fft/kokkos-fft.png
7+
---
8+
9+
# Introduction
10+
11+
The fast Fourier transform (FFT) is a family of fundamental algorithms that is widely used in scientific computing and other areas [1]. [kokkos-fft](https://github.com/kokkos/kokkos-fft) is designed to help [Kokkos](https://github.com/kokkos/kokkos) users who are:
12+
13+
* developing a Kokkos application which relies on FFT libraries. E.g., fluid simulation codes with periodic boundaries, plasma turbulence, etc.
14+
15+
* wishing to integrate in-situ signal and image processing with FFTs. E.g., spectral analyses, low pass filtering, etc.
16+
17+
* willing to use de facto standard FFT libraries just like [`numpy.fft`](https://numpy.org/doc/stable/reference/routines.fft.html).
18+
19+
kokkos-fft [2] can benefit such users through the following features:
20+
21+
* A simple interface like [`numpy.fft`](https://numpy.org/doc/stable/reference/routines.fft.html) with in-place and out-of-place transforms:
22+
Only accepts [Kokkos Views](https://kokkos.org/kokkos-core-wiki/API/core/view/view.html) to make APIs simple and safe.
23+
24+
* 1D, 2D, 3D standard and real FFT functions (similar to [`numpy.fft`](https://numpy.org/doc/stable/reference/routines.fft.html)) over 1D to 8D Kokkos Views:
25+
Batched plans are automatically used if `View` dimension is larger than FFT dimension.
26+
27+
* A reusable [FFT plan](https://kokkosfft.readthedocs.io/en/latest/api/plan/plan.html) which wraps the vendor libraries for each Kokkos backend:
28+
[fftw](http://www.fftw.org), [cuFFT](https://developer.nvidia.com/cufft), [hipFFT](https://github.com/ROCm/hipFFT) ([rocFFT](https://github.com/ROCm/rocFFT)), and [oneMKL](https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html) are automatically enabled based on the enabled Kokkos backend.
29+
30+
* Support for multiple CPU and GPU backends:
31+
FFT libraries for the enabled Kokkos backend are executed on the stream/queue used in that Execution space.
32+
33+
* Compile time and/or runtime errors for invalid usage (e.g. `View` extents mismatch).
34+
35+
# How it Works
36+
37+
For those who are familiar with [`numpy.fft`](https://numpy.org/doc/stable/reference/routines.fft.html), you may use kokkos-fft quite easily. In fact, all of the numpy.fft functions (`numpy.fft.<function_name>`) have an analogous counterpart in kokkos-fft (`KokkosFFT::<function_name>`), which can run on the Kokkos device. In addition, kokkos-fft supports [in-place transform](https://kokkosfft.readthedocs.io/en/latest/intro/using.html#inplace-transform) and [plan reuse](https://kokkosfft.readthedocs.io/en/latest/intro/using.html#reuse-fft-plan) capabilities.
38+
39+
Let's start with a simple example. The following C++ listing shows the 1D real to complex transform using `rfft` in kokkos-fft.
40+
41+
```C++
42+
#include <Kokkos_Core.hpp>
43+
#include <Kokkos_Random.hpp>
44+
#include <KokkosFFT.hpp>
45+
int main(int argc, char* argv[]) {
46+
Kokkos::ScopeGuard guard(argc, argv);
47+
const int n = 4;
48+
Kokkos::View<double*> x("x", n);
49+
Kokkos::View<Kokkos::complex<double>*> x_hat("x_hat", n/2+1);
50+
// initialize the input array with random values
51+
Kokkos::DefaultExecutionSpace exec;
52+
Kokkos::Random_XorShift64_Pool<> random_pool(/*seed=*/12345);
53+
Kokkos::fill_random(exec, x, random_pool, /*range=*/1.0);
54+
KokkosFFT::rfft(exec, x, x_hat);
55+
// block the current thread until all work enqueued into exec is finished
56+
exec.fence();
57+
}
58+
```
59+
60+
This is equivalent to the following Python code.
61+
62+
```python
63+
import numpy as np
64+
x = np.random.rand(4)
65+
x_hat = np.fft.rfft(x)
66+
```
67+
68+
There are two additional arguments in the Kokkos version:
69+
70+
* `exec`: [*Kokkos execution space instance*](https://kokkos.org/kokkos-core-wiki/API/core/execution_spaces.html) whose internal stream/queue is attached to the plan of backend FFT library.
71+
72+
* `x_hat`: [*Kokkos Views*](https://kokkos.org/kokkos-core-wiki/API/core/view/view.html) where the complex-valued FFT output will be stored. By accepting this view as an argument, the function allows the user to pre-allocate memory and optimize data placement, avoiding unnecessary allocations and copies.
73+
74+
Also, kokkos-fft only accepts [Kokkos Views](https://kokkos.org/kokkos-core-wiki/API/core/view/view.html) as input data. The accessibility of a `View` from `ExecutionSpace` is statically checked and will result in a compilation error if not accessible. See [documentations](https://kokkosfft.readthedocs.io/en/latest/intro/quick_start.html) for basic usage.
75+
76+
# Solving 2D Hasegawa-Wakatani turbulence with the Fourier spectral method
77+
78+
For turbulence simulations, we sometimes consider periodic boundaries by assuming that a system is homogeneous and isotropic. Under periodic boundary conditions, we can solve the system of equations using the Fourier spectral method. Here, we consider a typical 2D plasma turbulence model, called the Hasegawa-Wakatani equation [3] (see the thumbnail for the vorticity structure). Using Kokkos and kokkos-fft, we can easily implement the code, just like Python, while getting a significant acceleration. As described in the [documentation](https://github.com/kokkos/kokkos-fft/tree/main/examples/10_HasegawaWakatani/README.md), the core computational kernel of the code is the nonlinear term. In Python, it is implemented as follows,
79+
80+
```python
81+
def _poissonBracket(self, f, g):
82+
ikx_f = 1j * self.grid.kx * f
83+
iky_f = 1j * self.grid.kyh * f
84+
ikx_g = 1j * self.grid.kx * g
85+
iky_g = 1j * self.grid.kyh * g
86+
87+
# Inverse FFT complex [ny, nx/2+1] => real [ny, nx]
88+
dfdx = self._backwardFFT(ikx_f)
89+
dfdy = self._backwardFFT(iky_f)
90+
dgdx = self._backwardFFT(ikx_g)
91+
dgdy = self._backwardFFT(iky_g)
92+
93+
# Convolution in real space
94+
conv = dfdx * dgdy - dfdy * dgdx
95+
96+
# Forward FFT real [ny, nx] => [ny, nx/2+1]
97+
poisson_bracket = self._forwardFFT(conv)
98+
99+
# Reality condition
100+
poisson_bracket = realityCondition(poisson_bracket)
101+
return poisson_bracket
102+
```
103+
104+
We make 4 backward FFTs on `ikx_f`, `iky_f`, `ikx_g` and `iky_g`. Then, we perform the convolution in real space followed by a forward FFT on the result to compute the Poisson bracket. The equivalent Kokkos code is as follows,
105+
106+
```C++
107+
template <typename FViewType, typename GViewType, typename PViewType>
108+
void poissonBracket(const FViewType& fk, const GViewType& gk, PViewType& pk) {
109+
derivative(fk, gk, m_ik_fg_all);
110+
backwardFFT(m_ik_fg_all, m_dfgdx_all);
111+
112+
// Convolution in real space
113+
convolution(m_dfgdx_all, m_conv);
114+
115+
// Forward FFT
116+
forwardFFT(m_conv, pk);
117+
118+
// ky == 0 component
119+
auto sub_pk = Kokkos::subview(pk, Kokkos::ALL, 0, Kokkos::ALL);
120+
realityCondition(sub_pk, m_mask);
121+
}
122+
```
123+
124+
The functions `derivative` and `convolution` are parallelized with [`Kokkos::parallel_for`](https://kokkos.org/kokkos-core-wiki/API/core/parallel-dispatch/parallel_for.html) using a [`MDRangePolicy`](https://kokkos.org/kokkos-core-wiki/API/core/policies/MDRangePolicy.html). For example, `derivative` is computed in the following manner. It should be noted that we store the derivatives `ikx_f`, `iky_f`, `ikx_g` and `iky_g` as [`subviews`](https://kokkos.org/kokkos-core-wiki/API/core/view/subview.html) of a single view `ik_fg_all`. This way, we only need to perform one batched backward FFT over derivatives rather than calling FFTs multiple times for all derivatives.
125+
126+
```C++
127+
template <typename FViewType, typename GViewType, typename FGViewType>
128+
void derivative(const FViewType& fk, const GViewType& gk,
129+
FGViewType& ik_fg_all) {
130+
auto ikx_f =
131+
Kokkos::subview(ik_fg_all, pair_type(0, 2), Kokkos::ALL, Kokkos::ALL);
132+
auto iky_f =
133+
Kokkos::subview(ik_fg_all, pair_type(2, 4), Kokkos::ALL, Kokkos::ALL);
134+
135+
auto ikx_g = Kokkos::subview(ik_fg_all, 4, Kokkos::ALL, Kokkos::ALL);
136+
auto iky_g = Kokkos::subview(ik_fg_all, 5, Kokkos::ALL, Kokkos::ALL);
137+
auto kx = m_grid->m_kx;
138+
auto kyh = m_grid->m_kyh;
139+
140+
const Kokkos::complex<double> z(0.0, 1.0);
141+
constexpr int nb_vars = 2;
142+
143+
range2D_type range(point2D_type{{0, 0}}, point2D_type{{m_nkyh, m_nkx2}},
144+
tile2D_type{{TILE0, TILE1}});
145+
146+
Kokkos::parallel_for(
147+
range, KOKKOS_LAMBDA(int iky, int ikx) {
148+
const auto tmp_ikx = z * kx(ikx);
149+
const auto tmp_iky = z * kyh(iky);
150+
for (int in = 0; in < nb_vars; in++) {
151+
const auto tmp_fk = fk(in, iky, ikx);
152+
ikx_f(in, iky, ikx) = tmp_ikx * tmp_fk;
153+
iky_f(in, iky, ikx) = tmp_iky * tmp_fk;
154+
}
155+
const auto tmp_gk = gk(iky, ikx);
156+
ikx_g(iky, ikx) = tmp_ikx * tmp_gk;
157+
iky_g(iky, ikx) = tmp_iky * tmp_gk;
158+
});
159+
}
160+
```
161+
162+
For forward and backward FFTs, we create plans during initialization which are reused by the [`KokkosFFT::execute`](https://kokkosfft.readthedocs.io/en/latest/intro/using.html#reuse-fft-plan) function. The following function implements the forward FFT followed by unpacking.
163+
164+
```C++
165+
template <typename InViewType, typename OutViewType>
166+
void forwardFFT(const InViewType& f, OutViewType& fk) {
167+
KokkosFFT::execute(*m_forward_plan, f, m_forward_buffer);
168+
auto forward_buffer = m_forward_buffer;
169+
auto norm_coef = m_norm_coef;
170+
int nkx2 = m_nkx2, nkx = (m_nkx2 - 1) / 2, ny = m_ny, nv = 2;
171+
range3D_type range(point3D_type{{0, 0, 0}},
172+
point3D_type{{nv, m_nkyh, nkx + 1}},
173+
tile3D_type{{2, TILE0, TILE1}});
174+
Kokkos::parallel_for(
175+
"FFT_unpack", range, KOKKOS_LAMBDA(int iv, int iky, int ikx) {
176+
fk(iv, iky, ikx) = forward_buffer(iv, iky, ikx) * norm_coef;
177+
int ikx_neg = nkx2 - ikx;
178+
int iky_neg = (ny - iky), iky_nonzero = iky;
179+
if (ikx == 0) ikx_neg = 0;
180+
if (iky == 0) {
181+
iky_neg = ny - 1;
182+
iky_nonzero = 1;
183+
}
184+
fk(iv, iky_nonzero, ikx_neg) =
185+
Kokkos::conj(forward_buffer(iv, iky_neg, ikx)) * norm_coef;
186+
});
187+
}
188+
```
189+
190+
The implementation together with detailed description can be found in the [examples](https://github.com/kokkos/kokkos-fft/tree/main/examples/10_HasegawaWakatani) directory.
191+
192+
# Benchmark
193+
194+
We have performed a benchmark of this application over multiple backends. We performed a simulation for 100 steps with a resolution of `1024 x 1024` while I/Os are disabled. The following table shows the achieved performance.
195+
196+
| Device | Icelake (python) | Icelake (36 cores) | A100 | H100 | MI250X (1 GCD) | PVC |
197+
| --- | --- | --- | --- | --- | --- | --- |
198+
| LOC | 568 | 738 | 738 | 738 | 738 | 738 |
199+
| Compiler/version | Python 3.12.3 | IntelLLVM 2023.0.0 | nvcc 12.2 | nvcc 12.3 | rocm 5.7 | IntelLLVM 2024.0.2 |
200+
| GB/s (Theoretical peak) | 205 | 205 | 1555 | 3350 | 1600 | 3276.8 |
201+
| Elapsed time [s] | 463 | 9.28 | 0.25 | 0.14 | 0.41 | 0.30 |
202+
| Speed up | x 1 | x 49.9 | x 1852 | x 3307 | x 1129 | x 1562 |
203+
204+
As expected, the Python version is the simplest in terms of lines of code (LOC). With Kokkos and kokkos-fft, the same logic can be implemented without significantly increasing the source code size (roughly 1.5 times longer). However, the performance gain is enormous, allowing a single and simple code runs on multiple architectures efficiently.
205+
Note that this performance improvement largely reflects the fact that KokkosFFT is using the various optimized hardware-specific FFT libraries under the hood.
206+
207+
# Future developments
208+
209+
We are planning to add the following functionalities. Contributions to the project are highly welcomed (see [developer guide](https://kokkosfft.readthedocs.io/en/latest/developer_guide.html)).
210+
211+
* Multi-GPU support with MPI
212+
213+
* Device callable batched capability of FFTs like [`kokkos-kernels`](https://github.com/kokkos/kokkos-kernels)
214+
215+
* Supporting callbacks if backend library supports that
216+
217+
# References
218+
219+
[1] Daniel N Rockmore; The FFT: an algorithm the whole family can use. Computing in Science & Engineering Jan/Feb 2000; 2 (1): 60-64. https://doi.org/10.1109/5992.814659
220+
[2] Y. Asahi, T. Padioleau, P. Zehner, J. Bigot and D Lebrun-Grandie, kokkos-fft: A shared-memory FFT for the Kokkos ecosystem, Journal of Open Source Software (JOSS), submitted. [https://github.com/openjournals/joss-reviews/issues/8391](https://github.com/openjournals/joss-reviews/issues/8391)
221+
[3] Masahiro Wakatani, Akira Hasegawa; A collisional drift wave description of plasma edge turbulence. Phys. Fluids 1 March 1984; 27 (3): 611–618. https://doi.org/10.1063/1.864660

0 commit comments

Comments
 (0)