Skip to content

Commit 5a03612

Browse files
authored
[SYCL][NATIVE_CPU] Update docs for Native CPU compiler pipeline (#20042)
This integrates the appropriate compiler documentation originally in the oneAPI Construction Kit (OCK) into the NativeCPU compiler pipeline documenation. It has been updated to try to reflect the Native CPU pipeline, and remove some of the references to OCK's structures, as well as moving some of the documentation to markdown files to be consistent with some of the other documentation. Some of it may be irrelevant for Native CPU, and if so this should be updated over time. Support was added for the mermaid flowcharts in the config.
1 parent 899bd6d commit 5a03612

File tree

7 files changed

+2832
-78
lines changed

7 files changed

+2832
-78
lines changed

sycl/doc/design/SYCLNativeCPU.md

Lines changed: 33 additions & 78 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
The SYCL Native CPU flow aims at treating the host CPU as a "first class citizen", providing a SYCL implementation that targets CPUs of various different architectures, with no other dependencies than DPC++ itself, while bringing performances comparable to state-of-the-art CPU backends. SYCL Native CPU also provides some initial/experimental support for LLVM's [source-based code coverage tools](https://clang.llvm.org/docs/SourceBasedCodeCoverage.html) (see also section [Code coverage](#code-coverage)).
44

5-
# Compiler and runtime options
5+
## Compiler and runtime options
66

77
The SYCL Native CPU flow is enabled by setting `native_cpu` as a `sycl-target`:
88

@@ -33,9 +33,9 @@ Note that SYCL Native CPU co-exists alongside the other SYCL targets. For exampl
3333
```
3434
clang++ -fsycl -fsycl-targets=native_cpu,spir64 <input> -o <output>
3535
```
36-
The application can then run on either SYCL target by setting the DPC++ `ONEAPI_DEVICE_SELECTOR` environment variable accordingly.
36+
The application can then run on either SYCL target by setting the DPC++ `ONEAPI_DEVICE_SELECTOR` environment variable to include `native_cpu:cpu` accordingly.
3737

38-
## Configuring DPC++ with SYCL Native CPU
38+
### Configuring DPC++ with SYCL Native CPU
3939

4040
SYCL Native CPU needs to be enabled explicitly when configuring DPC++, using `--native_cpu`, e.g.
4141

@@ -45,24 +45,11 @@ python buildbot/configure.py \
4545
# other options here
4646
```
4747

48-
### libclc target triples
48+
#### libclc target triples
4949

5050
SYCL Native CPU uses [libclc](https://github.com/intel/llvm/tree/sycl/libclc) to implement many SPIRV builtins. When Native CPU is enabled, the default target triple for libclc will be `LLVM_TARGET_TRIPLE` (same as the default target triple used by `clang`). This can be overridden by setting the `--native-cpu-libclc-targets` option in `configure.py`.
5151

52-
### oneAPI Construction Kit
53-
54-
SYCL Native CPU uses the [oneAPI Construction Kit](https://github.com/codeplaysoftware/oneapi-construction-kit) (OCK) in order to support some core SYCL functionalities and improve performances, the OCK is fetched by default when SYCL Native CPU is enabled, and can optionally be disabled using the `NATIVECPU_USE_OCK` CMake variable (please note that disabling the OCK will result in limited functionalities and performances on the SYCL Native CPU backend):
55-
56-
```
57-
python3 buildbot/configure.py --native_cpu -DNATIVECPU_USE_OCK=Off
58-
```
59-
60-
By default the oneAPI Construction Kit is pulled at the project's configure time using CMake `FetchContent`. This behaviour can be overridden by setting `NATIVECPU_OCK_USE_FETCHCONTENT=Off` and `OCK_SOURCE_DIR=<path>`
61-
in order to use a local checkout of the oneAPI Construction Kit. The CMake variables `OCK_GIT_TAG` and `OCK_GIT_REPO` can be used to override the default git tag and repository used by `FetchContent`.
62-
63-
The SYCL Native CPU device needs to be selected at runtime by setting the environment variable `ONEAPI_DEVICE_SELECTOR=native_cpu:cpu`.
64-
65-
### oneTBB integration
52+
#### oneTBB integration
6653

6754
SYCL Native CPU can use oneTBB as an optional backend for task scheduling. oneTBB with SYCL Native CPU is enabled by setting `NATIVECPU_WITH_ONETBB=On` at configure time:
6855

@@ -76,9 +63,17 @@ This will pull oneTBB into SYCL Native CPU via CMake `FetchContent` and DPC++ ca
7663

7764
By default SYCL Native CPU implements its own scheduler whose only dependency is standard C++.
7865

79-
# Supported features and current limitations
66+
## Supported features and current limitations
67+
68+
The SYCL Native CPU supports all core SYCL features with some outstanding bugs. There are some optional features which have no or partial support:
69+
70+
* bfloat16
71+
* address sanitizer
72+
* images
73+
* device globals (unsure as we pass one of them)
74+
* ESIMD
8075

81-
The SYCL Native CPU flow is still WIP, not optimized and several core SYCL features are currently unsupported. Currently `barriers` are supported only when the oneAPI Construction Kit integration is enabled, several math builtins are not supported and attempting to use those will most likely fail with an `undefined reference` error at link time. Examples of supported applications can be found in the [runtime tests](https://github.com/intel/llvm/blob/sycl/sycl/test/native_cpu).
76+
Some of these, such as bfloat16 will fail with an undefined reference error at link time.
8277

8378

8479
To execute the `e2e` tests on SYCL Native CPU, configure the test suite with:
@@ -91,13 +86,12 @@ cmake \
9186
-G Ninja \
9287
-B build -S . \
9388
-DCMAKE_CXX_COMPILER=clang++ \
94-
-DSYCL_TEST_E2E_TARGETS="native_cpu:cpu"
95-
89+
-DSYCL_TEST_E2E_TARGETS="native_cpu:cpu"
9690
```
9791

9892
Note that a number of `e2e` tests are currently still failing.
9993

100-
# Vectorization
94+
## Vectorization
10195

10296
With the integration of the OneAPI Construction Kit, the SYCL Native CPU target
10397
also gained support for Whole Function Vectorization.\\
@@ -107,9 +101,11 @@ Whole Function Vectorization is enabled by default, and can be controlled throug
107101

108102
The `-march=` option can be used to select specific target cpus which may improve performance of the vectorized code.
109103

110-
For more details on how the Whole Function Vectorizer is integrated for SYCL Native CPU, refer to the [Technical details](#technical-details) section.
104+
For more details on how the Whole Function Vectorizer is integrated for SYCL Native CPU, refer to the [Native CPU Compiler Pipeline](#native-cpu-compiler-pipeline) section.
111105

112-
# Code coverage
106+
To run the Vecz lit tests, build DPC++ with `-DNATIVE_CPU_BUILD_VECZ_TEST_TOOLS=ON` and run with `check-sycl-vecz`.
107+
108+
## Code coverage
113109

114110
SYCL Native CPU has experimental support for LLVM's source-based [code coverage](https://clang.llvm.org/docs/SourceBasedCodeCoverage.html). This enables coverage testing across device and host code.
115111
Example usage:
@@ -121,7 +117,7 @@ llvm-profdata merge -sparse default.profraw -o foo.profdata
121117
llvm-cov show .\vector-add.exe -instr-profile=foo.profdata
122118
```
123119

124-
## Ongoing work
120+
### Ongoing work
125121

126122
* Complete support for remaining SYCL features, including but not limited to
127123
* math and other builtins
@@ -130,7 +126,9 @@ llvm-cov show .\vector-add.exe -instr-profile=foo.profdata
130126

131127
### Please note that Windows is partially supported but temporarily disabled due to some implementation details, it will be re-enabled soon.
132128

133-
# Technical details
129+
## Native CPU compiler pipeline
130+
131+
SYCL Native CPU formerly used the [oneAPI Construction Kit](https://github.com/uxlfoundation/oneapi-construction-kit) (OCK) via CMake FetchContent in order to support some core SYCL functionalities and improve performances in the compiler pipeline. The relevant OCK parts have been brought into DPC++ and the Native CPU compiler pipeline is documented in [SYCLNativeCPUPipeline documentation](SYCLNativeCPUPipeline.md), with a brief overview below. The OCK- related parts are still enabled by using the `NATIVECPU_USE_OCK` CMake variable, but this is enabled by default.
134132

135133
The following section gives a brief overview of how a simple SYCL application is compiled for the SYCL Native CPU target. Consider the following SYCL sample, which performs vector addition using USM:
136134

@@ -174,62 +172,20 @@ entry:
174172
```
175173

176174
For the SYCL Native CPU target, the device compiler is in charge of materializing the SPIRV builtins (such as `@__spirv_BuiltInGlobalInvocationId`), so that they can be correctly updated by the runtime when executing the kernel. This is performed by the [PrepareSYCLNativeCPU pass](https://github.com/intel/llvm/blob/sycl/llvm/lib/SYCLNativeCPUUtils/PrepareSYCLNativeCPU.cpp).
177-
The PrepareSYCLNativeCPUPass also emits a `subhandler` function, which receives the kernel arguments from the SYCL runtime (packed in a vector), unpacks them, and forwards only the used ones to the actual kernel.
175+
The PrepareSYCLNativeCPUPass also emits a `subhandler` wrapper function, which receives the kernel arguments from the SYCL runtime (packed in a vector), unpacks them, and forwards only the used ones to the actual kernel.
178176

179177

180-
## PrepareSYCLNativeCPU Pass
178+
### PrepareSYCLNativeCPU Pass
181179

182-
This pass will add a pointer to a `native_cpu::state` struct as kernel argument to all the kernel functions, and it will replace all the uses of SPIRV builtins with the return value of appropriately defined functions, which will read the requested information from the `native_cpu::state` struct. The `native_cpu::state` struct is defined in the [native_cpu UR adapter](https://github.com/oneapi-src/unified-runtime/blob/main/source/adapters/native_cpu/nativecpu_state.hpp) and the builtin functions are defined in the [native_cpu device library](https://github.com/intel/llvm/blob/sycl/libdevice/nativecpu_utils.cpp).
180+
This pass will add a pointer to a `native_cpu::state` struct as kernel argument to all the kernel functions, and it will replace all the uses of SPIRV builtins with the return value of appropriately defined functions, which will read the requested information from the `native_cpu::state` struct. For more information, see [PrepareSYCLNativeCPU Pass](SYCLNativeCPUPipeline.md#preparesyclnativecpu-pass).
183181

184-
185-
The resulting IR is:
186-
187-
```llvm
188-
define weak dso_local void @_Z6Sample.NativeCPUKernel(ptr noundef align 4 %0, ptr noundef align 4 %1, ptr noundef align 4 %2, ptr %3) local_unnamed_addr #3 !srcloc !74 !kernel_arg_buffer_location !75 !kernel_arg_type !76 !sycl_fixed_targets !49 !sycl_kernel_omit_args !77 {
189-
entry:
190-
%ncpu_builtin = call ptr @_Z13get_global_idmP15nativecpu_state(ptr %3)
191-
%4 = load i64, ptr %ncpu_builtin, align 32, !noalias !78
192-
%arrayidx.i = getelementptr inbounds i32, ptr %1, i64 %4
193-
%5 = load i32, ptr %arrayidx.i, align 4, !tbaa !72
194-
%arrayidx4.i = getelementptr inbounds i32, ptr %2, i64 %4
195-
%6 = load i32, ptr %arrayidx4.i, align 4, !tbaa !72
196-
%add.i = add nsw i32 %5, %6
197-
%cmp.i8.i = icmp ult i64 %4, 2147483648
198-
tail call void @llvm.assume(i1 %cmp.i8.i)
199-
%arrayidx6.i = getelementptr inbounds i32, ptr %0, i64 %4
200-
store i32 %add.i, ptr %arrayidx6.i, align 4, !tbaa !72
201-
ret void
202-
}
203-
```
204-
This pass will also set the correct calling convention for the target, and handle calling convention-related function attributes, allowing to call the kernel from the runtime.
205-
206-
The `subhandler` for the SYCL Native CPU kernel looks like:
207-
208-
```llvm
209-
define weak void @_Z6Sample(ptr %0, ptr %1) #4 {
210-
entry:
211-
%2 = getelementptr %0, ptr %0, i64 0
212-
%3 = load ptr, ptr %2, align 8
213-
%4 = getelementptr %0, ptr %0, i64 3
214-
%5 = load ptr, ptr %4, align 8
215-
%6 = getelementptr %0, ptr %0, i64 4
216-
%7 = load ptr, ptr %6, align 8
217-
%8 = getelementptr %0, ptr %0, i64 7
218-
%9 = load ptr, ptr %8, align 8
219-
call void @_ZTS10SimpleVaddIiE.NativeCPUKernel(ptr %3, ptr %5, ptr %7, ptr %9, ptr %1)
220-
ret void
221-
}
222-
```
223-
224-
As you can see, the `subhandler` steals the kernel's function name, and receives two pointer arguments: the first one points to the kernel arguments from the SYCL runtime, and the second one to the `nativecpu::state` struct.
225-
226-
## Handling barriers
182+
### Handling barriers
227183

228184
On SYCL Native CPU, calls to `__spirv_ControlBarrier` are handled using the `WorkItemLoopsPass` from the oneAPI Construction Kit. This pass handles barriers by splitting the kernel between calls to `__spirv_ControlBarrier`, and creating a wrapper that runs the subkernels over the local range. In order to correctly interface to the oneAPI Construction Kit pass pipeline, SPIRV builtins are defined in the device library to call the corresponding `mux` builtins (used by the OCK).
229185

230-
## Vectorization
186+
### Vectorization
231187

232-
The OneAPI Construction Kit's Whole Function Vectorizer is executed as an LLVM Pass. Considering the following input function:
188+
The Whole Function Vectorizer is executed as an LLVM Pass. Considering the following input function:
233189

234190
```llvm
235191
define void @SimpleVadd(i32*, i32*, i32*) {
@@ -269,7 +225,7 @@ and points to the original version of the function. This information is used lat
269225
which will account for the vectorization when creating the Work Item Loops, and use the original version of the function to add
270226
peeling loops.
271227

272-
## Kernel registration
228+
### Kernel registration
273229

274230
In order to register the SYCL Native CPU kernels to the SYCL runtime, we applied a small change to the `clang-offload-wrapper` tool: normally, the `clang-offload-wrapper` bundles the offload binary in an LLVM-IR module. Instead of bundling the device code, for the SYCL Native CPU target we insert an array of function pointers to the `subhandler`s, and the `sycl_device_binary_struct::BinaryStart` and `sycl_device_binary_struct::BinaryEnd` fields, which normally point to the begin and end addresses of the offload binary, now point to the begin and end of the array.
275231

@@ -285,7 +241,6 @@ In order to register the SYCL Native CPU kernels to the SYCL runtime, we applied
285241

286242
Each entry in the array contains the kernel name as a string, and a pointer to the `subhandler` function declaration. Since the subhandler's signature has always the same arguments (two pointers in LLVM-IR), the `clang-offload-wrapper` can emit the function declarations given just the function names contained in the `.table` file emitted by `sycl-post-link`. The symbols are then resolved by the system's linker, which receives both the output from the offload wrapper and the lowered device module.
287243

288-
## Kernel lowering and execution
244+
### Kernel lowering and execution
289245

290246
The information produced by the device compiler is then employed to correctly lower the kernel LLVM-IR module to the target ISA (this is performed by the driver when `-fsycl-targets=native_cpu` is set). The object file containing the kernel code is linked with the host object file (and libsycl and any other needed library) and the final executable is run using the SYCL Native CPU UR Adapter, defined in [the Unified Runtime repo](https://github.com/oneapi-src/unified-runtime/tree/adapters/source/adapters/native_cpu).
291-

0 commit comments

Comments
 (0)