You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[SYCL][NATIVE_CPU] Update docs for Native CPU compiler pipeline (#20042)
This integrates the appropriate compiler documentation originally in the
oneAPI Construction Kit (OCK) into the NativeCPU compiler pipeline
documenation.
It has been updated to try to reflect the Native CPU pipeline, and
remove some of the references to OCK's structures, as well as moving
some of the documentation to markdown files to be consistent with some
of the other documentation.
Some of it may be irrelevant for Native CPU, and if so this should be
updated over time.
Support was added for the mermaid flowcharts in the config.
Copy file name to clipboardExpand all lines: sycl/doc/design/SYCLNativeCPU.md
+33-78Lines changed: 33 additions & 78 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,7 @@
2
2
3
3
The SYCL Native CPU flow aims at treating the host CPU as a "first class citizen", providing a SYCL implementation that targets CPUs of various different architectures, with no other dependencies than DPC++ itself, while bringing performances comparable to state-of-the-art CPU backends. SYCL Native CPU also provides some initial/experimental support for LLVM's [source-based code coverage tools](https://clang.llvm.org/docs/SourceBasedCodeCoverage.html) (see also section [Code coverage](#code-coverage)).
4
4
5
-
# Compiler and runtime options
5
+
##Compiler and runtime options
6
6
7
7
The SYCL Native CPU flow is enabled by setting `native_cpu` as a `sycl-target`:
8
8
@@ -33,9 +33,9 @@ Note that SYCL Native CPU co-exists alongside the other SYCL targets. For exampl
The application can then run on either SYCL target by setting the DPC++ `ONEAPI_DEVICE_SELECTOR` environment variable accordingly.
36
+
The application can then run on either SYCL target by setting the DPC++ `ONEAPI_DEVICE_SELECTOR` environment variable to include `native_cpu:cpu`accordingly.
37
37
38
-
## Configuring DPC++ with SYCL Native CPU
38
+
###Configuring DPC++ with SYCL Native CPU
39
39
40
40
SYCL Native CPU needs to be enabled explicitly when configuring DPC++, using `--native_cpu`, e.g.
SYCL Native CPU uses [libclc](https://github.com/intel/llvm/tree/sycl/libclc) to implement many SPIRV builtins. When Native CPU is enabled, the default target triple for libclc will be `LLVM_TARGET_TRIPLE` (same as the default target triple used by `clang`). This can be overridden by setting the `--native-cpu-libclc-targets` option in `configure.py`.
51
51
52
-
### oneAPI Construction Kit
53
-
54
-
SYCL Native CPU uses the [oneAPI Construction Kit](https://github.com/codeplaysoftware/oneapi-construction-kit) (OCK) in order to support some core SYCL functionalities and improve performances, the OCK is fetched by default when SYCL Native CPU is enabled, and can optionally be disabled using the `NATIVECPU_USE_OCK` CMake variable (please note that disabling the OCK will result in limited functionalities and performances on the SYCL Native CPU backend):
By default the oneAPI Construction Kit is pulled at the project's configure time using CMake `FetchContent`. This behaviour can be overridden by setting `NATIVECPU_OCK_USE_FETCHCONTENT=Off` and `OCK_SOURCE_DIR=<path>`
61
-
in order to use a local checkout of the oneAPI Construction Kit. The CMake variables `OCK_GIT_TAG` and `OCK_GIT_REPO` can be used to override the default git tag and repository used by `FetchContent`.
62
-
63
-
The SYCL Native CPU device needs to be selected at runtime by setting the environment variable `ONEAPI_DEVICE_SELECTOR=native_cpu:cpu`.
64
-
65
-
### oneTBB integration
52
+
#### oneTBB integration
66
53
67
54
SYCL Native CPU can use oneTBB as an optional backend for task scheduling. oneTBB with SYCL Native CPU is enabled by setting `NATIVECPU_WITH_ONETBB=On` at configure time:
68
55
@@ -76,9 +63,17 @@ This will pull oneTBB into SYCL Native CPU via CMake `FetchContent` and DPC++ ca
76
63
77
64
By default SYCL Native CPU implements its own scheduler whose only dependency is standard C++.
78
65
79
-
# Supported features and current limitations
66
+
## Supported features and current limitations
67
+
68
+
The SYCL Native CPU supports all core SYCL features with some outstanding bugs. There are some optional features which have no or partial support:
69
+
70
+
* bfloat16
71
+
* address sanitizer
72
+
* images
73
+
* device globals (unsure as we pass one of them)
74
+
* ESIMD
80
75
81
-
The SYCL Native CPU flow is still WIP, not optimized and several core SYCL features are currently unsupported. Currently `barriers` are supported only when the oneAPI Construction Kit integration is enabled, several math builtins are not supported and attempting to use those will most likely fail with an `undefined reference` error at link time. Examples of supported applications can be found in the [runtime tests](https://github.com/intel/llvm/blob/sycl/sycl/test/native_cpu).
76
+
Some of these, such as bfloat16 will fail with an undefined reference error at link time.
82
77
83
78
84
79
To execute the `e2e` tests on SYCL Native CPU, configure the test suite with:
@@ -91,13 +86,12 @@ cmake \
91
86
-G Ninja \
92
87
-B build -S . \
93
88
-DCMAKE_CXX_COMPILER=clang++ \
94
-
-DSYCL_TEST_E2E_TARGETS="native_cpu:cpu"
95
-
89
+
-DSYCL_TEST_E2E_TARGETS="native_cpu:cpu"
96
90
```
97
91
98
92
Note that a number of `e2e` tests are currently still failing.
99
93
100
-
# Vectorization
94
+
##Vectorization
101
95
102
96
With the integration of the OneAPI Construction Kit, the SYCL Native CPU target
103
97
also gained support for Whole Function Vectorization.\\
@@ -107,9 +101,11 @@ Whole Function Vectorization is enabled by default, and can be controlled throug
107
101
108
102
The `-march=` option can be used to select specific target cpus which may improve performance of the vectorized code.
109
103
110
-
For more details on how the Whole Function Vectorizer is integrated for SYCL Native CPU, refer to the [Technical details](#technical-details) section.
104
+
For more details on how the Whole Function Vectorizer is integrated for SYCL Native CPU, refer to the [Native CPU Compiler Pipeline](#native-cpu-compiler-pipeline) section.
111
105
112
-
# Code coverage
106
+
To run the Vecz lit tests, build DPC++ with `-DNATIVE_CPU_BUILD_VECZ_TEST_TOOLS=ON` and run with `check-sycl-vecz`.
107
+
108
+
## Code coverage
113
109
114
110
SYCL Native CPU has experimental support for LLVM's source-based [code coverage](https://clang.llvm.org/docs/SourceBasedCodeCoverage.html). This enables coverage testing across device and host code.
llvm-cov show .\vector-add.exe -instr-profile=foo.profdata
122
118
```
123
119
124
-
## Ongoing work
120
+
###Ongoing work
125
121
126
122
* Complete support for remaining SYCL features, including but not limited to
127
123
* math and other builtins
@@ -130,7 +126,9 @@ llvm-cov show .\vector-add.exe -instr-profile=foo.profdata
130
126
131
127
### Please note that Windows is partially supported but temporarily disabled due to some implementation details, it will be re-enabled soon.
132
128
133
-
# Technical details
129
+
## Native CPU compiler pipeline
130
+
131
+
SYCL Native CPU formerly used the [oneAPI Construction Kit](https://github.com/uxlfoundation/oneapi-construction-kit) (OCK) via CMake FetchContent in order to support some core SYCL functionalities and improve performances in the compiler pipeline. The relevant OCK parts have been brought into DPC++ and the Native CPU compiler pipeline is documented in [SYCLNativeCPUPipeline documentation](SYCLNativeCPUPipeline.md), with a brief overview below. The OCK- related parts are still enabled by using the `NATIVECPU_USE_OCK` CMake variable, but this is enabled by default.
134
132
135
133
The following section gives a brief overview of how a simple SYCL application is compiled for the SYCL Native CPU target. Consider the following SYCL sample, which performs vector addition using USM:
136
134
@@ -174,62 +172,20 @@ entry:
174
172
```
175
173
176
174
For the SYCL Native CPU target, the device compiler is in charge of materializing the SPIRV builtins (such as `@__spirv_BuiltInGlobalInvocationId`), so that they can be correctly updated by the runtime when executing the kernel. This is performed by the [PrepareSYCLNativeCPU pass](https://github.com/intel/llvm/blob/sycl/llvm/lib/SYCLNativeCPUUtils/PrepareSYCLNativeCPU.cpp).
177
-
The PrepareSYCLNativeCPUPass also emits a `subhandler` function, which receives the kernel arguments from the SYCL runtime (packed in a vector), unpacks them, and forwards only the used ones to the actual kernel.
175
+
The PrepareSYCLNativeCPUPass also emits a `subhandler`wrapper function, which receives the kernel arguments from the SYCL runtime (packed in a vector), unpacks them, and forwards only the used ones to the actual kernel.
178
176
179
177
180
-
## PrepareSYCLNativeCPU Pass
178
+
###PrepareSYCLNativeCPU Pass
181
179
182
-
This pass will add a pointer to a `native_cpu::state` struct as kernel argument to all the kernel functions, and it will replace all the uses of SPIRV builtins with the return value of appropriately defined functions, which will read the requested information from the `native_cpu::state` struct. The `native_cpu::state` struct is defined in the [native_cpu UR adapter](https://github.com/oneapi-src/unified-runtime/blob/main/source/adapters/native_cpu/nativecpu_state.hpp) and the builtin functions are defined in the [native_cpu device library](https://github.com/intel/llvm/blob/sycl/libdevice/nativecpu_utils.cpp).
180
+
This pass will add a pointer to a `native_cpu::state` struct as kernel argument to all the kernel functions, and it will replace all the uses of SPIRV builtins with the return value of appropriately defined functions, which will read the requested information from the `native_cpu::state` struct. For more information, see [PrepareSYCLNativeCPU Pass](SYCLNativeCPUPipeline.md#preparesyclnativecpu-pass).
store i32 %add.i, ptr %arrayidx6.i, align 4, !tbaa !72
201
-
ret void
202
-
}
203
-
```
204
-
This pass will also set the correct calling convention for the target, and handle calling convention-related function attributes, allowing to call the kernel from the runtime.
205
-
206
-
The `subhandler` for the SYCL Native CPU kernel looks like:
As you can see, the `subhandler` steals the kernel's function name, and receives two pointer arguments: the first one points to the kernel arguments from the SYCL runtime, and the second one to the `nativecpu::state` struct.
225
-
226
-
## Handling barriers
182
+
### Handling barriers
227
183
228
184
On SYCL Native CPU, calls to `__spirv_ControlBarrier` are handled using the `WorkItemLoopsPass` from the oneAPI Construction Kit. This pass handles barriers by splitting the kernel between calls to `__spirv_ControlBarrier`, and creating a wrapper that runs the subkernels over the local range. In order to correctly interface to the oneAPI Construction Kit pass pipeline, SPIRV builtins are defined in the device library to call the corresponding `mux` builtins (used by the OCK).
229
185
230
-
## Vectorization
186
+
###Vectorization
231
187
232
-
The OneAPI Construction Kit's Whole Function Vectorizer is executed as an LLVM Pass. Considering the following input function:
188
+
The Whole Function Vectorizer is executed as an LLVM Pass. Considering the following input function:
233
189
234
190
```llvm
235
191
define void @SimpleVadd(i32*, i32*, i32*) {
@@ -269,7 +225,7 @@ and points to the original version of the function. This information is used lat
269
225
which will account for the vectorization when creating the Work Item Loops, and use the original version of the function to add
270
226
peeling loops.
271
227
272
-
## Kernel registration
228
+
###Kernel registration
273
229
274
230
In order to register the SYCL Native CPU kernels to the SYCL runtime, we applied a small change to the `clang-offload-wrapper` tool: normally, the `clang-offload-wrapper` bundles the offload binary in an LLVM-IR module. Instead of bundling the device code, for the SYCL Native CPU target we insert an array of function pointers to the `subhandler`s, and the `sycl_device_binary_struct::BinaryStart` and `sycl_device_binary_struct::BinaryEnd` fields, which normally point to the begin and end addresses of the offload binary, now point to the begin and end of the array.
275
231
@@ -285,7 +241,6 @@ In order to register the SYCL Native CPU kernels to the SYCL runtime, we applied
285
241
286
242
Each entry in the array contains the kernel name as a string, and a pointer to the `subhandler` function declaration. Since the subhandler's signature has always the same arguments (two pointers in LLVM-IR), the `clang-offload-wrapper` can emit the function declarations given just the function names contained in the `.table` file emitted by `sycl-post-link`. The symbols are then resolved by the system's linker, which receives both the output from the offload wrapper and the lowered device module.
287
243
288
-
## Kernel lowering and execution
244
+
###Kernel lowering and execution
289
245
290
246
The information produced by the device compiler is then employed to correctly lower the kernel LLVM-IR module to the target ISA (this is performed by the driver when `-fsycl-targets=native_cpu` is set). The object file containing the kernel code is linked with the host object file (and libsycl and any other needed library) and the final executable is run using the SYCL Native CPU UR Adapter, defined in [the Unified Runtime repo](https://github.com/oneapi-src/unified-runtime/tree/adapters/source/adapters/native_cpu).
0 commit comments