Skip to content

Commit 7c6e3a8

Browse files
committed
[SYCL][NATIVE_CPU] Update docs for Native CPU compiler pipeline
This integrates the appropriate compiler documentation originally in the oneAPI Construction Kit (OCK) into the NativeCPU compiler pipeline documenation. It has been updated to try to reflect the Native CPU pipeline, and remove some of the references to OCK's structures, as well as moving some of the documentation to markdown files to be consistent with some of the other documentation. Some of it may be irrelevant for Native CPU, and if so this should be updated over time. Mermaid has also been enabled to allow viewing of flowcharts in the markdown.
1 parent b671165 commit 7c6e3a8

File tree

7 files changed

+2800
-14
lines changed

7 files changed

+2800
-14
lines changed

llvm/docs/requirements.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,3 +8,4 @@ sphinxcontrib-applehelp==2.0.0
88
sphinx-reredirects==0.1.6
99
furo==2025.7.19
1010
myst-parser==4.0.0
11+
sphinxcontrib-mermaid==1.0.0

sycl/doc/conf.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@
3232
# Add any Sphinx extension module names here, as strings. They can be
3333
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
3434
# ones.
35-
extensions = ["myst_parser"]
35+
extensions = ["myst_parser", "sphinxcontrib.mermaid"]
3636

3737
# Implicit targets for cross reference
3838
myst_heading_anchors = 5
@@ -47,6 +47,9 @@
4747
# The suffix of source filenames.
4848
source_suffix = [".rst", ".md"]
4949

50+
# Allow use of mermaid directly to view on github without the {}
51+
myst_fence_as_directive = ["mermaid"]
52+
5053
exclude_patterns = [
5154
# Extensions are mostly in asciidoc which has poor support in Sphinx.
5255
"extensions/*",

sycl/doc/design/SYCLNativeCPU.md

Lines changed: 5 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -49,18 +49,6 @@ python buildbot/configure.py \
4949

5050
SYCL Native CPU uses [libclc](https://github.com/intel/llvm/tree/sycl/libclc) to implement many SPIRV builtins. When Native CPU is enabled, the default target triple for libclc will be `LLVM_TARGET_TRIPLE` (same as the default target triple used by `clang`). This can be overridden by setting the `--native-cpu-libclc-targets` option in `configure.py`.
5151

52-
### oneAPI Construction Kit
53-
54-
SYCL Native CPU uses the [oneAPI Construction Kit](https://github.com/codeplaysoftware/oneapi-construction-kit) (OCK) in order to support some core SYCL functionalities and improve performances, the OCK is fetched by default when SYCL Native CPU is enabled, and can optionally be disabled using the `NATIVECPU_USE_OCK` CMake variable (please note that disabling the OCK will result in limited functionalities and performances on the SYCL Native CPU backend):
55-
56-
```
57-
python3 buildbot/configure.py --native_cpu -DNATIVECPU_USE_OCK=Off
58-
```
59-
60-
By default the oneAPI Construction Kit is pulled at the project's configure time using CMake `FetchContent`. This behaviour can be overridden by setting `NATIVECPU_OCK_USE_FETCHCONTENT=Off` and `OCK_SOURCE_DIR=<path>`
61-
in order to use a local checkout of the oneAPI Construction Kit. The CMake variables `OCK_GIT_TAG` and `OCK_GIT_REPO` can be used to override the default git tag and repository used by `FetchContent`.
62-
63-
The SYCL Native CPU device needs to be selected at runtime by setting the environment variable `ONEAPI_DEVICE_SELECTOR=native_cpu:cpu`.
6452

6553
### oneTBB integration
6654

@@ -96,6 +84,7 @@ cmake \
9684
```
9785

9886
Note that a number of `e2e` tests are currently still failing.
87+
The SYCL Native CPU device needs to be selected at runtime by setting the environment variable `ONEAPI_DEVICE_SELECTOR=native_cpu:cpu`.
9988

10089
# Vectorization
10190

@@ -128,7 +117,10 @@ llvm-cov show .\vector-add.exe -instr-profile=foo.profdata
128117

129118
### Please note that Windows is partially supported but temporarily disabled due to some implementation details, it will be re-enabled soon.
130119

131-
# Technical details
120+
121+
# Native CPU compiler pipeline
122+
123+
SYCL Native CPU formerly used uses the [oneAPI Construction Kit](https://github.com/codeplaysoftware/oneapi-construction-kit) (OCK) in order to support some core SYCL functionalities and improve performances in the compiler pipeline. This relevant parts have been brought into DPC++ and the Native CPU compiler pipeline is documented [here](SYCLNativeCPUPipeline.md), with a brief overview below. The OCK related parts are still enabled by using the `NATIVECPU_USE_OCK` CMake variable, but this is enabled by default.
132124

133125
The following section gives a brief overview of how a simple SYCL application is compiled for the SYCL Native CPU target. Consider the following SYCL sample, which performs vector addition using USM:
134126

Lines changed: 277 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,277 @@
1+
Native CPU Compiler Pipeline Overview
2+
=====================================
3+
4+
# Introduction
5+
6+
This document serves to introduce users to the Native CPU compiler pipeline. The
7+
compiler pipeline performs several key transformations over several phases that
8+
can be difficult to understand for new users. The pipeline is constructed and
9+
run in `llvm::sycl::utils::addSYCLNativeCPUBackendPasses`. All of the compiler
10+
pipeline code can be found under
11+
[llvm/lib/SYCLNativeCPUUtils](https://github.com/intel/llvm/tree/sycl/llvm/lib/SYCLNativeCPUUtils),
12+
with the code which originated from the [oneAPI Construction
13+
Kit](https://github.com/uxlfoundation/oneapi-construction-kit/tree/main), under
14+
`compiler_passes` in that directory.
15+
16+
17+
## Objective and Execution Model
18+
19+
The compiler pipeline\'s objective is to compile incoming LLVM IR
20+
modules containing one or more kernel functions to object code ready for
21+
execution when invoked by the host-side runtime. The assumptions placed
22+
on the input and output kernels is as follows:
23+
24+
1. The original kernel is assumed to adhere to an implicit **SIMT**
25+
execution model; it runs once per each *work-item* in an
26+
**NDRange**.
27+
2. It is passed a state struct which contains information about the scheduling.
28+
3. All builtins which do not relate to scheduling have been processed and we are
29+
left with some scheduling related calls to `mux builtins`.
30+
4. The final compiled kernel is assumed to be invoked from the
31+
host-side runtime once per *work-group* in the **NDRange**.
32+
33+
The following diagram provides an overview of the main phases of the
34+
Native CPU compiler pipeline in terms of the underlying and assumed
35+
kernel execution model.
36+
37+
The inner-most function is the original input kernel, which is *wrapped*
38+
by new functions in successive phases, until it is ready in a form to be
39+
executed by the Native CPU driver.
40+
41+
```mermaid
42+
flowchart TD;
43+
Start(["Driver Entry Point"])
44+
Start-->WiLoop["for (wi : wg)"]
45+
WiLoop-->OrigKernel["original_kernel()"]
46+
```
47+
48+
The [WorkItemLoopsPass](SYCLNativeCPUPipelinePasses.md#workitemloopspass)
49+
is the key pass which makes some of the implicit parallelism
50+
explicit. By introducing *work-item loops* around each kernel function,
51+
the new kernel entry point now runs on every work-group in an
52+
**NDRange**.
53+
54+
## Compiler Pipeline Overview
55+
56+
With the overall execution model established, we can start to dive
57+
deeper into the key phases of the compilation pipeline.
58+
59+
```mermaid
60+
flowchart TD;
61+
InputIR(["Input IR"])
62+
SpecConstants(["Handling SpecConstants"])
63+
Metadata(["Adding Metadata/Attributes"])
64+
Vecz(["Vectorization"])
65+
WorkItemLoops(["Work Item Loops / Barriers"])
66+
DefineBuiltins(["Define builtins"])
67+
TidyUp(["Tidy up"])
68+
69+
InputIR-->SpecConstants
70+
SpecConstants-->Metadata
71+
Metadata-->Vecz
72+
Vecz-->WorkItemLoops
73+
WorkItemLoops-->DefineBuiltins
74+
DefineBuiltins-->TidyUp
75+
```
76+
77+
78+
### Input IR
79+
80+
The program begins as an LLVM module. Kernels in the module are assumed
81+
to obey a **SIMT** programming model, as described earlier in [Objective
82+
& Execution Model](#objective-and-execution-model).
83+
84+
Simple fix-up passes take place at this stage: the IR is massaged to
85+
conform to specifications or to fix known deficiencies in earlier
86+
representations. The input IR at this point will contains special
87+
builtins, called `mux builtins` for ndrange or subgroup
88+
style operations e.g. `mux_get_global_id`. Many of these
89+
later passes will refer to these `mux builtins`.
90+
91+
### Adding Metadata/Attributes
92+
93+
Native CPU IR metadata and attributes are attached to kernels. This
94+
information is used by following passes to identify certain aspects of
95+
kernels which are not otherwise attainable or representable in LLVM IR.
96+
97+
[TransferKernelMetadataPass and
98+
EncodeKernelMetadataPass](SYCLNativeCPUPipelinePasses.md#transferkernelmetadatapass-and-encodekernelmetadatapass)
99+
are responsible for adding this information.
100+
101+
### Whole Function Vectorization
102+
103+
The [vecz](SYCLNativeCPUVecz.md) whole-function vectorizer is optionally run.
104+
105+
Note that VECZ may perform its own scalarization, depending on the
106+
options passed to it, potentially undoing the work of any previous
107+
optimization passes, although it is able to preserve or even widen
108+
pre-existing vector operations in many cases.
109+
110+
#### Work-item Scheduling & Barriers
111+
112+
The work-item loops are added to each kernel by the [WorkItemLoopsPass](SYCLNativeCPUPipelinePasses.md#workitemloopspass).
113+
114+
The kernel execution model changes at this stage to replace some of the
115+
implicit parallelism with explicit looping, as described earlier in
116+
[Objective & Execution Model](#objective-and-execution-model).
117+
118+
[Barrier Scheduling](#barrier-scheduling) takes place at this stage, as
119+
well as [Vectorization Scheduling](#vectorization-scheduling) if the
120+
vectorizer was run.
121+
122+
123+
### Barrier Scheduling
124+
125+
The fact that the
126+
[WorkItemLoopsPass](SYCLNativeCPUPipelinePasses.md#workitemloopspass) handles
127+
both work-item loops and barriers can be confusing to newcomers. These two
128+
concepts are in fact linked. Taking the kernel code below, this section will
129+
show how the `WorkItemLoopsPass` lays out and schedules a kernel\'s work-item
130+
loops in the face of barriers.
131+
132+
```C
133+
kernel void foo(global int *a, global int *b) {
134+
// pre barrier code - foo.mux-barrier-region.0()
135+
size_t id = get_global_id(0);
136+
a[id] += 4;
137+
// barrier
138+
barrier(CLK_GLOBAL_MEM_FENCE);
139+
// post barrier code - foo.mux-barrier-region.1()
140+
b[id] += 4;
141+
}
142+
```
143+
144+
The kernel has one global barrier, and one statement on either side of
145+
it. The `WorkItemLoopsPass` conceptually breaks down the kernel into
146+
*barrier regions*, which constitute the code following the control-flow
147+
between all barriers in the kernel. The example above has two regions:
148+
the first contains the call to `get_global_id` and the read/update/write
149+
of global memory pointed to by `a`; the second contains the
150+
read/update/write of global memory pointed to by `b`.
151+
152+
To correctly observe the barrier\'s semantics, all work-items in the
153+
work-group need to execute the first barrier region before beginning the
154+
second. Thus the `WorkItemLoopsPass` produces two sets of work-item
155+
loops to schedule this kernel:
156+
157+
```mermaid
158+
graph TD;
159+
A(["@foo.mux-barrier-wrapper()"])
160+
A-->B{{"for (wi : wg)"}}
161+
B-->C[["@foo.mux-barrier-region.0()<br> a[id] += 4;"]]
162+
C-->D["fence"];
163+
D-->E{{"for (wi : wg)"}}
164+
E-->F[["@foo.mux-barrier-region.1() <br> b[id] += 4;"]]
165+
```
166+
167+
#### Live Variables
168+
169+
Note also that `id` is a *live variable* whose lifetime traverses the
170+
barrier. The `WorkItemLoopsPass` creates a structure of live variables
171+
which are passed between the successive barrier regions, containing data
172+
that needs to be live in future regions.
173+
174+
In this case, however, calls to certain builtins like `get_global_id`
175+
are treated specially and are materialized anew in each barrier region
176+
where they are used.
177+
178+
### Vectorization Scheduling
179+
180+
The [WorkItemLoopsPass](SYCLNativeCPUPipelinePasses.md#workitemloopspass) is
181+
responsible for laying out kernels which have been vectorized by the
182+
[vecz](SYCLNativeCPUVecz.md) whole-function vectorizer.
183+
184+
The vectorizer creates multiple versions of the original kernel.
185+
Vectorized kernels on their own are generally unable to fulfill
186+
work-group scheduling requirements, as they operate only on a number of
187+
work-items equal to a multiple of the vectorization factor. As such, for
188+
the general case, several kernels must be combined to cover all
189+
work-items in the work-group; the `WorkItemLoopsPass` is responsible for
190+
this.
191+
192+
The following diagram uses a vectorization width of 4.
193+
194+
For brevity, the diagram below only details in inner-most work-item
195+
loops. Most kernels will in reality have 2 outer levels of loops over
196+
the full *Y* and *Z* work-group dimensions.
197+
198+
```mermaid
199+
flowchart TD;
200+
Start("@foo.mux-barrier-wrapper()")
201+
OrigKernel0[["@foo()"]]
202+
OrigKernel1[["@__vecz_v4_foo()"]]
203+
Link1("`unsigned i = 0;
204+
unsigned wg_size = get\_local\_size(0);
205+
unsigned peel = wg\_size % 4;`")
206+
ScalarPH{{"\< scalar check \>"}}
207+
VectorPH("for (unsigned e = wg\_size - peel; i \< e; i += 4)")
208+
Link2("for (; i< wg_size; i++)")
209+
Return("return")
210+
211+
Start-->Link1
212+
Link1-->|"if (wg_size != peel)"|VectorPH
213+
Link1-->|"if (wg\_size == peel)"|ScalarPH
214+
ScalarPH-->|"if (peel)"|Link2
215+
Link2-->OrigKernel0
216+
OrigKernel0-->Return
217+
OrigKernel1-->ScalarPH
218+
ScalarPH-->|"if (!peel)"|Return
219+
VectorPH-->OrigKernel1
220+
```
221+
222+
In the above example, the vectorized kernel is called to execute as many
223+
work-items as possible, up to the largest multiple of the vectorization
224+
less than or equal to the work-group size.
225+
226+
In the case that there are work-items remaining (i.e., if the work-group
227+
size is not a multiple of 4) then the original scalar kernel is called
228+
on the up to 3 remaining work-items. These remaining work-items are
229+
typically called the \'peel\' iterations.
230+
231+
#### Defining mux Builtins
232+
233+
The bodies of mux builtin function declarations are now provided.
234+
235+
The [PrepareSYCLNativeCPU](https://github.com/intel/llvm/blob/sycl/llvm/lib/SYCLNativeCPUUtils/PrepareSYCLNativeCPU.cpp) does most of the materialization of scheduling builtins to connect up these scheduling style instructions to the scheduling structure that is passed in.
236+
237+
Any remaining materialization of builtins are handled by
238+
[DefineMuxBuiltinsPass](https://github.com/intel/llvm/blob/sycl/llvm/lib/SYCLNativeCPUUtils/compiler_passes/compiler_pipeline/source/define_mux_builtins_pass.cpp),
239+
such as ``__mux_mem_barrier``. The use of this pass should probably be phased
240+
out in preferenace to doing it all in one place.
241+
242+
Some builtins may rely on others to complete their function. These
243+
dependencies are handled transitively.
244+
245+
Pseudo C code:
246+
247+
```C
248+
struct MuxWorkItemInfo { size_t[3] local_ids; ... };
249+
struct MuxWorkGroupInfo { size_t[3] group_ids; ... };
250+
251+
// And this wrapper function
252+
void foo.mux-sched-wrapper(MuxWorkItemInfo *wi, MuxWorkGroupInfo *wg) {
253+
size_t id = __mux_get_global_id(0, wi, wg);
254+
}
255+
256+
// The DefineMuxBuiltinsPass provides the definition
257+
// of __mux_get_global_id:
258+
size_t __mux_get_global_id(uint i, MuxWorkItemInfo *wi, MuxWorkGroupInfo *wg) {
259+
return (__mux_get_group_id(i, wi, wg) * __mux_get_local_size(i, wi, wg)) +
260+
__mux_get_local_id(i, wi, wg) + __mux_get_global_offset(i, wi, wg);
261+
}
262+
263+
// And thus the definition of __mux_get_group_id...
264+
size_t __mux_get_group_id(uint i, MuxWorkItemInfo *wi, MuxWorkGroupInfo *wg) {
265+
return i >= 3 ? 0 : wg->group_ids[i];
266+
}
267+
268+
// and __mux_get_local_id, etc
269+
size_t __mux_get_local_id(uint i, MuxWorkItemInfo *wi, MuxWorkGroupInfo *wg) {
270+
return i >= 3 ? 0 : wi->local_ids[i];
271+
}
272+
```
273+
274+
# Tidy up
275+
276+
There is some tidying up at the end such as deleting unused functions or
277+
replacing the scalar kernel with the vectorized one.

0 commit comments

Comments
 (0)