Skip to content

Commit 0e74590

Browse files
committed
[SYCL][NATIVE_CPU] Update docs for Native CPU compiler pipeline
This integrates the appropriate compiler documentation originally in the oneAPI Construction Kit (OCK) into the NativeCPU compiler pipeline documenation. It has been updated to try to reflect the Native CPU pipeline, and remove some of the references to OCK's structures, as well as moving some of the documentation to markdown files to be consistent with some of the other documentation. Some of it may be irrelevant for Native CPU, and if so this should be updated over time.
1 parent b671165 commit 0e74590

File tree

4 files changed

+2786
-0
lines changed

4 files changed

+2786
-0
lines changed

.github/workflows/sycl-docs.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,8 @@ jobs:
3535
run: |
3636
sudo apt-get install -y graphviz ssh ninja-build libhwloc-dev
3737
sudo pip3 install -r repo/llvm/docs/requirements.txt
38+
# TODO: If works move to requirements.txt
39+
sudo pip3 install sphinxcontrib-mermaid
3840
- name: Build Docs
3941
run: |
4042
mkdir -p $GITHUB_WORKSPACE/build
Lines changed: 277 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,277 @@
1+
Native CPU Compiler Pipeline Overview
2+
=====================================
3+
4+
# Introduction
5+
6+
This document serves to introduce users to the Native CPU compiler pipeline. The
7+
compiler pipeline performs several key transformations over several phases that
8+
can be difficult to understand for new users. The pipeline is constructed and
9+
run in `llvm::sycl::utils::addSYCLNativeCPUBackendPasses`. All of the compiler
10+
pipeline code can be found under
11+
[llvm/lib/SYCLNativeCPUUtils](https://github.com/intel/llvm/tree/sycl/llvm/lib/SYCLNativeCPUUtils),
12+
with the code which originated from the [oneAPI Construction
13+
Kit](https://github.com/uxlfoundation/oneapi-construction-kit/tree/main), under
14+
`compiler_passes` in that directory.
15+
16+
17+
## Objective and Execution Model
18+
19+
The compiler pipeline\'s objective is to compile incoming LLVM IR
20+
modules containing one or more kernel functions to object code ready for
21+
execution when invoked by the host-side runtime. The assumptions placed
22+
on the input and output kernels is as follows:
23+
24+
1. The original kernel is assumed to adhere to an implicit **SIMT**
25+
execution model; it runs once per each *work-item* in an
26+
**NDRange**.
27+
2. It is passed a state struct which contains information about the scheduling.
28+
3. All builtins which do not relate to scheduling have been processed and we are
29+
left with some scheduling related calls to `mux builtins`.
30+
4. The final compiled kernel is assumed to be invoked from the
31+
host-side runtime once per *work-group* in the **NDRange**.
32+
33+
The following diagram provides an overview of the main phases of the
34+
Native CPU compiler pipeline in terms of the underlying and assumed
35+
kernel execution model.
36+
37+
The inner-most function is the original input kernel, which is *wrapped*
38+
by new functions in successive phases, until it is ready in a form to be
39+
executed by the Native CPU driver.
40+
41+
```mermaid
42+
flowchart TD;
43+
Start(["Driver Entry Point"])
44+
Start-->WiLoop["for (wi : wg)"]
45+
WiLoop-->OrigKernel["original_kernel()"]
46+
```
47+
48+
The [WorkItemLoopsPass](native_cpu_pipeline_passes.md#workitemloopspass)
49+
is the key pass which makes some of the implicit parallelism
50+
explicit. By introducing *work-item loops* around each kernel function,
51+
the new kernel entry point now runs on every work-group in an
52+
**NDRange**.
53+
54+
## Compiler Pipeline Overview
55+
56+
With the overall execution model established, we can start to dive
57+
deeper into the key phases of the compilation pipeline.
58+
59+
```mermaid
60+
flowchart TD;
61+
InputIR(["Input IR"])
62+
SpecConstants(["Handling SpecConstants"])
63+
Metadata(["Adding Metadata/Attributes"])
64+
Vecz(["Vectorization"])
65+
WorkItemLoops(["Work Item Loops / Barriers"])
66+
DefineBuiltins(["Define builtins"])
67+
TidyUp(["Tidy up"])
68+
69+
InputIR-->SpecConstants
70+
SpecConstants-->Metadata
71+
Metadata-->Vecz
72+
Vecz-->WorkItemLoops
73+
WorkItemLoops-->DefineBuiltins
74+
DefineBuiltins-->TidyUp
75+
```
76+
77+
78+
### Input IR
79+
80+
The program begins as an LLVM module. Kernels in the module are assumed
81+
to obey a **SIMT** programming model, as described earlier in [Objective
82+
& Execution Model](#objective-and-execution-model).
83+
84+
Simple fix-up passes take place at this stage: the IR is massaged to
85+
conform to specifications or to fix known deficiencies in earlier
86+
representations. The input IR at this point will contains special
87+
builtins, called `mux builtins` for ndrange or subgroup
88+
style operations e.g. `mux_get_global_id`. Many of these
89+
later passes will refer to these `mux builtins`.
90+
91+
### Adding Metadata/Attributes
92+
93+
Native CPU IR metadata and attributes are attached to kernels. This
94+
information is used by following passes to identify certain aspects of
95+
kernels which are not otherwise attainable or representable in LLVM IR.
96+
97+
[TransferKernelMetadataPass and
98+
EncodeKernelMetadataPass](native_cpu_pipeline_passes.md#transferkernelmetadatapass-and-encodekernelmetadatapass)
99+
are responsible for adding this information.
100+
101+
### Whole Function Vectorization
102+
103+
The [vecz](native_cpu_vecz.md) whole-function vectorizer is optionally run.
104+
105+
Note that VECZ may perform its own scalarization, depending on the
106+
options passed to it, potentially undoing the work of any previous
107+
optimization passes, although it is able to preserve or even widen
108+
pre-existing vector operations in many cases.
109+
110+
#### Work-item Scheduling & Barriers
111+
112+
The work-item loops are added to each kernel by the [WorkItemLoopsPass](native_cpu_pipeline_passes.md#workitemloopspass).
113+
114+
The kernel execution model changes at this stage to replace some of the
115+
implicit parallelism with explicit looping, as described earlier in
116+
[Objective & Execution Model](#objective-and-execution-model).
117+
118+
[Barrier Scheduling](#barrier-scheduling) takes place at this stage, as
119+
well as [Vectorization Scheduling](#vectorization-scheduling) if the
120+
vectorizer was run.
121+
122+
123+
### Barrier Scheduling
124+
125+
The fact that the
126+
[WorkItemLoopsPass](native_cpu_pipeline_passes.md#workitemloopspass) handles
127+
both work-item loops and barriers can be confusing to newcomers. These two
128+
concepts are in fact linked. Taking the kernel code below, this section will
129+
show how the `WorkItemLoopsPass` lays out and schedules a kernel\'s work-item
130+
loops in the face of barriers.
131+
132+
```C
133+
kernel void foo(global int *a, global int *b) {
134+
// pre barrier code - foo.mux-barrier-region.0()
135+
size_t id = get_global_id(0);
136+
a[id] += 4;
137+
// barrier
138+
barrier(CLK_GLOBAL_MEM_FENCE);
139+
// post barrier code - foo.mux-barrier-region.1()
140+
b[id] += 4;
141+
}
142+
```
143+
144+
The kernel has one global barrier, and one statement on either side of
145+
it. The `WorkItemLoopsPass` conceptually breaks down the kernel into
146+
*barrier regions*, which constitute the code following the control-flow
147+
between all barriers in the kernel. The example above has two regions:
148+
the first contains the call to `get_global_id` and the read/update/write
149+
of global memory pointed to by `a`; the second contains the
150+
read/update/write of global memory pointed to by `b`.
151+
152+
To correctly observe the barrier\'s semantics, all work-items in the
153+
work-group need to execute the first barrier region before beginning the
154+
second. Thus the `WorkItemLoopsPass` produces two sets of work-item
155+
loops to schedule this kernel:
156+
157+
```mermaid
158+
graph TD;
159+
A(["@foo.mux-barrier-wrapper()"])
160+
A-->B{{"for (wi : wg)"}}
161+
B-->C[["@foo.mux-barrier-region.0()<br> a[id] += 4;"]]
162+
C-->D["fence"];
163+
D-->E{{"for (wi : wg)"}}
164+
E-->F[["@foo.mux-barrier-region.1() <br> b[id] += 4;"]]
165+
```
166+
167+
#### Live Variables
168+
169+
Note also that `id` is a *live variable* whose lifetime traverses the
170+
barrier. The `WorkItemLoopsPass` creates a structure of live variables
171+
which are passed between the successive barrier regions, containing data
172+
that needs to be live in future regions.
173+
174+
In this case, however, calls to certain builtins like `get_global_id`
175+
are treated specially and are materialized anew in each barrier region
176+
where they are used.
177+
178+
### Vectorization Scheduling
179+
180+
The [WorkItemLoopsPass](native_cpu_pipeline_passes.md#workitemloopspass) is
181+
responsible for laying out kernels which have been vectorized by the
182+
[vecz](native_cpu_vecz.md) whole-function vectorizer.
183+
184+
The vectorizer creates multiple versions of the original kernel.
185+
Vectorized kernels on their own are generally unable to fulfill
186+
work-group scheduling requirements, as they operate only on a number of
187+
work-items equal to a multiple of the vectorization factor. As such, for
188+
the general case, several kernels must be combined to cover all
189+
work-items in the work-group; the `WorkItemLoopsPass` is responsible for
190+
this.
191+
192+
The following diagram uses a vectorization width of 4.
193+
194+
For brevity, the diagram below only details in inner-most work-item
195+
loops. Most kernels will in reality have 2 outer levels of loops over
196+
the full *Y* and *Z* work-group dimensions.
197+
198+
```mermaid
199+
flowchart TD;
200+
Start("@foo.mux-barrier-wrapper()")
201+
OrigKernel0[["@foo()"]]
202+
OrigKernel1[["@__vecz_v4_foo()"]]
203+
Link1("`unsigned i = 0;
204+
unsigned wg_size = get\_local\_size(0);
205+
unsigned peel = wg\_size % 4;`")
206+
ScalarPH{{"\< scalar check \>"}}
207+
VectorPH("for (unsigned e = wg\_size - peel; i \< e; i += 4)")
208+
Link2("for (; i< wg_size; i++)")
209+
Return("return")
210+
211+
Start-->Link1
212+
Link1-->|"if (wg_size != peel)"|VectorPH
213+
Link1-->|"if (wg\_size == peel)"|ScalarPH
214+
ScalarPH-->|"if (peel)"|Link2
215+
Link2-->OrigKernel0
216+
OrigKernel0-->Return
217+
OrigKernel1-->ScalarPH
218+
ScalarPH-->|"if (!peel)"|Return
219+
VectorPH-->OrigKernel1
220+
```
221+
222+
In the above example, the vectorized kernel is called to execute as many
223+
work-items as possible, up to the largest multiple of the vectorization
224+
less than or equal to the work-group size.
225+
226+
In the case that there are work-items remaining (i.e., if the work-group
227+
size is not a multiple of 4) then the original scalar kernel is called
228+
on the up to 3 remaining work-items. These remaining work-items are
229+
typically called the \'peel\' iterations.
230+
231+
#### Defining mux Builtins
232+
233+
The bodies of mux builtin function declarations are now provided.
234+
235+
The [PrepareSYCLNativeCPU](https://github.com/intel/llvm/blob/sycl/llvm/lib/SYCLNativeCPUUtils/PrepareSYCLNativeCPU.cpp) does most of the materialization of scheduling builtins to connect up these scheduling style instructions to the scheduling structure that is passed in.
236+
237+
Any remaining materialization of builtins are handled by
238+
[DefineMuxBuiltinsPass](https://github.com/intel/llvm/blob/sycl/llvm/lib/SYCLNativeCPUUtils/compiler_passes/compiler_pipeline/source/define_mux_builtins_pass.cpp),
239+
such as ``__mux_mem_barrier``. The use of this pass should probably be phased
240+
out in preferenace to doing it all in one place.
241+
242+
Some builtins may rely on others to complete their function. These
243+
dependencies are handled transitively.
244+
245+
Pseudo C code:
246+
247+
```C
248+
struct MuxWorkItemInfo { size_t[3] local_ids; ... };
249+
struct MuxWorkGroupInfo { size_t[3] group_ids; ... };
250+
251+
// And this wrapper function
252+
void foo.mux-sched-wrapper(MuxWorkItemInfo *wi, MuxWorkGroupInfo *wg) {
253+
size_t id = __mux_get_global_id(0, wi, wg);
254+
}
255+
256+
// The DefineMuxBuiltinsPass provides the definition
257+
// of __mux_get_global_id:
258+
size_t __mux_get_global_id(uint i, MuxWorkItemInfo *wi, MuxWorkGroupInfo *wg) {
259+
return (__mux_get_group_id(i, wi, wg) * __mux_get_local_size(i, wi, wg)) +
260+
__mux_get_local_id(i, wi, wg) + __mux_get_global_offset(i, wi, wg);
261+
}
262+
263+
// And thus the definition of __mux_get_group_id...
264+
size_t __mux_get_group_id(uint i, MuxWorkItemInfo *wi, MuxWorkGroupInfo *wg) {
265+
return i >= 3 ? 0 : wg->group_ids[i];
266+
}
267+
268+
// and __mux_get_local_id, etc
269+
size_t __mux_get_local_id(uint i, MuxWorkItemInfo *wi, MuxWorkGroupInfo *wg) {
270+
return i >= 3 ? 0 : wi->local_ids[i];
271+
}
272+
```
273+
274+
# Tidy up
275+
276+
There is some tidying up at the end such as deleting unused functions or
277+
replacing the scalar kernel with the vectorized one.

0 commit comments

Comments
 (0)