|
| 1 | +Native CPU Compiler Pipeline Overview |
| 2 | +===================================== |
| 3 | + |
| 4 | +# Introduction |
| 5 | + |
| 6 | +This document serves to introduce users to the Native CPU compiler pipeline. The |
| 7 | +compiler pipeline performs several key transformations over several phases that |
| 8 | +can be difficult to understand for new users. The pipeline is constructed and |
| 9 | +run in `llvm::sycl::utils::addSYCLNativeCPUBackendPasses`. All of the compiler |
| 10 | +pipeline code can be found under |
| 11 | +[llvm/lib/SYCLNativeCPUUtils](https://github.com/intel/llvm/tree/sycl/llvm/lib/SYCLNativeCPUUtils), |
| 12 | +with the code which originated from the [oneAPI Construction |
| 13 | +Kit](https://github.com/uxlfoundation/oneapi-construction-kit/tree/main), under |
| 14 | +`compiler_passes` in that directory. |
| 15 | + |
| 16 | + |
| 17 | +## Objective and Execution Model |
| 18 | + |
| 19 | +The compiler pipeline\'s objective is to compile incoming LLVM IR |
| 20 | +modules containing one or more kernel functions to object code ready for |
| 21 | +execution when invoked by the host-side runtime. The assumptions placed |
| 22 | +on the input and output kernels is as follows: |
| 23 | + |
| 24 | +1. The original kernel is assumed to adhere to an implicit **SIMT** |
| 25 | + execution model; it runs once per each *work-item* in an |
| 26 | + **NDRange**. |
| 27 | +2. It is passed a state struct which contains information about the scheduling. |
| 28 | +3. All builtins which do not relate to scheduling have been processed and we are |
| 29 | + left with some scheduling related calls to `mux builtins`. |
| 30 | +4. The final compiled kernel is assumed to be invoked from the |
| 31 | + host-side runtime once per *work-group* in the **NDRange**. |
| 32 | + |
| 33 | +The following diagram provides an overview of the main phases of the |
| 34 | +Native CPU compiler pipeline in terms of the underlying and assumed |
| 35 | +kernel execution model. |
| 36 | + |
| 37 | +The inner-most function is the original input kernel, which is *wrapped* |
| 38 | +by new functions in successive phases, until it is ready in a form to be |
| 39 | +executed by the Native CPU driver. |
| 40 | + |
| 41 | +```mermaid |
| 42 | +flowchart TD; |
| 43 | + Start(["Driver Entry Point"]) |
| 44 | + Start-->WiLoop["for (wi : wg)"] |
| 45 | + WiLoop-->OrigKernel["original_kernel()"] |
| 46 | +``` |
| 47 | + |
| 48 | +The [WorkItemLoopsPass](SYCLNativeCPUPipelinePasses.md#workitemloopspass) |
| 49 | +is the key pass which makes some of the implicit parallelism |
| 50 | +explicit. By introducing *work-item loops* around each kernel function, |
| 51 | +the new kernel entry point now runs on every work-group in an |
| 52 | +**NDRange**. |
| 53 | + |
| 54 | +## Compiler Pipeline Overview |
| 55 | + |
| 56 | +With the overall execution model established, we can start to dive |
| 57 | +deeper into the key phases of the compilation pipeline. |
| 58 | + |
| 59 | +```mermaid |
| 60 | +flowchart TD; |
| 61 | + InputIR(["Input IR"]) |
| 62 | + SpecConstants(["Handling SpecConstants"]) |
| 63 | + Metadata(["Adding Metadata/Attributes"]) |
| 64 | + Vecz(["Vectorization"]) |
| 65 | + WorkItemLoops(["Work Item Loops / Barriers"]) |
| 66 | + DefineBuiltins(["Define builtins"]) |
| 67 | + TidyUp(["Tidy up"]) |
| 68 | +
|
| 69 | + InputIR-->SpecConstants |
| 70 | + SpecConstants-->Metadata |
| 71 | + Metadata-->Vecz |
| 72 | + Vecz-->WorkItemLoops |
| 73 | + WorkItemLoops-->DefineBuiltins |
| 74 | + DefineBuiltins-->TidyUp |
| 75 | +``` |
| 76 | + |
| 77 | + |
| 78 | +### Input IR |
| 79 | + |
| 80 | +The program begins as an LLVM module. Kernels in the module are assumed |
| 81 | +to obey a **SIMT** programming model, as described earlier in [Objective |
| 82 | +& Execution Model](#objective-and-execution-model). |
| 83 | + |
| 84 | +Simple fix-up passes take place at this stage: the IR is massaged to |
| 85 | +conform to specifications or to fix known deficiencies in earlier |
| 86 | +representations. The input IR at this point will contains special |
| 87 | +builtins, called `mux builtins` for ndrange or subgroup |
| 88 | +style operations e.g. `mux_get_global_id`. Many of these |
| 89 | +later passes will refer to these `mux builtins`. |
| 90 | + |
| 91 | +### Adding Metadata/Attributes |
| 92 | + |
| 93 | +Native CPU IR metadata and attributes are attached to kernels. This |
| 94 | +information is used by following passes to identify certain aspects of |
| 95 | +kernels which are not otherwise attainable or representable in LLVM IR. |
| 96 | + |
| 97 | +[TransferKernelMetadataPass and |
| 98 | +EncodeKernelMetadataPass](SYCLNativeCPUPipelinePasses.md#transferkernelmetadatapass-and-encodekernelmetadatapass) |
| 99 | +are responsible for adding this information. |
| 100 | + |
| 101 | +### Whole Function Vectorization |
| 102 | + |
| 103 | +The [vecz](SYCLNativeCPUVecz.md) whole-function vectorizer is optionally run. |
| 104 | + |
| 105 | +Note that VECZ may perform its own scalarization, depending on the |
| 106 | +options passed to it, potentially undoing the work of any previous |
| 107 | +optimization passes, although it is able to preserve or even widen |
| 108 | +pre-existing vector operations in many cases. |
| 109 | + |
| 110 | +#### Work-item Scheduling & Barriers |
| 111 | + |
| 112 | +The work-item loops are added to each kernel by the [WorkItemLoopsPass](SYCLNativeCPUPipelinePasses.md#workitemloopspass). |
| 113 | + |
| 114 | +The kernel execution model changes at this stage to replace some of the |
| 115 | +implicit parallelism with explicit looping, as described earlier in |
| 116 | +[Objective & Execution Model](#objective-and-execution-model). |
| 117 | + |
| 118 | +[Barrier Scheduling](#barrier-scheduling) takes place at this stage, as |
| 119 | +well as [Vectorization Scheduling](#vectorization-scheduling) if the |
| 120 | +vectorizer was run. |
| 121 | + |
| 122 | + |
| 123 | +### Barrier Scheduling |
| 124 | + |
| 125 | +The fact that the |
| 126 | +[WorkItemLoopsPass](SYCLNativeCPUPipelinePasses.md#workitemloopspass) handles |
| 127 | +both work-item loops and barriers can be confusing to newcomers. These two |
| 128 | +concepts are in fact linked. Taking the kernel code below, this section will |
| 129 | +show how the `WorkItemLoopsPass` lays out and schedules a kernel\'s work-item |
| 130 | +loops in the face of barriers. |
| 131 | + |
| 132 | +```C |
| 133 | +kernel void foo(global int *a, global int *b) { |
| 134 | + // pre barrier code - foo.mux-barrier-region.0() |
| 135 | + size_t id = get_global_id(0); |
| 136 | + a[id] += 4; |
| 137 | + // barrier |
| 138 | + barrier(CLK_GLOBAL_MEM_FENCE); |
| 139 | + // post barrier code - foo.mux-barrier-region.1() |
| 140 | + b[id] += 4; |
| 141 | +} |
| 142 | +``` |
| 143 | +
|
| 144 | +The kernel has one global barrier, and one statement on either side of |
| 145 | +it. The `WorkItemLoopsPass` conceptually breaks down the kernel into |
| 146 | +*barrier regions*, which constitute the code following the control-flow |
| 147 | +between all barriers in the kernel. The example above has two regions: |
| 148 | +the first contains the call to `get_global_id` and the read/update/write |
| 149 | +of global memory pointed to by `a`; the second contains the |
| 150 | +read/update/write of global memory pointed to by `b`. |
| 151 | +
|
| 152 | +To correctly observe the barrier\'s semantics, all work-items in the |
| 153 | +work-group need to execute the first barrier region before beginning the |
| 154 | +second. Thus the `WorkItemLoopsPass` produces two sets of work-item |
| 155 | +loops to schedule this kernel: |
| 156 | +
|
| 157 | +```mermaid |
| 158 | +graph TD; |
| 159 | + A(["@foo.mux-barrier-wrapper()"]) |
| 160 | + A-->B{{"for (wi : wg)"}} |
| 161 | + B-->C[["@foo.mux-barrier-region.0()<br> a[id] += 4;"]] |
| 162 | + C-->D["fence"]; |
| 163 | + D-->E{{"for (wi : wg)"}} |
| 164 | + E-->F[["@foo.mux-barrier-region.1() <br> b[id] += 4;"]] |
| 165 | +``` |
| 166 | + |
| 167 | +#### Live Variables |
| 168 | + |
| 169 | +Note also that `id` is a *live variable* whose lifetime traverses the |
| 170 | +barrier. The `WorkItemLoopsPass` creates a structure of live variables |
| 171 | +which are passed between the successive barrier regions, containing data |
| 172 | +that needs to be live in future regions. |
| 173 | + |
| 174 | +In this case, however, calls to certain builtins like `get_global_id` |
| 175 | +are treated specially and are materialized anew in each barrier region |
| 176 | +where they are used. |
| 177 | + |
| 178 | +### Vectorization Scheduling |
| 179 | + |
| 180 | +The [WorkItemLoopsPass](SYCLNativeCPUPipelinePasses.md#workitemloopspass) is |
| 181 | +responsible for laying out kernels which have been vectorized by the |
| 182 | +[vecz](SYCLNativeCPUVecz.md) whole-function vectorizer. |
| 183 | + |
| 184 | +The vectorizer creates multiple versions of the original kernel. |
| 185 | +Vectorized kernels on their own are generally unable to fulfill |
| 186 | +work-group scheduling requirements, as they operate only on a number of |
| 187 | +work-items equal to a multiple of the vectorization factor. As such, for |
| 188 | +the general case, several kernels must be combined to cover all |
| 189 | +work-items in the work-group; the `WorkItemLoopsPass` is responsible for |
| 190 | +this. |
| 191 | + |
| 192 | +The following diagram uses a vectorization width of 4. |
| 193 | + |
| 194 | +For brevity, the diagram below only details in inner-most work-item |
| 195 | +loops. Most kernels will in reality have 2 outer levels of loops over |
| 196 | +the full *Y* and *Z* work-group dimensions. |
| 197 | + |
| 198 | +```mermaid |
| 199 | +flowchart TD; |
| 200 | + Start("@foo.mux-barrier-wrapper()") |
| 201 | + OrigKernel0[["@foo()"]] |
| 202 | + OrigKernel1[["@__vecz_v4_foo()"]] |
| 203 | + Link1("`unsigned i = 0; |
| 204 | + unsigned wg_size = get\_local\_size(0); |
| 205 | + unsigned peel = wg\_size % 4;`") |
| 206 | + ScalarPH{{"\< scalar check \>"}} |
| 207 | + VectorPH("for (unsigned e = wg\_size - peel; i \< e; i += 4)") |
| 208 | + Link2("for (; i< wg_size; i++)") |
| 209 | + Return("return") |
| 210 | +
|
| 211 | + Start-->Link1 |
| 212 | + Link1-->|"if (wg_size != peel)"|VectorPH |
| 213 | + Link1-->|"if (wg\_size == peel)"|ScalarPH |
| 214 | + ScalarPH-->|"if (peel)"|Link2 |
| 215 | + Link2-->OrigKernel0 |
| 216 | + OrigKernel0-->Return |
| 217 | + OrigKernel1-->ScalarPH |
| 218 | + ScalarPH-->|"if (!peel)"|Return |
| 219 | + VectorPH-->OrigKernel1 |
| 220 | +``` |
| 221 | + |
| 222 | +In the above example, the vectorized kernel is called to execute as many |
| 223 | +work-items as possible, up to the largest multiple of the vectorization |
| 224 | +less than or equal to the work-group size. |
| 225 | + |
| 226 | +In the case that there are work-items remaining (i.e., if the work-group |
| 227 | +size is not a multiple of 4) then the original scalar kernel is called |
| 228 | +on the up to 3 remaining work-items. These remaining work-items are |
| 229 | +typically called the \'peel\' iterations. |
| 230 | + |
| 231 | +#### Defining mux Builtins |
| 232 | + |
| 233 | +The bodies of mux builtin function declarations are now provided. |
| 234 | + |
| 235 | +The [PrepareSYCLNativeCPU](https://github.com/intel/llvm/blob/sycl/llvm/lib/SYCLNativeCPUUtils/PrepareSYCLNativeCPU.cpp) does most of the materialization of scheduling builtins to connect up these scheduling style instructions to the scheduling structure that is passed in. |
| 236 | + |
| 237 | +Any remaining materialization of builtins are handled by |
| 238 | +[DefineMuxBuiltinsPass](https://github.com/intel/llvm/blob/sycl/llvm/lib/SYCLNativeCPUUtils/compiler_passes/compiler_pipeline/source/define_mux_builtins_pass.cpp), |
| 239 | +such as ``__mux_mem_barrier``. The use of this pass should probably be phased |
| 240 | +out in preferenace to doing it all in one place. |
| 241 | + |
| 242 | +Some builtins may rely on others to complete their function. These |
| 243 | +dependencies are handled transitively. |
| 244 | + |
| 245 | +Pseudo C code: |
| 246 | + |
| 247 | +```C |
| 248 | +struct MuxWorkItemInfo { size_t[3] local_ids; ... }; |
| 249 | +struct MuxWorkGroupInfo { size_t[3] group_ids; ... }; |
| 250 | + |
| 251 | +// And this wrapper function |
| 252 | +void foo.mux-sched-wrapper(MuxWorkItemInfo *wi, MuxWorkGroupInfo *wg) { |
| 253 | + size_t id = __mux_get_global_id(0, wi, wg); |
| 254 | +} |
| 255 | + |
| 256 | +// The DefineMuxBuiltinsPass provides the definition |
| 257 | +// of __mux_get_global_id: |
| 258 | +size_t __mux_get_global_id(uint i, MuxWorkItemInfo *wi, MuxWorkGroupInfo *wg) { |
| 259 | + return (__mux_get_group_id(i, wi, wg) * __mux_get_local_size(i, wi, wg)) + |
| 260 | + __mux_get_local_id(i, wi, wg) + __mux_get_global_offset(i, wi, wg); |
| 261 | +} |
| 262 | + |
| 263 | +// And thus the definition of __mux_get_group_id... |
| 264 | +size_t __mux_get_group_id(uint i, MuxWorkItemInfo *wi, MuxWorkGroupInfo *wg) { |
| 265 | + return i >= 3 ? 0 : wg->group_ids[i]; |
| 266 | +} |
| 267 | + |
| 268 | +// and __mux_get_local_id, etc |
| 269 | +size_t __mux_get_local_id(uint i, MuxWorkItemInfo *wi, MuxWorkGroupInfo *wg) { |
| 270 | + return i >= 3 ? 0 : wi->local_ids[i]; |
| 271 | +} |
| 272 | +``` |
| 273 | +
|
| 274 | +# Tidy up |
| 275 | +
|
| 276 | +There is some tidying up at the end such as deleting unused functions or |
| 277 | +replacing the scalar kernel with the vectorized one. |
0 commit comments