intel
diff --git a/‎.github/workflows/sycl-docs.yml‎
Lines changed: 2 additions & 0 deletions b/‎.github/workflows/sycl-docs.yml‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎sycl/doc/NativeCPU/native_cpu_pipeline.md‎
Lines changed: 277 additions & 0 deletions b/‎sycl/doc/NativeCPU/native_cpu_pipeline.md‎
Lines changed: 277 additions & 0 deletions
@@ -35,6 +35,8 @@ jobs:
       run: |
         sudo apt-get install -y graphviz ssh ninja-build libhwloc-dev
         sudo pip3 install -r repo/llvm/docs/requirements.txt
+        # TODO: If works move to requirements.txt
+        sudo pip3 install sphinxcontrib-mermaid
     - name: Build Docs
       run: |
         mkdir -p $GITHUB_WORKSPACE/build
 
@@ -0,0 +1,277 @@
+Native CPU Compiler Pipeline Overview
+=====================================
+
+# Introduction
+
+This document serves to introduce users to the Native CPU compiler pipeline. The
+compiler pipeline performs several key transformations over several phases that
+can be difficult to understand for new users. The pipeline is constructed and
+run in `llvm::sycl::utils::addSYCLNativeCPUBackendPasses`. All of the compiler
+pipeline code can be found under
+[llvm/lib/SYCLNativeCPUUtils](https://github.com/intel/llvm/tree/sycl/llvm/lib/SYCLNativeCPUUtils),
+with the code which originated from the [oneAPI Construction
+Kit](https://github.com/uxlfoundation/oneapi-construction-kit/tree/main), under
+`compiler_passes` in that directory.
+
+
+## Objective and Execution Model
+
+The compiler pipeline\'s objective is to compile incoming LLVM IR
+modules containing one or more kernel functions to object code ready for
+execution when invoked by the host-side runtime. The assumptions placed
+on the input and output kernels is as follows:
+
+1.  The original kernel is assumed to adhere to an implicit **SIMT**
+    execution model; it runs once per each *work-item* in an
+    **NDRange**.
+2. It is passed a state struct which contains information about the scheduling.
+3. All builtins which do not relate to scheduling have been processed and we are
+   left with some scheduling related calls to `mux builtins`.
+4.  The final compiled kernel is assumed to be invoked from the
+    host-side runtime once per *work-group* in the **NDRange**.
+
+The following diagram provides an overview of the main phases of the
+Native CPU compiler pipeline in terms of the underlying and assumed
+kernel execution model.
+
+The inner-most function is the original input kernel, which is *wrapped*
+by new functions in successive phases, until it is ready in a form to be
+executed by the Native CPU driver.
+
+```mermaid
+flowchart TD;
+    Start(["Driver Entry Point"])
+    Start-->WiLoop["for (wi : wg)"]
+    WiLoop-->OrigKernel["original_kernel()"]
+```
+
+The [WorkItemLoopsPass](native_cpu_pipeline_passes.md#workitemloopspass)
+is the key pass which makes some of the implicit parallelism
+explicit. By introducing *work-item loops* around each kernel function,
+the new kernel entry point now runs on every work-group in an
+**NDRange**.
+
+## Compiler Pipeline Overview
+
+With the overall execution model established, we can start to dive
+deeper into the key phases of the compilation pipeline.
+
+```mermaid
+flowchart TD;
+    InputIR(["Input IR"])
+    SpecConstants(["Handling SpecConstants"])
+    Metadata(["Adding Metadata/Attributes"])
+    Vecz(["Vectorization"])
+    WorkItemLoops(["Work Item Loops / Barriers"])
+    DefineBuiltins(["Define builtins"])
+    TidyUp(["Tidy up"])
+
+    InputIR-->SpecConstants
+    SpecConstants-->Metadata
+    Metadata-->Vecz
+    Vecz-->WorkItemLoops
+    WorkItemLoops-->DefineBuiltins
+    DefineBuiltins-->TidyUp
+```
+
+
+### Input IR
+
+The program begins as an LLVM module. Kernels in the module are assumed
+to obey a **SIMT** programming model, as described earlier in [Objective
+& Execution Model](#objective-and-execution-model).
+
+Simple fix-up passes take place at this stage: the IR is massaged to
+conform to specifications or to fix known deficiencies in earlier
+representations. The input IR at this point will contains special
+builtins, called `mux builtins` for ndrange or subgroup
+style operations e.g. `mux_get_global_id`. Many of these
+later passes will refer to these `mux builtins`.
+
+### Adding Metadata/Attributes
+
+Native CPU IR metadata and attributes are attached to kernels. This
+information is used by following passes to identify certain aspects of
+kernels which are not otherwise attainable or representable in LLVM IR.
+
+[TransferKernelMetadataPass and
+EncodeKernelMetadataPass](native_cpu_pipeline_passes.md#transferkernelmetadatapass-and-encodekernelmetadatapass)
+are responsible for adding this information.
+
+### Whole Function Vectorization
+
+The [vecz](native_cpu_vecz.md) whole-function vectorizer is optionally run.
+
+Note that VECZ may perform its own scalarization, depending on the
+options passed to it, potentially undoing the work of any previous
+optimization passes, although it is able to preserve or even widen
+pre-existing vector operations in many cases.
+
+#### Work-item Scheduling & Barriers
+
+The work-item loops are added to each kernel by the [WorkItemLoopsPass](native_cpu_pipeline_passes.md#workitemloopspass).
+
+The kernel execution model changes at this stage to replace some of the
+implicit parallelism with explicit looping, as described earlier in
+[Objective & Execution Model](#objective-and-execution-model).
+
+[Barrier Scheduling](#barrier-scheduling) takes place at this stage, as
+well as [Vectorization Scheduling](#vectorization-scheduling) if the
+vectorizer was run.
+
+
+### Barrier Scheduling
+
+The fact that the
+[WorkItemLoopsPass](native_cpu_pipeline_passes.md#workitemloopspass) handles
+both work-item loops and barriers can be confusing to newcomers. These two
+concepts are in fact linked. Taking the kernel code below, this section will
+show how the `WorkItemLoopsPass` lays out and schedules a kernel\'s work-item
+loops in the face of barriers.
+
+```C
+kernel void foo(global int *a, global int *b) {
+  // pre barrier code - foo.mux-barrier-region.0()
+  size_t id = get_global_id(0);
+  a[id] += 4;
+  // barrier
+  barrier(CLK_GLOBAL_MEM_FENCE);
+  // post barrier code - foo.mux-barrier-region.1()
+  b[id] += 4;
+}
+```
+
+The kernel has one global barrier, and one statement on either side of
+it. The `WorkItemLoopsPass` conceptually breaks down the kernel into
+*barrier regions*, which constitute the code following the control-flow
+between all barriers in the kernel. The example above has two regions:
+the first contains the call to `get_global_id` and the read/update/write
+of global memory pointed to by `a`; the second contains the
+read/update/write of global memory pointed to by `b`.
+
+To correctly observe the barrier\'s semantics, all work-items in the
+work-group need to execute the first barrier region before beginning the
+second. Thus the `WorkItemLoopsPass` produces two sets of work-item
+loops to schedule this kernel:
+
+```mermaid
+graph TD;
+    A(["@foo.mux-barrier-wrapper()"])
+    A-->B{{"for (wi : wg)"}}
+    B-->C[["@foo.mux-barrier-region.0()<br> a[id] += 4;"]]
+    C-->D["fence"];
+    D-->E{{"for (wi : wg)"}}
+    E-->F[["@foo.mux-barrier-region.1() <br> b[id] += 4;"]]
+```
+
+#### Live Variables
+
+Note also that `id` is a *live variable* whose lifetime traverses the
+barrier. The `WorkItemLoopsPass` creates a structure of live variables
+which are passed between the successive barrier regions, containing data
+that needs to be live in future regions.
+
+In this case, however, calls to certain builtins like `get_global_id`
+are treated specially and are materialized anew in each barrier region
+where they are used.
+
+### Vectorization Scheduling
+
+The [WorkItemLoopsPass](native_cpu_pipeline_passes.md#workitemloopspass) is
+responsible for laying out kernels which have been vectorized by the
+[vecz](native_cpu_vecz.md) whole-function vectorizer.
+
+The vectorizer creates multiple versions of the original kernel.
+Vectorized kernels on their own are generally unable to fulfill
+work-group scheduling requirements, as they operate only on a number of
+work-items equal to a multiple of the vectorization factor. As such, for
+the general case, several kernels must be combined to cover all
+work-items in the work-group; the `WorkItemLoopsPass` is responsible for
+this.
+
+The following diagram uses a vectorization width of 4.
+
+For brevity, the diagram below only details in inner-most work-item
+loops. Most kernels will in reality have 2 outer levels of loops over
+the full *Y* and *Z* work-group dimensions.
+
+```mermaid
+flowchart TD;
+    Start("@foo.mux-barrier-wrapper()")
+    OrigKernel0[["@foo()"]]
+    OrigKernel1[["@__vecz_v4_foo()"]]
+    Link1("`unsigned i = 0;
+            unsigned wg_size = get\_local\_size(0);
+            unsigned peel = wg\_size % 4;`")
+    ScalarPH{{"\< scalar check \>"}}
+    VectorPH("for (unsigned e = wg\_size - peel; i \< e; i += 4)")
+    Link2("for (; i< wg_size; i++)")
+    Return("return")
+
+    Start-->Link1
+    Link1-->|"if (wg_size != peel)"|VectorPH
+    Link1-->|"if (wg\_size == peel)"|ScalarPH
+    ScalarPH-->|"if (peel)"|Link2
+    Link2-->OrigKernel0
+    OrigKernel0-->Return
+    OrigKernel1-->ScalarPH
+    ScalarPH-->|"if (!peel)"|Return
+    VectorPH-->OrigKernel1
+```
+
+In the above example, the vectorized kernel is called to execute as many
+work-items as possible, up to the largest multiple of the vectorization
+less than or equal to the work-group size.
+
+In the case that there are work-items remaining (i.e., if the work-group
+size is not a multiple of 4) then the original scalar kernel is called
+on the up to 3 remaining work-items. These remaining work-items are
+typically called the \'peel\' iterations.
+
+#### Defining mux Builtins
+
+The bodies of mux builtin function declarations are now provided.
+
+The [PrepareSYCLNativeCPU](https://github.com/intel/llvm/blob/sycl/llvm/lib/SYCLNativeCPUUtils/PrepareSYCLNativeCPU.cpp) does most of the materialization of scheduling builtins to connect up these scheduling style instructions to the scheduling structure that is passed in.
+
+Any remaining materialization of builtins are handled by
+[DefineMuxBuiltinsPass](https://github.com/intel/llvm/blob/sycl/llvm/lib/SYCLNativeCPUUtils/compiler_passes/compiler_pipeline/source/define_mux_builtins_pass.cpp),
+such as ``__mux_mem_barrier``. The use of this pass should probably be phased
+out in preferenace to doing it all in one place.
+
+Some builtins may rely on others to complete their function. These
+dependencies are handled transitively.
+
+Pseudo C code:
+
+```C
+struct MuxWorkItemInfo { size_t[3] local_ids; ... };
+struct MuxWorkGroupInfo { size_t[3] group_ids; ... };
+
+// And this wrapper function
+void foo.mux-sched-wrapper(MuxWorkItemInfo *wi, MuxWorkGroupInfo *wg) {
+  size_t id = __mux_get_global_id(0, wi, wg);
+}
+
+// The DefineMuxBuiltinsPass provides the definition
+// of __mux_get_global_id:
+size_t __mux_get_global_id(uint i, MuxWorkItemInfo *wi, MuxWorkGroupInfo *wg) {
+  return (__mux_get_group_id(i, wi, wg) * __mux_get_local_size(i, wi, wg)) +
+         __mux_get_local_id(i, wi, wg) + __mux_get_global_offset(i, wi, wg);
+}
+
+// And thus the definition of __mux_get_group_id...
+size_t __mux_get_group_id(uint i, MuxWorkItemInfo *wi, MuxWorkGroupInfo *wg) {
+  return i >= 3 ? 0 : wg->group_ids[i];
+}
+
+// and __mux_get_local_id, etc
+size_t __mux_get_local_id(uint i, MuxWorkItemInfo *wi, MuxWorkGroupInfo *wg) {
+  return i >= 3 ? 0 : wi->local_ids[i];
+}
+```
+
+# Tidy up
+
+There is some tidying up at the end such as deleting unused functions or
+replacing the scalar kernel with the vectorized one.