[SYCL][NATIVE_CPU] Update docs for Native CPU compiler pipeline

coldav · coldav · commit afe1acee0da7 · 2025-09-11T13:37:46.000+01:00
This integrates the appropriate compiler documentation originally in
the oneAPI Construction Kit (OCK) into the NativeCPU compiler pipeline documenation.

It has been updated to try to reflect the Native CPU pipeline, and remove
some of the references to OCK's structures, as well as moving some of the
documentation to markdown files to be consistent with some of the other
documentation.

Some of it may be irrelevant for Native CPU, and if so this should be
updated over time.

Mermaid has also been enabled to allow viewing of flowcharts in the markdown.
diff --git a/llvm/docs/requirements.txt b/llvm/docs/requirements.txt
@@ -8,3 +8,4 @@ sphinxcontrib-applehelp==2.0.0
 sphinx-reredirects==0.1.6
 furo==2025.7.19
 myst-parser==4.0.0
+sphinxcontrib-mermaid==1.0.0
diff --git a/sycl/doc/conf.py b/sycl/doc/conf.py
@@ -32,7 +32,7 @@
 # Add any Sphinx extension module names here, as strings. They can be
 # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
 # ones.
-extensions = ["myst_parser"]
+extensions = ["myst_parser", "sphinxcontrib.mermaid"]
 
 # Implicit targets for cross reference
 myst_heading_anchors = 5
@@ -47,6 +47,9 @@
 # The suffix of source filenames.
 source_suffix = [".rst", ".md"]
 
+# Allow use of mermaid directly to view on github without the {}
+myst_fence_as_directive = ["mermaid"]
+
 exclude_patterns = [
     # Extensions are mostly in asciidoc which has poor support in Sphinx.
     "extensions/*",
diff --git a/sycl/doc/design/SYCLNativeCPU.md b/sycl/doc/design/SYCLNativeCPU.md
@@ -49,18 +49,6 @@ python buildbot/configure.py \
 
 SYCL Native CPU uses [libclc](https://github.com/intel/llvm/tree/sycl/libclc) to implement many SPIRV builtins. When Native CPU is enabled, the default target triple for libclc will be `LLVM_TARGET_TRIPLE` (same as the default target triple used by `clang`). This can be overridden by setting the `--native-cpu-libclc-targets` option in `configure.py`.
 
-### oneAPI Construction Kit
-
-SYCL Native CPU uses the [oneAPI Construction Kit](https://github.com/codeplaysoftware/oneapi-construction-kit) (OCK) in order to support some core SYCL functionalities and improve performances, the OCK is fetched by default when SYCL Native CPU is enabled, and can optionally be disabled using the `NATIVECPU_USE_OCK` CMake variable (please note that disabling the OCK will result in limited functionalities and performances on the SYCL Native CPU backend):
-
-```
-python3 buildbot/configure.py --native_cpu -DNATIVECPU_USE_OCK=Off
-```
-
-By default the oneAPI Construction Kit is pulled at the project's configure time using CMake `FetchContent`. This behaviour can be overridden by setting `NATIVECPU_OCK_USE_FETCHCONTENT=Off` and `OCK_SOURCE_DIR=<path>`
-in order to use a local checkout of the oneAPI Construction Kit. The CMake variables `OCK_GIT_TAG` and `OCK_GIT_REPO` can be used to override the default git tag and repository used by `FetchContent`.
-
-The SYCL Native CPU device needs to be selected at runtime by setting the environment variable `ONEAPI_DEVICE_SELECTOR=native_cpu:cpu`. 
 
 ### oneTBB integration
 
@@ -96,6 +84,7 @@ cmake \
 ```
 
 Note that a number of `e2e` tests are currently still failing.
+The SYCL Native CPU device needs to be selected at runtime by setting the environment variable `ONEAPI_DEVICE_SELECTOR=native_cpu:cpu`.
 
 # Vectorization
 
@@ -128,7 +117,10 @@ llvm-cov show .\vector-add.exe -instr-profile=foo.profdata
 
 ### Please note that Windows is partially supported but temporarily disabled due to some implementation details, it will be re-enabled soon.
 
-# Technical details
+
+# Native CPU compiler pipeline
+
+SYCL Native CPU formerly used uses the [oneAPI Construction Kit](https://github.com/codeplaysoftware/oneapi-construction-kit) (OCK) in order to support some core SYCL functionalities and improve performances in the compiler pipeline. This relevant parts have been brought into DPC++ and the Native CPU compiler pipeline is documented [here](SYCLNativeCPUPipeline.md), with a brief overview below. The OCK related parts are still enabled by using the `NATIVECPU_USE_OCK` CMake variable, but this is enabled by default.
 
 The following section gives a brief overview of how a simple SYCL application is compiled for the SYCL Native CPU target. Consider the following SYCL sample, which performs vector addition using USM:
 
diff --git a/sycl/doc/design/SYCLNativeCPUPipeline.md b/sycl/doc/design/SYCLNativeCPUPipeline.md
@@ -0,0 +1,277 @@
+Native CPU Compiler Pipeline Overview
+=====================================
+
+# Introduction
+
+This document serves to introduce users to the Native CPU compiler pipeline. The
+compiler pipeline performs several key transformations over several phases that
+can be difficult to understand for new users. The pipeline is constructed and
+run in `llvm::sycl::utils::addSYCLNativeCPUBackendPasses`. All of the compiler
+pipeline code can be found under
+[llvm/lib/SYCLNativeCPUUtils](https://github.com/intel/llvm/tree/sycl/llvm/lib/SYCLNativeCPUUtils),
+with the code which originated from the [oneAPI Construction
+Kit](https://github.com/uxlfoundation/oneapi-construction-kit/tree/main), under
+`compiler_passes` in that directory.
+
+
+## Objective and Execution Model
+
+The compiler pipeline\'s objective is to compile incoming LLVM IR
+modules containing one or more kernel functions to object code ready for
+execution when invoked by the host-side runtime. The assumptions placed
+on the input and output kernels is as follows:
+
+1.  The original kernel is assumed to adhere to an implicit **SIMT**
+    execution model; it runs once per each *work-item* in an
+    **NDRange**.
+2. It is passed a state struct which contains information about the scheduling.
+3. All builtins which do not relate to scheduling have been processed and we are
+   left with some scheduling related calls to `mux builtins`.
+4.  The final compiled kernel is assumed to be invoked from the
+    host-side runtime once per *work-group* in the **NDRange**.
+
+The following diagram provides an overview of the main phases of the
+Native CPU compiler pipeline in terms of the underlying and assumed
+kernel execution model.
+
+The inner-most function is the original input kernel, which is *wrapped*
+by new functions in successive phases, until it is ready in a form to be
+executed by the Native CPU driver.
+
+```mermaid
+flowchart TD;
+    Start(["Driver Entry Point"])
+    Start-->WiLoop["for (wi : wg)"]
+    WiLoop-->OrigKernel["original_kernel()"]
+```
+
+The [WorkItemLoopsPass](SYCLNativeCPUPipelinePasses.md#workitemloopspass)
+is the key pass which makes some of the implicit parallelism
+explicit. By introducing *work-item loops* around each kernel function,
+the new kernel entry point now runs on every work-group in an
+**NDRange**.
+
+## Compiler Pipeline Overview
+
+With the overall execution model established, we can start to dive
+deeper into the key phases of the compilation pipeline.
+
+```mermaid
+flowchart TD;
+    InputIR(["Input IR"])
+    SpecConstants(["Handling SpecConstants"])
+    Metadata(["Adding Metadata/Attributes"])
+    Vecz(["Vectorization"])
+    WorkItemLoops(["Work Item Loops / Barriers"])
+    DefineBuiltins(["Define builtins"])
+    TidyUp(["Tidy up"])
+
+    InputIR-->SpecConstants
+    SpecConstants-->Metadata
+    Metadata-->Vecz
+    Vecz-->WorkItemLoops
+    WorkItemLoops-->DefineBuiltins
+    DefineBuiltins-->TidyUp
+```
+
+
+### Input IR
+
+The program begins as an LLVM module. Kernels in the module are assumed
+to obey a **SIMT** programming model, as described earlier in [Objective
+& Execution Model](#objective-and-execution-model).
+
+Simple fix-up passes take place at this stage: the IR is massaged to
+conform to specifications or to fix known deficiencies in earlier
+representations. The input IR at this point will contains special
+builtins, called `mux builtins` for ndrange or subgroup
+style operations e.g. `mux_get_global_id`. Many of these
+later passes will refer to these `mux builtins`.
+
+### Adding Metadata/Attributes
+
+Native CPU IR metadata and attributes are attached to kernels. This
+information is used by following passes to identify certain aspects of
+kernels which are not otherwise attainable or representable in LLVM IR.
+
+[TransferKernelMetadataPass and
+EncodeKernelMetadataPass](SYCLNativeCPUPipelinePasses.md#transferkernelmetadatapass-and-encodekernelmetadatapass)
+are responsible for adding this information.
+
+### Whole Function Vectorization
+
+The [vecz](SYCLNativeCPUVecz.md) whole-function vectorizer is optionally run.
+
+Note that VECZ may perform its own scalarization, depending on the
+options passed to it, potentially undoing the work of any previous
+optimization passes, although it is able to preserve or even widen
+pre-existing vector operations in many cases.
+
+#### Work-item Scheduling & Barriers
+
+The work-item loops are added to each kernel by the [WorkItemLoopsPass](SYCLNativeCPUPipelinePasses.md#workitemloopspass).
+
+The kernel execution model changes at this stage to replace some of the
+implicit parallelism with explicit looping, as described earlier in
+[Objective & Execution Model](#objective-and-execution-model).
+
+[Barrier Scheduling](#barrier-scheduling) takes place at this stage, as
+well as [Vectorization Scheduling](#vectorization-scheduling) if the
+vectorizer was run.
+
+
+### Barrier Scheduling
+
+The fact that the
+[WorkItemLoopsPass](SYCLNativeCPUPipelinePasses.md#workitemloopspass) handles
+both work-item loops and barriers can be confusing to newcomers. These two
+concepts are in fact linked. Taking the kernel code below, this section will
+show how the `WorkItemLoopsPass` lays out and schedules a kernel\'s work-item
+loops in the face of barriers.
+
+```C
+kernel void foo(global int *a, global int *b) {
+  // pre barrier code - foo.mux-barrier-region.0()
+  size_t id = get_global_id(0);
+  a[id] += 4;
+  // barrier
+  barrier(CLK_GLOBAL_MEM_FENCE);
+  // post barrier code - foo.mux-barrier-region.1()
+  b[id] += 4;
+}
+```
+
+The kernel has one global barrier, and one statement on either side of
+it. The `WorkItemLoopsPass` conceptually breaks down the kernel into
+*barrier regions*, which constitute the code following the control-flow
+between all barriers in the kernel. The example above has two regions:
+the first contains the call to `get_global_id` and the read/update/write
+of global memory pointed to by `a`; the second contains the
+read/update/write of global memory pointed to by `b`.
+
+To correctly observe the barrier\'s semantics, all work-items in the
+work-group need to execute the first barrier region before beginning the
+second. Thus the `WorkItemLoopsPass` produces two sets of work-item
+loops to schedule this kernel:
+
+```mermaid
+graph TD;
+    A(["@foo.mux-barrier-wrapper()"])
+    A-->B{{"for (wi : wg)"}}
+    B-->C[["@foo.mux-barrier-region.0()<br> a[id] += 4;"]]
+    C-->D["fence"];
+    D-->E{{"for (wi : wg)"}}
+    E-->F[["@foo.mux-barrier-region.1() <br> b[id] += 4;"]]
+```
+
+#### Live Variables
+
+Note also that `id` is a *live variable* whose lifetime traverses the
+barrier. The `WorkItemLoopsPass` creates a structure of live variables
+which are passed between the successive barrier regions, containing data
+that needs to be live in future regions.
+
+In this case, however, calls to certain builtins like `get_global_id`
+are treated specially and are materialized anew in each barrier region
+where they are used.
+
+### Vectorization Scheduling
+
+The [WorkItemLoopsPass](SYCLNativeCPUPipelinePasses.md#workitemloopspass) is
+responsible for laying out kernels which have been vectorized by the
+[vecz](SYCLNativeCPUVecz.md) whole-function vectorizer.
+
+The vectorizer creates multiple versions of the original kernel.
+Vectorized kernels on their own are generally unable to fulfill
+work-group scheduling requirements, as they operate only on a number of
+work-items equal to a multiple of the vectorization factor. As such, for
+the general case, several kernels must be combined to cover all
+work-items in the work-group; the `WorkItemLoopsPass` is responsible for
+this.
+
+The following diagram uses a vectorization width of 4.
+
+For brevity, the diagram below only details in inner-most work-item
+loops. Most kernels will in reality have 2 outer levels of loops over
+the full *Y* and *Z* work-group dimensions.
+
+```mermaid
+flowchart TD;
+    Start("@foo.mux-barrier-wrapper()")
+    OrigKernel0[["@foo()"]]
+    OrigKernel1[["@__vecz_v4_foo()"]]
+    Link1("`unsigned i = 0;
+            unsigned wg_size = get\_local\_size(0);
+            unsigned peel = wg\_size % 4;`")
+    ScalarPH{{"\< scalar check \>"}}
+    VectorPH("for (unsigned e = wg\_size - peel; i \< e; i += 4)")
+    Link2("for (; i< wg_size; i++)")
+    Return("return")
+
+    Start-->Link1
+    Link1-->|"if (wg_size != peel)"|VectorPH
+    Link1-->|"if (wg\_size == peel)"|ScalarPH
+    ScalarPH-->|"if (peel)"|Link2
+    Link2-->OrigKernel0
+    OrigKernel0-->Return
+    OrigKernel1-->ScalarPH
+    ScalarPH-->|"if (!peel)"|Return
+    VectorPH-->OrigKernel1
+```
+
+In the above example, the vectorized kernel is called to execute as many
+work-items as possible, up to the largest multiple of the vectorization
+less than or equal to the work-group size.
+
+In the case that there are work-items remaining (i.e., if the work-group
+size is not a multiple of 4) then the original scalar kernel is called
+on the up to 3 remaining work-items. These remaining work-items are
+typically called the \'peel\' iterations.
+
+#### Defining mux Builtins
+
+The bodies of mux builtin function declarations are now provided.
+
+The [PrepareSYCLNativeCPU](https://github.com/intel/llvm/blob/sycl/llvm/lib/SYCLNativeCPUUtils/PrepareSYCLNativeCPU.cpp) does most of the materialization of scheduling builtins to connect up these scheduling style instructions to the scheduling structure that is passed in.
+
+Any remaining materialization of builtins are handled by
+[DefineMuxBuiltinsPass](https://github.com/intel/llvm/blob/sycl/llvm/lib/SYCLNativeCPUUtils/compiler_passes/compiler_pipeline/source/define_mux_builtins_pass.cpp),
+such as ``__mux_mem_barrier``. The use of this pass should probably be phased
+out in preferenace to doing it all in one place.
+
+Some builtins may rely on others to complete their function. These
+dependencies are handled transitively.
+
+Pseudo C code:
+
+```C
+struct MuxWorkItemInfo { size_t[3] local_ids; ... };
+struct MuxWorkGroupInfo { size_t[3] group_ids; ... };
+
+// And this wrapper function
+void foo.mux-sched-wrapper(MuxWorkItemInfo *wi, MuxWorkGroupInfo *wg) {
+  size_t id = __mux_get_global_id(0, wi, wg);
+}
+
+// The DefineMuxBuiltinsPass provides the definition
+// of __mux_get_global_id:
+size_t __mux_get_global_id(uint i, MuxWorkItemInfo *wi, MuxWorkGroupInfo *wg) {
+  return (__mux_get_group_id(i, wi, wg) * __mux_get_local_size(i, wi, wg)) +
+         __mux_get_local_id(i, wi, wg) + __mux_get_global_offset(i, wi, wg);
+}
+
+// And thus the definition of __mux_get_group_id...
+size_t __mux_get_group_id(uint i, MuxWorkItemInfo *wi, MuxWorkGroupInfo *wg) {
+  return i >= 3 ? 0 : wg->group_ids[i];
+}
+
+// and __mux_get_local_id, etc
+size_t __mux_get_local_id(uint i, MuxWorkItemInfo *wi, MuxWorkGroupInfo *wg) {
+  return i >= 3 ? 0 : wi->local_ids[i];
+}
+```
+
+# Tidy up
+
+There is some tidying up at the end such as deleting unused functions or
+replacing the scalar kernel with the vectorized one.
diff --git a/sycl/doc/design/SYCLNativeCPUPipelinePasses.md b/sycl/doc/design/SYCLNativeCPUPipelinePasses.md
diff --git a/sycl/doc/design/SYCLNativeCPUVecz.md b/sycl/doc/design/SYCLNativeCPUVecz.md