intel
diff --git a/‎llvm/docs/requirements.txt‎
Lines changed: 1 addition & 0 deletions b/‎llvm/docs/requirements.txt‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎sycl/doc/conf.py‎
Lines changed: 4 additions & 1 deletion b/‎sycl/doc/conf.py‎
Lines changed: 4 additions & 1 deletion
diff --git a/‎sycl/doc/design/SYCLNativeCPU.md‎
Lines changed: 5 additions & 13 deletions b/‎sycl/doc/design/SYCLNativeCPU.md‎
Lines changed: 5 additions & 13 deletions
diff --git a/‎sycl/doc/design/SYCLNativeCPUPipeline.md‎
Lines changed: 277 additions & 0 deletions b/‎sycl/doc/design/SYCLNativeCPUPipeline.md‎
Lines changed: 277 additions & 0 deletions
@@ -8,3 +8,4 @@ sphinxcontrib-applehelp==2.0.0
 sphinx-reredirects==0.1.6
 furo==2025.7.19
 myst-parser==4.0.0
+sphinxcontrib-mermaid==1.0.0
@@ -32,7 +32,7 @@
 # Add any Sphinx extension module names here, as strings. They can be
 # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
 # ones.
-extensions = ["myst_parser"]
+extensions = ["myst_parser", "sphinxcontrib.mermaid"]
 
 # Implicit targets for cross reference
 myst_heading_anchors = 5
@@ -47,6 +47,9 @@
 # The suffix of source filenames.
 source_suffix = [".rst", ".md"]
 
+# Allow use of mermaid directly to view on github without the {}
+myst_fence_as_directive = ["mermaid"]
+
 exclude_patterns = [
     # Extensions are mostly in asciidoc which has poor support in Sphinx.
     "extensions/*",
 
@@ -49,18 +49,6 @@ python buildbot/configure.py \
 
 SYCL Native CPU uses [libclc](https://github.com/intel/llvm/tree/sycl/libclc) to implement many SPIRV builtins. When Native CPU is enabled, the default target triple for libclc will be `LLVM_TARGET_TRIPLE` (same as the default target triple used by `clang`). This can be overridden by setting the `--native-cpu-libclc-targets` option in `configure.py`.
 
-### oneAPI Construction Kit
-
-SYCL Native CPU uses the [oneAPI Construction Kit](https://github.com/codeplaysoftware/oneapi-construction-kit) (OCK) in order to support some core SYCL functionalities and improve performances, the OCK is fetched by default when SYCL Native CPU is enabled, and can optionally be disabled using the `NATIVECPU_USE_OCK` CMake variable (please note that disabling the OCK will result in limited functionalities and performances on the SYCL Native CPU backend):
-
-```
-python3 buildbot/configure.py --native_cpu -DNATIVECPU_USE_OCK=Off
-```
-
-By default the oneAPI Construction Kit is pulled at the project's configure time using CMake `FetchContent`. This behaviour can be overridden by setting `NATIVECPU_OCK_USE_FETCHCONTENT=Off` and `OCK_SOURCE_DIR=<path>`
-in order to use a local checkout of the oneAPI Construction Kit. The CMake variables `OCK_GIT_TAG` and `OCK_GIT_REPO` can be used to override the default git tag and repository used by `FetchContent`.
-
-The SYCL Native CPU device needs to be selected at runtime by setting the environment variable `ONEAPI_DEVICE_SELECTOR=native_cpu:cpu`. 
 
 ### oneTBB integration
 
@@ -96,6 +84,7 @@ cmake \
 ```
 
 Note that a number of `e2e` tests are currently still failing.
+The SYCL Native CPU device needs to be selected at runtime by setting the environment variable `ONEAPI_DEVICE_SELECTOR=native_cpu:cpu`.
 
 # Vectorization
 
@@ -128,7 +117,10 @@ llvm-cov show .\vector-add.exe -instr-profile=foo.profdata
 
 ### Please note that Windows is partially supported but temporarily disabled due to some implementation details, it will be re-enabled soon.
 
-# Technical details
+
+# Native CPU compiler pipeline
+
+SYCL Native CPU formerly used uses the [oneAPI Construction Kit](https://github.com/codeplaysoftware/oneapi-construction-kit) (OCK) in order to support some core SYCL functionalities and improve performances in the compiler pipeline. This relevant parts have been brought into DPC++ and the Native CPU compiler pipeline is documented [here](SYCLNativeCPUPipeline.md), with a brief overview below. The OCK related parts are still enabled by using the `NATIVECPU_USE_OCK` CMake variable, but this is enabled by default.
 
 The following section gives a brief overview of how a simple SYCL application is compiled for the SYCL Native CPU target. Consider the following SYCL sample, which performs vector addition using USM:
 
 
@@ -0,0 +1,277 @@
+Native CPU Compiler Pipeline Overview
+=====================================
+
+# Introduction
+
+This document serves to introduce users to the Native CPU compiler pipeline. The
+compiler pipeline performs several key transformations over several phases that
+can be difficult to understand for new users. The pipeline is constructed and
+run in `llvm::sycl::utils::addSYCLNativeCPUBackendPasses`. All of the compiler
+pipeline code can be found under
+[llvm/lib/SYCLNativeCPUUtils](https://github.com/intel/llvm/tree/sycl/llvm/lib/SYCLNativeCPUUtils),
+with the code which originated from the [oneAPI Construction
+Kit](https://github.com/uxlfoundation/oneapi-construction-kit/tree/main), under
+`compiler_passes` in that directory.
+
+
+## Objective and Execution Model
+
+The compiler pipeline\'s objective is to compile incoming LLVM IR
+modules containing one or more kernel functions to object code ready for
+execution when invoked by the host-side runtime. The assumptions placed
+on the input and output kernels is as follows:
+
+1.  The original kernel is assumed to adhere to an implicit **SIMT**
+    execution model; it runs once per each *work-item* in an
+    **NDRange**.
+2. It is passed a state struct which contains information about the scheduling.
+3. All builtins which do not relate to scheduling have been processed and we are
+   left with some scheduling related calls to `mux builtins`.
+4.  The final compiled kernel is assumed to be invoked from the
+    host-side runtime once per *work-group* in the **NDRange**.
+
+The following diagram provides an overview of the main phases of the
+Native CPU compiler pipeline in terms of the underlying and assumed
+kernel execution model.
+
+The inner-most function is the original input kernel, which is *wrapped*
+by new functions in successive phases, until it is ready in a form to be
+executed by the Native CPU driver.
+
+```mermaid
+flowchart TD;
+    Start(["Driver Entry Point"])
+    Start-->WiLoop["for (wi : wg)"]
+    WiLoop-->OrigKernel["original_kernel()"]
+```
+
+The [WorkItemLoopsPass](SYCLNativeCPUPipelinePasses.md#workitemloopspass)
+is the key pass which makes some of the implicit parallelism
+explicit. By introducing *work-item loops* around each kernel function,
+the new kernel entry point now runs on every work-group in an
+**NDRange**.
+
+## Compiler Pipeline Overview
+
+With the overall execution model established, we can start to dive
+deeper into the key phases of the compilation pipeline.
+
+```mermaid
+flowchart TD;
+    InputIR(["Input IR"])
+    SpecConstants(["Handling SpecConstants"])
+    Metadata(["Adding Metadata/Attributes"])
+    Vecz(["Vectorization"])
+    WorkItemLoops(["Work Item Loops / Barriers"])
+    DefineBuiltins(["Define builtins"])
+    TidyUp(["Tidy up"])
+
+    InputIR-->SpecConstants
+    SpecConstants-->Metadata
+    Metadata-->Vecz
+    Vecz-->WorkItemLoops
+    WorkItemLoops-->DefineBuiltins
+    DefineBuiltins-->TidyUp
+```
+
+
+### Input IR
+
+The program begins as an LLVM module. Kernels in the module are assumed
+to obey a **SIMT** programming model, as described earlier in [Objective
+& Execution Model](#objective-and-execution-model).
+
+Simple fix-up passes take place at this stage: the IR is massaged to
+conform to specifications or to fix known deficiencies in earlier
+representations. The input IR at this point will contains special
+builtins, called `mux builtins` for ndrange or subgroup
+style operations e.g. `mux_get_global_id`. Many of these
+later passes will refer to these `mux builtins`.
+
+### Adding Metadata/Attributes
+
+Native CPU IR metadata and attributes are attached to kernels. This
+information is used by following passes to identify certain aspects of
+kernels which are not otherwise attainable or representable in LLVM IR.
+
+[TransferKernelMetadataPass and
+EncodeKernelMetadataPass](SYCLNativeCPUPipelinePasses.md#transferkernelmetadatapass-and-encodekernelmetadatapass)
+are responsible for adding this information.
+
+### Whole Function Vectorization
+
+The [vecz](SYCLNativeCPUVecz.md) whole-function vectorizer is optionally run.
+
+Note that VECZ may perform its own scalarization, depending on the
+options passed to it, potentially undoing the work of any previous
+optimization passes, although it is able to preserve or even widen
+pre-existing vector operations in many cases.
+
+#### Work-item Scheduling & Barriers
+
+The work-item loops are added to each kernel by the [WorkItemLoopsPass](SYCLNativeCPUPipelinePasses.md#workitemloopspass).
+
+The kernel execution model changes at this stage to replace some of the
+implicit parallelism with explicit looping, as described earlier in
+[Objective & Execution Model](#objective-and-execution-model).
+
+[Barrier Scheduling](#barrier-scheduling) takes place at this stage, as
+well as [Vectorization Scheduling](#vectorization-scheduling) if the
+vectorizer was run.
+
+
+### Barrier Scheduling
+
+The fact that the
+[WorkItemLoopsPass](SYCLNativeCPUPipelinePasses.md#workitemloopspass) handles
+both work-item loops and barriers can be confusing to newcomers. These two
+concepts are in fact linked. Taking the kernel code below, this section will
+show how the `WorkItemLoopsPass` lays out and schedules a kernel\'s work-item
+loops in the face of barriers.
+
+```C
+kernel void foo(global int *a, global int *b) {
+  // pre barrier code - foo.mux-barrier-region.0()
+  size_t id = get_global_id(0);
+  a[id] += 4;
+  // barrier
+  barrier(CLK_GLOBAL_MEM_FENCE);
+  // post barrier code - foo.mux-barrier-region.1()
+  b[id] += 4;
+}
+```
+
+The kernel has one global barrier, and one statement on either side of
+it. The `WorkItemLoopsPass` conceptually breaks down the kernel into
+*barrier regions*, which constitute the code following the control-flow
+between all barriers in the kernel. The example above has two regions:
+the first contains the call to `get_global_id` and the read/update/write
+of global memory pointed to by `a`; the second contains the
+read/update/write of global memory pointed to by `b`.
+
+To correctly observe the barrier\'s semantics, all work-items in the
+work-group need to execute the first barrier region before beginning the
+second. Thus the `WorkItemLoopsPass` produces two sets of work-item
+loops to schedule this kernel:
+
+```mermaid
+graph TD;
+    A(["@foo.mux-barrier-wrapper()"])
+    A-->B{{"for (wi : wg)"}}
+    B-->C[["@foo.mux-barrier-region.0()<br> a[id] += 4;"]]
+    C-->D["fence"];
+    D-->E{{"for (wi : wg)"}}
+    E-->F[["@foo.mux-barrier-region.1() <br> b[id] += 4;"]]
+```
+
+#### Live Variables
+
+Note also that `id` is a *live variable* whose lifetime traverses the
+barrier. The `WorkItemLoopsPass` creates a structure of live variables
+which are passed between the successive barrier regions, containing data
+that needs to be live in future regions.
+
+In this case, however, calls to certain builtins like `get_global_id`
+are treated specially and are materialized anew in each barrier region
+where they are used.
+
+### Vectorization Scheduling
+
+The [WorkItemLoopsPass](SYCLNativeCPUPipelinePasses.md#workitemloopspass) is
+responsible for laying out kernels which have been vectorized by the
+[vecz](SYCLNativeCPUVecz.md) whole-function vectorizer.
+
+The vectorizer creates multiple versions of the original kernel.
+Vectorized kernels on their own are generally unable to fulfill
+work-group scheduling requirements, as they operate only on a number of
+work-items equal to a multiple of the vectorization factor. As such, for
+the general case, several kernels must be combined to cover all
+work-items in the work-group; the `WorkItemLoopsPass` is responsible for
+this.
+
+The following diagram uses a vectorization width of 4.
+
+For brevity, the diagram below only details in inner-most work-item
+loops. Most kernels will in reality have 2 outer levels of loops over
+the full *Y* and *Z* work-group dimensions.
+
+```mermaid
+flowchart TD;
+    Start("@foo.mux-barrier-wrapper()")
+    OrigKernel0[["@foo()"]]
+    OrigKernel1[["@__vecz_v4_foo()"]]
+    Link1("`unsigned i = 0;
+            unsigned wg_size = get\_local\_size(0);
+            unsigned peel = wg\_size % 4;`")
+    ScalarPH{{"\< scalar check \>"}}
+    VectorPH("for (unsigned e = wg\_size - peel; i \< e; i += 4)")
+    Link2("for (; i< wg_size; i++)")
+    Return("return")
+
+    Start-->Link1
+    Link1-->|"if (wg_size != peel)"|VectorPH
+    Link1-->|"if (wg\_size == peel)"|ScalarPH
+    ScalarPH-->|"if (peel)"|Link2
+    Link2-->OrigKernel0
+    OrigKernel0-->Return
+    OrigKernel1-->ScalarPH
+    ScalarPH-->|"if (!peel)"|Return
+    VectorPH-->OrigKernel1
+```
+
+In the above example, the vectorized kernel is called to execute as many
+work-items as possible, up to the largest multiple of the vectorization
+less than or equal to the work-group size.
+
+In the case that there are work-items remaining (i.e., if the work-group
+size is not a multiple of 4) then the original scalar kernel is called
+on the up to 3 remaining work-items. These remaining work-items are
+typically called the \'peel\' iterations.
+
+#### Defining mux Builtins
+
+The bodies of mux builtin function declarations are now provided.
+
+The [PrepareSYCLNativeCPU](https://github.com/intel/llvm/blob/sycl/llvm/lib/SYCLNativeCPUUtils/PrepareSYCLNativeCPU.cpp) does most of the materialization of scheduling builtins to connect up these scheduling style instructions to the scheduling structure that is passed in.
+
+Any remaining materialization of builtins are handled by
+[DefineMuxBuiltinsPass](https://github.com/intel/llvm/blob/sycl/llvm/lib/SYCLNativeCPUUtils/compiler_passes/compiler_pipeline/source/define_mux_builtins_pass.cpp),
+such as ``__mux_mem_barrier``. The use of this pass should probably be phased
+out in preferenace to doing it all in one place.
+
+Some builtins may rely on others to complete their function. These
+dependencies are handled transitively.
+
+Pseudo C code:
+
+```C
+struct MuxWorkItemInfo { size_t[3] local_ids; ... };
+struct MuxWorkGroupInfo { size_t[3] group_ids; ... };
+
+// And this wrapper function
+void foo.mux-sched-wrapper(MuxWorkItemInfo *wi, MuxWorkGroupInfo *wg) {
+  size_t id = __mux_get_global_id(0, wi, wg);
+}
+
+// The DefineMuxBuiltinsPass provides the definition
+// of __mux_get_global_id:
+size_t __mux_get_global_id(uint i, MuxWorkItemInfo *wi, MuxWorkGroupInfo *wg) {
+  return (__mux_get_group_id(i, wi, wg) * __mux_get_local_size(i, wi, wg)) +
+         __mux_get_local_id(i, wi, wg) + __mux_get_global_offset(i, wi, wg);
+}
+
+// And thus the definition of __mux_get_group_id...
+size_t __mux_get_group_id(uint i, MuxWorkItemInfo *wi, MuxWorkGroupInfo *wg) {
+  return i >= 3 ? 0 : wg->group_ids[i];
+}
+
+// and __mux_get_local_id, etc
+size_t __mux_get_local_id(uint i, MuxWorkItemInfo *wi, MuxWorkGroupInfo *wg) {
+  return i >= 3 ? 0 : wi->local_ids[i];
+}
+```
+
+# Tidy up
+
+There is some tidying up at the end such as deleting unused functions or
+replacing the scalar kernel with the vectorized one.