From 1e7595e21933dccbf8ec9c647a628275ea572d3f Mon Sep 17 00:00:00 2001 From: Colin Davidson Date: Thu, 11 Sep 2025 14:15:52 +0100 Subject: [PATCH 01/14] [SYCL][NATIVE_CPU] Update docs for Native CPU compiler pipeline This integrates the appropriate compiler documentation originally in the oneAPI Construction Kit (OCK) into the NativeCPU compiler pipeline documentation. It has been updated to try to reflect the Native CPU pipeline, and remove some of the references to OCK's structures, as well as moving some of the documentation to markdown files to be consistent with some of the other documentation. Some of it may be irrelevant for Native CPU, and if so this should be updated over time. Support was added for the mermaid flowcharts in the config. --- llvm/docs/requirements.txt | 1 + sycl/doc/conf.py | 5 +- sycl/doc/design/SYCLNativeCPU.md | 69 +- sycl/doc/design/SYCLNativeCPUPipeline.md | 322 ++++ .../doc/design/SYCLNativeCPUPipelinePasses.md | 1182 +++++++++++++++ sycl/doc/design/SYCLNativeCPUVecz.md | 1325 +++++++++++++++++ sycl/doc/index.rst | 3 + 7 files changed, 2846 insertions(+), 61 deletions(-) create mode 100644 sycl/doc/design/SYCLNativeCPUPipeline.md create mode 100644 sycl/doc/design/SYCLNativeCPUPipelinePasses.md create mode 100644 sycl/doc/design/SYCLNativeCPUVecz.md diff --git a/llvm/docs/requirements.txt b/llvm/docs/requirements.txt index 14f34a5465e44..36371f16e3769 100644 --- a/llvm/docs/requirements.txt +++ b/llvm/docs/requirements.txt @@ -8,3 +8,4 @@ sphinxcontrib-applehelp==2.0.0 sphinx-reredirects==0.1.6 furo==2025.7.19 myst-parser==4.0.0 +sphinxcontrib-mermaid==1.0.0 diff --git a/sycl/doc/conf.py b/sycl/doc/conf.py index 27e73ee3d3ad0..f8480ed58e724 100644 --- a/sycl/doc/conf.py +++ b/sycl/doc/conf.py @@ -32,7 +32,7 @@ # Add any Sphinx extension module names here, as strings. They can be # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom # ones. -extensions = ["myst_parser"] +extensions = ["myst_parser", "sphinxcontrib.mermaid"] # Implicit targets for cross reference myst_heading_anchors = 5 @@ -47,6 +47,9 @@ # The suffix of source filenames. source_suffix = [".rst", ".md"] +# Allow use of mermaid directly to view on github without the {} +myst_fence_as_directive = ["mermaid"] + exclude_patterns = [ # Extensions are mostly in asciidoc which has poor support in Sphinx. "extensions/*", diff --git a/sycl/doc/design/SYCLNativeCPU.md b/sycl/doc/design/SYCLNativeCPU.md index eee9479fef103..2696d785d1659 100644 --- a/sycl/doc/design/SYCLNativeCPU.md +++ b/sycl/doc/design/SYCLNativeCPU.md @@ -49,18 +49,6 @@ python buildbot/configure.py \ SYCL Native CPU uses [libclc](https://github.com/intel/llvm/tree/sycl/libclc) to implement many SPIRV builtins. When Native CPU is enabled, the default target triple for libclc will be `LLVM_TARGET_TRIPLE` (same as the default target triple used by `clang`). This can be overridden by setting the `--native-cpu-libclc-targets` option in `configure.py`. -### oneAPI Construction Kit - -SYCL Native CPU uses the [oneAPI Construction Kit](https://github.com/codeplaysoftware/oneapi-construction-kit) (OCK) in order to support some core SYCL functionalities and improve performances, the OCK is fetched by default when SYCL Native CPU is enabled, and can optionally be disabled using the `NATIVECPU_USE_OCK` CMake variable (please note that disabling the OCK will result in limited functionalities and performances on the SYCL Native CPU backend): - -``` -python3 buildbot/configure.py --native_cpu -DNATIVECPU_USE_OCK=Off -``` - -By default the oneAPI Construction Kit is pulled at the project's configure time using CMake `FetchContent`. This behaviour can be overridden by setting `NATIVECPU_OCK_USE_FETCHCONTENT=Off` and `OCK_SOURCE_DIR=` -in order to use a local checkout of the oneAPI Construction Kit. The CMake variables `OCK_GIT_TAG` and `OCK_GIT_REPO` can be used to override the default git tag and repository used by `FetchContent`. - -The SYCL Native CPU device needs to be selected at runtime by setting the environment variable `ONEAPI_DEVICE_SELECTOR=native_cpu:cpu`. ### oneTBB integration @@ -96,6 +84,7 @@ cmake \ ``` Note that a number of `e2e` tests are currently still failing. +The SYCL Native CPU device needs to be selected at runtime by setting the environment variable `ONEAPI_DEVICE_SELECTOR=native_cpu:cpu`. # Vectorization @@ -107,7 +96,7 @@ Whole Function Vectorization is enabled by default, and can be controlled throug The `-march=` option can be used to select specific target cpus which may improve performance of the vectorized code. -For more details on how the Whole Function Vectorizer is integrated for SYCL Native CPU, refer to the [Technical details](#technical-details) section. +For more details on how the Whole Function Vectorizer is integrated for SYCL Native CPU, refer to the [Native CPU Compiler Pipeline](#native-cpu-compiler-pipeline) section. # Code coverage @@ -130,7 +119,10 @@ llvm-cov show .\vector-add.exe -instr-profile=foo.profdata ### Please note that Windows is partially supported but temporarily disabled due to some implementation details, it will be re-enabled soon. -# Technical details + +# Native CPU compiler pipeline + +SYCL Native CPU formerly used uses the [oneAPI Construction Kit](https://github.com/codeplaysoftware/oneapi-construction-kit) (OCK) in order to support some core SYCL functionalities and improve performances in the compiler pipeline. This relevant parts have been brought into DPC++ and the Native CPU compiler pipeline is documented [here](SYCLNativeCPUPipeline.md), with a brief overview below. The OCK related parts are still enabled by using the `NATIVECPU_USE_OCK` CMake variable, but this is enabled by default. The following section gives a brief overview of how a simple SYCL application is compiled for the SYCL Native CPU target. Consider the following SYCL sample, which performs vector addition using USM: @@ -174,54 +166,12 @@ entry: ``` For the SYCL Native CPU target, the device compiler is in charge of materializing the SPIRV builtins (such as `@__spirv_BuiltInGlobalInvocationId`), so that they can be correctly updated by the runtime when executing the kernel. This is performed by the [PrepareSYCLNativeCPU pass](https://github.com/intel/llvm/blob/sycl/llvm/lib/SYCLNativeCPUUtils/PrepareSYCLNativeCPU.cpp). -The PrepareSYCLNativeCPUPass also emits a `subhandler` function, which receives the kernel arguments from the SYCL runtime (packed in a vector), unpacks them, and forwards only the used ones to the actual kernel. +The PrepareSYCLNativeCPUPass also emits a `subhandler` wrapper function, which receives the kernel arguments from the SYCL runtime (packed in a vector), unpacks them, and forwards only the used ones to the actual kernel. ## PrepareSYCLNativeCPU Pass -This pass will add a pointer to a `native_cpu::state` struct as kernel argument to all the kernel functions, and it will replace all the uses of SPIRV builtins with the return value of appropriately defined functions, which will read the requested information from the `native_cpu::state` struct. The `native_cpu::state` struct is defined in the [native_cpu UR adapter](https://github.com/oneapi-src/unified-runtime/blob/main/source/adapters/native_cpu/nativecpu_state.hpp) and the builtin functions are defined in the [native_cpu device library](https://github.com/intel/llvm/blob/sycl/libdevice/nativecpu_utils.cpp). - - -The resulting IR is: - -```llvm -define weak dso_local void @_Z6Sample.NativeCPUKernel(ptr noundef align 4 %0, ptr noundef align 4 %1, ptr noundef align 4 %2, ptr %3) local_unnamed_addr #3 !srcloc !74 !kernel_arg_buffer_location !75 !kernel_arg_type !76 !sycl_fixed_targets !49 !sycl_kernel_omit_args !77 { -entry: - %ncpu_builtin = call ptr @_Z13get_global_idmP15nativecpu_state(ptr %3) - %4 = load i64, ptr %ncpu_builtin, align 32, !noalias !78 - %arrayidx.i = getelementptr inbounds i32, ptr %1, i64 %4 - %5 = load i32, ptr %arrayidx.i, align 4, !tbaa !72 - %arrayidx4.i = getelementptr inbounds i32, ptr %2, i64 %4 - %6 = load i32, ptr %arrayidx4.i, align 4, !tbaa !72 - %add.i = add nsw i32 %5, %6 - %cmp.i8.i = icmp ult i64 %4, 2147483648 - tail call void @llvm.assume(i1 %cmp.i8.i) - %arrayidx6.i = getelementptr inbounds i32, ptr %0, i64 %4 - store i32 %add.i, ptr %arrayidx6.i, align 4, !tbaa !72 - ret void -} -``` -This pass will also set the correct calling convention for the target, and handle calling convention-related function attributes, allowing to call the kernel from the runtime. - -The `subhandler` for the SYCL Native CPU kernel looks like: - -```llvm -define weak void @_Z6Sample(ptr %0, ptr %1) #4 { -entry: - %2 = getelementptr %0, ptr %0, i64 0 - %3 = load ptr, ptr %2, align 8 - %4 = getelementptr %0, ptr %0, i64 3 - %5 = load ptr, ptr %4, align 8 - %6 = getelementptr %0, ptr %0, i64 4 - %7 = load ptr, ptr %6, align 8 - %8 = getelementptr %0, ptr %0, i64 7 - %9 = load ptr, ptr %8, align 8 - call void @_ZTS10SimpleVaddIiE.NativeCPUKernel(ptr %3, ptr %5, ptr %7, ptr %9, ptr %1) - ret void -} -``` - -As you can see, the `subhandler` steals the kernel's function name, and receives two pointer arguments: the first one points to the kernel arguments from the SYCL runtime, and the second one to the `nativecpu::state` struct. +This pass will add a pointer to a `native_cpu::state` struct as kernel argument to all the kernel functions, and it will replace all the uses of SPIRV builtins with the return value of appropriately defined functions, which will read the requested information from the `native_cpu::state` struct. For more information, see [PrepareSYCLNativeCPU Pass](SYCLNativeCPUPipeline.md#preparesyclnativecpu-pass). ## Handling barriers @@ -229,7 +179,7 @@ On SYCL Native CPU, calls to `__spirv_ControlBarrier` are handled using the `Wor ## Vectorization -The OneAPI Construction Kit's Whole Function Vectorizer is executed as an LLVM Pass. Considering the following input function: +The Whole Function Vectorizer is executed as an LLVM Pass. Considering the following input function: ```llvm define void @SimpleVadd(i32*, i32*, i32*) { @@ -288,4 +238,3 @@ Each entry in the array contains the kernel name as a string, and a pointer to t ## Kernel lowering and execution The information produced by the device compiler is then employed to correctly lower the kernel LLVM-IR module to the target ISA (this is performed by the driver when `-fsycl-targets=native_cpu` is set). The object file containing the kernel code is linked with the host object file (and libsycl and any other needed library) and the final executable is run using the SYCL Native CPU UR Adapter, defined in [the Unified Runtime repo](https://github.com/oneapi-src/unified-runtime/tree/adapters/source/adapters/native_cpu). - diff --git a/sycl/doc/design/SYCLNativeCPUPipeline.md b/sycl/doc/design/SYCLNativeCPUPipeline.md new file mode 100644 index 0000000000000..9de5e566e7cbe --- /dev/null +++ b/sycl/doc/design/SYCLNativeCPUPipeline.md @@ -0,0 +1,322 @@ +Native CPU Compiler Pipeline Overview +===================================== + +# Introduction + +This document serves to introduce users to the Native CPU compiler pipeline. The +compiler pipeline performs several key transformations over several phases that +can be difficult to understand for new users. The pipeline is constructed and +run in `llvm::sycl::utils::addSYCLNativeCPUBackendPasses`. All of the compiler +pipeline code can be found under +[llvm/lib/SYCLNativeCPUUtils](https://github.com/intel/llvm/tree/sycl/llvm/lib/SYCLNativeCPUUtils), +with the code which originated from the [oneAPI Construction +Kit](https://github.com/uxlfoundation/oneapi-construction-kit/tree/main), under +`compiler_passes` in that directory. + + +## Objective and Execution Model + +The compiler pipeline\'s objective is to compile incoming LLVM IR +modules containing one or more kernel functions to object code ready for +execution when invoked by the host-side runtime. The assumptions placed +on the input and output kernels is as follows: + +1. The original kernel is assumed to adhere to an implicit **SIMT** + execution model; it runs once per each *work-item* in an + **NDRange**. +2. It is passed a state struct which contains information about the scheduling. +3. All builtins which do not relate to scheduling have been processed and we are + left with some scheduling related calls to `mux builtins`. +4. The final compiled kernel is assumed to be invoked from the + host-side runtime once per *work-group* in the **NDRange**. + +The following diagram provides an overview of the main phases of the +Native CPU compiler pipeline in terms of the underlying and assumed +kernel execution model. + +```mermaid +flowchart TD; + Start(["Driver Entry Point"]) + Start-->WiLoop["for (wi : wg)"] + WiLoop-->OrigKernel["original_kernel()"] +``` + +The inner-most function is the original input kernel, which is *wrapped* +by new functions in successive phases, until it is ready in a form to be +executed by the Native CPU driver. + +The [WorkItemLoopsPass](SYCLNativeCPUPipelinePasses.md#workitemloopspass) +is the key pass which makes some of the implicit parallelism +explicit. By introducing *work-item loops* around each kernel function, +the new kernel entry point now runs on every work-group in an +**NDRange**. + +## Compiler Pipeline Overview + +With the overall execution model established, we can start to dive +deeper into the key phases of the compilation pipeline. + +```mermaid +flowchart TD; + InputIR(["Input IR"]) + SpecConstants(["Handling SpecConstants"]) + Metadata(["Adding Metadata/Attributes"]) + Vecz(["Vectorization"]) + WorkItemLoops(["Work Item Loops / Barriers"]) + DefineBuiltins(["Define builtins and Tidy up"]) + + InputIR-->SpecConstants + SpecConstants-->Metadata + Metadata-->Vecz + Vecz-->WorkItemLoops + WorkItemLoops-->DefineBuiltins + DefineBuiltins-->TidyUp +``` + + +### Input IR + +The program begins as an LLVM module. Kernels in the module are assumed +to obey a **SIMT** programming model, as described earlier in [Objective +& Execution Model](#objective-and-execution-model). + +Simple fix-up passes take place at this stage: the IR is massaged to +conform to specifications or to fix known deficiencies in earlier +representations. The input IR at this point will contains special +builtins, called `mux builtins` for ndrange or subgroup +style operations e.g. `mux_get_global_id`. Many of these +later passes will refer to these `mux builtins`. + +### Adding Metadata/Attributes + +Native CPU IR metadata and attributes are attached to kernels. This +information is used by following passes to identify certain aspects of +kernels which are not otherwise attainable or representable in LLVM IR. + +[TransferKernelMetadataPass and +EncodeKernelMetadataPass](SYCLNativeCPUPipelinePasses.md#transferkernelmetadatapass-and-encodekernelmetadatapass) +are responsible for adding this information. + +### Whole Function Vectorization + +The [vecz](SYCLNativeCPUVecz.md) whole-function vectorizer is optionally run. + +Note that VECZ may perform its own scalarization, depending on the +options passed to it, potentially undoing the work of any previous +optimization passes, although it is able to preserve or even widen +pre-existing vector operations in many cases. + +#### Work-item Scheduling & Barriers + +The work-item loops are added to each kernel by the [WorkItemLoopsPass](SYCLNativeCPUPipelinePasses.md#workitemloopspass). + +The kernel execution model changes at this stage to replace some of the +implicit parallelism with explicit looping, as described earlier in +[Objective & Execution Model](#objective-and-execution-model). + +[Barrier Scheduling](#barrier-scheduling) takes place at this stage, as +well as [Vectorization Scheduling](#vectorization-scheduling) if the +vectorizer was run. + + +### Barrier Scheduling + +The fact that the +[WorkItemLoopsPass](SYCLNativeCPUPipelinePasses.md#workitemloopspass) handles +both work-item loops and barriers can be confusing to newcomers. These two +concepts are in fact linked. Taking the kernel code below, this section will +show how the `WorkItemLoopsPass` lays out and schedules a kernel\'s work-item +loops in the face of barriers. + +```C +kernel void foo(global int *a, global int *b) { + // pre barrier code - foo.mux-barrier-region.0() + size_t id = get_global_id(0); + a[id] += 4; + // barrier + barrier(CLK_GLOBAL_MEM_FENCE); + // post barrier code - foo.mux-barrier-region.1() + b[id] += 4; +} +``` + +The kernel has one global barrier, and one statement on either side of +it. The `WorkItemLoopsPass` conceptually breaks down the kernel into +*barrier regions*, which constitute the code following the control-flow +between all barriers in the kernel. The example above has two regions: +the first contains the call to `get_global_id` and the read/update/write +of global memory pointed to by `a`; the second contains the +read/update/write of global memory pointed to by `b`. + +To correctly observe the barrier\'s semantics, all work-items in the +work-group need to execute the first barrier region before beginning the +second. Thus the `WorkItemLoopsPass` produces two sets of work-item +loops to schedule this kernel: + +```mermaid +graph TD; + A(["@foo.mux-barrier-wrapper()"]) + A-->B{{"for (wi : wg)"}} + B-->C[["@foo.mux-barrier-region.0()
a[id] += 4;"]] + C-->D["fence"]; + D-->E{{"for (wi : wg)"}} + E-->F[["@foo.mux-barrier-region.1()
b[id] += 4;"]] +``` + +#### Live Variables + +Note also that `id` is a *live variable* whose lifetime traverses the +barrier. The `WorkItemLoopsPass` creates a structure of live variables +which are passed between the successive barrier regions, containing data +that needs to be live in future regions. + +In this case, however, calls to certain builtins like `get_global_id` +are treated specially and are materialized anew in each barrier region +where they are used. + +### Vectorization Scheduling + +The [WorkItemLoopsPass](SYCLNativeCPUPipelinePasses.md#workitemloopspass) is +responsible for laying out kernels which have been vectorized by the +[vecz](SYCLNativeCPUVecz.md) whole-function vectorizer. + +The vectorizer creates multiple versions of the original kernel. +Vectorized kernels on their own are generally unable to fulfill +work-group scheduling requirements, as they operate only on a number of +work-items equal to a multiple of the vectorization factor. As such, for +the general case, several kernels must be combined to cover all +work-items in the work-group; the `WorkItemLoopsPass` is responsible for +this. + +The following diagram uses a vectorization width of 4. + +For brevity, the diagram below only details in inner-most work-item +loops. Most kernels will in reality have 2 outer levels of loops over +the full *Y* and *Z* work-group dimensions. + +```mermaid +flowchart TD; + Start("@foo.mux-barrier-wrapper()") + OrigKernel0[["@foo()"]] + OrigKernel1[["@__vecz_v4_foo()"]] + Link1("`unsigned i = 0; + unsigned wg_size = get\_local\_size(0); + unsigned peel = wg\_size % 4;`") + ScalarPH{{"\< scalar check \>"}} + VectorPH("for (unsigned e = wg\_size - peel; i \< e; i += 4)") + Link2("for (; i< wg_size; i++)") + Return("return") + + Start-->Link1 + Link1-->|"if (wg_size != peel)"|VectorPH + Link1-->|"if (wg\_size == peel)"|ScalarPH + ScalarPH-->|"if (peel)"|Link2 + Link2-->OrigKernel0 + OrigKernel0-->Return + OrigKernel1-->ScalarPH + ScalarPH-->|"if (!peel)"|Return + VectorPH-->OrigKernel1 +``` + +In the above example, the vectorized kernel is called to execute as many +work-items as possible, up to the largest multiple of the vectorization +less than or equal to the work-group size. + +In the case that there are work-items remaining (i.e., if the work-group +size is not a multiple of 4) then the original scalar kernel is called +on the up to 3 remaining work-items. These remaining work-items are +typically called the \'peel\' iterations. + +#### PrepareSYCLNativeCPU Pass + +This pass will add a pointer to a `native_cpu::state` struct as kernel argument to all the kernel functions, and it will replace all the uses of SPIRV builtins with the return value of appropriately defined functions, which will read the requested information from the `native_cpu::state` struct. The `native_cpu::state` struct is defined in the [native_cpu UR adapter](https://github.com/oneapi-src/unified-runtime/blob/main/source/adapters/native_cpu/nativecpu_state.hpp) and the builtin functions are defined in the [native_cpu device library](https://github.com/intel/llvm/blob/sycl/libdevice/nativecpu_utils.cpp). + + +For a simple kernel of the form: + +```cpp + auto kern = [=](cl::sycl::id<1> wiID) { + c_ptr[wiID] = a_ptr[wiID] + b_ptr[wiID]; + }; +``` +The resulting IR from a typical this kernel with a `sycl::range` of 1 is: + +```llvm +define weak dso_local void @_Z6Sample.NativeCPUKernel(ptr noundef align 4 %0, ptr noundef align 4 %1, ptr noundef align 4 %2, ptr %3) local_unnamed_addr #3 !srcloc !74 !kernel_arg_buffer_location !75 !kernel_arg_type !76 !sycl_fixed_targets !49 !sycl_kernel_omit_args !77 { +entry: + %ncpu_builtin = call ptr @_Z13get_global_idmP15nativecpu_state(ptr %3) + %4 = load i64, ptr %ncpu_builtin, align 32, !noalias !78 + %arrayidx.i = getelementptr inbounds i32, ptr %1, i64 %4 + %5 = load i32, ptr %arrayidx.i, align 4, !tbaa !72 + %arrayidx4.i = getelementptr inbounds i32, ptr %2, i64 %4 + %6 = load i32, ptr %arrayidx4.i, align 4, !tbaa !72 + %add.i = add nsw i32 %5, %6 + %cmp.i8.i = icmp ult i64 %4, 2147483648 + tail call void @llvm.assume(i1 %cmp.i8.i) + %arrayidx6.i = getelementptr inbounds i32, ptr %0, i64 %4 + store i32 %add.i, ptr %arrayidx6.i, align 4, !tbaa !72 + ret void +} +``` +This pass will also set the correct calling convention for the target, and handle calling convention-related function attributes, allowing to call the kernel from the runtime. + +This kernel function is then wrapped again with a `subhandler` function, which receives the kernel arguments from the SYCL runtime (packed in a vector), unpacks them, and forwards only the used ones to the actual kernel and looks like: + +```llvm +define weak void @_Z6Sample(ptr %0, ptr %1) #4 { +entry: + %2 = getelementptr %0, ptr %0, i64 0 + %3 = load ptr, ptr %2, align 8 + %4 = getelementptr %0, ptr %0, i64 3 + %5 = load ptr, ptr %4, align 8 + %6 = getelementptr %0, ptr %0, i64 4 + %7 = load ptr, ptr %6, align 8 + %8 = getelementptr %0, ptr %0, i64 7 + %9 = load ptr, ptr %8, align 8 + call void @_ZTS10SimpleVaddIiE.NativeCPUKernel(ptr %3, ptr %5, ptr %7, ptr %9, ptr %1) + ret void +} +``` + +As you can see, the `subhandler` steals the kernel's function name, and receives two pointer arguments: the first one points to the kernel arguments from the SYCL runtime, and the second one to the `nativecpu::state` struct. + +There is also some tidying up at the end such as deleting unused functions or +replacing the scalar kernel with the vectorized one. + + +Any remaining materialization of builtins are handled by +[DefineMuxBuiltinsPass](https://github.com/intel/llvm/blob/sycl/llvm/lib/SYCLNativeCPUUtils/compiler_passes/compiler_pipeline/source/define_mux_builtins_pass.cpp), +such as ``__mux_mem_barrier``. The use of this pass should probably be phased +out in preferance to doing it all in one place. + +Some builtins may rely on others to complete their function. These +dependencies are handled transitively. + +Pseudo C code: + +```C +struct MuxWorkItemInfo { size_t[3] local_ids; ... }; +struct MuxWorkGroupInfo { size_t[3] group_ids; ... }; + +// And this wrapper function +void foo.mux-sched-wrapper(MuxWorkItemInfo *wi, MuxWorkGroupInfo *wg) { + size_t id = __mux_get_global_id(0, wi, wg); +} + +// The DefineMuxBuiltinsPass provides the definition +// of __mux_get_global_id: +size_t __mux_get_global_id(uint i, MuxWorkItemInfo *wi, MuxWorkGroupInfo *wg) { + return (__mux_get_group_id(i, wi, wg) * __mux_get_local_size(i, wi, wg)) + + __mux_get_local_id(i, wi, wg) + __mux_get_global_offset(i, wi, wg); +} + +// And thus the definition of __mux_get_group_id... +size_t __mux_get_group_id(uint i, MuxWorkItemInfo *wi, MuxWorkGroupInfo *wg) { + return i >= 3 ? 0 : wg->group_ids[i]; +} + +// and __mux_get_local_id, etc +size_t __mux_get_local_id(uint i, MuxWorkItemInfo *wi, MuxWorkGroupInfo *wg) { + return i >= 3 ? 0 : wi->local_ids[i]; +} +``` diff --git a/sycl/doc/design/SYCLNativeCPUPipelinePasses.md b/sycl/doc/design/SYCLNativeCPUPipelinePasses.md new file mode 100644 index 0000000000000..98653b1ca13e6 --- /dev/null +++ b/sycl/doc/design/SYCLNativeCPUPipelinePasses.md @@ -0,0 +1,1182 @@ +Compiler Utilities +================== + +The `compiler::utils` module exists under +[compiler_pipeline](https://github.com/intel/llvm/tree/sycl/llvm/lib/SYCLNativeCPUUtils/compiler_passes/compiler_pipeline) +and provides a number of utility functions and LLVM passes inside the +`compiler::utils` namespace, along with `metadata` and Function attributes. +These utility passes are currently only being used by `Native CPU`. These +utilities were originally under the [oneAPI Construction +Kit](https://github.com/uxlfoundation/oneapi-construction-kit/tree/main). + +# TransferKernelMetadataPass and EncodeKernelMetadataPass + +These passes are responsible for setting up metadata on kernels under +compilation. Many other passes implicitly rely on the metadata and +attributes added by these passes, so it is recommended to run them +first, if possible. + +The difference between the two passes concerns the list of kernels it +runs over: + +- The `TransferKernelMetadataPass` runs over multiple kernels in the + module, using `!opencl.kernels` metadata if present, else all + functions in the module with the `llvm::CallingConv::SPIR_KERNEL` + calling convention. +- The `EncodeKernelMetadataPass` runs on *one* kernel, supplied by + name to the pass upon construction. + +Their job is three-fold: + +- To add [mux-kernel](#mux-kernel-attribute) entry-point function attributes + to the kernels covered by each pass. + +- To add [!reqd_work_group_size](#metadata) function metadata if not already + attached. It sets this information based on local work-group size + information which is: + + > - (`TransferKernelMetadataPass`) - taken from the kernel\'s + > entry in the `!opencl.kernels` module-level metadata. + > - (`EncodeKernelMetadataPass`) - optionally passed to the pass + > on construction. The local sizes passed to the pass should + > either be empty or correspond 1:1 with the list of kernels + > provided. + +- To add [mux-work-item-order](#function-attributes) work-item order function attributes. It uses + optional data supplied to either pass on construction to encode this + metadata. If not set, the default `xyz` order is used. + +# WorkItemLoopsPass + +The `WorkItemLoopsPass` is responsible for adding explicit parallelism +to implicitly parallel SIMT kernels. It does so by wrapping each kernel +up in a triple-nested loop over all work-items in the work-group. Thus, +kernels scheduled by this pass can be invoked once per work-group. + +The order in which work-items are executed is fairly flexible, but generally in +ascending order from [0] to [N-1] through the innermost [X] dimension, followed +by the [Y] dimension, and lastly the [Z] dimension. + +Conceptually, the pass transforms `old_kernel` into `new_kernel` in the +example below: + +```cpp +void old_kernel(int *in, int *out) { + size_t id = get_local_linear_id(0); + out[id] = in[id] * 4; +} + +void new_kernel(int *in, int *out) { + for (size_t z = 0, sizeZ = get_local_size(2); z != sizeZ; z++) { + for (size_t y = 0, sizeY = get_local_size(1); y != sizeY; y++) { + for (size_t x = 0, sizeX = get_local_size(0); x != sizeX; x++) { + size_t id = (z * sizeY * sizeX) + (y * sizeX) + x; + out[id] = in[id] * 4; + } + } + } +} +``` + +To satisfy the programming model, the pass must be careful around +control barriers and *barrier-like* functions. The `WorkItemLoopsPass` +splits a kernel into separately executing kernel functions using barrier +calls as boundaries. Each section of the kernel split by these barriers +is known as a *barrier region*. + +```cpp +void old_kernel(int *in, int *out) { + size_t id = get_local_linear_id(0); + out[id * 2] = in[id]; + // All work-items in the work-group must encounter the barrier before any + // are allowed to continue execution beyond the barrier. + work_group_barrier(CLK_GLOBAL_MEM_FENCE); + out[id * 2 + 1] = in[id] * 4; +} + +void new_kernel(int *in, int *out) { + // Barrier region #0 + for (size_t z = 0, sizeZ = get_local_size(2); z != sizeZ; z++) { + for (size_t y = 0, sizeY = get_local_size(1); y != sizeY; y++) { + for (size_t x = 0, sizeX = get_local_size(0); x != sizeX; x++) { + size_t id = (z * sizeY * sizeX) + (y * sizeX) + x; + out[id * 2] = in[id]; + } + } + } + + // The control aspect of the barrier has been satisfied by the loops, so + // it has been decomposed to just a memory barrier. + mem_fence(CLK_GLOBAL_MEM_FENCE); + + // Barrier region #1 + for (size_t z = 0, sizeZ = get_local_size(2); z != sizeZ; z++) { + for (size_t y = 0, sizeY = get_local_size(1); y != sizeY; y++) { + for (size_t x = 0, sizeX = get_local_size(0); x != sizeX; x++) { + size_t id = (z * sizeY * sizeX) + (y * sizeX) + x; + out[id * 2 + 1] = in[id] * 4; + } + } + } +} +``` + +To propagate data dependencies between these *barrier regions*, an +analysis is performed to create a struct of live variables which is +passed as an argument to each kernel. Generated kernels then reference +this struct rather than the original values. A simplified example +follows: + +```cpp +void old_kernel(int *in, int *out) { + size_t id = get_local_linear_id(0); + // X is a barrier-carried dependency: produced in one barrier region and + // accessed in another. + int x = in[id] * 4; + // All work-items in the work-group must encounter the barrier before any + // are allowed to continue execution beyond the barrier. + work_group_barrier(CLK_GLOBAL_MEM_FENCE); + // Use X, produced by the previous barrier region. + out[id] = x; +} + +void new_kernel(int *in, int *out) { + struct kernel_live_vars { + int x; + }; + // Illustrative purposes: this is in reality a stack allocation. + kernel_live_vars *live_vars = + malloc(get_local_size(0) * get_local_size(1) + * get_local_size(2) * sizeof(live_vars)); + + for (size_t z = 0, sizeZ = get_local_size(2); z != sizeZ; z++) { + for (size_t y = 0, sizeY = get_local_size(1); y != sizeY; y++) { + for (size_t x = 0, sizeX = get_local_size(0); x != sizeX; x++) { + size_t id = (z * sizeY * sizeX) + (y * sizeX) + x; + live_vars[id] = in[id] * 4; + } + } + } + + mem_fence(CLK_GLOBAL_MEM_FENCE); + + for (size_t z = 0, sizeZ = get_local_size(2); z != sizeZ; z++) { + for (size_t y = 0, sizeY = get_local_size(1); y != sizeY; y++) { + for (size_t x = 0, sizeX = get_local_size(0); x != sizeX; x++) { + size_t id = (z * sizeY * sizeX) + (y * sizeX) + x; + out[id] = live_vars[id]; + } + } + } +} +``` + +The loop that reconstructs the kernels in the wrapper function uses the +vectorization dimension as innermost cycle, and it relies on +[mux-work-item-order](#function-attributes) function attributes for the +outermost loops. + +Preserving debug info is a problem for the `WorkItemLoopsPass` due to +live variables getting stored in a struct passed as an argument to each +of the generated kernels. As a result the memory locations pointed to by +the debug info are out of date with respect to newly written values. By +specifying the `IsDebug` flag when creating the pass we can resolve this +problem at the expense of performance. + +When the `IsDebug` flag is set the pass adds a new `alloca` which +contains a pointer to the live variables struct of the currently +executing work-item, since there is a separate struct for each work-item +in a work-group. A new `store` instruction to this `alloca` is also +inserted before calls to each of the separated kernels with the new +address of the live variables struct for the work-item about to be +executed. These extra writes to the stack have a runtime cost which is +why this transformation is only done when compiling for debug. + +The benefit of adding the extra `alloca` is that it forces the address +to be placed on the stack, where we can point to it with +`llvm.dbg.declare()` intrinsics, rather than reading the address from a +register where it won\'t persist. Not all source variables are classed +as live however if they are not used past the first barrier, so when the +`IsDebug` flag is set we also modify the algorithm for finding live +variables to mark these `alloca` instructions as live. Otherwise their +values won\'t be updated for the current work item past the first +barrier and the debugger will print incorrect values. + +To point to the location in the live variables struct where each source +variable lives we use DWARF expressions, represented in LLVM by a +`DIExpression` metadata node. In our expression we first use a +`DW_OP_deref` DWARF operation to dereference the pointer in our +debugging `alloca` to find the start of the live variables struct. Then +next in the expression we have a `DW_OP_plus` operation taking an +integer operand for the byte offset into the struct for that particular +variable. + +In order to establish which values actually cross a barrier, we traverse +the CFG and build inter-barrier regions. We start traversal at the +beginning of the function, and at the barriers, and we end whenever we +encounter another barrier or a return statement. We collect all values +that are defined within one region, which have uses in any other region, +which are called \"external uses\". We also collect values that are +defined within one region and used in the same region, but where the +definition does not dominate the use. These are \"internal uses\" and +can occur where a barrier is present in a loop, such that the same +barrier that begins the inter-barrier region can also be hit at the end +of that region. (The definition must have dominated all its uses in the +original function, but a barrier inside a loop can result in the second +part of the loop body preceding the first within the inter-barrier +region.) + +We also implement a \"Barrier Tidying\" optimization that +posts-processes the set of live values to remove certain values where it +is expected that loading and storing these values will incur more +overhead than simply recalculating them from other available values +(including other barrier-stored values and kernel parameters). Values +considered removable are: + +> - NOP casts, +> - Casts from a narrow type to a wider type, +> - All other casts where the source operand is already in the +> barrier, +> - Vector splats, +> - Calls to \"rematerializable\" builtins - see +> `compiler::utils::eBuiltinPropertyRematerializable` + +If the barrier contains scalable vectors, the size of the struct is +dependent on the value of `vscale`, and so is the total number of struct +instances for a given work group size. In this case we create the +barrier memory area as a byte buffer (i.e. an array of `i8`), instead of +an array of barrier structs. The address of the barrier struct for the +subkernel invocations have to be computed knowing the vscale, and +pointer-cast to the barrier struct type. Any scalable vector members of +the barrier struct are put into a flexible array member (of type `i8`) +at the end, so that GEPs to individual members can be constructed by +calculating their byte offsets into this array and the results cast to +pointers of the needed type. The position of individual scalable vector +members is calculated by multiplying their equivalent \"fixed width\" +offset (i.e. the same as if vscale were equal to 1) by the actual +vscale. + +Once we know which values are to be included in the barrier struct, we +can split the kernel proper, creating a new function for each of the +inter-barrier regions, cloning the Basic Blocks of the original function +into it. We apply the barrier in the following order: external uses are +remapped into loads from the barrier struct, then any barrier-resident +values are stored into the barrier, and finally, internal uses are +remapped into the barrier. External and internal uses are dealt with +separately, since external uses can always be safely loaded only once at +the beginning of the new function, where as internal uses may or may not +need to load the loop-updated value. For this reason, stores are always +created immediately after the definitions of the relevant values, rather +than at the barrier at the end of the region. (This may have some scope +for further optimization work.) When tidying has removed a value from +the barrier, we have to also clone those values as well, in order to +re-compute these values from the value actually stored in the barrier +struct. Each subkernel returns an integer ID that maps to the barriers, +corresponding to the barrier that was encountered at the end of the +subkernel. There is a special barrier ID that represents the return +statement of the original kernel, and also one that represents the +kernel entry point. + +This pass runs over all functions in the module which have [mux-kernel](#function-attributes) entry-point attributes. + +The new wrappers take the name of either the \'tail\' or \'main\' +kernels \--whichever is present \-- suffixed by +\".mux-barrier-wrapper\". The wrappers call either the original +kernel(s) if no barriers are present, or the newly-created barrier +regions if barriers are present. The original kernels are left in the +module in either case but are marked as internal so that later passes +can optimize them if they are no longer called once inlined. + +Newly-created functions preserve the original calling convention, unless they +are kernels. In that case, the new functions will have `SPIR_FUNC` calling +convention. Newly-created functions steal the [mux-kernel](#function-attributes) +attributes from the original functions. + +Once we have all of our subkernels, we apply the 3-dimensional work item +loops individually to each subkernel. The return value of a subkernel is +used to determine which subkernel loop to branch to next, or to exit the +wrapper function, as appropriate. + +## Work-group scheduling (vectorized and scalar loops) + +The [WorkItemLoopsPass](#workitemloopspass) is responsible for stitching +together multiple kernels to make a single kernel capable of correctly +executing all work-items in the work-group. + +In particular, when a kernel has been vectorized with +[vecz](SYCLNativeCPUVecz.md) it executes multiple work-items at +once. Unless the work-group size in the vectorized dimension is known to +be a multiple of the vectorization factor, there exists the possibility +that some work-items will not be executed by the vectorized loop. + +As such, the [WorkItemLoopsPass](#workitemloopspass) is able to stitch +together kernels in several different configurations: + +- Vector + scalar loop +- Vector loop + vector-predicated tail +- Vector loop only +- Scalar loop only + +### Vector + Scalar + +The vector + scalar kernel combination is considered the default +behaviour. Most often the work-group size is unknown at compile time and +thus it must be assumed that the vector loop may not execute all +work-items. + +This configuration is used if the [WorkItemLoopsPass](#workitemloopspass) is +asked to run on a vectorized function which has +[!codeplay_ca_vecz.derived](#metadata) function metadata linking it back to its + scalar progenitor. In this case, both the vector and scalar kernel functions +are identified and are used. The vector work-items are executed first, followed +by the scalar work-items. + +```cpp +const size_t peel = group_size_x % vec_width; +const size_t peel_limit = group_size_x - peel; + +if (group_size_x >= vector_width) { + for (size_t z = 0; z < group_size_z; ++z) { + for (size_t y = 0; y < group_size_y; ++y) { + for (size_t wi = 0; wi < peel_limit; wi += vec_width) { + // run vectorized kernel if vec_width > 1, + // otherwise the scalar kernel. + } + } + } +} +if (group_size_x < vector_width || group_size_x % vector_width != 0) { + for (size_t z = 0; z < group_size_z; ++z) { + for (size_t y = 0; y < group_size_y; ++y) { + // peeled loop running remaining work-items (if any) on the scalar + // kernel + for (size_t wi = peel_limit; wi < group_size_x; ++wi) { + // run scalar kernel + } + } + } +} +``` + +Barriers are supported in this mode by creating a separate barrier +struct for both the vector and scalar versions of the kernel. + +There are circumstances in which this mode is skipped in favour of +\"vector only\" mode: + +- If the local work-group size is known to be a multiple of the + vectorization factor. + + > - This is identified through the [!reqd_work_group_size](#metadata) function metadata. This is often automatically + > added to functions by compiler frontends if kernels are + > supplied with attributes (e.g., `reqd_work_group_size` in + > OpenCL). Alternatively, if the work-group size is known at + > compile time, use the + > [ TransferKernelMetadataPass or EncodeKernelMetadataPass ](#transferkernelmetadatapass-and-encodekernelmetadatapass) + > to encode functions with this information. + +- If the [WorkItemLoopsPass](#workitemloopspass) has been created with + the `ForceNoTail` option. + + - This is a global toggle for *all* kernels in the program. + +- If the kernel has been vectorized with vector predication. In this + case the vector loop is known to handle scalar iterations itself. + +If any of these conditions are true, the \"vector only\" mode is used. + +### Vector + Vector-predicated + +The vector + vector-predicated kernel combination is a special case +optimization of the default behaviour. + +If the pass detects both a vector and vector-predicated kernel linked to +the same original kernel with the same vectorization width, the scalar +tail loop is replaced with a straight-line call to the vector-predicated +kernel, which will perform all of the scalar iterations at once. + +```cpp +const size_t peel = group_size_x % vec_width; +const size_t peel_limit = group_size_x - peel; + +if (group_size_x >= vector_width) { + for (size_t z = 0; z < group_size_z; ++z) { + for (size_t y = 0; y < group_size_y; ++y) { + for (size_t wi = 0; wi < peel_limit; wi += vec_width) { + // run vectorized kernel if vec_width > 1, + } + if (peel) { + // run vector-predicated kernel + } + } + } +} +``` + +### Vector only + +If the [WorkItemLoopsPass](#workitemloopspass) is run on a vectorized +kernel for which no [vecz](SYCLNativeCPUVecz.md) linking metadata is found to +identify the scalar kernel, or if a scalar kernel is found but one of +the conditions listed above hold, then the kernel is emitted using the +vector kernel only. It is assumed that if no scalar kernel is found it +is because targets know that one is not required. + +### Scalar only + +If the [WorkItemLoopsPass](#workitemloopspass) is run on a scalar kernel +then only the scalar kernel is used. + +# OptimalBuiltinReplacementPass + +The `OptimalBuiltinReplacementPass` is an optimization call-graph pass designed +to replace calls to builtin functions with optimal equivalents. This is only +used in the [veczc](SYCLNativeCPUVecz.md#veczc---the-vecz-compiler) tool and +should probably be phased out here. + +The `OptimalBuiltinReplacementPass` iterates over the call graph from +kernels inwards to their called functions, and visits all call sites in +the caller functions. If a call is made to a function that the pass is +interested in, the call is deleted and is replaced with a series of +inline IR instructions. Using the call graph guarantees that +replacements are made on a priority basis; outermost functions are +replaced before any functions they themselves call. + +Replacements are optionally made according to a specific `BuiltinInfo` +object, which may be passed to this pass. It defaults to `nullptr`. If +this `BuiltinInfo` is present then it is asked whether it recognizes any +builtin functions and is tasked with inlining a suitable sequence of +instructions. + +Replacements are also performed on two abacus-internal builtins: +`__abacus_clz` and `__abacus_mul_hi`. Replacing these rather than their +OpenCL user-facing builtins allows replacements in more cases, as the +abacus versions are used to implement several other builtin functions. + +The `__abacus_clz` builtin \-- count leading zeros \-- can be exchanged +for a hardware intrinsic: `llvm.ctlz`. However, some variants are +skipped: 64-bit scalar and vector variants are skipped, since Arm uses +calls to an external function to help it implement this case. + +The `__abacus_mul_hi` builtin \-- multiplication returning the \"high\" +part of the product \-- can be exchanged for a shorter series of LLVM +instructions which perform the multiplication in a wider type before +shifting it down. This is desirable because abacus has a rule that it +never introduces larger types in its calculations. LLVM, however, is +able to match a specific sequence of instructions against a \"mul hi\" +node, which is canonical, well-optimized, and many targets directly +lower that node to a single instruction. 64-bit versions (scalar and +vector) are skipped since 64-bit \"mul hi\" and 128-bit integers are not +well supported on all targets. + +The `__abacus_fmin` and `__abacus_fmax` builtins can be exchanged for +hardware intrinsics: `llvm.minnum` and `llvm.maxnum`. This is not +performed on ARM targets due to LLVM backend compiler bugs. + +# RunVeczPass + +The `RunVeczPass` module pass provides a wrapper for using our +[vecz](SYCLNativeCPUVecz.md) IR vectorizer. This vectorizes the +kernel to a SIMD width specified when the pass is created. In our case +this is typically local size in the first dimension but there are other +factors to consider when picking the width, like being a power of 2. + +We only enable the vectorizer in host when the `-cl-wfv={always|auto}` +option is provided, a condition check which is the first thing this pass +does. If this check fails, the pass exits early, otherwise the +vectorizer is invoked through top level API +`vecz::Vectorizer::vectorize`. If the passed option is `-cl-wfv=auto`, +then we first have to check the layout of the input kernel to find out +if it is advantageous to vectorize it, and only do so if it is the case. +If the passed option is `-cl-wfv=always`, then we will try to vectorize +the kernel in any case. If successful, this will return a new vectorized +kernel function created in the LLVM module so that this vectorized +kernel is used instead of our scalar kernel from here on. + +## Cost Model Interface + +User cost-modelling in vecz can be handled by the +`vecz::VeczPassOptionsAnalsis` which takes a user defined query function +on construction. This pass is a required analysis pass for vecz, so be +sure to add it to your analysis manager. + +Vecz queries the result of this analysis before operating on a kernel, +and the user function may fill an array of `VeczPassOptions` which +contain suitably modelled widths, vectorization factors, and scalability +options determined suitable for the target. + +The `VeczPassOptionsAnalysis` pass can be default-constructed - in which +case vecz makes a conservative decision about kernel vectorization - or +be constructed passing in a user callback function. The function takes +as its parameters a reference to the function to be optionally +vectorized, and a reference to a vector of `VeczPassOptions` which it is +expected to fill in. + +If it\'s not interested in seeing the function vectorized, it returns +false; otherwise it fills in the `VeczPassOptions` array with the +choicest vectorization options it can muster for the target. For +example: + +```cpp +void InitMyAnalysisManager(llvm::ModuleAnalysisManager &MAM) { + MyCostModel CM; + MAM.registerPass([CM] { + return vecz::VeczPassOptionsAnalysis( + [CM](llvm::Function &F, + llvm::SmallVectorImpl &Opts) { + if (CM->getCostWFV(&F) > 0) { + // Vectorizing will make things worse, so don't + return false; + } + VeczPassOptions O; + vecz::VectorizationChoices &choices = O.choices; + if (!MyCostModel->hasDoubles()) { + choices.enable(eCababilityNoDoubleSupport); + } + if (CM->getCostPartialScalarization(&F) < 0) { + choices.enable(vecz::VectorizationChoices::ePartialScalarization); + } + if (CM->getCostBOSCC(&F) < 0) { + choices.enable(vecz::VectorizationChoices::eLinearizeBOSCC); + } + // Our silly target only has 42-wide SIMD units! + opts.factor = Vectorization::getFixedWidth(42); + Opts.emplace_back(std::move(O)); + return true; + }); + }); +} +``` + +To access the `VeczPassOptionsAnalysis` from inside any other pass in +the same pass manager, do the following: + +```cpp +auto queryPassOpts = getAnalysis(); +``` + +The above returns a pointer to the cost model the wrapper pass was +constructed with, and may return `nullptr` if no cost model was +provided. + +The Cost Model header file resides at `utils/cost_model.h`. + +# DefineMuxBuiltinsPass + +The `DefineMuxBuiltinsPass` performs a scan over all functions in the +module, calling `BuiltinInfo::defineMuxBuiltin` on all mux builtin +function declarations. + +If a definition of a mux builtin requires calls to other mux builtins +which themselves need defining, such dependencies can be added to the +end of the module\'s list of functions so that the +`DefineMuxBuiltinsPass` will visit those in turn. One example of this is +the lowering of `__mux_get_global_id` which calls `__mux_get_local_id`, +among other functions. + +# ReplaceLocalModuleScopeVariablesPass + +The `ReplaceLocalModuleScopeVariables` pass identifies global variables +in the local address space and places them in a struct called +`localVarTypes`, allocated in a newly created wrapper function. A +pointer to the struct is then passed via a parameter to the original +kernel. The wrapper function takes over function attributes and metadata +from the original kernel. + +When creating the struct we need to be aware of the alignment of members +so that they are OpenCL conformant for their type. To do this we +manually pad the struct by keeping track of each elements offset and +adding byte array entries for padding to meet alignment requirements. +Finally the whole struct is aligned to the largest member alignment +found. + +Once the struct is created the pass replaces all instructions using each +of the global variables identified in the previous step with +instructions referencing the matching struct member instead. Finally the +identified global variables are removed once all of their uses have been +replaced. + +# PrepareBarriersPass + +The `PrepareBarriersPass` is useful in order to satisfy the requirements +the [WorkItemLoopsPass](#workitemloopspass) has on kernels containing +barrier-like functions if running in conjunction with the +[RunVeczPass](#runveczpass). If running, it should be run before using +the vectorizer. + +It ensures that barriers are synchronized between two or more vectorized +versions of the same kernel. It gives each barrier a unique ID, which +the vectorizer preserves in each vectorized kernel, meaning the +`WorkItemLoopsPass` can correctly schedule the work-item loops for each +barrier region. + +# Metadata Utilities + +There are several key pieces of metadata used for inter-communication +between the Native CPU passes. + +In order to avoid hard-coding assumptions about the metadata\'s names, +number of operands, types of operands, etc., utility functions +**should** be used to access or manipulate the metadata. The specific +names and/or operands of these metadata is **not** guaranteed to be +stable. + +# Attribute Utilities + +There are several key attributes used for inter-communication between +the Native CPU passes. + +The +`compiler_passes/compiler_pipeline/include/compiler/utils/attributes.h` +header contains all such APIs, several of which are given here by way of +example: + +- `void setIsKernel(llvm::Function &F)` + - Adds the `mux-kernel` attribute to function `F`. +- `void setIsKernelEntryPt(llvm::Function &F)` + - Adds `"mux-kernel"="entry-point"` attribute to function `F` +- `bool isKernel(const llvm::Function &F)` + - Returns true if function `F` has a `mux-kernel` attribute +- `bool isKernelEntryPt(const llvm::Function &F)` + - Returns true if function `F` has a `mux-kernel` attribute with + the value `"entry-point"`. +- `void dropIsKernel(llvm::Function &F)` + - Drops the `mux-kernel` attribute from function `F`, if present. +- `void takeIsKernel(llvm::Function &ToF, llvm::Function &FromF)` + - Transfers `mux-kernel` attributes from function `FromF` to + function `ToF`, if present on the old function. + Overwrites any such metadata in the new function. + +# Sub-groups + +A implementation of SPIR-V sub-group builtins is provided by the +default compiler pipeline. + +The SPIR-V sub-group builtins are first translated into the +corresponding Native CPU builtin functions. These functions are +understood by the rest of the compiler and can be identified and +analyzed by the `BuiltinInfo` analysis. + +A definition of these mux builtins for where the sub-group size is 1 is +provided by `BIMuxInfoConcept` used by the +[DefineMuxBuiltinsPass](#definemuxbuiltinspass). + +Vectorized definitions of the various sub-group builtins are provided by +the VECZ pass, so any target running VECZ (and the above passes) will be +able to support sub-groups of a larger size than 1. Note that VECZ does +not currently interact \"on top of\" the mux builtins - it replaces them +in the functions it vectorized. This is future work to allow the two to +build on top of each other. + +If a target wishes to provide their own sub-group implementation they +should provide a derived `BIMuxInfoConcept` and override +`defineMuxBuiltin` for the sub-group builtins. + +# LLVM intermediate representation + +## Mangling + +Mangling is used by the vectorizer to declare, define and use internal +overloaded builtin functions. In general, the mangling scheme follows [Appendix +A of the SPIR 1.2 +specification](https://www.khronos.org/registry/SPIR/specs/spir_spec-1.2.pdf). +itself an extension of the Itanium C++ mangling scheme. + +## Vector Types + +The Itanium specification under-specifies vector types in general, so vendors +are left to establish their own system. In the vectorizer, fixed-length vector +types follow the convention that LLVM, GCC, ICC and others use. The first +component is ``Dv`` followed by the number of elements in the vector, followed by +an underscore (\ ``_``\ ) and then the mangled element type: + +``` llvm + <2 x i32> -> Dv2_i + <32 x double> -> Dv32_d +``` + +Scalable-vector IR types do not have an established convention. Certain vendors +such as ARM SVE2 provide scalable vector types at the C/C++ language level, but +those are mangled in a vendor-specific way. + +The vectorizer chooses its own mangling scheme using the Itanium +vendor-extended type syntax, which is ``u``\ , followed by the length of the +mangled type, then the mangled type itself. + +Scalable-vectors are first mangled with ``nx`` to indicate the scalable +component. The next part is an integer describing the known multiple of the +scalable component. Lastly, the element type is mangled according to the +established vectorizer mangling scheme (i.e. Itanium). + +Example: + +``` llvm + -> u5nxv1j + -> u5nxv2f + -> u6nxv16d + -> u11nxv4PU3AS1j + + define void @__vecz_b_interleaved_storeV_Dv16_dPU3AS1d(<16 x double> %0, double addrspace(1)* %1, i64 %2) { + define void @__vecz_b_interleaved_storeV_u6nxv16dPU3AS1d( %0, double addrspace(1)* %1, i64 %2) { +``` + +# Builtins + +The Following intermediate representations are used in the interface to Native CPU. Some of these may not be relevant for Native CPU, and may exist from the time this was part of the `oneAPI Construction Kit`. + +* ``size_t __mux_get_global_size(i32 %i)`` - Returns the number of global + invocations for the ``%i``'th dimension. +* ``size_t __mux_get_global_id(i32 %i)`` - Returns the unique global + invocation identifier for the ``%i``'th dimension. +* ``size_t __mux_get_global_offset(i32 %i)`` - Returns the global offset (in + invocations) for the ``%i``'th dimension. +* ``size_t __mux_get_local_size(i32 %i)`` - Returns the number of local + invocations within a work-group for the ``%i``'th dimension. +* ``size_t __mux_get_local_id(i32 %i)`` - Returns the unique local invocation + identifier for the ``%i``'th dimension. +* ``i32 __mux_get_sub_group_id()`` - Returns the sub-group ID. +* ``size_t __mux_get_num_groups(i32 %i)`` - Returns the number of work-groups + for the ``%i``'th dimension. +* ``i32 __mux_get_num_sub_groups()`` - Returns the number of sub-groups for + the current work-group. +* ``i32 __mux_get_max_sub_group_size()`` - Returns the maximum sub-group size + in the current kernel. +* ``i32 __mux_get_sub_group_size()`` - Returns the number of invocations in the + sub-group. +* ``i32 __mux_get_sub_group_local_id()`` - Returns the unique invocation ID + within the current sub-group. +* ``size_t __mux_get_group_id(i32 %i)`` - Returns the unique work-group + identifier for the ``%i``'th dimension. +* ``i32 __mux_get_work_dim()`` - Returns the number of dimensions in + use. +* ``__mux_dma_event_t __mux_dma_read_1D(ptr address_space(3) %dst,`` + ``ptr address_space(1) %src, size_t %width, __mux_dma_event_t %event)`` - DMA + 1D read from ``%src`` to ``%dst`` of ``%width`` bytes. May use ``%event`` + from previous DMA call. Returns event used. +* ``__mux_dma_event_t __mux_dma_read_2D(ptr address_space(3) %dst,`` + ``ptr address_space(1) %src, size_t %width, size_t %dst_stride,`` + ``size_t %src_stride, size_t %height __mux_dma_event_t %event)`` - DMA 2D + read from ``%src`` to ``%dst`` of ``%width`` bytes and ``%height`` rows, with + ``%dst_stride`` bytes between dst rows and ``%src_stride`` bytes between src + rows. May use ``%event`` from previous DMA call. Returns event used. +* ``__mux_dma_event_t __mux_dma_read_3D(ptr address_space(3) %dst,`` + ``ptr address_space(1) %src, size_t %width, size_t %dst_line_stride,`` + ``size_t %src_line_stride, size_t %height, size_t %dst_plane_stride,`` + ``size_t %src_plane_stride, size_t %depth, __mux_dma_event_t %event)`` - DMA + 3D read from ``%src`` to ``%dst`` of ``%width`` bytes, ``%height`` rows, and + ``%depth`` planes, with ``%dst_line_stride`` bytes between dst rows, + ``%src_line_stride`` bytes between src rows, ``%dst_plane_stride`` bytes + between dst planes, and ``%src_plane_stride`` between src planes. May use + ``%event`` from previous DMA call. Returns event used. +* ``__mux_dma_event_t __mux_dma_write_1D(ptr address_space(1) ptr %dst,`` + ``ptr address_space(3) %src, size_t %width, __mux_dma_event_t %event)`` - DMA + 1D write from ``%src`` to ``%dst`` of ``%width`` bytes. May use ``%event`` + from previous DMA call. Returns event used. +* ``__mux_dma_event_t __mux_dma_write_2D(ptr address_space(1) %dst,`` + ``ptr address_space(1) %src, size_t %width, size_t %dst_stride,`` + ``size_t %src_stride, size_t %height __mux_dma_event_t %event)`` - DMA 2D + write from ``%src`` to ``%dst`` of ``%width`` bytes and ``%height`` rows, + with ``%dst_stride`` bytes between dst rows and ``%src_stride`` bytes between + src rows. May use ``%event`` from previous DMA call. Returns event used. +* ``__mux_dma_event_t __mux_dma_write_3D(ptr address_space(3) %dst,`` + ``ptr address_space(1) %src, size_t %width, size_t %dst_line_stride,`` + ``size_t %src_line_stride, size_t %height, size_t %dst_plane_stride,`` + ``size_t %src_plane_stride, size_t %depth, + ``__mux_dma_event_t %event)`` - DMA 3D write from ``%src`` to ``%dst`` of + ``%width`` bytes, ``%height`` rows, and ``%depth`` planes, with + ``%dst_line_stride`` bytes between dst rows, ``%src_line_stride`` bytes + between src rows, ``%dst_plane_stride`` bytes between dst planes, and + ``src_plane_stride`` between src planes. May use ``%event`` from previous DMA + call. Returns event used. +* ``void __mux_dma_wait(i32 %num_events, __mux_dma_event_t*)`` - Wait on + events initiated by a DMA read or write. +* ``size_t __mux_get_global_linear_id()`` - Returns a linear ID equivalent + to ``(__mux_get_global_id(2) - __mux_get_global_offset(2)) *`` + ``__mux_get_global_size(1) * __mux_get_global_size(0) +`` + ``(__mux_get_global_id(1) - __mux_get_global_offset(1)) *`` + ``__mux_get_global_size(0) + (__mux_get_global_id(0) -`` + ``__mux_get_global_offset(0))``. +* ``size_t __mux_get_local_linear_id(void)`` - Returns a linear ID equivalent + to ``__mux_get_local_id(2) * __mux_get_local_size(1) *`` + ``__mux_get_local_size(0) + __mux_get_local_id(1) * __mux_get_local_size(0)`` + ``+ __mux_get_local_id(0)``. +* ``size_t __mux_get_enqueued_local_size(i32 i)`` - Returns the enqueued + work-group size in the ``i``'th dimension, for uniform work-groups this is + equivalent to ``size_t __mux_get_local_size(i32 %i)``. +* ``void __mux_mem_barrier(i32 %scope, i32 %semantics)`` - Controls the order + that memory accesses are observed (serves as a fence instruction). This + control is only ensured for memory accesses issued by the invocation calling + the barrier and observed by another invocation executing within the memory + ``%scope``. Additional control over the kind of memory controlled and what + kind of control to apply is provided by ``%semantics``. See `below + <#memory-and-control-barriers>`__ for more information. +* ``void __mux_work_group_barrier(i32 %id, i32 %scope, i32 %semantics)`` and + ``void __mux_sub_group_barrier(i32 %id, i32 %scope, i32 %semantics)`` - Wait + for other invocations of the work-group/sub-group to reach the current point + of execution (serves as a control barrier). A barrier identifier is provided + by ``%id`` (note that implementations **must** ensure uniqueness themselves, + e.g., by running the ``compiler::utils::PrepareBarriersPass``). These + builtins may also atomically provide a memory barrier with the same semantics + as ``__mux_mem_barrier(i32 %scope, i32 %semantics)``. See `below + <#memory-and-control-barriers>`__ for more information. + +## Group operation builtins + +Native CPU defines a variety of builtins to handle operations across a +sub-group, work-group, or *vector group*. + +The builtin functions are overloadable and are mangled according to the type of +operand they operate on. + +Each *work-group* operation takes as its first parameter a 32-bit integer +barrier identifier (``i32 %id``). Note that if barriers are used to implement +these operations, implementations **must** ensure uniqueness of these IDs +themselves, e.g., by running the ``compiler::utils::PrepareBarriersPass``. The +barrier identifier parameter is not mangled. + +> [!NOTE] +> The sub-group and work-group builtins are all **uniform**, that is, the +> behaviour is undefined unless all invocations in the group reach this point +> of execution. + + Future versions of Native CPU **may** add **non-uniform** versions of these + builtins. + +The groups are defined as: + +* ``work-group`` - a group of invocations running together as part of an ND + range. These builtins **must** only take scalar values. +* ``sub-group`` - a subset of invocations in a work-group which can synchronize + and share data efficiently. Native CPU leaves the choice of sub-group size + and implementation to the target; Native CPU only defines these builtins with + a "trivial" sub-group size of 1. These builtins **must** only take scalar + values. +* ``vec-group`` - a software level group of invocations processing data in + parallel *on a single invocation*. This allows the compiler to simulate a + sub-group without any hardware sub-group support (e.g., through + vectorization). These builtins **may** take scalar *or vector* values. The + scalar versions of these builtins are essentially identical to the + corresponding ``sub-group`` builtins with a sub-group size of 1. + + +### ``any``/``all`` builtins + +The ``any`` and ``all`` builtins return ``true`` if any/all of their operands +are ``true`` and ``false`` otherwise. + +```llvm + i1 @__mux_sub_group_any_i1(i1 %x) + i1 @__mux_work_group_any_i1(i32 %id, i1 %x) + i1 @__mux_vec_group_any_v4i1(<4 x i1> %x) +``` + +### ``broadcast`` builtins + +The ``broadcast`` builtins broadcast the value corresponding to the local ID to +the result of all invocations in the group. The sub-group version of this +builtin takes an ``i32`` sub-group linear ID to identify the invocation to +broadcast, and the work-group version take three ``size_t`` indices to locate +the value to broadcast. Unused indices (e.g., in lower-dimension kernels) +**must** be set to zero - this is the same value returned by +``__mux_get_global_id`` for out-of-range dimensions. + +```llvm + i64 @__mux_sub_group_broadcast_i64(i64 %val, i32 %sg_lid) + i32 @__mux_work_group_broadcast_i32(i32 %id, i32 %val, i64 %lidx, i64 %lidy, i64 %lidz) + i64 @__mux_vec_group_broadcast_v2i64(<2 x i64> %val, i32 %vec_id) +``` + +### ``reduce`` and ``scan`` builtins + +The ``reduce`` and ``scan`` builtins return the result of the group operation +for all values of their parameters specified by invocations in the group. + +Scans may be either ``inclusive`` or ``exclusive``. Inclusive scans perform the +operation over all invocations in the group. Exclusive scans perform the +operation over the operation's identity value and all but the final invocation +in the group. + +The group operation may be specified as one of: + +* ``add``/``fadd`` - integer/floating-point addition. +* ``mul``/``fmul`` - integer/floating-point multiplication. +* ``smin``/``umin``/``fmin`` - signed integer/unsigned integer/floating-point minimum. +* ``smax``/``umax``/``fmax`` - signed integer/unsigned integer/floating-point maximum. +* ``and``/``or``/``xor`` - bitwise ``and``/``or``/``xor``. +* ``logical_and``/``logical_or``/``logical_xor`` - logical ``and``/``or``/``xor``. + +Examples: + +```llvm + i32 @__mux_sub_group_reduce_add_i32(i32 %val) + i32 @__mux_work_group_reduce_add_i32(i32 %id, i32 %val) + float @__mux_work_group_reduce_fadd_f32(i32 %id, float %val) + + i32 @__mux_sub_group_scan_inclusive_mul_i32(i32 %val) + i32 @__mux_work_group_scan_inclusive_mul_i32(i32 %id, i32 %val) + float @__mux_work_group_scan_inclusive_fmul_f32(i32 %id, float %val) + + i64 @__mux_sub_group_scan_exclusive_mul_i64(i64 %val) + i64 @__mux_work_group_scan_exclusive_mul_i64(i32 %id, i64 %val) + double @__mux_work_group_scan_exclusive_fmul_f64(i32 %id, double %val) + + i64 @__mux_vec_group_scan_exclusive_mul_nxv1i64( %val) +``` + + +### Sub-group ``shuffle`` builtin + +The ``sub_group_shuffle`` builtin allows data to be arbitrarily transferred +between invocations in a sub-group. The data that is returned for this +invocation is the value of ``%val`` for the invocation identified by ``%lid``. + +``%lid`` need not be the same value for all invocations in the sub-group. + +```llvm + i32 @__mux_sub_group_shuffle_i32(i32 %val, i32 %lid) +``` + +### Sub-group ``shuffle_up`` builtin + +The ``sub_group_shuffle_up`` builtin allows data to be transferred from an +invocation in the sub-group with a lower sub-group local invocation ID up to an +invocation in the sub-group with a higher sub-group local invocation ID. + +The builtin has two operands: ``%prev`` and ``%curr``. To determine the result +of this builtin, first let ``SubgroupLocalInvocationId`` be equal to +``__mux_get_sub_group_local_id()``, let the signed shuffle index be equivalent +to this invocation’s ``SubgroupLocalInvocationId`` minus the specified +``%delta``, and ``MaxSubgroupSize`` be equal to +``__mux_get_max_sub_group_size()`` for the current kernel. + +* If the shuffle index is greater than or equal to zero and less than the + ``MaxSubgroupSize``, the result of this builtin is the value of the ``%curr`` + operand for the invocation with ``SubgroupLocalInvocationId`` equal to the + shuffle index. + +* If the shuffle index is less than zero but greater than or equal to the + negative ``MaxSubgroupSize``, the result of this builtin is the value of the + ``%prev`` operand for the invocation with ``SubgroupLocalInvocationId`` equal + to the shuffle index plus the ``MaxSubgroupSize``. + +All other values of the shuffle index are considered to be out-of-range. + +``%delta`` need not be the same value for all invocations in the sub-group. + +```llvm + + i8 @__mux_sub_group_shuffle_up_i8(i8 %prev, i8 %curr, i32 %delta) +``` + +### Sub-group ``shuffle_down`` builtin + +The ``sub_group_shuffle_down`` builtin allows data to be transferred from an +invocation in the sub-group with a higher sub-group local invocation ID down to +a invocation in the sub-group with a lower sub-group local invocation ID. + +The builtin has two operands: ``%curr`` and ``%next``. To determine the result +of this builtin , first let ``SubgroupLocalInvocationId`` be equal to +``__mux_get_sub_group_local_id()``, the unsigned shuffle index be equivalent to +the sum of this invocation’s ``SubgroupLocalInvocationId`` plus the specified +``%delta``, and ``MaxSubgroupSize`` be equal to +``__mux_get_max_sub_group_size()`` for the current kernel. + +* If the shuffle index is less than the ``MaxSubgroupSize``, the result of this + builtin is the value of the ``%curr`` operand for the invocation with + ``SubgroupLocalInvocationId`` equal to the shuffle index. + +* If the shuffle index is greater than or equal to the ``MaxSubgroupSize`` but + less than twice the ``MaxSubgroupSize``, the result of this builtin is the + value of the ``%next`` operand for the invocation with + ``SubgroupLocalInvocationId`` equal to the shuffle index minus the + ``MaxSubgroupSize``. All other values of the shuffle index are considered to + be out-of-range. + +All other values of the shuffle index are considered to be out-of-range. + +``%delta`` need not be the same value for all invocations in the sub-group. + +```llvm + float @__mux_sub_group_shuffle_down_f32(float %curr, float %next, i32 %delta) +``` + +### Sub-group ``shuffle_xor`` builtin + +These ``sub_group_shuffle_xor`` builtin allows for efficient sharing of data +between items within a sub-group. + +The data that is returned for this invocation is the value of ``%val`` for the +invocation with sub-group local ID equal to this invocation’s sub-group local +ID XOR’d with the specified ``%xor_val``. If the result of the XOR is greater +than the current kernel's maximum sub-group size, then it is considered +out-of-range. + +```llvm + double @__mux_sub_group_shuffle_xor_f64(double %val, i32 %xor_val) +``` + +### Memory and Control Barriers + +The mux barrier builtins synchronize both memory and execution flow. + +The specific semantics with which they synchronize are defined using the +following enums. + +The ``%scope`` parameter defines which other invocations observe the memory +ordering provided by the barrier. Only one of the values may be chosen +simultaneously. + +```cpp + enum MemScope : uint32_t { + MemScopeCrossDevice = 0, + MemScopeDevice = 1, + MemScopeWorkGroup = 2, + MemScopeSubGroup = 3, + MemScopeWorkItem = 4, + }; +``` + +The ``%semantics`` parameter defines the kind of memory affected by the +barrier, as well as the ordering constraints. Only one of the possible +**ordering**\s may be chosen simultaneously. The **memory** field is a +bitfield. + +```cpp + enum MemSemantics : uint32_t { + // The 'ordering' to apply to a barrier. A barrier may only set one of the + // following at a time: + MemSemanticsRelaxed = 0x0, + MemSemanticsAcquire = 0x2, + MemSemanticsRelease = 0x4, + MemSemanticsAcquireRelease = 0x8, + MemSemanticsSequentiallyConsistent = 0x10, + MemSemanticsMask = 0x1F, + // What kind of 'memory' is controlled by a barrier. Acts as a bitfield, so + // a barrier may, e.g., synchronize both sub-group, work-group and cross + // work-group memory simultaneously. + MemSemanticsSubGroupMemory = 0x80, + MemSemanticsWorkGroupMemory = 0x100, + MemSemanticsCrossWorkGroupMemory = 0x200, + }; +``` + +### Atomics and Fences + +The LLVM intermediate representation stored in +``compiler::BaseModule::finalized_llvm_module`` **may** contain any of the +following atomic instructions: + +* [`cmpxchg`](https://llvm.org/docs/LangRef.html#cmpxchg-instruction) for the `monotonic ordering`_ with *strong* semantics only +* [`atomicrmw`](https://llvm.org/docs/LangRef.html#atomicrmw-instruction) for the following opcodes: ``add``, ``and``, ``sub``, ``min``, + ``max``, ``umin``, ``umax``, ``or``, ``xchg``, ``xor`` for the `monotonic + ordering`_ only + +A compiler **shall** correctly legalize or select these instructions to ISA +specific operations. + +The LLVM intermediate representation stored in +``compiler::BaseModule::finalized_llvm_module`` **may** also contain any of the +following atomic instructions: +https://llvm.org/docs/LangRef.html#atomicrmw-instruction +* [cmpxchg](https://llvm.org/docs/LangRef.html#cmpxchg-instruction) for the [monotonic ordering](https://llvm.org/docs/LangRef.html#ordering) with *weak* semantics +* [load](https://llvm.org/docs/LangRef.html#load-instruction) with the instruction marked as *atomic* for the [monotonic ordering](https://llvm.org/docs/LangRef.html#ordering) + only +* [store](https://llvm.org/docs/LangRef.html#store-instruction) with the instruction marked as *atomic* for the [monotonic ordering](https://llvm.org/docs/LangRef.html#ordering) + only +* [fence](https://llvm.org/docs/LangRef.html#fence-instruction) for the [acquire + ordering](https://llvm.org/docs/LangRef.html#ordering), [release + ordering](https://llvm.org/docs/LangRef.html#ordering) and [acq_rel + ordering](https://llvm.org/docs/LangRef.html#ordering) only. + +The atomic instructions listed above **shall not** have a +[syncscope](https://llvm.org/docs/LangRef.html#syncscope) argument. + +No lock free requirements are made on the above atomic instructions. A target +**may** choose to provide a software implementation of the atomic instructions +via some other mechanism such as a hardware mutex. + +## Metadata + +The following table describes metadata which can be introduced at different stages of the +pipeline: + + | Name | Fields | Description | + |------|--------|-------------| + |``!reqd_work_group_size``|i32, i32, i32|Required work-group size encoded as *X*, *Y*, *Z*. If not present, no required size is assumed.| + |``!max_work_dim``| i32 | Maximum dimension used for work-items. If not present, ``3`` is assumed.| + |``!codeplay_ca_wrapper``|various (incl. *vectorization options*)|Information about a *kernel entry point* regarding its work-item iteration over *sub-kernels* as stitched together by the ``WorkItemLoopsPass`` pass in the ``compiler::utils`` module. Typically this involves the loop structure, the vectorization width and options of each loop.| + |``!codeplay_ca_vecz.base``|*vectorization options*, ``Function*``| Links one function to another, indicating that the function acts as the *base* - or *source* - of vectorization with the given vectorization options, and the linked function is the result of a *successful* vectorization. A function may have *many* such pieces of metadata, if it was vectorized multiple times.| + |``!codeplay_ca_vecz.derived``|*vectorization options*, ``Function*``| Links one function to another, indicating that the function is the result of a *successful* vectorization with the given vectorization options, using the linked function as the *base* - or *source* - of vectorization. A function may only have **one** such piece of metadata.| + |``!codeplay_ca_vecz.base.fail``|*vectorization options*| Metadata indicating a *failure* to vectorize with the provided vectorization options.| + |``!mux_scheduled_fn``|i32, i32(, i32, i32)?| Metadata indicating the function parameter indices of the pointers to MuxWorkItemInfo and MuxWorkGroupInfo structures, respectively. A negative value (canonicalized as -1) indicates the function has no such parameter. Up to two additional custom parameter indices can be used by targets.| + |``!intel_reqd_sub_group_size``|i32|Required sub-group size encoded as a 32-bit integer. If not present, no required sub-group size is assumed.| + +Users **should not** rely on the name, format, or operands of these metadata. +Instead, utility functions are provided by the ``utils`` module to work with +accessing, setting, or updating each piece of metadata. + +> [!NOTE] +> The metadata above which refer to *vectorization options* have no concise + metadata form as defined by the specification and **are not** guaranteed to + be backwards compatible. See the C++ utility APIs in the ``utils`` module as + described above for the specific information encoded/decoded by + vectorization. + + | Name | Fields | Description | + |------|--------|-------------| + |``!mux-scheduling-params``|string, string, ...| A list of scheduling parameter names used by this target. Emitted into the module at the time scheduling parameters are added to functions that requires them. The indices found in ``!mux_scheduled_fn`` function metadata are indices into this list. + +## Function Attributes + +The following table describes function attributes which can be introduced at +different stages of the pipeline: + + + | Attribute | Description | + |------------------|-------------| + |``"mux-kernel"/"mux-kernel"="x"``| Denotes a *"kernel"* function. Additionally denotes a *"kernel entry point"* if the value is ``"entry-point"``. `See below [mux-kernel](#mux-kernel-attribute) for more details. | + |``"mux-orig-fn"="val"``| Denotes the name of the *"original function"* of a function. This original function may or may not exist in the module. The original function name is propagated through the compiler pipeline each time Native CPU creates a new function to wrap or replace a function. | + |``"mux-base-fn-name"="val"``| Denotes the *"base name component"* of a function. Used by several passes when creating new versions of a kernel, rather than appending suffix upon suffix.| + + For example, a pass that suffixes newly-created functions with + ``".pass2"`` will generate ``@foo.pass1.pass2`` when given function + ``@foo.pass1``, but will generate simply ``@foo.pass2`` if the same + function has ``"mux-base-name"="foo"``. + + | Attribute | Description | + |-----------|-------------| + |``"mux-local-mem-usage"="val"``| Estimated local-memory usage for the function. Value must be a positive integer. | + |``"mux-work-item-order"="val"``| Work-item order (the dimensions over which work-items are executed from innermost to outermost) as defined by the ``utils_work_item_order_e`` enum. If not present, ``"xyz"`` may be assumed. | + | ``"mux-barrier-schedule"="val"``| Typically found on call sites. Determines the ordering of work-item execution after a berrier. See the `BarrierSchedule` enum. | + | ``"mux-no-subgroups"``| Marks the function as not explicitly using sub-groups (e.g., identified by the use of known mux sub-group builtins). If a pass introduces the explicit use of sub-groups to a function, it should remove this attribute. | + +### mux-kernel attribute + +SYCL programs generally consist of a number of *kernel functions*, which +have a certain programming model and may be a subset of all functions in the +*module*. + +Native CPU compiler passes often need to identity kernel functions amongst +other functions in the module. Further to this, a Native CPU implementation may +know that an even smaller subset of kernels are in fact considered *kernels +under compilation*. In the interests of compile-time it is not desirable to +optimize kernels that are known to never run. + +Under this scheme, it is further possible to distinguish between kernels that +are *entry points* and those that aren't. Entry points are kernels which may be +invoked from the runtime. Other kernels in the module may only be run when +invoked indirectly: called from kernel entry points. + +The `mux-kernel` function attribute is used to +communicate *kernels under compilation* and *kernel entry points* (a subset of +those) between passes. This approach has a myriad of advantages. It provides a +stable, consistent, kernel identification method which other data do not: names +cannot easily account for new kernels introduced by optimizations like +vectorization; calling conventions are often made target-specific at some point +in the pipeline; pointers to functions are unstable when kernels are +replaced/removed. + +Passes provided by Native CPU ensure this attribute is updated when adding, +removing, or replacing kernel functions. Each Native CPU pass in its +documentation lists whether it operates on *kernels* or *kernel entry points*, +if applicable. diff --git a/sycl/doc/design/SYCLNativeCPUVecz.md b/sycl/doc/design/SYCLNativeCPUVecz.md new file mode 100644 index 0000000000000..9796188255b79 --- /dev/null +++ b/sycl/doc/design/SYCLNativeCPUVecz.md @@ -0,0 +1,1325 @@ +# Vecz Documentation + +Codeplay's Vecz is a library based on LLVM that allows vectorization of SPMD +programs such as Native CPU kernels. + +Vecz is automatically built during the oneAPI Construction Kit build process +(only if a runtime compiler is found) but needs to be manually enabled to be +used during the kernel compilation process. This is done by providing the +`-cl-wfv={always|auto}` option before running any Native CPU program. + +Vecz ships with a standalone tool called `veczc`. This tool will consume LLVM IR +- in bitcode or textual format, producing vectorized output. + +The code can be found +[here](https://github.com/coldav/llvm/tree/colin/add_native_cpu_pipeline_docs/llvm/lib/SYCLNativeCPUUtils/compiler_passes/vecz), +but is also derived from the [oneAPI Construction +Kit](https://github.com/uxlfoundation/oneapi-construction-kit/tree/main/modules/compiler/vecz), +where the original code can be found. + +It also comes with `lit` tests which can be built by configuring with +`-DNATIVE_CPU_BUILD_VECZ_TEST_TOOLS=ON` and run with `check-sycl-vecz`, + +## Design ideas + +Vecz's design is based on the automatic whole function vectorization research +by [Ralf Karrenberg][1], titled "Automatic SIMD Vectorization of SSA-based +Control Flow Graphs", and a combination of other papers referenced at various +places in this document. While the process followed by Vecz is not exactly the +same, understanding the research would help to understand Vecz better. + +## Supporting infrastructure + +Vecz relies on certain other classes and functions provided by +`compiler::utils`. The `BuiltinInfo` interface is used to ascertain certain +properties of Native CPU builtin functions such as whether a particular function +has a vector equivalent, or whether it is a work item ID query. Specific +implementations of the interface are provided for Native CPU. + +Before running vecz, it is recommended to run the +`compiler::utils::OptimalBuiltinReplacementPass`. This replaces certain builtin +calls with LLVM instructions or intrinsics that perform the equivalent +operation, which enables later optimizations to work with them (which applies +both to LLVM's own optimization passes that vecz runs, and to some of vecz's +own transform passes). Furthermore, it allows these builtins to be widened +arbitrarily, without being limited to the widths available as Native CPU builtins. + +If a target intends to use the `compiler::utils::WorkItemLoopsPass` after Vecz, +it is important to ensure that, **before vecz**, all calls to barrier-like +functions in the full nested kernel call-graph are given unique barrier IDs. +Note that this effectively mandates full inlining of all functions containing +barrier-like calls. + +This is necessary because vectorization can considerably affect control flow, +so the ordering of the barriers in the function may change. If the +`WorkItemLoopsPass` needs to combine two different versions of the same kernel +into a single scheduled kernel, it is vital that direct correspondence of the +barrier calls is maintained. + +Users can run the `compiler::utils::PrepareBarriersPass`, which satisfies these +requirements. + +The `vecz::RunVeczPass` does not delete the original scalar kernel after +vectorization, nor does it transfer the scalar kernel name to the vectorized +function. + +## Target specialization + +Vecz provides an interface, `vecz::TargetInfo`, that allows the vectorizer to make +target-dependent decisions. This is retrieved via an analysis pass: +`vecz::TargetInfoAnalysis`. Targets may override the `vecz::TargetInfo` +constructed by this Analysis. The interface has a default implementation, which +may be overridden. + +Targets can override: + +* Builder functions for all of the various forms of memory + operation that vecz can output: loads and stores in masked or unmasked forms; + contiguous access; interleaved access; scatter/gather access. Targets may want + to provide their own intrinsics for these operations, if they exist. +* Builder functions for special forms of vector shuffles on scalable vectors, + namely "inner" and "outer" broadcasts (meaning the duplication of each vector + element `n` times, and the duplication of the entire vector `n` times), as + well as vectorized scalable extract and insert instructions (which vectorize + to a pick or replacement of every nth element to/from a narrower vector). + Since there are no LLVM instructions to efficiently perform these operations + on scalable vectors, the default implementation involves writing to memory and + reading it back, which is likely to be suboptimal. +* Builder function to construct the vector length argument of predicated vector + instructions, on targets that support this feature. +* Functions to support the Interleaved Groups Combining Pass. +* A function to support the SIMD Width Analysis that returns the widest + vectorization factor possible for a given set of values. +* A function to compute the preferred vector width of a given scalar type. The + packetizer can use this to emit multiple vectors per scalar-kernel value, + instead of a single wider vector. The default implementation is based on the + bit width of a single vector register (from `llvm::TargetTransformInfo`). + +For full details, see the documentation of the `vecz/target_info.h` header file. + +## Vectorization process + +* Clone the kernel function +* Analyze the kernel function to determine if it has a pattern not prone to + being vectorized +* Run preparation passes +* Perform control-flow to data-flow conversion (to handle divergent CFGs) +* Scalarize the kernel if desired/needed +* Run pre-packetization "middle optimization" passes +* Determine the optimal SIMD width for the kernel +* Packetize the kernel +* Perform optimizations and cleanup +* Define internal builtins + +## Uniform Analysis + +Instructions in a kernel function may be uniform (i.e. they evaluate to the +same value or have the same side-effects on all lanes) or varying. Varying +instructions usually have a data dependency on either the global ID or the local +ID of the work-item executing the kernel. As an example, in the following kernel +the store to the global memory, as well as the calculation of the address for +the store, are varying instructions depending on the global ID of the work-item. +The multiplication of `in` by 2 on the other hand is uniform, since it is the +same across all the work-items. + +```c +kernel void fn(int in, global int *out) { + size_t tid = get_global_id(0); + in = in * 2; + out[tid] = in; +} +``` + +Assuming to vectorize on the x dimension, after packetization, functions calls +like `get_global_id(0)` or `get_local_id(0)` will return vectors of consecutive +indices, which will allow us to packetize the store into a vector store. + +On architectures that support both scalar and vector instructions we do not want +to packetize uniform instructions, as each vector lane will perform the same +operation on the same data and return the same result. Instead, we want to keep +uniform instructions scalar (i.e. keep the original instructions untouched), and +broadcast their value into a vector when necessary. + +In order to differentiate between the varying and the uniform instructions, +we have implemented an analysis pass called the `Uniform Analysis`. This +analysis starts by finding "vector roots" in the function. By "roots" we mean +instructions (usually function calls) that we know to be varying. For example, +work-item ID related functions, or packetized arguments in non-kernel functions +are some common vector roots. Each root and its users are recursively marked as +varying. Marking a value happens before marking its users, so that use cycles +(e.g. phi nodes) do not cause infinite recursion. The instructions remaining +after this process are considered to be uniform. + +In the previous example kernel, the vector root used is the call to the +`get_global_id(0)` function. Starting from that point and then recursively +going through its users and their users etc. We recursively mark the address +calculation (`getelementptr`) for the output and the store to the output as +varying too. The `in` value and its multiplication are not marked as varying +since they are not using any varying values, they are being used by one. We +should note here that under special cases, such as an `alloca` instruction that +is stored into by a varying store, we might mark some instructions as varying +just because they are used by a varying instruction but in the general case we +do not. + +> The relevant classes and functions can be found in +> `source/include/analysis/uniform_value_analysis.h` and +> `source/analysis/uniform_value_analysis.cpp`. + +## Stride Analysis + +Memory operations can access their data in several different patterns, as a +function of the work item ID, categorized as: + +* Uniform: data is accessed from the same address for all work items; +* Contiguous: data is accessed sequentially with no gaps; +* Strided: data is accessed linearly but spaced apart with a constant or + uniform stride; +* Divergent: data is accessed with no discernible pattern. + +The stride analysis traverses address computation expressions, to ascertain +which kind of memory access is required, computing any constant strides +encountered. Uniform but variable strides are not computed during the analysis, +because doing so would require creating new instructions in the function, which +is at odds with the idea of an analysis pass. However, it is usually sufficient +to know that access is linear, without needing to know its actual value. When a +transform pass wishes to make use of the actual value, it can call the +`manifest()` function of the analysis, which will traverse its internal dataset +and create any instructions required. Note that pointers to these instructions +will survive until the analysis is invalidated. + +This analysis uses the result of [Uniform Analysis](#uniform-analysis). + +> The relevant classes and functions can be found in +> `source/include/analysis/stride_analysis.h` and +> `source/analysis/stride_analysis.cpp`. + +## Packetization Analysis + +The packetizer needs to know which instructions require packetization in +advance, for optimal functioning. An instruction that has been marked as +varying by the [Uniform Analysis](#uniform-analysis) may or may not require +packetization, since some varying values will form an expression computing +the address of a contiguous or strided memory operation. Therefore, this +analysis starts at the function's vector leaves, and works backwards through +operands, recursively marking values for packetization. When a contiguous or +strided memory operation is encountered, its address operand is not +followed. This allows a more accurate estimation of packetization requirements +prior to actual packetization, which is useful for the +[SIMD Width Analysis](#simd-width-analysis). + +This analysis uses the result of [Stride Analysis](#stride-analysis). + +> The relevant classes and functions can be found in +> `source/include/analysis/packetization_analysis.h` and +> `source/analysis/packetization_analysis.cpp`. + +## Control Flow Graph Analysis + +Another useful analysis we employ is the Control Flow Graph (CFG) Analysis. As +the name suggests, it analyzes the control flow graph to store useful +information about blocks and loops such as what loop does a block live in and +what are the lcssa values of a loop (values that are live through a loop). + +The analysis works by iterating over the basic blocks, create a tag for each +block, and if the block belongs in a loop, then create a tag for that loop (if +one does not yet exist) and mark the loop as owning the block. + +If in the process of visiting the blocks we encounter a divergent branch, then +we say that the CFG needs to be converted into a Data-Flow Graph (see the +[Control Flow Conversion](#control-flow-conversion-pass) section). + +> The relevant classes and functions can be found in +> `source/include/analysis/control_flow_analysis.h` and +> `source/analysis/control_flow_analysis.cpp`. + +## Divergence Analysis + +> Formally known as `Rewire Target Analysis`. + +This analysis is used to find all the information regarding divergence in a CFG. +It uses as a pre-requisite the [Uniform Analysis](#uniform-analysis) to know +which instructions are varying when we first evaluate the CFG. Those +instructions are the basis to find divergent branches, which are branches whose +operands are varying. We name blocks having such instructions `div_causing`. +The analysis works by iterating over all the branches in the CFG until no more +new `div_causing` blocks are found. +When we find a `div_causing` block, we first compute the divergent path that +this block creates. This divergent path contains all the blocks from the +`div_causing` block to the post dominator of the latter. All the blocks that +belong in a divergent path may need to have their instructions marked varying, +in the case where they might be used outside the divergent path and thus need +to be packetized. +We then find all the join points of the `div_causing` block. Such blocks have +disjoint paths from the `div_causing` block. These blocks are called `blend` +blocks and will need to have their PHI nodes transformed into select +instructions because their path will be linearized. These blocks also need to +have their PHI nodes marked as varying. +After we have processed the `div_causing` block, we must find if this block +makes a loop divergent. A loop is divergent if it is possible that some work +items leave the loop at an exit, while others keep iterating. In essence, this +means that a divergent branch has no post-dominator in the loop. We also mark +all the exits of the loop where *some* work items may leave as divergent because +they will need to be linearized. +Finally, we compute `by_all` blocks which are blocks that need not be predicated +because no divergence is present when they are executed. + +> The relevant functions can be found in +> `source/include/analysis/divergence_analysis.h` and +> `source/analysis/divergence_analysis.cpp`. + +## Liveness Analysis + +This analysis is used to determine the set of live values at any point in the +program. A value becomes "live" at its definition and remains live until all of +its uses have been encountered. The result of this analysis provides an info +object for every basic block in the program, that contains the "Live In" set +(i.e. the values that are live at the start of the Basic Block, including all +of its PHI nodes), and the "Live Out" set (i.e. the values that are still live +at the end of the block). By iterating backwards over a Basic Block, starting +from the Live Outs, one can determine the set of values that are live at any +point in the program. + +The implementation is based on Section 5.2 of the paper "Computing Liveness Sets +for SSA-Form Programs." by Florian Brandner, Benoit Boissinot, Alain Darte, +Benoît Dupont de Dinechin, Fabrice Rastello. + +This analysis is used by [BOSCC](#branch-on-superword-condition-code), and by +[SIMD Width Analysis](#simd-width-analysis). + +> The relevant classes and functions can be found in +> `source/include/analysis/liveness_analysis.h` and +> `source/analysis/liveness_analysis.cpp` + +## SIMD Width Analysis + +This analysis is used to estimate the optimal vectorization width, depending on +the contents of the kernel. The current strategy is a two-stage process: first +we find the widest used varying value type, disregarding types that make up less +than a small proportion of the program according to some tolerance threshold. +The SIMD width computed is the number of elements of this type that will fit +into a single vector register. Then we analyze the program using the +[Liveness Analysis](#liveness-analysis) to estimate the maximum SIMD width that +will fit into vector registers, if it is wider than the previously computed +result. This allows vectorization to produce values that will not necessarily +fit into single vector registers, but will fit across multiple registers after +legalization. + +SIMD Width Analysis is performed only when vectorization is set to automatic +(either by using the `-cl-wfv=auto` option or by passing `bool Auto=true` to +`Vectorizer::vectorize()`). The analysis is performed after control flow +conversion, so that any changes made by this or prior passes will be taken into +account. + +This analysis uses the result of +[Packetization Analysis](#packetization-analysis). + +> The relevant classes and functions can be found in +> `source/include/analysis/simd_width_analysis.h` and +> `source/analysis/simd_width_analysis.cpp` + +## Vectorizability Analysis + +This is not an analysis in the traditional sense but more of a filter for +kernels that we know we will not be able to vectorize. There are a number of +cases that we cannot handle in Vecz, such as kernels containing specific atomic +instructions, functions returning vector results, or specific Native CPU builtins. +By checking for these conditions early in the vectorization process we can save +on compile time and also avoid accidentally generating an incorrectly vectorized +kernel. + +> The relevant functions can be found in +> `source/include/vectorization_context.h` and +> `source/vectorization_context.cpp`. + +## Reachability Analysis + +This is a utility class created to speed up CFG reachability queries required by +[BOSCC](#branch-on-superword-condition-code). It is not an analysis pass managed +by LLVM, but must be created manually where required. The algorithm is based on +an Open Proceedings paper entitled "Reachability Queries in Very Large Graphs: A +Fast Refined Online Search Approach" by Renê R. Veloso, Loïc Cerf, Wagner Meira +Jr, Mohammed J. Zaki. In addition to that approach, dominator and post-dominator +trees are used to further accelerate the process. + +The reachability data structure must be constructed from a directed acyclic +graph. Backedge traversal is a case not yet handled. + +> The relevant functions can be found in `source/include/reachability.h` and +> `source/reachability.cpp`. + +## Preparation Passes + +We employ a set of preparation passes that includes both optimization passes, +such as the Mem2Reg pass discussed later on, and passes necessary to generate IR +that Vecz can handle: + +* Dominator Tree Wrapper Pass (from LLVM) +* Loop Info Wrapper Pass (from LLVM) +* Switch Lowering Pass (from LLVM) +* Function Exit Nodes Unification Pass (from LLVM) +* Builtin Inlining Pass (from Vecz, described below) +* Promote Memory To Register Pass (from LLVM) +* Basic Mem2Reg Pass (from Vecz, described below) +* Instruction Combining Pass (from LLVM) +* Dead Code Elimination Pass (from LLVM) +* Pre-linearization Pass (from Vecz, described below) +* Instruction Combining Pass (again, to deal with Pre-linearization changes) +* CFG Simplification Pass (from LLVM) +* Unify Function Exit Nodes Pass (from LLVM) +* Loop Simplification Pass (from LLVM) +* Loop Rotate Pass (from LLVM) +* Simplify Infinite Loop Pass (from Vecz, described below) +* Induction Variable Simplification Pass (from LLVM) +* Early CSE Pass (from LLVM) +* LCSSA Pass (from LLVM - restores LCSSA form if broken by Early CSE) + +> The relevant classes and functions can be found in `vectorizer.cpp`, as +> well as `builtin_inlining_pass.h`, `builtin_inlining_pass.cpp`, +> `basic_mem2reg_pass.h`, and `basic_mem2reg_pass.cpp` for the Vecz +> passes. + +### Builtin Inlining Pass + +The Builtin Inlining pass replaces calls to Native CPU builtins with an inline +version of the builtin. This is done because the generic approach that we follow +for getting the scalar or vector equivalent of a builtin (described later in the +[Packetizing Native CPU Builtins](#packetizing-Native CPU-builtins) section) does not +work for all of them, so instead we bring the implementation in the same module +as the kernel and let the vectorizer vectorize it. More details can be found in +the [Vectorizing Builtin Functions](#vectorizing-builtin-functions) section. + +> The relevant class can be found in +> `source/include/transform/builtin_inlining_pass.h` and +> `source/transform/builtin_inlining_pass.cpp`. + +### Basic Mem2Reg Pass + +The Basic Mem2Reg pass performs `alloca` promotions, similar to how the LLVM +Mem2Reg pass operates. Of course, there are a number of requirements that an +`alloca` and its users need to fulfill before it is possible to perform such an +optimization but the general idea is this: we first check if it is possible +to determine the value (as an LLVM Value, not necessarily as a compile time +constant value) stored in the `alloca`, and if it is the case, then users of +the `alloca` are updated to use the stored value directly, instead of loading it +from the `alloca`. + +The Basic Mem2Reg is somewhat simpler than LLVM's own Promote Memory To Register +Pass, and as a result more strict in what it will promote. However, it is able +to promote some `alloca`s that LLVM's own pass cannot, for instance where there +are bitcasts involved. + +> The pass can be found in `basic_mem2reg_pass.h` and `basic_mem2reg_pass.cpp`. + +### Pre-linearization Pass + +This pass transforms simple `if-then` and `if-then-else` constructs by hoisting +their contents out into the parent scope, when it determines that the cost of +executing the instructions is less than the cost of the branches. This also takes +into account of the extra branching logic that will be inserted by +[BOSCC](#branch-on-superword-condition-code) during Control Flow Conversion. + +The CFG itself is not modified, instead being left for LLVM's CFG Simplification +Pass to tidy up. + +> The pass can be found in `source/transform/pre_linearize_pass.cpp`. + +### Simplify Infinite Loop Pass + +The Simplify Infinite Loop pass checks the CFG for infinite loops that may still +be present after the LLVM loop simplifications passes we call. This pass is +necessary for VECZ to make all loops have the same layout, i.e. in this case at +least one exit block, as handling infinite loops within the +[Control Flow Conversion Pass](#control-flow-conversion-pass) would add too much +overhead. + +This pass is a loop pass and will check, for each loop, if exit blocks are +present. If not, it means the loop cannot terminate (as it cannot be exited) so +we have to mutate it. It then tries to find the unique return block of the +function (it should only have one as we call `UnifyFunctionExitNodesPass` prior +to this pass to make sure we do have only one return block). This return block +will be the exit block of the infinite loop after mutation. After finding the +latter, we add a conditional branch to the latch that will either branch to the +header or to the return block. The condition of that conditional branch will +actually always be true such that it will still always branch to the loop +header to respect the semantic of the original program. Finally, the pass +updates the PHI nodes in the return block by adding an incoming block to them, +which is the latch of the infinite loop. It also adds new PHI nodes for uses in +the return block that may be defined after the infinite loop, for which adding +the edge from the infinite loop to the return block may break the SSA form. + +> The pass can be found in `source/transform/simplify_infinite_loop_pass.cpp`. + +## Remove Intptr Pass + +This pass scans for `PtrToInt` casts that can be eliminated and converted into +bitcasts or GEP instructions. It is able to eliminate a `PtrToInt` in the +following cases: + +* A PtrToInt followed by an IntToPtr, which is replaced by a pointer cast; +* A PtrToInt used by a PHI node, in which case the PHI node is replaced + by one of the pointer type; +* A PtrToInt where the pointer type is `i8*`, followed by an integer add or + subtract, in which case it is replaced by a GEP. + +Removing intptr casts makes it possible for uniform pointer strides to be +identified. + +> The pass can be found in `source/transform/remove_intptr_pass.cpp` + +## Squash Small Vectors Pass + +This pass looks for loads and stores of vector types that fit into a legal +integer, where packetization would result in non-contiguous access, and replaces +them with loads or stores of an integer scalar of the same size (where alignment +requirements permit). This allows more efficient generation of scatter/gather or +interleaved memory operations on these types. + +> The pass can be found in `source/transform/squash_small_vectors_pass.cpp` + +## Scalarization Pass + +This pass converts code that is already in a vector form into scalar code, or +retains a partially scalarized code, so that the packetizer can produce vector +IR at a vectorization width optimal for the target hardware. + +The scalarization pass is divided into two stages: the analysis and the actual +transformation stage. + +In the analysis we mark the values that need scalarization, which includes +vector leaves and non-vector instructions using vector operands. +The non-vector instructions using vector operands are either `ExtractElement`s +with vector operands or `BitCastInst`s from vector to non-vector. Note that the +analysis is not performed in an analysis pass, but a utility class that runs +locally within the transform pass, since this information is not needed by any +other pass. + +If the vector leaf instruction or any of its arguments are of vector type and +the primitive size is greater than the partial scalarization factor (called +primitive size), the instruction is marked for needing scalarization. This marks +the end of analysis. + +Once we have the values that were marked for transformation by the analysis +stage, the vector operands are first scalarized and then followed by the vector +leaf instructions that need scalarization. + +> The utility classes can be found in +> `source/include/transform/scalarizer.h` and +> `source/transform/scalarizer.cpp`. +> The transform pass can be found in +> `source/include/transform/scalarization_pass.h` and +> `source/transform/scalarization_pass.cpp`. + +## Control Flow Conversion Pass + +The control flow conversion linearizes the divergent control flow executing both +if and else condition blocks. In order to preserve safe access, the blocks are +predicated with masks, to allow only legal access for calls and memory accesses +that have side-effects. + +Control flow conversion is the actual control-flow to data-flow conversion pass +that uses information from the control flow analysis and divergence analysis. +Conversion starts with generating masks, applying masks and generating selects, +followed by linearizing the control flow and finally repair the SSA form that +the linearization may have broken. + +The mask for every basic block is generated starting with the entry mask, which +is followed by a branch mask for cases where the entry mask is a phi node of its +predecessors. Then special masks in the loop are generated to handle run time +control flow divergence, namely; the loop active mask, combined exit mask. Next, +the loop live values need to reflect the right values for early exited lanes. +The masks are then applied to prevent side-effects for the inactive instances. +In case of a call to memory operation, it is replaced with a corresponding masked +internal builtin call [Defining Internal Builtins](#defining-internal-builtins). +The phi nodes are then transformed into selects, to enable control-flow to +data-flow conversion. + +The CFG is then linearized, where necessary. It is actually partially linearized +to retain uniform branches that we know need not be linearized. We apply the +partial linearization by identifying every divergent blocks thanks to the +divergence analysis to know which blocks should be linearized, and which blocks +may remain untouched. A divergent block is called a `div causing block`. To +linearize the CFG, we keep a deferral list that represents all the blocks that +lost their predecessor's edge because of divergence. When we reach a block, if +the block is a div causing block, then it can only have one successor, otherwise +the block can keep the same amount of successors it has. To know which block +should be the successor of another block, we choose between the current +successors, and the deferral list available for that block. The choice is then +made based on the Dominance-Compact Block Indexing, which assigns each block a +unique index. Dominance compactness means that for any block, all other blocks +dominated by that block follow on in a contiguous sequence. This is constructed +by a depth-first traversal of the dominator tree, visiting children in CFG +reverse post-order. (In actual fact, Loop-Compactness takes precedence over +Dominance-Compactness; the latter usually implies the former, but certain loops +with multiple exits can break this, so special care has to be taken.) Using that +index to choose the successor guarantees that if an edge `A to B` existed in the +original graph, an edge `A to X to B` will exist in the linearized graph, thus +conserving dominance. + +Once the CFG is linearized, we may have introduced new edges that were not there +previously, which may have broken the SSA form. Therefore, we must repair the +SSA form by introducing blend instructions (in the form of phi nodes) at the new +converging points. + +The partial linearization implementation was inspired from the paper +`Automatic SIMD Vectorization of SSA-based Control Flow Graphs` by +Ralf Karrenberg and `Partial Control-Flow Linearization` by +Simon Moll & Sebastian Hack. + +> The pass can be found in +> `source/include/transform/control_flow_conversion_pass.h` and +> `source/transform/control_flow_conversion_pass.cpp`. + +### Branch On Superword Condition Code + +Various optimizations directly linked to the partial linearization can be +applied. One of those optimizations is BOSCC (Branch On Superword Condition +Code), whose purpose is to duplicate predicated code into their uniform, +original, form so that this duplicated code can be executed when all lanes of +the SIMD group are either all true or all false. In fact, when this is the case, +there is no point to execute predicated instructions as all lanes will be +executed, or none. + +The first step of this optimization is to duplicate all the code paths that may +diverge so that we can execute that code when all lanes are true/false. We thus +have one part of the CFG that diverges and one that stays uniform, throughout +the execution of the code. However, just after duplicating the code, the latter +is separated from the original CFG and the rewiring will be done later, once the +linearization is done. In order to identify which blocks need to be duplicated, +we need to identify Single-Entry, Single-Exit (SESE) regions that contain +divergence-causing branches. We leverage the Dominance-Compact Block Indexing to +do this, since any SESE region is necessarily dominance compact. In the simple +case, a divergence-causing branch will be from the entry block of a SESE region. +However, this is not strictly necessarily the case in more complex CFGs, where +the SESE entry block might not be a divergent branch, but multiple divergent +branches may exist within the region. Therefore we deal with Multiple-Entry, +Single-Exit predicated subregions of the SESE that can potentially overlap each +other (although we only ever duplicate each predicated block once, regardless of +how many different regions it appears in), each beginning with a single +divergence-causing branch. + +Once the linearization is done, and we start repairing the CFG from all the +changes we made, we can start rewiring the duplicated (i.e. uniform) parts of +the CFG into the divergent ones. The first thing we do is to make the outermost +loop preheaders of duplicated loops always target the uniform loop because the +first time we enter the loop, all our lanes are activated/deactivated so there +is no need to execute the divergent loop. Then, for each divergent branch, we +add a run time checker that checks if all lanes are activated, in which case we +hit the all-lanes-activated uniform path. Otherwise, we check if none of the +lanes are activated, in which case we hit the no-lanes-activated uniform path. +Finally, if none of those two checks were true, then that means some condition +diverges: some lanes evaluate to true, and some evaluate to false; we thus have +to go into the divergent part of the CFG. As soon as we go into the divergent +part of the CFG (the one which contains predicated instructions), it is not +possible to go back into the uniform part of the CFG (the one that contains no +predicated instructions), until we reach a blend block, that is, a block where +all the previous divergent branches meet. + +In order to allow fast reachability queries of the CFG, all of the blend points +are computed and stored during modification of the CFG, which allows us to +construct a data structure to speed up the required reachability queries at the +point the PHI nodes are actually created, since if we were modifying the CFG +during this process, the reachability data structre would be continuously +invalidated. It also means that the PHI nodes can be created with all +predecessors known, and avoids cases where a reachable PHI node would be falsely +classified as unreachable simply because it hasn't been connected up yet. + +Reachability queries are handled by the +[Reachability Analysis](#reachability-analysis) class described earlier in this +document, except in some remaining cases outside of BOSCC, and in one case +inside of BOSCC where reachability needs to traverse backedges, which is not +handled by the aforementioned data structure. + +The BOSCC implementation was inspired from the paper +`Predicate Vectors if you must` by Shahar Timnat, Ohad Shacham and Ayal Zaks. + +> The class can be found in `source/control_flow_boscc.h` and +> `source/control_flow_boscc.cpp`. + +### Return on Superword Condition Code + +ROSCC is a simpler alternative to BOSCC that doesn't require any code +duplication. It handles only "early exit branches", i.e. code of the form: + +```c +if (some_condition) { + return; +} +``` + +Where `some_condition` is a varying expression, ROSCC will insert an additional +uniform branch directly to the exit block. + +ROSCC is applied only when BOSCC is turned off, since BOSCC will handle this +special case in a more general way. + +> The class can be found in `source/control_flow_roscc.h` and +> `source/control_flow_roscc.cpp`. + +### Instantiating functions with side effects + +Much like the memory operations, functions that may produce side-effects also +need to be masked. Call instructions are examined, and if it is determined that +we will not be able to handle the call in any other way, the call is replaced +with a call to a masked version of the function. + +The masked version is nothing more than a wrapper around the original call. The +wrapper function accepts the same arguments as the unmasked version and an +additional boolean (`i8`) argument for the mask. If the mask is true, the +wrapped function is executed and its result is returned, otherwise `poison` is +returned, without executing the wrapped function. + +After replacing the call with the masked call, we mark the call for +instantiation, as we cannot packetize it into a vector instruction. + +### Division by zero exceptions + +On some hardware, a divide by zero operation and/or a numerical overflow will +result in a CPU exception. Since inactive vector lanes should never trigger such +an exception, masks may also need to be applied using `select` instructions on +the divisor, that result in a divisor of `1` for inactive lanes. There is no way +for Vecz to get the information about this requirement from the target, and since +most GPU hardware silently ignores division by zero, by default this behaviour is +disabled. It can be enabled explicitly by using the `DivisionExceptions` +[Vecz Choice](#vecz-choices). + +Note that the mask applied to the divisor is derived purely from the CFG. The +behaviour of any division by zero on an active vector lane will be unaffected. + + +## Packetization + +During packetization, instructions that define varying values or produce varying +side-effects are turned into instructions that define the same value or produce +the same effects for each SIMD instance. For example, an `add` instruction +that takes two `i32` operands is turned into another `add` instruction that +takes two `` operands (where `N` is the SIMD width). This is done +recursively in three steps; first, we packetize any branch condition that +requires packetization, and then, starting at the "vector leaves" of the +function, the rest of the instructions. Vector leaves are instructions that +allow varying values to "escape" from the function. Some examples of leaves +include: + +* Store instructions, when the value to store and/or the pointer is varying +* Call instructions, when varying operands are present or when the call has no + use +* Return instructions + +After those two steps, then we proceed to packetize any remaining phi nodes, +explained in more details in a following [subsection](#packetizing-phi-nodes). + +During the packetization process, we might run into cases where we cannot +packetize a varying instruction but instead we have to instantiate it. By +instantiation we mean repeating an instruction `N` times, one for each SIMD +lane. A common example would be calls to the `printf` function, as in the +following kernel: + +```c +kernel void fn(global int *in, global int *out) { + size_t tid = get_global_id(0); + int load_in = in[tid]; + int result = load_in * tid; + printf("in[%d] = %d\n", tid, result); + out[tid] = result; +} +``` + +In this kernel, the call to the `printf` function will be repeated `N` times, +with its arguments adjusted to use the correct global ID for each lane. On the +other hand, the load and store instructions, as well as the multiplication, will +be packetized into vector instructions of width `N`. More details on +instantiation can be found in the [relevant section](#instantiation). + +As we have already mentioned, the packetization (or instantiation) starts from +the vector leaves and recursively continues into their operands. As a matter of +fact, the operands are packetized before the instruction itself; in order to +generate the correct instruction, we first need to have the correct operands. +This process stops when we reach either: + +1. An operand that is uniform, such as constants or kernel arguments. +2. A vector root such as `get_global_id(0)`. +3. A pointer that we can handle in its scalar form. + +In the first case, we create a packetized version of the operand by simply +broadcasting its value into a vector, so that each element of the vector +contains the same value. Since the value is uniform, we do not need to proceed +and packetize its operands. The second case is handled specially by using the +scalar value to create a vector of sequentially increasing values. + +The third case is also special, because it depends on the access pattern +of the pointer. If the pointer is varying and we can determine a stride +to the access pattern then we do not need to packetize the pointer. +Instead, we can use the same pointer value as the base for a vector memory +instruction. More details on this can be found in the [Packetizing Memory +Operations](#packetizing-memory-operations) subsection. + +Given all of these, we can now see how the example kernel given above will be +vectorized, with a vector width of 4. Of course, Vecz is not a source-to-source +transformation pass but the following kernel captures the equivalent IR changes +that will be performed in an easier to read format: + +```c +kernel void fn(global int *in, global int *out) { + size_t tid = get_global_id(0); + size_t4 tid4 = {tid, tid, tid, tid} + {0, 1, 2, 3}; + int4 load_in4 = in[tid]; + int4 result4 = load_in4 * tid4; + printf("in[%d] = %d\n", tid4.s0, result4.s0); + printf("in[%d] = %d\n", tid4.s1, result4.s1); + printf("in[%d] = %d\n", tid4.s2, result4.s2); + printf("in[%d] = %d\n", tid4.s3, result4.s3); + out[tid] = result4; +} +``` + +Notice how the address for the vector load and store are still calculated using +the scalar ID variable (`tid`), since the kernel accesses the elements of the +array consecutively and thus we can use a vector load and a vector store with +the same base address. + +Regardless of how a varying instruction has been handled, after we have +packetized or instantiated it, we mark it for deletion. After we have gone +through all the vector leaves, we proceed to delete all the instructions that we +marked for deletion, as long as they have no remaining users. + +> The packetization pass can be found in +> `source/include/transform/packetization_pass.h` and +> `source/transform/packetization_pass.cpp`. + +This is the general approach taken to packetize instructions but some cases +need to be handled specially. We will now explain in more depth some special +packetization cases. + +### Packetizing Memory Operations + +Memory operations are special as their access pattern determines how they are +packetized. Specifically, a memory operation will have one of these mutually +exclusive access patterns: + +1. No recognizable stride +2. A stride of 0 elements (i.e. being uniform) +3. A stride of 1 element (i.e. being contiguous) +4. A stride of `X` elements + +In the first case we will packetize the memory operation using a scatter/gather +internal builtin. This means that we will generate an address for each SIMD +lane and the scatter/gather internal builtin will handle storing or loading the +elements to and from vectors. + +In the second case, we will choose between two different approaches depending +on the need for masking. If masking is necessary, we will use masked +scatter/gather, with all the lanes getting the same address. If, on the other +hand, masking is not required, we will keep the scalar instruction and if it is +a load instruction, use a vector splat for the loaded value. + +In the third case we could generate the addresses of the sequential elements and +use the same approach as in the first two but there is a much better solution. +All we need to do is perform a vector memory operation of width `N` with the +same pointer as the scalar version of the instruction. This will efficiently +load `N` elements from the memory into a vector. This is usually the most +optimal way to load and store vectors. + +Finally, for the fourth case, we will use an interleaved memory operation +internal builtin. This builtin takes the base pointer and the stride of the +memory operation, so calculating each individual address is (in theory, see next +paragraph) not required. + +How the internal builtins are implemented in practice differs based on the +target hardware but a description of the generic version emitted by Vecz can be +found in the [Defining Internal Builtins](#defining-internal-builtins) section. + +> The relevant functions and classes can be found in +> `source/include/transform/memory_operations.h` and +> `source/transform/memory_operations.cpp`. + +### Packetizing Phi Nodes + +Phi nodes are a bit more tricky to packetize as they introduce cycles in the +def/use chain. To avoid this, an empty vector phi node (i.e. with no incoming +value) is created at first, when packetizing from the leaves. Once all the +leaves have been packetized, the incoming values of each empty phi is packetized +in turn. Since packetizing an incoming value may involve packetizing a new phi, +this process needs to be repeated until all phi nodes have been handled. + +### Packetizing Native CPU Builtins + +Since all the Native CPU builtins are known, we can use a special technique +for easily and efficiently packetizing many of them. Many of the builtins +already have vector equivalents, so we can just use them instead of +vectorizing the scalar version (for that approach see [Vectorizing Builtin +Functions](#vectorizing-builtin-functions) instead). This is done by first +determining if it is safe to use the vector equivalent. For example, the vector +version of the `length` builtin does not operate element-wise, so we cannot use +it. + +After we make sure that there are no known issues with the vector version of +the builtin, we construct the expected function signature based on the scalar +and the vector types that we have. We then search for the function matching +that signature in the current module, and also in the builtins module (if it is +available). If the function is found then we can use it, otherwise we report +that vectorizing the builtin failed. This means that this step can detect +builtins that have no vector versions but have been vectorized from the scalar +version using Vecz. + +> The builtin information code can be found in +> `include/vecz/vecz_builtin_info.h`, +> `source/include/cl_builtin_info.h`, and +> `source/cl_builtin_info.cpp`. + +### Instantiation + +Instantiating a value or instruction means evaluating it in the context of each +SIMD lane and creating a separate copy for each lane. Instantiating a call to +`get_global_id(0)` or `get_local_id(0)` results in the SIMD lane's global or +local ID. If the instruction has varying operands, they need to be instantiated +(and so on recursively) too. Even though looking at the source code it looks as +if the instantiation is handled by a different pass, it is in the packetization +pass that calls for instructions to be instantiated when necessary. In turn, the +instantiator will call back in the packetization pass when it determines that it +shouldn't instantiate an instruction. + +When an instruction should be instantiated or not is determined by the +Instantiation Analysis. This analysis goes through all the instructions in the +function and looks for instructions that we know we shouldn't try to packetize, +such as `printf` calls, instructions that have types that we cannot create a +vector with, or the masked user functions we talked about in the [Control Flow +Conversion](#control-flow-conversion-pass) section. + +> The analysis pass can be found in +> `source/include/analysis/instantiation_analysis.h` and +> `source/analysis/instantiation_analysis.cpp`. +> The transform pass can be found in +> `source/include/transform/instantiation_pass.h` and +> `source/transform/instantiation_pass.cpp`. + +As an example of what instantiation looks like in the actual IR, let's say that +we have the following code snippet that loads a variable: + +```c +... = in[get_global_id(0)] +``` + +The IR for the snippet looks like this: + +``` +%call = call spir_func i64 @_Z13get_global_idj(i32 0) +%arrayidx = getelementptr inbounds i32, i32 addrspace(1)* %in, i64 %call +%0 = load i32, i32 addrspace(1)* %arrayidx, align 4 +``` + +If this code was to be instantiated, it would look like this: + +``` +%call = call spir_func i64 @_Z13get_global_idj(i32 0) #2 +%arrayidx0 = getelementptr inbounds i32, i32 addrspace(1)* %in1, i64 %call, i64 0 +%arrayidx1 = getelementptr inbounds i32, i32 addrspace(1)* %in1, i64 %call, i64 1 +%arrayidx2 = getelementptr inbounds i32, i32 addrspace(1)* %in1, i64 %call, i64 2 +%arrayidx3 = getelementptr inbounds i32, i32 addrspace(1)* %in1, i64 %call, i64 3 +%0 = load i32, i32 addrspace(1)* %arrayidx0, align 4 +%1 = load i32, i32 addrspace(1)* %arrayidx1, align 4 +%2 = load i32, i32 addrspace(1)* %arrayidx2, align 4 +%3 = load i32, i32 addrspace(1)* %arrayidx3, align 4 +``` + +Note how each load takes a different memory address which depends on the SIMD +lane index and the current global ID. + +Instantiation is usually done when a varying instruction cannot be packetized, +e.g. calls to functions like `printf` which have no SIMD equivalent. The above +example does not require instantiation as the scalar load can simply be turned +into a vector load. The example was simply given for demonstration purposes, an +actual example can be found in the [Packetization](#packetization) section. + +### Vectorizing Builtin Functions + +Sometimes vectorizing kernels is not enough and the Native CPU builtin functions +called by the kernel have to be vectorized too. One example is `isequal` which +returns `1` if the two arguments have the same value or `0` if they don't or if +one of the argument is NaN. The IR implementation is simple: + +``` +define spir_func i32 @_Z7isequalff(float %x, float %y) { +entry: + %cmp.i = fcmp oeq float %x, %y + %conv.i = zext i1 %cmp.i to i32 + ret i32 %conv.i +} +``` + +The first step to vectorize this builtin function is declare the vectorized +builtin: + +``` +declare spir_func <4 x i32> @__vecz_v4__Z7isequalff(<4 x float> %x, <4 x float> %y) +``` + +The second step is to copy the instructions from the original function to the +vectorized function. One issue is that the function arguments now have type +`<4 x float>` instead of `float` which prevents copying instructions that +refer to the original arguments. One way to work around this is to create +`extractelement` instructions to act as placeholders for the arguments. +Instructions that referred to the old arguments are changed to refer to the +relevant placeholder instead: + +``` +define spir_func <4 x i32> @__vecz_v4__Z7isequalff(<4 x float> %x, <4 x float> %y) { +entry: + %placeholder_x = extractelement <4 x float> %x, i32 0 + %placeholder_y = extractelement <4 x float> %y, i32 0 + %cmp.i = fcmp oeq float %placeholder_x, %placeholder_y + %conv.i = zext i1 %cmp.i to i32 + ret i32 %conv.i +} +``` + +Placeholders instructions are marked as such so that they are not mistaken with +regular instructions. When the placeholder needs to be packetized, it is +replaced with the actual argument: + +``` +define spir_func <4 x i32> @__vecz_v4__Z7isequalff(<4 x float> %x, <4 x float> %y) { +entry: + %cmp.i1 = fcmp oeq <4 x float> %x, %y + %conv.i2 = zext <4 x i1> %cmp.i1 to <4 x i32> + ret <4 x i32> %conv.i2 +} +``` + +## Post-Vectorization Optimizations and Cleanup + +After the vectorization process is completed, we run some additional passes to +further optimize and cleanup the code: + +* Inline Post Vectorization Pass (from Vecz) +* CFG Simplification Pass (from LLVM) +* Global Value Numbering Pass (from LLVM) +* Dead Code Elimination Pass (from LLVM) +* Interleaved Group Combining pass (from Vecz) +* Instruction Combining pass (from LLVM) +* Masked Memory Operations Simplification pass (from Vecz) +* Internal Builtin Definition pass (from Vecz) + +> The passes can be found in the files +> `source/include/transform/passes.h` and +> `source/transform/passes.cpp`. + +### Inline Post Vectorization Pass + +The Inline Post Vectorization Pass is responsible for inlining builtins that +have no vector/scalar equivalent or called functions that don't have the +`NoInline` attribute. + +### Interleaved Group Combining Pass + +The Interleaved Group Combining pass is responsible for lowering groups of +interleaved memory operations into vector memory operations. Specifically, +if there is a group of `K` interleaved operations with stride `K`, each +accessing the elements in between the others, they will be transformed into +`K` consecutive vector operations. For example, if we have the interleaved +operations *A*, *B*, *C*, and *D* (with a number next to the letter signifying +the element index), + +``` +------------------------------------------------- +|A1|B1|C1|D1|A2|B2|C2|D2|A3|B3|C3|D3|A4|B4|C4|D4| +------------------------------------------------- +``` + +They will be optimized into the vector operations *a*, *b*, *c*, and *d*. + +``` +------------------------------------------------- +|a1|a2|a3|a4|b1|b2|b3|b4|c1|c2|c3|c4|d1|d2|d3|d4| +------------------------------------------------- +``` + +The first pattern commonly appears after scalarizing vector memory operations in +a kernel and then revectorizing each one of them into a vector instructions. + +> The pass can be found in +> `source/include/transform/interleaved_group_combine_pass.h` and +> `source/transform/interleaved_group_combine_pass.cpp` while some of +> the optimization code can be found in `include/vecz_target_info.h`, +> `source/vector_target_info.cpp` and +> `source/vector_target_info_arm.cpp`. + +### Masked Memory Operations Simplification Pass + +This pass is responsible for lowering masked operations into unmasked or nop +operations, assuming that we can determine the mask values at compile time. +If all the lanes in the mask are set to `true` then the mask is unnecessary +and the operation can be lowered to the equivalent unmasked operation. If, +on the other hand, all the mask lanes are set to `false`, the operation will +not be executed at all and it can thus be replaced by a nop. Note that such +optimizations are only possible if the mask values are known at compile time, as +runtime optimizations need to be handled separately, specifically when the code +for the internal builtins is generated. + +> The pass can be found in `vectorizer.h` and `vectorizer.cpp`. + +### Defining Internal Builtins + +We have already mentioned how the internal builtins are used in +the [Control Flow Conversion](#control-flow-conversion-pass) and +[Packetization](#packetization) sections. We have the following internal +builtins: + +* masked load/store +* interleaved load/store +* masked interleaved load/store +* scatter store / gather load +* masked scatter store / gather load + +The masked versions perform the same operation as their unmasked counterparts, +with the exception that the operation is only performed for the lanes for which +the mask is `true`. + +For the masked loads and stores, as well as the masked scatter stores and gather +loads, LLVM provides intrinsics that perform these operations. How the +intrinsics are implemented obviously depends on the backend. + +However, LLVM does not provide intrinsics for the remaining operations, +interleaved loads and stores, masked interleaved loads and stores, and unmasked +scatter stores and gather loads. Assuming that the masked scatter/gather +intrinsics that LLVM provides are at least as efficient as manually performing +each memory operation separately and then collecting them into a vector, we use +those LLVM intrinsics for these operations as well. For the interleaved +operations, we first need to generate all the pointers, using the pointer base +and the stride, and then call the LLVM intrinsic. + +In case that intrinsic generation fails, we define the function by emulating the +vector with appropriate masking when required, which of course is suboptimal. + +This design is the default used in Vecz but since it is modular, it is possible +to change the implementation for any target that Vecz is ported to. +Specifically, in the `vector_target_info.cpp` file exists a number of +`createX` functions (where `X` is the internal builtin name, e.g. +"`MaskedLoad`") where the actual code for the builtins is generated. The +functions are very generic; they take an `IRBuilder` and the required pointers +and values, so it is easy to modify them without having to modify any other part +of the vectorizer. Their code can be replaced with more optimal and target +specific code. It can also be modified to solve any issues the target might have +with the current internal builtins implementation. + +> Note: The current implementation for scatter/gather uses an `addrspacecast` +> instruction in order to use the LLVM intrinsics with pointers having an +> address space other than 0. This works for the x86 implementation but +> it might not work on other architectures. +> +> Note: The interleaved memory operations use the same fallback as the masked +> interleaved ones. +> +> The pass and the relevant materialization code can be found in +> `source/include/vectorization_context.h`, +> `source/vectorization_context.cpp`, `include/vecz/vecz_target_info.h`, +> and `source/vector_target_info.cpp`. + +### Cloning of the Native CPU Metadata + +After the vectorization process has been completed, and only if it has been +successful, we update the Native CPU metadata in the module to include the +vectorized kernel. This isn't done by a pass, it's just a function call +at the end of the main vectorizer pass. Since some of the metadata requires +information known only by the frontend compiler (clang), we use the existing +metadata by cloning the relevant nodes and then replacing the function pointer. + +> Note: When copying the metadata, we do not adjust the workgroup size, even +> though we are now executing fewer work items. Since each invocation of the +> kernel is now performing the work of `N` scalar kernels, where `N` the vector +> width, we only need to execute `1/N` work items for each workgroup. + +## Miscellaneous + +This section covers various necessary utilities that the main vectorizer passes +are using. + +### Builtins Information + +Vecz handles Native CPU builtins specially, so we need to be able to identify them, +query for various characteristics, and of course pull their definition or +implementation into the current module. This is all handled by the builtins info +code found in `vecz_builtins_info.h`, `cl_builtin_info`, and +`cl_builtin_info.cpp`. The `BuiltinInfo` class allows us to: + +* Identify builtins based on their name. +* Identify various characteristics of a builtin, such as if it safe to vectorize + it, or if it is a builtin we need to handle specially. +* Determine if a builtin is uniform. +* Determine if a builtin has side-effects. +* Get the vector or scalar equivalent of a builtin. +* Materialize a builtin from the builtins module. +* Emit a custom inline implementation for specific builtins. + +While the code is mostly Native CPU centered, it can also handle LLVM intrinsics, +and it can be expanded to handle any builtin functions necessary. + +As far as the identification of a builtin and its properties, parts of it are +done in a generic way that works with all the builtins, and parts are hardcoded +by the developers. For example, demangling a builtin and getting its name can be +easily done automatically, while determining if a builtin returns a global or +local ID needs to be hardcoded by the developers. For this reason, we have a +large list of known builtins that have a set of special characteristics, while +any builtin omitted from this list is assumed to conform to some default set of +characteristics. + +### Function and Type Mangler and Demangler + +The SPIR standard mandates that all Native CPU builtins needs to be mangled +according to the Itanium name mangling convention, with some additional +extensions. Normally, the mangling of function names is handled by the frontend +of the compiler (clang), since it depends on language specific types but we can +correctly identify and mangle/demangle all the Native CPU primitives, vectors, and +image types at the IR level as well. + +Mangling is used by the builtins module to determine the name of a newly +generated builtin (for example when creating the vector equivalent of a +builtin), while demangling is used to identify the builtins. Furthermore, other +parts of the vectorizer use the mangler/demangler for their own purposes. + +Currently, we only support a subset of the Itanium mangling rules but this is +enough for most Native CPU kernels. For example, we cannot mangle `struct` types, as +we cannot easily map between the C type name and the LLVM one. + +> The relevant files are `include/vecz/vecz_mangling.h` and +> `source/mangling.cpp`. + +### Stride and Offset Information + +As we explained in previous sections ([Packetization](#packetization), [Defining +Internal Builtins](#defining-internal-builtins)), it is necessary for Vecz to be +able to determine the access pattern of a memory operation. This essentially +involves three attributes: the base pointer, the offset, and the stride. +These can be determined with the help of the `OffsetInfo` class found in the +`offset_info.h` and `offset_info.cpp` files. + +The class accepts a pointer and tries to determine the offset and the stride of +the pointer. This is done by tracing the base of the pointer and keeping track +of the operations performed on the way. As an example + +```c +kernel void fn(global int *in) { + size_t tid = get_global_id(0); + global int* ptr = (in + tid) * 2; + ... +} +``` + +The `ptr` pointer value in this kernel is calculated by adding the global ID to +it and then multiplying it by 2. Its base pointer is the `in` kernel argument, +which is uniform. `OffsetInfo` can determine that this pointer had an offset +depending on the global ID and that it has a stride of 2 elements (this is +similar to the scalar evolution analysis used for loop optimizations). Having +this information, we now know that any memory accesses using `ptr` needs to be +packetized as interleaved memory operations with a stride of 2. + +### Vectorization Dimension + +User may decide to vectorize the code on whichever of the possible three +dimension the workgroup is composed of. Vecz refers to them as dimension `0` +(x), dimension `1` (y), and dimension `2` (z). Vecz supports this configuration +via an additional parameter. This parameter directly affects the [Uniform +Analysis](#uniform-analysis), the [Packetization](#packetization), and the +[Stride and Offset Information](#stride-and-offset-information). If no +parameter is specified, vectorization on the x dimension is assumed. + +### Vecz Choices + +"Choices" are options that the programmer can select regarding various aspects +of the vectorization process. For example, it is possible to set Vecz to always +vectorize uniform instructions. This is handled by the `VectorizationChoices` +class. The choices can be set in three different ways: + +1. By modifying the code to explicitly enable or disable a choice. This is meant + to be used by developers when optimizing Vecz for a custom target. +2. By passing the `-vecz-choice=` flag to `opt`. This is meant to be + used for testing and debugging purposes. More details for this option can be + found through `opt`'s `help` function. +3. By setting the `CODEPLAY_VECZ_CHOICES` environment variable. +4. By calling the `addChoice` method from the `Vectorizer` class. + +The `CODEPLAY_VECZ_CHOICES` variable accepts a string of Choices separated by +a semicolon (`;`) character. More details, as well as the choices available, can +be found in the `vecz_choices.h` file. + +> The `VectorizationChoices` class can be found in +> `include/vecz/vecz_choices.h` and +> `source/vectorization_choices.cpp` + +## Obtaining Vectorization Statistics + +LLVM has support for collecting counters (called statistics in LLVM) from the +passes. Vecz is among the passes that can produce such information. This can be +done in two ways. + +First, the official way is to use `opt` with the `-stats` option. This will +print the statistics from all the passes that have any. + +The second way is to pass the `-cl-llvm-stats` option to the oneAPI +Construction Kit. This will do pretty much the same work that the `-stats` +option does, but it can be used in cases where it is not possible to use `-stats`. + +## Optimization Remarks + +Vecz utilizes the Remarks system available in LLVM, mostly to warn about +vectorization failures. The remarks can be enabled by passing the +`-pass-remarks=vecz` and `-pass-remarks-missed=vecz` command line option to + `opt`, or the `-v` flag to `oclc`. + +## veczc - the VECZ Compiler + +The command line tool veczc is a standalone compiler that is used to vectorize +LLVM bitcode binary files. Its main use is in our vecz LIT-based testing (see +modules/compiler/vecz/test). + +It has the following arguments: + +* -o `file` output bitcode file +* -w `width` the width to vectorize the code to +* -d `dimension` the dimension index to vectorize the code on + +* -k `name` the function names to select for vectorization. It can appear + multiple times, in one of several forms. In the standard form, simply passing + the names of kernels will ensure those kernels are vectorized by the globally + selected parameters + e.g. + `veczc -k foo -k bar ...` + selects both the `foo` and `bar` kernels for + vectorization. + The more complex form allows specifying the vectorization parameters + *per*-kernel in multiplicate: + e.g. + `veczc -k foo:4,8,16 ...` + will generate multiple vectorized versions of `foo` at vectorization factors + of 4, 8, and 16. All other paremeters will be inherited from the global + configuration. + The complete syntax for the kernel specification switch (`k`) value is as follows: + + ```bnf + ::= ':' + ::= + ::= (opt)(opt)(opt)(opt) + ::= ',' // multiple specs are comma-separated + ::= [0-9]+ // a decimal integer + ::= [a-zA-Z_][a-zA-Z_0-9]+ // As in the simple form - the name of the kernel to vectorize + ::= '.' [123] // Vectorize only the given dimension + ::= // vectorize by the given factor + ::= 'a' // automatic vectorization factor + ::= '@' // Assume local size (SIMD width) is the given number + ::= 's' // Turn on scalable vector support + ``` + n.b. (There should be no whitespace as this interface is designed for easy + nonquoted use in common shells) + +It supports bitcode files with the following target triples: + +* `spir-unknown-unknown` 32-bit SPIR binaries +* `spir64-unknown-unknown` 64-bit SPIR binaries + +Because veczc doesn't load all of the builtins prior to vectorization, +declarations of scalar or vector versions of any builtins used in the input file +must be present, otherwise scalarization or packetization will not be able to +materialize the scalarized/vectorized builtin calls and veczc will fail with an +error message. + +## References + +[1]: http://dblp.uni-trier.de/pers/hd/k/Karrenberg:Ralf diff --git a/sycl/doc/index.rst b/sycl/doc/index.rst index 6b9d058217c84..fe3e1078514a8 100644 --- a/sycl/doc/index.rst +++ b/sycl/doc/index.rst @@ -51,6 +51,9 @@ Design Documents for the oneAPI DPC++ Compiler design/DeviceConfigFile design/PropagateCompilerFlagsToRuntime design/SYCLNativeCPU + design/SYCLNativeCPUPipeline + design/SYCLNativeCPUPipelinePasses + design/SYCLNativeCPUVecz design/CommandGraph design/OffloadDesign design/PrivateAlloca From d1db8a7475fda0346869b835e55737e974283968 Mon Sep 17 00:00:00 2001 From: Colin Davidson Date: Fri, 12 Sep 2025 13:21:14 +0100 Subject: [PATCH 02/14] Minor updates after review. --- sycl/doc/design/SYCLNativeCPU.md | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/sycl/doc/design/SYCLNativeCPU.md b/sycl/doc/design/SYCLNativeCPU.md index 2696d785d1659..ec586751d19eb 100644 --- a/sycl/doc/design/SYCLNativeCPU.md +++ b/sycl/doc/design/SYCLNativeCPU.md @@ -33,7 +33,7 @@ Note that SYCL Native CPU co-exists alongside the other SYCL targets. For exampl ``` clang++ -fsycl -fsycl-targets=native_cpu,spir64 -o ``` -The application can then run on either SYCL target by setting the DPC++ `ONEAPI_DEVICE_SELECTOR` environment variable accordingly. +The application can then run on either SYCL target by setting the DPC++ `ONEAPI_DEVICE_SELECTOR` environment variable to include `native_cpu:cpu` accordingly. ## Configuring DPC++ with SYCL Native CPU @@ -49,7 +49,6 @@ python buildbot/configure.py \ SYCL Native CPU uses [libclc](https://github.com/intel/llvm/tree/sycl/libclc) to implement many SPIRV builtins. When Native CPU is enabled, the default target triple for libclc will be `LLVM_TARGET_TRIPLE` (same as the default target triple used by `clang`). This can be overridden by setting the `--native-cpu-libclc-targets` option in `configure.py`. - ### oneTBB integration SYCL Native CPU can use oneTBB as an optional backend for task scheduling. oneTBB with SYCL Native CPU is enabled by setting `NATIVECPU_WITH_ONETBB=On` at configure time: @@ -84,7 +83,6 @@ cmake \ ``` Note that a number of `e2e` tests are currently still failing. -The SYCL Native CPU device needs to be selected at runtime by setting the environment variable `ONEAPI_DEVICE_SELECTOR=native_cpu:cpu`. # Vectorization @@ -119,10 +117,9 @@ llvm-cov show .\vector-add.exe -instr-profile=foo.profdata ### Please note that Windows is partially supported but temporarily disabled due to some implementation details, it will be re-enabled soon. - # Native CPU compiler pipeline -SYCL Native CPU formerly used uses the [oneAPI Construction Kit](https://github.com/codeplaysoftware/oneapi-construction-kit) (OCK) in order to support some core SYCL functionalities and improve performances in the compiler pipeline. This relevant parts have been brought into DPC++ and the Native CPU compiler pipeline is documented [here](SYCLNativeCPUPipeline.md), with a brief overview below. The OCK related parts are still enabled by using the `NATIVECPU_USE_OCK` CMake variable, but this is enabled by default. +SYCL Native CPU formerly used the [oneAPI Construction Kit](https://github.com/uxlfoundation/oneapi-construction-kit) (OCK) via CMake FetchContent in order to support some core SYCL functionalities and improve performances in the compiler pipeline. The relevant OCK parts have been brought into DPC++ and the Native CPU compiler pipeline is documented [here](SYCLNativeCPUPipeline.md), with a brief overview below. The OCK- related parts are still enabled by using the `NATIVECPU_USE_OCK` CMake variable, but this is enabled by default. The following section gives a brief overview of how a simple SYCL application is compiled for the SYCL Native CPU target. Consider the following SYCL sample, which performs vector addition using USM: From ec1273cd5a1aec40102466ee19e5f6ebd9542579 Mon Sep 17 00:00:00 2001 From: Colin Davidson Date: Fri, 12 Sep 2025 15:31:09 +0100 Subject: [PATCH 03/14] Response to more comments on SYCLNativeCPUPipeline.md --- sycl/doc/design/SYCLNativeCPUPipeline.md | 29 ++++++++++++++++++++++-- 1 file changed, 27 insertions(+), 2 deletions(-) diff --git a/sycl/doc/design/SYCLNativeCPUPipeline.md b/sycl/doc/design/SYCLNativeCPUPipeline.md index 9de5e566e7cbe..1975135b9e761 100644 --- a/sycl/doc/design/SYCLNativeCPUPipeline.md +++ b/sycl/doc/design/SYCLNativeCPUPipeline.md @@ -239,7 +239,27 @@ For a simple kernel of the form: c_ptr[wiID] = a_ptr[wiID] + b_ptr[wiID]; }; ``` -The resulting IR from a typical this kernel with a `sycl::range` of 1 is: + +with original incoming IR of: + +```llvm +define weak_odr dso_local spir_kernel void @_Z6Sample(ptr noundef align 4 %_arg_c_ptr, ptr noundef align 4 %_arg_a_ptr, ptr noundef align 4 %_arg_b_ptr) local_unnamed_addr #1 comdat !srcloc !74 !kernel_arg_buffer_location !75 !kernel_arg_type !76 !sycl_fixed_targets !49 !sycl_kernel_omit_args !77 { +entry: + %0 = load i64, ptr @__spirv_BuiltInGlobalInvocationId, align 32, !noalias !78 + %arrayidx.i = getelementptr inbounds i32, ptr %_arg_a_ptr, i64 %0 + %1 = load i32, ptr %arrayidx.i, align 4, !tbaa !72 + %arrayidx4.i = getelementptr inbounds i32, ptr %_arg_b_ptr, i64 %0 + %2 = load i32, ptr %arrayidx4.i, align 4, !tbaa !72 + %add.i = add nsw i32 %1, %2 + %cmp.i8.i = icmp ult i64 %0, 2147483648 + tail call void @llvm.assume(i1 %cmp.i8.i) + %arrayidx6.i = getelementptr inbounds i32, ptr %_arg_c_ptr, i64 %0 + store i32 %add.i, ptr %arrayidx6.i, align 4, !tbaa !72 + ret void +} +``` + +The resulting IR from a typical kernel with a `sycl::range` of dimension 1 is: ```llvm define weak dso_local void @_Z6Sample.NativeCPUKernel(ptr noundef align 4 %0, ptr noundef align 4 %1, ptr noundef align 4 %2, ptr %3) local_unnamed_addr #3 !srcloc !74 !kernel_arg_buffer_location !75 !kernel_arg_type !76 !sycl_fixed_targets !49 !sycl_kernel_omit_args !77 { @@ -258,6 +278,9 @@ entry: ret void } ``` + +This (scalar) IR was generated by this pass from the input IR by adding the state struct pointer, substituting the builtins to reference the state struct, and adapt the kernel name. + This pass will also set the correct calling convention for the target, and handle calling convention-related function attributes, allowing to call the kernel from the runtime. This kernel function is then wrapped again with a `subhandler` function, which receives the kernel arguments from the SYCL runtime (packed in a vector), unpacks them, and forwards only the used ones to the actual kernel and looks like: @@ -280,6 +303,8 @@ entry: As you can see, the `subhandler` steals the kernel's function name, and receives two pointer arguments: the first one points to the kernel arguments from the SYCL runtime, and the second one to the `nativecpu::state` struct. +The subhandler calls the function generated by the WorkItemLoopsPass, which calls the vectorized kernel and the scalar kernel if peeling is needed as described above. + There is also some tidying up at the end such as deleting unused functions or replacing the scalar kernel with the vectorized one. @@ -287,7 +312,7 @@ replacing the scalar kernel with the vectorized one. Any remaining materialization of builtins are handled by [DefineMuxBuiltinsPass](https://github.com/intel/llvm/blob/sycl/llvm/lib/SYCLNativeCPUUtils/compiler_passes/compiler_pipeline/source/define_mux_builtins_pass.cpp), such as ``__mux_mem_barrier``. The use of this pass should probably be phased -out in preferance to doing it all in one place. +out in preference to doing it all in one place. Some builtins may rely on others to complete their function. These dependencies are handled transitively. From 71cb2ad1f834716b61117eac76920aec4c319e4c Mon Sep 17 00:00:00 2001 From: Colin Davidson Date: Tue, 16 Sep 2025 09:32:43 +0100 Subject: [PATCH 04/14] More updates from comments --- sycl/doc/conf.py | 4 +- sycl/doc/design/SYCLNativeCPU.md | 2 +- sycl/doc/design/SYCLNativeCPUPipeline.md | 21 +++--- .../doc/design/SYCLNativeCPUPipelinePasses.md | 69 +++++++++---------- sycl/doc/design/SYCLNativeCPUVecz.md | 2 +- 5 files changed, 49 insertions(+), 49 deletions(-) diff --git a/sycl/doc/conf.py b/sycl/doc/conf.py index f8480ed58e724..a65ea7ff2601d 100644 --- a/sycl/doc/conf.py +++ b/sycl/doc/conf.py @@ -47,9 +47,11 @@ # The suffix of source filenames. source_suffix = [".rst", ".md"] -# Allow use of mermaid directly to view on github without the {} +# Make the GitHub-compatible syntax also work with MyST myst_fence_as_directive = ["mermaid"] +mermaid_output_format = 'png' + exclude_patterns = [ # Extensions are mostly in asciidoc which has poor support in Sphinx. "extensions/*", diff --git a/sycl/doc/design/SYCLNativeCPU.md b/sycl/doc/design/SYCLNativeCPU.md index ec586751d19eb..b89379271e3d1 100644 --- a/sycl/doc/design/SYCLNativeCPU.md +++ b/sycl/doc/design/SYCLNativeCPU.md @@ -119,7 +119,7 @@ llvm-cov show .\vector-add.exe -instr-profile=foo.profdata # Native CPU compiler pipeline -SYCL Native CPU formerly used the [oneAPI Construction Kit](https://github.com/uxlfoundation/oneapi-construction-kit) (OCK) via CMake FetchContent in order to support some core SYCL functionalities and improve performances in the compiler pipeline. The relevant OCK parts have been brought into DPC++ and the Native CPU compiler pipeline is documented [here](SYCLNativeCPUPipeline.md), with a brief overview below. The OCK- related parts are still enabled by using the `NATIVECPU_USE_OCK` CMake variable, but this is enabled by default. +SYCL Native CPU formerly used the [oneAPI Construction Kit](https://github.com/uxlfoundation/oneapi-construction-kit) (OCK) via CMake FetchContent in order to support some core SYCL functionalities and improve performances in the compiler pipeline. The relevant OCK parts have been brought into DPC++ and the Native CPU compiler pipeline is documented in [SYCLNativeCPUPipeline documentation](SYCLNativeCPUPipeline.md), with a brief overview below. The OCK- related parts are still enabled by using the `NATIVECPU_USE_OCK` CMake variable, but this is enabled by default. The following section gives a brief overview of how a simple SYCL application is compiled for the SYCL Native CPU target. Consider the following SYCL sample, which performs vector addition using USM: diff --git a/sycl/doc/design/SYCLNativeCPUPipeline.md b/sycl/doc/design/SYCLNativeCPUPipeline.md index 1975135b9e761..80938a2adf2fb 100644 --- a/sycl/doc/design/SYCLNativeCPUPipeline.md +++ b/sycl/doc/design/SYCLNativeCPUPipeline.md @@ -1,7 +1,6 @@ -Native CPU Compiler Pipeline Overview -===================================== +# Native CPU Compiler Pipeline Overview -# Introduction +## Introduction This document serves to introduce users to the Native CPU compiler pipeline. The compiler pipeline performs several key transformations over several phases that @@ -16,7 +15,7 @@ Kit](https://github.com/uxlfoundation/oneapi-construction-kit/tree/main), under ## Objective and Execution Model -The compiler pipeline\'s objective is to compile incoming LLVM IR +The compiler pipeline's objective is to compile incoming LLVM IR modules containing one or more kernel functions to object code ready for execution when invoked by the host-side runtime. The assumptions placed on the input and output kernels is as follows: @@ -106,7 +105,7 @@ options passed to it, potentially undoing the work of any previous optimization passes, although it is able to preserve or even widen pre-existing vector operations in many cases. -#### Work-item Scheduling & Barriers +### Work-item Scheduling & Barriers The work-item loops are added to each kernel by the [WorkItemLoopsPass](SYCLNativeCPUPipelinePasses.md#workitemloopspass). @@ -119,13 +118,13 @@ well as [Vectorization Scheduling](#vectorization-scheduling) if the vectorizer was run. -### Barrier Scheduling +#### Barrier Scheduling The fact that the [WorkItemLoopsPass](SYCLNativeCPUPipelinePasses.md#workitemloopspass) handles both work-item loops and barriers can be confusing to newcomers. These two concepts are in fact linked. Taking the kernel code below, this section will -show how the `WorkItemLoopsPass` lays out and schedules a kernel\'s work-item +show how the `WorkItemLoopsPass` lays out and schedules a kernel's work-item loops in the face of barriers. ```C @@ -148,7 +147,7 @@ the first contains the call to `get_global_id` and the read/update/write of global memory pointed to by `a`; the second contains the read/update/write of global memory pointed to by `b`. -To correctly observe the barrier\'s semantics, all work-items in the +To correctly observe the barrier's semantics, all work-items in the work-group need to execute the first barrier region before beginning the second. Thus the `WorkItemLoopsPass` produces two sets of work-item loops to schedule this kernel: @@ -174,7 +173,7 @@ In this case, however, calls to certain builtins like `get_global_id` are treated specially and are materialized anew in each barrier region where they are used. -### Vectorization Scheduling +#### Vectorization Scheduling The [WorkItemLoopsPass](SYCLNativeCPUPipelinePasses.md#workitemloopspass) is responsible for laying out kernels which have been vectorized by the @@ -225,9 +224,9 @@ less than or equal to the work-group size. In the case that there are work-items remaining (i.e., if the work-group size is not a multiple of 4) then the original scalar kernel is called on the up to 3 remaining work-items. These remaining work-items are -typically called the \'peel\' iterations. +typically called the 'peel' iterations. -#### PrepareSYCLNativeCPU Pass +### PrepareSYCLNativeCPU Pass This pass will add a pointer to a `native_cpu::state` struct as kernel argument to all the kernel functions, and it will replace all the uses of SPIRV builtins with the return value of appropriately defined functions, which will read the requested information from the `native_cpu::state` struct. The `native_cpu::state` struct is defined in the [native_cpu UR adapter](https://github.com/oneapi-src/unified-runtime/blob/main/source/adapters/native_cpu/nativecpu_state.hpp) and the builtin functions are defined in the [native_cpu device library](https://github.com/intel/llvm/blob/sycl/libdevice/nativecpu_utils.cpp). diff --git a/sycl/doc/design/SYCLNativeCPUPipelinePasses.md b/sycl/doc/design/SYCLNativeCPUPipelinePasses.md index 98653b1ca13e6..5a05ec4bfd3a8 100644 --- a/sycl/doc/design/SYCLNativeCPUPipelinePasses.md +++ b/sycl/doc/design/SYCLNativeCPUPipelinePasses.md @@ -1,5 +1,4 @@ -Compiler Utilities -================== +# SYCL Native CPU pipeline passes The `compiler::utils` module exists under [compiler_pipeline](https://github.com/intel/llvm/tree/sycl/llvm/lib/SYCLNativeCPUUtils/compiler_passes/compiler_pipeline) @@ -9,7 +8,7 @@ These utility passes are currently only being used by `Native CPU`. These utilities were originally under the [oneAPI Construction Kit](https://github.com/uxlfoundation/oneapi-construction-kit/tree/main). -# TransferKernelMetadataPass and EncodeKernelMetadataPass +## TransferKernelMetadataPass and EncodeKernelMetadataPass These passes are responsible for setting up metadata on kernels under compilation. Many other passes implicitly rely on the metadata and @@ -46,7 +45,7 @@ Their job is three-fold: optional data supplied to either pass on construction to encode this metadata. If not set, the default `xyz` order is used. -# WorkItemLoopsPass +## WorkItemLoopsPass The `WorkItemLoopsPass` is responsible for adding explicit parallelism to implicitly parallel SIMT kernels. It does so by wrapping each kernel @@ -297,7 +296,7 @@ loops individually to each subkernel. The return value of a subkernel is used to determine which subkernel loop to branch to next, or to exit the wrapper function, as appropriate. -## Work-group scheduling (vectorized and scalar loops) +### Work-group scheduling (vectorized and scalar loops) The [WorkItemLoopsPass](#workitemloopspass) is responsible for stitching together multiple kernels to make a single kernel capable of correctly @@ -317,7 +316,7 @@ together kernels in several different configurations: - Vector loop only - Scalar loop only -### Vector + Scalar +#### Vector + Scalar The vector + scalar kernel combination is considered the default behaviour. Most often the work-group size is unknown at compile time and @@ -385,7 +384,7 @@ There are circumstances in which this mode is skipped in favour of If any of these conditions are true, the \"vector only\" mode is used. -### Vector + Vector-predicated +#### Vector + Vector-predicated The vector + vector-predicated kernel combination is a special case optimization of the default behaviour. @@ -413,7 +412,7 @@ if (group_size_x >= vector_width) { } ``` -### Vector only +#### Vector only If the [WorkItemLoopsPass](#workitemloopspass) is run on a vectorized kernel for which no [vecz](SYCLNativeCPUVecz.md) linking metadata is found to @@ -422,12 +421,12 @@ the conditions listed above hold, then the kernel is emitted using the vector kernel only. It is assumed that if no scalar kernel is found it is because targets know that one is not required. -### Scalar only +#### Scalar only If the [WorkItemLoopsPass](#workitemloopspass) is run on a scalar kernel then only the scalar kernel is used. -# OptimalBuiltinReplacementPass +### OptimalBuiltinReplacementPass The `OptimalBuiltinReplacementPass` is an optimization call-graph pass designed to replace calls to builtin functions with optimal equivalents. This is only @@ -473,7 +472,7 @@ The `__abacus_fmin` and `__abacus_fmax` builtins can be exchanged for hardware intrinsics: `llvm.minnum` and `llvm.maxnum`. This is not performed on ARM targets due to LLVM backend compiler bugs. -# RunVeczPass +### RunVeczPass The `RunVeczPass` module pass provides a wrapper for using our [vecz](SYCLNativeCPUVecz.md) IR vectorizer. This vectorizes the @@ -493,7 +492,7 @@ the kernel in any case. If successful, this will return a new vectorized kernel function created in the LLVM module so that this vectorized kernel is used instead of our scalar kernel from here on. -## Cost Model Interface +#### Cost Model Interface User cost-modelling in vecz can be handled by the `vecz::VeczPassOptionsAnalsis` which takes a user defined query function @@ -561,7 +560,7 @@ provided. The Cost Model header file resides at `utils/cost_model.h`. -# DefineMuxBuiltinsPass +### DefineMuxBuiltinsPass The `DefineMuxBuiltinsPass` performs a scan over all functions in the module, calling `BuiltinInfo::defineMuxBuiltin` on all mux builtin @@ -574,7 +573,7 @@ end of the module\'s list of functions so that the the lowering of `__mux_get_global_id` which calls `__mux_get_local_id`, among other functions. -# ReplaceLocalModuleScopeVariablesPass +### ReplaceLocalModuleScopeVariablesPass The `ReplaceLocalModuleScopeVariables` pass identifies global variables in the local address space and places them in a struct called @@ -596,7 +595,7 @@ instructions referencing the matching struct member instead. Finally the identified global variables are removed once all of their uses have been replaced. -# PrepareBarriersPass +### PrepareBarriersPass The `PrepareBarriersPass` is useful in order to satisfy the requirements the [WorkItemLoopsPass](#workitemloopspass) has on kernels containing @@ -610,7 +609,7 @@ the vectorizer preserves in each vectorized kernel, meaning the `WorkItemLoopsPass` can correctly schedule the work-item loops for each barrier region. -# Metadata Utilities +### Metadata Utilities There are several key pieces of metadata used for inter-communication between the Native CPU passes. @@ -621,7 +620,7 @@ number of operands, types of operands, etc., utility functions names and/or operands of these metadata is **not** guaranteed to be stable. -# Attribute Utilities +### Attribute Utilities There are several key attributes used for inter-communication between the Native CPU passes. @@ -647,7 +646,7 @@ example: function `ToF`, if present on the old function. Overwrites any such metadata in the new function. -# Sub-groups +### Sub-groups A implementation of SPIR-V sub-group builtins is provided by the default compiler pipeline. @@ -672,9 +671,9 @@ If a target wishes to provide their own sub-group implementation they should provide a derived `BIMuxInfoConcept` and override `defineMuxBuiltin` for the sub-group builtins. -# LLVM intermediate representation +### LLVM intermediate representation -## Mangling +#### Mangling Mangling is used by the vectorizer to declare, define and use internal overloaded builtin functions. In general, the mangling scheme follows [Appendix @@ -682,7 +681,7 @@ A of the SPIR 1.2 specification](https://www.khronos.org/registry/SPIR/specs/spir_spec-1.2.pdf). itself an extension of the Itanium C++ mangling scheme. -## Vector Types +##### Vector Types The Itanium specification under-specifies vector types in general, so vendors are left to establish their own system. In the vectorizer, fixed-length vector @@ -720,7 +719,7 @@ Example: define void @__vecz_b_interleaved_storeV_u6nxv16dPU3AS1d( %0, double addrspace(1)* %1, i64 %2) { ``` -# Builtins +#### Builtins The Following intermediate representations are used in the interface to Native CPU. Some of these may not be relevant for Native CPU, and may exist from the time this was part of the `oneAPI Construction Kit`. @@ -820,7 +819,7 @@ The Following intermediate representations are used in the interface to Native C as ``__mux_mem_barrier(i32 %scope, i32 %semantics)``. See `below <#memory-and-control-barriers>`__ for more information. -## Group operation builtins +##### Group operation builtins Native CPU defines a variety of builtins to handle operations across a sub-group, work-group, or *vector group*. @@ -859,7 +858,7 @@ The groups are defined as: corresponding ``sub-group`` builtins with a sub-group size of 1. -### ``any``/``all`` builtins +##### ``any``/``all`` builtins The ``any`` and ``all`` builtins return ``true`` if any/all of their operands are ``true`` and ``false`` otherwise. @@ -870,7 +869,7 @@ are ``true`` and ``false`` otherwise. i1 @__mux_vec_group_any_v4i1(<4 x i1> %x) ``` -### ``broadcast`` builtins +##### ``broadcast`` builtins The ``broadcast`` builtins broadcast the value corresponding to the local ID to the result of all invocations in the group. The sub-group version of this @@ -886,7 +885,7 @@ the value to broadcast. Unused indices (e.g., in lower-dimension kernels) i64 @__mux_vec_group_broadcast_v2i64(<2 x i64> %val, i32 %vec_id) ``` -### ``reduce`` and ``scan`` builtins +##### ``reduce`` and ``scan`` builtins The ``reduce`` and ``scan`` builtins return the result of the group operation for all values of their parameters specified by invocations in the group. @@ -924,7 +923,7 @@ Examples: ``` -### Sub-group ``shuffle`` builtin +##### Sub-group ``shuffle`` builtin The ``sub_group_shuffle`` builtin allows data to be arbitrarily transferred between invocations in a sub-group. The data that is returned for this @@ -936,7 +935,7 @@ invocation is the value of ``%val`` for the invocation identified by ``%lid``. i32 @__mux_sub_group_shuffle_i32(i32 %val, i32 %lid) ``` -### Sub-group ``shuffle_up`` builtin +##### Sub-group ``shuffle_up`` builtin The ``sub_group_shuffle_up`` builtin allows data to be transferred from an invocation in the sub-group with a lower sub-group local invocation ID up to an @@ -968,7 +967,7 @@ All other values of the shuffle index are considered to be out-of-range. i8 @__mux_sub_group_shuffle_up_i8(i8 %prev, i8 %curr, i32 %delta) ``` -### Sub-group ``shuffle_down`` builtin +##### Sub-group ``shuffle_down`` builtin The ``sub_group_shuffle_down`` builtin allows data to be transferred from an invocation in the sub-group with a higher sub-group local invocation ID down to @@ -1000,7 +999,7 @@ All other values of the shuffle index are considered to be out-of-range. float @__mux_sub_group_shuffle_down_f32(float %curr, float %next, i32 %delta) ``` -### Sub-group ``shuffle_xor`` builtin +##### Sub-group ``shuffle_xor`` builtin These ``sub_group_shuffle_xor`` builtin allows for efficient sharing of data between items within a sub-group. @@ -1015,7 +1014,7 @@ out-of-range. double @__mux_sub_group_shuffle_xor_f64(double %val, i32 %xor_val) ``` -### Memory and Control Barriers +##### Memory and Control Barriers The mux barrier builtins synchronize both memory and execution flow. @@ -1060,7 +1059,7 @@ bitfield. }; ``` -### Atomics and Fences +##### Atomics and Fences The LLVM intermediate representation stored in ``compiler::BaseModule::finalized_llvm_module`` **may** contain any of the @@ -1095,7 +1094,7 @@ No lock free requirements are made on the above atomic instructions. A target **may** choose to provide a software implementation of the atomic instructions via some other mechanism such as a hardware mutex. -## Metadata +### Metadata The following table describes metadata which can be introduced at different stages of the pipeline: @@ -1126,7 +1125,7 @@ accessing, setting, or updating each piece of metadata. |------|--------|-------------| |``!mux-scheduling-params``|string, string, ...| A list of scheduling parameter names used by this target. Emitted into the module at the time scheduling parameters are added to functions that requires them. The indices found in ``!mux_scheduled_fn`` function metadata are indices into this list. -## Function Attributes +### Function Attributes The following table describes function attributes which can be introduced at different stages of the pipeline: @@ -1150,7 +1149,7 @@ different stages of the pipeline: | ``"mux-barrier-schedule"="val"``| Typically found on call sites. Determines the ordering of work-item execution after a berrier. See the `BarrierSchedule` enum. | | ``"mux-no-subgroups"``| Marks the function as not explicitly using sub-groups (e.g., identified by the use of known mux sub-group builtins). If a pass introduces the explicit use of sub-groups to a function, it should remove this attribute. | -### mux-kernel attribute +#### mux-kernel attribute SYCL programs generally consist of a number of *kernel functions*, which have a certain programming model and may be a subset of all functions in the diff --git a/sycl/doc/design/SYCLNativeCPUVecz.md b/sycl/doc/design/SYCLNativeCPUVecz.md index 9796188255b79..2368476be684a 100644 --- a/sycl/doc/design/SYCLNativeCPUVecz.md +++ b/sycl/doc/design/SYCLNativeCPUVecz.md @@ -1,4 +1,4 @@ -# Vecz Documentation +# SYCL Native CPU Vecz Codeplay's Vecz is a library based on LLVM that allows vectorization of SPMD programs such as Native CPU kernels. From e801dba55ab15d21d4586f844ab7632812b0adc1 Mon Sep 17 00:00:00 2001 From: Colin Davidson Date: Tue, 16 Sep 2025 09:53:08 +0100 Subject: [PATCH 05/14] Remove png output --- sycl/doc/conf.py | 2 -- 1 file changed, 2 deletions(-) diff --git a/sycl/doc/conf.py b/sycl/doc/conf.py index a65ea7ff2601d..7180fa41cc941 100644 --- a/sycl/doc/conf.py +++ b/sycl/doc/conf.py @@ -50,8 +50,6 @@ # Make the GitHub-compatible syntax also work with MyST myst_fence_as_directive = ["mermaid"] -mermaid_output_format = 'png' - exclude_patterns = [ # Extensions are mostly in asciidoc which has poor support in Sphinx. "extensions/*", From 0a936a48e609f40f6e513063c4984cafb70ed06f Mon Sep 17 00:00:00 2001 From: Colin Davidson Date: Tue, 16 Sep 2025 12:27:41 +0100 Subject: [PATCH 06/14] Remove mermaid dependency in favour of images --- llvm/docs/requirements.txt | 1 - sycl/doc/conf.py | 5 +- sycl/doc/design/SYCLNativeCPUPipeline.md | 71 +++--------------- sycl/doc/design/images/native_cpu_vecz.jpg | Bin 0 -> 46693 bytes .../images/native_cpu_wi_loops_barrier.jpg | Bin 0 -> 22777 bytes 5 files changed, 11 insertions(+), 66 deletions(-) create mode 100644 sycl/doc/design/images/native_cpu_vecz.jpg create mode 100644 sycl/doc/design/images/native_cpu_wi_loops_barrier.jpg diff --git a/llvm/docs/requirements.txt b/llvm/docs/requirements.txt index 36371f16e3769..14f34a5465e44 100644 --- a/llvm/docs/requirements.txt +++ b/llvm/docs/requirements.txt @@ -8,4 +8,3 @@ sphinxcontrib-applehelp==2.0.0 sphinx-reredirects==0.1.6 furo==2025.7.19 myst-parser==4.0.0 -sphinxcontrib-mermaid==1.0.0 diff --git a/sycl/doc/conf.py b/sycl/doc/conf.py index 7180fa41cc941..27e73ee3d3ad0 100644 --- a/sycl/doc/conf.py +++ b/sycl/doc/conf.py @@ -32,7 +32,7 @@ # Add any Sphinx extension module names here, as strings. They can be # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom # ones. -extensions = ["myst_parser", "sphinxcontrib.mermaid"] +extensions = ["myst_parser"] # Implicit targets for cross reference myst_heading_anchors = 5 @@ -47,9 +47,6 @@ # The suffix of source filenames. source_suffix = [".rst", ".md"] -# Make the GitHub-compatible syntax also work with MyST -myst_fence_as_directive = ["mermaid"] - exclude_patterns = [ # Extensions are mostly in asciidoc which has poor support in Sphinx. "extensions/*", diff --git a/sycl/doc/design/SYCLNativeCPUPipeline.md b/sycl/doc/design/SYCLNativeCPUPipeline.md index 80938a2adf2fb..b8356386d4e47 100644 --- a/sycl/doc/design/SYCLNativeCPUPipeline.md +++ b/sycl/doc/design/SYCLNativeCPUPipeline.md @@ -29,20 +29,10 @@ on the input and output kernels is as follows: 4. The final compiled kernel is assumed to be invoked from the host-side runtime once per *work-group* in the **NDRange**. -The following diagram provides an overview of the main phases of the -Native CPU compiler pipeline in terms of the underlying and assumed -kernel execution model. - -```mermaid -flowchart TD; - Start(["Driver Entry Point"]) - Start-->WiLoop["for (wi : wg)"] - WiLoop-->OrigKernel["original_kernel()"] -``` - The inner-most function is the original input kernel, which is *wrapped* by new functions in successive phases, until it is ready in a form to be -executed by the Native CPU driver. +executed by the Native CPU driver. These include effectively wrapping a `for (wi : wg)` +around the original kernel. The [WorkItemLoopsPass](SYCLNativeCPUPipelinePasses.md#workitemloopspass) is the key pass which makes some of the implicit parallelism @@ -55,23 +45,12 @@ the new kernel entry point now runs on every work-group in an With the overall execution model established, we can start to dive deeper into the key phases of the compilation pipeline. -```mermaid -flowchart TD; - InputIR(["Input IR"]) - SpecConstants(["Handling SpecConstants"]) - Metadata(["Adding Metadata/Attributes"]) - Vecz(["Vectorization"]) - WorkItemLoops(["Work Item Loops / Barriers"]) - DefineBuiltins(["Define builtins and Tidy up"]) - - InputIR-->SpecConstants - SpecConstants-->Metadata - Metadata-->Vecz - Vecz-->WorkItemLoops - WorkItemLoops-->DefineBuiltins - DefineBuiltins-->TidyUp -``` - +1. InputIR +2. Handling SpecConstants +3. Adding Metadata / Attributes +4. Vectorization +5. Work item loops and barriers +6. Define builtins and tidy up. ### Input IR @@ -152,15 +131,7 @@ work-group need to execute the first barrier region before beginning the second. Thus the `WorkItemLoopsPass` produces two sets of work-item loops to schedule this kernel: -```mermaid -graph TD; - A(["@foo.mux-barrier-wrapper()"]) - A-->B{{"for (wi : wg)"}} - B-->C[["@foo.mux-barrier-region.0()
a[id] += 4;"]] - C-->D["fence"]; - D-->E{{"for (wi : wg)"}} - E-->F[["@foo.mux-barrier-region.1()
b[id] += 4;"]] -``` +![Work Item Loops with barrier.](images/native_cpu_wi_loops_barrier.jpg) #### Live Variables @@ -193,29 +164,7 @@ For brevity, the diagram below only details in inner-most work-item loops. Most kernels will in reality have 2 outer levels of loops over the full *Y* and *Z* work-group dimensions. -```mermaid -flowchart TD; - Start("@foo.mux-barrier-wrapper()") - OrigKernel0[["@foo()"]] - OrigKernel1[["@__vecz_v4_foo()"]] - Link1("`unsigned i = 0; - unsigned wg_size = get\_local\_size(0); - unsigned peel = wg\_size % 4;`") - ScalarPH{{"\< scalar check \>"}} - VectorPH("for (unsigned e = wg\_size - peel; i \< e; i += 4)") - Link2("for (; i< wg_size; i++)") - Return("return") - - Start-->Link1 - Link1-->|"if (wg_size != peel)"|VectorPH - Link1-->|"if (wg\_size == peel)"|ScalarPH - ScalarPH-->|"if (peel)"|Link2 - Link2-->OrigKernel0 - OrigKernel0-->Return - OrigKernel1-->ScalarPH - ScalarPH-->|"if (!peel)"|Return - VectorPH-->OrigKernel1 -``` +![Work Item Loops with vecz.](images/native_cpu_vecz.jpg) In the above example, the vectorized kernel is called to execute as many work-items as possible, up to the largest multiple of the vectorization diff --git a/sycl/doc/design/images/native_cpu_vecz.jpg b/sycl/doc/design/images/native_cpu_vecz.jpg new file mode 100644 index 0000000000000000000000000000000000000000..91015aae3feb8d9c449a6fff6e9056fb9877ef9f GIT binary patch literal 46693 zcmdRWcUY6#vUiZ)rArOH7il70M4E^+=}nq60Re%~jG%yY0R;i2BORo7klv&>>4Yjh zK}vv-e6RcLbMHR;JLlYc&-wniU&wlRAZyJ#v&yWQ-^>K_9kT+ueqU8X6@-O_1$qqp zfH3nQWf1nYYro#W3kP`P65!(E;NTMB*8o(nSJ^>*SG0>srI_Me}Hug0f>|b01y@P8wjH9IFa?{j`Z zVNqpObxmzueM4hMXIFPmZ(skw_{8MY^vw6!IoRsj`o`wgkL?}A(a+pvv>lUx*lT-R`LuyOEz$%S>z7iicNIJj)0c$7-I_|_g&>|((L)OX{vD%uG- z#P#4bHlAZdw44$!F2paKjT72;o2G^6evJy=UJZ(-C#DvewKByCru}gzwpTA41B%SWfI{LCX7RUHHB zh_enupRF+c6FTJoGdfV1Bl55O==|a*176)sj{zzBgknIic%jH5K)^pq1BSnGR6PD& z*qrayZ7oQA39S*NsU-;pl+^@C-u_V$WdQyorQ81srMrKo^a|kVU!wF^o5{D?9I27;c4fn$rSVL4^Xxijgia*>& zp>PLs42VVzB_#4YiQ!))Fra^n#&2x=k7%_0Eg!)CSov==;)|w@`%RcOxcx6-G%z4y z&Gw9ADMu@$_^-WPqKLdF0lv8Aw2UMKBXuz#f-qpx?`X+lKv*0Y5RrTsWmL)hIZg~L*p+)d%zgCes$w589hsWOMxx@+|+Xh=}~q zlh;+~>#@H;#<(05b`@Y)J@I`1zU*LAi>TxWT27k;lO|UU z(TVnyAwP|~q)yAWb;)=dTBUCLQ9vfR!5Ew@%}NU@7^1u5{mxK3G|(syGJh}Jc+4&4 zAWc@L+hapUFwMBoegVsV;NwkLIO|)bauIm=K2PR|U#T%9L#J-Sy;M#uAiKkLpXVmK zn&)Wy);DZ5s#g4E`+hkPLGJonmLZjAOYF$`l{O+5UsJ>U5(zrPFj1dv2e~gM$)NlU zF}7Hy5b{HM2AmgB}J#eBBSe}*xjpD zT-i)neu1G91xX~d+xG<>lx)}M==)9`24qj}^?8xJW^>vk(lqmR2gSz$6n{hE55^(mxAo+)D%~PeytK{Lw#ckKkES;|{ zx#G1;YN=TAu&5Lh5+ut$TRRiGT8OX1r>FP5^IXD+yb$#iZW#yL7cg&{e(i5rm_JlB z_pr7~I$@f(t5V%Sr*hxWJPfoR9F5Rj2=s*8wETPFNelQBwTHA$eEYQ3L|H9_j@FteKOj+linc5`<9 zAbIo)^0u*>A^l6MJ;PKM#p<3sSIKv^DU9#>srzEF$(xQ)i&X}Dt+$cFm-lA84W{x! zW|V6$-LjK?=x+UN$0H_uK&e_OcD*#rfOTa}0-)NI-P~)19CR<3L%x`Ak73Cva=FNw zDA{bOO%a;DJv!*&_3FzY!#%R`n3~zQcP)^BiOBr!@#*tpLA_nw0 zMi~QAGnsKCN4D|3cwm2V&N?q%ZC*|Gj1LivkS@O)@~+{XS5}=0{74aZ$>!urrz7VxhP=hx z2h(tE59}uUilf3LRfEN7#B$_u3^|M^7$sf=b#g?Lj!Kl?V~T1L61!oK-%9_8!ouQi zxW~fz{XwOfxb{y1FnV}TbvjS&44c@Ft2OHvb=%4A3CpMwKZobe4@c+EZrwV`F&*{( z(MkzfgXzF~YCKjgY|UC+wC>9iGV~aixj)$UIlJTMRKY}9(%v2iViDMx<6S3Cm%UWd z^j%X7U6O+@hP^6Yv@|8=#NQJfTMAF&BY&$JIGTAK3FgBwIF}IUxyn8XdbIsBjSl{l z)+u1U?%jA$rH|vQHQXh}LcW8S!6q+o%Rw=YKB%76`*mO^|jFJ&Qkq%8NU z$G7TJmky+-P=&TCLKOT9;-R|L>yd*sNWWDm3yKUL-gz?Sj*==H2`F;*WoX%Gu83jA zT^t@|vK78EHkhiUf+f+q$ue z5I?mel@RRVs3SmpAqiENF26BVedW0ei|e++QdwE_^KPR;mg~FTT6apgN;3Z3eD0PW z_%!B`@QQhKs9sYr&B+k;f*YNJHnJObNC{~iT%Aq~$b$?T%pBYsgZR&A)i>3SGYsw( z)}&iLF8>+dxGUKqKC5!>-OurD^!GOLue#&;9<4$2Nwldblsj4(lFA26# z5-j;HFiCZ2d8s0-1w12g6vb0Chm<;Xsbx^vqjci_z9V{-(rE(T^Barq1UFr+;mTJ0 zye^Xpd%=j3o!$>rTv@(`3_I%`^3Hbta{u(!oKUt?yc`{#L4(notl5Zz+#AF|cgzHP#S{MF?)Qxq^iznU9ibv~ z*Dc9V6^Wbluw)sci&MG$v2(#Di-Ppit~2^b5PycqN?U=(N&=5bu9 z*G(u|;rY;Hk$i(au?tN#8E6fIM#xN3A}_mAt01=y)#_wULrFuv*~xZ^Dl<2W4YJy; z>$o#|}*v^khH6}r38=Q^=CIjL>kk1N{qHOLqJK}>8}%&FG)?DynLJKe3pbf~9d&J?ir6nF(LIGJP4$@% zAI@`od>lKLX2Bey!v0y4@)ijRH5IQtw45O;MDxL8a@{f(=n*bY%G&Y;_45%F3-mPd zxFzE(2LlRLL-nmYD64oaImAJK=uSZ4ei+ak?@ze^4Crob3gG08pvTmz(8o|XF9!6% zQ#>oX;ZVG{?U;AF4Jn5K?Nj@u%LA^a+5^Br+`@nyb7nChe?=rU^qVyO=$QH0U!J-A z+cV`xSO5n@rI^9LoZ}Ar7yIWE{}sikB)Zf8j>)FWj6$ zm*UX4;GblwfaxN?2{}WDGK*tC_!(E(aev_@<1f4fC?2X|K-c2V!#QLa1k!((BJD3y zm>8w=BSPaMB!6X^cO3H*II|K0>j{mlgazL{qzQJZL6;l&>PSDy6W86)s=>QP4A3)-i*@WZ@4KBX`% zI<-`Odj3S$?0APvQXhW^4B7HGynA_fZyFJUI7nk zX=sRDJv!L_<}*#Jd!9`7R&3sg)mn`!0_2gHfK`9jfHMczr`IKI(g^Pb-!PdXj0nMV z_0tIh>IdMgi464Xa!Tm(tqfGWg_VBkY|t*Mqt{R685CX(oa8Klu8gyxd<=;57YgtY zdO_I?Koo&<&?6P_|JR+ZL?QF1jYthxTbzu!e_T{fZL;#jNeoz*)Y?<4WzhWlssbBV zrEA`XhE-zgqZt5I`kSe;;tbTgN?}8WUg|iuQP{%6ipj&6#}CcN*Uj8-ZLTUdRot7u z-~XCC3IC8H!SUjg+u)(qCauEGgg1ettIYJp_a-y*uEcwGF*!CRSoDTi?^a?7aC~W= z)GH!Qfx|WWepUN&a{}-k0DT*puXsgh$%z5Ax+fI1p=c>F6$ zAK3XbL%@#^`vbwle^BBFZh#o*hnKS^HK~5 zl?1*9ooNLy>g?6hLU$4PAO@si-|00C=pFcdBHrwm|F9hmS)ny+dat6MLAM`G?Rfup zSvO}7)LAYIygSiWB?>6@_svx28IBlGT$|X!P_*m0XiUSPLwSv__Zpf9>9Tr)Fp3jK z$qGkLE8P4NGUs$cxDo3@Xd4P!iZ6Lub*&^HcLIl{M)(T!>K+nN4r$Y?o@ka|sTXl~ zfKPu~Rh9DZorS41QodM&1V=p6!IgbB`*d%9ko77jqf2w;N1$k?*^)C0$5>l!_A8Sq zBW)F}fiP;Qurg;>#QEKqpneYa>K7C!?kN;@d3#WS`K%TtL{n|_{=E9$Qcz~vkb4S# zU*Oc7|3{`^^PW#|&2CepPI5w|>X|BBJGw+-|E8&hu}heN`T;i8mh#Qeeq43N`)s(L zjHz)De}*{>s3#3dfDE}boT+v-(+mLPeRAGxKRkQWS3kQ>Kqs1_`Iu1IlYoini^6S4 zlmQ_(OFa2?Kdu1M$n8H9miLfzA=l z;PkUa(&yG5eO1p5%A{a>%&ehnSfX%!=X8k)DUjdqVKa^t0Iah)PW8(tv-z6NOWOTE z1COprxt@D1N3NoEQkfy%tzRuxW` z+8>qBY+D)(E57GddoJJ(OZmw_t-_YP<={XKSbY))Yx5MkdqWh5^i_XDd>zx)!P={l z2YZm>PTD1rakGa_jgwDd4n|_zzCTU0;=gMU3t;VtzcOB;Kw5l;h&&3olsD?~t*ln1 zVkuoEp8xTZlj+Uo`*%{8?M3a9I-6KWs!tmx(Hp9d_1QVR zqY(%rpYqT_@^tzib}TBkr!xn$P3N?=wALf=a@phWYw-_xj%U&v=*!|@#XcL1KDy`P zk)8o4(2)PoW*gq*=Yp0v7*KhTCI+O^ZDHWq7>-I3`A==Q{U>cm^dDpYo>u&qw4sM0 zeV}^~{O~Is!)6*y^O?u8u^CRIq2uXD014T%l6*FM8|fW9&)agz-EGmSGcRkPH%y&Q zYpK+7kYer>U)^kdgq|fd;slk6vrs>~>4h>7*1+0(ZWP;#km^Y#k9;1aXf|&aV%#Q4dL@0<+PzA)sdBLX0q9xbz}1pV5`4KQ#=WPN7{NSr0=OyO_1W-~#l_Q5 zcjoSzgHJcZ*@@q9zO8=a**Em2@tHKti?!m|f(E8QjZ9p{fT+w(e=LztJZIFjAn#0l zk`;iv?q%cWX5$b$z3dXlbyw|d{#(xm<3mD;Kte=auOh`doHXeZUU2>*B)sFN%Z#t@ zttw6C;#`E<>*xo0ljkp8g~=Z7FSzmmv4L#IGd3dP61CCe@`5;ticcX8k*seVzu`iZ+&V|2N5E-p&A`2m{Q;{LQ+ z{dfxH_eIG$oS<*5I=#u8b_%L{BxA!Ny`z&xk%3e})~z>D7FvF;Wck#MdKy9@mF>Kj zd~-V`n!-x-dS>_>JcUDu!`@{|i1r$7QU{NB) zZdSs{zR#7B)V}r^X}?xZy=!&7*&Sv@HNS0*k|#@tFK&DEMR}iwrlbrsRp?&UV0k`M z4*W*YxvBMC(BGS87^Q-s&4Piixy(%`=4q7f7}9#0n{6ZPst=0P%|RL$Anklfm?a0Y ztzdTu~*Pa{$uVi zW=4A&Mfceh`Owz=Dz4n}+&{L0{a)n&>($P=^5)U#yF@R(ZH*zJj^43jLqQhSG*kFf zhNVy!=Ht2~4L^H#Z|PvJrQk7(j@GZ9Vs~rhNxd4$-?XwX?wo$vL6eM1>`Ts5Et(>e zG*fKNSq#OW1jl;xzzhKJpEvu%ZxOgZo#wKl1`B1hCS9=N?y?>RH0EJ=T$hZr$AEGx zVGX;0l?K86hk?6aK|Q}Gw+vA1BAC#{}&qY*2s}jWFKszm&Uks({-G6vxwAAxeHaNE%Fa+ zj;NTKfEuC(oOs-&n-|FLmM1lAF^lfK(;fH#mik3DzoU(^^D4q?RVDVUGs)jF%Oe-* zT0`9u@EFl!397U5Ks5s?qmgQdPh^CaG8UDaQTC2eo z_Qd_>Ed_nrZy=>%uU>^Q6=8#b&=JO{(||Oss_`<3HVQS@RZ;>;OTua?Ys<=<16<+!gH^%e0Xz8C{~YT~?|*Hep-R*h~ywSWA8 zTgtxuZR@GDDeXY3W|Wnx@Of89MMx7E>L(O9I-Whhdh3zv9^NB|m%y$_ ztyz8@XtXxMh%EFzFAY=@&Tw)Sg>TGhiYrk!%|2bz;0mGG#I+Jnc$A=Fossn;uP<@b zSMEUjO{!6stRG4-+X|VeJmt>r;a<=dE>hY>l3BZdM!O4!AKOShXNot^h=4k1>G$eF zlD4`m1*a#)bDAfrbzPWrstfu2IK*Y|pU2!epTuec+0E4JYA0PrZpAO5xDX$$ zs7{vYoh&8BRx@-$1Y{xJisTwT)3^FZ5*bU^ZAs33WIZ+LFLS7z74d4)+^<8z&N^8H zUnx~0aKARU)@7(^_S@7SjY}oXsXpTntjJ;;SE{GP=QJwTRpTdq=S@o1Ycl3A?k(V0 z!>q7M2y~p5X z-)-)uM!wl!A*rvwU=V@#lZxvV5Gr)biO3Zljkm$T4*1BpnGgfAyhuhzMb_mtw{?nK zaV3-fC7#9m&#)7LK}qPj@C^7=s2u~!(>E%dNpi-3jx5R)_sm{lK=(hFYhoj`Q6k3- zv!8^hb@E0h`4*L`%vpzpPQMX)jx=pV`H@u}&LhTK#E1zzO^ zIw3<9qTpLyb*a$s86`wnv5PBBiRjR28nIueoc%(7t&yFjF@NY&(h@RN!Hk+tc5}$9 zOZWDOtj>~8vS^CrW&l54n^33M;ajT#u`cC?l1NYm&sKeJ7vu@L#8U)+PmCb}+bu~ip)bLBE zi=Zo6UlvUB!HQbQ&0tl=0K0>)X{W$JO!NUT(^ug0$b^mo@{rW6h#G+_!&AVdLU4b+ zh`vQDgFCfwIz!4Jn3h{dN%>lpU?D0xHx7UI1@4!I>Eq&0He)cla?QjTKBFonTHj%H2A(j(&(d}^aqq#xXLN##Cl^246$IP~lnR}u=j>77$8>gijW=LPc*39g3~$ZXhvPecYM?G<6-D!)oDks9 zlZdC))8npXr4k=b7-IytO~a~uI3KU~vQ=3m^&hR=->CESc`2>Hh|H36M-0q&d33nu zOre>bWI{TI>U+2v8r}*AtoaB$xm@u+pYfipUZyjjby8rftgW0{NuL>M{-8|pTF!{8 zH)({vlQPyTNiTP^s+j|Rp%|?F*(G(e#jCX27B;U))Rp?=l?P|oZLasL@d;Go4z{7T zb@b-zZban%7r{GdzA52qMUtJ1n3&h8rTd>g&RjRX_v}oJoz^&c@ePqe&59ugL|+4? z3lu?gpKM@2Az!BXyVfj7c}TgHT!l2g?Qbdc|~ddrQ9!lJ`!`QS(fx)|Fu zr3p7V&AjCu3i)QbWq8aEZO`JsM|IBB1=T$NkS+Ra4KeO3WEE>tlVyS0tGk+4_ua{t zf6&$$GVoUOChNZTb71kJkKD)M<;_f5abRm|d6|PC?tLK-&kt`F5eXG;7Ctjw@JUM5 z4tIMV74anLNwh>pi&-uqwM)YwwXA_(KZht}ckK!tvo_?=aoWlOR}3qw7GQKLvqKp4 zM_u0Cd?d_XXC+OCD*KF)OWGH#%S(^{XLW-fB(ehjd7&!ZTF^g9o|Pj62d z4LJ@h_F&)Np^w0U8F&Q`f3AVYnYq(gL83a*Je6Z7LZ;(O%La7*H=Dzp*z^kJ`0e$n zU~k$T^scQl*%hlxNHc?bM+3bPah)j&!()!RNWW%lh4i6%4ew1(Lqy%#Ypt#NJqwPf zf!Xyjba$)A(L@m9)rI@Ly~!7!lY^kegp9+@nuC3-sf)vYQG*|g@>RaqgCbA%%{m#9 zUny`QIlA5oZMHZ+pMejT%G_ai%sj3xoWiaHqRbI^uS2cXg!L%iNaHHwpn6*(h~0|1 zWh7X_RoyxDMer%EiSD?vps0pN^I@bkfxX2`R3!0rX~U-k&1_fYFz5}G7F;cCZh*Ef zY~Ertd(vQT`upN}wt8e1=?3A7?_7Ov>YX0~vo^~cJZaVa)o!yB@`!y)(R{}x{wZ!` z;hF{eT#$gp)MG8F&QZ(x(8R%N4IEE7&-X-J=_cyraf#OQF0+}Gf)o+Pa#?qLut6U^ zomaUPB6*8{%i%0D0x}X4x>vL>&L-QA%?+lE1PY-2HgKM+fE}v~Vg9A0BfxHIJ5Dl} zlDCQEEeB-+M%Zs7|K9I%!6^WSY%+&_ufs>b*RssFGy#-ZzfU~|RN+Q`#h=G7`YZ`;3YLUjioElssCWHow?gr0VGEm~ z&X#J_u{>=HnY`v|WH3=Z&m8nL{pTuo4(#;fY{!P6O>N2)HBHY3FM)01TPZhseK!CH zvZ4@tit@{JN;0^IJ{4|UPzM2eTu!l?o}QCO74e`{GrZN z$@7e=pvKdsk$rKYYOv#b+n@GN_hUR?Jh^>ax4+x8{Np~FYe#&0iCNQ%uIh&+R zj$5$;#ypPsPof)?H?I?k-Q-Ld`bvQuSyN~#gk1K>7-p<8TXpLGpgk?LbHW%X{`Baw z+z|Qpaw2$mEXoTJ(B-RZy5k9>Zg(hC(k%n2j z`?o^yp=L_0{r&yC;c(ksqGFd~<7LaG*?G(P1&dj+ys2irVhy)%S(ECS8yb2yu%koB zm92zb+E|(mygk+|%rvIFGiraBP7P<|LkxR6supLLdtF76R?YJ2Y&hQ*uap!~3B2_y z*GFpKtVgyXVs%P_Pe?^;>)YukgoX+BoC>**iCJn?Akb*Ekfq}bzR3~}WF5RtS)Gel zV*d%P)U#&_Cff6B=WQ#*$7Q`eGQ%y3?WEeq@Pi6lzg3#C)Qb_l@l~Um24H7lMVR-= zHHZc$J;mG6mN*2 zqk{wngyRCGL^+fUWsj`3=Q>+5!B10sfzU4Hr7uHR&hrHI)yMklp>ki+Y`*t$9;_zI z{E!9T9kUF&RD_g2LX5P#`BNpX-i#IsF7pug zgK4$8ngj!rCy(ZvvlEku8$9o$!H-$kNrQWm*V7-$|)gx8cR!!^SQ;OR5a*+6m5^^m^(%w-B#bPLALQ>xX)z z-eBhA!M6QDd9_Q*ORI?u_)=9)X-!OydFA55)rZc8N6baL)CG1=`L{M%Z#ix@HZsRq zHcbic2@A0{G=(e$^po5g(_}Ax2s0VTPuvPuj}zU zdy=*+>d7+pssG9=H_{T?K+{9F1?%1XyQy_8v6TKiuLBcU*8=JdC1mjpyzXPP_xPAYd%V_9C73xjmpUTRpggbQXO&_{mYdl7z;{w_L&9eh-@cgYuG{x%#Pn@01$Z{_LM+Ty?7Ix*e(o97MevD*FA><-9X!0!X6qWSN|l zL-sC%d$=7!(7)Q9yUs;8-t|V4Sp-)kYq8hnZ8EuJJT50f{Fg2}{Cu~$Wj``+wSS_$ z>|~K2&24O>O&70E-_r4pm+MP)J(`iL!xP;33_EuEHLT(((@Rz*y8?J#%X-r9WO}ToP2B@TKDG2eO zKffG{QJs^Yr}T-p7$iNA>$XqaRFr(aOpcNJ)iOQ%8RttD`jq`20_jOUkPz9;kMeF*6eIS<8V z3N2gKM6y|D?WkWs-=NfNv(*BOQOkS*+W6snauRk=_(QS8u`7j+UXXxdcq zrQiPWwyghCy9UlRRYJT^3D-0vn)q@Z62|d;)~oL2X7tj?bVXc`(Y^QBl>q<7akot5 zW5~{nK^8LYuv1IIOVy$|t+LOO67S2d4-ZxMZixFqM9&{s&qwdA$?wE<$6>o9KrVvZ zl}058CnulhR??5^s+Q%nDk2;^&F>Dry{S${aQmy=g%I_*Kh`D~K9Y)KC}T&PnqFlsADUk1!hF&) z9+b7L!#^WMvKM~VG~@Z@D^A9{Xe`BGG{HPlKt+qNS-)(zl|d74!bo{?ywCKU%P0Oa z&2izQo#(=VbpK`xd!)3E`ARb*JQ9qz>~xeL8K_w_>-43XmccD6Fvis_)F=3L41>7{ z=-aq$!B*-{pgs(VGJ-8CtEL|8^br3rC3L8KmWbnaL*Tkkf|wYdDui_<=04x75D12{ zf`_XiENU}we2wOaW{tJ-*v1)dpPAg}pftPQ_mz!8EhJNHpV7pxY+!hpSz(yi%yF6# z!eN{fF+kz(DO{jNa@k9yqw2>4xQ8d z3GQhu`g07&Y<$TImW)mq4BINvRiRm-{xgX`{!R0zc=hD&0P68J>@do(u_t@i>9f@~ zaelY2fQ?8s=Lk>I(3pl{->`Ufa36pRm8>_`SCpnq=sv=LYO4>=F(3eBZ5u>OKwX`d z0pQUE1A-=_T=EkvqMr`FGf>+7Y@7zckqUf2j2fR8dpl*%Sq2qtQnhj=SP2;lb;ckKZCdVt(sN`c9`frYOe9%F zn_5NW(&iV9snGmf`==4Z@&c#n)S8dA$t~Yj$N*_p`0#{kO}HZ7BW`Tp)cX zD-%cxtOTRET@|o#DeCah8X&IFwvrLvdSk3y(n7Q20=Fe5_lb#}j5+6u z4~Scn>LqAB0n{%Uu|oc8j;*R`I#3BF&(5rE81JT7#V|k{rS+PZV!i-W2A7WS@l_uQ z(P6bx3A4r$%3{JAL9iM_pbb6o<*D0A#z<@LXq2=Ko(5&s48 ze{YumFX9!=)^N3sC06*!RC-31nBmuj*qG>>W#MQV zhY%o(O9!ZEGy%%ws7zSI^;fK}_pX^e>wFMTIni~!CxzvXVpFpMJV80*eEr+Qz-pT1 z-`7ULVF6HMf7+E2(Er?6tiZ-{Daej1F+}^GRd^|Ei2!ATRqfDYUPncwU^sws=lq!g z3zgpmiAydAyKSdx#YR0zO7;HK5?W+eLR%r|*svR_gKL(ZZQ!hB>gm2IPzxw} zJJ!Dk`~1ECmFcezzY$(#pUu<`arll z+NjPod?JwNsbCg84)a}@p~i>U9$6?WHU^|E7FU3803#GTHkuZ$C*t0>+$aw+fH{g&XLLJO9`wwU307X zGU(Q-)#cvUfByTsHja^w1`Z;3v@efEnp&(fK7>GCiMcow=_1j}&@w4U=rb-oHZEOV zxM?%w@wxThor)cVShLXNzVWA6dm5Mh8O5Ink*iY|vR!?@K)tl;-P-!fDlwJf%ECCi zxCmCY#b&94_*tLO+i?l1Ap8PdbDg$&4)#?&#bt>$H^5#1LR*JbcbM~NHca!ojH@#k z9WCin%2P2QliJ3{y7vU4{PX8#oe8rz=l#*xqg%`u;5u^*Xr!y{Y&Mf@JaQGi=mjW# zO5n16688^3DcWg7rK%?TK_zBLNlQdzi=+h&9#!_RO`$IDjUf z=1obYx%%|GhZHo_BsKbOC3X(ALEKMJtOwwwl_A z&q=C33Nn*;eqfc!u9Rg+)^)bX>NKmy>CKR&F0S_7>=&A#Zw4*_^Qmdk{ToFX5Q{~b z0)yo?fqIbt=EU8OkCe&26zzOq_1yT@NS9FxS_uOKqN1DBBp_a_$v*mr|_eM`tsSh4jFLR9?Fk9G3Wbsf>h**tlo6eHpR_uS%RE_;G;}V-XCHffHSt{ z%H3Z*ERXU3D4ODZCD8rTg#vu;)Y|R(xemCMICbQiE@r^x zkc5D!3|dI?gN7$@tJxMm80?B3%JsRieyO%+3FT~f2fm{fBx}8LSp=3jJ{X zcF-P~UwLseUGJ-E@C6*WlvPUodb^-6^#kuB@JfBTYx2!ZoRivYKIM%HZBDxG&&oB} z)HV-PJMWmuPTwVFx_%R1!#g`rxwani+%=%1*8hQneQC>dsFz?x*KGp_vu=_*y=-dZ zFK?5_ydV$n0#bXM31Q{5@?p_9OOy!aHM1KzVhwaYN1%4*+IND3y-96iGeWR7Du|T9 zOxekh^K5uX8Z14*K00!_*q1_cL#5(65cg=kekq9 zPSx#y7KUa(sKn6DcPBBZwbxDdb3DZTOO%jx6Bu z>RS!cAGd3D=To$cpXxRh?%rW)iwTsO3-1C}*Ll@(z}ShQ8E;%5n0dX&y>hLeo>?v@ zsc+82Z%gtkBtX^iM7^Q(>` z{`7E0H+5ji;_eiSY(QPagD*MzO9kup%C6=w{pQ3+4H`iR_qq2^!JTM4EkkIPQ{3v6 zq_EvjlEB)!;Hp`4DRx`1zzziLgSyi}TOLMhjfW z_UrgB)i%WnJ3e*!UkM=SU?}MI%u%+~4I$Y0NkEY9S^^gTf}nNqn-EoOyxoz}>CZ7M zZRAKVjSS84=R)F|6Fh?Um6cVoH|zv&&d1-u=XezlEL+_wvuu~;wJ`;bMTjW?c96Y> zC~dfscgn0SsaFx=%}>SkMGY$GexTFQ;15p_(7XfhY9y<1_sHIErG}TKc)?1Q_mlh; zBUsR`mLgmajPS(WSx(h@>^T#-SW_?5I@5{Zhp$SvUflsA`jmvj+u0LO*Xm5ji*a-a z*G~$bTZOmM!e?7#)WWKRc$W&za_p_C6&>*l5}jkdZnZjMcca)Yi$!MjT%bs_i(@9M z_pSY$D$NYneu|6G@-MqR&)vgLYnLgWxA^5nJUvdMnUHYhUiz*=mL6?1Wc@1S8>Pez zA{u=VQ=;Y2KQ6}qgB#5ffMAU}WI~CL!|54~)6oX<79%`gv;al0be9i zQG57R-{V8G_ERP4wdeY({sGeEN|`uckE#FhGscttNzm&L)(`fGAdq3M=qPX1HY+D$ zALlM}uzP+E&huyXI}E>}iJ0jXsdkIIHMTx~?D`2%5|;M%rCYLz)j>wzR$)fFY|+pKTZ#9k0XI8WdwRbWF0&MOT!m$AIi zS0SHFB!f!I^Y$0LmB-uym}%Ca@Vr0@pcr63e{*?F4aH}L108L0Il&Sd?tRnBXR++MPMcFg3eX(7T{1T6azM8MjY_{6jrFW zmOxopdGqOgfga#k-$!zwL+7=J+Ir&tI(`NipR-dBJx5|da2g=|hYM7vU5?mgH@l;$ z%R>}@SX86tQBKg`$6EawTLi;^w$^~8DgbHbq>DSjfXM!-3}YweiYOEm3EVF0Awa4` z&{n0cB~bW8WYi8m7M^?I3cv;(@8N1!#I+e|&=~L^B^Ci{)n=9gfxsj#-Gl-+Qf-wk z?BtLHlHVne>7uVp!VvUnubAd@&JCh;Q@)Pm8`iO!?&WoWv1nur*X{(9IkvJ+i%9PK z%uNP00hX21Mv@QL0-cQ}?i2pDy@E0+K*(Ug^f&({W~*I**>N9;7cSl0rq#fk3su+rJm- zz)wdik|Uy4IaUsgCF=V}Zcuz%z7#FLTP_d3gTLC_m4uT3eyHDjCF^i8ll5C+2!1k3 z8<_uO_%gb2hZzIfo@?aaQs9O5Ya-n-APaBuEAFz`mXUK0HIxXD<@7dx;$nx|hy!7} zYM26#gQ2TdOLcQ@8s8aLbNeZ%cCULhXRFTA z8}JAUvTm*mLrUM#MehReU>LP`v+M< zWM36RGRkC0a{n~L)%Wwg(xOi0)xIPWTS^L!>v?zd>&5vu)SrT!lL`?c9Slr&wbZ&z zxijcv`CctL-|V`j*);I9iVSSKFVE;=dt1T+Yh2gZ^enDf6lSKWS?bc||1i!4P`4}- z%gZX)w7)8H$buFqCMNZ55q5gWPtPZl9|mww;APhl7qERO3~D&?$Pi#UZ>mTaU5W~B zQ}F__^x6nqvL2?bmQv}EaB=K4MvJZA!%9A=xD0+*cjim7CdnO-!0%XMox)4YLnfKkXS2nzz*dP@@IXJYVKPh zIsVvglXX4}zs3tp_2L#13e>+8%lTfo6_+~=61RplMtdx}`sqqhpQqd*Z;}lN*o?|6 zUnS1?S-g=TDQ9-6%f;I?isY*_AjTyVgr*Z)aHJ|XFhhz*UxZXOD zeXWY#=2gd^`@nFl`ykbEKP$m0*El;yvpM#TfsU94PbM|!LV#b9)-j3~v3nFx#!RU9YyiShAnznu>ZJu=N)yg1 zcBXB#czs*U!t&dfu9VyKKCI+j6W8`>28IyCK#lpRh=jbF>h$v~BOO}yeM#eK@ek{^ z>u+ypvGSoyuB$J29lf=kcySX+Y%*;-4uqu{k)D@^3CfhlX3o<`4-drOQxqq0K1Re& zmc869D3L(SgykVPAbxVnc?zvLLU}@asy|2^xP|5$y~gYKUIIypSluX1z!P0rzN80C z>MMH7ZcBN<+jV#a-Fq?t6ua36kgAF!AGJbLDv>$>9=$>E_Yx9+E@NW>C0_cg)_pwt z43OUZ3-*~Tqjbo1X!isXC~~zj1YfAW|MC1+8YI@nGPUA*9tyZV${NGbiAHMXhu!4o zd{CfLWVb8}$Y!1zGXmUZVIb@r3LJht>IqtuQ%4 z#BPriS}|dG@Zn@NvjQo@mfjl)v3p@_`-X@F0%41>Xxq)l1J7K3FbLo?Hl-|_Yo7YT zOQu^D;nb$5>DxiJD|879*xmb1L83wm z=a?}k{Wjpix~7ABX?9Lx)S(!T(P2ZP#E52z}-g~SpkOL(y#ni)Y*7~lszAN6k7-O6mSZ3sEqZN@( z%sZX)f+NforA4*_iLTma07aH2p)@~7FSBM*|~4O1shT;Dp&w*aMlC1X>QOWaDJ!mTQtzf z9L>aHb{~fX75<_~f4K#htqN_Z-Y^ZJz2xgiZa0{&d6W7WCrL2Y{cs(KI9L%r5#J5I zjaGMtXHzIwE<0bWM4p%iDpl7v&u>I2IOB@cl@hMWr|@KCh6;ZbnOoA%p9`Ca45r;W zGi)UYk>xJ+tLjU)jPmXX<{LgPhJd}eno}L!Cq&g*o538RevUC#x8wqzLTt;?u0yLu zaS)-hAxen67Oj@Qc-f=id0!{1wbpSFw#)n49G8W`W=Bz%o>BM3s?83;3uRAceJo!u zd^CXsdn)@|KUh~<+Na*r-R8>nR#8w1!KHa9Z%ZsX=9P_a0iuMM&SF#@0o;w$?-f^g z3hh1gQ>FCz*=#9U0gf@*;%U~~#ZSU-UESkb*(~*FYmR3|*CD^6>jrKH$Wfvt2b&vJ zfZk}>Q~umc#}BFQvPu0x|5>K%EmG#Rix?s$E53^M-at0S!>MnpXeQ-NY%COWAKW<1 z8+I*tf!V3Z%WP-{(Kz;+vzjew*7Ae#e?(1j*m9qdl?Z>Ry6VdXg1fs>JEon4K|(zi zLg+w5JaaumaAqYMsyC~9psTa4SYolP=#8oaCEMKV?D=S;q-S3kEkbU;DBQc`p}d&q%Jp76V2| z9+99?G4b#bZm3~7mfXw_s#-e(YycX*eJSq#iBDLg#(%1}tS1zt_*)>Bl|)4EbK{Xc zR<2Uj;YF*EV z{3^^7CJlNIukroullO5yjEZ!-x#ep=J zh2Z|$g?*BtgWzyJ<-PvnT>sN`ToqUDgkiguh)Z<5ACvAVXxYCY_Rv{A>!~Y!FIBI zBV~^@0bbM3KZOTa+nOTh{a$x{koOz^8T9!fBBN*vwn8GSx*8Y_@b@_wq%-&l@QW?E z4N6lFeTxqIr0i!PktK>3PSZs9k~2m2T2J};iq+Q(s6IN9G<$7=5dD(TraiP@!nY~1 zSlL}-un}?l9?qS<$ku(j#T8labj>QOs;$P0xU@jG62~(hj`00y{gk_gmm}LM?B5yA ztKcbbvzSr#Ju)gMcS^rO?a?WLj`O{y9OT2|b$J{ji8~D83u)ZERB^(VkI;PjlMP+B z9DT9$&Mzkg?~vKJupQkzR`~MvxG?wUTKLnwx-GuFmnek;%MSh`T;GKQk%;=q zWz(v?GpE{;nqb6&kG#=_^tGUsPK{ti9m6!)QU?uc>zr?udt9O`nd5r0Ad$kaJjnC* zhhxr|6|@_sUlg@>wuFyQCD!#;GBskSOaD)(RLO~3#=TYT426u&=Yg< zk#_1qy}XjCcWz@Pfms&cSeMfx^P>a!X}ow<(kYP;xsmy{-S*sm#rzxxNBd+xpX%+d zl-->8WYtZ+bhKCTCRqjM5$aPNVui#R%N)eJz-)O*_BfqcxV?_=CjGK;qT_X8gIlV% zDA_+L3%sCtu+eWaampR;^T0%?cKB5NzH8E+@1unqU&F%B-?E9VrLu<6OG%Xl_)K-q z>{Y77(B~NwhRd^gUOP7mf^P=GC{S~2NvoiKu4tC0e)c+dLqg(fhofMVyMjL8@K%kA ztU|eXAHlTiLqRS~ZXDdBmT3lxHr!V~*h(+nF4xSxIJxh}Uf&w^Q~iGDm}iZYHF2B% zsf4ryOd<4JM1@He8Cn^+HeezdiE;BOK{uIo)=BSHoRU219ibDqm5|uxvNyGTCVKdj zt?aV$&HlJxJkqNCB#<#4tqmpRhm=fV=iJI&6PfPGOa{HOjnEdk&#b5=x=S!cslnW? z#a~(TILdfwociH0EPJfe%x57wXV7FNksjhs8oiTRFjUZf;`no&GskJl&k zAt5ORhh)SQw-CJspSX=~>zQ)EGB?(?NA2&6&AC(wWlikSC~&D5=*K3@TmZH_olhXr2jKIcH%5;V$+!0hK{%%jK0-V z`{{Ot&gEN;>0$4;ylj1MYvE0OEKjH=E`RWrbuRq~{4Hf+d*&2^4rE5Oei1W$PNUUc ztW8C*8Rk60o7CF;I*RAhFQkls7A5*e{g~SCtiS6d3(#I`G61u6DtT9&ZUU}*jAEXf5eSoLCNbGM0@f`w0YX8(LVIQlBstaHp{oNe?$2wYc zv7`V{8ASau@^(R%Cef##(qT+#OMC9at# zGw?(H3Tp~Q`j?(m0d=>H&NvcP*At!g{!Xx`|G-(f9z_}-RJhV_U(*q`xA9hn3%#UP zyRS?5nZt(M=?4`~R4x_otI``d%=iNT;CwK$kVFO*#Y4n>bDd~uwH;v@x?2KF>{>Ld z7Bn2UDY|*T_AK;Y7foNvd&NjByx5!YIGRZ%SBKn6j}!;Dy66wSAP*?u5}Wsv#16Uu z?0rgxHHLA2HDz>Su+&1?KVNkI3Zn6U^sgcrWY4-k0Cq+ZN)l&dPlA2}`*FJCR{P&i z7NSJ)l1|Hr*P%1%zmJ9R8kBGtAwE-8ZW~8pua;I+=N}^6xz2-<6M5A7(G@3|m5J#Y zO4nHqti%eD5hOqfHo1i!kCO!apk7j^*&0gJ(N%X|(Q3)vnqyJ8YC_>+BY|CXb$=m# zuj?)7SQ7-W3B8=#nKf9kusr$+J_oEM4weJAFNdH%d_N0y!8`=mE&#L&Ez?+Kkr!Vj zbbbWF0G3uFnQzlif3nk+ezFU0G9VMW&FN{y-!At2`^Dz}bTQ#Pmen4s+82KryQuR= zkvamRE2vI#Da5=?ZCj8{kL5JZp5-er=*}5=0(FQ9c5T5<~$y& z0br#BZHq^;5tnWVf-B0u3kUkoCYHg9^LIwE5CFR60%Q$c+B?ts@5W5$S5lFjHIPmQ zJ(S-+#Z27!Qy)D9`uYC_Aw;Iz#jE^=bfzoYmGk>@jJ8rAod_5ln`y{yf?ujWq}EZV zoK4gZ!0`=mp)zE4nl7K4OZ+C##aP`8nET_v;`JhtFyG#-H7Dsfhd{@8W$_fc;&!<{ zb=75mbqe?pG;kcI{(n8|v5bcU(M6|wlDd{l9RpO|YR9}sD>vhJCs@!My4gP(W;f~s z?U+(-zIgkMx$~KG4R)mtdkJ0PdL`yiez125K4`i+bogJLET!0anfUo_<-mt9(A{mP ztAAp<@jn>A(6`k#LL$5_2z z|5M|iVf@XYFbTjxBOe67b%vmmv94qsNMIi3ISLddtA&ivoLWbbj7L{E&N!tmd`t=O z)PO{C@I_K}vMV0lOd8StYIAtT{Plz^YMFQTco>NI7`w4*ypzQ-D>qENky6X3!XEpn z`2*ke!Z!~33fCp2*D2mX0M7wg?*S#+VhxV#6kz`?RUSYtinf*aD8LD_wI&~Cept7$ zN&fJ>ai5*7H9I;R#eoh(b`4OX3cooSGPry%DgZmb8lr5_E^`rsfFHlN^fR?U_!_GGd?bc zhK;kV(}=nmc#7c5sbJa1yW$L9E2)_XM+6&xn)Z`=#a2iqjpB%AoO+`aJluZV4?wmU3=9 zSG!KF%KE&$^%CwDD-I$;XJ||xyI86rulg5czWN24&F6Jy-OZh4PeG`xi>k0kkLbJ+ zVM>3;A;eT_a?+$orS9?P=BkXabYEZH}0V(IvEdEr~lhB^5eFV?2O>(Ex`(V@CF zfqW?>c(Pu@)RgR#v!t|Ol3$~A_FI9ysqz<}&K39)5s%2Li9g4beUD){22uEhPxLN` z50(TAY7{W~Ud2tDmIT|~k{GyM*TgS(z^nIHn?+btheRa-zihGBqI+7B!E5~n-~C*I~gEHsyIKMGB^!M zhbG48&KcoOCbJ&(P+coT(IN~ok<&1x4DI@^P8Unvc(sNItw@`MBENNSJQ^kte`O#I zefKaO6xmtNr~&GyHK_T&VOU+Rwco7M_(ts%^u))M9MgYkl4#pafq zO$;w%xjrSBjk09z@3kV9zFGOpZe15?;ni1VuO|4!J(PK4ysuw$XP%)dtg>Fl;lQ>2 z$^h?DL4ln*a*aYZn$=$S6(42W>uYSavQJSR1xUlV=Z(`R@r3BAC&;Dc23xlZ)4q-= zwxBw$p#(aw#dy2=NL6!>sXXM!xvlewC(|X0`phQ( z(FJ$Y={#PoHd_b1f^1UN>&H?}akbA8-gzS=t>{MlI8J1d`$d3qw%Uc~KZQ-nULuAE zi;zfHb~)}h&wK$DwPh8S1IxmgJa4V)E5CgW^_cY%P~w*Up<4d3E&j`o&i6sPrWEmR z{xGGW0aZmDFH2;sIdnn=DvG5xe2NrtHi4Wn#C9ZK#rD$uY;4sMxl46oV4zEQDh``^ z$TWhs`|rz=B*%_K|xhs72 zX>g1!$#GkfH#9WFBL7b^0)~mW4?uu<2QvXczKS^hr?Ii7{lRypbER6Q3qrW<;i`#u z*!CWv+3(|~1D4DiryXp7!g)5b9CV!w-3p}yTwAGSWLNBJ$I>`IyWkA1zu{!P!`fp2 zAHjcITL}aztP8c@WGOz4_jj>(r!7#@WM_`~_#-G2tTZrZo;Cq^Wy1X#H99WX*Tw)c&H$HMb+>@67 zQL(nxGXkgW&3~qslRZOx8_Y)@uM_)ak>+DF_Xz-^+$uSAV@oTN5wao_xu-QE`Wp`a zC#Fj;i_m{7z~isPE-ZnoPXOvkbjE8hz(N6B)|!qMVHW=XIQi<=Nx-c59g6$UdF6%W zLcR8L-5;Ft9$nJiM%j3Wf9WUGTWpC;d<+`wI2yB3M;RD35b|ri5ua$#pi1HhwoT;$ za+a?0ndm8xPQ~Yl#Z4mB_@Lrg>zd^$Gc63i^TEaKlZneKK=B<0!T<(4fv5%{Ga#8@ z9*jxS+HQb!o?xx48}pj8(MGV9W6quIvkO3_PWXAn&AEYu1o` z3w+dgB#N&$02&Fo@3#;hS8Ou=BOm}yK=>)7$f4CfM7+&K zMyn}qe)*1g{3rvDJ_vQR$@9C!KwXG`zj=GbmE3>nCh(s-@&9HlO3wfs^V9TR>;NqE zg2#rTY!%^9j7_;t-gXlNDubmsm9#0Ed0n1~@wz9Zj&p zzu_e8^jV(#048Ftse$4d=Dh$dwg*_QFC?ZrcELSlzt|*Q3A#Q5)XM4z%cv@daGr@` zfiiLHenlWnT9H6E7iapDB=%SGd`ylUZ?4~@ct<$yys;B&ub-hf15tHd7Mk7l$7&`SFbV>Z4`K85>^U)6+jyQWhdZUZhsC7ANza(vxa528uyUX)1|mM*c^AzL!L@41T>zk&k_jBaT~Te-JuiQ%j;ls z<6m4H78yK>f4_g&x6e2DP&VoZCJ=>K>BioH)lcPXlqwe(8Gku6)x;eQ^mVstsMBMo zhtLR`{=jqh>h)B2f1n$6D3cwpi&nti=8^S*!{L`#>`KzxibX3#=>5IW7ti9XVlS*X(GttAnJS`Saa9Y4)0gn$0K1~oJO0*PMo0WVWP0mGHLLCDAF}QduExOQNSHqJ z2Z&Su%)Zshv$G5a>>ckhtv&;3{mxJk6S@2%r?DVB7>8>jwE$4!gpRK{=9MY4{E!BU zZve(}4N232PTnwj0yGoM?`Y>=_qP83f)137qL$t)DWu-?kef4k%W>Ii|Q&jdE~*&f54gkw$K6WTm2V zwb0^DwTHKdoBRQ!sbeATajBqHBo8fIL7O8`9gQ{4dCIvxkXi*<7zKT9f~%U)evC6a z4G*3F3yBT-3Pg5%dS@6M%woFYK)$&wJ}I!L(UE5nR5|~$9m+MYm-?K>Qj)hjFQMr> zgV|^PKp(G803J(tZh3)j0Fc7BI}<-5o}yLPEF;|2E4NMNQ!Dw_?m;o|y#5N{Zt%1&2+JOz;!f@q+y1~X0I{at~x=RZ*Dc^-}xer7Bi)`a`H4qa6 zjbb?*Kzr58(w_D|$(NAKAby+F>C@ z-_1302-t|O?~wSZAU>m2VcF**zFw;~tsz05DpTOgjf*SA%&UdA1QVlx{W6PksOog7 z={Fo&^ZLfgarqREcX45}OnT67^Jx>`tR}9!sCF-|4Va_bxqjSu(0n~4 z<&FZM`Xs|#(v9b$365Co;`l$U$DGT9BMxB7#g1SGfQ)NL#z%|DF}-I-Zm4L$=}4B$ zd%Uf3Jt236UXP@f(~vRaCpRujo1&l|t!xr}eFj2_3`$8|<3y&0LS!7G-JAntA00M~ z^||_rh@B#@A4GV{b7T%t&pED82ZPXqd*0^YK{>|x$|~x7QzCQSWq&S4Ra?SJbx2N9 z$Rxu~VfcO6i?_9cUJ>+EpLZ(|?!$O0efhjNiZbFkER4v-Q|BOX?=tr!#?7Cl&Mi2l zRU^sBY>X*k%i!a};512jt*gELEsxLi;sHae*fit>;v*K^iF$C=6j32&Q}5WYa(ijC z(V9O^ByFAN`6An%!E@Y0Dkg^0Qsr3NN0qs=CkFI4QhA9!>NgF6B50e0O&2i_k@aV{ zo8zSN<}1>cBD=M5*AxA^D{(c@%m^(qv}MoH7A*ZKx~z|3zob;2xj!)|ZkuShQvK0B zWq@R=d)GanK8_L-tci$7pEcS+o|!O4?|3h$8)$2pE8vu~$a5tbl;cxJKv-VgiKqJocH2Fen4* z=>>CWtO=dtEkeJl^#K>ctJJE*QK^{ME?v8vR=}F$4l6Wd;YPckKy#hpzr50`{|d+d z{haKj%2@VL6_t|ZIzT~_CgwP?9ECWyk;~+G6AV&2W;i;ajg9+}9ur>ZfMn`LEqa6Zde$zosg`Ju z??)48#dfdXaOA7VpsSTofI1vY;l>1LD9(K=XufPfIp&+(EIJwKNCiSmKMSq(f|E_;ObJun)rM1F4je zOQpT_@zuOs!LK~Y|1Y=pp9er+!?0UUWPW$7aJ(&-Lx97Rk|yn*Tf*FsY3Pz-fDhHY zbiMgp=MSuWAQS#jNV`HwhF?`frf>y&%WL@jie>_HdyBzCT>7uX?{RZ?%&({nAE?cg z*$QT<2LJL?RsV69`|JO>vFMr{3+G^R$K&Q0fGgGOMuio_)fQ(+--mGMMk&GDOErIH z=Un7zhbw%X!F@{&0Igi2W6Qh&PWIb6_ zO-TEQiu%uXDf{{Y0%LoFvO8@NGKE=Bp1JyRzQwur5vX13`d@=2|DBxr4G$z>#xs@W zLKSOmwXKo4SvONXRbuXqD7(wEl}|ln2~Iy*={<{^=ocOJhr6}7AY>xdhaD#vl(`jo@QgtDnS6?#d(+&prlN08UsQn=$ z?-oyi9bw^zFd)8F_S!=tMfXEOqSq3cxE0LaIO`qi!x{A;uf0yy^G$fz9eokm zk2Y*2cDnTP26(+GS={Ki+kj!`fb2=$Z{Q|s9=?aiWSMAl1^ZHyIf zr^aw}&IvZA?SCY`{uPHe=J4P~XN7qT*ii;k`49~pweYInmIzFlV{ZN>>q z4E|I(`JbT0e*=vDbj_3J~;b%yf0^&F+e?S zlymNBYqT0NNV+z@eHO*@yIk@U>-In{g#8-X1$BKqh6RX)`&(Wh@y7ux*$if<;X+D(E!inh~-y znlu5fo~a408!d@_q+X<=buplS_!K{t?<4Z&niTs~@t+=kn7UhXDeSj)YKxL6EbiEk zL}c)d8btE1b1m)t;Ex!KdrB44qR%_hOm;{!kA2VkgM$u~%Q`3C| z5-QW?4!UB8{h4<4&+z+$9s&jJ?7?E{dL_DL`#c=W;|IYRjSY*ayK4LM4@{!M#}7%A z#)--tbpsvOjGw27CVN}G>psIHcXC3RG{#(y;wkudUn>G!)N`!*rHs(c1sTry)-mQU z2^;sm36k_}IO0^Z-N{D0577V08Z}P{4;Krkn`KjG2-DZThl87g+n|qb$REE@EP7jy zezH&%u`Zb~*cISFIMC02wRxxv@~8P!-5L6{rm4N*ora4C!x&Zc)Ks`0mNn|sad8{@ zW6|{DH=I6`6=8K7Ek!C_(fdNo7>#MjHHi#I9};X~xl`7&nd3~sxk^FT7^^2456|^@ zd`4w>V`9?lRx6(NFAQ|d!QR%ZTaxkBUL_97Ow^B;geHmdz9H6>)8i*Ni+7Z@T_q#M zJgkj^^)qO67QAajbG)NYSppzSswjoQ?__C&awnkI7KS;dl_`6nYTh6X+< z&bC8$H}J_=w?gmOftJOkzmp1J#hor9VjHe#<&g!_1D&CP%*jaUc6TQNPo+J6iSiGA z-S4c@A1^i3eAG*V*v`9&FWGv7@Xef;nWud26k2YRTa;fHRYe}gko-&PVRp}RiAQdWS3y*0nen{lOl*o#3;;S?yjvw!(MRARe0Cs zM?3S4sS3Ebys6IIw@Ge%oF?<(mFm#7IDBirzEz=^hAL8kV{w3s>LQ|%I=riL+&;4 zA6$}nw3&*Td#auz_e)BaHtz;jZN1wx{NS57B{=Xo@|ujg3boS9)pyr$)~|Qlpnap{ zZlKQwL^1MNKact@F_uM+HFVSd$^4x4ncto({GjA;VNA4<(Zpavv__9Pm&$z`>rB72 z<)tx5TI#dksFd6$c#k$*qv=>v7m#_Xg>P=`_tfmQBS5jsH^u+?wS-?`^DEXv14YaC z5fB>fMK81XZh(JbnJuCldUFlM66ztCfspYN+iX|xD)omwHNQ$*C*$YzYSiL=0{+mC zREvSBBhTOfKN8_)EA5-2xno6SWd!cRkpBygz$GR8*!=v`Xwk>vg5L5Mf_g zXUs)EZK>;%$h+aHDKPtc93dfYqA#hcb@AOTD6dk%aLQQ$e#%}8oLTQS^T$ivZ{2p6 z&2=ii49YVq{sX*Bi_7K?CR`2p>**7xmvZl``W82|pt!E8xEff1fM5Qr*q!ZL*j%IY8(vuSq-S)1FQwFYNWuN7sm zzguM2zOM8NG8SfrGZ5r97!VOD)<){d-p+Ow#H>a1CS-r` zsbWW-?VgpPh5A#ogeC!eyWQ9e;e2`dNkpZ3TiSeX@W@RDk3%K8VKVGlWq~JiY`{hg zM%W)QO8BCpUA`C{Kql(*CEcPw%K-^}$zf|v$BkEO)w~+z(b0el21Xqy7|6J?;`nGr zX9^;3#h~t=zdb)l4_D&gdo7Z9hqqhy#c0pZwp%|j^vI3AY_io%F#$oo`>ra^8I6E0 zJW7U0x941!+Z}9GYY22=1;TP>l-(#t`N}=mQtZ86<#?(-2)ieOr2$s4XUp5c#A$C$ z$&Nej)!B5sk8uy>S3S)32EstdQgWc_KQpZ&am1Zs-?G;^?7G06;O@Kkttopl!{XhOagpw=(dIdAW#juxO;@c-5X6`t- zQFaK56#V1kn>fvvt%l`>qHC6T+|gZi`**rBROYG$0BrkLMJ?>eas2JI0`>BAkp~an6r}!L@rWQzQqT&N;MbVf*QLUC}0}CMVJueIja1t$QVI>te z0p)A%OdZhWNVP=mq6Pt{k}bmWg2xchcm#mVs5NisAMUXL+(Sqi8(h_DiFxbkpCM+L z<+K2b%I?1lzakitY)vT*W4JJ!*~9P5r1WYvpg6i?p?l{@LY|ZApt1~0of-+<%(Vpk zbB!|qo7PkS_~;~P8c=lS)tLgef3tfN~f{^9yfMALM$`9nwHfCd7qyMvkOj&kg z)UiO4=V*V|v33=pXPPcQ->7L}iKzsG`euyjZL z)IpZi3vr*BJ>o|%d!zl2K>OdmhA`SYCsOkMpwTS)zZW_Bk0KlY5_#*7Bm1k}w1DCX zHe3?IKE5wImi|N5o+*o~Yets&#coBJaN^aPl(13mji(8dEqvr_p!>%aLVq1p(=1{n z2*Y;Pv=IBV#G}#)7M1W|?~6ki+mqy2YMfglUo@qYCe(uGlNo1YW}Kb5yMlBVJ&F(8 zzHhTdvPQB+(nw7`DrN1~md25}6+~qi{_ko||8$+1W)*@N$mc8^4}trS<#i7VsQ2HB z+OY{7zbSG_8KR3k7J!WazejhYJXkCPxDZ*Y5IT@l0p9=O#S}N z`#k+2lots0;dp4^|H>Es%S}jiqCQyDaHONa+-+6Hj=c@csEzl~3J+)lO=T-tkHE|G zNVHFqJGI!e-T%S5%ZvC`*`hX;@z9wY4SFt&q zwkb|8mc#J_@3pgT%S^0QHKhL9u^Wa1cxxwke#0>V!@07t{>m4yuw}EZs9+hVQ8|tp zq-HGsTt)lG+^)8xE{L*cfV#XI!CJM54q`D%kPFb>DMSoN{4<* zecJA7@Sip`pAT=f;pz1w*pdvs5{nUR>uxf61sWH2c0abo~FkWIs z-(1_7df+6%*;w%AEJe25a}C`nN2OLOo7Ux zTl|@AOs&_uiW;C!AkHpN`3i|&_~<|<%ZyKjgn~eLu50Q&QId+9_vD1TO78OOOi|PS zt7@a9L~E%bU2q$W_-yU0pUTctTlqH z@hTsCo4&&C5}e$$i`8iDi(QP{-CLTpmfENYxLPXkt?~D?clGKS&;^i(fEi#vI4H$I zK5$l?l@&*YUm7*mw6ssl>gNk)dU4ik2z{#(XorOr;4qhPc&W6t-cTesh*}FKg3zPz zVSN)}B?xm*o9TK&fd(E7*yEEwt`G)o#YE_Cm zV*qxTh(Bf7lFufQD98(nxk>D6~Sn!+)|3CjlAJG^eBSr+;y0 z{uSu`Z;vZ25t%z%Knkd(Fa08CXS_de_M_?+&`4|4IqY-=y;zK)MXSs7!9@`Qq88yB zDy?mozTXBy)hUd=R;4WV*`i0Tx$4%9c{N4?g)7Z8zX84ymSUeLR_v}0gw|6 z^CWa}dG7e>-RqZZuZ>o{x(N#XL!IO#P(cliWku+4Sf7ERaCx<{V0x1#yH&D?O!YZZ z>Cr0!&48>SAloJhOarh?u^j*@4;{~dtP@2NGHm-ZK=E|8XrvJi<2D#>ThzNO7(-3kWo zuz^*h0;Ac0n?zT(Mm?y*8Ub1+#cw#Q8>z<$pZ{5`k-v(?0$`i`P$VJX-2aDI!G99V zXTlEv=qw{UdHj^MkrR=!jyLl2Ept>?rTHB+30cDp@l{yK<*T&*BCLN`s2r1D75psB zdxF$~MZ9U|I7s>qenab~2YEEC1d^!Aku=OP4?ETB2Vd|2?O%-oVFg>ac}^$JtpVm7 zWlU(6<+L*URWgbU8})k=^Dns(pergKC{mH^my6to>OVSFenM!AJ>f6iKK19H|2sXz zVAwAU0GaFMP*tY6rpR{HRZ)_aNS|mZxx=*DyYFV4i3?*6=C3(k@lkt!@4nZ~%G~{tTCT`K z2-SyGjjI620tC8_^LS6A)&_b7Kk^nNxl~Z6q-@r2TcCdBX#9Bl=o%iyJLicm`y6XD z?JCdB;l?Q!7u9j-oKnpR(0RY87=i&Gp-`{Db@2H*cQp1qOA-728sIVfuZ#Qs+JAq2 zyg)$CZW_7?+d~Vz;M~FO={n?bfS#-oV53b|&_J^}CI-4r34wU3DIp^kU^M=;-YFWy)4oZYzG*E|Sr>Sd(xZ<{1Lw$8Yfi-{D^UC9O zD;TeTkCb>Wf{5@$gGi??c_Z%b-g<1-f^WDYz6z(!BYLe|UBX@GA3`A}PZkCfkf}Y; z#q|BOoF%lh7KK?wR+Sf3li{5`xKWDs;i(P(jMh?rF#V5^&@qHSwo`Ogs z9)JfHn6qjM#$nPf2>+C$dM(h0H25ob(wUgh7;<@fJcgk87JK`0BS+8ryu*w8HX#N# z>2D1+1#jHaCmU}jmX}?Vk}B_{doAFmHT$f!y)kNJYIkiW=VM`-5|4tI!t2p${0E`; z4$KJpGu=#Qu)=843FRzh=U9;Wd&dL!sl_EN29pnvNm5e!w|JpVUe*;7yRU#cAsbz| zJIatNfny5 zzjm9JW`VjhUp&0*y4on^z=O7PPw5R}0|x|7R+hCjy_lDy`vd8ybA|3;0hG&k6S6nr z5A2@?L5*R@SHVh>S{n+VMjhM=n@B#s?VmrAulH!^JuYq4z_v;DX}!skNg8ZDIGu=s z=oH3k&NbX6DW4kiR5w*7eK}Z4>q<^CQqK6&FU#ZW#m^|dp9gP6ojWN}(6{Z%2vY!C zsS1DI8D%OE*vHT;k*?8NZk5+MGm5h+^{#C}@85TB7H$ukZYH>7 z*;Y5Uw4|x!?P{BTA%5&hegB2t5$VwEiGn$dA0t_)RlHJP;z+u*M@%M)`9ftyVI0rl z&E&JmHylj68O#MMmo?7a2^~nk5lWNm?2-4~DWJfmwBQ*>;-jMh{=%d$kK|tR58u^Q zsW@0rFzB4@B=!rx)aYoQIe0FyY(7?AK^2+tA*Amvp3QB!H*=QiZ;xB~`Z{^gWeBcp z6u5i7)iV6Jg&(BXs_wGPmO|DPV&(pF{PovCFZl*5A(Rb*z%SpB9%E~9%x~7F*z?f@vmM2a8-KuZ0N|tjlSB2N1G|4=U-#pqK!;B-z^r=~-F0>QBxH5voJtIQfeHmS_zE|S z4&cDEM;u(@nFE`1YB|{rz^#@Gtdp=9(1n2W*Ieu#5Um*;BJ>89K%-}4mHbSeWv*`t za~o;95#dcjQJ=P+1#J)bmcB7`q&_O*pjl;EdcEGeez)zUaJz7BxmL=ZvSM zC)X|3CtRT;L3~x0YG1AP@!h5rQ@Y);EJ%wA!5VlrIXdoe>EudOwRy%hrTg=<7JnAX z4U)x2sGBk`Zxg?{amRCs8Q8=-)cXSrI=q~JI<=ZT?X&mzdG9dMCCg!fbf#@oGiQy- znybMfuUpx}nPj1Kg4hd)m&sEk#=SdIJ>mY4^|aiL$&XP1od6GM6u99a4GR#w5S=rs$y&Ud zD2d}v;@!8rH_qY(>Tk;!=1wWLg>>)yhI7508lnH*(lPZ;!CZ!!h{?9E!MM;9b&3yS zx+PC*9(J>mx{IimEUr$3aUo|MBqdT|7tn=)Lvd_{zn2H`Do(AUY+`*4IRPWHM+z1- z;if1f*zJFH&|^tf6?z~GE>z}9qux7dxEfoW5FhA2R20*7^|#Ox{`&Uy>n{)KK0lMl zsv3}^nZ`Kh0yS&Lsusl-GqYZ>SSzQ3MZ`<>T<%@Zu&v*%6*qUQha;N$Rzp~RC>OTv3 zofD^Ml=p$jFjkf#a7~NWPggz7WevF;?dfC8N<0{t(U0#k8}m#lBmK!fM{Q0wPSV1Z zBHy)Mj2rMm*p@+I>TXv&Q5|{hmVbYRyAJ-L)d*Ye11X;aLL)auS1n=BfN;^zAMM4=y4t9rH}kXclx_O*;@nfmDaC>2}9TO$e--0FaWd+m|{dky(`Be04j?H(E^S=b9)Tm1c2vG*5w1LE}kVu1B9%Sq5U6_zklFygk=m8L(FfyU^)!HDP-~-S?k+LXp+& z0zlxsARu)J3X0N$_V)a-<%Y%p%zM<^b(>DPd!Q!4*c%Z++ZQ&*X4?t{@r7J!YF zoBHc9=q_|A7;feFM@h@d0a5+omKQXDs5pSAk$??c6oQc~!rVgz%{4_Fmj!+h3mkr) zE+dtY?}7WSr=|xFH*5Cyw#Q>>Xv4ZTZGW{(P6*2z|JMgDASe0`h`Yj%Kc)7iKS|6` ziEjJaug)CGk=dRO;PZcecK!~X>of*8g_|1~UV=XjuQG;ThzL44`fG3@mKc4X?7xgI zkS5cPthyqS24is)0jOf?Z-aeploCJ&!b1Nz7(n97pJZ+Y5~%(>7+|=CpCL9Z6Bth% zwqF$xE(U4qyKw$KasD(K|C@REr-}3D;rL%oobbO*9RH%&D&U=eyC8&NmnO}1Dcz<$ zdgNg%GE)6GiG7$qO!ZcdP;fyQJ;$xnk=#NDX}dRxTp2|sj$3!|=jZX?3Nwq=6VF|& z_Z#Z#zM;;mJXzj{VcuOB13<9ZqgzK4ufT`u<=~w&z>b{W+5u8Ug0HPYgsK-*K`&;d zW1yqImJYpt!}j@Q|NQksJ%Dk3o_(X;Z-lOTQs^_$ED%dRJl&=SmwO;F5t2<|I@6aP zxqwbq=06bs{xL=7fBfGaTz|a|Lx6r)8p{qMaeByh!o6GT{$QuG68AI|2ZM)xPj3vv zQezg*+XRCrXrH9lqJ$cA%qNB%JSN>OvsiL#unKdLWqeW``Xx-L70@C43WK%rLhlY zrvK%Ao)10m^FHtP@&93dbI;5@_uR8wa~{WW9_M-9g?mk14lV20{*cEj5;jF6;#pB}|p_sn25R9h-#96mE@tA9lBF5$xf*t;Hc5#@I$3 zd?7fX5}^&TEx+-a+@|^}H16vy9dWG5HAijJQz8XGq1{5|j!) z;YHjEQ~k&F_q#|XzWTg{>In#8CqlAIi3N#f?-G^k`ccO!(-&Nqhl9N|H)hXfrD zWr|KxS)g8G4Qo=aTuaI=XrZsGY8WV)D{ri5pwi(8q=j9zVXTu?Dib^VF7v%mlqL0) z%y_`345?T=MFn)LOH-M$L^*PEJBIGR#LZi}l|+sg(RKy=I%^lg_E3V)}B zqQrMv2ARv?o{=VlCfSZjTykS^7IyZAE(|;XXIdTjjijfw7py2or~|kG36y0NrpdWFQWfYQAj-!sClZ6?+kTMkyn(ItWK@u z@@q`|x~;nh;d-9Ki-LUQd&mWs+x&5gx0D&@T>~I!KQ$ro%GzW81q|K|pe29_Gb z^@7~dW+qIm=1tG0CVIjudfL%TBAxnXT_1Ya@McnvLe(vvDf-A2o;{NilME8I%_P$@ z1CA8nEJzFe3V2Pg743rf^x$jh@LcIo6Nhb6Ox$w4=PO^+vFMFxBjYQF2@@%3T(%j} zM;kMMPLYOnF!99GB$~>;KPKJRl4vR~OHWt%xDmAO#4=rGI&hvxTvjc#m=M%A!%N4s zcjFVko^6Y&pl9=u=eo%A-J@%zW0@89Pek*7);sjK`!e_Kv80_i%09=+1JSGk1bhVK z*jBJ)veg=T-x5bbxPTM~aHPguF?%Q&<=x@u9MwZqn0?>Mz_h@Y+ROhMJN< zCJ0wYT+k?U`*F7e0sgk(-%3ZUG}dT?2ztFHYuLyw6^A?c8}*?1k74}O=-8oW2F9Ao zva<6$@9pVU@%>hxx1k&R0P-SCNRD5Vtw*ZWjVSnXU`QB4?C)V`)xKBVyvm(4ml3Fq zojrKDYW3^Yx9Vxk(DziRtYJgTe@Y%(J|Qq`gXndX9(sRB5xTXn@|e?iz8SK6dKj{` zpNClV`VWEduaOKk+-bo_3JI2)Yk{n$uI(de7g!0a5=niV!!=5`o_}x=p19lDvpjpL z_fn}99y-UAhrlbT)E^La8UPaI(9@H<+?&07TL8jcULZ9GTNDQ8frU7;yTi% zdn37bKwDNy+>8t>gvEU6Gu+A)k+ZvtWU;SGn6rp4a^atHrP~RFBuHUre^TvoffPn* zp+5OcY*nTuIv?V@T36now*(7+MiS3*GOpq_zHpS=^p~ z#z9BPDOOVa66+p$2)bDZgc+s(BqIE>&zU9EI?WJtX+ocT1pf-u8+D!!r!<3ib{&uY z>xYU%h|rZeRy#P+hoVltKxwoAKPDQ#(J3eBTR+1r=enwhe0bF!9A35Fo;%VwI|he2 zE2rmNr?b5*avw>^m3LX1R^*BvI6&p_GK1yrr*!wnp!3F+c z#|I9VY_Pt1#U|X9Gu@ZBG>;@U6?43W>un3#jLE=VEba1u_J*$VEPr;ATF*Sy(YMDV z$BSNEWI_pkFLXlWIaAKQt=b6SQfO0AhNa8~rkPpm91_@3bq@<)3AoEmm4D{0A_sVX zJ9R!0E}$!Q@;heE-r%{Acdg&|y6z3HI?DKbsXMo8wMJo60v9 zi#vqVpJK*kju39hR%kRAiVKCZi+j-Yo+tLG<6@dvTi`ueetLtMmRVmUrAZs8)#(?I zyb?U_C)K(nV8m9&s67Kgd!pg=m9%LWlbb_#4TiSTie+>uon7e|q>|d9kZTtuv!%mxyxLM?`B$I3*>-$$E`rv?YP+j3%v^&;PB4f@rso)7& zx+?VaA8Qr<&21W8OEHvoW2~Iy59)%vRqa=d>k|IHhRx-R-8t}9f>YKwjvVr2hg4kC zn-JT$5xE^HQ4x01Kv!|{?hm*AIKA8#j!kE)G5yPbOV-85o4$?_M3l+I(%+(=;R9ij zAtskhz4e16_5(2U?#C9KZ`%!BNpav-pv3hJed0|Q=mMk^F%1P4xejLutk7bre>UF4 zxb9OWhJN7V#m}7UB963XDbb2lR9d2qYPFE4DO6~aJk=DKTWkry*?pnpVW*!|!ux;( zGxlF1`ng%VGwN_A)k?z7*m?E2MV4I-rVmHz19`7wkP@`FLpi2o^wLGJpvVVeME~VM%5i@EJ-VeyqW&bvlBRZm9Dw2NzKjXH4nF5)uj@3T z%uSO?_JY96Cp4ms@sW8m(+uKgR6q`IidiJRS9G{Cs8rq7!Qsu4l(MrtEMQ-w{^{YI zA-%TxHV&!+T)dJ=QNOyUaZy1IM^Pear#jbq5q<(lQZH10{kAN9_pkW%LOu~CHSJ^zFQ_I~Qj^Q|j zBMA=kTY?qBzsQlj?quBhXtcsM#g^&R@$V(P`ZaPRCa6bD6mbYQs1af(J=ecjw;FF& z6VH(S#UoP1rUn_Y%r(O{@ed6ZensWrGo5`0)(KOkzv%fo7vIc21Jx5hsorQOEo)rb zH8!@EOd*5$SZD5c`3n9gD@q=h$IJqcWa5q-yC%WHYi9aD-PPp$OdO-R zrPaFYvDRoFo$JU=Ug{QL+j@-~q1V*YBJ0vZ_HO%L6gI(ui z5}lp4zFu_K7f#M`Qh~178YDu}PwGwD*e>s_x^@nOd|!veST4mrj~?7>$a}ta0xn#p zRL+X1)o{H)&L;-_w%lCS>47{$^syUGcbUH3U}5PiT#qiJDWxLn=`%RG34U#zgewfn zZRbBO#F$$s@~5cIoqB58Cy;of-@`q1@t1)5>lgpuY+v@=P9NHGqa!jm+@6*Jjd@S5 zwWl%ppU2pBA!&C}E7WvdRjqz&Q*+B59ZZr4MM?_^HKPaf4@1D&mZeHoOx5RP?}yma6(wjXgINkgd-ePah2p$#I&Zv)Kp z?&Fe&9+vTf(mA^?@!=vw`KffJYk#!laGhtwLf`(v}V_`S<532Kj@ZTnCwO-d(y^9k3Zc@jg-Lsi5~xp9u9xGZz(TP z9LPRn01lPlt@aG1p2$7wZPDKq{p>*Yt^)uu1wIS}^ZFr>#6^fx6Roz!SApAH2O1kd z*A#LmL&{sdBvMoT!^lc&p&~P3!u?egI_4>Y?{2~a(2er+QVg~B(%vSQF6j6 znN6m7qFMeR>u>(CMkvTUxgEPJ8XyW_A}=D*yRy>{WEA7=XWaJIl6Ork`H6%rs}utw za>>g;-EYnM*XZ-VZd+a;7*e#VQh=ev6}$Q7&bIR!Dj2n{i|O54@&-^x7GDe`I-h*X z848S!fL*E`H|gTWGB;c+2auBTpFB)P|{;ch_Bf;o{MM#%`Rk z>Uf^qCYeLdQ}}DQeoI|)te5_?y_`l|Y=df%?o|C3u1_RQ^xZ{d*Ff;nQTL)_VDh#l zPrm-dF@DiH8CH0;l0sXfa(e=>q!+(&4q(9yOkeMJJ-X=Q=6%UAC+?p0_-R^D^VeX&UMUNQHv7-!&!9-Ts>I>zS1ingwOLB+p`!6xKbh8 zp(ccfn1I8i+vQQ$u}U#J(K4k{EAjoLrDr|avGG~6k~!QDtdT((iJRRVwVXxVjB`yO ztI;XDra*r5kyKa=8vjg^5@e1e@X^>0^$#X-;RPfdBtobfkeObtzlzQO&HbFSP#TRZ z{EI2GJJr9f65qz1Ad=SfQba;4%t>}Jp%;$lRs&{vbfTGY^Al3??6#=4%-fF;=39|D*LoqQCL1aK5 z3nSA7IeQRxB(OM9R)9E9+2g^#-+`;jfwV^i$c$PVLHs0uq&y-oT;lET|Mj?l*Zldw z2Az=kxQP&OTn{U*Am~w&ZY-kD9*`*K183LoXY|`Mrw^~0L@|yKODu%Ou1S`z||l E2Wjj?qyPW_ literal 0 HcmV?d00001 diff --git a/sycl/doc/design/images/native_cpu_wi_loops_barrier.jpg b/sycl/doc/design/images/native_cpu_wi_loops_barrier.jpg new file mode 100644 index 0000000000000000000000000000000000000000..bbd43b74857b031e65785370b9a9e6e6129dd09f GIT binary patch literal 22777 zcmd?R2RNMF+9*7FiyFNqN)Ro2qDDlEi0GYYL5AqPOc1?H5ClPVqDAk$1<{4*4AIMs zIvC?T+2`Bex8L{O``!C||NhT)&UGHMjO$tLzVEfx-Iwbh*UKQXCyFYHAPfu)&2Vr4i{`vq9Ht@m0$HBqI#v#PR!^OWrc;f~UArTQV2?ZH32{{Q75g9cZ`AteH zDykc#G_=%|v=o$7l)nbSzykVU;}GEB5Ks~m5mWvzKiBOba(ql-&?y!MD+rSu1B)Ev zx*Nm-0%70)(EbAWKOPKB07hIqd;&rupg|282onPf3lkgb7t}y&f1nifkkGL3h{*Va_lZf#A5v1Yb8_?Y3kr)q zS5{Tm)YjEEGwlv4pP2m)FLD4cOl)i{Y`kB1VPJX!1&bUT=e8g&g`5VSx${j{p+J1fM{!vd9RzH` zns6!$moY+Wc99hh#4ptTg4w@@SkQk9v;Rcwzwnv`-N3>C77vRYBn!II%8ndD(1i0z z3FM3vE~eZNe1X66=J>2Zab1Hb)Y|x3uR%!aBXh4DUo>@c&^2h&aTEf#A-x7sDWUiT zaIQh43!kn*%fl?_yA^FKG-GX>cfZ%IT@O=pKbMH0it(oFA=#v5zjm3%7xCY_3 z#$1CQMvSx10V`PLRjEYn@h*|U#>yzR@Wdx=|8~K_AkH`_V0oT^4|l=tnd` z`OHy@1>@Q~6I>Dlxw((Dwr5`XJvjx+=NP8e zum@ogXf6e-arPu;#v0Z9unb4edGhd0kH^tUS;>dTvxhPKG{LYD=-Y6;VS2x>%9meX zaL4U6in9ckt70c^@PC%@n>9dRYX zP?19=GCwj8JYEd!H~1r#W?U%Q3BH;rN0PS4W?^605U=gS{7kA~Q=@eMeH)47SsEO+ z4>IkQYBkz4T_ms8_Oc?5Wkb-nejt+$FV*A6FCS0wS{l4|YtiJ0wTh{vR=E2u{`TE2 zH~X$zL9RV4BvSqk2=Pvj4Aw6b2gg@pM;p}=LA~Y91fY_E!o3iCI0M*uAeN^qfB+?% zarAV;VoFNV)xoh~Gqt9yin+#q&17N8q(@Tzax5iY^eq}5RKg^daAX;aw9EBURfP9- zv>~6CR2Q>=Ops4h3ohL>`U=!a!~quL?fQ;9?_SuCarZ9E)C`+XyZ(s^ej$111o>V{ zXo_G~)NL8=DZk7;lg^P{U!zOOk)>-8|7G(+hD|@ZxOVm!8kE5{)-3G2N>kIMDgavY zN+Mz*%1qlmRN-jmVKwqNOkMp(RJS&~tEE-1e`d(DdccU(dfE&Z@8b(lHNR>yU%lTG zo10tO6G!yI6+8Gt#Ny0dIcbuIRF129p}Z)*Qa_b3)PwvxnfbCIZvMKkKJ_sCL#3jj z^=BGN0zZ|G#DpH**r0Sh$AOj>mVcB|SyjhG)~(8vd-@mt05g+7+r?E1O}b%Vz37FW z7OXP)d)Q6|w<^zAKOwnE!&j$k?)xcmS0cmsp)@EqVPp`SQ&CP(z20LhJ|{+KHK;vF zka3mryylDknePbwj~TBWRd{ZkF_Y5sFAbBdi8_9Xu{05kf(z;0!yiE8Z(NkMsuR0C$66NYQYZo(CCT5qj&>$3<*eyMc%2Kq=L)6ZE8WJfZVtv zY6;iZOMZ{vMttXu3YiF&W z;qjQ4HHzV5V0v*puDV=2j-TGjWF>+mGZ*;;4O?6vpMa(E`q)z+SkrGE*;V$_#?CBc zNP4Rzze>8R{e{tr^hq6D=4~5AfY5vl{|~XKT>Y-RPtn*FS_R5s{ifkhaGnO9DN4Ld zO~1f_W>wGiO^WOo74g|r#Cyf-)-=qmcj{wmoFQHhD;4OfQ{h&RI-}0G zohpvE7Q=7d*ew5pv?5B``ASWkM^c~HKS9AS&#quAg8`|920PIWU22)_>{cyLc=ICfnCyz#IYV;iY&nWU)`Hg0KSP$? z%=1zyt%=#@Kz7O|dP=;@6Mmd3-kt}inF)hV;XkW4XW25dLJublUpgYZUz)KE9^Jpp zFrzIPrk3gX%zSIQM2atKX@|aaz+kJEOnRuNJDn{CS063^sr+3l(H7aFnHLE-lh#%O z0?lbP5cTDeB{gOtQ{y5%{UDG}WA9gc-j~AuNbryf>nD}FltA9BA29;Ae#)7y%+CY< z$}_s*oxVVLSPI$eYnD5z+nwswHoOY~; z-6wXQy&t>cy#`oC9cF?)>y29$Dlo4+g?Mqzl2{>H=1;M(=5kmvDv2~W)8-lkJ=o)u z_mJI~-o5GFaJsY>APN3n{P;)o!CM*!8}nl1OM|x~pG>YYZd3^qDJjH*y01a*`hC|R z`)iO4x_M-!gd~97Pp%%nlfn(@xf-;CC5*9;9dRCcJh0xITU^_ii*~qk4WfR1M#z<2 zuR?gV6h4I~H)Zc!FWR=_n;s7;tqph49%5*;-r-+d9o|fF(+YbEoQ(Xr>>Few$S+a4$%U z$HhqAJ-qd986&B`9kWhAhIu=HrNOS5!_$H3eS8YXm?%@{j4GLa2W!yiMnc_<6w469 zJckmmELO)lB#mb5LnoOYUpoo%ououkw4`Zv^j&^i718x^i0c)9#F{@*03AF(Zh7zi zkIhJKABTp(rd+|O7v#~3atCE5ELg?DDq(k)33sY^I%F}HJO#*1;oJ&tbz&>F! zy#}@Jww<#=ZPwf19BA{eL$g`sn_|pdqX)LL zt{k%DVTOuR`W@`dBSZglrBgEKk=E8LS1xnVejUXPMe=^UmQH~I12BCZ~Uqvl?4WO6h%$JX> zV#8>Bdb0@G9b$#vrRX2cK?`?`&@^y{t=VXw;X41TiZWjNxO(VdUA3;PPzTHXukHLj z9|EHTdpLJD%j~LA32JSPr{`DSUP&9sc4gjBTY-`|v=CgCLZnhdKUs1*X5Q}Ci1(7K z!PqBbKAF@|@TR)^q-6eQ6L(VxrG`N`S9T=mAC+ZUTO0!!JmjI+WubpP0@2249=ZR zS-fAVEqMsS4#FFqs@{55-N0R3k>?-V8hIp(H1+twf>1;P#wm{MRXW~kI$(RIi1)G} zxvxRHJH^|ynQN%9fsv=#%_C@S#V!(4{ zyjNukEYATM#ks~0hb22@qxDZ;3K?azp(vK<*z>*YV=!UkR2uDJa{2j>jV{6xjubMDzMN>9$NI(4$i_?ZaltqG~YKdRx@r`)cUs7Yzlff4EKcCI$A6i;2?a8gz+_ zY*X#>UWZ2#SOmPeX==W#Tqa8E*sKO&P$&eCM`dzfbiBspgWX4OyZHDGe}}*lr%~)a zWr@pQW6c|T(8aM+abM=>u0ds$pk~HBz0%L8TwOC7Z3bvn)M6XQJo|3xJXdyhbE7JQ z%D$BW{e6vG>(D!iU$Hn}aq;%WQps_C!XC9!Rq!h4JeXyk>o&z#Cr4&Deo0r-$e31} zus>E^aoUbvX64rPEtpM9fjul9ct@DJ%P;c)$vUm{WpZr_I^$SwHSt2bIz3-pIePE) z7`705z2Z5Y{LMEQycmh>CqzsOR3lzheG;C*D1sud%Clh;vEH8Y6#6?V29Nd$D&E{4 z!LkFbEyF6LWD$3Hm~gK_=$i~pYU17WGD)e!HDHGu**rYnQd7plHbb+lajLf;*!Ish z1x++!HD@;;^<>gZiRXCH=v#6}(sLT9nsJx05PcQik+SA=PKYpFN5)z&6@DsXh+e|;l# zh~1=eoL7PTWP?}=vv+lLUCE@j3%z_a$r`cxZXlf9EAIv*mXqRk=R1mriVsB^-Z6(L zQ0ia|DFC7v->lnL_0VH0fNjRyeTxYeeCjQ`3~pqLc3~UxN%uaBIL7zW_++}}+VnV? ze9u7Z;PLWLMwbV=gkAHxeO-=}L$EfY9?`=RpO)Mh%fjv!igg(lrZ4mN`)sIa3w@uw z`n+HZh|mkaWLNF|*!)PYaxTnBu1|ljkP0Ibi{J&&UD5B@b{DX|yd~#M{uNIKJBgd2 z^N5UK-ozF|G%tLiJX%V;&~`e7%=EG4Vw;cHcA50|H|F~|&N2j3uWS5iUS?;BMiH3y zJ%4!rW?$F6Wv4E1v~w8M?nGv7cHWeP>@(=`IX^7Ev7t?YQd3yQ4G!%xqw!*Uf;2Bl z$5hk2>yCxBo96fT8$umKwBh|`4-Kc*=r~s;pq0hh9={F zTDQ2NS3|^bro+{T(~9KVwbdctiSP$s)USoyX${NHiNhIraFy-Gb3l%w0(MZ^>uJwH z`#KFHPCmg2WxWHiUhheBEnPGRurb5>J2B5q2r` z>cX!!K8LKAjosQORw(y`sgLR?aiilJ6d`dC%qI$N5p-yFgz*%Y3Urf~uCdSB)rrAp zo3pwc9I}TA9u;x&JBeHAeV=>|%d2M*Mt1G7kQ|^zXJ+hbB{I^B?VRQ#^fYQ8kLQ?3 z8QC{Q%s(j#_q@~X-yYXzI5Em+J?YqH)p6uxG87XLkIic1g=4;K;E5z2owicg>eQ0l3qU zl{=Zn@Y2OJu(Lm*0>rVX`+%{O`LRpUBm5V+MY|8po*dKj_=x5y8jsO?NL8W9mm-#q z&Q}Cbcz`_t1EkwqPnahP?WGDh)c4zvjVBU3#XH|+lVn|p&Y2AW9&ylpaE^i}qp=ns z2scQlBtjAm@#6rT@n>cz6Q6WSf1iTGzI{0e+o0?U6Egi8B%472SSr5dXnrVCDS#6) zvhlVJ&b$LS56ZQ&q2=0W{i_ZLTeqLHD}tfD^~SWHlBaCQKA&3@U3p$8gKwtblnB3$ z%d%BSdw*%$??xz`=~p;jLBdUL~{M7R=1)&k){#HJZb}VtDyy7;^s! z%*nr_(9UFG&R^)^WcKxNMdwcu?_+edd7W=b?dmYv#2Y|;c5Jijx3v=u8b&;b0eKtw z^)KGmUmn+gQt&eUApV)SQl@fjVk8Y(!TnglropBV#PRC#L1ZKJ_}Ne$GP&*5|FSyz zTZQ}RU>467DFB1DCmK;(VDy22NzBC1vu)KfQE<=pIz!y5>>W=buh>p6dv_w*Y_QIU z8}@OjI__+Dy|g-3IdA#cv_oQj#P{*c!C$8 zumqYi0H@^=iWXM-!>m(7l%|=eOEc(s*#DtZFh97I^Sg%-*dvDTuXZ*Lg@x@N-8D8c zllPTWakzW@;)c{3HMS4^mJGdZD-Tj{nUA#7M+~~sGf|ypT^C)0y8L)z<2Zk{Zt>0k z$23HNrU39JO|PQON7gL=KtGn>6U<0N<3Cn_5v?fx2kQOZHT%D8hVLe-eF5TQS&_b_ zY!AImVMH#s+cG_@6D_p9Nn<#v*Mf7(eET4W*nl_aOFqDR^S(^I|IR%u=r&b8f*R0l zdeWQVHP=l|j{I~<`>Pb~BTC*FyZRPao%^1i@`M_k)yDrklw(XRr@)|U@W13(Z zaOd6=p~#wjn#Ov-5^2XJg|@9JAXu(~Uqg^ zi$aBc_et0B+rf(xZHuBGs;^YNlHZ#$n>|g{DQl96u(K%;fihK37#roNu^;Vv;>#$2 zf@7L8H$R_m53U3>ErSakZjDRH70%A;=f@Bqgw*hG5wEA+Y%=O_OMV)|(9Q04bj!b+ z!bP~+n&(5-5@EpYoiiErA1@j*pG_SPXOLCiccRMJA_z4!K znt)(3gb**-%?8}08pV@a=OC$G+t{jFTTwCmGn^I{qn*TFx5WACrDA)JH0iQ5(EPX! zTQ;m&jX1w;rId10z~4JRW$&OAnxyQ7{dGXTyClzQm7S@5GLH}MQ*9>8l78(tugD0A zNUK2GcL6ugvHkML&*#)^3F}nftemv^c+NIs{eM!Hm@}|7I z=nvKv_BuWFg}5(wi-TGX{rd1E189+!#a`w$JjmuJ30}d5Q@1BiqE09JlAS476;5rl z7ya@MS9NX}{9sII5}6i%KKip+;ZlCvhblArL?6W{AR<5r=E+@p*7(6FA{d_p$tq|b zel&e|ICw1LE47p=;!Fdsj$)shJqY;}tUbnT0JmU zFT=;Ei@x@@Ju`fOWy+)H4%=3Erb5d%t`cOe+iRIk7XV!YaRxe zpYGRH*5)w?=*inJ_)p5&KR#X}l+tmA?Ho|4uNo;#uaB~i?k!F$a65t2OzsOh49 zp}i*jPQ~;MnIn>DxhA>G1L(MBDL3on?S{`Ujjxv4b4YIB}pkd7daVK%!+Ga+ysgSn(X`5N`y zk8A4Guu|m}zf@R?b?lnr<2bUS;qSME8Ld6iBLaALk{4MKW?@U6qkSx4)thDRMw{-T z&-1~$@7b{*(S4Ap+?Tuu3Lv2J!Q$^l8Mi5?$8ad?axotrA5e9oOj42BOfC99(O1!H|Cj2GnP zho(n2EQRjYq(D~q5Km@!ou6hlMH{Oyv~>4@trD5AgfTm7Q4cO3 ztE>)bb?OFC!M9(jly3>Y>h&#s{3t1cxKmE3LV1le)B;0Oo=gH$-*60#|7wxMpYW)2 zk-Ki^;-yInbT0b)9)Xi$g&^ldaiLqvNe;7fH z?;tU*1GSd3efSw4JAB|J$Gu`G4+AEkfto&0{U9RL?0~oib*18k@t?4eEZJp({F;P5~w&@|T zw4;!Ut?0Wkg1eO~NgMaG&{U|4O?tXpJ?y#H!NXv+NU!63s8+?Kp;aA;+3uO($d@;I z2j0G}8R9m0c{WJsN^2=+Y#zcfq*=}}>s7J``BvSS*;eoABJBRcA}3*Gx9(#hWsG?_ z;-CU5^|Vj6uCoGr_LwbZ)DTUSYp7gGr)VxO<54L+b4J)i~NlYbMIVF1&5w6nNBf$lhYi%AQfT zu!&JKeQcZeo%F(oI6dF!-^S7m-)&dCy!QvQD-m^ zaE&U6T~ci|d93CV!|TA&Hhb%EQjuK;O~>AGi%~~_Bw*AqzbXr<)h!-C9;FUHZTnmn zoj|Phts&ATaYgB{`CY!yR`db!um`MclLbqZhYQ7$4`$|}ZwqZ^lk%k3iL~n^3boC# zUA}yZwj_|Mc`e57ee32&SLv|!whBK{0VxBX z4hsp(8N-{HMB$NG+^>ag#`5#e0IE;v557l(?ijIukvNxiV~!F z-_7}*cRK;%(`?o83^g^E93>9coauIiN-YDNsRR}#5&Y^exSNhKcb#9!61K8=k*Ya3 zYE4y_lg+8+nxt!PZ8+9;5g9AFH1+?m(oQ$fBSM#h9Lg?#s}fCPF*b(vn1wH5O{P7b zJT~F!nwofZW~p7{#jscP=oB%>4;O#0m!6W$gD%J3|3hy@7G%YVJhVHaS!)8w6!_AT zjD{K>rye;{9*;qG1%S9-&Q!~t20=4y8Az7ui|k0AVY=YegL85}rEAcOxy&od@14jI zw90EhcX(lU4LX+BbN}<|n!s6E*fr?(iD`k30;To`AQtjpb`jNAKV8K8MI5+aV`9y4 z8rb8g%YkS+$Oh_KZ_Mjd6tCAIY5A0zjgCnv2;h1!le2Dhz>Fbl7A`lE_qb3%zftQL zE358q3|+qvNG#=#l34Ctj(_rnjj2aIUe0p+Gtb7AFO1$#5TO)%d_al>Kdx!0QYYE+ zc@c_*+0QXw*+ub9dvze{^XHKEdEu{Y9}%r#kP?(CVli_S?0?h5D}Y%9KImwniN^|i zsd#AZO(=2izU0V5MN*Ip6SQ!%jB{nU05=A1)>${rYU$|sX{P^HUEoFn<^sVSqj?In zGDzLrc%!R4_LNx!Q4lGj7p1M?_?e?D)$C#08*jydMct!x7!iXvlpi~Uz~n6^ovMuW zC#XSoGi%o>@lel6v(lE1B>8VIE!n5*_^ z+tjL7_psj_EEvOfgYJc1V&~^~&Xa2irc2D%AU?m&{HMe{J~pk~1x`!~lv<$dNUpz> z->c};^2TSe*Pt~5Aoy#pCV34~H7_lMP|Fk_-Ib-dgHTtaL3f^-T}Xq+!9%~@bFdyi zvU79@Ul8CJy8rUuqnT4`Xj=SjWX#o68cvnm%_KB1wY`6&p3!;Ar#O7k5%;pT9x7?# zS$uo>%k9KKY=o&6LU@^Ks4dJ-?X#E1ff0fwLNqwDWJL~9i8Nqr`LL(ds30%cTP=JApehp~`7>LePp zU43I98y_A;r`)f6rdXy&d?cKK^7+VR8AAT?>%b=(Y5j|d!rXWxkcI+f`(NXPJQ;{3 zR3IN@m%9M6WiL89Exp=yjtK;je`r0s295s!zhdcuent)aGuBjJ@wcoA=PH^$J*Fr9 zTYy7Nw*zHGyyBX9kkEs*o5oU^n9X#&d_+G?n|br-#l9a9CbD6ZuOFA8BNx+ss@)32 zSrkVnz%QLy;Q7sb*B~iT0TdGu&PHs@uBdq#p}k(2AthP>-^lm5F++dXn}B-GaaM68 zvq27I#ezGRj=D5~DCGBpYY=k|nsPd%Z6h=fu#bwD(9$L?lz;8;W!-<6?RV8mDDHA1 z$Otjuq{P!}L-{oMa9yl(q61V9%)ecJ{CzDHsbxxi4a%TTMw667kS89;8Ykb>QK;m_ zu9UO;zpwoRGJBgjJ0CL7&jnh-g!lgyT$QE;WseTcd`WEP5t-4l#BPqneq4yy;#`QG z;r5iF!sIF7asr#G{QOoS7*m^qZ!boIf9rxC@QD77Y-X#MI7sf*=MP`Z8gB*1y43IE zA1pRq$TUP;gEHeUf4Kg?L`vB2Hq~Sd+^EyB&c75DXksG2Llcd~jq~hAiyR2*Y|`_a z-D5hvQWj|E?dbBwq@S-jir2zlaa~6b@7A6-VLT}tYhxr=38$vJ=*P4Zs(pa-y}b2T zM*Vlbu}07`AN`mR%4@G{=DHlD8e>};r$6as`uStDl6!@y-_D8f37oQGQu87Hhe6xF z<*xteZ$RDZ%UghFnlK)%xdB00#3lVPA+ifiC&p6Wa|xEpYm>hlNt!c_Gia5{}QqFMH)~K^^Z6` z$%huo!^W|lZ=@~$K^zYAAG{~AfvHWNni5Xt)n=&Lwa4h{>lc?EDxI@a?s?# z*C4qT;Ber&T!Ug9G>m*zJlRrBzL_BvNw08EMo@HvXuyy9UtLN+bDhTTGmh`O1h69Q z@{mo+X-}s1Yov|TJ`}}|A6}@i`IpeSfGmo2e zrJ4*YbUYB7{;H!VMW zl=Kyl7gfSHd(a|dD7;)^Rij3*k#4G_ZTE8(srZ4t8aCxsT~CW=n1y$0EfQZwu$y$4 z$P*7}FJAzb4TEC;LJ#BE_Ix82yuF{XP3Dt=c6SSkPI?oL?6--M^ z53}4cSNB6Sj~aL0a}_4uH~!;k}g9Z zj$=H`!c+Uge8e9Ma2RP4~@Jg9Hn zO>fX=zXG;Ik38z5o|^4uQy;Cqq!YQELFu`F3Wp_)S%cR@iR>s|Sn(6Q-v#_eIq?Az zaFL6U>7Z(Y8t|aBl6HJ2USsn_ntM%mzNi4Z5*Ak3V@!L z4cP>nrkPr=j+|2-paG_wRqEql8G&Y$Hbc+awvqv={}AoJb5(`PvT}cHcg~v`7%dXn zC9RWIKd$uFEMJIok5BC#OLD@C7osU4|CnuuwYHurkDH&$Jp9>vJb z-RGSwJwTPe?WTqDZ%3+@TUiCHmS1_sw%|fbmCj|!Dy5cw3kDRal7@ccvw|`SOl2Kg0!34ncucIY&&n6DnWX~HI+4_uR3M!ZhtHxyTT zXQWk@(T^2Dj`w;pqUtz>_}?G`*Jt3KJp%Dfe4cBJ{_}ZrIG(C zJ7Tu}uGgl>58VuVLh-pZuuQZC^*5qD2*^c|>)QqoV^~cNzw$ zLO(Gw`Xi;vE9i8EG zPgSVAGp~Fa<<<+sxK$kkg(bG|)DbI+QWj{UitBXI>O*_y(8py%BS;fdKyYNH15O0& z={D3$=Yj+q4X5u<>(O?xsR7}Q04frP#lcQ18xfIAE7<}bxO)$EU*lq$L)yg=qD zkOUYe;!7p_yB_LDlYsx6w+@)MOyjpbSa<^#7Rg6tZArs)^-LLJFLrtjdFOUv5;~KL z>om7i$US4U-Slw6vB0FTa=EJ{hE7=wG~uQxZ9Ew9IZ-?{jUVbIZ=#Ih97m34OPIoF zj^isz>B}g7IQT0S8WrxURH22Y(R+Ih+SmY&V23DRFVN-u;Z3od#;()Spa77ndw6?@ zxkvW*m?rgwx^c?32ywQZ)0v!#4zL(b@8x)29&Zu z%fX|O-^&Fj9A#HXkl`V7FXLdwkLS{es8HZTW+SYjm?v-{Lw2T}iQ=f63Yw`uk*38L zhdu!uGMfMgTqFAjk^^I7iSi(}fdf_5G<2 z7!N@pJP=_w!d|IMmYmn!fI%Xlh0rcuDkhV4UHSkYCoLq-P@_gs{PUS>dgJN1Mjj4R zDJ0WuV2S61@&ZS}+bu~fVg75Sh*OU8;0-mNceu|cpBq3K?*^$qBq*{i6qeFj@!TVI zfC*r8hNINQe>}@690BKV)%tM2>)jQd$SI4ZC-&67A#UmXRK`*BpkZM|k%M!7ymA36 z?bc0K+I_ie_-uk9SJ9)+Hi(YGn9VObE215g1UiSj4WLIhbhVr0*#` zjdFB<&3SlmXEDBegthxytDzMQqq6tc9+g}aC*mnuXslV)H#e`a8PUCXyK$oD+cURI z$KAe~=mz4NT>e6RMI)R(c5cylmoLZCX+0SaCYo86)l_?uTV1wl#rXWbjtM;(pm9=* z#vt)xP{!lLql#Do9G)8bn}Dod)vg397SjG$FEVM6C|o^t0Beurwj*Ll^7NPpD?6&?nz%Cg5V3L|6}>Wyc~b zJlc8Nlo^5Z-Y}rX<<4U)HF2>s)mt~@?6Zv(s)1Xhxi*aS`LjPrCjR|diLH+R=0q^c zJB!N-WymgN_(fyOdw(+=rDI6W$O|nZX3Y*bs^ZD^ruJ zCYEXy^oa{qY|y*&z=5`RkXfd9c+pam=dc@j-jAWMfg1mTA73H;`3N#b>FrDAgH5wdL4)xGYi5 z_v1(?-^61IVU%+t!e(WdmQdiT$_aaSX-VDjl?IBk9Q%ko#YV5%8PA*biCSd)YT=iG zY_T;3p9}{JmjZ?_Za7yiZ8j=T)!_6EIS_93ewX)gOmYGp{A@c@Qx|Rtv)nY~%B~?# zgr(zqCCD{?xhYz*k2N>nt89Bmll}n)?E^g>;QE~m20{Z77Rpzi7vpd+T2pgU8Fv}= zVR%Lh^pUtdE+IZo-DjksA#z~WC}K_jMZ}GGKZqpC2X z%suQ{Df~}Aig2|BP4?*71ebq)3E1o~;0*IyV9qFaKjT3qyXx-_Zv*~C#=k2K0m*s8 zkp}r%fX3?;aH%X_uU4GkvN*hd0sFJO(yb0WuI&jfjgIxvPCvu3U!I%}hRBDy8RHjC8V_=ql_EI-Ss{Pa?Y$?O^cy^Zmg) zljhZsV)=w2pZkUpJ>zM0Iy=(`X7NmACAMSwOEZfyfD&i(D?`W58IWa@5F4*cbYKPp zATsj$u0d>idUw&*XW0+vsuV1LU#D@EzIt$h`ztjEH<<+$qx7rkr?rpV)LH4>)D^t8bRVjiK7;x#RHXYkTb`qb5ocFeQUIF>R)wM!b1o(OmX2@0m(axg zvgh=%I;g<|xvxX1JTZujs#RGmFL@&b6FTnGkaezs8br!t_TvywuChrG)@uw4%D{MB z-^iKYjaV?ymg2wpFM+L^2obs*brF*GY&WGvDCU46B}iA5@8x^Wr@qBIwE0Plj4{yz zPmV^Y3UCuF!G$JfT{e!yCqESG8)7*OUgG0F+{M-B;-uS3h|KyuMNU1TZJTsJwiCTE zARcgPhl&pL^#*9& z-ZL#VPJYwO4at@%F4_??>Mwoe%KgqzviY7oEf1x#5gitP_nvWeOy83oa1@n>%4a7} z4V2LWIwQO#F({I&jQp30@vjs{(!O!o6&2)~xZ>FUbF^UGwzR*}RMYkmaOVw$qDA~1?ekrB1H&T{v{*t46VErMe?vGLClDDId};=+=BTA zxCHm-Sq8~L9!?S&n=)> z_U=Fmrw#!HG6!vLInMRZ+izaPI0~Q$W!I^tStE}izZ(t6CjH%X|7VW=fOK|s)h0NA z4t4+gu`I=_@vX*;x`JnoZWg^To%HtJG1l7;XV^lmw@E?kEi5aZJ>`Ucw(yA^G|6|8 z@?4It*|Y?(XxjlL_X}!m-TOBlU@=v7loMSQ$S{L}YXr$%9`J>wx3|Y@@HthBiPbMP zAE-44hYsv&>b(EL##XUsqStY00$JYZa=`1jFblMZpL<$IE^-Z;BQMI3{}xaFWx0y_ zet)lX2G@U4d;9-hYX3WS%BN;z5zJqdHR-5>`v!$NaqGzxdAJ?B{xnqJa+tdW$78n? zxLK%#>a(2KEMxS0I9Xb@3j`awu0b#B0LeH#9D8ELHQb@SH!CEh=v@YDc%_lf<9c$W zeh6sYRN>JqERe5JFxg97KsVXUFQe&&9ICo6U1F&NP9OTF@;|F>DIpmH5;4w6Kc7f^ z<|@Z)K^BxEM|2~lsdo%&YZ_yX(k0d5J$WDNj%L|IWot+Hwhmb%74ZfiihrT&f1n+@ zYvJ2N1t_1yt%BP=$seK{p^+jxgAA7o%&_%wBECni)=o)h7yTUn54_fSQ2Pi*kCj{|X=%1*J`I)&!ZuoQ$32kY)N_A)*h%qim?ocg-}u-R7jYE^ zy)_w9Q?r;3$y;FHxe|XKqY0aO@p(m`N9di-_h8`nJadiSQ3{@Jtn0vvE|9z_Q@~8jp47a zJtgn$watmM6wKc+{;GOI0Ccet3GFnau4ZA{Q8E7{a*=B`Gxkbzn=Jb(6({*hTa%*W zZ3&jxQ0|*HvH3ESsdf)Hz4v|(k;$Kt@~Z-GkBy8}uPnMW$7%?M6Y{%=RHeIJ=Ik8~ zg|G>s0*mfi85_2xV3rvAfQpr32R)Th_bbC~@VQ&vIG+QhiIH))zA9tSf<@Hpt|O7a zXL{Ic)~9mZNx~FJP5oy=QE^0H}sbAoA=4l{3RDAZN_IrDgL#wQ< zVOi12kgURlsK3XC=@);+NrMjhw-Frrc}IRTHwU?fjfet7>{(Dg-dvL zu@FOEzxuK;ZxB`N`zT3+fPF!vpMohj`pqyw?=#^y>QoWl2!Drz(&~CC4j99g_ik^N zYtC$GQARK3Zuo@oGgh2PneR$3CDj?}J(Wr2vYVGto}&jA@XgWcE4}Ws#=VldsVRDU z!IH6#n|Gfao{uY5?0MkJIrLP5*y+EGq9ry11gDw>`C^7E@@E$3x2N)n()2AOwO$Ic zJvB~Y0|n{5CfM*|;Nv7@b*y2--q=I^odkS}M)@MQvw&M%$8psg;3&v8CS0%wj* zIc>}!Vd(KZ9@E3mtt?kr$BtyPw)Ag(sGIDv!FsEzHQ&p!0$*Hec*rquz5gn-seS)Z zO8bJ65D{~cf;BRA70A%!aKIg&y3~HD9U)*N_$cKTjSB`lSN|;TGr_ozP{Hv4v2Co= zSGCBPA_F6}l0gf0RUQS~m87kohgs3MFK-(v%Jp+5s8zo?y}Qxuc@4UyYIaFzw{&4d z?CcZJ0E1w9U3Og1M2NN_l^vv%toa)1!M0h4qFiL(=EUxzx0O-zxcyp$ho1C%Qorf# zAh%fJuc3jA6aQ^WuLCS2Rp2UQx9!+xI(M3xXE|)-nA;w3{CRiX`zML<`_2A;nmB*@ zMN$c6wkA_Nxm~;tCH3;uN1FRO*zqMc`S|(2+1v2A;v97d7?K)z8=?twUy>G@GoR z2SW&`_TOv%lymK@J_uZ7hp){FMA!?sH673z?ieOjc-uK0%30oTHL2{3$M#eUw3z+L zAW5SAdbI%V7>LqaNxOoWC05%3e$$&^-7}xW#zBFZQv21&`%?90FYGwT``&#kq)U7E zNv}XdP2pAptnr{j%~cyN-hD;QQ!4$sSp$YqY|!*|g)N-#)7On0*mqB+&U%s#WS&;APUxf%{ePeO`Ga?^6H$KFJ zdv;4^hI_V(2ztF5zw#61hguet9v?kU=973*sUq+$FCq9H8F@8{Cj}dvv|Vf2^Lw>w zggr`i`G8FRwB_ZTj@tO(XNp$)?X)Mf>mce!4;AI=ZpTW;{u&*zXS70+w*p~8>KP&Z z^SwOP$KBL~&_mUBDZ8tQWgLNM#Yj%_=@yiRrIA6FlaiYLh!?p^{G2$X3W1=6SkY3Z0df+ft z?J;uGJ5vD574x1^^zoJ{6u6a!GxBlz^i|`;I(#@wJWPsi72IS-^uv~BN`~W+c_`TW zmbwFX-TRfgFVC9lmZexj_%YMLlcNfEr(qUiWwR<<4*hl-<)S@E{48)@M zwCXKd@*`h!@%!z-q@@G!y^`B!VGBrk`38EvsVG-#DVHt!2gD+tI9Z8oiML;@R4^>} zH-vZK(PiQXP;D;?jY~S0=aHW2nw)!_x`){Qf))YC?;pN|>_7w3^v z#FARRYT|M&Z8Y8bT{2nH&1!ZbT~A;H_{oy3I-2!;L#CoEoo&opkSeu8^Iyve342`iZ4=P54(m z9ka~r);lV^20gc~i=SP@L+_SYCq^i1rFK#^@<{q>5z1N!Ej+kbPp3okKX%#?K!CAb2; zv1rHV@j9)CUi;4Cfa}c#HseBDg0Bo_`|sw-ykYetQ4lKkrc6Sws?Lk_=vE4ZXq4@= zomfn<3hRiCYhfRO?S^*bW5@W%=GE@Y?|^hS9+X5f%rqqbhg@-Mgmi4N-RC|}M$DI_ z*x3pgDp>J&oNPa)lTgM;qtZDq)t(H_$#QFiO+jr;6=I3JhOzW&A9U+{&| zotv*X)0Z#Zwwh;CmPT&k%q6|fsSTMb{?nBW<%&3D)emb+=KPba@H_YE?t^)@Y0Z~qj2%=mn4bA|fhLZ=A!l8-KDPVIcs_>FOL-%a_bmwS>Qo#to$k@r37 z%FKUfa~2C8*}Y`ujaHAv@sG-rR;|9XLG;Y+f9Ln}B%41dxaJk+I zg|`oyqw6ew)Gq$xym`a6Yhs^MdoSI)ZQJ|h+9Y?Q)ml3$=0+=6m%pD?b%WpLhxcQ9 z`B&5Ah1hoHXRfa3_Pi>pv!!IJ?!v~DP1|=w$6F|#yEbFx#AYE!TLbn@$08R8y%IdZ zIz{Heue3egzCTP?)EIC4(YgAEd&HHk`~t7N@=RWoTAy4uNi*!0#>M#abJnbh)BoBN z(weaI)}p;-0?hst)7yi%<6ITgmO>#+UB)xv*Ip=ZQgfX zGunHus_1m!Om5ose{J)D6LSwP0O#X?i&DBFQ^6N~5b{{im0Ez*0ol(4}vKkA?V zVV+oSsnSh)qlvS|1rz*18}1F2fNa{tiZ^6 zG;{uF&W`^jRTB&xMRoqqaO5I594y6)gSvpTfT+gUqXtV~js53;j3;)_ni3X|ZVE}JklCEbTXPU1(HOU&Z6zX;M`6D4M aSCZMQW6Z)p=WjItD>V|2siCg=|2F}`Mw{RO literal 0 HcmV?d00001 From bc5daabba891e835ee8b4a078f81c6cda04baa52 Mon Sep 17 00:00:00 2001 From: Colin Davidson Date: Tue, 16 Sep 2025 13:36:35 +0100 Subject: [PATCH 07/14] Removed unnecessary `` and \. --- sycl/doc/design/SYCLNativeCPUPipeline.md | 2 +- .../doc/design/SYCLNativeCPUPipelinePasses.md | 392 +++++++++--------- 2 files changed, 197 insertions(+), 197 deletions(-) diff --git a/sycl/doc/design/SYCLNativeCPUPipeline.md b/sycl/doc/design/SYCLNativeCPUPipeline.md index b8356386d4e47..0baf2469391f3 100644 --- a/sycl/doc/design/SYCLNativeCPUPipeline.md +++ b/sycl/doc/design/SYCLNativeCPUPipeline.md @@ -259,7 +259,7 @@ replacing the scalar kernel with the vectorized one. Any remaining materialization of builtins are handled by [DefineMuxBuiltinsPass](https://github.com/intel/llvm/blob/sycl/llvm/lib/SYCLNativeCPUUtils/compiler_passes/compiler_pipeline/source/define_mux_builtins_pass.cpp), -such as ``__mux_mem_barrier``. The use of this pass should probably be phased +such as `__mux_mem_barrier`. The use of this pass should probably be phased out in preference to doing it all in one place. Some builtins may rely on others to complete their function. These diff --git a/sycl/doc/design/SYCLNativeCPUPipelinePasses.md b/sycl/doc/design/SYCLNativeCPUPipelinePasses.md index 5a05ec4bfd3a8..0ac568cd5a669 100644 --- a/sycl/doc/design/SYCLNativeCPUPipelinePasses.md +++ b/sycl/doc/design/SYCLNativeCPUPipelinePasses.md @@ -34,7 +34,7 @@ Their job is three-fold: attached. It sets this information based on local work-group size information which is: - > - (`TransferKernelMetadataPass`) - taken from the kernel\'s + > - (`TransferKernelMetadataPass`) - taken from the kernel's > entry in the `!opencl.kernels` module-level metadata. > - (`EncodeKernelMetadataPass`) - optionally passed to the pass > on construction. The local sizes passed to the pass should @@ -194,11 +194,11 @@ why this transformation is only done when compiling for debug. The benefit of adding the extra `alloca` is that it forces the address to be placed on the stack, where we can point to it with `llvm.dbg.declare()` intrinsics, rather than reading the address from a -register where it won\'t persist. Not all source variables are classed +register where it won't persist. Not all source variables are classed as live however if they are not used past the first barrier, so when the `IsDebug` flag is set we also modify the algorithm for finding live variables to mark these `alloca` instructions as live. Otherwise their -values won\'t be updated for the current work item past the first +values won't be updated for the current work item past the first barrier and the debugger will print incorrect values. To point to the location in the live variables struct where each source @@ -215,9 +215,9 @@ the CFG and build inter-barrier regions. We start traversal at the beginning of the function, and at the barriers, and we end whenever we encounter another barrier or a return statement. We collect all values that are defined within one region, which have uses in any other region, -which are called \"external uses\". We also collect values that are +which are called "external uses". We also collect values that are defined within one region and used in the same region, but where the -definition does not dominate the use. These are \"internal uses\" and +definition does not dominate the use. These are "internal uses" and can occur where a barrier is present in a loop, such that the same barrier that begins the inter-barrier region can also be hit at the end of that region. (The definition must have dominated all its uses in the @@ -225,7 +225,7 @@ original function, but a barrier inside a loop can result in the second part of the loop body preceding the first within the inter-barrier region.) -We also implement a \"Barrier Tidying\" optimization that +We also implement a "Barrier Tidying" optimization that posts-processes the set of live values to remove certain values where it is expected that loading and storing these values will incur more overhead than simply recalculating them from other available values @@ -237,7 +237,7 @@ considered removable are: > - All other casts where the source operand is already in the > barrier, > - Vector splats, -> - Calls to \"rematerializable\" builtins - see +> - Calls to "rematerializable" builtins - see > `compiler::utils::eBuiltinPropertyRematerializable` If the barrier contains scalable vectors, the size of the struct is @@ -251,7 +251,7 @@ the barrier struct are put into a flexible array member (of type `i8`) at the end, so that GEPs to individual members can be constructed by calculating their byte offsets into this array and the results cast to pointers of the needed type. The position of individual scalable vector -members is calculated by multiplying their equivalent \"fixed width\" +members is calculated by multiplying their equivalent "fixed width" offset (i.e. the same as if vscale were equal to 1) by the actual vscale. @@ -278,9 +278,9 @@ kernel entry point. This pass runs over all functions in the module which have [mux-kernel](#function-attributes) entry-point attributes. -The new wrappers take the name of either the \'tail\' or \'main\' -kernels \--whichever is present \-- suffixed by -\".mux-barrier-wrapper\". The wrappers call either the original +The new wrappers take the name of either the 'tail' or 'main' +kernels - whichever is present - suffixed by +".mux-barrier-wrapper". The wrappers call either the original kernel(s) if no barriers are present, or the newly-created barrier regions if barriers are present. The original kernels are left in the module in either case but are marked as internal so that later passes @@ -361,7 +361,7 @@ Barriers are supported in this mode by creating a separate barrier struct for both the vector and scalar versions of the kernel. There are circumstances in which this mode is skipped in favour of -\"vector only\" mode: +"vector only" mode: - If the local work-group size is known to be a multiple of the vectorization factor. @@ -382,7 +382,7 @@ There are circumstances in which this mode is skipped in favour of - If the kernel has been vectorized with vector predication. In this case the vector loop is known to handle scalar iterations itself. -If any of these conditions are true, the \"vector only\" mode is used. +If any of these conditions are true, the "vector only" mode is used. #### Vector + Vector-predicated @@ -452,20 +452,20 @@ Replacements are also performed on two abacus-internal builtins: OpenCL user-facing builtins allows replacements in more cases, as the abacus versions are used to implement several other builtin functions. -The `__abacus_clz` builtin \-- count leading zeros \-- can be exchanged +The `__abacus_clz` builtin (count leading zeros) can be exchanged for a hardware intrinsic: `llvm.ctlz`. However, some variants are skipped: 64-bit scalar and vector variants are skipped, since Arm uses calls to an external function to help it implement this case. -The `__abacus_mul_hi` builtin \-- multiplication returning the \"high\" -part of the product \-- can be exchanged for a shorter series of LLVM +The `__abacus_mul_hi` builtin (multiplication returning the "high" +part of the product) can be exchanged for a shorter series of LLVM instructions which perform the multiplication in a wider type before shifting it down. This is desirable because abacus has a rule that it never introduces larger types in its calculations. LLVM, however, is -able to match a specific sequence of instructions against a \"mul hi\" +able to match a specific sequence of instructions against a "mul hi" node, which is canonical, well-optimized, and many targets directly lower that node to a single instruction. 64-bit versions (scalar and -vector) are skipped since 64-bit \"mul hi\" and 128-bit integers are not +vector) are skipped since 64-bit "mul hi" and 128-bit integers are not well supported on all targets. The `__abacus_fmin` and `__abacus_fmax` builtins can be exchanged for @@ -511,7 +511,7 @@ as its parameters a reference to the function to be optionally vectorized, and a reference to a vector of `VeczPassOptions` which it is expected to fill in. -If it\'s not interested in seeing the function vectorized, it returns +If it's not interested in seeing the function vectorized, it returns false; otherwise it fills in the `VeczPassOptions` array with the choicest vectorization options it can muster for the target. For example: @@ -568,7 +568,7 @@ function declarations. If a definition of a mux builtin requires calls to other mux builtins which themselves need defining, such dependencies can be added to the -end of the module\'s list of functions so that the +end of the module's list of functions so that the `DefineMuxBuiltinsPass` will visit those in turn. One example of this is the lowering of `__mux_get_global_id` which calls `__mux_get_local_id`, among other functions. @@ -614,7 +614,7 @@ barrier region. There are several key pieces of metadata used for inter-communication between the Native CPU passes. -In order to avoid hard-coding assumptions about the metadata\'s names, +In order to avoid hard-coding assumptions about the metadata's names, number of operands, types of operands, etc., utility functions **should** be used to access or manipulate the metadata. The specific names and/or operands of these metadata is **not** guaranteed to be @@ -663,7 +663,7 @@ provided by `BIMuxInfoConcept` used by the Vectorized definitions of the various sub-group builtins are provided by the VECZ pass, so any target running VECZ (and the above passes) will be able to support sub-groups of a larger size than 1. Note that VECZ does -not currently interact \"on top of\" the mux builtins - it replaces them +not currently interact "on top of" the mux builtins - it replaces them in the functions it vectorized. This is future work to allow the two to build on top of each other. @@ -686,8 +686,8 @@ itself an extension of the Itanium C++ mangling scheme. The Itanium specification under-specifies vector types in general, so vendors are left to establish their own system. In the vectorizer, fixed-length vector types follow the convention that LLVM, GCC, ICC and others use. The first -component is ``Dv`` followed by the number of elements in the vector, followed by -an underscore (\ ``_``\ ) and then the mangled element type: +component is `Dv` followed by the number of elements in the vector, followed by +an underscore ( `_` ) and then the mangled element type: ``` llvm <2 x i32> -> Dv2_i @@ -699,10 +699,10 @@ such as ARM SVE2 provide scalable vector types at the C/C++ language level, but those are mangled in a vendor-specific way. The vectorizer chooses its own mangling scheme using the Itanium -vendor-extended type syntax, which is ``u``\ , followed by the length of the +vendor-extended type syntax, which is `u` , followed by the length of the mangled type, then the mangled type itself. -Scalable-vectors are first mangled with ``nx`` to indicate the scalable +Scalable-vectors are first mangled with `nx` to indicate the scalable component. The next part is an integer describing the known multiple of the scalable component. Lastly, the element type is mangled according to the established vectorizer mangling scheme (i.e. Itanium). @@ -723,100 +723,100 @@ Example: The Following intermediate representations are used in the interface to Native CPU. Some of these may not be relevant for Native CPU, and may exist from the time this was part of the `oneAPI Construction Kit`. -* ``size_t __mux_get_global_size(i32 %i)`` - Returns the number of global - invocations for the ``%i``'th dimension. -* ``size_t __mux_get_global_id(i32 %i)`` - Returns the unique global - invocation identifier for the ``%i``'th dimension. -* ``size_t __mux_get_global_offset(i32 %i)`` - Returns the global offset (in - invocations) for the ``%i``'th dimension. -* ``size_t __mux_get_local_size(i32 %i)`` - Returns the number of local - invocations within a work-group for the ``%i``'th dimension. -* ``size_t __mux_get_local_id(i32 %i)`` - Returns the unique local invocation - identifier for the ``%i``'th dimension. -* ``i32 __mux_get_sub_group_id()`` - Returns the sub-group ID. -* ``size_t __mux_get_num_groups(i32 %i)`` - Returns the number of work-groups - for the ``%i``'th dimension. -* ``i32 __mux_get_num_sub_groups()`` - Returns the number of sub-groups for +* `size_t __mux_get_global_size(i32 %i)` - Returns the number of global + invocations for the `%i`'th dimension. +* `size_t __mux_get_global_id(i32 %i)` - Returns the unique global + invocation identifier for the `%i`'th dimension. +* `size_t __mux_get_global_offset(i32 %i)` - Returns the global offset (in + invocations) for the `%i`'th dimension. +* `size_t __mux_get_local_size(i32 %i)` - Returns the number of local + invocations within a work-group for the `%i`'th dimension. +* `size_t __mux_get_local_id(i32 %i)` - Returns the unique local invocation + identifier for the `%i`'th dimension. +* `i32 __mux_get_sub_group_id()` - Returns the sub-group ID. +* `size_t __mux_get_num_groups(i32 %i)` - Returns the number of work-groups + for the `%i`'th dimension. +* `i32 __mux_get_num_sub_groups()` - Returns the number of sub-groups for the current work-group. -* ``i32 __mux_get_max_sub_group_size()`` - Returns the maximum sub-group size +* `i32 __mux_get_max_sub_group_size()` - Returns the maximum sub-group size in the current kernel. -* ``i32 __mux_get_sub_group_size()`` - Returns the number of invocations in the +* `i32 __mux_get_sub_group_size()` - Returns the number of invocations in the sub-group. -* ``i32 __mux_get_sub_group_local_id()`` - Returns the unique invocation ID +* `i32 __mux_get_sub_group_local_id()` - Returns the unique invocation ID within the current sub-group. -* ``size_t __mux_get_group_id(i32 %i)`` - Returns the unique work-group - identifier for the ``%i``'th dimension. -* ``i32 __mux_get_work_dim()`` - Returns the number of dimensions in +* `size_t __mux_get_group_id(i32 %i)` - Returns the unique work-group + identifier for the `%i`'th dimension. +* `i32 __mux_get_work_dim()` - Returns the number of dimensions in use. -* ``__mux_dma_event_t __mux_dma_read_1D(ptr address_space(3) %dst,`` - ``ptr address_space(1) %src, size_t %width, __mux_dma_event_t %event)`` - DMA - 1D read from ``%src`` to ``%dst`` of ``%width`` bytes. May use ``%event`` +* `__mux_dma_event_t __mux_dma_read_1D(ptr address_space(3) %dst,` + `ptr address_space(1) %src, size_t %width, __mux_dma_event_t %event)` - DMA + 1D read from `%src` to `%dst` of `%width` bytes. May use `%event` from previous DMA call. Returns event used. -* ``__mux_dma_event_t __mux_dma_read_2D(ptr address_space(3) %dst,`` - ``ptr address_space(1) %src, size_t %width, size_t %dst_stride,`` - ``size_t %src_stride, size_t %height __mux_dma_event_t %event)`` - DMA 2D - read from ``%src`` to ``%dst`` of ``%width`` bytes and ``%height`` rows, with - ``%dst_stride`` bytes between dst rows and ``%src_stride`` bytes between src - rows. May use ``%event`` from previous DMA call. Returns event used. -* ``__mux_dma_event_t __mux_dma_read_3D(ptr address_space(3) %dst,`` - ``ptr address_space(1) %src, size_t %width, size_t %dst_line_stride,`` - ``size_t %src_line_stride, size_t %height, size_t %dst_plane_stride,`` - ``size_t %src_plane_stride, size_t %depth, __mux_dma_event_t %event)`` - DMA - 3D read from ``%src`` to ``%dst`` of ``%width`` bytes, ``%height`` rows, and - ``%depth`` planes, with ``%dst_line_stride`` bytes between dst rows, - ``%src_line_stride`` bytes between src rows, ``%dst_plane_stride`` bytes - between dst planes, and ``%src_plane_stride`` between src planes. May use - ``%event`` from previous DMA call. Returns event used. -* ``__mux_dma_event_t __mux_dma_write_1D(ptr address_space(1) ptr %dst,`` - ``ptr address_space(3) %src, size_t %width, __mux_dma_event_t %event)`` - DMA - 1D write from ``%src`` to ``%dst`` of ``%width`` bytes. May use ``%event`` +* `__mux_dma_event_t __mux_dma_read_2D(ptr address_space(3) %dst,` + `ptr address_space(1) %src, size_t %width, size_t %dst_stride,` + `size_t %src_stride, size_t %height __mux_dma_event_t %event)` - DMA 2D + read from `%src` to `%dst` of `%width` bytes and `%height` rows, with + `%dst_stride` bytes between dst rows and `%src_stride` bytes between src + rows. May use `%event` from previous DMA call. Returns event used. +* `__mux_dma_event_t __mux_dma_read_3D(ptr address_space(3) %dst,` + `ptr address_space(1) %src, size_t %width, size_t %dst_line_stride,` + `size_t %src_line_stride, size_t %height, size_t %dst_plane_stride,` + `size_t %src_plane_stride, size_t %depth, __mux_dma_event_t %event)` - DMA + 3D read from `%src` to `%dst` of `%width` bytes, `%height` rows, and + `%depth` planes, with `%dst_line_stride` bytes between dst rows, + `%src_line_stride` bytes between src rows, `%dst_plane_stride` bytes + between dst planes, and `%src_plane_stride` between src planes. May use + `%event` from previous DMA call. Returns event used. +* `__mux_dma_event_t __mux_dma_write_1D(ptr address_space(1) ptr %dst,` + `ptr address_space(3) %src, size_t %width, __mux_dma_event_t %event)` - DMA + 1D write from `%src` to `%dst` of `%width` bytes. May use `%event` from previous DMA call. Returns event used. -* ``__mux_dma_event_t __mux_dma_write_2D(ptr address_space(1) %dst,`` - ``ptr address_space(1) %src, size_t %width, size_t %dst_stride,`` - ``size_t %src_stride, size_t %height __mux_dma_event_t %event)`` - DMA 2D - write from ``%src`` to ``%dst`` of ``%width`` bytes and ``%height`` rows, - with ``%dst_stride`` bytes between dst rows and ``%src_stride`` bytes between - src rows. May use ``%event`` from previous DMA call. Returns event used. -* ``__mux_dma_event_t __mux_dma_write_3D(ptr address_space(3) %dst,`` - ``ptr address_space(1) %src, size_t %width, size_t %dst_line_stride,`` - ``size_t %src_line_stride, size_t %height, size_t %dst_plane_stride,`` - ``size_t %src_plane_stride, size_t %depth, - ``__mux_dma_event_t %event)`` - DMA 3D write from ``%src`` to ``%dst`` of - ``%width`` bytes, ``%height`` rows, and ``%depth`` planes, with - ``%dst_line_stride`` bytes between dst rows, ``%src_line_stride`` bytes - between src rows, ``%dst_plane_stride`` bytes between dst planes, and - ``src_plane_stride`` between src planes. May use ``%event`` from previous DMA +* `__mux_dma_event_t __mux_dma_write_2D(ptr address_space(1) %dst,` + `ptr address_space(1) %src, size_t %width, size_t %dst_stride,` + `size_t %src_stride, size_t %height __mux_dma_event_t %event)` - DMA 2D + write from `%src` to `%dst` of `%width` bytes and `%height` rows, + with `%dst_stride` bytes between dst rows and `%src_stride` bytes between + src rows. May use `%event` from previous DMA call. Returns event used. +* `__mux_dma_event_t __mux_dma_write_3D(ptr address_space(3) %dst,` + `ptr address_space(1) %src, size_t %width, size_t %dst_line_stride,` + `size_t %src_line_stride, size_t %height, size_t %dst_plane_stride,` + `size_t %src_plane_stride, size_t %depth, + `__mux_dma_event_t %event)` - DMA 3D write from `%src` to `%dst` of + `%width` bytes, `%height` rows, and `%depth` planes, with + `%dst_line_stride` bytes between dst rows, `%src_line_stride` bytes + between src rows, `%dst_plane_stride` bytes between dst planes, and + `src_plane_stride` between src planes. May use `%event` from previous DMA call. Returns event used. -* ``void __mux_dma_wait(i32 %num_events, __mux_dma_event_t*)`` - Wait on +* `void __mux_dma_wait(i32 %num_events, __mux_dma_event_t*)` - Wait on events initiated by a DMA read or write. -* ``size_t __mux_get_global_linear_id()`` - Returns a linear ID equivalent - to ``(__mux_get_global_id(2) - __mux_get_global_offset(2)) *`` - ``__mux_get_global_size(1) * __mux_get_global_size(0) +`` - ``(__mux_get_global_id(1) - __mux_get_global_offset(1)) *`` - ``__mux_get_global_size(0) + (__mux_get_global_id(0) -`` - ``__mux_get_global_offset(0))``. -* ``size_t __mux_get_local_linear_id(void)`` - Returns a linear ID equivalent - to ``__mux_get_local_id(2) * __mux_get_local_size(1) *`` - ``__mux_get_local_size(0) + __mux_get_local_id(1) * __mux_get_local_size(0)`` - ``+ __mux_get_local_id(0)``. -* ``size_t __mux_get_enqueued_local_size(i32 i)`` - Returns the enqueued - work-group size in the ``i``'th dimension, for uniform work-groups this is - equivalent to ``size_t __mux_get_local_size(i32 %i)``. -* ``void __mux_mem_barrier(i32 %scope, i32 %semantics)`` - Controls the order +* `size_t __mux_get_global_linear_id()` - Returns a linear ID equivalent + to `(__mux_get_global_id(2) - __mux_get_global_offset(2)) *` + `__mux_get_global_size(1) * __mux_get_global_size(0) +` + `(__mux_get_global_id(1) - __mux_get_global_offset(1)) *` + `__mux_get_global_size(0) + (__mux_get_global_id(0) -` + `__mux_get_global_offset(0))`. +* `size_t __mux_get_local_linear_id(void)` - Returns a linear ID equivalent + to `__mux_get_local_id(2) * __mux_get_local_size(1) *` + `__mux_get_local_size(0) + __mux_get_local_id(1) * __mux_get_local_size(0)` + `+ __mux_get_local_id(0)`. +* `size_t __mux_get_enqueued_local_size(i32 i)` - Returns the enqueued + work-group size in the `i`'th dimension, for uniform work-groups this is + equivalent to `size_t __mux_get_local_size(i32 %i)`. +* `void __mux_mem_barrier(i32 %scope, i32 %semantics)` - Controls the order that memory accesses are observed (serves as a fence instruction). This control is only ensured for memory accesses issued by the invocation calling the barrier and observed by another invocation executing within the memory - ``%scope``. Additional control over the kind of memory controlled and what - kind of control to apply is provided by ``%semantics``. See `below + `%scope`. Additional control over the kind of memory controlled and what + kind of control to apply is provided by `%semantics`. See `below <#memory-and-control-barriers>`__ for more information. -* ``void __mux_work_group_barrier(i32 %id, i32 %scope, i32 %semantics)`` and - ``void __mux_sub_group_barrier(i32 %id, i32 %scope, i32 %semantics)`` - Wait +* `void __mux_work_group_barrier(i32 %id, i32 %scope, i32 %semantics)` and + `void __mux_sub_group_barrier(i32 %id, i32 %scope, i32 %semantics)` - Wait for other invocations of the work-group/sub-group to reach the current point of execution (serves as a control barrier). A barrier identifier is provided - by ``%id`` (note that implementations **must** ensure uniqueness themselves, - e.g., by running the ``compiler::utils::PrepareBarriersPass``). These + by `%id` (note that implementations **must** ensure uniqueness themselves, + e.g., by running the `compiler::utils::PrepareBarriersPass`). These builtins may also atomically provide a memory barrier with the same semantics - as ``__mux_mem_barrier(i32 %scope, i32 %semantics)``. See `below + as `__mux_mem_barrier(i32 %scope, i32 %semantics)`. See `below <#memory-and-control-barriers>`__ for more information. ##### Group operation builtins @@ -828,9 +828,9 @@ The builtin functions are overloadable and are mangled according to the type of operand they operate on. Each *work-group* operation takes as its first parameter a 32-bit integer -barrier identifier (``i32 %id``). Note that if barriers are used to implement +barrier identifier (`i32 %id`). Note that if barriers are used to implement these operations, implementations **must** ensure uniqueness of these IDs -themselves, e.g., by running the ``compiler::utils::PrepareBarriersPass``. The +themselves, e.g., by running the `compiler::utils::PrepareBarriersPass`. The barrier identifier parameter is not mangled. > [!NOTE] @@ -843,25 +843,25 @@ barrier identifier parameter is not mangled. The groups are defined as: -* ``work-group`` - a group of invocations running together as part of an ND +* `work-group` - a group of invocations running together as part of an ND range. These builtins **must** only take scalar values. -* ``sub-group`` - a subset of invocations in a work-group which can synchronize +* `sub-group` - a subset of invocations in a work-group which can synchronize and share data efficiently. Native CPU leaves the choice of sub-group size and implementation to the target; Native CPU only defines these builtins with a "trivial" sub-group size of 1. These builtins **must** only take scalar values. -* ``vec-group`` - a software level group of invocations processing data in +* `vec-group` - a software level group of invocations processing data in parallel *on a single invocation*. This allows the compiler to simulate a sub-group without any hardware sub-group support (e.g., through vectorization). These builtins **may** take scalar *or vector* values. The scalar versions of these builtins are essentially identical to the - corresponding ``sub-group`` builtins with a sub-group size of 1. + corresponding `sub-group` builtins with a sub-group size of 1. -##### ``any``/``all`` builtins +##### `any`/`all` builtins -The ``any`` and ``all`` builtins return ``true`` if any/all of their operands -are ``true`` and ``false`` otherwise. +The `any` and `all` builtins return `true` if any/all of their operands +are `true` and `false` otherwise. ```llvm i1 @__mux_sub_group_any_i1(i1 %x) @@ -869,15 +869,15 @@ are ``true`` and ``false`` otherwise. i1 @__mux_vec_group_any_v4i1(<4 x i1> %x) ``` -##### ``broadcast`` builtins +##### `broadcast` builtins -The ``broadcast`` builtins broadcast the value corresponding to the local ID to +The `broadcast` builtins broadcast the value corresponding to the local ID to the result of all invocations in the group. The sub-group version of this -builtin takes an ``i32`` sub-group linear ID to identify the invocation to -broadcast, and the work-group version take three ``size_t`` indices to locate +builtin takes an `i32` sub-group linear ID to identify the invocation to +broadcast, and the work-group version take three `size_t` indices to locate the value to broadcast. Unused indices (e.g., in lower-dimension kernels) **must** be set to zero - this is the same value returned by -``__mux_get_global_id`` for out-of-range dimensions. +`__mux_get_global_id` for out-of-range dimensions. ```llvm i64 @__mux_sub_group_broadcast_i64(i64 %val, i32 %sg_lid) @@ -885,24 +885,24 @@ the value to broadcast. Unused indices (e.g., in lower-dimension kernels) i64 @__mux_vec_group_broadcast_v2i64(<2 x i64> %val, i32 %vec_id) ``` -##### ``reduce`` and ``scan`` builtins +##### `reduce` and `scan` builtins -The ``reduce`` and ``scan`` builtins return the result of the group operation +The `reduce` and `scan` builtins return the result of the group operation for all values of their parameters specified by invocations in the group. -Scans may be either ``inclusive`` or ``exclusive``. Inclusive scans perform the +Scans may be either `inclusive` or `exclusive`. Inclusive scans perform the operation over all invocations in the group. Exclusive scans perform the operation over the operation's identity value and all but the final invocation in the group. The group operation may be specified as one of: -* ``add``/``fadd`` - integer/floating-point addition. -* ``mul``/``fmul`` - integer/floating-point multiplication. -* ``smin``/``umin``/``fmin`` - signed integer/unsigned integer/floating-point minimum. -* ``smax``/``umax``/``fmax`` - signed integer/unsigned integer/floating-point maximum. -* ``and``/``or``/``xor`` - bitwise ``and``/``or``/``xor``. -* ``logical_and``/``logical_or``/``logical_xor`` - logical ``and``/``or``/``xor``. +* `add`/`fadd` - integer/floating-point addition. +* `mul`/`fmul` - integer/floating-point multiplication. +* `smin`/`umin`/`fmin` - signed integer/unsigned integer/floating-point minimum. +* `smax`/`umax`/`fmax` - signed integer/unsigned integer/floating-point maximum. +* `and`/`or`/`xor` - bitwise `and`/`or`/`xor`. +* `logical_and`/`logical_or`/`logical_xor` - logical `and`/`or`/`xor`. Examples: @@ -923,90 +923,90 @@ Examples: ``` -##### Sub-group ``shuffle`` builtin +##### Sub-group `shuffle` builtin -The ``sub_group_shuffle`` builtin allows data to be arbitrarily transferred +The `sub_group_shuffle` builtin allows data to be arbitrarily transferred between invocations in a sub-group. The data that is returned for this -invocation is the value of ``%val`` for the invocation identified by ``%lid``. +invocation is the value of `%val` for the invocation identified by `%lid`. -``%lid`` need not be the same value for all invocations in the sub-group. +`%lid` need not be the same value for all invocations in the sub-group. ```llvm i32 @__mux_sub_group_shuffle_i32(i32 %val, i32 %lid) ``` -##### Sub-group ``shuffle_up`` builtin +##### Sub-group `shuffle_up` builtin -The ``sub_group_shuffle_up`` builtin allows data to be transferred from an +The `sub_group_shuffle_up` builtin allows data to be transferred from an invocation in the sub-group with a lower sub-group local invocation ID up to an invocation in the sub-group with a higher sub-group local invocation ID. -The builtin has two operands: ``%prev`` and ``%curr``. To determine the result -of this builtin, first let ``SubgroupLocalInvocationId`` be equal to -``__mux_get_sub_group_local_id()``, let the signed shuffle index be equivalent -to this invocation’s ``SubgroupLocalInvocationId`` minus the specified -``%delta``, and ``MaxSubgroupSize`` be equal to -``__mux_get_max_sub_group_size()`` for the current kernel. +The builtin has two operands: `%prev` and `%curr`. To determine the result +of this builtin, first let `SubgroupLocalInvocationId` be equal to +`__mux_get_sub_group_local_id()`, let the signed shuffle index be equivalent +to this invocation’s `SubgroupLocalInvocationId` minus the specified +`%delta`, and `MaxSubgroupSize` be equal to +`__mux_get_max_sub_group_size()` for the current kernel. * If the shuffle index is greater than or equal to zero and less than the - ``MaxSubgroupSize``, the result of this builtin is the value of the ``%curr`` - operand for the invocation with ``SubgroupLocalInvocationId`` equal to the + `MaxSubgroupSize`, the result of this builtin is the value of the `%curr` + operand for the invocation with `SubgroupLocalInvocationId` equal to the shuffle index. * If the shuffle index is less than zero but greater than or equal to the - negative ``MaxSubgroupSize``, the result of this builtin is the value of the - ``%prev`` operand for the invocation with ``SubgroupLocalInvocationId`` equal - to the shuffle index plus the ``MaxSubgroupSize``. + negative `MaxSubgroupSize`, the result of this builtin is the value of the + `%prev` operand for the invocation with `SubgroupLocalInvocationId` equal + to the shuffle index plus the `MaxSubgroupSize`. All other values of the shuffle index are considered to be out-of-range. -``%delta`` need not be the same value for all invocations in the sub-group. +`%delta` need not be the same value for all invocations in the sub-group. ```llvm i8 @__mux_sub_group_shuffle_up_i8(i8 %prev, i8 %curr, i32 %delta) ``` -##### Sub-group ``shuffle_down`` builtin +##### Sub-group `shuffle_down` builtin -The ``sub_group_shuffle_down`` builtin allows data to be transferred from an +The `sub_group_shuffle_down` builtin allows data to be transferred from an invocation in the sub-group with a higher sub-group local invocation ID down to a invocation in the sub-group with a lower sub-group local invocation ID. -The builtin has two operands: ``%curr`` and ``%next``. To determine the result -of this builtin , first let ``SubgroupLocalInvocationId`` be equal to -``__mux_get_sub_group_local_id()``, the unsigned shuffle index be equivalent to -the sum of this invocation’s ``SubgroupLocalInvocationId`` plus the specified -``%delta``, and ``MaxSubgroupSize`` be equal to -``__mux_get_max_sub_group_size()`` for the current kernel. - -* If the shuffle index is less than the ``MaxSubgroupSize``, the result of this - builtin is the value of the ``%curr`` operand for the invocation with - ``SubgroupLocalInvocationId`` equal to the shuffle index. - -* If the shuffle index is greater than or equal to the ``MaxSubgroupSize`` but - less than twice the ``MaxSubgroupSize``, the result of this builtin is the - value of the ``%next`` operand for the invocation with - ``SubgroupLocalInvocationId`` equal to the shuffle index minus the - ``MaxSubgroupSize``. All other values of the shuffle index are considered to +The builtin has two operands: `%curr` and `%next`. To determine the result +of this builtin , first let `SubgroupLocalInvocationId` be equal to +`__mux_get_sub_group_local_id()`, the unsigned shuffle index be equivalent to +the sum of this invocation’s `SubgroupLocalInvocationId` plus the specified +`%delta`, and `MaxSubgroupSize` be equal to +`__mux_get_max_sub_group_size()` for the current kernel. + +* If the shuffle index is less than the `MaxSubgroupSize`, the result of this + builtin is the value of the `%curr` operand for the invocation with + `SubgroupLocalInvocationId` equal to the shuffle index. + +* If the shuffle index is greater than or equal to the `MaxSubgroupSize` but + less than twice the `MaxSubgroupSize`, the result of this builtin is the + value of the `%next` operand for the invocation with + `SubgroupLocalInvocationId` equal to the shuffle index minus the + `MaxSubgroupSize`. All other values of the shuffle index are considered to be out-of-range. All other values of the shuffle index are considered to be out-of-range. -``%delta`` need not be the same value for all invocations in the sub-group. +`%delta` need not be the same value for all invocations in the sub-group. ```llvm float @__mux_sub_group_shuffle_down_f32(float %curr, float %next, i32 %delta) ``` -##### Sub-group ``shuffle_xor`` builtin +##### Sub-group `shuffle_xor` builtin -These ``sub_group_shuffle_xor`` builtin allows for efficient sharing of data +These `sub_group_shuffle_xor` builtin allows for efficient sharing of data between items within a sub-group. -The data that is returned for this invocation is the value of ``%val`` for the +The data that is returned for this invocation is the value of `%val` for the invocation with sub-group local ID equal to this invocation’s sub-group local -ID XOR’d with the specified ``%xor_val``. If the result of the XOR is greater +ID XOR’d with the specified `%xor_val`. If the result of the XOR is greater than the current kernel's maximum sub-group size, then it is considered out-of-range. @@ -1021,7 +1021,7 @@ The mux barrier builtins synchronize both memory and execution flow. The specific semantics with which they synchronize are defined using the following enums. -The ``%scope`` parameter defines which other invocations observe the memory +The `%scope` parameter defines which other invocations observe the memory ordering provided by the barrier. Only one of the values may be chosen simultaneously. @@ -1035,9 +1035,9 @@ simultaneously. }; ``` -The ``%semantics`` parameter defines the kind of memory affected by the +The `%semantics` parameter defines the kind of memory affected by the barrier, as well as the ordering constraints. Only one of the possible -**ordering**\s may be chosen simultaneously. The **memory** field is a +**ordering**s may be chosen simultaneously. The **memory** field is a bitfield. ```cpp @@ -1062,19 +1062,19 @@ bitfield. ##### Atomics and Fences The LLVM intermediate representation stored in -``compiler::BaseModule::finalized_llvm_module`` **may** contain any of the +`compiler::BaseModule::finalized_llvm_module` **may** contain any of the following atomic instructions: * [`cmpxchg`](https://llvm.org/docs/LangRef.html#cmpxchg-instruction) for the `monotonic ordering`_ with *strong* semantics only -* [`atomicrmw`](https://llvm.org/docs/LangRef.html#atomicrmw-instruction) for the following opcodes: ``add``, ``and``, ``sub``, ``min``, - ``max``, ``umin``, ``umax``, ``or``, ``xchg``, ``xor`` for the `monotonic +* [`atomicrmw`](https://llvm.org/docs/LangRef.html#atomicrmw-instruction) for the following opcodes: `add`, `and`, `sub`, `min`, + `max`, `umin`, `umax`, `or`, `xchg`, `xor` for the `monotonic ordering`_ only A compiler **shall** correctly legalize or select these instructions to ISA specific operations. The LLVM intermediate representation stored in -``compiler::BaseModule::finalized_llvm_module`` **may** also contain any of the +`compiler::BaseModule::finalized_llvm_module` **may** also contain any of the following atomic instructions: https://llvm.org/docs/LangRef.html#atomicrmw-instruction * [cmpxchg](https://llvm.org/docs/LangRef.html#cmpxchg-instruction) for the [monotonic ordering](https://llvm.org/docs/LangRef.html#ordering) with *weak* semantics @@ -1101,29 +1101,29 @@ pipeline: | Name | Fields | Description | |------|--------|-------------| - |``!reqd_work_group_size``|i32, i32, i32|Required work-group size encoded as *X*, *Y*, *Z*. If not present, no required size is assumed.| - |``!max_work_dim``| i32 | Maximum dimension used for work-items. If not present, ``3`` is assumed.| - |``!codeplay_ca_wrapper``|various (incl. *vectorization options*)|Information about a *kernel entry point* regarding its work-item iteration over *sub-kernels* as stitched together by the ``WorkItemLoopsPass`` pass in the ``compiler::utils`` module. Typically this involves the loop structure, the vectorization width and options of each loop.| - |``!codeplay_ca_vecz.base``|*vectorization options*, ``Function*``| Links one function to another, indicating that the function acts as the *base* - or *source* - of vectorization with the given vectorization options, and the linked function is the result of a *successful* vectorization. A function may have *many* such pieces of metadata, if it was vectorized multiple times.| - |``!codeplay_ca_vecz.derived``|*vectorization options*, ``Function*``| Links one function to another, indicating that the function is the result of a *successful* vectorization with the given vectorization options, using the linked function as the *base* - or *source* - of vectorization. A function may only have **one** such piece of metadata.| - |``!codeplay_ca_vecz.base.fail``|*vectorization options*| Metadata indicating a *failure* to vectorize with the provided vectorization options.| - |``!mux_scheduled_fn``|i32, i32(, i32, i32)?| Metadata indicating the function parameter indices of the pointers to MuxWorkItemInfo and MuxWorkGroupInfo structures, respectively. A negative value (canonicalized as -1) indicates the function has no such parameter. Up to two additional custom parameter indices can be used by targets.| - |``!intel_reqd_sub_group_size``|i32|Required sub-group size encoded as a 32-bit integer. If not present, no required sub-group size is assumed.| + |`!reqd_work_group_size`|i32, i32, i32|Required work-group size encoded as *X*, *Y*, *Z*. If not present, no required size is assumed.| + |`!max_work_dim`| i32 | Maximum dimension used for work-items. If not present, `3` is assumed.| + |`!codeplay_ca_wrapper`|various (incl. *vectorization options*)|Information about a *kernel entry point* regarding its work-item iteration over *sub-kernels* as stitched together by the `WorkItemLoopsPass` pass in the `compiler::utils` module. Typically this involves the loop structure, the vectorization width and options of each loop.| + |`!codeplay_ca_vecz.base`|*vectorization options*, `Function*`| Links one function to another, indicating that the function acts as the *base* - or *source* - of vectorization with the given vectorization options, and the linked function is the result of a *successful* vectorization. A function may have *many* such pieces of metadata, if it was vectorized multiple times.| + |`!codeplay_ca_vecz.derived`|*vectorization options*, `Function*`| Links one function to another, indicating that the function is the result of a *successful* vectorization with the given vectorization options, using the linked function as the *base* - or *source* - of vectorization. A function may only have **one** such piece of metadata.| + |`!codeplay_ca_vecz.base.fail`|*vectorization options*| Metadata indicating a *failure* to vectorize with the provided vectorization options.| + |`!mux_scheduled_fn`|i32, i32(, i32, i32)?| Metadata indicating the function parameter indices of the pointers to MuxWorkItemInfo and MuxWorkGroupInfo structures, respectively. A negative value (canonicalized as -1) indicates the function has no such parameter. Up to two additional custom parameter indices can be used by targets.| + |`!intel_reqd_sub_group_size`|i32|Required sub-group size encoded as a 32-bit integer. If not present, no required sub-group size is assumed.| Users **should not** rely on the name, format, or operands of these metadata. -Instead, utility functions are provided by the ``utils`` module to work with +Instead, utility functions are provided by the `utils` module to work with accessing, setting, or updating each piece of metadata. > [!NOTE] > The metadata above which refer to *vectorization options* have no concise metadata form as defined by the specification and **are not** guaranteed to - be backwards compatible. See the C++ utility APIs in the ``utils`` module as + be backwards compatible. See the C++ utility APIs in the `utils` module as described above for the specific information encoded/decoded by vectorization. | Name | Fields | Description | |------|--------|-------------| - |``!mux-scheduling-params``|string, string, ...| A list of scheduling parameter names used by this target. Emitted into the module at the time scheduling parameters are added to functions that requires them. The indices found in ``!mux_scheduled_fn`` function metadata are indices into this list. + |`!mux-scheduling-params`|string, string, ...| A list of scheduling parameter names used by this target. Emitted into the module at the time scheduling parameters are added to functions that requires them. The indices found in `!mux_scheduled_fn` function metadata are indices into this list. ### Function Attributes @@ -1133,21 +1133,21 @@ different stages of the pipeline: | Attribute | Description | |------------------|-------------| - |``"mux-kernel"/"mux-kernel"="x"``| Denotes a *"kernel"* function. Additionally denotes a *"kernel entry point"* if the value is ``"entry-point"``. `See below [mux-kernel](#mux-kernel-attribute) for more details. | - |``"mux-orig-fn"="val"``| Denotes the name of the *"original function"* of a function. This original function may or may not exist in the module. The original function name is propagated through the compiler pipeline each time Native CPU creates a new function to wrap or replace a function. | - |``"mux-base-fn-name"="val"``| Denotes the *"base name component"* of a function. Used by several passes when creating new versions of a kernel, rather than appending suffix upon suffix.| + |`"mux-kernel"/"mux-kernel"="x"`| Denotes a *"kernel"* function. Additionally denotes a *"kernel entry point"* if the value is `"entry-point"`. `See below [mux-kernel](#mux-kernel-attribute) for more details. | + |`"mux-orig-fn"="val"`| Denotes the name of the *"original function"* of a function. This original function may or may not exist in the module. The original function name is propagated through the compiler pipeline each time Native CPU creates a new function to wrap or replace a function. | + |`"mux-base-fn-name"="val"`| Denotes the *"base name component"* of a function. Used by several passes when creating new versions of a kernel, rather than appending suffix upon suffix.| For example, a pass that suffixes newly-created functions with - ``".pass2"`` will generate ``@foo.pass1.pass2`` when given function - ``@foo.pass1``, but will generate simply ``@foo.pass2`` if the same - function has ``"mux-base-name"="foo"``. + `".pass2"` will generate `@foo.pass1.pass2` when given function + `@foo.pass1`, but will generate simply `@foo.pass2` if the same + function has `"mux-base-name"="foo"`. | Attribute | Description | |-----------|-------------| - |``"mux-local-mem-usage"="val"``| Estimated local-memory usage for the function. Value must be a positive integer. | - |``"mux-work-item-order"="val"``| Work-item order (the dimensions over which work-items are executed from innermost to outermost) as defined by the ``utils_work_item_order_e`` enum. If not present, ``"xyz"`` may be assumed. | - | ``"mux-barrier-schedule"="val"``| Typically found on call sites. Determines the ordering of work-item execution after a berrier. See the `BarrierSchedule` enum. | - | ``"mux-no-subgroups"``| Marks the function as not explicitly using sub-groups (e.g., identified by the use of known mux sub-group builtins). If a pass introduces the explicit use of sub-groups to a function, it should remove this attribute. | + |`"mux-local-mem-usage"="val"`| Estimated local-memory usage for the function. Value must be a positive integer. | + |`"mux-work-item-order"="val"`| Work-item order (the dimensions over which work-items are executed from innermost to outermost) as defined by the `utils_work_item_order_e` enum. If not present, `"xyz"` may be assumed. | + | `"mux-barrier-schedule"="val"`| Typically found on call sites. Determines the ordering of work-item execution after a berrier. See the `BarrierSchedule` enum. | + | `"mux-no-subgroups"`| Marks the function as not explicitly using sub-groups (e.g., identified by the use of known mux sub-group builtins). If a pass introduces the explicit use of sub-groups to a function, it should remove this attribute. | #### mux-kernel attribute From d88de7b86fe9800d8a328419297fb69f97106c7a Mon Sep 17 00:00:00 2001 From: Colin Davidson Date: Tue, 16 Sep 2025 14:02:47 +0100 Subject: [PATCH 08/14] Fixed bad references --- sycl/doc/design/SYCLNativeCPUPipelinePasses.md | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/sycl/doc/design/SYCLNativeCPUPipelinePasses.md b/sycl/doc/design/SYCLNativeCPUPipelinePasses.md index 0ac568cd5a669..cd23390ae68cf 100644 --- a/sycl/doc/design/SYCLNativeCPUPipelinePasses.md +++ b/sycl/doc/design/SYCLNativeCPUPipelinePasses.md @@ -807,8 +807,7 @@ The Following intermediate representations are used in the interface to Native C control is only ensured for memory accesses issued by the invocation calling the barrier and observed by another invocation executing within the memory `%scope`. Additional control over the kind of memory controlled and what - kind of control to apply is provided by `%semantics`. See `below - <#memory-and-control-barriers>`__ for more information. + kind of control to apply is provided by `%semantics`. See [memory and control barriers](#memory-and-control-barriers) for more information. * `void __mux_work_group_barrier(i32 %id, i32 %scope, i32 %semantics)` and `void __mux_sub_group_barrier(i32 %id, i32 %scope, i32 %semantics)` - Wait for other invocations of the work-group/sub-group to reach the current point @@ -816,8 +815,7 @@ The Following intermediate representations are used in the interface to Native C by `%id` (note that implementations **must** ensure uniqueness themselves, e.g., by running the `compiler::utils::PrepareBarriersPass`). These builtins may also atomically provide a memory barrier with the same semantics - as `__mux_mem_barrier(i32 %scope, i32 %semantics)`. See `below - <#memory-and-control-barriers>`__ for more information. + as `__mux_mem_barrier(i32 %scope, i32 %semantics)`. See [memory and control barriers](#memory-and-control-barriers) for more information. ##### Group operation builtins From 1f9af147e579b6a37011ae940189572d78781dd4 Mon Sep 17 00:00:00 2001 From: Colin Davidson Date: Tue, 16 Sep 2025 15:24:47 +0100 Subject: [PATCH 09/14] Updated for latest review comments --- sycl/doc/design/SYCLNativeCPUPipeline.md | 17 ++++++++--------- 1 file changed, 8 insertions(+), 9 deletions(-) diff --git a/sycl/doc/design/SYCLNativeCPUPipeline.md b/sycl/doc/design/SYCLNativeCPUPipeline.md index 0baf2469391f3..3717f0c0067a2 100644 --- a/sycl/doc/design/SYCLNativeCPUPipeline.md +++ b/sycl/doc/design/SYCLNativeCPUPipeline.md @@ -20,19 +20,18 @@ modules containing one or more kernel functions to object code ready for execution when invoked by the host-side runtime. The assumptions placed on the input and output kernels is as follows: -1. The original kernel is assumed to adhere to an implicit **SIMT** +1. The original kernel is assumed to adhere to an implicit **SIMT** execution model; it runs once per each *work-item* in an **NDRange**. 2. It is passed a state struct which contains information about the scheduling. 3. All builtins which do not relate to scheduling have been processed and we are - left with some scheduling related calls to `mux builtins`. -4. The final compiled kernel is assumed to be invoked from the + left with some scheduling related calls to "mux builtins". +4. The final compiled kernel is assumed to be invoked from the host-side runtime once per *work-group* in the **NDRange**. The inner-most function is the original input kernel, which is *wrapped* by new functions in successive phases, until it is ready in a form to be -executed by the Native CPU driver. These include effectively wrapping a `for (wi : wg)` -around the original kernel. +executed by the Native CPU driver. The [WorkItemLoopsPass](SYCLNativeCPUPipelinePasses.md#workitemloopspass) is the key pass which makes some of the implicit parallelism @@ -77,9 +76,9 @@ are responsible for adding this information. ### Whole Function Vectorization -The [vecz](SYCLNativeCPUVecz.md) whole-function vectorizer is optionally run. +The [Vecz](SYCLNativeCPUVecz.md) whole-function vectorizer is optionally run. -Note that VECZ may perform its own scalarization, depending on the +Note that Vecz may perform its own scalarization, depending on the options passed to it, potentially undoing the work of any previous optimization passes, although it is able to preserve or even widen pre-existing vector operations in many cases. @@ -148,7 +147,7 @@ where they are used. The [WorkItemLoopsPass](SYCLNativeCPUPipelinePasses.md#workitemloopspass) is responsible for laying out kernels which have been vectorized by the -[vecz](SYCLNativeCPUVecz.md) whole-function vectorizer. +[Vecz](SYCLNativeCPUVecz.md) whole-function vectorizer. The vectorizer creates multiple versions of the original kernel. Vectorized kernels on their own are generally unable to fulfill @@ -164,7 +163,7 @@ For brevity, the diagram below only details in inner-most work-item loops. Most kernels will in reality have 2 outer levels of loops over the full *Y* and *Z* work-group dimensions. -![Work Item Loops with vecz.](images/native_cpu_vecz.jpg) +![Work Item Loops with Vecz.](images/native_cpu_vecz.jpg) In the above example, the vectorized kernel is called to execute as many work-items as possible, up to the largest multiple of the vectorization From 864cd5caabee543484b1c2c92264407d240e9820 Mon Sep 17 00:00:00 2001 From: Colin Davidson Date: Tue, 16 Sep 2025 15:27:51 +0100 Subject: [PATCH 10/14] More changes for Vecz capitalisation --- sycl/doc/design/SYCLNativeCPUPipelinePasses.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/sycl/doc/design/SYCLNativeCPUPipelinePasses.md b/sycl/doc/design/SYCLNativeCPUPipelinePasses.md index cd23390ae68cf..e60177270e9d0 100644 --- a/sycl/doc/design/SYCLNativeCPUPipelinePasses.md +++ b/sycl/doc/design/SYCLNativeCPUPipelinePasses.md @@ -303,7 +303,7 @@ together multiple kernels to make a single kernel capable of correctly executing all work-items in the work-group. In particular, when a kernel has been vectorized with -[vecz](SYCLNativeCPUVecz.md) it executes multiple work-items at +[Vecz](SYCLNativeCPUVecz.md) it executes multiple work-items at once. Unless the work-group size in the vectorized dimension is known to be a multiple of the vectorization factor, there exists the possibility that some work-items will not be executed by the vectorized loop. @@ -415,7 +415,7 @@ if (group_size_x >= vector_width) { #### Vector only If the [WorkItemLoopsPass](#workitemloopspass) is run on a vectorized -kernel for which no [vecz](SYCLNativeCPUVecz.md) linking metadata is found to +kernel for which no [Vecz](SYCLNativeCPUVecz.md) linking metadata is found to identify the scalar kernel, or if a scalar kernel is found but one of the conditions listed above hold, then the kernel is emitted using the vector kernel only. It is assumed that if no scalar kernel is found it @@ -475,7 +475,7 @@ performed on ARM targets due to LLVM backend compiler bugs. ### RunVeczPass The `RunVeczPass` module pass provides a wrapper for using our -[vecz](SYCLNativeCPUVecz.md) IR vectorizer. This vectorizes the +[Vecz](SYCLNativeCPUVecz.md) IR vectorizer. This vectorizes the kernel to a SIMD width specified when the pass is created. In our case this is typically local size in the first dimension but there are other factors to consider when picking the width, like being a power of 2. @@ -494,9 +494,9 @@ kernel is used instead of our scalar kernel from here on. #### Cost Model Interface -User cost-modelling in vecz can be handled by the +User cost-modelling in Vecz can be handled by the `vecz::VeczPassOptionsAnalsis` which takes a user defined query function -on construction. This pass is a required analysis pass for vecz, so be +on construction. This pass is a required analysis pass for Vecz, so be sure to add it to your analysis manager. Vecz queries the result of this analysis before operating on a kernel, @@ -505,7 +505,7 @@ contain suitably modelled widths, vectorization factors, and scalability options determined suitable for the target. The `VeczPassOptionsAnalysis` pass can be default-constructed - in which -case vecz makes a conservative decision about kernel vectorization - or +case Vecz makes a conservative decision about kernel vectorization - or be constructed passing in a user callback function. The function takes as its parameters a reference to the function to be optionally vectorized, and a reference to a vector of `VeczPassOptions` which it is @@ -661,8 +661,8 @@ provided by `BIMuxInfoConcept` used by the [DefineMuxBuiltinsPass](#definemuxbuiltinspass). Vectorized definitions of the various sub-group builtins are provided by -the VECZ pass, so any target running VECZ (and the above passes) will be -able to support sub-groups of a larger size than 1. Note that VECZ does +the Vecz pass, so any target running Vecz (and the above passes) will be +able to support sub-groups of a larger size than 1. Note that Vecz does not currently interact "on top of" the mux builtins - it replaces them in the functions it vectorized. This is future work to allow the two to build on top of each other. From fe53fcce307b2c02e2c94e4231ded80d1db27c0d Mon Sep 17 00:00:00 2001 From: Colin Davidson Date: Tue, 16 Sep 2025 17:14:24 +0100 Subject: [PATCH 11/14] Update after review comments --- sycl/doc/design/SYCLNativeCPUPipeline.md | 8 ++++---- sycl/doc/design/SYCLNativeCPUPipelinePasses.md | 4 ++-- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/sycl/doc/design/SYCLNativeCPUPipeline.md b/sycl/doc/design/SYCLNativeCPUPipeline.md index 3717f0c0067a2..1759ca23f8122 100644 --- a/sycl/doc/design/SYCLNativeCPUPipeline.md +++ b/sycl/doc/design/SYCLNativeCPUPipeline.md @@ -9,7 +9,7 @@ run in `llvm::sycl::utils::addSYCLNativeCPUBackendPasses`. All of the compiler pipeline code can be found under [llvm/lib/SYCLNativeCPUUtils](https://github.com/intel/llvm/tree/sycl/llvm/lib/SYCLNativeCPUUtils), with the code which originated from the [oneAPI Construction -Kit](https://github.com/uxlfoundation/oneapi-construction-kit/tree/main), under +Kit](https://github.com/uxlfoundation/oneapi-construction-kit), under `compiler_passes` in that directory. @@ -21,13 +21,13 @@ execution when invoked by the host-side runtime. The assumptions placed on the input and output kernels is as follows: 1. The original kernel is assumed to adhere to an implicit **SIMT** - execution model; it runs once per each *work-item* in an - **NDRange**. + execution model; it runs once per each *work-item* in an + **NDRange**. 2. It is passed a state struct which contains information about the scheduling. 3. All builtins which do not relate to scheduling have been processed and we are left with some scheduling related calls to "mux builtins". 4. The final compiled kernel is assumed to be invoked from the - host-side runtime once per *work-group* in the **NDRange**. + host-side runtime once per *work-group* in the **NDRange**. The inner-most function is the original input kernel, which is *wrapped* by new functions in successive phases, until it is ready in a form to be diff --git a/sycl/doc/design/SYCLNativeCPUPipelinePasses.md b/sycl/doc/design/SYCLNativeCPUPipelinePasses.md index e60177270e9d0..6ce8a7ad90876 100644 --- a/sycl/doc/design/SYCLNativeCPUPipelinePasses.md +++ b/sycl/doc/design/SYCLNativeCPUPipelinePasses.md @@ -53,8 +53,8 @@ up in a triple-nested loop over all work-items in the work-group. Thus, kernels scheduled by this pass can be invoked once per work-group. The order in which work-items are executed is fairly flexible, but generally in -ascending order from [0] to [N-1] through the innermost [X] dimension, followed -by the [Y] dimension, and lastly the [Z] dimension. +ascending order from \[0\] to \[N-1\] through the innermost \[X\] dimension, followed +by the \[Y\] dimension, and lastly the \[Z\] dimension. Conceptually, the pass transforms `old_kernel` into `new_kernel` in the example below: From 5d275f2180f8b8c177007973c93cd6278fa9ed89 Mon Sep 17 00:00:00 2001 From: Colin Davidson Date: Tue, 16 Sep 2025 17:27:05 +0100 Subject: [PATCH 12/14] Fix header levels in SYCLNativeCPU --- sycl/doc/design/SYCLNativeCPU.md | 28 ++++++++++++++-------------- 1 file changed, 14 insertions(+), 14 deletions(-) diff --git a/sycl/doc/design/SYCLNativeCPU.md b/sycl/doc/design/SYCLNativeCPU.md index b89379271e3d1..a7d3417e938e7 100644 --- a/sycl/doc/design/SYCLNativeCPU.md +++ b/sycl/doc/design/SYCLNativeCPU.md @@ -2,7 +2,7 @@ The SYCL Native CPU flow aims at treating the host CPU as a "first class citizen", providing a SYCL implementation that targets CPUs of various different architectures, with no other dependencies than DPC++ itself, while bringing performances comparable to state-of-the-art CPU backends. SYCL Native CPU also provides some initial/experimental support for LLVM's [source-based code coverage tools](https://clang.llvm.org/docs/SourceBasedCodeCoverage.html) (see also section [Code coverage](#code-coverage)). -# Compiler and runtime options +## Compiler and runtime options The SYCL Native CPU flow is enabled by setting `native_cpu` as a `sycl-target`: @@ -35,7 +35,7 @@ clang++ -fsycl -fsycl-targets=native_cpu,spir64 -o ``` The application can then run on either SYCL target by setting the DPC++ `ONEAPI_DEVICE_SELECTOR` environment variable to include `native_cpu:cpu` accordingly. -## Configuring DPC++ with SYCL Native CPU +### Configuring DPC++ with SYCL Native CPU SYCL Native CPU needs to be enabled explicitly when configuring DPC++, using `--native_cpu`, e.g. @@ -45,11 +45,11 @@ python buildbot/configure.py \ # other options here ``` -### libclc target triples +#### libclc target triples SYCL Native CPU uses [libclc](https://github.com/intel/llvm/tree/sycl/libclc) to implement many SPIRV builtins. When Native CPU is enabled, the default target triple for libclc will be `LLVM_TARGET_TRIPLE` (same as the default target triple used by `clang`). This can be overridden by setting the `--native-cpu-libclc-targets` option in `configure.py`. -### oneTBB integration +#### oneTBB integration SYCL Native CPU can use oneTBB as an optional backend for task scheduling. oneTBB with SYCL Native CPU is enabled by setting `NATIVECPU_WITH_ONETBB=On` at configure time: @@ -63,7 +63,7 @@ This will pull oneTBB into SYCL Native CPU via CMake `FetchContent` and DPC++ ca By default SYCL Native CPU implements its own scheduler whose only dependency is standard C++. -# Supported features and current limitations +## Supported features and current limitations The SYCL Native CPU flow is still WIP, not optimized and several core SYCL features are currently unsupported. Currently `barriers` are supported only when the oneAPI Construction Kit integration is enabled, several math builtins are not supported and attempting to use those will most likely fail with an `undefined reference` error at link time. Examples of supported applications can be found in the [runtime tests](https://github.com/intel/llvm/blob/sycl/sycl/test/native_cpu). @@ -84,7 +84,7 @@ cmake \ Note that a number of `e2e` tests are currently still failing. -# Vectorization +## Vectorization With the integration of the OneAPI Construction Kit, the SYCL Native CPU target also gained support for Whole Function Vectorization.\\ @@ -96,7 +96,7 @@ The `-march=` option can be used to select specific target cpus which may improv For more details on how the Whole Function Vectorizer is integrated for SYCL Native CPU, refer to the [Native CPU Compiler Pipeline](#native-cpu-compiler-pipeline) section. -# Code coverage +## Code coverage SYCL Native CPU has experimental support for LLVM's source-based [code coverage](https://clang.llvm.org/docs/SourceBasedCodeCoverage.html). This enables coverage testing across device and host code. Example usage: @@ -108,7 +108,7 @@ llvm-profdata merge -sparse default.profraw -o foo.profdata llvm-cov show .\vector-add.exe -instr-profile=foo.profdata ``` -## Ongoing work +### Ongoing work * Complete support for remaining SYCL features, including but not limited to * math and other builtins @@ -117,7 +117,7 @@ llvm-cov show .\vector-add.exe -instr-profile=foo.profdata ### Please note that Windows is partially supported but temporarily disabled due to some implementation details, it will be re-enabled soon. -# Native CPU compiler pipeline +## Native CPU compiler pipeline SYCL Native CPU formerly used the [oneAPI Construction Kit](https://github.com/uxlfoundation/oneapi-construction-kit) (OCK) via CMake FetchContent in order to support some core SYCL functionalities and improve performances in the compiler pipeline. The relevant OCK parts have been brought into DPC++ and the Native CPU compiler pipeline is documented in [SYCLNativeCPUPipeline documentation](SYCLNativeCPUPipeline.md), with a brief overview below. The OCK- related parts are still enabled by using the `NATIVECPU_USE_OCK` CMake variable, but this is enabled by default. @@ -166,15 +166,15 @@ For the SYCL Native CPU target, the device compiler is in charge of materializin The PrepareSYCLNativeCPUPass also emits a `subhandler` wrapper function, which receives the kernel arguments from the SYCL runtime (packed in a vector), unpacks them, and forwards only the used ones to the actual kernel. -## PrepareSYCLNativeCPU Pass +### PrepareSYCLNativeCPU Pass This pass will add a pointer to a `native_cpu::state` struct as kernel argument to all the kernel functions, and it will replace all the uses of SPIRV builtins with the return value of appropriately defined functions, which will read the requested information from the `native_cpu::state` struct. For more information, see [PrepareSYCLNativeCPU Pass](SYCLNativeCPUPipeline.md#preparesyclnativecpu-pass). -## Handling barriers +### Handling barriers On SYCL Native CPU, calls to `__spirv_ControlBarrier` are handled using the `WorkItemLoopsPass` from the oneAPI Construction Kit. This pass handles barriers by splitting the kernel between calls to `__spirv_ControlBarrier`, and creating a wrapper that runs the subkernels over the local range. In order to correctly interface to the oneAPI Construction Kit pass pipeline, SPIRV builtins are defined in the device library to call the corresponding `mux` builtins (used by the OCK). -## Vectorization +### Vectorization The Whole Function Vectorizer is executed as an LLVM Pass. Considering the following input function: @@ -216,7 +216,7 @@ and points to the original version of the function. This information is used lat which will account for the vectorization when creating the Work Item Loops, and use the original version of the function to add peeling loops. -## Kernel registration +### Kernel registration In order to register the SYCL Native CPU kernels to the SYCL runtime, we applied a small change to the `clang-offload-wrapper` tool: normally, the `clang-offload-wrapper` bundles the offload binary in an LLVM-IR module. Instead of bundling the device code, for the SYCL Native CPU target we insert an array of function pointers to the `subhandler`s, and the `sycl_device_binary_struct::BinaryStart` and `sycl_device_binary_struct::BinaryEnd` fields, which normally point to the begin and end addresses of the offload binary, now point to the begin and end of the array. @@ -232,6 +232,6 @@ In order to register the SYCL Native CPU kernels to the SYCL runtime, we applied Each entry in the array contains the kernel name as a string, and a pointer to the `subhandler` function declaration. Since the subhandler's signature has always the same arguments (two pointers in LLVM-IR), the `clang-offload-wrapper` can emit the function declarations given just the function names contained in the `.table` file emitted by `sycl-post-link`. The symbols are then resolved by the system's linker, which receives both the output from the offload wrapper and the lowered device module. -## Kernel lowering and execution +### Kernel lowering and execution The information produced by the device compiler is then employed to correctly lower the kernel LLVM-IR module to the target ISA (this is performed by the driver when `-fsycl-targets=native_cpu` is set). The object file containing the kernel code is linked with the host object file (and libsycl and any other needed library) and the final executable is run using the SYCL Native CPU UR Adapter, defined in [the Unified Runtime repo](https://github.com/oneapi-src/unified-runtime/tree/adapters/source/adapters/native_cpu). From dc58a743ebcfdbd2c283be724e146a3cef61b94a Mon Sep 17 00:00:00 2001 From: Colin Davidson Date: Tue, 16 Sep 2025 17:41:59 +0100 Subject: [PATCH 13/14] More updates from reviews --- sycl/doc/design/SYCLNativeCPUPipeline.md | 2 +- sycl/doc/design/SYCLNativeCPUPipelinePasses.md | 2 +- sycl/doc/design/SYCLNativeCPUVecz.md | 2 -- 3 files changed, 2 insertions(+), 4 deletions(-) diff --git a/sycl/doc/design/SYCLNativeCPUPipeline.md b/sycl/doc/design/SYCLNativeCPUPipeline.md index 1759ca23f8122..02b5fc95276b5 100644 --- a/sycl/doc/design/SYCLNativeCPUPipeline.md +++ b/sycl/doc/design/SYCLNativeCPUPipeline.md @@ -1,4 +1,4 @@ -# Native CPU Compiler Pipeline Overview +# SYCL Native CPU Pipeline Overview ## Introduction diff --git a/sycl/doc/design/SYCLNativeCPUPipelinePasses.md b/sycl/doc/design/SYCLNativeCPUPipelinePasses.md index 6ce8a7ad90876..d2064a58b5226 100644 --- a/sycl/doc/design/SYCLNativeCPUPipelinePasses.md +++ b/sycl/doc/design/SYCLNativeCPUPipelinePasses.md @@ -1,4 +1,4 @@ -# SYCL Native CPU pipeline passes +# SYCL Native CPU Pipeline Passes The `compiler::utils` module exists under [compiler_pipeline](https://github.com/intel/llvm/tree/sycl/llvm/lib/SYCLNativeCPUUtils/compiler_passes/compiler_pipeline) diff --git a/sycl/doc/design/SYCLNativeCPUVecz.md b/sycl/doc/design/SYCLNativeCPUVecz.md index 2368476be684a..44b705e995372 100644 --- a/sycl/doc/design/SYCLNativeCPUVecz.md +++ b/sycl/doc/design/SYCLNativeCPUVecz.md @@ -1320,6 +1320,4 @@ must be present, otherwise scalarization or packetization will not be able to materialize the scalarized/vectorized builtin calls and veczc will fail with an error message. -## References - [1]: http://dblp.uni-trier.de/pers/hd/k/Karrenberg:Ralf From 710f0dc248e012bf0dce214eca717aa95a8ed338 Mon Sep 17 00:00:00 2001 From: Colin Davidson Date: Wed, 17 Sep 2025 11:06:01 +0100 Subject: [PATCH 14/14] Update on Supported features and current limitations After recent comments, some updates --- sycl/doc/design/SYCLNativeCPU.md | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/sycl/doc/design/SYCLNativeCPU.md b/sycl/doc/design/SYCLNativeCPU.md index a7d3417e938e7..bad698e736719 100644 --- a/sycl/doc/design/SYCLNativeCPU.md +++ b/sycl/doc/design/SYCLNativeCPU.md @@ -65,7 +65,15 @@ By default SYCL Native CPU implements its own scheduler whose only dependency is ## Supported features and current limitations -The SYCL Native CPU flow is still WIP, not optimized and several core SYCL features are currently unsupported. Currently `barriers` are supported only when the oneAPI Construction Kit integration is enabled, several math builtins are not supported and attempting to use those will most likely fail with an `undefined reference` error at link time. Examples of supported applications can be found in the [runtime tests](https://github.com/intel/llvm/blob/sycl/sycl/test/native_cpu). +The SYCL Native CPU supports all core SYCL features with some outstanding bugs. There are some optional features which have no or partial support: + +* bfloat16 +* address sanitizer +* images +* device globals (unsure as we pass one of them) +* ESIMD + +Some of these, such as bfloat16 will fail with an undefined reference error at link time. To execute the `e2e` tests on SYCL Native CPU, configure the test suite with: @@ -78,8 +86,7 @@ cmake \ -G Ninja \ -B build -S . \ -DCMAKE_CXX_COMPILER=clang++ \ - -DSYCL_TEST_E2E_TARGETS="native_cpu:cpu" - + -DSYCL_TEST_E2E_TARGETS="native_cpu:cpu" ``` Note that a number of `e2e` tests are currently still failing. @@ -96,6 +103,8 @@ The `-march=` option can be used to select specific target cpus which may improv For more details on how the Whole Function Vectorizer is integrated for SYCL Native CPU, refer to the [Native CPU Compiler Pipeline](#native-cpu-compiler-pipeline) section. +To run the Vecz lit tests, build DPC++ with `-DNATIVE_CPU_BUILD_VECZ_TEST_TOOLS=ON` and run with `check-sycl-vecz`. + ## Code coverage SYCL Native CPU has experimental support for LLVM's source-based [code coverage](https://clang.llvm.org/docs/SourceBasedCodeCoverage.html). This enables coverage testing across device and host code.