fastmachinelearning · calad0i · Aug 7, 2025 · Aug 18, 2025 · Aug 18, 2025 · Sep 2, 2025
diff --git a/README.md b/README.md
@@ -125,6 +125,18 @@ Additionally, if you use specific features developed in later papers, please cit
     year = "2022"
 }
 ```
+Distributed arithmetic:
+```bibtex
+@misc{Sun:2025,
+      title={da4ml: Distributed Arithmetic for Real-time Neural Networks on FPGAs},
+      author={Chang Sun and others},
+      year={2025},
+      eprint={2507.04535},
+      archivePrefix={arXiv},
+      primaryClass={cs.AR},
+      url={https://arxiv.org/abs/2507.04535},
+}
+```
 binary/ternary networks:
 ```bibtex
 @article{Loncar:2020hqp,

diff --git a/docs/advanced/_static/da4ml-workflow.svg b/docs/advanced/_static/da4ml-workflow.svg
diff --git a/docs/advanced/_static/hgq-overview.svg b/docs/advanced/_static/hgq-overview.svg
diff --git a/docs/advanced/auto.rst b/docs/advanced/auto.rst
@@ -20,3 +20,6 @@ inference will never set a bitwdith larger than the bitwidth of the ``max_precis
 When manually setting bitdwidths, the accumulator can overflow, and the precision may need to be reduced. For the accumulator, it is usually a bad idea to explicitly
 enable rounding or saturation modes since it dramatically increases the execution time. For other types (e.g. output types or weight types), however, rounding and saturation handling
 can be enabled as needed.
+
+.. note::
+    For supported models (Most ``HGQ/HGQ2`` models and some ``QKeras`` models), Model-wise Precision Inference (documented in `model-wise precision inference <../precision.html>`_) can be used to achieve bit-exact conversion. Please refer to that section for more details.
diff --git a/docs/advanced/da.rst b/docs/advanced/da.rst
@@ -0,0 +1,32 @@
+======================
+Distributed Arithmetic
+======================
+
+.. image:: https://img.shields.io/badge/License-LGPLv3-blue.svg
+   :target: https://www.gnu.org/licenses/lgpl-3.0.en.html
+.. image:: https://badge.fury.io/py/da4ml.svg
+   :target: https://badge.fury.io/py/da4ml
+.. image:: https://img.shields.io/badge/arXiv-2507.04535-b31b1b.svg
+   :target: https://arxiv.org/abs/2507.04535
+
+
+Distributed Arithmetic (DA) is a strategy for constant-matrix-vector multiplication (CMVM) operations used in hls4ml. The implementation is provided by an external library, `da4ml <https://github.com/calad0i/da4ml>`__. The library transforms the CMVM operations into an adder graph with common subexpression elimations to reduce the overall complexity. As the CMVM operation is fully unrolled, `reuse_factor` **must** be 1 (by default) for the corresponding CMVM operations [*]_. Comparing to the traditional `Latency` strategy CMVM kernels, DA can usually reduce up to 30% of the LUTs and all DSPs used.
+
+.. rst-class:: light
+.. image:: _static/da4ml-workflow.svg
+   :alt: Workflow of DA in hls4ml
+   :width: 600
+
+When the DA strategy is used, the CMVM operations will be implemented bit-exactly, and the accumulator precision setting will not be used.
+
+.. [*] Not to be confused with `II=1`. `reuse_factor` is the `II` for one CMVM operation, not one layer. One layer may invoke the same CMVM kernel multiple times and thus has `II>1` while each CMVM operation is unrolled, e.g., convolution layers with more than one partition.
+
+Currently, the DA strategy is only available for the Vivado/Vitis HLS backends. The following layers are supported:
+* Dense
+* Convolutional (1D, 2D)
+* EinsumDense
+* Multi-head attention (implemented as multiple EinsumDense layers)
+
+While possible, the RNN layers are not yet supported by the DA strategy.
+
+For more details, please refer to the `da4ml repository <https://github.com/calad0i/da4ml>`__ or the `paper <https://arxiv.org/abs/2507.04535>`__.
diff --git a/docs/advanced/extension.rst b/docs/advanced/extension.rst
@@ -18,22 +18,22 @@ To implement a custom layer in ``hls4ml`` with the extension API, the required c
 * Function config template
 * Registration of layer, source code, and templates
 
-Complete example
-================
+Complete example for Keras v2
+=============================
 
-For concreteness, let's say our custom layer ``KReverse`` is implemented in Keras and reverses the order of the last dimension of the input.
+For concreteness, let's say our custom layer ``KReverse`` is implemented in Keras v2 and reverses the order of the last dimension of the input.
 
 .. code-block:: Python
 
     # Keras implementation of a custom layer
-    class KReverse(tf.keras.layers.Layer):
+    class KReverse(keras.layers.Layer):
         '''Keras implementation of a hypothetical custom layer'''
 
         def __init__(self):
             super().__init__()
 
         def call(self, inputs):
-            return tf.reverse(inputs, axis=[-1])
+            return inputs[..., ::-1]
 
         def get_config(self):
             return super().get_config()
@@ -58,19 +58,44 @@ This parser reads the attributes of the Keras layer instance and populates a dic
 It also returns a list of output shapes (one sjape for each output).
 In this case, there a single output with the same shape as the input.
 
-.. code-block:: Python
+.. tabs::
+    .. tab:: Keras v2
+
+        .. code-block:: Python
+
+            # Parser for converter
+            def parse_reverse_layer(keras_layer, input_names, input_shapes, data_reader):
+                layer = {}
+                layer['class_name'] = 'KReverse'
+                layer['name'] = keras_layer['config']['name']
+                layer['n_in'] = input_shapes[0][1]
+
+                if input_names is not None:
+                    layer['inputs'] = input_names
+
+                return layer, [shape for shape in input_shapes[0]]
+
+    .. tab:: Keras v3
+
+        .. code-block:: Python
 
-    # Parser for converter
-    def parse_reverse_layer(keras_layer, input_names, input_shapes, data_reader):
-        layer = {}
-        layer['class_name'] = 'HReverse'
-        layer['name'] = keras_layer['config']['name']
-        layer['n_in'] = input_shapes[0][1]
+            from hls4ml.converters.keras_v3._base import register, KerasV3LayerHandler
 
-        if input_names is not None:
-            layer['inputs'] = input_names
+            @register
+            class KReverseHandler(KerasV3LayerHandler):
+                '''Keras v3 layer handler for KReverse'''
 
-        return layer, [shape for shape in input_shapes[0]]
+                handles = ('KReverse',)
+                def handle(
+                    self,
+                    layer: 'keras.Layer',
+                    in_tensors: Sequence['KerasTensor'],
+                    out_tensors: Sequence['KerasTensor'],
+                ) -> dict[str, Any] | tuple[dict[str, Any], ...]:
+                    # Only layer-specific parameters are needed.
+                    # Common parameters are automatically added in the base class.
+                    assert len(in_tensors[0].shape) == 2, 'KReverse is only supported for 2D tensors'
+                    return {'n_in': in_tensors[0].shape[-1]}
 
 Next, we need the actual HLS implementaton of the function, which can be written in a header file ``nnet_reverse.h``.
 
@@ -140,33 +165,33 @@ In this case, the HLS code is valid for both the Vivado and Quartus backends.
 .. code-block:: Python
 
     # Register the converter for custom Keras layer
-    hls4ml.converters.register_keras_layer_handler('KReverse', parse_reverse_layer)
+    hls4ml.converters.register_keras_v2_layer_handler('KReverse', parse_reverse_layer)
+    # For keras v3, use register on subclassed KerasV3LayerHandler from hls4ml.converters.keras_v3._base instead
 
     # Register the hls4ml's IR layer
-    hls4ml.model.layers.register_layer('HReverse', HReverse)
+    hls4ml.model.layers.register_layer('KReverse', HReverse)
 
     for backend_id in ['Vivado', 'Quartus']:
         # Register the optimization passes (if any)
         backend = hls4ml.backends.get_backend(backend_id)
-        backend.register_pass('remove_duplicate_reverse', RemoveDuplicateReverse, flow=f'{backend_id.lower()}:optimize')
 
         # Register template passes for the given backend
         backend.register_template(HReverseConfigTemplate)
         backend.register_template(HReverseFunctionTemplate)
 
         # Register HLS implementation
-        backend.register_source('nnet_reverse.h')
+        backend.register_source('/path/to/your/nnet_reverse.h')
 
 Finally, we can actually test the ``hls4ml`` custom layer compared to the Keras one.
 
 .. code-block:: Python
 
     # Test if it works
-    kmodel = tf.keras.models.Sequential(
+    kmodel = keras.models.Sequential(
         [
-            tf.keras.layers.Input(shape=(8,)),
+            keras.layers.Input(shape=(8,)),
             KReverse(),
-            tf.keras.layers.ReLU(),
+            keras.layers.ReLU(),
         ]
     )
 

diff --git a/docs/advanced/hgq.rst b/docs/advanced/hgq.rst
@@ -1,49 +1,71 @@
-===================================
-High Granularity Quantization (HGQ)
-===================================
-
-.. image:: https://github.com/calad0i/HGQ/actions/workflows/sphinx-build.yml/badge.svg
-   :target: https://calad0i.github.io/HGQ/
-.. image:: https://badge.fury.io/py/hgq.svg
-   :target: https://badge.fury.io/py/hgq
+======================================
+High Granularity Quantization (HGQ2)
+======================================
+
+.. note::
+   New projects are encouraged to use `HGQ2 <../hgq2.html>`_ instead of the original `HGQ <../hgq.html>`_.
+   HGQ2 extends the original HGQ with more supported layers, more quantizer options, and is on top of Keras v3, which can be used natively with JAX, PyTorch, and TensorFlow backends.
+
+.. image:: https://img.shields.io/badge/License-LGPLv3-blue.svg
+   :target: https://www.gnu.org/licenses/lgpl-3.0.en.html
+.. image:: https://github.com/calad0i/HGQ2/actions/workflows/sphinx-build.yml/badge.svg
+   :target: https://calad0i.github.io/HGQ2/
+.. image:: https://badge.fury.io/py/hgq2.svg
+   :target: https://badge.fury.io/py/hgq2
 .. image:: https://img.shields.io/badge/arXiv-2405.00645-b31b1b.svg
    :target: https://arxiv.org/abs/2405.00645
 
-`High Granularity Quantization (HGQ) <https://github.com/calad0i/HGQ/>`_ is a library that performs gradient-based automatic bitwidth optimization and quantization-aware training algorithm for neural networks to be deployed on FPGAs. By leveraging gradients, it allows for bitwidth optimization at arbitrary granularity, up to per-weight and per-activation level.
+HGQ2 (High Granularity Quantization 2) is a quantization-aware training framework built on Keras v3, targeting real-time deep learning applications on edge devices like FPGAs. It provides a comprehensive set of tools for creating and training quantized neural networks with minimal effort.
 
-.. image:: https://calad0i.github.io/HGQ/_images/overview.svg
-   :alt: Overview of HGQ
-   :align: center
+HGQ2 implements an gradient-based automatic bitwidth optimization and quantization-aware training algorithm. By laveraging gradients, it allows for bitwidth optimization at arbitrary granularity, up to per-weight and per-activation level.
 
-Conversion of models made with HGQ library is fully supported. The HGQ models are first converted to proxy model format, which can then be parsed by hls4ml bit-accurately. Below is an example of how to create a model with HGQ and convert it to hls4ml model.
+.. rst-class:: light
+.. image:: _static/hgq-overview.svg
+   :alt: HGQ-overview
+   :width: 600
 
-.. code-block:: Python
+Key Features
+------------
 
-   import keras
-   from HGQ.layers import HDense, HDenseBatchNorm, HQuantize
-   from HGQ import ResetMinMax, FreeBOPs
+- **Multi-backend support**: Works with TensorFlow, JAX, and PyTorch through Keras v3
+- **Flexible quantization**: Supports different quantization schemes including fixed-point and minifloat
+- **Hardware synthesis**: Direct integration with hls4ml for FPGA deployment
+- **Trainable quantization parameters**: Optimize bitwidths through gradient-based methods
+- **Effective Bit-Operations (EBOP)**: Accurate resource estimation during training for the deployed firmware
+- **Advanced layer support**: HGQ2 supports advanced layers like einsum, einsum dense, and multi-head attention layers with quantization and hardware synthesis support
 
-   model = keras.models.Sequential([
-      HQuantize(beta=1.e-5),
-      HDenseBatchNorm(32, beta=1.e-5, activation='relu'),
-      HDenseBatchNorm(32, beta=1.e-5, activation='relu'),
-      HDense(10, beta=1.e-5),
-   ])
 
-    opt = keras.optimizers.Adam(learning_rate=0.001)
-    loss = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
-    model.compile(optimizer=opt, loss=loss, metrics=['accuracy'])
-    callbacks = [ResetMinMax(), FreeBOPs()]
+.. code-block:: python
+   :caption: Simple example
 
-    model.fit(..., callbacks=callbacks)
+   import keras
+   from hgq.layers import QDense, QConv2D
+   from hgq.config import LayerConfigScope, QuantizerConfigScope
 
-    from HGQ import trace_minmax, to_proxy_model
-    from hls4ml.converters import convert_from_keras_model
+   # Setup quantization configuration
+   # These values are the defaults, just for demonstration purposes here
+   with (
+      # Configuration scope for setting the default quantization type and overflow mode
+      # The second configuration scope overrides the first one for the 'datalane' place
+      QuantizerConfigScope(place='all', default_q_type='kbi', overflow_mode='SAT_SYM'),
+      # Configuration scope for enabling EBOPs and setting the beta0 value
+      QuantizerConfigScope(place='datalane', default_q_type='kif', overflow_mode='WRAP'),
+      LayerConfigScope(enable_ebops=True, beta0=1e-5),
+   ):
+      model = keras.Sequential([
+         QConv2D(32, (3, 3), activation='relu'),
+         keras.layers.MaxPooling2D((2, 2)),
+         keras.layers.Flatten(),
+         QDense(10)
+      ])
 
-    trace_minmax(model, x_train, cover_factor=1.0)
-    proxy = to_proxy_model(model, aggressive=True)
+   ... # Training, evaluation, and anything else you want to do with the model
 
-    model_hls = convert_from_keras_model(proxy, backend='vivado',output_dir=... ,part=...)
+   model_hls = hls4ml.converters.convert_from_keras(model, ...)
+   # Model-wise precision propagation is done automatically for HGQ models for bit-exactness
+   # Do NOT pass precision config if you don't know what you are doing
 
+   model_hls.compile()
 
-An interactive example of HGQ can be found in the `kaggle notebook <https://www.kaggle.com/code/calad0i/small-jet-tagger-with-hgq-1>`_. Full documentation can be found at `calad0i.github.io/HGQ <https://calad0i.github.io/HGQ/>`_.
+.. note::
+   Do not pass any precision configuration from ``hls4ml.converters.convert_from_keras`` in general. HGQ-defined models will invoke model-wise precision propagation automatically to ensure bit-exactness between the Keras model and the generated HLS code (See `here <./precision.html>`__ for more details).
diff --git a/docs/advanced/hgq1.rst b/docs/advanced/hgq1.rst
@@ -0,0 +1,52 @@
+===================================
+High Granularity Quantization (HGQ)
+===================================
+
+.. note::
+   While still supported and maintained, HGQ is deprecated in favor of `HGQ2 <../hgq2.html>`_. New projects are strongly encouraged to use HGQ2 instead.
+
+.. image:: https://github.com/calad0i/HGQ/actions/workflows/sphinx-build.yml/badge.svg
+   :target: https://calad0i.github.io/HGQ/
+.. image:: https://badge.fury.io/py/hgq.svg
+   :target: https://badge.fury.io/py/hgq
+.. image:: https://img.shields.io/badge/arXiv-2405.00645-b31b1b.svg
+   :target: https://arxiv.org/abs/2405.00645
+
+`High Granularity Quantization (HGQ) <https://github.com/calad0i/HGQ/>`_ is a library that performs gradient-based automatic bitwidth optimization and quantization-aware training algorithm for neural networks to be deployed on FPGAs. By leveraging gradients, it allows for bitwidth optimization at arbitrary granularity, up to per-weight and per-activation level.
+
+.. image:: https://calad0i.github.io/HGQ/_images/overview.svg
+   :alt: Overview of HGQ
+   :align: center
+
+Conversion of models made with HGQ library is fully supported. The HGQ models are first converted to proxy model format, which can then be parsed by hls4ml bit-accurately. Below is an example of how to create a model with HGQ and convert it to hls4ml model.
+
+.. code-block:: Python
+
+   import keras
+   from HGQ.layers import HDense, HDenseBatchNorm, HQuantize
+   from HGQ import ResetMinMax, FreeBOPs
+
+   model = keras.models.Sequential([
+      HQuantize(beta=1.e-5),
+      HDenseBatchNorm(32, beta=1.e-5, activation='relu'),
+      HDenseBatchNorm(32, beta=1.e-5, activation='relu'),
+      HDense(10, beta=1.e-5),
+   ])
+
+    opt = keras.optimizers.Adam(learning_rate=0.001)
+    loss = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
+    model.compile(optimizer=opt, loss=loss, metrics=['accuracy'])
+    callbacks = [ResetMinMax(), FreeBOPs()]
+
+    model.fit(..., callbacks=callbacks)
+
+    from HGQ import trace_minmax, to_proxy_model
+    from hls4ml.converters import convert_from_keras_model
+
+    trace_minmax(model, x_train, cover_factor=1.0)
+    proxy = to_proxy_model(model, aggressive=True)
+
+    model_hls = convert_from_keras_model(proxy, backend='vivado',output_dir=... ,part=...)
+
+
+An interactive example of HGQ can be found in the `kaggle notebook <https://www.kaggle.com/code/calad0i/small-jet-tagger-with-hgq-1>`_. Full documentation can be found at `calad0i.github.io/HGQ <https://calad0i.github.io/HGQ/>`_.
diff --git a/docs/advanced/precision.rst b/docs/advanced/precision.rst
@@ -0,0 +1,23 @@
+==============================
+Model-wise Precision Inference
+==============================
+
+The model-wise precision inference (implemented in :py:class:`~hls4ml.model.optimizer.passes.bit_exact.BitExact`) attempts to infer the appropriate configuration for **all** precision in the model. Unlike the automatic precision inference, this pass disregards all user-defined precision, and "trust" only data embedded in the model, i.e., the actual values of the weights and explicit quantizers defined between layers.
+
+This pass uses a modified symbolic interval arithmetic to compute the ranges and the needed quantization steps for all precision in the model graph, with the goal of eliminating any discrepancy between the quantized model and the original model. In the inference process, only the raw weight values and the explicit quantizers (either ``FixedPointQuantizer``, or ``linear/relu`` layers with ``trusted=True``) are considered as sources of precision information. All other precision information (e.g., user-defined precision in ``config_from_*`` functions) will not be used in the inference process.
+
+Invoking of this pass is configured by the ``bit_exact`` key in the backend configuration (default: ``None``). There are two ways to enable this pass:
+- When converting from ``HGQ/HGQ2`` models, this pass is automatically enabled unless ``bit_exact`` is explicitly set to ``False``.
+- For other models, this pass can be enabled by setting ``bit_exact`` to ``True``. Currently, only ``QKeras`` sets this key automatically when converting from ``QKeras`` models. Support for ``QONNX`` is planned but not yet implemented.
+
+If the original model is not properly quantized, this pass will lead to huge bitwidths in the model. In this context, properly quantized models are those that have quantizers defined between **all layers with non-trivial arithmetics**. The successful application of this pass should result in bit-exact model, i.e., the quantized model should produce the same outputs as the original model for all inputs [*]_.
+
+Not all operator types are supported in this pass. If any unsupported operator is encountered during the inference, this pass will **crash** the conversion process to prevent silent failures. Please consider use `automatic precision inference <../auto.html>`_ if your model contains unsupported operators or unquantized components.
+
+.. warning::
+    Importantly, quantizers **should be used immediately after the inputs**, or the input precision may not be properly inferred. If you are using ``HGQ/HGQ2``, this is automatically taken care of in most cases. If you are using ``QKeras``, make sure to put a ``QActivation`` with ``quantized_bits`` right after the input layer such that the input precision can be derived.
+
+.. [*] While quantized, the original model will still operate on float-point values, so there is a chance that the outputs will not be exactly the same due to float rounding errors in the original model.
+
+.. note::
+    Unlike the automatic precision inference, it is strongly recommended to **not** use the ``config_from_*`` functions to set the precisions in the model. Automatic precision inference and this pass cannot be used simultaneously.