Release Notes

Release 0.10

AMD Quark for PyTorch
- New Features
  - Support PyTorch 2.7.1 and 2.8.0.
  - Support for int3 quantization and exporting of models.
  - Support the AWQ algorithm with Gemma3 and Phi4.
  - Support Qronos advanced quantization algorithm.
  - Applying the GPTQ algorithm runs x3-x4 faster compared to AMD Quark 0.9, using CUDA/HIP Graph by default. If requirement, CUDA Graph for GPTQ can be disabled using the environment variable QUARK_GRAPH_DEBUG=0.
  - Quarot algorithm supports a new configuration parameter rotation_size to define custom hadamard rotation sizes. Please refer to QuaRotConfig documentation.
  - Support the Qronos post-training quantization algorithm. Please refer to the arXiv paper and Quark documentation.
- QuantizationSpec check:
  - Every time user finishes init QuantizationSpec will automatically perform config check. If any invalid config is supplied, a warning or error message will be given to user for better correction. In this way, find potential error as early as possible rather than cause a runtime error during quantization process.
- LLM Depth-Wise Pruning tool:
  - Depth-wise pruning tool that can decrease the LLM model size. This tool deletes the consecutive decode layers in LLM under a certain supplied pruning ratio.
  - Based on PPL influence, the consecutive layers that have less influence on PPL will be regarded as having less influence on LLM and can be deleted.
- Model Support:
  - Support OCP MXFP4, MXFP6, MXFP8 quantization of new models: DeepSeek-R1, Llama4-Scout, Llama4-Maverick, gpt-oss-20b, gpt-oss-120b.
- Deprecations and breaking changes
  - OCP MXFP6 weight packing layout is modified to fit the expected layout by CDNA4 mfma_scale instruction.
  - In the examples/language_modeling/llm_ptq/quantize_quark.py example, the quantization scheme w_mxfp4_a_mxfp6 is removed and replaced by w_mxfp4_a_mxfp6_e2m3 and w_mxfp4_a_mxfp6_e3m2.
- Important bug fixes
  - A bug in Quarot and Rotation algorithms where fused rotations were wrongly applied twice on input embeddings / LM head weights is fixed.
  - Reduce the slowness of the reloading of large quantized models as DeepSeek-R1 using Transformers + Quark.
AMD Quark for ONNX
- New Features:
  - API Refactor (Introduced the new API design with improved consistency and usability)
    - Supported class-based algorithm usage.
    - Aligned data type both for Quark Torch and Quark ONNX.
    - Refactored quantization configs.
  - Auto Search Enhancements
    - Two-Stage Search: First identifies the best calibration config, then searches for the optimal FastFinetune config based on it. Expands the search space for higher efficiency.
    - Advanced-Fastft Search: Supports continuous search spaces, advanced algorithms (e.g., TPE), and parallel execution for faster, smarter searching.
    - Joint-Parameter Search: Combines coupled parameters into a unified space to avoid ineffective configurations and improve search quality.
  - Added support for ONNX 1.19 and ONNXRuntime 1.22.1
  - Added optimized weight-scale calculation with the MinMSE method to improve quantization accuracy.
  - Accelerated calibration with multi-process support, covering algorithms such as MinMSE, Percentile, Entropy, Distribution, and LayerwisePercentile.
  - Added progress bars for Percentile, Entropy, Distribution, and LayerwisePercentile algorithms.
  - Supported users to specify a directory for saving cache files.
- Enhancements:
  - Significantly reduced memory usage across various configurations, including calibration and FastFinetune stages, with optimizations for both CPU and GPU memory.
  - Improved clarity of error and warning outputs, helping users select better parameters based on memory and disk conditions.
- Bug fixes and minor improvements:
  - Provided actionable hints when OOM or insufficient disk space issues occur in calibration and fast fine-tuning.
  - Fixed multi-GPU issues during FastFinetune.
  - Fixed a bug related to converting BatchNorm to Conv.
  - Fixed a bug in BF16 conversion on models larger than 2GB.
Quark Torch API Refactor
- LLMTemplate for simplified quantization configuration:
  - Introduced :py:class:.LLMTemplate class for convenient LLM quantization configuration
  - Built-in templates for popular LLM architectures (Llama4, Qwen, Mistral, Phi, DeepSeek, GPT-OSS, etc.)
  - Support for multiple quantization schemes: int4/uint4 (group sizes 32, 64, 128), int8, fp8, mxfp4, mxfp6e2m3, mxfp6e3m2, bfp16, mx6
  - Advanced features: layer-wise quantization, KV cache quantization, attention quantization
  - Algorithm support: AWQ, GPTQ, SmoothQuant, AutoSmoothQuant, Rotation
  - Custom template and scheme registration capabilities for users to define their own template and quantization schemes

            from quark.torch import LLMTemplate

            # List available templates
            templates = LLMTemplate.list_available()
            print(templates)  # ['llama', 'opt', 'qwen', 'mistral', ...]

            # Get a specific template
            llama_template = LLMTemplate.get("llama")

            # Create a basic configuration
            config = llama_template.get_config(scheme="fp8", kv_cache_scheme="fp8")

Export and import APIs are deprecated in favor of new ones:
- ModelExporter.export_safetensors_model is deprecated in favor of export_safetensors:
  
  Before:

            from quark.torch import ModelExporter
            from quark.torch.export.config.config import ExporterConfig, JsonExporterConfig

            export_config = ExporterConfig(json_export_config=JsonExporterConfig())
            exporter = ModelExporter(config=export_config, export_dir=export_dir)
            exporter.export_safetensors_model(model, quant_config)

     After:

            from quark.torch import export_safetensors
            export_safetensors(model, output_dir=export_dir)

  -  `ModelImporter.import_model_info` is deprecated in favor of `import_model_from_safetensors`:

     Before:

            from quark.torch.export.api import ModelImporter

            model_importer = ModelImporter(
               model_info_dir=export_dir,
               saved_format="safetensors"
            )
            quantized_model = model_importer.import_model_info(original_model)

     After:

            from quark.torch import import_model_from_safetensors
            quantized_model = import_model_from_safetensors(
               original_model,
               model_dir=export_dir
            )

Quark ONNX API Refactor
- Before:
  - Basic Usage:

           from quark.onnx import ModelQuantizer
           from quark.onnx.quantization.config.config import Config
           from quark.onnx.quantization.config.custom_config import get_default_config

           input_model_path = "demo.onnx"
           quantized_model_path = "demo_quantized.onnx"
           calib_data_path = "calib_data"
           calib_data_reader = ImageDataReader(calib_data_path)

           a8w8_config = get_default_config("A8W8")
           quantization_config = Config(global_quant_config=a8w8_config )
           quantizer = ModelQuantizer(quantization_config)
           quantizer.quantize_model(input_model_path, quantized_model_path, calib_data_reader)

  -  Advanced Usage:

	   from quark.onnx import ModelQuantizer
	   from quark.onnx.quantization.config.config import Config, QuantizationConfig
	   from onnxruntime.quantization.calibrate import CalibrationMethod
	   from onnxruntime.quantization.quant_utils import QuantFormat, QuantType, ExtendedQuantType

	   input_model_path = "demo.onnx"
	   quantized_model_path = "demo_quantized.onnx"
	   calib_data_path = "calib_data"
	   calib_data_reader = ImageDataReader(calib_data_path)

	   DEFAULT_ADAROUND_PARAMS = {
	       "DataSize": 1000,
	       "FixedSeed": 1705472343,
	       "BatchSize": 2,
	       "NumIterations": 1000,
	       "LearningRate": 0.1,
	       "OptimAlgorithm": "adaround",
	       "OptimDevice": "cpu",
	       "InferDevice": "cpu",
	       "EarlyStop": True,
	   }

	   quant_config = QuantizationConfig(
	       calibrate_method=CalibrationMethod.Percentile,
	       quant_format=QuantFormat.QDQ,
	       activation_type=QuantType.QInt8,
	       weight_type=QuantType.QInt8,
	       nodes_to_exclude=["/layer.2/Conv_1", "^/Conv/.*"],
	       subgraphs_to_exclude=[(["start_node_1", "start_node_2"], ["end_node_1", "end_node_2"])],
	       include_cle=True,
	       include_fast...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Release Notes

Release 0.10

Uh oh!

Releases: amd/Quark

Release 0.10

Release Notes

Release 0.10

Uh oh!