Skip to content

Releases: amd/Quark

Release 0.10

26 Sep 22:24
Compare
Choose a tag to compare

Release Notes

Release 0.10

  • AMD Quark for PyTorch

    • New Features

      • Support PyTorch 2.7.1 and 2.8.0.
      • Support for int3 quantization and exporting of models.
      • Support the AWQ algorithm with Gemma3 and Phi4.
      • Support Qronos advanced quantization algorithm.
      • Applying the GPTQ algorithm runs x3-x4 faster compared to AMD Quark 0.9, using CUDA/HIP Graph by default. If requirement, CUDA Graph for GPTQ can be disabled using the environment variable QUARK_GRAPH_DEBUG=0.
      • Quarot algorithm supports a new configuration parameter rotation_size to define custom hadamard rotation sizes. Please refer to QuaRotConfig documentation.
      • Support the Qronos post-training quantization algorithm. Please refer to the arXiv paper and Quark documentation.
    • QuantizationSpec check:

      • Every time user finishes init QuantizationSpec will automatically perform config check. If any invalid config is supplied, a warning or error message will be given to user for better correction. In this way, find potential error as early as possible rather than cause a runtime error during quantization process.
    • LLM Depth-Wise Pruning tool:

      • Depth-wise pruning tool that can decrease the LLM model size. This tool deletes the consecutive decode layers in LLM under a certain supplied pruning ratio.
      • Based on PPL influence, the consecutive layers that have less influence on PPL will be regarded as having less influence on LLM and can be deleted.
    • Model Support:

      • Support OCP MXFP4, MXFP6, MXFP8 quantization of new models: DeepSeek-R1, Llama4-Scout, Llama4-Maverick, gpt-oss-20b, gpt-oss-120b.
    • Deprecations and breaking changes

      • OCP MXFP6 weight packing layout is modified to fit the expected layout by CDNA4 mfma_scale instruction.

      • In the examples/language_modeling/llm_ptq/quantize_quark.py example, the quantization scheme w_mxfp4_a_mxfp6 is removed and replaced by w_mxfp4_a_mxfp6_e2m3 and w_mxfp4_a_mxfp6_e3m2.

    • Important bug fixes

      • A bug in Quarot and Rotation algorithms where fused rotations were wrongly applied twice on input embeddings / LM head weights is fixed.

      • Reduce the slowness of the reloading of large quantized models as DeepSeek-R1 using Transformers + Quark.

  • AMD Quark for ONNX

    • New Features:

      • API Refactor (Introduced the new API design with improved consistency and usability)

        • Supported class-based algorithm usage.
        • Aligned data type both for Quark Torch and Quark ONNX.
        • Refactored quantization configs.
      • Auto Search Enhancements

        • Two-Stage Search: First identifies the best calibration config, then searches for the optimal FastFinetune config based on it. Expands the search space for higher efficiency.
        • Advanced-Fastft Search: Supports continuous search spaces, advanced algorithms (e.g., TPE), and parallel execution for faster, smarter searching.
        • Joint-Parameter Search: Combines coupled parameters into a unified space to avoid ineffective configurations and improve search quality.
      • Added support for ONNX 1.19 and ONNXRuntime 1.22.1

      • Added optimized weight-scale calculation with the MinMSE method to improve quantization accuracy.

      • Accelerated calibration with multi-process support, covering algorithms such as MinMSE, Percentile, Entropy, Distribution, and LayerwisePercentile.

      • Added progress bars for Percentile, Entropy, Distribution, and LayerwisePercentile algorithms.

      • Supported users to specify a directory for saving cache files.

    • Enhancements:

      • Significantly reduced memory usage across various configurations, including calibration and FastFinetune stages, with optimizations for both CPU and GPU memory.
      • Improved clarity of error and warning outputs, helping users select better parameters based on memory and disk conditions.
    • Bug fixes and minor improvements:

      • Provided actionable hints when OOM or insufficient disk space issues occur in calibration and fast fine-tuning.
      • Fixed multi-GPU issues during FastFinetune.
      • Fixed a bug related to converting BatchNorm to Conv.
      • Fixed a bug in BF16 conversion on models larger than 2GB.
  • Quark Torch API Refactor

    • LLMTemplate for simplified quantization configuration:

      • Introduced :py:class:.LLMTemplate class for convenient LLM quantization configuration
      • Built-in templates for popular LLM architectures (Llama4, Qwen, Mistral, Phi, DeepSeek, GPT-OSS, etc.)
      • Support for multiple quantization schemes: int4/uint4 (group sizes 32, 64, 128), int8, fp8, mxfp4, mxfp6e2m3, mxfp6e3m2, bfp16, mx6
      • Advanced features: layer-wise quantization, KV cache quantization, attention quantization
      • Algorithm support: AWQ, GPTQ, SmoothQuant, AutoSmoothQuant, Rotation
      • Custom template and scheme registration capabilities for users to define their own template and quantization schemes
            from quark.torch import LLMTemplate

            # List available templates
            templates = LLMTemplate.list_available()
            print(templates)  # ['llama', 'opt', 'qwen', 'mistral', ...]

            # Get a specific template
            llama_template = LLMTemplate.get("llama")

            # Create a basic configuration
            config = llama_template.get_config(scheme="fp8", kv_cache_scheme="fp8")
  • Export and import APIs are deprecated in favor of new ones:

    • ModelExporter.export_safetensors_model is deprecated in favor of export_safetensors:

      Before:

            from quark.torch import ModelExporter
            from quark.torch.export.config.config import ExporterConfig, JsonExporterConfig

            export_config = ExporterConfig(json_export_config=JsonExporterConfig())
            exporter = ModelExporter(config=export_config, export_dir=export_dir)
            exporter.export_safetensors_model(model, quant_config)
     After:
            from quark.torch import export_safetensors
            export_safetensors(model, output_dir=export_dir)
  -  `ModelImporter.import_model_info` is deprecated in favor of `import_model_from_safetensors`:

     Before:
            from quark.torch.export.api import ModelImporter

            model_importer = ModelImporter(
               model_info_dir=export_dir,
               saved_format="safetensors"
            )
            quantized_model = model_importer.import_model_info(original_model)
     After:
            from quark.torch import import_model_from_safetensors
            quantized_model = import_model_from_safetensors(
               original_model,
               model_dir=export_dir
            )
  • Quark ONNX API Refactor

    • Before:

      • Basic Usage:
           from quark.onnx import ModelQuantizer
           from quark.onnx.quantization.config.config import Config
           from quark.onnx.quantization.config.custom_config import get_default_config

           input_model_path = "demo.onnx"
           quantized_model_path = "demo_quantized.onnx"
           calib_data_path = "calib_data"
           calib_data_reader = ImageDataReader(calib_data_path)

           a8w8_config = get_default_config("A8W8")
           quantization_config = Config(global_quant_config=a8w8_config )
           quantizer = ModelQuantizer(quantization_config)
           quantizer.quantize_model(input_model_path, quantized_model_path, calib_data_reader)
  -  Advanced Usage:
	   from quark.onnx import ModelQuantizer
	   from quark.onnx.quantization.config.config import Config, QuantizationConfig
	   from onnxruntime.quantization.calibrate import CalibrationMethod
	   from onnxruntime.quantization.quant_utils import QuantFormat, QuantType, ExtendedQuantType

	   input_model_path = "demo.onnx"
	   quantized_model_path = "demo_quantized.onnx"
	   calib_data_path = "calib_data"
	   calib_data_reader = ImageDataReader(calib_data_path)

	   DEFAULT_ADAROUND_PARAMS = {
	       "DataSize": 1000,
	       "FixedSeed": 1705472343,
	       "BatchSize": 2,
	       "NumIterations": 1000,
	       "LearningRate": 0.1,
	       "OptimAlgorithm": "adaround",
	       "OptimDevice": "cpu",
	       "InferDevice": "cpu",
	       "EarlyStop": True,
	   }

	   quant_config = QuantizationConfig(
	       calibrate_method=CalibrationMethod.Percentile,
	       quant_format=QuantFormat.QDQ,
	       activation_type=QuantType.QInt8,
	       weight_type=QuantType.QInt8,
	       nodes_to_exclude=["/layer.2/Conv_1", "^/Conv/.*"],
	       subgraphs_to_exclude=[(["start_node_1", "start_node_2"], ["end_node_1", "end_node_2"])],
	       include_cle=True,
	       include_fast...
Read more