Releases: amd/Quark
Release 0.10
Release Notes
Release 0.10
-
AMD Quark for PyTorch
-
New Features
- Support PyTorch 2.7.1 and 2.8.0.
- Support for int3 quantization and exporting of models.
- Support the AWQ algorithm with Gemma3 and Phi4.
- Support Qronos advanced quantization algorithm.
- Applying the GPTQ algorithm runs x3-x4 faster compared to AMD Quark 0.9, using CUDA/HIP Graph by default. If requirement, CUDA Graph for GPTQ can be disabled using the environment variable
QUARK_GRAPH_DEBUG=0
. - Quarot algorithm supports a new configuration parameter
rotation_size
to define custom hadamard rotation sizes. Please refer to QuaRotConfig documentation. - Support the Qronos post-training quantization algorithm. Please refer to the arXiv paper and Quark documentation.
-
QuantizationSpec check:
- Every time user finishes init
QuantizationSpec
will automatically perform config check. If any invalid config is supplied, a warning or error message will be given to user for better correction. In this way, find potential error as early as possible rather than cause a runtime error during quantization process.
- Every time user finishes init
-
LLM Depth-Wise Pruning tool:
- Depth-wise pruning tool that can decrease the LLM model size. This tool deletes the consecutive decode layers in LLM under a certain supplied pruning ratio.
- Based on PPL influence, the consecutive layers that have less influence on PPL will be regarded as having less influence on LLM and can be deleted.
-
Model Support:
- Support OCP MXFP4, MXFP6, MXFP8 quantization of new models: DeepSeek-R1, Llama4-Scout, Llama4-Maverick, gpt-oss-20b, gpt-oss-120b.
-
Deprecations and breaking changes
-
OCP MXFP6 weight packing layout is modified to fit the expected layout by CDNA4
mfma_scale
instruction. -
In the
examples/language_modeling/llm_ptq/quantize_quark.py
example, the quantization schemew_mxfp4_a_mxfp6
is removed and replaced byw_mxfp4_a_mxfp6_e2m3
andw_mxfp4_a_mxfp6_e3m2
.
-
-
Important bug fixes
-
-
AMD Quark for ONNX
-
New Features:
-
API Refactor (Introduced the new API design with improved consistency and usability)
- Supported class-based algorithm usage.
- Aligned data type both for Quark Torch and Quark ONNX.
- Refactored quantization configs.
-
Auto Search Enhancements
- Two-Stage Search: First identifies the best calibration config, then searches for the optimal FastFinetune config based on it. Expands the search space for higher efficiency.
- Advanced-Fastft Search: Supports continuous search spaces, advanced algorithms (e.g., TPE), and parallel execution for faster, smarter searching.
- Joint-Parameter Search: Combines coupled parameters into a unified space to avoid ineffective configurations and improve search quality.
-
Added support for ONNX 1.19 and ONNXRuntime 1.22.1
-
Added optimized weight-scale calculation with the MinMSE method to improve quantization accuracy.
-
Accelerated calibration with multi-process support, covering algorithms such as MinMSE, Percentile, Entropy, Distribution, and LayerwisePercentile.
-
Added progress bars for Percentile, Entropy, Distribution, and LayerwisePercentile algorithms.
-
Supported users to specify a directory for saving cache files.
-
-
Enhancements:
- Significantly reduced memory usage across various configurations, including calibration and FastFinetune stages, with optimizations for both CPU and GPU memory.
- Improved clarity of error and warning outputs, helping users select better parameters based on memory and disk conditions.
-
Bug fixes and minor improvements:
- Provided actionable hints when OOM or insufficient disk space issues occur in calibration and fast fine-tuning.
- Fixed multi-GPU issues during FastFinetune.
- Fixed a bug related to converting BatchNorm to Conv.
- Fixed a bug in BF16 conversion on models larger than 2GB.
-
-
Quark Torch API Refactor
-
LLMTemplate for simplified quantization configuration:
- Introduced :py:class:
.LLMTemplate
class for convenient LLM quantization configuration - Built-in templates for popular LLM architectures (Llama4, Qwen, Mistral, Phi, DeepSeek, GPT-OSS, etc.)
- Support for multiple quantization schemes: int4/uint4 (group sizes 32, 64, 128), int8, fp8, mxfp4, mxfp6e2m3, mxfp6e3m2, bfp16, mx6
- Advanced features: layer-wise quantization, KV cache quantization, attention quantization
- Algorithm support: AWQ, GPTQ, SmoothQuant, AutoSmoothQuant, Rotation
- Custom template and scheme registration capabilities for users to define their own template and quantization schemes
- Introduced :py:class:
-
from quark.torch import LLMTemplate
# List available templates
templates = LLMTemplate.list_available()
print(templates) # ['llama', 'opt', 'qwen', 'mistral', ...]
# Get a specific template
llama_template = LLMTemplate.get("llama")
# Create a basic configuration
config = llama_template.get_config(scheme="fp8", kv_cache_scheme="fp8")
-
Export and import APIs are deprecated in favor of new ones:
-
ModelExporter.export_safetensors_model
is deprecated in favor ofexport_safetensors
:Before:
-
from quark.torch import ModelExporter
from quark.torch.export.config.config import ExporterConfig, JsonExporterConfig
export_config = ExporterConfig(json_export_config=JsonExporterConfig())
exporter = ModelExporter(config=export_config, export_dir=export_dir)
exporter.export_safetensors_model(model, quant_config)
After:
from quark.torch import export_safetensors
export_safetensors(model, output_dir=export_dir)
- `ModelImporter.import_model_info` is deprecated in favor of `import_model_from_safetensors`:
Before:
from quark.torch.export.api import ModelImporter
model_importer = ModelImporter(
model_info_dir=export_dir,
saved_format="safetensors"
)
quantized_model = model_importer.import_model_info(original_model)
After:
from quark.torch import import_model_from_safetensors
quantized_model = import_model_from_safetensors(
original_model,
model_dir=export_dir
)
-
Quark ONNX API Refactor
-
Before:
- Basic Usage:
-
from quark.onnx import ModelQuantizer
from quark.onnx.quantization.config.config import Config
from quark.onnx.quantization.config.custom_config import get_default_config
input_model_path = "demo.onnx"
quantized_model_path = "demo_quantized.onnx"
calib_data_path = "calib_data"
calib_data_reader = ImageDataReader(calib_data_path)
a8w8_config = get_default_config("A8W8")
quantization_config = Config(global_quant_config=a8w8_config )
quantizer = ModelQuantizer(quantization_config)
quantizer.quantize_model(input_model_path, quantized_model_path, calib_data_reader)
- Advanced Usage:
from quark.onnx import ModelQuantizer
from quark.onnx.quantization.config.config import Config, QuantizationConfig
from onnxruntime.quantization.calibrate import CalibrationMethod
from onnxruntime.quantization.quant_utils import QuantFormat, QuantType, ExtendedQuantType
input_model_path = "demo.onnx"
quantized_model_path = "demo_quantized.onnx"
calib_data_path = "calib_data"
calib_data_reader = ImageDataReader(calib_data_path)
DEFAULT_ADAROUND_PARAMS = {
"DataSize": 1000,
"FixedSeed": 1705472343,
"BatchSize": 2,
"NumIterations": 1000,
"LearningRate": 0.1,
"OptimAlgorithm": "adaround",
"OptimDevice": "cpu",
"InferDevice": "cpu",
"EarlyStop": True,
}
quant_config = QuantizationConfig(
calibrate_method=CalibrationMethod.Percentile,
quant_format=QuantFormat.QDQ,
activation_type=QuantType.QInt8,
weight_type=QuantType.QInt8,
nodes_to_exclude=["/layer.2/Conv_1", "^/Conv/.*"],
subgraphs_to_exclude=[(["start_node_1", "start_node_2"], ["end_node_1", "end_node_2"])],
include_cle=True,
include_fast...