Improve QAT int4 weight-only numerics #2986

andrewor14 · 2025-09-11T23:17:27Z

Summary: Similar to #2937, this commit improves the prepare vs convert SQNR of int4 weight-only QAT from 12 to 45. This is achieved by mimicking the numerics of the target FBGEMM bf16-int4 kernel more closely. In particular, the FBGEMM kernel:

Performs asymmetric [0, 15] quant first then recenters to 8
Uses smaller scale eps of 1e-6 instead of bf16's eps (0.0078125)
Quantizes the weights using min val instead of zero points

Unit tests:

python test/quantization/test_qat.py -k test_quantize_api_int4
python test/quantization/test_qat.py -k test_fbgemm_int4_weight_only_primitives

End-to-end tests:

Fine-tuning Llama3.1-8B with and without this PR in unsloth:

fine-tune for 1 epoch on yahma/alpaca-cleaned with LoRA
batch size 8, learning rate 2e-4, no gradient accumulation

Wikitext:

QAT int4 quantized model (with this PR) achieved 33% lower perplexity than the int4 baseline
QAT int4 quantized model without this PR was worse

==> unsloth_model_lora_baseline_output/lm_eval_float.log <==
|        |       |none  |     0|word_perplexity|↓  |7.5551|±  |   N/A|

==> unsloth_model_lora_baseline_output/lm_eval_quantized.log <==
|        |       |none  |     0|word_perplexity|↓  |8.7655|±  |   N/A|

# QAT without this PR (quantized)
==> unsloth_model_lora_qat_int4_output/lm_eval_quantized.log <==
|        |       |none  |     0|word_perplexity|↓  |8.3548|±  |   N/A|

# QAT with this PR (quantized)
==> unsloth_model_lora_qat_int4_output/lm_eval_quantized.log <==
|        |       |none  |     0|word_perplexity|↓  |10.0683|±  |   N/A|

pytorch-bot · 2025-09-11T23:17:31Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2986

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 2bc59a1 with merge base 10ba659 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jerryzh168 · 2025-09-11T23:30:08Z

test/quantization/test_qat.py

        """
        Test the following:
            quantize_(model, QATConfig(Int4WeightOnlyConfig(), step="prepare"))
            quantize_(model, QATConfig(Int4WeightOnlyConfig(), step="convert"))
        """
        self._test_quantize_api_against_ptq(
-            Int4WeightOnlyConfig(version=version),
-            target_prepare_sqnr=12,
+            Int4WeightOnlyConfig(version=version, int4_packing_format=packing_format),


I feel it's fine for QAT to only support version 2

although you may want to cover more int4 packing format such as TILE_PACKED_TO_4D the previous tinygemm layout

yeah I think we can drop version 1, but it's BC breaking so we can do it separately

jerryzh168 · 2025-09-11T23:37:57Z

torchao/quantization/qat/fake_quantizer.py

+        fbgemm_symmetric_qmax = 8
+        w_grouped = w.to(torch.float32).view(w.shape[0], -1, self.config.group_size)
+        max_val = torch.amax(w_grouped, dim=-1, keepdim=True)
+        min_val = torch.amin(w_grouped, dim=-1, keepdim=True)
+        scale = torch.clamp(max_val - min_val, min=eps) / qmax
+        zero_point = min_val + scale * fbgemm_symmetric_qmax


why don't we call int4_row_quantize_zp and get the scale/zero_point from there? is it because of performance concerns?

I guess we could ask fbgemm to add another function to just compute scale/zero_point so we can call it here in the future

yeah and also they cast the quantized values to int8, which we don't want to do here

**Summary:** Similar to #2937, this commit improves the prepare vs convert SQNR of int4 weight-only QAT from 12 to 45. This is achieved by mimicking the numerics of the target FBGEMM bf16-int4 kernel more closely. In particular, the FBGEMM kernel: 1. Performs asymmetric [0, 15] quant first then recenters to 8 2. Uses smaller scale eps of 1e-6 instead of bf16's eps (0.0078125) 3. Quantizes the weights using min val instead of zero points **Unit tests:** ``` python test/quantization/test_qat.py -k test_quantize_api_int4 python test/quantization/test_qat.py -k test_fbgemm_int4_weight_only_primitives ``` **End-to-end tests:** Fine-tuning Llama3.1-8B with and without this PR in unsloth: - fine-tune for 1 epoch on yahma/alpaca-cleaned with LoRA - batch size 8, learning rate 2e-4, no gradient accumulation Wikitext: - QAT int4 quantized model (with this PR) achieved 33% lower perplexity than the int4 baseline - QAT int4 quantized model without this PR was worse ``` ==> unsloth_model_lora_baseline_output/lm_eval_float.log <== | | |none | 0|word_perplexity|↓ |7.5551|± | N/A| ==> unsloth_model_lora_baseline_output/lm_eval_quantized.log <== | | |none | 0|word_perplexity|↓ |8.7655|± | N/A| # QAT without this PR (quantized) ==> unsloth_model_lora_qat_int4_output/lm_eval_quantized.log <== | | |none | 0|word_perplexity|↓ |8.3548|± | N/A| # QAT with this PR (quantized) ==> unsloth_model_lora_qat_int4_output/lm_eval_quantized.log <== | | |none | 0|word_perplexity|↓ |10.0683|± | N/A| ```

andrewor14 · 2025-09-12T21:48:33Z

@jerryzh168 I updated the PR description with end-to-end tasks, can you take another look?

andrewor14 requested a review from jerryzh168 September 11, 2025 23:17

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 11, 2025

andrewor14 force-pushed the update-int4-qat branch from 00d10c8 to f17861a Compare September 11, 2025 23:21

jerryzh168 reviewed Sep 11, 2025

View reviewed changes

andrewor14 added the topic: improvement Use this tag if this PR is an improvement (doesn't fit into any of the other categories) label Sep 12, 2025

andrewor14 force-pushed the update-int4-qat branch from f17861a to 2bc59a1 Compare September 12, 2025 21:47

jerryzh168 approved these changes Sep 12, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve QAT int4 weight-only numerics #2986

Improve QAT int4 weight-only numerics #2986

andrewor14 commented Sep 11, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Sep 11, 2025 •

edited

Loading

Uh oh!

jerryzh168 Sep 11, 2025

Uh oh!

jerryzh168 Sep 11, 2025

Uh oh!

andrewor14 Sep 12, 2025

Uh oh!

jerryzh168 Sep 11, 2025 •

edited

Loading

Uh oh!

andrewor14 Sep 12, 2025

Uh oh!

andrewor14 commented Sep 12, 2025

Uh oh!

Uh oh!

Improve QAT int4 weight-only numerics #2986

Are you sure you want to change the base?

Improve QAT int4 weight-only numerics #2986

Conversation

andrewor14 commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2986

✅ No Failures

Uh oh!

jerryzh168 Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

jerryzh168 Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

andrewor14 Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

jerryzh168 Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrewor14 Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

andrewor14 commented Sep 12, 2025

Uh oh!

Uh oh!

andrewor14 commented Sep 11, 2025 •

edited

Loading

pytorch-bot bot commented Sep 11, 2025 •

edited

Loading

jerryzh168 Sep 11, 2025 •

edited

Loading