Skip to content

Conversation

andrewor14
Copy link
Contributor

@andrewor14 andrewor14 commented Sep 11, 2025

Summary: Similar to #2937, this commit improves the prepare vs convert SQNR of int4 weight-only QAT from 12 to 45. This is achieved by mimicking the numerics of the target FBGEMM bf16-int4 kernel more closely. In particular, the FBGEMM kernel:

  1. Performs asymmetric [0, 15] quant first then recenters to 8
  2. Uses smaller scale eps of 1e-6 instead of bf16's eps (0.0078125)
  3. Quantizes the weights using min val instead of zero points

Unit tests:

python test/quantization/test_qat.py -k test_quantize_api_int4
python test/quantization/test_qat.py -k test_fbgemm_int4_weight_only_primitives

End-to-end tests:

Fine-tuning Llama3.1-8B with and without this PR in unsloth:

  • fine-tune for 1 epoch on yahma/alpaca-cleaned with LoRA
  • batch size 8, learning rate 2e-4, no gradient accumulation

Wikitext:

  • QAT int4 quantized model (with this PR) achieved 33% lower perplexity than the int4 baseline
  • QAT int4 quantized model without this PR was worse
==> unsloth_model_lora_baseline_output/lm_eval_float.log <==
|        |       |none  |     0|word_perplexity|↓  |7.5551|±  |   N/A|

==> unsloth_model_lora_baseline_output/lm_eval_quantized.log <==
|        |       |none  |     0|word_perplexity|↓  |8.7655|±  |   N/A|

# QAT without this PR (quantized)
==> unsloth_model_lora_qat_int4_output/lm_eval_quantized.log <==
|        |       |none  |     0|word_perplexity|↓  |8.3548|±  |   N/A|

# QAT with this PR (quantized)
==> unsloth_model_lora_qat_int4_output/lm_eval_quantized.log <==
|        |       |none  |     0|word_perplexity|↓  |10.0683|±  |   N/A|

Copy link

pytorch-bot bot commented Sep 11, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2986

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 2bc59a1 with merge base 10ba659 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 11, 2025
"""
Test the following:
quantize_(model, QATConfig(Int4WeightOnlyConfig(), step="prepare"))
quantize_(model, QATConfig(Int4WeightOnlyConfig(), step="convert"))
"""
self._test_quantize_api_against_ptq(
Int4WeightOnlyConfig(version=version),
target_prepare_sqnr=12,
Int4WeightOnlyConfig(version=version, int4_packing_format=packing_format),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel it's fine for QAT to only support version 2

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

although you may want to cover more int4 packing format such as TILE_PACKED_TO_4D the previous tinygemm layout

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I think we can drop version 1, but it's BC breaking so we can do it separately

Comment on lines +183 to +188
fbgemm_symmetric_qmax = 8
w_grouped = w.to(torch.float32).view(w.shape[0], -1, self.config.group_size)
max_val = torch.amax(w_grouped, dim=-1, keepdim=True)
min_val = torch.amin(w_grouped, dim=-1, keepdim=True)
scale = torch.clamp(max_val - min_val, min=eps) / qmax
zero_point = min_val + scale * fbgemm_symmetric_qmax
Copy link
Contributor

@jerryzh168 jerryzh168 Sep 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't we call int4_row_quantize_zp and get the scale/zero_point from there? is it because of performance concerns?

I guess we could ask fbgemm to add another function to just compute scale/zero_point so we can call it here in the future

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah and also they cast the quantized values to int8, which we don't want to do here

@andrewor14 andrewor14 added the topic: improvement Use this tag if this PR is an improvement (doesn't fit into any of the other categories) label Sep 12, 2025
**Summary:** Similar to #2937, this commit improves the prepare
vs convert SQNR of int4 weight-only QAT from 12 to 45. This is
achieved by mimicking the numerics of the target FBGEMM bf16-int4
kernel more closely. In particular, the FBGEMM kernel:

1. Performs asymmetric [0, 15] quant first then recenters to 8
2. Uses smaller scale eps of 1e-6 instead of bf16's eps (0.0078125)
3. Quantizes the weights using min val instead of zero points

**Unit tests:**

```
python test/quantization/test_qat.py -k test_quantize_api_int4
python test/quantization/test_qat.py -k test_fbgemm_int4_weight_only_primitives
```

**End-to-end tests:**

Fine-tuning Llama3.1-8B with and without this PR in unsloth:

- fine-tune for 1 epoch on yahma/alpaca-cleaned with LoRA
- batch size 8, learning rate 2e-4, no gradient accumulation

Wikitext:

- QAT int4 quantized model (with this PR) achieved 33% lower
  perplexity than the int4 baseline
- QAT int4 quantized model without this PR was worse

```
==> unsloth_model_lora_baseline_output/lm_eval_float.log <==
|        |       |none  |     0|word_perplexity|↓  |7.5551|±  |   N/A|

==> unsloth_model_lora_baseline_output/lm_eval_quantized.log <==
|        |       |none  |     0|word_perplexity|↓  |8.7655|±  |   N/A|

# QAT without this PR (quantized)
==> unsloth_model_lora_qat_int4_output/lm_eval_quantized.log <==
|        |       |none  |     0|word_perplexity|↓  |8.3548|±  |   N/A|

# QAT with this PR (quantized)
==> unsloth_model_lora_qat_int4_output/lm_eval_quantized.log <==
|        |       |none  |     0|word_perplexity|↓  |10.0683|±  |   N/A|
```
@andrewor14
Copy link
Contributor Author

@jerryzh168 I updated the PR description with end-to-end tasks, can you take another look?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. topic: improvement Use this tag if this PR is an improvement (doesn't fit into any of the other categories)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants