convert : handle compressed-tensors quant method #17069

compilade · 2025-11-07T03:03:06Z

(alternative to #17064, cc @ngxson)

This adds support for a few formats in the compressed-tensors quant method.

pack-quantized
- symmetric = true (without zero point)
  - https://huggingface.co/gaunernst/gemma-3-4b-it-qat-compressed-tensors (tested)
  - https://huggingface.co/moonshotai/Kimi-K2-Thinking (untested, but should work?)
- symmetric = false (with zero point)
  - https://huggingface.co/cpatonn/Qwen3-4B-Instruct-2507-AWQ-4bit (tested)
int-quantized
- https://huggingface.co/RedHatAI/Qwen2.5-1.5B-quantized.w8a8 (tested)
float-quantized
- strategy = "channel"
  - https://huggingface.co/RedHatAI/Llama-3.2-1B-Instruct-FP8-dynamic (tested)
- strategy = "block"
  - https://huggingface.co/RedHatAI/Qwen3-0.6B-FP8-BLOCK (tested)
naive-quantized
- https://huggingface.co/nm-testing/Qwen2-1.5B-Instruct-FP8W8 (tested)

I've also re-tested plain fp8 with https://huggingface.co/Qwen/Qwen3-4B-FP8 to make sure I didn't break it.

I found a problem in the lazy tensors related to skipping metadata changes for binary operators, which I've fixed. Otherwise the broadcast shift (when unpacking) didn't have the correct final shape.

Make sure to read the contributing guidelines before submitting a PR

ubergarm · 2025-11-07T05:52:50Z

Successfully converted with compilade/convert-prequant-compressed-tensors@128118fdb running:

numactl -N 1 -m 1 \
python \
    convert_hf_to_gguf.py \
    --outtype bf16 \
    --split-max-size 50G \
    --outfile /mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF \
    /mnt/data/models/moonshotai/Kimi-K2-Thinking/

Shard (46/46): 100%|██████████| 22.5G/22.5G [01:15<00:00, 299Mbyte/s]
Writing: 100%|██████████| 2.05T/2.05T [1:59:32<00:00, 286Mbyte/s]
INFO:hf-to-gguf:Model successfully exported to /mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/

currently inferencing on a q8_0 (q4_0 routed experts) successfully!

ubergarm · 2025-11-07T07:35:10Z

Well thanks to the llama.cpp team there is a GGUF available for testing:

https://www.reddit.com/r/LocalLLaMA/comments/1oqo57j/ubergarmkimik2thinkinggguf_hugging_face/

Great job ngxson, compilade, DevQuasar, Bartowski, AesSedai, and more folks who pulled together hacking on this one today! 🫶

ngxson

Nice, thanks!

jukofyork · 2025-11-07T12:19:24Z

Is this the PR that is doing the round trip to BF16? If so, can we actually be sure it will work:

#17064 (comment)

It would be better to find this out now than have many TB of bastard quants getting uploaded to HF!

ngxson · 2025-11-07T12:47:16Z

IIUC this will round trip to F32, then quantize to target types like q8_0. The dequant only happens internally.

We cannot yet directly map kimi-k2's int4 to q4_0 because q4_0 only supports f16 as scale. Adding q4_0 with bf16 scale is possible but will require quite a lot of works.

I think the requant solution introduced by this PR is a good temporary solution. We will improve it if the model (or this quant scheme) get a lot of usages

jukofyork · 2025-11-07T13:19:25Z

Is there a layout diagram for GGUF similar to this for safetensors:

If there is then it would be pretty trivial to quant to a slightly wrong version of Q4_0 using this PR, and then use a simple C/C++ program to overwrite the values in the GGUF with lossless values with a bit of bit-shifting...

jukofyork · 2025-11-07T13:30:12Z

(deleted so that nobody tries doing this!)

We cannot yet directly map kimi-k2's int4 to q4_0 because q4_0 only supports f16 as scale. Adding q4_0 with bf16 scale is possible but will require quite a lot of works.

So long as it doesn't overflow (and @ubergarm said in the other PR that he only saw very small values, way below +/- 65504), there is a direct mapping between BF16 and F16:

so there would be no need for a special BF16 version of Q4_0 if we can correctly map the packed INT4 nibbles to Q4_0 nibbles, and overall the whole mapping should (hopefully) be completely lossless.

Actually this won't be lossless as I forgot about sub-normals... But it shouldn't be that far off to make much difference.

ngxson · 2025-11-07T14:21:46Z

If there is then it would be pretty trivial to quant to a slightly wrong version of Q4_0 using this PR, and then use a simple C/C++ program to overwrite the values in the GGUF with lossless values with a bit of bit-shifting...

This discussion seems to be better to be placed in mi wip branch for mapping int4 -> q4_0. Here is it a bit off-topic

#17064 (comment)

my version is almost working, just need to correct the nibble layout. Will open a PR for discussion

csabakecskemeti · 2025-11-07T15:11:06Z

just FYI:
not strictly related to the MR but for moonshotai/Kimi-K2-Thinking

I think I've made it work with alternative way.
I've build a conversion utility inspired by the Deepseek V3 dequantizer
int4-to-bf16

Both Q3 and Q2 GGUF seems working:

Experimental quants uploading (please allow some more time for the upload) here:
DevQuasar/moonshotai.Kimi-K2-Thinking-GGUF

Feel free to test the quants and the converter

Regardless to this^, it would be nice if this native quant handling would work!

jukofyork · 2025-11-07T15:23:44Z

If there is then it would be pretty trivial to quant to a slightly wrong version of Q4_0 using this PR, and then use a simple C/C++ program to overwrite the values in the GGUF with lossless values with a bit of bit-shifting...

This discussion seems to be better to be placed in mi wip branch for mapping int4 -> q4_0. Here is it a bit off-topic

#17064 (comment)

my version is almost working, just need to correct the nibble layout. Will open a PR for discussion

Yeah, sorry - I'll continue the discussion there.

I also tested if using uniform weights with make_qx_quants would work (ie: find a set of lossless nibbles, but possibly with a different scale and the nibbles allowed to be re-centred, etc). Sadly it doesn't work either when given lots of random starting blocks (whatever heuristic it is using mustn't be exhaustive enough to find the true minimum which has close to zero error...).

compilade · 2025-11-09T14:39:47Z

To be clear, I agree with @jukofyork; requantization isn't lossless when a naïve quantization function is used (like the default Q4_0 one).

The plan is to implement generalized repacking to allow fast lossless conversion (apart from BF16 -> F16 for very small scales). But as described in #14810 (comment), this requires handling permutes, reshapes, stacking, (and maybe splitting), and so it's not exactly simple.

I'll still proceed with merging this because dequantization works (but note the caveat with requantization), and repacking will use part of that functionality (to get the quantized values in the correct order).

CISC · 2025-11-09T14:45:02Z

To be clear, I agree with @jukofyork; requantization isn't lossless when a naïve quantization function is used (like the default Q4_0 one).

TBH, it shouldn't be expected to be either.

I'll still proceed with merging this because dequantization works (but note the caveat with requantization), and repacking will use part of that functionality (to get the quantized values in the correct order).

Agreed, repacking is an additional feature that deserves its own PR.

compilade added 6 commits November 6, 2025 21:12

convert : handle compressed-tensors quant method

33dba6c

convert : handle int-quantized models

d23bdd5

convert : handle naive-quantized models

33dcb44

gguf-py : __pos__ is also unary

987862a

convert : fix flake8 lint

3770d94

convert : use F32 for dequant of pack-quantized tensors

128118f

compilade requested a review from CISC as a code owner November 7, 2025 03:03

github-actions bot added the python python script changes label Nov 7, 2025

compilade added the enhancement New feature or request label Nov 7, 2025

compilade mentioned this pull request Nov 7, 2025

convert: add dequant function for compressed_tensor (kimi-k2-thinking) #17064

Closed

DajanaV mentioned this pull request Nov 7, 2025

UPSTREAM PR #17069: convert : handle compressed-tensors quant method auroralabs-loci/llama.cpp#112

Open

6 tasks

ggerganov requested a review from ngxson November 7, 2025 07:58

CISC approved these changes Nov 7, 2025

View reviewed changes

ngxson approved these changes Nov 7, 2025

View reviewed changes

Thireus mentioned this pull request Nov 9, 2025

[REQ] moonshotai/Kimi-K2-Thinking Thireus/GGUF-Tool-Suite#39

Open

compilade mentioned this pull request Nov 9, 2025

Refactor: convert_hf_to_gguf.py #17114

Draft

compilade merged commit 1c07c0c into master Nov 9, 2025
10 checks passed

convert : handle compressed-tensors quant method #17069

convert : handle compressed-tensors quant method #17069

Conversation

compilade commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ubergarm commented Nov 7, 2025

Uh oh!

ubergarm commented Nov 7, 2025

Uh oh!

ngxson left a comment

Choose a reason for hiding this comment

Uh oh!

jukofyork commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Nov 7, 2025

Uh oh!

jukofyork commented Nov 7, 2025

Uh oh!

jukofyork commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Nov 7, 2025

Uh oh!

csabakecskemeti commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jukofyork commented Nov 7, 2025

Uh oh!

compilade commented Nov 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented Nov 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

compilade commented Nov 7, 2025 •

edited

Loading

jukofyork commented Nov 7, 2025 •

edited

Loading

jukofyork commented Nov 7, 2025 •

edited

Loading

csabakecskemeti commented Nov 7, 2025 •

edited

Loading

compilade commented Nov 9, 2025 •

edited

Loading