Skip to content

Conversation

@compilade
Copy link
Collaborator

@compilade compilade commented Nov 7, 2025

(alternative to #17064, cc @ngxson)

This adds support for a few formats in the compressed-tensors quant method.

I've also re-tested plain fp8 with https://huggingface.co/Qwen/Qwen3-4B-FP8 to make sure I didn't break it.

I found a problem in the lazy tensors related to skipping metadata changes for binary operators, which I've fixed. Otherwise the broadcast shift (when unpacking) didn't have the correct final shape.


Make sure to read the contributing guidelines before submitting a PR

@ubergarm
Copy link

ubergarm commented Nov 7, 2025

Successfully converted with compilade/convert-prequant-compressed-tensors@128118fdb running:

numactl -N 1 -m 1 \
python \
    convert_hf_to_gguf.py \
    --outtype bf16 \
    --split-max-size 50G \
    --outfile /mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF \
    /mnt/data/models/moonshotai/Kimi-K2-Thinking/

Shard (46/46): 100%|██████████| 22.5G/22.5G [01:15<00:00, 299Mbyte/s]
Writing: 100%|██████████| 2.05T/2.05T [1:59:32<00:00, 286Mbyte/s]
INFO:hf-to-gguf:Model successfully exported to /mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/

currently inferencing on a q8_0 (q4_0 routed experts) successfully!

@ubergarm
Copy link

ubergarm commented Nov 7, 2025

Well thanks to the llama.cpp team there is a GGUF available for testing:

https://www.reddit.com/r/LocalLLaMA/comments/1oqo57j/ubergarmkimik2thinkinggguf_hugging_face/

Great job ngxson, compilade, DevQuasar, Bartowski, AesSedai, and more folks who pulled together hacking on this one today! 🫶

@ggerganov ggerganov requested a review from ngxson November 7, 2025 07:58
Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks!

@jukofyork
Copy link
Collaborator

jukofyork commented Nov 7, 2025

Is this the PR that is doing the round trip to BF16? If so, can we actually be sure it will work:

#17064 (comment)

It would be better to find this out now than have many TB of bastard quants getting uploaded to HF!

@ngxson
Copy link
Collaborator

ngxson commented Nov 7, 2025

IIUC this will round trip to F32, then quantize to target types like q8_0. The dequant only happens internally.

We cannot yet directly map kimi-k2's int4 to q4_0 because q4_0 only supports f16 as scale. Adding q4_0 with bf16 scale is possible but will require quite a lot of works.

I think the requant solution introduced by this PR is a good temporary solution. We will improve it if the model (or this quant scheme) get a lot of usages

@jukofyork
Copy link
Collaborator

Is there a layout diagram for GGUF similar to this for safetensors:

image

If there is then it would be pretty trivial to quant to a slightly wrong version of Q4_0 using this PR, and then use a simple C/C++ program to overwrite the values in the GGUF with lossless values with a bit of bit-shifting...

@jukofyork
Copy link
Collaborator

jukofyork commented Nov 7, 2025

(deleted so that nobody tries doing this!)

We cannot yet directly map kimi-k2's int4 to q4_0 because q4_0 only supports f16 as scale. Adding q4_0 with bf16 scale is possible but will require quite a lot of works.

So long as it doesn't overflow (and @ubergarm said in the other PR that he only saw very small values, way below +/- 65504), there is a direct mapping between BF16 and F16:
image

so there would be no need for a special BF16 version of Q4_0 if we can correctly map the packed INT4 nibbles to Q4_0 nibbles, and overall the whole mapping should (hopefully) be completely lossless.

Actually this won't be lossless as I forgot about sub-normals... But it shouldn't be that far off to make much difference.

@ngxson
Copy link
Collaborator

ngxson commented Nov 7, 2025

If there is then it would be pretty trivial to quant to a slightly wrong version of Q4_0 using this PR, and then use a simple C/C++ program to overwrite the values in the GGUF with lossless values with a bit of bit-shifting...

This discussion seems to be better to be placed in mi wip branch for mapping int4 -> q4_0. Here is it a bit off-topic

#17064 (comment)

my version is almost working, just need to correct the nibble layout. Will open a PR for discussion

@csabakecskemeti
Copy link
Contributor

csabakecskemeti commented Nov 7, 2025

just FYI:
not strictly related to the MR but for moonshotai/Kimi-K2-Thinking


I think I've made it work with alternative way.
I've build a conversion utility inspired by the Deepseek V3 dequantizer
int4-to-bf16

Both Q3 and Q2 GGUF seems working:
kimi-think-proof

Experimental quants uploading (please allow some more time for the upload) here:
DevQuasar/moonshotai.Kimi-K2-Thinking-GGUF

Feel free to test the quants and the converter


Regardless to this^, it would be nice if this native quant handling would work!

@jukofyork
Copy link
Collaborator

If there is then it would be pretty trivial to quant to a slightly wrong version of Q4_0 using this PR, and then use a simple C/C++ program to overwrite the values in the GGUF with lossless values with a bit of bit-shifting...

This discussion seems to be better to be placed in mi wip branch for mapping int4 -> q4_0. Here is it a bit off-topic

#17064 (comment)

my version is almost working, just need to correct the nibble layout. Will open a PR for discussion

Yeah, sorry - I'll continue the discussion there.

I also tested if using uniform weights with make_qx_quants would work (ie: find a set of lossless nibbles, but possibly with a different scale and the nibbles allowed to be re-centred, etc). Sadly it doesn't work either when given lots of random starting blocks (whatever heuristic it is using mustn't be exhaustive enough to find the true minimum which has close to zero error...).

@compilade
Copy link
Collaborator Author

compilade commented Nov 9, 2025

To be clear, I agree with @jukofyork; requantization isn't lossless when a naïve quantization function is used (like the default Q4_0 one).

The plan is to implement generalized repacking to allow fast lossless conversion (apart from BF16 -> F16 for very small scales). But as described in #14810 (comment), this requires handling permutes, reshapes, stacking, (and maybe splitting), and so it's not exactly simple.

I'll still proceed with merging this because dequantization works (but note the caveat with requantization), and repacking will use part of that functionality (to get the quantized values in the correct order).

@CISC
Copy link
Collaborator

CISC commented Nov 9, 2025

To be clear, I agree with @jukofyork; requantization isn't lossless when a naïve quantization function is used (like the default Q4_0 one).

TBH, it shouldn't be expected to be either.

I'll still proceed with merging this because dequantization works (but note the caveat with requantization), and repacking will use part of that functionality (to get the quantized values in the correct order).

Agreed, repacking is an additional feature that deserves its own PR.

@compilade compilade merged commit 1c07c0c into master Nov 9, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants